Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very slow first-time gradient calculation #1119

Open
pxl-th opened this issue Nov 15, 2021 · 33 comments
Open

Very slow first-time gradient calculation #1119

pxl-th opened this issue Nov 15, 2021 · 33 comments

Comments

@pxl-th
Copy link
Member

pxl-th commented Nov 15, 2021

Hi. I'm currently implementing self-supervised depth estimation network, similar to Monodepth2 and would like to point out a possible performance concern.

Taking gradient for the first time takes very long time. On CPU it is around 5 minutes, while on GPU it is ~10-11 minutes.
Further iterations are much faster though (~8 seconds on CPU, ~0.5 seconds on GPU). But this makes iterating on the code quite painful.

Model has UNet-like architecture, with EfficientNet-B0 as encoder and decoder that emits outputs (disparities) at every upsampling level. The loss function is then calculated over these outputs. I can post code a bit later, once it is in a more presentable form, but for the architecture it is similar to Segmentation.jl.

To get the timings I just timed gradient calculation:

@time= gradient(θ) do
    train_loss(model, x, train_parameters)
end

Timings for taking first gradient:

  • CPU
330.033668 seconds (401.83 M allocations: 24.610 GiB, 2.25% gc time, 97.00% compilation time)
  • GPU
675.810394 seconds (602.73 M allocations: 30.976 GiB, 2.96% gc time, 88.58% compilation time)
@willtebbutt
Copy link
Member

FWIW, I have noticed substantially-increased compilation times recently (anecdotally -- I don't have a good MWE)

@ToucheSir
Copy link
Member

What are the timings for the first run of the forward pass? GPU compilation can be especially slow on complex code, e.g. JuliaGPU/GPUCompiler.jl#65. If you warm up by running the forward pass first and then calling gradient, that would help with isolating how much time Zygote is responsible for. I'd recommend using tiny inputs to reduce the impact of runtime on these results.

More generally, first gradient compilation in Zygote is quite slow. I'm not sure what can be done about that other than trying a different AD, but my expertise is limited and other folks may have more useful suggestions.

@pxl-th
Copy link
Member Author

pxl-th commented Nov 15, 2021

Inputs are already quite small 128x128, batch size is set to 1 (for these timings) as well.

Timings for (in the same session in that order):

@time train_loss(model, x, train_parameters) # (1)
@time= gradient(θ) do                     # (2)
    train_loss(model, x, train_parameters)
end

In both cases, Zygote takes the majority of the time.

CPU

(1) 63.867963 seconds (108.40 M allocations: 6.094 GiB, 3.01% gc time, 98.00% compilation time)
(2) 268.388808 seconds (318.07 M allocations: 18.132 GiB, 2.06% gc time, 98.45% compilation time)

GPU

(1) 226.649656 seconds (276.76 M allocations: 15.019 GiB, 2.98% gc time, 77.41% compilation time)
(2) 477.420002 seconds (437.27 M allocations: 22.299 GiB, 2.64% gc time, 94.41% compilation time)

@willtebbutt
Copy link
Member

@pxl-th could you please also running your benchmarks on 0.6.29 and 0.6.28 to see if this is a recent regression, or a problem that has been around for a while?

@pxl-th
Copy link
Member Author

pxl-th commented Nov 15, 2021

Tried running on 0.6.29, but hitting this error:

ERROR: LoadError: Need an adjoint for constructor Base.Iterators.Enumerate{Vector{Int64}}.

I guess code relies on #785 as I had to use map, enumerate, etc. to avoid array mutation. Instead of using push!, mainly for the loss calculation.

@mcabbott
Copy link
Member

mcabbott commented Dec 2, 2021

Worth trying with ideas from #1126, the simplest of which is to run this before your model:

@eval Flux (c::Chain)(x) = foldl((y,f) -> f(y), (x, c.layers...))

@pxl-th
Copy link
Member Author

pxl-th commented Jan 5, 2022

Worth trying with ideas from #1126, the simplest of which is to run this before your model:

@eval Flux (c::Chain)(x) = foldl((y,f) -> f(y), (x, c.layers...))

Thanks for suggestions, but for me the timings with these changes are pretty much the same.
What has helped is the use of -O1 optimization flag.
Although the timings are still high IMO, especially for the simple ResNet model.

The code has changes a bit, so the timings are different from the previous:

Default:
(F) 83.199798 seconds (204.95 M allocations: 10.673 GiB, 3.93% gc time, 71.90% compilation time)
(B) 272.590107 seconds (391.83 M allocations: 20.177 GiB, 2.45% gc time, 96.10% compilation time)

With -O1:
(F) 63.597444 seconds (205.33 M allocations: 10.679 GiB, 5.32% gc time, 74.61% compilation time)
(B) 127.244899 seconds (391.93 M allocations: 20.179 GiB, 5.26% gc time, 93.88% compilation time)

BTW, the code is now available at Monodepth2.jl.


I see similar results with the plain ResNet model in these tests.

Default:
(F): 31.948738 seconds (80.25 M allocations: 4.180 GiB, 4.83% gc time, 52.90% compilation time)
(B): 95.282793 seconds (108.76 M allocations: 5.640 GiB, 2.60% gc time, 97.25% compilation time)

With -O1:
(F): 24.255596 seconds (80.24 M allocations: 4.179 GiB, 6.14% gc time, 56.21% compilation time)
(B): 34.169904 seconds (108.75 M allocations: 5.639 GiB, 6.46% gc time, 94.94% compilation time)


F - forward pass
B - backward pass

@mcabbott
Copy link
Member

mcabbott commented Jan 5, 2022

Another random idea if trying things is JuliaLang/julia#43370 (with Julia 1.8)

@ToucheSir
Copy link
Member

Give VChain from #1126 a try too. What are the times for the resnet model on CPU? I think comparing compilation latency between CPU and GPU forward passes would be the easiest way to start looking into this.

@pxl-th
Copy link
Member Author

pxl-th commented Jan 5, 2022

What are the times for the resnet model on CPU? I think comparing compilation latency between CPU and GPU forward
passes would be the easiest way to start looking into this.

ResNet CPU:
(F): 69.924270 seconds (33.55 M allocations: 2.181 GiB, 0.91% gc time, 15.85% compilation time)
(B): 244.059900 seconds (111.98 M allocations: 7.127 GiB, 1.03% gc time, 54.21% compilation time)

I also looked at the output from SnoopCompile, mainly:

Flux.crossentropy(softmax(model(x)), y) # run forward
tinf = SnoopCompine.@snoop_deep gradient(θ) do
    Flux.crossentropy(softmax(model(x)), y)
end
print_tree(SnoopCompile.flatten(tinf; tmin=1e-2)) # tmin to retain only biggest contributors

And if I understand the output correctly, it seems that Chain is not the main/only major cause.

CPU [ascending order]
Vector{SnoopCompileCore.InferenceTiming}
├─ InferenceTiming: 0.010088/0.040965 on IRTools.Inner.var"#meta#1"(::Any, ::Any, meta::typeof(IRTools.Inner.meta), ::Any)
├─ InferenceTiming: 0.010145/0.024410 on IRTools.Inner.var"#dominators#132"(::Any, dominators::typeof(IRTools.Inner.dominators), ::Any)
├─ InferenceTiming: 0.010495/0.012893 on Base.nteltype(::Type{NamedTuple{(:type, :insert), T}}) where T<:Tuple{Any, Bool}
├─ InferenceTiming: 0.010563/0.012686 on eltype(::Tuple{Any, Any, Any})
├─ InferenceTiming: 0.010569/0.010569 on Base.nteltype(::Type{NamedTuple{(:meta,), T}}) where T<:Tuple
├─ InferenceTiming: 0.010630/0.010630 on Zygote.var"#s70#52"(::Any, ::Any, ::Any, ::Any, ::Any, ::Any)
├─ InferenceTiming: 0.010639/0.025669 on Base.Meta._partially_inline!(::Any, ::Vector{Any}, ::Any, ::Vector{Any}, ::Int64, ::Int64, ::Symbol)
├─ InferenceTiming: 0.010648/0.015004 on iterate(::IRTools.Inner.Pipe, ::Tuple{Vector{Vector{IRTools.Inner.Variable}}, Int64, Int64})
├─ InferenceTiming: 0.010653/0.073189 on IRTools.Inner.var"#meta#1"(::Type, ::UInt64, meta::typeof(IRTools.Inner.meta), ::Type)
├─ InferenceTiming: 0.010676/0.010676 on eltype(::Gtk.GLib.GList{L}) where L<:Gtk.GLib._LList
├─ InferenceTiming: 0.010688/0.019389 on ZygoteRules._pullback(::Zygote.Context, flatten::typeof(Flux.flatten), ::Array{Float32, 4})
├─ InferenceTiming: 0.010709/0.013468 on eltype(::Tuple{Any, Any})
├─ InferenceTiming: 0.010754/0.012320 on Base.nteltype(::Type{NamedTuple{(:unless,), T}}) where T<:Tuple{Any}
├─ InferenceTiming: 0.010775/0.109244 on NNlib.var"#∇conv_data!#196"(Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}()::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ∇conv_data!::typeof(NNlib.∇conv_data!), ::Array{Float32, 5}, ::Array{Float32, 5}, ::Array{Float32, 5}, ::NNlib.DenseConvDims{3, (3, 3, 1), 256, 512, 1, (2, 2, 1), (1, 1, 1, 1, 0, 0), (1, 1, 1), false})
├─ InferenceTiming: 0.010804/0.038556 on NNlib.col2im!(::SubArray{Float32, 4, Array{Float32, 5}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Int64}, true}, ::SubArray{Float32, 2, Array{Float32, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}, ::NNlib.DenseConvDims{3, (3, 3, 1), 64, 128, 1, (2, 2, 1), (1, 1, 1, 1, 0, 0), (1, 1, 1), false})
├─ InferenceTiming: 0.010883/0.041851 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.010976/0.010976 on eltype(::Tuple{Any})
├─ InferenceTiming: 0.011105/0.015665 on ChainRulesCore.ProjectTo(::Vector{Float32})
├─ InferenceTiming: 0.011238/0.011796 on iterate(::IRTools.Inner.Pipe, ::Tuple{Any, Any, Int64})
├─ InferenceTiming: 0.011372/0.014132 on ChainRulesCore.canonicalize(::ChainRulesCore.Tangent{Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}, _A} where _A<:(NamedTuple{L})) where L
├─ InferenceTiming: 0.011420/0.024915 on (Base.Pairs{Symbol})(::NamedTuple{(:unless,), _A} where _A<:Tuple{Any}, ::Tuple{Symbol})
├─ InferenceTiming: 0.011555/0.044246 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Base.Broadcast.broadcasted), ::Matrix{Float32}, ::Matrix{Union{}})
├─ InferenceTiming: 0.011583/0.039987 on NNlib.var"#∇conv_filter!#205"(Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}()::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ∇conv_filter!::typeof(NNlib.∇conv_filter!), ::Array{Float32, 5}, ::Array{Float32, 5}, ::Array{Float32, 5}, ::NNlib.DenseConvDims{3, (7, 7, 1), 1, 64, 1, (2, 2, 1), (3, 3, 3, 3, 0, 0), (1, 1, 1), false})
├─ InferenceTiming: 0.011618/0.025747 on (Base.Pairs{Symbol})(::NamedTuple{(:type, :insert), _A} where _A<:Tuple{Any, Bool}, ::Tuple{Symbol, Symbol})
├─ InferenceTiming: 0.011739/0.197436 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.011784/0.023602 on (Base.Pairs{Symbol})(::NamedTuple{(:meta,)}, ::Tuple{Symbol})
├─ InferenceTiming: 0.011844/0.037883 on NNlib.col2im!(::SubArray{Float32, 4, Array{Float32, 5}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}, Int64}, true}, ::SubArray{Float32, 2, Array{Float32, 3}, Tuple{Base.Slice{Base.OneTo{Int64}}, Base.Slice{Base.OneTo{Int64}}, Int64}, true}, ::NNlib.DenseConvDims{3, (3, 3, 1), 128, 128, 1, (1, 1, 1), (1, 1, 1, 1, 0, 0), (1, 1, 1), false})
├─ InferenceTiming: 0.011913/0.017974 on ZygoteRules._pullback(::Zygote.Context, epseltype::typeof(Flux.epseltype), ::Matrix{Float32})
├─ InferenceTiming: 0.011922/0.020028 on Base.__cat_offset1!(::Any, ::Tuple{Any, Vararg{Any}}, ::Tuple{Bool}, ::Tuple{Vararg{Int64}}, ::Int64)
├─ InferenceTiming: 0.011962/0.017538 on Zygote.accum(::NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, NamedTuple{(:s,), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}}}}}}}, NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}}}}}, ::NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, NamedTuple{(:s,), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}}}}}}}, NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}}}}})
├─ InferenceTiming: 0.011968/0.015333 on NNlib.var"#∇maxpool_direct!#445"(::Float32, ::Float32, ∇maxpool_direct!::typeof(NNlib.∇maxpool_direct!), ::Array{Float32, 5}, ::Array{Float32, 5}, ::Array{Float32, 5}, ::Array{Float32, 5}, ::NNlib.PoolDims)
├─ InferenceTiming: 0.011985/0.014633 on Base.Pairs(::SparseArrays.SparseVector{<:Integer, Ti} where Ti, ::LinearIndices{1, Tuple{Base.OneTo{Int64}}})
├─ InferenceTiming: 0.012131/0.349522 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Base._ntuple), ::Flux.var"#196#197", ::Int64)
├─ InferenceTiming: 0.012167/0.048107 on Zygote.z2d(::NamedTuple{(:layers,), Tuple{Tuple{Nothing, Base.RefValue{Any}}}}, ::Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}})
├─ InferenceTiming: 0.012208/0.012208 on (::Type{Base.Iterators.Enumerate{_A}} where _A)(::AbstractVector)
├─ InferenceTiming: 0.012219/0.012219 on (::typeof((#_norm_layer_forward#272)))(::Array{Float32, 4})
├─ InferenceTiming: 0.012226/0.279947 on (::typeof((#crossentropy#12)))(::Float32)
├─ InferenceTiming: 0.012369/0.118534 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.012413/0.025442 on iterate(::IRTools.Inner.Pipe, ::Any)
├─ InferenceTiming: 0.012551/0.140657 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Losses.var"##crossentropy#12", ::Int64, ::typeof(Statistics.mean), ::Float32, ::typeof(Flux.Losses.crossentropy), ::Matrix{Float32}, ::Matrix{Float32})
├─ InferenceTiming: 0.012771/0.021761 on (::Zygote.var"#1796#back#231"{Zygote.Jnew{NamedTuple{(:dims,), Tuple{Int64}}, Nothing, true}})(nothing::Nothing)
├─ InferenceTiming: 0.012894/0.012988 on Zygote.accum(::NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, NamedTuple{(:s,), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}}}}}}}, NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}}}}}, ::NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, NamedTuple{(:s,), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}}}}}}}, NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}}}}})
├─ InferenceTiming: 0.012949/0.028781 on ZygoteRules._pullback(::Zygote.Context, ::Flux.MaxPool{2, 2}, ::Array{Float32, 4})
├─ InferenceTiming: 0.013310/0.013310 on Base.Broadcast.combine_styles(::Tuple{Integer})
├─ InferenceTiming: 0.013596/0.047526 on NNlib.var"#∇conv_data!#196"(Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}()::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ∇conv_data!::typeof(NNlib.∇conv_data!), ::Array{Float32, 5}, ::Array{Float32, 5}, ::Array{Float32, 5}, ::NNlib.DenseConvDims{3, (3, 3, 1), 128, 128, 1, (1, 1, 1), (1, 1, 1, 1, 0, 0), (1, 1, 1), false})
├─ InferenceTiming: 0.013882/0.035197 on Base.Broadcast._broadcast_getindex_evalf(::typeof(Zygote.accum), ::NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, NamedTuple{(:s,), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}}}}}}}, ::NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}})
├─ InferenceTiming: 0.014414/0.052358 on NNlib.var"#∇conv_filter!#205"(Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}()::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ∇conv_filter!::typeof(NNlib.∇conv_filter!), ::Array{Float32, 5}, ::Array{Float32, 5}, ::Array{Float32, 5}, ::NNlib.DenseConvDims{3, (3, 3, 1), 128, 256, 1, (2, 2, 1), (1, 1, 1, 1, 0, 0), (1, 1, 1), false})
├─ InferenceTiming: 0.014514/0.049505 on NNlib.var"#∇conv_data!#196"(Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}()::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, ∇conv_data!::typeof(NNlib.∇conv_data!), ::Array{Float32, 5}, ::Array{Float32, 5}, ::Array{Float32, 5}, ::NNlib.DenseConvDims{3, (3, 3, 1), 64, 128, 1, (2, 2, 1), (1, 1, 1, 1, 0, 0), (1, 1, 1), false})
├─ InferenceTiming: 0.014712/0.083152 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.014731/0.188349 on ZygoteRules.adjoint(::Zygote.Primal)
├─ InferenceTiming: 0.015709/0.015788 on Base.Broadcast.combine_styles(::Tuple{Any, Nothing, Nothing})
├─ InferenceTiming: 0.015726/0.015805 on Base.Broadcast.combine_styles(::Tuple{Any, Nothing})
├─ InferenceTiming: 0.015756/0.015830 on Base.Broadcast.combine_styles(::Tuple{Nothing, Vararg{Any}})
├─ InferenceTiming: 0.015795/0.015874 on Base.Broadcast.combine_styles(::Tuple{Nothing, Any})
├─ InferenceTiming: 0.015845/0.015924 on Base.Broadcast.combine_styles(::Tuple{Any, Nothing, Nothing, Nothing})
├─ InferenceTiming: 0.015958/0.177567 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.016032/0.040375 on ZygoteRules._pullback(::Zygote.Context, ::Flux.MeanPool{2, 4}, ::Array{Float32, 4})
├─ InferenceTiming: 0.016121/0.016251 on Base.Broadcast.combine_styles(::Tuple)
├─ InferenceTiming: 0.016154/0.022153 on ChainRulesCore.canonicalize(::ChainRulesCore.Tangent{ResidualNetwork{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}}, Flux.MaxPool{2, 2}, Flux.Chain{Tuple{Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}}}, Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}}, NamedTuple{(:entry, :pooling, :layers, :head, :size, :stages), Tuple{ChainRulesCore.Tangent{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}}, NamedTuple{(:layers,), Tuple{ChainRulesCore.Tangent{Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}, Tuple{ChainRulesCore.NoTangent, ChainRulesCore.Tangent{Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, Float32, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent}}}}}}}}, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent}}})
├─ InferenceTiming: 0.016681/0.016723 on eltype(::Base.Generator{UnitRange{Int64}, IRTools.Inner.var"#178#182"})
├─ InferenceTiming: 0.016717/0.016759 on eltype(::Base.Generator{UnitRange{Int64}, IRTools.Inner.var"#138#142"{IRTools.Inner.CFG}})
├─ InferenceTiming: 0.016912/0.016956 on eltype(::Base.Generator{Vector{IRTools.Inner.Block}, IRTools.Inner.var"#157#159"})
├─ InferenceTiming: 0.016952/0.016994 on eltype(::Base.Generator{UnitRange{Int64}, IRTools.Inner.var"#179#183"})
├─ InferenceTiming: 0.016979/0.017022 on eltype(::Base.Generator{Vector{Int64}, IRTools.Inner.var"#133#135"{Vector{Int64}}})
├─ InferenceTiming: 0.017115/0.039701 on Zygote.grad_mut(::Core.Box)
├─ InferenceTiming: 0.017331/0.115292 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Base.Broadcast.broadcasted), ::typeof(+), ::Matrix{Float32}, ::typeof(identity))
├─ InferenceTiming: 0.017366/0.494089 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{typeof(Flux.flatten), Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.017424/0.017470 on eltype(::Base.Generator{Vector{IRTools.Inner.Block}, Zygote.var"#25#26"})
├─ InferenceTiming: 0.017960/0.021058 on ChainRulesCore.canonicalize(::ChainRulesCore.Tangent{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}, _A} where _A<:(NamedTuple{L})) where L
├─ InferenceTiming: 0.018458/0.018842 on Zygote.pair(Val{:layers}()::Val{:layers}, ::Tuple{Nothing, Base.RefValue{Any}, Nothing, Base.RefValue{Any}}, ::Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}})
├─ InferenceTiming: 0.018840/0.034957 on (::Zygote.var"#back#222"{:layers, Zygote.Context, Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}, Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}})(::Any)
├─ InferenceTiming: 0.018858/0.063641 on (::typeof((ntuple)))(nothing::Nothing)
├─ InferenceTiming: 0.019146/0.025782 on ChainRulesCore.canonicalize(::ChainRulesCore.Tangent{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}, _A} where _A<:(NamedTuple{L})) where L
├─ InferenceTiming: 0.019272/0.040831 on (::Zygote.var"#back#222"{:entry, Zygote.Context, ResidualNetwork{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}}, Flux.MaxPool{2, 2}, Flux.Chain{Tuple{Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}}}, Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}}, Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}}})(::Any)
├─ InferenceTiming: 0.019359/0.154767 on ZygoteRules._pullback(::Zygote.Context, ::ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.019958/0.027300 on Zygote.grad_mut(::Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}})
├─ InferenceTiming: 0.020003/0.451108 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}, ::Matrix{Float32})
├─ InferenceTiming: 0.020231/0.028097 on ZygoteRules._pullback(::Zygote.Context, literal_getproperty::typeof(ZygoteRules.literal_getproperty), ::ResidualNetwork{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}}, Flux.MaxPool{2, 2}, Flux.Chain{Tuple{Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}}}, Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}}, Val{:head}()::Val{:head})
├─ InferenceTiming: 0.020367/0.042751 on -(::ForwardDiff.Partials{1, Float32})
├─ InferenceTiming: 0.020436/0.025872 on (::Zygote.var"#561#565"{Zygote.Context, Base.var"#180#181"{Flux.var"#196#197"}})(::Int64)
├─ InferenceTiming: 0.020472/0.025695 on ChainRulesCore.canonicalize(::ChainRulesCore.Tangent{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, NamedTuple{(, :weight, :bias, :stride, :pad, :dilation, :groups), Tuple{ChainRulesCore.NoTangent, Array{Float32, 4}, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent}}})
├─ InferenceTiming: 0.021038/0.042871 on *(::ForwardDiff.Partials{1, Float32}, ::Float32)
├─ InferenceTiming: 0.021068/0.026181 on ChainRulesCore.canonicalize(::ChainRulesCore.Tangent{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}, NamedTuple{(:layers, :connection), Tuple{ChainRulesCore.Tangent{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, NamedTuple{(:layers,), Tuple{ChainRulesCore.Tangent{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}, Tuple{ChainRulesCore.NoTangent, ChainRulesCore.Tangent{Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, Float32, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent}}}, ChainRulesCore.NoTangent, ChainRulesCore.Tangent{Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, Float32, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent}}}}}}}}, ChainRulesCore.NoTangent}}})
├─ InferenceTiming: 0.022866/0.024282 on IRTools.Inner.varargs!(::Any, ::IRTools.Inner.IR, ::Any)
├─ InferenceTiming: 0.024231/0.091506 on (::Zygote.var"#back#222"{:weight, Zygote.Context, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Array{Float32, 4}})(::Array{Float32, 4})
├─ InferenceTiming: 0.024548/0.080729 on ZygoteRules._pullback(::Zygote.Context, ::typeof(ntuple), ::Flux.var"#273#276"{4, Array{Float32, 4}}, ::Int64)
├─ InferenceTiming: 0.025172/0.042600 on ZygoteRules.adjoint(::Zygote.Context, ::typeof(Zygote.__new__), ::Type{Flux.var"#196#197"})
├─ InferenceTiming: 0.026518/0.039971 on Zygote.accum(::NamedTuple{(:entry, :pooling, :layers, :head, :size, :stages), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing, Nothing, Nothing, Nothing, Nothing}}, ::Nothing, ::NamedTuple{(:entry, :pooling, :layers, :head, :size, :stages), Tuple{Nothing, Nothing, NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}}, NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}}}}}, NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, NamedTuple{(:s,), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}}}}}}}, NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}}}}}, NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, NamedTuple{(:s,), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}}}}}}}, NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}}}}}, NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, NamedTuple{(:s,), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}}}}}}}, NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}}}}}}}}, Nothing, Nothing, Nothing}}, ::Vararg{Any})
├─ InferenceTiming: 0.027003/0.045044 on Zygote.accum(::NamedTuple{(:y, :x, :model), Tuple{Matrix{Float32}, Nothing, Nothing}}, ::NamedTuple{(:y, :x, :model), Tuple{Nothing, Array{Float32, 4}, Nothing}}, ::NamedTuple{(:y, :x, :model), Tuple{Nothing, Nothing, NamedTuple{(:entry, :pooling, :layers, :head, :size, :stages), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing, NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}}, NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}}}}}, NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, NamedTuple{(:s,), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}}}}}}}, NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}}}}}, NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, NamedTuple{(:s,), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}}}}}}}, NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}}}}}, NamedTuple{(:layers,), Tuple{Tuple{NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, NamedTuple{(:s,), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}}}}}}}, NamedTuple{(:block,), Tuple{NamedTuple{(:layers, :connection), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}, Nothing, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing}}}}}}}}}}, Nothing, Nothing, Nothing}}}})
├─ InferenceTiming: 0.027286/0.029421 on Zygote.z2d(::NamedTuple, ::Any)
├─ InferenceTiming: 0.028041/0.048187 on ChainRulesCore.canonicalize(::ChainRulesCore.Tangent{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, _A} where _A<:(NamedTuple{L})) where L
├─ InferenceTiming: 0.028684/0.143502 on (::Zygote.var"#back#222"{:y, Zygote.Context, ResNet.var"#10#13"{Matrix{Float32}, Array{Float32, 4}, ResidualNetwork{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}}, Flux.MaxPool{2, 2}, Flux.Chain{Tuple{Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}}}, Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}}}, Matrix{Float32}})(::Matrix{Float32})
├─ InferenceTiming: 0.028890/0.029325 on Zygote.matching_cr_sig(::Type, ::Type)
├─ InferenceTiming: 0.029448/0.029448 on (::IRTools.Inner.var"#flatten#154")(::Any)
├─ InferenceTiming: 0.030826/0.030826 on getproperty(Base::Module, length::Symbol)
├─ InferenceTiming: 0.031258/0.031382 on Base.indexed_iterate(::Tuple{Flux.MaxPool{2, 2}, Zygote.var"#back#222"{:pooling, Zygote.Context, ResidualNetwork{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}}, Flux.MaxPool{2, 2}, Flux.Chain{Tuple{Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}}}, Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}}, Flux.MaxPool{2, 2}}}, ::Int64)
├─ InferenceTiming: 0.031448/0.031448 on ==(::Zygote.Alpha, ::Zygote.Alpha)
├─ InferenceTiming: 0.031609/0.032366 on invperm(::Vector{Int64})
├─ InferenceTiming: 0.032014/0.056284 on ZygoteRules._pullback(::Zygote.Context, literal_getproperty::typeof(ZygoteRules.literal_getproperty), ::Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, Val{:weight}()::Val{:weight})
├─ InferenceTiming: 0.032422/0.032422 on getproperty(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, IRTools.Inner.var"#85#86"{IRTools.Inner.var"#128#129"{Set{IRTools.Inner.Variable}}}, Tuple{Base.Broadcast.Extruded{Vector{Any}, Tuple{Bool}, Tuple{Int64}}}}, axes::Symbol)
├─ InferenceTiming: 0.033065/0.033065 on getindex((:unless,)::Tuple{Symbol}, 1::Int64)
├─ InferenceTiming: 0.033829/0.034685 on Base.collect_to_with_first!(::Vector{DataType}, ::Type{Any}, ::Base.Generator{UnitRange{Int64}, Base.var"#180#181"{Base.Iterators.var"#1#2"{Tuple{Vector{Any}, Vector{IRTools.Inner.Variable}}}}}, ::Int64)
├─ InferenceTiming: 0.034877/0.034877 on getindex(::Vector{Dict{IRTools.Inner.Slot, Any}}, ::Int64)
├─ InferenceTiming: 0.035932/0.035932 on setproperty!(::Dict{Union{Nothing, IRTools.Inner.Variable}, Nothing}, ::Symbol, ::Vector{UInt8})
├─ InferenceTiming: 0.036065/0.036065 on getproperty(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{4}, NTuple{4, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{4}, Nothing, typeof(*), Tuple{Array{Float32, 4}, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{4}, Nothing, typeof(conj), Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{4}, Nothing, typeof(^), Tuple{Array{Float32, 4}, Int64}}}}}}, args::Symbol)
├─ InferenceTiming: 0.038160/0.038186 on setindex!(::BitVector, ::Integer, ::Int64)
├─ InferenceTiming: 0.038257/0.038960 on ntuple(::Base.Iterators.var"#7#8", ::Int64)
├─ InferenceTiming: 0.038284/0.038801 on iterate(::Base.Generator{Base.Iterators.Filter{Base.var"#117#118"{Base.var"#115#116"{typeof(in), typeof(pop!), Set{IRTools.Inner.Undefined}}}, Tuple{IRTools.Inner.Undefined, IRTools.Inner.Undefined}}, typeof(identity)})
├─ InferenceTiming: 0.038938/0.038938 on Base.throw_setindex_mismatch(::Vector{Int64}, ::Tuple{Int64})
├─ InferenceTiming: 0.039840/0.039969 on Base.Generator(#25::Zygote.var"#25#26", ::Vector{IRTools.Inner.Block})
├─ InferenceTiming: 0.039853/0.039997 on LinearIndices(::Vector{DataType})
├─ InferenceTiming: 0.040487/0.040487 on (::typeof((ntuple)))(nothing::Nothing)
├─ InferenceTiming: 0.040930/0.040930 on Core.Typeof(::IRTools.Inner.var"#85#86"{IRTools.Inner.var"#161#169"{Dict{Any, Nothing}, IRTools.Inner.Block, IRTools.Inner.var"#queue!#167"{Vector{IRTools.Inner.Block}}}})
├─ InferenceTiming: 0.040941/0.043061 on Base.Broadcast.copyto_nonleaf!(::Vector{IRTools.Inner.Variable}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, MacroTools.var"#23#24"{IRTools.Inner.var"#180#187"{IRTools.Inner.Block}}, Tuple{Base.Broadcast.Extruded{Vector{Any}, Tuple{Bool}, Tuple{Int64}}}}, ::Base.OneTo{Int64}, ::Int64, ::Int64)
├─ InferenceTiming: 0.041481/0.041548 on Base.Broadcast._getindex(::Tuple{Base.Broadcast.Extruded{Vector{Vector{Int64}}, Tuple{Bool}, Tuple{Int64}}}, ::Int64)
├─ InferenceTiming: 0.041752/0.096623 on Zygote.instrument(::IRTools.Inner.IR)
├─ InferenceTiming: 0.042102/0.050115 on copyto!(::BitVector, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, IRTools.Inner.var"#85#86"{IRTools.Inner.var"#180#187"{IRTools.Inner.Block}}, Tuple{Vector{Any}}})
├─ InferenceTiming: 0.042209/0.134717 on ZygoteRules._pullback(::Zygote.Context, ::Flux.var"#_norm_layer_forward##kw", ::NamedTuple{(:reduce_dims, :affine_shape), Tuple{Vector{Int64}, NTuple{4, Int64}}}, ::typeof(Flux._norm_layer_forward), ::Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.042832/0.043465 on Base.Broadcast.Broadcasted{Style, Nothing, F, Args}(::CUDA.var"#286#287", ::Tuple{Base.RefValue{typeof(+)}, Union{AbstractArray{<:T}, T} where T<:Number, Array{Float32, 4}}, nothing::Nothing) where {Style<:Union{Nothing, Base.Broadcast.BroadcastStyle}, F, Args<:Tuple}
├─ InferenceTiming: 0.043433/0.043837 on Base.Broadcast.instantiate(::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{4}, Nothing, typeof(/), Tuple{Array{Float32, 4}, Array{Float32, 4}}})
├─ InferenceTiming: 0.046286/0.048632 on ZygoteRules._pullback(::Zygote.Context, ::Type{Tuple}, ::Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}})
├─ InferenceTiming: 0.046598/0.162556 on (::Zygote.ZBack{NNlib.var"#conv_pullback#244"{Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Array{Float32, 4}, Array{Float32, 4}, NNlib.DenseConvDims{2, (3, 3), 512, 512, 1, (1, 1), (1, 1, 1, 1), (1, 1), false}}})(::Array{Float32, 4})
├─ InferenceTiming: 0.046935/0.046990 on (::Zygote.var"#48#50"{NamedTuple{(:entry, :pooling, :layers, :head, :size, :stages), Tuple{NamedTuple{(:layers,), Tuple{Tuple{Nothing, NamedTuple{(, , , , :σ², , :momentum, :affine, :track_stats, :active, :chs), Tuple{Nothing, Nothing, Nothing, Nothing, Nothing, Float32, Nothing, Nothing, Nothing, Nothing, Nothing}}}}}, Nothing, Nothing, Nothing, Nothing, Nothing}}})(layers::Symbol)
├─ InferenceTiming: 0.047594/0.047594 on Base.copymutable(::AbstractVector)
├─ InferenceTiming: 0.049028/0.049028 on (::typeof((#_norm_layer_forward#272)))(::Array{Float32, 4})
├─ InferenceTiming: 0.051924/0.224176 on (::Zygote.var"#1763#back#223"{Zygote.var"#back#222"{:layers, Zygote.Context, Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}})(::Tuple{Nothing, Base.RefValue{Any}, Nothing, Base.RefValue{Any}})
├─ InferenceTiming: 0.052446/0.075674 on ChainRulesCore.canonicalize(::ChainRulesCore.Tangent{ResNet.var"#10#13"{Matrix{Float32}, Array{Float32, 4}, ResidualNetwork{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}}, Flux.MaxPool{2, 2}, Flux.Chain{Tuple{Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}}}, Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}}}, NamedTuple{(:y, :x, :model), Tuple{Matrix{Float32}, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent}}})
├─ InferenceTiming: 0.052463/0.065228 on copyto!(::BitVector, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, IRTools.Inner.var"#89#90"{IRTools.Inner.var"#161#169"{Dict{Any, IRTools.Inner.Variable}, IRTools.Inner.Block, IRTools.Inner.var"#queue!#167"{Vector{IRTools.Inner.Block}}}}, Tuple{Vector{Any}}})
├─ InferenceTiming: 0.053957/0.423415 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}, ::Matrix{Float32})
├─ InferenceTiming: 0.054958/0.056661 on Zygote.wrap_chainrules_output(::ChainRulesCore.Thunk{NNlib.var"#243#246"{Array{Float32, 4}, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, Array{Float32, 4}, NNlib.DenseConvDims{2, (1, 1), 128, 256, 1, (2, 2), (0, 0, 0, 0), (1, 1), false}}})
├─ InferenceTiming: 0.056967/0.092181 on IRTools.Inner.varargs!(::IRTools.Inner.Meta, ::IRTools.Inner.IR, ::Int64)
├─ InferenceTiming: 0.057399/0.085418 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, ::Array{Float32, 4})
├─ InferenceTiming: 0.059726/0.077806 on ChainRulesCore.canonicalize(::ChainRulesCore.Tangent{Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, NamedTuple{(:λ, :β, :γ, :μ, :σ², :ϵ, :momentum, :affine, :track_stats, :active, :chs), Tuple{ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, Float32, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent}}})
├─ InferenceTiming: 0.067200/0.067277 on Channel{Any}(9223372036854775807::Int64)
├─ InferenceTiming: 0.069348/0.090344 on ZygoteRules._pullback(::Zygote.Context, ::Flux.var"##_norm_layer_forward#272", ::Vector{Int64}, ::NTuple{4, Int64}, ::typeof(Flux._norm_layer_forward), ::Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.084313/0.391098 on (::typeof((λ)))(::Array{Float32, 4})
├─ InferenceTiming: 0.096467/0.254334 on ZygoteRules._pullback(::Zygote.Context, ::Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.103244/8.110116 on ZygoteRules._pullback(::Zygote.Context, ::ResidualNetwork{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}}, Flux.MaxPool{2, 2}, Flux.Chain{Tuple{Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}}}, Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.106090/1.802821 on (::Zygote.var"#94#95"{Zygote.Params, typeof((λ)), Zygote.Context})(::Float32)
├─ InferenceTiming: 0.109384/0.109384 on ZygoteRules.gradtuple1(::Tuple{Nothing, Nothing, Matrix{Float32}})
├─ InferenceTiming: 0.124380/3.754166 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.144772/0.453725 on (::typeof((crossentropy)))(::Float32)
├─ InferenceTiming: 0.147364/0.285503 on (::typeof((λ)))(::Matrix{Float32})
├─ InferenceTiming: 0.184074/8.334357 on ZygoteRules._pullback(::Zygote.Context, ::ResNet.var"#10#13"{Matrix{Float32}, Array{Float32, 4}, ResidualNetwork{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}}, Flux.MaxPool{2, 2}, Flux.Chain{Tuple{Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, Flux.Conv{2, 4, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), Vector{Float32}, Float32, Vector{Float32}}}}, typeof(+)}}}}}}, Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), Matrix{Float32}, Vector{Float32}}}}}})
├─ InferenceTiming: 0.190952/0.515821 on ZygoteRules._pullback(::Zygote.Context, ::Flux.var"##_norm_layer_forward#272", ::Vector{Int64}, ::NTuple{4, Int64}, ::typeof(Flux._norm_layer_forward), ::Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.199366/7.903827 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.206748/0.211004 on (::ChainRulesCore.ProjectTo{AbstractArray, NamedTuple{(:element, :axes), Tuple{ChainRulesCore.ProjectTo{Float32, NamedTuple{(), Tuple{}}}, NTuple{4, Base.OneTo{Int64}}}}})(::Array{Float32, 4})
├─ InferenceTiming: 0.238401/0.277354 on ZygoteRules._pullback(::Zygote.Context, ::Type{Base.Generator{UnitRange{Int64}, Base.var"#180#181"{Flux.var"#196#197"}}}, ::Base.var"#180#181"{Flux.var"#196#197"}, ::UnitRange{Int64})
├─ InferenceTiming: 0.267384/0.364975 on Base.permute!!(::Vector{IRTools.Inner.BasicBlock}, ::AbstractVector{<:Integer})
├─ InferenceTiming: 0.282205/0.282205 on getindex(::Tuple{Zygote.var"#1716#back#201"{Zygote.var"#198#200"{Tuple{}}}, typeof((λ)), Zygote.var"#1728#back#205"{Zygote.var"#203#204"}, typeof((applychain))}, 4::Int64)
├─ InferenceTiming: 0.283315/0.434623 on ZygoteRules._pullback(::Zygote.Context, ::typeof(ntuple), ::Flux.var"#282#283"{Array{Float32, 4}, Int64}, ::Int64)
├─ InferenceTiming: 0.471366/0.928664 on ZygoteRules._pullback(::Zygote.Context, ::typeof(ntuple), ::Flux.var"#196#197", ::Int64)
├─ InferenceTiming: 0.575252/7.605734 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}}, ::Array{Float32, 4})
├─ InferenceTiming: 0.773211/3.609603 on ZygoteRules._pullback(::Zygote.Context, ::Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, ::Array{Float32, 4})
├─ InferenceTiming: 1.228437/2.100508 on ZygoteRules._pullback(::Zygote.Context, ::Flux.var"#_norm_layer_forward##kw", ::NamedTuple{(:reduce_dims, :affine_shape), Tuple{Vector{Int64}, NTuple{4, Int64}}}, ::typeof(Flux._norm_layer_forward), ::Flux.BatchNorm{typeof(NNlib.relu), Vector{Float32}, Float32, Vector{Float32}}, ::Array{Float32, 4})
├─ InferenceTiming: 4.166818/6.877291 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Conv{2, 2, typeof(identity), Array{Float32, 4}, Flux.Zeros}, ::Array{Float32, 4})
└─ InferenceTiming: 170.547585/195.279104 on Core.Compiler.Timings.ROOT()
GPU [ascending order]
Vector{SnoopCompileCore.InferenceTiming}
├─ InferenceTiming: 0.010150/0.010722 on GPUCompiler.classify_arguments(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::LLVM.FunctionType)
├─ InferenceTiming: 0.010165/0.022720 on GPUCompiler.finish_module!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{ForwardDiff.Dual{Nothing, Float32, 1}, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Zygote.var"#1116#1119"{typeof(-)}, Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::LLVM.Module, ::LLVM.Function)
├─ InferenceTiming: 0.010210/0.010796 on GPUCompiler.classify_arguments(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, ::LLVM.FunctionType)
├─ InferenceTiming: 0.010217/0.011866 on GPUCompiler.classify_arguments(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, ::LLVM.FunctionType)
├─ InferenceTiming: 0.010258/0.014482 on Zygote.broadcast_forward(::Function, ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.010284/0.019969 on GPUCompiler.classify_arguments(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(CUDA.big_mapreduce_kernel), Tuple{typeof(identity), typeof(Base.add_sum), Float32, CartesianIndices{4, NTuple{4, Base.OneTo{Int64}}}, CartesianIndices{4, NTuple{4, Base.OneTo{Int64}}}, CUDA.CuDeviceArray{Float32, 4, 1}, CUDA.CuDeviceArray{Float32, 4, 1}}}}, ::LLVM.FunctionType)
├─ InferenceTiming: 0.010458/0.108061 on Zygote.primal(::IRTools.Inner.IR)
├─ InferenceTiming: 0.010473/0.018042 on GPUCompiler.classify_arguments(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(/), Tuple{Float32, Int64}}}}, Int64}}}, ::LLVM.FunctionType)
├─ InferenceTiming: 0.010475/0.019382 on IRTools.Inner.var"#dominators#132"(::Any, dominators::typeof(IRTools.Inner.dominators), ::Any)
├─ InferenceTiming: 0.010492/0.022951 on GPUCompiler.finish_module!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(-), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}}}, Int64}}}, ::LLVM.Module, ::LLVM.Function)
├─ InferenceTiming: 0.010514/0.023355 on GPUCompiler.finish_module!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Zygote.var"#1125#1130"{Int64}, Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{ForwardDiff.Dual{Nothing, Float32, 1}, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::LLVM.Module, ::LLVM.Function)
├─ InferenceTiming: 0.010519/0.156733 on (::IRTools.Inner.var"#reaching#184"{Dict{Int64, Dict{Int64, Vector{IRTools.Inner.Slot}}}, Dict{Int64, Dict{IRTools.Inner.Slot, Any}}})(::Any, ::Any)
├─ InferenceTiming: 0.010525/0.237493 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Chain{Tuple{Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.010547/0.030943 on GPUCompiler.finish_module!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Float32}}, Int64}}}, ::LLVM.Module, ::LLVM.Function)
├─ InferenceTiming: 0.010583/0.018489 on GPUCompiler.check_ir!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(/), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(*), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::Vector{Tuple{String, Vector{Base.StackTraces.StackFrame}, Any}}, ::LLVM.CallInst)
├─ InferenceTiming: 0.010584/0.026118 on GPUCompiler.finish_module!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(CUDA.big_mapreduce_kernel), Tuple{typeof(identity), typeof(Base.add_sum), Float32, CartesianIndices{5, NTuple{5, Base.OneTo{Int64}}}, CartesianIndices{5, NTuple{5, Base.OneTo{Int64}}}, CUDA.CuDeviceArray{Float32, 5, 1}, CUDA.CuDeviceArray{Float32, 5, 1}}}}, ::LLVM.Module, ::LLVM.Function)
├─ InferenceTiming: 0.010612/0.032498 on GPUCompiler.finish_module!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::LLVM.Module, ::LLVM.Function)
├─ InferenceTiming: 0.010646/0.010827 on Zygote.nt_nothing(::Flux.Dense{typeof(identity), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}})
├─ InferenceTiming: 0.010650/0.061969 on copy(::Base.Broadcast.Broadcasted)
├─ InferenceTiming: 0.010664/0.011711 on GPUCompiler.classify_arguments(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Zygote.var"#1125#1130"{Int64}, Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{ForwardDiff.Dual{Nothing, Float32, 1}, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::LLVM.FunctionType)
├─ InferenceTiming: 0.010667/0.017402 on GPUCompiler.classify_arguments(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Float32}}, Int64}}}, ::LLVM.FunctionType)
├─ InferenceTiming: 0.010707/0.018009 on GPUCompiler.classify_arguments(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 4, 1}, Base.Broadcast.Broadcasted{Nothing, NTuple{4, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(>), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}, Int64}}}}, Int64}}}, ::LLVM.FunctionType)
├─ InferenceTiming: 0.010778/0.416782 on CUDA.var"#mapreducedim!#288"(::Float32, mapreducedim!::typeof(GPUArrays.mapreducedim!), identity::typeof(identity), add_sum::typeof(Base.add_sum), ::CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer}, ::CUDA.CuArray{Float32, 5, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.010817/0.041259 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.010824/0.069397 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.010905/0.910204 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{Flux.Dense{typeof(identity), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.010917/0.020808 on (::GPUCompiler.var"#94#97"{LLVM.Context, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}})()
├─ InferenceTiming: 0.010931/0.058418 on ChainRulesCore.ProjectTo(::CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.010945/0.010945 on Base.nteltype(::Type{NamedTuple{(:meta,), T}}) where T<:Tuple
├─ InferenceTiming: 0.010980/0.011595 on GPUCompiler.classify_arguments(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(-), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}}}, Int64}}}, ::LLVM.FunctionType)
├─ InferenceTiming: 0.011074/0.032989 on GPUCompiler.finish_module!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 4, 1}, Base.Broadcast.Broadcasted{Nothing, NTuple{4, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(>), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}, Int64}}}}, Int64}}}, ::LLVM.Module, ::LLVM.Function)
├─ InferenceTiming: 0.011144/0.011144 on eltype(::Tuple{Any})
├─ InferenceTiming: 0.011159/0.123184 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.011218/0.211611 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.011246/0.071674 on GPUCompiler.finish_module!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(CUDA.big_mapreduce_kernel), Tuple{typeof(identity), typeof(Base.add_sum), Float32, CartesianIndices{4, NTuple{4, Base.OneTo{Int64}}}, CartesianIndices{4, NTuple{4, Base.OneTo{Int64}}}, CUDA.CuDeviceArray{Float32, 4, 1}, CUDA.CuDeviceArray{Float32, 4, 1}}}}, ::LLVM.Module, ::LLVM.Function)
├─ InferenceTiming: 0.011256/0.024713 on Base.Meta._partially_inline!(::Any, ::Vector{Any}, ::Any, ::Vector{Any}, ::Int64, ::Int64, ::Symbol)
├─ InferenceTiming: 0.011326/0.047320 on eltype(::Tuple{Any, Any})
├─ InferenceTiming: 0.011440/0.018505 on GPUCompiler.classify_arguments(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(/), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(*), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::LLVM.FunctionType)
├─ InferenceTiming: 0.011458/0.017914 on ZygoteRules._pullback(::Zygote.Context, epseltype::typeof(Flux.epseltype), ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.011507/0.018622 on GPUCompiler.compile_method_instance(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 4, 1}, Base.Broadcast.Broadcasted{Nothing, NTuple{4, Base.OneTo{Int64}}, typeof(Zygote.accum), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.011520/0.275042 on IRTools.Inner.Wrap.var"#IR#11"(::IRTools.Inner.Meta, ::Type{IRTools.Inner.IR}, ::Core.CodeInfo, ::Int64)
├─ InferenceTiming: 0.011529/0.011529 on eltype(::Gtk.GLib.GList{L}) where L<:Gtk.GLib._LList
├─ InferenceTiming: 0.011596/0.080520 on IRTools.Inner.var"#meta#1"(::Type, ::UInt64, meta::typeof(IRTools.Inner.meta), ::Type)
├─ InferenceTiming: 0.011714/0.020292 on (::GPUCompiler.var"#94#97"{LLVM.Context, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{ForwardDiff.Dual{Nothing, Float32, 1}, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Zygote.var"#1116#1119"{typeof(-)}, Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}})()
├─ InferenceTiming: 0.011721/0.022904 on (::GPUCompiler.var"#94#97"{LLVM.Context, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Zygote.var"#1123#1127", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{ForwardDiff.Dual{Nothing, Float32, 1}, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}})()
├─ InferenceTiming: 0.011894/0.024120 on (Base.Pairs{Symbol})(::NamedTuple{(:meta,)}, ::Tuple{Symbol})
├─ InferenceTiming: 0.011915/0.014055 on Base.nteltype(::Type{NamedTuple{(:type, :insert), T}}) where T<:Tuple{Any, Bool}
├─ InferenceTiming: 0.011983/0.120840 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Base.Broadcast.broadcasted), ::typeof(+), ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::typeof(identity))
├─ InferenceTiming: 0.012306/0.033140 on GPUCompiler.finish_module!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(/), Tuple{Float32, Int64}}}}, Int64}}}, ::LLVM.Module, ::LLVM.Function)
├─ InferenceTiming: 0.012337/0.022491 on (::GPUCompiler.var"#94#97"{LLVM.Context, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(/), Tuple{Float32, Int64}}}}, Int64}}}})()
├─ InferenceTiming: 0.012392/0.019113 on GPUCompiler.compile_method_instance(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 4, 1}, Base.Broadcast.Broadcasted{Nothing, NTuple{4, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(>), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}, Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.012489/0.013126 on GPUCompiler.classify_arguments(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Zygote.var"#1123#1127", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{ForwardDiff.Dual{Nothing, Float32, 1}, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::LLVM.FunctionType)
├─ InferenceTiming: 0.012626/0.013217 on GPUCompiler.classify_arguments(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(CUDA.big_mapreduce_kernel), Tuple{typeof(identity), typeof(Base.add_sum), Float32, CartesianIndices{5, NTuple{5, Base.OneTo{Int64}}}, CartesianIndices{5, NTuple{5, Base.OneTo{Int64}}}, CUDA.CuDeviceArray{Float32, 5, 1}, CUDA.CuDeviceArray{Float32, 5, 1}}}}, ::LLVM.FunctionType)
├─ InferenceTiming: 0.012703/0.016074 on Base.Pairs(::SparseArrays.SparseVector{<:Integer, Ti} where Ti, ::LinearIndices{1, Tuple{Base.OneTo{Int64}}})
├─ InferenceTiming: 0.012716/0.089360 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.012724/0.185567 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(/), Tuple{Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(*), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.012768/0.940799 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{typeof(Flux.flatten), Flux.Dense{typeof(identity), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.012779/0.089246 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.012791/0.012791 on Base.nteltype(::Type{NamedTuple{(:unless,), T}}) where T<:Tuple{Any}
├─ InferenceTiming: 0.012815/0.259173 on Core.Compiler.IRCode(::IRTools.Inner.IR)
├─ InferenceTiming: 0.012910/0.022085 on GPUCompiler.classify_arguments(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 4, 1}, Base.Broadcast.Broadcasted{Nothing, NTuple{4, Base.OneTo{Int64}}, typeof(Zygote.accum), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}}}, Int64}}}, ::LLVM.FunctionType)
├─ InferenceTiming: 0.013007/0.053565 on IRTools.Inner.var"#meta#1"(::Any, ::Any, meta::typeof(IRTools.Inner.meta), ::Any)
├─ InferenceTiming: 0.013061/0.028519 on iterate(::IRTools.Inner.Pipe, ::Any)
├─ InferenceTiming: 0.013263/0.877828 on (::typeof((#crossentropy#12)))(::Float32)
├─ InferenceTiming: 0.013326/0.357505 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Base._ntuple), ::Flux.var"#196#197", ::Int64)
├─ InferenceTiming: 0.013499/0.014134 on iterate(::IRTools.Inner.Pipe, ::Tuple{Any, Any, Int64})
├─ InferenceTiming: 0.013508/0.022323 on Base.__cat_offset1!(::Any, ::Tuple{Any, Vararg{Any}}, ::Tuple{Bool}, ::Tuple{Vararg{Int64}}, ::Int64)
├─ InferenceTiming: 0.013529/0.087795 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.013643/0.028902 on (Base.Pairs{Symbol})(::NamedTuple{(:unless,), _A} where _A<:Tuple{Any}, ::Tuple{Symbol})
├─ InferenceTiming: 0.013688/0.095482 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Zygote.var"#1123#1127", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{ForwardDiff.Dual{Nothing, Float32, 1}, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.013876/0.100366 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(CUDA.big_mapreduce_kernel), Tuple{typeof(identity), typeof(Base.add_sum), Float32, CartesianIndices{5, NTuple{5, Base.OneTo{Int64}}}, CartesianIndices{5, NTuple{5, Base.OneTo{Int64}}}, CUDA.CuDeviceArray{Float32, 5, 1}, CUDA.CuDeviceArray{Float32, 5, 1}}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.013898/0.174759 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(CUDA.big_mapreduce_kernel), Tuple{typeof(identity), typeof(Base.add_sum), Float32, CartesianIndices{4, NTuple{4, Base.OneTo{Int64}}}, CartesianIndices{4, NTuple{4, Base.OneTo{Int64}}}, CUDA.CuDeviceArray{Float32, 4, 1}, CUDA.CuDeviceArray{Float32, 4, 1}}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.013934/0.022156 on (::Zygote.var"#1796#back#231"{Zygote.Jnew{NamedTuple{(:dims,), Tuple{Int64}}, Nothing, true}})(nothing::Nothing)
├─ InferenceTiming: 0.014116/0.141249 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 4, 1}, Base.Broadcast.Broadcasted{Nothing, NTuple{4, Base.OneTo{Int64}}, typeof(Zygote.accum), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.014142/0.030957 on ZygoteRules._pullback(::Zygote.Context, ::Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::Nothing)
├─ InferenceTiming: 0.014189/0.091423 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(Flux.Losses.xlogy), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.014288/0.090335 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Zygote.var"#1125#1130"{Int64}, Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{ForwardDiff.Dual{Nothing, Float32, 1}, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.014355/0.091268 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{ForwardDiff.Dual{Nothing, Float32, 1}, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Zygote.var"#1116#1119"{typeof(-)}, Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.014534/0.023667 on (::GPUCompiler.var"#94#97"{LLVM.Context, GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}})()
├─ InferenceTiming: 0.014792/0.014792 on Base.Broadcast.combine_styles(::Tuple{Integer})
├─ InferenceTiming: 0.014802/0.134129 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 4, 1}, Base.Broadcast.Broadcasted{Nothing, NTuple{4, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{4}, Nothing, typeof(>), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}, Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.014847/0.014847 on (::Type{Base.Iterators.Enumerate{_A}} where _A)(::AbstractVector)
├─ InferenceTiming: 0.015379/0.039750 on ZygoteRules._pullback(::Zygote.Context, ::Flux.MaxPool{2, 2}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.015826/0.100185 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceVector{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceVector{Float32, 1}, Tuple{Bool}, Tuple{Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.016059/0.016149 on Base.Broadcast.combine_styles(::Tuple{Any, Nothing, Nothing})
├─ InferenceTiming: 0.016067/0.016151 on Base.Broadcast.combine_styles(::Tuple)
├─ InferenceTiming: 0.016125/0.016207 on Base.Broadcast.combine_styles(::Tuple{Nothing, Any})
├─ InferenceTiming: 0.016178/0.016306 on Base.Broadcast.combine_styles(::Tuple{Any, Nothing, Nothing, Nothing})
├─ InferenceTiming: 0.016224/0.016302 on Base.Broadcast.combine_styles(::Tuple{Nothing, Vararg{Any}})
├─ InferenceTiming: 0.016256/0.149636 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(CUDA.partial_mapreduce_grid), Tuple{typeof(identity), typeof(Base.add_sum), Float32, CartesianIndices{5, NTuple{5, Base.OneTo{Int64}}}, CartesianIndices{5, NTuple{5, Base.OneTo{Int64}}}, Val{true}, CUDA.CuDeviceArray{Float32, 6, 1}, CUDA.CuDeviceArray{Float32, 5, 1}}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.016258/0.016399 on Base.Broadcast.combine_styles(::Tuple{Any, Nothing})
├─ InferenceTiming: 0.016259/0.036863 on ZygoteRules._pullback(::Zygote.Context, ::Flux.MeanPool{2, 4}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.016309/0.016393 on Base.Broadcast.combine_styles(::Tuple{Integer, Integer, Integer, Integer, Vararg{Integer, N}} where N)
├─ InferenceTiming: 0.016367/0.023714 on unsafe_load(::Core.LLVMPtr{ForwardDiff.Dual{Nothing, Float32, 1}, 1}, ::Int64, Val{4}()::Val{4})
├─ InferenceTiming: 0.016514/0.022897 on iterate(::IRTools.Inner.Pipe, ::Tuple{Vector{Vector{IRTools.Inner.Variable}}, Int64, Int64})
├─ InferenceTiming: 0.016997/0.191864 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.017110/0.100047 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(CUDA.partial_mapreduce_grid), Tuple{typeof(identity), typeof(Base.add_sum), Float32, CartesianIndices{4, NTuple{4, Base.OneTo{Int64}}}, CartesianIndices{4, NTuple{4, Base.OneTo{Int64}}}, Val{true}, CUDA.CuDeviceArray{Float32, 5, 1}, CUDA.CuDeviceArray{Float32, 4, 1}}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.018338/0.018384 on eltype(::Base.Generator{Vector{Int64}, IRTools.Inner.var"#133#135"{Vector{Int64}}})
├─ InferenceTiming: 0.018351/0.168510 on ZygoteRules._pullback(::Zygote.Context, ::ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.018447/0.018527 on eltype(::Base.Generator{Vector{IRTools.Inner.Block}, IRTools.Inner.var"#157#159"})
├─ InferenceTiming: 0.018592/0.018641 on eltype(::Base.Generator{Vector{IRTools.Inner.Block}, Zygote.var"#25#26"})
├─ InferenceTiming: 0.018679/0.018723 on eltype(::Base.Generator{UnitRange{Int64}, IRTools.Inner.var"#138#142"{IRTools.Inner.CFG}})
├─ InferenceTiming: 0.018791/0.023355 on ChainRulesCore.canonicalize(::ChainRulesCore.Tangent{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, NamedTuple{(, :weight, :bias, :stride, :pad, :dilation, :groups), Tuple{ChainRulesCore.NoTangent, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent}}})
├─ InferenceTiming: 0.018810/1.059477 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Losses.var"##crossentropy#12", ::Int64, ::typeof(Statistics.mean), ::Float32, ::typeof(Flux.Losses.crossentropy), ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.019207/0.019254 on eltype(::Base.Generator{UnitRange{Int64}, IRTools.Inner.var"#179#183"})
├─ InferenceTiming: 0.019692/0.108227 on (::typeof((ntuple)))(nothing::Nothing)
├─ InferenceTiming: 0.019744/0.019744 on Zygote.pair(Val{:layers}()::Val{:layers}, ::Any, ::Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}})
├─ InferenceTiming: 0.020078/0.039923 on (::Zygote.var"#back#222"{:layers, Zygote.Context, Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}, Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}})(::Any)
├─ InferenceTiming: 0.020362/0.045544 on (::Zygote.var"#back#222"{:entry, Zygote.Context, ResidualNetwork{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, Flux.MaxPool{2, 2}, Flux.Chain{Tuple{Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}}}, Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}, Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}})(::Any)
├─ InferenceTiming: 0.020498/0.020543 on eltype(::Base.Generator{UnitRange{Int64}, IRTools.Inner.var"#178#182"})
├─ InferenceTiming: 0.020656/0.044565 on Zygote.grad_mut(::Base.RefValue{typeof(identity)})
├─ InferenceTiming: 0.020700/0.026199 on (::Zygote.var"#561#565"{Zygote.Context, Base.var"#180#181"{Flux.var"#196#197"}})(::Int64)
├─ InferenceTiming: 0.021751/0.085318 on Zygote.grad_mut(::Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}})
├─ InferenceTiming: 0.022611/0.092301 on (::Zygote.var"#back#222"{:weight, Zygote.Context, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}})(::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.023176/0.094462 on -(::ForwardDiff.Partials{1, Float32})
├─ InferenceTiming: 0.023493/0.255932 on ZygoteRules.adjoint(::Zygote.Primal)
├─ InferenceTiming: 0.024382/0.104822 on IRTools.Inner.varargs!(::IRTools.Inner.Meta, ::IRTools.Inner.IR, ::Int64)
├─ InferenceTiming: 0.024860/0.026498 on IRTools.Inner.varargs!(::Any, ::IRTools.Inner.IR, ::Any)
├─ InferenceTiming: 0.026138/0.043617 on ZygoteRules.adjoint(::Zygote.Context, ::typeof(Zygote.__new__), ::Type{Flux.var"#196#197"})
├─ InferenceTiming: 0.027663/0.027663 on (::Type{Base.Broadcast.Extruded{_A, _B, _C}} where {_A, _B, _C})(::AbstractArray, ::Tuple, ::Tuple)
├─ InferenceTiming: 0.027818/0.031938 on unsafe_store!(::Core.LLVMPtr{ForwardDiff.Dual{Nothing, Float32, 1}, 1}, ::ForwardDiff.Dual{Nothing, Float32, 1}, ::Int64, Val{4}()::Val{4})
├─ InferenceTiming: 0.029445/0.200991 on (::Zygote.var"#back#222"{:y, Zygote.Context, ResNet.var"#10#13"{CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ResidualNetwork{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, Flux.MaxPool{2, 2}, Flux.Chain{Tuple{Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}}}, Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}, CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}})(::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.029626/0.030230 on !=(::UInt8, 0::Int64)
├─ InferenceTiming: 0.030535/0.033966 on copyto!(::BitVector, ::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, IRTools.Inner.var"#85#86"{IRTools.Inner.var"#93#94"{IRTools.Inner.var"#103#104"{Dict{Any, Any}}}}, Tuple{Vector{Any}}})
├─ InferenceTiming: 0.031029/0.031029 on (::IRTools.Inner.var"#flatten#154")(::Any)
├─ InferenceTiming: 0.031612/0.033124 on Base.collect_to_with_first!(::Vector, ::Any, ::Base.Generator{Vector{Any}, MacroTools.var"#23#24"{Zygote.var"#18#19"{IRTools.Inner.Pipe, IRTools.Inner.Variable}}}, ::Int64)
├─ InferenceTiming: 0.032316/0.032316 on Base.Broadcast.restart_copyto_nonleaf!(::Vector, ::Vector{Nothing}, ::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, IRTools.Inner.var"#89#90"{Zygote.var"#39#41"{Dict{Any, Any}}}, Tuple{Base.Broadcast.Extruded{Vector{Any}, Tuple{Bool}, Tuple{Int64}}}}, ::Any, ::Int64, ::Base.OneTo{Int64}, ::Int64, ::Int64)
├─ InferenceTiming: 0.032888/0.033002 on Zygote.chain_rrule(::Zygote.ZygoteRuleConfig{Zygote.Context}, ::typeof(==), ::Int64, ::Int64)
├─ InferenceTiming: 0.033012/0.043105 on copyto!(::BitVector, ::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, IRTools.Inner.var"#flatten#154", Tuple{Vector{Pair{Int64, Any}}}})
├─ InferenceTiming: 0.033241/0.051797 on ZygoteRules._pullback(::Zygote.Context, literal_getproperty::typeof(ZygoteRules.literal_getproperty), ::Flux.Dense{typeof(identity), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Val{:weight}()::Val{:weight})
├─ InferenceTiming: 0.033284/0.033284 on Base.IdSet()
├─ InferenceTiming: 0.035395/0.035737 on foreach(::IRTools.Inner.var"#queue!#167"{Vector{IRTools.Inner.Block}}, ::Vector{IRTools.Inner.Block})
├─ InferenceTiming: 0.035457/0.035505 on *(3::Int64, ::UInt64)
├─ InferenceTiming: 0.036040/0.036040 on (::MacroTools.var"#23#24"{IRTools.Inner.var"#130#131"{Dict{IRTools.Inner.Variable, Int64}}})(::GlobalRef)
├─ InferenceTiming: 0.036143/0.036143 on Base.isslotfilled(::Dict{Int64, Vector{IRTools.Inner.Slot}}, ::Int64)
├─ InferenceTiming: 0.036303/0.036424 on Base.indexed_iterate(::Tuple{Int64, Int64}, 1::Int64)
├─ InferenceTiming: 0.037373/0.040288 on Base.hash_64_64(1::UInt64)
├─ InferenceTiming: 0.037525/0.037525 on |>(::Any, reverse::typeof(reverse))
├─ InferenceTiming: 0.038136/0.038833 on (Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}})(substitute::typeof(IRTools.Inner.substitute), ::Tuple{Tuple{IRTools.Inner.Pipe}, Vector{Any}})
├─ InferenceTiming: 0.038278/0.041837 on Base.collect_to_with_first!(::AbstractArray, ::Any, ::Base.Generator{_A, IRTools.Inner.var"#164#172"} where _A, ::Any)
├─ InferenceTiming: 0.039003/0.039025 on Zygote.gradindex(::Tuple{Nothing, Union{Nothing, Tuple{Any}}, Any}, 3::Int64)
├─ InferenceTiming: 0.039317/0.039317 on getproperty(::Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}}, typeof(map), Tuple{Base.RefValue{IRTools.Inner.Wrap.var"#4#8"}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, typeof(IRTools.Inner.successors), Tuple{Vector{IRTools.Inner.Block}}}}}, f::Symbol)
├─ InferenceTiming: 0.040518/0.059940 on Zygote.accum(::NamedTuple{(:y, :x, :model), Tuple{CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, Nothing, Nothing}}, ::NamedTuple{(:y, :x, :model), Tuple{Nothing, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Nothing}}, (nothing,)::Nothing)
├─ InferenceTiming: 0.041110/0.041110 on getindex(::Tuple{Tuple{IRTools.Inner.Variable}, Tuple{Int64}}, ::Int64)
├─ InferenceTiming: 0.041664/0.041717 on (NamedTuple{(:weight, :bias, )})(::Tuple{CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent})
├─ InferenceTiming: 0.042048/0.044432 on ZygoteRules._pullback(::Zygote.Context, ::Type{Tuple}, ::Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}})
├─ InferenceTiming: 0.044845/0.044845 on NamedTuple{(:element, :axes), Tuple{ChainRulesCore.ProjectTo{Float32, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}}}}(::Tuple{ChainRulesCore.ProjectTo{Float32, NamedTuple{(), Tuple{}}}, Tuple{Base.OneTo{Int64}}})
├─ InferenceTiming: 0.044848/0.045396 on convert(::Type{ForwardDiff.Partials{1, Float32}}, ::ForwardDiff.Partials{1, Bool})
├─ InferenceTiming: 0.045031/0.062628 on Base.rehash!(::Dict{Any, Union{Nothing, IRTools.Inner.Variable}}, ::Int64)
├─ InferenceTiming: 0.045279/0.046171 on GPUCompiler.add_kernel_state!(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(CUDA.big_mapreduce_kernel), Tuple{typeof(identity), typeof(Base.add_sum), Float32, CartesianIndices{4, NTuple{4, Base.OneTo{Int64}}}, CartesianIndices{4, NTuple{4, Base.OneTo{Int64}}}, CUDA.CuDeviceArray{Float32, 4, 1}, CUDA.CuDeviceArray{Float32, 4, 1}}}}, ::LLVM.Module, ::LLVM.Function)
├─ InferenceTiming: 0.045680/0.050890 on (::typeof((applychain)))(::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.047259/0.070766 on ChainRulesCore.canonicalize(::ChainRulesCore.Tangent{ResNet.var"#10#13"{CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ResidualNetwork{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, Flux.MaxPool{2, 2}, Flux.Chain{Tuple{Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}}}, Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}, NamedTuple{(:y, :x, :model), Tuple{CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ChainRulesCore.NoTangent, ChainRulesCore.NoTangent}}})
├─ InferenceTiming: 0.049355/0.173307 on CUDA.cufunction_compile(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(CUDA.partial_mapreduce_grid), Tuple{typeof(identity), typeof(Base.add_sum), Float32, CartesianIndices{4, NTuple{4, Base.OneTo{Int64}}}, CartesianIndices{4, NTuple{4, Base.OneTo{Int64}}}, Val{true}, CUDA.CuDeviceArray{Float32, 5, 1}, CUDA.CuDeviceArray{Float32, 4, 1}}}})
├─ InferenceTiming: 0.049851/0.049898 on iterate(::Base.Generator{NTuple{11, Symbol}, Zygote.var"#218#219"}, ::Int64)
├─ InferenceTiming: 0.050193/0.050489 on Base.collect_to!(::Vector{Expr}, ::Base.Generator{UnitRange{Int64}, ForwardDiff.var"#16#17"{ForwardDiff.var"#46#47"}}, 2::Int64, ::Int64)
├─ InferenceTiming: 0.050305/0.053479 on map(::Function, ::CUDA.CuArray{ForwardDiff.Dual{Nothing, Float32, 1}, 2, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.050364/0.050364 on getindex((:element, :axes)::Tuple{Symbol, Symbol}, ::Int64)
├─ InferenceTiming: 0.050711/0.051204 on which(broadcast_kernel::GPUArrays.var"#broadcast_kernel#17", ::Type{Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, Zygote.var"#1123#1127", Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{ForwardDiff.Dual{Nothing, Float32, 1}, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}})
├─ InferenceTiming: 0.050844/0.089162 on Base.collect_similar(::Vector{IRTools.Inner.Statement}, ::Base.Generator{Vector{IRTools.Inner.Statement}, IRTools.Inner.var"#77#79"{IRTools.Inner.var"#85#86"{IRTools.Inner.var"#128#129"{Set{IRTools.Inner.Variable}}}}})
├─ InferenceTiming: 0.051370/0.051512 on Base.indexed_iterate(::Tuple{Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Zygote.var"#198#200"{Tuple{Nothing, Nothing}}}, ::Int64)
├─ InferenceTiming: 0.052188/0.059830 on Base.Broadcast.broadcasted(::typeof(Zygote.accum), ::Tuple, ::Tuple, ::Tuple, ::Tuple, ::Tuple, ::Tuple)
├─ InferenceTiming: 0.053389/0.056245 on sort!(::Vector{Pair{IRTools.Inner.Variable, Tuple{Int64, Int64}}}, ::Int64, ::Int64, Base.Sort.MergeSortAlg()::Base.Sort.MergeSortAlg, Base.Order.By{IRTools.Inner.var"#42#45", Base.Order.ForwardOrdering}(IRTools.Inner.var"#42#45"(), Base.Order.ForwardOrdering())::Base.Order.By{IRTools.Inner.var"#42#45", Base.Order.ForwardOrdering}, ::Vector{Pair{IRTools.Inner.Variable, Tuple{Int64, Int64}}})
├─ InferenceTiming: 0.054509/0.070220 on (Base.Pairs{Symbol})(::NamedTuple{(:type, :insert), _A} where _A<:Tuple{Any, Bool}, ::Tuple{Symbol, Symbol})
├─ InferenceTiming: 0.055266/0.239121 on CUDA.cufunction_compile(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceArray{Float32, 4, 1}, Base.Broadcast.Broadcasted{Nothing, NTuple{4, Base.OneTo{Int64}}, typeof(Zygote.accum), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceArray{Float32, 4, 1}, NTuple{4, Bool}, NTuple{4, Int64}}}}, Int64}}})
├─ InferenceTiming: 0.055846/0.056646 on searchsortedfirst(::AbstractRange{<:Integer}, ::Int64, Base.Order.ForwardOrdering()::Base.Order.ForwardOrdering)
├─ InferenceTiming: 0.055943/0.055943 on ndims(::Type{Base.RefValue{typeof(Zygote.accum)}})
├─ InferenceTiming: 0.056293/0.394181 on (::typeof((λ)))(::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.056519/0.056519 on GPUCompiler.lower_byval(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(CUDA.partial_mapreduce_grid), Tuple{typeof(identity), typeof(Base.add_sum), Float32, CartesianIndices{5, NTuple{5, Base.OneTo{Int64}}}, CartesianIndices{5, NTuple{5, Base.OneTo{Int64}}}, Val{true}, CUDA.CuDeviceArray{Float32, 6, 1}, CUDA.CuDeviceArray{Float32, 5, 1}}}}, ::LLVM.Module, ::LLVM.Function)
├─ InferenceTiming: 0.056546/0.056666 on checkbounds(::Type{Bool}, ::Vector{IRTools.Inner.Branch}, 1::Int64)
├─ InferenceTiming: 0.056633/0.170000 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.058457/0.058457 on Base.indexed_iterate(::Tuple{Int64, Tuple{Int64, Int64}}, 1::Int64, 1::Int64)
├─ InferenceTiming: 0.058635/0.058925 on <=(2::Int64, ::Int8)
├─ InferenceTiming: 0.059165/0.089157 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.062894/0.176889 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Float32}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.067208/0.182267 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(+), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{0}, Nothing, typeof(/), Tuple{Float32, Int64}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.101067/1.190170 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Flux.Losses.crossentropy), ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.101743/0.895432 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Dense{typeof(identity), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.111724/9.626243 on ZygoteRules._pullback(::Zygote.Context, ::ResidualNetwork{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, Flux.MaxPool{2, 2}, Flux.Chain{Tuple{Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}}}, Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.112764/0.190926 on GPUCompiler.var"#emit_llvm#112"(::Bool, ::Bool, ::Bool, ::Bool, emit_llvm::typeof(GPUCompiler.emit_llvm), ::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#17", Tuple{CUDA.CuKernelContext, CUDA.CuDeviceMatrix{Float32, 1}, Base.Broadcast.Broadcasted{Nothing, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(*), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Nothing, typeof(-), Tuple{Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}, Base.Broadcast.Extruded{CUDA.CuDeviceMatrix{Float32, 1}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}}}, Int64}}}, ::Core.MethodInstance)
├─ InferenceTiming: 0.114814/2.986306 on (::Zygote.var"#94#95"{Zygote.Params, typeof((λ)), Zygote.Context})(::Float32)
├─ InferenceTiming: 0.128708/0.197607 on ZygoteRules._pullback(::Zygote.Context, ::Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::Nothing)
├─ InferenceTiming: 0.152103/0.281560 on (::typeof((λ)))(::CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.155450/1.061840 on (::typeof((crossentropy)))(::Float32)
├─ InferenceTiming: 0.198346/9.457739 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.241073/0.281525 on ZygoteRules._pullback(::Zygote.Context, ::Type{Base.Generator{UnitRange{Int64}, Base.var"#180#181"{Flux.var"#196#197"}}}, ::Base.var"#180#181"{Flux.var"#196#197"}, ::UnitRange{Int64})
├─ InferenceTiming: 0.268560/0.362674 on (::NNlib.var"#33#36"{CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}})()
├─ InferenceTiming: 0.281188/0.434706 on Base.permute!!(::Vector{IRTools.Inner.BasicBlock}, ::AbstractVector{<:Integer})
├─ InferenceTiming: 0.288024/10.003419 on ZygoteRules._pullback(::Zygote.Context, ::ResNet.var"#10#13"{CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ResidualNetwork{Flux.Chain{Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, Flux.MaxPool{2, 2}, Flux.Chain{Tuple{Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}, Flux.Chain{Tuple{ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, ResNet.Shortcut{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}}}, ResNet.ResidualBlock{Flux.SkipConnection{Flux.Chain{Tuple{Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, Flux.Conv{2, 4, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(identity), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}, typeof(+)}}}}}}, Flux.Chain{Tuple{Flux.MeanPool{2, 4}, typeof(Flux.flatten), Flux.Dense{typeof(identity), CUDA.CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}}}})
├─ InferenceTiming: 0.319089/0.566933 on ZygoteRules._pullback(::Zygote.Context, ::Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 0.510660/1.013936 on ZygoteRules._pullback(::Zygote.Context, ::typeof(ntuple), ::Flux.var"#196#197", ::Int64)
├─ InferenceTiming: 0.640451/9.166997 on ZygoteRules._pullback(::Zygote.Context, ::typeof(Flux.applychain), ::Tuple{Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, Flux.BatchNorm{typeof(NNlib.relu), CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, Float32, CUDA.CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
├─ InferenceTiming: 5.134399/8.290482 on ZygoteRules._pullback(::Zygote.Context, ::Flux.Conv{2, 2, typeof(identity), CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, Flux.Zeros}, ::CUDA.CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
└─ InferenceTiming: 69.771594/95.247655 on Core.Compiler.Timings.ROOT()

@pxl-th
Copy link
Member Author

pxl-th commented Jan 5, 2022

Here's also ProfileView for the whole ResNet tinf flamegraph (just in case). Everything in red is mostly Zygote related.

resnet-gpu-def

@ToucheSir
Copy link
Member

Yeah, inference generally balks on Zygote and I'm not sure how we can get it inferring/precompiling well (if at all). Also, those forward times are pretty eye-watering! What's the smallest input size you can run the network on (IIRC resnet-50 inference time on 1 sample + FP32 shouldn't be more than a couple of seconds even on CPU)?

@pxl-th
Copy link
Member Author

pxl-th commented Jan 6, 2022

Also, those forward times are pretty eye-watering!

These are the timings for the very first run on a cold start. Subsequent ones are fast:

Here's forward for the GPU:

45.118269 seconds (67.11 M allocations: 3.499 GiB, 4.40% gc time, 54.40% compilation time) # first run
 0.06753 seconds (2.37 k allocations: 107.922 KiB) # second run

Backward:

128.453328 seconds (108.89 M allocations: 5.649 GiB, 2.32% gc time, 98.03% compilation time)  # first run
  4.318588 seconds (215.16 k allocations: 13.108 MiB, 99.74% compilation time)  # second run

@torfjelde
Copy link
Contributor

@pxl-th could you please also running your benchmarks on 0.6.29 and 0.6.28 to see if this is a recent regression, or a problem that has been around for a while?

We ran into this issue in Turing.jl too. I ran quite a few timings to see if it was due to recent regressions: TuringLang/Turing.jl#1754 (comment)

@pxl-th
Copy link
Member Author

pxl-th commented Jan 10, 2022

@torfjelde for my case (NN using Flux.jl), compile times can be significantly improved by fixing type-stability issues with layers.
See FluxML/NNlib.jl#370 for timings, mainly first forward pass. But it doesn't improve backward pass much.

@torfjelde
Copy link
Contributor

torfjelde commented Jan 10, 2022

@torfjelde for my case (NN using Flux.jl), compile times can be significantly improved by fixing type-stability issues with layers. See FluxML/NNlib.jl#370 for timings, mainly first forward pass. But it doesn't improve backward pass much.

Are you referring to instabilities in evaluation or pullback? Because the forward evaluation is def type-stable in the code I've been trying, but I haven't checked the pullback yet (though I'm pretty confident that it is).

EDIT: Using @code_warntype wasn't enough for the pullback, but using Cthulu.jl I found type-instabilities (and I was reminded that we're acutally aware of these, and I had just forgotten).

@torfjelde
Copy link
Contributor

Also one thing I'm noticing in our issue: the memory usage on some of the older versions, e.g. [email protected] is insane. This is much better on more recent versions though.

@AriMKatz
Copy link

@torfjelde have you tried checking with jet.jl?

@torfjelde
Copy link
Contributor

@torfjelde have you tried checking with jet.jl?

Nope, but I just tried it but I have to use [email protected] because Turing doesn't support Julia 1.7 quite yet. Using @analyze_call I didn't get anything (am I doing something wrong?) 😕

@mcabbott
Copy link
Member

What has helped is the use of -O1 optimization flag.

We could consider setting this for the package, like so:

https://github.com/JuliaPlots/Plots.jl/pull/2544/files

Assuming it had the same good effect, and doesn't cause huge runtime slowdowns. (Or perhaps parts of Zygote could be moved to sub-modules to control what this hits.)

@ToucheSir
Copy link
Member

Also worth looking into precompiling IRTools. When you look at the SnoopCompile flamegraph, IRTools functions make up a very large percentage of the total area.

@mcabbott
Copy link
Member

Does anything in IRTools run at runtime? That might also be a good candidate for being -O1. And @max_methods 1 alla JuliaLang/julia#43370.

@ToucheSir
Copy link
Member

ToucheSir commented Jan 10, 2022

It runs at generated function generation time, so I'm not sure what that counts as...certainly worth a try though! (Edit: TuringLang/Turing.jl#1754 (comment) has some numbers)

For reference, this is what I was using to analyze the feasibility of precompilation a few weeks back:

using Zygote, SnoopCompile, ProfileView
# useful for further analysis
# using MethodAnalysis, AbstractTrees

f(x) = (x^2 - 1) / 3
tinf = @snoopi_deep gradient(f, 10)
ProfileView.view(flamegraph(tinf))  # tinf not fg, surely

@mcabbott
Copy link
Member

mcabbott commented Jan 10, 2022

Base.Experimental.@optlevel 0 seems to help and not hurt this Flux example: #1126 (comment) . I saw little effect from adding it only to IRTools.

@CarloLucibello
Copy link
Member

Is this issue the same of #1126? If so, let's close one of the two to avoid dispersion

@CarloLucibello
Copy link
Member

CarloLucibello commented Feb 10, 2022

@darsnack
Copy link
Member

darsnack commented Feb 10, 2022

Another coarse grained data point w.r.t. to timeline on these regressions. Gradient tests were disabled on the initial refactor of Metalhead.jl due to the long compile times. That was as far back as June 2021. That puts Zygote in the v0.6.11-14 range.

Perhaps it has gotten worse since then, but the CI logs indicate that they were at least 40 minutes or more.

@mcabbott
Copy link
Member

#1126 also points to Julia 1.6 (released on 24 March 2021) being much slower than 1.5 in this regard.

@theabhirath
Copy link
Member

Just for tracking, gradtests on Metalhead (Julia 1.8):
image

@ToucheSir
Copy link
Member

I think I missed some of the discussion on this, what do the top and bottom test runs represent? Is the bottom just a second run after warmup or did something dramatically improve latency?

@theabhirath
Copy link
Member

I think I missed some of the discussion on this, what do the top and bottom test runs represent? Is the bottom just a second run after warmup or did something dramatically improve latency?

The former, it's just a second run. I made no changes between the runs

@darsnack
Copy link
Member

Anything we can do here?1 This effects nearly every model in Metalhead.jl significantly...which basically means a horrible TTFG experience for anyone doing very basic ML stuff in Flux.

I was playing with the following script and noticed similar behavior re: Chain(::Vector). If I run the "No vector" test, I get ~25 seconds as the TTFG, then I run "Vector" and get ~3 seconds as the TTFG. Running "Vector" from a cold REPL gets us back to ~25 seconds for TTFG. So, I think definitely Chain is not to blame for the pullback compile latency.

using Metalhead
using BenchmarkTools
using Flux

function vgg_vec(imsize; config, inchannels, batchnorm = false, nclasses, fcsize, dropout)
    conv = Metalhead.vgg_convolutional_layers(config, batchnorm, inchannels)
    imsize = Flux.outputsize(conv, (imsize..., inchannels); padbatch = true)[1:3]
    class = Metalhead.vgg_classifier_layers(imsize, nclasses, fcsize, dropout)
    return Chain(Chain(conv), Chain(class))
end

function VGG_vec(depth::Int = 16; pretrain = false, batchnorm = false, nclasses = 1000)
    layers = vgg_vec((224, 224);
                     config = Metalhead.vgg_conv_config[Metalhead.vgg_config[depth]],
                     inchannels = 3,
                     batchnorm = batchnorm,
                     nclasses = nclasses,
                     fcsize = 4096,
                     dropout = 0.5)
    model = VGG(layers)

    return model
end

x = rand(Float32, 224, 224, 3, 1)

## No vector

mnovec = VGG(11)
@btime $(mnovec)($x)
@time Flux.gradient(x -> sum(mnovec(x)), x)
@btime Flux.gradient(x -> sum($(mnovec)(x)), $x)

## Vector

mvec = VGG_vec(11)
@btime $(mvec)($x)
@time gradient(x -> sum(mvec(x)), x)
@btime gradient(x -> sum($(mvec)(x)), $x)

Footnotes

  1. Outside of making Flux AD-agnostic

@ToucheSir
Copy link
Member

My recollection is that 15 (+-10) seconds is the fixed cost of TTFG (minus import time) regardless of what function is being compiled. Having also poked around the Zygote internals a bit recently, there should definitely be something we can do about this.

julia> using Zygote, SnoopCompile, ProfileSVG

julia> f(x) = x + 1
f (generic function with 1 method)

julia> tinf = @snoopi_deep gradient(f, 10.)
InferenceTimingNode: 7.108668/11.338889 on Core.Compiler.Timings.ROOT() with 360 direct children

ttfg

(if the above isn't rendering properly, I suggest downloading from https://gist.github.com/ToucheSir/fe22eea4547696d19c6a65733a0a2404 and opening locally).

The problem is finding someone with the requisite knowledge to tackle issues such as precompilation, reducing codegen time (which SnoopCompile seems to suggest takes just as long), type instability (some of which appears to be intentional), and more. When I tried to get rid of the leftmost flame by adding some precompilation routines, for example (following https://timholy.github.io/SnoopCompile.jl/stable/snoopi_deep_analysis/ and the related JuliaCon workshop), the result was abject failure.

Here are two ideas about what to do. Neither is amazing, but hopefully this gets the conversation going:

  1. Start "bottom-up" and try to get a handle on what is fundamentally causing latency issues with the existing tooling out there. This presupposes some level of competence using said tooling, but perhaps @timholy would be able to suggest folks who could assist here?
  2. Go "top-down" and essentially bisect the model call stack with @nograd/@non_differentiable until we find which functions are causing the most latency. Flux has historically been pretty lax about allowing the AD unnecessary visibility into routines such as dims initialization, so there may be some low-hanging fruit left. I think this would only help with any time on top of the 15ish seconds for TTFG, but it would be far easier to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants