-
-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of EfficientNetv2 and MNASNet #198
Conversation
Also misc. docs and formatting
Also fix bug in EfficientNet models
# building inverted residual blocks | ||
for (k, t, c, reduction, activation, stride) in configs | ||
for (i, (k, t, c, reduction, activation, stride)) in enumerate(configs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I work on refactoring the EfficientNet, this part is something that's been annoying me so I decided to put it up. torchvision
has a cool feature from other papers (see https://pytorch.org/blog/torchvision-mobilenet-v3-implementation/ for a proper explanation) where they use dilations and a reduced tail (i.e. last three blocks) for some engineering gains. My code works but looks terribly ugly - any idea how to make this look prettier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull it out into a utility function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's definitely a possibility, but I don't like the fact that this changes the implementation details for MobileNetv3 so fundamentally that unifying the code for the three MobileNets now becomes a nightmare. torchvision
uses class variables, so they are able to get away with writing this for example - any creative ways to do something similar in Julia?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that a utility function? :P Perhaps I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that a utility function? :P Perhaps I'm missing something.
It is, but torchvision
writes the code for the three MobileNets differently, and I wanted to avoid that if possible in Julia. But what I more meant was that the tail dilation and reduced tail are built into the config dict because all of those variables are in the same function scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the extent of the differing logic here the conditional below? If so, perhaps it could be its own function, parameterized on i
or some boolean to indicate the calculation is being done for a tail layer. This function has a branch for whether you want the dilations and/or reduced dimensions at the tail. Won't be the prettiest thing, but it doesn't have to be either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This went away in the recent refactor, but keeping this conversation unresolved as a reminder that maybe some day we'll find a way to bake this into the configuration dict
bda527b
to
b57e6f6
Compare
Side note - using |
Okay I'll make my own script over the weekend and sanity check. |
Thanks! Comparing to this example https://www.kaggle.com/code/paultimothymooney/efficientnetv2-with-tensorflow?scriptVersionId=120655976&cellId=27
And performance maxes out within a few epochs |
@darsnack i was just wondering whether you have had a chance to test this or check my code for any obvious issues. Thanks |
Using IanButterworth/EfficientNet-Training@ec80084 and switching just the model to
Other than there being some issue with the model, could it be that EfficientNet requires some subtle difference in the loss function or optimizer setups? |
I was trying a lot of these on CIFAR-10 during my GSoC and was facing issues, including with ResNet – the accuracies were nowhere near what I could get with PyTorch, for example. I remember trying to debug this but then had to give up since I got occupied with other commitments. IIRC at the time one theory was that there might be something wrong with the gradients, but we didn't nearly manage to get enough training examples on the GPU to confirm. I could try running these again if I got a GPU somehow 😅 The difference between EfficientNet and ResNet is weird, but not unexpected because they do use different kinds of layers. Maybe MobileNet vs EfficientNet tells us more? The underlying code over there is the exact same because of the way the models are constructed. Even for ResNet, though, the loss curve looks kinda bumpy... |
Interesting. I was considering setting up a job to do the same for all appropriate models in Metalhead on my machine, and tracking any useful metrics while doing so for a comparison table. I have a relatively decent setup so it might not take too long. I can understand that full training isn't a part of CI because of resources, but I think it needs to be demonstrated that these models are trainable somewhere public. |
I've started putting together a benchmarking script here. #264 |
As @theabhirath mentioned, this has come up before but we never got to the bottom of why because of time constraints. If you have some bandwidth to do some diagnostic runs looking at whether gradients, activations etc are misbehaving and where they're misbehaving, we could narrow the problem down to something actionable. Whether that be a bug in Metalhead itself or something further upstream like Flux, Zygote or CUDA. |
Ok. I think my minimum goal then is to characterise the issue (is it all models) and make it clear in the docs which ones cannot be expected to be trainable. As someone coming to this package to use it to train on a custom dataset it's been quite a time sync just to get to that understanding. |
If it's a either-or deal between that and finding out why a specific model (e.g. EfficientNet) has problems, I'd lean towards the latter. I suspect any issues may lie at least at the level of some of the shared layer helpers, so addressing those would help other models too. This kind of deep debugging is also something myself and probably @darsnack are less likely to have time to do, whereas running a convergence test for all models would be more straightforward. But if the goal is to do both, that sounds good to me. |
I lean more towards what I'm currently trying because I'm not sure I have the knowledge/skill set to dive in and debug. Tbh if I find a model that trains and is performant I may declare victory, but share findings. Or maybe I strike lucky in a dive. |
Sorry for the late reply. I did start writing a script, but I never gotten around to starting my testing. I kept meaning to reply as soon as I did that, but it's been a busy week. Looks like we might have some duplicate work. Either way, I uploaded my code to a repo here. I've add all of you as collaborators. The biggest help right now would be is someone can add in all the remaining models + configs to the |
Based on the current runs that finished, it looks like all the EfficientNetv2 models have some bug. Only the ResNets have the quirk Abhirath noticed during the summer where the models starts off fine then start to overfit to the training data. I also have AlexNet and EfficientNet (v1) queued up. Let's see how those do. |
I will say though that the ResNet loss curves don't look as bad as I remember them. Perhaps in this case, a different learning rate would fix things. |
My recollection was that the PyTorch resnets converged faster and overfit less easily even with less help from data augmentation. Is it straightforward to do a head-to-head comparison? |
I'm able to replicate the poor training behaviour of EfficientNet-B0 on a local machine, which happens to have an AMD GPU. this suggests the problem may not lie with anything CUDA-specific. |
Great, thanks. How does one go about debugging this kind of thing? Are there generic tools for visualizing the model that could help? |
I'll modify the script to log gradient norms by layer and also do some local debugging just to sanity check the output isn't obviously wrong. I'll also add MobileNet to the script. I think that might be a good reference model to compare against assuming it does converge. If it works, that would narrow the issue down to the specific "inverted residual block" that EfficientNet uses. |
If either of you have a machine with a decently powerful CPU, I think a CPU-only run would be interesting to see if we can isolate GPU libraries as a possibility altogether. |
I was seeing less time than that for b0 on my decidedly less powerful machine, so maybe using a smaller size would help? We've established that all EfficientNets are affected after all. |
I've updated progress on my benchmark script here #264 (comment) Would anyone be able to help me get the errors resolved? I guess most of them are input image size issues? |
I've commented there! Also just to understand your previous EfficientNet training on CPU graph, it still seems to have those fluctuations in loss, right? Meaning that the issue might not be exactly GPU linked (or at least not only GPU linked)? |
Updated results: https://wandb.ai/darsnack/metalhead-bench/ For me, I see MobileNetv1 and MobileNetv2 succeeding while MobileNetv3 fails. I tried to check what the differences are by inspection, and the one major different was the use of So, I ran another MobileNetv3 but with |
This is an implementation of EfficientNetv2 and MNASNet. There's clearly some code duplication between the structures of the
efficientnet
function and themobilenetvX
functions, but figuring out how to unify that API and its design is perhaps best left for another PR.TODO
efficientnet
functionmbconv
andfused_mbconv
P.S. memory issues mean that the tests are more fragmented than they ought to be, not sure how we can go about addressing those in the short/medium term