-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic segmentation #25
Comments
This is the paragraph. I don't think 32x32 is linked to the number of classes. It is the low-resolution of the logit map. The dataset seems to be ADE-20k, which should have 3688 classes. |
32 x 16 = 512 so starting with a cropped image 512x512 pixel you would end up with [Batch size, # of patches, # of classes]. So [1, 32x32, #of classes]. Where # of classes would be the classes you fine tune it on. I dont think you want to touch the intermediate layers, just train a head that learns the mapping btw the output of the transformer stack to segmentation label. |
In the Linear class, I can do the following:
Where, to get that # of classes in the output dims ? |
You probably would prefer See for instance how it is written for SegFormer: To upsample the map and take the argmax, you may refer to 🤗's doc about Semantic segmentation. Take everything I write with a grain of salt though. |
The simplest example for semantic segmentation task head I've done using patch_features:
|
@Alexankharin But this requires using a conv operation, but the paper specifically specifies using a dense linear layer. The only way I can think of doing it is as follows:
|
Probably I understood paper wrong, but thought it was mentioned linear classification over features patch-wise.
|
The fact that the layer is linear does not really matter, it is just a way to say that DINOv2's frozen features are really good, so that you can train a simple head and get good results. 😄 If Alexankharin's simple code gives good results, then it is fine. Plus the explanation is probably correct. |
Exactly. Any valid head should be fine. Linear is the easiest to train, but a larger one will get better results.
Spot on too. The linear head is applied separately to each patch token, ie it is also a 1x1 convolution. |
Closing as answered (and keeping track in #55). |
can confirm that a 1x1 convolution on unrolled patches is mathematically equivalent to a linear layer. No information on neighboring patches is considered in encoding each patch and there are no edge effects due to the 1x1 kernel so there is no need for padding. Number of parameters and their input and outputs are exactly the same. |
I borrow from Alexankharin and U Net comcept to decode it
|
Hi, I would like to know if this means ground truth segmentation label of the images are needed ? If so, is it possible to peform unsupervised semantic segmentation with DINOv2 ? Many thanks |
Yes you need to have the label. |
You can use a conv layer instead. Use the number of classes as the number of out channels. Cheers! here is a sample code
|
Thanks all for the guidelines and the DINOV2 team for the release of the pre-trained model. I have managed to train a semantic segmentation model in my domain it has achieved exceptional performance, quite close to human capability. The surprise is that the training dataset was just a dozen of masked labels. Do we have the explanation for this high performance in few-shot learning? Just curious! |
That is a good question! If you let me speculate for a bit: -Semantic segmentation could be a combination of two tasks: depth approximation and object classification. That is because when doing semantic segmentation, depth approximation could provide good masks (good object boundary estimation) and object classification could provides a mean to differentiate between the masks produced Now if we think of DINOv2, its pretext task (combination of DINO and IBOT) forces it to estimate the same object embedding when given two overlapping crops of an image. Again this is all speculation, after all models are often blackboxes with emergent properties, but i think it is still interesting to discuss why we believe the properties emerge to guide further design. Let me know what you think! |
Hi folks, Inspired by this thread, I created a tutorial for people regarding training a linear classifier on top of a frozen DINOv2 for semantic segmentation: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DINOv2/Train_a_linear_classifier_on_top_of_DINOv2_for_semantic_segmentation.ipynb. DINOv2 is now available in HF Transformers as well :) https://huggingface.co/docs/transformers/main/model_doc/dinov2 |
Hi, @DuongTSon, I want to use DINOv2 in my domain, but performance very low. Myabe I think it's my codes faults. Could you please share your codes? |
@PeterKim1 Hi, I cannot share the code since it was a project in my company. However you can take a look at this repository https://github.com/itsprakhar/Downstream-Dinov2, it covers the basic structure of a DINO-based models. Some experiences below I have gained when using DINOV2:
Hope it can help you! |
Thank you much for this tutorial! Having a ton of fun using it for experimenting in the medical domain. I'm a tinkerer, but not an actual programmer, so apologies if this is a bad question... Have you considered replicating this tutorial with the new models that include registers? I can't seem to find them available in HF, and I haven't yet been able to get it working appropriately by loading the model from Torch Hub. Seems like these could be quite promising for semantic segmentation — thanks! |
Hi, they haven't been added yet to HF: huggingface/transformers#27379. However this should be really easy given the tiny differences of https://github.com/facebookresearch/dinov2/pull/282/files. |
You could directly apply a linear layer on a tensor (B,HW,D) instead of reshaping to (B,D,H,W) and using a 1x1 conv "trick" on it. Pytorch allows linear layer to take tensor with more than 2 shapes, provided that the last one corresponds to "in_features". It will then iterate the linear layer over each BHW tensor independently. For clarification : |
I created a sample notebook here that uses torch hub rather than hugging face for creating a custom semantic segmentation model. As a result, you can use the models with registers. |
awesome will check it out - thanks!! |
I'm not able to find code for Semantic segmentation. In the paper it's written that:
Does this mean a Linear layer with 32*32 = 1024 output classes need to be trained? What about
n_last_blocks_list = [1, 4]
andn_last_blocks = max(n_last_blocks_list)
? Does that need to be changed ton_last_blocks_list = [1, 1]
andn_last_blocks = max(n_last_blocks_list)
?Is there any sample code for semantic segmentation ?
The text was updated successfully, but these errors were encountered: