Replies: 7 comments 13 replies
-
Yea I also can't replicate it, I might try the original Jax code and see if they're doing something different. I'm training like this: |
Beta Was this translation helpful? Give feedback.
-
Nevermind, with --grad-checkpointing I can now run batch size 1200 whereas without it's stuck at 220. |
Beta Was this translation helpful? Give feedback.
-
performance gains claimed in siglip paper have yet to be reproduced here, the clip and siglip losses are at best roughly equal, in many situations the siglip seems a bit worse in practice. Since the OP I've added some variations of the siglip loss that can be switched via "--loss-dist-impl" arg. See: open_clip/src/open_clip/loss.py Lines 314 to 448 in b2f1403 'gather' is probably the best balance, 'reduce' might be the best for memory but slower speed. The original bidir/shift impl here were supposed to mimic what was described in the paper but probably not great given the number of send/recv calls needed to impl in torch. The official codebase never appeared actually added the exact impl described in the paper. Not sure why. I don't imagine it being any faster than here except maybe for some specific jax reasons. I also feel some of the comparisons that were made in the paper were against a CLIP loss impl that may not have been as efficient as the defaults we use here ... 'local loss + gather w/ grad' combo |
Beta Was this translation helpful? Give feedback.
-
Interesting. The state of ML research is really lacking in reproducibility!
If I could have your advice. I have a rich dataset of 3.5 million
archeological items many with detailed descriptions. I'm looking to train
an object retrieval model so that based on an image, some text or both I
can retrieve likely candidate archeological objects.
CLIP style models seem best suited for this. However I hear they're a
nightmare to fine tune. I was going to finetune SigLIP pertained with
weight decay off and the same settings as the paper. Are there other
alternatives I should try? Maybe DinoV2 SSL?
…On Fri, Jan 3, 2025, 18:34 Ross Wightman ***@***.***> wrote:
performance gains claimed in siglip paper have yet to be reproduced here,
the clip and siglip losses are at best roughly equal, in many situations
the siglip seems a bit worse in practice.
Since the OP I've added some variations of the siglip loss that can be
switched via "--loss-dist-impl" arg.
See:
https://github.com/mlfoundations/open_clip/blob/b2f1403605aade5a004434076246b6bc741aa47d/src/open_clip/loss.py#L314-L448
'gather' is probably the best balance, 'reduce' might be the best for
memory but slower speed. The original bidir/shift impl here were supposed
to mimic what was described in the paper but probably not great given the
number of send/recv calls needed to impl in torch.
The official codebase never appeared actually added the exact impl
described in the paper. Not sure why. I don't imagine it being any faster
than here except maybe for some specific jax reasons.
I also feel some of the comparisons that were made in the paper were
against a CLIP loss impl that may not have been as efficient as the
defaults we use here ... 'local loss + gather w/ grad' combo
—
Reply to this email directly, view it on GitHub
<#872 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4X76KPZZOYFXESZUJNT432I3CYRAVCNFSM6AAAAABURSMBL2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZSG43TSNY>
.
You are receiving this because you commented.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
Regarding augmentations. The good thing is because the text is long I am
using LLMs to produce multiple versions of the text.
So essentially 1 long description can become 5 slightly different short
descriptions focusing on key details.
I haven't seen this done and I'm hoping it improves performance.
Image augmentations are important too since a lot of the database is also
black and white. I want to also have augmentations that add noise similar
to older photographs. Some of the database dates back to 1880.
…On Fri, Jan 3, 2025, 18:56 Ross Wightman ***@***.***> wrote:
@alexisdrakopoulos <https://github.com/alexisdrakopoulos> I think you
should get decent milage fine-tuning SigLIP or CLIP models. The SigLIP and
DFN (CLIP) models are at the top of most rankings by zero-shot and other
downstream eval tasks.
If the captions are very long, and tokens will be truncated, you might get
better results deploying masking/prioritization ala CLIPA
https://github.com/mlfoundations/open_clip/blob/main/docs/clipa.md#text-token-length-reduction
Using the same pretrain settings as the originals is usually not the best
strategy. Use a LR that's at least one OOM smaller, I don't know if I'd
fully disable weight decay, maybe try 1/2 to 1/4 to start.
3.5M is a decent number of samples, but not 'a lot' by CLIP / SigLIP
standards, enabling some image augmentations could help.
I use layer-wise LR decay in a lot of timm base fine-tune, that can be
used here too but not added to the codebase. I think the EVA people used it
for their models, could try borowing that (open to PR if can be cleanly
added here):
https://github.com/baaivision/EVA/blob/master/EVA-CLIP/rei/training/optim.py
—
Reply to this email directly, view it on GitHub
<#872 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4X76MCYKDKFMZOOSNYYQL2I3FNZAVCNFSM6AAAAABURSMBL2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZSG44TINQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
I am GPU poor so large batch sizes aren't ideal. I can get access to TPUs
for a short time but I haven't tried pytorch on TPUs so not sure what kind
of issues I'll run into.
Otherwise I'm mainly stuck on max 8x48gb VRAM.
…On Fri, Jan 3, 2025, 18:59 Ross Wightman ***@***.***> wrote:
Also, I see no reason why you can't fine-tune a SigLIP model with CLIP
loss, or try a CLIP model with SigLIP loss...
CLIP definitely performs better if you can fine-tune at larger batch sizes
though, just like pretrain. Getting the global batch size up into the 8-32k
range can really help though you still get results without doing that.
SigLIP appears to work a bit better at smaller batch sizes though I
haven't heard anyone trying it at the 16+k range saying they actually saw
better results than with CLIP loss at similar batch sizes. I think the
SigLIP dataset was better cleaned/curated than others so might be a big
part of why those models are so good (vs just the loss).
—
Reply to this email directly, view it on GitHub
<#872 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4X76MM67ZKECSNUXA2PBD2I3FY5AVCNFSM6AAAAABURSMBL2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZSG44TOMI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
The complete dataset is proprietary and ridiculously expensive. However a
subset of it (2m entries) I can likely publish down the line. I'm hoping to
get a paper out of this!
What do you think of https://news.ycombinator.com/item?id=34970045?
It's proprietary tech so could be fluff but I'm curious about their claims.
…On Fri, Jan 3, 2025, 19:03 Ross Wightman ***@***.***> wrote:
Also if this dataset happens to be openly licensable, I'm sure others
would love to explore... I can help get it on the HF hub if that's a
possibility. Understand it may be proprietary though :)
—
Reply to this email directly, view it on GitHub
<#872 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4X76NWUIIZ5DQBKRGDQ7D2I3GHVAVCNFSM6AAAAABURSMBL2VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNZSHAYDAMQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
Hello,
The original SigLIP paper said they can fit 2x batch size on TPU with base SigLIP model, compared with CLIP.
But in my experiment, I both used 14400 batch size on 48 A100-40GB, while the SigLIP and CLIP models are both base-sized standard structure. Then during the training, SigLIP takes 33.5G while CLIP takes 37.0G on each GPU. They are close and I couldn't scale up 2x batch size as the paper said.
I am not using any FSDP/deepspeed techniques, is it the reason? Or does the GPU type matter a lot? I have no idea.
Can anyone who ever trained a SigLIP model share your experience?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions