-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General Post-Training with 4 RTX 4090 GPUs #33
Comments
hi @ztianlin the autoregressive finetuning only requires 2 A100/H100 https://github.com/NVIDIA/Cosmos/tree/main/cosmos1/models/autoregressive/nemo/post_training |
Two is still 160GB. Will DIGITS's 128GB be enough? |
Thanks! And what about diffusion models? I really wish that one can post train diffusion models on 4090 |
@ztianlin I'm a PM at NVIDIA for COSMOS. Can you share why post training training on a 4090 is important? |
@jpenningCA @ethanhe42 Currently, the documentation states:
However, this doesn’t provide much clarity. It seems reasonable to infer that 8 A100s or H100s were likely used for the 7B and 14B models, but is that level of hardware strictly necessary from a VRAM perspective? What would be the minimum recommended VRAM requirements? Additionally, the README describes that training uses NeMo Framework's data and model parallelism capabilities, specifically mentioning Fully Sharded Data Parallel (FSDP) and Tensor Parallelism. This suggests that parameters, optimizer states, and activations are distributed across all GPUs, and individual layer parameter tensors are also spread across GPUs. Given this information:
Understanding these details would help determine if alternative hardware configurations could work for post-training these models. |
It's like the question, why we need LLM on iPhone? I think maybe most customers have 4090, instead of expensive A100/H100. |
As metaphorically described by @StarsTesla, my research resources are limited. I believe if one can easily train on 4090, the cosmos community will become larger and more active. |
Hi all, you can try using peft/lora finetuning, which might have lower GPU memory requirement. you need to add the following change to your recipe (e.g. cosmos_diffusion_7b_text2world_finetune) to enable lora
recipe.trainer.strategy.sequence_parallel = False
recipe.model_transform = run.Config(llm.peft.LoRA,
target_modules=['linear_qkv', 'linear_proj'], # , 'linear_fc1', 'linear_fc2'],
dim=256,
) |
Hi @ethanhe42 thanks for the LoRA code! It seems 2xA100s can do full finetuning with activation checkpointing and smaller videos (12 x 45 x 80) instead of 10x90x160. I'm also noticing that:
Trying to understand what is using all this VRAM, and how I might trade resolution / accuracy for less VRAM? An aside, it seems on line 76 in |
what do you see
you can try further reduce lora size or add
yes, this is for reducing memory usage |
@ethanhe42 Thanks for the response! I see 2 of those prints:
However, a different print is also in
I'm unsure why going going from 4.4B trainable params w/ activation checkpointing to 264M params without checkpointing only reduces VRAM usage by ~2GB? It feels like something else is occupying a lot of VRAM as reducing the sequence length dramatically doesn't change much. |
Hello, I wonder if it is possible to do the general post-training for diffusion WFM with 4 GeForce RTX 4090 GPUs.
My dad can't afford 8 A100 GPUs. Please show mercy to poor people!
The text was updated successfully, but these errors were encountered: