Fixed seasonality, torch.compile and added "resume training" #87
victorbjorsvik
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
For anyone intrested I looked through the Errata and several of the PRs of this repo. I implemented most of these suggestions in my own repo and trained the model for 1 Epoch. You can find the repo here. Here are the most prudent changes I made:
Adressed several of the issues Andrej had throughout the video including:
Seasonality in training data
Unable to use torch.compile while doing HellaSwag evals in the training loop
More aggressive learning rate and schema
Enabled resuming training based on last model checkpoint
All the issues mentioned in Andrej's Errata in his repo
Several other improvements suggested in PRs in Andrej's repo
During training i managed to get a dt of ~ 0.34 per step and processed ~ 1.5M tokens per second. I trained the model for 1 epoch (~10B tokens) in under 2 hrs on 8 A100 (80 GB SXM4) GPUs. This gave me a min training loss of 2.84857, min validation loss of 3.0383 and a max HellaSwag eval of 0.3101. Model checkpoints can be provided upon request
All credit goes to the good people suggesting edits in the PRs.
Beta Was this translation helpful? Give feedback.
All reactions