Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Learning rate decay #64

Open
afcruzs opened this issue Oct 17, 2023 · 2 comments
Open

About Learning rate decay #64

afcruzs opened this issue Oct 17, 2023 · 2 comments

Comments

@afcruzs
Copy link

afcruzs commented Oct 17, 2023

Hello, I have a small question regarding the MuP proxy model sweeps. Did you perform full learning rate decay to the 4b or 16b tokens in the proxy models mentioned in Appendix F.4 (gpt3)? Or did you decay the learning rate to the "real" number of tokens to be used in the target model? (Effectively, decaying very little in the proxy model sweeps)

It'd be interesting to know what did you do in the experiments in the appendix 4.3 (gpt3) and in general if this has any effect at all on the transferability (perhaps you have some empirical or theoretical insights), recommendations would be very welcome :)

@yadandan
Copy link

yadandan commented Nov 2, 2023

The same question. We also found that the optimal learning rates differ for different training steps across the widths. For instance, in the early stages of training, a larger learning rate performs better, but as training progresses, a smaller learning rate gradually overtakes it.

@xidulu
Copy link

xidulu commented Apr 26, 2024

@yadandan Just so I know more, when you say "performs better", are you referring to training error or test error?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants