Things That Might Be Wrong #1
smpanaro
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
A few parts of the Norm Tweaking paper were ambiguous to me and are currently implemented with my best guess. I've listed them below and linked to the relevant code. If you think there's a more correct implementation, let me know! PRs welcome too!
Loss Function
The loss function from the paper is:
![image](https://private-user-images.githubusercontent.com/2950214/299531582-c89d3616-9d76-46ce-9a54-9f02f03cb30e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NDUwNDUsIm5iZiI6MTczODk0NDc0NSwicGF0aCI6Ii8yOTUwMjE0LzI5OTUzMTU4Mi1jODlkMzYxNi05ZDc2LTQ2Y2UtOWE1NC05ZjAyZjAzY2IzMGUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIwNyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMDdUMTYxMjI1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NjQyMDYzMDhmYTNlYTdkY2VlYzM1MTVhZTBlY2E5YTQ1Yjg4NDhmOTY1MjgyNDdiMGI1NDQwN2I1YzY3MDJjYSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.t6lNMal1vbMkIuUVKYhxwvt5v4otN8YxAk3XrrMGiIQ)
Two things that I suspect might be wrong:
Batch Size
The paper says to use Adam as the optimizer, but also to use the equation$lr_i =lr_0 ∗(1+scale∗(i/L))$ to set a learning rate for level $i$ . I think this implies taking the 128 samples and inferring them in smaller batches for each layer so that the optimizer actually does something. I've chosen to infer each sample individually, but it's unclear to me if using a batch size of 2-128 would be helpful or matter at all.
Identifying the Language of Tokens
This repo currently implements LLM-QAT's synthetic data generation and not Norm Tweaking's evolution of it. It's unclear to me how to map each token to a langauge. As a simple example, the token "a" is a word in many languages. Even if it were possible to map each token to a language, it's not clear to me how to determine the proportion of the languages in the training data.
![image](https://private-user-images.githubusercontent.com/2950214/299534512-a2fc21f9-c13f-46a3-b835-47f55db1af6e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NDUwNDUsIm5iZiI6MTczODk0NDc0NSwicGF0aCI6Ii8yOTUwMjE0LzI5OTUzNDUxMi1hMmZjMjFmOS1jMTNmLTQ2YTMtYjgzNS00N2Y1NWRiMWFmNmUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIwNyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMDdUMTYxMjI1WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MDg3YjAyMmMyYzNiZjJkNWZiNTJhYjBjYzI4ZjkxZjhkZjU1ZTFkMjNjODNkOGFlYjE3YjkxMjdiYmJlY2I3ZCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.Z_HCqEzUnO-z_XGMhsMtGXG6dIIYq_3Gr5QBeHINjGc)
How to generate this table for an arbitrary tokenizer vocab?
Implementation of Layer Freezing
I'm not sure if the way I've frozen/unfrozen each layer for tweaking is correct. It looks correct when I examine requires_grad, but it's possible there's a nuance of PyTorch that I have missed.
Beta Was this translation helpful? Give feedback.
All reactions