-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can not reproduce results by LLAMA-7B on OpenBook QA #24
Comments
+1, I'm getting exactly the same results |
Hi, the results in Table 6 are obtained from OPT-30B (As described in 5.3.Q3). And for practical use, you can use the accumulation attention scores obtained from the whole prefilling stage. Since OpenbookQA only requires one step decoding, our current implementation is a simulation version that decomposes the original prefilling stage into a two parts. And we consider the second part as a simulated decoding stage. In this simulation version, we only use the local statistics of accumulation attention scores which might be biased when the sequence length is extremely small. |
Hey @Kyriection - Thanks a lot for your response and extra clarification. I'm having one more issue with reproducing Figure 8 from the latest version of the paper. I followed your setup exactly and haven't changed anything in the code - just calling commands from the README. Below I paste a screenshot Excel with my results - in my attempt the downstream scores downgrade much quicker than reported in Figure 8. Do you have any idea why I cannot reproduce those results ? I'm using huggyllama-llama-7b and even heavy and recent ratio. |
Did you use scores from prefilling stage for any of the downstream results reported in the paper or did you use the simulated decoding? I believe that the implementation in the repo, at least for the LM-Eval, follows the simulated decoding approach. |
Hi, we adjust the ratio of how much part of prefilling stage are used for the simulated decoding approach. Since some input samples only contain tens of tokens, using 20% for calculating accumulated attention scores is highly biased. For simplicty, you can directly use the whole prefilling stage for calculating the scores, which is a reasonable and practical setting. |
Yes, I understand - is this logic implemented somewhere in the code? Also, do you have any idea what could be the reason behind my suboptimal results? |
Hi, you can use the implementaion here https://github.com/FMInference/H2O/blob/main/h2o_hf/utils_lm_eval/modify_llama.py#L152. (I tested current implementation with llama-1-7b on openbookqa, full accuracy is 44.6 and H2O is 44.4.) Previous simulation implemention will directy use the first 20% prefilling stage for calculating accumulated attention scores which are biased when input samples only contains tens of tokens. This might be the reason behind the suboptimal results. By increase the ratio of the prefilling stage for calculating accumulated attention scores, or directly use the whole prefilling stage(global statistics), such biased can be largely mitigated, resulting in better performance. |
Hello, did you find code of the ''simulated decoding'' in this repo? Thanks. |
Full Cache Baseli
huggyllama/llama-7b
bash scripts/lm_eval/full_cache.sh openbookqa huggyllama/llama-7b llama
huggyllama/llama-7b
H2O
bash scripts/lm_eval/h2o.sh openbookqa huggyllama/llama-7b llama
As shown in the paper :
The text was updated successfully, but these errors were encountered: