You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Impressive work on the innovative data selection method!
I recently finished reading your paper. I'm particularly curious about the computation of the gradient projection. In your paper, you mentioned using a 125M model and reducing the gradient dimension to 16384. Does this imply the need to store a 125M x 16384 = 2048G projection matrix? That seems impractical considering memory constraints. Even if one could generate the random projection matrix on-the-fly, the computational cost for projection would still be substantial. However, your paper suggests that the projection cost is only 1% of the forward-backward process. I find this aspect a bit confusing. Could you provide some information on this matter? Thank you very much!
The text was updated successfully, but these errors were encountered:
Hi @aztec1900, did you make some progress on this issue? I am also very interested in it, but I didn't find the code to estimate datamodels in this repo.
Impressive work on the innovative data selection method!
I recently finished reading your paper. I'm particularly curious about the computation of the gradient projection. In your paper, you mentioned using a 125M model and reducing the gradient dimension to 16384. Does this imply the need to store a 125M x 16384 = 2048G projection matrix? That seems impractical considering memory constraints. Even if one could generate the random projection matrix on-the-fly, the computational cost for projection would still be substantial. However, your paper suggests that the projection cost is only 1% of the forward-backward process. I find this aspect a bit confusing. Could you provide some information on this matter? Thank you very much!
The text was updated successfully, but these errors were encountered: