-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FP16 support to ptq_evaluate.py and update README argument list #1174
Conversation
Signed-off-by: hkayann <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just 2 quick comments, otherwise looks good!
Thanks again for this.
'--dtype', default='float', choices=['float', 'bfloat16'], help='Data type to use') | ||
'--dtype', | ||
default='float', | ||
choices=['float', 'bfloat16', 'half'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think half
and float16
are equivalent, and for consistency reasons I think I prefer float16
.
If I am missing something, let me know please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are functionally equivalent. My initial thought was that using half could help prevent typos, as the only difference is the letter b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, I am trying to save custom-bit-width models, for example FP16
with mantissa 9 bits
, exponent 6 bits
etc, but seems like not possible given the available PyTorch dtypes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I'm not sure I can fully help with the second issue, since we mostly focus on minifloat quantization with 8 bits of fewer.
In the meantime, would you mind changing half
to float16
? I understand the potential for typos but I still prefer trying to be consistent across the codebase, and we never (or rarely) use half instead of float16.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can simulate for now, so I have a working workaround which is something. I have done all the requested changes as requested. Many thanks again.
Signed-off-by: hkayann <[email protected]>
Thanks for this again :) I will let the tests run and then merge it |
Reason for this PR
Changes Made in this PR
Testing Summary
Risk Highlight
Checklist
dev
branch.