-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low throughput with modernbert #531
Comments
Getting batch_size=32 avg tokens per sentence=1024 with |
Can you use the trt-onnx docker images? ModernBert requires flash-attention-2 (flash-attn) which requires a different build environment. |
Will try this, but I have flash-attn installed in this image
|
Does this mean one has to convert the model to onnx |
@ewianda No, it will use flash-attn |
I tried the image, but the throughput was still the same 🤷 |
System Info
Testing https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base
On NVIDIA L4, seems quite low for ~150M param model?
Information
Tasks
Reproduction
v75 + torch + cuda
Alibaba-NLP/gte-reranker-modernbert-base
The text was updated successfully, but these errors were encountered: