Low throughput with modernbert #531

rawsh-rubrik · 2025-02-11T20:46:11Z

System Info

Testing https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base

INFO     2025-02-11 20:36:37,724 infinity_emb INFO:           select_model.py:64
         model=`Alibaba-NLP/gte-reranker-modernbert-base`                       
         selected, using engine=`torch` and device=`cuda`
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
INFO     2025-02-11 20:36:44,229 infinity_emb INFO: using            torch.py:88
         torch.compile(dynamic=True)                                            
W0211 20:37:23.950000 1 torch/_inductor/utils.py:1137] [6/0] Not enough SMs to use max_autotune_gemm mode
INFO     2025-02-11 20:39:06,469 infinity_emb INFO: Getting   select_model.py:97
         timings for batch_size=32 and avg tokens per                           
         sentence=3                                                             
                 2.62     ms tokenization                                       
                 19.90    ms inference                                          
                 0.04     ms post-processing                                    
                 22.56    ms total                                              
         embeddings/sec: 1418.67                                                

INFO     2025-02-11 20:40:19,740 infinity_emb INFO: Getting  select_model.py:103
         timings for batch_size=32 and avg tokens per                           
         sentence=1025                                                          
                 52.67    ms tokenization                                       
                 33388.80         ms inference                                  
                 0.16     ms post-processing                                    
                 33441.63         ms total                                      
         embeddings/sec: 0.96

On NVIDIA L4, seems quite low for ~150M param model?

Information

Docker + cli
pip + cli
pip + usage of Python interface

Tasks

An officially supported CLI command
My own modifications

Reproduction

v75 + torch + cuda
Alibaba-NLP/gte-reranker-modernbert-base

The text was updated successfully, but these errors were encountered:

rawsh-rubrik · 2025-02-11T20:48:27Z

Getting

batch_size=32 avg tokens per sentence=1024
embeddings/sec: 47.83

with BAAI/bge-reranker-v2-m3 + --no-bettertransformer

michaelfeil · 2025-02-11T23:08:21Z

Can you use the trt-onnx docker images? ModernBert requires flash-attention-2 (flash-attn) which requires a different build environment.

rawsh-rubrik · 2025-02-12T17:38:11Z

Will try this, but I have flash-attn installed in this image

FROM nvidia/cuda:12.1.1-devel-ubuntu22.04 AS base

ENV PYTHONUNBUFFERED=1 \
    \
    # pip
    PIP_NO_CACHE_DIR=off \
    PIP_DISABLE_PIP_VERSION_CHECK=on \
    PIP_DEFAULT_TIMEOUT=100 \
    \
    PYTHON="python3.10"
RUN apt-get update && apt-get install build-essential python3-dev libsndfile1 $PYTHON-venv $PYTHON curl -y
WORKDIR /app

FROM base as builder
# setup venv
RUN $PYTHON -m venv /app/venv
ENV PATH="/app/venv/bin:$PATH"
RUN pip install wheel packaging
RUN pip install "infinity-emb[all]==0.0.75" "sentence-transformers==3.4.1" "transformers==4.48.3"

# install flash-attn
RUN pip install --no-cache-dir flash-attn --no-build-isolation

# Use a multi-stage build -> production version, with download
FROM base AS tested-builder
COPY --from=builder /app /app
ENV HF_HOME=/app/.cache/huggingface
ENV PATH=/app/venv/bin:$PATH
# do nothing
RUN echo "copied all files"

# Use a multi-stage build -> production version
FROM tested-builder AS production
ENTRYPOINT ["infinity_emb"]

ewianda · 2025-02-20T21:34:23Z

Can you use the trt-onnx docker images? ModernBert requires flash-attention-2 (flash-attn) which requires a different build environment.

Does this mean one has to convert the model to onnx

michaelfeil · 2025-02-20T22:39:51Z

@ewianda No, it will use flash-attn

ewianda · 2025-02-21T01:21:28Z

I tried the image, but the throughput was still the same 🤷

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low throughput with modernbert #531

Low throughput with modernbert #531

rawsh-rubrik commented Feb 11, 2025

rawsh-rubrik commented Feb 11, 2025 •

edited

Loading

michaelfeil commented Feb 11, 2025

rawsh-rubrik commented Feb 12, 2025 •

edited

Loading

ewianda commented Feb 20, 2025 •

edited

Loading

michaelfeil commented Feb 20, 2025

ewianda commented Feb 21, 2025

Low throughput with modernbert #531

Low throughput with modernbert #531

Comments

rawsh-rubrik commented Feb 11, 2025

System Info

Information

Tasks

Reproduction

rawsh-rubrik commented Feb 11, 2025 • edited Loading

michaelfeil commented Feb 11, 2025

rawsh-rubrik commented Feb 12, 2025 • edited Loading

ewianda commented Feb 20, 2025 • edited Loading

michaelfeil commented Feb 20, 2025

ewianda commented Feb 21, 2025

rawsh-rubrik commented Feb 11, 2025 •

edited

Loading

rawsh-rubrik commented Feb 12, 2025 •

edited

Loading

ewianda commented Feb 20, 2025 •

edited

Loading