Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low throughput with modernbert #531

Open
2 of 5 tasks
rawsh-rubrik opened this issue Feb 11, 2025 · 6 comments
Open
2 of 5 tasks

Low throughput with modernbert #531

rawsh-rubrik opened this issue Feb 11, 2025 · 6 comments

Comments

@rawsh-rubrik
Copy link

System Info

Testing https://huggingface.co/Alibaba-NLP/gte-reranker-modernbert-base

INFO     2025-02-11 20:36:37,724 infinity_emb INFO:           select_model.py:64
         model=`Alibaba-NLP/gte-reranker-modernbert-base`                       
         selected, using engine=`torch` and device=`cuda`
You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
INFO     2025-02-11 20:36:44,229 infinity_emb INFO: using            torch.py:88
         torch.compile(dynamic=True)                                            
W0211 20:37:23.950000 1 torch/_inductor/utils.py:1137] [6/0] Not enough SMs to use max_autotune_gemm mode
INFO     2025-02-11 20:39:06,469 infinity_emb INFO: Getting   select_model.py:97
         timings for batch_size=32 and avg tokens per                           
         sentence=3                                                             
                 2.62     ms tokenization                                       
                 19.90    ms inference                                          
                 0.04     ms post-processing                                    
                 22.56    ms total                                              
         embeddings/sec: 1418.67                                                

INFO     2025-02-11 20:40:19,740 infinity_emb INFO: Getting  select_model.py:103
         timings for batch_size=32 and avg tokens per                           
         sentence=1025                                                          
                 52.67    ms tokenization                                       
                 33388.80         ms inference                                  
                 0.16     ms post-processing                                    
                 33441.63         ms total                                      
         embeddings/sec: 0.96

On NVIDIA L4, seems quite low for ~150M param model?

Information

  • Docker + cli
  • pip + cli
  • pip + usage of Python interface

Tasks

  • An officially supported CLI command
  • My own modifications

Reproduction

v75 + torch + cuda
Alibaba-NLP/gte-reranker-modernbert-base

@rawsh-rubrik
Copy link
Author

rawsh-rubrik commented Feb 11, 2025

Getting

batch_size=32 avg tokens per sentence=1024
embeddings/sec: 47.83

with BAAI/bge-reranker-v2-m3 + --no-bettertransformer

@michaelfeil
Copy link
Owner

Can you use the trt-onnx docker images? ModernBert requires flash-attention-2 (flash-attn) which requires a different build environment.

@rawsh-rubrik
Copy link
Author

rawsh-rubrik commented Feb 12, 2025

Will try this, but I have flash-attn installed in this image

FROM nvidia/cuda:12.1.1-devel-ubuntu22.04 AS base

ENV PYTHONUNBUFFERED=1 \
    \
    # pip
    PIP_NO_CACHE_DIR=off \
    PIP_DISABLE_PIP_VERSION_CHECK=on \
    PIP_DEFAULT_TIMEOUT=100 \
    \
    PYTHON="python3.10"
RUN apt-get update && apt-get install build-essential python3-dev libsndfile1 $PYTHON-venv $PYTHON curl -y
WORKDIR /app

FROM base as builder
# setup venv
RUN $PYTHON -m venv /app/venv
ENV PATH="/app/venv/bin:$PATH"
RUN pip install wheel packaging
RUN pip install "infinity-emb[all]==0.0.75" "sentence-transformers==3.4.1" "transformers==4.48.3"

# install flash-attn
RUN pip install --no-cache-dir flash-attn --no-build-isolation

# Use a multi-stage build -> production version, with download
FROM base AS tested-builder
COPY --from=builder /app /app
ENV HF_HOME=/app/.cache/huggingface
ENV PATH=/app/venv/bin:$PATH
# do nothing
RUN echo "copied all files"

# Use a multi-stage build -> production version
FROM tested-builder AS production
ENTRYPOINT ["infinity_emb"]

@ewianda
Copy link

ewianda commented Feb 20, 2025

Can you use the trt-onnx docker images? ModernBert requires flash-attention-2 (flash-attn) which requires a different build environment.

Does this mean one has to convert the model to onnx

@michaelfeil
Copy link
Owner

@ewianda No, it will use flash-attn

@ewianda
Copy link

ewianda commented Feb 21, 2025

I tried the image, but the throughput was still the same 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants