llama2 7b model #89

ha-seungwon · 2024-11-23T03:36:07Z

Hello,

Thank you for your interesting project.

Can I use OnnxStream task in Llama2 -7b fp16 model??

vitoplantamura · 2024-11-24T07:43:34Z

hi, currently the LLM sample application only supports "TinyLlama-1.1B-Chat-v0.3-fp16" and "Mistral-7B-Instruct-v0.2-fp16". Vito

ha-seungwon · 2024-11-25T01:19:16Z

Hello,

So is it not possible to customise another LLM model?

Thanks

vitoplantamura · 2024-11-26T19:19:52Z

Since TinyLlama adopts the same architecture and tokenizer as Llama 2, adding Llama 2 support to src/llm.cpp should be fairly simple. It involves exporting the onnx file, running "onnxsim_large_model" on it, and finally running "onnx2txt". Vito

ha-seungwon · 2024-11-27T07:59:07Z

Hello,

I already try but some error comes out plz help me.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch.nn as nn
import onnx

# Llama2 모델 로드
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

# Llama2 ONNX 변환용 래퍼 모델 정의
class LlamaModel(nn.Module):
    def __init__(self, model):
        super(LlamaModel, self).__init__()
        self.model = model

    def forward(self, input_ids, attention_mask, position_ids, *past_key_values):
        past_key_values = tuple(
            (past_key_values[i], past_key_values[i + 1]) for i in range(0, len(past_key_values), 2)
        )
        outputs = self.model(
            use_cache=True,
            return_dict=True,
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
        )
        pkv = outputs.past_key_values
        # logits와 각 past_key_values를 반환
        return [outputs.logits] + [item for pair in pkv for item in pair]

# 더미 입력 생성
with torch.no_grad():
    dummy_input = (
        torch.tensor([[1, 2, 3]], dtype=torch.int64),  # input_ids
        torch.tensor([[1, 1, 1]], dtype=torch.int64),  # attention_mask
        torch.tensor([[0, 1, 2]], dtype=torch.int64)   # position_ids
    )

    # 32개 레이어의 past_key_values 추가 (batch_size=1, num_heads=32, past_seq_len=4, head_dim=128)
    for _ in range(32):
        dummy_input += (torch.randn(1, 32, 4, 128, dtype=torch.float16),)  # key
        dummy_input += (torch.randn(1, 32, 4, 128, dtype=torch.float16),)  # value

    # 입력 및 출력 이름 정의
    input_names = ["input_ids", "attention_mask", "position_ids"] + [f"pkv{i}" for i in range(64)]  # 32 layers * 2 (key, value)
    output_names = ["logits"] + [f"opkv{i}" for i in range(64)]  # 32 layers * 2 (key, value)

    # ONNX 변환
    torch.onnx.export(
        LlamaModel(model),
        dummy_input,
        "./onnx_export_model/model.onnx",
        verbose=False,
        input_names=input_names,
        output_names=output_names,
        opset_version=14,
        do_constant_folding=True,
        export_params=True,
        dynamic_axes={
            "input_ids": {1: "sequence"},
            "attention_mask": {1: "sequence"},
            "position_ids": {1: "sequence"},
            **{f"pkv{i}": {2: "past_seq_len"} for i in range(64)},
        },
    )

)

after export my model and "onnxsim_large_model" > "onnx2txt"

Gather -> 68
Shape -> 37
Add -> 227
Range -> 1
Unsqueeze -> 41
Slice -> 162
Cast -> 136
Equal -> 3
And -> 1
Where -> 2
Expand -> 5
Concat -> 130
Reshape -> 129
ScatterND -> 1
Pow -> 65
ReduceMean -> 65
Sqrt -> 65
Div -> 65
Mul -> 386
MatMul -> 290
Transpose -> 161
Cos -> 1
Sin -> 1
Neg -> 64
Softmax -> 32
Sigmoid -> 32
TOTAL -> 2170

output of my onnx2txt code

my error is

how can I fix it?

vitoplantamura · 2024-11-28T08:21:20Z

I will try to reproduce the problem and let you know in the next few days. This problem is typically caused by the fact that the implementation of the HF Transformers has changed compared to the version I used to generate the TinyLlama onnx file. This causes the new onnx file to be different. A quick fix could be to use the same version of the HF Transformers to generate the new onnx file... I'll let you know ASAP, Vito

vitoplantamura · 2024-11-30T03:13:01Z

I was able to run src/llm.cpp with llama2 exported using your script. The problem is that your script preserves the upcasts (float16->float32) and downcasts (float32->float16) needed in certain parts of the model to preserve the accuracy of the activations. Please note that src/llm.cpp handles the upcast problem in the code (search that file for "model.m_requires_upcast"). The solution is, in your script, in the line "AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)" replace "float16" with "float32" (or delete "torch_dtype=torch.float16"). Vito

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama2 7b model #89

llama2 7b model #89

ha-seungwon commented Nov 23, 2024

vitoplantamura commented Nov 24, 2024 via email

ha-seungwon commented Nov 25, 2024 •

edited

Loading

vitoplantamura commented Nov 26, 2024 via email

ha-seungwon commented Nov 27, 2024 •

edited

Loading

vitoplantamura commented Nov 28, 2024 via email

vitoplantamura commented Nov 30, 2024 via email

llama2 7b model #89

llama2 7b model #89

Comments

ha-seungwon commented Nov 23, 2024

vitoplantamura commented Nov 24, 2024 via email

ha-seungwon commented Nov 25, 2024 • edited Loading

vitoplantamura commented Nov 26, 2024 via email

ha-seungwon commented Nov 27, 2024 • edited Loading

vitoplantamura commented Nov 28, 2024 via email

vitoplantamura commented Nov 30, 2024 via email

ha-seungwon commented Nov 25, 2024 •

edited

Loading

ha-seungwon commented Nov 27, 2024 •

edited

Loading