Skip to content

Latest commit

 

History

History
206 lines (156 loc) · 10.3 KB

README.md

File metadata and controls

206 lines (156 loc) · 10.3 KB

RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

RAG-Instruct

📃 Paper |🤗 RAG-Instruct-Llama3-3B |🤗 RAG-Instruct-Llama3-8B | 📚 RAG-Instruct Dataset

⚡ Introduction

Hello! Welcome to the repository for RAG-Instruct!

RAG-Instruct

RAG-Instruct is a method for generating diverse and high-quality RAG instruction data. It synthesizes instruction datasets based on any source corpus, leveraging the following approaches:

  • Five RAG paradigms, which represent diverse query-document relationships to enhance model generalization across tasks.
  • Instruction simulation, which enriches instruction diversity and quality by utilizing the strengths of existing instruction datasets.

Using this approach, we constructed a 40K instruction dataset from Wikipedia, covering a wide range of RAG scenarios and tasks. Our RAG-Instruct significantly enhances the RAG ability of LLMs, demonstrating remarkable improvements in RAG performance across various tasks.

Model WQA (acc) PQA (acc) TQA (acc) OBQA (EM) ARC (EM) 2WIKI (acc) HotP (acc) MSQ (acc) CFQA (EM) PubMed (EM)
Llama3.1-8B + Naive RAG 56.7 56.8 71.5 57.6 61.4 60.7 45.5 23.5 53.1 63.0
Llama3.1-8B-Instruct + Naive RAG 61.9 62.8 73.9 77.2 70.3 66.8 45.5 19.0 53.7 73.6
Llama3.1-8B + RAG-Instruct 69.7 68.4 80.0 82.4 79.6 76.8 59.6 33.7 57.3 77.0

We open-sourced our models, data, and code here.

💻 Model

  • Model Access
Model Name Base LLMs Link
RAG-Instruct-Llama3-3B LLaMA-3.2-3B HF Link
RAG-Instruct-Llama3-8B LLaMA-3.1-8B HF Link
  • Deploy

RAG-Instruct models can be used just like Llama-3.1-8B-Instruct. You can deploy it with tools like vllm or Sglang, or perform direct inference:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("FreedomIntelligence/RAG-Instruct-Llama3-8B",torch_dtype="auto",device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("FreedomIntelligence/RAG-Instruct-Llama3-8B")

# Example input
input_text = """### Paragraph:
[1] structure is at risk from new development...
[2] as Customs and Excise stores...
[3] Powis Street is partly underway...
...

### Instruction:
Which organization is currently using a building in Woolwich that holds historical importance?
"""

# Tokenize and prepare input
messages = [{"role": "user", "content": input_text}]
inputs = tokenizer(tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True), return_tensors="pt").to(model.device)

# Generate output
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📚 Data

We’ve open-sourced a 40K instruction dataset for RAG. Download it here:

Data Description Link
RAG-Instruct (Wikipedia) Diverse RAG instruction data based on Wikipedia Link

🛠️ Data Construction

We provide scripts to synthesize a diverse RAG instruction dataset.

1. Download Source Documents.
We use preprocessed passage data from DPR and embeddings generated with Contriever-MSMARCO :

  • Download the preprocessed passage data:

    cd retrieval_lm
    wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
  • Download the generated embeddings:

    wget https://dl.fbaipublicfiles.com/contriever/embeddings/contriever-msmarco/wikipedia_embeddings.tar

2. Prepare Exemplar Datasets.

We utilize several high-quality datasets as exemplars, including ShareGPT, Alpaca, WizardLM-70K, Lmsys-chat-1M, and SlimOrca.

To ensure high-quality data, we filtered and sampled these datasets using GPT-4o to extract knowledge-intensive data (Q). Using the exemplar data (Q), we retrieve source documents to construct (D*). Specifically, we match the exemplar instructions or questions with source documents by ranking their relevance. For convenience, we provide a processed dataset containing source documents and exemplar data across five RAG scenarios here.

3. Synthesize Data with Prompts.
Using the retrieved documents (D*) and exemplar data (Q), we synthesize new data points with tailored prompts to create diverse and high-quality instruction-following datasets.

cd data_gen
python generate_data.py \
    --data_path examplar_data/data.json \
    --max_workers 16 \
    --save_dir ./output_data/RAG-Instruct.json

4. Run Retriever
Before training, we need to perform retrieval on the synthesized RAG-Instruct dataset. For each data entry, we ensure that the retrieval documents includes all source documents (D*) and supplement them with enough unrelated documents (D-) to total 10 documents. We use preprocessed passage data from DPR and embeddings generated with Contriever. To retrieve noisy documents (D-), use the following command:

cd retrieval_lm
python passage_retrieval.py \
    --model_name_or_path facebook/contriever-msmarco \
    --passages psgs_w100.tsv \
    --passages_embeddings "wikipedia_embeddings/*" \
    --input_name RAG_INSTRCT_DATA_PATH \
    --output_dir YOUR_OUTPUT_FILE \
    --n_docs 250

RAG_INSTRUCT_DATA_PATH is the final location of the synthesized RAG-Instruct.json file. The input file must be in json or jsonl format. Each instance should include either a question or instruction field, which will be used as the query during retrieval.

Next, we sample documents ranked beyond the top 200 as (D-) and get the final training data.

🚀 Training

Fine-tuning with RAG-Instruct

You can fine-tune your large model using the RAG-Instruct dataset to significantly boost RAG capabilities. Use the following code:

accelerate launch --config_file ./configs/sft.yaml \
    --num_processes 8  \
    --num_machines 1 \
    --machine_rank 0 \
    --deepspeed_multinode_launcher standard train_rag_sft.py \
    --experiment_name RAG-Instruct-training \
    --model_path meta-llama/Llama-3.1-8B-Instruct \
    --data_path FreedomIntelligence/RAG-Instruct \
    --max_seq_len 4096 \
    --learning_rate 5e-6 \
    --train_bsz_per_gpu 2 \
    --gradient_accumulation_steps 16 \
    --output_dir ./ckpts \
    --log_dir ./train_logs \
    --n_epochs 3 \
    --gradient_checkpointing

🧐 Evaluation

  1. You first need to install Sglang. After installation, deploy the model you want to test using Sglang with the following command:
log_num=0
model_name="FreedomIntelligence/RAG-Instruct-Llama3-3B" # Path to the model you are deploying
port=21${log_num}35
CUDA_VISIBLE_DEVICES=0  python -m sglang.launch_server --model-path $model_name --port $port --mem-fraction-static 0.8 --dp 1 --tp 1  > sglang${log_num}.log 2>&1 &
  1. Wait for the model to be deployed. After deployment, you can run the following code for evaluation.
model_name="FreedomIntelligence/RAG-Instruct-Llama3-3B" # Path to the model you are deploying
python eval/eval_sglang.py --model_name $model_name --input_file eval/data/eval_data.json --port $port --max_new_tokens 500  

Here, we provide the evaluation example using the PopQA dataset in the file eval/data/eval_data.json. For other evaluation datasets, please first use the retriever to retrieve (You can refer to the retriever code in the training section), and then use the above script for evaluation.

  1. After completing the evaluation, run the following code to stop the Sglang service and release GPU memory.
bash evaluation/kill_sglang_server.sh

The evaluation code above can be used to test most models supported by Sglang.

📖 Citation

@misc{liu2024raginstructboostingllmsdiverse,
      title={RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions}, 
      author={Wanlong Liu and Junying Chen and Ke Ji and Li Zhou and Wenyu Chen and Benyou Wang},
      year={2024},
      eprint={2501.00353},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.00353}, 
}