RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

RAG-Instruct

📃 Paper ｜🤗 RAG-Instruct-Llama3-3B ｜🤗 RAG-Instruct-Llama3-8B ｜ 📚 RAG-Instruct Dataset

⚡ Introduction

Hello! Welcome to the repository for RAG-Instruct!

RAG-Instruct is a method for generating diverse and high-quality RAG instruction data. It synthesizes instruction datasets based on any source corpus, leveraging the following approaches:

Five RAG paradigms, which represent diverse query-document relationships to enhance model generalization across tasks.
Instruction simulation, which enriches instruction diversity and quality by utilizing the strengths of existing instruction datasets.

Using this approach, we constructed a 40K instruction dataset from Wikipedia, covering a wide range of RAG scenarios and tasks.

We open-sourced our models, data, and code here.

💻 Model

Model Access

Model Name	Base LLMs	Link
RAG-Instruct-Llama3-3B	LLaMA-3.2-3B	HF Link
RAG-Instruct-Llama3-8B	LLaMA-3.1-8B	HF Link

Deploy

RAG-Instruct models can be used just like Llama-3.1-8B-Instruct. You can deploy it with tools like vllm or Sglang, or perform direct inference:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("FreedomIntelligence/RAG-Instruct-Llama3.1-8B",torch_dtype="auto",device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("FreedomIntelligence/RAG-Instruct-Llama3.1-8B")

# Example input
input_text = """### Paragraph:
[1] structure is at risk from new development...
[2] as Customs and Excise stores...
[3] Powis Street is partly underway...
...

### Instruction:
Which organization is currently using a building in Woolwich that holds historical importance?
"""

# Tokenize and prepare input
messages = [{"role": "user", "content": input_text}]
inputs = tokenizer(tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True), return_tensors="pt").to(model.device)

# Generate output
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📚 Data

We’ve open-sourced a 40K instruction dataset for RAG. Download it here:

Data	Description	Link
RAG-Instruct (Wikipedia)	Diverse RAG instruction data based on Wikipedia	Link

🚀 Training

Fine-tuning with RAG-Instruct

You can fine-tune your large model using the RAG-Instruct dataset to significantly boost RAG capabilities. Use the following code:

accelerate launch --config_file ./configs/sft.yaml \
    --num_processes 8  \
    --num_machines 1 \
    --machine_rank 0 \
    --deepspeed_multinode_launcher standard train_rag_sft.py \
    --experiment_name RAG-Instruct-training \
    --model_path meta-llama/Llama-3.1-8B-Instruct \
    --data_path FreedomIntelligence/RAG-Instruct \
    --max_seq_len 4096 \
    --learning_rate 5e-6 \
    --train_bsz_per_gpu 1 \
    --gradient_accumulation_steps 16 \
    --output_dir ./ckpts \
    --log_dir ./train_logs \
    --n_epochs 3 \
    --gradient_checkpointing

🛠️ Data Construction

We provide scripts to synthesize a diverse RAG instruction dataset.

1. Download Source Documents.
We use preprocessed passage data from DPR and embeddings generated with Contriever-MSMARCO:

Download the preprocessed passage data:

cd retrieval_lm
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz

Download the generated embeddings:

wget https://dl.fbaipublicfiles.com/contriever/embeddings/contriever-msmarco/wikipedia_embeddings.tar

2. Prepare Exemplar Datasets.

We utilize several high-quality datasets as exemplars, including ShareGPT, Alpaca, WizardLM-70K, Lmsys-chat-1M, and SlimOrca.

To ensure high-quality data, we filtered and sampled these datasets using GPT-4 to extract knowledge-intensive data (Q).

3. Retrieve Documents.
We use preprocessed passage data from DPR and embeddings generated with Contriever. To retrieve passages, use the following command:

cd retrieval_lm
python passage_retrieval.py \
    --model_name_or_path facebook/contriever-msmarco \
    --passages psgs_w100.tsv \
    --passages_embeddings "wikipedia_embeddings/*" \
    --input_name RAG_INSTRCT_DATA_PATH \
    --output_dir YOUR_OUTPUT_FILE \
    --n_docs 250

The input file must be in json or jsonl format. Each instance should include either a question or instruction field, which will be used as the query during retrieval.

Using the exemplar data (Q), we retrieve source documents to construct (D*). Specifically, we match the exemplar instructions or questions with source documents by ranking their relevance. For convenience, we provide a processed dataset containing source documents and exemplar data across five RAG scenarios here.

4. Synthesize Data with Prompts.
Using the retrieved documents (D*) and exemplar data (Q), we synthesize new data points with tailored prompts to create diverse and high-quality instruction-following datasets.

cd data_gen
python generate_data.py \
    --data_path examplar_data/data.json \
    --max_workers 16 \
    --save_dir ./output_data/RAG-Instruct.json

🔥 Evaluation

📖 Citation

@misc{liu2024raginstructboostingllmsdiverse,
      title={RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions}, 
      author={Wanlong Liu and Junying Chen and Ke Ji and Li Zhou and Wenyu Chen and Benyou Wang},
      year={2024},
      eprint={2501.00353},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.00353}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
configs		configs
data_gen		data_gen
retrieval_lm		retrieval_lm
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train_rag_sft.py		train_rag_sft.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

RAG-Instruct

⚡ Introduction

💻 Model

📚 Data

🚀 Training

Fine-tuning with RAG-Instruct

🛠️ Data Construction

🔥 Evaluation

📖 Citation

About

Releases

Packages

Languages

License

FreedomIntelligence/RAG-Instruct

Folders and files

Latest commit

History

Repository files navigation

RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions

RAG-Instruct

⚡ Introduction

💻 Model

📚 Data

🚀 Training

Fine-tuning with RAG-Instruct

🛠️ Data Construction

🔥 Evaluation

📖 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages