📃 Paper |🤗 RAG-Instruct-Llama3-3B |🤗 RAG-Instruct-Llama3-8B | 📚 RAG-Instruct Dataset
Hello! Welcome to the repository for RAG-Instruct!
RAG-Instruct is a method for generating diverse and high-quality RAG instruction data. It synthesizes instruction datasets based on any source corpus, leveraging the following approaches:
- Five RAG paradigms, which represent diverse query-document relationships to enhance model generalization across tasks.
- Instruction simulation, which enriches instruction diversity and quality by utilizing the strengths of existing instruction datasets.
Using this approach, we constructed a 40K instruction dataset from Wikipedia, covering a wide range of RAG scenarios and tasks.
We open-sourced our models, data, and code here.
- Model Access
Model Name | Base LLMs | Link |
---|---|---|
RAG-Instruct-Llama3-3B | LLaMA-3.2-3B | HF Link |
RAG-Instruct-Llama3-8B | LLaMA-3.1-8B | HF Link |
- Deploy
RAG-Instruct models can be used just like Llama-3.1-8B-Instruct
. You can deploy it with tools like vllm or Sglang, or perform direct inference:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("FreedomIntelligence/RAG-Instruct-Llama3.1-8B",torch_dtype="auto",device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("FreedomIntelligence/RAG-Instruct-Llama3.1-8B")
# Example input
input_text = """### Paragraph:
[1] structure is at risk from new development...
[2] as Customs and Excise stores...
[3] Powis Street is partly underway...
...
### Instruction:
Which organization is currently using a building in Woolwich that holds historical importance?
"""
# Tokenize and prepare input
messages = [{"role": "user", "content": input_text}]
inputs = tokenizer(tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True), return_tensors="pt").to(model.device)
# Generate output
outputs = model.generate(**inputs, max_new_tokens=2048)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
We’ve open-sourced a 40K instruction dataset for RAG. Download it here:
Data | Description | Link |
---|---|---|
RAG-Instruct (Wikipedia) | Diverse RAG instruction data based on Wikipedia | Link |
You can fine-tune your large model using the RAG-Instruct
dataset to significantly boost RAG capabilities. Use the following code:
accelerate launch --config_file ./configs/sft.yaml \
--num_processes 8 \
--num_machines 1 \
--machine_rank 0 \
--deepspeed_multinode_launcher standard train_rag_sft.py \
--experiment_name RAG-Instruct-training \
--model_path meta-llama/Llama-3.1-8B-Instruct \
--data_path FreedomIntelligence/RAG-Instruct \
--max_seq_len 4096 \
--learning_rate 5e-6 \
--train_bsz_per_gpu 1 \
--gradient_accumulation_steps 16 \
--output_dir ./ckpts \
--log_dir ./train_logs \
--n_epochs 3 \
--gradient_checkpointing
We provide scripts to synthesize a diverse RAG instruction dataset.
1. Download Source Documents.
We use preprocessed passage data from DPR and embeddings generated with Contriever-MSMARCO:
-
Download the preprocessed passage data:
cd retrieval_lm wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
-
Download the generated embeddings:
wget https://dl.fbaipublicfiles.com/contriever/embeddings/contriever-msmarco/wikipedia_embeddings.tar
2. Prepare Exemplar Datasets.
We utilize several high-quality datasets as exemplars, including ShareGPT, Alpaca, WizardLM-70K, Lmsys-chat-1M, and SlimOrca.
To ensure high-quality data, we filtered and sampled these datasets using GPT-4 to extract knowledge-intensive data (Q).
3. Retrieve Documents.
We use preprocessed passage data from DPR and embeddings generated with Contriever. To retrieve passages, use the following command:
cd retrieval_lm
python passage_retrieval.py \
--model_name_or_path facebook/contriever-msmarco \
--passages psgs_w100.tsv \
--passages_embeddings "wikipedia_embeddings/*" \
--input_name RAG_INSTRCT_DATA_PATH \
--output_dir YOUR_OUTPUT_FILE \
--n_docs 250
The input file must be in json
or jsonl
format. Each instance should include either a question
or instruction
field, which will be used as the query during retrieval.
Using the exemplar data (Q), we retrieve source documents to construct (D*). Specifically, we match the exemplar instructions or questions with source documents by ranking their relevance. For convenience, we provide a processed dataset containing source documents and exemplar data across five RAG scenarios here.
4. Synthesize Data with Prompts.
Using the retrieved documents (D*) and exemplar data (Q), we synthesize new data points with tailored prompts to create diverse and high-quality instruction-following datasets.
cd data_gen
python generate_data.py \
--data_path examplar_data/data.json \
--max_workers 16 \
--save_dir ./output_data/RAG-Instruct.json
@misc{liu2024raginstructboostingllmsdiverse,
title={RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions},
author={Wanlong Liu and Junying Chen and Ke Ji and Li Zhou and Wenyu Chen and Benyou Wang},
year={2024},
eprint={2501.00353},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.00353},
}