MRAG/datasets at main · spcl/MRAG

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
chem_documents_v1.tar.bz2		chem_documents_v1.tar.bz2
chem_documents_v2.tar.bz2		chem_documents_v2.tar.bz2
legal_documents_v1.tar.bz2		legal_documents_v1.tar.bz2
legal_documents_v2.tar.bz2		legal_documents_v2.tar.bz2
wikipedia_articles.json.bz2		wikipedia_articles.json.bz2
wikipedia_queries.json.bz2		wikipedia_queries.json.bz2

README.md

Datasets

Wikipedia Articles

wikipedia_articles.json.bz2 contains the compressed JSON file, that we used as the synthetic Wikipedia article dataset during the evaluation of MRAG. It was generated using our synthetic dataset generator: multirag datagen. It consists of 25 categories with 50 distinct documents in each category. The documents are based on the summary of Wikipedia articles and each document contains at least 800 characters.

Please uncompess wikipedia_articles.json.bz2 using bunzip2 before using it.

For the format of the JSON file see the respective section in the dataset module documention.

Legal Documents

legal_documents_v1.tar.bz2 contains the legal documents dataset used during the evaluation of MRAG. It consists of 25 categories with 25 topics. Each topic is stored in an individual JSON file and contains 10 documents. The dataset consists of 6,250 documents in total. We provide an improved version of the dataset as legal_documents_v2.tar.bz2 with longer document summaries.

Chemical Plant Accident Dataset

chem_documents_v1.tar.bz2 contains the dataset created for the analysis of causes of chemical plant accidents. Similar to the legal document datasets, this dataset consists of 25 categories with 25 topics, where each topic has 10 documents and is stored in a single JSON file. The dataset consists of 6,250 documents in total. We also provide an improved version of the dataset as chem_documents_v2.tar.bz2 with longer document summaries.

Queries

Wikipedia Article Queries

wikipedia_queries.json.bz2 contains the compressed JSON file, that stores the queries that were used to query the synthetic Wikipedia article dataset during the evaluation of MRAG. It was generated using our synthetic query generator: multirag querygen. It contains 25 queries with 1, 2, 3, 4, 5, 6, 10, 15, 20 and 25 aspects respectively, so a total of 250 queries.

Please uncompess wikipedia_queries.json.bz2 using bunzip2 before using it.

For the format of the JSON file see the respective section in the dataset module documention.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

README.md

Datasets

Wikipedia Articles

Legal Documents

Chemical Plant Accident Dataset

Queries

Wikipedia Article Queries

Files

datasets

Directory actions

More options

Directory actions

More options

Latest commit

History

datasets

Folders and files

parent directory

README.md

Datasets

Wikipedia Articles

Legal Documents

Chemical Plant Accident Dataset

Queries

Wikipedia Article Queries