We will walk through how to add a new dataset into datalab.
If your dataset already has a public link online, you can use that link.
Otherwise, you'll need to put your dataset into a server with downloadable links (please make sure you have permission to redistribute the dataset first). For example, you can place your datasets in
- google drive
- google cloud
- AWS S3
Suppose the dataset name to be added is cr
, we need to:
- create a folder
cr
in DataLab/datasets/ - create a data loader script
cr.py
in the above folder, i.e.,Datalab/datasets/cr/cr.py
- finish the data loader script based on some provided examples:
- enter into
Datalab/datasets
folder - run following python commands
from datalabs import load_dataset
dataset = load_dataset("./cr")
print(dataset['train']._info)
print(dataset['train']._info.task_templates)
Once you successfully finished the above steps, if you would like to make your dataset public, you can set up a pull request.
Once you successfully added a new dataset, please update the the file dataset_info.jsonl
by conducting the following command (in the folder of utils/
):
python get_dataset_info.py --previous_jsonl dataset_info.jsonl --output_jsonl dataset_info_dev.jsonl --datasets YOUR_DATASET_NAME
cat dataset_info_dev.jsonl >> dataset_info.jsonl
When adding a new datasets, you probably will encounter following questions:
Suggested docs: how to add_new_task_schema
Suggested doc: how to add language information
You can find scripts for adding different datasets cross different tasks. Suggested doc: more examples
For example,
-
if you aim to add a simple text classification dataset:
-
if you aim to add a simple summarization dataset
-
if you aim to add a simple natural languange inference dataset:
-
if you aim to add a datasets with different versions/domains/languages/subdatasets
-
if your datasets have been packaged into a zip file, you can refer to this example
-
if you want to upload your dataset into DataLab web platform (which provides a bunch of data visualization and analysis), you can follow this doc.
NOTE:
- Usually, using the Lower case string with
_
instead of-
for the script name (arxiv_sum.py
) while camel case for the class name (ArxivSum
).
DataLab has a special environmental variable DATALAB_PRIVATE_LOC
that you can use to
store private datasets. It can be a web location or a location on your filesystem.
Insert this exact string DATALAB_PRIVATE_LOC
into your dataset location, and then
set an environmental variable:
export DATALAB_PRIVATE_LOC=/path/to/private/root
and the environmental variable will be substituted into your dataset path. You can seen an example of how this is done in the fig_qa dataloader.