Nikitas Theodoropoulos, Giorgos Filandrianos, Vassilis Lyberatos, Maria Lymperaiou, Giorgos Stamou
National Technical University of Athens (NTUA)
Artificial Intelligence and Learning Systems Laboratory (AILS)
This repository contains the code implementation for our contribution to the 2nd iteration of the BabyLM Challenge. The challenge is centered around sample-efficient language modelling, given human-like data constraints of 10M and 100M words. Our approach relies on data augmentation using TinyStories — a synthetic dataset of short and simple stories.
We train GPT-Neo decoder models on subsets of TinyStories, varying the amount of available training data. We find that even with access to less than 100M words the models are able to generate high-quality and original completions to a given story.
To measure the effect of synthetic story data on LM pre-training, we train LTG-BERT encoder models on a combined dataset consisting of:
- a subset of TinyStories
- story completions generated by our GPT-Neo models
- a subset of the BabyLM dataset.
Results indicate that synthetic data can occasionally offer modest gains, but overall have a negative influence on linguistic understanding.
Our work is an initial study on the quality of synthetic story data in low resource settings, and underscores their potential for augmentation in data-constrained LM training. We hope that by releasing our implementation we will aid future research in this direction.
baby_lm/:
source codeencoder/
: LTG-BERT model implementationgenerator/
: GPT-Neo utilities for dataset, sampling and evaluationprocess_data/
: data processing scripts and utilities
configs/
: contains configuration filesmodels/
: model architecturespreprocess/
: dataset preprocessing and creationsampling/
: dataset generation and model evaluationtrain/
: GPT-Neo and LTG-BERT training configs for various data configurations
data/
: contains training data and promptsraw/
: raw training datagenerated
: synthetic data generated by GPT-Neo
processed/
: processed training dataprompts/
: various prompts
evaluation_files
: files needed for the GPT-Neo and LTG-BERT models evaluation using the official pipeline, see instructions in Evaluating Linguistic Abilitiesoutputs/
: contains outputs of the projectmodels
: trained modelstokenizers
: trained tokenizersevaluation
: GPT-Neo Self-BLEU evaluation results
scripts_slurm/
: contains useful script templates for conducting experiments with SLURM
- To install dependencies using Poetry run
poetry install
- Alternatively
pip install -r requirements.txt
can be used - The code was tested with Python 3.12, but should also be compatible with other versions after adjusting dependencies
- To use Weights & Biases please set
WANDB_KEY
inbaby_lm/train_config.py
, otherwise set thewandb_log
option toFalse
while runningtrain.py
To begin preprocessing, we first need to download the raw training data:
- The BabyLM text dataset needs to be present in
/data/raw/babylm/
, and can be downloaded from here
-
The TinyStories dataset needs to be present in
/data/raw/tinystories/
, and can be found here -
For training GPT-Neo models we use the newer TinyStories dataset generated by GPT-4:
TinyStoriesV2-GPT4-train.txt
,TinyStoriesV2-GPT4-valid.txt
-
The original dataset
TinyStories-train.txt
is only needed to evaluate the GPT-Neo model released by the TinyStories authors
The expected data directory structure is:
data/raw/
├── babylm
│ ├── dev
│ │ ├── ...
│ ├── test
│ │ ├── ...
│ ├── train_10M
│ │ ├── ...
│ └── train_100M
│ ├── bnc_spoken.train
| ├── ...
└── tinystories
├── TinyStories-train.txt
├── TinyStoriesV2-GPT4-train.txt
└── TinyStoriesV2-GPT4-valid.txt
Below we give instructions on preprocessing and constructing training datasets for various data configurations
These are then used for training GPT-Neo and LTG-BERT models
To prepare data for GPT-Neo training using the TinyStories dataset run the following command:
python -m baby_lm.process_data._prepare_tinystories_data_decoder --config_file <CONFIG_PATH>
- Different config files can be used, depending the size of the TinyStories training dataset
<CONFIG_PATH>
can beconfigs/preprocess/tinystories/decoder_tinystories_{5,10,25,50,75,100,500}m.yaml
To prepare data for LTG-BERT training using the BabyLM dataset, run the following command:
python -m baby_lm.process_data._prepare_babylm_data_encoder --config_file <CONFIG_PATH>
- Different config files can be used, depending the size of the BabyLM training dataset
<CONFIG_PATH>
can beconfigs/preprocess/babylm/babylm_train_{10,100}m.yaml
To prepare data for LTG-BERT training using the TinyStories dataset, run the following command:
python -m baby_lm.process_data._prepare_tinystories_data_encoder --config_file <CONFIG_PATH>
- If you don't want to include generated data
<CONFIG_PATH>
can beconfigs/preprocess/tinystories/encoder_tinystories_{10,100}m_nogen.yaml
- If you want to use synthetic data, the greedy generation dataset using the GPT-Neo-5m and GPT-Neo-50m models should be present, see Training Models and Data Generation.
You can then use the configs
<CONFIG_PATH>
can beconfigs/preprocess/tinystories/encoder_tinystories_{10,100}m.yaml
To prepare data for LTG-BERT training using a combination of BabyLM and TinyStories data you must first run the two scripts above for the standalone pre-processing of the TinyStories and BabyLM datasets.
- E.g., if you want to train with a combination of 5m of TinyStories and 5m of BabyLM data, the 10m splits from both datasets need to be already processed using the configs:
babylm_train_10m.yaml
,encoder_tinystories_10m.yaml
Afterwards, to create the combined training dataset run the following command:
-
python -m baby_lm.process_data._prepare_joint_training_data_encoder --config_file <CONFIG_PATH>
-
If you don't want to use generated data you can use the following configs
<CONFIG_PATH>
can beconfigs/preprocess/joint/baby{5,50}m_tiny{5,50}m_nogen.yaml
-
If you want to use synthetic data generated by GPT-Neo, it must be already present. To train a GPT-Neo model and use it to generate the synthetic dataset, see Training Models and Data Generation.
Afterwards, you can use the following configs, depending on the size of the BabyLM and TinyStories training datasets and the sampling method used for generating the synthetic training data (greedy or nucleus),
<CONFIG_PATH>
can beconfigs/preprocess/joint/baby{5,50}m_tiny{5,50}m_{greedy, nucleus1, nucleus5, nucleus10}.yaml
Note: To ensure correctness the data generation process is inefficient, and the combined data of the TinyStories and BabyLM splits are produced every time for different sampling methods, even though the files are the same. To save disk space and resources you can run the preprocessing for only one sampling method e.g., by using the greedy
sampling config, and then changing the nucleus5
config file to use the processed files from the greedy
data folder, keeping only the generated data file different.
To train either LTG-BERT or GPT-Neo models, first use the scripts above to preprocess the training data and create the corresponding datasets. Then the following command can be used:
python -m baby_lm.train --training_config <TRAIN_CONFIG> --experiment_config <EXP_CONFIG>
The <TRAIN_CONFIG>
is the basic configuration file which is then updated using the <EXP_CONFIG>
file to define each experiment. Additionally, command line arguments take precedence over both config files, should you wish to quickly change a training parameter.
-
For LTG-BERT training,
<TRAIN_CONFIG>
can be eitherconfigs/train/train_ltg_bert/_base_LTG-BERT.yaml
to train models for the Strict track (100M words), orconfigs/train/train_ltg_bert/_small_LTG-BERT.yaml
to train models for the Strict-Small track (10M words). For GPT-Neo training it has to always beconfigs/train/train_gpt-neo/_base_GPT-Neo.yaml
. -
The
<EXP_CONFIG>
file varies depending on the experiment, and can be selected from the config files in theconfigs/train/train_ltg-bert/
andconfigs/train/train_gpt-neo/
directories.
Note: After training, to use a model for evaluation or data generation, you must convert the tokenizer and model to the HuggingFace format, by running the script:
python convert_HF.py --checkpoint_path <PATH> --model_type <TYPE>
Where<PATH>
is the path to the model checkpoint directory, and<TYPE>
is eithergpt
orltg-bert
.
Note: The GPT-Neo models should be moved to this directory structure, with these specific names, in order for the generation and evaluation scripts to work correctly:
models
├── gpt_neo_5M
│ ├── checkpoint
│ │ └── ...
├── gpt_neo_10M
├── gpt_neo_25M
├── gpt_neo_50M
├── gpt_neo_75M
├── gpt_neo_100M
├── gpt_neo_500M
To generate data using GPT-Neo both the training data and the corresponding trained model need to exist. Then the following command can be run:
python -m baby_lm.generator.sample --config_file <CONFIG_PATH>
- Different config files can be used to change the generation strategy
<CONFIG_PATH>
can beconfigs/sampling/dataset/generate_dataset_tiny{5,10,50,100}m_{greedy, nucleus1, nucleus5, nucleus10}.yaml
Self-BLEU Evaluation
To evaluate the self-BLEU score for the GPT-Neo models included in the paper analysis: GPT-Neo-5m, GPT-Neo-10m, GPT-Neo-25m, GPT-Neo-50m, GPT-Neo-75m, GPT-Neo-100m, roneneldan/TinyStories-33M_HF
, use the command:
python -m baby_lm.generator.calculate_self_bleu_per_model
Either locally trained or the uploaded HF models can be used for the evaluation, see the script for more details.
Self-BLEU Evaluation for
python -m baby_lm.generator.calculate_self_bleu_nucleus_k --model <MODEL>
where<MODEL>
can be either a local model, e.g.,gpt_neo_50M
or an HF model e.g.,nikitastheo/GPT-Neo-50m_HF
To evaluate the models we build upon the official evaluation pipeline of the 2024 BabyLM Challenge. The steps to recreate our evaluation are listed below:
- Create a new virtual environment with your preferred way
e.g., conda create --name eval python=3.12 && conda activate eval
- Then install the evaluation pipeline by running
./prepare_eval.sh
- cd inside
evaluation-pipeline-2024
and run./install.sh
- Follow the instructions at https://github.com/babylm/evaluation-pipeline-2024 to download the evaluation data for BLiMP, BLiMP Supplement, EWoK and (Super)GLUE
- Run the evaluation with
python evaluate_all.py --config_file models.yaml
. You can choose which models to be evaluated by editingmodels.yaml
- The file
ltg_bert_glue_config_finetune.yaml
contains hyperparameters used for the (Super)GLUE evaluation. - The script
collect_results.py <MODEL_NAME>
can be used to collect a model's results in one file (results for all benchmarks need to be present) - The script
score_predictions.py
can be used to output a summary of the a model's performance. - Both scripts are originally provided by the organizers and slightly modified by us
The outputs of the LLM evaluation using Claude Sonnet are stored in the corresponding HF dataset
The directory scripts_slurm
contains a small but helpful utility for running experiments using SLURM. It was created to make experimentation in a SLURM cluster easier.
It contains two template files srun_script_template_multi_gpu.slurm
and srun_script_template.slurm
. Before running, please change those scripts to match your server configuration by filling in the ...
fields, and leaving variables like <VARIABLE>
intact.
The two scripts contain variables that are substituted according to an experiment config, and a new slurm script is created to run this specific experiment. We present an example below:
-
First run
python ./scripts_slurm/run_experiment.py --exp_name 4_GPT-Neo_50m --base_script _base_GPT-Neo --conf_dir configs/train/train_gpt_neo/ --num_gpus 4
-
After this command the file
./scripts_slurm/configs/train/train_gpt_neo/srun_script_4_GPT-Neo_50m_GPUS_4.yaml
will be created, and can be run directly with thesbatch
command on the cluster
@misc{theodoropoulos2024berttimestoriesinvestigatingrole,
title={BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training},
author={Nikitas Theodoropoulos and Giorgos Filandrianos and Vassilis Lyberatos and Maria Lymperaiou and Giorgos Stamou},
year={2024},
eprint={2410.15365},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.15365},
}
For inquiries or comments you can email me directly at [email protected] or open an issue.
During the development of this codebase we were aided by the following public code repositories. We thank the authors for their contribution to open-source research, and hope that the release of our implementation will also help future researchers and ML practitioners.