GitHub - nikitas-theo/BERTtimeStories: Code implementation for our paper "BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training" as part of the 2024 BabyLM Challenge

BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training

Nikitas Theodoropoulos, Giorgos Filandrianos, Vassilis Lyberatos, Maria Lymperaiou, Giorgos Stamou

National Technical University of Athens (NTUA)
Artificial Intelligence and Learning Systems Laboratory (AILS)

This repository contains the code implementation for our contribution to the 2nd iteration of the BabyLM Challenge. The challenge is centered around sample-efficient language modelling, given human-like data constraints of 10M and 100M words. Our approach relies on data augmentation using TinyStories — a synthetic dataset of short and simple stories.

We train GPT-Neo decoder models on subsets of TinyStories, varying the amount of available training data. We find that even with access to less than 100M words the models are able to generate high-quality and original completions to a given story.

To measure the effect of synthetic story data on LM pre-training, we train LTG-BERT encoder models on a combined dataset consisting of:

a subset of TinyStories
story completions generated by our GPT-Neo models
a subset of the BabyLM dataset.

Results indicate that synthetic data can occasionally offer modest gains, but overall have a negative influence on linguistic understanding.

Our work is an initial study on the quality of synthetic story data in low resource settings, and underscores their potential for augmentation in data-constrained LM training. We hope that by releasing our implementation we will aid future research in this direction.

Project Structure

baby_lm/: source code
- encoder/: LTG-BERT model implementation
- generator/: GPT-Neo utilities for dataset, sampling and evaluation
- process_data/: data processing scripts and utilities
configs/: contains configuration files
- models/: model architectures
- preprocess/: dataset preprocessing and creation
- sampling/: dataset generation and model evaluation
- train/: GPT-Neo and LTG-BERT training configs for various data configurations
data/: contains training data and prompts
- raw/: raw training data
  - generated: synthetic data generated by GPT-Neo
- processed/: processed training data
- prompts/: various prompts
evaluation_files: files needed for the GPT-Neo and LTG-BERT models evaluation using the official pipeline, see instructions in Evaluating Linguistic Abilities
outputs/: contains outputs of the project
- models: trained models
- tokenizers: trained tokenizers
- evaluation: GPT-Neo Self-BLEU evaluation results
scripts_slurm/: contains useful script templates for conducting experiments with SLURM

Setup

To install dependencies using Poetry run poetry install
Alternatively pip install -r requirements.txt can be used
The code was tested with Python 3.12, but should also be compatible with other versions after adjusting dependencies
To use Weights & Biases please set WANDB_KEY in baby_lm/train_config.py, otherwise set the wandb_log option to False while running train.py

Data Processing

To begin preprocessing, we first need to download the raw training data:

BabyLM data

The BabyLM text dataset needs to be present in /data/raw/babylm/, and can be downloaded from here

TinyStories data

The TinyStories dataset needs to be present in /data/raw/tinystories/, and can be found here
For training GPT-Neo models we use the newer TinyStories dataset generated by GPT-4: TinyStoriesV2-GPT4-train.txt, TinyStoriesV2-GPT4-valid.txt
The original dataset TinyStories-train.txt is only needed to evaluate the GPT-Neo model released by the TinyStories authors

Data Directory structure:

The expected data directory structure is:

data/raw/
├── babylm
│  ├── dev
│  │  ├── ...
│  ├── test
│  │  ├── ...
│  ├── train_10M
│  │  ├── ...
│  └── train_100M
│     ├── bnc_spoken.train
|     ├── ...
└── tinystories
   ├── TinyStories-train.txt
   ├── TinyStoriesV2-GPT4-train.txt
   └── TinyStoriesV2-GPT4-valid.txt

Below we give instructions on preprocessing and constructing training datasets for various data configurations
These are then used for training GPT-Neo and LTG-BERT models

GPT-Neo Training Data

To prepare data for GPT-Neo training using the TinyStories dataset run the following command:

python -m baby_lm.process_data._prepare_tinystories_data_decoder --config_file <CONFIG_PATH>
Different config files can be used, depending the size of the TinyStories training dataset
<CONFIG_PATH> can be configs/preprocess/tinystories/decoder_tinystories_{5,10,25,50,75,100,500}m.yaml

LTG-BERT Training Data

To prepare data for LTG-BERT training using the BabyLM dataset, run the following command:

python -m baby_lm.process_data._prepare_babylm_data_encoder --config_file <CONFIG_PATH>
Different config files can be used, depending the size of the BabyLM training dataset
<CONFIG_PATH> can be configs/preprocess/babylm/babylm_train_{10,100}m.yaml

To prepare data for LTG-BERT training using the TinyStories dataset, run the following command:

python -m baby_lm.process_data._prepare_tinystories_data_encoder --config_file <CONFIG_PATH>
If you don't want to include generated data
<CONFIG_PATH> can be configs/preprocess/tinystories/encoder_tinystories_{10,100}m_nogen.yaml
If you want to use synthetic data, the greedy generation dataset using the GPT-Neo-5m and GPT-Neo-50m models should be present, see Training Models and Data Generation. You can then use the configs
<CONFIG_PATH> can be configs/preprocess/tinystories/encoder_tinystories_{10,100}m.yaml

To prepare data for LTG-BERT training using a combination of BabyLM and TinyStories data you must first run the two scripts above for the standalone pre-processing of the TinyStories and BabyLM datasets.

E.g., if you want to train with a combination of 5m of TinyStories and 5m of BabyLM data, the 10m splits from both datasets need to be already processed using the configs: babylm_train_10m.yaml, encoder_tinystories_10m.yaml

Afterwards, to create the combined training dataset run the following command:

python -m baby_lm.process_data._prepare_joint_training_data_encoder --config_file <CONFIG_PATH>
If you don't want to use generated data you can use the following configs
<CONFIG_PATH> can be configs/preprocess/joint/baby{5,50}m_tiny{5,50}m_nogen.yaml
If you want to use synthetic data generated by GPT-Neo, it must be already present. To train a GPT-Neo model and use it to generate the synthetic dataset, see Training Models and Data Generation.
Afterwards, you can use the following configs, depending on the size of the BabyLM and TinyStories training datasets and the sampling method used for generating the synthetic training data (greedy or nucleus),
<CONFIG_PATH> can be configs/preprocess/joint/baby{5,50}m_tiny{5,50}m_{greedy, nucleus1, nucleus5, nucleus10}.yaml

Note: To ensure correctness the data generation process is inefficient, and the combined data of the TinyStories and BabyLM splits are produced every time for different sampling methods, even though the files are the same. To save disk space and resources you can run the preprocessing for only one sampling method e.g., by using the greedy sampling config, and then changing the nucleus5 config file to use the processed files from the greedy data folder, keeping only the generated data file different.

Training Models

To train either LTG-BERT or GPT-Neo models, first use the scripts above to preprocess the training data and create the corresponding datasets. Then the following command can be used:

python -m baby_lm.train --training_config <TRAIN_CONFIG> --experiment_config <EXP_CONFIG>

The <TRAIN_CONFIG> is the basic configuration file which is then updated using the <EXP_CONFIG> file to define each experiment. Additionally, command line arguments take precedence over both config files, should you wish to quickly change a training parameter.

For LTG-BERT training, <TRAIN_CONFIG> can be either configs/train/train_ltg_bert/_base_LTG-BERT.yaml to train models for the Strict track (100M words), or configs/train/train_ltg_bert/_small_LTG-BERT.yaml to train models for the Strict-Small track (10M words). For GPT-Neo training it has to always be configs/train/train_gpt-neo/_base_GPT-Neo.yaml.
The <EXP_CONFIG> file varies depending on the experiment, and can be selected from the config files in the configs/train/train_ltg-bert/ and configs/train/train_gpt-neo/ directories.

Note: After training, to use a model for evaluation or data generation, you must convert the tokenizer and model to the HuggingFace format, by running the script:

python convert_HF.py --checkpoint_path <PATH> --model_type <TYPE> Where <PATH> is the path to the model checkpoint directory, and <TYPE> is either gpt or ltg-bert.

Note: The GPT-Neo models should be moved to this directory structure, with these specific names, in order for the generation and evaluation scripts to work correctly:

models
├── gpt_neo_5M
│  ├── checkpoint
│  │  └── ...
├── gpt_neo_10M
├── gpt_neo_25M
├── gpt_neo_50M
├── gpt_neo_75M
├── gpt_neo_100M
├── gpt_neo_500M

Data Generation

To generate data using GPT-Neo both the training data and the corresponding trained model need to exist. Then the following command can be run:

python -m baby_lm.generator.sample --config_file <CONFIG_PATH>
Different config files can be used to change the generation strategy
<CONFIG_PATH> can be configs/sampling/dataset/generate_dataset_tiny{5,10,50,100}m_{greedy, nucleus1, nucleus5, nucleus10}.yaml

Evaluation

Evaluating GPT-Neo Generations — Self-BLEU

Self-BLEU Evaluation To evaluate the self-BLEU score for the GPT-Neo models included in the paper analysis: GPT-Neo-5m, GPT-Neo-10m, GPT-Neo-25m, GPT-Neo-50m, GPT-Neo-75m, GPT-Neo-100m, roneneldan/TinyStories-33M_HF, use the command:

python -m baby_lm.generator.calculate_self_bleu_per_model

Either locally trained or the uploaded HF models can be used for the evaluation, see the script for more details.

Self-BLEU Evaluation for $k$ generations per story using nucleus sampling For the analysis presented in the paper, use the command:

python -m baby_lm.generator.calculate_self_bleu_nucleus_k --model <MODEL> where <MODEL> can be either a local model, e.g., gpt_neo_50M or an HF model e.g., nikitastheo/GPT-Neo-50m_HF

Evaluating Linguistic Abilities — Challenge Benchmarks

To evaluate the models we build upon the official evaluation pipeline of the 2024 BabyLM Challenge. The steps to recreate our evaluation are listed below:

Create a new virtual environment with your preferred way e.g., conda create --name eval python=3.12 && conda activate eval
Then install the evaluation pipeline by running ./prepare_eval.sh
cd inside evaluation-pipeline-2024 and run ./install.sh
Follow the instructions at https://github.com/babylm/evaluation-pipeline-2024 to download the evaluation data for BLiMP, BLiMP Supplement, EWoK and (Super)GLUE
Run the evaluation with python evaluate_all.py --config_file models.yaml. You can choose which models to be evaluated by editing models.yaml

The file ltg_bert_glue_config_finetune.yaml contains hyperparameters used for the (Super)GLUE evaluation.
The script collect_results.py <MODEL_NAME> can be used to collect a model's results in one file (results for all benchmarks need to be present)
The script score_predictions.py can be used to output a summary of the a model's performance.
Both scripts are originally provided by the organizers and slightly modified by us

LLM Evaluation

The outputs of the LLM evaluation using Claude Sonnet are stored in the corresponding HF dataset

SLURM Scripts

The directory scripts_slurm contains a small but helpful utility for running experiments using SLURM. It was created to make experimentation in a SLURM cluster easier. It contains two template files srun_script_template_multi_gpu.slurm and srun_script_template.slurm. Before running, please change those scripts to match your server configuration by filling in the ... fields, and leaving variables like <VARIABLE> intact.

The two scripts contain variables that are substituted according to an experiment config, and a new slurm script is created to run this specific experiment. We present an example below:

First run python ./scripts_slurm/run_experiment.py --exp_name 4_GPT-Neo_50m --base_script _base_GPT-Neo --conf_dir configs/train/train_gpt_neo/ --num_gpus 4
After this command the file ./scripts_slurm/configs/train/train_gpt_neo/srun_script_4_GPT-Neo_50m_GPUS_4.yaml will be created, and can be run directly with the sbatch command on the cluster

Please cite the following publication

@misc{theodoropoulos2024berttimestoriesinvestigatingrole,
      title={BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training}, 
      author={Nikitas Theodoropoulos and Giorgos Filandrianos and Vassilis Lyberatos and Maria Lymperaiou and Giorgos Stamou},
      year={2024},
      eprint={2410.15365},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.15365}, 
}

Correspondence

For inquiries or comments you can email me directly at [email protected] or open an issue.

Acknowledgements

During the development of this codebase we were aided by the following public code repositories. We thank the authors for their contribution to open-source research, and hope that the release of our implementation will also help future researchers and ML practitioners.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
baby_lm		baby_lm
configs		configs
data		data
evaluation_files		evaluation_files
outputs		outputs
scripts_slurm		scripts_slurm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install_eval.sh		install_eval.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training

BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training

Project Structure

Setup

Data Processing

BabyLM data

TinyStories data

Data Directory structure:

GPT-Neo Training Data

LTG-BERT Training Data

Training Models

Data Generation

Evaluation

Evaluating GPT-Neo Generations — Self-BLEU

Evaluating Linguistic Abilities — Challenge Benchmarks

LLM Evaluation

SLURM Scripts

Please cite the following publication

Correspondence

Acknowledgements

About

Releases

Packages

Languages

License

nikitas-theo/BERTtimeStories

Folders and files

Latest commit

History

Repository files navigation

BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training

BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training

Project Structure

Setup

Data Processing

BabyLM data

TinyStories data

Data Directory structure:

GPT-Neo Training Data

LTG-BERT Training Data

Training Models

Data Generation

Evaluation

Evaluating GPT-Neo Generations — Self-BLEU

Evaluating Linguistic Abilities — Challenge Benchmarks

LLM Evaluation

SLURM Scripts

Please cite the following publication

Correspondence

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages