Skip to content

Code implementation for our paper "BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training" as part of the 2024 BabyLM Challenge

License

Notifications You must be signed in to change notification settings

nikitas-theo/BERTtimeStories

Repository files navigation

BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training


Nikitas Theodoropoulos, Giorgos Filandrianos, Vassilis Lyberatos, Maria Lymperaiou, Giorgos Stamou

National Technical University of Athens (NTUA)
Artificial Intelligence and Learning Systems Laboratory (AILS)


Paper
HuggingFace Models


This repository contains the code implementation for our contribution to the 2nd iteration of the BabyLM Challenge. The challenge is centered around sample-efficient language modelling, given human-like data constraints of 10M and 100M words. Our approach relies on data augmentation using TinyStories — a synthetic dataset of short and simple stories.

We train GPT-Neo decoder models on subsets of TinyStories, varying the amount of available training data. We find that even with access to less than 100M words the models are able to generate high-quality and original completions to a given story.

To measure the effect of synthetic story data on LM pre-training, we train LTG-BERT encoder models on a combined dataset consisting of:

  • a subset of TinyStories
  • story completions generated by our GPT-Neo models
  • a subset of the BabyLM dataset.

Results indicate that synthetic data can occasionally offer modest gains, but overall have a negative influence on linguistic understanding.

Our work is an initial study on the quality of synthetic story data in low resource settings, and underscores their potential for augmentation in data-constrained LM training. We hope that by releasing our implementation we will aid future research in this direction.

Project Structure

  • baby_lm/: source code
    • encoder/: LTG-BERT model implementation
    • generator/: GPT-Neo utilities for dataset, sampling and evaluation
    • process_data/: data processing scripts and utilities
  • configs/: contains configuration files
    • models/: model architectures
    • preprocess/: dataset preprocessing and creation
    • sampling/: dataset generation and model evaluation
    • train/: GPT-Neo and LTG-BERT training configs for various data configurations
  • data/: contains training data and prompts
    • raw/: raw training data
      • generated: synthetic data generated by GPT-Neo
    • processed/: processed training data
    • prompts/: various prompts
  • evaluation_files: files needed for the GPT-Neo and LTG-BERT models evaluation using the official pipeline, see instructions in Evaluating Linguistic Abilities
  • outputs/: contains outputs of the project
    • models: trained models
    • tokenizers: trained tokenizers
    • evaluation: GPT-Neo Self-BLEU evaluation results
  • scripts_slurm/: contains useful script templates for conducting experiments with SLURM

Setup

  • To install dependencies using Poetry run poetry install
  • Alternatively pip install -r requirements.txt can be used
  • The code was tested with Python 3.12, but should also be compatible with other versions after adjusting dependencies
  • To use Weights & Biases please set WANDB_KEY in baby_lm/train_config.py, otherwise set the wandb_log option to False while running train.py

Data Processing

To begin preprocessing, we first need to download the raw training data:

BabyLM data

  • The BabyLM text dataset needs to be present in /data/raw/babylm/, and can be downloaded from here

TinyStories data

  • The TinyStories dataset needs to be present in /data/raw/tinystories/, and can be found here

  • For training GPT-Neo models we use the newer TinyStories dataset generated by GPT-4: TinyStoriesV2-GPT4-train.txt, TinyStoriesV2-GPT4-valid.txt

  • The original dataset TinyStories-train.txt is only needed to evaluate the GPT-Neo model released by the TinyStories authors

Data Directory structure:

The expected data directory structure is:

data/raw/
├── babylm
│  ├── dev
│  │  ├── ...
│  ├── test
│  │  ├── ...
│  ├── train_10M
│  │  ├── ...
│  └── train_100M
│     ├── bnc_spoken.train
|     ├── ...
└── tinystories
   ├── TinyStories-train.txt
   ├── TinyStoriesV2-GPT4-train.txt
   └── TinyStoriesV2-GPT4-valid.txt

Below we give instructions on preprocessing and constructing training datasets for various data configurations
These are then used for training GPT-Neo and LTG-BERT models

GPT-Neo Training Data

To prepare data for GPT-Neo training using the TinyStories dataset run the following command:

  • python -m baby_lm.process_data._prepare_tinystories_data_decoder --config_file <CONFIG_PATH>
  • Different config files can be used, depending the size of the TinyStories training dataset
    <CONFIG_PATH> can be configs/preprocess/tinystories/decoder_tinystories_{5,10,25,50,75,100,500}m.yaml

LTG-BERT Training Data

To prepare data for LTG-BERT training using the BabyLM dataset, run the following command:

  • python -m baby_lm.process_data._prepare_babylm_data_encoder --config_file <CONFIG_PATH>
  • Different config files can be used, depending the size of the BabyLM training dataset
    <CONFIG_PATH> can be configs/preprocess/babylm/babylm_train_{10,100}m.yaml

To prepare data for LTG-BERT training using the TinyStories dataset, run the following command:

  • python -m baby_lm.process_data._prepare_tinystories_data_encoder --config_file <CONFIG_PATH>
  • If you don't want to include generated data
    <CONFIG_PATH> can be configs/preprocess/tinystories/encoder_tinystories_{10,100}m_nogen.yaml
  • If you want to use synthetic data, the greedy generation dataset using the GPT-Neo-5m and GPT-Neo-50m models should be present, see Training Models and Data Generation. You can then use the configs
    <CONFIG_PATH> can be configs/preprocess/tinystories/encoder_tinystories_{10,100}m.yaml

To prepare data for LTG-BERT training using a combination of BabyLM and TinyStories data you must first run the two scripts above for the standalone pre-processing of the TinyStories and BabyLM datasets.

  • E.g., if you want to train with a combination of 5m of TinyStories and 5m of BabyLM data, the 10m splits from both datasets need to be already processed using the configs: babylm_train_10m.yaml, encoder_tinystories_10m.yaml

Afterwards, to create the combined training dataset run the following command:

  • python -m baby_lm.process_data._prepare_joint_training_data_encoder --config_file <CONFIG_PATH>

  • If you don't want to use generated data you can use the following configs
    <CONFIG_PATH> can be configs/preprocess/joint/baby{5,50}m_tiny{5,50}m_nogen.yaml

  • If you want to use synthetic data generated by GPT-Neo, it must be already present. To train a GPT-Neo model and use it to generate the synthetic dataset, see Training Models and Data Generation.
    Afterwards, you can use the following configs, depending on the size of the BabyLM and TinyStories training datasets and the sampling method used for generating the synthetic training data (greedy or nucleus),
    <CONFIG_PATH> can be configs/preprocess/joint/baby{5,50}m_tiny{5,50}m_{greedy, nucleus1, nucleus5, nucleus10}.yaml

Note: To ensure correctness the data generation process is inefficient, and the combined data of the TinyStories and BabyLM splits are produced every time for different sampling methods, even though the files are the same. To save disk space and resources you can run the preprocessing for only one sampling method e.g., by using the greedy sampling config, and then changing the nucleus5 config file to use the processed files from the greedy data folder, keeping only the generated data file different.

Training Models

To train either LTG-BERT or GPT-Neo models, first use the scripts above to preprocess the training data and create the corresponding datasets. Then the following command can be used:

  • python -m baby_lm.train --training_config <TRAIN_CONFIG> --experiment_config <EXP_CONFIG>

The <TRAIN_CONFIG> is the basic configuration file which is then updated using the <EXP_CONFIG> file to define each experiment. Additionally, command line arguments take precedence over both config files, should you wish to quickly change a training parameter.

  • For LTG-BERT training, <TRAIN_CONFIG> can be either configs/train/train_ltg_bert/_base_LTG-BERT.yaml to train models for the Strict track (100M words), or configs/train/train_ltg_bert/_small_LTG-BERT.yaml to train models for the Strict-Small track (10M words). For GPT-Neo training it has to always be configs/train/train_gpt-neo/_base_GPT-Neo.yaml.

  • The <EXP_CONFIG> file varies depending on the experiment, and can be selected from the config files in the configs/train/train_ltg-bert/ and configs/train/train_gpt-neo/ directories.

Note: After training, to use a model for evaluation or data generation, you must convert the tokenizer and model to the HuggingFace format, by running the script:

  • python convert_HF.py --checkpoint_path <PATH> --model_type <TYPE> Where <PATH> is the path to the model checkpoint directory, and <TYPE> is either gpt or ltg-bert.

Note: The GPT-Neo models should be moved to this directory structure, with these specific names, in order for the generation and evaluation scripts to work correctly:

models
├── gpt_neo_5M
│  ├── checkpoint
│  │  └── ...
├── gpt_neo_10M
├── gpt_neo_25M
├── gpt_neo_50M
├── gpt_neo_75M
├── gpt_neo_100M
├── gpt_neo_500M

Data Generation

To generate data using GPT-Neo both the training data and the corresponding trained model need to exist. Then the following command can be run:

  • python -m baby_lm.generator.sample --config_file <CONFIG_PATH>
  • Different config files can be used to change the generation strategy
    <CONFIG_PATH> can be configs/sampling/dataset/generate_dataset_tiny{5,10,50,100}m_{greedy, nucleus1, nucleus5, nucleus10}.yaml

Evaluation

Evaluating GPT-Neo Generations — Self-BLEU

Self-BLEU Evaluation To evaluate the self-BLEU score for the GPT-Neo models included in the paper analysis: GPT-Neo-5m, GPT-Neo-10m, GPT-Neo-25m, GPT-Neo-50m, GPT-Neo-75m, GPT-Neo-100m, roneneldan/TinyStories-33M_HF, use the command:

  • python -m baby_lm.generator.calculate_self_bleu_per_model

Either locally trained or the uploaded HF models can be used for the evaluation, see the script for more details.

Self-BLEU Evaluation for $k$ generations per story using nucleus sampling For the analysis presented in the paper, use the command:

  • python -m baby_lm.generator.calculate_self_bleu_nucleus_k --model <MODEL> where <MODEL> can be either a local model, e.g., gpt_neo_50M or an HF model e.g., nikitastheo/GPT-Neo-50m_HF

Evaluating Linguistic Abilities — Challenge Benchmarks

To evaluate the models we build upon the official evaluation pipeline of the 2024 BabyLM Challenge. The steps to recreate our evaluation are listed below:

  1. Create a new virtual environment with your preferred way e.g., conda create --name eval python=3.12 && conda activate eval
  2. Then install the evaluation pipeline by running ./prepare_eval.sh
  3. cd inside evaluation-pipeline-2024 and run ./install.sh
  4. Follow the instructions at https://github.com/babylm/evaluation-pipeline-2024 to download the evaluation data for BLiMP, BLiMP Supplement, EWoK and (Super)GLUE
  5. Run the evaluation with python evaluate_all.py --config_file models.yaml. You can choose which models to be evaluated by editing models.yaml
  • The file ltg_bert_glue_config_finetune.yaml contains hyperparameters used for the (Super)GLUE evaluation.
  • The script collect_results.py <MODEL_NAME> can be used to collect a model's results in one file (results for all benchmarks need to be present)
  • The script score_predictions.py can be used to output a summary of the a model's performance.
  • Both scripts are originally provided by the organizers and slightly modified by us

LLM Evaluation

The outputs of the LLM evaluation using Claude Sonnet are stored in the corresponding HF dataset

SLURM Scripts

The directory scripts_slurm contains a small but helpful utility for running experiments using SLURM. It was created to make experimentation in a SLURM cluster easier. It contains two template files srun_script_template_multi_gpu.slurm and srun_script_template.slurm. Before running, please change those scripts to match your server configuration by filling in the ... fields, and leaving variables like <VARIABLE> intact.

The two scripts contain variables that are substituted according to an experiment config, and a new slurm script is created to run this specific experiment. We present an example below:

  • First run python ./scripts_slurm/run_experiment.py --exp_name 4_GPT-Neo_50m --base_script _base_GPT-Neo --conf_dir configs/train/train_gpt_neo/ --num_gpus 4

  • After this command the file ./scripts_slurm/configs/train/train_gpt_neo/srun_script_4_GPT-Neo_50m_GPUS_4.yaml will be created, and can be run directly with the sbatch command on the cluster

Please cite the following publication

@misc{theodoropoulos2024berttimestoriesinvestigatingrole,
      title={BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training}, 
      author={Nikitas Theodoropoulos and Giorgos Filandrianos and Vassilis Lyberatos and Maria Lymperaiou and Giorgos Stamou},
      year={2024},
      eprint={2410.15365},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.15365}, 
}

Correspondence

For inquiries or comments you can email me directly at [email protected] or open an issue.

Acknowledgements

During the development of this codebase we were aided by the following public code repositories. We thank the authors for their contribution to open-source research, and hope that the release of our implementation will also help future researchers and ML practitioners.

About

Code implementation for our paper "BERTtime Stories: Investigating the Role of Synthetic Story Data in Language Pre-training" as part of the 2024 BabyLM Challenge

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published