(Corpus) Size Does Matter: The Effects of In-Domain Corpus Size on Language Model Performance

XSC224u Final Project: Pre-training BERT from scratch on different sizes of biomedical domain corpora.

Project Repo Framework

1. DataCollection

All artifacts related to the collection (primarily through web scraping) of data for both pretraining and finetuing

2. Preprocessing

All artifacts related to the preprocessing of data from raw files up to the point of tokenization.

3. Modeling

All artificats related to building/tweaking our BERT language models

Expected Timeframe

March 11: Tokenization/Vocabulary scheme finalized + pretraining data secured
March 20: Model pretraining completed (for at least one model approach)
March 30: All experiments completed
April 3: Final paper due
April 8: Code completed