Skip to content

Latest commit

 

History

History
22 lines (18 loc) · 875 Bytes

File metadata and controls

22 lines (18 loc) · 875 Bytes

(Corpus) Size Does Matter: The Effects of In-Domain Corpus Size on Language Model Performance

XSC224u Final Project: Pre-training BERT from scratch on different sizes of biomedical domain corpora.

Project Repo Framework

1. DataCollection

  • All artifacts related to the collection (primarily through web scraping) of data for both pretraining and finetuing

2. Preprocessing

  • All artifacts related to the preprocessing of data from raw files up to the point of tokenization.

3. Modeling

  • All artificats related to building/tweaking our BERT language models

Expected Timeframe

  • March 11: Tokenization/Vocabulary scheme finalized + pretraining data secured
  • March 20: Model pretraining completed (for at least one model approach)
  • March 30: All experiments completed
  • April 3: Final paper due
  • April 8: Code completed