XSC224u Final Project: Pre-training BERT from scratch on different sizes of biomedical domain corpora.
- All artifacts related to the collection (primarily through web scraping) of data for both pretraining and finetuing
- All artifacts related to the preprocessing of data from raw files up to the point of tokenization.
- All artificats related to building/tweaking our BERT language models
- March 11: Tokenization/Vocabulary scheme finalized + pretraining data secured
- March 20: Model pretraining completed (for at least one model approach)
- March 30: All experiments completed
- April 3: Final paper due
- April 8: Code completed