(Corpus) Size Does Matter: The Effects of In-Domain Corpus Size on Language Model Performance

XSC224u Final Project: Pre-training BERT from scratch on different sizes of biomedical domain corpora.

Project Repo Framework

All artifacts related to the collection (primarily through web scraping) of data for both pretraining and finetuing

All artifacts related to the preprocessing of data from raw files up to the point of tokenization.

Name	Name	Last commit message	Last commit date
Latest commit JasonZhangzy1757 Add files via upload Apr 14, 2022 5550e9e · Apr 14, 2022 History 79 Commits
Finetuning	Finetuning	re finetuning	Apr 8, 2022
Preprocessing	Preprocessing	clean up on main	Apr 8, 2022
Pretraining	Pretraining	clean up on main	Apr 8, 2022
.gitignore	.gitignore	final push before closing 12gb training	Apr 1, 2022
Final Paper.pdf	Final Paper.pdf	Add files via upload	Apr 14, 2022
README.md	README.md	Update README.md	Apr 8, 2022
requirements.txt	requirements.txt	clean up on main	Apr 8, 2022