Added support for masked language modeling (bidirectional models) #211
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR builds on the Huggingface subject, which assumes that models are autoregressive (following the
ModelForCausalLM
). This PR adds support for bidirectional models with masked language modeling following theModelForMaskedLM
). Since bidirectional models rely on future context, I use a sliding window approach (see google-research/bert#66). In particular, for each text part, up tow/2
tokens are included for the current part + previous context, and the remainingw/2
tokens are masked.The
region_layer_mapping
for the language system was determined by scoring every transformer layer in BERT's encoder against thePereira2018.243sentences-linear
,Pereira2018.384sentences-linear
, andBlank2014-linear
benchmarks, and choosing the layer with the highest average score.This PR also provides unit tests for reading time estimation, next word prediction, and neural recording, using the
bert-base-uncased
model. Future models can use the same format, as long as they implement theModelForMaskedLM
interface. For example, to add the base DistilBERT model: