Repo for the paper (Un)solving Morphological Inflection: Lemma Overlap Artificially Inflates Models' Performance.
LemmaSplitData
contains a lemma split version of the data from SIGMORPHON's 2020 shared task 0.lstm
contains the baseline LSTM model.generate_lemma_splits.py
is the script used to produce the data in practice.
- Clone the SIGMORPHON 2020 task 0 data - the 3 folders
DEVELOPMENT_LANGUAGES
,SURPRISE_LANGUAGES
andGOLD-TEST
- to the folderDataExperiments/FormSplit
. - Run the script
generate_lemma_splits.py
. It will generate a folder calledDataExperiments/LemmaSplit
at the same level and the same families sub-division (without the covered test files), where the samples are split across lemmas instead of randomly.