Checked on Ubuntu 18.04 64 bits
-
Create a virtualenv:
virtualenv -p python3.6 parsing-as-pretraining
-
Activate the virtualenv:
source parsing-as-pretraining/bin/activate
-
Install the required dependencies:
pip install -r requirements.txt
To know HOWTO transform a constituent or a dependency tree into a sequence of labels, please check and read the README.md of the repositories tree2labels and dep2labels
In what follows, we assume the linearized datasets are stored in PTB-linearized/
and EN_EWT-linearized/
: Each folder contains three different files: train.tsv
, dev.tsv
, and test.tsv
Execute:
cd NCRFpp
python main.py --config $PATH_CONFIG_FILE
The folders NCRFpp/const_confs/
and NCRFpp/dep_confs/
show some examples of configuration files.
Parameters used to train different types of models:
contextualize
: [True|False] to specify whether to further contextualize word vectors through the NCRFpp BILSTMsuse_elmo
: [True|False] Execute ELMo to compute the word vectors, instead of using precomputed or random representationsfine_tune_emb
: [True|False] To finetune or not the pretraining encoder during traininguse_char
: [True|False] To use or not the character LSTMs supported by NCRFpp (alwaysFalse
in our work)use_features
: [True|False] To use or not features other than words that are present in the linearized dataset (alwaysFalse
in our work)word_emb_dim
: Size of the word embeddings. Used when training random representations
Specific parameters to train constituent models:
###PathsToAdditionalScripts###
tree2labels=../tree2labels
evaluate=../tree2labels/evaluate.py
evalb=../tree2labels/EVALB/evalb
gold_dev_trees=../data/datasets/PTB/dev.trees
optimize_with_evalb=True
Specific parameters to train dependency models:
###PathsToAdditionalScripts###
dep2labels=../dep2labels
gold_dev_trees=../data/datasets/en-ewt/en_ewt-ud-dev.conllu
optimize_with_las=True
conll_ud=../dep2labels/conll17_ud_eval.py
Adapt the paths accordingly and run ./train_bert_model.sh
The script assumes that the dataset is inside a folder and separated in three different files name train.tsv
, dev.tsv
, and test.tsv
.
General parameter description:
--bert_model
: The base model used during training, i.e.bert-base-cased
--task_name
:sl_tsv
It specifies the format of the input files (alwayssl_tsv
)--model_dir
: Path where to save the model--max_seq_length
: Expected maximum sequence length--output_dir
: Path to where store the outputs generated by the model--do_train
: Activate to train the model--do_eval
: Activate to evaluate the model on the dev set--do_test
: Activate to run the model on the test set--do_lower_case
: To lower case the input when using an uncased model (e.g.bert-base-uncased
)
Additional options:
--parsing_paradigm
: [dependencies|constituency]--not_finetune
: Keeps BERT weights frozen during training--use_bilstms
: Flag to indicate whether to use BILSTMs before the output layer
Additional specific options for dependency parsers:
--path_gold_conll
: Path to the gold conll file to evaluate
Additional specific options for constituent parsers:
--evalb_param
: [True|False] to indicate whether to use the COLLINS.prm parameter file to compute the F1 bracketing score--path_gold_parenthesized
: Path to the gold parenthesized tree to evaluate
Example:
python run_token_classifier.py \
--data_dir ./data/datasets/PTB-linearized/ \
--bert_model bert-base-cased \
--task_name sl_tsv \
--model_dir /tmp/bert.finetune.linear.model \
--output_dir /tmp/dev.bert.finetune.linear.output \
--path_gold_parenthesized ../data/datasets/PTB/dev.trees \
--parsing_paradigm constituency --do_train --do_eval --num_train_epochs 15 --max_seq_length 250 [--use_bilstms] [--not_finetune]
Adapt the paths and run the scripts ./run_const_ncrfpp.sh
(constituents) and ./run_dep_ncrfpp.sh
(dependencies)
Adapt the paths and model names accordingly and execute ./run_token_classifier.sh
Example for constituency parsing:
python run_token_classifier.py \
--data_dir ./data/datasets/PTB-linearized/ \
--bert_model bert-base-cased \
--task_name sl_tsv \
--model_dir ./data/bert_models_const/bert.const.finetune.linear \
--output_dir ./data/outputs_const/test.bert.finetune.linear.output \
--evalb_param True \
--max_seq_length 250 \
--path_gold_parenthesized ./data/datasets/PTB/test.trees \
--parsing_paradigm constituency --do_test [--use_bilstms]
Example for dependency parsing:
python run_token_classifier.py \
--data_dir ./data/datasets/EN_EWT-pred-linearized \
--bert_model bert-base-cased \
--task_name sl_tsv \
--model_dir ./data/bert_models_dep/bert.dep.finetune.linear \
--output_dir ./data/outputs_dep/test.bert.finetune.linear.output \
--path_gold_conll ./data/datasets/en-ewt/en_ewt-ud-test.conllu \
--max_seq_length 350 \
--parsing_paradigm dependencies --do_test [--use_bilstms]
Note: Remember to use the option --do_lower_case
too, in case you trained an uncased model.
Use python evaluate_spans.py [--predicted] [--gold]
to show some charts referring to constituent experiments:
--predicted
Path to the directory containing the files (each of them in PTB, parenthesized format) for which to plot the charts--gold
Path to the file containing the gold trees in PTB (parenthesized format)
Use python evaluate_dependencies.py [--predicted] [--gold]
to show some charts referring to the dependency experiments:
--predicted
: Path to the directory containing the files (each of them a predicted conllu file) for which to plot the charts--gold
: Path to the corresponding gold conllu file.
Vilares, D. and Strzyz, M. and Søgaard, A. and Gómez-Rodríguez, C. Parsing as Pretraining. In AAAI 2020