This tutorial shows how to train a baseline model for the Keyphrase Extraction track in the ICASSP2023 MUG Challenge.
We model KPE as a sequence-labeling problem and apply the Bert-CRF model implemented in AdaSeq to solve it.
AdaSeq is an easy-to-use library, built on ModelScope, which provides plenty of cutting-edge models, training methods and useful toolkits for sequence understanding tasks.
Python version >= 3.7
git clone https://github.com/modelscope/adaseq.git
pip install -r requirements.txt -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
-
Download and put
train.json
dev.json
except_TS_test1_without_label.csv
to${root_dir}/examples/ICASSP2023_MUG_track4/dataset/
.${root_dir}
is the absolute path of the AdaSeq repository. -
Preprocess the downloaded data by splitting it into splits of 128 characters (or longer) and reformat data into CoNLL format.
cd examples/ICASSP2023_MUG_track4
python preprocess.py
-
Modify the config file Change the
${root_dir}
inconfigs/bert_crf_sbert.yaml
to your own ABSOLUTE path of this repo. -
Start training Let’s train a Bert-CRF model using the preset script as an example.
cd ${root_dir}
python scripts/train.py -c examples/ICASSP2023_MUG_track4/configs/bert_crf_sbert.yaml
Also, there are many methods other than Bert-CRF implemented in AdaSeq, such as RaNER, Global-Pointer, etc.
What's more, a grid search tool is provided for efficient tuning.
First install some requirements in the evaluation script.
pip install jieba rouge yake
Use evaluate_kw.py
to evaluate the predictions. For KPE, the union of the labels from the three annotators is used as the final manual labels for training and evaluation.
There are 4 necessary parameters:
<data_path>
: the source test data. e.g. dev.json.<pred_path>
: the prediction file output by your model.<doc_split_path
>: the file recording where documents are split, which is output by preprocessing.<out_path>
: the file path in which the evaluation log will be.
For example:
cd examples/ICASSP2023_MUG_track4
python evaluate_kw.py dataset/dev.json ${root_dir}/experiments/kpe_sbert/outputs/${datetime}/pred.txt dataset/split_list_dev.json evaluation.log
First use your trained model to make predictions on the test file.
cd ${root_dir}
python scripts/test.py -c examples/ICASSP2023_MUG_track4/configs/bert_crf_sbert.yaml -cp ${model_path}
The output file is pred.txt in your current path.
Then use get_keywords.py
to make predictions on test file. There are 4 necessary parameters as same as those in Step 4
.
For example:
cd examples/ICASSP2023_MUG_track4
python get_keywords.py dataset/test.json ../../pred.txt dataset/split_list_test.json ${out_path}
Now you can submit ${out_path}
to the competition!
Here you go:
sh ./examples/ICASSP2023_MUG_track4/end2end.sh ${your_sdk_token}
The ${your_sdk_token}
should be replaced with your ModelScope sdk token.
After all finished, a submit.json
file will be generated, which can be used to submit for the competition.
Split | Model | Backbone | Exact/Partial F1 @10 | Exact/Partial F1 @15 | Exact/Partial F1 @20 | Checkpoint |
---|---|---|---|---|---|---|
Dev | yake | - | 15.0/24.3 | 19.8/30.4 | 20.4/32.1 | - |
Dev | Bert-CRF | sijunhe/nezha-cn-base | 35.6/43.2 | 38.1/49.5 | 37.2/48.1 | - |
Dev | Bert-CRF | damo/nlp_structbert_backbone_base_std | 35.9/47.7 | 40.1/52.2 | 39.4/51.1 | ModelScope |