Sentence Split is task of dividing complex sentence in two simple sentences ex. complex sentence
Mary likes to play football in her freetime whenever she meets with her friends that are very nice people.
can be divided in
Mary likes to play football in her freetime whenever she meets with her friends.
and
Her friends are very nice people.
To make best sentence split model available till now
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("flax-community/t5-base-wikisplit")
model = AutoModelForSeq2SeqLM.from_pretrained("flax-community/t5-base-wikisplit")
complex_sentence = "This comedy drama is produced by Tidy , the company she co-founded in 2008 with her husband David Peet , who is managing director ."
sample_tokenized = tokenizer(complex_sentence, return_tensors="pt")
answer = model.generate(sample_tokenized['input_ids'], attention_mask = sample_tokenized['attention_mask'], max_length=256, num_beams=5)
gene_sentence = tokenizer.decode(answer[0], skip_special_tokens=True)
gene_sentence
"""
Output:
This comedy drama is produced by Tidy. She co-founded Tidy in 2008 with her husband David Peet, who is managing director.
"""
- Sentence Simplification
- Data Augmentation
- Sentence Rephrase
Current Basline from paper
Model | Exact | SARI | BLEU |
---|---|---|---|
t5-base-wikisplit | 17.93 | 67.5438 | 76.9 |
t5-v1_1-base-wikisplit | 18.1207 | 67.4873 | 76.9478 |
byt5-base-wikisplit | 11.3582 | 67.2685 | 73.1682 |
t5-large-wikisplit | 18.6632 | 68.0501 | 77.1881 |
- All of our models are having better result for two metrics(Exact and SARI scores) than baseline models
- Our t5-base-wikisplit and t5-v1_1-base-wikisplit model are achieving comparative results with half model size or weights that will enable faster inference
- We added wikisplit metrics which is freely available at huggingface datasets. It will be easy to calculate relevent scores for this task from now on
- t5-base training on Wiki Split
- t5-v1_1-base training on Wiki Split
- byt5-base training on Wiki Split
- t5-large training on Wiki Split
- Streamlit UI for App
- Single Websplit Evaluation Metrics Addition in Huggingface Datasets
- Challenge: Get better performance than roberta2roberta_L-24_wikisplit
- Performance improvement with Research
- Tackle Gender Biasness and fairness while text generation
- Benchmarking and Experimenting with Web Split