Skip to content
/ HeSum Public

A Novel Dataset for Abstractive Text Summarization in Hebrew

Notifications You must be signed in to change notification settings

OnlpLab/HeSum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

HeSum: A Novel Dataset for Abstractive Text Summarization in Hebrew

Paper

The paper can be found here.

Data

The data is composed of pairs of sub-heading + article scrapped from Shkuf, ha-makom, the7eye.

The data can be found here. You can also used the csv attached under the data folder.

The data can be downloaded directly from hugginface using datasets library in python.

pip3 intall datasets
from datasets import load_dataset

hesum = load_dataset('biunlp/HeSum')

The data contains three data splits - train (8000 examples), validation (1000 examples) and test (1000 examples)

Each sample contains the following:

  • summary: Sub-heading of an article
  • article: The article.

Models

There are two models fine-tuned on the HeSum dataset.

  1. mT5LongHeSum-base (2.3 GB), Download.
  2. mT5LongHeSum-large (4.6 GB) Download.

For running the model

from datasets import load_dataset
from transformers import pipeline

dataset = load_dataset("biunlp/HeSum")['test']
hub_model_id = "biunlp/mT5LongHeSum-large"
summarizer = pipeline("summarization", model=hub_model_id)

article = "<Enter you text here>"
summarizer(article, max_length=250)

About

A Novel Dataset for Abstractive Text Summarization in Hebrew

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published