-
Run the following command lines in the terminal:
(a) Abstractive Summarization:
pip install transformers torch sentencepiece
(b) Extractive Summarization:
pip install spacy pytextrank python -m spacy download en_core_web_lg
-
Create a
.txt
file with the content whose summary you want -
Copy the path of the file and paste it in the code
-
Run the code. The output for the
text.txt
file in the repository looks like:(a) Abstractive Summarization:
(b) Extractive Summarization:
The repository includes two separate Python scripts for text summarization: one using Abstractive Summarization with a pre-trained PEGASUS model and another using Extractive Summarization with the TextRank algorithm via the SpaCy library. Here's an overview of each script and a comparison to determine which one might be best for different use cases.
-
Abstractive Summarization (
abstractive.py
)- Libraries Used:
- PEGASUS: A pre-trained model from Google's Transformer-based models specifically designed for abstractive summarization.
- Transformers: Hugging Face library to easily load pre-trained models.
- Process:
- The script loads the pre-trained PEGASUS-XSUM model, which is fine-tuned for generating summaries of documents.
- It reads the input text from a file and measures its original length.
- The text is tokenized using the PEGASUS tokenizer, ensuring that the input is formatted properly for the model.
- The model generates a summary with controlled parameters such as beam search (
num_beams=5
), maximum and minimum summary lengths, and a length penalty to prevent excessively short or long summaries. - The summary is then decoded from token IDs into human-readable text.
- The script outputs the summary and the length of the generated summary in characters.
- Advantages:
- Abstractive Summarization means the model generates a new summary in its own words, which tends to provide more coherent and fluent results.
- It can handle complex documents and condense the content into a more meaningful summary rather than just extracting sentences.
- Limitations:
- The model is dependent on the pre-trained data it was fine-tuned on (news summarization in the case of XSUM), so it might not perform optimally on all domains.
- It requires more computational resources due to the size and complexity of the PEGASUS model.
- Libraries Used:
-
Extractive Summarization (
extractive.py
)- Libraries Used:
- SpaCy: An NLP library for processing text and extracting information.
- PyTextRank: A library that implements the TextRank algorithm for extractive summarization.
- Process:
- The script uses SpaCy's large English model (
en_core_web_lg
), which is pre-trained on a large corpus of English text. - The TextRank algorithm is added as a pipeline component to SpaCy. TextRank is a graph-based algorithm that ranks sentences based on their relevance to the overall document.
- The input text is read from the file, and its original length is measured.
- SpaCy processes the text, and TextRank identifies the most important sentences, which are then used to generate the summary.
- The summary is formed by extracting the top-ranked sentence(s) and is outputted along with the length of the summary.
- The script uses SpaCy's large English model (
- Advantages:
- Extractive Summarization is faster and simpler than abstractive summarization, as it just selects important sentences from the original text without generating new content.
- It works well with a wide range of text types, as it does not require domain-specific fine-tuning.
- It's less computationally intensive since it uses a simpler algorithm compared to PEGASUS.
- Limitations:
- The summary might not be as coherent or fluent as in abstractive summarization, as it directly extracts sentences from the original text.
- The extracted sentences may not flow smoothly when put together in a summary.
- Libraries Used:
-
When to Use Abstractive Summarization (PEGASUS):
- Complex, lengthy documents where a concise, fluent summary is needed.
- Scenarios where coherence and readability are more important than just picking out key sentences (e.g., summarizing research papers, long articles, or creative writing).
- If you want summaries that paraphrase the input text and focus on delivering the essential meaning in a human-readable format.
-
When to Use Extractive Summarization (TextRank with SpaCy):
- Shorter, more straightforward documents where you just need to extract the key points or sentences.
- Scenarios where you need to quickly get a sense of the main ideas from a document without needing a rephrased summary.
- If computational resources are limited or if you need to process large amounts of text efficiently.
- PEGASUS (Abstractive Summarization) is generally considered superior when you need more coherent and fluent summaries that convey the meaning in a concise form. It is ideal for complex and domain-specific content but requires more computational power.
- TextRank (Extractive Summarization) is a good choice for faster, simpler tasks or when working with resources where computational efficiency is important, though the summaries might not always be as polished or natural.
For tasks requiring highly fluent and creative summaries, PEGASUS would likely be the better choice. However, for simpler, faster extraction of key points from text, TextRank offers a more efficient solution.