Skip to content

5umitpandey/Text_Summarisation

Repository files navigation

Text Summarization

Published Paper

This document explains multiple approaches to text summarization, combining frequency-based methods, position-based methods, TF-IDF-based scoring, and transformer-based models (HuggingFace Transformers).

The goal of summarization is to condense a longer body of text into its most important parts while retaining meaning.


Cleaning and Preprocessing Text

Before applying summarization techniques, the text needs to be cleaned:

  • Remove special characters and digits.
  • Tokenize text into words and sentences.
  • Lemmatize words using WordNetLemmatizer.
  • Remove stopwords to reduce noise.

The cleaned text is then used in different summarization algorithms.


Frequency-Based Summarization

Approach:

  • Tokenize text into sentences.
  • Count word frequency after removing stopwords.
  • Normalize frequency values by dividing by the maximum frequency.
  • Score each sentence based on word frequencies.
  • Select the top-N sentences as the summary.

Position-Based Summarization

Approach:

  • Tokenize text into sentences.
  • Score sentences based on their position in the text (earlier sentences often carry more importance).
  • Select the top-N sentences by position score.

TF-IDF Based Summarization

Approach:

  • Create a frequency matrix of words in each sentence.
  • Compute Term Frequency (TF) and Inverse Document Frequency (IDF).
  • Calculate TF-IDF scores for words.
  • Score sentences by averaging TF-IDF values of words.
  • Select the best sentences above a set threshold.

This method ensures that rare but important words have higher weight in the summary.


Sample Case

Input: Peter and Elizabeth took a taxi to attend the night party in the city. While in the party, Elizabeth collapsed and WAS rushed to the hospital.

Position-based summary:

Peter Elizabeth took taxi attend night party city party Elizabeth collapsed rushed hospital


Transformer-Based Summarization (HuggingFace)

Approach:

  • Uses pre-trained deep learning models such as sshleifer/distilbart-cnn-12-6.
  • Capable of generating abstractive summaries, unlike extractive methods.
  • Requires the transformers library and a summarization pipeline.

Key Takeaways

  1. Frequency-based summarization selects sentences with the most frequent important words.
  2. Position-based summarization prioritizes sentences based on placement in the document.
  3. TF-IDF summarization balances word frequency with uniqueness, highlighting rare but meaningful terms.
  4. Transformer-based summarization (e.g., DistilBART) provides abstractive summaries, rephrasing content instead of just extracting sentences.

Each approach has strengths and trade-offs:

  • Rule-based methods (frequency, TF-IDF, position) are fast and simple.
  • Transformer-based methods are more accurate but computationally expensive.

Confirmation Letter with Google Scholar Impact Factor