The Summarization App is a web-based tool that provides text summarization for both books and YouTube videos. The app can:
- Summarize chapters from books hosted on Project Gutenberg (only UTF-8 format).
- Generate summaries from YouTube videos using either existing transcripts or transcriptions of the audio.
This project leverages several state-of-the-art machine learning models, such as BART for text summarization and Whisper for audio transcription, all through an intuitive interface built with Streamlit.
- Fetches book content directly from Project Gutenberg (UTF-8 format only).
- Automatically extracts chapters from the book.
- Summarizes selected chapters using BART.
- Fetches and summarizes transcripts of YouTube videos.
- If no transcript is available, downloads and transcribes the audio using Whisper, then generates a summary.
Ensure you have Python 3.8 or above installed.
The app relies on several key Python libraries:
- streamlit: For the web interface.
- transformers: To use BART for text summarization.
- torch: To handle model computations.
- nltk: For sentence tokenization.
- whisper: For audio transcription.
- yt-dlp: For downloading YouTube audio.
- pytube: To fetch video metadata.
- youtube_transcript_api: For fetching YouTube transcripts.
- requests: For fetching book content.
git clone https://github.com/your-username/summarization-app.git
cd summarization-appIt’s recommended to use a virtual environment to manage dependencies:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activateInstall the required Python packages:
pip install -r requirements.txtThe app requires the punkt tokenizer from NLTK for sentence tokenization. To download it:
python -c "import nltk; nltk.download('punkt')"Ensure that yt-dlp is installed to handle YouTube video downloads. You can install it via pip:
pip install yt-dlpOr download directly from the official yt-dlp repository.
To start the Streamlit app, run:
streamlit run app.pyThis will open the app in your web browser.
- Supported Source: The app only supports books from Project Gutenberg that are in UTF-8 text format.
- Book URL Requirements:
- Ensure the book URL is from Project Gutenberg.
- The book must be in UTF-8 plain text format.
- The URL should directly point to the UTF-8 version of the plain text file.
- Navigate to Project Gutenberg.
- Search for the book you wish to summarize.
- On the book’s download page, scroll down to the Download options.
- Click on the Plain Text UTF-8 format.
- Example:
https://www.gutenberg.org/files/1342/1342-0.txt(for Pride and Prejudice).
- Example:
- Copy the URL of the UTF-8 plain text file and use it in the app.
If you attempt to use a different format (e.g., HTML, PDF, or other encodings), the app will not be able to process the book.
- Supported Source: The app accepts YouTube video URLs.
- Transcript: If the video has a transcript, the app fetches it automatically.
- Transcription: If no transcript is available, the app downloads the video’s audio and transcribes it using Whisper.
-
Main Page:
- Select either Book Summarization or YouTube Video Summarization.
-
Book Summarization:
- Enter the URL of a book from Project Gutenberg (in UTF-8 format).
- Fetch the book content and select a chapter.
- Summarize the selected chapter and download the summary.
-
YouTube Video Summarization:
- Enter a YouTube video URL.
- Fetch the video’s transcript (if available) or transcribe its audio using Whisper.
- Summarize the transcript and download the summary.
- Model:
facebook/bart-large-cnn - Purpose: Summarization of text (used for both book chapters and YouTube transcripts).
- Model:
openai/whisper - Purpose: Transcription of YouTube audio (used when no transcript is available).
The summaries and transcripts are saved in the following structure:
summaries/
└── <Book or Video Title> - <Author or Video ID>/
├── Chapter_<n>.txt
├── transcript.txt
└── summary.txt
- Adding support for more book formats and sources.
- Enhancing the transcription quality using Whisper's larger models.
- Implementing a more robust chapter detection algorithm for books.
This project is licensed under the MIT License.