This project combines web scraping and Natural Language Processing (NLP) to extract and analyze climate reports from the IPCC. The notebook includes steps to download PDF reports, parse content, and apply NLP techniques to derive insights from the data.
This repository focuses on:
-
Web Scraping: Automated extraction and downloading of multiple PDF reports from the IPCC website.
-
Data Processing: Conversion and management of large PDF files, including validation and file existence checks.
-
Natural Language Processing (NLP): Applying NLP to analyze textual content from climate reports, enabling data-driven insights.
- Web scraping with Beautifulsoup used to extract data from the IPCC website
-
Data Transformation:
-
PDF files containing climate change reports are converted to text format.
-
Text data is cleaned and pre-processed for NLP analysis.
-
-
SpaCy NLP tools were used to analyze the text data, including:
a. Keyword extraction: Identifying key terms and concepts related to climate change.
b. Topic modeling: Identifying the main topics discussed in the reports.
The results of the NLP analysis are visualized using Worldcloud. This allows users to explore and understand the key findings of the climate change reports.
- The program allows users to input a specific keyword or phrase.
- The program then retrieves all paragraphs from the reports that contain the input keyword or phrase, this allows users to find specific information related to their interests quickly.