From fb6fc43792c086613dfaec4d4d6f9991e9e67664 Mon Sep 17 00:00:00 2001 From: RyanSamman Date: Sun, 4 Oct 2020 11:38:19 +0300 Subject: [PATCH] Blog Word Cloud --- .gitignore | 2 + README.md | 235 +++++++++++++++++++++++++++++++++ display.ipynb | 80 +++++++++++ images/BlogWordCloud.png | Bin 0 -> 252778 bytes images/ExampleWordCloud.png | Bin 0 -> 23399 bytes images/FCITLogo.jpg | Bin 0 -> 24875 bytes parser.ipynb | 116 ++++++++++++++++ processedData.json | 13 ++ pythonFiles/display.py | 33 +++++ pythonFiles/parser.py | 28 ++++ pythonFiles/processedData.json | 13 ++ pythonFiles/pythonFiles.md | 36 +++++ pythonFiles/rawBlogText.json | 5 + pythonFiles/scraper.py | 55 ++++++++ rawBlogText.json | 5 + scraper.ipynb | 137 +++++++++++++++++++ 16 files changed, 758 insertions(+) create mode 100644 .gitignore create mode 100644 README.md create mode 100644 display.ipynb create mode 100644 images/BlogWordCloud.png create mode 100644 images/ExampleWordCloud.png create mode 100644 images/FCITLogo.jpg create mode 100644 parser.ipynb create mode 100644 processedData.json create mode 100644 pythonFiles/display.py create mode 100644 pythonFiles/parser.py create mode 100644 pythonFiles/processedData.json create mode 100644 pythonFiles/pythonFiles.md create mode 100644 pythonFiles/rawBlogText.json create mode 100644 pythonFiles/scraper.py create mode 100644 rawBlogText.json create mode 100644 scraper.ipynb diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..8851899 --- /dev/null +++ b/.gitignore @@ -0,0 +1,2 @@ +chromedriver.exe +.vscode \ No newline at end of file diff --git a/README.md b/README.md new file mode 100644 index 0000000..dc6a4ae --- /dev/null +++ b/README.md @@ -0,0 +1,235 @@ +# Table of Contents +- [Table of Contents](#table-of-contents) +- [What is this?](#what-is-this) +- [How does it work?](#how-does-it-work) + - [Dependencies](#dependencies) + - [Scraping the Data](#scraping-the-data) + - [Initialize the Browser Driver](#initialize-the-browser-driver) + - [Opening a site](#opening-a-site) + - [Retrieving Blog Links](#retrieving-blog-links) + - [Retrieving Text Data](#retrieving-text-data) + - [Processing Data](#processing-data) + - [Loading the Data](#loading-the-data) + - [Removing newlines](#removing-newlines) + - [Selecting all the words present](#selecting-all-the-words-present) + - [Getting the Frequency of Each Word](#getting-the-frequency-of-each-word) + - [Saving the Data](#saving-the-data) + - [Visualizing the Data](#visualizing-the-data) + - [Load the processed data](#load-the-processed-data) + - [Load the target image](#load-the-target-image) + - [Creating the WordCloud Object](#creating-the-wordcloud-object) + +# What is this? +As part of our CPIT221 course, we write a weekly blog. In our previous week's blog, we wrote about our experiences in a group discussion, where the members were chosen at random. + +As someone who likes data, it was a no brainer to scrape and visualize that data, so this is the end result: + +
Word Cloud Scraped Image
+ +If it wasn't obvious, the size of the text correlates to how frequent the word appears in all the blogs. + +# How does it work? +I split the code for this project into three distinct parts: +- **Scraping** Data +- **Processing** Data +- **Visualizing** Data + +Of course, to achieve this, we will need to install some dependencies. + +## Dependencies +To view and run the code, you will need to use a [Jupyter Notebook](https://jupyter.org). Alternatively, [Read this and run the .py files](./pythonFiles/pythonFiles.md). + +Visual Studio Code has Jupyter Notebook support built into it's [Python Extention](https://marketplace.visualstudio.com/items?itemName=ms-python.python), and that is what I have used for this project. + +Additionally, you will need to install: +- [Selenium Webdriver Library](https://pypi.org/project/selenium/) +- [Selenium Chrome Driver](https://chromedriver.chromium.org/downloads), Make sure to either add it to the PATH or place it in the project's Current Working Directory +- [Matplotlib](https://pypi.org/project/matplotlib/) +- [WordCloud](https://pypi.org/project/matplotlib/) + + +## Scraping the Data +This was achieved by using Selenium's Webdriver for Python. Although it's intended use is for [Integration Testing](https://en.wikipedia.org/wiki/Integration_testing), we can also use it to automate browser actions and effectively scrape data from the browser. + +Previously, I have used Python's Interactive Shell for web scraping and browser automation. However, with this project I decided to finally use Jupyter Notebooks, which has autocompletion and allows you to tinker with the code in real time. The main application of Jupyter Notebooks here is the ability to run code in blocks called 'Cells' which is useful in the tedious process of scraping data. + +### Initialize the Browser Driver +```py +driver = webdriver.Chrome() +``` + +### Opening a site +```py +driver.get("https://website.web.edu.sa") +``` +Then, we will manually log in and move to the weekly writing blogs. + +### Retrieving Blog Links + +After correctly moving to the right page, we can scrape all the blog links with the code below. +```py +def getBlogs(): + # Get List Element containing all the blog links + blogList = driver.find_element_by_xpath("/html/body/div[5]/div[2]/div/div/div/div/div[3]/div/div[2]/div[4]/div/ul") + + # Retrieve children of