Skip to content

NikuPAN/web-scraping-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scrapping Project

Last Updated: 2024-06-20

Table of Contents

Introduction

This web scrapping project allows you to extract content from a website. It tells you how it is structured, how many HTML tags and corresponding number of items.

It also scrape any of the subpages and sibling pages found on the page you entered, it provides options for you whether you want to scrape external resources so you don't keep the program running without good reasons.

The best use case for this project would be integrate with AI to analyse a website, whether it is well structured, how it was designed or, how it can be improved.

Getting Started

Prerequisites

List of software and tools needed:

  • Python 3 Environment
  • Code Editor (Recommended: Visual Studio Code, do not use Notepad unless you are very experienced developer)

Installation

Step-by-step guide on installing the project:

  • Download python from Python Official Website
  • Download Recommended code editor from Microsoft
  • Install required python external dependencies (Optional)
    pip install selenium bs4 webdriver-manager requests
  • Alternatively, let the program install for you when running, it has internal package installer.

Usage

  • Clone this project
  • Open Project folder in your code editor

Run Directly

  • You will got two types of script that can directly execute this python program:
    • start.bat - This script is for Windows;
    • start.sh - This script is for Linux
      • NOTE: It is possible that your Linux system does not allow you to execute start.sh directly due to file permission, If this is the case, use chmod +x start.sh to change file's permission.

Run Using Command Prompt

  • Alternatively, you can run the program using command:
  • Open a terminal
  • Type Command
    py main.py
    or
    python main.py

Optional Parameter

  • You can also use command with argument input: py main.py 1 or python main.py 2 where:
    • 1 = runs the main program directly;
    • 2 = enter settings menu NOTE: Any of these options will skip the main menu!

Features

  • List of Features:
    • Fetch Scrap web contents
    • Gernerate a directory (hierarchy) map representing the website structure (TO BE COMPLETED)
    • Expand all potential hidden content by dropdown / expandable HTML components (TO BE COMPLETED)
    • Download HTML webpage(s)
    • Generate a report when your fetch is successful with all text contents group by HTML tags.
    • Scrap slibing / sub-pages found in the website (Optional)
    • Download slibing / sub-pages found in the website (Optional)
    • Download Javascripts used in the website (Optional)

License

MIT License

Copyright (c) [2024] [ConceptV Pty LTD]

Permission is hereby granted, free of charge, to ConceptV Pty LTD obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Acknowledgments

  • This project is created by Nick Pang. However, ConceptV Pty LTD has full ownership on this project.

About

Web Scrapper Project

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors