Skip to content

spectrayan/web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraper

A Python module that scrapes web content from URLs defined in a CSV file and downloads the content into JSON format.

Features

  • Scrapes web pages using Scrapy spiders for crawling and parsing
  • Reads URLs from a CSV file
  • Crawls and extracts content from internal links within the same domain
  • Configurable crawling depth and limits
  • Extracts title, text, and metadata from web pages
  • Handles JavaScript-rendered pages and animations using Selenium
  • Waits for pages to fully load before extracting content
  • Saves scraped data in JSON format
  • Configurable using Pydantic models
  • Comprehensive logging with Loguru

Installation

Prerequisites

  • Python 3.12 or higher
  • uv - A Python package installer and resolver

This project uses pyproject.toml for dependency management with uv, which is a faster alternative to pip.

Install from source

  1. Clone the repository:

    git clone https://github.com/yourusername/web-scraper.git
    cd web-scraper
    
  2. Install the package and its dependencies using uv:

    uv pip install -e .
    

Install dependencies only

If you just want to run the script without installing the package, you can install the required dependencies using uv:

uv pip install .

This will install all the necessary packages (scrapy, loguru, pydantic, pandas, selenium, webdriver-manager) defined in pyproject.toml needed to run the web scraper.

Usage

Environment Variables

The web scraper can be configured using environment variables. A sample .env.sample file is provided with all available configuration options. To use it:

  1. Copy the sample file to create your own .env file:

    cp .env.sample .env
    
  2. Edit the .env file to customize your settings.

  3. Install the web scraper with the optional env dependency to load environment variables from the .env file:

    uv pip install ".[env]"
    

    Alternatively, you can install python-dotenv separately:

    uv pip install python-dotenv
    
  4. The environment variables will be automatically loaded from the .env file when you run the scraper.

Basic Usage

  1. Make sure you have installed the required dependencies:

    uv pip install .
    
  2. Create a CSV file with a list of URLs to scrape. The CSV file should have a column named url. For example:

    url
    https://www.example.com
    https://www.python.org
    https://www.wikipedia.org
  3. Run the scraper:

    python main.py --csv-file your_urls.csv
    

    Alternatively, you can set the SCRAPER_CSV_FILE_PATH environment variable in your .env file and run:

    python main.py
    
  4. The scraped data will be saved in the output directory by default.

Crawling Configuration

The web scraper can crawl and extract content from internal links within the same domain. This feature is enabled by default and can be configured using the following environment variables:

  • SCRAPER_FOLLOW_INTERNAL_LINKS: Whether to follow internal links within the same domain (default: True)
  • SCRAPER_MAX_DEPTH: Maximum depth for crawling internal links (default: 2)
  • SCRAPER_MAX_PAGES_PER_DOMAIN: Maximum number of pages to crawl per domain (default: 100)

For example, to crawl with a maximum depth of 3 and limit to 50 pages per domain, set the following in your .env file:

SCRAPER_FOLLOW_INTERNAL_LINKS=True
SCRAPER_MAX_DEPTH=3
SCRAPER_MAX_PAGES_PER_DOMAIN=50

Path-Based Crawling

The crawler only follows links that start with the same path as the initial URL. For example:

  • If the initial URL is https://example.com/blog-details, the crawler will only follow links that start with https://example.com/blog-details or https://example.com/blog-details/.
  • If the initial URL is https://example.com/blog-details?id=123, the crawler will only follow links that start with https://example.com/blog-details.
  • If the initial URL is https://example.com/ (or just https://example.com), the crawler will follow all links on the domain.

This ensures that the crawler stays within the specific section of the website that you're interested in, rather than crawling the entire domain. It's particularly useful for scraping specific sections of large websites.

The crawler will only follow links within the same domains as the initial URLs provided in the CSV file. This prevents the crawler from wandering off to external websites.

You can also configure crawling behavior using command-line arguments:

# Crawl with a maximum depth of 3
python main.py --csv-file your_urls.csv --max-depth 3

# Disable following internal links
python main.py --csv-file your_urls.csv --no-follow-internal-links

# Limit to 50 pages per domain
python main.py --csv-file your_urls.csv --max-pages-per-domain 50

Command-line arguments take precedence over environment variables.

JavaScript Rendering

The web scraper uses Selenium to render pages with JavaScript and wait for them to fully load before extracting content. This is particularly useful for scraping modern websites that use animations and dynamically load content.

By default, all pages are rendered using Selenium, which ensures that even content loaded via JavaScript is properly extracted. The scraper waits for the page to be fully loaded and then waits an additional period to allow any animations to complete.

You can configure the maximum time to wait for a page to load using the SELENIUM_WAIT_TIME environment variable in your .env file:

SELENIUM_WAIT_TIME=15  # Wait up to 15 seconds for pages to load

Alternatively, you can set the wait time using the command-line argument:

python main.py --csv-file your_urls.csv --selenium-wait-time 15

The default wait time is 10 seconds, which is sufficient for most websites. If you're scraping websites with particularly heavy JavaScript or slow-loading animations, you may need to increase this value.

Command Line Options

python main.py --help

This will display the available command line options:

usage: main.py [-h] --csv-file CSV_FILE [--output-dir OUTPUT_DIR] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log-file LOG_FILE]
               [--follow-internal-links] [--no-follow-internal-links] [--max-depth MAX_DEPTH] [--max-pages-per-domain MAX_PAGES_PER_DOMAIN]
               [--selenium-wait-time SELENIUM_WAIT_TIME]

Web Scraper

options:
  -h, --help            show this help message and exit
  --csv-file CSV_FILE, -c CSV_FILE
                        Path to the CSV file containing URLs to scrape
  --output-dir OUTPUT_DIR, -o OUTPUT_DIR
                        Directory to save scraped data
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}, -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Log level
  --log-file LOG_FILE   Path to the log file. If not provided, a default log file will be created.
  --follow-internal-links
                        Follow internal links within the same domain
  --no-follow-internal-links
                        Do not follow internal links
  --max-depth MAX_DEPTH
                        Maximum depth for crawling internal links
  --max-pages-per-domain MAX_PAGES_PER_DOMAIN
                        Maximum number of pages to crawl per domain
  --selenium-wait-time SELENIUM_WAIT_TIME
                        Maximum time to wait for a page to load (in seconds)

Project Structure

The project follows Object-Oriented Programming principles with different files for different functionalities:

  • main.py: Entry point for the scraper
  • src/scraper/config.py: Configuration using Pydantic
  • src/scraper/csv_reader.py: Module to read URLs from CSV files
  • src/scraper/items.py: Definition of scraped data structure
  • src/scraper/pipelines.py: Processing and saving scraped data
  • src/scraper/logger.py: Logging setup using Loguru
  • src/scraper/middlewares.py: Middleware for handling JavaScript rendering using Selenium
  • src/scraper/spiders/web_spider.py: Scrapy spider for crawling and parsing

Output Format

The scraper saves each scraped page as a separate JSON file with the following structure:

{
  "url": "https://www.example.com",
  "title": "Example Domain",
  "text": "This domain is for use in illustrative examples in documents...",
  "metadata": {
    "description": "Example website description",
    "keywords": "example, domain",
    "domain": "www.example.com",
    "crawl_depth": 0
  },
  "timestamp": "2023-06-01T12:34:56.789012"
}

Additionally, all scraped pages are combined into a single JSON file named all_items.json.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Library to scrape the web

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages