Web Scraper

A Python module that scrapes web content from URLs defined in a CSV file and downloads the content into JSON format.

Features

Scrapes web pages using Scrapy spiders for crawling and parsing
Reads URLs from a CSV file
Crawls and extracts content from internal links within the same domain
Configurable crawling depth and limits
Extracts title, text, and metadata from web pages
Handles JavaScript-rendered pages and animations using Selenium
Waits for pages to fully load before extracting content
Saves scraped data in JSON format
Configurable using Pydantic models
Comprehensive logging with Loguru

Installation

Prerequisites

Python 3.12 or higher
uv - A Python package installer and resolver

This project uses pyproject.toml for dependency management with uv, which is a faster alternative to pip.

Install from source

Clone the repository:

git clone https://github.com/yourusername/web-scraper.git
cd web-scraper

Install the package and its dependencies using uv:
```
uv pip install -e .
```

Install dependencies only

If you just want to run the script without installing the package, you can install the required dependencies using uv:

uv pip install .

This will install all the necessary packages (scrapy, loguru, pydantic, pandas, selenium, webdriver-manager) defined in pyproject.toml needed to run the web scraper.

Usage

Environment Variables

The web scraper can be configured using environment variables. A sample .env.sample file is provided with all available configuration options. To use it:

Copy the sample file to create your own .env file:
```
cp .env.sample .env
```
Edit the .env file to customize your settings.
Install the web scraper with the optional env dependency to load environment variables from the .env file:
```
uv pip install ".[env]"
```
Alternatively, you can install python-dotenv separately:
```
uv pip install python-dotenv
```
The environment variables will be automatically loaded from the .env file when you run the scraper.

Basic Usage

Make sure you have installed the required dependencies:
```
uv pip install .
```
Create a CSV file with a list of URLs to scrape. The CSV file should have a column named url. For example:
```
url
https://www.example.com
https://www.python.org
https://www.wikipedia.org
```
Run the scraper:
```
python main.py --csv-file your_urls.csv
```
Alternatively, you can set the SCRAPER_CSV_FILE_PATH environment variable in your .env file and run:
```
python main.py
```
The scraped data will be saved in the output directory by default.

Crawling Configuration

The web scraper can crawl and extract content from internal links within the same domain. This feature is enabled by default and can be configured using the following environment variables:

SCRAPER_FOLLOW_INTERNAL_LINKS: Whether to follow internal links within the same domain (default: True)
SCRAPER_MAX_DEPTH: Maximum depth for crawling internal links (default: 2)
SCRAPER_MAX_PAGES_PER_DOMAIN: Maximum number of pages to crawl per domain (default: 100)

For example, to crawl with a maximum depth of 3 and limit to 50 pages per domain, set the following in your .env file:

SCRAPER_FOLLOW_INTERNAL_LINKS=True
SCRAPER_MAX_DEPTH=3
SCRAPER_MAX_PAGES_PER_DOMAIN=50

Path-Based Crawling

The crawler only follows links that start with the same path as the initial URL. For example:

If the initial URL is https://example.com/blog-details, the crawler will only follow links that start with https://example.com/blog-details or https://example.com/blog-details/.
If the initial URL is https://example.com/blog-details?id=123, the crawler will only follow links that start with https://example.com/blog-details.
If the initial URL is https://example.com/ (or just https://example.com), the crawler will follow all links on the domain.

This ensures that the crawler stays within the specific section of the website that you're interested in, rather than crawling the entire domain. It's particularly useful for scraping specific sections of large websites.

The crawler will only follow links within the same domains as the initial URLs provided in the CSV file. This prevents the crawler from wandering off to external websites.

You can also configure crawling behavior using command-line arguments:

# Crawl with a maximum depth of 3
python main.py --csv-file your_urls.csv --max-depth 3

# Disable following internal links
python main.py --csv-file your_urls.csv --no-follow-internal-links

# Limit to 50 pages per domain
python main.py --csv-file your_urls.csv --max-pages-per-domain 50

Command-line arguments take precedence over environment variables.

JavaScript Rendering

The web scraper uses Selenium to render pages with JavaScript and wait for them to fully load before extracting content. This is particularly useful for scraping modern websites that use animations and dynamically load content.

By default, all pages are rendered using Selenium, which ensures that even content loaded via JavaScript is properly extracted. The scraper waits for the page to be fully loaded and then waits an additional period to allow any animations to complete.

You can configure the maximum time to wait for a page to load using the SELENIUM_WAIT_TIME environment variable in your .env file:

SELENIUM_WAIT_TIME=15  # Wait up to 15 seconds for pages to load

Alternatively, you can set the wait time using the command-line argument:

python main.py --csv-file your_urls.csv --selenium-wait-time 15

The default wait time is 10 seconds, which is sufficient for most websites. If you're scraping websites with particularly heavy JavaScript or slow-loading animations, you may need to increase this value.

Command Line Options

python main.py --help

This will display the available command line options:

usage: main.py [-h] --csv-file CSV_FILE [--output-dir OUTPUT_DIR] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log-file LOG_FILE]
               [--follow-internal-links] [--no-follow-internal-links] [--max-depth MAX_DEPTH] [--max-pages-per-domain MAX_PAGES_PER_DOMAIN]
               [--selenium-wait-time SELENIUM_WAIT_TIME]

Web Scraper

options:
  -h, --help            show this help message and exit
  --csv-file CSV_FILE, -c CSV_FILE
                        Path to the CSV file containing URLs to scrape
  --output-dir OUTPUT_DIR, -o OUTPUT_DIR
                        Directory to save scraped data
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}, -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Log level
  --log-file LOG_FILE   Path to the log file. If not provided, a default log file will be created.
  --follow-internal-links
                        Follow internal links within the same domain
  --no-follow-internal-links
                        Do not follow internal links
  --max-depth MAX_DEPTH
                        Maximum depth for crawling internal links
  --max-pages-per-domain MAX_PAGES_PER_DOMAIN
                        Maximum number of pages to crawl per domain
  --selenium-wait-time SELENIUM_WAIT_TIME
                        Maximum time to wait for a page to load (in seconds)

Project Structure

The project follows Object-Oriented Programming principles with different files for different functionalities:

main.py: Entry point for the scraper
src/scraper/config.py: Configuration using Pydantic
src/scraper/csv_reader.py: Module to read URLs from CSV files
src/scraper/items.py: Definition of scraped data structure
src/scraper/pipelines.py: Processing and saving scraped data
src/scraper/logger.py: Logging setup using Loguru
src/scraper/middlewares.py: Middleware for handling JavaScript rendering using Selenium
src/scraper/spiders/web_spider.py: Scrapy spider for crawling and parsing

Output Format

The scraper saves each scraped page as a separate JSON file with the following structure:

{
  "url": "https://www.example.com",
  "title": "Example Domain",
  "text": "This domain is for use in illustrative examples in documents...",
  "metadata": {
    "description": "Example website description",
    "keywords": "example, domain",
    "domain": "www.example.com",
    "crawl_depth": 0
  },
  "timestamp": "2023-06-01T12:34:56.789012"
}

Additionally, all scraped pages are combined into a single JSON file named all_items.json.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.env.sample		.env.sample
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
sample_urls.csv		sample_urls.csv
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Scraper

Features

Installation

Prerequisites

Install from source

Install dependencies only

Usage

Environment Variables

Basic Usage

Crawling Configuration

Path-Based Crawling

JavaScript Rendering

Command Line Options

Project Structure

Output Format

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web Scraper

Features

Installation

Prerequisites

Install from source

Install dependencies only

Usage

Environment Variables

Basic Usage

Crawling Configuration

Path-Based Crawling

JavaScript Rendering

Command Line Options

Project Structure

Output Format

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages