Skip to content

This project implements a parallel web scraper that efficiently analyzes website content using Intel Threading Building Blocks (TBB).

Notifications You must be signed in to change notification settings

Vukotije/parallel-web-scraper

Repository files navigation

📚 Parallel Web Scraper

A high-performance parallel web scraper written in modern C++17, designed to extract structured book data from Books to Scrape.
This project was developed as part of a university course in Parallel Programming, demonstrating:

  • Efficient task parallelism using Intel TBB
  • Modern C++ coding practices
  • Clean modular architecture
  • Real-world HTML parsing and web interaction

🎯 Overview

The scraper automatically discovers catalog pages, extracts book details, collects statistics, and exports all results in JSON format.

Extracted Information:

  • Title
  • Price (incl./excl. tax)
  • Availability
  • Rating
  • Category
  • Description
  • UPC & Metadata

Key Features

Feature Description
Parallel Processing Uses Intel TBB for concurrency and high throughput
🔍 Automatic Page Discovery Dynamically crawls catalog pages
🪶 HTML5 Parsing Uses Lexbor for high-speed DOM extraction
🌐 Modern HTTP Uses CPR (libcurl wrapper) for requests
📊 Analytics Output Produces detailed aggregated statistics
💾 JSON Export Structured output for further analysis

🏗️ Project Structure

parallel-web-scraper/
├── CMakeLists.txt
├── CMakePresets.json
├── vcpkg.json
├── config.json
├── data/
│   ├── urls.txt
│   └── results.json
└── ParallelWebScraper/
    ├── include/
    │   ├── Book.h
    │   ├── Config.h
    │   ├── HtmlParser.h
    │   ├── HttpClient.h
    │   ├── util.h
    │   └── WebScraper.h
    └── src/
        ├── main.cpp
        ├── Config.cpp
        ├── HtmlParser.cpp
        ├── HttpClient.cpp
        ├── util.cpp
        └── WebScraper.cpp

🔧 Dependencies (Managed via vcpkg)

Library Purpose
Intel TBB Parallel execution runtime
CPR Modern C++ HTTP client
Lexbor Fast HTML5 parsing
nlohmann-json JSON serialization

To ensure reproducible environments, dependencies are declared in vcpkg.json.


🚀 Build Instructions (Cross‑Platform via CMake)

1. Configure

cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake

2. Build

cmake --build build --config Release

3. Run

./build/Release/ParallelWebScraper

For Windows (PowerShell):

cmake --preset vs2022-release
cmake --build --preset vs2022-release --config Release
.�uild�s2022\Release\ParallelWebScraper.exe

⚙️ Configuration

Modify config.json:

{
  "baseUrl": "https://books.toscrape.com",
  "outputFile": "data/results.json",
  "maxConcurrency": 0,
  "maxRetries": 3,
  "timeoutMs": 30000
}

📊 Example Output (results.json)

{
  "totalBooks": 1000,
  "statistics": {
    "averagePrice": 35.67,
    "fiveStarBooks": 234
  },
  "books": [{ "title": "Example Book", "price": 35.67, "rating": "Five" }]
}

🎓 Educational Notes

This project demonstrates:

  • Work-stealing scheduling (TBB)
  • Task decomposition strategies
  • I/O vs CPU-bound pipeline balancing
  • Incremental parsing patterns

Useful further study topics:

  • Lock-free data structures
  • Work queues
  • Actor-model pipelines

📝 License

This project is intended for education and learning.
Target website is explicitly provided for scraping practice.

About

This project implements a parallel web scraper that efficiently analyzes website content using Intel Threading Building Blocks (TBB).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published