📚 Parallel Web Scraper

A high-performance parallel web scraper written in modern C++17, designed to extract structured book data from Books to Scrape.
This project was developed as part of a university course in Parallel Programming, demonstrating:

Efficient task parallelism using Intel TBB
Modern C++ coding practices
Clean modular architecture
Real-world HTML parsing and web interaction

🎯 Overview

The scraper automatically discovers catalog pages, extracts book details, collects statistics, and exports all results in JSON format.

Extracted Information:

Title
Price (incl./excl. tax)
Availability
Rating
Category
Description
UPC & Metadata

Key Features

Feature	Description
⚡ Parallel Processing	Uses Intel TBB for concurrency and high throughput
🔍 Automatic Page Discovery	Dynamically crawls catalog pages
🪶 HTML5 Parsing	Uses Lexbor for high-speed DOM extraction
🌐 Modern HTTP	Uses CPR (libcurl wrapper) for requests
📊 Analytics Output	Produces detailed aggregated statistics
💾 JSON Export	Structured output for further analysis

🏗️ Project Structure

parallel-web-scraper/
├── CMakeLists.txt
├── CMakePresets.json
├── vcpkg.json
├── config.json
├── data/
│   ├── urls.txt
│   └── results.json
└── ParallelWebScraper/
    ├── include/
    │   ├── Book.h
    │   ├── Config.h
    │   ├── HtmlParser.h
    │   ├── HttpClient.h
    │   ├── util.h
    │   └── WebScraper.h
    └── src/
        ├── main.cpp
        ├── Config.cpp
        ├── HtmlParser.cpp
        ├── HttpClient.cpp
        ├── util.cpp
        └── WebScraper.cpp

🔧 Dependencies (Managed via vcpkg)

Library	Purpose
Intel TBB	Parallel execution runtime
CPR	Modern C++ HTTP client
Lexbor	Fast HTML5 parsing
nlohmann-json	JSON serialization

To ensure reproducible environments, dependencies are declared in vcpkg.json.

🚀 Build Instructions (Cross‑Platform via CMake)

1. Configure

cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmake

2. Build

cmake --build build --config Release

3. Run

./build/Release/ParallelWebScraper

For Windows (PowerShell):

cmake --preset vs2022-release
cmake --build --preset vs2022-release --config Release
.�uild�s2022\Release\ParallelWebScraper.exe

⚙️ Configuration

Modify config.json:

{
  "baseUrl": "https://books.toscrape.com",
  "outputFile": "data/results.json",
  "maxConcurrency": 0,
  "maxRetries": 3,
  "timeoutMs": 30000
}

📊 Example Output (results.json)

{
  "totalBooks": 1000,
  "statistics": {
    "averagePrice": 35.67,
    "fiveStarBooks": 234
  },
  "books": [{ "title": "Example Book", "price": 35.67, "rating": "Five" }]
}

🎓 Educational Notes

This project demonstrates:

Work-stealing scheduling (TBB)
Task decomposition strategies
I/O vs CPU-bound pipeline balancing
Incremental parsing patterns

Useful further study topics:

Lock-free data structures
Work queues
Actor-model pipelines

📝 License

This project is intended for education and learning.
Target website is explicitly provided for scraping practice.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
ParallelWebScraper		ParallelWebScraper
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
ParallelWebScraper.sln		ParallelWebScraper.sln
README.md		README.md
config.json		config.json
vcpkg.json		vcpkg.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 Parallel Web Scraper

🎯 Overview

Extracted Information:

Key Features

🏗️ Project Structure

🔧 Dependencies (Managed via vcpkg)

🚀 Build Instructions (Cross‑Platform via CMake)

1. Configure

2. Build

3. Run

⚙️ Configuration

📊 Example Output (results.json)

🎓 Educational Notes

📝 License

About

Uh oh!

Releases

Packages

Languages

Vukotije/parallel-web-scraper

Folders and files

Latest commit

History

Repository files navigation

📚 Parallel Web Scraper

🎯 Overview

Extracted Information:

Key Features

🏗️ Project Structure

🔧 Dependencies (Managed via vcpkg)

🚀 Build Instructions (Cross‑Platform via CMake)

1. Configure

2. Build

3. Run

⚙️ Configuration

📊 Example Output (results.json)

🎓 Educational Notes

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages