A high-performance parallel web scraper written in modern C++17, designed to extract structured book data from Books to Scrape.
This project was developed as part of a university course in Parallel Programming, demonstrating:
- Efficient task parallelism using Intel TBB
- Modern C++ coding practices
- Clean modular architecture
- Real-world HTML parsing and web interaction
The scraper automatically discovers catalog pages, extracts book details, collects statistics, and exports all results in JSON format.
- Title
- Price (incl./excl. tax)
- Availability
- Rating
- Category
- Description
- UPC & Metadata
| Feature | Description |
|---|---|
| ⚡ Parallel Processing | Uses Intel TBB for concurrency and high throughput |
| 🔍 Automatic Page Discovery | Dynamically crawls catalog pages |
| 🪶 HTML5 Parsing | Uses Lexbor for high-speed DOM extraction |
| 🌐 Modern HTTP | Uses CPR (libcurl wrapper) for requests |
| 📊 Analytics Output | Produces detailed aggregated statistics |
| 💾 JSON Export | Structured output for further analysis |
parallel-web-scraper/
├── CMakeLists.txt
├── CMakePresets.json
├── vcpkg.json
├── config.json
├── data/
│ ├── urls.txt
│ └── results.json
└── ParallelWebScraper/
├── include/
│ ├── Book.h
│ ├── Config.h
│ ├── HtmlParser.h
│ ├── HttpClient.h
│ ├── util.h
│ └── WebScraper.h
└── src/
├── main.cpp
├── Config.cpp
├── HtmlParser.cpp
├── HttpClient.cpp
├── util.cpp
└── WebScraper.cpp
| Library | Purpose |
|---|---|
| Intel TBB | Parallel execution runtime |
| CPR | Modern C++ HTTP client |
| Lexbor | Fast HTML5 parsing |
| nlohmann-json | JSON serialization |
To ensure reproducible environments, dependencies are declared in vcpkg.json.
cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=$VCPKG_ROOT/scripts/buildsystems/vcpkg.cmakecmake --build build --config Release./build/Release/ParallelWebScraperFor Windows (PowerShell):
cmake --preset vs2022-release
cmake --build --preset vs2022-release --config Release
.�uild�s2022\Release\ParallelWebScraper.exeModify config.json:
{
"baseUrl": "https://books.toscrape.com",
"outputFile": "data/results.json",
"maxConcurrency": 0,
"maxRetries": 3,
"timeoutMs": 30000
}{
"totalBooks": 1000,
"statistics": {
"averagePrice": 35.67,
"fiveStarBooks": 234
},
"books": [{ "title": "Example Book", "price": 35.67, "rating": "Five" }]
}This project demonstrates:
- Work-stealing scheduling (TBB)
- Task decomposition strategies
- I/O vs CPU-bound pipeline balancing
- Incremental parsing patterns
Useful further study topics:
- Lock-free data structures
- Work queues
- Actor-model pipelines
This project is intended for education and learning.
Target website is explicitly provided for scraping practice.