Web Scraper Implementation - Margarita Surina #3

surina-margarita · 2025-10-11T20:21:03Z

Description

A command-line web scraper built in Rust that downloads and processes web pages concurrently. The tool follows links up to a specified depth and organizes downloaded content in a hierarchical directory structure.

Features

Concurrent Processing: Uses async/await with configurable worker threads
Depth-Limited Crawling: Controls crawling depth to avoid infinite loops
Same-Domain Filtering: Only follows links within the starting domain
Organized Storage: Saves pages in depth-based directory structure
CLI Interface: Easy-to-use command-line interface with help system

Usage

# Install the tool
cargo install --path .

# Basic usage
webcrawl https://example.com

# Advanced usage
webcrawl --output ./crawled_data --depth 3 --workers 6 https://example.com

Signed-off-by: margarita.surina <[email protected]>

…handling (dev-sys-do#3) * Add initial project structure and define core modules for networking and storage * Implement ProtocolMessage and ProtocolError enums with TryFrom conversion for message parsing * Restore main function with REPL for protocol command input and message parsing * Update protocol command definitions to include checksum in YEET command * Implement string conversion for ProtocolMessage and ProtocolError enums * Enhance REPL error handling and response formatting in main function

surina-margarita force-pushed the main branch from 635cedf to f9a4c63 Compare October 13, 2025 10:01

surina-margarita added 10 commits October 13, 2025 12:03

feat: Initialize web scraper project

de0b757

Signed-off-by: margarita.surina <[email protected]>

feat: add basic CLI

0391107

Signed-off-by: margarita.surina <[email protected]>

feat: expand CLI with output, depth, and workers options

5e31337

Signed-off-by: margarita.surina <[email protected]>

feat: add HTTP downloader with async support

2f34158

Signed-off-by: margarita.surina <[email protected]>

feat: add HTML parser for link extraction

1aebaa6

Signed-off-by: margarita.surina <[email protected]>

feat: add hierarchical file storage system

d5a196b

Signed-off-by: margarita.surina <[email protected]>

feat: implement concurrent workers system

b554f3e

Signed-off-by: margarita.surina <[email protected]>

chore: code formatting

1c40a01

Signed-off-by: margarita.surina <[email protected]>

chore: add unit tests

6ffe5b2

Signed-off-by: margarita.surina <[email protected]>

chore: add documentation

f9a4c63

Signed-off-by: margarita.surina <[email protected]>

surina-margarita marked this pull request as ready for review October 13, 2025 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Web Scraper Implementation - Margarita Surina #3

Web Scraper Implementation - Margarita Surina #3

Uh oh!

surina-margarita commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Web Scraper Implementation - Margarita Surina #3

Are you sure you want to change the base?

Web Scraper Implementation - Margarita Surina #3

Uh oh!

Conversation

surina-margarita commented Oct 11, 2025

Description

Features

Usage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant