Skip to content

Conversation

@surina-margarita
Copy link

Description

A command-line web scraper built in Rust that downloads and processes web pages concurrently. The tool follows links up to a specified depth and organizes downloaded content in a hierarchical directory structure.

Features

  • Concurrent Processing: Uses async/await with configurable worker threads
  • Depth-Limited Crawling: Controls crawling depth to avoid infinite loops
  • Same-Domain Filtering: Only follows links within the starting domain
  • Organized Storage: Saves pages in depth-based directory structure
  • CLI Interface: Easy-to-use command-line interface with help system

Usage

# Install the tool
cargo install --path .

# Basic usage
webcrawl https://example.com

# Advanced usage
webcrawl --output ./crawled_data --depth 3 --workers 6 https://example.com

@surina-margarita surina-margarita marked this pull request as ready for review October 13, 2025 10:10
jorisvilardell added a commit to jorisvilardell/project-2427 that referenced this pull request Oct 30, 2025
…handling (dev-sys-do#3)

* Add initial project structure and define core modules for networking and storage

* Implement ProtocolMessage and ProtocolError enums with TryFrom conversion for message parsing

* Restore main function with REPL for protocol command input and message parsing

* Update protocol command definitions to include checksum in YEET command

* Implement string conversion for ProtocolMessage and ProtocolError enums

* Enhance REPL error handling and response formatting in main function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant