This is a minimal Python project that demonstrates crawling web pages from a single site. It respects robots.txt, stays within the seed site's domain, rate-limits requests, and extracts links from HTML pages.
- Single-site restriction (same hostname as the seed URL)
- Respects
robots.txtfor the given user agent - Simple rate limiting between requests
- Deduplicates visited URLs
- Extracts and follows in-domain
<a href>links - CLI with options for max pages and rate limit
- Create and activate a virtual environment (recommended):
python -m venv .venv
source .venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Run a crawl:
python -m crawler.main --seed https://example.com --max-pages 30 --rate-limit 0.5Options:
--seed: Starting URL (must include scheme, e.g.https://)--max-pages: Maximum number of pages to visit (default: 30)--rate-limit: Seconds to wait between requests (default: 0.5)--user-agent: User agent string used for requests and robots.txt (default:SingleSiteCrawler/1.0)
- Only
text/htmlpages are parsed for links. - Fragments are removed from URLs; query strings are preserved.
- If you deploy behind a proxy or need custom headers, extend
crawler/crawler.py.