Skip to content

FPMedia/demo-ai-search-engineer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Single-Site Web Crawler (Python)

This is a minimal Python project that demonstrates crawling web pages from a single site. It respects robots.txt, stays within the seed site's domain, rate-limits requests, and extracts links from HTML pages.

Features

  • Single-site restriction (same hostname as the seed URL)
  • Respects robots.txt for the given user agent
  • Simple rate limiting between requests
  • Deduplicates visited URLs
  • Extracts and follows in-domain <a href> links
  • CLI with options for max pages and rate limit

Quick start

  1. Create and activate a virtual environment (recommended):
python -m venv .venv
source .venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Run a crawl:
python -m crawler.main --seed https://example.com --max-pages 30 --rate-limit 0.5

Options:

  • --seed: Starting URL (must include scheme, e.g. https://)
  • --max-pages: Maximum number of pages to visit (default: 30)
  • --rate-limit: Seconds to wait between requests (default: 0.5)
  • --user-agent: User agent string used for requests and robots.txt (default: SingleSiteCrawler/1.0)

Notes

  • Only text/html pages are parsed for links.
  • Fragments are removed from URLs; query strings are preserved.
  • If you deploy behind a proxy or need custom headers, extend crawler/crawler.py.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages