Single-Site Web Crawler (Python)

This is a minimal Python project that demonstrates crawling web pages from a single site. It respects robots.txt, stays within the seed site's domain, rate-limits requests, and extracts links from HTML pages.

Features

Single-site restriction (same hostname as the seed URL)
Respects robots.txt for the given user agent
Simple rate limiting between requests
Deduplicates visited URLs
Extracts and follows in-domain <a href> links
CLI with options for max pages and rate limit

Quick start

Create and activate a virtual environment (recommended):

python -m venv .venv
source .venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Run a crawl:

python -m crawler.main --seed https://example.com --max-pages 30 --rate-limit 0.5

Options:

--seed: Starting URL (must include scheme, e.g. https://)
--max-pages: Maximum number of pages to visit (default: 30)
--rate-limit: Seconds to wait between requests (default: 0.5)
--user-agent: User agent string used for requests and robots.txt (default: SingleSiteCrawler/1.0)

Notes

Only text/html pages are parsed for links.
Fragments are removed from URLs; query strings are preserved.
If you deploy behind a proxy or need custom headers, extend crawler/crawler.py.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
crawler		crawler
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Single-Site Web Crawler (Python)

Features

Quick start

Notes

About

Uh oh!

Releases

Packages

Languages

FPMedia/demo-ai-search-engineer

Folders and files

Latest commit

History

Repository files navigation

Single-Site Web Crawler (Python)

Features

Quick start

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages