Web2Context

Crawl websites into clean, structured PDFs. Optimized for LLM context (NotebookLM, Claude, ChatGPT).

Web2Context is a powerful web crawler that converts entire websites into well-organized PDF documents, perfect for feeding into Large Language Models (LLMs) for context and analysis.

Features

🌐 Smart Crawling: Automatically discovers and crawls all pages within a website
📄 PDF Generation: Converts web pages to clean, structured PDFs
🤖 LLM Optimized: Special mode for AI/LLM consumption with clean output
⚡ Fast & Parallel: Multi-worker architecture for efficient crawling
🔄 Smart Updates: Detects content changes and updates only modified pages
🎯 Domain Scoping: Automatically limits crawling to the same domain
📁 Organized Output: Saves PDFs in organized directories by domain

Requirements

Python 3.7 or higher
pip (Python package manager)
Internet connection

Installation

Step 1: Clone the Repository

If you don't have Git installed, you can download the project as a ZIP file from GitHub and extract it.

Option A: Using Git (Recommended)

git clone https://github.com/unvoidf/web2context.git
cd web2context

Option B: Download ZIP

Go to https://github.com/unvoidf/web2context
Click the green "Code" button
Select "Download ZIP"
Extract the ZIP file to your desired location
Open a terminal/command prompt in the extracted folder

Step 2: Install Dependencies

# Install Python packages
pip install -r requirements.txt

# Install Playwright browser (required for web crawling)
playwright install chromium

Note: If you encounter permission errors, try:

Windows: python -m pip install -r requirements.txt
Linux/Mac: pip3 install -r requirements.txt or python3 -m pip install -r requirements.txt

Step 3: Verify Installation

python web2context.py --help

If you see help text, installation was successful!

Quick Start

Basic Usage

Crawl a website and save all pages as PDFs:

python web2context.py https://example.com/docs

This will:

Crawl all pages starting from the provided URL
Save PDFs to results/example-com-pdfs/ directory
Use default settings (3 workers, 1 second delay)

LLM Optimized Mode

For best results when using PDFs with LLMs:

python web2context.py https://example.com/docs --llm

This mode enables:

✅ Clean output (no timestamps in headers)
✅ Ignores query parameters (stable URLs)
✅ Smart update detection (only updates changed content)
✅ Auto-scope detection (intelligently determines crawl boundaries)

Turbo Mode

For maximum speed:

python web2context.py https://example.com/docs --llm --fast

This enables:

10 parallel workers (instead of 3)
Minimal delay between requests (0.1s instead of 1s)

Command Line Options

Option	Description
`url`	Required. The starting URL to crawl from
`--llm`	Optimize for AI/LLM usage (clean output, smart updates)
`--fast`	Turbo mode: 10 workers, minimal delay
`--debug`	Enable debug logging for troubleshooting

Output Structure

PDFs are saved in the results/ directory:

results/
└── example-com-pdfs/
    ├── Page_Title_1.pdf
    ├── Page_Title_2.pdf
    ├── ...
    └── .hashes/
        ├── Page_Title_1.hash
        └── Page_Title_2.hash

PDF Files: The actual PDF documents
.hashes/ Directory: Contains content hashes for change detection

Usage Examples

Example 1: Documentation Site

python web2context.py https://docs.example.com --llm

Example 2: Blog Archive

python web2context.py https://blog.example.com/posts --llm --fast

Example 3: GitHub Documentation

python web2context.py https://github.com/user/repo/wiki --llm

The --llm mode automatically detects GitHub/GitLab repositories and scopes crawling appropriately.

How It Works

Start: Begins crawling from the provided URL
Discover: Extracts all links from each page
Filter: Only processes links from the same domain
Convert: Generates PDFs from each page
Track: Monitors progress and handles errors gracefully

Handling Existing Output

If the output directory already exists, you'll be prompted to choose:

[O]verwrite: Delete existing files and start fresh
[A]ppend: Create new versions (file_1.pdf, file_2.pdf, etc.)
[S]kip: Skip files that already exist
[U]pdate: Update only changed files (uses content hashing)
[Q]uit: Cancel the operation

With --llm mode, update mode is automatically used.

Troubleshooting

"Command not found: python"

Try using python3 instead:

python3 web2context.py https://example.com

"Module not found" errors

Make sure you installed dependencies:

pip install -r requirements.txt
playwright install chromium

Playwright browser errors

Reinstall the browser:

playwright install chromium --force

Timeout errors

Some pages may take longer to load. The crawler will skip them and continue. Check the summary at the end for failed pages.

Permission errors

On Linux/Mac, you might need sudo for global installations, or use a virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Advanced Usage

Custom Output Directory

Modify the code to specify a custom output directory, or the crawler will auto-generate one based on the domain.

Worker Configuration

Default workers can be adjusted in the code:

Basic mode: 3 workers
Fast mode: 10 workers

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Issues: Report bugs or request features on GitHub Issues
Homepage: https://github.com/unvoidf/web2context

Acknowledgments

Built with:

Playwright - Web automation
unidecode - Text normalization

Made for LLM context preparation 🤖📚

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
crawler_components		crawler_components
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
web2context.py		web2context.py

License

unvoidf/web2context

Folders and files

Latest commit

History

Repository files navigation

Web2Context

Features

Requirements

Installation

Step 1: Clone the Repository

Step 2: Install Dependencies

Step 3: Verify Installation

Quick Start

Basic Usage

LLM Optimized Mode

Turbo Mode

Command Line Options

Output Structure

Usage Examples

Example 1: Documentation Site

Example 2: Blog Archive

Example 3: GitHub Documentation

How It Works

Handling Existing Output

Troubleshooting

"Command not found: python"

"Module not found" errors

Playwright browser errors

Timeout errors

Permission errors

Advanced Usage

Custom Output Directory

Worker Configuration

Contributing

License

Support

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages