Skip to content

Turn any website into an LLM-ready knowledge base with smart PDF conversion.

License

Notifications You must be signed in to change notification settings

unvoidf/web2context

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Web2Context

Crawl websites into clean, structured PDFs. Optimized for LLM context (NotebookLM, Claude, ChatGPT).

Web2Context is a powerful web crawler that converts entire websites into well-organized PDF documents, perfect for feeding into Large Language Models (LLMs) for context and analysis.

Features

  • 🌐 Smart Crawling: Automatically discovers and crawls all pages within a website
  • πŸ“„ PDF Generation: Converts web pages to clean, structured PDFs
  • πŸ€– LLM Optimized: Special mode for AI/LLM consumption with clean output
  • ⚑ Fast & Parallel: Multi-worker architecture for efficient crawling
  • πŸ”„ Smart Updates: Detects content changes and updates only modified pages
  • 🎯 Domain Scoping: Automatically limits crawling to the same domain
  • πŸ“ Organized Output: Saves PDFs in organized directories by domain

Requirements

  • Python 3.7 or higher
  • pip (Python package manager)
  • Internet connection

Installation

Step 1: Clone the Repository

If you don't have Git installed, you can download the project as a ZIP file from GitHub and extract it.

Option A: Using Git (Recommended)

git clone https://github.com/unvoidf/web2context.git
cd web2context

Option B: Download ZIP

  1. Go to https://github.com/unvoidf/web2context
  2. Click the green "Code" button
  3. Select "Download ZIP"
  4. Extract the ZIP file to your desired location
  5. Open a terminal/command prompt in the extracted folder

Step 2: Install Dependencies

# Install Python packages
pip install -r requirements.txt

# Install Playwright browser (required for web crawling)
playwright install chromium

Note: If you encounter permission errors, try:

  • Windows: python -m pip install -r requirements.txt
  • Linux/Mac: pip3 install -r requirements.txt or python3 -m pip install -r requirements.txt

Step 3: Verify Installation

python web2context.py --help

If you see help text, installation was successful!

Quick Start

Basic Usage

Crawl a website and save all pages as PDFs:

python web2context.py https://example.com/docs

This will:

  • Crawl all pages starting from the provided URL
  • Save PDFs to results/example-com-pdfs/ directory
  • Use default settings (3 workers, 1 second delay)

LLM Optimized Mode

For best results when using PDFs with LLMs:

python web2context.py https://example.com/docs --llm

This mode enables:

  • βœ… Clean output (no timestamps in headers)
  • βœ… Ignores query parameters (stable URLs)
  • βœ… Smart update detection (only updates changed content)
  • βœ… Auto-scope detection (intelligently determines crawl boundaries)

Turbo Mode

For maximum speed:

python web2context.py https://example.com/docs --llm --fast

This enables:

  • 10 parallel workers (instead of 3)
  • Minimal delay between requests (0.1s instead of 1s)

Command Line Options

Option Description
url Required. The starting URL to crawl from
--llm Optimize for AI/LLM usage (clean output, smart updates)
--fast Turbo mode: 10 workers, minimal delay
--debug Enable debug logging for troubleshooting

Output Structure

PDFs are saved in the results/ directory:

results/
└── example-com-pdfs/
    β”œβ”€β”€ Page_Title_1.pdf
    β”œβ”€β”€ Page_Title_2.pdf
    β”œβ”€β”€ ...
    └── .hashes/
        β”œβ”€β”€ Page_Title_1.hash
        └── Page_Title_2.hash
  • PDF Files: The actual PDF documents
  • .hashes/ Directory: Contains content hashes for change detection

Usage Examples

Example 1: Documentation Site

python web2context.py https://docs.example.com --llm

Example 2: Blog Archive

python web2context.py https://blog.example.com/posts --llm --fast

Example 3: GitHub Documentation

python web2context.py https://github.com/user/repo/wiki --llm

The --llm mode automatically detects GitHub/GitLab repositories and scopes crawling appropriately.

How It Works

  1. Start: Begins crawling from the provided URL
  2. Discover: Extracts all links from each page
  3. Filter: Only processes links from the same domain
  4. Convert: Generates PDFs from each page
  5. Track: Monitors progress and handles errors gracefully

Handling Existing Output

If the output directory already exists, you'll be prompted to choose:

  • [O]verwrite: Delete existing files and start fresh
  • [A]ppend: Create new versions (file_1.pdf, file_2.pdf, etc.)
  • [S]kip: Skip files that already exist
  • [U]pdate: Update only changed files (uses content hashing)
  • [Q]uit: Cancel the operation

With --llm mode, update mode is automatically used.

Troubleshooting

"Command not found: python"

Try using python3 instead:

python3 web2context.py https://example.com

"Module not found" errors

Make sure you installed dependencies:

pip install -r requirements.txt
playwright install chromium

Playwright browser errors

Reinstall the browser:

playwright install chromium --force

Timeout errors

Some pages may take longer to load. The crawler will skip them and continue. Check the summary at the end for failed pages.

Permission errors

On Linux/Mac, you might need sudo for global installations, or use a virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Advanced Usage

Custom Output Directory

Modify the code to specify a custom output directory, or the crawler will auto-generate one based on the domain.

Worker Configuration

Default workers can be adjusted in the code:

  • Basic mode: 3 workers
  • Fast mode: 10 workers

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Acknowledgments

Built with:


Made for LLM context preparation πŸ€–πŸ“š

About

Turn any website into an LLM-ready knowledge base with smart PDF conversion.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages