Crawl websites into clean, structured PDFs. Optimized for LLM context (NotebookLM, Claude, ChatGPT).
Web2Context is a powerful web crawler that converts entire websites into well-organized PDF documents, perfect for feeding into Large Language Models (LLMs) for context and analysis.
- π Smart Crawling: Automatically discovers and crawls all pages within a website
- π PDF Generation: Converts web pages to clean, structured PDFs
- π€ LLM Optimized: Special mode for AI/LLM consumption with clean output
- β‘ Fast & Parallel: Multi-worker architecture for efficient crawling
- π Smart Updates: Detects content changes and updates only modified pages
- π― Domain Scoping: Automatically limits crawling to the same domain
- π Organized Output: Saves PDFs in organized directories by domain
- Python 3.7 or higher
- pip (Python package manager)
- Internet connection
If you don't have Git installed, you can download the project as a ZIP file from GitHub and extract it.
Option A: Using Git (Recommended)
git clone https://github.com/unvoidf/web2context.git
cd web2contextOption B: Download ZIP
- Go to https://github.com/unvoidf/web2context
- Click the green "Code" button
- Select "Download ZIP"
- Extract the ZIP file to your desired location
- Open a terminal/command prompt in the extracted folder
# Install Python packages
pip install -r requirements.txt
# Install Playwright browser (required for web crawling)
playwright install chromiumNote: If you encounter permission errors, try:
- Windows:
python -m pip install -r requirements.txt - Linux/Mac:
pip3 install -r requirements.txtorpython3 -m pip install -r requirements.txt
python web2context.py --helpIf you see help text, installation was successful!
Crawl a website and save all pages as PDFs:
python web2context.py https://example.com/docsThis will:
- Crawl all pages starting from the provided URL
- Save PDFs to
results/example-com-pdfs/directory - Use default settings (3 workers, 1 second delay)
For best results when using PDFs with LLMs:
python web2context.py https://example.com/docs --llmThis mode enables:
- β Clean output (no timestamps in headers)
- β Ignores query parameters (stable URLs)
- β Smart update detection (only updates changed content)
- β Auto-scope detection (intelligently determines crawl boundaries)
For maximum speed:
python web2context.py https://example.com/docs --llm --fastThis enables:
- 10 parallel workers (instead of 3)
- Minimal delay between requests (0.1s instead of 1s)
| Option | Description |
|---|---|
url |
Required. The starting URL to crawl from |
--llm |
Optimize for AI/LLM usage (clean output, smart updates) |
--fast |
Turbo mode: 10 workers, minimal delay |
--debug |
Enable debug logging for troubleshooting |
PDFs are saved in the results/ directory:
results/
βββ example-com-pdfs/
βββ Page_Title_1.pdf
βββ Page_Title_2.pdf
βββ ...
βββ .hashes/
βββ Page_Title_1.hash
βββ Page_Title_2.hash
- PDF Files: The actual PDF documents
.hashes/Directory: Contains content hashes for change detection
python web2context.py https://docs.example.com --llmpython web2context.py https://blog.example.com/posts --llm --fastpython web2context.py https://github.com/user/repo/wiki --llmThe --llm mode automatically detects GitHub/GitLab repositories and scopes crawling appropriately.
- Start: Begins crawling from the provided URL
- Discover: Extracts all links from each page
- Filter: Only processes links from the same domain
- Convert: Generates PDFs from each page
- Track: Monitors progress and handles errors gracefully
If the output directory already exists, you'll be prompted to choose:
- [O]verwrite: Delete existing files and start fresh
- [A]ppend: Create new versions (file_1.pdf, file_2.pdf, etc.)
- [S]kip: Skip files that already exist
- [U]pdate: Update only changed files (uses content hashing)
- [Q]uit: Cancel the operation
With --llm mode, update mode is automatically used.
Try using python3 instead:
python3 web2context.py https://example.comMake sure you installed dependencies:
pip install -r requirements.txt
playwright install chromiumReinstall the browser:
playwright install chromium --forceSome pages may take longer to load. The crawler will skip them and continue. Check the summary at the end for failed pages.
On Linux/Mac, you might need sudo for global installations, or use a virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txtModify the code to specify a custom output directory, or the crawler will auto-generate one based on the domain.
Default workers can be adjusted in the code:
- Basic mode: 3 workers
- Fast mode: 10 workers
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Issues: Report bugs or request features on GitHub Issues
- Homepage: https://github.com/unvoidf/web2context
Built with:
- Playwright - Web automation
- unidecode - Text normalization
Made for LLM context preparation π€π