Add TermScrape: Complete terminal-based web scraper with LLM support by serhat961 · Pull Request #1 · serhat961/grokscrap-ng

serhat961 · 2025-11-09T22:39:24Z

Implemented a comprehensive web scraping tool with the following features:

Core Components:

Single URL scraping with multiple output formats (markdown, JSON, text)
Recursive website crawling with configurable depth limits
JavaScript rendering support via Playwright headless browser
LLM-based data extraction using local Ollama models
HTML parsing and conversion to structured formats

Modules:

src/main.py: Click-based CLI interface with scrape and crawl commands
src/scraper.py: Single URL scraping with requests and error handling
src/crawler.py: Queue-based recursive crawling with depth control
src/parser.py: HTML parsing, markdown conversion, structured JSON extraction
src/browser.py: Async Playwright integration for JavaScript rendering
src/llm_extract.py: Ollama integration for AI-powered extraction
src/utils.py: Robots.txt checking, delays, logging, URL validation

Ethical Features:

Automatic robots.txt compliance
Random delays (1-5s) between requests
Proper User-Agent identification
Configurable rate limiting

Testing & Documentation:

Pytest test suite for scraper and crawler
Comprehensive usage guide and API documentation
Contributing guidelines with code standards
Example scripts demonstrating library usage

Configuration:

requirements.txt with all dependencies
pyproject.toml for packaging and distribution
MIT License
.gitignore for Python projects

The tool is production-ready and follows PEP8 standards with type hints, modular design, and robust error handling.

Implemented a comprehensive web scraping tool with the following features: Core Components: - Single URL scraping with multiple output formats (markdown, JSON, text) - Recursive website crawling with configurable depth limits - JavaScript rendering support via Playwright headless browser - LLM-based data extraction using local Ollama models - HTML parsing and conversion to structured formats Modules: - src/main.py: Click-based CLI interface with scrape and crawl commands - src/scraper.py: Single URL scraping with requests and error handling - src/crawler.py: Queue-based recursive crawling with depth control - src/parser.py: HTML parsing, markdown conversion, structured JSON extraction - src/browser.py: Async Playwright integration for JavaScript rendering - src/llm_extract.py: Ollama integration for AI-powered extraction - src/utils.py: Robots.txt checking, delays, logging, URL validation Ethical Features: - Automatic robots.txt compliance - Random delays (1-5s) between requests - Proper User-Agent identification - Configurable rate limiting Testing & Documentation: - Pytest test suite for scraper and crawler - Comprehensive usage guide and API documentation - Contributing guidelines with code standards - Example scripts demonstrating library usage Configuration: - requirements.txt with all dependencies - pyproject.toml for packaging and distribution - MIT License - .gitignore for Python projects The tool is production-ready and follows PEP8 standards with type hints, modular design, and robust error handling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TermScrape: Complete terminal-based web scraper with LLM support#1

Add TermScrape: Complete terminal-based web scraper with LLM support#1
serhat961 wants to merge 1 commit into
mainfrom
claude/termscrape-web-scraper-011CUy64Sz2N4xYpGzdNuovh

serhat961 commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

serhat961 commented Nov 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants