Skip to content

Add TermScrape: Complete terminal-based web scraper with LLM support#1

Open
serhat961 wants to merge 1 commit into
mainfrom
claude/termscrape-web-scraper-011CUy64Sz2N4xYpGzdNuovh
Open

Add TermScrape: Complete terminal-based web scraper with LLM support#1
serhat961 wants to merge 1 commit into
mainfrom
claude/termscrape-web-scraper-011CUy64Sz2N4xYpGzdNuovh

Conversation

@serhat961
Copy link
Copy Markdown
Owner

Implemented a comprehensive web scraping tool with the following features:

Core Components:

  • Single URL scraping with multiple output formats (markdown, JSON, text)
  • Recursive website crawling with configurable depth limits
  • JavaScript rendering support via Playwright headless browser
  • LLM-based data extraction using local Ollama models
  • HTML parsing and conversion to structured formats

Modules:

  • src/main.py: Click-based CLI interface with scrape and crawl commands
  • src/scraper.py: Single URL scraping with requests and error handling
  • src/crawler.py: Queue-based recursive crawling with depth control
  • src/parser.py: HTML parsing, markdown conversion, structured JSON extraction
  • src/browser.py: Async Playwright integration for JavaScript rendering
  • src/llm_extract.py: Ollama integration for AI-powered extraction
  • src/utils.py: Robots.txt checking, delays, logging, URL validation

Ethical Features:

  • Automatic robots.txt compliance
  • Random delays (1-5s) between requests
  • Proper User-Agent identification
  • Configurable rate limiting

Testing & Documentation:

  • Pytest test suite for scraper and crawler
  • Comprehensive usage guide and API documentation
  • Contributing guidelines with code standards
  • Example scripts demonstrating library usage

Configuration:

  • requirements.txt with all dependencies
  • pyproject.toml for packaging and distribution
  • MIT License
  • .gitignore for Python projects

The tool is production-ready and follows PEP8 standards with type hints, modular design, and robust error handling.

Implemented a comprehensive web scraping tool with the following features:

Core Components:
- Single URL scraping with multiple output formats (markdown, JSON, text)
- Recursive website crawling with configurable depth limits
- JavaScript rendering support via Playwright headless browser
- LLM-based data extraction using local Ollama models
- HTML parsing and conversion to structured formats

Modules:
- src/main.py: Click-based CLI interface with scrape and crawl commands
- src/scraper.py: Single URL scraping with requests and error handling
- src/crawler.py: Queue-based recursive crawling with depth control
- src/parser.py: HTML parsing, markdown conversion, structured JSON extraction
- src/browser.py: Async Playwright integration for JavaScript rendering
- src/llm_extract.py: Ollama integration for AI-powered extraction
- src/utils.py: Robots.txt checking, delays, logging, URL validation

Ethical Features:
- Automatic robots.txt compliance
- Random delays (1-5s) between requests
- Proper User-Agent identification
- Configurable rate limiting

Testing & Documentation:
- Pytest test suite for scraper and crawler
- Comprehensive usage guide and API documentation
- Contributing guidelines with code standards
- Example scripts demonstrating library usage

Configuration:
- requirements.txt with all dependencies
- pyproject.toml for packaging and distribution
- MIT License
- .gitignore for Python projects

The tool is production-ready and follows PEP8 standards with type hints,
modular design, and robust error handling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants