A powerful web scraping tool designed for extracting structured data from websites with configurable rules and multiple execution modes.
- Configurable JSON-based scraping rules
- Multiple extraction modes:
- Static: Fast HTML parsing without JavaScript execution
- Browser: Full browser emulation with JavaScript support
- Concurrent scraping with adjustable worker count
go install github.com/crawlerclub/extractor/cmd/rabbitextract@latest
go install github.com/crawlerclub/extractor/cmd/rabbitcrawler@latestrabbitextract is a command-line tool for extracting data from a single webpage using JSON configuration rules.
-config: Path to the config JSON file (required)-url: URL to extract data from (optional if provided in config)-mode: Extraction mode (optional, defaults to "auto")auto: Automatically choose between static and browser modestatic: Fast HTML parsing without JavaScriptbrowser: Full browser emulation with JavaScript support
-output: Output file path (optional, defaults to stdout)
- Create a configuration file
config.json:
{
"name": "example-scraper",
"example_url": "https://example.com/page",
"schemas": [
{
"name": "articles",
"entity_type": "article",
"selector": "//div[@class='article']",
"fields": [
{
"name": "title",
"type": "text",
"selector": ".//h1"
},
{
"name": "content",
"type": "text",
"selector": ".//div[@class='content']"
}
]
}
]
}- Run the extractor:
rabbitextract -config config.json -url "https://example.com/page" -output result.jsontext: Extract text content from an elementattribute: Extract specific attribute value from an elementnested: Extract nested object with multiple fieldslist: Extract array of items
_id: Used to generate unique external_id for items_time: Used to set external_time for items