A dual interface Go module for building simple web scrapers
- Go struct-tag interface
- Command-line interface
- HTML⇒JSON API server
- Single binary
- Simple configuration
- Zero-downtime config reload with kill -s SIGHUP <scraper-pid>
 
Binaries
See the latest release or download it with this one-liner: curl https://i.jpillora.com/scraper | bash
Source
$ go get -v github.com/jpillora/scraperpackage main
import (
	"log"
	"github.com/jpillora/scraper/scraper"
)
func main() {
	type result struct {
		Title string `scraper:"h3 span"`
		URL   string `scraper:"a[href] | @href"`
	}
	type google struct {
		URL    string   `scraper:"https://www.google.com/search?q={{query}}"`
		Result []result `scraper:"#rso div[class=g]"`
		Query  string   `scraper:"query"`
	}
	g := google{Query: "hello world"}
	if err := scraper.Execute(&g); err != nil {
		log.Fatal(err)
	}
	for i, r := range g.Result {
		fmt.Printf("#%d: '%s' => %s\n", i+1, r.Title, r.URL)
	}
}#1: 'Helloworld Travel – Deals on Accommodation, Flights ...' => https://www.helloworld.com.au/
#2: '"Hello, World!" program - Wikipedia' => https://en.wikipedia.org/wiki/%22Hello,_World!%22_program
#3: 'Helloworld Travel - Wikipedia' => https://en.wikipedia.org/wiki/Helloworld_Travel
#4: 'Helloworld Travel Limited' => https://www.helloworldlimited.com.au/
#5: 'Total immersion, Serious fun! with Hello-World!' => https://www.hello-world.com/
#6: 'Helloworld Travel - Home | Facebook' => https://www.facebook.com/helloworldau/
Given google.json
{
  "/search": {
    "url": "https://www.google.com/search?q={{query}}",
    "list": "#rso div[class=g]",
    "result": {
      "title": "h3 span",
      "url": ["a[href]", "@href"]
    }
  }
}$ scraper google.json
2015/05/16 20:10:46 listening on 3000...$ curl "localhost:3000/search?query=hellokitty"
[
  {
    "title": "Official Home of Hello Kitty \u0026 Friends | Hello Kitty Shop",
    "url": "http://www.sanrio.com/"
  },
  {
    "title": "Hello Kitty - Wikipedia, the free encyclopedia",
    "url": "http://en.wikipedia.org/wiki/Hello_Kitty"
  },
  ...{
  <path>: {
    "method": <method>
    "url": <url>
    "list": <selector>,
    "result": {
      <field>: <extractor>,
      <field>: [<extractor>, <extractor>, ...],
      ...
    }
  }
}
- <path>- Required The path of the scraper- Accessible at http://<host>:port/<path>
- You may define path variables like: my/path/:varwhen set to/my/path/foothen:var = "foo"
 
- Accessible at 
- <url>- Required The URL of the remote server to scrape- It may contain template variables in the form {{ var }}, scraper will look for avarpath variable, if not found, it will then look for a query parametervar
 
- It may contain template variables in the form 
- result- Required represents the resulting JSON object, after executing the- <extractor>on the current DOM context. A field may use sequence of- <extractor>s to perform more complex queries.
- <method>- The HTTP request method (defaults to- GET)
- <extractor>- A string in which must be one of:- a regex in form /abc/- searches the text of the current DOM context (extracts the first group when provided).
- a regex in form s/abc/xyz/- searches the text of the current DOM context and replaces with the provided text (sed-like syntax).
- an attribute in the form @abc- gets the attributeabcfrom the DOM context.
- a function in the form html()- gets the DOM context as string
- a function in the form trim()- trims space from the beginning and the end of the string
- a query param in the form query-param(abc)- parses the current context as a URL and extracts the provided param
- a css selector abc(if not in the forms above) alters the DOM context.
 
- a regex in form 
- list- Optional A css selector used to split the root DOM context into a set of DOM contexts. Useful for capturing search results.
Replace <variable> with your configuration, documented above.
- Define your endpoint struct:
type endpoint struct {
  Method string   `scraper:"<method>"`
  URL    string   `scraper:"<url>"`
  Result []result `scraper:"<list>`
  <param>  string `scraper:"<param>"`
}Method, URL, Result and Debug are special fields, the remaining string fields are treated as input parameters. Input parameters use the field name with first character lowercased by default.
- Define your result struct:
type result struct {
  <field> string `scraper:"<extractor>"`
  <field> string `scraper:"<extractor> | <extractor>"`
}The result struct is used to define field to extractor mappings. All fields must be strings. Struct tags cannot contain arrays so instead we join multiple extractors with |.
- Execute it:
e := endpoint{MyParam: "hello world"}
if err := scraper.Execute(&e); err != nil {
  ...
}
// e.Result is now set- https://github.com/ernesto-jimenez/scraperboardR THE USE OR OTHER DEALINGS IN THE SOFTWARE.