Skip to content

Web scraper to extract data from web pages and XML sitemaps

Notifications You must be signed in to change notification settings

jaimeiniesta/funkspector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Funkspector

Hex.pm

Web page inspector for Elixir.

Funkspector is a web scraper that lets you extract data from web pages and XML or TXT sitemaps.

Usage

Resolving URLs

Simply pass Funkspector the URL to resolve and it will return its final URL after following redirections:

iex> { :ok, final_url, _ } = Funkspector.resolve("http://example.com")

Page Scraping

Simply pass Funkspector the URL of a web page to inspect and it will return its scraped data:

iex> { :ok, document } = Funkspector.page_scrape("https://rocketvalidator.com")

Sitemap Scraping

Funkspector can extract the locations from XML sitemaps, like this:

iex> { :ok, document } = Funkspector.sitemap_scrape("https://rocketvalidator.com/sitemap.xml")

It also supports TXT sitemaps:

iex> { :ok, document } = Funkspector.text_sitemap_scrape("https://rocketvalidator.com/sitemap.txt")

Custom User Agent

You can specify a custom User Agent string using the user_agent option.

Example:

  Funkspector.page_scrape("http://example.com", %{user_agent: "My Bot"})

Basic Auth

You can specify a basic auth username and password using the basic_auth option, which will be passed as an Authorization request header.

Example:

  Funkspector.page_scrape("http://example.com", %{basic_auth: {"user", "secret"}})

Setting a custom timeout

Use recv_timeout to set a custom timeout for the request, in milliseconds.

Example:

  Funkspector.page_scrape("http://example.com", %{recv_timeout: 5_000})

Loading a document contents instead of requesting

You can skip the HTTP request of the document if you already have the contents of the document. This is useful in cases where you already have the contents from a previous request or cache. For example:

Funkspector.page_scrape("https://example.com", contents: "<html>...</html>")

Scraped data

For a successful response you'll get a Funkspector.Document with the scraped data, which will depend on the kind of scraper used. All data will be found inside the :data attribute.

Error response

In case of error, Funkspector will return the original_url and the reason from the server:

case Funkspector.page_scrape("http://example.com") do
  { :ok, document } ->
    IO.inspect(data)
  { :error, url, reason } ->
    IO.puts "Could not scrape #{url} because of #{reason}"
end

Installation

If available in Hex, the package can be installed as:

  1. Add funkspector to your list of dependencies in mix.exs:

    def deps do [{:funkspector, "~> 0.10"}] end

  2. Ensure funkspector is started before your application:

    def application do [applications: [:funkspector]] end

About

Web scraper to extract data from web pages and XML sitemaps

Resources

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published

Languages