-
Notifications
You must be signed in to change notification settings - Fork 8
First scraper tutorial (Python)
Write a real scraper by copying and pasting code, for programmers or non-programmers (30 minutes).
We’re going to scrape the average number of years children spend in school in different countries from this page, which was once on a UN site but has since been replaced with an Excel spreadsheet.
Go to the new ScraperWiki site and register. Create a dataset, choose the "Code in your browser" tool, then choose Python as the language. You’ll get a web based code editor.
Put in a few lines of code to show it runs, and click the “Run” button or type Ctrl+Return.
print "Hello, coding in the cloud!"
(As we go through this tutorial, you can copy and paste each block of code onto the end of your growing scraper, and run it each time.)
The code runs on ScraperWiki's servers. You can see any output you printed in the console at the bottom of the editor.
You can use any normal Python library to crawl the web, such as urllib2 or Mechanize. There is also a simple built in ScraperWiki library which may be easier to use. The url we are about to scrape is hosted on the Web Archive.
import scraperwiki
html = scraperwiki.scrape("http://web.archive.org/web/20110514112442/http://unstats.un.org/unsd/demographic/products/socind/education.htm")
print html
The code can be explained as follow. The very first line imports scraperwiki library, which informs the program to open the url. Second line scrapes the content of the html source in the page. It then prints the html so you can be sure you have access to all information you need.
lxml is the best library for extracting content from HTML.
import lxml.html
root = lxml.html.fromstring(html)
for tr in root.cssselect("div[align='left'] tr"):
tds = tr.cssselect("td")
if len(tds)==12:
data = {
'country' : tds[0].text_content(),
'years_in_school' : int(tds[4].text_content())
}
print data
So here what happens. In this second block of code, we just imported the lxml python library that scrapes. This reads the html. In second line root is defined. In the sequence comes the loop. This loops looks for all tr table elements in html. It is important to understand basic html structure to follow this lines. For the purpose of this tutorial:
- Html code introduces a table on an html page;
- tr - html tag for "table row"
- td - html tag for "table data"
- div - a section of a html page than can be identified with ids or classes.
Loop that starts in third line and goes throug all <'tr'> elements which are defined in root.cssselect. The bits of code like div and td are CSS selectors, just like those used to style HTML. Here we use them to select all the table rows. And then, for each of those rows, we select the individual cells, and if there are 12 of them (ie: we are in the main table body, rather than in one of the header rows), we extract the country name and schooling statistic.
Then we define "tds" which is an array that is there to select all "<'td'>" elements in each table row.
The query is narrowed by the "if" statement and tells the program to look for the table which has 12 columns. In the page it does not have that number but in the source code there are some empty <'td'> tags.
So data is defined in order to make the code extract elements in the tds array. As an array starts in the number 0, the program will extract data from the very first <'td'> element in table.
The datastore is a magic SQL store, one where you don't need to make a schema up front.
Replace print data in the lxml loop with this save command (make sure you keep it indented with spaces at the start like this):
scraperwiki.sql.save(unique_keys=['country'], data=data)
The unique keys (just country in this case) identify each piece of data. When the scraper runs again, existing data with the same values for the unique keys is replaced.
Go to the "View in a table" tool to see the data loading in (you'll need to keep reloading the page). Notice that the code keeps running in the background, even when you're not in the "Code in your browser" tool. Wait until it has finished.
If you haven't done so yet, edit the title of your scraper.
Now, you can use other tools. For example choose "Download as spreadsheet".
For more complex queries "Query with SQL". Try this query in the SQL query box.
select * from swdata order by years_in_school desc limit 10
It gives you the records for the ten countries where children spend the most years at school.
If you have a scraper you want to write, and feel ready, then get going. Otherwise try the other tutorials.