Excel reading guide (Python)

Microsoft Excel spreadsheet files can contain lots of data in cells, formulae, notes and so on.

For more details, read the xlrd documentation.

Opening the file

Download the Excel file, and load it into the xlrd ("Excel reader"). This example is taken from an FOI request for tram accidents.

import scraperwiki
import xlrd
xlbin = scraperwiki.scrape("http://www.whatdotheyknow.com/request/82804/response/208592/attach/2/ACCIDENTS%20TRAMS%20Laurderdale.xls")
book = xlrd.open_workbook(file_contents=xlbin)

Directly select the worksheet you want by number or name.

sheet = book.sheet_by_index(0)
sheet = book.sheet_by_name('LAUDERDALE AVE')

If you're not sure what sheets there are, this prints them out.

for n, s in enumerate(book.sheets()):
    print "Sheet %d is called %s and has %d columns and %d rows" % (n, s.name, s.ncols, s.nrows)

Reading values

Read the values of a row in one go.

print sheet.row_values(4)

Dates will come out as floating point numbers by default (e.g. 36649.0), unless you specially convert them. This cellval function does the hard work for you.

import datetime
def cellval(cell, datemode):
    if cell.ctype == xlrd.XL_CELL_DATE:
        datetuple = xlrd.xldate_as_tuple(cell.value, datemode)
        if datetuple[3:] == (0, 0, 0):
            return datetime.date(datetuple[0], datetuple[1], datetuple[2])
        return datetime.date(datetuple[0], datetuple[1], datetuple[2], datetuple[3], datetuple[4], datetuple[5])
    if cell.ctype == xlrd.XL_CELL_EMPTY:    return None
    if cell.ctype == xlrd.XL_CELL_BOOLEAN:  return cell.value == 1
    return cell.value

Read a whole row, with dates properly converted.

print [ cellval(c, book.datemode) for c in sheet.row(4) ]

Read individual cells like this.

print cellval(sheet.cell(0,0))

Saving to the datastore

Work out sensible heading names. In this case we remove the "." from "FLEET NO.".

keys = sheet.row_values(2)
keys[1] = keys[1].replace('.', '')
print keys

Make dictionaries from the rows and save them.

for rownumber in range(4, sheet.nrows):
    # create dictionary of the row values
    values = [ cellval(c, book.datemode) for c in sheet.row(rownumber) ]
    data = dict(zip(keys, values))
    data['rownumber'] = rownumber

    # remove the empty column (which has a blank heading)
    del data['']

    # only save if it is a full row (rather than a blank line or a note)
    if data['DATE'] != None and data['FLEET NO'] != None:
        scraperwiki.sql.save(unique_keys=['rownumber'], data=data)

Bad dates

In some cases there are bad dates. Improve the cellval function to catch exceptions.

try:
    datetuple = xlrd.xldate_as_tuple(cell.value, datemode)
except Exception, e:
    print "BAD", cell, e
    return str(cell)
try:
    if datetuple[3:] == (0, 0, 0):
        return datetime.date(datetuple[0], datetuple[1], datetuple[2])
    return datetime.date(datetuple[0], datetuple[1], datetuple[2], datetuple[3], datetuple[4], datetuple[5])
except ValueError, e:
    print "BAD value", datetuple, cell, e
    return str(cell)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Excel reading guide (Python)

Opening the file

Reading values

Saving to the datastore

Bad dates

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally