Skip to content

Error handling

Leon du Toit edited this page Nov 23, 2013 · 6 revisions

When things go wrong on your own machine it is perfectly possible to investigate the cause because you have the error output in the Terminal. But what if your program is running on some server in the night and you can only investigate problems after the fact? When writing programs that work with data many things can go wrong that have nothing to do with the logical consistency of your code. One has to recognise this and develop a systematic strategy to deal with scenarios when code execution does not happen as planned.

Defensive programming

Both Python and R are dynamically typed languages - the interpreter that parses the code you write figures out the data types before executing the code. Unlike statically typed languages like Java and C, you don't have to declare that some number is an integer or some function returns a double - the interpreter takes care of that for you. While this is one of the reasons why writing analytics code in these languages is so productive it can often be a source of error.

A good to deal with such sources of error (and also many others) is called defensive programming. Let's take an example where we have a function that needs to make sure that parameter inputs are of a specific type. Let's check out the Python example first.

import math

def log_num(num, base):
    if num <= 0:
        return "Cannot log: ", num
    elif base == 10:
        return math.log(num, base)
    elif base == "e":
        return math.log(num)

The R implementation of this silly function is pretty much the same.

log_num <- function(num, base) {
  if (num <= 0)
    paste('Cannot log:', num)
  else if (base == 10)
    log(num, base)
  else if (base == 'e')
    log(num)
}

Although the examples may be silly the general strategy proves very useful when working with dynamic languages. Of course, languages have built-in capabilities to deal with errors - this is called exception handling.

Exception handling

Defensive programming can only take you so far. Programs can fail in unexpected ways and programming languages have built-in capabilities for dealing with such failures. In general this is called exception handling. When something happens that should not - like a function that expects a value gets nothing instead, then the execution of the program will throw an exception to the caller of the function. Many times this can cause an entire program to halt.

But not all program failures are equal. If you can write your program in a way that when it fails it provides helpful information to someone inspecting the failure, then you've come a long way. This means that one should use exceptions to provide helpful error messages so that people can figure out what went wrong and fix it. This becomes especially helpful as programs grow larger and as complexity increases.

Python

Let's look at how we can use Python's exception handling mechanisms to provide helpful messages when we try to read data from a csv file into memory.

import pandas as pd

def get_data_from_file(filename):
    """Read data from csv file into pandas DataFrame.
    :param filename: string for file to read
    """
    try:
        data = pd.read_csv(filename)
        return data
    except IOError as e:
        print "Could not find file: %s not found" % filename
        print "More information: ", e
    finally:
        print "Pandas tried to get data for %s, hope it worked :)" % filename

data = get_data_from_file('my_data_file.csv')
print data.head()

The interpreter will first execute the code in the try block. If, for examples, the filename is wrong and points to a non-existent file then an Input/Output Error will be raised. We can deal with this situation by catching the I/O Error in the except block. Whatever we put in that block will be executed if an IOError is raised. In this case it will print our own custom message and whatever message Pandas provides when this error happens. The finally block will always execute, no matter if there is an exception or not. It can be used for various tasks - like removing the file after reading it, which may or may not be useful.

R

A typical program in the data science world would be a script that gets new data from some external source, does some transformation of the data and then stores the results in some form somewhere else. Whenever networking is involved such programs can fail because databases may be temporarily unavailable or because of some other unrelated network issue.

Imagine we had a function as part of our data pipeline that removed outliers from a data frame. But also imagine that this data frame was fetched from a remote database and that we had to deal with the possibility that the database would be unavailable at the time of calling the function. Let;s look at one way to use R's exception handling to deal with this in a graceful way.

#' Filter outliers from a data frame
#'
#' This function removes unwanted observations from a data frame.
#'
#' @param df data frame
#'

filter_outliers <- function(df) {
  func_and_param <- length(as.list(match.call()))
  if (func_and_param == 1) {
    df <- get_backup_from_db()
  }
  tryCatch(
    subset(df, df$target_column < 1000),
    error = function(c) { message(paste('Critical problem -', c)) },
    warning = function(c) { message(paste('Non-critical problem -', c)) },
    message = function(c) { message(conditionMessage(c)) },
    finally = cleanup_local_disk()
  )
}

What if the caller of this function cannot provide the data frame? We would like the call to execute and not break the program, but provide helpful information about what went wrong so that we can take action and fix it. In this function we first try to solve the problem of a missing input by being defensive.

match.call() reproduces the function call with the input values. We then extract that into a list and check if it contains both the name of the function (which it always does) and the parameter (which it only does if we provide it). If the parameter is not provided in the call then the length of this list will be 1. If func_and_param has length 1 we know we need to defend against a missing data frame input. We do this by trying to get backup data from a database. But what if this backup function fails for another reason? The main expression in the function subset(df, df$target_column < 1000) will evaluate and produce an error. There will be nothing to subset.

We can deal with the potential error by wrapping this evaluation in the tryCatch function. We name three handler functions: error, warning and message. Each handler will be called whenever the evaluation of subset(df, df$target_column < 1000) produces the matching condition. If, for example an error occurs then the error handler will be called. If a warning occurs then the warning handler will be called and so on.

What will happen when no data frame is given and no database is available? The subset(df, df$target_column < 1000) call will evaluate to an error, the error handler will be called. This handler will produce our custom specified message. Because it is a message, the message handler will be called. It will tell us that we have had a critical problem without breaking the execution of the program.

A comment

This section started by sketching a scenario where a program is running in the night on some remote machine. This is the rule rather than the exception for production code. It is relatively easy to debug problems in your programs when it is happening in front of your eyes and when you can look at what is printed on the screen for help. But when you get to work the morning after something went wrong in the night you want an audit trail of how and why the problems happened.

Handling errors in a graceful way, and providing helpful information when they occur is the starting point to producing such an audit trail. The second step, and the topic of the next section, is to store the messages in log files that can be inspected after the fact.

Clone this wiki locally