Anonymisation

A common request is to get some data in a csv file for ad hoc analysis. In many cases this data will be on a per-user level. As a provider of such data one cannot give out any information about who these users are. It may be that at some level in the database you're working with you are required to uniquely identify people on their phone numbers (to take on example), but you cannot in good conscience give out this information. So how do you quickly deal with this? Use a hash function.

Hypothetical data

Suppose you had the following input data in a csv file, and all you had to do was anonymise the phone numbers and leave x1 and x2 unchanged (whatever they may be). Suppose the file is named 'msisdn.csv'.

msisdn,x1,x2
"+2107448920",0,19
"+474247334",3,2
"+474034018",12,8
"+466778776",10,11
"+430091909",2,20

Python

import pandas as pd
import hashlib 

data = pd.read_csv("msisdn.csv")
print data

data.msisdn = data.msisdn.apply(lambda x: hashlib.sha1(x).hexdigest())
print data

The hashlib module is part of the Python standard library so you can import it anytime.

The example uses the pandas DataFrame's apply instance method to apply a Python lambda function aka an anonymous function to each row in the column. No explicit loop is necessary, which sits well with a 'vectorised' way of thinking about data.

While there are more verbose and perhaps less cryptic ways of doing the same thing, getting to know this style will be very helpful with common data tasks.

R

library(digest)

data <- read.csv('msisdn.csv')
print(data)

data$msisdn <- sapply(data$msisdn, digest)
print(data)

The digest package is not part of base R and has to be installed from CRAN like any other package: install.packages('digest').

The sapply function applies the digest hash function to all rows in the data$msisdn column. We do not need to specify an anonymous function here.

Want to add to the wiki?

Please drop me a mail

Anonymisation

Hypothetical data

Python

R

Want to add to the wiki?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Contents

Getting started

Software fundamentals

Working with data

Visualisation

Tools

Programming resources

Clone this wiki locally