-
Notifications
You must be signed in to change notification settings - Fork 1
Anonymisation
A common request is to get some data in a csv file for ad hoc analysis. In many cases this data will be on a per-user level. As a provider of such data one cannot give out any information about who these users are. It may be that at some level in the database you're working with you are required to uniquely identify people on their phone numbers (to take on example), but you cannot in good conscience give out this information. So how do you quickly deal with this? Use a hash function.
Suppose you had the following input data in a csv file, and all you had to do was anonymise the phone numbers and leave x1 and x2 unchanged (whatever they may be). Suppose the file is named 'msisdn.csv'.
msisdn,x1,x2
"+2107448920",0,19
"+474247334",3,2
"+474034018",12,8
"+466778776",10,11
"+430091909",2,20
import pandas as pd
import hashlib
data = pd.read_csv("msisdn.csv")
print data
data.msisdn = data.msisdn.apply(lambda x: hashlib.sha1(x).hexdigest())
print dataThe hashlib module is part of the Python standard library so you can import it anytime.
The example uses the pandas DataFrame's apply instance method to apply a Python lambda function aka an anonymous function to each row in the column. No explicit loop is necessary, which sits well with a 'vectorised' way of thinking about data.
While there are more verbose and perhaps less cryptic ways of doing the same thing, getting to know this style will be very helpful with common data tasks.
library(digest)
data <- read.csv('msisdn.csv')
print(data)
data$msisdn <- sapply(data$msisdn, digest)
print(data)The digest package is not part of base R and has to be installed from CRAN like any other package: install.packages('digest').
The sapply function applies the digest hash function to all rows in the data$msisdn column. We do not need to specify an anonymous function here.
- Please drop me a mail