Skip to content

Delimited data

Leon du Toit edited this page Sep 2, 2014 · 6 revisions

Delimited data is a catch-all term for data, usually stored in files, where observations belonging to different variables are identified by using a specific type of separator character. Common delimiters include: comma, tab, bar |, semi-colon. Working with such data files is extremely common and being comfortable with reading and writing such files is a must-have skill for doing data work. Both Python and R provide high-level and computationally efficient tools for doing this. Since csv (comma separated value) files are the most common I'll stick to those in the following examples. Suppose we have the following file called ex.csv with three columns and rows:

col1,col2,col3
1,4,6
2,90,1
27,77,90

Let's read this file, manipulate the data and write it to another file.

Python

Once again, the pandas library provides wonderful high-level tools for reading and writing from and to csv files.

import pandas as pd
df = pd.read_csv("ex.csv")
df["col4"] = ["me", "you", "me"]
belongings = df.groupby('col4').mean() # how much do we have?
belongings.to_csv(
  "grouped_ex.csv", 
  sep = ",", 
  float_format = "%d", 
  header = True, 
  quoting = False)

When writing the grouped results to a csv file one can specify many different details. Here I chose the name, the delimiter, the way that floating point numbers should be formatted, that I wanted column names in the header and lastly that there should be no quotes around strings. Doing the same using only the python standard library would be much more typing and typically more error prone. It is a good idea to use higher level functions when you can, especially in the context of data analysis.

R

Due to R's explicit focus on statistical computing it does a much better job at providing high-level tools for dealing with file reading and writing than Python's standard library. Suppose we want to read the file called ex.csv, do some manipulation and write the new data to another file:

df <- read.csv("ex.csv", stringsAsFactors = FALSE) # to get useful string types
df$col4 <- c("me", "you", "me")
transformed_ex <- aggregate(df, by = list(df$col4), FUN = mean)[1:4]
names(transformed_ex) <- c('col4', 'col1', 'col2', 'col3')
write.csv(
  df, 
  "transformed_ex.csv", 
  quote = FALSE, 
  row.names = FALSE, 
  col.names = TRUE)

There are two aspects things to address here: File reading and tabular data manipulation. Base R does a good job of reading files when your data is small, but with medium sized files (e.g 1 Gig and up) run-times for file reading become uncomfortably slow. The solution is to use a library: data.table. data.table has a function for file reading called fread that is orders of magnitude faster than base R. Simply do this...

install.packages('data.table')
library(data.table)
fread("large_file.csv")

...and you'll have your data in a data frame ready for manipulation.

The second aspect is tabular data manipulation. In the Python section above the pandas library is used to perform a grouping and aggregation operation on the data. Doing such tabular data manipulation is much of the bread and butter of practical data science. In R one has roughly three options for working with tabular data - each depending on the use case and/or personal preference: 1) use only base R; 2) use data.table, or; 3) use dplyr.

As the tiny data manipulation example above showed it takes some extra work to get the data into a neat format in R, compared to Python and pandas. It can also be slow in comparison. For those reasons libraries like data.table and dplyr are often much better suited to the task. Below I show how to do the aggregation with both:

library(data.table)
library(dplyr)
df <- read.csv("ex.csv", stringsAsFactors = FALSE)
df$col4 <- c("me", "you", "me")

# with data.table
dt_aggregated <- data.table(df)[, 
  list(
    col1 = mean(col1), 
    col2 = mean(col2), 
    col3 = mean(col3)), 
  by = col4]

# with dplyr
dp_aggregated <- df %>%
  group_by(col4) %>%
  summarise(
    col1 = mean(col1),
    col2 = mean(col2),
    col3 = mean(col3))

Both yield the same results but have different approaches to syntax and implementation. Compared to base R these will be much faster and the code is, in my opinion, cleaner.

Clone this wiki locally