Version control

I've chosen to discuss this very early on for two reasons: firstly, because I think that it is a major advantage to understand the importance and usage of version control in software development; and secondly, because it is new for many analysts who have not been trained as software developers. Adopting version control as part of your normal workflow as early as possible is crucial. All code should be managed by some version control system, whether it is a 100 line script that estimates a regression or a 10 000 line project that provides a full scale analytics dashboard. While there are other popular mainstream version control systems (like mercurial and subversion) I will focus on git.

Git

Git is a distributed version control system. To follow and replicate the examples discussed in this wiki, you have to download and install git on your computer and sign up for github. Once you have done this you can clone repository of this wiki and you will have a local copy of all the code in it.

Having a git repository for your code, rather than just having a directory with files, is like having a laboratory for your experiments versus a parking lot. Both will allow you to get work done, but one environment has been designed explicitly to help you focus on getting your work done, while the other is just enough to get by.

A repository is a database that keeps of track of the state of your files and directories over time - all the way down to each character change. You can move backward to the way things were a few weeks ago and then fast forward to the present state. You can create multiple versions side by side within the same repo - called branches. Changes to code are organised around commits - sets of changes that are applied to a repo over time. With git, anyone can get a repo and see all of this and contribute from anywhere. It really is like having a virtual laboratory.

A basic workflow

I will cover only what I see as the bare essentials of git because it is already extremely well documented on the official site and there are many great books about it (Pro Git, for example).

Let's say you want to change an already existing file and add a new one...

$ echo "adding some changes to some file" >> somefile
$ touch my_new_file
$ echo "putting a random comment in my new file" >> my_new_file

# check what git has to say about the changes
$ git status

# make sure you are working on the right branch
$ git branch

# tell git which changes you want to add to the next commit
$ git add somefile
$ git add my_new_file

# commit the changes you've added and write a message
$ git commit -m 'this is the commit message text'

# check how your latest contribution sits with all others
$ git log

Typically your commit message should have a header that is short and describe what the commit does when it is applied to the repo. More details can be given below the header if needed, but they should be minimal and descriptive. Readable and sensible commit messages are part of making a project well documented and well understood. Being disciplined with this will make for easier collaboration.

Typical project structure

Like said, a repo can have multiple branches. A typical real world project will have four branches: master, test, staging and production. One common workflow would be this: all new development happens on master, and testing happens on test - which could be a always updated version of either master or staging. Once a change or a set of changes are ready they are merged into the staging branch and deployed to a staging system. From there they will be merged into the production branch if all goes well. Therefore, production is always typically behind staging and staging is typically behind master.

Git aliases

log, status and branch are some of the most common commands you will use in git. Having aliases to shorten the typing will prove very handy.

alias gl='git log'
alias gs='git status'
alias gb='git branch'

Want to add to the wiki?

Please drop me a mail

Uh oh!

Version control

Git

A basic workflow

Typical project structure

Git aliases

Want to add to the wiki?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Contents

Getting started

Software fundamentals

Working with data

Visualisation

Tools

Programming resources

Clone this wiki locally