An Arvados pipeline template, var files to wormtable (for use in circa late-2014)
These files originally existed in a gist:
These set of shell scripts are combined in a way (described by the pipeline template for running automatically in Arvados) that goes from a set of VAR files to a set of wormtable files suitable for import into the GA4GH reference server implementation circa late 2014.
This file uses cgatools mkvcf to convert the VAR files into VCF files.
This bash script is complicated because cgatools is VERY picky about the directory structure and file names. Here's what the resultant directory might look like:
Input: a directory with a lot of VAR files in it, ending in .tsv.bz2, e.g. hu132B5C.tsv.bz2
Output: a directory with a lot of VCF files in it, ending in .vcf, e.g. hu132B5C.vcf
This file merges the directory of VCF files into a single VCF file by discarding all the headers except for the header from the first file, and then concatenating all the files.
Input: a directory with a lot of VCF files in it
Output: a single VCF file, i.e. mergedvcf.vcf
Input: mergedvcf.vcf
Output: a directory with wormtable files
- index_CHROM+ID.db
- index_CHROM+ID.xml
- index_CHROM+POS.db
- index_CHROM+POS.xml
- table.dat
- table.db
- table.xml
The Arvados pipeline template that orchestrates the above set of steps.
For more information, see:
This relies on the nancy/cgatools-womrtable Docker Image, which may also be found at
To set up your own docker image, see the following steps.
$ sudo apt-get install
$ sudo groupadd docker
$ sudo gpasswd -a $USER docker #in my case, I replace $USER with "nancy"
$ sudo service docker restart
$ exec su -l $USER #if you don't want to login+out or spawn a new shell
$ docker pull arvados/jobs
$ docker run -ti arvados/jobs /bin/bash
root@4fa648c759f3:/# apt-get update
cgatools is not super-pleasant to install. Here is my step-by-step for ubuntu 14.04 (trusty) / debian 7.8 (wheezy).
cd /home
mkdir nrw
cd nrw
mkdir src
mkdir local
mkdir local/bin
mkdir local/share
mkdir local/share/cgatools-1.8.0/
mkdir local/share/cgatools-1.8.0/doc
mkdir data
mkdir data/ref
apt-get install cmake
curl -O ""
tar -xvf cgatools-
cp cgatools- /home/nrw/bin
cp cgatools-* /home/nrw/local/share/cgatools-1.8.0/doc
vi ~/.bashrc
export PATH=$PATH:/home/nrw/local/bin
source ~/.bashrc
hash -r
root@4fa648c759f3:/# cgatools #yep, this works!
cgatools version 1.8.0 build 1
cd /home/nrw/data
curl -O #this step takes about an hour.
:/# apt-get install libdb-dev
:/# pip install wormtable
:/# which vcf2wt #check it's installed in the correct place
root@4fa648c759f3:/# exit
$ docker commit 4fa648c759f3 nancy/cgatools-wormtable
In line 33:
/home/nrw/local/bin/cgatools mkvcf --beta --genome-root $DIR2 --source-names masterVar --reference /home/nrw/data/ref/build37.crr --output $OUTDIR/$shortfilename.vcf --field-names GT
In line 9:
vcf2wt $INDIR/mergedvcf.vcf --truncate --quiet -tf $OUTDIR