From f7f5ff6ebed1343ab7f4efeba30788209239e00e Mon Sep 17 00:00:00 2001
From: igor <igordot@users.noreply.github.com>
Date: Mon, 15 Jul 2019 13:48:26 -0400
Subject: [PATCH] routine sync

---
 README.md                      |  6 +++---
 workflows/draft-genome-init.md | 36 ++++++++++++++++++++++------------
 2 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/README.md b/README.md
index 09296de..1a1d2ef 100644
--- a/README.md
+++ b/README.md
@@ -3,9 +3,9 @@ computational genomics resources
 contents:
 
 * `notes`: notes and simple computational tasks
-* `workflows`: multi-step protocols for specific tasks
-* `scripts`: complete scripts for simple tasks
-* `scripts-phoenix`: scripts optimized for the phoenix cluster (include references to data and system variables)
+* `workflows`: multi-step protocols
+* `scripts`: complete scripts for specific tasks
+* `scripts-bigpurple` and `scripts-phoenix`: scripts optimized for a specific cluster (include references to data and system variables)
 
 related repositories:
 
diff --git a/workflows/draft-genome-init.md b/workflows/draft-genome-init.md
index fe243cb..5d218b0 100644
--- a/workflows/draft-genome-init.md
+++ b/workflows/draft-genome-init.md
@@ -1,10 +1,10 @@
 # Initializing draft genome directory
 
 
-The requirement for [setting up a reference genome directory]
-(https://github.com/igordot/genomics/blob/master/workflows/ref-genome-init.md) is having two basic files:
- - `genome.fa` - genome sequence in FASTA format
- - `genes.gtf` - gene annotations in GTF format
+The requirement for [setting up a reference genome directory](https://github.com/igordot/genomics/blob/master/workflows/ref-genome-init.md) is having two basic files:
+
+* `genome.fa` - genome sequence in FASTA format
+* `genes.gtf` - gene annotations in GTF format
 
 Properly formatted sequence and annotation files are readily available for commonly analyzed genomes. Less popular and 
 draft genomes are less standardized and can be more difficult to work with.
@@ -19,16 +19,20 @@ When working with a new genome, Ensembl is usually a good place to start as it c
 many species. You can find the [SaiBol1 genome](http://pre.ensembl.org/Saimiri_boliviensis/Info/Index).
 
 The FASTA and GTF files are available and can be downloaded:
-```
+
+```bash
 wget -O genome.ensembl.pre.fa.gz ftp://ftp.ensembl.org/pub/pre/fasta/dna/saimiri_boliviensis/Saimiri_boliviensis.SaiBol1.0.dna_rm.toplevel.fa.gz
 wget -O genes.ensembl.pre.gtf.gz ftp://ftp.ensembl.org/pub/pre/gtf/saimiri_boliviensis/SaiBol1.0.genes.gtf.gz
 ```
 
 Once you have the reference files, it's a good idea to spot-check them.
-```
+
+```bash
 zcat genome.ensembl.pre.fa.gz | grep ">" | head
 ```
+
 Output:
+
 ```
 >scaffold:SaiBol1.0:JH378105.1:1:72162052:1 scaffold JH378105.1
 >scaffold:SaiBol1.0:JH378106.1:1:71252344:1 scaffold JH378106.1
@@ -42,10 +46,12 @@ Output:
 >scaffold:SaiBol1.0:JH378114.1:1:44255708:1 scaffold JH378114.1
 ```
 
-```
+```bash
 zcat genes.ensembl.pre.gtf.gz | head
 ```
+
 Output:
+
 ```
 JH378796.1	protein_coding	exon	805	910	.	+	.	 gene_id "ENSP00000271139_1"; transcript_id "ENSP00000271139_1"; exon_number "1"; gene_biotype "protein_coding";
 JH378796.1	protein_coding	CDS	805	910	.	+	0	 gene_id "ENSP00000271139_1"; transcript_id "ENSP00000271139_1"; exon_number "1"; gene_biotype "protein_coding"; protein_id "ENSP00000271139_1";
@@ -72,10 +78,12 @@ NCBI has a lot of databases, so it can be difficult to navigate. It also has the
 [SaiBol1 genome](https://www.ncbi.nlm.nih.gov/genome/6907), which is also based GCA_000235385.1 like Ensembl.
 
 There is no GTF, but there is a GFF.
-```
+
+```bash
 wget -O genome.ncbi.fa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/Saimiri_boliviensis/CHR_Un/39432_ref_SaiBol1.0_chrUn.fa.gz
 wget -O genes.ncbi.gff3.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/Saimiri_boliviensis/GFF/ref_SaiBol1.0_scaffolds.gff3.gz
 ```
+
 There are actually two GFF files: `scaffolds.gff3.gz` and `top_level.gff3.gz`, but they are identical based on both file 
 size and `md5sum`.
 
@@ -110,19 +118,21 @@ NW_003943604.1	Gnomon	gene	16666	24141	.	+	.	ID=gene1;Dbxref=GeneID:101050261;Na
 These reference files aren't perfect either, but at least the gene names are present.
 
 Fix the FASTA contig names so they match the GFF contig names (`NW_*` or `NC_*`):
-```
+
+```bash
 zcat genome.ncbi.fa.gz | perl -pe 's/gi\|.*\|(N._.+?)\|.*/\1/g' > genome.fa
 ```
 
 Convert GFF to GTF using `gffread` (part of Cufflinks suite):
-```
+
+```bash
 zcat genes.ncbi.gff3.gz | gffread - -T -o genes.ncbi.gff3.gtf
 ```
 
 A few (24 out of over 850,000) of the GTF entries do not contain `gene_id` or `gene_name`. Remove those:
-```
+
+```bash
 cat genes.ncbi.gff3.gtf | grep "gene_name" > genes.gtf
 ```
 
-This leaves us with a clean `genome.fa` and `genes.gtf` for [setting up a reference genome directory]
-(https://github.com/igordot/genomics/blob/master/workflows/ref-genome-init.md).
+This leaves us with a clean `genome.fa` and `genes.gtf` for [setting up a reference genome directory](https://github.com/igordot/genomics/blob/master/workflows/ref-genome-init.md).