Skip to content

Commit

Permalink
routine sync
Browse files Browse the repository at this point in the history
  • Loading branch information
igordot committed Jul 15, 2019
1 parent 4191229 commit f7f5ff6
Show file tree
Hide file tree
Showing 2 changed files with 26 additions and 16 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,9 @@ computational genomics resources
contents:

* `notes`: notes and simple computational tasks
* `workflows`: multi-step protocols for specific tasks
* `scripts`: complete scripts for simple tasks
* `scripts-phoenix`: scripts optimized for the phoenix cluster (include references to data and system variables)
* `workflows`: multi-step protocols
* `scripts`: complete scripts for specific tasks
* `scripts-bigpurple` and `scripts-phoenix`: scripts optimized for a specific cluster (include references to data and system variables)

related repositories:

Expand Down
36 changes: 23 additions & 13 deletions workflows/draft-genome-init.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Initializing draft genome directory


The requirement for [setting up a reference genome directory]
(https://github.com/igordot/genomics/blob/master/workflows/ref-genome-init.md) is having two basic files:
- `genome.fa` - genome sequence in FASTA format
- `genes.gtf` - gene annotations in GTF format
The requirement for [setting up a reference genome directory](https://github.com/igordot/genomics/blob/master/workflows/ref-genome-init.md) is having two basic files:

* `genome.fa` - genome sequence in FASTA format
* `genes.gtf` - gene annotations in GTF format

Properly formatted sequence and annotation files are readily available for commonly analyzed genomes. Less popular and
draft genomes are less standardized and can be more difficult to work with.
Expand All @@ -19,16 +19,20 @@ When working with a new genome, Ensembl is usually a good place to start as it c
many species. You can find the [SaiBol1 genome](http://pre.ensembl.org/Saimiri_boliviensis/Info/Index).

The FASTA and GTF files are available and can be downloaded:
```

```bash
wget -O genome.ensembl.pre.fa.gz ftp://ftp.ensembl.org/pub/pre/fasta/dna/saimiri_boliviensis/Saimiri_boliviensis.SaiBol1.0.dna_rm.toplevel.fa.gz
wget -O genes.ensembl.pre.gtf.gz ftp://ftp.ensembl.org/pub/pre/gtf/saimiri_boliviensis/SaiBol1.0.genes.gtf.gz
```

Once you have the reference files, it's a good idea to spot-check them.
```

```bash
zcat genome.ensembl.pre.fa.gz | grep ">" | head
```

Output:

```
>scaffold:SaiBol1.0:JH378105.1:1:72162052:1 scaffold JH378105.1
>scaffold:SaiBol1.0:JH378106.1:1:71252344:1 scaffold JH378106.1
Expand All @@ -42,10 +46,12 @@ Output:
>scaffold:SaiBol1.0:JH378114.1:1:44255708:1 scaffold JH378114.1
```

```
```bash
zcat genes.ensembl.pre.gtf.gz | head
```

Output:

```
JH378796.1 protein_coding exon 805 910 . + . gene_id "ENSP00000271139_1"; transcript_id "ENSP00000271139_1"; exon_number "1"; gene_biotype "protein_coding";
JH378796.1 protein_coding CDS 805 910 . + 0 gene_id "ENSP00000271139_1"; transcript_id "ENSP00000271139_1"; exon_number "1"; gene_biotype "protein_coding"; protein_id "ENSP00000271139_1";
Expand All @@ -72,10 +78,12 @@ NCBI has a lot of databases, so it can be difficult to navigate. It also has the
[SaiBol1 genome](https://www.ncbi.nlm.nih.gov/genome/6907), which is also based GCA_000235385.1 like Ensembl.

There is no GTF, but there is a GFF.
```

```bash
wget -O genome.ncbi.fa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/Saimiri_boliviensis/CHR_Un/39432_ref_SaiBol1.0_chrUn.fa.gz
wget -O genes.ncbi.gff3.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/Saimiri_boliviensis/GFF/ref_SaiBol1.0_scaffolds.gff3.gz
```

There are actually two GFF files: `scaffolds.gff3.gz` and `top_level.gff3.gz`, but they are identical based on both file
size and `md5sum`.

Expand Down Expand Up @@ -110,19 +118,21 @@ NW_003943604.1 Gnomon gene 16666 24141 . + . ID=gene1;Dbxref=GeneID:101050261;Na
These reference files aren't perfect either, but at least the gene names are present.

Fix the FASTA contig names so they match the GFF contig names (`NW_*` or `NC_*`):
```

```bash
zcat genome.ncbi.fa.gz | perl -pe 's/gi\|.*\|(N._.+?)\|.*/\1/g' > genome.fa
```

Convert GFF to GTF using `gffread` (part of Cufflinks suite):
```

```bash
zcat genes.ncbi.gff3.gz | gffread - -T -o genes.ncbi.gff3.gtf
```

A few (24 out of over 850,000) of the GTF entries do not contain `gene_id` or `gene_name`. Remove those:
```

```bash
cat genes.ncbi.gff3.gtf | grep "gene_name" > genes.gtf
```

This leaves us with a clean `genome.fa` and `genes.gtf` for [setting up a reference genome directory]
(https://github.com/igordot/genomics/blob/master/workflows/ref-genome-init.md).
This leaves us with a clean `genome.fa` and `genes.gtf` for [setting up a reference genome directory](https://github.com/igordot/genomics/blob/master/workflows/ref-genome-init.md).

0 comments on commit f7f5ff6

Please sign in to comment.