Skip to content

Reference sequence databases: BOLD

dickgroenenberg edited this page Oct 21, 2020 · 57 revisions

Create a BOLD blast database with taxonomy

First, data for Arthropoda needs to be downloaded manually (i.e. using a webbrowser) due to issues with the API. Download the public datasets sequences.fasta and specimens.tsv. Note: the approximate size of these files (February 2020) is 3.4 Gb and 2.5 Gb, respectively and download speeds are limited using the web interface. Make sure the files are named as indicated and located in the input folder (<path_to_galaxy-tool-BLAST>/utilities/bold/input/) specified in config.yml.



Create a conda environment (if not created before)

conda env create -f utilities/snakemake37_environment.yml

Go to the utilities folder of BOLD

cd galaxy-tool-BLAST/utilities/bold

Activate the environment

conda activate snakemake37

To create the databases execute the snakefile

snakemake -j 6

When the snakemake pipeline is done there will be an output folder containing folders for each BOLD indexed databases. You can move the folders to a destination of choice. To use the blast database in galaxy the path of the database need to be added to the blastn.xml file. See the example below.

<macro name="local_databases">
    <param name="database" type="select" multiple="true" label="Database">
        <option value="galaxy-tool-BLAST/utilities/bold/output/BOLD/bold_all_sequences_taxonomy.fa" label="bold">BOLD</option>
    </param>
</macro>

The description below is made for the manual creation of the database and is deprecated but can still be useful

Create a BOLD blast database with taxonomy

Download fasta files
The fasta files will be downloaded trough the api, every phylum has it's own fasta file. The files are placed in a folder called bold_fasta_files.

sh utilities/bold/get_bold_sequences.sh

Check for errors
It can happen that the fasta file is not complete, using the following script is a way to check this. Edit or move the script so it checks the right folder.

utilities/bold/simplecheck_bold_sequences.sh

The output file will look like something like this, if the text "Fatal error:" is found in the fasta file the line containing this text will be printed under the filename. In this case there is only something wrong with bold_fasta_files/Arthropoda_sequences.fasta. An error in the Arthropoda file is a known issue.

bold_fasta_files/Acanthocephala_sequences.fasta
bold_fasta_files/Acoelomorpha_sequences.fasta
bold_fasta_files/Annelida_sequences.fasta
bold_fasta_files/Arthropoda_sequences.fasta
Fatal error: Uncaught <table style="border: 1px" cellspacing="0">
bold_fasta_files/Ascomycota_sequences.fasta
bold_fasta_files/Basidiomycota_sequences.fasta
...

Download taxonomy files

sh utilities/bold/get_bold_taxonomy.sh

Check for errors

sh utilities/bold/simplecheck_bold_taxonomy.sh

Manually download the tsv and fasta that contain an error if downloaded with the api and place them in the folder with the rest of the files

http://www.boldsystems.org/index.php/Taxbrowser_Taxonpage?taxid=20

#remove incomplete file
rm bold_fasta_files/Arthropoda_sequences.fasta
#rename the manual downloaded file
mv bold_fasta_files/fasta.fas bold_fasta_files/Arthropoda_sequences.fasta

#remove incomplete file
rm bold_taxonomy_files/Arthropoda_taxonomy.tsv
#rename the manual downloaded file
mv bold_taxonomy_files/bold_data.txt bold_taxonomy_files/Arthropoda_taxonomy.tsv 

Merge files together

cat bold_fasta_files/*.fasta > bold_all_sequences.fa
cat bold_taxonomy_files/*.tsv > bold_all_taxonomy.tsv

Clean the files

sed -e '/^[^>]/ s/-//g' bold_all_sequences.fa
awk -F "\t" '{print $1"\t"$10"\t"$12"\t"$14"\t"$16"\t"$20"\t"$22}' > bold_all_taxonomy_filtered.tsv 

Download GBIF for kingdom information

wget http://rs.gbif.org/datasets/backbone/backbone-current.zip
unzip -j backbone-current.zip "Taxon.tsv"
awk -F "\t" '{print $18"\t"$19"\t"$20"\t"$21"\t"$22"\t"$23}' Taxon.tsv > gbif_taxonomy.tsv

Add taxonomy

utilities/bold/add_taxonomy_bold.py -t bold_all_taxonomy_filtered.tsv -g gbif_taxonomy.tsv -b bold_all_sequences.fa -o bold_all_sequences_taxonomy.fa   

Create blast database

sudo makeblastdb2.8.0 -in bold_all_sequences_taxonomy.fa -dbtype nucl

Create filtered version

utilities/bold/filter_bold_for_species.py -b bold_all_sequences_taxonomy.fa -o bold_all_sequences_taxonomy_species_only.fa
vsearch --derep_fulllength bold_all_sequences_taxonomy_species_only.fa --output bold_all_sequences_taxonomy_species_only_nodups.fa
sudo makeblastdb2.8.0 -in bold_all_sequences_taxonomy_species_only_nodups.fa -dbtype nucl