-
Notifications
You must be signed in to change notification settings - Fork 1
Reference sequence databases: BOLD
First, data for Arthropoda needs to be downloaded manually (i.e. using a webbrowser) due to issues with the API. Download the public datasets sequences.fasta and specimens.tsv. Note: the approximate size of these files (February 2020) is 3.4 Gb and 2.5 Gb, respectively and download speeds are limited using the web interface. Make sure the files are named as indicated and located in the input folder (<path_to_galaxy-tool-BLAST>/utilities/bold/input/) specified in config.yml.
Create a conda environment (if not created before)
conda env create -f utilities/snakemake37_environment.yml
Go to the utilities folder of BOLD
cd galaxy-tool-BLAST/utilities/bold
Activate the environment
conda activate snakemake37
To create the databases execute the snakefile
snakemake -j 6
When the snakemake pipeline is done there will be an output folder containing folders for each BOLD indexed databases. You can move the folders to a destination of choice. To use the blast database in galaxy the path of the database need to be added to the blastn.xml file. See the example below.
<macro name="local_databases">
<param name="database" type="select" multiple="true" label="Database">
<option value="galaxy-tool-BLAST/utilities/bold/output/BOLD/bold_all_sequences_taxonomy.fa" label="bold">BOLD</option>
</param>
</macro>
The description below is made for the manual creation of the database and is deprecated but can still be useful
Download fasta files
The fasta files will be downloaded trough the api, every phylum has it's own fasta file. The files are placed in a folder called bold_fasta_files.
sh utilities/bold/get_bold_sequences.sh
Check for errors
It can happen that the fasta file is not complete, using the following script is a way to check this. Edit or move the script so it checks the right folder.
utilities/bold/simplecheck_bold_sequences.sh
The output file will look like something like this, if the text "Fatal error:" is found in the fasta file the line containing this text will be printed under the filename. In this case there is only something wrong with bold_fasta_files/Arthropoda_sequences.fasta. An error in the Arthropoda file is a known issue.
bold_fasta_files/Acanthocephala_sequences.fasta
bold_fasta_files/Acoelomorpha_sequences.fasta
bold_fasta_files/Annelida_sequences.fasta
bold_fasta_files/Arthropoda_sequences.fasta
Fatal error: Uncaught <table style="border: 1px" cellspacing="0">
bold_fasta_files/Ascomycota_sequences.fasta
bold_fasta_files/Basidiomycota_sequences.fasta
...
Download taxonomy files
sh utilities/bold/get_bold_taxonomy.sh
Check for errors
sh utilities/bold/simplecheck_bold_taxonomy.sh
Manually download the tsv and fasta that contain an error if downloaded with the api and place them in the folder with the rest of the files
http://www.boldsystems.org/index.php/Taxbrowser_Taxonpage?taxid=20
#remove incomplete file
rm bold_fasta_files/Arthropoda_sequences.fasta
#rename the manual downloaded file
mv bold_fasta_files/fasta.fas bold_fasta_files/Arthropoda_sequences.fasta
#remove incomplete file
rm bold_taxonomy_files/Arthropoda_taxonomy.tsv
#rename the manual downloaded file
mv bold_taxonomy_files/bold_data.txt bold_taxonomy_files/Arthropoda_taxonomy.tsv
Merge files together
cat bold_fasta_files/*.fasta > bold_all_sequences.fa
cat bold_taxonomy_files/*.tsv > bold_all_taxonomy.tsv
Clean the files
sed -e '/^[^>]/ s/-//g' bold_all_sequences.fa
awk -F "\t" '{print $1"\t"$10"\t"$12"\t"$14"\t"$16"\t"$20"\t"$22}' > bold_all_taxonomy_filtered.tsv
Download GBIF for kingdom information
wget http://rs.gbif.org/datasets/backbone/backbone-current.zip
unzip -j backbone-current.zip "Taxon.tsv"
awk -F "\t" '{print $18"\t"$19"\t"$20"\t"$21"\t"$22"\t"$23}' Taxon.tsv > gbif_taxonomy.tsv
Add taxonomy
utilities/bold/add_taxonomy_bold.py -t bold_all_taxonomy_filtered.tsv -g gbif_taxonomy.tsv -b bold_all_sequences.fa -o bold_all_sequences_taxonomy.fa
Create blast database
sudo makeblastdb2.8.0 -in bold_all_sequences_taxonomy.fa -dbtype nucl
utilities/bold/filter_bold_for_species.py -b bold_all_sequences_taxonomy.fa -o bold_all_sequences_taxonomy_species_only.fa
vsearch --derep_fulllength bold_all_sequences_taxonomy_species_only.fa --output bold_all_sequences_taxonomy_species_only_nodups.fa
sudo makeblastdb2.8.0 -in bold_all_sequences_taxonomy_species_only_nodups.fa -dbtype nucl