OBITools is a collection of programs designed for the analysis of next-generation sequencing data for metabarcoding. It contains tools for assembling, de-multiplexing, filtering, dereplication, denoising and taxonomic assignation of sequences. Additionally, the programs allow for the construction of custom reference databases, tailored for specific project needs.
For a complete description of the commands and options included in the OBITools, refer to the online documentation:
http://metabarcoding.org/obitools/doc/index.html
In this tutorial we will use the OBITools packages to analyze DNA metabarcoding. The data consists of 16s rDNA sequences, approximately 270 base pair nucleotides in length, obtained from stomach and intestine contents of grey seal (Halichoerus grypus).
The PCR primers used to produced the sequences contained “tags” of eight nucleotides at the 5’ end. In this case we used eigh different forward primers and eight different reverse primers, for a total of 64 unique combinations of “tags”; each corresponding to an individual sample.
This document is largely based on the tutorial available within the OBITools documentation pages: http://metabarcoding.org/obitools/doc/wolves.html
The datasets for this tutorials are available in the project directories /proj/g2016021/metabarcoding/seal. They consist of the following:
The sequence files in FastQ format:
The filter file with the tags and primers used to sort the sequences into different samples:
The reference database in a FastA format:
The NCBI taxonomy files, formated in the ecoPCR format:
The OBITools installation file (also available at http://metabarcoding.org/obitools/doc/_downloads/get-obitools.py).
According to previous exercises
Copy the files to your local folder so that you can work independently from other users.
$ cp -r /proj/g2016021/metabarcoding/seal ~/glob/metabarcoding/
OBITools requires Python 2.7 to be installed. In order to avoid a different version of Python running by default in UPPMAX, you will need to load it manually.
$ module load python/2.7.9
run the following command to create a directory in which all the OBITools will be installed. In this case we will install it in the metabarcoding directory where you copied the exercise files
$ cd ~/glob/metabarcoding
$ python get-obitools.py
run the shell script named obitools which reconfigures the UNIX environment and activates the obitools.
$ ./obitools
At the end of this exercise you can deactivate the OBITools by typing the command exit
exit
OBITools are no more activated, Bye...
======================================
This is done with the illuminapairedend utility that aligns forward and reverse sequences, and outputs a consensus sequence of each.
illuminapairedend --score-min=40 -r ./seal/seal_f.fastq ./seal/seal_r.fastq > seal.fastq
The ---score-min option discards sequences with low alignment quality. In this case, if the alignment is below 40, the sequences are not aligned but contcatenated, and the mode attribute at the sequence header is set to joined instead of aligned. The joined sequences are removed using the following command:
obigrep -p 'mode!="joined"' seal.fastq > seal.ali.fastq
The sequences without the joined attribute are kept in the seal.ali.fastq file.
Note that from this moment on we add a suffix to the file names (e.g. “.ali” for “aligned” in the previous file) in order to keep track of every step of the progress. You can always look at the contents of your files using the headcommand in linux:
$ head seal.ali.fastq
The sequences are assigned to its corresponding sample based on the unique combination of primers and tags, which is provided in the text file: seal_a_ngsfilter.txt. The file contains information on different experiments (as needed), the sample name, the forward:reverse tags (e. g. GCATGCAG:GTAGCTAG), the forward and reverse primer sequence, respectively, the letter T for using only the forward primer tag, or F for using both primers and both tags, respectively, and any relevant extra information for further analyses.
Use the following command for assigning the sequences to their respective samples:
ngsfilter -t seal_a_ngsfilter.txt -u unidentified.fastq seal.ali.fastq > seal.ali.assigned.fastq
You will produce two files:
seal.ali.assigned.fastq which contains all the sequences assigned to its respective sampleunidentified.fastq which contains samples that were not assignedAt this point, the primers and tags were removed from each sequence, and instead the information on the corresponding sample was added to the sequence header.
In most cases, the same DNA sequence is present several times in each sample. By identifying unique sequences and adding the number of times it was observed in a given sample to the sequence headers, we reduce the size of the file and simplify the process of comparing sequences against a database.
Use the obiuniq command to de-replicate samples:
obiuniq -m sample seal.ali.assigned.fastq > seal.ali.assigned.uniq.fasta
Using the -m sample option we keep the information about the origin of each sequence.
The ´obiuniq´ command returns a FastA file.
look at the first sequence record using the head command. The first record will look something like this:
>MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB sminR=40.0; forward_primer=cgtgcraaggtagcg; reverse_primer=gtcgccccaaccraag; sminL=40.0; merged_sample={'09.04454M': 879, '09.00453M': 68, '09.00451M': 337, '08.09142M': 6, '09.05449M': 2, '09.05017M': 57, '08.09132M': 48, '09.04319M': 814, '08.09133M': 390, '08.09141M': 191, '09.00019M': 2}; seq_a_single=0; count=2794; seq_length=213; status=full; mode=alignment; seq_b_single=0;
caatcaattgtcttttaaatgaagacctgtatgaatggcataacgagggtttaactgtct
ctttctcctggtcagtgaaactgatctacccgtgcagaagcgggtatgaatatacaagac
gagaagaccctatggagctttagacgcccaccaatcacgaaaagcaggtctcgctcaaca
gatttccaaacaacgtgatactggcacaaacgt
The sequence contains now a series of key=values in the header, corresponding to the samples in qhich the sequence was found (merged_sample={'09.04454M':879 ...}) and the total count for the sequence (count=2794).
In the above example, the sequence has been recorded 879 times in the sample named 09.04454M; 68 times in the sample 09.00453M; and so on.
Use the obiannotate command to keep only these key=values :
obiannotate -k count -k merged_sample seal.ali.assigned.uniq.fasta > $$ ; mv $$ seal.ali.assigned.uniq.fasta
The first sequence should now look like this:
>MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB merged_sample={'09.04454M': 879, '08.09142M': 6, '08.09141M': 191, '09.00019M': 2, '08.09132M': 48, '09.00451M': 337, '09.04319M': 814, '09.05449M': 2, '09.05017M': 57, '09.00453M': 68, '08.09133M': 390}; count=2794;
caatcaattgtcttttaaatgaagacctgtatgaatggcataacgagggtttaactgtct
ctttctcctggtcagtgaaactgatctacccgtgcagaagcgggtatgaatatacaagac
gagaagaccctatggagctttagacgcccaccaatcacgaaaagcaggtctcgctcaaca
gatttccaaacaacgtgatactggcacaaacgt
Given the sheer size of the dataset, it is likely that it will contain errors produced during the PCR or sequencing steps, or chimeras. It is wise to remove such sequences and sequence variants that are likely to be artifacts.
Use the obistat command to sort the sequences by the count attribute added in the previous steps, and then pipe them to the linux commands sort and head to keep the 20 lowest count values
obistat -c count seal.ali.assigned.uniq.fasta | sort -nk1 | head -20
The output should look like this:
count count total
1 6580 6580
2 803 1606
3 413 1239
4 253 1012
5 182 910
6 116 696
7 79 553
8 67 536
9 55 495
10 47 470
11 46 506
12 37 444
13 32 416
14 35 490
15 28 420
16 20 320
17 24 408
18 27 486
19 13 247
In this example, the dataset contains 6580 sequences that occurred only one time.
The sequences that occurred ten or more times are less likely to correspond to artifacts. The obigrep command can be used to filter out sequences that appeared less than ten times (option -p 'count>=10'), as well as to remove shorter sequences that can contain primer/adapter dimers or other short DNA fragments that do not correspond to our target sequence. In this example we will remove fragments shorter than 150 bases, since our target 16s sequence is approximately 270 bases long (option -l 150).
obigrep -l 150 -p 'count>=10' seal.ali.assigned.uniq.fasta > seal.ali.assigned.uniq.c10.l150.fasta
The obiclean command tags sequences as either head, internal or singleton based on sequence record counts and sequence similarities.
Refer to the obiclean documentation for details:
http://metabarcoding.org/obitools/doc/scripts/obiclean.html
Use the following command for the final denoising step, which removes sequences likely to be produced by PCR errors:
obiclean -s merged_sample -r 0.05 -H seal.ali.assigned.uniq.c10.l150.fasta > seal.ali.assigned.uniq.c10.l150.clean.fasta
The option -s indicates that the tagging is done by each sample individually, under the key merged_sample; -r sets the ratio of sequence count for comparisons to 0.05; -H selects sequences with a head tag in at least one sample.
At this point our database has been compacted to unique sequences without potential sequencing errors, and the information on count per sample has been added to the sequence header. The process of taxonomic assignment by comparing with a reference database should now be much simplified.
Nevertheless, the taxonomic assignment can be speeded up further by building a custom reference database that contains only the DNA fragments of interest to our particular study. This can be doneusing the ecoPCR program:
http://metabarcoding.org/obitools/doc/scripts/ecoPCR.html
For the sake of simplicity, we provide a ready-made reference database in the exercise files, as well as the taxonomy files formatted in the taxonomy files in the ecoPCR format. Refer to the ecoPCR documentation for detailed instructions on building a custom reference database.
The database db_v05_prey.fasta provided for this exercise was built using the latest release of the EMBL sequence set (release 128 at the time of this workshop); with the 16s PCR primers used to produce the dataset, and allowing for three mismatching errors in the Primer sequences.
The latest set of EMBL sequences is available at:
ftp: //ftp.ebi.ac.uk/pub/databases/ena/sequence/release/std/embl/release/
Use the ecotag command to compare the sequence and reference datasets and assign each record to a taxon:
ecotag -d embl_r128 -R db_v05_prey.fasta seal.ali.assigned.uniq.c10.l150.clean.fasta > seal.ali.assigned.uniq.c10.l150.clean.tag.fasta
The resulting file contains several key=value attributes on the taxonomic assignment. For example: * best_match={‘[reference database]’:‘[record in database]’} i.e. best aligned record in the database * best_identity={‘[reference database]’: [identity value 0 to 1]} i.e. percentage of identity between the query and the best match * taxid=[id of the best match] * scientific_name=[name] i.e. name of the taxon assigned, either to species or a higher taxonomic level
The first record of the resulting dataset should look like this:
>MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB species_name=Clupea harengus; family=55118; scientific_name=Clupea harengus; rank=species; taxid=7950; best_identity={'db_v05_prey': 1.0}; scientific_name_by_db={'db_v05_prey': 'Clupea harengus'}; obiclean_samplecount=11; species=7950; merged_sample={'09.04454M': 879, '08.09142M': 6, '08.09141M': 191, '09.00019M': 2, '09.00453M': 68, '09.00451M': 337, '09.04319M': 814, '09.05449M': 2, '09.05017M': 57, '08.09132M': 48, '08.09133M': 390}; obiclean_count={'09.04454M': 901, '09.04319M': 825, '08.09141M': 193, '09.00019M': 2, '09.00453M': 68, '09.00451M': 340, '08.09142M': 6, '09.05449M': 2, '09.05017M': 58, '08.09132M': 48, '08.09133M': 400}; obiclean_singletoncount=0; obiclean_cluster={'09.04454M': 'MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB', '09.04319M': 'MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB', '08.09141M': 'MISEQ:26:000000000-AJE6C:1:1101:22374:1278_CONS_SUB_SUB', '09.00019M': 'MISEQ:26:000000000-AJE6C:1:1101:22374:1278_CONS_SUB_SUB', '09.00453M': 'MISEQ:26:000000000-AJE6C:1:1101:22374:1278_CONS_SUB_SUB', '09.00451M': 'MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB', '08.09142M': 'MISEQ:26:000000000-AJE6C:1:1101:22374:1278_CONS_SUB_SUB', '09.05449M': 'MISEQ:26:000000000-AJE6C:1:1101:22374:1278_CONS_SUB_SUB', '09.05017M': 'MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB', '08.09132M': 'MISEQ:26:000000000-AJE6C:1:1101:22374:1278_CONS_SUB_SUB', '08.09133M': 'MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB'}; species_list={'db_v05_prey': ['Clupea harengus']}; obiclean_internalcount=6; match_count={'db_v05_prey': 1}; obiclean_head=True; taxid_by_db={'db_v05_prey': 7950}; family_name=Clupeidae; genus_name=Clupea; obiclean_status={'09.04454M': 'h', '09.04319M': 'h', '08.09141M': 'i', '09.00019M': 'i', '09.00453M': 'i', '09.00451M': 'h', '08.09142M': 'i', '09.05449M': 'i', '09.05017M': 'h', '08.09132M': 'i', '08.09133M': 'h'}; obiclean_headcount=5; count=2794; id_status={'db_v05_prey': True}; best_match={'db_v05_prey': 'KC193695;'}; order_name=Clupeiformes; rank_by_db={'db_v05_prey': 'species'}; genus=7949; order=32446;
caatcaattgtcttttaaatgaagacctgtatgaatggcataacgagggtttaactgtct
ctttctcctggtcagtgaaactgatctacccgtgcagaagcgggtatgaatatacaagac
gagaagaccctatggagctttagacgcccaccaatcacgaaaagcaggtctcgctcaaca
gatttccaaacaacgtgatactggcacaaacgt
As you can see, the sequence header of the latest FastA file contains several attributes that may not be necessary for further analyses. These can be removed using the --delete-tag option of the obiannotate command:
obiannotate --delete-tag=scientific_name_by_db --delete-tag=obiclean_samplecount --delete-tag=obiclean_count --delete-tag=obiclean_singletoncount --delete-tag=obiclean_cluster --delete-tag=obiclean_internalcount --delete-tag=obiclean_head --delete-tag=taxid_by_db --delete-tag=obiclean_headcount --delete-tag=id_status --delete-tag=rank_by_db --delete-tag=order_name --delete-tag=order seal.ali.assigned.uniq.c10.l150.clean.tag.fasta > seal.ali.assigned.uniq.c10.l150.clean.tag.ann.fasta
The first record of the resulting file will look like this:
>MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB merged_sample={'09.04454M': 879, '08.09142M': 6, '08.09141M': 191, '09.00019M': 2, '08.09132M': 48, '09.00451M': 337, '09.04319M': 814, '09.05449M': 2, '09.05017M': 57, '09.00453M': 68, '08.09133M': 390}; match_count={'db_v05_prey': 1}; species_name=Clupea harengus; best_match={'db_v05_prey': 'KC193695; family=55118; family_name=Clupeidae; scientific_name=Clupea harengus; taxid=7950; rank=species; obiclean_status={'09.04454M': 'h', '09.04319M': 'h', '08.09141M': 'i', '09.00019M': 'i', '08.09132M': 'i', '09.00451M': 'h', '08.09142M': 'i', '09.05449M': 'i', '09.05017M': 'h', '09.00453M': 'i', '08.09133M': 'h'}; best_identity={'db_v05_prey': 1.0}; species_list={'db_v05_prey': ['Clupea harengus']}; genus_name=Clupea; species=7950; count=2794; '}; order_name=Clupeiformes; rank_by_db={'db_v05_prey': 'species'}; genus=7949; order=32446;
caatcaattgtcttttaaatgaagacctgtatgaatggcataacgagggtttaactgtct
ctttctcctggtcagtgaaactgatctacccgtgcagaagcgggtatgaatatacaagac
gagaagaccctatggagctttagacgcccaccaatcacgaaaagcaggtctcgctcaaca
gatttccaaacaacgtgatactggcacaaacgt
The sequences can be sorted by attributes with the command obisort. For instance use the following command to sort the sequences in decreasing order of count:
obisort -k count -r seal.ali.assigned.uniq.c10.l150.clean.tag.ann.fasta > seal.ali.assigned.uniq.c10.l150.clean.tag.ann.sort.fasta
The first record (the one with the largest sequence count) should now be:
>MISEQ:26:000000000-AJE6C:1:1101:22374:1278_CONS_SUB_SUB count=44523; merged_sample={'09.04454M': 5451, '08.09142T': 2, '08.09142M': 891, '09.02440M': 3107, '08.09138T': 1, '09.00019M': 3017, '09.00451M': 4364, '09.05448M': 2, '09.05017M': 622, '09.00453M': 2812, '09.05016M': 10, '08.09158M': 933, '08.09133M': 3973, '09.00013M': 10, '09.04455M': 438, '08.04480M': 5749, '09.04319M': 6869, '08.04480T': 2, '08.09141M': 4754, '09.04455T': 3, '09.02443M': 1, '08.09134M': 1, '09.05016T': 1, '09.05449M': 204, '09.00010M': 41, '08.09132M': 1260, '09.00016T': 2, '08.09128M': 2, '08.09134T': 1}; species_name=Clupea harengus; family=55118; best_match={'db_v05_prey': 'AP009133; family_name=Clupeidae; scientific_name=Clupea harengus; match_count={'db_v05_prey': 1}; rank=species; taxid=7950; best_identity={'db_v05_prey': 1.0}; species_list={'db_v05_prey': ['Clupea harengus']}; genus_name=Clupea; obiclean_status={'09.04454M': 'h', '08.09142T': 's', '08.09142M': 'h', '09.02440M': 'h', '08.09138T': 's', '09.00019M': 'h', '09.00451M': 'h', '09.05448M': 's', '09.05017M': 'h', '09.00453M': 'h', '09.00016T': 's', '08.09158M': 'h', '08.09133M': 'h', '09.00013M': 's', '09.04455M': 'h', '08.04480M': 'h', '09.04319M': 'h', '08.04480T': 's', '08.09141M': 'h', '09.04455T': 's', '09.02443M': 's', '08.09134M': 's', '09.05016T': 's', '09.05449M': 'h', '09.00010M': 'h', '08.09132M': 'h', '09.05016M': 's', '08.09128M': 's', '08.09134T': 's'}; species=7950; '}; order_name=Clupeiformes; rank_by_db={'db_v05_prey': 'species'}; genus=7949; order=32446;
caatcaattgtcttttaaatgaagacctgtatgaatggcataacgagggtttaactgtct
ctttctcctggtcagtgaaactgatctacccgtgcagaagcgggtatgaatatacaagac
gagaagaccctatggagctttagacgcccaccaatcacgaaaagcaggtctcgctcaaca
gacttccaaacaacgtgatactggcacaaacgt
Finally, use the obitab command to generate a tab-delimited file based on the sequence attributes, that can be imported into other programs (e.g. excel or R)
obitab -o seal.ali.assigned.uniq.c10.l150.clean.tag.ann.sort.fasta > seal.ali.assigned.uniq.c10.l150.clean.tag.ann.sort.tab
Open the file in excel to view the final results