Seal diet analysis using the OBITools pipeline

Step by step instructions

1. Connect to UPPMAX and get a node of your own

According to previous exercises

2. Copy the files to your local directory

Copy the files to your local folder so that you can work independently from other users.

$ cp -r /proj/g2016021/metabarcoding/seal ~/glob/metabarcoding/

3. Installing the OBITools

OBITools requires Python 2.7 to be installed. In order to avoid a different version of Python running by default in UPPMAX, you will need to load it manually.

$ module load python/2.7.9

run the following command to create a directory in which all the OBITools will be installed. In this case we will install it in the metabarcoding directory where you copied the exercise files

$ cd ~/glob/metabarcoding

$ python get-obitools.py

run the shell script named obitools which reconfigures the UNIX environment and activates the obitools.

$ ./obitools

At the end of this exercise you can deactivate the OBITools by typing the command exit

exit



OBITools are no more activated, Bye...

======================================

4. Merge forward and reverse paired-end sequences from an Illumina output

This is done with the illuminapairedend utility that aligns forward and reverse sequences, and outputs a consensus sequence of each.

illuminapairedend --score-min=40 -r ./seal/seal_f.fastq ./seal/seal_r.fastq > seal.fastq

The ---score-min option discards sequences with low alignment quality. In this case, if the alignment is below 40, the sequences are not aligned but contcatenated, and the mode attribute at the sequence header is set to joined instead of aligned. The joined sequences are removed using the following command:

obigrep -p 'mode!="joined"' seal.fastq > seal.ali.fastq

The sequences without the joined attribute are kept in the seal.ali.fastq file.

Note that from this moment on we add a suffix to the file names (e.g. “.ali” for “aligned” in the previous file) in order to keep track of every step of the progress. You can always look at the contents of your files using the headcommand in linux:

$ head seal.ali.fastq

5. Assign each sequence to its corresponding sample using the unique tag combinations

The sequences are assigned to its corresponding sample based on the unique combination of primers and tags, which is provided in the text file: seal_a_ngsfilter.txt. The file contains information on different experiments (as needed), the sample name, the forward:reverse tags (e. g. GCATGCAG:GTAGCTAG), the forward and reverse primer sequence, respectively, the letter T for using only the forward primer tag, or F for using both primers and both tags, respectively, and any relevant extra information for further analyses.

Use the following command for assigning the sequences to their respective samples:

ngsfilter -t seal_a_ngsfilter.txt -u unidentified.fastq seal.ali.fastq > seal.ali.assigned.fastq

You will produce two files:

seal.ali.assigned.fastq which contains all the sequences assigned to its respective sample
unidentified.fastq which contains samples that were not assigned

At this point, the primers and tags were removed from each sequence, and instead the information on the corresponding sample was added to the sequence header.

6. Dereplicate reads into unique sequences

In most cases, the same DNA sequence is present several times in each sample. By identifying unique sequences and adding the number of times it was observed in a given sample to the sequence headers, we reduce the size of the file and simplify the process of comparing sequences against a database.

Use the obiuniq command to de-replicate samples:

obiuniq -m sample seal.ali.assigned.fastq > seal.ali.assigned.uniq.fasta

Using the -m sample option we keep the information about the origin of each sequence.

The ´obiuniq´ command returns a FastA file.

look at the first sequence record using the head command. The first record will look something like this:

>MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB sminR=40.0; forward_primer=cgtgcraaggtagcg; reverse_primer=gtcgccccaaccraag; sminL=40.0; merged_sample={'09.04454M': 879, '09.00453M': 68, '09.00451M': 337, '08.09142M': 6, '09.05449M': 2, '09.05017M': 57, '08.09132M': 48, '09.04319M': 814, '08.09133M': 390, '08.09141M': 191, '09.00019M': 2}; seq_a_single=0; count=2794; seq_length=213; status=full; mode=alignment; seq_b_single=0;

caatcaattgtcttttaaatgaagacctgtatgaatggcataacgagggtttaactgtct

ctttctcctggtcagtgaaactgatctacccgtgcagaagcgggtatgaatatacaagac

gagaagaccctatggagctttagacgcccaccaatcacgaaaagcaggtctcgctcaaca

gatttccaaacaacgtgatactggcacaaacgt

The sequence contains now a series of key=values in the header, corresponding to the samples in qhich the sequence was found (merged_sample={'09.04454M':879 ...}) and the total count for the sequence (count=2794).

In the above example, the sequence has been recorded 879 times in the sample named 09.04454M; 68 times in the sample 09.00453M; and so on.

Use the obiannotate command to keep only these key=values :

obiannotate -k count -k merged_sample seal.ali.assigned.uniq.fasta > $$ ; mv $$ seal.ali.assigned.uniq.fasta

The first sequence should now look like this:

>MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB merged_sample={'09.04454M': 879, '08.09142M': 6, '08.09141M': 191, '09.00019M': 2, '08.09132M': 48, '09.00451M': 337, '09.04319M': 814, '09.05449M': 2, '09.05017M': 57, '09.00453M': 68, '08.09133M': 390}; count=2794;

caatcaattgtcttttaaatgaagacctgtatgaatggcataacgagggtttaactgtct

ctttctcctggtcagtgaaactgatctacccgtgcagaagcgggtatgaatatacaagac

gagaagaccctatggagctttagacgcccaccaatcacgaaaagcaggtctcgctcaaca

gatttccaaacaacgtgatactggcacaaacgt

7. Denoise the sequence dataset

Given the sheer size of the dataset, it is likely that it will contain errors produced during the PCR or sequencing steps, or chimeras. It is wise to remove such sequences and sequence variants that are likely to be artifacts.

Get count statistics

Use the obistat command to sort the sequences by the count attribute added in the previous steps, and then pipe them to the linux commands sort and head to keep the 20 lowest count values

obistat -c count seal.ali.assigned.uniq.fasta | sort -nk1 | head -20

The output should look like this:

count     count      total

1          6580      6580

2           803      1606

3           413      1239

4           253      1012

5           182       910

6           116       696

7            79       553

8            67       536

9            55       495

10           47       470

11           46       506

12           37       444

13           32       416

14           35       490

15           28       420

16           20       320

17           24       408

18           27       486

19           13       247

In this example, the dataset contains 6580 sequences that occurred only one time.

Keep only sequences occurring ten or more times

The sequences that occurred ten or more times are less likely to correspond to artifacts. The obigrep command can be used to filter out sequences that appeared less than ten times (option -p 'count>=10'), as well as to remove shorter sequences that can contain primer/adapter dimers or other short DNA fragments that do not correspond to our target sequence. In this example we will remove fragments shorter than 150 bases, since our target 16s sequence is approximately 270 bases long (option -l 150).

obigrep -l 150 -p 'count>=10' seal.ali.assigned.uniq.fasta > seal.ali.assigned.uniq.c10.l150.fasta

Clean the dataset of sequencing variants (PCR errors)

The obiclean command tags sequences as either head, internal or singleton based on sequence record counts and sequence similarities.

Refer to the obiclean documentation for details:

http://metabarcoding.org/obitools/doc/scripts/obiclean.html

Use the following command for the final denoising step, which removes sequences likely to be produced by PCR errors:

obiclean -s merged_sample -r 0.05 -H seal.ali.assigned.uniq.c10.l150.fasta > seal.ali.assigned.uniq.c10.l150.clean.fasta

The option -s indicates that the tagging is done by each sample individually, under the key merged_sample; -r sets the ratio of sequence count for comparisons to 0.05; -H selects sequences with a head tag in at least one sample.

8. Taxonomic assignment of sequences

At this point our database has been compacted to unique sequences without potential sequencing errors, and the information on count per sample has been added to the sequence header. The process of taxonomic assignment by comparing with a reference database should now be much simplified.

Nevertheless, the taxonomic assignment can be speeded up further by building a custom reference database that contains only the DNA fragments of interest to our particular study. This can be doneusing the ecoPCR program:

http://metabarcoding.org/obitools/doc/scripts/ecoPCR.html

For the sake of simplicity, we provide a ready-made reference database in the exercise files, as well as the taxonomy files formatted in the taxonomy files in the ecoPCR format. Refer to the ecoPCR documentation for detailed instructions on building a custom reference database.

The database db_v05_prey.fasta provided for this exercise was built using the latest release of the EMBL sequence set (release 128 at the time of this workshop); with the 16s PCR primers used to produce the dataset, and allowing for three mismatching errors in the Primer sequences.

The latest set of EMBL sequences is available at:

ftp: //ftp.ebi.ac.uk/pub/databases/ena/sequence/release/std/embl/release/

Assign each sequence to a taxon

Use the ecotag command to compare the sequence and reference datasets and assign each record to a taxon:

ecotag -d embl_r128 -R db_v05_prey.fasta seal.ali.assigned.uniq.c10.l150.clean.fasta > seal.ali.assigned.uniq.c10.l150.clean.tag.fasta

The resulting file contains several key=value attributes on the taxonomic assignment. For example: * best_match={‘[reference database]’:‘[record in database]’} i.e. best aligned record in the database * best_identity={‘[reference database]’: [identity value 0 to 1]} i.e. percentage of identity between the query and the best match * taxid=[id of the best match] * scientific_name=[name] i.e. name of the taxon assigned, either to species or a higher taxonomic level

The first record of the resulting dataset should look like this:

>MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB species_name=Clupea harengus; family=55118; scientific_name=Clupea harengus; rank=species; taxid=7950; best_identity={'db_v05_prey': 1.0}; scientific_name_by_db={'db_v05_prey': 'Clupea harengus'}; obiclean_samplecount=11; species=7950; merged_sample={'09.04454M': 879, '08.09142M': 6, '08.09141M': 191, '09.00019M': 2, '09.00453M': 68, '09.00451M': 337, '09.04319M': 814, '09.05449M': 2, '09.05017M': 57, '08.09132M': 48, '08.09133M': 390}; obiclean_count={'09.04454M': 901, '09.04319M': 825, '08.09141M': 193, '09.00019M': 2, '09.00453M': 68, '09.00451M': 340, '08.09142M': 6, '09.05449M': 2, '09.05017M': 58, '08.09132M': 48, '08.09133M': 400}; obiclean_singletoncount=0; obiclean_cluster={'09.04454M': 'MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB', '09.04319M': 'MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB', '08.09141M': 'MISEQ:26:000000000-AJE6C:1:1101:22374:1278_CONS_SUB_SUB', '09.00019M': 'MISEQ:26:000000000-AJE6C:1:1101:22374:1278_CONS_SUB_SUB', '09.00453M': 'MISEQ:26:000000000-AJE6C:1:1101:22374:1278_CONS_SUB_SUB', '09.00451M': 'MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB', '08.09142M': 'MISEQ:26:000000000-AJE6C:1:1101:22374:1278_CONS_SUB_SUB', '09.05449M': 'MISEQ:26:000000000-AJE6C:1:1101:22374:1278_CONS_SUB_SUB', '09.05017M': 'MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB', '08.09132M': 'MISEQ:26:000000000-AJE6C:1:1101:22374:1278_CONS_SUB_SUB', '08.09133M': 'MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB'}; species_list={'db_v05_prey': ['Clupea harengus']}; obiclean_internalcount=6; match_count={'db_v05_prey': 1}; obiclean_head=True; taxid_by_db={'db_v05_prey': 7950}; family_name=Clupeidae; genus_name=Clupea; obiclean_status={'09.04454M': 'h', '09.04319M': 'h', '08.09141M': 'i', '09.00019M': 'i', '09.00453M': 'i', '09.00451M': 'h', '08.09142M': 'i', '09.05449M': 'i', '09.05017M': 'h', '08.09132M': 'i', '08.09133M': 'h'}; obiclean_headcount=5; count=2794; id_status={'db_v05_prey': True}; best_match={'db_v05_prey': 'KC193695;'}; order_name=Clupeiformes; rank_by_db={'db_v05_prey': 'species'}; genus=7949; order=32446;

caatcaattgtcttttaaatgaagacctgtatgaatggcataacgagggtttaactgtct

ctttctcctggtcagtgaaactgatctacccgtgcagaagcgggtatgaatatacaagac

gagaagaccctatggagctttagacgcccaccaatcacgaaaagcaggtctcgctcaaca

gatttccaaacaacgtgatactggcacaaacgt

9. Generate a final results table

As you can see, the sequence header of the latest FastA file contains several attributes that may not be necessary for further analyses. These can be removed using the --delete-tag option of the obiannotate command:

obiannotate --delete-tag=scientific_name_by_db --delete-tag=obiclean_samplecount --delete-tag=obiclean_count --delete-tag=obiclean_singletoncount --delete-tag=obiclean_cluster --delete-tag=obiclean_internalcount --delete-tag=obiclean_head --delete-tag=taxid_by_db --delete-tag=obiclean_headcount --delete-tag=id_status --delete-tag=rank_by_db --delete-tag=order_name --delete-tag=order seal.ali.assigned.uniq.c10.l150.clean.tag.fasta > seal.ali.assigned.uniq.c10.l150.clean.tag.ann.fasta

The first record of the resulting file will look like this:

>MISEQ:26:000000000-AJE6C:1:1101:8944:1169_CONS_SUB_SUB merged_sample={'09.04454M': 879, '08.09142M': 6, '08.09141M': 191, '09.00019M': 2, '08.09132M': 48, '09.00451M': 337, '09.04319M': 814, '09.05449M': 2, '09.05017M': 57, '09.00453M': 68, '08.09133M': 390}; match_count={'db_v05_prey': 1}; species_name=Clupea harengus; best_match={'db_v05_prey': 'KC193695; family=55118; family_name=Clupeidae; scientific_name=Clupea harengus; taxid=7950; rank=species; obiclean_status={'09.04454M': 'h', '09.04319M': 'h', '08.09141M': 'i', '09.00019M': 'i', '08.09132M': 'i', '09.00451M': 'h', '08.09142M': 'i', '09.05449M': 'i', '09.05017M': 'h', '09.00453M': 'i', '08.09133M': 'h'}; best_identity={'db_v05_prey': 1.0}; species_list={'db_v05_prey': ['Clupea harengus']}; genus_name=Clupea; species=7950; count=2794; '}; order_name=Clupeiformes; rank_by_db={'db_v05_prey': 'species'}; genus=7949; order=32446;

caatcaattgtcttttaaatgaagacctgtatgaatggcataacgagggtttaactgtct

ctttctcctggtcagtgaaactgatctacccgtgcagaagcgggtatgaatatacaagac

gagaagaccctatggagctttagacgcccaccaatcacgaaaagcaggtctcgctcaaca

gatttccaaacaacgtgatactggcacaaacgt

The sequences can be sorted by attributes with the command obisort. For instance use the following command to sort the sequences in decreasing order of count:

obisort -k count -r seal.ali.assigned.uniq.c10.l150.clean.tag.ann.fasta > seal.ali.assigned.uniq.c10.l150.clean.tag.ann.sort.fasta

The first record (the one with the largest sequence count) should now be:

>MISEQ:26:000000000-AJE6C:1:1101:22374:1278_CONS_SUB_SUB count=44523; merged_sample={'09.04454M': 5451, '08.09142T': 2, '08.09142M': 891, '09.02440M': 3107, '08.09138T': 1, '09.00019M': 3017, '09.00451M': 4364, '09.05448M': 2, '09.05017M': 622, '09.00453M': 2812, '09.05016M': 10, '08.09158M': 933, '08.09133M': 3973, '09.00013M': 10, '09.04455M': 438, '08.04480M': 5749, '09.04319M': 6869, '08.04480T': 2, '08.09141M': 4754, '09.04455T': 3, '09.02443M': 1, '08.09134M': 1, '09.05016T': 1, '09.05449M': 204, '09.00010M': 41, '08.09132M': 1260, '09.00016T': 2, '08.09128M': 2, '08.09134T': 1}; species_name=Clupea harengus; family=55118; best_match={'db_v05_prey': 'AP009133; family_name=Clupeidae; scientific_name=Clupea harengus; match_count={'db_v05_prey': 1}; rank=species; taxid=7950; best_identity={'db_v05_prey': 1.0}; species_list={'db_v05_prey': ['Clupea harengus']}; genus_name=Clupea; obiclean_status={'09.04454M': 'h', '08.09142T': 's', '08.09142M': 'h', '09.02440M': 'h', '08.09138T': 's', '09.00019M': 'h', '09.00451M': 'h', '09.05448M': 's', '09.05017M': 'h', '09.00453M': 'h', '09.00016T': 's', '08.09158M': 'h', '08.09133M': 'h', '09.00013M': 's', '09.04455M': 'h', '08.04480M': 'h', '09.04319M': 'h', '08.04480T': 's', '08.09141M': 'h', '09.04455T': 's', '09.02443M': 's', '08.09134M': 's', '09.05016T': 's', '09.05449M': 'h', '09.00010M': 'h', '08.09132M': 'h', '09.05016M': 's', '08.09128M': 's', '08.09134T': 's'}; species=7950;  '}; order_name=Clupeiformes; rank_by_db={'db_v05_prey': 'species'}; genus=7949; order=32446;

caatcaattgtcttttaaatgaagacctgtatgaatggcataacgagggtttaactgtct

ctttctcctggtcagtgaaactgatctacccgtgcagaagcgggtatgaatatacaagac

gagaagaccctatggagctttagacgcccaccaatcacgaaaagcaggtctcgctcaaca

gacttccaaacaacgtgatactggcacaaacgt

Finally, use the obitab command to generate a tab-delimited file based on the sequence attributes, that can be imported into other programs (e.g. excel or R)

obitab -o seal.ali.assigned.uniq.c10.l150.clean.tag.ann.sort.fasta > seal.ali.assigned.uniq.c10.l150.clean.tag.ann.sort.tab

Open the file in excel to view the final results