Getting Started¶
In this getting started tutorial we will show you how to generate standardised
taxonomic profiles from the diverse outputs of two popular taxonomic profilers:
Kraken2 and
mOTUs using taxpasta
.
If you want a more detailed walkthrough of why standardising the profiles is useful, please see the Deep Dive tutorial.
Preparation¶
Software¶
For this tutorial you will need an internet connection, an installation of taxpasta.
Data¶
First we will make a ‘scratch’ directory where we can run the tutorial and delete again afterwards.
We will also need to download some example taxonomic profiles from Kraken2 and
mOTUs. We can download test data from the taxpasta repository using, for
example, curl
(OSX, Linux) or wget
(generally Linux Only).
Info
The following test data are from ancient DNA samples against standard databases, thus have a high unclassified rate due to uncharacterised environmental contamination and extinct species.
## mOTUs
curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_pe-ERR5766176-db_mOTU.out
curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_se-ERR5766180-db_mOTU.out
## Kraken2
curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/kraken2/2612_pe-ERR5766176-db1.kraken2.report.txt
## mOTUs
wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_pe-ERR5766176-db_mOTU.out
wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_se-ERR5766180-db_mOTU.out
## Kraken2
wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/kraken2/2612_pe-ERR5766176-db1.kraken2.report.txt
We should now see three files with contents in the taxpasta-tutorial
directory
2612_pe-ERR5766176-db1.kraken2.report.txt 2612_se-ERR5766180-db_mOTU.out
2612_pe-ERR5766176-db_mOTU.out
Profiles¶
Raw Output¶
To begin, let’s look at the contents of the output from each profiler.
# git tag version 3.0.3 | motus version 3.0.3 | map_tax 3.0.3 | gene database: nr3.0.3 | calc_mgc 3.0.3 -y insert.scaled_counts -l 75 | calc_motu 3.0.3 -k mOTU -C no_CAMI -g 3 -c -p | taxonomy: ref_mOTU_3.0.3 meta_mOTU_3.0.3
# call: python /usr/local/bin/../share/motus-3.0.1//motus profile -p -c -f ERX5474932_ERR5766176_1.fastq.gz -r ERX5474932_ERR5766176_2.fastq.gz -db db_mOTU -t 2 -n 2612_pe-ERR5766176-db_mOTU -o 2612_pe-ERR5766176-db_mOTU.out
#consensus_taxonomy NCBI_tax_id 2612_pe-ERR5766176-db_mOTU
Leptospira alexanderi [ref_mOTU_v3_00001] 100053 0
Leptospira weilii [ref_mOTU_v3_00002] 28184 0
Chryseobacterium sp. [ref_mOTU_v3_00004] NA 0
Chryseobacterium gallinarum [ref_mOTU_v3_00005] 1324352 0
Chryseobacterium indologenes [ref_mOTU_v3_00006] 253 0
Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007] NA 0
Chryseobacterium jejuense [ref_mOTU_v3_00008] 445960 0
99.97 627680 627680 U 0 unclassified
0.03 168 0 R 1 root
0.03 168 0 R1 131567 cellular organisms
0.03 168 0 D 2759 Eukaryota
0.03 168 0 D1 33154 Opisthokonta
0.02 152 0 K 33208 Metazoa
0.02 152 0 K1 6072 Eumetazoa
0.02 152 0 K2 33213 Bilateria
0.02 152 0 K3 33511 Deuterostomia
0.02 152 0 P 7711 Chordata
These look quite different. Neither of them is in a nice "pure" tabular format that is convenient for analysis software or spreadsheet tools such as Microsoft Excel or LibreOffice Calc to load. They also have different types columns and, in the case of Kraken2, it has an interesting "indentation" way of showing the taxonomic rank of each taxon.
Standardisation and Merging¶
taxpasta standardise¶
This is where taxpasta
comes to the rescue!
With taxpasta
, you can standardise and combine profiles into multi-sample
taxon tables for you already at the command-line; rather than having to do this
with custom scripts and a lot of manual data munging.
If you want to standardise a single profile you need three things:
- The name of the taxonomic profiler used to generate the input file
(
--profiler
or-p
) - The requested output file name with a valid suffix that will tell
taxpasta
which format to save the output in (--output
or-o
) - The input profile file itself
taxpasta standardise -p kraken2 -o 2612_pe-ERR5766176-db1_kraken2.tsv 2612_pe-ERR5766176-db1.kraken2.report.txt
[INFO] Write result to '2612_pe-ERR5766176-db1_kraken2.tsv'.
Let's look at the result:
taxonomy_id count
0 627680
1 0
131567 0
2759 0
33154 0
33208 0
6072 0
33213 0
33511 0
This looks much more tidy!
You can see that we did not have to specify any additional column names or other
arguments. taxpasta
has created a suitable table for you.
taxpasta merge¶
What about the more complicated mOTUs case, where we not only have unusual comment headers but also profiles from multiple samples to be standardised?
In this case, we can instead use taxpasta merge
, which will both standardise
and merge the profiles of different samples into one for you - all through the
command-line.
Again, We need to specify the profiler, the output name and format (via the suffix), and the input profiles themselves.
taxpasta merge -p motus -o dbMOTUs_motus.tsv 2612_pe-ERR5766176-db_mOTU.out 2612_se-ERR5766180-db_mOTU.out
[WARNING] The merged profiles contained different taxa. Additional zeroes were introduced for missing taxa.
[INFO] Write result to 'dbMOTUs_motus.tsv'.
Let's peek at the result.
taxonomy_id 2612_pe-ERR5766176-db_mOTU 2612_se-ERR5766180-db_mOTU
40518 20 2
216816 1 0
1680 6 1
1262820 1 0
74426 2 1
1907654 1 0
1852370 3 1
39491 3 0
33039 2 0
As with Kraken2, this looks much more tabular, and we can see references to both input files.
Danger
We do not (directly) support merging across different
classifiers/profilers, as each tool may have its own database, metric, and
taxonomy ID system,. This can risk making naïve assumptions and
false-positive interpretations, thus taxpasta
is designed to help
prepare data for cross-classifier without doing it itself. We rather
highly recommend doing this mindfully in an exploratory fashion. We provide
examples of how to do this carefully using R and Python in the corresponding
How to merge across
profilers section.
However if you really want this functionality, please let the developers know via a feature request.
Additional functionality¶
- If you want to learn how to use
taxpasta
to add taxonomic names (rather than IDs) to your profiles, see here. - Want to customise the sample names in the columns? See here.
Clean Up¶
Once you’re happy that you’ve completed the tutorial you can clean up your workspace by removing the tutorial directory.