If you want a more detailed walkthrough of why standardising the profiles is useful, please see the Deep Dive tutorial.
For this tutorial you will need an internet connection, an installation of taxpasta.
First we will make a ‘scratch’ directory where we can run the tutorial and delete again afterwards.
We will also need to download some example taxonomic profiles from Kraken2 and
mOTUs. We can download test data from the taxpasta repository using, for
curl (OSX, Linux) or
wget (generally Linux Only).
The following test data are from ancient DNA samples against standard databases, thus have a high unclassified rate due to uncharacterised environmental contamination and extinct species.
## mOTUs curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_pe-ERR5766176-db_mOTU.out curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_se-ERR5766180-db_mOTU.out ## Kraken2 curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/kraken2/2612_pe-ERR5766176-db1.kraken2.report.txt
## mOTUs wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_pe-ERR5766176-db_mOTU.out wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_se-ERR5766180-db_mOTU.out ## Kraken2 wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/kraken2/2612_pe-ERR5766176-db1.kraken2.report.txt
We should now see three files with contents in the
2612_pe-ERR5766176-db1.kraken2.report.txt 2612_se-ERR5766180-db_mOTU.out 2612_pe-ERR5766176-db_mOTU.out
To begin, let’s look at the contents of the output from each profiler.
# git tag version 3.0.3 | motus version 3.0.3 | map_tax 3.0.3 | gene database: nr3.0.3 | calc_mgc 3.0.3 -y insert.scaled_counts -l 75 | calc_motu 3.0.3 -k mOTU -C no_CAMI -g 3 -c -p | taxonomy: ref_mOTU_3.0.3 meta_mOTU_3.0.3 # call: python /usr/local/bin/../share/motus-3.0.1//motus profile -p -c -f ERX5474932_ERR5766176_1.fastq.gz -r ERX5474932_ERR5766176_2.fastq.gz -db db_mOTU -t 2 -n 2612_pe-ERR5766176-db_mOTU -o 2612_pe-ERR5766176-db_mOTU.out #consensus_taxonomy NCBI_tax_id 2612_pe-ERR5766176-db_mOTU Leptospira alexanderi [ref_mOTU_v3_00001] 100053 0 Leptospira weilii [ref_mOTU_v3_00002] 28184 0 Chryseobacterium sp. [ref_mOTU_v3_00004] NA 0 Chryseobacterium gallinarum [ref_mOTU_v3_00005] 1324352 0 Chryseobacterium indologenes [ref_mOTU_v3_00006] 253 0 Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007] NA 0 Chryseobacterium jejuense [ref_mOTU_v3_00008] 445960 0
99.97 627680 627680 U 0 unclassified 0.03 168 0 R 1 root 0.03 168 0 R1 131567 cellular organisms 0.03 168 0 D 2759 Eukaryota 0.03 168 0 D1 33154 Opisthokonta 0.02 152 0 K 33208 Metazoa 0.02 152 0 K1 6072 Eumetazoa 0.02 152 0 K2 33213 Bilateria 0.02 152 0 K3 33511 Deuterostomia 0.02 152 0 P 7711 Chordata
These look quite different. Neither of them is in a nice "pure" tabular format that is convenient for analysis software or spreadsheet tools such as Microsoft Excel or LibreOffice Calc to load. They also have different types columns and, in the case of Kraken2, it has an interesting "indentation" way of showing the taxonomic rank of each taxon.
Standardisation and Merging¶
This is where
taxpasta comes to the rescue!
taxpasta, you can standardise and combine profiles into multi-sample
taxon tables for you already at the command-line; rather than having to do this
with custom scripts and a lot of manual data munging.
If you want to standardise a single profile you need three things:
- The name of the taxonomic profiler used to generate the input file
- The requested output file name with a valid suffix that will tell
taxpastawhich format to save the output in (
- The input profile file itself
[INFO] Write result to '2612_pe-ERR5766176-db1_kraken2.tsv'.
Let's look at the result:
taxonomy_id count 0 627680 1 0 131567 0 2759 0 33154 0 33208 0 6072 0 33213 0 33511 0
This looks much more tidy!
You can see that we did not have to specify any additional column names or other
taxpasta has created a suitable table for you.
What about the more complicated mOTUs case, where we not only have unusual comment headers but also profiles from multiple samples to be standardised?
In this case, we can instead use
taxpasta merge, which will both standardise
and merge the profiles of different samples into one for you - all through the
Again, We need to specify the profiler, the output name and format (via the suffix), and the input profiles themselves.
[WARNING] The merged profiles contained different taxa. Additional zeroes were introduced for missing taxa. [INFO] Write result to 'dbMOTUs_motus.tsv'.
Let's peek at the result.
taxonomy_id 2612_pe-ERR5766176-db_mOTU 2612_se-ERR5766180-db_mOTU 40518 20 2 216816 1 0 1680 6 1 1262820 1 0 74426 2 1 1907654 1 0 1852370 3 1 39491 3 0 33039 2 0
As with Kraken2, this looks much more tabular, and we can see references to both input files.
We do not (directly) support merging across different
classifiers/profilers, as each tool may have its own database, metric, and
taxonomy ID system,. This can risk making naïve assumptions and
false-positive interpretations, thus
taxpasta is designed to help
prepare data for cross-classifier without doing it itself. We rather
highly recommend doing this mindfully in an exploratory fashion. We provide
examples of how to do this carefully using R and Python in the corresponding
How to merge across
However if you really want this functionality, please let the developers know via a feature request.
- If you want to learn how to use
taxpastato add taxonomic names (rather than IDs) to your profiles, see here.
- Want to customise the sample names in the columns? See here.
Once you’re happy that you’ve completed the tutorial you can clean up your workspace by removing the tutorial directory.