Skip to content

Getting Started

In this getting started tutorial we will show you how to generate standardised taxonomic profiles from the diverse outputs of two popular taxonomic profilers: Kraken2 and mOTUs using taxpasta.

If you want a more detailed walkthrough of why standardising the profiles is useful, please see the Deep Dive tutorial.

Preparation

Software

For this tutorial you will need an internet connection, an installation of taxpasta.

Data

First we will make a ‘scratch’ directory where we can run the tutorial and delete again afterwards.

mkdir taxpasta-tutorial
cd taxpasta-tutorial

We will also need to download some example taxonomic profiles from Kraken2 and mOTUs. We can download test data from the taxpasta repository using, for example, curl (OSX, Linux) or wget (generally Linux Only).

Info

The following test data are from ancient DNA samples against standard databases, thus have a high unclassified rate due to uncharacterised environmental contamination and extinct species.

## mOTUs
curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_pe-ERR5766176-db_mOTU.out
curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_se-ERR5766180-db_mOTU.out

## Kraken2
curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/kraken2/2612_pe-ERR5766176-db1.kraken2.report.txt
## mOTUs
wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_pe-ERR5766176-db_mOTU.out
wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_se-ERR5766180-db_mOTU.out

## Kraken2
wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/kraken2/2612_pe-ERR5766176-db1.kraken2.report.txt

We should now see three files with contents in the taxpasta-tutorial directory

ls
2612_pe-ERR5766176-db1.kraken2.report.txt  2612_se-ERR5766180-db_mOTU.out
2612_pe-ERR5766176-db_mOTU.out

Profiles

Raw Output

To begin, let’s look at the contents of the output from each profiler.

head 2612_pe-ERR5766176-db_mOTU.out
# git tag version 3.0.3 |  motus version 3.0.3 | map_tax 3.0.3 | gene database: nr3.0.3 | calc_mgc 3.0.3 -y insert.scaled_counts -l 75 | calc_motu 3.0.3 -k mOTU -C no_CAMI -g 3 -c -p | taxonomy: ref_mOTU_3.0.3 meta_mOTU_3.0.3
# call: python /usr/local/bin/../share/motus-3.0.1//motus profile -p -c -f ERX5474932_ERR5766176_1.fastq.gz -r ERX5474932_ERR5766176_2.fastq.gz -db db_mOTU -t 2 -n 2612_pe-ERR5766176-db_mOTU -o 2612_pe-ERR5766176-db_mOTU.out
#consensus_taxonomy NCBI_tax_id 2612_pe-ERR5766176-db_mOTU
Leptospira alexanderi [ref_mOTU_v3_00001]   100053  0
Leptospira weilii [ref_mOTU_v3_00002]   28184   0
Chryseobacterium sp. [ref_mOTU_v3_00004]    NA  0
Chryseobacterium gallinarum [ref_mOTU_v3_00005] 1324352 0
Chryseobacterium indologenes [ref_mOTU_v3_00006]    253 0
Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007]  NA  0
Chryseobacterium jejuense [ref_mOTU_v3_00008]   445960  0
head 2612_pe-ERR5766176-db1.kraken2.report.txt
 99.97  627680  627680  U   0   unclassified
  0.03  168 0   R   1   root
  0.03  168 0   R1  131567    cellular organisms
  0.03  168 0   D   2759        Eukaryota
  0.03  168 0   D1  33154         Opisthokonta
  0.02  152 0   K   33208           Metazoa
  0.02  152 0   K1  6072              Eumetazoa
  0.02  152 0   K2  33213               Bilateria
  0.02  152 0   K3  33511                 Deuterostomia
  0.02  152 0   P   7711                    Chordata

These look quite different. Neither of them is in a nice "pure" tabular format that is convenient for analysis software or spreadsheet tools such as Microsoft Excel or LibreOffice Calc to load. They also have different types columns and, in the case of Kraken2, it has an interesting "indentation" way of showing the taxonomic rank of each taxon.

Standardisation and Merging

taxpasta standardise

This is where taxpasta comes to the rescue!

With taxpasta, you can standardise and combine profiles into multi-sample taxon tables for you already at the command-line; rather than having to do this with custom scripts and a lot of manual data munging.

If you want to standardise a single profile you need three things:

  • The name of the taxonomic profiler used to generate the input file (--profiler or -p)
  • The requested output file name with a valid suffix that will tell taxpasta which format to save the output in (--output or -o)
  • The input profile file itself
taxpasta standardise -p kraken2 -o 2612_pe-ERR5766176-db1_kraken2.tsv 2612_pe-ERR5766176-db1.kraken2.report.txt
[INFO] Write result to '2612_pe-ERR5766176-db1_kraken2.tsv'.

Let's look at the result:

head 2612_pe-ERR5766176-db1_kraken2.tsv
taxonomy_id count
0   627680
1   0
131567  0
2759    0
33154   0
33208   0
6072    0
33213   0
33511   0

This looks much more tidy!

You can see that we did not have to specify any additional column names or other arguments. taxpasta has created a suitable table for you.

taxpasta merge

What about the more complicated mOTUs case, where we not only have unusual comment headers but also profiles from multiple samples to be standardised?

In this case, we can instead use taxpasta merge, which will both standardise and merge the profiles of different samples into one for you - all through the command-line.

Again, We need to specify the profiler, the output name and format (via the suffix), and the input profiles themselves.

taxpasta merge -p motus -o dbMOTUs_motus.tsv 2612_pe-ERR5766176-db_mOTU.out 2612_se-ERR5766180-db_mOTU.out
[WARNING] The merged profiles contained different taxa. Additional zeroes were introduced for missing taxa.
[INFO] Write result to 'dbMOTUs_motus.tsv'.

Let's peek at the result.

head dbMOTUs_motus.tsv
taxonomy_id 2612_pe-ERR5766176-db_mOTU  2612_se-ERR5766180-db_mOTU
40518   20  2
216816  1   0
1680    6   1
1262820 1   0
74426   2   1
1907654 1   0
1852370 3   1
39491   3   0
33039   2   0

As with Kraken2, this looks much more tabular, and we can see references to both input files.

Danger

We do not (directly) support merging across different classifiers/profilers, as each tool may have its own database, metric, and taxonomy ID system,. This can risk making naïve assumptions and false-positive interpretations, thus taxpasta is designed to help prepare data for cross-classifier without doing it itself. We rather highly recommend doing this mindfully in an exploratory fashion. We provide examples of how to do this carefully using R and Python in the corresponding How to merge across profilers section.

However if you really want this functionality, please let the developers know via a feature request.

Additional functionality

  • If you want to learn how to use taxpasta to add taxonomic names (rather than IDs) to your profiles, see here.
  • Want to customise the sample names in the columns? See here.

Clean Up

Once you’re happy that you’ve completed the tutorial you can clean up your workspace by removing the tutorial directory.

cd ..
rm -rf taxpasta-tutorial