Deep Dive¶

Introduction¶

Following from our basic 'Getting Started' tutorial, we will once again show you how to generate standardised taxonomic profiles from the diverse outputs of two popular taxonomic profilers: Kraken2 and mOTUs with taxpasta.

However, in this particular we will also demonstrate some benefits of this standardisation when running downstream analyses on such tables in either the popular statistical programming language R or Python.

Preparation¶

Software¶

For this tutorial you will need an internet connection, an installation of taxpasta, and, if you want to follow the R parts, an installation of R itself with the readr and dplyr packages from the Tidyverse. Furthermore, we assume a UNIX based operating system, such as Linux, macOS, or Windows Subsystem for Linux.

To summarise, you will need:

R

Unix terminal (e.g., bash)
Python environment with taxpasta
R

Package dependencies

install.packages(c("readr", "dplyr"), dependencies = TRUE)

Python

Unix terminal (e.g., bash)
Python environment with taxpasta
No additional requirements, since we will make use of pandas which is already part of the taxpasta installation.

Data¶

First we will make a ‘scratch’ directory where we can run the tutorial and delete again afterwards.

mkdir taxpasta-tutorial
cd taxpasta-tutorial

We will also need to download some example taxonomic profiles from Kraken2 and mOTUs. We can download test data from the taxpasta repository using, for example, curl or wget.

curl

## mOTUs
curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_pe-ERR5766176-db_mOTU.out
curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_se-ERR5766180-db_mOTU.out

## Kraken2
curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/kraken2/2612_pe-ERR5766176-db1.kraken2.report.txt

wget

## mOTUs
wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_pe-ERR5766176-db_mOTU.out
wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_se-ERR5766180-db_mOTU.out

## Kraken2
wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/kraken2/2612_pe-ERR5766176-db1.kraken2.report.txt

We should now see three files with contents in the taxpasta-tutorial directory

ls

2612_pe-ERR5766176-db1.kraken2.report.txt  2612_se-ERR5766180-db_mOTU.out
2612_pe-ERR5766176-db_mOTU.out

Profiles¶

Raw Output¶

To begin, let’s look at the contents of the output from each profiler.

mOTUs

head 2612_pe-ERR5766176-db_mOTU.out

# git tag version 3.0.3 |  motus version 3.0.3 | map_tax 3.0.3 | gene database: nr3.0.3 | calc_mgc 3.0.3 -y insert.scaled_counts -l 75 | calc_motu 3.0.3 -k mOTU -C no_CAMI -g 3 -c -p | taxonomy: ref_mOTU_3.0.3 meta_mOTU_3.0.3
# call: python /usr/local/bin/../share/motus-3.0.1//motus profile -p -c -f ERX5474932_ERR5766176_1.fastq.gz -r ERX5474932_ERR5766176_2.fastq.gz -db db_mOTU -t 2 -n 2612_pe-ERR5766176-db_mOTU -o 2612_pe-ERR5766176-db_mOTU.out
#consensus_taxonomy NCBI_tax_id 2612_pe-ERR5766176-db_mOTU
Leptospira alexanderi [ref_mOTU_v3_00001]   100053  0
Leptospira weilii [ref_mOTU_v3_00002]   28184   0
Chryseobacterium sp. [ref_mOTU_v3_00004]    NA  0
Chryseobacterium gallinarum [ref_mOTU_v3_00005] 1324352 0
Chryseobacterium indologenes [ref_mOTU_v3_00006]    253 0
Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007]  NA  0
Chryseobacterium jejuense [ref_mOTU_v3_00008]   445960  0

Kraken2

head 2612_pe-ERR5766176-db1.kraken2.report.txt

 99.97  627680  627680  U   0   unclassified
  0.03  168 0   R   1   root
  0.03  168 0   R1  131567    cellular organisms
  0.03  168 0   D   2759        Eukaryota
  0.03  168 0   D1  33154         Opisthokonta
  0.02  152 0   K   33208           Metazoa
  0.02  152 0   K1  6072              Eumetazoa
  0.02  152 0   K2  33213               Bilateria
  0.02  152 0   K3  33511                 Deuterostomia
  0.02  152 0   P   7711                    Chordata

These look quite different. Neither of them is in a nice "pure" tabular format that is convenient for analysis software or spreadsheet tools such as Microsoft Excel or LibreOffice Calc to load. They also have different types columns and, in the case of Kraken2, it has an interesting "indentation" way of showing the taxonomic rank of each taxon.

mOTUs¶

The following section uses R/Python.

R

We can try loading a mOTUs profile into R using the common table reading function read_tsv() from the readr package with default arguments.

requireNamespace("readr")

Loading required namespace: readr

profile_motus <- readr::read_tsv("2612_pe-ERR5766176-db_mOTU.out")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 33573 Columns: 1

── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): # git tag version 3.0.3 |  motus version 3.0.3 | map_tax 3.0.3 | ge...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

You can see we immediately hit an error, as as we saw above, there is a ‘comment’ line at the top of the mOTUs profile with information on how the profile was generated.

While such a comment is very useful for reproducibility, to load this into software expecting ‘true’ tabular data, we have to instead add extra options to the function, which makes loading the table less than smooth for downstream analyses.

profile_motus <- readr::read_tsv("2612_pe-ERR5766176-db_mOTU.out", comment = "#")

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 33570 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Leptospira alexanderi [ref_mOTU_v3_00001]
dbl (2): 100053, 0

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

However, once again we hit another error: the column headers are also specified as a comment line… Instead we can try to skip the first two lines entirely.

profile_motus <- readr::read_tsv("2612_pe-ERR5766176-db_mOTU.out", skip = 2)

Rows: 33571 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): #consensus_taxonomy
dbl (2): NCBI_tax_id, 2612_pe-ERR5766176-db_mOTU

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

profile_motus

# A tibble: 33,571 × 3
   `#consensus_taxonomy`                                      NCBI_tax…¹ 2612_…²
   <chr>                                                           <dbl>   <dbl>
 1 Leptospira alexanderi [ref_mOTU_v3_00001]                      100053       0
 2 Leptospira weilii [ref_mOTU_v3_00002]                           28184       0
 3 Chryseobacterium sp. [ref_mOTU_v3_00004]                           NA       0
 4 Chryseobacterium gallinarum [ref_mOTU_v3_00005]               1324352       0
 5 Chryseobacterium indologenes [ref_mOTU_v3_00006]                  253       0
 6 Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007]         NA       0
 7 Chryseobacterium jejuense [ref_mOTU_v3_00008]                  445960       0
 8 Chryseobacterium sp. G972 [ref_mOTU_v3_00009]                 1805473       0
 9 Chryseobacterium contaminans [ref_mOTU_v3_00010]              1423959       0
10 Chryseobacterium indologenes [ref_mOTU_v3_00011]                  253       0
# … with 33,561 more rows, and abbreviated variable names ¹NCBI_tax_id,
#   ²`2612_pe-ERR5766176-db_mOTU`

Python

We can try loading a mOTUs profile into Python using the common table reading function read_table() from the pandas package with default arguments.

import pandas as pd

profile_motus_2612_pe_raw = pd.read_table("2612_pe-ERR5766176-db_mOTU.out")

ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 3

You can see we immediately hit a ParserError, as there is a ‘comment’ line at the top of the mOTUs profile with information on how the profile was generated.

While such a comment is very useful for reproducibility, to load this we have to instead add extra options to the function, which makes loading the table less than smooth for downstream analyses.

profile_motus = pd.read_table(
    "2612_pe-ERR5766176-db_mOTU.out",
    comment="#",
)
profile_motus.head()

	Leptospira alexanderi [ref_mOTU_v3_00001]	100053
0	Leptospira weilii [ref_mOTU_v3_00002]	28184
1	Chryseobacterium sp. [ref_mOTU_v3_00004]	nan
2	Chryseobacterium gallinarum [ref_mOTU_v3_00005]	1.32435e+06
3	Chryseobacterium indologenes [ref_mOTU_v3_00006]	253
4	Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007]	nan

However, once again we encounter another problem: the column headers are also specified as a comment line and our first line of data is instead read as the column headers. We can try to skip the first two lines entirely to remedy this.

profile_motus = pd.read_table(
    "2612_pe-ERR5766176-db_mOTU.out",
    skiprows=2,
)
profile_motus.head()

	#consensus_taxonomy	NCBI_tax_id
0	Leptospira alexanderi [ref_mOTU_v3_00001]	100053
1	Leptospira weilii [ref_mOTU_v3_00002]	28184
2	Chryseobacterium sp. [ref_mOTU_v3_00004]	nan
3	Chryseobacterium gallinarum [ref_mOTU_v3_00005]	1.32435e+06
4	Chryseobacterium indologenes [ref_mOTU_v3_00006]	253

That works! At least in terms of reasonable table headers. However, we can see that there are missing NCBI taxonomy identifiers so there is probably more data cleaning ahead of us. Getting to this point took too much effort already to load what is essentially a simple table. Furthermore, if we are interested in loading all mOTUs profiles, we would have to load each profile one by one for each sample, requiring more complicated loops and table join code.

Kraken2¶

And what about the Kraken2 output?

R

profile_kraken2 <- readr::read_tsv("2612_pe-ERR5766176-db1.kraken2.report.txt")

New names:
Rows: 43 Columns: 6
── Column specification
──────────────────────────────────────────────────────── Delimiter: "\t" chr
(2): U, unclassified dbl (4): 99.97, 627680...2, 627680...3, 0
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `627680` -> `627680...2`
• `627680` -> `627680...3`

profile_kraken2

# A tibble: 43 × 6
   `99.97` `627680...2` `627680...3` U        `0` unclassified
     <dbl>        <dbl>        <dbl> <chr>  <dbl> <chr>
 1    0.03          168            0 R          1 root
 2    0.03          168            0 R1    131567 cellular organisms
 3    0.03          168            0 D       2759 Eukaryota
 4    0.03          168            0 D1     33154 Opisthokonta
 5    0.02          152            0 K      33208 Metazoa
 6    0.02          152            0 K1      6072 Eumetazoa
 7    0.02          152            0 K2     33213 Bilateria
 8    0.02          152            0 K3     33511 Deuterostomia
 9    0.02          152            0 P       7711 Chordata
10    0.02          152            0 P1     89593 Craniata
# … with 33 more rows

This doesn’t fail to load but unfortunately the column headers look a bit weird. It seems the Kraken2 file does not include a column header! In this case we have to specify these ourselves.

profile_kraken2 <- readr::read_tsv(
    "2612_pe-ERR5766176-db1.kraken2.report.txt",
    col_names = c(
        "percent",
        "clade_assigned_reads",
        "direct_assigned_reads",
        "taxonomy_lvl",
        "taxonomy_id",
        "name"
    )
)

Rows: 44 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): taxonomy_lvl, name
dbl (4): percent, clade_assigned_reads, direct_assigned_reads, taxonomy_id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

profile_kraken2

# A tibble: 44 × 6
   percent clade_assigned_reads direct_assigned_reads taxonomy_lvl taxon…¹ name
     <dbl>                <dbl>                 <dbl> <chr>          <dbl> <chr>
 1  100.                 627680                627680 U                  0 uncl…
 2    0.03                  168                     0 R                  1 root
 3    0.03                  168                     0 R1            131567 cell…
 4    0.03                  168                     0 D               2759 Euka…
 5    0.03                  168                     0 D1             33154 Opis…
 6    0.02                  152                     0 K              33208 Meta…
 7    0.02                  152                     0 K1              6072 Eume…
 8    0.02                  152                     0 K2             33213 Bila…
 9    0.02                  152                     0 K3             33511 Deut…
10    0.02                  152                     0 P               7711 Chor…
# … with 34 more rows, and abbreviated variable name ¹taxonomy_id

Python

profile_kraken2 = pd.read_table("2612_pe-ERR5766176-db1.kraken2.report.txt")
profile_kraken2.head()

	99.97	627680	U	0	unclassified
0	0.03	168	R	1	root
1	0.03	168	R1	131567	cellular organisms
2	0.03	168	D	2759	Eukaryota
3	0.03	168	D1	33154	Opisthokonta
4	0.02	152	K	33208	Metazoa

This doesn’t fail to load but unfortunately the column headers look a bit weird. It seems the Kraken2 file does not include a column header! In this case we have to specify these ourselves.

profile_kraken2 = pd.read_table(
    "2612_pe-ERR5766176-db1.kraken2.report.txt",
    names=[
        "percent",
        "clade_assigned_reads",
        "direct_assigned_reads",
        "taxonomy_lvl",
        "taxonomy_id",
        "name",
    ],
)
profile_kraken2.head()

	percent	clade_assigned_reads	direct_assigned_reads	taxonomy_lvl	taxonomy_id	name
0	99.97	627680	627680	U	0	unclassified
1	0.03	168	0	R	1	root
2	0.03	168	0	R1	131567	cellular organisms
3	0.03	168	0	D	2759	Eukaryota
4	0.03	168	0	D1	33154	Opisthokonta

This looks better but if we look at the Kraken2 documentation, we can also see that sometimes extra columns may be added to the output, meaning that our column names wouldn’t always work. So again, not trivial.

Comparing Output of Different Profilers¶

The following section uses R/Python.

What if we wanted to compare the output of the different tools? A nice way to do this would be to merge the files into one table.

R

In the tidyverse flavour of R, we can do this with the full_join function of the dplyr package. This form of joining tables includes all rows both from the left and right table in the resulting table.

requireNamespace("dplyr")

Loading required namespace: dplyr

dplyr::full_join(profile_motus, profile_kraken2)

Error in `dplyr::full_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ use by = character()` to perform a cross-join.

The error by must be supplied when x and y have no common variables occurs because the column names are not the same between the two tables for the different profilers’ outputs.

We need to specify which column of the left table should be joined with what column of the right table.

raw_merged_table <- dplyr::full_join(profile_motus, profile_kraken2, by = c("NCBI_tax_id" = "taxonomy_id"))
raw_merged_table

# A tibble: 33,615 × 8
   `#consensus_taxonomy`   NCBI_…¹ 2612_…² percent clade…³ direc…⁴ taxon…⁵ name
   <chr>                     <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>   <chr>
 1 Leptospira alexanderi …  100053       0      NA      NA      NA <NA>    <NA>
 2 Leptospira weilii [ref…   28184       0      NA      NA      NA <NA>    <NA>
 3 Chryseobacterium sp. […      NA       0      NA      NA      NA <NA>    <NA>
 4 Chryseobacterium galli… 1324352       0      NA      NA      NA <NA>    <NA>
 5 Chryseobacterium indol…     253       0      NA      NA      NA <NA>    <NA>
 6 Chryseobacterium artoc…      NA       0      NA      NA      NA <NA>    <NA>
 7 Chryseobacterium jejue…  445960       0      NA      NA      NA <NA>    <NA>
 8 Chryseobacterium sp. G… 1805473       0      NA      NA      NA <NA>    <NA>
 9 Chryseobacterium conta… 1423959       0      NA      NA      NA <NA>    <NA>
10 Chryseobacterium indol…     253       0      NA      NA      NA <NA>    <NA>
# … with 33,605 more rows, and abbreviated variable names ¹NCBI_tax_id,
#   ²`2612_pe-ERR5766176-db_mOTU`, ³clade_assigned_reads,
#   ⁴direct_assigned_reads, ⁵taxonomy_lvl

Python

With pandas, we can perform such a merge either with the join or merge methods and an "outer" argument to perform an outer join that includes all rows both from the left and right table in the resulting table. The join method works best when joining tables on their indices. In our case, the indices are not meaningful for the join operation as they are simple integer ranges. We will use the merge method instead.

profile_motus.merge(profile_kraken2, how="outer")

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

The MergeError occurs because the column names are not the same between the two tables for the different profilers’ outputs. Instead, we need to specify which column of the left table should be joined with what column of the right table. The merge method allows us to do so.

raw_merged_table = profile_motus.merge(
    profile_kraken2,
    how="outer",
    left_on="NCBI_tax_id",
    right_on="taxonomy_id",
)
raw_merged_table.sample(10)

	#consensus_taxonomy	NCBI_tax_id	percent	clade_assigned_reads	direct_assigned_reads	taxonomy_lvl	taxonomy_id	name
33034	Spirosoma endophyticum [ref_mOTU_v3_11717]	662367	nan	nan	nan	nan	nan	nan
23891	Neorhizobium galegae [ref_mOTU_v3_00935]	399	nan	nan	nan	nan	nan	nan
5186	Actinobacteria species incertae sedis [ext_mOTU_v3_15937]	nan	nan	nan	nan	nan	nan	nan
18713	Acidobacteria species incertae sedis [ext_mOTU_v3_29464]	nan	nan	nan	nan	nan	nan	nan
3049	Flavobacteriaceae species incertae sedis [meta_mOTU_v3_13750]	nan	nan	nan	nan	nan	nan	nan
5551	Clostridiales species incertae sedis [ext_mOTU_v3_16302]	nan	nan	nan	nan	nan	nan	nan
19547	Clostridiales species incertae sedis [ext_mOTU_v3_30298]	nan	nan	nan	nan	nan	nan	nan
28261	Clostridium collagenovorans [ref_mOTU_v3_06789]	29357	nan	nan	nan	nan	nan	nan
2265	Alphaproteobacteria species incertae sedis [meta_mOTU_v3_12884]	nan	nan	nan	nan	nan	nan	nan
21849	Clostridiales species incertae sedis [ext_mOTU_v3_32600]	nan	nan	nan	nan	nan	nan	nan

But wait, this doesn't look right at all. We know which sample column we have from mOTUs, but what about the Kraken read count column? Also, many of the columns of the profiles are not shared between the two classifiers/profilers (see an important note about this here).

Ultimately, we have many inconsistencies between different output files. Attempting to merge together the resulting files makes little sense, and thus it's difficult to do any meaningful comparison.

Standardisation and Merging¶

taxpasta standardise¶

The following section uses your terminal

This is where taxpasta comes to the rescue!

With taxpasta, you can standardise and combine profiles into multi-sample taxon tables for you already at the command-line; rather than having to do this with custom scripts and a lot of manual data munging.

If you want to standardise a single profile you need three things:

The name of the taxonomic profiler used to generate the input file (--profiler or -p)
The requested output file name with a valid suffix that will tell taxpasta which format to save the output in (--output or -o)
The input profile file itself

taxpasta standardise -p kraken2 -o 2612_pe-ERR5766176-db1_kraken2.tsv 2612_pe-ERR5766176-db1.kraken2.report.txt

[INFO] Write result to '2612_pe-ERR5766176-db1_kraken2.tsv'.

Let's look at the result:

head 2612_pe-ERR5766176-db1_kraken2.tsv

taxonomy_id count
0   627680
1   0
131567  0
2759    0
33154   0
33208   0
6072    0
33213   0
33511   0

This looks much more tidy!

The remainder of this section uses R/Python.

R

Now let’s try to load the taxpasta standardised Kraken2 result into R again.

profile_kraken2_std <- readr::read_tsv("2612_pe-ERR5766176-db1_kraken2.tsv")

Rows: 44 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (2): taxonomy_id, count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

profile_kraken2_std

# A tibble: 44 × 2
   taxonomy_id  count
         <dbl>  <dbl>
 1           0 627680
 2           1      0
 3      131567      0
 4        2759      0
 5       33154      0
 6       33208      0
 7        6072      0
 8       33213      0
 9       33511      0
10        7711      0
# … with 34 more rows

Python

Now let’s try to load the taxpasta standardised Kraken2 result into Python again.

profile_kraken2_std = pd.read_table("2612_pe-ERR5766176-db1_kraken2.tsv")
profile_kraken2_std.head()

	taxonomy_id	count
0	0	627680
1	1	0
2	131567	0
3	2759	0
4	33154	0

You can see that we did not have to specify any additional column names or other arguments. taxpasta has created a suitable table for you.

taxpasta merge¶

The following section uses your terminal

What about the more complicated mOTUs case, where we not only have unusual comment headers but also profiles from multiple samples to be standardised?

In this case, we can instead use taxpasta merge, which will both standardise and merge the profiles of different samples into one for you - all through the command-line.

Again, We need to specify the profiler, the output name and format (via the suffix), and the input profiles themselves.

taxpasta merge -p motus -o dbMOTUs_motus.tsv 2612_pe-ERR5766176-db_mOTU.out 2612_se-ERR5766180-db_mOTU.out

[WARNING] The merged profiles contained different taxa. Additional zeroes were introduced for missing taxa.
[INFO] Write result to 'dbMOTUs_motus.tsv'.

Let's peek at the result.

head dbMOTUs_motus.tsv

taxonomy_id 2612_pe-ERR5766176-db_mOTU  2612_se-ERR5766180-db_mOTU
40518   20  2
216816  1   0
1680    6   1
1262820 1   0
74426   2   1
1907654 1   0
1852370 3   1
39491   3   0
33039   2   0

As with Kraken2, this looks much more tabular, and we can see references to both input files.

The remainder of this section uses R/Python.

R

Once again, let’s try loading the standardised and merged mOTUs result into R.

profile_motus_merged <- readr::read_tsv("dbMOTUs_motus.tsv")

Rows: 37 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (3): taxonomy_id, 2612_pe-ERR5766176-db_mOTU, 2612_se-ERR5766180-db_mOTU

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

profile_motus_merged

# A tibble: 37 × 3
   taxonomy_id `2612_pe-ERR5766176-db_mOTU` `2612_se-ERR5766180-db_mOTU`
         <dbl>                        <dbl>                        <dbl>
 1       40518                           20                            2
 2      216816                            1                            0
 3        1680                            6                            1
 4     1262820                            1                            0
 5       74426                            2                            1
 6     1907654                            1                            0
 7     1852370                            3                            1
 8       39491                            3                            0
 9       33039                            2                            0
10       39486                            1                            0
# … with 27 more rows

Python

Once again, let’s try loading the standardised and merged mOTUs result into Python.

profile_motus_merged = pd.read_table("dbMOTUs_motus.tsv")
profile_motus_merged.head()

	taxonomy_id	2612_pe-ERR5766176-db_mOTU	2612_se-ERR5766180-db_mOTU
0	40518	20	2
1	216816	1	0
2	1680	6	1
3	1.26282e+06	1	0
4	74426	2	1

You can see we have one taxonomy_id column, and two columns each referring to one of the two samples - all without having to spend time playing with different arguments for loading the files or additional data transformations.

If you prefer the columns names to be different from just the input filenames, you can also provide a sample sheet to customise them.

By default, taxpasta uses taxonomy identifiers to merge tables. If you’re interested in having human-readable taxon names see How-to add taxon names.

Important caveat¶

Tip

Carefully read our background documentation on terminology and considerations for comparing results from different metagenomic profilers.

You may have noticed that when "standardising" the output from each profiler that not all columns are retained. This is because each profiler has a different way of reporting relative abundance, and will produce additional metrics (represented as additional columns) that allow for better evaluation of the accuracy or confidence in each identified taxon.

However, as these metrics are not consistent between each profiler, they are also not comparable, thus in taxpasta we only retain columns that are conceptually comparable, i.e., taxonomy identifiers and counts. taxpasta also therefore does not (directly) support merging across classifiers/profilers as this needs to be done in a careful and mindful manner. We provide examples of how to do this carefully using R and Python in the corresponding How to Merge Across Classifiers section.

Please be aware that while taxpasta is a utility to make comparison between profilers easier to be performed, this does not necessarily mean that all possible comparisons are necessarily valid - this will depend on a case-by-case basis of your project! For example, not all taxonomies will be the same for each tool, so a given taxonomy_id in one profile may refer to a two different species depending on what taxonomy a tool uses.

As an example, for simple presence-and-absence analyses (such as pathogen screening), taxpasta will be highly suitable for comparing sensitivity of different tools or reference databases.

Clean Up¶

The following section uses your terminal

Once you’re happy that you’ve completed the tutorial you can clean up your workspace by removing the tutorial directory.

cd ..
rm -rf taxpasta-tutorial