Deep Dive¶
Introduction¶
Following from our basic 'Getting Started' tutorial,
we will once again show you how to generate standardised taxonomic profiles from the
diverse outputs of two popular taxonomic profilers:
Kraken2 and
mOTUs with taxpasta
.
However, in this particular we will also demonstrate some benefits of this standardisation when running downstream analyses on such tables in either the popular statistical programming language R or Python.
Preparation¶
Software¶
For this tutorial you will need an internet connection, an installation of taxpasta, and, if you want to follow the R parts, an installation of R itself with the readr and dplyr packages from the Tidyverse. Furthermore, we assume a UNIX based operating system, such as Linux, macOS, or Windows Subsystem for Linux.
To summarise, you will need:
-
Unix terminal (e.g.,
bash
) -
Package dependencies
- Unix terminal (e.g.,
bash
) - Python environment with taxpasta
- No additional requirements, since we will make use of pandas which is already part of the taxpasta installation.
Data¶
First we will make a ‘scratch’ directory where we can run the tutorial and delete again afterwards.
We will also need to download some example taxonomic profiles from Kraken2 and
mOTUs. We can download test data from the taxpasta
repository using, for
example, curl
or wget
.
## mOTUs
curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_pe-ERR5766176-db_mOTU.out
curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_se-ERR5766180-db_mOTU.out
## Kraken2
curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/kraken2/2612_pe-ERR5766176-db1.kraken2.report.txt
## mOTUs
wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_pe-ERR5766176-db_mOTU.out
wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/motus/2612_se-ERR5766180-db_mOTU.out
## Kraken2
wget https://raw.githubusercontent.com/taxprofiler/taxpasta/main/tests/data/kraken2/2612_pe-ERR5766176-db1.kraken2.report.txt
We should now see three files with contents in the taxpasta-tutorial
directory
2612_pe-ERR5766176-db1.kraken2.report.txt 2612_se-ERR5766180-db_mOTU.out
2612_pe-ERR5766176-db_mOTU.out
Profiles¶
Raw Output¶
To begin, let’s look at the contents of the output from each profiler.
# git tag version 3.0.3 | motus version 3.0.3 | map_tax 3.0.3 | gene database: nr3.0.3 | calc_mgc 3.0.3 -y insert.scaled_counts -l 75 | calc_motu 3.0.3 -k mOTU -C no_CAMI -g 3 -c -p | taxonomy: ref_mOTU_3.0.3 meta_mOTU_3.0.3
# call: python /usr/local/bin/../share/motus-3.0.1//motus profile -p -c -f ERX5474932_ERR5766176_1.fastq.gz -r ERX5474932_ERR5766176_2.fastq.gz -db db_mOTU -t 2 -n 2612_pe-ERR5766176-db_mOTU -o 2612_pe-ERR5766176-db_mOTU.out
#consensus_taxonomy NCBI_tax_id 2612_pe-ERR5766176-db_mOTU
Leptospira alexanderi [ref_mOTU_v3_00001] 100053 0
Leptospira weilii [ref_mOTU_v3_00002] 28184 0
Chryseobacterium sp. [ref_mOTU_v3_00004] NA 0
Chryseobacterium gallinarum [ref_mOTU_v3_00005] 1324352 0
Chryseobacterium indologenes [ref_mOTU_v3_00006] 253 0
Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007] NA 0
Chryseobacterium jejuense [ref_mOTU_v3_00008] 445960 0
99.97 627680 627680 U 0 unclassified
0.03 168 0 R 1 root
0.03 168 0 R1 131567 cellular organisms
0.03 168 0 D 2759 Eukaryota
0.03 168 0 D1 33154 Opisthokonta
0.02 152 0 K 33208 Metazoa
0.02 152 0 K1 6072 Eumetazoa
0.02 152 0 K2 33213 Bilateria
0.02 152 0 K3 33511 Deuterostomia
0.02 152 0 P 7711 Chordata
These look quite different. Neither of them is in a nice "pure" tabular format that is convenient for analysis software or spreadsheet tools such as Microsoft Excel or LibreOffice Calc to load. They also have different types columns and, in the case of Kraken2, it has an interesting "indentation" way of showing the taxonomic rank of each taxon.
mOTUs¶
The following section uses R/Python.
We can try loading a mOTUs profile into R using the common table reading
function read_tsv()
from the readr
package with default arguments.
Loading required namespace: readr
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 33573 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): # git tag version 3.0.3 | motus version 3.0.3 | map_tax 3.0.3 | ge...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
You can see we immediately hit an error, as as we saw above, there is a ‘comment’ line at the top of the mOTUs profile with information on how the profile was generated.
While such a comment is very useful for reproducibility, to load this into software expecting ‘true’ tabular data, we have to instead add extra options to the function, which makes loading the table less than smooth for downstream analyses.
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 33570 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): Leptospira alexanderi [ref_mOTU_v3_00001]
dbl (2): 100053, 0
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
However, once again we hit another error: the column headers are also specified as a comment line… Instead we can try to skip the first two lines entirely.
Rows: 33571 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): #consensus_taxonomy
dbl (2): NCBI_tax_id, 2612_pe-ERR5766176-db_mOTU
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 33,571 × 3
`#consensus_taxonomy` NCBI_tax…¹ 2612_…²
<chr> <dbl> <dbl>
1 Leptospira alexanderi [ref_mOTU_v3_00001] 100053 0
2 Leptospira weilii [ref_mOTU_v3_00002] 28184 0
3 Chryseobacterium sp. [ref_mOTU_v3_00004] NA 0
4 Chryseobacterium gallinarum [ref_mOTU_v3_00005] 1324352 0
5 Chryseobacterium indologenes [ref_mOTU_v3_00006] 253 0
6 Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007] NA 0
7 Chryseobacterium jejuense [ref_mOTU_v3_00008] 445960 0
8 Chryseobacterium sp. G972 [ref_mOTU_v3_00009] 1805473 0
9 Chryseobacterium contaminans [ref_mOTU_v3_00010] 1423959 0
10 Chryseobacterium indologenes [ref_mOTU_v3_00011] 253 0
# … with 33,561 more rows, and abbreviated variable names ¹NCBI_tax_id,
# ²`2612_pe-ERR5766176-db_mOTU`
We can try loading a mOTUs profile into Python using the common table
reading function read_table()
from the pandas
package with default
arguments.
ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 3
You can see we immediately hit a ParserError
, as there is a ‘comment’
line at the top of the mOTUs profile with information on how the profile
was generated.
While such a comment is very useful for reproducibility, to load this we have to instead add extra options to the function, which makes loading the table less than smooth for downstream analyses.
profile_motus = pd.read_table(
"2612_pe-ERR5766176-db_mOTU.out",
comment="#",
)
profile_motus.head()
Leptospira alexanderi [ref_mOTU_v3_00001] | 100053 | 0 | |
---|---|---|---|
0 | Leptospira weilii [ref_mOTU_v3_00002] | 28184 | 0 |
1 | Chryseobacterium sp. [ref_mOTU_v3_00004] | nan | 0 |
2 | Chryseobacterium gallinarum [ref_mOTU_v3_00005] | 1.32435e+06 | 0 |
3 | Chryseobacterium indologenes [ref_mOTU_v3_00006] | 253 | 0 |
4 | Chryseobacterium artocarpi/ureilyticum [ref_mOTU_v3_00007] | nan | 0 |
However, once again we encounter another problem: the column headers are also specified as a comment line and our first line of data is instead read as the column headers. We can try to skip the first two lines entirely to remedy this.
#consensus_taxonomy | NCBI_tax_id | 2612_pe-ERR5766176-db_mOTU | |
---|---|---|---|
0 | Leptospira alexanderi [ref_mOTU_v3_00001] | 100053 | 0 |
1 | Leptospira weilii [ref_mOTU_v3_00002] | 28184 | 0 |
2 | Chryseobacterium sp. [ref_mOTU_v3_00004] | nan | 0 |
3 | Chryseobacterium gallinarum [ref_mOTU_v3_00005] | 1.32435e+06 | 0 |
4 | Chryseobacterium indologenes [ref_mOTU_v3_00006] | 253 | 0 |
That works! At least in terms of reasonable table headers. However, we can see that there are missing NCBI taxonomy identifiers so there is probably more data cleaning ahead of us. Getting to this point took too much effort already to load what is essentially a simple table. Furthermore, if we are interested in loading all mOTUs profiles, we would have to load each profile one by one for each sample, requiring more complicated loops and table join code.
Kraken2¶
And what about the Kraken2 output?
New names:
Rows: 43 Columns: 6
── Column specification
──────────────────────────────────────────────────────── Delimiter: "\t" chr
(2): U, unclassified dbl (4): 99.97, 627680...2, 627680...3, 0
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `627680` -> `627680...2`
• `627680` -> `627680...3`
# A tibble: 43 × 6
`99.97` `627680...2` `627680...3` U `0` unclassified
<dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 0.03 168 0 R 1 root
2 0.03 168 0 R1 131567 cellular organisms
3 0.03 168 0 D 2759 Eukaryota
4 0.03 168 0 D1 33154 Opisthokonta
5 0.02 152 0 K 33208 Metazoa
6 0.02 152 0 K1 6072 Eumetazoa
7 0.02 152 0 K2 33213 Bilateria
8 0.02 152 0 K3 33511 Deuterostomia
9 0.02 152 0 P 7711 Chordata
10 0.02 152 0 P1 89593 Craniata
# … with 33 more rows
This doesn’t fail to load but unfortunately the column headers look a bit weird. It seems the Kraken2 file does not include a column header! In this case we have to specify these ourselves.
profile_kraken2 <- readr::read_tsv(
"2612_pe-ERR5766176-db1.kraken2.report.txt",
col_names = c(
"percent",
"clade_assigned_reads",
"direct_assigned_reads",
"taxonomy_lvl",
"taxonomy_id",
"name"
)
)
Rows: 44 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): taxonomy_lvl, name
dbl (4): percent, clade_assigned_reads, direct_assigned_reads, taxonomy_id
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 44 × 6
percent clade_assigned_reads direct_assigned_reads taxonomy_lvl taxon…¹ name
<dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 100. 627680 627680 U 0 uncl…
2 0.03 168 0 R 1 root
3 0.03 168 0 R1 131567 cell…
4 0.03 168 0 D 2759 Euka…
5 0.03 168 0 D1 33154 Opis…
6 0.02 152 0 K 33208 Meta…
7 0.02 152 0 K1 6072 Eume…
8 0.02 152 0 K2 33213 Bila…
9 0.02 152 0 K3 33511 Deut…
10 0.02 152 0 P 7711 Chor…
# … with 34 more rows, and abbreviated variable name ¹taxonomy_id
99.97 | 627680 | 627680.1 | U | 0 | unclassified | |
---|---|---|---|---|---|---|
0 | 0.03 | 168 | 0 | R | 1 | root |
1 | 0.03 | 168 | 0 | R1 | 131567 | cellular organisms |
2 | 0.03 | 168 | 0 | D | 2759 | Eukaryota |
3 | 0.03 | 168 | 0 | D1 | 33154 | Opisthokonta |
4 | 0.02 | 152 | 0 | K | 33208 | Metazoa |
This doesn’t fail to load but unfortunately the column headers look a bit weird. It seems the Kraken2 file does not include a column header! In this case we have to specify these ourselves.
profile_kraken2 = pd.read_table(
"2612_pe-ERR5766176-db1.kraken2.report.txt",
names=[
"percent",
"clade_assigned_reads",
"direct_assigned_reads",
"taxonomy_lvl",
"taxonomy_id",
"name",
],
)
profile_kraken2.head()
percent | clade_assigned_reads | direct_assigned_reads | taxonomy_lvl | taxonomy_id | name | |
---|---|---|---|---|---|---|
0 | 99.97 | 627680 | 627680 | U | 0 | unclassified |
1 | 0.03 | 168 | 0 | R | 1 | root |
2 | 0.03 | 168 | 0 | R1 | 131567 | cellular organisms |
3 | 0.03 | 168 | 0 | D | 2759 | Eukaryota |
4 | 0.03 | 168 | 0 | D1 | 33154 | Opisthokonta |
This looks better but if we look at the Kraken2 documentation, we can also see that sometimes extra columns may be added to the output, meaning that our column names wouldn’t always work. So again, not trivial.
Comparing Output of Different Profilers¶
The following section uses R/Python.
What if we wanted to compare the output of the different tools? A nice way to do this would be to merge the files into one table.
In the tidyverse flavour of R, we can do this with the full_join
function of the dplyr
package. This form of joining tables includes
all rows both from the left and right table in the resulting table.
Loading required namespace: dplyr
Error in `dplyr::full_join()`:
! `by` must be supplied when `x` and `y` have no common variables.
ℹ use by = character()` to perform a cross-join.
The error by must be supplied when x and y have no common variables
occurs because the column names are not the same between the two tables
for the different profilers’ outputs.
We need to specify which column of the left table should be joined with what column of the right table.
raw_merged_table <- dplyr::full_join(profile_motus, profile_kraken2, by = c("NCBI_tax_id" = "taxonomy_id"))
raw_merged_table
# A tibble: 33,615 × 8
`#consensus_taxonomy` NCBI_…¹ 2612_…² percent clade…³ direc…⁴ taxon…⁵ name
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 Leptospira alexanderi … 100053 0 NA NA NA <NA> <NA>
2 Leptospira weilii [ref… 28184 0 NA NA NA <NA> <NA>
3 Chryseobacterium sp. [… NA 0 NA NA NA <NA> <NA>
4 Chryseobacterium galli… 1324352 0 NA NA NA <NA> <NA>
5 Chryseobacterium indol… 253 0 NA NA NA <NA> <NA>
6 Chryseobacterium artoc… NA 0 NA NA NA <NA> <NA>
7 Chryseobacterium jejue… 445960 0 NA NA NA <NA> <NA>
8 Chryseobacterium sp. G… 1805473 0 NA NA NA <NA> <NA>
9 Chryseobacterium conta… 1423959 0 NA NA NA <NA> <NA>
10 Chryseobacterium indol… 253 0 NA NA NA <NA> <NA>
# … with 33,605 more rows, and abbreviated variable names ¹NCBI_tax_id,
# ²`2612_pe-ERR5766176-db_mOTU`, ³clade_assigned_reads,
# ⁴direct_assigned_reads, ⁵taxonomy_lvl
With pandas, we can perform such a merge either with the join
or
merge
methods and an "outer"
argument to perform an outer join that
includes all rows both from the left and right table in the resulting
table. The join
method works best when joining tables on their
indices. In our case, the indices are not meaningful for the join
operation as they are simple integer ranges. We will use the merge
method instead.
MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
The MergeError
occurs because the column names are not the same
between the two tables for the different profilers’ outputs. Instead, we
need to specify which column of the left table should be joined with
what column of the right table. The merge
method allows us to do so.
raw_merged_table = profile_motus.merge(
profile_kraken2,
how="outer",
left_on="NCBI_tax_id",
right_on="taxonomy_id",
)
raw_merged_table.sample(10)
#consensus_taxonomy | NCBI_tax_id | 2612_pe-ERR5766176-db_mOTU | percent | clade_assigned_reads | direct_assigned_reads | taxonomy_lvl | taxonomy_id | name | |
---|---|---|---|---|---|---|---|---|---|
33034 | Spirosoma endophyticum [ref_mOTU_v3_11717] | 662367 | 0 | nan | nan | nan | nan | nan | nan |
23891 | Neorhizobium galegae [ref_mOTU_v3_00935] | 399 | 0 | nan | nan | nan | nan | nan | nan |
5186 | Actinobacteria species incertae sedis [ext_mOTU_v3_15937] | nan | 0 | nan | nan | nan | nan | nan | nan |
18713 | Acidobacteria species incertae sedis [ext_mOTU_v3_29464] | nan | 0 | nan | nan | nan | nan | nan | nan |
3049 | Flavobacteriaceae species incertae sedis [meta_mOTU_v3_13750] | nan | 0 | nan | nan | nan | nan | nan | nan |
5551 | Clostridiales species incertae sedis [ext_mOTU_v3_16302] | nan | 0 | nan | nan | nan | nan | nan | nan |
19547 | Clostridiales species incertae sedis [ext_mOTU_v3_30298] | nan | 0 | nan | nan | nan | nan | nan | nan |
28261 | Clostridium collagenovorans [ref_mOTU_v3_06789] | 29357 | 0 | nan | nan | nan | nan | nan | nan |
2265 | Alphaproteobacteria species incertae sedis [meta_mOTU_v3_12884] | nan | 0 | nan | nan | nan | nan | nan | nan |
21849 | Clostridiales species incertae sedis [ext_mOTU_v3_32600] | nan | 0 | nan | nan | nan | nan | nan | nan |
But wait, this doesn't look right at all. We know which sample column we have
from mOTUs
, but what about the Kraken
read count column? Also, many of the
columns of the profiles are not shared between the two classifiers/profilers
(see an important note about this here).
Ultimately, we have many inconsistencies between different output files. Attempting to merge together the resulting files makes little sense, and thus it's difficult to do any meaningful comparison.
Standardisation and Merging¶
taxpasta standardise¶
The following section uses your terminal
This is where taxpasta
comes to the rescue!
With taxpasta
, you can standardise and combine profiles into multi-sample taxon tables
for you already at the command-line; rather than having to do this with custom scripts
and a lot of manual data munging.
If you want to standardise a single profile you need three things:
- The name of the taxonomic profiler used to generate the input file (
--profiler
or-p
) - The requested output file name with a valid suffix that will tell
taxpasta
which format to save the output in (--output
or-o
) - The input profile file itself
taxpasta standardise -p kraken2 -o 2612_pe-ERR5766176-db1_kraken2.tsv 2612_pe-ERR5766176-db1.kraken2.report.txt
[INFO] Write result to '2612_pe-ERR5766176-db1_kraken2.tsv'.
Let's look at the result:
taxonomy_id count
0 627680
1 0
131567 0
2759 0
33154 0
33208 0
6072 0
33213 0
33511 0
This looks much more tidy!
The remainder of this section uses R/Python.
Now let’s try to load the taxpasta standardised Kraken2 result into R again.
Rows: 44 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (2): taxonomy_id, count
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 44 × 2
taxonomy_id count
<dbl> <dbl>
1 0 627680
2 1 0
3 131567 0
4 2759 0
5 33154 0
6 33208 0
7 6072 0
8 33213 0
9 33511 0
10 7711 0
# … with 34 more rows
Now let’s try to load the taxpasta standardised Kraken2 result into Python again.
profile_kraken2_std = pd.read_table("2612_pe-ERR5766176-db1_kraken2.tsv")
profile_kraken2_std.head()
taxonomy_id | count | |
---|---|---|
0 | 0 | 627680 |
1 | 1 | 0 |
2 | 131567 | 0 |
3 | 2759 | 0 |
4 | 33154 | 0 |
You can see that we did not have to specify any additional column names or other
arguments. taxpasta
has created a suitable table for you.
taxpasta merge¶
The following section uses your terminal
What about the more complicated mOTUs case, where we not only have unusual comment headers but also profiles from multiple samples to be standardised?
In this case, we can instead use taxpasta merge
, which will both standardise
and merge the profiles of different samples into one for you - all
through the command-line.
Again, We need to specify the profiler, the output name and format (via the suffix), and the input profiles themselves.
taxpasta merge -p motus -o dbMOTUs_motus.tsv 2612_pe-ERR5766176-db_mOTU.out 2612_se-ERR5766180-db_mOTU.out
[WARNING] The merged profiles contained different taxa. Additional zeroes were introduced for missing taxa.
[INFO] Write result to 'dbMOTUs_motus.tsv'.
Let's peek at the result.
taxonomy_id 2612_pe-ERR5766176-db_mOTU 2612_se-ERR5766180-db_mOTU
40518 20 2
216816 1 0
1680 6 1
1262820 1 0
74426 2 1
1907654 1 0
1852370 3 1
39491 3 0
33039 2 0
As with Kraken2, this looks much more tabular, and we can see references to both input files.
The remainder of this section uses R/Python.
Once again, let’s try loading the standardised and merged mOTUs result into R.
Rows: 37 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (3): taxonomy_id, 2612_pe-ERR5766176-db_mOTU, 2612_se-ERR5766180-db_mOTU
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 37 × 3
taxonomy_id `2612_pe-ERR5766176-db_mOTU` `2612_se-ERR5766180-db_mOTU`
<dbl> <dbl> <dbl>
1 40518 20 2
2 216816 1 0
3 1680 6 1
4 1262820 1 0
5 74426 2 1
6 1907654 1 0
7 1852370 3 1
8 39491 3 0
9 33039 2 0
10 39486 1 0
# … with 27 more rows
Once again, let’s try loading the standardised and merged mOTUs result into Python.
taxonomy_id | 2612_pe-ERR5766176-db_mOTU | 2612_se-ERR5766180-db_mOTU | |
---|---|---|---|
0 | 40518 | 20 | 2 |
1 | 216816 | 1 | 0 |
2 | 1680 | 6 | 1 |
3 | 1.26282e+06 | 1 | 0 |
4 | 74426 | 2 | 1 |
You can see we have one taxonomy_id
column, and two columns each referring to
one of the two samples - all without having to spend time playing with different
arguments for loading the files or additional data transformations.
If you prefer the columns names to be different from just the input filenames, you can also provide a sample sheet to customise them.
By default, taxpasta
uses taxonomy identifiers to merge tables. If you’re
interested in having human-readable taxon names see How-to add taxon
names.
Important caveat¶
Tip
Carefully read our background documentation on terminology and considerations for comparing results from different metagenomic profilers.
You may have noticed that when "standardising" the output from each profiler that not all columns are retained. This is because each profiler has a different way of reporting relative abundance, and will produce additional metrics (represented as additional columns) that allow for better evaluation of the accuracy or confidence in each identified taxon.
However, as these metrics are not consistent between each profiler, they are
also not comparable, thus in taxpasta
we only retain columns that
are conceptually comparable, i.e., taxonomy identifiers and counts. taxpasta
also therefore does not (directly) support merging across classifiers/profilers
as this needs to be done in a careful and mindful manner.
We provide examples of how to do this carefully using R and Python in the
corresponding How to Merge Across Classifiers section.
Please be aware that while taxpasta
is a utility to make comparison between
profilers easier to be performed, this does not necessarily mean that all
possible comparisons are necessarily valid - this will depend on a case-by-case
basis of your project! For example, not all taxonomies will be the same for each
tool, so a given taxonomy_id
in one profile may refer to a two different species
depending on what taxonomy a tool uses.
As an example, for simple presence-and-absence analyses (such as pathogen
screening), taxpasta
will be highly suitable for comparing sensitivity of
different tools or reference databases.
Clean Up¶
The following section uses your terminal
Once you’re happy that you’ve completed the tutorial you can clean up your workspace by removing the tutorial directory.