Skip to content

merge

What

The purpose of the merge command is to standardise and immediately combine multiple taxonomic profiles from the same tool but different samples. This produces a standard 'taxon table' in either wide or long format.

When

You should use taxpasta merge when you want to standardise multiple profiles in one go and have all profiles combined into a single table. You will use this command if you want to load the table directly into a spreadsheet program or programming language without needing to manually combine profiles.

See standardise if you wish to only standardise without merging.

Warning

You should only use this command if you are interested in raw 'counts'! The standardised output will remove profiler specific information, such as names, percentages, and lineage information.

How

To use this command, you will need multiple profiles of a single tool, the name of the tool, and specify an output file name:

taxpasta merge --profiler kraken2 --output taxon_table.tsv sample1.kreport.txt sample2.kreport.txt sample3.kreport.txt

where sample1.kreport.txt, sample2.kreport.txt, sample3.kreport.txt are report files from Kraken2.

This will produce a file called sample_standardised.tsv that contains the taxpasta 'standard' multi-column structure described below.

Why

Take, for example, the following Kraken2 output file.

 99.98  787758  787758  U   0   unclassified
  0.02  119 0   R   1   root
  0.02  119 0   R1  131567    cellular organisms
  0.02  119 0   D   2759        Eukaryota
  0.02  119 0   D1  33154         Opisthokonta
  0.01  96  0   K   4751            Fungi
  0.01  96  0   K1  451864            Dikarya

This output format is specific to Kraken2 and is unlikely to be comparable with other tools, as they will record this information in different formats. Furthermore, the indentation system to show taxonomic rank depth is not particularly 'machine-readable'; making it difficult to load it into spreadsheet tools or tabular formats preferred by languages such as R.

A more common format in metagenomics is to have a first column with the taxon name and each subsequent column representing a different sample. Each cell represents a count of the number of sequence hits against that row's taxon within that column's sample.

We have chosen to reduce all taxa to their respective identifiers. We chose zero as the identifier for unclassified reads. Since there are many downstream processing and analytics methods that assume integer read counts, we only support such a count or pseudo count column.

An example of this format could be:

taxonomy_id sample1 sample2 sample3
0 787758 2233938 98872
1 119 12929 872
131567 119 5345 800
2759 119 123 200
33154 119 123 29
4751 96 30 29
451864 96 30 29

Where you have a header indicating each column, the first (taxonomy_id) indicating which taxon has the counts in the second column (sample1), then third column (sample2), fourth column (sample3), and so on.

As you can see here, this is a much more compact way of looking at multiple samples, with the caveat that you may not have additional information, such as the accuracy of each assignment.

Danger

Taxpasta will assume that all taxonomic profiles to be processed are based on the same underlying taxonomy. That means, taxpasta will happily join taxa by their identifier even if they stem from different taxonomies.