How-to Merge Across Profilers¶
As stated in the main description of the tools and tutorials, taxpasta
does
not (directly) support merging across different profilers, as each tool may
have its own reference database, taxonomy, and/or abundance metric. This can
risk making naïve assumptions and false-positive interpretations, thus
taxpasta
is designed to help prepare data for cross-profiler analysis
without doing so itself. We highly recommend doing this mindfully in an
exploratory fashion.
Here we will show you how you can load such standardised profiles into R and Python in a way that allows you to distinguish between the two tools as necessary.
Dependencies¶
You will need the following packages and libraries.
Pandas is already part of the taxpasta installation, so you don't need to install anything further.
Merging Across Profilers¶
Assuming you had two files in the same directory, motus_dbMOTUs.tsv
and
kraken2_db2.tsv
- both of which are the output from a previous taxpasta merge
command - you can load them as follows.
First, create a list of the TSV files.
Next, we use the list to create a table with the file names in one column,
load the contents of each file into a nested column, and finally remove the
nesting for those contents. If your files were created in wide format (the
default for taxpasta
), then we also need to pivot the tables to end up
with a tidy format. Otherwise, the highlighted lines below won't be
necessary1.
- Remove all of the highlighted lines 6-11, as well as the pipe on line 5 if your tables are already in long format.
This will result in a long format table containing four columns:
taxonomy_id
, sample
, count
, and filename
.
First, we iterate over TSV files in the working directory. Next, we load
the table from each file into a pandas dataframe. Assuming that those tables
are in wide format, since that is taxpasta's default, we pivot the tables
into long format using melt
. Otherwise, the highlighted part can be
skipped. Then, we assign the filename to a new column as an identifier for
which profiler was used. Lastly, we concatenate all tables into one
dataframe.
- Remove the call to the
melt
method (lines 9-14) if your tables are already in long format.
The result
is a long format table with for columns: taxonomy_id
,
sample
, count
, and filename
.
From here, you can ensure that when you are making comparisons between tools you
are taking the tool and database into account via the filename
column. Of
course, you may add further columns like profiler
instead.