How-to Merge Across Profilers¶
As stated in the main description of the tools and tutorials,
not (directly) support merging across different profilers, as each tool may
have its own reference database, taxonomy, and/or abundance metric. This can
risk making naïve assumptions and false-positive interpretations, thus
taxpasta is designed to help prepare data for cross-profiler analysis
without doing so itself. We highly recommend doing this mindfully in an
Here we will show you how you can load such standardised profiles into R and Python in a way that allows you to distinguish between the two tools as necessary.
You will need the following packages and libraries.
Pandas is already part of the taxpasta installation, so you don't need to install anything further.
Merging Across Profilers¶
Assuming you had two files in the same directory,
kraken2_db2.tsv - both of which are the output from a previous
taxpasta merge command - you can load them as follows.
First, create a list of the TSV files.
Next, we use the list to create a table with the file names in one column,
load the contents of each file into a nested column, and finally remove the
nesting for those contents. If your files were created in wide format (the
taxpasta), then we also need to pivot the tables to end up
with a tidy format. Otherwise, the highlighted lines below won't be
- Remove all of the highlighted lines 6-11, as well as the pipe on line 5 if your tables are already in long format.
This will result in a long format table containing four columns:
First, we iterate over TSV files in the working directory. Next, we load
the table from each file into a pandas dataframe. Assuming that those tables
are in wide format, since that is taxpasta's default, we pivot the tables
into long format using
melt. Otherwise, the highlighted part can be
skipped. Then, we assign the filename to a new column as an identifier for
which profiler was used. Lastly, we concatenate all tables into one
- Remove the call to the
meltmethod (lines 9-14) if your tables are already in long format.
result is a long format table with for columns:
From here, you can ensure that when you are making comparisons between tools you
are taking the tool and database into account via the
filename column. Of
course, you may add further columns like