Skip to content

How-to add taxa names to output

Info

We follow on from the main tutorial including all files just before the clean up step.

If you wish to have actual human-readable taxon names in your standardised output, you need to supply 'taxonomy' files. These files are typically called nodes.dmp and names.dmp. Most profilers use the NCBI taxonomy files.

Assuming that your current working directory is the taxpasta-tutorial directory, we can download the taxonomy files with the following.

curl -O ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz.md5
curl -O ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
md5sum --check taxdump.tar.gz.md5
mkdir taxdump
tar -C taxdump -xzf taxdump.tar.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

100    49  100    49    0     0     38      0  0:00:01  0:00:01 --:--:--    38

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current

                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

  0 57.8M    0 17376    0     0  13692      0  1:13:49  0:00:01  1:13:48 13681

  8 57.8M    8 4812k    0     0  2169k      0  0:00:27  0:00:02  0:00:25 2168k

 25 57.8M   25 14.8M    0     0  4718k      0  0:00:12  0:00:03  0:00:09 4718k

 42 57.8M   42 24.8M    0     0  6024k      0  0:00:09  0:00:04  0:00:05 6022k

 62 57.8M   62 36.3M    0     0  7159k      0  0:00:08  0:00:05  0:00:03 7447k

 85 57.8M   85 49.2M    0     0  8093k      0  0:00:07  0:00:06  0:00:01  9.9M

100 57.8M  100 57.8M    0     0  8566k      0  0:00:06  0:00:06 --:--:-- 11.3M

taxdump.tar.gz: OK

Once downloaded and extracted, you can supply the directory with the taxdump files to your respective taxpasta commands with the --taxonomy flag, and specify which type of taxonomy information to display, e.g., just the name, the rank, and/or taxonomic lineage.

taxpasta merge -p motus -o dbMOTUs_motus_with_names.tsv --taxonomy taxdump --add-name 2612_pe-ERR5766176-db_mOTU.out 2612_se-ERR5766180-db_mOTU.out
[WARNING] The merged profiles contained different taxa. Additional zeroes were introduced for missing taxa.

[INFO] Write result to 'dbMOTUs_motus_with_names.tsv'.

The merged taxpasta output now looks like:

head dbMOTUs_motus_with_names.tsv
taxonomy_id 2612_pe-ERR5766176-db_mOTU  2612_se-ERR5766180-db_mOTU  name

40518   20  2   Ruminococcus bromii

216816  1   0   Bifidobacterium longum

1680    6   1   Bifidobacterium adolescentis

1262820 1   0   Clostridium sp. CAG:567

74426   2   1   Collinsella aerofaciens

1907654 1   0   Collinsella bouchesdurhonensis

1852370 3   1   Prevotellamassilia timonensis

39491   3   0   [Eubacterium] rectale

33039   2   0   [Ruminococcus] torques

Clean Up

Don't forget to remove the tutorial directory if you don't want to keep it for later use.