eggNOG mapper v2.0.0 v2.0.1 - eggnogdb/eggnog-mapper GitHub Wiki (2024)

Table of Contents

  • Overview
  • What's new in eggNOG-mapper v2
      • v2.0.0
  • Installation
    • Requirements
      • Software Requirements
      • Storage Requirements
    • Download
    • Fetch databases
  • Basic usage
  • Setting up large annotation jobs
      • Phase 1. hom*ology searches
      • Phase 2. Orthology and functional annotation
  • Citation

`eggnog-mapper` is a tool for fast functional annotation of novel sequences. It uses precomputed orthologous groups and phylogenies from the eggNOG database to transfer functional information from fine-grained orthologs only.

Common uses of eggNOG-mapper include the annotation of novel genomes, transcriptomes or even metagenomic gene catalogs.

The use of orthology predictions for functional annotation permits a higher precision than traditional hom*ology searches (i.e. BLAST searches), as it avoids transferring annotations from close paralogs (duplicate genes with a higher chance of being involved in functional divergence).

Benchmarks comparing different eggNOG-mapper options against BLAST and InterProScan [are](https://github.com/jhcepas/emapper-benchmark/blob/master/benchmark_analysis.ipynb).

EggNOG-mapper is also available as a public online resource: http://eggnog-mapper.embl.de

  • Expanded database of precomputed orthology assignments, now based on eggNOG v5.0. This includes 5,090 representative genomes (4445 bacteria, 168 archaea and 477 eukaryota), as well as 2502 viral proteomes.
  • HMMer search mode is deprecated. Read FAQ---Frequently-Asked-Questions#why-i-cannot-choose-hmmer-search-mode-in-version-20
  • Updated functional sources (e.g. KEGG, GeneOntology)
  • New columns in the output annotation file :
1. query_name2. seed eggNOG ortholog3. seed ortholog evalue4. seed ortholog score5. Predicted taxonomic group6. Predicted protein name7. Gene Ontology terms 8. EC number9. KEGG_ko10. KEGG_Pathway11. KEGG_Module12. KEGG_Reaction13. KEGG_rclass14. BRITE15. KEGG_TC16. CAZy 17. BiGG Reaction18. tax_scope: eggNOG taxonomic level used for annotation19. eggNOG OGs 20. bestOG (deprecated, use smallest from eggnog OGs)21. COG Functional Category22. eggNOG free text description

Requirements

Software Requirements

- Python 2.7- wget- DIAMOND binaries available (otherwise using the ones packaged with eggNOG-mapper)- BioPython (required only if using the `--translate` option)

Storage Requirements

  • ~40GB for the eggNOG annotation database
  • ~10GB for sequence database

Download

  • Download and decompress the latest version of eggnog-mapper from
 https://github.com/jhcepas/eggnog-mapper/releases. The program does not require compilation nor installation.
  • or clone this git repository (master branch):
git clone https://github.com/jhcepas/eggnog-mapper.git

Fetch databases

To donwnload necessary databases, run the following script:

download_eggnog_data.py 

This will fetch and decompress all precomputed eggNOG data into the data/ directory.

To start an annotation job, provide a FASTA file containing your query sequences, and run `emapper.py`

python emapper.py -i test/p53.fa --output p53_maNOG -m diamond

The following recommendations are based on the different experiences annotating huge genomic and metagenomic datesets (>100M proteins).

eggNOG mapper works at two phases: 1) finding seed orthologous sequences 2) expanding annotations. 1 is mainly cpu intensive, while 2 is more about disk operations. You can therefore optimize the annotation of huge files, but running each phase on different setups.

Phase 1. hom*ology searches

1) Split your input FASTA file into chunks, each containing a moderate number of sequences (1M seqs per file worked good in our tests). We usually work with FASTA files where sequences are in a single line, so splitting is very simple.

split -l 2000000 -a 3 -d input_file.faa input_file.chunk_

2) Use diamond mode. Each chunk can be processed independently in a cluster node, and you should tell `emapper.py` not to run the annotation phase yet. This way you can parallelize diamond searches as much as you want, even when running from a shared file system. Assuming an example with 100M proteins, the above command will generate 100 file chunks, and each should run diamond using 16 cores. The necessary commands that need to be submitted to the cluster queue can be generated with something like this:

# generate all the commands that should be distributed in the clusterfor f in *.chunk_*; doecho ./emapper.py -m diamond --no_annot --no_file_comments --cpu 16 -i $f -o $f; done

Phase 2. Orthology and functional annotation

The annotation phase needs to query `data/eggnog.db` intensively. This file is a sqlite3 database, so it is highly recommended that the file lives under the fastest local disk possible. For instance, we store `eggnog.db` in SSD disks or, if possible, under `/dev/shm` (memory based filesystem).

3) Concatenate all chunk_*.emapper.seed_orthologs file.

cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs

4) Run the orthologs search and annotation phase in a single multi core machine (10 cores in our example), reading from a fast disk.

emapper.py --annotate_hits_table input.emapper.seed_orthologs --no_file_comments -o output_file --cpu 10

We usually annotate at a rate of 300-400 proteins per second using a 10 cpu cores and having `eggnog.db` under the `/dev/shm` disk, but you can of course run many of those instances in parallel. If you are running `emapper.py` from a conda environment, check [these](https://github.com/jhcepas/eggnog-mapper/issues/80).

and _voilà_, you got your annotations.

 Please cite the following two papers if you use eggNOG-mapper v2
[1] Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Jaime Huerta-Cepas, Damian Szklarczyk, Lars Juhl Jensen, Christian von Mering and Peer Bork. Submitted (2016).[2] eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8; 47(Database issue): D309–D314. doi: 10.1093/nar/gky1085 
eggNOG mapper v2.0.0 v2.0.1 - eggnogdb/eggnog-mapper GitHub Wiki (2024)
Top Articles
Latest Posts
Article information

Author: Merrill Bechtelar CPA

Last Updated:

Views: 5365

Rating: 5 / 5 (70 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Merrill Bechtelar CPA

Birthday: 1996-05-19

Address: Apt. 114 873 White Lodge, Libbyfurt, CA 93006

Phone: +5983010455207

Job: Legacy Representative

Hobby: Blacksmithing, Urban exploration, Sudoku, Slacklining, Creative writing, Community, Letterboxing

Introduction: My name is Merrill Bechtelar CPA, I am a clean, agreeable, glorious, magnificent, witty, enchanting, comfortable person who loves writing and wants to share my knowledge and understanding with you.