Functional ANnoTAtion based on embedding space SImilArity
FANTASIA is an advanced pipeline for the automatic functional annotation of protein sequences using state-of-the-art protein language models. It integrates deep learning embeddings and in-memory similarity searches, retrieving reference vectors from a PostgreSQL database with pgvector-backed storage, to associate Gene Ontology (GO) terms with proteins.
For full documentation, visit FANTASIA Documentation.
For users who need a lightweight, standalone alternative, FANTASIA-Lite provides fast Gene Ontology annotation directly from local FASTA files, without requiring a database server or the full FANTASIA infrastructure. It leverages protein language model embeddings and nearest-neighbor similarity in embedding space to deliver high-quality functional annotations with minimal setup.
For FANTASIA-Lite, visit https://github.com/CBBIO/FANTASIA-Lite
Two packaged reference datasets are available; select one depending on your analysis needs:
-
Main Reference (last layer, default)
Embeddings extracted only from the final hidden layer of each PLM.
Recommended for most annotation tasks (smaller, faster to load).
Record: https://zenodo.org/records/17795871 -
Multilayer Reference (early layers + final layers)
Embeddings extracted from multiple hidden layers (including intermediate and final).
Suitable for comparative and exploratory analyses requiring layer-wise representations.
Record: https://zenodo.org/records/17793273
Available Embedding Models
Supports protein language models: ESM-2, ProtT5, ProstT5, Ankh3-Large, and ESM3c for sequence representation.
-
Redundancy Filtering
Filters out homologous sequences using MMseqs2 in the lookup table, allowing controlled redundancy levels through an adjustable threshold, ensuring reliable benchmarking and evaluation. -
Optimized Data Storage
Embeddings are stored in HDF5 format for input sequences. The reference table, however, is hosted in a public relational PostgreSQL database using pgvector. -
Efficient Similarity Lookup
High-throughput similarity search with a hybrid approach: reference embeddings are stored in a PostgreSQL + pgvector database, then loaded per model/layer into memory so similarities can be computed efficiently in the application with vectorized CPU or GPU operations. In the repository default configuration, lookup runs on CPU unlesslookup.use_gpu: trueis enabled. -
Sequential Embedding + Lookup
FANTASIA first computes query embeddings and stores them inembeddings.h5, then runs the lookup stage. These stages execute sequentially within a run, so embedding and lookup do not compete for GPU resources unless multiple FANTASIA jobs are launched at the same time. -
Global & Local Alignment of Hits
Candidate hits from the reference table are aligned both globally and locally against the input protein for validation and scoring. -
Multi-layer Embedding Support
Optional support for intermediate + final layers to enable layer-wise analyses and improved exploration. -
Raw Outputs & Flexible Post-processing
Exposes raw result tables for custom analyses and includes a flexible post-processing & scoring system that produces TopGO-ready files.
Performs high-speed searches using in-memory computations. Reference vectors are retrieved from a PostgreSQL database with pgvector-backed storage for comparison. -
Functional Annotation by Similarity
Assigns Gene Ontology (GO) terms to proteins based on embedding space similarity, using pre-trained embeddings from all supported models.
-
Embedding Generation
Computes protein embeddings using deep learning models (ProtT5, ProstT5, ESM-2, Ankh3-Large, and ESM3c). -
GO Term Lookup
Performs vector similarity searches using in-memory computations to assign Gene Ontology terms. Reference embeddings are retrieved from a PostgreSQL database with pgvector-backed storage and loaded per model/layer into memory. In the default configuration, this stage runs on CPU (lookup.use_gpu: false). Only experimental evidence codes are used for transfer.
The repository default is CPU lookup (lookup.use_gpu: false). For single-run workflows on CUDA-capable systems, enabling GPU lookup with lookup.use_gpu: true is recommended. In the current pipeline, embeddings are generated first and lookup runs afterward, so Stage A and Stage B do not overlap within the same run.
When processing multiple proteomes on a single GPU-equipped machine, a sequential launcher script is recommended. Running one proteome at a time preserves the same non-overlapping execution model used within a single FANTASIA run and avoids GPU contention between concurrent jobs. This is often the simplest and most reliable strategy for small-to-medium batches of proteomes.
Example:
./scripts/run_sequential_proteomes.sh ../config/prott5_full.yaml /path/to/proteomes /path/to/experiments prott5The GPU memory required by the lookup stage depends mainly on:
- the size of the reference embedding matrix
- the lookup query batch size
- the embedding dimensionality
- temporary tensors created during cosine or euclidean distance computation
Because FANTASIA runs embeddings first and lookup afterward, GPU lookup memory requirements do not depend on the embedding step being active within the same run.
For a typical single-model Prot-T5 layer-0 lookup on a proteome, the reference matrix may be on the order of 123,977 x 1024, with lookup batches such as 516 x 1024 using float32 tensors. In practice, this fits comfortably on a 24 GB GPU and is generally expected to fit on a 16 GB GPU as well. Actual memory requirements still depend on the selected reference dataset, enabled layers/models, and lookup batch size.
The table below summarizes a lookup-only benchmark on a single proteome using the same precomputed Prot-T5 embeddings and the same reference table. Only the lookup execution device was changed.
Benchmark hardware for the GPU run:
- GPU:
NVIDIA GeForce RTX 3090 Ti - VRAM:
24 GB - CUDA available in the runtime environment:
True - PyTorch build used for the benchmark:
2.11.0+cu130
| Proteome | Input proteins | Mean protein length (aa) | Max protein length (aa) | Embedded proteins | Lookup tasks | Lookup device | Distance time (total) | Distance time / batch | Lookup wall time |
|---|---|---|---|---|---|---|---|---|---|
| A proteome (Prot-T5, layer 0) | 20,223 | 392.25 | 8,215 | 20,223 | 20,223 | CPU | 1,835.89 s | 45.90 s | 1,933.08 s |
| A proteome (Prot-T5, layer 0) | 20,223 | 392.25 | 8,215 | 20,223 | 20,223 | GPU | 17.05 s | 0.43 s | 126.95 s |
Observed speedup in this benchmark:
- Distance kernel: about
108xfaster on GPU (1835.89 s→17.05 s) - Lookup wall time: about
15xfaster on GPU (1933.08 s→126.95 s)
In this benchmark, no proteins were discarded before embedding: the input FASTA contained 20,223 proteins and the generated embeddings.h5 also contained 20,223 embedded accessions.
Long proteins were not removed either. Instead, when embedding.max_sequence_length is set, FANTASIA truncates sequences longer than that limit before embedding. This means lookup can still cover the full proteome while controlling the memory cost of the embedding stage.
FANTASIA writes lookup results in three main forms:
- Per-accession raw CSV files under
raw_results/{model}/layer_{k}/ - A global
summary.csvproduced during post-processing - TopGO-ready files under
topgo/
If you need to consolidate many per-accession raw CSV files into a single table for downstream analysis, use the merge utility.
Example:
python scripts/merge_raw_results.py \
/path/to/experiment/raw_results/prot-t5/layer_0 \
-o /path/to/experiment/raw_results/prot-t5/layer_0_merged.csv \
--add-source-fileThe raw CSVs are the most detailed output. Each row represents one transferred GO annotation associated with one retrieved reference hit for one query protein.
Typical columns include:
accession: query protein accessiongo_id: transferred GO termgo_description: GO term namecategory: GO namespace, typicallyBP,MF, orCCdistance: embedding-space distance between the query and the selected reference hitreliability_index: similarity-derived score computed fromdistancemodel_name: embedding model used for the lookuplayer_index: model layer used for the lookupprotein_id,organism,gene_name: metadata from the matched reference proteinevidence_code: evidence code associated with the transferred annotationquery_len,ref_len: query and reference sequence lengths
If sequence-aware storage is enabled, the raw CSVs can also include alignment-derived metrics:
identity,similarity,alignment_score,gaps_percentage: global alignment metricsidentity_sw,similarity_sw,alignment_score_sw,gaps_percentage_sw: local Smith-Waterman-style alignment metricsalignment_length,alignment_length_sw: aligned lengths for the global and local alignments
distance is the nearest-neighbor distance in embedding space, so lower values indicate a closer reference match.
reliability_index is derived from distance so that higher values indicate stronger support:
- cosine lookup:
reliability_index = 1 - distance - euclidean lookup:
reliability_index = 0.5 / (0.5 + distance) - other metrics:
reliability_index = 1 / (1 + distance)
In practice:
- lower
distanceis better - higher
reliability_indexis better reliability_indexis the easiest column to rank by in the raw files
When alignment metrics are present:
identityand related columns summarize the global end-to-end alignmentidentity_swand related columns summarize the best local alignment segment
This is useful because some hits may share only a conserved local region. A protein can therefore have:
- moderate global identity but high local identity
- strong embedding similarity together with weak sequence alignment, or the reverse
These fields are best interpreted as complementary evidence rather than strict pass/fail filters.
summary.csv is the post-processed accession-by-GO summary table. It should be interpreted as the output of a heuristic ranking procedure, not as a table of probabilities. In particular, final_score is not a probability score and should not be read as a calibrated confidence value. The table aggregates all raw rows belonging to the same (accession, go_id, model_name, layer_index) combination and computes configured statistics such as min, max, and mean.
By default, the repository configuration summarizes:
reliability_indexidentityidentity_sw- support count normalized by
limit_per_entry
The default aliases are:
riforreliability_indexid_gfor global identityid_lfor local identity
In the current code, the support count metric is derived from the number of raw rows supporting the same (accession, go_id, model_name, layer_index) group, normalized by limit_per_entry. This means count acts as a support-strength signal rather than a probability: GO terms supported repeatedly across raw hits receive a larger value.
So columns such as max_ri_ProtT5_L0, mean_id_g_ProtT5_L0, or max_id_l_ProtT5_L0 in summary.csv represent aggregated per-model, per-layer evidence for the same accession and GO term.
If weights are configured, FANTASIA also writes:
- weighted columns prefixed by
w_ - a composite
final_score
final_score is a configuration-driven heuristic ranking score, not a universal probability or calibrated confidence value. Its objective is to combine several evidence signals into one sortable value so candidate GO terms can be prioritized within the same run and configuration.
In the repository default configuration, final_score is built from a weighted combination of:
- the best embedding-derived support (
max_ri) - the best global alignment identity (
max_id_g) - the best local alignment identity (
max_id_l) - the support
count
This makes final_score useful for ranking candidate GO terms, filtering outputs, and downstream prioritization, but its numerical value should not be interpreted as a probability of correctness. Changing the configured metrics or weights changes the meaning of the score.
If lookup.topgo: true, FANTASIA also exports TopGO-compatible files under topgo/.
- Per-model/layer exports keep rows separated by model, layer, and GO category
- Ensemble exports keep the best
reliability_indexper(accession, go_id, category)across all models and layers
These files contain three columns in tab-separated form:
- accession
- GO term
- reliability index
FANTASIA requires two key services:
- PostgreSQL 16 with pgvector: Stores reference protein embeddings used by the lookup stage
- RabbitMQ: Message broker for distributed embedding task processing
- Python 3.12 (the project metadata specifies
>=3.12,<4.0) A Conda environment based on Python 3.12 is a suitable local setup option. - Docker and Docker Compose installed
Additional dependency notes:
- MMseqs2 is required if you enable redundancy filtering during lookup. FANTASIA invokes the external
mmseqsexecutable, so it must be installed separately and available in yourPATH. - Parasail is used for alignment-based post-processing through its Python package. When FANTASIA is installed through its declared Python dependencies,
parasailis provided by the runtime environment and does not need to be invoked as a separate command-line tool.
-
Start services (from the FANTASIA directory):
docker-compose up -d
-
Verify services are running:
docker-compose ps
Expected output:
CONTAINER ID IMAGE STATUS xxx pgvector/pgvector:0.7.0-pg16 Up (healthy) xxx rabbitmq:3.13-management Up (healthy) -
Test database connection:
PGPASSWORD=clave psql -h localhost -U usuario -d BioData -c "SELECT 1"
The docker-compose.yml is configured with the following default credentials (matching config.yaml):
| Service | Host | Port | User | Password | Database |
|---|---|---|---|---|---|
| PostgreSQL | localhost | 5432 | usuario | clave | BioData |
| RabbitMQ | localhost | 5672 | guest | guest | - |
BioData is the default local PostgreSQL database name used for the restored reference lookup table downloaded from Zenodo. It is a configurable database name, not a separate repository requirement.
RabbitMQ Management UI is available at: http://localhost:15672 (user: guest, password: guest)
Connection refused error:
# Check if containers are running
docker-compose ps
# If stopped, restart them
docker-compose restart
# View logs
docker-compose logs postgres
docker-compose logs rabbitmqPassword authentication failed:
Ensure the credentials in docker-compose.yml match those in config.yaml:
# Current values in docker-compose.yml
POSTGRES_USER: usuario
POSTGRES_PASSWORD: clave
POSTGRES_DB: BioDataCleaning up: To remove containers and volumes:
docker-compose down -v| Name | Model ID | Params | Architecture | Description |
|---|---|---|---|---|
| ESM-2 | facebook/esm2_t33_650M_UR50D |
650M | Encoder (33L) | Learns structure/function from UniRef50. No MSAs. Optimized for accuracy. |
| ProtT5 | Rostlab/prot_t5_xl_uniref50 |
1.2B | Encoder-Decoder | Trained on UniRef50. Strong transfer for structure/function tasks. |
| ProstT5 | Rostlab/ProstT5 |
1.2B | Multi-modal T5 | Learns 3Di structural states + function. Enhances contact/function tasks. |
| Ankh3-Large | ElnaggarLab/ankh3-large |
620M | Encoder (T5-style) | Fast inference. Good semantic/structural representation. |
| ESM3c | esmc_600m |
600M | Encoder (36L) | New gen. model trained on UniRef + MGnify + JGI. High precision & speed. |
FANTASIA is the result of a collaborative effort between Ana Rojas’ Lab (CBBIO) (Andalusian Center for Developmental Biology, CSIC) and Rosa Fernández’s Lab (Metazoa Phylogenomics Lab, Institute of Evolutionary Biology, CSIC-UPF). This project demonstrates the synergy between research teams with diverse expertise.
This version of FANTASIA builds upon previous work from:
-
Metazoa Phylogenomics Lab's FANTASIA
The original implementation of FANTASIA for functional annotation. -
bio_embeddings
A state-of-the-art framework for generating protein sequence embeddings. -
GoPredSim
A similarity-based approach for Gene Ontology annotation. -
MMseqs2
Used for redundancy-aware sequence clustering and filtering during lookup workflows. -
Parasail
Provides high-performance pairwise sequence alignment routines used in hit validation and post-processing. -
protein-information-system
Serves as the reference biological information system, providing a robust data model and curated datasets for protein structural and functional analysis.
We also extend our gratitude to LifeHUB-CSIC for inspiring this initiative and fostering innovation in computational biology.
If you use FANTASIA in your research, please cite the following publications:
-
Martínez-Redondo, G. I., Barrios, I., Vázquez-Valls, M., Rojas, A. M., & Fernández, R. (2024).
Illuminating the functional landscape of the dark proteome across the Animal Tree of Life.
DOI: 10.1101/2024.02.28.582465 -
Barrios-Núñez, I., Martínez-Redondo, G. I., Medina-Burgos, P., Cases, I., Fernández, R., & Rojas, A. M. (2024).
Decoding proteome functional information in model organisms using protein language models.
DOI: 10.1101/2024.02.14.580341
FANTASIA is distributed under the terms of the GNU Affero General Public License v3.0.
- Ana M. Rojas: a.rojas.m@csic.es
- Rosa Fernández: rosa.fernandez@ibe.upf-csic.es
- Belén Carbonetto: belen.carbonetto.metazomics@gmail.com
- Àlex Domínguez Rodríguez: adomrod4@upo.es
- Gemma I. Martínez-Redondo: gemma.martinez@ibe.upf-csic.es
- Francisco Miguel Pérez Canales: fmpercan@upo.es
- Francisco J. Ruiz Mota: fraruimot@alum.us.es
