CAFA evaluation frequency

Every 3 years — How often the field's main protein function benchmark scores competing methods

LAFA evaluation frequency

~Every 8 weeks — How often LAFA updates rankings as new UniProt-GOA data releases arrive

7,401 proteins — Proteins with stable sequences and new experimental annotations between Sep 2025 and Mar 2026

SwissProt coverage (target)

~550,000 sequences — Full scale of reviewed SwissProt database LAFA aims to benchmark against

TransFew & FunBind training cutoff

Nov 2022 — Illustrating the performance decay risk for models trained on older annotation data

LAFA: Continuous Protein Function AI Benchmarking

Q: Methods currently hosted

7 total — 3 state-of-the-art predictors (TransFew, FunBind, DeepGOPlus) plus 4 interpretive baselines

Every drug target, every pathogen surface protein, every enzyme in your gut microbiome — each one does something specific, and knowing what it does can be the difference between a promising therapy and a dead end. But of the roughly 250 million protein sequences now catalogued across life on Earth, only a tiny fraction have had their functions confirmed through laboratory experiments. The rest sit in databases, annotated by computational guesswork of wildly varying quality. Improving that guesswork — and, crucially, knowing how much it has improved — is one of computational biology's most consequential open problems.

Here's the uncomfortable truth about how the field measures its own progress: it holds a competition every three years, scores everyone's methods once, and then largely moves on. That competition, the Critical Assessment of protein Function Annotation (CAFA), is genuinely valuable. It has driven real advances. But between rounds, the landscape shifts. New experimental data rolls in. Models get retrained. Methods fall out of use or become impossible to install. And nobody is watching. A new paper from Phan et al. (2026) introduces LAFA — the Longitudinal Assessment of Protein Function Annotation Models — a persistent benchmarking server designed to replace that three-year silence with continuous, reproducible evaluation.

The Science

To understand why LAFA matters, you need to understand what CAFA actually measures and where it breaks down.

Protein function, in the computational biology sense, is represented using the Gene Ontology (GO) — a structured, controlled vocabulary that describes what proteins do at the molecular level (molecular function), what biological processes they participate in (biological process), and where in the cell they operate (cellular component). Think of it as a universal filing system for biological knowledge. CAFA works by releasing a set of target protein sequences, collecting predictions from competing teams about which GO terms apply to each protein, and then — after a four-to-six month annotation accumulation period — comparing those predictions against newly validated experimental annotations. The method that best anticipates what biocurators will confirm wins.

The design is clever: because the ground truth is gathered after predictions are submitted, there's no way for a method to simply memorize the answer. But the system has a structural flaw that Phan et al. (2026) call the Open World assumption problem. Protein annotation is not a closed book. A method might correctly predict that a protein has a particular function, but if that function hasn't been experimentally confirmed and curated into a database by the end of the accumulation window, the prediction is effectively penalized. The method looks wrong when it was, in fact, right — just ahead of the evidence. Over three years, a great deal of evidence accumulates. Methods evaluated at a single snapshot may be systematically misevaluated, and nobody can easily check.

LAFA, available at functionbench.net, is designed to fix this by making evaluation an ongoing process rather than a periodic event. The platform synchronizes with UniProt-GOA — the gold-standard database of protein annotations — approximately every eight weeks. Each new data release creates a new time point ($t_1$), and the window between any two consecutive time points becomes a new evaluation window: a precise, dated period during which new experimental annotations accumulate and serve as fresh ground truth.

The key technical innovation enabling this is containerization. Each participating method is packaged into a self-contained software container — essentially a sealed computational environment that includes all the code, dependencies, and configuration the method needs to run. Once a container is sealed at a given time point $t_{0}$ , it cannot access new data. It cannot update itself. It cannot peek at annotations that hadn't been published yet. This design prevents the subtle data leakage that plagues many informal comparisons and ensures that predictions made in September 2025 can be fairly evaluated against annotations that accumulated through March 2026.

Figure 1: The LAFA timeline. At each data release time point (here, starting Sep 2025), data are collected, and predictions are generated from the hosted methods. A time window is defined between any two time points (e.g., Sep 2025 - Nov 2025), during which an evaluation is performed for predictions that existed at the earlier time point (here, only method A and B are included in the Sep 2025 - Nov 2025 evaluation). Evaluation of each time window is accessible on the LAFA website, with an option to compare between any two windows. If a participating method retrains their model with updated training data and submits a new container (here, method A gets retrained to A1), we will generate new predictions and add them to the next evaluation round. The retrained method will then be comparable with its previous version to interpret the effect of training data on evaluation. Source: An Phan, Yanli Wang

The back-end runs on a high-performance computing cluster and automates two main workflows: a time point build that constructs a clean snapshot of all available protein sequences and annotations at each release, and a time window evaluation that scores each method's predictions against the annotations that accumulated between two time points. Performance is reported as an $F_{1}$ score — the harmonic mean of precision and recall — computed at the best possible classification threshold, using the well-validated CAFA-evaluator package (Piovesan et al., 2024).

What They Found

For LAFA's testbed stage, the team focused on a carefully curated set of 7,401 proteins — specifically, proteins whose amino acid sequences remained unchanged across the study window and which accumulated new experimental annotations between September 2025 and March 2026. This is smaller than the full SwissProt database (roughly 550,000 to 580,000 reviewed sequences), but it provides a controlled, fair playing field for the initial comparison.

Three state-of-the-art methods are currently hosted, alongside four baselines that serve as interpretive anchors.

TransFew (Boadu et al., 2024) combines learned representations of protein sequences with semantic representations of GO terms themselves — essentially teaching the model to understand the meaning of function labels, not just their statistical co-occurrence with sequence features. It was trained on annotation data from November 2022.

FunBind (Boadu et al., 2025) takes a multimodal approach, fusing protein sequences, textual descriptions, domain annotations, structural features, and GO terms through a large pre-trained foundational model. The version hosted on LAFA uses only the sequence modality, representing a deliberately conservative deployment.

DeepGOPlus (Kulmanov et al., 2020) combines deep convolutional neural networks applied directly to protein sequences with BLAST-based similarity search — a powerful hybrid that has been a strong performer in past CAFA rounds. Notably, its containerized version was trained and tested on data from mid-2025, making it the most recently trained of the three methods.

The four baselines range from the trivially simple (the Naive baseline, which assigns every GO term to every protein weighted by how common that term is) to the genuinely informative. The Non-Experimental GOA baseline is particularly interesting: it predicts that any GO term currently annotated to a protein — even without experimental support — will eventually receive experimental validation. This measures how well computational and literature-based annotations anticipate future experimental confirmation, which turns out to be a non-trivial bar to clear. The BLAST baseline transfers annotations from sequence-similar proteins, while the Embedding similarity baseline does the same using learned protein language model embeddings instead of raw sequence identity.

LAFA Data Release Timeline

UniProt-GOA release dates used in LAFA's four evaluation time points, spanning September 2025 to March 2026.

LAFA Data Release Timeline
Label	Value
Sep 2025	1
Nov 2025	2
Dec 2025	3
Mar 2026	4

The evaluation spans four time points — September 2025, November 2025, December 2025, and March 2026 — creating a timeline of overlapping evaluation windows that capture how performance evolves as the annotation ground truth grows richer.

Methods Hosted on LAFA: Training Data Recency

Training data vintage (year) for each of the three main prediction methods currently hosted on LAFA, illustrating how outdated training sets may affect performance over time.

Methods Hosted on LAFA: Training Data Recency
Label	Value
TransFew	2,022
FunBind	2,022
DeepGOPlus	2,025

One of LAFA's most distinctive contributions is making visible something the community has talked about but rarely measured directly: performance decay. Methods trained on older annotation data may perform well when evaluated on a similar vintage of ground truth, but their advantage can shrink or disappear as the biological knowledge base moves on. When a method developer retrains their model on updated data and submits a new container version, LAFA can run both old and new versions against identical evaluation windows — providing a controlled experiment on the value of keeping training data current. No other platform offers this comparison out of the box.

SwissProt Test Set Size Over Recent Years

Approximate number of reviewed SwissProt protein sequences available as a stable test set for method comparison, reflecting steady growth in the curated proteome.

SwissProt Test Set Size Over Recent Years
Label	Value
Lower bound (recent years)	550,000
Upper bound (recent years)	580,000

Why This Changes Things

The implications of LAFA reach further than a tidier leaderboard. They touch on how the field defines progress itself.

Consider the problem of reproducibility. Most protein function prediction methods published in academic papers are difficult or impossible to run by anyone other than their creators. Dependencies change. Web servers go offline. Code that worked in 2021 may not compile in 2025. LAFA's containerization mandate addresses this directly: to be hosted on LAFA, a method must be packaged in a way that any reasonably equipped computing system can run it, years from now, on identical data. This is a soft standard-setting exercise with real bite — if your method can't be containerized, it can't participate.

There's also a deeper epistemological issue at stake. Science advances by accumulating evidence, and protein biology is no exception. The database of experimentally validated protein functions grows continuously, driven by thousands of labs publishing results on specific proteins, organisms, and diseases. A method evaluated in 2022 is being measured against a thinner slice of biological reality than the same method evaluated in 2025. LAFA makes this explicit by timestamping every evaluation window and allowing direct comparison across time. Researchers can now ask not just "which method is best?" but "which method ages best?" — a question with serious practical implications for anyone deploying these tools in drug discovery or genome annotation pipelines.

The platform also directly confronts the Open World assumption problem. Because LAFA accumulates ground truth over multiple time windows, a prediction that was correct but unvalidated in September 2025 may well be vindicated by March 2026. Longer accumulation windows produce fuller, fairer evaluations. The ability to compare a four-month window against an eight-month window — planned as a core LAFA feature — will let the community quantify exactly how much that extra time of annotation accumulation changes the apparent ranking of methods.

For context: CAFA5, the most recent major CAFA round, attracted hundreds of participating teams and generated substantial media attention in the bioinformatics community (Piovesan et al., 2024). But between CAFA rounds, progress is invisible to anyone who isn't directly plugged into individual research groups. LAFA converts that invisible progress into a public, timestamped record. It is, in effect, a continuously updated answer to the question: are we getting better at this?

The platform is also designed to be extensible. Phan et al. (2026) explicitly invite community-contributed evaluation modules — new metrics, new visualization tools, new ways of slicing the data. This is modeled on the nf-core ecosystem (Ewels et al., 2020), a community-curated library of bioinformatics pipelines that has become a de facto standard in genomics. If LAFA achieves similar adoption in function prediction, it could eventually define what "state of the art" means in the field.

What's Next

LAFA is candid about its current limitations. The testbed evaluation covers 7,401 proteins — a fraction of the 550,000+ sequences in reviewed SwissProt. Scaling to the full database requires substantially more compute, and the team acknowledges this as their most pressing infrastructural challenge. Until that scaling is achieved, the rankings produced by LAFA are informative but not yet definitive.

There are also annotation-level challenges. GO terms themselves change over time: terms can be declared obsolete, merged with other terms, or discovered to have been incorrectly assigned in the first place (Gene Ontology Consortium, 2026; Schnoes et al., 2009). A method that correctly predicted a GO term that was later deprecated looks wrong by no fault of its own. The team plans to add GO-slim-based evaluation — essentially scoring predictions against a simplified, more stable version of the GO hierarchy — to provide a view of performance that is less sensitive to these ontology fluctuations.

A third planned improvement involves collaboration with UniProt-GOA to establish a private holdout set: a collection of newly curated annotations that is kept secret until evaluation time. This would allow immediate scoring of newly submitted methods without waiting for a full eight-week release cycle, while preserving the time-delayed principle that prevents data leakage.

Perhaps the most forward-looking question LAFA raises is not technical but sociological: will the community actually containerize their methods and submit them? The history of benchmarking in computational biology is littered with platforms that launched with fanfare and died from neglect. LAFA's design choices — open-source, community-extensible, GitHub-hosted — seem aimed at avoiding that fate. But adoption requires a cultural shift. Methods papers need to incentivize reproducibility, not just novelty. Journals and funders could help by making LAFA participation a condition of publication for protein function prediction work, the same way structural biologists are expected to deposit coordinates in the Protein Data Bank.

What LAFA offers, at its core, is accountability. Not the punitive kind — but the scientific kind. The kind that lets the field look back in five years and trace exactly when and why a new approach pulled ahead, what training data it used, and whether that advantage held up as biology's knowledge base kept growing. In a field where understanding protein function is foundational to everything from antibiotic development to understanding rare disease, that accountability is not a luxury. It is overdue.

The Protein Function Benchmark That Never Sleeps: Meet LAFA

The Science

What They Found

Why This Changes Things

What's Next

Source articles

Comments (0)