The Algorithm That Designs Better mRNA Vaccines in 7 Minutes

Seven minutes. That is how long it takes the algorithm developed by Fornace, Wang, and Lindsey to design an mRNA sequence for a protein five thousand amino acids long — fully optimized for structural stability, using the most accurate thermodynamic model available (Fornace et al., 2026). For context, a 5,000-amino-acid protein has roughly possible mRNA encodings. The universe is about nanoseconds old. The combinatorial gulf between those numbers is precisely the problem this paper sets out to cross.
mRNA vaccines work by instructing your cells to produce a viral protein, triggering an immune response. The mRNA-1273 vaccine that protected millions during the COVID-19 pandemic, the BNT162b2 shot developed by Pfizer and BioNTech — these were designed, in part, by making careful choices about which genetic letters to use. But "careful choices" has historically meant heuristics, simplified models, and educated guesses. This paper proposes something more rigorous: a mathematically exact method that, for the first time, performs global sequence optimization against a physically accurate model of how RNA folds.
The Science
Every protein is built from a sequence of amino acids. And every amino acid is encoded by a codon — a triplet of RNA bases (A, U, G, or C). Because the genetic code is redundant, most amino acids can be spelled multiple ways. Leucine, for instance, can be encoded by six different codons. This redundancy means that for any given protein, there are exponentially many mRNA sequences that will produce exactly the same chain of amino acids.
This isn't just a theoretical curiosity. The choice of codon spelling has real biological consequences. An mRNA strand with a stable, well-folded secondary structure — meaning its bases pair up into tightly organized loops and stems, like a crumpled paper clip — is more resistant to degradation by cellular enzymes called RNases, and tends to be translated more efficiently into protein. Secondary structure stability is measured as a free energy (specifically, Gibbs free energy): the more negative the value, the more thermodynamically stable the structure. Designing an mRNA sequence therefore means navigating a combinatorial labyrinth to find the spelling that minimizes this free energy.
The challenge, computationally, is that you cannot just test every spelling. You cannot even test a tiny fraction of them. Previous approaches fell into three camps: dynamic programming methods that used simplified, inaccurate energy models; local search methods that mutated sequences one codon at a time without guarantees of convergence; and deep learning approaches that learned codon patterns from genomic data but operated as black boxes with limited physical interpretability.
Fornace, Wang, and Lindsey, working at Lawrence Berkeley National Laboratory and the University of California, Berkeley, take a different route (Fornace et al., 2026). They build on a recently developed tensor-based formulation of RNA secondary structure thermodynamics — a mathematical framework where the physical properties of every possible fold are encoded as a network of multi-dimensional arrays called tensors. The key insight is that the constraint "this sequence must encode a specific protein" can itself be expressed as a tensor train — a compact chain of matrices whose product encodes which RNA sequences are legal. By fusing these two tensor structures, the algorithm can simultaneously integrate over all possible sequences and all possible folds in a single coherent computation.
The result is an algorithm that computes the partition function of the entire sequence-structure space — a quantity that captures, in one number, the collective thermodynamic weight of every valid mRNA sequence folded into every valid secondary structure. From that, sequences can be sampled exactly from a Boltzmann distribution — meaning the probability of drawing any given sequence is precisely proportional to , where is the free energy of that sequence. This is not an approximation. It is exact statistical mechanics applied to a combinatorially vast biological problem.
What They Found
The algorithm delivers on several fronts, and the benchmarks are striking.
Cost of Sequence Design vs. Sequence Analysis
Ratio of wall-clock time to compute the full design partition function (qψ) versus analyzing a single fixed sequence of equal length. Values near 1× mean design costs the same as analysis; the paper reports a ceiling of ≲4× across all tested lengths.
| Label | Value |
|---|---|
| 300 nt | 1.5 |
| 600 nt | 2 |
| 1500 nt | 2.8 |
| 3000 nt | 3.3 |
| 6000 nt | 3.8 |
| 15000 nt | 3.9 |
First, consider the computational cost. Analyzing a single fixed RNA sequence — computing its partition function over secondary structures — already requires time, where is the length of the sequence in nucleotides. The design problem, which must sum over all valid sequences simultaneously, might be expected to be far more expensive. In practice, Fornace et al. find that the cost overhead of design over analysis is less than for all tested problem sizes. Designing a sequence incurs roughly the same computational burden as analyzing four fixed sequences. That is a remarkably tight bound for a problem of this combinatorial scale.
Second, GPU acceleration changes the practical complexity dramatically. While the theoretical complexity is , GPU-parallelized runs empirically approach for many problem sizes — meaning a 5,000-amino-acid protein (15,000 nucleotides) can be designed in approximately 7 minutes.
GPU Speedup Over Serial CPU for Sequence Design
Speedup factor of GPU-accelerated computation over serial CPU for computing the design partition function qψ, as a function of sequence length. Larger proteins benefit dramatically from GPU parallelization.
| Label | Value |
|---|---|
| 300 nt | 2 |
| 600 nt | 5 |
| 1500 nt | 15 |
| 3000 nt | 30 |
| 6000 nt | 50 |
| 15000 nt | 80 |
Serial CPU computation is substantially slower; parallelization across CPUs and GPUs provides speedups of an order of magnitude or more.
Third, the quality of the designed sequences is demonstrably better than existing alternatives. When the researchers benchmark their method against four example systems, the sequences sampled from their distribution have substantially lower free energies (meaning more stable structures) than sequences drawn uniformly at random from the space of valid codon sequences, . They also compare against sequences optimized under a simplified additive free energy model — the kind of approximation used by previous dynamic programming methods. The simplified model's "optimal" sequences look worse under the true, detailed energy model. The point is not merely that this algorithm is faster; it is that simpler models were systematically misleading.
Importantly, the paper argues that sampling is more useful than minimization. Finding the single minimum-free-energy sequence — the "best" spelling — is tempting, but can yield pathological sequences that are thermodynamically extreme and potentially brittle in practice. Sampling from the Boltzmann distribution instead produces an ensemble of sequences that balance stability with diversity, giving downstream experimentalists a rich portfolio of candidates to filter and test. The researchers demonstrate exact computation of marginal statistics across this ensemble: for any position in the protein, you can compute exactly which codon is most likely to appear, and with what probability. You can compute the expected base pairing probability at every nucleotide site. These are not approximations but exact consequences of the underlying probability model.
Fraction of Unpaired Bases by Design Method
Equilibrium fraction of unpaired bases f(ϕ) for sequences generated by different methods. Lower values indicate more stable, better-folded structures. Results averaged across four benchmark systems from the paper.
| Label | Value |
|---|---|
| Uniform random (𝒰) | 0.52 |
| Simplified model optimum (ϕ⁰min) | 0.38 |
| Simplified model sample (𝒟⁰) | 0.44 |
| Full model sample (𝒟) | 0.33 |
| Full model optimum (ϕmin) | 0.28 |
The framework also natively supports soft constraints — design objectives layered on top of the hard requirement of amino acid coding. The codon adaptation index (CAI), for example, rewards codons that appear frequently in a host organism's genome, on the theory that the cell's translational machinery will handle them more fluently. Codon pair bias (CPB) rewards certain adjacent codon combinations. Both can be incorporated as tunable free energy bonuses in the tensor train formulation, without modifying the core algorithm at all. The researchers demonstrate codon frequency matching via a fixed-point iteration method that converges reliably within a small number of iterations.
Why This Changes Things
The significance here runs deeper than computational efficiency. For years, a frustrating gap has existed between what physicists know about RNA thermodynamics and what computational biologists could actually use in sequence design. The most accurate free energy models — the "nearest-neighbor" thermodynamic parameters developed by the Turner lab and others, encoded in software like NUPACK and Vienna RNA — are routinely used to analyze fixed sequences. But designing sequences against those models was considered computationally intractable at scale. The field compensated with simpler approximations: additive base-pair bonuses, codon-by-codon greedy choices, or machine learning surrogates trained on secondary structure outputs.
This paper closes that gap. The tensor-based framework of Fornace et al. makes the fully detailed energy model tractable for design — not by approximating the model, but by restructuring the computation so that sequences and structures are explored together in one integrated dynamic program. The mathematical machinery is elegant: the codon constraints are expressed as a tensor train (a chain of matrix multiplications whose trace counts valid sequences), which can be fused with the structural tensor network without changing the topology of the underlying dynamic programming recursion. The dimensionality of the blocks changes, but the algorithm's logical skeleton stays the same.
The implications for mRNA medicine are real. mRNA therapeutics are no longer solely vaccines; they are increasingly proposed for cancer immunotherapy, rare genetic diseases, and protein replacement therapy. In each case, the design of the mRNA sequence is a foundational step that affects how much protein is produced, for how long, and how safely. Methods that can rigorously optimize this step — not approximately, but exactly — have direct translational value. They also produce interpretable outputs: the marginal codon probabilities, base pairing maps, and free energy landscapes are scientifically meaningful quantities, not black-box scores.
The approach also generalizes. While the paper focuses on single-stranded mRNA design for protein-coding applications, the authors note that the framework extends to non-coding regions and, in principle, to broader classes of sequence-based objectives. Any objective that can be expressed as a tensor train — a broad class — can be folded into the same algorithm. This makes it a platform, not just a solution to one specific problem.
What's Next
The authors are candid about limitations. The current algorithm handles unpseudoknotted secondary structures — folds where base pairs don't cross each other when drawn on a flat diagram. This is a standard and well-motivated assumption (pseudoknots are relatively rare and computationally intractable in general), but it means the algorithm cannot capture the full richness of RNA three-dimensional structure. Extending to pseudoknots, or to multi-strand systems, remains an open problem.
The paper also focuses on the coding region of the mRNA. Real therapeutic mRNAs include untranslated regions, cap structures, and poly-A tails that influence stability and expression. Incorporating these non-coding regions is noted as straightforward in principle but is left for future work.
There is also the question of experimental validation. The benchmarks in this paper are computational: free energy comparisons across methods, convergence of fixed-point iterations, timing measurements. The proof that these computationally optimized sequences actually perform better in cells — producing more protein, surviving longer, eliciting stronger immune responses — requires wet-lab experiments that have not yet been reported. That validation gap is not a criticism of the paper's contributions; it is the natural next step. The researchers have built a rigorous engine. What it produces in a cell remains to be measured.
More broadly, this work sits at the intersection of two accelerating trends. First, mRNA medicine is expanding rapidly: the success of COVID-19 mRNA vaccines has unlocked massive investment in the platform, and dozens of clinical trials are underway for mRNA-based cancer vaccines, flu shots, and rare disease therapies. Second, GPU-accelerated scientific computing is making previously intractable physical simulations routine. The confluence of those trends — better hardware, better math, and a burning medical need — makes the timing of this paper feel significant.
The algorithm Fornace, Wang, and Lindsey have developed is not the final word on mRNA sequence design. But it is the first time the field has had a tool that is simultaneously exact, physically rigorous, computationally tractable at therapeutic scales, and statistically interpretable. In a domain where every percentage point of stability can mean the difference between a viable drug and a failed one, that combination matters. The black box era of codon optimization may be drawing to a close.