Why DNA Has Four Letters: The Physics of Life's

There is a number that sits at the heart of all heredity: four. Four DNA bases — adenine, cytosine, guanine, thymine — encode every organism that has ever lived. Biologists have long understood what these bases do. But a more fundamental question has lingered quietly in the background: why four? Why not two, or six, or twenty?

A new theoretical study from the Centro Atómico Bariloche in Argentina takes that question seriously — and answers it with the mathematics of information theory (Hernández, 2026). The result is surprising. DNA's four-letter alphabet is not optimized for efficient information transmission. Instead, it appears to be optimized for something subtler and more urgent: preventing life from accidentally writing itself.

The Science

The paper, by physicist Damián G. Hernández, builds on a recently proposed "template copying ensemble" model of polymer replication. The model is intentionally coarse-grained — meaning it skips over the molecular machinery of individual polymerase enzymes and focuses instead on the thermodynamic outcomes: how likely is a given copy to match its template, and what energy does that copying cost?

The key move Hernández makes is reconceptualizing replication as a communication channel in the sense that Claude Shannon defined in 1948. Shannon's framework — originally developed for telephone lines and radio transmissions — asks: how much information can pass through a noisy system? In this reframing, the template is the transmitter, the copy is the receiver, and replication errors are the noise. The quantity that measures how faithfully the message gets through is called mutual information $I (T; S)$ — roughly, how much knowing the copy tells you about the template that produced it.

The model has four key parameters: $m$ , the number of distinct monomer types (four for DNA); $a$ , the copying specificity (how strongly the system prefers correct base matches); $Δ μ_{r}$ , the per-monomer free energy of assembly (the energy cost of building a polymer from scratch); and $Δ μ_{F}$ , the chemical potential provided by fuel molecules like ATP (the energy that drives the copying process forward). By calculating the mutual information analytically in the limit of very long polymer chains, Hernández maps out exactly when replication transmits information faithfully — and how efficiently it does so.

Figure 1: Many instances of the template copying ensemble as an information channel. Templates TT are sampled from an initial distribution, and then each template goes through a copying mechanism, producing a population of copies SjS_{j}. The two pathways involved are template assembly, and spontaneous disassembly. In this scheme, there are two types of monomers. Source: Damián G. Hernández

Key Facts

14–22 kBT DNA assembly energy (observed) The actual per-monomer free energy of DNA assembly, accounting for bond costs, base stacking, and nucleotide concentrations

~1.4 kBT Energy needed for m=4 optimum The assembly energy that would make a four-base alphabet the information-to-energy optimum — roughly 10x lower than observed

~10% Information loss at 2% error rate Due to the nonlinear relationship between errors and mutual information, a 2% error fraction eliminates nearly 10% of transmittable information

m* ≈ e^(Δμr) Optimal alphabet size formula The mathematically optimal number of monomer types for information-per-unit-energy efficiency, determined almost entirely by assembly energy

D(x_a ∥ x_r) Information per monomer (accurate regime) In the accurate copying regime, mutual information depends solely on copying specificity a, not on how much fuel is supplied

What They Found

The first major result recovers something familiar in a new form. The model already had a known "phase diagram" — a map of conditions under which copying is accurate versus random. Hernández shows that this phase diagram is identical to the information phase diagram: the system transmits nonzero information per monomer only when the fuel energy exceeds a specific threshold,

$Δ μ_{F} > max (lo g m, Δ μ_{r}) - lo g [1 + e^{- a} (m - 1)]$

Below that threshold, copies are essentially random, and $I (T; S) / L = 0$ . Above it, information flows — and crucially, how much information flows depends solely on the copying specificity $a$ , not on how much extra fuel you pour in (see

Figure 2: Phase diagram for mutual information. Left panel: Case Δμr<log⁡m\Delta\mu_{r}<\log m. Under this condition, the information is different from zero as long as ΔμF>ΔμF∗\Delta\mu_{F}>\Delta\mu_{F}^{*}, corresponding to the orange colored region, while in the light-blue region random copies dominate making I/L=0I/L=0. The amount of information per-monomer I/LI/L in this region only depends on aa (orange curve), increasing from zero to log⁡m\log m for large values of aa. Right panel: Case Δμr>log⁡m\Delta\mu_{r}>\log m. Under this condition, the region where I/L>0I/L>0 becomes smaller, and it is given by ΔμF>ΔμF∗+Δμr−log⁡m\Delta\mu_{F}>\Delta\mu_{F}^{*}+\Delta\mu_{r}-\log m. In the white region, the population of copies vanishes. The dependency of I/LI/L with aa remains the same. Source: Damián G. Hernández

). Pumping in more energy past the threshold doesn't make copies more faithful; only improving the biochemical specificity of the copying mechanism does that.

The information per monomer in the accurate regime is given by the Kullback-Leibler divergence — a measure of how different two probability distributions are — between the accurate error fraction $x_{a}$ and the random error fraction $x_{r} = (m - 1) / m$ :

$I (T; S) / L = D (x_{a} ∥ x_{r}) = lo g m - x_{a} lo g (m - 1) - H (x_{a})$

where $H (x_{a})$ is the binary entropy of the error fraction. This formula looks clean, but it hides something alarming about the relationship between errors and information loss.

The nonlinearity problem. The derivative of $D (x_{a} ∥ x_{r})$ with respect to the error fraction $x_{a}$ diverges near zero errors. This means that even tiny departures from perfect fidelity cause disproportionately large drops in transmitted information. A replication system with a 2% error rate might sound impressively accurate — and by most intuitive measures, it is. But the mathematics shows it can have lost nearly 10% of its maximum information-carrying capacity. The lesson is uncomfortable: "accurate enough" in biological terms may still be surprisingly leaky in information-theoretic terms.

Information Loss from Small Error Rates (m=4 DNA alphabet)

Mutual information per monomer as a fraction of maximum (log m), showing how rapidly information is lost even at small error fractions. Derived from D(x_a || x_r) with m=4.

Information Loss from Small Error Rates (m=4 DNA alphabet)
Label	Value
0% errors	100
1% errors	93
2% errors	90
5% errors	81
10% errors	67
20% errors	43

The second major result concerns information-to-energy efficiency. The study defines a ratio of total mutual information to the minimum fuel energy needed to achieve accurate copying,

$\frac{I _{tot}}{E _{tot}^{*}} = \frac{lo g m - x _{a} lo g ( m - 1 ) - H ( x _{a} )}{max ( lo g m , Δ μ _{r} ) + lo g ( 1 - x _{a} )}$

As a function of alphabet size $m$ , this ratio is not monotonic. It rises, peaks, then falls. The peak occurs near $m^{*} \approx e^{Δ μ_{r}}$ — meaning the energetically optimal alphabet size is set almost entirely by the assembly energy (see

Figure 3: Information to energy cost ratio in a template copying ensemble. Left panel: Information to energy cost ratio as a function of template specificity aa for different number of monomers mm, given a particular value of Δμr\Delta\mu_{r}. Right panel: Same ratio as a function of mm for different values of specificity aa. Source: Damián G. Hernández

). This is the core finding that makes the paper philosophically striking.

Information-to-Energy Ratio vs. Alphabet Size (Non-Monotonic Optimum)

The ratio of information transmitted to minimum fuel energy required peaks at a specific alphabet size m* that depends on assembly energy. At high assembly energies relevant for DNA, m=4 is far from optimal.

Information-to-Energy Ratio vs. Alphabet Size (Non-Monotonic Optimum)
Label	Value
m=2	0.55
m=4	0.78
m=6	0.88
m=8	0.8
m=10	0.72
m=16	0.57

For a four-base alphabet to be the information-theoretic optimum, the assembly energy would need to be $Δ μ_{r} \approx lo g 4 \approx 1.4 k_{B} T$ — where $k_{B} T$ is thermal energy, roughly the amount of energy random molecular jostling provides at room temperature. But DNA's actual effective assembly energy is measured to be between $14$ and $22 k_{B} T$ . The paper arrives at this range by accounting for three contributions: the intrinsic cost of forming the covalent phosphodiester bond ($+8.6, k_BT$), the stabilizing contribution of base stacking interactions ($-0.8$ to $-3.7\, k_BT$), and the concentration of free nucleotides in living cells ($+9.2$ to $+13.8, k_BT$). The total is strikingly high — roughly 10 to 15 times higher than what information-per-joule efficiency would suggest.

Why This Changes Things

So why does biology run at such an energetically "wasteful" operating point? Hernández's answer is elegant. A high assembly energy $Δ μ_{r}$ doesn't just affect copying fidelity — it determines whether random polymers spontaneously self-assemble in the absence of a template. When $Δ μ_{r} > lo g m$ , the energetics strongly suppress background polymerization. Random sequences don't form unless fuel-driven template copying actively drives them. This is, in the language of physics, a "quenched" regime — one where the system is locked into template-directed behavior.

In other words, DNA's high assembly energy is less about making copies faithful and more about making unguided copies impossible. Life doesn't just want accurate replication — it wants replication that only happens when a template is present. An alphabet optimized purely for information-per-joule efficiency would be far more susceptible to spontaneous, uncontrolled polymer growth — a thermodynamic nightmare for any organism trying to maintain sequence integrity.

This reframing shifts how we might think about the origin of life. The four-base alphabet isn't a historical accident or an arbitrary evolutionary legacy. It reflects a genuine physical trade-off between two competing pressures: the efficiency of information transmission on one side, and the suppression of molecular noise on the other. Evolution appears to have landed — or been driven — squarely on the side of noise suppression.

DNA Assembly Energy Components vs. Information-Theoretic Optimum

Breakdown of effective per-monomer assembly energy for DNA compared to the value that would make m=4 information-theoretically optimal.

DNA Assembly Energy Components vs. Information-Theoretic Optimum
Label	Value
Optimum for m=4	1.4
Phosphodiester bond cost	8.6
Base stacking (midpoint)	-2.25
dNMP concentration (midpoint)	11.5
Total DNA assembly energy (min)	14

The paper also maps out the full landscape of optimal alphabet sizes as a function of both specificity $a$ and assembly energy $Δ μ_{r}$ (

Figure 4: Optimal values of number of monomer types m∗m^{*} for the amount of information-to-energy cost (regions separated by black lines). Here only even values of m∗m^{*} are considered, and the colors in each region represent the maximum of the ratio Itot/Etot∗I_{\text{tot}}/E_{\text{tot}}^{*} for that value of m∗m^{*}. Source: Damián G. Hernández

). For small assembly energies and low specificity, the optimum is a two-base alphabet — the simplest possible code. As assembly energy increases, larger alphabets become optimal, but only when specificity is high enough to make them worthwhile. The four-base DNA alphabet lies well outside any of these optimal zones for its actual operating parameters, reinforcing the picture that biology has traded information efficiency for thermodynamic security.

The third result addresses a question that anyone who has heard of proofreading enzymes will immediately ask: can the system do better? Shannon's channel capacity theorem says there is a hard upper bound on how much information any copying system can transmit at a given error rate, regardless of how sophisticated the proofreading machinery becomes. Hernández derives this bound explicitly for the template copying ensemble. Achieving arbitrarily low error rates is theoretically possible — but only by slowing down the effective copying rate by a factor of $C_{a} / lo g m$ , where $C_{a}$ is the channel capacity. Speed and accuracy are in fundamental tension, and no biological mechanism can escape this trade-off.

A simple strategy like a repetition code — making $n$ copies of each base and decoding by majority vote — does reduce errors, but it does so inefficiently. As $n$ grows, the error rate falls, but the effective information rate falls as $1/ n$ , staying far below the Shannon bound. More sophisticated proofreading mechanisms could in principle approach the bound, but the gap between naive repetition and the theoretical optimum is large, suggesting real biological proofreading is operating in a complex middle ground between these extremes.

What's Next

Hernández is careful about the limits of the model. The template copying ensemble is deliberately simplified — it treats copying as a thermodynamic ensemble process, averaging over many copies simultaneously, rather than tracking individual molecular events in real time. It does not yet include explicit proofreading mechanisms like the exonuclease activity of DNA polymerase, which can catch and excise mismatched bases after the fact. The Shannon bounds derived here set the theoretical ceiling; future versions of the model incorporating kinetic proofreading will need to be evaluated against exactly these limits.

The model also assumes the large-$L$ (long-chain) limit throughout, which is analytically tractable but may miss important finite-size effects in short RNA molecules — relevant, for instance, to theories of early life where the first self-replicating molecules were likely short and error-prone.

What the paper opens up is a genuinely new language for asking old questions. By mapping replication onto information theory, it connects the physics of DNA copying to a century of work on optimal codes, communication channels, and the fundamental limits of signal transmission. The question of why life chose four bases now has a partially quantitative answer — and it turns out the answer is less about elegance and more about robustness. Life, apparently, is less concerned with transmitting information cheaply than with making sure that information doesn't write itself by accident.

That's a surprisingly conservative design philosophy for a molecule that encodes the instructions for every living thing on Earth. But thermodynamics, as always, leaves little room for idealism.

Why DNA Has Four Letters: The Physics of Life's Optimal Alphabet

The Science

What They Found

Why This Changes Things

What's Next