The Algorithm That Learns How Science Grows — and

Every scientific field leaves a fingerprint in the structure of its citation network. Machine learning papers cite machine learning papers. Immunology clusters around immunology. A handful of landmark papers accumulate thousands of citations while most go nearly unread — the rich get richer, as the principle goes. These networks are not just sociologically interesting; they are the test beds on which graph algorithms are benchmarked, community detectors are validated, and link-prediction models are trained. The quality of those algorithms depends on the quality of the synthetic networks used to evaluate them. And according to a new study out of Warsaw University of Technology, most of the tools researchers rely on to generate those synthetic networks are, in a precise technical sense, cheating (Brzozowski et al., 2026).

The paper introduces the Citation Seeder (CS) algorithm — a generator that can fake an entire citation network with just four interpretable parameters per research community, matched against real networks using up to four orders of magnitude fewer parameters than the best existing competitor. It also delivers a methodological critique that reframes how the field should think about evaluating any graph generator.

The Science

The challenge at the heart of this paper is deceptively simple to state: how do you generate a fake citation network that is realistic enough to be a useful stand-in for the real thing?

Real citation networks have three structural properties that must hold simultaneously. First, they are directed — a paper cites another paper, not the other way around. Second, they are nearly acyclic — because papers cite older work, the network flows almost entirely in one temporal direction (a directed acyclic graph, or DAG, in graph theory terms). A paper published in 2020 does not, in general, cite a paper published in 2024. Third, they are modular — they contain communities, clusters of papers in the same field that cite each other more than they cite papers outside the field.

Existing community-aware graph generators were designed for undirected social networks or densely cyclic graphs. They can plant community structure, but they produce the wrong kind of flow. Conversely, growth models inspired by Derek de Solla Price's preferential attachment — the "rich-get-richer" mechanism where papers with more citations are more likely to receive additional ones — capture temporal dynamics but ignore community structure. No widely adopted tool has done both at once.

Brzozowski, Gagolewski, and Siudem (2026) built CS on the Price-Pareto model (Chakraborty et al., earlier work), which predicts, analytically, that in-degree distributions within communities should follow a Pareto type II (also called Lomax) distribution — a heavy-tailed shape where a small number of papers accumulate a disproportionate share of citations. The shape of that tail is governed by a preferentiality parameter $ρ_{i} \in (0, 1)$ : the fraction of each new paper's citations that go to already-popular papers (proportional to their existing in-degree) within the same field, versus being distributed randomly across the whole network.

The full CS model has exactly four parameters per community $i$ :

$p_{i}$ : the community's share of new publications (its growth rate)
$m_{i}$ : the typical reference list length in that field
$σ_{i}^{2}$ : the variance in reference list length
$ρ_{i}$ : the preferentiality — how merit-driven vs. accidental citations are

All four can be estimated directly from a real network in closed form — no gradient descent, no cross-validation, no neural network. The total generation time is $O (N + E)$ , meaning it scales linearly with the number of nodes and edges. That matters enormously when you are working with networks of millions of papers.

The study compared 12 methods — five classical generators each in standard and "decycled" form, plus two CS variants — across seven real citation networks ranging from the 3,312-node CiteSeer dataset to the 3.5-million-node DBLP network, evaluated on 26 structural metrics grouped into six categories.

Real Citation Networks Used in Evaluation

Node counts for the seven real citation networks used to benchmark the 12 graph generators. Networks span five orders of magnitude in size.

Real Citation Networks Used in Evaluation
Label	Value
CiteSeer	3,312
Cora	2,708
PubMed	19,717
Cora-Full	19,793
OGBN-ArXiv*	169,343
Cit-Patents*	2,755,865
DBLP*	3,586,177

What They Found

The headline result is clean: CS achieves a competitive mean rank against the best-performing baseline across all 26 metrics, while using up to $1 0^{4}$ fewer parameters. The best baseline is the Degree-Corrected Stochastic Block Model (DC-SBM) run with a cycle-breaking procedure the authors introduce — a method that uses $O (N + k^{2})$ parameters, where $N$ is the network size and $k$ is the number of communities. CS uses $O (k)$ parameters. On networks with tens of thousands of nodes, that difference is four orders of magnitude.

Figure 3: Global mean ranks with 95% bootstrap confidence intervals,
obtained by block-resampling (dataset ×\times metric) rows with
replacement (B=10,000B=10{,}000 draws). Point estimates are shown as
diamonds (◆\blacklozenge) for CS variants and
circles (∙\bullet) for baselines; horizontal bars span
the 2.5th-97.5th percentile of the bootstrap distribution. The dashed
vertical line marks the median rank (6.5) for k=12k=12 methods. Left
panel: all 26 metrics (N=182N=182 blocks); right panel: 20 non-endogenous
metrics (N=140N=140 blocks). — Figure 3: Global mean ranks with 95% bootstrap confidence intervals, obtained by block-resampling (dataset ×\times metric) rows with replacement (B=10,000B=10{,}000 draws). Point estimates are shown as diamonds (◆\blacklozenge) for CS variants and circles (∙\bullet) for baselines; horizontal bars span the 2.5th-97.5th percentile of the bootstrap distribution. The dashed vertical line marks the median rank (6.5) for k=12k=12 methods. Left panel: all 26 metrics (N=182N=182 blocks); right panel: 20 non-endogenous metrics (N=140N=140 blocks). Source: Łukasz Brzozowski, Marek Gagolewski

But the more intellectually significant finding is about why heavyweight models appear to perform well — and why that appearance is misleading.

The paper introduces a distinction between two types of evaluation metrics. Endogenous metrics ask: does the synthetic network have the same community statistics as the real one you fed in as input? Exogenous metrics ask: if you run a community-detection algorithm on the synthetic network, does it behave the same way it would on a real network? The difference matters because a generator could simply memorise the ground-truth community labels from the input and reproduce them exactly, scoring perfectly on endogenous metrics while completely failing to generate realistic network topology. That is precisely what the high-parameter models do.

When the evaluation is restricted to the 20 non-endogenous (exogenous) metrics — the ones that measure genuine generative realism — CS outperforms the $O (N + k^{2})$ DC-SBM baseline. The heavily parametrised model's apparent advantage evaporates. It had been, in the authors' language, overfitting: memorising the planted community statistics rather than learning how networks grow.

Figure 5: Category-level performance scores for selected methods.
Score =(k+1−r¯)/k=(k{+}1-\bar{r})/k, where r¯\bar{r} is the category mean rank and k=12k=12.
Higher is better; 0.5 indicates median performance. — Figure 5: Category-level performance scores for selected methods. Score =(k+1−r¯)/k=(k{+}1-\bar{r})/k, where r¯\bar{r} is the category mean rank and k=12k=12. Higher is better; 0.5 indicates median performance. Source: Łukasz Brzozowski, Marek Gagolewski

This finding has a direct parallel to a well-known failure mode in machine learning. A model that achieves high accuracy on training data by memorising examples rather than learning patterns will fail on new data. Here, a graph generator that scores well because it reproduces the exact statistics it was given will fail to produce synthetic networks that genuinely stress-test community detection algorithms — which is the entire point of the exercise.

The paper also delivers a systematic analysis of cycle breaking — the practice of reversing edge directions to enforce temporal order. The intuition is that static generators produce networks where edges point in arbitrary directions; by applying a topological sort and flipping back-edges, you can impose citation-like flow. The authors find this is not uniformly helpful. For the DC-SBM, cycle breaking dramatically improves degree-distribution fidelity. For other generators, it can degrade community structure. The effect is method-specific and must be evaluated rather than assumed.

Figure 6: Heatmap presenting impact of cycle breaking per category. Each cell represents a percent of instance a near-DAG version of the method achieved better results than the standard method on the corresponding metrics. Source: Łukasz Brzozowski, Marek Gagolewski

Parameter Count: CS vs. Competing Models

Number of parameters required by each model family, shown for a representative network with N=10,000 nodes and k=7 communities. CS uses O(k) parameters; DC-SBM uses O(N+k²).

Parameter Count: CS vs. Competing Models
Label	Value
ER	1
LFR	5
SBM	49
CS	28
Config Model	10,000
DC-SBM	10,049

CS handles this natively. The algorithm grows the network one node at a time, so every generated edge points from a newer node to an older one — a strict DAG by construction. A separate back-edge injection step then adds a small fraction of temporal violations (papers citing papers that came out simultaneously, or revision-era cross-references) to match the real network's back-edge ratio.

Figure 1: An example network generated by our algorithm with parameters estimated from the Cora citation network [41]. The node colours correspond to the communities in the input network. Source: Łukasz Brzozowski, Marek Gagolewski

The figure above shows a CS-generated network estimated from the Cora citation network — seven node colours for seven research communities, the structure visually reminiscent of the real data it was learned from. It looks like science. It is science, in a statistical sense.

Figure 2: CS-generated vs theoretical in-degree distributions (Eq. 1), N=30,000N=30{,}000, k=3k=3, and ρ=0.2\rho=0.2 (low preferentiality, left),
ρ=0.5\rho=0.5 (moderate, centre) or ρ=0.9\rho=0.9 (high, right).
The theoretical CCDF is computed from the same parameters. Deviations in the tails reflect finite-size effects that diminish with increasing NN. — Figure 2: CS-generated vs theoretical in-degree distributions (Eq. 1), N=30,000N=30{,}000, k=3k=3, and ρ=0.2\rho=0.2 (low preferentiality, left), ρ=0.5\rho=0.5 (moderate, centre) or ρ=0.9\rho=0.9 (high, right). The theoretical CCDF is computed from the same parameters. Deviations in the tails reflect finite-size effects that diminish with increasing NN. Source: Łukasz Brzozowski, Marek Gagolewski

Why This Changes Things

The stakes here extend well beyond the narrow problem of citation network generation. Community detection algorithms — tools that identify clusters in complex networks — are used across biology (protein interaction networks), social science (political polarisation networks), economics (financial contagion networks), and infrastructure (power grids, internet routing). Every one of those algorithms needs to be benchmarked on synthetic data with known ground truth. If the synthetic data is unrealistic — if it fails to capture the directed, temporal, heavy-tailed structure of real networks — then benchmark results mean very little.

The LFR Benchmark, introduced in 2008, is the de facto standard for undirected networks. It generates power-law degree distributions and modular community structure and has been used in hundreds of papers to evaluate community detection methods. But citation networks are not undirected. The fact that paper A cites paper B is fundamentally different from the fact that paper B cites paper A. Directionality carries meaning — about intellectual influence, about temporal priority, about the flow of ideas. A benchmark that ignores this is not testing the right thing.

CS is the first community-aware generator to natively produce directed near-DAGs with theoretically grounded structure. Its parameters have concrete meanings: $ρ_{i}$ is the fraction of citations in field $i$ that are merit-driven rather than accidental; $m_{i}$ is the typical reference list length. A researcher studying computer science citation patterns could estimate these parameters from real data and then generate thousands of synthetic networks with identical statistical properties but different random seeds — giving genuine statistical power to any downstream analysis.

Average Out-Degree Across Citation Networks

Typical reference list length (mean out-degree) for each of the seven real citation networks evaluated. This corresponds directly to the m_i parameter CS estimates per community.

Average Out-Degree Across Citation Networks
Label	Value
CiteSeer	1.39
Cora	2
PubMed	2.25
Cora-Full	3.3
Cit-Patents	5.07
DBLP	6.24
OGBN-ArXiv	6.89

The $O (N + E)$ runtime also matters practically. Generating a synthetic version of DBLP — 3.5 million nodes — takes time proportional to the size of the output. Deep learning-based graph generators, which have been proposed as alternatives, require GPU training runs and cannot scale to networks of this size. CS generates at network speed.

There is also something philosophically cleaner about CS's approach. The parameters are not just fitting coefficients — they are quantities with bibliometric interpretations. The preferentiality $ρ_{i}$ connects directly to in-degree inequality via the Gini coefficient: more unequal citation distributions within a field imply higher $ρ_{i}$ , meaning that field's citation dynamics are more "winner-take-all." This makes CS not just a generator but an explanatory model — a way of describing how a field grows in terms of a handful of numbers that can be compared across disciplines, tracked over time, or extrapolated into the future.

What's Next

The paper is honest about limitations. The Price-Pareto model assumes communities grow independently and that preferential attachment is restricted to within-community citations. Real networks have cross-community citation dynamics that CS handles only through the accidental citation mechanism. Fields don't grow in isolation — a breakthrough in machine learning reshapes citation patterns in statistics, neuroscience, and biology simultaneously. Capturing those cascade dynamics would require extending the model.

The authors also note that the back-edge injection heuristic is approximate when true timestamps are unavailable. For datasets like CiteSeer and Cora, where publication dates are absent, the temporal ordering must be inferred from the network structure itself — a reasonable approximation, but not the same as knowing when each paper was actually published.

The endogenous/exogenous distinction the paper introduces deserves to become standard practice in the field. The insight — that a generator can score well by memorising inputs rather than learning structure — applies to any generative benchmark, not just citation networks. Social network generators, biological network generators, knowledge graph generators: all of them could be hiding the same failure mode. Decomposing evaluation metrics into those that measure genuine generative realism versus those that measure input recall is a methodological contribution with broad reach.

The CS algorithm itself opens several directions. Because its parameters have interpretable meanings, it can be used to study counterfactuals: what would a citation network look like if a field's preferentiality doubled? What if two fields merged? What if a new field grew ten times faster than existing ones? These are not just academic questions — they bear on science policy, on how funding shapes intellectual communities, on how algorithmic recommendation systems reshape what gets cited. CS gives researchers a clean, theoretically grounded tool for exploring them.

For now, the immediate contribution is precise and valuable: a fast, parsimonious, interpretable generator that passes the most demanding test a synthetic network can face — making community detection algorithms behave as if they were running on the real thing.

The Algorithm That Learns How Science Grows — and Fakes It Perfectly

The Science

What They Found

Why This Changes Things

What's Next