sweap: The Tool Teaching Computers to Write

Imagine you are designing the software that controls an elevator. You want a guarantee — not a test, not a code review, but a mathematical proof — that the elevator will never trap a passenger indefinitely, and will always eventually reach every floor that was requested. Now imagine the floor counter is an unbounded integer, the number of passengers is unknown, and the environment (button presses, power fluctuations, sensor readings) can take on infinitely many values. How do you write software you can be certain is correct?

This is the problem of reactive synthesis: automatically generating a program from a high-level specification, with a built-in proof that the program satisfies it. It is one of the grandest ambitions in formal methods — and for infinite-state systems, it is known to be formally undecidable. No algorithm is guaranteed to terminate with an answer for every possible input. Yet a new tool called sweap, described by Azzopardi, Di Stefano, and Piterman (2026), keeps making progress in the territory that theory says is impossible.

On the standard benchmark suite used by the research community, sweap solved 84 problems — compared to 65 solved by its own previous version and fewer still by its external rival. That gap represents not just engineering polish, but two genuinely new ideas: a way to shrink an exponential computation to a polynomial one, and a way to detect when a problem has no solution without looping forever.

The Science

Reactive synthesis frames program generation as a two-player game. One player is the controller — the program being designed. The other is the environment — the hostile, unpredictable outside world: user inputs, sensor readings, network messages. The controller wins if it can satisfy the specification (the LTL, or Linear Temporal Logic, formula) no matter what the environment does. If such a strategy exists, the problem is realisable; the tool returns working code. If not, it is unrealisable, and the tool ideally explains why.

The traditional version of this game plays out over a finite set of states — think a light switch that is either on or off. The breakthrough of sweap and its predecessors is tackling the infinite-state version, where variables are full integers: a counter that can hold any value from $- \infty$ to $+ \infty$ , or a running sum that grows without bound. The arena the game is played on now has infinitely many positions.

sweap's core strategy is called CEGAR — Counter-Example Guided Abstraction Refinement (Clarke et al., 2003). The idea is elegant: instead of reasoning about all infinitely many integer states directly, abstract the problem into a finite one by grouping states into coarse buckets (for instance, "x is less than 0," "x is between 0 and 10," "x is greater than 10"). Feed that finite abstraction to a high-performance finite-synthesis engine — sweap uses Strix (Meyer et al., 2018) or SemML (Kretínský et al., 2025) as black-box backends. If the finite solver finds a winning strategy, great: by soundness of the abstraction, it also works in the real, infinite-state problem. If it finds a counterstrategy — a way for the environment to win — sweap checks whether that counterstrategy is genuine or spurious (an artifact of the overly coarse abstraction). If spurious, it refines the abstraction and tries again.

The trap that plagued earlier CEGAR-based synthesis tools was liveness. Safety properties — "the system never enters a bad state" — are relatively easy to learn through refinement. Liveness properties — "the system eventually reaches a good state" — are harder, because you might never learn enough facts to conclude that, say, incrementing an integer will eventually make it bigger than any fixed threshold. sweap's distinctive contribution, introduced in earlier work (Azzopardi et al., 2025) and here extended, is to learn acceleration assumptions: LTL formulas encoding that if a term is incremented infinitely often without being decremented, it must eventually cross every finite threshold. This is the formal analogue of the intuition that "if you keep counting up, you will eventually pass any number." Encoding that intuition precisely and automatically is what lets sweap solve liveness problems its predecessors could not.

What They Found

The headline result is visible in

sweap vs sweap-strix: Problems Solved, Failed, Unsolved

Head-to-head comparison of sweap (with dual abstraction) against sweap-strix (without) on 106 benchmark problems.

sweap vs sweap-strix: Problems Solved, Failed, Unsolved
Label	Value
sweap	84 problems
sweap-strix	65 problems

: sweap solved 84 of 106 benchmark problems, compared to 65 for sweap-strix (the version without the new dual-abstraction feature) and 19 problems left unsolved (timing out rather than producing a wrong answer). The 19 unsolved cases are intellectually honest — the tool knew it didn't know, rather than guessing wrong.

(a) sweap vs sweap-strix. Source: Shaun Azzopardi, Luca Di Stefano

Looking inside those 84 solutions reveals something telling about the new dual-abstraction feature. Of the 84 problems sweap solved, 52 were solved by the base approach alone — the standard CEGAR loop. But 12 were solved only by the dual approach, meaning they would have been missed entirely without it. Another 12 were solved by both methods, but the dual approach was faster. This decomposition matters: it means the dual abstraction is not just a theoretical nicety but a genuine practical necessity for a significant fraction of real problems.

The comparison against Issy — sweap's only external competitor in the infinite-state synthesis space — is shown in

sweap vs Issy: Problems Solved by Format

Number of benchmark problems solved by sweap and Issy across three input specification formats.

sweap vs Issy: Problems Solved by Format
Label	Value
ISSY format	70 problems
RPG format	57 problems
TSL format	38 problems

and

How sweap's 84 Solutions Were Found

Breakdown of which solving mode (base CEGAR vs dual abstraction) was responsible for sweap's 84 solved problems.

How sweap's 84 Solutions Were Found
Label	Value
Only base approach	52
Only dual approach	12
Base faster than dual	8
Dual faster than base	12

. Across all three supported input formats (ISSY, RPG, and TSL), sweap consistently solved more problems and failed on fewer. On ISSY-format benchmarks, sweap solved 70 problems compared to Issy's 42; on RPG benchmarks, 57 vs. 52; on TSL, 38 vs. 32. Perhaps most striking is the failure column: sweap failed on 42 ISSY benchmarks to Issy's 18, which sounds worse until you notice sweap also timed out on 40 (meaning it ran out of clock, not that it gave a wrong answer), while Issy only timed out on 2. This tradeoff reflects the different architectures: sweap attempts harder problems more aggressively and occasionally runs out of time doing so.

(b) sweap vs Issy, ISSY benchmarks. Source: Shaun Azzopardi, Luca Di Stefano

The key technical innovation enabling these results is the binary encoding of predicates. Here is the problem it solves: suppose you have three predicates about a variable $x$ — say $x \leq 0$ , $x \leq 4$ , and $x \leq 10$ . Naively, each predicate gets its own Boolean variable in the abstract problem, so $n$ predicates produce $2^{n}$ Boolean combinations, most of which are logically impossible (you cannot have $x \leq 0$ without also having $x \leq 4$). Earlier CEGAR tools paid the full exponential cost, making the abstract synthesis problem doubly exponential in the number of predicates — exponential in the number of predicates to build the abstraction, then exponential again to run synthesis on it.

sweap's insight is that integer predicates of the form $t \leq c$ naturally define intervals on the number line: in this example, $(- \infty, 0]$ , $(0, 4]$ , $(4, 10]$ , and $(10, + \infty)$ . These four intervals are mutually exclusive and exhaustive, so they can be represented by just $⌈ lo g_{2} 4 ⌉ = 2$ Boolean variables — not 3. More generally, $n$ threshold predicates define $n + 1$ intervals, representable in $⌈ lo g_{2} (n + 1)⌉$ bits. The upshot: the abstract synthesis problem is only singly exponential in the number of predicates, not doubly. For problems with dozens of predicates, this difference is the gap between minutes and millennia.

Why This Changes Things

The significance of sweap extends beyond competitive benchmarks. The dream behind reactive synthesis is correct-by-construction software: rather than writing code and then testing it, you write a specification — "the drone must always avoid obstacles and eventually reach its destination" — and a tool generates code with a built-in proof. This dream is oldest in finite-state systems, where it is already industrially relevant for hardware and protocol design. The extension to infinite-state systems, and specifically to Linear Integer Arithmetic (LIA — the fragment of arithmetic involving integer variables, addition, and comparisons, without multiplication), opens the door to a vast class of practical systems: resource managers, network protocols, control software for physical systems.

What makes sweap's approach particularly interesting is its commitment to being modular. Rather than building a bespoke synthesis engine for integers, it treats existing finite-synthesis tools as black boxes. This means sweap automatically inherits improvements from the broader LTL synthesis community — when Strix or SemML gets faster, sweap gets faster for free. This is a philosophically important design choice: it bets on the continued progress of the field rather than trying to solve everything in one monolithic system.

The dual abstraction for unrealisability detection is worth dwelling on. In the standard CEGAR loop, the environment controls how state predicates are assigned in each round of the game. This is natural — the environment is adversarial, so it should choose the most damaging state description it can. But for certain unrealisable problems, the worst-case environment strategy requires the environment to track arbitrarily precise integer values, which demands an infinite number of predicates and causes the refinement loop to never terminate. sweap's dual mode flips this: it gives the controller responsibility for choosing state predicates, while the environment retains control of input predicates. In this flipped setting, some previously-infinite counterstrategies become finite — because the environment no longer needs to track precise integer values to demonstrate that no controller can win.

The paper's worked example (Listing 2, Section 4) is instructive. The problem asks: can a controller guarantee it reaches state $q_{1}$ while keeping integer variable $x < 10$ forever? The answer is no — once $q_{1}$ is reached, the environment can always choose input $i \geq 10$ to make $x \geq 10$ . But in the standard flow, proving this requires the abstract environment to track arbitrarily fine-grained values of $x$ during the controller's increment/decrement dance in $q_{0}$ , which requires counting — and counting requires infinitely many predicates. The dual flow sidesteps this: now the controller chooses what predicate bucket $x$ is in, and the environment just needs to know "am I in $q_{1}$ ? If so, set $i \geq 10$ ." That is a finite counterstrategy, and sweap finds it.

The equirealisable reductions — transformations that simplify a game into a smaller, equivalent one — are the kind of engineering detail that separates a research prototype from a usable tool. Automatically translating between TSL, RPG, ISSY, and sweap's native format means researchers working in any of these traditions can now run their benchmarks through sweap without manual encoding. Manual encoding is error-prone (the paper acknowledges as much), and errors in encoding invalidate benchmark comparisons. Having a single tool that ingests all four formats honestly is a contribution to the research infrastructure of the field, not just to the tool's own performance numbers.

What's Next

The honest caveat is that "outperforms its only competitor" describes a young and narrow field. The absolute numbers — 84 problems solved, 57 RPG benchmarks, 38 TSL benchmarks — reflect a community that is still building up its benchmark suites. The industrial deployment of infinite-state reactive synthesis remains a research goal rather than a current reality. The problems sweap handles are expressed in a formal language that requires expertise to write; tooling for automatically extracting such specifications from natural-language requirements or existing codebases does not yet exist at scale.

There are also classes of problems sweap cannot yet handle. The paper is explicit that certain highly liveness-dependent problems still defeat the CEGAR approach, and that the dual abstraction does not universally cure non-termination — it cures specific structural patterns. The 19 timeouts in the headline benchmark (

sweap vs sweap-strix: Problems Solved, Failed, Unsolved

Head-to-head comparison of sweap (with dual abstraction) against sweap-strix (without) on 106 benchmark problems.

sweap vs sweap-strix: Problems Solved, Failed, Unsolved
Label	Value
sweap	84 problems
sweap-strix	65 problems

) are reminders that undecidability is real, not a theoretical technicality. Problems with very many predicates, or with complex interactions between liveness and safety, can still exhaust the refinement loop.

The most important open question is scalability to realistic system sizes. The benchmark problems in the literature are carefully crafted to be tractable; real software controllers — for a distributed database, a real-time operating system, a self-driving vehicle — have state spaces and specification complexities orders of magnitude larger. The binary encoding reduces exponential to polynomial in predicate abstraction, which is a genuine speedup, but synthesis itself remains doubly exponential in the formula size, and that wall is not moved by sweap's innovations.

What sweap's results do establish is that the undecidability of infinite-state synthesis is not a practical death sentence for the approach. The combination of smart abstraction, liveness acceleration, dual-mode refinement, and modular reuse of finite-synthesis backends produces a tool that is genuinely useful on a meaningful problem class today — and whose architecture is designed to improve as every component improves. The researchers are building infrastructure for a future where writing a specification and pressing "synthesise" produces not just a program, but a proof — for systems that count, measure, and accumulate, not just systems that flip bits.

sweap: The Tool Teaching Computers to Write Correct Software for Infinite Worlds

The Science

What They Found

Why This Changes Things

What's Next