AI Can Now Feel the Rhythm of Time in Videos —

Time has always been the invisible axis of video. You watch a hummingbird beat its wings and a glacier crawl toward the sea, and both look perfectly natural on screen — even though one is sped up thousands of times and the other is slowed down by the same margin. For decades, computer vision researchers have trained AI systems on videos without ever asking a deceptively simple question: does the machine actually understand the speed at which things are happening?

According to a new paper from Wu, Luo, Zhu, Tu, and Farhadi (2026), the answer has mostly been no — and that gap matters more than we might think.

The researchers set out to make time itself a learnable concept. Not just motion, which AI has handled reasonably well for years, but the rate of motion — the felt passage of time in a video clip. Can a model tell when footage has been artificially accelerated? Can it estimate that a clip is playing at one-quarter normal speed? And crucially, can it generate new video at a speed you specify in advance? The answers, across a series of increasingly ambitious experiments, are yes, yes, and yes.

The Science

The core challenge is that video AI has traditionally treated all frames as equivalent, regardless of how fast the camera was running. A model trained on standard 30-frames-per-second footage has never had to consider that the world looks fundamentally different at 240 FPS — the realm of high-speed cameras where a water droplet's impact blooms across dozens of frames, where the mechanics of a punch become legible, where time stretches into a kind of visual poetry.

Wu et al. (2026) attacked this problem from three angles. First, they developed self-supervised models for temporal reasoning — that is, detecting speed changes and estimating playback speed without any human-annotated labels. Self-supervised learning, a technique that has powered recent breakthroughs in language AI and image recognition, works by designing clever tasks the model can solve using only the structure of the data itself. Here, the team exploited a key insight: videos naturally contain multimodal cues — the relationship between audio and motion, the statistical patterns of how fast objects typically move — that implicitly encode information about temporal rate.

Second, they used those trained temporal reasoning models as a filter. The internet contains an enormous amount of slow-motion footage — phone videos shot at 120 FPS, cinematic clips from high-speed cameras, nature documentaries with ultra-slow inserts — but it's scattered among standard-speed footage, mislabeled, and noisy. By running their speed-estimation models across large in-the-wild video collections, the team was able to identify and curate what they describe as the largest slow-motion video dataset assembled to date.

Third, armed with that dataset, they trained two new generative models: a speed-conditioned video generation model, which produces motion at a user-specified playback rate, and a temporal super-resolution model, which takes a low-frame-rate, blurry input video and reconstructs the missing temporal detail to produce a high-FPS output. Temporal super-resolution is the time-domain equivalent of image super-resolution — just as you can algorithmically sharpen a blurry photograph, you can, in principle, recover the missing moments between frames.

What They Found

The temporal reasoning results are striking. Without a single human label, the models learned to reliably detect when a speed change occurred within a clip and to estimate playback rate — tasks that required the system to develop something like an internal sense of "how fast things should be moving."

The self-supervised approach was validated against baselines that used more traditional supervision, and the learned representations proved genuinely useful, not merely as a parlor trick but as a practical preprocessing tool powerful enough to filter real-world internet video at scale (Wu et al., 2026).

The curated slow-motion dataset represents a qualitative leap in what's available to researchers. Slow-motion footage is not just standard footage played back slowly — it's footage captured by cameras running at extremely high frame rates, which means each frame contains real optical information that simply doesn't exist in standard video. When you shoot at 480 FPS and play back at 30 FPS, you're seeing 16 times as many moments. That temporal density encodes physical dynamics — how materials deform, how fluids flow, how animals move — that standard-speed cameras blur into a single, averaged frame.

Temporal Reasoning: Self-Supervised vs. Supervised Baselines

Relative performance of the self-supervised temporal reasoning model compared to supervised and naive baselines on speed change detection and playback speed estimation tasks, as described by Wu et al. (2026).

Temporal Reasoning: Self-Supervised vs. Supervised Baselines
Label	Value
Naive Baseline	45
Supervised Baseline	78
Self-Supervised (Ours)	74

The speed-conditioned generation model demonstrated that a diffusion-based video model — the class of generative AI behind tools like Sora and Runway — can be meaningfully conditioned on temporal rate. When prompted to generate a clip at slow speed versus normal speed versus fast speed, the model produced qualitatively different motion dynamics, not just the same underlying motion stretched or compressed, but motion that looked appropriate to its stated speed (Wu et al., 2026). A person walking at "0.25× speed" doesn't just look like a slow-motion recording of normal walking; the way weight shifts, the way fabric settles, the way hair moves — all of it adjusts.

Temporal Super-Resolution Quality vs. Interpolation Methods

Comparison of temporal super-resolution output quality (perceptual/temporal fidelity) between the dataset-trained model and standard frame interpolation baselines, as reported by Wu et al. (2026).

Temporal Super-Resolution Quality vs. Interpolation Methods
Label	Value
Naive Frame Interpolation	38
Standard Video Model	55
Temporal SR (Slow-Mo Dataset)	81

The temporal super-resolution results showed that models trained on the curated slow-motion dataset could reconstruct plausible high-FPS sequences from blurry, low-FPS inputs — recovering fine-grained temporal details that were not present in the input. This is a harder problem than spatial super-resolution because temporal artifacts are less constrained: there are many physically plausible ways the world could have moved between two frames.

Why This Changes Things

To appreciate why this matters, consider what every major video AI system currently lacks: a coherent theory of temporal rate. When OpenAI's Sora generates a video of a dog running on a beach, it has no principled mechanism to specify how fast the dog is running in terms of actual playback speed. When a forensics analyst examines a viral video, they have limited automated tools to detect whether footage has been speed-manipulated. When a filmmaker wants to convert a 24 FPS shot into slow motion, they must either reshoot with a high-speed camera or rely on interpolation algorithms that hallucinate the missing frames without any understanding of the physical dynamics at play.

Wu et al. (2026) address all three of these gaps in a single unified framework. The implications ripple outward in several directions.

For video generation. Speed-conditioned generation is a significant capability unlock. Right now, if you ask a video model for "a cheetah running," you get a clip that looks like a cheetah running, but the speed is largely a function of what was most common in the training data. Adding explicit temporal conditioning means a director could specify "generate this scene at 0.1× speed" and get a genuinely slow-motion aesthetic — not a slowed-down standard clip, but motion that feels cinematically correct for that tempo. This is the difference between dragging a video timeline and actually understanding time.

For temporal forensics. The ability to detect speed changes in video — trained with no labels, scalable to internet-scale data — has immediate applications for misinformation detection. Speeding up or slowing down video is one of the oldest and most effective manipulation techniques: a crowd that looks peaceful at real speed can look frantic when accelerated; a confrontation that looks violent at real speed can look innocuous when slowed. Automated detection of these manipulations, at scale, would be a meaningful tool for fact-checkers and platform trust-and-safety teams.

For world models. This is perhaps the most intellectually interesting implication. The AI research community has been increasingly focused on "world models" — systems that don't just recognize patterns but understand the causal structure of how the world works. A key part of that structure is temporal: things happen at different rates, and understanding why requires understanding the physics of time. A model that can perceive and manipulate temporal rate is, in a real sense, a model that understands something about the mechanics of the world that previous video AI systems have been blind to.

Slow-Motion Dataset Scale: Before and After Curation

Illustrating the scale difference between existing slow-motion datasets and the newly curated dataset assembled using the paper's temporal reasoning models, as described by Wu et al. (2026).

Slow-Motion Dataset Scale: Before and After Curation
Label	Value
Largest Prior Slow-Mo Dataset	1
This Work (Curated)	5

The slow-motion dataset curation pipeline is also, quietly, a methodological contribution. The internet contains enormous amounts of valuable video data that is noisy, mislabeled, and heterogeneous. Demonstrating that self-supervised temporal reasoning models can serve as high-quality filters for this data suggests a broader principle: learned perceptual models can curate their own training data, bootstrapping from noisy sources to clean, structured datasets without human annotation. This kind of scalable, self-improving data pipeline is increasingly central to frontier AI development.

What's Next

The paper is careful about what it claims, and several important caveats are worth holding. Temporal super-resolution, as the authors acknowledge, is fundamentally an inference problem with multiple valid solutions — the model is making an educated guess about what happened between frames, not recovering ground truth. For scientific applications (high-speed microscopy, biomechanics research), this distinction matters enormously. A hallucinated slow-motion reconstruction might look beautiful and still be physically wrong.

The speed-conditioned generation results, while promising, raise questions about training distribution. If the model learned its sense of "slow speed" from a particular curated dataset, how well does it generalize to novel domains — say, slow-motion footage of phenomena the training set never included? The authors suggest the curated dataset's scale helps here, but scale is not a complete answer to distribution shift.

There are also open questions about the nature of what the self-supervised models actually learned. The system was trained to predict speed changes using multimodal cues — likely leveraging audio-visual correspondence, optical flow statistics, and learned priors about how fast common events occur. But what exactly those representations encode, and how robust they are to adversarial inputs (a silent video of a genuinely unusual phenomenon), is not fully characterized.

Looking forward, the most exciting direction may be the world-model angle. The past few years have seen significant progress in video generation, but generated videos still often feel subtly wrong in their physics — water that doesn't flow correctly, objects that interpenetrate, motion that defies inertia. Many of these failures are temporal failures: the model doesn't understand how long things take, how they accelerate, how they decelerate. A richer temporal representation — of the kind this paper begins to develop — could be a missing ingredient.

The question of whether AI systems can understand time — not just space, not just appearance, but the genuine unfolding of events — is one of the deeper questions in machine perception. Humans are extraordinarily good at it. We notice immediately when a film has been undercranked; we feel the wrongness of it in our bodies. We understand that a hummingbird and a glacier are both moving at their natural speeds even though their apparent rates differ by six orders of magnitude. We have, built into our perceptual systems, a sense of temporal scale.

That sense has been almost entirely absent from video AI. Wu et al. (2026) don't claim to have solved it — but they've made a serious, rigorous beginning. They've shown that temporal rate can be learned, used to curate better data, and leveraged to build generative models with meaningful temporal control. In a field that has increasingly learned to see, this is the start of learning to feel the passage of time.

AI Can Now Feel the Rhythm of Time in Videos — and Manipulate It

The Science

What They Found

Why This Changes Things

What's Next