Table of Contents
If you’ve used Sora, Kling, Runway, or any AI video generation tool, you’ve probably noticed the same failure mode: the first few seconds look good, then something starts to drift. A character’s face changes subtly between frames. Background details shift. Motion becomes unnatural. By thirty seconds in, the video barely resembles what you asked for. This is temporal drift — and it’s been the defining unsolved problem in AI video generation since these tools emerged. In 2025, several research groups converged on systematic solutions. Here’s what they found.
TL;DR
- Core problem: forgetting (early frames fall out of context window, details lost) and drifting (autoregressive error accumulation) — two problems that trade off against each other
- Root cause: video diffusion models have finite temporal context windows; beyond that, only compressed representations survive
- 2025 solutions:
- FramePack: inverted temporal generation + fixed context length — enables hour-long video in theory
- Mixture of Contexts (MoC): sparse attention with learned routing selects the most relevant historical frames
- A2RD: multimodal memory + closed-loop self-correction for story-consistent long video
- Direct Forcing: closes the training-inference distribution gap to reduce error accumulation
- Key insight: forgetting and drifting are a fundamental trade-off; every solution attacks this differently
Why AI Video Generation Is Structurally Hard
Still image generation models only need spatial consistency within one frame. Video generation adds temporal consistency across potentially hundreds of frames. The same character’s face must match at frame 1 and frame 300. Moving objects must follow plausible physics. Lighting and shadows must evolve coherently.
Modern video generation models handle this through diffusion models with 3D spatiotemporal attention — the denoising network processes spatial and temporal tokens together, enabling it to model frame-to-frame relationships. The constraint: context windows are finite.
graph TD
A[Video generation task] --> B[Short clips<br>under 10 seconds]
A --> C[Long video<br>30+ seconds]
B --> D[All frames fit<br>in context window]
C --> E[Early frames fall<br>out of context]
E --> F1[Forgetting<br>Detail loss]
E --> F2[Drifting<br>Error accumulation]
F1 --> G[Face changes<br>Background objects shift]
F2 --> H[Quality degrades<br>Motion becomes unnatural]
The Trade-off That Makes This Hard
Forgetting: The longer the video, the sooner early frames fall out of the context window. The model is left working with compressed embeddings instead of pixel-level detail. Character faces “drift” toward a different face. Background objects change shape or disappear.
Drifting: Autoregressive generation means each step depends on the previous step’s output. During training, the model sees real frames; during inference, it sees its own generated frames. Errors accumulate and amplify across steps (exposure bias / observation bias).
Here’s the dilemma: strengthening memory to address forgetting can worsen drifting, because erroneous early frames get amplified. Reducing memory dependency to address drifting accelerates forgetting. Every solution in 2025 attacks this trade-off from a different angle.
The 2025 Solutions
FramePack: Inverted Generation Order
FramePack’s core idea is counterintuitive: don’t generate from the beginning forward. Instead, generate anchor frames at key points first, then fill gaps working backward from each endpoint.
When the model generates any given frame, it can see both the start and end of its local segment — two high-quality anchors. Error accumulation paths are shortened because every generation step has bounded bidirectional distance to reference frames.
More importantly: FramePack maintains a fixed-length context window regardless of total video length. Per-step compute cost stays constant. This is what makes hour-long video generation theoretically tractable (demonstrated in lab settings on H100 hardware for 60-minute outputs).
Mixture of Contexts (MoC): Sparse Memory Retrieval
MoC reframes long video generation as an internal information retrieval problem. Rather than attending to all historical frames (computationally explosive), the model learns a sparse routing module that dynamically selects the most relevant historical frames for each new generation step.
Mandatory anchors — certain key frames like scene beginnings and first character appearances — are always included in the attention window regardless of video length. This directly addresses forgetting without requiring full attention over the entire history. Compute scales sub-quadratically.
A2RD: Agentic Self-Correction
Agentic Autoregressive Diffusion (A2RD) introduces three mechanisms working together:
- Segment-based autoregressive generation: long videos are divided into manageable segments with clean memory reset points between them
- Multimodal memory: memory includes not just visual frames but text descriptions, object states, and scene summaries — richer conditioning for long-range coherence
- Closed-loop self-correction: after generating each segment, the model evaluates consistency and revises before proceeding
This approach is particularly suited for narrative-heavy content where character state tracking matters across scenes.
Direct Forcing: Closing the Training-Inference Gap
A complementary solution to drifting: during training, expose the model to its own generated frames (not only ground truth frames). This trains the model to remain consistent even when starting from imperfect inputs, reducing the distributional shift that causes cascading errors during inference. It’s a single-step approximation strategy with modest compute overhead and measurable improvement in autoregressive stability.
What Changed in Practice
Video length: From the previous practical ceiling of 10-30 seconds to several minutes of coherent generation. Seedance 2.0 (early 2026) generates 120-second continuous video; FramePack research has demonstrated much longer.
Character consistency: Consistent character appearance across scenes is now viable for real production workflows — advertising, short films, educational content.
Open-source integration: MoC and FramePack techniques are being integrated into ComfyUI and Hugging Face Diffusers, making long-form video accessible to engineers without custom infrastructure.
What’s Still Open
- Face detail in close-ups: Micro-level facial consistency in extreme close-ups remains a hard problem
- Physics consistency: Object motion that reliably respects physics is still research territory (DiffPhy and related approaches are promising but not broadly deployed)
- Evaluation metrics: FVD and LPIPS don’t fully capture human perception of temporal consistency; the field lacks a definitive benchmark
- Compute at training time: FramePack’s inference efficiency doesn’t eliminate the training cost; these models require significant infrastructure to train
References
- Temporal Drift in AI-Generated Video: Causes, Evaluation, and Production Strategies (iMerit)
- Frame Context Packing and Drift Prevention (arxiv 2504.12626)
- A2RD: Agentic Autoregressive Diffusion for Long Video Consistency (arxiv)
- Mixture of Contexts for Long Video Generation (arxiv 2508.21058)
- Pack and Force Your Memory: Long-form and Consistent Video Generation (arxiv)
- State of open video generation models in Diffusers (Hugging Face)
- Solved: The Bug That Haunted AI Video For Years (YouTube)
🇺🇸 English
If you've ever played with AI video tools — Sora, Runway, Kling, any of them — you've seen it happen. The first few seconds look impressive. Then, slowly, something starts to go wrong. A character's face shifts. Background objects warp or disappear. Motion gets choppy and weird. By the thirty-second mark, you're watching something that barely resembles what you asked for.
This phenomenon has a name: temporal drift. And it has been the defining, unsolved problem in AI video generation since these tools were first released. In 2025, multiple research groups converged on real solutions — and this year finally felt like a turning point.
So what exactly is going wrong under the hood?
AI video models — like everything in diffusion-based generation — operate with a finite context window. Think of it like working memory. Still image generation only needs spatial consistency within one frame. But video adds a time dimension: you need that character's face to match at frame one and at frame three hundred. You need physics to feel plausible. Lighting and shadows need to evolve coherently across the whole sequence.
The way these models work, they use a denoising network that processes spatial and temporal tokens together. Sounds powerful — and it is — but only up to the point where the context window runs out. After that, you're working with compressed summaries instead of actual pixel data. And that's where the trouble starts.
There are actually two distinct failure modes here, and understanding them is key to understanding why this problem is so hard.
The first is **forgetting**. As the video gets longer, early frames fall out of the context window entirely. The model loses access to the fine-grained visual details — the exact shape of a character's nose, the specific texture of a wall. What it retains is a kind of rough sketch. And from that rough sketch, it starts hallucinating details that may or may not match what came before. Face changes. Objects shift. This is forgetting.
The second is **drifting**. AI video generates autoregressively — each new clip depends on the previous one. Here's the problem: during training, the model learned from real, high-quality frames. But during inference, it's feeding off its own generated output. Which is imperfect. Which introduces errors. Which get amplified in the next step. And the next. By the time you're thirty seconds in, you've accumulated a chain of compounding mistakes.
Here's the painful part: these two problems are actually in tension with each other. If you strengthen the model's memory to fight forgetting, you risk amplifying drifted, erroneous early frames. If you cut memory dependency to fight drifting, you accelerate forgetting. Every solution in 2025 attacks this trade-off from a different angle.
Let's go through the four main approaches.
**FramePack** does something counterintuitive: it flips the generation order. Instead of building a video from beginning to end, it generates anchor frames at key points first — think of it like sketching the start and end of a scene before filling in the middle — then works backward from each endpoint to fill in the gaps.
Why does this help? Because when the model generates any given frame, it can see both a starting anchor and an ending anchor for its local segment. Error accumulation paths get dramatically shorter. And critically, FramePack maintains a fixed-length context window regardless of how long the total video is. Per-step compute stays constant. That's what made it possible for researchers to demonstrate hour-long video generation on high-end hardware in lab conditions. Practically speaking, it's what pushes the ceiling from thirty seconds to minutes or beyond.
**Mixture of Contexts**, or MoC, takes a different approach. Instead of changing the generation order, it reframes the problem as a memory retrieval problem. Rather than attending to every historical frame — which would be computationally explosive — the model learns a routing mechanism that dynamically selects the most relevant frames to pull into attention for each new generation step.
It also designates mandatory anchors: certain frames, like the first time a character appears or the beginning of a scene, are always in the attention window no matter what. This directly fights forgetting without requiring full history recall. And because you're only attending to a sparse selection rather than everything, compute scales much more efficiently.
**A2RD — Agentic Autoregressive Diffusion** — takes the most ambitious approach. It combines three mechanisms: it breaks long videos into segments with clean memory reset points between them; it maintains a multimodal memory that includes not just visual frames but text descriptions and scene summaries; and — this is the interesting part — it runs a self-correction loop. After generating each segment, the model evaluates its own consistency and revises before moving on. It's essentially doing its own QA mid-generation. This approach particularly shines for narrative content where you need to track character states across scenes.
The fourth solution, **Direct Forcing**, is more of a training strategy than an architecture change. The idea is simple: during training, expose the model to its own generated frames, not just ground truth real footage. This trains the model to handle imperfect inputs without spiraling. It closes the gap between training conditions and real inference conditions. It's not a silver bullet, but it measurably reduces error accumulation with relatively modest compute overhead.
What did all of this add up to in practice?
The practical ceiling for coherent AI video generation used to be somewhere between ten and thirty seconds. With these techniques, we're now seeing several minutes of coherent output. Some commercial systems have hit 120-second continuous video. The research prototypes have gone further. Character consistency across scenes — which was a real barrier for advertising and short film production — is now viable in actual workflows. And the open-source ecosystem is picking up: FramePack and MoC techniques are being integrated into tools like ComfyUI and Hugging Face Diffusers, so this isn't staying locked inside research labs.
That said, some hard problems remain. Facial micro-detail in extreme close-ups is still unreliable. True physics consistency — objects that reliably behave the way real objects do — is still largely research territory. And the field doesn't yet have great evaluation metrics; existing tools don't fully capture whether a human would perceive a video as temporally consistent or not.
So here's what to take away from all of this.
First: temporal drift isn't one bug, it's two — forgetting and drifting — and they trade off against each other. Any solution that claims to fix both completely is probably oversimplifying.
Second: 2025 produced real, systematic progress on this. FramePack's fixed context windows, MoC's sparse retrieval, A2RD's self-correction loop — these aren't marginal improvements. They represent a genuine architectural shift in how the field thinks about long-form video.
And third: the path to production-quality AI video is now clearer than it's ever been. We're not there yet, but for the first time, the road is mapped.
🇹🇼 中文
如果你用過 Sora、Kling、或者 Runway,你一定遇過這個問題:影片前幾秒很漂亮,但再過幾秒,臉開始歪掉,背景悄悄在變,整體感覺越來越不對。這個問題叫做「時序漂移」,英文是 Temporal Drift,它困擾 AI 影片生成超過三年——直到 2025 年,才有了幾個真正系統性的解法。
先來說清楚問題的本質。
AI 影片生成比圖片生成難得多,不只是因為有更多幀要處理,而是因為它需要在時間維度上保持一致。同一個角色的臉,第 1 幀和第 300 幀必須是同一張臉;移動的物體,位置要符合物理規律;光線跟陰影要合理地隨時間演變。這些要求疊加在一起,非常苛刻。
現代影片生成模型的核心是擴散模型加上時空注意力機制,讓模型可以同時處理空間和時間的關聯。問題就出在這裡——它的上下文視窗是有限的。短影片沒問題,所有幀都在記憶裡。但一旦超過某個長度,早期的幀就被推出視窗,模型只剩下壓縮後的表示,原始像素的細節就這樣消失了。
這帶來了兩個互相制衡的問題。
第一個是「遺忘」:早期的幀細節流失,角色的臉漂移成另一張臉,背景物件消失或改變形狀。第二個是「漂移」:自回歸生成的每一步都依賴上一步的輸出,訓練時模型看到真實幀,但推論時它只能看到自己生成的幀——一旦某一幀有誤差,後面每一幀都會把這個誤差放大。
最棘手的地方是:這兩個問題互相制衡。強化記憶可以緩解遺忘,但把有誤差的早期幀放大影響,漂移反而更嚴重。反過來,加強對當前幀的重視可以控制漂移,但早期資訊遺失更快。這個 trade-off 困住了研究者很多年。
2025 年,三個新架構分別從不同角度打破了這個困境。
第一個是 FramePack,它的想法非常反直覺。傳統做法是從第一幀開始往後生成,但 FramePack 反過來:先生成高品質的關鍵幀,再從結尾往前填充中間。這樣一來,模型在生成任何一段的時候,左右兩端都有高品質的錨點可以參考,誤差累積的路徑被大幅縮短。更關鍵的是,FramePack 維持固定長度的上下文視窗,無論影片多長,每次推論的計算成本都不變。實驗室版本已經在 H100 上做到了 60 分鐘的影片生成。
第二個是 Mixture of Contexts,簡稱 MoC。它把長影片生成重新定義成一個記憶檢索問題。模型有一個歷史幀的記憶庫,但生成新幀的時候,不是對所有歷史幀做全注意力——那樣計算量會爆炸——而是學習一個稀疏路由模組,動態選出最相關的幾個歷史幀來關注。同時,某些關鍵幀,比如場景開頭、角色第一次出現的幀,會被設為強制性錨點,永遠包含在注意力範圍內,不管影片多長都不會被遺忘。
第三個是 A2RD,全名是 Agentic Autoregressive Diffusion。它引入了多模態記憶——不只記視覺幀,還記文字描述、物件狀態、場景摘要。而且有閉環自我修正:生成一段之後,先評估一致性,發現問題就回頭修正,確認沒問題再繼續往下生成。這個方法特別適合故事性強、需要精確追蹤角色狀態的長影片。
另外還有一個叫 Direct Forcing 的訓練策略,思路更直接:既然推論時模型看到的是自己生成的幀,那訓練時也讓它看自己的幀,讓它學會在不完美的輸入下仍然生成一致的輸出。這個方法計算成本不高,但顯著降低了推論時的誤差累積。
這些解法的影響是實際的。Seedance 2.0 在 2026 年初已能生成 120 秒的連貫影片,這在一年前是很難想像的。廣告、短片、教育影片這類需要跨場景保持角色一致的創作,開始有了真正可行的生產工作流。這些技術也已經開始整合進 ComfyUI 和 Diffusers 這類開源框架。
當然問題還沒有完全解決。臉部細節在特寫鏡頭下仍然不穩定,物體運動的物理一致性還是開放問題,訓練成本依然很高。
但核心的突破已經發生了。總結三個重點:第一,時序漂移的根源是上下文視窗有限加上自回歸誤差累積,這是結構性問題,不是調參能解決的。第二,遺忘和漂移是互相制衡的 trade-off,2025 年的解法各自用不同方式繞開了這個困境,而不是硬解其中一個。第三,FramePack 的倒序生成、MoC 的稀疏記憶選取、A2RD 的自我修正,三條路都指向同一個結論:長影片生成需要的不只是更大的模型,而是更聰明的記憶管理架構。
Tags
Related Articles
AlphaFold's Nobel Prize: When AI Starts to Decode the Language of Life
AlphaFold's protein structure predictions earned the 2024 Nobel Prize in Chemistry. Here's what the MSA + Transformer architecture actually does and why it matters.
AlphaFold: The AI That Solved Biology's 50-Year Problem and Won a Nobel Prize
AlphaFold solved the protein folding problem in 2020 at near-experimental accuracy, earning Demis Hassabis and John Jumper the 2024 Nobel Prize in Chemistry. Its database now contains 200M+ protein structures, actively accelerating drug development and materials science.
Demis Hassabis: Why I Love Hard Questions — The Core of a Research Philosophy
Hassabis's preference for 'hard questions' isn't a personality quirk — it's a research strategy: choose problems that unlock large amounts of downstream value when solved, not problems easy enough to publish quickly. This strategy is the core reason DeepMind keeps breaking through at the scientific frontier.