Sora's core architecture is a Diffusion Transformer (DiT): compress video into spatiotemporal patch tokens, train a diffusion model to denoise them, with the Transformer handling global coherence. The real engineering challenges are temporal consistency, variable-length/resolution support, and training scale.
KV Cache reduces autoregressive Transformer generation from O(n²) — recomputing the full sequence for every new token — to O(n) per step, which is the core reason modern LLM inference is fast enough to be usable.
Transformer self-attention is inherently orderless — positional encoding is the fix. From sinusoidal absolute encoding, to learnable absolute encoding, to relative positional encoding, to RoPE (Rotary Position Embedding): modern LLMs almost universally use RoPE because it requires no parameters, naturally encodes relative distances, and can be extended to longer sequences.