PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents

Feng Wang^1,2, Yichun Shi¹, Ceyuan Yang¹, Qiushan Guo¹, Jingxiang Sun¹, Alan Yuille², Peng Wang¹

¹ByteDance Seed ²Johns Hopkins University

Abstract

This work presents VTok, a unified video tokenization framework that can be used for both generation and understanding tasks. Unlike the leading vision–language systems that tokenize videos through a naïve frame-sampling strategy, we propose to decouple the spatial and temporal representations of videos by retaining the spatial features of a single key frame while encoding each subsequent frame into a single residual token, achieving compact yet expressive video tokenization. Our experiments suggest that VTok effectively reduces the complexity of video representation from the product of frame count and per-frame token count to their sum, while the residual tokens sufficiently capture viewpoint and motion changes relative to the key frame. Extensive evaluations demonstrate the efficacy and efficiency of VTok: it achieves notably higher performance on a range of video understanding and text-to-video generation benchmarks compared with baselines using naïve tokenization, all with shorter token sequences per video (e.g., 3.4% higher accuracy on our TV-Align benchmark and 1.9% higher VBench score). Remarkably, VTok produces more coherent motion and stronger guidance following in text-to-video generation, owing to its more consistent temporal encoding. We hope VTok can serve as a standardized video tokenization paradigm for future research in video understanding and generation.

Video Generation Examples

Generated by VTok

Generated by WAN2.2

A dog walks from upper-left to lower-right across the camera frame, steady pace, soft shadows, paws visible, with head slightly bobbing and tail wagging.

Generated by VTok

Generated by WAN2.2

A car moves from left to right across the frame, captured from a slowly-moving forward-facing camera in a quiet outdoor setting.

Generated by VTok

Generated by WAN2.2

A street vendor packs two sandwiches into a paper bag at an open-air stall, with the camera positioned at counter level capturing the hands-on motion.

Generated by VTok

Generated by WAN2.2

A chef flips four pancakes one by one on a hot griddle, close-up shot with only hands shown, faint steam, bubbles forming, each pancake returning to its spot.

Generated by VTok

Generated by WAN2.2

From a refrigerator's interior viewpoint with the interior light illuminating, a cook opens the fridge door, grabs a bottle, and shuts the door until it clicks closed.

Generated by VTok

Generated by WAN2.2

Indoors in a quiet living room, from a stationary camera at waist height, a cat leaps onto a chair, pauses to look around for a beat, then hops back down.

Generated by VTok

Generated by WAN2.2

A paper sheet flutters from the edge of a table, spirals softly downward, and settles flat on the tabletop, captured from a steady side-view camera.