A True Video Foundation Model

Achieving the “GPT Moment” for Video
In the world of language, we’ve witnessed a paradigm shift. Large Language Models (LLMs) like GPT don’t just process words. they achieve a holistic understanding by mapping every token to every other token in a context window. They understand syntax, semantics, and nuance because they see the entire document, not just isolated words.

Now, imagine this power applied to video.

Imagine a Video Foundation Model (VFM) that treats a 15-minute video not as a sequence of 27,000 disconnected images, but as a single, coherent visual and temporal document.

This model wouldn’t just detect a “person” in frame #5,034. It would inherently know it’s the same person from frame #1, understand the narrative of their actions, remember objects they interacted with minutes ago even if they are now off-screen, and grasp the causal links between events. It would achieve true object permanence and narrative comprehension —the same way an LLM understands a story’s plot.

The Grand Challenge: The Tyranny of the Transformer

The very mechanism that gives LLMs their power—unconstrained self-attention—is what breaks when applied to video. The mapping of “all frames to all frames” is our goal, but it’s a computational nightmare.

Let’s put it in perspective:

  • A text document might have a few thousand tokens.

  • A single minute of 30fps video, tokenized into small patches, can easily generate over 200,000 tokens .

The O(N^2) complexity of self-attention means that scaling from a few thousand text tokens to hundreds of thousands of video tokens doesn’t just make things slower; it makes them computationally impossible with current methods. We would need a GPU cluster the size of a city just to process a single video clip.

We are at a crossroads. Current methods (like running YOLO on each frame and stitching the results) feel like reading a novel by analyzing each word in isolation—you get the vocabulary, but you miss the plot entirely.

The Central Question for Debate:

Is the path to a true Video Foundation Model through scaling our current paradigm or through a fundamental architectural rethink ?