Sora, Runway, etc (How does it work ?)

Hi, so I felt maybe this course was a little too brief, but still interesting and informative.

To be honest I really have no personal interest in creating a video generation model-- Though I am rather curious about the underlying theory, applications, and methods used.

It strikes me as if, somehow, this must be like ‘sustained attention’-- But for a sequence of images, which must be pretty tough.

My thinking is these methods would be interesting for other applications, so I didn’t know if anyone could cite some papers or similar as to, with video gen, exactly how are they pulling this off ?