Surfable wave prediction

Anuar_Zhuken · June 16, 2024, 7:58pm

The idea:

A live video stream is coming from a camera pointing to a surfing spot.
The model users want to know where to position (x, y) themselves to catch a wave. Assume the user needs 30-60 seconds to paddle to that position.

Input: 3D tensor (width x height x time)
Output: [ x: int, y: int, secs_from_now: int ]

What architecture will produce the best result?

Some scattered thoughts:

using a pre-trained CNN might be beneficial for both training time and prediction accuracy
to handle a video, either 3D CNN or a model with some attention can be used
a model with attention will most likely be superior because frames from earlier can substantially affect prediction
I would love to use Transformers