Surfable wave prediction

The idea:

A live video stream is coming from a camera pointing to a surfing spot.
The model users want to know where to position (x, y) themselves to catch a wave. Assume the user needs 30-60 seconds to paddle to that position.

Input: 3D tensor (width x height x time)
Output: [ x: int, y: int, secs_from_now: int ]

What architecture will produce the best result?

Some scattered thoughts:

  • using a pre-trained CNN might be beneficial for both training time and prediction accuracy
  • to handle a video, either 3D CNN or a model with some attention can be used
  • a model with attention will most likely be superior because frames from earlier can substantially affect prediction
  • I would love to use Transformers