The idea:
A live video stream is coming from a camera pointing to a surfing spot.
The model users want to know where to position (x, y) themselves to catch a wave. Assume the user needs 30-60 seconds to paddle to that position.
Input: 3D tensor (width x height x time)
Output: [ x: int, y: int, secs_from_now: int ]
What architecture will produce the best result?
Some scattered thoughts:
- using a pre-trained CNN might be beneficial for both training time and prediction accuracy
- to handle a video, either 3D CNN or a model with some attention can be used
- a model with attention will most likely be superior because frames from earlier can substantially affect prediction
- I would love to use Transformers