I’m building a Video QA model, which should choose only one answer in the given 5 candidate answer related to question, so It’s someway a classification problem.
Firstly, I use a complex VideoEncoder (including GNN, self-attention, …) to get an video features with shape (batch_size, num_key_frames, hidden_dim). Then an multi-head attention module (q is video features, both k and v are the question tokens with shape (batch_size, num_tokens, hidden_dim)) is applied to update video features. Finally use an MLP to get the weight of each key frames, and use the weighted sum with shape (batch_size, hidden_dim) to get the logits.
My codes are like:
# (batch_size, num_key_frames, hidden_dim)
video_features = ...
# (batch_size, num_tokens, hidden_dim)
question_tokens = ...
# (batch_size, hidden_dim)
candidate_answers = ...
# (batch_size, num_key_frames, hidden_dim)
# Codes reshape them to satisfices `batch_first=False` is hidden for brief.
video_features = self.attention(video_features, question_tokens, question_tokens)
# mlp is like:
# nn.Sequential(
# params.init_linear(nn.Linear(feature_dim, feature_dim // 2)),
# nn.Tanh(),
# params.init_linear(nn.Linear(feature_dim // 2, 1)),
# nn.Softmax(dim=-2)
# )
# (batch_size, hidden_dim)
video_features = torch.sum(video_features * self.mlp(video_features), dim=1)
# or use linear
logits = torch.cosine_similarity(video_features, candidate_answers, dim=1)
The model will overfit in the train set, its accuracy will be more than 98%, but in the test set, it’s about 45%. I found that if the video encoder is removed and set the video_features to all zeros (like: video_features = torch.zeros(batch_size, num_key_frames, hidden_dim, device=device)
), only use the attention and weighted sum modules, the model also can reach the accuracy of 45% in the test set, but the accuracy in train set is about 50%, so the overfitting is suppressed, but the accuracy in the test set is not increased.
How to troubleshoot potential bugs in the VideoEncoder?