What the Missing Frames Showed: Machine Learning Describes Masked Video Events

unnamed--14-
Neural networks can describe in words what’s happening in pictures and videos — but can they make sensible guesses about things that happened before or will happen afterward? Researchers probed this ability.

What’s new: Chen Liang at Zhejiang University and colleagues introduced a dataset and architecture, called Reasoner, that generates text descriptions of hidden, or masked, events in videos. They call this capability Visual Abductive Reasoning.

Key insight: To reason about an event in the past or future, it’s necessary to know about events that came before and/or after it, including their order and how far apart they were — what happened immediately before and/or after is most important, and more distant events add further context. A transformer typically encodes the positions of input tokens either one way (a token’s absolute position in the sequence of tokens) or the other (its pairwise distance from every other token), but not both. However, it’s possible to modify these positional encoding styles by producing an embedding for each pair of tokens that’s different from the inversion of each pair — for example, producing different embeddings for the pairs of positions (1,3) and (3,1). This approach captures both the order of events and their distance apart, making it possible to judge the relevance of any event to the events that surround it.

How it works: The authors trained an encoder and decoder. The training dataset included more than 8,600 clips of daily activities found on the web and television. Each clip depicted an average of four sequential events with text descriptions such as “a boy throws a frisbee out and his dog is running after it,” “the dog caught the frisbee back,” and “frisbee is in the boy’s hand.” The authors masked one event per clip. The task was to generate a description of each event in a clip including the masked one.

  • The authors randomly sampled 50 frames per event and produced a representation of each frame using a pretrained ResNet. They masked selected events.
  • The encoder, a vanilla transformer, collected the frame representations into visual representations. In addition to the self-attention matrix, it learned a matrix of embeddings that represented the relative event positions along with their order. It added the two matrices when calculating attention.
  • The decoder comprised three stacked transformers, each of which generated a sentence that described each event. It also produced a confidence score for each description (the average probability per word), which helped successive transformers to refine the descriptions.
  • During training, one term of the loss function encouraged the system to generate descriptions similar to the ground-truth descriptions. Another term encouraged it to minimize the difference between the encoder’s representation of masked and unmasked versions of an event.

Results: The authors compared Reasoner to the best competing method, PDVC, a video captioner trained to perform their task. Three human volunteers evaluated the generated descriptions of masked events in 500 test-set examples drawn at random. Evaluating the descriptions of masked events, the evaluators preferred Reasoner in 29.9 percent of cases, preferred PDVC in 10.4 percent of cases, found them equally good in 13.7 percent of cases, and found them equally bad in 46.0 percent of cases. The authors also pitted Reasoner’s output against descriptions of masked events written by humans. The evaluators preferred human-generated descriptions in 64.8 percent of cases, found them equally good in 22.1 percent of cases, found them equally bad in 4.2 percent of cases, and preferred Reasoner in 8.9 percent of cases.

Why it matters: Reasoning over events in video is impressive but specialized. However, many NLP practitioners can take advantage of the authors’ innovation in using transformers to process text representations. A decoder needs only one transformer to produce descriptions, but the authors improved their descriptions by stacking transformers and using the confidence of previous transformers to help the later ones refine their output.

We’re thinking: Given a context, transformer-based text generators often stray from it — sometimes to the point of spinning wild fantasies. This work managed to keep transformers focused on a specific sequence of events, to the extent that they could fill in missing parts of the sequence. Is there a lesson here for keeping transformers moored to reality?