Byungchul Tak, Shu Tao, et al.
IC2E 2016
The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to a description without access to temporal annotations during training. Prior work uses co-attention mechanisms to understand relationships between the vision and language data, but they lack contextual information between video frames that can be useful to determine how well a segment relates to the query. To address this, we propose an efficient Latent Graph Co-Attention Network (LoGAN) that exploits fine-grained frame-by-word interactions to jointly reason about the correspondences between all possible pairs of frames, providing context cues absent in prior work. Experiments on the DiDeMo and Charades-STA datasets demonstrate the effectiveness of our approach, where we improve Recall@1 by 520% over prior weakly-supervised methods, even boasting an 11% gain over strongly-supervised methods on DiDeMo, while also using significantly fewer model parameters than other co-attention mechanisms.
Byungchul Tak, Shu Tao, et al.
IC2E 2016
Kuniaki Saito, Kihyuk Sohn, et al.
CVPR 2023
Kevin Gu, Eva Tuecke, et al.
ICML 2024
Zongyuan Ge, Sergey Demyanov, et al.
BMVC 2017