Everything at Once - Multi-modal Fusion Transformer for Video RetrievalNina ShvetsovaBrian Chenet al.2022CVPR 2022
AVLnet: Learning audio-visual language representations from instructional videosAndrew RouditchenkoAngie Boggustet al.2021INTERSPEECH 2021
Cascaded multilingual audio-visual learning from videosAndrew RouditchenkoAngie Boggustet al.2021INTERSPEECH 2021