What When and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated InstructionsBrian ChenNina Shvetsovaet al.2024CVPR 2024
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language KnowledgeWei LinLeonid Karlinskyet al.2023ICCV 2023