Qian Huang, George C. Stockman
ICPR 1994
Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for languages other than English still lags. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student's text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers.
Qian Huang, George C. Stockman
ICPR 1994
Hisashi Kashima, Tsuyoshi Id́e, et al.
IEICE Transactions on Information and Systems
James E. Gentile, Nalini Ratha, et al.
BTAS 2009
Alberto Tomita Jr., Tsuyoshi Ebina, et al.
IEICE Transactions on Information and Systems