Pavel Kisilev, Daniel Freedman, et al.
ICPR 2012
We study three aspects of designing appearance based visual features for automatic lipreading: (a) The choice of the video region of interest (ROI), on which image transform features are obtained; (b) The extraction of speech discriminant features at each frame; and (c) The use of temporal information to improve visual speech modeling. In particular, with respect to (a), we propose a ROI that includes the speaker's jaw and cheeks, in addition to the traditionally used mouth/lip region; with respect to (b) and (c), we propose the use of a two-stage linear discriminant analysis, both within frame, as well as across a large number of frames. On a large-vocabulary, continuous speech audio-visual database, the proposed visual features result in a 13% absolute reduction in visual-only word error rate over a baseline visual front end, and in an additional 28% relative improvement in audio-visual over audio-only phonetic classification accuracy.
Pavel Kisilev, Daniel Freedman, et al.
ICPR 2012
Sudeep Sarkar, Kim L. Boyer
Computer Vision and Image Understanding
Michelle X. Zhou, Fei Wang, et al.
ICMEW 2013
James E. Gentile, Nalini Ratha, et al.
BTAS 2009