Gerasimos Potamianos, Chalapathy Neti, et al.
International Journal of Speech Technology
We propose the use of a hierarchical, two-stage discriminant transformation for obtaining audio-visual features that improve automatic speech recognition. Linear discriminant analysis (LDA), followed by a maximum likelihood linear transform (MLLT) is first applied on MFCC based audio-only features, as well as on visual-only features, obtained by a discrete cosine transform of the video region of interest. Subsequently, a second stage of LDA and MLLT is applied on the concatenation of the resulting single modality features. The obtained audio-visual features are used to train a traditional HMM based speech recognizer. Experiments on the IBM ViaVoice™ audio-visual database demonstrate that the proposed feature fusion method improves speaker-independent, large vocabulary, continuous speech recognition for both clean and noisy audio conditions considered. A 24% relative word error rate reduction over an audio-only system is achieved in the latter case.
Gerasimos Potamianos, Chalapathy Neti, et al.
International Journal of Speech Technology
Patrick Lucey, Gerasimos Potamianos
MMSP 2006
Jing Huang, Gerasimos Potamianos, et al.
AVSP 2003
Gerasimos Potamianos, Chalapathy Neti
ICSLP 2000