Large-vocabulary audio-visual speech recognition by machines and humans

Gerasimos Potamianos; Chalapathy Neti; Giridharan Iyengar; Eric Helmuth

INTERSPEECH - Eurospeech 2001

Conference paper

03 Sep 2001

Large-vocabulary audio-visual speech recognition by machines and humans

Abstract

We compare automatic recognition with human perception of audio-visual speech, in the large-vocabulary, continuous speech recognition (LVCSR) domain. Specifically, we study the benefit of the visual modality for both machines and humans, when combined with audio degraded by speech-babble noise at various signal-To-noise ratios (SNRs). We first consider an automatic speechreading system with a pixel based visual front end that uses feature fusion for bimodal integration, and we compare its performance with an audio-only LVCSR system. We then describe results of human speech perception experiments, where subjects are asked to transcribe audio-only and audiovisual utterances at various SNRs. For both machines and humans, we observe approximately a 6 dB effective SNR gain compared to the audio-only performance at 10 dB, however such gains significantly diverge at other SNRs. Furthermore, automatic audio-visual recognition outperforms human audioonly speech perception at low SNRs.

Conference paper