Association control in mobile wireless networks
Minkyong Kim, Zhen Liu, et al.
INFOCOM 2008
We compare automatic recognition with human perception of audio-visual speech, in the large-vocabulary, continuous speech recognition (LVCSR) domain. Specifically, we study the benefit of the visual modality for both machines and humans, when combined with audio degraded by speech-babble noise at various signal-To-noise ratios (SNRs). We first consider an automatic speechreading system with a pixel based visual front end that uses feature fusion for bimodal integration, and we compare its performance with an audio-only LVCSR system. We then describe results of human speech perception experiments, where subjects are asked to transcribe audio-only and audiovisual utterances at various SNRs. For both machines and humans, we observe approximately a 6 dB effective SNR gain compared to the audio-only performance at 10 dB, however such gains significantly diverge at other SNRs. Furthermore, automatic audio-visual recognition outperforms human audioonly speech perception at low SNRs.
Minkyong Kim, Zhen Liu, et al.
INFOCOM 2008
Daniel M. Bikel, Vittorio Castelli
ACL 2008
Yuqing Gao, Hakan Erdǒgan, et al.
INTERSPEECH - Eurospeech 2001
Nanda Kambhatla
ACL 2004