Lattice-based Viterbi decoding techniques for speech translation
George Saon, Michael Picheny
ASRU 2007
Since it is impractical to prerecord human speech for dynamic content such as email messages and news, many commercial speech applications use recorded human speech for fixed content (e.g. system prompts) and synthetic speech for dynamic content. However, mixing human speech and synthetic speech may not be optimal from a consistency perspective. A two-condition between-participants experiment (N - 24) was conducted to compare two versions of a telephony application for Personal Information Management (PIM). In the first condition, all the system output was delivered with synthetic speech. In the second condition, users heard a mix of human speech and synthetic speech. Users managed several email and calendar tasks. Users' task performance was rated by two independent judges. Their self-ratings of task performance and attitudinal responses were also measured by means of questionnaires. Users interacting with the interface that used only synthetic speech performed the task significantly better, while users interacting with the mixed-speech interface thought they did better and had more positive attitudinal responses. A consistency framework drawn from human psychological processing is offered to explain the difference in task performance. Cognitive processing and attitudinal response are differentiated. Design implications and directions for future research are suggested.
George Saon, Michael Picheny
ASRU 2007
Russell Bobbitt, Jonathan Connell, et al.
WACV 2011
David W. Jacobs, Daphna Weinshall, et al.
IEEE Transactions on Pattern Analysis and Machine Intelligence
Opher Etzion
DEBS 2007