Shadow carving
Silvio Savarese, Holly Rushmeier, et al.
Proceedings of the IEEE International Conference on Computer Vision
We propose a three-stage pixel-based visual front end for automatic speechreading (lipreading) that results in significantly improved recognition performance of spoken words or phonemes. The proposed algorithm is a cascade of three transforms applied on a three-dimensional video region-of-interest that contains the speaker's mouth area. The first stage is a typical image compression transform that achieves a high-energy, reduced-dimensionality representation of the video data. The second stage is a linear discriminant analysis-based data projection, which is applied on a concatenation of a small amount of consecutive image transformed video data. The third stage is a data rotation by means of a maximum likelihood linear transform that optimizes the likelihood of the observed data under the assumption of their class-conditional multivariate normal distribution with diagonal covariance. We applied the algorithm to visual-only 52-class phonetic and 27-class visemic classification on a 162-subject, 8-hour long, large vocabulary, continuous speech audio-visual database. We demonstrated significant classification accuracy gains by each added stage of the proposed algorithm which, when combined, can achieve up to 27% improvement. Overall, we achieved a 60% (49%) visual-only frame-level visemic classification accuracy with (without) use of test set viseme boundaries. In addition, we report improved audio-visual phonetic classification over the use of a single-stage image transform visual front end. Finally, we discuss preliminary speech recognition results.
Silvio Savarese, Holly Rushmeier, et al.
Proceedings of the IEEE International Conference on Computer Vision
Ken C.L. Wong, Satyananda Kashyap, et al.
Pattern Recognition Letters
Diganta Misra, Muawiz Chaudhary, et al.
CVPRW 2024
Aisha Urooj Khan, Hilde Kuehne, et al.
CVPR 2023