Phoneme-to-viseme mappings: the good, the bad, and the ugly

Article


Bear, Y. and Harvey, Richard 2017. Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Communication. 95, pp. 40-67.
AuthorsBear, Y. and Harvey, Richard
Abstract

Visemes are the visual equivalent of phonemes. Although not precisely defined,
a working definition of a viseme is “a set of phonemes which have identical
appearance on the lips”. Therefore a phoneme falls into one viseme class but a
viseme may represent many phonemes: a many to one mapping. This mapping
introduces ambiguity between phonemes when using viseme classifiers. Not
only is this ambiguity damaging to the performance of audio-visual classifiers
operating on real expressive speech, there is also considerable choice between
possible mappings.
In this paper we explore the issue of this choice of viseme-to-phoneme map.
We show that there is definite difference in performance between viseme-tophoneme
mappings and explore why some maps appear to work better than
others. We also devise a new algorithm for constructing phoneme-to-viseme
mappings from labeled speech data. These new visemes, ‘Bear’ visemes, are
shown to perform better than previously known units.

JournalSpeech Communication
Journal citation95, pp. 40-67
ISSN0167-6393
Year2017
PublisherElsevier for: European Association for Signal Processing (EURASIP); International Speech Communication Association (ISCA); and North-Holland
Accepted author manuscript
License
CC BY-NC-ND
Digital Object Identifier (DOI)doi:10.1016/j.specom.2017.07.001
Web address (URL)https://doi.org/10.1016/j.specom.2017.07.001
Publication dates
Print29 Jul 2017
Publication process dates
Deposited31 Jul 2017
Accepted28 Jul 2017
Accepted28 Jul 2017
Copyright information© 2017 Elsevier
Permalink -

https://repository.uel.ac.uk/item/84qz4

  • 13
    total views
  • 45
    total downloads
  • 3
    views this month
  • 2
    downloads this month

Related outputs

Comparing phonemes and visemes with DNN-based lipreading
Thangthai, Kwanchiva, Bear, Y. and Harvey, Richard 2017. Comparing phonemes and visemes with DNN-based lipreading. in: Proceedings of British Machine Vision Conference BMVA Press. pp. In Press
Visual speech recognition: aligning terminologies for better understanding
Bear, Y. and Taylor, Sarah L. 2017. Visual speech recognition: aligning terminologies for better understanding. in: Proceedings of British Machine Vision Conference BMVA Press. pp. In Press
Resolution limits on visual speech recognition
Bear, Y., Harvey, Richard, Theobald, Barry-John and Lan, Yuxuan 2014. Resolution limits on visual speech recognition. in: IEEE International Conference on Image Processing (ICIP) IEEE.
Some observations on computer lip-reading: moving from the dream to the reality
Bear, Y., Owen, Gari, Harvey, Richard and Theobald, Barry-John 2014. Some observations on computer lip-reading: moving from the dream to the reality. Proceedings of SPIE. 9253.
Which phoneme-to-viseme maps best improve visual-only computer lip-reading?
Bear, Y., Harvey, Richard W., Theobald, Barry-John and Lan, Yuxuan 2014. Which phoneme-to-viseme maps best improve visual-only computer lip-reading? in: Bebis, George, Boyle, Richard, Parvin, Bahram, Koracin, Darko, McMahan, Ryan, Jerald, Jason, Zhang, Hui, Drucker, Steven M., Kambhamettu, Chandra, Choubassi, Maha El, Deng, Zhigang and Carlson, Mark (ed.) Advances in Visual Computing: 10th International Symposium, ISVC 2014, Las Vegas, NV, USA, December 8-10, 2014, Proceedings, Part II Springer International Publishing.
Decoding visemes: Improving machine lip-reading
Bear, Y. and Harvey, Richard 2016. Decoding visemes: Improving machine lip-reading. in: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE.
Finding phonemes: improving machine lip-reading
Bear, Y., Harvey, Richard W. and Lan, Yuxuan 2015. Finding phonemes: improving machine lip-reading. FAAVSP - The 1st Joint Conference on Facial Analysis, Animation and Auditory-Visual Speech Processing. Education Centre of the Jesuits, Vienna, Austria 11 - 13 Sep 2015 International Speech Communication Association. pp. 115-120
Speaker-independent machine lip-reading with speaker-dependent viseme classifiers
Bear, Y., Cox, Stephen J. and Harvey, Richard W. 2015. Speaker-independent machine lip-reading with speaker-dependent viseme classifiers. FAAVSP - The 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing. Education Centre of the Jesuits, Vienna, Austria 11 - 13 Sep 2015 International Speech Communication Association. pp. 190-195
Visual gesture variability between talkers in continuous speech
Bear, Y. 2017. Visual gesture variability between talkers in continuous speech. in: Proceedings of British Machine Vision Conference BMVA Press. pp. In Press