Phoneme-to-viseme mappings: the good, the bad, and the ugly

Article


Bear, Y. and Harvey, Richard 2017. Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Communication. 95, pp. 40-67.
AuthorsBear, Y. and Harvey, Richard
Abstract

Visemes are the visual equivalent of phonemes. Although not precisely defined,
a working definition of a viseme is “a set of phonemes which have identical
appearance on the lips”. Therefore a phoneme falls into one viseme class but a
viseme may represent many phonemes: a many to one mapping. This mapping
introduces ambiguity between phonemes when using viseme classifiers. Not
only is this ambiguity damaging to the performance of audio-visual classifiers
operating on real expressive speech, there is also considerable choice between
possible mappings.
In this paper we explore the issue of this choice of viseme-to-phoneme map.
We show that there is definite difference in performance between viseme-tophoneme
mappings and explore why some maps appear to work better than
others. We also devise a new algorithm for constructing phoneme-to-viseme
mappings from labeled speech data. These new visemes, ‘Bear’ visemes, are
shown to perform better than previously known units.

JournalSpeech Communication
Journal citation95, pp. 40-67
ISSN0167-6393
Year2017
PublisherElsevier for: European Association for Signal Processing (EURASIP); International Speech Communication Association (ISCA); and North-Holland
Accepted author manuscript
License
CC BY-NC-ND
Digital Object Identifier (DOI)doi:10.1016/j.specom.2017.07.001
Web address (URL)https://doi.org/10.1016/j.specom.2017.07.001
Publication dates
Print29 Jul 2017
Publication process dates
Deposited31 Jul 2017
Accepted28 Jul 2017
Accepted28 Jul 2017
Copyright information© 2017 Elsevier
Permalink -

https://repository.uel.ac.uk/item/84qz4

  • 6
    total views
  • 19
    total downloads
  • 2
    views this month
  • 5
    downloads this month

Related outputs

Comparing phonemes and visemes with DNN-based lipreading
Thangthai, Kwanchiva, Bear, Y. and Harvey, Richard 2017. Comparing phonemes and visemes with DNN-based lipreading. in: Proceedings of British Machine Vision Conference BMVA Press. pp. In Press
Visual speech recognition: aligning terminologies for better understanding
Bear, Y. and Taylor, Sarah L. 2017. Visual speech recognition: aligning terminologies for better understanding. in: Proceedings of British Machine Vision Conference BMVA Press. pp. In Press
Visual gesture variability between talkers in continuous speech
Bear, Y. 2017. Visual gesture variability between talkers in continuous speech. in: Proceedings of British Machine Vision Conference BMVA Press. pp. In Press
Resolution limits on visual speech recognition
Bear, Y., Harvey, Richard, Theobald, Barry-John and Lan, Yuxuan 2014. Resolution limits on visual speech recognition. in: IEEE International Conference on Image Processing (ICIP) IEEE.
Some observations on computer lip-reading: moving from the dream to the reality
Bear, Y., Owen, Gari, Harvey, Richard and Theobald, Barry-John 2014. Some observations on computer lip-reading: moving from the dream to the reality. Proceedings of SPIE. 9253.
Which phoneme-to-viseme maps best improve visual-only computer lip-reading?
Bear, Y., Harvey, Richard W., Theobald, Barry-John and Lan, Yuxuan 2014. Which phoneme-to-viseme maps best improve visual-only computer lip-reading? in: Bebis, George, Boyle, Richard, Parvin, Bahram, Koracin, Darko, McMahan, Ryan, Jerald, Jason, Zhang, Hui, Drucker, Steven M., Kambhamettu, Chandra, Choubassi, Maha El, Deng, Zhigang and Carlson, Mark (ed.) Advances in Visual Computing: 10th International Symposium, ISVC 2014, Las Vegas, NV, USA, December 8-10, 2014, Proceedings, Part II Springer International Publishing.
Decoding visemes: Improving machine lip-reading
Bear, Y. and Harvey, Richard 2016. Decoding visemes: Improving machine lip-reading. in: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE.
Finding phonemes: improving machine lip-reading
Bear, Y., Harvey, Richard W. and Lan, Yuxuan 2015. Finding phonemes: improving machine lip-reading. FAAVSP - The 1st Joint Conference on Facial Analysis, Animation and Auditory-Visual Speech Processing. Education Centre of the Jesuits, Vienna, Austria 11 - 13 Sep 2015 International Speech Communication Association. pp. 115-120
Speaker-independent machine lip-reading with speaker-dependent viseme classifiers
Bear, Y., Cox, Stephen J. and Harvey, Richard W. 2015. Speaker-independent machine lip-reading with speaker-dependent viseme classifiers. FAAVSP - The 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing. Education Centre of the Jesuits, Vienna, Austria 11 - 13 Sep 2015 International Speech Communication Association. pp. 190-195