Comparing phonemes and visemes with DNN-based lipreading

Book chapter


Thangthai, Kwanchiva, Bear, Y. and Harvey, Richard 2017. Comparing phonemes and visemes with DNN-based lipreading. in: Proceedings of British Machine Vision Conference BMVA Press. pp. In Press
AuthorsThangthai, Kwanchiva, Bear, Y. and Harvey, Richard
Abstract

There is debate if phoneme or viseme units are the most effective for a lipreading
system. Some studies use phoneme units even though phonemes describe unique short
sounds; other studies tried to improve lipreading accuracy by focusing on visemes with
varying results. We compare the performance of a lipreading system by modeling visual
speech using either 13 viseme or 38 phoneme units. We report the accuracy of our
system at both word and unit levels. The evaluation task is large vocabulary continuous
speech using the TCD-TIMIT corpus. We complete our visual speech modeling via
hybrid DNN-HMMs and our visual speech decoder is aWeighted Finite-State Transducer
(WFST). We use DCT and Eigenlips as a representation of mouth ROI image. The
phoneme lipreading system word accuracy outperforms the viseme based system word
accuracy. However, the phoneme system achieved lower accuracy at the unit level which
shows the importance of the dictionary for decoding classification outputs into words.

Book titleProceedings of British Machine Vision Conference
Page rangeIn Press
Year2017
PublisherBMVA Press
Publication dates
PrintSep 2017
Publication process dates
Deposited24 Aug 2017
AcceptedJul 2017
Event28th British Machine Vision Conference
Additional information

© 2017 The authors

Publisher's version
License
CC BY-ND
Permalink -

https://repository.uel.ac.uk/item/84qv9

  • 9
    total views
  • 13
    total downloads
  • 2
    views this month
  • 5
    downloads this month

Related outputs

Visual speech recognition: aligning terminologies for better understanding
Bear, Y. and Taylor, Sarah L. 2017. Visual speech recognition: aligning terminologies for better understanding. in: Proceedings of British Machine Vision Conference BMVA Press. pp. In Press
Phoneme-to-viseme mappings: the good, the bad, and the ugly
Bear, Y. and Harvey, Richard 2017. Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Communication. 95, pp. 40-67.
Resolution limits on visual speech recognition
Bear, Y., Harvey, Richard, Theobald, Barry-John and Lan, Yuxuan 2014. Resolution limits on visual speech recognition. in: IEEE International Conference on Image Processing (ICIP) IEEE.
Some observations on computer lip-reading: moving from the dream to the reality
Bear, Y., Owen, Gari, Harvey, Richard and Theobald, Barry-John 2014. Some observations on computer lip-reading: moving from the dream to the reality. Proceedings of SPIE. 9253.
Which phoneme-to-viseme maps best improve visual-only computer lip-reading?
Bear, Y., Harvey, Richard W., Theobald, Barry-John and Lan, Yuxuan 2014. Which phoneme-to-viseme maps best improve visual-only computer lip-reading? in: Bebis, George, Boyle, Richard, Parvin, Bahram, Koracin, Darko, McMahan, Ryan, Jerald, Jason, Zhang, Hui, Drucker, Steven M., Kambhamettu, Chandra, Choubassi, Maha El, Deng, Zhigang and Carlson, Mark (ed.) Advances in Visual Computing: 10th International Symposium, ISVC 2014, Las Vegas, NV, USA, December 8-10, 2014, Proceedings, Part II Springer International Publishing.
Decoding visemes: Improving machine lip-reading
Bear, Y. and Harvey, Richard 2016. Decoding visemes: Improving machine lip-reading. in: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) IEEE.
Finding phonemes: improving machine lip-reading
Bear, Y., Harvey, Richard W. and Lan, Yuxuan 2015. Finding phonemes: improving machine lip-reading. FAAVSP - The 1st Joint Conference on Facial Analysis, Animation and Auditory-Visual Speech Processing. Education Centre of the Jesuits, Vienna, Austria 11 - 13 Sep 2015 International Speech Communication Association. pp. 115-120
Speaker-independent machine lip-reading with speaker-dependent viseme classifiers
Bear, Y., Cox, Stephen J. and Harvey, Richard W. 2015. Speaker-independent machine lip-reading with speaker-dependent viseme classifiers. FAAVSP - The 1st Joint Conference on Facial Analysis, Animation, and Auditory-Visual Speech Processing. Education Centre of the Jesuits, Vienna, Austria 11 - 13 Sep 2015 International Speech Communication Association. pp. 190-195
Visual gesture variability between talkers in continuous speech
Bear, Y. 2017. Visual gesture variability between talkers in continuous speech. in: Proceedings of British Machine Vision Conference BMVA Press. pp. In Press