Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 24 October 2022

Visual speech recognition for multiple languages in the wild

  • Pingchuan Ma   ORCID: orcid.org/0000-0003-3752-0803 1 ,
  • Stavros Petridis 1 , 2 &
  • Maja Pantic 1 , 2  

Nature Machine Intelligence volume  4 ,  pages 930–939 ( 2022 ) Cite this article

1896 Accesses

44 Citations

56 Altmetric

Metrics details

  • Computational biology and bioinformatics
  • Computer science
  • Human behaviour

A preprint version of the article is available at arXiv.

Visual speech recognition (VSR) aims to recognize the content of speech based on lip movements, without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to the larger training sets rather than the model design. Here we demonstrate that designing better models is equally as important as using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model, and highlight the importance of hyperparameter optimization and appropriate data augmentations. We show that such a model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin. It even outperforms models that were trained on non-publicly available datasets containing up to to 21 times more data. We show, furthermore, that using additional training data, even in other languages or with automatically generated transcriptions, results in further improvement.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 digital issues and online access to articles

111,21 € per year

only 9,27 € per issue

Buy this article

  • Purchase on SpringerLink
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

visual speech recognition

Similar content being viewed by others

visual speech recognition

Two-stage visual speech recognition for intensive care patients

visual speech recognition

A speech corpus of Quechua Collao for automatic dimensional emotion recognition

visual speech recognition

An enhanced speech emotion recognition using vision transformer

Data availability.

The datasets used in the current study are available from the original authors on the LRS2 ( https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html ), LRS3 ( https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html ), CMLR ( https://www.vipazoo.cn/CMLR.html ), Multilingual ( http://www.openslr.org/100 ) and CMU-MOSEAS ( http://immortal.multicomp.cs.cmu.edu/cache/multilingual ) repositories. Qualitative results and the list of cleaned videos for the training and test sets of CMU-MOSEAS and Multilingual TEDx are available on the authors’ GitHub repository ( https://mpc001.github.io/lipreader.html ).

Code availability

Pre-trained networks and testing code are available on a GitHub repository ( https://mpc001.github.io/lipreader.html ) or at Zenodo 66 under an Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) licence.

Potamianos, G., Neti, C., Gravier, G., Garg, A. & Senior, A. W. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91 , 1306–1326 (2003).

Article   Google Scholar  

Dupont, S. & Luettin, J. Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2 , 141–151 (2000).

Chung, J. S., Senior, A., Vinyals, O. & Zisserman, A. Lip reading sentences in the wild. In Proc. 30th IEEE / CVF Conference on Computer Vision and Pattern Recognition 3444–3453 (IEEE, 2017).

Afouras, T., Chung, J. S., Senior, A., Vinyals, O. & Zisserman, A. Deep audio-visual speech recognition. In IEEE Transactions on Pattern Analysis and Machine Intelligence, 1 (IEEE, 2018); https://doi.org/10.1109/TPAMI.2018.2889052

Shillingford, B. et al. Large-scale visual speech recognition. In Proc. 20th Annual Conference of International Speech Communication Association 4135–4139 (ISCA, 2019).

Serdyuk, D., Braga, O. & Siohan, O. Audio-visual speech recognition is worth 32 × 32 × 8 voxels. In Proc. IEEE Automatic Speech Recognition and Understanding Workshop 796–802 (IEEE, 2021).

Zhang, X. et al. Understanding pictograph with facial features: end-to-end sentence-level lip reading of Chinese. In Proc. 33rd AAAI Conference on Artificial Intelligence 9211–9218 (AAAI, 2019).

Zhao, Y., Xu, R. & Song, M. A cascade sequence-to-sequence model for Chinese Mandarin lip reading. In Proc. 1st ACM International Conference on Multimedia in Asia 1–6 (ACM, 2019).

Ma, S., Wang, S. & Lin, X. A transformer-based model for sentence-level Chinese Mandarin lipreading. In Proc. 5th IEEE International Conference on Data Science in Cyberspace 78–81 (IEEE, 2020).

Ma, P., Petridis, S. & Pantic, M. End-to-end audio-visual speech recognition with conformers. In Proc. 46th IEEE International Conference on Acoustics , Speech and Signal Processing 7613–7617 (IEEE, 2021).

Gulati, A. et al. Conformer: convolution-augmented transformer for speech recognition. In Proc. 21st Annual Conference of International Speech Communication Association 5036–5040 (ISCA, 2020).

Makino, T. et al. Recurrent neural network transducer for audio-visual speech recognition. In Proc. IEEE Automatic Speech Recognition and Understanding Workshop 905–912 (IEEE, 2019).

McGurk, H. & MacDonald, J. Hearing lips and seeing voices. Nature 264 , 746–748 (1976).

Sumby, W. H. & Pollack, I. Visual contribution to speech intelligibility in noise. J. Acoust. Soc. Am. 26 , 212–215 (1954).

Petridis, S., Stafylakis, T., Ma, P., Tzimiropoulos, G. & Pantic, M. Audio-visual speech recognition with a hybrid CTC/attention architecture. In Proc. IEEE Spoken Language Technology Workshop 513–520 (IEEE, 2018).

Yu, J. et al. Audio-visual recognition of overlapped speech for the LRS2 dataset. In Proc. 45th IEEE International Conference on Acoustics , Speech and Signal Processing 6984–6988 (IEEE, 2020).

Yu, W., Zeiler, S. & Kolossa, D. Fusing information streams in end-to-end audio-visual speech recognition. In Proc. 46th IEEE International Conference on Acoustics , Speech and Signal Processing 3430–3434 (IEEE, 2021).

Sterpu, G., Saam, C. & Harte, N. How to teach DNNs to pay attention to the visual modality in speech recognition. IEEE / ACM Trans. Audio Speech Language Process. 28 , 1052–1064 (2020).

Google Scholar  

Afouras, T., Chung, J. S. & Zisserman, A. The conversation: deep audio-visual speech enhancement. In Proc. 19th Annual Conference of International Speech Communication Association 3244–3248 (ISCA, 2018).

Ephrat, A. et al. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph. 37 , 112:1–112:11 (2018).

Yoshimura, T., Hayashi, T., Takeda, K. & Watanabe, S. End-to-end automatic speech recognition integrated with CTC-based voice activity detection. In Proc. 45th IEEE International Conference on Acoustics , Speech and Signal Processing 6999–7003 (IEEE, 2020).

Kim, Y. J. et al. Look who’s talking: active speaker detection in the wild. In Proc. 22nd Annual Conference of International Speech Communication Association 3675–3679 (ISCA, 2021).

Chung, J. S., Huh, J., Nagrani, A., Afouras, T. & Zisserman, A. Spot the conversation: speaker diarisation in the wild. In Proc. 21st Annual Conference of International Speech Communication Association 299–303 (ISCA, 2020).

Denby, B. et al. Silent speech interfaces. Speech Commun. 52 , 270–287 (2010).

Haliassos, A., Vougioukas, K., Petridis, S. & Pantic, M. Lips don’t lie: a generalisable and robust approach to face forgery detection. In Proc. 34th IEEE / CVF Conference on Computer Vision and Pattern Recognition 5039–5049 (IEEE, 2021).

Mira, R. et al. End-to-end video-to-speech synthesis using generative adversarial networks. IEEE Transactions on Cybernetics. 1–13 (IEEE, 2022).

Prajwal, K., Mukhopadhyay, R., Namboodiri, V. P. & Jawahar, C. Learning individual speaking styles for accurate lip to speech synthesis. In Proc. 33rd IEEE / CVF Conference on Computer Vision and Pattern Recognition 13796–13805 (IEEE, 2020).

Dungan, L., Karaali, A. & Harte, N. The impact of reduced video quality on visual speech recognition. In Proc. 25th IEEE International Conference on Image Processing 2560–2564 (IEEE, 2018).

Bear, H. L., Harvey, R., Theobald, B.-J. & Lan, Y. Resolution limits on visual speech recognition. In Proc. 21st IEEE International Conference on Image Processing 1371–1375 (IEEE, 2014).

Geirhos, R. et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. In Proc. 7th International Conference on Learning Representations (OpenReview, 2019).

Cheng, S. et al. Towards pose-invariant lip-reading. In Proc. 45th IEEE International Conference on Acoustics , Speech and Signal Processing 4357–4361 (IEEE, 2020).

Wand, M. & Schmidhuber, J. Improving speaker-independent lipreading with domain-adversarial training. In Proc. 18th Annual Conference of International Speech Communication Association 3662–3666 (ISCA, 2017).

Petridis, S., Wang, Y., Li, Z. & Pantic, M. End-to-end multi-view lipreading. In Proc. 28th British Machine Vision Conference (BMVA, 2017); https://doi.org/10.5244/C.31.161

Bicevskis, K. et al. Effects of mouthing and interlocutor presence on movements of visible vs. non-visible articulators. Can. Acoust. 44 , 17–24 (2016).

Šimko, J., Beňuš, Š. & Vainio, M. Hyperarticulation in Lombard speech: global coordination of the jaw, lips and the tongue. J. Acoust. Soc. Am. 139 , 151–162 (2016).

Ma, P., Petridis, S. & Pantic, M. Investigating the Lombard effect influence on end-to-end audio-visual speech recognition. In Proc. 20th Annual Conference of International Speech Communication Association 4090–4094 (ISCA, 2019).

Petridis, S., Shen, J., Cetin, D. & Pantic, M. Visual-only recognition of normal, whispered and silent speech. In Proc. 43rd IEEE International Conference on Acoustics , Speech and Signal Processing 6219–6223 (IEEE, 2018).

Heracleous, P., Ishi, C. T., Sato, M., Ishiguro, H. & Hagita, N. Analysis of the visual Lombard effect and automatic recognition experiments. Comput. Speech Language 27 , 288–300 (2013).

Efforts to acknowledge the risks of new A.I. technology. New York Times (22 October 2018); https://www.nytimes.com/2018/10/22/business/efforts-to-acknowledge-the-risks-of-new-ai-technology.html

Feathers, T. Tech Companies Are Training AI to Read Your Lips https://www.vice.com/en/article/bvzvdw/tech-companies-are-training-ai-to-read-your-lips (2021).

Liopa. https://liopa.ai . Accessed 24 November 2021.

Crawford, S. Facial recognition laws are (literally) all over the map. Wired (16 December 2019); https://www.wired.com/story/facial-recognition-laws-are-literally-all-over-the-map/

Flynn, S. 13 cities where police are banned from using facial recognition tech. Innovation & Tech Today (18 November 2020); https://innotechtoday.com/13-cities-where-police-are-banned-from-using-facial-recognition-tech/

An update on our use of face recognition. FaceBook (2 November 2021); https://about.fb.com/news/2021/11/update-on-use-of-face-recognition/

Metz, R. Amazon will block police indefinitely from using its facial-recognition software. CNN (18 May 2021); https://edition.cnn.com/2021/05/18/tech/amazon-police-facial-recognition-ban/index.html

Greene, J. Microsoft won’t sell police its facial-recognition technology, following similar moves by Amazon and IBM. Washington Post (11 June 2020) https://www.washingtonpost.com/technology/2020/06/11/microsoft-facial-recognition

Afouras, T., Chung, J. S. & Zisserman, A. LRS3-TED: a large-scale dataset for visual speech recognition. Preprint at https://arxiv.org/abs/1809.00496 (2018).

Zadeh, A. B. et al. CMU-MOSEAS: a multimodal language dataset for Spanish, Portuguese, German and French. In Proc. 2020 Conference on Empirical Methods in Natural Language Processing 1801–1812 (ACL, 2020).

Salesky, E. et al. The multilingual TEDx corpus for speech recognition and translation. In Proc. 22nd Annual Conference of International Speech Communication Association 3655–3659 (ISCA, 2021).

Valk, J. & Alumäe, T. VoxLingua107: a dataset for spoken language recognition. In Proc. IEEE Spoken Language Technology Workshop 652–658 (IEEE, 2021).

Deng, J. et al. RetinaFace: single-stage dense face localisation in the wild. In Proc. 33rd IEEE / CVF Conference on Computer Vision and Pattern Recognition 5203–5212 (IEEE, 2020).

Bulat, A. & Tzimiropoulos, G. How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks). In Proc. 16th IEEE / CVF International Conference on Computer Vision 1021–1030 (IEEE, 2017).

Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proc. 3rd International Conference on Learning Representations (OpenReview, 2015).

Assael, Y., Shillingford, B., Whiteson, S. & De Freitas, N. LipNet: end-to-end sentence-level lipreading. Preprint at https://arxiv.org/abs/1611.01599 (2016).

Ma, P., Martinez, B., Petridis, S. & Pantic, M. Towards practical lipreading with distilled and efficient models. In Proc. 46th IEEE International Conference on Acoustics , Speech and Signal Processing 7608–7612 (IEEE, 2021).

Park, D. S. et al. SpecAugment: a simple data augmentation method for automatic speech recognition. In Proc. 20th Annual Conference of International Speech Communication Association 2613–2617 (ISCA, 2019).

Liu, C. et al. Improving RNN transducer based ASR with auxiliary tasks. In Proc. IEEE Spoken Language Technology Workshop 172–179 (IEEE, 2021).

Toshniwal, S., Tang, H., Lu, L. & Livescu, K. Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition. In Proc. 18th Annual Conference of International Speech Communication Association 3532–3536 (ISCA, 2017).

Lee, J. & Watanabe, S. Intermediate loss regularization for CTC-based speech recognition. In Proc. 46th IEEE International Conference on Acoustics , Speech and Signal Processing 6224–6228 (IEEE, 2021).

Pascual, S., Ravanelli, M., Serrà, J., Bonafonte, A. & Bengio, Y. Learning problem-agnostic speech representations from multiple self-supervised tasks. In Proc. 20th Annual Conference of International Speech Communication Association 161–165 (ISCA, 2019).

Shukla, A., Petridis, S. & Pantic, M. Learning speech representations from raw audio by joint audiovisual self-supervision. In Proc. 37th International Conference on Machine Learning Workshop (PMLR, 2020).

Ma, P., Mira, R., Petridis, S., Schuller, B. W. & Pantic, M. LiRA: learning visual speech representations from audio through self-supervision. In Proc. 22nd Annual Conference of International Speech Communication Association 3011–3015 (ISCA, 2021).

Serdyuk, D., Braga, O. & Siohan, O. Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Muti-Person Video. In Proc. 23rd Annual Conference of International Speech Communication Association 2833–2837 (ISCA, 2022).

Watanabe, S. et al. ESPnet: End-to-end speech processing toolkit. In Proc. 19th Annual Conference of International Speech Communication Association 2207–2211 (ISCA, 2018).

Kingma, D. & Ba, J. Adam: a method for stochastic optimization. In Proc. 2nd International Conference on Learning Representations (OpenReview, 2014).

Ma, P., Petridis, S. & Pantic, M. 2022. mpc001/Visual_Speech_Recognition_for_Multiple_Languages: visual speech recognition for multiple languages. Zenodo https://doi.org/10.5281/zenodo.7065080

Afouras, T., Chung, J. S. & Zisserman, A. ASR is all you need: cross-modal distillation for lip reading. In Proc. 45th IEEE International Conference on Acoustics , Speech and Signal Processing 2143–2147 (IEEE, 2020).

Ren, S., Du, Y., Lv, J., Han, G. & He, S. Learning from the master: distilling cross-modal advanced knowledge for lip reading. In Proc. 34th IEEE / CVF Conference on Computer Vision and Pattern Recognition 13325–13333 (IEEE, 2021).

Zhao, Y. et al. Hearing lips: improving lip reading by distilling speech recognizers. In Proc. 34th AAAI Conference on Artificial Intelligence 6917–6924 (AAAI, 2020).

Download references

Acknowledgements

All training, testing and ablation studies were conducted at Imperial College London.

Author information

Authors and affiliations.

Imperial College London, London, UK

Pingchuan Ma, Stavros Petridis & Maja Pantic

Meta AI, London, UK

Stavros Petridis & Maja Pantic

You can also search for this author in PubMed   Google Scholar

Contributions

The code was written by P.M., and the experiments were conducted by P.M. and S.P. The manuscript was written by P.M., S.P. and M.P. M.P. supervised the entire project.

Corresponding author

Correspondence to Pingchuan Ma .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Machine Intelligence thanks Joon Son Chung, Olivier Siohan and Mingli Song for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Supplementary text, Fig. 1, Tables 1–28 and references.

Supplementary Video 1

A demo of visual speech recognition for multiple languages.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Ma, P., Petridis, S. & Pantic, M. Visual speech recognition for multiple languages in the wild. Nat Mach Intell 4 , 930–939 (2022). https://doi.org/10.1038/s42256-022-00550-z

Download citation

Received : 22 February 2022

Accepted : 13 September 2022

Published : 24 October 2022

Issue Date : November 2022

DOI : https://doi.org/10.1038/s42256-022-00550-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Continuous lipreading based on acoustic temporal alignments.

  • David Gimeno-Gómez
  • Carlos-D. Martínez-Hinarejos

EURASIP Journal on Audio, Speech, and Music Processing (2024)

Audio-guided self-supervised learning for disentangled visual speech representations

  • Shuang Yang

Frontiers of Computer Science (2024)

Research of ReLU output device in ternary optical computer based on parallel fully connected layer

  • Huaqiong Ma

The Journal of Supercomputing (2024)

Sla-former: conformer using shifted linear attention for audio-visual speech recognition

Complex & Intelligent Systems (2024)

3D facial animation driven by speech-video dual-modal signals

  • Zhouzhou Liao

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

visual speech recognition

Visual Speech Recognition for Multiple Languages in the Wild

visual speech recognition

Visual speech recognition (VSR) aims to recognise the content of speech based on the lip movements without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to larger training sets rather than the model design. In this work, we demonstrate that designing better models is equally important to using larger training sets. We propose the addition of prediction-based auxiliary tasks to a VSR model and highlight the importance of hyper-parameter optimisation and appropriate data augmentations. We show that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin. It even outperforms models that were trained on non-publicly available datasets containing up to to 21 times more data. We show furthermore that using additional training data, even in other languages or with automatically generated transcriptions, results in further improvement.

Please sign up or login with your details

Generation Overview

AI Generator calls

AI Video Generator calls

AI Chat messages

Genius Mode messages

Genius Mode images

AD-free experience

Private images

  • Includes 500 AI Image generations, 1750 AI Chat Messages, 30 AI Video generations, 60 Genius Mode Messages and 60 Genius Mode Images per month. If you go over any of these limits, you will be charged an extra $5 for that group.
  • For example: if you go over 500 AI images, but stay within the limits for AI Chat and Genius Mode, you'll be charged $5 per additional 500 AI Image generations.
  • Includes 100 AI Image generations and 300 AI Chat Messages. If you go over any of these limits, you will have to pay as you go.
  • For example: if you go over 100 AI images, but stay within the limits for AI Chat, you'll have to reload on credits to generate more images. Choose from $5 - $1000. You'll only pay for what you use.

Out of credits

Refill your membership to continue using DeepAI

Share your generations with friends

     Pingchuan Ma 1      Stavros Petridis 1,2      Maja Pantic 1,2

1 imperial college london       2 meta ai, [paper]         [code]         [model].

Run PyTorch locally or get started quickly with one of the supported cloud platforms

Whats new in PyTorch tutorials

Familiarize yourself with PyTorch concepts and modules

Bite-size, ready-to-deploy PyTorch code examples

Master PyTorch basics with our engaging YouTube tutorial series

Learn about the tools and frameworks in the PyTorch Ecosystem

Join the PyTorch developer community to contribute, learn, and get your questions answered.

A place to discuss PyTorch code, issues, install, research

Find resources and get questions answered

Award winners announced at this year's PyTorch Conference

Build innovative and privacy-aware AI experiences for edge devices

End-to-end solution for enabling on-device inference capabilities across mobile and edge devices

Explore the documentation for comprehensive guidance on how to use PyTorch.

Read the PyTorch Domains documentation to learn more about domain-specific libraries.

Catch up on the latest technical news and happenings

Stories from the PyTorch ecosystem

Learn about the latest PyTorch tutorials, new, and more

Learn how our community solves real, everyday machine learning problems with PyTorch

Find events, webinars, and podcasts

Learn more about the PyTorch Foundation.

  • Become a Member

October 10, 2023

Real-time Audio-visual Speech Recognition

by Team PyTorch

Audio-Visual Speech Recognition (AV-ASR, or AVSR) is the task of transcribing text from audio and visual streams, which has recently attracted a lot of research attention due to its robustness to noise. The vast majority of work to date has focused on developing AV-ASR models for non-streaming recognition; studies on streaming AV-ASR are very limited.

We have developed a compact real-time speech recognition system based on TorchAudio, a library for audio and signal processing with PyTorch . It can run locally on a laptop with high accuracy without accessing the cloud. Today, we are releasing the real-time AV-ASR recipe under a permissive open license (BSD-2-Clause license), enabling a broad set of applications and fostering further research on audio-visual models for speech recognition.

This work is part of our approach to AV-ASR research . A promising aspect of this approach is its ability to automatically annotate large-scale audio-visual datasets, which enables the training of more accurate and robust speech recognition systems. Furthermore, this technology has the potential to run on smart devices since it achieves the latency and memory efficiency that such devices require for inference.

In the future, speech recognition systems are expected to power applications in numerous domains. One of the primary applications of AV-ASR is to enhance the performance of ASR in noisy environments. Since visual streams are not affected by acoustic noise, integrating them into an audio-visual speech recognition model can compensate for the performance drop of ASR models. Our AV-ASR system has the potential to serve multiple purposes beyond speech recognition, such as text summarization, translation and even text-to-speech conversion. Moreover, the exclusive use of VSR can be useful in certain scenarios, e.g. where speaking is not allowed, in meetings, and where privacy in public conversations is desired.

Fig. 1 The pipeline for audio-visual speech recognition system

Fig. 1 : The pipeline for audio-visual speech recognition system

Our real-time AV-ASR system is presented in Fig. 1. It consists of three components, a data collection module, a pre-processing module and an end-to-end model. The data collection module comprises hardware devices, such as a microphone and camera. Its role is to collect information from the real world. Once the information is collected, the pre-processing module location and crop out face. Next, we feed the raw audio stream and the pre-processed video stream into our end-to-end model for inference.

Data collection

We use torchaudio.io.StreamReader to capture audio/video from streaming device input, e.g. microphone and camera on laptop. Once the raw video and audio streams are collected, the pre-processing module locates and crops faces. It should be noted that data is immediately deleted during the streaming process.

Pre-processing

Before feeding the raw stream into our model, each video sequence has to undergo a specific pre-processing procedure. This involves three critical steps. The first step is to perform face detection. Following that, each individual frame is aligned to a referenced frame, commonly known as the mean face, in order to normalize rotation and size differences across frames. The final step in the pre-processing module is to crop the face region from the aligned face image. We would like to clearly note that our model is fed with raw audio waveforms and pixels of the face, without any further preprocessing like face parsing or landmark detection. An example of the pre-processing procedure is illustrated in Table 1.

0. Original 1. Detection 2. Alignment 3. Crop

Table 1 : Preprocessing pipeline.

Fig. 2 The architecture for the audio-visual speech recognition system.

Fig. 2 : The architecture for the audio-visual speech recognition system

We consider two configurations: Small with 12 Emformer blocks and Large with 28, with 34.9M and 383.3M parameters, respectively. Each AV-ASR model composes front-end encoders, a fusion module, an Emformer encoder, and a transducer model. To be specific, we use convolutional frontends to extract features from raw audio waveforms and facial images. The features are concatenated to form 1024-d features, which are then passed through a two-layer multi-layer perceptron and an Emformer transducer model. The entire network is trained using RNN-T loss. The architecture of the proposed AV-ASR model is illustrated in Fig. 2.

Datasets. We follow Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels to use publicly available audio-visual datasets including LRS3 , VoxCeleb2 and AVSpeech for training. We do not use mouth ROIs or facial landmarks or attributes during both training and testing stages.

Comparisons with the state-of-the-art. Non-streaming evaluation results on LRS3 are presented in Table 2. Our audio-visual model with an algorithmic latency of 800 ms (160ms+1280msx0.5) yields a WER of 1.3%, which is on par with those achieved by state-of-the-art offline models such as AV-HuBERT, RAVEn, and Auto-AVSR.

ViT3D-CM 90, 000 1.6
AV-HuBERT 1, 759 1.4
RAVEn 1, 759 1.4
AutoAVSR 3, 448 0.9
Ours 3, 068 1.3

Table 2 : Non-streaming evaluation results for audio-visual models on the LRS3 dataset.

Noisy experiments. During training, 16 different noise types are randomly injected to audio waveforms, including 13 types from Demand database, ‘DLIVING’,’DKITCHEN’, ‘OMEETING’, ‘OOFFICE’, ‘PCAFETER’, ‘PRESTO’, ‘PSTATION’, ‘STRAFFIC’, ‘SPSQUARE’, ‘SCAFE’, ‘TMETRO’, ‘TBUS’ and ‘TCAR’, two more types of noise from speech commands database, white and pink and one more type of noise from NOISEX-92 database, babble noise. SNR levels in the range of [clean, 7.5dB, 2.5dB, -2.5dB, -7.5dB] are selected from with a uniform distribution. Results of ASR and AV-ASR models, when tested with babble noise, are shown in Table 3. With increasing noise level, the performance advantage of our audio-visual model over our audio-only model grows, indicating that incorporating visual data improves noise robustness.

A 1.6 1.8 3.2 10.9 27.9 55.5
A+V 1.6 1.7 2.1 6.2 11.7 27.6

Table 3 : Streaming evaluation WER (%) results at various signal-to-noise ratios for our audio-only (A) and audio-visual (A+V) models on the LRS3 dataset under 0.80-second latency constraints.

Real-time factor . The real-time factor (RTF) is an important measure of a system’s ability to process real-time tasks efficiently. An RTF value of less than 1 indicates that the system meets real-time requirements. We measure RTF using a laptop with an Intel® Core™ i7-12700 CPU running at 2.70 GHz and an NVIDIA 3070 GeForce RTX 3070 Ti GPU. To the best of our knowledge, this is the first AV-ASR model that reports RTFs on the LRS3 benchmark. The Small model achieves a WER of 2.6% and an RTF of 0.87 on CPU (Table 4), demonstrating its potential for real-time on-device inference applications.

Large GPU 1.6 0.35
Small GPU 2.6 0.33
CPU 0.87

Table 4 : Impact of AV-ASR model size and device on WER and RTF. Note that the RTF calculation includes the pre-processing step wherein the Ultra-Lightweight Face Detection Slim 320 model is used to generate face bounding boxes.

Learn more about the system from the published works below:

  • Shi, Yangyang, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian Chan, Frank Zhang, Duc Le, and Mike Seltzer. “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition.” In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6783-6787. IEEE, 2021.
  • Ma, Pingchuan, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, and Maja Pantic. “Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels.” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5. IEEE, 2023.

Access comprehensive developer documentation for PyTorch

Get in-depth tutorials for beginners and advanced developers

Find development resources and get your questions answered

  • Get Started
  • Learn the Basics
  • PyTorch Recipes
  • Introduction to PyTorch - YouTube Series
  • Developer Resources
  • Contributor Awards - 2023
  • About PyTorch Edge
  • PyTorch Domains
  • Blog & News
  • PyTorch Blog
  • Community Blog
  • Community Stories
  • PyTorch Foundation
  • Governing Board

To analyze traffic and optimize your experience, we serve cookies on this site. By clicking or navigating, you agree to allow our usage of cookies. As the current maintainers of this site, Facebook’s Cookies Policy applies. Learn more, including about available controls: Cookies Policy .

Subscribe to the PwC Newsletter

Join the community, edit social preview.

visual speech recognition

Add a new code entry for this paper

Remove a code repository from this paper, mark the official implementation from paper authors, add a new evaluation result row.

TASK DATASET MODEL METRIC NAME METRIC VALUE GLOBAL RANK REMOVE
  • AUTOMATIC SPEECH RECOGNITION
  • AUTOMATIC SPEECH RECOGNITION (ASR)
  • KNOWLEDGE DISTILLATION
  • SPEECH-RECOGNITION
  • SPEECH RECOGNITION
  • VISUAL SPEECH RECOGNITION

Remove a task

visual speech recognition

Add a method

Remove a method.

  • KNOWLEDGE DISTILLATION -

Edit Datasets

Enhancing ctc-based visual speech recognition.

11 Sep 2024  ·  Hendrik Laux , Anke Schmeink · Edit social preview

This paper presents LiteVSR2, an enhanced version of our previously introduced efficient approach to Visual Speech Recognition (VSR). Building upon our knowledge distillation framework from a pre-trained Automatic Speech Recognition (ASR) model, we introduce two key improvements: a stabilized video preprocessing technique and feature normalization in the distillation process. These improvements yield substantial performance gains on the LRS2 and LRS3 benchmarks, positioning LiteVSR2 as the current best CTC-based VSR model without increasing the volume of training data or computational resources utilized. Furthermore, we explore the scalability of our approach by examining performance metrics across varying model complexities and training data volumes. LiteVSR2 maintains the efficiency of its predecessor while significantly enhancing accuracy, thereby demonstrating the potential for resource-efficient advancements in VSR technology.

Code Edit Add Remove Mark official

Tasks edit add remove, datasets edit.

visual speech recognition

Results from the Paper Edit Add Remove

Methods edit add remove.

Deep Learning for Visual Speech Analysis: A Survey

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, index terms.

Computing methodologies

Artificial intelligence

Natural language processing

Speech recognition

Machine learning

Learning paradigms

Machine learning approaches

Neural networks

Emerging technologies

Circuit substrates

Cellular neural networks

Recommendations

Deep-learning-based audio-visual speech enhancement in presence of lombard effect.

  • Impact of Lombard effect on audio, visual and audio-visual speech enhancement (SE).

When speaking in presence of background noise, humans reflexively change their way of speaking in order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Collecting speech in Lombard conditions is ...

Survey of Deep Learning Paradigms for Speech Processing

Over the past decades, a particular focus is given to research on machine learning techniques for speech processing applications. However, in the past few years, research has focused on using deep learning for speech processing applications. This ...

Analysis and Recognition of NAM Speech Using HMM Distances and Visual Information

Non-audible murmur (NAM) is an unvoiced speech signal that can be received through the body tissue with the use of special acoustic sensors (i.e., NAM microphones) attached behind the talker's ear. The authors had previously reported experimental results ...

Information

Published in.

IEEE Computer Society

United States

Publication History

  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

A PyTorch implementation of the Deep Audio-Visual Speech Recognition paper.

smeetrs/deep_avsr

Folders and files.

NameName
6 Commits

Repository files navigation

Deep audio-visual speech recognition.

The repository contains a PyTorch reproduction of the TM-CTC model from the Deep Audio-Visual Speech Recognition paper. We train three models - Audio-Only (AO), Video-Only (VO) and Audio-Visual (AV), on the LRS2 dataset for the speech-to-text transcription task.

Requirements

System packages:

Python packages:

CUDA 10.0 (if NVIDIA GPU is to be used):

Project Structure

The structure of the audio_only , video_only and audio_visual directories is as follows:

Directories

/checkpoints : Temporary directory to store intermediate model weights and plots while training. Gets automatically created.

/data : Directory containing the LRS2 Main and Pretrain dataset class definitions and other required data-related utility functions.

/final : Directory to store the final trained model weights and plots. If available, place the pre-trained model weights in the models subdirectory.

/models : Directory containing the class definitions for the models.

/utils : Directory containing function definitions for calculating CER/WER, greedy search/beam search decoders and preprocessing of data samples. Also contains functions to train and evaluate the model.

checker.py : File containing checker/debug functions for testing all the modules and the functions in the project as well as any other checks to be performed.

config.py : File to set the configuration options and hyperparameter values.

demo.py : Python script for generating predictions with the specified trained model for all the data samples in the specified demo directory.

preprocess.py : Python script for preprocessing all the data samples in the dataset.

pretrain.py : Python script for pretraining the model on the pretrain set of the LRS2 dataset using curriculum learning.

test.py : Python script to test the trained model on the test set of the LRS2 dataset.

train.py : Python script to train the model on the train set of the LRS2 dataset.

We provide Word Error Rate (WER) achieved by the models on the test set of the LRS2 dataset with both Greedy Search and Beam Search (with Language Model) decoding techniques. We have tested in cases of clean audio and noisy audio (0 dB SNR). We also give WER in cases where only one of the modalities is used in the Audio-Visual model.

Operation Mode AO/VO Model AV Model
Greedy Beam (+LM)
Greedy Beam (+LM)
Clean Audio
AO 11.4% 8.3% 12.0% 8.2%
VO 61.8% 55.3% 56.3% 49.2%
AV - - 10.3% 6.8%
Noisy Audio
AO 62.5% 54.0% 59.0% 50.7%
AV - - 29.1% 22.1%

Pre-trained Weights

Download the pre-trained weights for the Visual Frontend, AO, VO, AV and Language model from here .

Once the Visual Frontend and Language Model weights are downloaded, place them in a folder and add their paths in the config.py file. Place the weights of AO, VO and AV model in their corresponding /final/models directory.

If planning to train the models, download the complete LRS2 dataset from here or in cases of custom datasets, have the specifications and folder structure similar to LRS2 dataset.

Steps have been provided to either train the models or to use the trained models directly for inference:

Set the configuration options in the config.py file before each of the following steps as required. Comments have been provided for each option. Also, check the Training Details section below as a guide for training the models from scratch.

Run the preprocess.py script to preprocess and generate the required files for each sample.

Run the pretrain.py script for one iteration of curriculum learning. Run it multiple times, each time changing the PRETRAIN_NUM_WORDS argument in the config.py file to perform multiple iterations of curriculum learning.

Run the train.py script to finally train the model on the train set.

Once the model is trained, run the test.py script to obtain the performance of the trained model on the test set.

Run the demo.py script to use the model to make predictions for each sample in a demo directory. Read the specifications for the sample in the demo.py file.

Set the configuration options in the config.py file. Comments have been provided for each option.

Important Training Details

We perform iterations of Curriculum Learning by changing the PRETRAIN_NUM_WORDS config option. The number of words used in each iteration of curriculum learning is as follows: 1,2,3,5,7,9,13,17,21,29,37, i.e., 11 iterations in total.

During Curriculum Learning, the minibatch size (default=32) is reduced by half each time we hit an Out Of Memory error.

In each iteration, the training is terminated forcefully once the validation set WER flattens. We also make sure that the Learning Rate has decreased to the minimum value before terminating the training.

We train the AO and VO models first. We then initialize the AV model with weights from the trained AO and VO models as follows: AO Audio Encoder → AV Audio Encoder, VO Video Encoder → AV Video Encoder, VO Video Decoder → AV Joint Decoder.

The weights of the Audio and Video Encoders are fixed during AV model pretraining. The complete AV model is trained on the train set after the pretraining is complete.

We have used a GPU with 11 GB memory for our training. Each model took around 7 days for complete training.

The pre-trained weights of the Visual Frontend and the Language Model have been obtained from Afouras T. and Chung J, Deep Lip Reading: a comparison of models and an online application, 2018 GitHub repository.

The CTC beam search implementation is adapted from Harald Scheidl, CTC Decoding Algorithms GitHub repository.

  • Python 100.0%

Winscribe end of life : Special migration offers available!

What is speech recognition?

Speech recognition

newsItem.txWebsitetemplateAuthor.name

Speech recognition is a capability that enables a program or an app to process human speech, a.k.a what you are saying, into a written format.

Female read person with long blond hair and white cardigan talking into a phone while using a laptop

It is often confused with voice recognition – the key difference is that speech recognition is used to understand words in spoken language, whilst voice recognition is a biometric technology for identifying an individual’s voice.

Although there are many speech recognition applications and devices available, the more advanced solutions are now using artificial intelligence (AI) and machine learning. These integrate grammar, syntax, structure, and composition of audio and voice signals to understand and process your speech. Some also allow organizations to customize and adapt the technology to their specific requirements (more about this later in this article).

Thanks to the ever-growing use of portable devices, like smart phones, tiny microphones, dictaphones, speech recognition software has entered all aspects of our business and everyday lives. Examples include virtual assistants, like Siri or Alexa, that enable us to command our devices just by talking and voice search which allows users to input voice-based search queries.

However, the most significant area as far as business users are concerned is speech to text software. This area is growing rapidly, due in no small part to the availability of cloud-based solutions that are enabling users to access speech to text apps from their smartphones or tablets.

How speech recognition systems work?

Speech recognition in Windows 10 is a powerful accessibility feature that allows users to control their computer using voice commands. This technology enables you to dictate text, navigate applications, and perform various tasks without needing a keyboard or mouse.

Philips SpeechLive is a cloud-based dictation solution that seamlessly integrates with Windows 10. You can use it with any of your favorite office application like Word, Outlook or even Salesforce.

The increasing role of AI

Artificial intelligence (AI) and machine learning methods like deep learning and neural networks are becoming more common in advanced speech recognition software. AI can be used to address common challenges to speech recognition technology. For example:

  • Regional accents and dialects: It can sometimes be difficult to understand what someone with a strong dialect is saying, but AI can assist in detecting the various nuances.
  • Context: Homophones are words that have the same or similar sounds, but different meanings. A simple example is “sell” and “cell.” Once again, AI can help in differentiating.  

Importance of the cloud

Given the explosive growth of hybrid and remote working, there’s an increasing requirement for workers to have access to speech to text capabilities anywhere, at any time and the cloud option delivers this.

A cloud-based solution, such as Philips SpeechLive utilizing the Dragon Professional Anywhere software, enables authors to access fully featured versions of speech to text apps from their devices, irrespective of their location. Transcripts can be shared with other team members, allowing them to add comments or sign-off the contents. And because the software is cloud-based, these changes/additions can be made from anywhere.

Selecting the right speech recognition solution

A key factor in speech recognition technology is its accuracy rate. There is little merit in using speech recognition for input purposes if the resultant document is littered with errors.  Fortunately, the use of AI has enabled some speech recognition solutions to achieve accuracy rates as high as 99% .

The need to address user mobility is also an important factor to consider. Will you need access to the speech recognition capabilities whilst working from home or in remote locations? If so, then the availability of a mobile app capable of supporting both Android and iOS devices are essential.

And finally, one of the most important considerations is the degree to which the speech recognition software allows for customization. Think for a moment about all of the industry-specific terms, acronyms, phrases or jargon used in sectors as diverse as legal, healthcare and financial services. It’s vital that the software has the capability to recognize these, whether it is trained to do so, or has the capability to import custom word lists that might already exist.

Potential benefits

Here are just a few of the benefits you can expect to achieve if you select a feature-rich speech recognition solution such as Philips SpeechLive to meet your requirements:

  • Speedier document production - talking is much faster than typing, allowing users to dictate a document roughly three times faster than they can type it.
  • Reduce repetitive tasks - freeing up time so that professionals can focus on other things. For example, fee earners in legal firms have found that using Philips SpeechLive results in spending less time on support activities such as document creation and editing, which in turn allows them to spend more time with clients and enables a greater focus on work that is directly billable. Similarly, within a medical environment, automating the processes involved in generating clinical documentation means that healthcare professionals can spend more time on patient treatment.
  • Another cloud-based benefit is the potential integration with other business apps.

For example, linking speech to text capabilities with other cloud apps such as workflow management and document management can provide a number of significant benefits such as streamlining document-related processes and providing a clear digital audit trail for all dictations.

Unsure if speech recognition is the correct tool for you? Find out by taking our easy 5-step  quiz .

These articles might also interest you

visual speech recognition

Speech Processing Solutions integrates with iManage to enhance documentation workflow for legal professionals

visual speech recognition

Speech AI in Action, leading the way in voice productivity with Microsoft's Speech Analytics

visual speech recognition

Philips SpeechLive has just joined forces with Nuance's Dragon Speech Recognition

visual speech recognition

Speech Recognition and Dictation on iPhone, Mac and Apple Watch

visual speech recognition

Cybersecurity threats in 2024 – what do they mean for your business?

visual speech recognition

What is digital dictation?

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: visual speech recognition.

Abstract: Lip reading is used to understand or interpret speech without hearing it, a technique especially mastered by people with hearing difficulties. The ability to lip read enables a person with a hearing impairment to communicate with others and to engage in social activities, which otherwise would be difficult. Recent advances in the fields of computer vision, pattern recognition, and signal processing has led to a growing interest in automating this challenging task of lip reading. Indeed, automating the human ability to lip read, a process referred to as visual speech recognition (VSR) (or sometimes speech reading), could open the door for other novel related applications. VSR has received a great deal of attention in the last decade for its potential use in applications such as human-computer interaction (HCI), audio-visual speech recognition (AVSR), speaker recognition, talking heads, sign language recognition and video surveillance. Its main aim is to recognise spoken word(s) by using only the visual signal that is produced during speech. Hence, VSR deals with the visual domain of speech and involves image processing, artificial intelligence, object detection, pattern recognition, statistical modelling, etc.
Comments: Speech and Language Technologies (Book), Prof. Ivo Ipsic (Ed.), ISBN: 978-953-307-322-4, InTech (2011)
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: [cs.CV]
  (or [cs.CV] for this version)
  Focus to learn more arXiv-issued DOI via DataCite
: Focus to learn more DOI(s) linking to related resources

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

1 blog link

Dblp - cs bibliography, bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

COMMENTS

  1. Visual Speech Recognition for Multiple Languages

    A repository for visual speech recognition (VSR) and audio-visual speech recognition (AV-ASR) in multiple languages, using conformers and landmarks. It provides tutorials, benchmarks, models, and tools for VSR and AV-ASR.

  2. Visual Speech Recognition for Multiple Languages in the Wild

    A paper that proposes a new model for visual speech recognition (VSR) that works for different languages and outperforms previous methods. The model uses auxiliary tasks, hyperparameter optimization and data augmentation to improve accuracy and robustness.

  3. Visual speech recognition for multiple languages in the wild

    Visual speech recognition (VSR), also known as lipreading, is the task of automatically recognizing speech from video based only on lip movements. In the past, this field has attracted a lot of ...

  4. Visual Speech Recognition

    Find papers, benchmarks, datasets and code for visual speech recognition, a task of recognizing speech from video. Explore the latest research and methods for lip reading, audio-visual speech recognition, and cross-lingual visual speech.

  5. [2205.10839] Deep Learning for Visual Speech Analysis: A Survey

    Deep Learning for Visual Speech Analysis: A Survey. Visual speech, referring to the visual domain of speech, has attracted increasing attention due to its wide applications, such as public security, medical treatment, military defense, and film entertainment. As a powerful AI strategy, deep learning techniques have extensively promoted the ...

  6. PDF Visual speech recognition for multiple languages in the wild

    A novel method for visual speech recognition (VSR) that outperforms previous models trained on larger datasets. The method uses auxiliary tasks, data augmentation and hyperparameter optimization to achieve state-of-the-art performance on English, Mandarin and Spanish.

  7. Efficient Training for Multilingual Visual Speech Recognition: Pre

    A novel training strategy for sentence-level multilingual VSR that uses visual speech units extracted from self-supervised VSR model. The paper shows improved training efficiency and state-of-the-art performance with a single model.

  8. Visual Speech Recognition for Multiple Languages in the Wild

    Visual speech recognition (VSR) aims to recognise the content of speech based on the lip movements without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to larger training sets rather than the model design.

  9. PDF Visual Speech Recognition for Multiple Languages

    Visual speech recognition (VSR) aims to recognise the content of speech based on the lip movements without re-lying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR mod-

  10. Deep Audio-Visual Speech Recognition

    Deep Audio-Visual Speech Recognition. Abstract: The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language ...

  11. Visual Speech Recognition: A Deep Learning Approach

    This paper proposes a deep learning method to perform word-level classification based on visual inputs alone. It uses ResNet and GRU to achieve high accuracy on BBC and custom video data sets.

  12. Improving Visual Speech Recognition for Small-Scale Datasets via

    Visual speech recognition, also known as lipreading, breaks the application limitation of automatic speech recognition when there is extreme background noise or a need for private communication. However, acquiring high-quality lipreading models remains a challenging task due to the lack of sufficient lipreading data for particular languages. In this work, we propose a new lipreading framework ...

  13. Visual Speech Recognition for Multiple Languages in the Wild

    Visual speech recognition (VSR) aims to recognise the content of speech based on the lip movements without relying on the audio stream. Advances in deep learning and the availability of large audio-visual datasets have led to the development of much more accurate and robust VSR models than ever before. However, these advances are usually due to ...

  14. Real-time Audio-visual Speech Recognition

    Audio-Visual Speech Recognition (AV-ASR, or AVSR) is the task of transcribing text from audio and visual streams, which has recently attracted a lot of research attention due to its robustness to noise. The vast majority of work to date has focused on developing AV-ASR models for non-streaming recognition; studies on streaming AV-ASR are very limited.

  15. Enhancing CTC-Based Visual Speech Recognition

    This paper presents LiteVSR2, an enhanced version of our previously introduced efficient approach to Visual Speech Recognition (VSR). Building upon our knowledge distillation framework from a pre-trained Automatic Speech Recognition (ASR) model, we introduce two key improvements: a stabilized video preprocessing technique and feature normalization in the distillation process.

  16. [1809.02108] Deep Audio-Visual Speech Recognition

    Deep Audio-Visual Speech Recognition. The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and ...

  17. Deep Learning for Visual Speech Analysis: A Survey

    Over the past five years, numerous deep learning based methods have been proposed to address various problems in this area, especially automatic visual speech recognition and generation. To push forward future research on visual speech, this paper will present a comprehensive review of recent progress in deep learning methods on visual speech ...

  18. [1807.05162] Large-Scale Visual Speech Recognition

    Large-Scale Visual Speech Recognition. This work presents a scalable solution to open-vocabulary visual speech recognition. To achieve this, we constructed the largest existing visual speech recognition dataset, consisting of pairs of text and video clips of faces speaking (3,886 hours of video). In tandem, we designed and trained an integrated ...

  19. Visual Speech Recognition

    Visual Speech Recognition Abstract: Human beings identify speeches of a speaker using information that can be characterized by several different modes of communication. Visual data such as lip and tongue expression, in addition to speech audio, can also help to understand speech. By using visual data that detects lip gestures and understands ...

  20. Visual Speech Recognition for Multiple Languages

    This is the repository of Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels and Visual Speech Recognition for Multiple Languages, which is the successor of End-to-End Audio-Visual Speech Recognition with Conformers. By using this repository, you can achieve the performance of 19.1%, 1.0% and 0.9% WER for automatic, visual, and ...

  21. Deep Audio-Visual Speech Recognition

    The repository contains a PyTorch reproduction of the TM-CTC model from the Deep Audio-Visual Speech Recognition paper. We train three models - Audio-Only (AO), Video-Only (VO) and Audio-Visual (AV), on the LRS2 dataset for the speech-to-text transcription task.

  22. What is speech recognition?

    Selecting the right speech recognition solution. A key factor in speech recognition technology is its accuracy rate. There is little merit in using speech recognition for input purposes if the resultant document is littered with errors. Fortunately, the use of AI has enabled some speech recognition solutions to achieve accuracy rates as high as ...

  23. [2409.07210] Enhancing CTC-Based Visual Speech Recognition

    This paper presents LiteVSR2, an enhanced version of our previously introduced efficient approach to Visual Speech Recognition (VSR). Building upon our knowledge distillation framework from a pre-trained Automatic Speech Recognition (ASR) model, we introduce two key improvements: a stabilized video preprocessing technique and feature normalization in the distillation process. These ...

  24. PDF arXiv:2202.13084v2 [cs.CV] 30 Oct 2022 Visual Speech Recogni

    Visual speech recognition (VSR), also known as lipreading, is the task of automatically recognizing speech from video based only on lip movements. In the past, this eld has at-tracted a lot of research attention within the speech recog-nition community [1, 2] but it has failed to meet the initial high expectations. There are two main reasons ...

  25. [1409.1411] Visual Speech Recognition

    A paper that reviews the challenges and applications of visual speech recognition (VSR), a technique to understand speech without hearing it. It covers topics such as lip reading, human-computer interaction, audio-visual speech recognition, and video surveillance.