Multi-Task Learning for E2E ASR Word and Utterance Confidence

David Qiu

Yanzhang (Ryan) He

Qiujia Li

Yu Zhang

Liangliang Cao

Ian Carmichael McGraw

Interspeech (2021)

Download Google Scholar

Abstract

Confidence scores are very useful for downstream applicationsof automatic speech recognition (ASR) systems. Recent workshave proposed using neural attention models to learn word or ut-terance confidence scores for end-to-end (E2E) ASR. By them-selves, word confidence does not model deletions, and utteranceconfidence discards much of the useful word-level training sig-nals. This paper studies the effect of adding utterance-level lossand individual deletion loss to the framework proposed in [1].Empirical results show that multi-task learning with all threeobjectives improves confidence metrics (NCE, AUC, RMSE)without the need for increasing the model size of the trans-former feature extractor. Using the utterance-level confidencefor rescoring also decreases the word error rates on Google’sVoice Search and long-tail datasets by 3-5% relative.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Multi-Task Learning for E2E ASR Word and Utterance Confidence

Abstract

Research Areas

Learn more about how we research

Google Ai

Google Cloud

Google DeepMind

Google Labs