– Speech recognition systems have been improved through unsupervised pre-training techniques.
– Pre-trained audio encoders lack a performant decoder, requiring finetuning.
– Fine-tuning can lead to overfitting and limited generalization to other datasets.
– Existing supervised datasets for speech recognition are limited in size.
– Weakly supervised pre-training with larger datasets improves robustness and generalization.
– This paper introduces Whisper2, a weakly supervised speech recognition approach.
– Whisper2 scales weakly supervised pre-training to 680,000 hours of labeled audio data.
– Models trained with Whisper2 transfer well to existing datasets without finetuning.
– Whisper2 focuses on multilingual and multitask training, covering 96 languages.
– The paper releases inference code and models for further research on robust speech recognition.

Thank you for reading this post, don't forget to subscribe!

– Speech processing models trained on large amounts of internet transcripts.
– Models generalize well to standard benchmarks without finetuning.
– Models approach human accuracy and robustness.
– Models and inference code released for further research on speech processing.

– Scaling weakly supervised pretraining improves robustness in speech recognition.
– Large and diverse supervised datasets can enhance zero-shot transfer performance.
– No need for self-supervision and self-training techniques.
– Models approach human accuracy and robustness.
– Models generalize well to standard benchmarks without finetuning.

– Models trained on 680,000 hours of weakly supervised data generalize well to standard benchmarks.
– Models achieve competitive results without the need for fine-tuning.
– Models approach human accuracy and robustness in speech recognition.