AI whisper is an automatic speech recognition system, but what can it do?
Updated: March 3, 2023 2:01 PM
OpenAI, the research company known for its impressive AI language models like ChatGPT and DALL-E 2, also released a speech recognition model in September 2022 called Whisper.
Whisper was largely overshadowed by the hype around the other OpenAI releases, ChatGPT and DALL-E 2.
Whisper is an automatic speech recognition system that can transcribe and translate audio files in approximately 100 different languages from around the world.
This innovative AI model uses a staggering 1.6 billion parameters and was trained on a huge volume of data – over 680,000 hours of audio collected from the web. Remarkably, it shows strong zero-shot performance across a wide range of automated speech recognition tasks.
READ NOW: ChatGPT vs. Bing AI Chatbot
Whispering AI Training
One of the distinctive features that distinguishes Whisper from other state-of-the-art Automatic Speech Recognition (ASR) models is that it does not require fine tuning on a reference data set for training, but instead uses “weak” monitoring with a large and noisy dataset of voice audio collected from the internet paired with transcription text.
According to OpenAI, the developers of Whisper, this training approach has produced a model that can excel in generalization and deliver impressive zero-shot performance using sophisticated algorithms and techniques.
The field of artificial intelligence is making significant advances in speech processing tasks such as multilingual speech recognition, voice activity detection, spoken language identification, and speech translation. This technology is advancing rapidly and is applied to a wide range of use cases.
Technical architecture
Whisper employs an encoder-decoder architecture that splits the input audio into 30-second segments, converts it to a log-Mel spectrogram format, and feeds it to an encoder.
A decoder is then taught to precisely connect the input audio with its relevant text caption. This model can be refined by integrating custom tokens tailored to specific tasks, such as language recognition, multilingual speech transcription, phrase-level timestamps, and speech-to-English conversion.
Whisper has the potential to significantly improve speech recognition and language translation in various applications, from virtual assistants to language learning tools. With its ability to recognize a wide range of accents and handle technical jargon, Whisper is a promising step in making speech recognition more accessible and accurate for everyone.
model variations
Whisper’s advantage over other speech recognition systems lies in its multi-language and multi-tasking data capability, making it a versatile performer with high accuracy.
The model has five versions, four of which are optimized for English-only applications. Depending on the desired application, each version of whisper offers various trade-offs between speed and accuracy.
In general, it is observed that the tiny.en and base.en models perform better than the small.en and medium.en models when it comes to English-only applications.
It is observed that the performance difference between the small.en and medium.en models becomes less significant compared to the other models. Whisper’s overall performance varies significantly depending on the language being used.
READ NOW: Too many requests in 1 hour
potential applications
Due to its adaptability and accuracy, Whisper is an exceptional resource for producing interview and podcast transcripts, and can even convert podcasts made in languages other than English to English using your device.
This powerful merger has the potential to revolutionize the transcription industry.
Testing Whisper’s AI
We put Whisper through its paces by feeding it multiple samples, including a Selena Gomez song, using the Python demo program available on GitHub. Whisper did an excellent job of transcribing the mp4 file to text, outperforming some AI-powered audio transcription services I’ve tried in the past. The change is shown in the snapshot below.

OpenAI released the Whisper API
Priced at $0.006 per minute, OpenAI recently announced that the Whisper model is now available via an API, allowing developers to incorporate this advanced speech-to-text model into their applications and services.
Is OpenAI Whisper free?
Whisper AI is a free and open source model, however the OpenAI API service is priced at $0.006/minute
What is Whisper AI?
Whisper is an automatic speech recognition system that can transcribe and translate audio files in approximately 100 different languages.