Whisper-v3 Hallucinations on Real World Data
Jose Nicholas Francisco
Last week Sam Altman announced Whisper v3 on stage at OpenAI’s Dev Day. Like any in the community, I was eager to see how the model performed. After all, at Deepgram, we love all things voice AI. Therefore we decided to take it for a spin.
This post shows how I got it running and the results of the testing that I did. Getting the testing setup was relatively straightforward; however, we found some surprising results.
I’ll show some peculiarities up front then go through the thorough analysis after that.
🔍 The Peculiarities We Found
Peculiarity #1:
Start at 4:06 in this audio clip (the same one embedded above). This file is one we used in our testing.
At that moment in the audio, the ground-truth transcription reads “Yeah, I have one Strider XS9. That one’s from 2020. I’ve got two of the Fidgets XSR7s from 2019. And the player tablet is a V2090 that’s dated 2015.”
However, the Whisper-v3 transcript says: