I am stunned by OpenAI Whisper’s voice to text results

The following blog entry was essentially transcribed (with manual editing where I was unhappy with what I said).

OpenAI Whisper using the medium model (unedited with errors in red):

The new OpenAI Whisper Voice to Transcription Service is now by far the best transcription service available. It is open source and can very easily be set up in Python. It does require a GPU, which is problematic if you have an AMD GPU. But once it is set up, the performance improvement over Google's recorder, even when using the Pixel 6 Pro, is so much higher that it threatens to replace it completely. Performance in terms of speed is also very good.

Pixel 6 Pro recorder (unedited with errors in red):

The new open AI, whisper voice to transcription service, is now, by far the best transcription service available, it is open source and can very easily be set up in Python. It does require a GPU which is problematic if you have an AMD GPU. But once it is set up the performance improvement, over Google's recorder even when using the Pixel 6 Pro, there's so much higher that it threatens to replace. 

It completely performance. In terms of speed is also very good.

As you can see, the Pixel sample has great difficulty with punctuation, and it artificially creates paragraphs when they shouldn’t be created. The one shortfall with OpenAI is that it doesn’t create paragraphs (all the text is in one paragraph), but from a practical point of view this is preferable because a human editor would want to review the transcription result and then manually add in paragraphs.

I find the OpenAI approach saves time. The number one complaint I have with the Google recorder is that its egregious punctuation errors consume a lot of time.

The real power of the OpenAI model is seen when you ingest very long audio transcriptions, say 30 minutes or more. It is able to output a finished product with startling accuracy, so much so that the amount of work involved in correcting errors starts to become de minimus.

The setup for OpenAI is also extremely easy if you have the necessary hardware requirements (https://www.assemblyai.com/blog/how-to-run-openais-whisper-speech-recognition-model/). A locally run model would afford greater privacy compared to a cloud service. Without an Nvidia GPU I was forced to test this on Kaggle, but I have never been as impressed with a transcription model as with Whisper. No doubt developers will create stand alone apps soon enough. Although the basic model is good (better than Google), the medium model is stunning.

Leave a comment