Fooling around with Azure voice recognition

French speech to text

« So, can I, like, magically transcribe my corpus into text? »

For those of you who work in the Humanities, and specifically in Linguistics, you probably know the situation. Every now and then, someone comes to see « Grand Master of All Things Computational », with a simple question. Like « What is the best part-of-speech tagger anyway? », or « How can I transcribe this huge > 20 hours audio corpus, like right now? ‘Cause I need it to write my PhD/article/course on interaction. Like, tomorrow… »

And then you start looking into it, because you can’t possibly let this question unanswered. Of course you start with the usual « You know, computational linguistics is not magic. Let me get back to you on this. »

This time, it’s about voice to text. So this is about how I spent my Easter Monday morning fooling around with Microsoft Azure’s Cognitive Services.

For those of you who are not familiar with the platform, Microsoft Azure offers several free plans, to get you started. I had already tested voice to text online services, because my colleagues will NEVER be able to set up even a basic python -or whatever- script. I tried, but it was too error-prone. I must admit that my test files are specially hard to process: dialogues with multiple interlocutors, in a noisy environment, with a low quality recording… So, I decided to test Microsoft Azure’s voice to text services.

First of all, even though there’s tons of documentation, I must say that the python examples will only get you so far. Specifically, if you’re aiming at continuous transcription from an existing file, the current python scripts are « buggy » (I suspect the example module is not in sync with the current stable version).

Long story short: I managed to set up a python script to get speech to text on my Linux box, using my trusted Blue Yeti mike (a bit of an overkill since expected input is 16 Khz mono). This what I get:

French speech to text
French speech to text

But, that’s easy: just download the example script, configure the API key and service region (« westus » btw), plug the mike and get going. But I still have a bigger challenge in store: getting a draft text transcription from an existing audio file.

This took a little more time. And without spending days on this, I only managed to get things working with the javascript example from

After a little bit of tinkering, in order to provide the voice recognition model the expected 16 Khz mono signal, this what I got, first with a relatively « clean » test sample.

Azure voice to speech javascript example with noisy file

As can be seen, it’s pretty good, although not 100% accurate. It’s not really a word-error rate problem, but rather a syntactic one: « Ceci est un test de reconnaissance vocale. La linguistique de corpus est une science appliquée des sciences du langage qui repose sur l\’analyse de grands volumes de textes annotés et structurée, et éventuellement analyser. »

The last word « analyser » is phonetically ambiguous with « analysé(s) ». Since there is a coordinative conjunction « et », a simple syntactic checker would be able to correct « analyser » into « analysés ». Grammar and spellchecker Language Tools could probably be plugged into the recognized text stream.

Now, this was the easy part. Let’s move on to the more challenging test: a dialogue with multiple speakers, in a noisy environment (an Apple Store).

Noisy voice to text transcription

This is the output text: « D\’accord, ouais, ils sont en fait, euh, je dois prendre un petit cadeau pour ma copine, OK? la Saint Valentin et je sais pas quoi prendre de LOK un iPhone ou en effet n\’est pas la quantité c\’est quoi votre fille, votre copine pour votre travail avec la mairie? OK c\’est du sport ouais elle A quoi comme téléphone actuellement à un iPhone 5? Voilà c\’est ça c\’est ça ouais laquelle? Manon pas toujours nous? »

Ok, it’s far from perfect, and the transcription doesn’t make much sense as is. But it’s pretty good, considering there are 3 different speakers, plus a number of indistinct speakers in the background. And since the main 3 speakers have a tendency to speak at the same time, it makes things hard for the voice recognition engine. Another thing is that speakers are not distinguished, by default. This could probably be tackled by spending more time reading the documentation.

But if we’re looking for a tool that might save hours of hard manual work, I’m pretty confident using Microsoft Azure’s Cognitive Services might help. Like… a lot! Even if we’re considering a ~ 30% word error rate, it means around 70% of the raw transcription process could probably be automated.

So, what’s next?

Well, Azure allows to define custom « phrases », as well as custom workflows for specific purposes. That would probably help keeping WER under 15%.

And I still have to test it on Chinese. Because someone’s PhD depends on it. And said someone was (cleverly) using youtube’s automatic transcription services. The only problem being that, with youtube, you can never guarantee data integrity. And this is a major concern for corpus linguistics projects.

So, stay tuned!