Transcribe-bot monster meltdown: DeepSpeech, Dragon, Google, IBM, MS, and more!

Speech has been a near-impossible field for computers until recently, and as talking to my computer has been something I dreamed of as a kid, I have been tracking the field as it progressed trough the years.

  • Expectations from dictation are high, correcting more than two words per paragraph means to most people it is just easier to type the whole thing in.
  • Microphone quality is very important to speech transcription, 
    $50 headsets are just “ok”, for best results the professionals use pro mics which cost $300+.
  • According to google Audio for recognition should not be compressed in a lossy way, as it hampers recognition. This means that network traffic might be an issue, and should be considered.
    Calculating bandwidth requirements for 16bit mono linear PCM:
    48Khz = 96KB/sec (most microphones provide)
    44.1Khz = 88KB/sec (default windows setting)
    16Khz = 32KB/sec (lowest quality allowed)
    Assuming ~100 concurrent streams: between 9.6MB and 3.2MB upload bandwidth should be allocated to this on the customer’s site for cloud based recognition. Flac can be used for lossless compression in order to save bandwidth, but that would require a Flac encoder.
  • Most cloud APIs will include a 60sec limit on streaming voice recognition:
    Wit: using chunked http upload, 10sec limit
    IBM: allows to configure the time limit (also allowing to remove it)
    Apple
    : 60sec limit, unknown protocol masked by iOS SDK
    Google: using grpc over http2, 65sec limit on streaming recognition.
    Microsoft: using websockets, 10 minute limit
    These limitations can usually be worked around by reconnecting periodically to the service.

The opensource & public cloud market:

The list is ordered approximately by recognition quality & fit for cloud streaming dictation:

  • Google Speech API: [2.4¢/min] [Demo] wins the match in most blogs I read, also gets the popular vote, you can try It online & on google docs (tools => voice typing), Google also have some better models under development and in alpha stages, which are available to Google partners .
  • IBM watson: [2¢/min][Demo]looks like a well rounded speech recognition engine which is used in the IBM Watson project, offered as a part of the IBM bluemix cloud.
  • Microsoft Bing Speech: [~2¢/min][Demo]try online just click the link (many languages), try in office with dictate.ms, A Microsoft garage product
  • Apple’s Speech API: [free for iOS apps] looks better than Microsoft in most benchmarks, may only be accessed from iOS & MacOS devices (many languages)
  • Wit Speech: [Demo] less info on this engine as it is a smaller company, it is geared towards voice assistants & commands.
  • Kaldi: [Free OpenSrc] [dockerfiledockerThe most mature speech recognition open source, has streaming recognition via gstreamer server, I don’t expect it to compare to google, but is an option for on premise engines, plus it’s free… (English & Estonian only).
  • Amazon transcribe[2.4¢/min] no streaming recognition yet, upload only for now. I don’t currently have access to this service, so feel free to use my audio files and send me the results. (many languages)
  • Temi / HappyScribe / Trint / Spext[~10¢/min] Dedicated voice transcription companies which have a cloud transcribing offering. streaming / real-time recognition is not their focus, as they are geared towards offline, long form audio transcriptions. but it is worth sampling in order to compare accuracy across the market, so I chose to run the samples trough Temi as well.
  • Mozilla Deep Speech: [Free OpenSrc] [dockerfiledockerone of the most recent and most exciting speech engines, created by implementing a Nueral network designed by Andrew Ng on tensorflow, and with Mozilla’s Data Sciense team actively behind it, It looks like a very promising OpenSource offering. It still has no streaming recognition: coming soon? (English Only for now)

Professional cloud speech engines

These professional engines look better than google speech in most benchmarks I saw, they have “talk to us” OEM typed licensing, and have Medical grade speech engines that are being actively used in the healthcare industry to save time while filling out patient reports. Needless to say that tolerance for errors in this field is very low.

  • Nuance DragonNuance offers a range of solutions including desktop toolbars, on-premise speech recognition servers, and a cloud speech API, A free trial is offered to customers doing POCs, again, with a “talk to us first” policy. I used dragon pro 15, which is their non-medical dictation desktop app. (many languages)
  • nVoq SayIt: [Demo] best in KLAS for medical speech, powering Dolbey’s fusion cloud speech APIs. Trials on “talk to us” basis here as well. (North American English only)

Comparing the engines

There already are a ton of comparisons trying out the different services side by side, and I will instead link to some prominent blogs at the end, but I couldn’t leave you all without a good experimentation run just for the fun of it.

The test setup

I recorded four short lines two of text from Darwin’s origin of species and two from Alice in wonderland (wikipedia) which range in recognition difficulty, in my own voice through a regular headset, Such one-liners are easy to send to all services since some have length limitations. After processing (has to be 16khz 16b Linear PCM Wav) I uploaded the same audio files to all services. since temi & deepspeech do not understand punctuation commands, I removed “comma” and “period” from the resulting text where they were recorded (this was a bit favorable towards DS which returned “camma” and other stuff instead).

All speech engines that support file upload in the demo / trial, received the audio files for translation, since Wit.Ai & nVoq didn’t provide this feature, I used a VB-Audio device driver to transfer audio from a player to a virtual mic, voice was played at 90% volume.

I do not have an AWS account or a mac computer, so I don’t have access to these engines, if you have it, please run the audio through it, and i’ll add your text to the comparison. Kaldi was tested with the Tedlium model.

Alice quote I

Origin: Alice a girl of seven years is feeling bored and drowsy while sitting on the riverbank with her elder sister

  • DS: alice a girl of seven years is feeling boared an drouwsy wild sitting on the riv bank wit her elder sister
  • Google: Alice a girl of 7 years is feeling bored and drowsy while sitting on the riverbank with her elder sister.
  • MS: Alice a girl of 7 years is feeling bored and drowsy while sitting on the riverbank with her elder sister.
  • IBM: Alice A girl of seven years is feeling bored and drowsy while sitting on the riverbank with her elder sister.
  • Temi: Alice, a girl of seven years is feeling bored and drowsy while sitting on the riverbank with her elder sister.
  • nVoqPalace a girl of 7 years is feeling bored and drowsy while sitting on the roof ban quoted her elder sister
  • Wit.AIhow is the girl of seven years is feeling bored and drowsy while sitting on the riverbank with her elder sister
  • Kaldi (tedlium)place until seven years is feeling bored in trials thewhile sitting on the riverbank with her elder sister.
  • Dragon Pro 15: Alice, a girl of seven years is feeling bored and drowsy while sitting on the riverbank with her elder sister

Darwin quote I

Origin: In the wild animal, on the contrary, all its facilities and power being brought into full action for the necessities of existence.

  • DS: in the wild animal on the contrary all its vasilithies and power being brought into full action for the necissities of existence
  • Google: In the wild animal, on the contrary, all its facilities and power being brought into full action for the necessities of existence.
  • MS: In the wild animal comma On the contrary comma all its facilities in power being brought into full action for the necessities of existence.
  • IBM: In the wild animal, on the contrary, all its facilities and power being brought into full action for the necessities of existence.
  • Temi: In the wild animal, on the contrary, all its facilities and power being brought into full action for the necessities of existence.
  • nVoq: In the wild animal, on the contrary, all its facilities and power being brought into full action 4 the necessities of existence
  • Wit.AI: in the wild animal on the contrary its facilities and power being brought into full action for the necessities of existence
  • Kaldi (tedlium): in the wild animal on the contrary now it’s a sad that these and power being brought into for action for the necessities of existence.
  • Dragon Pro 15: In the wild animal, on the contrary, all its facilities and power being brought into full action for the necessities of existence

** facilities = faculties, I mispronounced, the STT didn’t care…

Darwin quote II

Origin: Is strengthened by exercise, and must even slightly modify the food, the habits, and the whole economy of the race.

  • DS: Is strentened by exercise and must even slightly moify the food the habits and the whole economy of the race.
  • Google: Is strengthened by exercise, and must even slightly modify the food, the habits, and the whole economy of the race.
  • MS: Is strengthened by exercise, and must even slightly modified food, habets, in the whole economy of the race.
  • IBM: Is strengthened by exercise, and must even slightly. Modify the food, the habits, in the whole economy of the grace.
  • Temi: Is strengthened by exercise, and must even slightly modify the food the habits and the whole economy of the race.
  • nVoq: is strengthened by exercise, and must even slightly modify the food, [?] habits, and the whole economy of the race.
  • Wit.AI: is strengthened by exercise and must even slightly modified food habits and the whole economy of the race
  • Kaldi (tedlium):his trenton by exercise karma and must even slightly modified the food habits and the whole economy over the grace.
  • Dragon Pro 15: Is strengthened by exercise, and must even slightly modify the food, [?] habits, and the whole economy of the race.

Alice quote II (harder to recognize)

Origin: Outside, Alice hears the voices of animals that have gathered to gawk at her giant arms. The crowd hurls pebbles at her

  • DSooutside an s sheres the voice of animal that have gathered to godbiter ghiant arms the crowd hurl’s devils at he
  • Google: Outside Alice, here’s the voice of animals that have gathered to God Couture giant arms the crowd hurdles Devil’s at her.”
  • MS: Outside Alice here’s the voice of animals that have gathered to goget are giant arms the crowd herels bevels at her.
  • IBM: Outside Alice hears the voice of animals that have gathered to gawk at her giant arms. The crowd hurls pebbles at her.
  • Temi: Outside Ellis hears the voice of animals that have gathered to [?]giant arms. The crowd hurls pebbles at her.
  • nVoq: outside elsehears the voice of animal thatI have gathered to guard a giant arms the wound heals bevels at her
  • Wit.AIokay and this here’s the voice of animals that have gathered to got to trying to harm the crown hurls levels of her
  • Kaldi (tedlium): outside else here is the voice of animal health careother two battered trying to arms the crown hurls levels at her.
  • Dragon Pro 15: Outside, Alice hears the voice of animals that have gathered to God with her giant arms. The crowd hurls levels that are

Summing it all up, I opted to count “stumbles”, which means counting whole blocks of errors as a single one, I chose this methods since advanced language models tend to over-correct for mistakes that skew grammar or context, thus dragging the error across a few words.

Totals:

  • IBM: 2 stumbles
  • Temi: 2 stumbles
  • Dragon: 3 stumbles
  • Google: 3 stumbles
  • MS: 5 stumbles
  • Wit.AI: 6 stumbles
  • D.Speech: 8 stumbles
  • nVoq: 9 stumbles
  • Kaldi: 12 stumbles

In all fairness, I have to admit that choosing the last quote as difficult, was due to Google having trouble with it when I was playing around, so you may give Google half a point off for that :). The distance between IBM and Google is not really enough for me to Judge, and it’s obvious that we should have a real showdown with some long form speech coming up below.

In general, we can see the state of the open source vs. the cutting edge, we can also see that no engine is perfect, they still have a way to travel to reach human level understanding, despite recent accuracy claims.

nVoq are specialists in healtcare text and from what I saw they do shine in that field, so I decided that my final showdown will include two final test samples: long form, and medical speech.

Interim conclusions:

IBM,Google & Dragon pass to the next stage
nVoq will also be in the finals as it is a representative of the medical dictation specialization, so it isn’t really in the same category and not trained on similar voice content.

Finalists: IBM, Google (general speech), nVoq (medical speech)


Bonus test: WaveNet-D voice sample

Since my voice is not perfectly canon as I may have a slight Hebrew accent, I wanted to have a perfect English sample of the last sentence which heavily confused google & MS. To achieve this, I used the google TTS engine to produce a state-of-the-art WaveNet speech sample from text. After some base64 decoding & conversion to 16Khz I ran the same upload routine with all the previously tested services…

Origin:Outside, Alice hears the voices of animals that have gathered to gawk at her giant arms. The crowd hurls pebbles at her.

  • DSutside alice heres the voices of animals that have gathered do goc ether giant arms the crowd hurls pebbles at her
  • Google: Outside Alice hears the voices of animals that have gathered to God that are giant arms the crowd rolls pebbles at her.
  • MS: Outside Alice here’s the voices of animals that have gathered to gotthat are giant arms the crowd girls pebbles at her.
  • IBM: Outside Alice hears the voices of animals that have gathered to gawk at her giant arms. The crowd hurls pebbles at her.
  • Temi: Outside Alice, here’s the voices of animals that have gathered to Gawk at her giant arms. The crowd hurls pebbles at her.
  • nVoq: outside Alice years the voices of animals that have gathered to gawkergiant arms the crowd rosepebbles at her
  • Wit.AI: outside dallas here’s the voices of animals that have gathered to gawk at your giant arms the crowd rules pebbles at her
  • Kaldi (tedlium): outside alice here’s the voices of animals that have gathered to gawk at a giant arms the crowd rose petals at her.
  • Dragon Pro 15: Outside, Alice hears the voices of animals that have gathered to gawk at her giant arms. The crowd rose pebbles at her

So we got the pebbles part out of the way, but it seems that hurling just ain’t something that the tech giants do — they prefer to roll or girl them instead (I wonder what would girling a pebble mean). I get not understanding “Gawk” since it is a rather seldomly occurring word.


Stage II: final showdown

For the final stage I wanted to check 2 different fields: Best all around speech recognition, and best medical field speech transcription. I chose to do this since as far as I know, computerized dictation is currently mainly used for story telling (writers), and for medical diagnosis reporting (doctors). All the voice in this test is recorded by me and can be found in the github repo which accompanies this article. Punctuation is handled by all engines, so no freebies this time around…

Medical transcription test

Origin: A meld of two Diagnoses from online sites
69-year-old female with a history of smoking, asthma and bronchitis now with productive cough intermittently for several months.Left medial foot and ankle pain and swelling. Plantar metatarsal pain for 5 weeks. No known trauma. Possible free air under the diaphragm. On a chest x-ray for what appeared to be shortness of breath she was found to have what was thought to be free air under the right diaphragm. No intra-abdominal pathology.

nVoq:
69-year-old female with a history of smoking, asthma and bronchitis now with productive cough intermittently for several months. 
Left medial foot and ankle pain and swelling. Plantar metatarsal pain for 5 weeks. No known trauma. Possible free air under the diaphragm. 
On a chest x-ray for what appeared to be shortness of breath she was found to have what was thought to be free air under the right diaphragm. No intra-abdominal pathology.

Google:
69 year old female with a history of smoking, asthma and bronchitis now with productive cough intermittently for several months. 
Left medial foot and ankle pain and swelling. Plantar metatarsal pain for 5 week There’s No known trauma. Possible for year under the diaphragm. 
On the chest x-ray for what [?] the shortness of breath she was found to have.What was thought to be free are under the right diaphragm. No intra-abdominal pathology.

IBM:
69 year old female with a history of smoking, Asman bronchitis now with productive cough intermittently for several months. Lift medial foot and ankle pain and swelling. Lanzar metatarsal pain for 5 weeks no known trauma. Possible free air under the diaphragm. On the chest X. ray for what the peer to be shortness of breath she was found to have what was thought to be free air under the right diaphragm period no intractable pathology.

MS: 69 year old female with a history of smoking, after this now with productive cough intermittently for several months period left medial foot and ankle pain and swelling. plantare metatarsal pain for 5 weeks. known from a period possible freear under the front. on the chest X Ray for what appeared to be shortness of breath she was found to have what was not to be free air under the right diaphragm. no intradermal patala G.

Dragon Pro 15: 69-year-old female with a history of smoking, asked mybronchitis. Now, with productive cough intermittently for several months. Left medial foot and ankle pain and swelling. Plantar metatarsal pain for five weeks. No known trauma. Possible free air under the diaphragm. On the chest x-ray for what the peer to be shortness of breath. She was found to have what was thought to be free air under the right diaphragm. No intra-abdominal pathology.

Story telling test

Origin: Futurama, Episode 1 “space pilot 3000”
Look Leela, I don’t understand this world but you obviously do, so I give up. 
If you really think I should be a delivery boy, I’ll do it. He holds out his hand to Leela. She gets the implant gun ready. Fry cringes and looks away. The gun clicks but Fry feels nothing. He opens his eyes and sees Leela drop her own chip on the floor. Your chip, What are you doing?

Google:
Look Lila, I don’t understand this world, but you obviously do, so I give up. If you really think I should be a delivery boy, I’ll do it. He holds out his hand to Lila, She gets the implants gun ready. Fry cringes and looks away. The gun clicks, but fry feels nothing. He opens his eyes and sees Lila drop her own chip on the floor. You’re chip, What are you doing?

MS:
Look lela, I don’t understand this world but you obviously do, so I give up. if you really think I should be a delivery boy, I’ll do it. he holds out his hand to lila, she gets the implant gun ready. Frank whinges and looks away. the gun clips but Friday feels nothing. he opens his eyes and sees lelord drop her own chip on the floor. your trip, what are you doing?

IBM:
Look Leila, I don’t understand this world but you obviously do, so I give up period if you really think I should be a delivery boy, I’ll do it. He holds out his hand to Leila come she gets the implant gun ready. Franklin gyms and looks away. The gun clicks but Frank feels nothing. He opens his eyes and sees Leila drop her own chip on the floor. Your chip. Comma what are you doing?.

nVoq:
Look leila come I do not understand this world but you obviously to, so I give up. If you really think I should be a delivery boy, I will do it. He holds out his hand to Lila come she gets the implant can ready. Frei quenches and looks away. The gun clicks but Fry feels nothing. He opens his eyes and C’s leila drop her own chip on the floor. Your chip, what are you doing?

Dragon Pro 15: Look Leela, I don’t understand this world, but you obviously do, so I give up. If you really think I should be a delivery boy, I’ll do it. He holds out his hand to Leela, she gets the implant gun ready. Fred cringes and looks away. The gun clicks but Fry feels nothing. He opens his eyes and sees Leela drop her own shape on the floor. Your chip, what are you doing?


Conclusions:

AI and Speech recognition have gone a long way in the last 5 years or so, computerized dictation and translation used to be so daft, they made me laugh so much my stomach hurt the first few times I tested them. Now it is almost good, churning out flawless sentences 3 times out of 4 with a good engine. I might even consider dictating my next medium article to Google docs, or MS dictate. there is also a fair amount of effort going into open source voice recognition, which is not trivial given the level of investment that designing such an algorithm entails. Just footing the bill for the model training hardware is impressive, not to mention the specialized man power.

Best all-around speech recognition: [Google / Nuance]
When dictating whole paragraphs to the engine you can see where Google really shines, it has excellent context understanding and has no problem when parsing punctuation. Despite IBM’s previous precision in the first stage of the tests, it is not perform as well as google’s engine when dictating long pieces of text. Dragon has one more error than Google on the story text, but less on the medical and my guess is that when using the medical engine it will be much better on that, it is still not enough in order to pass judgement but I would still stick with google as it is cheap as an engine & free when using Google Docs, also it’s much easier to obtain.

I used Nuance Dragon 15 Pro, Which is not a medical voice engine, it is indeed a very good speech engine, and lives up to it’s name as a market leader. It did have the advantage of doing a short 3 sentence voice setup to tailor the engine to my voice, but it was just a short paragraph to read, so I won’t dock it any points for that.

Medical speech: [nVoq]
The differentiation from Google is even more apparent when freely playing with the engines, and when considering the fact that physicians have a close to zero tolerance for transcription errors.

It seems that the last 10% of recognition is hard for all engines alike and that a human brain has much more than a mere neural net with a language model behind it. So don’t expect perfect results, but if you are willing to correct a few errors, dictation can be a good option for you.