What is the basic significance of speech recognition audio


I was going through a section of week 1 title "Case study: speech recognition where in Professor Ng elaborates about how a transcription would transcribe a small clipping of audio shared in the video where in the the auditor basically makes a sound umm and stops for a second and then tells today’s weather, and Prof. Ng shares that he would prefer the transcriber adding the sound note of Umm rather than just today’s weather.

I have had the chance to work as medical transcriptionist in my early days of education and career managing, where we were taught to ignore this sound, and write or transcribe what is grammatically (as English vocabulary) and medical terms what is correct. So if the video where in auditor tells ummm… today’s weather, we as a transcriptionist would have typed only today’s weather.

So I want to know why did Andrew Ng preferred the sound note with text explanation for this!!!

I will be really happy to listen a good reasoning out for this :slight_smile:

Thank you in advance


1 Like

Hi @Deepti_Prasad

One of the good reasons could be my personal user experience :slight_smile:. For example, when I watch movies (or youtube) with subtitles I would prefer seeing “Umm…” when the actor says that instead of omitting it. It makes tracking the subtitles easier and you are somewhat reassured that you did not miss the word that could sound similar to “Umm…” (maybe a name or other word like “and” or other).

Also, I could imagine in some trial/police cases, these “Umms” could indicate that the speaker is not sure or somewhat hesitant, while if omitted the speaker in the text would appear as confident.

I’m sure there are other cases, but those came to my mind :slight_smile:.


1 Like

Thank you arvy, got the gist!!!

1 Like


What about auditor who really pronounce very badly with a noisy background especially for medicolwegal files where we have heard Ummm but was actually a word ulnar(medical terminology )… would speech recognition help in such case???

Just asking based on your description because we had files where these software would transcribe all these noisy background with different signs like $$$ … or ~~‘’‘’

So in such case software was not doing a good job at transcribing these files.

1 Like

I’m no expert in this field (I have no prior experience in transcription) but I can offer my thoughts.

To my knowledge, full automation on speech recognition would not help in such cases. General models would certainly not work. But models fine-tuned specifically on certain dialects and on medical data might help but with human in the loop (at least doing constant error analysis on certain terms and iteratively fine-tuning the models, or as assistants for human transcribers).

The problem with language models is that they assign probabilities to certain terms.
This might be fine for casual conversations but medical language (I think) are full of terms that are use sparingly (there are a lot of “long tail” terms) and deep understanding of the field is needed to assign a certain word to the sound (with complex logic) that might be many of things.
Accents are also a challenge to language models and I think the medical field is full of different background (age, ethnicity, etc.) people.

Also even to us, humans, sometimes it’s hard to discern “hypothyroidism” vs “hyperthyroidism” (they at least to me sound quite similar but of course mean opposite things) and we need a couple of words or sentences further down the conversation to reassure us what was meant.

So, in summary, I think fine-tuned models might be a help of different degree to human transcribers, but would not work as a fully automated system (at least in the medical field I would not advise such systems today).

Just my thoughts :slight_smile:

1 Like