What is the basic significance of speech recognition audio

Deepti_Prasad · October 30, 2023, 10:53am

Hello,

I was going through a section of week 1 title "Case study: speech recognition where in Professor Ng elaborates about how a transcription would transcribe a small clipping of audio shared in the video where in the the auditor basically makes a sound umm and stops for a second and then tells today’s weather, and Prof. Ng shares that he would prefer the transcriber adding the sound note of Umm rather than just today’s weather.

I have had the chance to work as medical transcriptionist in my early days of education and career managing, where we were taught to ignore this sound, and write or transcribe what is grammatically (as English vocabulary) and medical terms what is correct. So if the video where in auditor tells ummm… today’s weather, we as a transcriptionist would have typed only today’s weather.

So I want to know why did Andrew Ng preferred the sound note with text explanation for this!!!

I will be really happy to listen a good reasoning out for this

Thank you in advance

Regards
DP

arvyzukai · October 30, 2023, 7:28pm

Hi @Deepti_Prasad

One of the good reasons could be my personal user experience . For example, when I watch movies (or youtube) with subtitles I would prefer seeing “Umm…” when the actor says that instead of omitting it. It makes tracking the subtitles easier and you are somewhat reassured that you did not miss the word that could sound similar to “Umm…” (maybe a name or other word like “and” or other).

Also, I could imagine in some trial/police cases, these “Umms” could indicate that the speaker is not sure or somewhat hesitant, while if omitted the speaker in the text would appear as confident.

I’m sure there are other cases, but those came to my mind .

Cheers

Deepti_Prasad · October 31, 2023, 12:48am

Thank you arvy, got the gist!!!

Deepti_Prasad · October 31, 2023, 6:55am

Arvy,

What about auditor who really pronounce very badly with a noisy background especially for medicolwegal files where we have heard Ummm but was actually a word ulnar(medical terminology )… would speech recognition help in such case???

Just asking based on your description because we had files where these software would transcribe all these noisy background with different signs like $$$ … or ~~‘’‘’

So in such case software was not doing a good job at transcribing these files.

arvyzukai · October 31, 2023, 8:34am

I’m no expert in this field (I have no prior experience in transcription) but I can offer my thoughts.

To my knowledge, full automation on speech recognition would not help in such cases. General models would certainly not work. But models fine-tuned specifically on certain dialects and on medical data might help but with human in the loop (at least doing constant error analysis on certain terms and iteratively fine-tuning the models, or as assistants for human transcribers).

The problem with language models is that they assign probabilities to certain terms.
This might be fine for casual conversations but medical language (I think) are full of terms that are use sparingly (there are a lot of “long tail” terms) and deep understanding of the field is needed to assign a certain word to the sound (with complex logic) that might be many of things.
Accents are also a challenge to language models and I think the medical field is full of different background (age, ethnicity, etc.) people.

Also even to us, humans, sometimes it’s hard to discern “hypothyroidism” vs “hyperthyroidism” (they at least to me sound quite similar but of course mean opposite things) and we need a couple of words or sentences further down the conversation to reassure us what was meant.

So, in summary, I think fine-tuned models might be a help of different degree to human transcribers, but would not work as a fully automated system (at least in the medical field I would not advise such systems today).

Just my thoughts

Gabrielamdm · December 23, 2024, 10:51pm

Hello, I find the exchange very interesting. My project is to create an AI agent that analyses reading fluency. In that case, it is very important that it transcribes both hits and misses. I need to measure three parameters: accuracy, speed, and prosody.

Topic		Replies	Views
Language Model and Sequence Generation is about audio? Sequence Models coursera-platform	15	509	January 19, 2025
A noise in the audio Machine Learning in Production	1	588	May 13, 2021
Subtitles are hard to follow for deaf people Machine Learning in Production	9	660	April 24, 2023
Speech to Speech system AI Discussions ai-discussions	3	66	January 25, 2025
Week 3 assignment 2 speech to text labeling question Sequence Models coursera-platform	1	495	October 5, 2022

What is the basic significance of speech recognition audio

Related topics