In video-2, AI4E - Week3 - Case Study: Smart Speaker, the step-2 is “Speech recognition” in which we are doing A->B mapping of an Audio file -to-> “tell me a joke” but my doubt here is a user can ask this or rephrase this statement in many different ways. So, here do we need to build the data and train the model with all the such possible statements having the same meaning “tell me a joke”. The same things is explained in the next step-3 i.e. “Intent recognition” that one thing can be asked in multiple ways. Please explain, I am not able to visualise how this will work.
A follow up question on the above session only … the steps shared for smart speakers are:
- Wakeword detection
- Speech recognition
- Intent recognition
- Execute Joke
Can we also replace step-2 i.e. Speech recognition with
- Wakeword detection (via A->B mapping)
- NLP - Speech-to-text (to skip/avoid A->B mapping to get the text via Speech recognition)
- NLP - Text translation to english if speech is in other language
- Passed to extracted text to Intent recognition
- Execute Joke
Is this also correct?