In the exercise, all training and testing data are single-word speech to be detected. Is it the case that in practice, people speak whole sentences from which trigger words have to be detected? For example, at home when people randomly chat the algorithm needs to detect whether a trigger word has been spoken?
I am confused about this.