Why do only single words are used in Trigger Word Detection?

In the exercise, all training and testing data are single-word speech to be detected. Is it the case that in practice, people speak whole sentences from which trigger words have to be detected? For example, at home when people randomly chat the algorithm needs to detect whether a trigger word has been spoken?

I am confused about this.

This is a simple example, because training to recognize an entire spoken sentence would take too much processing time,