I think you’re just “overthinking” this. If we have two trigger words, then either of them “triggers”, so if both words appear in sequence, the second trigger simply over-rides the first and you take whatever is after that as the actual command. If you have a phrase, then the trigger has two stages: the first word gets you to stage one and the second word only “triggers” if you’re already in stage one, although we’d probably need the sophistication of an expiry on the stage one.
If your experiments (which I didn’t listen to) didn’t work, then it’s probably just the usual issue that nothing we are doing here is actually realistic in the sense that we can’t afford to really train a valid model in the environment here. The resources are too constrained. Witness the very first “is there a cat” experiment in DLS C1 W4: it works terribly on your own pictures. 209 samples is a ludicrously small training set for a problem that complex.