VAD and speech recognition model training

Hi, Week 1 introduced the idea of VAD to the pipeline of a speech recognition system. Later, it was mentioned that changing the VAD component could affect model performance in production. My question is, given this effect of the VAD component, why is it not involved in model training in the first place - e.g. in preparing the training data?

The goal of VAD is to isolate the audio from user speech and send it to the prediction server. Objective of VAD could be predicting the start and end times to isolate from input audio.
Prediction server is trained on the actual audio to provide the response as shown in the lecture.

It’s possible to train both the components seperately or train them together (multi task learning).

The advice is, when there’s a skew between training and testing disributions, it’s retrain the model.

Maybe i can ask a question regarding the goal of VAD. Is it only preparing the audio for the prediction server?

