So, let’s say we have trained RL model with data (X,y) in which the agent is initialized only at 1,2,3,4,6 states and let’s say we have collected data (X,y) from agent when initialized in only these states.

What does agent do when initialized at 5(Where we haven’t collected data for this state)?

what obstacles does agent face?

How does RL model learn to tackle this obstacles or challenges?

In the training phase we already set all possible ways and all rewards, punishment of these ways so if the RL model initialized at 5 Q(5,left) = his rewards value will decrease, and the model try to go right the rewards values will increase , so the the RL model will learn the way(left) in the conditions X isn’t the good ways(step) it’s better to go right… and so on in the training phase

In addition to Abdelrahman’s answer, first, the model should still be able to suggest an action, but since the model has never been trained with such situation, there are at least 2 possibilities in terms of the quality of that action:

The other examples on which the model has been trained is sufficient for the model to well “intrapolate” the missing piece about starting from state 5;

The model didn’t learn well to predict for starting from state 5.

If it is an online RL model, that example (starting from state 5) is also queued for training, and your model can be improved from there.