Iam not getting the sense, without giving action as input along with state, how this model is learning the Q(s,a). Because this Q(s,a) depends on action and we are not giving any information about action.
In the previous architecture we used to give action along with state as input and model gives Q(s,a) as the output.
I think the way this is accomplished is through a training process and the design of the neural network. The neural network uses information about the current state and the rewards it observes to learn. It does this by trying to make the predicted Q-values (which estimate the expected rewards) as close as possible to the target Q-values.
The output layer has 4 units because there are 4 and only 4 actions. Each of the 4 units represents a Q(s, a). For example, the second unit represents Q(s, a = 2) where a = 2 denotes the second action.
a is used and it is used to select the chosen one out of the four possible outcomes of Q. a is not missed out.