Hello! I am still in the early days of my nn adventures and I am curious if anyone here could help me understand if my intuition about how to solve a problem is correct - and also, if anyone could point me at any resources, paper that have already dealt with this… all the better.
I am doing some explorations into using neural networks with reinforcement learning tasks and am currently building a Deep Q-Learning network to play a game (seems like a good way to learn).
In each new state, the network is fed an observation from the game as an input and outputs all possible (discrete) actions as a set of linear nodes (no activation function) that represent the expected reward (score) based on those actions. The system then decides on an action and then I compute the loss by comparing the expected reward to the actual reward… and it’s the next step I am confused about - should I attempt to selectively backpropagate each output node that represents an action/reward at a time (since I am only taking one action at a time)… Or am I missing some essential truth about generalized nn learning here?