In Week 3, the state-action value function part, the professor said how to ‘compute the optimal behave’ would be explained later. However, I still wandering how to get the ‘optimal behave’ before we use Bellman Equation to calculate Q, as Q will use maxQ(s’, a’) for a’.
You may share the video name and the time mark of the video for the part that you question, but the general idea is, we assume a random, initial Q (that can be very wrong), then we train the Q progressively until it becomes good, and then we can use it to predict the optimal behavior.