How can we compute the 'optimal way' in a state action value function?

In Week 3, the state-action value function part, the professor said how to ‘compute the optimal behave’ would be explained later. However, I still wandering how to get the ‘optimal behave’ before we use Bellman Equation to calculate Q, as Q will use maxQ(s’, a’) for a’.

Could someone help please? Thanks.

Hi @zhongli,

You may share the video name and the time mark of the video for the part that you question, but the general idea is, we assume a random, initial Q (that can be very wrong), then we train the Q progressively until it becomes good, and then we can use it to predict the optimal behavior.