Q(s,a) and Policies

I’m halfway through the third week, but I just want to make sure I’m understanding this concepts correctly

We were first introduced to Policies and later on to Q(s,a).

Q(s,a) is a function that given a state and an action it returns the Return for that given action if after that action you follow the “optimal path” (which in early videos they presented various policies)

But isn’t policies just the Max(Q(s,a)) for each state? a.k.a the optimal path?

I just think it is a bit confusing that we were first introduced to policies and then to the Q(s,a) when from my understanding, Policies are the result of the max Q(s,a) for each state.

And also once going in one direction, you always go in that same direction, at least in the cases presented in the videos. So Policies are always calculated after Q(s,a) for every state?

Thanks in advance

Hi @jjdniz

A policy defines the agent’s behavior in environment, while Q-values represent the expected return for taking an action in a given state. Policies can be derived from Q-values by selecting the action with the highest Q-value for each state.

In many cases, policies are calculated based on Q-values, but they can also be represented directly without explicit computation of Q-values.