Reinforcement Learning - The State-action value function

I was trying simulations in the optional lab “State-action value function” related to reinforcement learning, and I have a doubt about “ties” or “ambiguities” in policy evaluation.

When you have an odd number of states and are considering a specific state as a starting point, such as state 3 (out of 5), there could be scenarios where the values of rewards and transitions result in ties or ambiguities between the possible actions. In other words, there might be multiple actions that lead to the same expected cumulative reward when starting from state 3.

For instance, from state 3 you consider two possible actions: move to state 2 or move to state 4. If the expected cumulative rewards for both actions are the same, you might be indifferent in choosing between the two actions, as they provide equal benefits.

This can pose challenges when establishing a policy, especially when you want to ensure deterministic decision-making. In such cases, you might need to consider additional factors, like exploring the long-term consequences of the chosen actions or incorporating stochasticity in your decision-making process to break ties.

In the context of the state-action value function Q(s, a), these ties or ambiguities could lead to situations where multiple actions have the same Q-value from a particular state. As a result, the policy might not be well-defined, and the machine’s behavior could be unpredictable in those situations.

To address this, reinforcement learning algorithms can incorporate exploration mechanisms or stochastic policies that introduce randomness to the decision-making process. These mechanisms help the machine explore different actions and make decisions even in scenarios with ties or ambiguities.

I was wondering how are designed policies and handled ties in reinforcement learning when establishing decision-making strategies in environments with an odd number of states. The example in our lab has an even number of states.

Hello @Popa_Mihaela_Simona,

Since you have shared an analysis of the situation, I would like to dig into that with you.

Ties can happen even when the number of states is even. Consider a four states case where the rewards are, from left to right, [5, 0, 0, 10] and \gamma= 0.5 and we start from state 1 (zero based). However, also through the above example, you see that the reward and the \gamma have both to be set to just the right value for it to happen. Such (arranged) “coincidence” may not be easy to observe in a complex, real-world problem. However, I don’t claim it is impossible.

This is what the Bellman equation is about. Through adding up a series of discounted reward, it considers the long-term consequence.

This would be to say, when there is a tie, randomly pick one. It is a good way out! Another way would be to always pick the first action, given that the actions are ordered. However, the second way imposes a bias towards earlier actions which may be good (if the earlier actions are more fail-safe) or bad (due to the bias).

The above “randomly pick one” or “higher ranked first” are simple rules that makes it defined.

I think the above responses have provided two simple and useful strategies.


Thank you Raymond for the clarifications and support. I have a more clear picture now, also considering that in the meantime I managed to complete the remaining videos and labs and get a more in-depth understanding.