State-action value function example?

DDG · September 8, 2022, 10:23am

Hi,

Just looking for a little clarification on the example and the number that Q(s,a) produces whether we are moving left or right.

I can’t seem to figure out where 6.25 comes from in state 3 moving right, as given that the terminal state reward is 40, I calculate that this number should be 5.0?

Maybe I have misunderstood the math behind this but here is how I have calculated from this state, moving right, with 2 actions, then optimally after those, and with gamma 0.5 and misstep_prob = 0.

Q(3, ->(2)) = 0 + 0(0.5) + 0(0.5)^2 + 40(0.5)^3 = 5.0

All other numbers seem to check out, so maybe I missed something.

Any clarification is appreciated.

Thanks.

Edit 1: Ok, so I realize that I have jumped the gun slightly as I moved from the state-action function lesson to the State-action value function example Lab.

I am now working through the Bellman equation lesson and see that some vital parts of the puzzle were missing from the above equation, I am still finding this rather confusing though, as it still seems that even with the new equation of:

Q(3, ->(2)) = 0 + 0.5[ 0 + 0(0.5) + 40(0.5)^2] = 5.0

The result is still 5.0?

rmwkwok · September 8, 2022, 11:52am

Please share which video and at what timestamp you find the example that gives you 6.25 in question?

DDG · September 8, 2022, 1:04pm

Hi,

Here is the state-action value function lesson, timestamp 1:10, we can see the setup as described above, but the lab example from the video shows 6.25 for state 3, moving right.

This is the value that I calculate to be 5.0, I think I must have done something wrong.

Thanks.

DDG · September 8, 2022, 1:56pm

Hi,

So I just completed the state-action value quiz and initially got the last question wrong.

This was due to not at first recognizing that the direction of movement would change after the initial action due to the algorithm then “behaving optimally after that.”

I think that possibly the same is happening with my understanding of the lab task. It’s not so much the maths but more about understanding the purpose of the value of num_actions = 2 and at what point directions may change in order to behave optimally so as to get the values displayed in the lab task.

Hope that makes sense.

rmwkwok · September 9, 2022, 12:28am

Hello @DDG,

Exactly!! There is nothing more I can add.

Raymond

DDG · September 9, 2022, 10:01am

Hi,

The error was in my misunderstanding of num_actions = 2. Although the lectures explicitly state the action is to happen only once, with this variable being given the value 2 I thought that it meant that whichever direction is chosen from whichever starting state that movement would happen in that direction twice and then begin behaving optimally.

After calculating the above but only moving once to the right and then moving optimally (to the left) I do indeed get the value of 6.25.

I guess the only question that remains then is, what does 'num_actions = 2’ actually mean, and how does it affect the outcome of the algorithm.

Thank you.

rmwkwok · September 9, 2022, 12:01pm

Hello @DDG,

We assigned a value 2 to the variable num_actions but the variable is never used in the rest of the optional lab, so it does no effect to the outcome of the algorithm. You can search num_actions in the C3 W3 optional lab and find that it only appears once in num_actions = 2 but the variable itself is never used in elsewhere.

It means “there are 2 actions”, which is either going left or right.

However, in the assignment, 4 is assigned to num_actions and it is actually used to correctly build the neural network that predicts the Q-value for each of the 4 actions.

Cheers,
Raymond

DDG · September 9, 2022, 1:00pm

Hello @rmwkwok,

I understand now that num_actions = 2 was representative of the fact that the Mars Rover discrete example only had 2 possible actions, namely, left or right.

Not sure why all this didn’t fall into place sooner.

Thanks for all your help.

Dave.

rmwkwok · September 9, 2022, 1:03pm

You are welcome Dave.

Raymond

Topic		Replies	Views
Week 3 lecture video has error _ State_action Value Function definition Unsupervised Learning, Recommenders, Reinforcement week-3	5	449	August 13, 2024
State action value function for terminal states Unsupervised Learning, Recommenders, Reinforcement week-3	9	414	September 21, 2024
State Action value function [Coursera Video] Unsupervised Learning, Recommenders, Reinforcement week-3	4	488	July 31, 2024
Quiz problem in bellman Unsupervised Learning, Recommenders, Reinforcement week-3	1	471	March 17, 2023
Error in State-action value quiz Unsupervised Learning, Recommenders, Reinforcement week-3	8	511	June 11, 2024

State-action value function example?

Related topics