Possible solution error in Reinforcement Learning Quiz?

For one of the questions on the Reinforcement learning introduction Quiz in C3W3 of ML Specialization, it asks the following:

You are using reinforcement learning to fly a helicopter. Using a discount factor of 0.75, your helicopter starts in some state and receives rewards -100 on the first step, -100 on the second step, and 1000 on the third and final step (where it has reached a terminal state). What is the return?

Based on the description, the following is my interpretation of it:

Actions: Start → Step 1 → Step 2 → Step 3 (Final)
Reward: 0 → (-100)0.75 → (-100)(0.75^2) → (1000)(0.75^3)

The quiz grades that as an incorrect solution stating the following:
“Remember the first reward is not discounted.”

However, shouldn’t the first reward be on the first (starting) state? Which in this case is 0?

Hello @nauman,

I hope you can stay with me for 20 seconds in this video starting from 1:07.

If you keep watching it until 1:25, you will find 4 rewards are listed. And if you map the narration with each of those rewards, you should see the following mappings:

0 → from state 4 you go to the left, we saw that the rewards you get would be zero on the first step from state 4
0 → zero from state 3
0 → zero from state 2
100 → 100 at state 1, the terminal state

Now we go back to the quiz “your helicopter starts in some state and receives rewards -100 on the first step”, and following the logic in the above, the grid for “some state” should have a -100, and then the grid for the next state have another -100. I think this should have addressed the difference from your understanding.

Also, I want to point out that the explanation is actually quite a good one. Let me explain. We assign a different orders-of-discount as the coefficients to the sequence of rewards. Now, the part of “different orders” makes the discounting stronger in future steps of the sequence, and this achieves our goal already - we want to penalize future rewards so that among all possible paths-to-destination, the shortest one wins. However, is it useful at all to penalize the first reward? At least in the examples shown in the lectures and in this quiz, it is not useful at all. Of course I understand we want to stick with the definition, so my response before this paragraph has done that, but my response in this paragraph is just trying to give you a different perspective.

Cheers,
Raymond

Hi @nauman
Welcome to the community!
in addition to what @rmwkwok said

The first step didn’t multiply with the discount factor or in the other the power of the discount factor in the step 1 is zero so it’s equal 1like this image

Cheers,
Abdelrahman

Thank you @rmwkwok and @AbdElRhaman_Fakhry for the prompt replies.

Correct me if im wrong, but I now understand that the question applies a -100 to the “some state” which is assumed to be the starting state. Im assuming thats also why the starting state is not discounted. However, that leads me to two questions:

  1. When it mentions “the first reward is always discounted”, does that refer to the reward associated with the starting state? Or the state following the action?

  2. If it is the former, then what is the purpose for a reward given to the starting state? Isnt that redundant? Considering regardless of which action you take, that reward will always be included in every possible return

Appreciate your responses

Thanks @AbdElRhaman_Fakhry

If thats the case, then why is the following the incorrect solution:

The quiz says the the correct solution is -100 - 0.75*(-100) + 0.75^2*100. But if we’re moving to “first step” that has a reward of -100, shouldnt that reward be discounted by 0.75?

Starting state (no reward, no discount) → first step → second step → third step
0 - 0.75100 - 0.75^2100 + 0.75^3*1000

Why is that an incorrect answer according to the quiz? I applied the same logic to the next question in the quiz and got the correct answer:

Sorry for spamming these questions, im just trying to figure out the correct logic here.

@AbdElRhaman_Fakhry

Why should the value of the first step be -100*(0.75^{0})? In the course video example starting from 6:27 , when starting from state 4 and ending at state 6:

Start (4) → state 5 → state 6

Dr.Ng applies discount of 0.5^{1} to the first state, not 0.5^{0}.

So 0 + 0.5^{1}*0 + 0.5^{2}*40, same as what I applied to my initial solution.

please check this and I will continue with you @nauman

The starting state is the state that you are start from it(here he say the first step and it’s mean the current step)you can get reward from it but without discount

“the first reward is always discounted”, it’s actually discounted by 0 as the power of the discount factor in the first state is 0.75 power 0 ( 0.75^0 * -100 )

Cheers,
Abdelrahman

@nauman
Starting state (the first step as mentioned here) (no discount) because the first step should be = -100*0.75^{0} and 0.75^{0} = 1 , the second step = -100*0.75^{1}, the third step = 1000*0.75^{2} …etc

please review what I updated , and sorry for the mistake
Cheers,
Abdelrahman

Thanks @AbdElRhaman_Fakhry.

That sounds about right. I guess I misunderstood the quiz question. The first step is the starting state in that case.

Is there a reason why the first state is given a -100 reward? What is the purpose of that since no action/step is taken before the starting step?

@nauman according what I updated…

In this image


he say if we started with state 1= Q(1,left) the value would be 100*0.5^{0} = 100
sorry for mistake that made before

The reason is because effort is made to reach this step so we make discount of the reward we get to compensate for this effort or this error, and therefore the effort is doubled at every step, but the starting step we didn’t made any effort as we didn’t move we just pick the reward so the discount is power zero

also personally I preferred to read these topic written by @ Christian_Simonis It’s took about Intuition behind the discount factor in reinforcement learning

The discount factor, 𝛾, is a real value ∈ [0, 1], cares for the rewards agent achieved in the past, present, and future. Let’s explore the two following cases:

  1. If 𝛾 = 0, the agent cares for his first reward only.
  2. If 𝛾 = 1, the agent cares for all future rewards.

Source

A reward which you get as soon as possible is just more „worth“ than a reward in the future, especially the longer the horizon is and the uncertainty would be higher.

Note that the discounting concept is also known from finance e.g. to bring future cash flows to a present value considering opportunity cost. This concept is quite similar. Feel free to take a look .

Cheers,
Abdelrahman

1 Like