“You are using reinforcement learning to fly a helicopter. Using a discount factor of 0.75, your helicopter starts in some state and receives rewards -100 on the first step, -100 on the second step, and 1000 on the third and final step (where it has reached a terminal state). What is the return?”
My confusion: If I have an agent that starts in a particular state….
a. Why is the terminology for starting in that state “receives rewards -100 on the first step”? In my mind, it’s not a step - it’s a state - and rewards should be calculated against potential actions. Is that just the way the algorithm is defined? It is consistent with the lecture.
b. Why, logically, do we place a reward on that starting state? Is it because the agent may have been placed in an advantageous place to start (i.e the mars rover landed on a pot of gold)?
@TMosh that sounds logical to me, but I looked back and the lecture and matching that up with the quiz and it seems that there is a concept of an undiscounted reward in the starting state. I’ll attach my quiz result which reinforces this.
The best rationalization I can think of (big newbie alert) is that there should be a value associated with the “action” of staying in that state. This would be relevant if other available actions would result in lesser (potentially negative) returns. Example, the mars rover lands on at the top of a pointy plateau and would tumble if it moved in any direction. I may be waaaay off here so feel free to zap this and redirect me.
Think I’m good. This concept is touched upon in a subsequent lecture (Bellman Equation one), though with more intuitive terminology. Andrew states that the agent gets a reward “right away” which would apply to any state including the initial state…