I am working on the Python notebook C3_W3_A1_Assignment. I have a question on the part where .step() gets introduced.
I was wondering why there is a Reward for doing any one single action since none of them directly reach the goals. There could only be negative rewards for using the engines in my opinion? However, if I run these lines (there should be a print screen below my message):
# Select an action action = 1
# Run a single time step of the environment’s dynamics with the given action. next_state, reward, done, _ = env.step(action)
I see a different Rewards for the 4 different actions. I was expecting that perhaps ‘do nothing’ would give the higher reward, but it’s actually ‘Fire right engine’ that maximises the reward…Why is that?
Also, the new states are always different every time I run the code. Is it because of the randomness?
First, section 3.3 of the assignment describes some rules for getting or losing rewards, so the lunar lander doesn’t get positive rewards because it fires an engine or does nothing. Rewards are not only considered for actions but according to the rules. For the complete reward algorithm, please check out these 25 lines of code.
Because each time you run the code in the screenshot, the lunarlander’s state will change, and rewards are calculated accordingly. Please check out the rules or the reward algorithm for how rewards are calculated. They are not calculated simply only based on action.
I am sorry, I am still not sure I understood.
I understand that the rewards don’t come just for actions, but looking at the lines you referred me to, it doesn’t seem to me that a single ‘Fire right engine’ from the starting position should lead to any reward. Why is that?
None of these things is satisfied:
Landing on the landing pad and coming to rest is about 100-140 points.
If the lander moves away from the landing pad, it loses reward.
If the lander crashes, it receives -100 points.
If the lander comes to rest, it receives +100 points.
Each leg with ground contact is +10 points.
Firing the main engine is -0.3 points each frame.
Firing the side engine is -0.03 points each frame.
except for ‘firing the side engine’ that has negative points. However, the final reward is positive for ‘Fire right engine’ .
I assumed that the .step() function only return the instantaneous reward after the action according to the rules: is that not the case?
First, we need to shift our focus away from “firing any engine = rewards”, because this is not how points are rewarded. Points are rewarded based on state changes, such as if the lunar lander moves closer to the landing pad, it gains rewards, otherwise it loses rewards. Even if the lunar lander is just free-falling but as soon as it is approaching the landing pad, the lunar lander gains rewards by the rule aforementioned. Ofcourse, crashing the lunar lander will cost points.
Again, forget about “firing any engine = rewards”, instead, think about how state has changed.
I also have the same question as OP, to me this doesn’t make sense, these are the rules in the exercise:
Landing on the landing pad and coming to rest is about 100-140 points.
If the lander moves away from the landing pad, it loses reward.
If the lander crashes, it receives -100 points.
If the lander comes to rest, it receives +100 points.
Each leg with ground contact is +10 points.
Firing the main engine is -0.3 points each frame.
Firing the side engine is -0.03 points each frame.
And in one specific section it then have this sample code
# Select an action.
action = 0
# Run a single time step of the environment's dynamics with the given action.
next_state, reward, done, _ = env.step(action)
# Display table with values. All values are displayed to 3 decimal places.
utils.display_table(initial_state, action, next_state, reward, done)
And the reward in the output is non-zero (> 0). If I’m understanding correctly, the reward is given as soon as we land in the next state S’ (based on the reward rules), instead of recursively backtraced from the terminal state/terminal epoch (e.g. landed or crashed)
And so based on the rules given above I don’ think the state after taking action 0 qualify for ANY reward at all.
The comment above from rmwkwok “Points are rewarded based on state changes, such as if the lunar lander moves closer to the landing pad, it gains rewards” ← this makes a lot of sense and would explain why there is >0 reward for S’, HOWEVER, the assignment’s reward rules don’t include this specific rule.