Week 3: Reinforcement learning introduction
Video Heading: The Return of Reinforcement
Prof.Ng explains to use discount rate close to 1 to the return of reinforcement. Can I know on what basis he is stating to choose a discount rate (r) close to 1? and why does one need to chose this value close to 1?
What happen if in unsupervised learning of reinforcement is having a negative and positive effect of rewards like in stock exchange when a stock level can go up and down in value, would one still choose a reward close to 1 even if one wants to know why a stock value is going through a loss?
The choice of the discount rate in reinforcement learning should align with your specific investment goals and risk preferences. There is no one-size-fits-all answer, and experimentation with different discount rates may be necessary to find the best strategy for your particular use case.
When you set the discount factor to 1 (or very close to 1), it means that your agent values all future rewards equally, regardless of when they occur. In other words, it has a long-term perspective and is willing to delay immediate gratification for the promise of larger rewards in the future.
And when you set it, for example, very close to 0, it means that the agent heavily discounts or devalues future rewards. It prioritizes immediate rewards and doesn’t consider long-term consequences as much.
So the choice of that depends on your case. Do you need the agent to focus on short-term gains or long-term ones, for example, planning for something, and you don’t care about risks happening currently? cause you focus on future gains maybe.
I hope it makes sense now.
I understood the concept of discount factor to be close to 1 the video created a doubt that it would only take consideration positive reinforcement.
How does discount rate would hold rewards in case of financial stocks where the value can go up and down, would the discount factor still be close to 1?
I didn’t get this pointer as Prof.Ng doesn’t mention this short-term or long-term profits.
From what I understood
Rewards are the negative or positive values or action taken or took based on an unsupervised reinforcement learning. Correct me if I am wrong!!
Well in case of financial stocks the discount factor is not necessarily close to 1 and is likely to be influenced by various factors, including the risk associated with the stock, prevailing market conditions, and individual investor preferences. It’s important to carefully consider these factors when performing financial valuations of stocks, as the discount rate is a critical component of the valuation process.
That’s correct but usually rewards are positive and negative values we call “penalties”
I think Prof.Ng explained this in terms of Policy which was a bit vague for my understanding.
The discount factor value allocation confused me because for helicopter he chose 0.9 and for robot he chose 0.5 and then for Chess he 0.995. I want to know his criteria for choosing these factors. From what Prof. Ng explained I understood these are based on the reward factors related to every case but then why robot was given 0.5, is it because the reward factors varied between 100 and 40? where as for Chess it varies between >1 or <1
I hope my this understanding is correct for this discount factor.
Your understanding is generally correct. The choice of discount factor is often based on the nature of the problem and the agent’s objectives.
So in case of helicopter Prof.Ng chose “0.9” that’s because we want values future rewards fairly highly but discounts them a little. This means that it is willing to consider long-term consequences but does not completely ignore short-term rewards. ( it’s flying so we want to land safely in the future but it doesn’t mean we want to crash currently)
The choice of a discount factor of “0.5” for the robot may indicate that in this scenario, the agent (the robot) values short-term rewards more than long-term ones. This could be because the rewards are highly variable, as you mentioned, ranging from 40 to 100. A lower discount factor places more emphasis on immediate rewards and makes the agent more myopic in its decision-making.
And for chess we chose discount factor " 0.995". This suggests that the agent (the chess-playing program) values long-term consequences very highly and places a significant emphasis on future rewards. And that’s happening in chess games you find player planning to win after 5-7 moves so we need to focus more on long-term consequences.
So as you said it depends on your case and your goals.