States, actions, rewards

In reinforcement learning are the sets of actions, rewards, and states always pre-defined? I assume that the rewards are set and pre-defined, and the algorithm can not change/ add to those, and that the algorithm can learn new states depending on what action it has taken. How about new actions? Can it learn new actions?

In the reinforcement learning there are not new action as you give the model all possible parameters like let’s assume that if you want to make model in arm you give the model the direction, speed and the angle of this direction that’s the possible actions for this arm and by training the models you adjust these variables to make a good actions

1 Like

Thank you!
Would you be able to answer all the questions please, and confirm my assumption?

Hey @Basira_Daqiq Thanks for your post.

In RL the sets of actions, rewards, and states are not always pre-defined. The flexibility and adaptability of RL come from the fact that the algorithm can learn and interact with its environment to discover new states, actions, and even learn from the feedback (rewards) it receives.

Just consider that learning or degree to which RL can learn new states, actions, and rewards depends on various factors, including the complexity of the environment, the algorithm’s design, the amount of exploration, and the reward structure.

There might be predefined elements like the initial set of actions or some baseline rewards, RL algorithms can indeed learn new actions, states, and even adapt rewards over time


I’ll give this a try, since I also struggled with this myself.

In order for the model to know what to do in any given state, it needs to know the Q(s,a). The important thing to note is that, initially you don’t know what the Q(s,a) values are for a given state. This is achieved through training the model (or you can start by just guessing the values, but in order to improve, you have to train it).

States are variables, that are derived from the system, that are measurable/calculable values. Things like: position, speed, temperature, pressure, status, etc…

Actions can come from one of two places, depending on if you are: training, or testing, the model. If you are training the model, for the case of the flying helicopter, the actions are inputs from a human ,controlling a joystick. If you are testing the model, the actions come from policies, that are derived from model training.

The rewards are predefined. You state constraints on the system: If you get here, you get these points, if you go there, you get those points, if you crash, you get these points, if your fuel rate is over this number, you get these points…

The total Return, is the total of the series of actions and rewards that are taken from the beginning, to the end of an “episode”. Where an episode is defined as the start of the task, to it’s end. Where the end could be defined as either a crash, going too far out of bounds, or a successful execution, or some other criteria you can define.

So if you think about training a deep learning re-inforcement algorithm, every time you train the model, you are creating a policy. A policy is a prescription, that says that if you are in one particular state, you should go to some next state.

The goal is to train the model repeatedly, thereby creating multiple policies, each varying in total return. Everytime you train the model with a new policy (or think of it as a state-path), by using the NN training apprach, you are finding the aggregation of policies that produce the greatest total return.

Before you even start training, you can randomly create a policy, just to initialize the NN. But everytime you train the model through an excersize, it will aggregate the new policy with the previous ones through transfer learning.

This is my thought process about it. I hope this helps, or at least provokes some discussion.