I am developing a RL agent for a toy FPS game using qlearning algorithm. My environment is the fps game: it has a player, spawned enemies, bullets. The agent has a hash qvalues{[state, action]}=qval that stores the q values of state when taking action. Here is the source code #! /usr/bin/python3import pygameimport randomimport numpy as npimport - Pastebin.com I will highlight some of the more important points below in the questions,
Here is the score over 5000 episodes of the agent learning the game:
It doesnât seem to have learnt much.
Here are my questions:
- Is this behavior normal over 5000 episodes? I donât know RL well enough to have a good sense. In other occasions, it seems excessively large.
- Is the qlearning algorithm implementation correct? I have read âtheâ RL book and have implemented the algorithm correctly in C++ for random walk problem. I am less confident about the python implementation.
It uses decayed epsilon-greedy algorithm plus qlearning,
def __hash(self, state, action):
return hash((state.tobytes(), action))
def get_q_value(self, state, action):
Return the Q-value for a given state-action pair; default to 0 if not present
state = np.reshape(state, [1, self.state_size])
return self.qvalues.get(self.__hash(state, action), 0.0)
def update_q_value(self, state, action, reward, next_state):
Find the maximum Q-value for the next state
max_next_q = max([self.get_q_value(next_state, a) for a in range(self.action_size)], default=0)
Q-learning formula to update Q-value of current state-action pair
current_q = self.get_q_value(state, action)
new_q = current_q + self.learning_rate * (reward + self.discount_factor * max_next_q - current_q)
Update Q-value in the dictionary
self.qvalues[self.__hash(state, action)] = new_q
def get_best_action(self, state):
Retrieve the action with the highest Q-value for the current state
q_values = [self.get_q_value(state, action) for action in range(self.action_size)]
max_q = max(q_values)
Return the action with the maximum Q-value (break ties randomly)
best_actions = [action for action, q in enumerate(q_values) if q == max_q]
return random.choice(best_actions)
def choose_action(self, state):
Epsilon-greedy policy to select action
if random.random() < self.exploration_rate:
return random.randint(0, self.action_size - 1) # Random action
else:
return self.get_best_action(state)
-
Note that for the simulated action, the mouse click shooting bullets is modeled as two actions, shoot up and shoot down. Because the enemies are spawning on top, I assume over time the shooting upward action becomes more rewarding and should be chosen to maximize reward. This hypothesis does not seem to work shown by the result which makes me wonder if my implementation is correct in point #2.
-
I donât know how to integrate a neural network into the agent yet. It seems to map a state into an action. How should the NN handle the game screen image (84x84 pixels) and agent state (position vector of 2 elements)? Is the input 84x84+2 in size? I feel the player position should have a bigger weight in choosing an action from NN, maybe thatâs handled automatically by the weights in the NN?
Thanks