Need help with my qlearning reinforced learning game playing agent

I am developing a RL agent for a toy FPS game using qlearning algorithm. My environment is the fps game: it has a player, spawned enemies, bullets. The agent has a hash qvalues{[state, action]}=qval that stores the q values of state when taking action. Here is the source code #! /usr/bin/python3import pygameimport randomimport numpy as npimport - Pastebin.com I will highlight some of the more important points below in the questions,

Here is the score over 5000 episodes of the agent learning the game:

It doesn’t seem to have learnt much.

Here are my questions:

  1. Is this behavior normal over 5000 episodes? I don’t know RL well enough to have a good sense. In other occasions, it seems excessively large.
  2. Is the qlearning algorithm implementation correct? I have read ‘the’ RL book and have implemented the algorithm correctly in C++ for random walk problem. I am less confident about the python implementation.

It uses decayed epsilon-greedy algorithm plus qlearning,

def __hash(self, state, action):

return hash((state.tobytes(), action))

def get_q_value(self, state, action):

Return the Q-value for a given state-action pair; default to 0 if not present

state = np.reshape(state, [1, self.state_size])

return self.qvalues.get(self.__hash(state, action), 0.0)

def update_q_value(self, state, action, reward, next_state):

Find the maximum Q-value for the next state

max_next_q = max([self.get_q_value(next_state, a) for a in range(self.action_size)], default=0)

Q-learning formula to update Q-value of current state-action pair

current_q = self.get_q_value(state, action)

new_q = current_q + self.learning_rate * (reward + self.discount_factor * max_next_q - current_q)

Update Q-value in the dictionary

self.qvalues[self.__hash(state, action)] = new_q

def get_best_action(self, state):

Retrieve the action with the highest Q-value for the current state

q_values = [self.get_q_value(state, action) for action in range(self.action_size)]

max_q = max(q_values)

Return the action with the maximum Q-value (break ties randomly)

best_actions = [action for action, q in enumerate(q_values) if q == max_q]

return random.choice(best_actions)

def choose_action(self, state):

Epsilon-greedy policy to select action

if random.random() < self.exploration_rate:

return random.randint(0, self.action_size - 1) # Random action

else:

return self.get_best_action(state)

  1. Note that for the simulated action, the mouse click shooting bullets is modeled as two actions, shoot up and shoot down. Because the enemies are spawning on top, I assume over time the shooting upward action becomes more rewarding and should be chosen to maximize reward. This hypothesis does not seem to work shown by the result which makes me wonder if my implementation is correct in point #2.

  2. I don’t know how to integrate a neural network into the agent yet. It seems to map a state into an action. How should the NN handle the game screen image (84x84 pixels) and agent state (position vector of 2 elements)? Is the input 84x84+2 in size? I feel the player position should have a bigger weight in choosing an action from NN, maybe that’s handled automatically by the weights in the NN?

Thanks

Tip:
When you post your code on the forum, please use the “preformatted text” tag, so that it isn’t interpreted as markdown.

1 Like