New Policies in RL - Conquer Policies and Trail-and-Explore Policy

Inventing new things is one of my hobbies, and although I’m not a professional, my passion drives me to give it a try :slight_smile:

Lately, over the past 4 days, I’ve come up with a few ideas for new reinforcement learning (RL) policies. I’ve brought them to life and tested them (note: the tests are not fully hyperparameter-tuned, and the results recording may not always be accurate). Despite this, I’ve uncovered some useful and interesting findings that I’d love to share. Now, I’m quite unsure of whether the finding is qualified enough for writing a research paper, plz give advice!

Overview:

  • Conquer Policies (three modified versions: Specific Conquer, Global Conquer V1, and State-Specific Conquer V2) are relatively faster than the Softmax policy, slower than Epsilon-Greedy, and achieve rewards comparable to Softmax. Note: conquer policies are modified. Conquer policy is a modified version of e-greedy policy because it still use e-greedy to determine action
  • Trail-and-Explore Policy: A policy I did not give much attention to, but its unique behavior in the Cliff Walking environment caught my attention. It exhibits a combination of Softmax and Epsilon-Greedy characteristics, offering a interesting pattern: this policy is different form e-greedy or softmax

Basic Info: Tested on Q-Learning, Episode-Based Tasks

  • Introduction to Each Policy and Their Use Cases:

  • Specific Conquer: Uses state-specific memory to adjust exploration. Applicable for both episode tasks and continuous tasks.

  • Global Conquer V1: Relies on global memory across an episode to guide exploration. Suitable for only episode tasks.

  • State-Specific Conquer V2: Incorporates state-specific action history for refined exploration. Applicable for both episode tasks and continuous tasks.

  • Trail-and-Explore: Balances exploration and exploitation using noise and reward changes. Applicable for both episode tasks and continuous tasks.

  • some examples of the new policies working in static wall experiment (to be Honest, sometimes these new policies perform even worse than softmax and e-greedy, but sometimes they are better, their performance seems unpredictable)

Hyperparameters:

GRID_SIZE: 12
ALPHA: 0.1
NUM_EPISODES: 1000
BETA_SPECIFIC: 0.1
BETA_GENERAL: 0.1
BETA_V2: 0.1
TAU: 1.0
EPSILON: 0.1
NUM_TRIALS: 25
GAMMA: 0.9
REWARD_GOAL: 250

Results:

Policy: Specific Conquer
Final Reward: -27.52
Standard Deviation: 592.5441030851701
Compute Time (s): 1.5365066719055176

Policy: Global Conquer V1
Final Reward: -25.56
Standard Deviation: 554.508095399147
Compute Time (s): 0.8060064601898194

Policy: Softmax
Final Reward: -23.0
Standard Deviation: 605.864257307744
Compute Time (s): 3.7024656677246095

Policy: Epsilon-Greedy
Final Reward: -52.32
Standard Deviation: 562.0970904504393
Compute Time (s): 0.7547558689117432

Policy: State-Specific Conquer V2
Final Reward: -29.72
Standard Deviation: 559.9300890232513
Compute Time (s): 1.8268444919586182

  • I have only tested these policies on Q-learning tasks with episodic setups, welcome for further investigations

My code for Cliff walking: Hypers can fine tune, (may not be 100% correct on getting result)

Cliff_Walking_heatmap - Copy.py (19.1 KB)

  • For most code, I actually use ai to help me write, including (environment set up, reward collection and graph display, and the run loop), but I did the policy part with the help of AI (i’m not good at coding :(, and read the environment and other parts as far as i can…… However, I think the AI messed up with the reward collection parts, and haven’t fixed it yet

Explanation: Basic Logic Idea of Conquer Policies, Verification, Modifications, and Trail-and-Explore Policy
Specific Conquer Policies:
The core idea behind Conquer policies is to use a memory mechanism to track how accustomed the agent is to rewards, controlled by a parameter beta (how much the current reward impacts memory). Memory serves as “how long the agent has been used to the reward,” updated as a weighted average of past and current rewards. The baseline (memory / max_q) dynamically reflects the agent’s perception of the reward. For example:
If you score 100% in a test once, you feel incredible (baseline near 0, agent gets greedy).
If you consistently score 100%, it feels normal (baseline near 1, agent explores for better rewards).
This baseline controls the trade-off between greedy and exploratory actions: a low baseline (unexpected reward change) encourages greediness, while a high baseline (stable rewards) promotes exploration.
Modification to V2 (State-Specific Conquer V2): Enhanced by incorporating state-specific action history (last_action and memory_history) to compute a weighted memory of previous actions’ Q-values, using baseline = memory / max_Q[state] (adjusted from your suggestion of last_action to align with the code’s max_Q usage) for finer control over exploration per state.
Global Conquer Policies:
Original Concept: The initial Global Conquer used a global memory updated per state, but it lacked episode-wide context. *not shown here because it’s not working well :frowning: *
Modified to V1: Adjusted to use a global memory for the entire episode (via global_policy.update), reflecting the average Q-values and total reward.
Logic and Idea of Trail-and-Explore Policy:
Core Idea: This policy compares the current maximum Q-value with the previous one, using a noise range (delta) to decide between exploration and exploitation. If the relative change in Q-values (q_relative) exceeds the noise threshold, the agent explores randomly; otherwise, it acts greedily.
Memory Store and Update: Stores the maximum Q-value per state in memories to track reward trends. Updated each step with the current max Q-value.
Role of delta: Acts as a sensitivity parameter, defining the noise range (-delta to delta) to trigger exploration when reward changes are significant.
Behavior: Combines Softmax-like randomness (via noise) and Epsilon-Greedy-like greediness (when q_relative is low), making it unique in unstable environments.
Modification Note: Could be enhanced by tuning delta dynamically based on episode progress or adding a memory decay factor.
Code Example for Each New Policy
Below are simplified versions of the code for each new policy:

Specific Conquer Policy (Based on State-Specific Conquer V0):

def specific_conquer_policy(q_table, state, memories, beta=0.1, episode=None, total_episodes=None):
if state not in q_table:
q_table[state] = [0.0 for _ in [‘up’, ‘right’, ‘down’, ‘left’]]
if state not in memories:
memories[state] = 0.0
max_q = np.max(q_table[state])
memories[state] = beta * max_q + (1 - beta) * memories[state]
baseline = memories[state] / (max_q if max_q != 0 else 1)
effective_epsilon = max(0.01, 1 - baseline)
if np.random.rand() < effective_epsilon:
return np.random.choice([‘up’, ‘right’, ‘down’, ‘left’])
return np.argmax(q_table[state])

Global Conquer V1 Policy:
class GlobalConquerPolicyV1:
def init(self, beta=0.1):
self.beta = beta
self.global_memory = 0
self.max_mean_q = 0

def update(self, q_table, total_reward):
    cur_mean_q = np.mean(list(q_table.values())) if q_table else 0
    self.max_mean_q = max(self.max_mean_q, cur_mean_q)
    self.global_memory = self.beta * total_reward + (1 - self.beta) * self.global_memory

def get_action(self, q_table, state):
    if state not in q_table: 
        q_table[state] = [0.0 for _ in ['up', 'right', 'down', 'left']]
    baseline = self.global_memory / self.max_mean_q if self.max_mean_q != 0 else 1
    effective_epsilon = max(0.01, 1 - baseline)
    if np.random.rand() < effective_epsilon:
        return np.random.choice(['up', 'right', 'down', 'left'])
    return np.argmax(q_table[state])

State-Specific Conquer V2 Policy:
class StateSpecificConquerPolicyV2:
def init(self, height=12, grid_size=12, beta=0.1):
self.beta = beta
self.Q = np.zeros((height * grid_size, 4))
self.max_Q = np.zeros(height * grid_size)
self.last_action = np.full(height * grid_size, -1)
self.memory_history = np.zeros(height * grid_size) # Single memory value per state, updated with beta

def update(self, state_idx, action_values, last_action=None):
    self.Q[state_idx] = action_values
    max_q = np.max(action_values)
    self.max_Q[state_idx] = max(self.max_Q[state_idx], max_q)
    if self.last_action[state_idx] == -1 and last_action is not None:
        self.last_action[state_idx] = last_action
        self.memory_history[state_idx] = action_values[last_action]
    elif last_action is not None:
        Q_current = action_values[last_action]
        self.memory_history[state_idx] = self.beta * Q_current + (1 - self.beta) * self.memory_history[state_idx]
        self.last_action[state_idx] = last_action

def compute_memory(self, state_idx):
    return self.memory_history[state_idx]  # Return the updated memory value

def get_action(self, state, episode, total_episodes, q_table):
    if state not in q_table:
        q_table[state] = [0.0 for _ in ['up', 'right', 'down', 'left']]
    state_idx = state[0] * 12 + state[1]
    memory = self.compute_memory(state_idx)
    baseline = memory / self.max_Q[state_idx] if self.max_Q[state_idx] != 0 else 1
    effective_epsilon = max(0.01, 1 - baseline)
    if np.random.rand() < effective_epsilon:
        return np.random.choice(['up', 'right', 'down', 'left'])
    return np.argmax(q_table[state])

Trail-and-Explore Policy:
def trail_and_explore_policy(q_table, state, previous_q, memories, delta=0.9):
current_max_q = np.max(q_table[state]) if state in q_table else 0
memories[state] = current_max_q
e = np.random.uniform(-delta, delta)
q_diff = previous_q - current_max_q if previous_q is not None else 0
q_relative = q_diff / max(abs(q_diff), abs(previous_q)) if max(abs(q_diff), abs(previous_q)) != 0 else 0
if q_relative > e:
return np.random.choice([‘up’, ‘right’, ‘down’, ‘left’]), current_max_q
return np.argmax(q_table[state]), current_max_q

I’m not a pro and have only self-learned AI for 50 days (via Coursera! :)), I am not a person looking for fame and wealth, but I think this invention can do some help to the AI community. (quite bold :/). Inventing alone is challenging, so if you find this interesting, please join me to refine these policies! No matter if you’re a pro or a learner like me, your help is welcomed ! :slight_smile:
My email: leeinfinitee00@gmail.com
Have a nice day!

@Lee_AI looking over this, my first question would be: ‘What kind of problem are you trying to solve ?’. I would make sure to be clear and state that upfront. Is this for the ARC prize?

great question indeed. to be honest, I haven’t think about it. I just trying on the math thinking about the logic. It was not init designed for solving any problem, but a try out. (are all paper supposed to be solving problems? for that I really have no idea since I read only a few research paper I am a beginner, and I am probably wrong). For the Prize? I did not know them existed, and certainly, this will have no way to win any prize or something, I just want: 1. challenge myself. 2.Make mistakes no mistake no success. 3.Try to bring up some hopefully helpful findings for others. My first try will certainly fail i know, but i will try again. Hope this answers your question.

I think the main purpose I posted this is because I want to talk to someone who are better than me (i’m saying I am good), because no one in my school wanna talk about these kinda topic, and I think the AI community is bigger than my school in terms of accessability.

@Lee_AI:
If you’re going to post your code on the forum, please use the “pre-formatted text” tag, so that your code doesn’t show up as Markdown.

You should be able to edit your original post and fix this.

Linguistic tip:
“wanna” is not a word in the English language. “Want to” is more correct. Writing style is important if you seek to have a positive reputation.

1 Like

Thanks for the the suggestions!