Hello,
I’m experimenting with Ray RLlib’s DQN (Dueling Double DQN) on a minimal custom environment, but I keep seeing many resets in a single training iteration, even though each episode completes immediately. I’ve tried tuning all the batch-size/horizon parameters but the behavior persists.
1. Custom environment
import gymnasium as gym
from gymnasium import spaces
from ray.tune.registry import register_env
class SimpleEnv(gym.Env):
def __init__(self, config):
self.observation_space = spaces.Box(low=0, high=1, shape=(1,), dtype=float)
self.action_space = spaces.Discrete(2)
self.step_count = 0
self.horizon = config.get("horizon", 1)
def reset(self, seed=None, options=None):
self.step_count = 0
print("=== RESET ===") # I see this printed many times!
return [0.0], {}
def step(self, action):
self.step_count += 1
done = self.step_count >= self.horizon
print(f"Step: {self.step_count}, Done: {done}")
return [0.0], 1.0, done, done, {}
register_env("SimpleEnv-v0", lambda cfg: SimpleEnv(cfg))
- horizon is set to 1, so each episode should be exactly one step long.
2. DQN configuration
from ray.rllib.algorithms.dqn import DQNConfig
config = DQNConfig()
# Environment
config.environment("SimpleEnv-v0", env_config={"horizon": 1})
# Runner settings
config.env_runners(
num_env_runners=0,
rollout_fragment_length=1,
batch_mode="complete_episodes"
)
# Training settings
config.training(
dueling=True,
double_q=True,
train_batch_size=50,
train_batch_size_per_learner=50,
minibatch_size=25,
num_steps_sampled_before_learning_starts=50,
target_network_update_freq=1,
)
# Episode termination
config.soft_horizon = True
config.no_done_at_end = False
# API stack
config.api_stack(
enable_rl_module_and_learner=True,
enable_env_runner_and_connector_v2=True
)
algo = config.build_algo()
algo.train()
3. Observed output
![Many resets in one episode]
=== RESET ===
Step: 1, Done: True
=== RESET ===
Step: 1, Done: True
=== RESET ===
Step: 1, Done: True
=== RESET ===
Step: 1, Done: True
=== RESET ===
Step: 1, Done: True
=== RESET ===
Step: 1, Done: True
=== RESET ===
Step: 1, Done: True
=== RESET ===
Step: 1, Done: True
=== RESET ===
Step: 1, Done: True
=== RESET ===
Step: 1, Done: True
=== RESET ===
Step: 1, Done: True
=== RESET ===
Step: 1, Done: True
=== RESET ===
Step: 1, Done: True
=== RESET ===
Despite using batch_mode="complete_episodes"
, rollout_fragment_length=1
, and setting the horizon to 1, RLlib collects multiple “episodes” (and prints many === RESET ===
) before each training update.
4. What I’ve tried
- Tuned:
train_batch_size
,train_batch_size_per_learner
num_steps_sampled_before_learning_starts
target_network_update_freq
- Disabling/enabling RLModule API stack
- Switched between single-agent and multi-agent replay buffers
- Used both
.env_runners(...)
and.rollouts(...)
Nothing stops RLlib from issuing multiple resets per iteration.
5. Question
- Why does Ray RLlib reset the environment multiple times per training iteration, even with
batch_mode="complete_episodes"
and a horizon of 1? - How can I force exactly one environment reset (one 1-step episode) per training iteration when using DQN/D3QN in RLlib?
Any pointers to the correct combination of RLlib settings (or a minimal working example) would be greatly appreciated!
- Tuned:
train_batch_size
,train_batch_size_per_learner
num_steps_sampled_before_learning_starts
target_network_update_freq
,and other hyperparameters- Disabling/enabling RLModule API stack
- Running on minimal code on simplest custom env
Nothing stops RLlib from issuing multiple resets per iteration.