So I have plotted a “simulation of a gradient component” as handled by the Adam algorithm.
Reminder of how Adam works (this diagram still needs a few fixes but it will do)
Explainer
The “simulation of a gradient component” is a function based on sin with added noise, it will swing forever, giving Adam something to work on.
Plots are shown for the “first 200 steps” and the “last 200 steps” leading up to step 10’000.
‘Regularized gradient component’
I call “regularized gradient component” the gradient component divided by its associated \sqrt{s^{corr}} + \epsilon
Large s
What I didn’t realize is how large the s can become (and indeed it becomes large immediately due to the correction by 1/(1-\beta_{2}), with beta_{2} = 0.999 as recommended.
In the plots below, we only see the square root of s and s^{corr}, not s or s^{corr} itself, which are too large to fit on the plot.
Note that the large s strongly suppresses the magnitude of its associated gradient component nearly immediately. In the final plot, the “regularized gradient component” does not budge much from zero even though the “simulated gradient component” just keeps on swinging.
Some notes on naming / vague connection to physical quantities:
- weight component
W[,]
: 1-dimensional position - gradient component
dW[,]
: similar to a ‘force’ i.e. an ‘acceleration’, as mentioned by Prof. Ng in Week 2, “gradient descent with momentum”, 5:30. v
: “first moment” (the term as used in physics) of the gradient component, i.e. its mean. Indeed it is a running average of the gradient component. So not quite a mean, but good enough. In “gradient descent with momentum”, Prof. Ng says that, at least for “gradient descent with momentum”, this has the characteristic of a velocity, but this seems not to be the intuition to apply here.s
: “second moment” (the term as used in physics) of the gradient component, i.e. its spread. This fits the “running average of the square of something”. However, fors
we do not first subtract the first moment, as we should. Is that right? Did we forget something?- Then we divide
v
bysqrt(s)
to get the “regularized gradient component”. I can’t imagine what that corresponds to though. - Finally we update
W[,]
with the “regularized gradient component”. This is like integrating roughly by a duration * velocity (the duration is always 1).
The plot (superposed sinuses with noise)
- Input, light blue crosses
- Output, dark blue dots
This plot has been generated with the random number generator seeded to 1.
Another plot (simple sinus swings with noise)
Here siimulate a gradient component that swings around mean 10 with amplitude 5 and and some noise. The \sqrt{s} eventually flattens out above the actual mean of 10 (I expect it to settle at 10 at +infinity though). v lags the gradient component with a smaller amplitude. The regularized gradient component, divided by \sqrt{s} stays near 1 with low amplitude and swings around that.
Another plot (simple line with noise)
This goes as expected, although the component, evidently dominant an high signal-to-noise, is still deprecated from 5 to ~1.
The code
- Updated for legibility on 2025-03-09
- Updated for bugfix on 2025-03-09 (didn’t use bias correction properly - the plot changes slightly)
from typing import Final, Dict, Tuple, List
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as mplaxes
def build_swinging(x: np.ndarray, periods: float) -> np.ndarray:
two_pi: Final[float] = periods * 2 * np.pi
size = x.shape[0]
x1_values: Final[np.ndarray] = np.linspace(0.0, two_pi * 0.5, num=size, dtype=np.float64).reshape(-1, 1)
x2_values: Final[np.ndarray] = np.linspace(0.0, two_pi * 2, num=size, dtype=np.float64).reshape(-1, 1)
x3_values: Final[np.ndarray] = np.linspace(0.0, two_pi * 4, num=size, dtype=np.float64).reshape(-1, 1)
return np.sin(x1_values) * 20 + np.power(np.cos(x2_values), 2) * 10 + np.sin(x3_values) * 3
def build_swinging_soft(x: np.ndarray, periods: float) -> np.ndarray:
two_pi: Final[float] = periods * 2 * np.pi
size = x.shape[0]
values: Final[np.ndarray] = np.linspace(0.0, two_pi * 2, num=size, dtype=np.float64).reshape(-1, 1)
return np.sin(values) * 5 + 10
def update_beta(value, beta, beta_running, t) -> Tuple[float, float]:
if t == 0:
# do not apply bias correction at t=0 as that means division by 0
# TODO seems subtle and I don't like this at all
# TODO maybe or should immediately shift to t=1
value_corr = value
beta_running = beta
else:
value_corr = value / (1 - beta_running)
beta_running *= beta
return (value_corr, beta_running)
def ema(cur_value, running_value, beta) -> float:
return (1 - beta) * cur_value + beta * running_value
def simulate_adam(dW_array: np.ndarray, beta_1: float, beta_2: float, epsilon: float) -> np.ndarray:
v_dW = 0.0
s_dW = 0.0
beta_1_running = 0.0
beta_2_running = 0.0
collected_data = []
for t in range(dW_array.shape[0]):
dW = dW_array[t, 0]
v_dW = ema(dW, v_dW, beta_1)
s_dW = ema(np.power(dW, 2), s_dW, beta_2)
v_dW_corr, beta_1_running = update_beta(v_dW, beta_1, beta_1_running, t)
s_dW_corr, beta_2_running = update_beta(s_dW, beta_2, beta_2_running, t)
dW_reg = v_dW_corr / (np.sqrt(s_dW_corr) + epsilon)
collected_data.append((t, dW, v_dW, s_dW, v_dW_corr, s_dW_corr, dW_reg))
return np.array(collected_data) # becomes a column-oriented array
# THIS IS NOT THE ADAM ALGORITHM
# JUST AN EXPERIMENT TO SEE WHAT'S GOING ON
def simulate_modified_adam(dW_array: np.ndarray, beta_1: float, beta_2: float, epsilon: float) -> np.ndarray:
v_dW = 0.0
s_dW = 0.0
beta_1_running = 0.0
beta_2_running = 0.0
collected_data = []
for t in range(dW_array.shape[0]):
dW = dW_array[t, 0]
v_dW = ema(dW, v_dW, beta_1)
s_dW = ema(np.power(dW - v_dW, 2), s_dW, beta_2) # MODIFIED
v_dW_corr, beta_1_running = update_beta(v_dW, beta_1, beta_1_running, t)
s_dW_corr, beta_2_running = update_beta(s_dW, beta_2, beta_2_running, t)
dW_reg = dW / (np.sqrt(s_dW_corr) + epsilon) + v_dW_corr # MODIFIED
collected_data.append((t, dW, v_dW, s_dW, v_dW_corr, s_dW_corr, dW_reg))
return np.array(collected_data) # becomes a column-oriented array
def plot_sublot(ax: mplaxes.Axes, t_start: int, t_stop: int, data: np.ndarray, noise_stddev: float, min_y: float, max_y: float) -> None:
ax.plot(data[t_start:t_stop, 0], data[t_start:t_stop, 1], label=f'noisy gradient component (σ = {noise_stddev:.2f})', marker='x', linestyle='none')
ax.plot(data[t_start:t_stop, 0], data[t_start:t_stop, 2], label='v (v ~ "1st moment", "mean")', marker='o', linestyle='none', color='red')
ax.plot(data[t_start:t_stop, 0], data[t_start:t_stop, 4], label='v with bias correction', marker='o', linestyle='none', color='orange')
ax.plot(data[t_start:t_stop, 0], np.sqrt(data[t_start:t_stop, 3]), label='√(s) (s ~ "2nd moment", "spread"))', marker='o', linestyle='none', color='lime')
ax.plot(data[t_start:t_stop, 0], np.sqrt(data[t_start:t_stop, 5]), label='√(s with bias correction)', marker='o', linestyle='none', color='darkgreen')
ax.plot(data[t_start:t_stop, 0], data[t_start:t_stop, 6], label='regularized gradient component', marker='o', linestyle='none', color='blue')
height = max_y - min_y
ax.set_ylim(min_y - 0.1 * height, max_y + 0.1 * height)
ax.set_xlim(t_start, t_stop)
ax.axhline(y=0, color='black', linewidth=2)
ax.legend()
ax.grid()
def plot_all(simulation_data: np.ndarray, noise_stddev: float, min_y: float, max_y: float) -> None:
size = simulation_data.shape[0]
fig, axes = plt.subplots(2, 1, figsize=(15, 2 * 7))
# "axes" is Axes or array of Axes https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.html#matplotlib.axes.Axes
window = 200
plot_sublot(axes[0], 0, window, simulation_data, noise_stddev, min_y, max_y)
plot_sublot(axes[1], size - window + 1, size - 1, simulation_data, noise_stddev, min_y, max_y)
plt.tight_layout()
plt.show()
use_fixed_rng_seed: Final[bool] = True
use_what_gradient_generator: Final[str] = "build_swinging_soft"
use_modified_adam: Final[bool] = False
def main():
if use_fixed_rng_seed:
rng = np.random.default_rng(1)
else:
# Every execution is different
rng = np.random.default_rng()
#
# Hyperparameters
#
beta_1: Final[float] = 0.9
beta_2: Final[float] = 0.999
epsilon: Final[float] = 0.0000001
#
# Other parameters
#
size: Final[int] = 10000 # number of data points in the graph
periods = size / 50
#
ticks: Final[np.ndarray] = np.arange(0, size, dtype=np.int32).reshape(-1, 1)
assert ticks.shape == (size, 1)
#
if use_what_gradient_generator == "build_swinging":
noise_stddev: Final[float] = 1.0
noise = rng.normal(size=size, scale=noise_stddev).reshape(-1, 1)
dW_array: Final[np.ndarray] = build_swinging(ticks, periods) + noise
min_y = -25
max_y = 30.0
elif use_what_gradient_generator == "build_swinging_soft":
noise_stddev: Final[float] = 0.5
noise = rng.normal(size=size, scale=noise_stddev).reshape(-1, 1)
dW_array: Final[np.ndarray] = build_swinging_soft(ticks, periods) + noise
min_y = 0.0
max_y = 20.0
else:
raise ValueError(f"Unknown value for 'use_what_gradient_generator': {use_what_gradient_generator}")
#
if use_modified_adam:
simulation_data = simulate_modified_adam(dW_array, beta_1, beta_2, epsilon)
else:
simulation_data = simulate_adam(dW_array, beta_1, beta_2, epsilon)
#
plot_all(simulation_data, noise_stddev, min_y, max_y)
main()