Some graph plotting for "Adam"

So I have plotted a “simulation of a gradient component” as handled by the Adam algorithm.

Reminder of how Adam works (this diagram still needs a few fixes but it will do)

Explainer

The “simulation of a gradient component” is a function based on sin with added noise, it will swing forever, giving Adam something to work on.

Plots are shown for the “first 200 steps” and the “last 200 steps” leading up to step 10’000.

‘Regularized gradient component’

I call “regularized gradient component” the gradient component divided by its associated \sqrt{s^{corr}} + \epsilon

Large s

What I didn’t realize is how large the s can become (and indeed it becomes large immediately due to the correction by 1/(1-\beta_{2}), with beta_{2} = 0.999 as recommended.

In the plots below, we only see the square root of s and s^{corr}, not s or s^{corr} itself, which are too large to fit on the plot.

Note that the large s strongly suppresses the magnitude of its associated gradient component nearly immediately. In the final plot, the “regularized gradient component” does not budge much from zero even though the “simulated gradient component” just keeps on swinging.

Some notes on naming / vague connection to physical quantities:

  • weight component W[,]: 1-dimensional position
  • gradient component dW[,]: similar to a ‘force’ i.e. an ‘acceleration’, as mentioned by Prof. Ng in Week 2, “gradient descent with momentum”, 5:30.
  • v: “first moment” (the term as used in physics) of the gradient component, i.e. its mean. Indeed it is a running average of the gradient component. So not quite a mean, but good enough. In “gradient descent with momentum”, Prof. Ng says that, at least for “gradient descent with momentum”, this has the characteristic of a velocity, but this seems not to be the intuition to apply here.
  • s: “second moment” (the term as used in physics) of the gradient component, i.e. its spread. This fits the “running average of the square of something”. However, for s we do not first subtract the first moment, as we should. Is that right? Did we forget something?
  • Then we divide v by sqrt(s) to get the “regularized gradient component”. I can’t imagine what that corresponds to though.
  • Finally we update W[,] with the “regularized gradient component”. This is like integrating roughly by a duration * velocity (the duration is always 1).

The plot (superposed sinuses with noise)

  • Input, light blue crosses
  • Output, dark blue dots

This plot has been generated with the random number generator seeded to 1.

Another plot (simple sinus swings with noise)

Here siimulate a gradient component that swings around mean 10 with amplitude 5 and and some noise. The \sqrt{s} eventually flattens out above the actual mean of 10 (I expect it to settle at 10 at +infinity though). v lags the gradient component with a smaller amplitude. The regularized gradient component, divided by \sqrt{s} stays near 1 with low amplitude and swings around that.

Another plot (simple line with noise)

This goes as expected, although the component, evidently dominant an high signal-to-noise, is still deprecated from 5 to ~1.

The code

  • Updated for legibility on 2025-03-09
  • Updated for bugfix on 2025-03-09 (didn’t use bias correction properly - the plot changes slightly)
from typing import Final, Dict, Tuple, List

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as mplaxes


def build_swinging(x: np.ndarray, periods: float) -> np.ndarray:
    two_pi: Final[float] = periods * 2 * np.pi
    size = x.shape[0]
    x1_values: Final[np.ndarray] = np.linspace(0.0, two_pi * 0.5, num=size, dtype=np.float64).reshape(-1, 1)
    x2_values: Final[np.ndarray] = np.linspace(0.0, two_pi * 2, num=size, dtype=np.float64).reshape(-1, 1)
    x3_values: Final[np.ndarray] = np.linspace(0.0, two_pi * 4, num=size, dtype=np.float64).reshape(-1, 1)
    return np.sin(x1_values) * 20 + np.power(np.cos(x2_values), 2) * 10 + np.sin(x3_values) * 3


def build_swinging_soft(x: np.ndarray, periods: float) -> np.ndarray:
    two_pi: Final[float] = periods * 2 * np.pi
    size = x.shape[0]
    values: Final[np.ndarray] = np.linspace(0.0, two_pi * 2, num=size, dtype=np.float64).reshape(-1, 1)
    return np.sin(values) * 5 + 10


def update_beta(value, beta, beta_running, t) -> Tuple[float, float]:
    if t == 0:
        # do not apply bias correction at t=0 as that means division by 0
        # TODO seems subtle and I don't like this at all
        # TODO maybe or should immediately shift to t=1
        value_corr = value
        beta_running = beta
    else:
        value_corr = value / (1 - beta_running)
        beta_running *= beta
    return (value_corr, beta_running)


def ema(cur_value, running_value, beta) -> float:
    return (1 - beta) * cur_value + beta * running_value


def simulate_adam(dW_array: np.ndarray, beta_1: float, beta_2: float, epsilon: float) -> np.ndarray:
    v_dW = 0.0
    s_dW = 0.0
    beta_1_running = 0.0
    beta_2_running = 0.0
    collected_data = []
    for t in range(dW_array.shape[0]):
        dW = dW_array[t, 0]
        v_dW = ema(dW, v_dW, beta_1)
        s_dW = ema(np.power(dW, 2), s_dW, beta_2)
        v_dW_corr, beta_1_running = update_beta(v_dW, beta_1, beta_1_running, t)
        s_dW_corr, beta_2_running = update_beta(s_dW, beta_2, beta_2_running, t)
        dW_reg = v_dW_corr / (np.sqrt(s_dW_corr) + epsilon)
        collected_data.append((t, dW, v_dW, s_dW, v_dW_corr, s_dW_corr, dW_reg))
    return np.array(collected_data)  # becomes a column-oriented array

# THIS IS NOT THE ADAM ALGORITHM
# JUST AN EXPERIMENT TO SEE WHAT'S GOING ON
def simulate_modified_adam(dW_array: np.ndarray, beta_1: float, beta_2: float, epsilon: float) -> np.ndarray:
    v_dW = 0.0
    s_dW = 0.0
    beta_1_running = 0.0
    beta_2_running = 0.0
    collected_data = []
    for t in range(dW_array.shape[0]):
        dW = dW_array[t, 0]
        v_dW = ema(dW, v_dW, beta_1)
        s_dW = ema(np.power(dW - v_dW, 2), s_dW, beta_2) # MODIFIED
        v_dW_corr, beta_1_running = update_beta(v_dW, beta_1, beta_1_running, t)
        s_dW_corr, beta_2_running = update_beta(s_dW, beta_2, beta_2_running, t)
        dW_reg = dW / (np.sqrt(s_dW_corr) + epsilon) + v_dW_corr # MODIFIED
        collected_data.append((t, dW, v_dW, s_dW, v_dW_corr, s_dW_corr, dW_reg))
    return np.array(collected_data)  # becomes a column-oriented array


def plot_sublot(ax: mplaxes.Axes, t_start: int, t_stop: int, data: np.ndarray, noise_stddev: float, min_y: float, max_y: float) -> None:
    ax.plot(data[t_start:t_stop, 0], data[t_start:t_stop, 1], label=f'noisy gradient component (σ = {noise_stddev:.2f})', marker='x', linestyle='none')
    ax.plot(data[t_start:t_stop, 0], data[t_start:t_stop, 2], label='v (v ~ "1st moment", "mean")', marker='o', linestyle='none', color='red')
    ax.plot(data[t_start:t_stop, 0], data[t_start:t_stop, 4], label='v with bias correction', marker='o', linestyle='none', color='orange')
    ax.plot(data[t_start:t_stop, 0], np.sqrt(data[t_start:t_stop, 3]), label='√(s) (s ~ "2nd moment", "spread"))', marker='o', linestyle='none', color='lime')
    ax.plot(data[t_start:t_stop, 0], np.sqrt(data[t_start:t_stop, 5]), label='√(s with bias correction)', marker='o', linestyle='none', color='darkgreen')
    ax.plot(data[t_start:t_stop, 0], data[t_start:t_stop, 6], label='regularized gradient component', marker='o', linestyle='none', color='blue')
    height = max_y - min_y
    ax.set_ylim(min_y - 0.1 * height, max_y + 0.1 * height)
    ax.set_xlim(t_start, t_stop)
    ax.axhline(y=0, color='black', linewidth=2)
    ax.legend()
    ax.grid()


def plot_all(simulation_data: np.ndarray, noise_stddev: float, min_y: float, max_y: float) -> None:
    size = simulation_data.shape[0]
    fig, axes = plt.subplots(2, 1, figsize=(15, 2 * 7))
    # "axes" is Axes or array of Axes https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.html#matplotlib.axes.Axes
    window = 200
    plot_sublot(axes[0], 0, window, simulation_data, noise_stddev, min_y, max_y)
    plot_sublot(axes[1], size - window + 1, size - 1, simulation_data, noise_stddev, min_y, max_y)
    plt.tight_layout()
    plt.show()


use_fixed_rng_seed: Final[bool] = True
use_what_gradient_generator: Final[str] = "build_swinging_soft"
use_modified_adam: Final[bool] = False

def main():
    if use_fixed_rng_seed:
        rng = np.random.default_rng(1)
    else:
        # Every execution is different
        rng = np.random.default_rng()
    #
    # Hyperparameters
    #
    beta_1: Final[float] = 0.9
    beta_2: Final[float] = 0.999
    epsilon: Final[float] = 0.0000001
    #
    # Other parameters
    #
    size: Final[int] = 10000  # number of data points in the graph
    periods = size / 50
    #
    ticks: Final[np.ndarray] = np.arange(0, size, dtype=np.int32).reshape(-1, 1)
    assert ticks.shape == (size, 1)
    #
    if use_what_gradient_generator == "build_swinging":
        noise_stddev: Final[float] = 1.0
        noise = rng.normal(size=size, scale=noise_stddev).reshape(-1, 1)
        dW_array: Final[np.ndarray] = build_swinging(ticks, periods) + noise
        min_y = -25
        max_y = 30.0
    elif use_what_gradient_generator == "build_swinging_soft":
        noise_stddev: Final[float] = 0.5
        noise = rng.normal(size=size, scale=noise_stddev).reshape(-1, 1)
        dW_array: Final[np.ndarray] = build_swinging_soft(ticks, periods) + noise
        min_y = 0.0
        max_y = 20.0
    else:
        raise ValueError(f"Unknown value for 'use_what_gradient_generator': {use_what_gradient_generator}")
    #
    if use_modified_adam:
        simulation_data = simulate_modified_adam(dW_array, beta_1, beta_2, epsilon)
    else:
        simulation_data = simulate_adam(dW_array, beta_1, beta_2, epsilon)
    #
    plot_all(simulation_data, noise_stddev, min_y, max_y)


main()

Sorry, David, but what’s the difference, in terms of your simulation settings, between these two plots?

Another point, what if you use your earlier idea of taking the abolute value of gradient instead of its second moment? We can see what difference it can make.

Sorry for being unclear.

It’s just the “simuated gradient component” that oscillates around 10 instead of 0 in the second plot. Like so:

For the first plot, we generate gradient component a bit clumsily like this. The x values are passed in, but they are not used, we just generate a wave of the same size as the x array.

def build_swinging(x: np.ndarray, periods: float) -> np.ndarray:
    two_pi: Final[float] = periods * 2 * np.pi
    size = x.shape[0]
    x1_values: Final[np.ndarray] = np.linspace(0.0, two_pi * 0.5, num=size, dtype=np.float64).reshape(-1, 1)
    x2_values: Final[np.ndarray] = np.linspace(0.0, two_pi * 2, num=size, dtype=np.float64).reshape(-1, 1)
    x3_values: Final[np.ndarray] = np.linspace(0.0, two_pi * 4, num=size, dtype=np.float64).reshape(-1, 1)
    return np.sin(x1_values) * 20 + np.power(np.cos(x2_values), 2) * 10 + np.sin(x3_values) * 3

For the second plot, we following is used:

def build_swinging_less(x: np.ndarray, periods: float) -> np.ndarray:
    two_pi: Final[float] = periods * 2 * np.pi
    size = x.shape[0]
    values: Final[np.ndarray] = np.linspace(0.0, two_pi * 2, num=size, dtype=np.float64).reshape(-1, 1)
    return np.sin(values) * 5 + 10

Another point, what if you use your earlier idea of taking the abolute value of gradient instead of its second moment? We can see what difference it can make.

More coffee is needed.

I was absolutely unhappy with the calculation that Adam performs for the second moment. It is missing the subtraction of the first moment (the mean) and feels wrong. How about these:

Helpers

def ewma(cur_value, running_value, beta) -> float:
    return (1 - beta) * cur_value + beta * running_value

def update_beta(value, beta, beta_running, t) -> Tuple[float, float]:
    if t == 0:
        # do not apply bias correction at t=0 as that means division by 0
        # TODO seems subtle and I don't like this at all
        # TODO maybe or should immediately shift to t=1
        value_corr = value
        beta_running = beta
    else:
        value_corr = value / (1 - beta_running)
        beta_running *= beta
    return (value_corr, beta_running)

Adam

def simulate_adam(dW_array: np.ndarray, beta_1: float, beta_2: float, epsilon: float) -> np.ndarray:
    v_dW = 0.0
    s_dW = 0.0
    beta_1_running = 0.0
    beta_2_running = 0.0
    collected_data = []
    for t in range(dW_array.shape[0]):
        dW = dW_array[t, 0]
        v_dW = ewma(dW, v_dW, beta_1)
        s_dW = ewma(np.power(dW, 2), s_dW, beta_2) # WEIRD
        v_dW_corr, beta_1_running = update_beta(v_dW, beta_1, beta_1_running, t)
        s_dW_corr, beta_2_running = update_beta(s_dW, beta_2, beta_2_running, t)
        dW_reg = v_dW_corr / (np.sqrt(s_dW_corr) + epsilon) # WEIRD
        collected_data.append((t, dW, v_dW, s_dW, v_dW_corr, s_dW_corr, dW_reg))
    return np.array(collected_data)  # becomes a column-oriented array

Not Adam, take 1

Modified to properly compute 2nd moment s (or s^{corr}, which should be the running average of the square of the divergence from the mean of dW, not the running average of the square of dW as in Adam.

Modified to divide compute dW_{reg}, the value to apply to W, as the first moment divided by the square root of the second moment. This then becomes like a reliability estimate (signal to noise ratio) or a deprecation of the magnitude of v_{dW}^{corr} if the second moment (the variability) s^{corr} is high.

def simulate_modified_adam(dW_array: np.ndarray, beta_1: float, beta_2: float, epsilon: float) -> np.ndarray:
    v_dW = 0.0
    s_dW = 0.0
    beta_1_running = 0.0
    beta_2_running = 0.0
    collected_data = []
    for t in range(dW_array.shape[0]):
        dW = dW_array[t, 0]
        v_dW = ewma(dW, v_dW, beta_1)
        s_dW = ewma(np.power(dW - v_dW, 2), s_dW, beta_2) # MODDED
        v_dW_corr, beta_1_running = update_beta(v_dW, beta_1, beta_1_running, t)
        s_dW_corr, beta_2_running = update_beta(s_dW, beta_2, beta_2_running, t)
        dW_reg = v_dW_corr / (np.sqrt(s_dW_corr) + epsilon) # MODDED
        collected_data.append((t, dW, v_dW, s_dW, v_dW_corr, s_dW_corr, dW_reg))
    return np.array(collected_data)  # becomes a column-oriented array

Not Adam, take 2

I like this even more, you don’t need epsilon and you deprecate dW_{reg} mores strongly as you don’t use sqrt.

def simulate_modified_adam(dW_array: np.ndarray, beta_1: float, beta_2: float) -> np.ndarray:
    v_dW = 0.0
    s_dW = 0.0
    beta_1_running = 0.0
    beta_2_running = 0.0
    collected_data = []
    for t in range(dW_array.shape[0]):
        dW = dW_array[t, 0]
        v_dW = ewma(dW, v_dW, beta_1)
        s_dW = ewma(np.power(dW - v_dW, 2), s_dW, beta_2) # MODDED
        v_dW_corr, beta_1_running = update_beta(v_dW, beta_1, beta_1_running, t)
        s_dW_corr, beta_2_running = update_beta(s_dW, beta_2, beta_2_running, t)
        dW_reg = v_dW_corr / (1.0 + s_dW_corr) # MODDED
        collected_data.append((t, dW, v_dW, s_dW, v_dW_corr, s_dW_corr, dW_reg))
    return np.array(collected_data)  # becomes a column-oriented array

The above corresponds to this dataflow:

Graphs for “take 2”

gradient component is noisy sinus

gradient component is simpler noisy sinus

gradient component is noisy constant

One notices that in the cass of “gradient component = constant”, dW_{reg} is not suppressed as hard as in the original Adam and is near the mean, as one would hope to see.

This case does not exist. To see why, check out the pseudo code below is from the Adam paper (arxiv):

My bad. I told you it is called second moment without more explanations.

The full name of what Adam uses is “Second Raw Moment”, and it is different from the Variance which is also called the “Second Central Moment”.

In fact, the Adam paper had precisely called it the Second Raw Moment. :sad_but_relieved_face:

Is there a plot-plot comparison between Adam and Non-Adam-take-1, and another plot-plot comparison between Adam and Non-Adam-take-2, where in each comparison, both Adam and the Non-Adam-take-X have the same stream of dW?