A note on the "TensorFlow" section

To collect further knowledge about the application of “TensorFlow”, I recommend posting to ChatGPT the following question. Hopefully it doesn’t make anything up :wink::

What is the principle behind Tensorflow’s ability to compute the gradient of a cost function in terms of certain parameters? Is it done symbolically or numerically?

Upon which one gets a lengthy response about “Automatic Differentiation (Autodiff)”, in particular:

TensorFlow primarily uses reverse-mode automatic differentiation, which is well-suited for functions with many inputs (parameters) and a single scalar output (like a cost function in deep learning). This method computes gradients efficiently by traversing the computational graph backward, storing intermediate results to avoid redundant computations.

And also ask:

If you give the cost equation to TensorFlow in Python, you write it in the same way as you would for evaluation. How does TensorFlow get the structure of the cost equation? Doesn’t it need a string to examine?

Upon which one gets told:

Great question! The key insight is that TensorFlow does not need a string representation of the equation to analyze it. Instead, TensorFlow constructs a computational graph dynamically when operations are performed using TensorFlow tensors.
When you define an equation in Python using TensorFlow operations, such as:

import tensorflow as tf

x = tf.Variable(3.0)
with tf.GradientTape() as tape:
    y = x**2 + 2*x + 1  # Cost function

TensorFlow does not parse a string representation of y = x**2 + 2*x + 1. Instead, it records all the operations that occur within the tf.GradientTape() context. These operations are executed normally in Python but are also logged into TensorFlow’s computational graph.

And also ask:

But at least in the call grad = tape.gradient(y,x) the tape.gradient() function must be aware of the names of the variables, i.e. y, x? Maybe the name of variables is communicated to the callee in Python?

And we get told:

Actually, TensorFlow’s tape.gradient(y, x) function does not rely on variable names like "y" or "x" at all. Instead, it tracks computation using references to objects in memory, not variable names. Here’s how it works:

  • When you perform operations using tf.Variable or tf.Tensor, TensorFlow logs them in a computational graph inside tf.GradientTape().
  • The graph records dependencies between variables and operations but does not store variable names.
  • In tape.gradient(y, x), the arguments y and x are Python variables that reference TensorFlow objects.
  • y is a tensor that results from computations involving x.
  • x is a tf.Variable or tf.Tensor whose gradient we want.
  • When tape.gradient(y, x) is called, TensorFlow follows the computational graph backward from y to x, applying the chain rule.
  • Since the graph stores the sequence of operations, it can determine how y depends on x without needing variable names.

I have not looked at the TF source code, but one assumes that they have implemented the analytic derivatives of functions that they actually provide, e.g. sigmoid, tanh, relu, etc. Presumably they use autodiff at any Chain Rule step in which they don’t have an analytic derivative available.

The TF/Keras documentation is quite good. You can also explore further, e.g. starting here.

Torch (and any other ML platforms) also implement autodiff.

As I’m sure you already realize, the fundamental mechanisms are based on the same “finite differences” idea that we learned in the Gradient Checking material in C2 W1. But a lot more care has gone into doing it accurately and efficiently. I’m sure if you search the ACM digital library you can find some papers about this.

1 Like

Thank you for these docs.

There is also an Wikipedia page to which the tensorflow documentation links.

But I daresay you err when saying that autodiff is based on finite differences. Autodiff is based on analytic information about the functions to differentiate so that we can evalute \frac{\partial{f}}{\partial{w}} at the w of interest directly. As we do by hand. The computational graph may not include non-analytic functions (or else one would have to plug in a Taylor expansion of the non-analytic function)

My apologies: you are right that TF autodiff does not use any variants of numeric differentiation. Obviously I had not actually read the documentation that I linked above.

1 Like

No worries. I will confess this happens to me more often than I would like too. There is too much to absorb.

Meanwhile

A simple exercise for a “evaluate the derivative at the leaves of the differentiation tree first, then compose when moving back to the root” (due the innermost-first evaluation order of most programming languages) approach as described in the 1964-08 paper by R.E. Wengert,

A simple automatic derivative evaluation program

https://dl.acm.org/doi/10.1145/355586.364791

It was given in FORTRAN. With a typo that apparently nobody ever worried about:

Freshly in Python with additional tracing output via strings, a possibility not directly available back then.

It seems to work but I haven’t quite convinced myself why:

import math
from typing import Tuple

import numpy as np


class Dual:
    def __init__(self, x: float, dx: float, desc: str, desc_d: str) -> None:
        self.x = x # part of the evaluation of f(x1,x2,...,xn)
        self.dx = dx # part of the evaluation of ∂f(x1,x2,...,xn)/∂x
        self.desc = desc # string describing formula computed
        self.desc_d = desc_d # string describing derivative computed


def add(u: Dual, v: Dual) -> Dual:
    desc = f"({u.desc} + {v.desc})"
    desc_d = f"({u.desc_d} + {v.desc_d})"
    return Dual(u.x + v.x, u.dx + v.dx, desc, desc_d)


def prod(u: Dual, v: Dual) -> Dual:
    desc = f"({u.desc} * {v.desc})"
    desc_d = f"({u.desc}*{v.desc_d} + {u.desc_d}*{v.desc})"
    return Dual(u.x * v.x, u.x * v.dx + u.dx * v.x, desc, desc_d)


def sine(u: Dual) -> Dual:
    desc = f"sin({u.desc})"
    desc_d = f"{u.desc_d}*cos({u.desc})"
    return Dual(math.sin(u.x), u.dx * math.cos(u.x), desc, desc_d)


#
# We have the formula: f(x1,x2,x3) = x1 * x2 + sin(x2 * x3)
# This function evaluates the function at (x1,x2,x3) and
# also computes the partial derivative by one of ∂f/∂x1, ∂f/∂x2, ∂f/∂x3.
#

def fun1(x1: Dual, x2: Dual, x3: Dual) -> Dual:
    return add(prod(x1, x2), sine(prod(x2, x3)))


def part(x1: float, x2: float, x3: float) -> Tuple[float, float, float, float]:
    f_eval = fun1(Dual(x1, 0.0, "x1", "df/dx1"),
                  Dual(x2, 0.0, "x2", "df/dx2"),
                  Dual(x3, 0.0, "x3", "df/dx3"))
    # f_eval now contains the correct value for fun1(x1,x2,x3)
    print(f"fun1(x1,x2,x3): {f_eval.desc} = {f_eval.x}")

    f_by_x1 = fun1(Dual(x1, 1.0, "x1", "[dx1/dx1=1]"),
                   Dual(x2, 0.0, "x2", "[dx2/dx1=0]"),
                   Dual(x3, 0.0, "x3", "[dx3/dx1=0]"))
    # f_by_x1 now contains the correct value for ∂fun1(x,x2,x3)/∂x @ x1
    print(f"∂fun1(x,x2,x3)/∂x @ x1: {f_by_x1.desc_d} = {f_by_x1.dx}")

    f_by_x2 = fun1(Dual(x1, 0.0, "x1", "[dx1/dx2=0]"),
                   Dual(x2, 1.0, "x2", "[dx2/dx2=1]"),
                   Dual(x3, 0.0, "x3", "[dx3/dx2=0]"))
    # f3 now contains the correct value for ∂fun1(x1,x,x3)/∂x @ x2
    print(f"∂fun1(x1,x,x3)/∂x @ x2: {f_by_x2.desc_d} = {f_by_x2.dx}")

    f_by_x3 = fun1(Dual(x1, 0.0, "x1", "[dx1/dx3=0]"),
                   Dual(x2, 0.0, "x2", "[dx2/dx3=0]"),
                   Dual(x3, 1.0, "x3", "[dx3/dx3=1]"))
    # f_by_3 now contains the correct value for ∂fun1(x1,x2,x)/∂x @ x3
    print(f"∂fun1(x1,x2,x)/∂x @ x3: {f_by_x3.desc_d} = {f_by_x3.dx}")
    return f_eval.x, f_by_x1.dx, f_by_x2.dx, f_by_x3.dx


assert np.all(np.isclose(
    [1.72058450, 2.0, 3.880510859, 1.92034057],
    part(1, 2, 3)))

assert np.all(np.isclose(
    [6.909297, 2.0, 2.583853, -0.832293],
    part(3, 2, 1)))