Report on rewriting assignment of week 3 ("tensorflow introduction")

I have just finished re-coding the assignment “tensorflow introduction” as a standalone that can be run locally. That took many many days indeed (around 3 weeks? what??)

A lot of interesting problems had to be solved, and a lot of those were of the usual “how do I organize this code properly” and “what am I doing” type.

I am not sure what I can upload of this, even if I rip out the “graded” elements (currently isolated in a separate module, see below). But there may be interest.

What are interesting points:

Tensorflow import problems

I have written the code in JetBrains “PyCharm” (and indispensable helper for highlighting problems) but the current PyCharm cannot properly deal with Tensorflow imports (these seem to be too dynamic). Thus the import statements are not as they should be but reveal Tensorflow implementation details (for example one needs to write from tensorflow.python.keras.metrics import CategoricalAccuracy instead of just from tensorflow.keras.metrics import CategoricalAccuracy to get a functioning import). I understand JetBrains has a fix ready but not released yet.

Type annotations

Type annotations have been added throughout, providing code that tells what it does (at least to a certain degree) rather than code that bamboozles you into believing you understand what it does.

Use of classes

Classes have been used to group related data items together (i.e. they are used as “records” mostly). Passing lots of arguments between functions becomes more readable.

For example, a dedicated class for “hyperparameters”:

class HyperParams:
    def __init__(self,
                 layer_widths: list[int],
                 optimizer: OptimizerV2,
                 num_epochs=1500,
                 minibatch_size=32,
                 how_to_compute_cross_entropy=HowToComputeCrossEntropy.CXE_FUNCTION,
                 print_period=10,
                 collect_period=10):
        self.optimizer = optimizer
        self.layer_widths = layer_widths
        self.num_epochs = num_epochs
        self.minibatch_size = minibatch_size
        self.how_to_compute = how_to_compute_cross_entropy
        self.print_period = print_period  # print every "this many" epochs; do not print if <= 0
        self.collect_period = collect_period  # collect stats every "this many" epochs; do not collect if <= 0

    # number of layers w/o the input layer

    def num_layers(self) -> int:
        return len(self.layer_widths)

    def input_layer_width(self) -> int:
        return self.layer_widths[0]

    def output_layer_width(self) -> int:
        return self.layer_widths[-1]

    def num_categories(self) -> int:
        return self.layer_widths[-1]

    def is_print_now(self, epoch_index: int) -> bool:
        return self.print_period > 0 and (epoch_index % self.print_period == 0 or epoch_index == self.num_epochs - 1)

    def is_collect_now(self, epoch_index: int) -> bool:
        return self.collect_period > 0 and (epoch_index % self.collect_period == 0 or epoch_index == self.num_epochs - 1)

Improving runtime verification

A lot of assertions and checks had to be added to be sure that we think we are dealing with is actually what we are dealing with.

For a maximalist example, here is the code that checks the structure of a “minibatch”

def is_columnwise_one_hot(t: Tensor) -> bool:
    if t.ndim != 2: # this is now called "rank", which is bad because it's not the rank
        return False
    # must only contain 0 or 1
    tensor_of_bool_1 = tf.logical_or(tf.equal(t, 0), tf.equal(t, 1))
    if not tf.reduce_all(tensor_of_bool_1):
        return False
    # all columns must sume to 1
    tensor_of_sums = tf.reduce_sum(t, axis=0)
    tensor_of_bool_2 = tf.equal(tensor_of_sums, 1)
    if not tf.reduce_all(tensor_of_bool_2):
        return False
    return True

def _assert_minibatch_structure(minibatch: Tuple, hyper_params: HyperParams) -> None:
    # The structure of the minibatch is a bit unexpected...
    # This is actually a tuple (tensor of minibatch_size "X tensors", tensor of minibatch_size "Y tensors")
    # I was expecting a Dataset of minibatch_size tuple of ("X tensor", "Y tensor") instead.
    assert isinstance(minibatch, Tuple), f"minibatch should be a tuple but is a {type(minibatch)}"
    assert len(minibatch) == 2, f"minibatch should be a tuple of length 2 but is a tuple of length {len(minibatch)}"
    tensor_X = minibatch[0]
    tensor_Y = minibatch[1]
    assert isinstance(tensor_X, Tensor)
    assert isinstance(tensor_Y, Tensor)
    assert tensor_X.ndim == 3, f"Expected 3 dimensions for 'tensor_X', got {tensor_X.ndim}"
    assert tensor_Y.ndim == 3, f"Expected 3 dimensions for 'tensor_Y', got {tensor_Y.ndim}"
    # Use min() in case we can't even fill a whole minibatch
    this_minibatch_size = tuple(tensor_X.shape)[0]
    assert 0 < this_minibatch_size <= hyper_params.minibatch_size, f"Expected the size of this minibatch to range in [1,...,{hyper_params.minibatch_size}], got {this_minibatch_size}"
    assert tuple(tensor_Y.shape)[0] == this_minibatch_size, f"Expected 1st dimension of size {this_minibatch_size} for 'tensor_Y', got {tuple(tensor_Y.shape)[0]}"
    assert tuple(tensor_X.shape)[1] == hyper_params.input_layer_width(), f"Expected 2nd dimension of size {hyper_params.input_layer_width()} for 'tensor_X', got {tuple(tensor_X.shape)[1]}"
    assert tuple(tensor_Y.shape)[1] == hyper_params.num_categories(), f"Expected 2nd dimension of size {hyper_params.num_categories()} for 'tensor_Y', got {tuple(tensor_Y.shape)[1]}"
    assert tuple(tensor_X.shape)[2] == 1, f"Expected 3rd dimension of size 1 for 'tensor_X', got {tuple(tensor_X.shape)[2]}"
    assert tuple(tensor_Y.shape)[2] == 1, f"Expected 3rd dimension of size 1 for 'tensor_Y', got {tuple(tensor_Y.shape)[2]}"
    assert is_columnwise_one_hot(tf.transpose(tf.squeeze(tensor_Y))), f"tensor_Y is not a column-wise one-hot tensor: {tensor_Y}"

Porting of test functions

The test functions from the assignment have been rewritten too (only the one in the notebook, not the one of the grader), and more explicit error messages have been added. As a maximalist example, testing the “compute sum of loss” function:

def test_compute_total_loss_1(y_dset_onehot: Dataset, how_to_compute: HowToComputeCrossEntropy, verbose: bool) -> None:
    fname: Final[str] = inspect.currentframe().f_code.co_name
    last_layer_z = tf.constant(
        [[2.4048107, 5.0334096],
         [-0.7921977, -4.1523376],
         [0.9447198, -0.46802214],
         [1.158121, 3.9810789],
         [4.768706, 2.3220146],
         [6.1481323, 3.909829]]
    )
    # In the above, the y values will select [6.1481323, 5.0334096]
    # If scaled via softmax, those are [0.775994115215876, 0.573110512580303]
    # If natural log is applied, those are [-0.253610342312369, -0.556676714231485]
    # https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch
    # create batches of size 2
    print(f"{fname}: dataset 'y_dset_onehot' contains tensors of this shape: {next(iter(y_dset_onehot)).shape}")
    y_dset_onehot_batched = y_dset_onehot.batch(batch_size=2)
    assert isinstance(y_dset_onehot_batched, Dataset)
    y_onehot_minibatch: Tensor = next(iter(y_dset_onehot_batched))
    assert isinstance(y_onehot_minibatch, Tensor)
    print(f"{fname}: dataset with batches 'y_dset_onehot_batched' contains tensors of this shape: {y_onehot_minibatch.shape}")
    y_onehot_minibatch_reshaped = tf.transpose(tf.squeeze(y_onehot_minibatch, axis=-1))
    print(f"{fname}: 'y_onehot_minibatch_reshaped' has this this shape: {y_onehot_minibatch_reshaped.shape}")
    print(f"{fname}: 'y_onehot_minibatch_reshaped' has this this content:\n {y_onehot_minibatch_reshaped}")
    sum_of_loss = graded_compute_sum_of_loss(last_layer_z, y_onehot_minibatch_reshaped, how_to_compute, verbose)
    assert isinstance(sum_of_loss, Tensor), f"Result is not a Tensor but a {type(sum_of_loss)}. Use the TensorFlow API!"
    assert tuple(sum_of_loss.shape) == (), f"Result does not have proper shape. Should be ()"
    expected = (0.50722074 + 1.1133534) / 2.0
    obtained = sum_of_loss.numpy()
    print(f"{fname}: obtained {obtained} with {how_to_compute}")
    print(f"{fname}: expected {expected}")
    print(f"{fname}: absolute difference {abs(obtained - expected)}")
    # https://numpy.org/doc/stable/reference/generated/numpy.isclose.html
    assert np.isclose(obtained, expected, atol=1e-7), f"Expected {expected} but got {obtained}"
    print_green(f"{fname} passed")

Printing

The messages about the evolution of the computation printed to stdout have been extended. One now sees things like:

epoch #930: end at 10:45:47
   training batch
      tensorflow accuracy for training batch has worsened by -1.21% to 98.52% since epoch 900
      soft accuracy for training batch has worsened by -1.34% to 94.92% since epoch 900
      hard accuracy for training batch has worsened by -1.21% to 98.52% since epoch 900
      cost for training batch has worsened by -52.32% to 0.06406 since epoch 900
   testing batch
      tensorflow accuracy for testing batch has improved by 1.10% to 76.67% since epoch 900
      soft accuracy for testing batch has improved by 1.49% to 73.26% since epoch 900
      hard accuracy for testing batch has improved by 1.10% to 76.67% since epoch 900
      cost for testing batch has improved by 5.44% to 0.79494 since epoch 900

Modularization

The notebook format is evidently not a friend of modularization. So we fix this. The code has been modularized into a number of Python files, with the “graded” parts, isolated into the graded.py python file (except the function which computes the one-hot representation, which currently needs to be in “common”, the topmost module)

Style improvements

The original code has some improvable style elements. Some of which I remember:

  • Using np.isclose() on Tensor objects, it is not evident why that should even work properly and seems to not be recommended practice;
  • Doing type checks by using type(obj) == Type instead of using isinstance(obj, Type);
  • Passing functions-to-test as callables to test functions instead of having the test function call the function-to-test directly. It’s really confusing. When the test function then calls other graded functions directly (i.e. not via a Callable) it also becomes futile. But maybe there is some nonevident reason why it is done like that.
  • Big blocks of code have been broken up into smaller functions. Big blocks are not recommended in most cases, and are made worse in Python due to lack of proper BEGIN/END markers and the sad lack of local scope (you never know where your variable comes from or where it goes.)
  • Use of “enums” (which the IDE can correct or propose and the interpreter can check at runtime) rather than strings (which are arbitrary and uncheckable) as keys for dicts.

Computing the softmax

The computation of the softmax can be done with tensorflow or “explicity” (i.e. the code to do it step by step using tensorflow operations has been added). In spite of reports on the forum that this will fail the testing/grading of assignment 3 due to numerical problems, I found this to work perfectly well, all tests passed immediately.

Nicer plots

The generation of plots has been improved somewhat. Here is the plot after 1000 epochs, using Adam gradient descent as given in the assignment. See below for an explainer about the various “accuracy” values:

(Notice that the cost generated for the “test data” increases at some point due the ANN starting to overfit.)

Accuracy computation

The accuracy is computed via two different ways, which I chose to label “Soft accuracy” and “Hard accuracy”

  • “Soft accuracy”: If we follow the strategy whereby the softmax output is used as a vector of probabilities according to which the predictor will randomly select what class to output (i.e. the predictor performs a random draw from the possible classes according to the softmax output), then the predictor will be “accurate” with a certain probability, namely the one corresponding to the true class. So the “soft accuracy” will always be incremented by the softmax output corresponding to the true class. One may note that our “cost” is the negative logarithm of the probability of winning the “soft accuracy” prediction on every element of the batch.
  • “Hard accuracy”: If we follow the strategy whereby the predictor always predicts the class with the largest softmax output, then the predictor will be accurate only if that is the true class. So the hard accuracy is incremented by 1 iff the place of the maximum softmax value corresponds to the true class.

Both of the above have been hand-coded and graphed together with the accuracy computed by tensorflow.python.keras.metrics.CategoricalAccuracy. It turns out that the latter computes the “hard accuracy”. (See the plots above)

As an aside, I spent some time checking the code setting up the tensorflow-computed accuracy, which was generating nonsensical results, until I noticed that I have to transpose the tensors for Z (also, rather inaccurately called the logits as those are only proper “logits” in the case of 2-class classification) and Y before passing them to CategoricalAccuracy.update_state() (at least in my code). It would be nice if tensorflow allowed naming the tensor axes, then one could do asserts on the axes names.

Weight statistics

Additionally, a graph of the weight statistics is being generated. Here, the Frobenius Norm is normalized by dividing it by the number of neurons of the relevant layer.

Maybe I should try to do some L2 regularization. :thinking: But not now.

Call graph

Let’s finish with the call graph, which shows the part where the “tape” collects operations of forward propagation, and where we perform a “tapeless” forward propagation over test data to get accuracy values (it’s probably too large for the forum and will be mangled):

2 Likes