C2_W2_Assignment - How digit recognition really works

Has anyone seen a good example which explains how the digit recognition works i.e. how these pixels values are actually interpreted as a single digit? The neural net (machine) doesnt really know that it is a number it is just a collection of pixels with varying brightness, it calculates the weighted sum, followed by activation does forward prop and then outputs a digit.

Im trying to visualize how the weighted sum of the input is interpreted as a digit. Can anyone explain by taking one training sample as an example and then running through the neural net at each step?. Like x = [ 0.00e+00 … 8.56e-06 1.94e-06 -7.37e-04
-8.13e-03 -1.86e-02 -1.87e-02 -1.88e-02 -1.91e-02 -1.64e-02 -3.78e-03…,0.00e+00] is the input as shown above then we calculate a = k by first layer and then another “a” by second layer and finally output layer with make a prediction?

Prof Ng has done similar for sample data I think but i will be helpful if it can be explained it for this handwritten text identification just for 1 sample training data (only forward prop). Trying to see how the pixel data is interpreted as a digit.

Hello @ronnyfrano

It is not easy thing to do to visualize and interpret the inner workings of neural network. In your case when you are not using convolutional layers but just dense layers the network just computes costs and losses for every pixel as you point out, and then compares with what it has learned in the past during training. If a similar pattern in terms of categorization is found then it predicts with a certain probability, a guess.

In Tensorflow Advanced Techniques Specialization are introduced techniques techniques to create class activation maps, gradient maps etc. that can help you visualize which parts of an image are used to learn from and make a prediction. I dont remember it being done for dense NN’s but the principle should be the same.

Check it out if you have time.

2 Likes

I think part of the difficulty is that last part. Imagine instead of classifying images of numbers, the network was classifying images of animals. You wouldn’t say the network outputs a cat, right? It outputs a vector of probabilities from which you can determine the index of the element with the largest value, then use that index to look up in a different dictionary what class corresponds to that index. Same thing for the digit recognition.

Right now you’re probably thinking I didn’t really answer the question, because that doesn’t explain how the vector of probabilities was output. I find it useful to think of the neural network as a black box transformation. You input a 2D matrix of numbers, X and apply a bunch of matrix addition and multiplication operations Y = W* X + b until finally it outputs a 1D vector of numbers. During training, that output is the called the \hat{Y}, the predictions. The loss function compares \hat{Y} to the correct, known values Y. If the difference, which is the loss or error, is too big, then adjust the W and b values and try again.

It may seem like the neural net is learning to recognize a 1, but what it is really learning is a weights matrix W such that when that matrix is multiplied by the matrix of pixels of an image of a 1, the output vector will have its largest value in the 2^{nd} position.

What muddies the mental picture for me is when I see pictures of “what a CNN learns” that show edges, then patches with shapes , then larger patches that start to look like something, finally a 1, or a cat. These are not what the network itself is learning. Rather, they are the patches of the input image that trigger the activation function for that filter or layer after the weights matrix has been learned and applied. They represent what portion of the input was ‘downsampled’ into that output. At the last layer, it’s the entire image. At the first layer, it’s a bunch of little filter-sized patches. If it helps to think of the network as a signal extractor, they are the portions of the input that contain useful signal. Comments and suggestions welcome.

2 Likes

This is the key here. The network doesn’t learn “Number 1” or “Cat” but values in the one or more W matrices. How these numbers are learned? via the forward prop, loss calc, and backward prop processes.

Now, how does the network that a 1 is a 1 at the end? because you instructed the network to do it. Let me give you an example with animals.

You want to classify images of 3 animals: Dog, Bird, Horse. Since the computer doesn’t understand words or letters but only numbers, you have to define numbers for these animals. You can do something like:

0 = Dog
1 = Bird
2 = Horse

You can set this up however you want.

Once you have defined this in your dataset, the model is trained and, at the end of the model we have the last layer with a “softmax” activation. This last layer has to have the same number of “classes” we defined before, so it has 3 units, one for Dog, one for Bird and one for Horse.

This last layer, then, is a vector with 3 positions: 0, 1, 2. This layer is filled up with numbers that all sum up to 1. This is a vector or probabilities.

When we do an inference we can get these values on this vector:

0 = 0.25
1 = 0.15
2 = 0.60

This sums up to 1. And the highest probability is 0.60, which belongs to index 2, which, based on our definition, represents Horse.

So to summarize:

  1. The model doesn’t learn to identify the actual number 1 but learns numbers that will produce a certain probability that is high when you input a 1.
  2. The number of different things you want to classify determines the size of the last vector.
  3. The highest probability in the last vector indicates what class has been identified.
  4. You define the “code” or “id” for each class, as depicted above for Dog, Bird and Horse.

I hope this sheds some light to your question!

Juan

2 Likes

I considered adding something about this, but was writing without my first cup of coffee and had exceeded attention span :crazy_face:

In the multi-class (ie not binary) classification task it is common to represent the ground truth, or Y, vector in what is known as a one hot format. For @Juan_Olano’s three animal universe and an input image of a horse, that would be [0 0 1]. At the end of the first training forward pass, the \hat{Y} vector of predictions would be pretty random…maybe [0.2 0.5 0.3]. The loss function would penalize these bad guesses, adjust weights during back prop, and try again. Over time, you would see the values in positions 0 and 1 converging towards 0.0 and the value in position 2 approaching (but likely never reaching) 1.0 until at training end you might have [0.25 0.15 0.60] from which you infer definitely not a bird, probably not a dog, most likely a horse. Hope this helps

ps: the model is the same for the digit recognition task. Suppose, to reduce ambiguity, the image is of a number 4. When setting up the training, the labels for images of 4 would all have been represented as a one hot using [0 0 0 0 1 0 0 0 0 0]. During training, the predictions would converge towards that, and during operation runtime you would hope forward prop would product a vector with 9 very small floating point numbers and one floating point number near but no larger than 1.0 in the 5^{th} position. Note that this represents the probability that the image is a 4, but is itself a floating point number like 0.85 and definitely not a digit. Hope it helps.

1 Like

Thanks for all the comment, this does help to get a clearer picture than what I had before (Still dont have the Eureka moment yet :slight_smile: )

This makes sense, exactly what i was looking for to get my head right. To help visualize this, is there an example out there on the internet that talks abt this exact thing with like an example in terms of the vectors. Like X = [0,0,0,1,1,1,0,0,0], Neural then with 2 layers and 3 neurons, output of first step is a - […] then output later values are a = […] then go back an iterate. Like a simple example which show the w vector and the output vector it finds at every stage with just 1 training example.

If there is non out there, then any pointers on how I could go about building such an example manually instead of using tensorflow just to confirm my understanding, Like what example I could consider considering the digit recognition example size has 400 features and it would be difficult to build an example around this. Something that is real world but still simple enough.

I’ll see if I can find something like what you describe

1 Like

So I was thinking once it is able to find the matrix for W and B after the learning, what is the size of the eventual matrix W and B for a NN of below size?

model = Sequential(
    [
        Dense(25, activation = 'relu'),
        Dense(15, activation = 'relu'),
        Dense(1, activation = 'linear')
    ],
    name='model_1'
)

Since at each layer in a NN we have different number of the params based on the number of units in each layer (like in the image below). SO during prediction what size of the matrix w and b does it use to give an actual predicted value?

Each layer has its own matrix W and vector B.

For hidden layer 1: W = (n, 25) and b = (25,)
For hidden layer 2: W = (25, 15) and b = (15,)
For final layer: W = (15, 1) and b = (1,)

To be precise, your question cannot be answered with just the model you provided, because it depends on input shape. @Juan_Olano showed you the second dimension of the weights and biases, which depends on the the number of connections you specified in the Dense layers. But the first dimension depends on the input dimensions. @Juan_Olano also correctly points out that each layer has its own weights and bias objects, which it seems may be a source of confusion for you.

To help you explore further, I cobbled together some sample code I pulled from the interweb that actually loads and trains on the MNIST data. Maybe play around with it a bit and see if it doesn’t help?

import tensorflow as tf
import numpy as np
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
input_shape = (28, 28, 1)

x_train=x_train.reshape(x_train.shape[0], x_train.shape[1], x_train.shape[2], 1)
x_train=x_train / 255.0
x_test = x_test.reshape(x_test.shape[0], x_test.shape[1], x_test.shape[2], 1)
x_test=x_test/255.0
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(name='FlattenLayer',input_shape=(28, 28)),
  tf.keras.layers.Dense(128, name='Dense128',activation='relu'),
  tf.keras.layers.Dense(10,name='Dense10')
])
model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

history = model.fit(
    x_train,
    y_train,
    batch_size=32,
    epochs=6,
)
print(x_train.shape)  #training inputs

(6000, 28, 28, 1)

print(x_train.shape[1]*x_train.shape[2])  #one flattened training input

784

layer = model.get_layer('Dense128')
print(layer.weights)

[<tf.Variable ‘Dense128/kernel:0’ shape=(784, 128) dtype=float32, numpy=
array(
[[ 0.06696945, 0.07679319, 0.06742486, …, -0.01404408,
0.07030296, -0.06523466],
[-0.07856862, -0.02751825, 0.01609654, …, 0.05214984,
0.07474259, -0.00408177],
[ 0.060228 , 0.00990647, -0.0353794 , …, -0.01840374,
0.06142697, 0.01504412],
…,
[-0.07450459, -0.05087827, 0.03410669, …, 0.04347721,
0.0032033 , 0.06776185],
[-0.04982505, -0.01608631, 0.07413752, …, 0.02166978,
-0.02083131, -0.05235609],
[-0.01907828, 0.05915994, 0.07626631, …, 0.07750984,
0.03713375, -0.06325782]],

dtype=float32)>,
<tf.Variable ‘Dense128/bias:0’ shape=(128,) dtype=float32, numpy=
array([ 0.09507202, 0.0825828 , 0.06340718, 0.01609202, 0.21900468,
-0.06261531, -0.01747248, -0.01654577, -0.1401379 , -0.17549014,
-0.01728175, 0.07885362, 0.19574201, -0.07354902, 0.13560148,
-0.08775545, -0.01657506, -0.04262279, -0.1097747 , 0.0043903 ,
0.10178334, 0.08393242, -0.0524591 , 0.15432979, 0.24308488,
-0.02623043, 0.0037841 , 0.00517588, -0.00893999, 0.04289384,

-0.05562854, 0.03767706, 0.06388604, -0.09254396, 0.06858032,
0.14181091, 0.08502957, -0.02626839, -0.01771167, -0.08993658,
0.02208419, 0.11501006, 0.07171755], dtype=float32)>]

Notice the shape of the W in the first Dense layer is (784,128), and the shape of the b in the first Dense layer is (128,) (1D)

layer = model.get_layer('Dense10')
print(layer.weights)

[<tf.Variable ‘Dense10/kernel:0’ shape=(128, 10) dtype=float32, numpy=
array([[ 0.0404684 , -0.0282199 , 0.08507891, …, -0.03951922,
-0.195442 , -0.45420387],
[ 0.04301757, -0.78330183, -0.30019557, …, -0.21572816,
0.2496909 , -0.05051493],
[ 0.05784213, -0.3508402 , -0.12698603, …, -0.63199383,
0.09412976, -0.1580189 ],
…,
[-0.31220144, 0.25271496, 0.16391952, …, 0.16664003,
-0.18537991, -0.21527106],
[-0.16806315, -0.12554409, -0.01407977, …, -0.4826628 ,
0.14121683, -0.14735046],
[ 0.1426654 , -0.12658076, -0.7396209 , …, -0.09768049,
0.15803745, 0.12935531]], dtype=float32)>,

<tf.Variable ‘Dense10/bias:0’ shape=(10,) dtype=float32, numpy=
array([-0.15701889, -0.08623032, 0.0506925 , -0.06654423, 0.03800333,
0.10811225, -0.04681133, -0.12554045, 0.21290296, -0.04749312],
dtype=float32)>]

In the last layer the shape of W is (128,10) while the shape of b is (10,) (again, 1D).

The shapes of W and b for the successive layers have to flow together from the input shape through to the output shape. This is because we have defined each step as matrix multiplication, where the input and the first hidden layer require coherent shapes, as do the output of the first hidden layer and the input of the second hidden layer, and the output of the last layer with the desired overall output. (60000x784)=>(784x128)=>(128x10)=>(10x1)

Notice that I printed out W and b from each layer only after the training had completed. But with some elbow grease, you could in fact collect those values during each training iteration and watch them evolve. Would be a good TensorFlow brain teaser :slightly_smiling_face:

Let us know?

couple of additional thoughts:

  • By default, Keras prints the layer’s W matrix as kernel instead of weights even though the variable’s name in the Layer class is layer.weights. :man_shrugging:

  • This is a simple, but real model. You can run it and get around 99% accuracy. Probably don’t need to use the entire 60K for training, but I was lazy and didn’t split it out.

  • The little model you typed in had 15 in the last Dense layer. Since this question originated with digit recognition, that would likely need to be 10, right?

So I get this about the number of parameters at each layer, I havent looked at the code yet but what I was trying to understand is, after the training gets done during inference how is the vector W and b used? Since during training we are trying to find W and B so when doing inference when we are trying to predict, what size if W and B is considered. So basically trying to understand what happens during inference?

At inference you run the entire network. The size of W and b at each layer are the same during inference as they were during training. That’s the point of the training, right? To learn the values of the W and b at each layer.

ohh ok, that clarifies. I was assuming that at inference you just take the value of W and B and find the prediction somehow w/o running through the entire network.

So complex models with multiple layers would increase the inference times correct?

The neural network is set of operations that mathematically transforms the input(s) into the output(s). The more complex the network, the more parameters, the more computations to perform, the more expensive both the training and the inference become. In application design there is a concept of ‘minimum viable capability’, which means don’t build features the customer doesn’t need. Same holds for neural net design. Add enough model architecture complexity to achieve the needed accuracy, but no more.