What is the Cost Function for Softmax?

Sure, no problem. Please make a list of variables names that are confusing.

It’s not that the variable names are confusing, I just don’t know what they are and what values they hold when they are not variable names used by Andrew.

When you say axis 0 and axis 1, do you mean first and second dimensions of a numpy array?

How does TF optimize the weights for each unit when each unit has access to the same input training examples. Does TF use different algorithms for minimising the cost function for each unit and so arrive at different weights for that unit?

When you say axis 0 and axis 1 , do you mean first and second dimensions of a numpy array?

Yes, that’s right, in TF / NumPy, axis 0 refers to the first dimension (rows in a 2D array), and axis 1 refers to the second dimension (columns in a 2D array).

How does TF optimize the weights for each unit when each unit has access to the same input training examples. Does TF use different algorithms for minimising the cost function for each unit and so arrive at different weights for that unit?

Excellent question!
In TensorFlow (and other ML frameworks), all units (neurons) in a layer see the same input, but they learn different weights because each unit has its own set of randomly initialized weights and bias. Even though they receive the same input, they multiply it by different parameters, so their outputs differ, and so do their gradients during backpropagation. One optimization algorithm is used for the whole neural network, such as SGD, Adam, etc. TensorFlow applies the same algorithm to update all parameters, but since each unit’s weights contribute differently to the network’s output and the cost function, their gradients are different, and thus each unit’s weights are updated differently.

1 Like

Let’s implement a linear layer manually and check how it works. We use GradientTape.gradient(target, sources) to calculate the gradient of the loss with respect to the parameters. Then we update the parameters.

>>> import tensorflow as tf
>>> tf.keras.utils.set_random_seed(42)
>>> w = tf.Variable(tf.random.normal((3, 2)), name='weights')
>>> w
<tf.Variable 'weights:0' shape=(3, 2) dtype=float32, numpy=
array([[ 0.3274685, -0.8426258],
       [ 0.3194337, -1.4075519],
       [-2.3880599, -1.0392479]], dtype=float32)>
>>> b = tf.Variable(tf.zeros(2, dtype=tf.float32), name='bias')
>>> b
<tf.Variable 'bias:0' shape=(2,) dtype=float32, numpy=array([0., 0.], dtype=float32)>
>>> x = tf.constant([[1., 2., 3.]])
>>> y = tf.constant([-1., 1.])
>>> with tf.GradientTape(persistent=True) as tape:
...   output = x @ w + b
...   loss = tf.reduce_mean((y - output)**2)
[dl_dw, dl_db] = tape.gradient(loss, [w, b])
>>> dl_dw
<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[ -5.197844,  -7.775473],
       [-10.395688, -15.550946],
       [-15.593533, -23.32642 ]], dtype=float32)>
>>> dl_db
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([-5.197844, -7.775473], dtype=float32)>
>>> w = w - 0.001 * dl_dw
>>> w
<tf.Tensor: shape=(3, 2), dtype=float32, numpy=
array([[ 0.33266637, -0.8348503 ],
       [ 0.32982937, -1.3920009 ],
       [-2.3724663 , -1.0159215 ]], dtype=float32)>
>>> b = b - 0.001 * dl_db
>>> b
<tf.Tensor: shape=(2,), dtype=float32, numpy=array([0.00519784, 0.00777547], dtype=float32)>

So as TF starts parameter optimisation with random values to its weight and bias parameters and uses the same optimisation algorithm to minimise the cost function for each unit, does that mean that some units will have weight and bias parameters that are more optimised than those of other units because they might by chance be closer to their optimised values?

Randomly-assigned starting values for weight and bias parameters may need fewer iterations to reach optimal values for minimal cost function evaluation than others.

Also, why do we need more than one unit per layer? Can’t TF just compute one unit’s weight and bias parameters for minimum cost function evaluation using a chosen optimisation algorithm like gradient descent? Why might we need, say, 25 units in an input layer?

So as TF starts parameter optimisation with random values to its weight and bias parameters and uses the same optimisation algorithm to minimise the cost function for each unit, does that mean that some units will have weight and bias parameters that are more optimised than those of other units because they might by chance be closer to their optimised values?

As I mentioned in my previous post, TF and other frameworks initialize weights randomly, typically using strategies like Glorot/Xavier or He initialization to ensure good starting conditions. However, even with such methods, the exact values are still random, so some units will start closer to their optimal values, especially in early training. As a result, they may converge faster, while others take longer or even get stuck in less optimal regions (depending on the optimizer, learning rate, etc.). Therefore two identical models trained with different seeds might not perform identically.

Also, why do we need more than one unit per layer? Can’t TF just compute one unit’s weight and bias parameters for minimum cost function evaluation using a chosen optimisation algorithm like gradient descent? Why might we need, say, 25 units in an input layer?

Multiple units allow the model to learn diverse features of the input data. Let’s say if you’re feeding in pixel data from an image one unit might learn to detect vertical edges, another might learn horizontal edges, another one might be sensitive to brightness, and so on. If you had only one unit, the network would have a very limited capacity, and it would learn only one pattern, which is not enough for most tasks. For example, in a digit recognition task (MNIST) you can decrease the number of units and see the results.