What is the Cost Function for Softmax?

So a batch is one set of training examples from 1…m?

So a batch is one set of training examples from 1…m?

It can be a full batch of training examples or a mini-batch (a subset of the dataset) after which you apply the optimization step.

All I really want to know is what is the mathematical expression of the loss function and the mathematical expression of the cost function for the softmax activation function as Andrew made a mistake when he was talking about the loss function and called it the cost function instead.

I re-watched that lecture and agree with you - there was some inconsistency in the terminology used. You can find mathematical expressions for the loss and cost functions in my replies.

1 Like

What do you mean by optimization step? Do you mean gradient descent to optimise the W_{j} and B parameters for low cost function evaluation?

Thanks, I now understand the loss and cost functions for the softmax activation function.

1 Like

I think the loss function should be a function defined as a function of the variables z_k for k =1 to m, Y and i more correctly.

Yes, by optimization step, I mean using an algorithm like gradient descent to update the parameters in order to minimize the cost function. This tutorial provides more details.

Do you mean z_k, 1 \le k \le N, k\in {\mathbb Z}, where N is the number of classes?

Yes, that is the correct mathematical notation. But it also is a function of the tensor \textit{Y} and {i}, 1 \le i \le m, i, m\in {\mathbb N}

Please also note that the class labels y^{(i)} are integers. Therefore, Y is a vector (of course it is possible to consider it as a tensor). It is also possible to convert its elements into one-hot vectors, but in this case the equation for the loss function would be slightly different.

@conscell Yes, I agree.

Do you know how TF determines the different W_j and \text b parameters for each of the units in an input layer of say 25 units when the same set of input training examples are available for each unit? Does TF choose to set some of a unit’s W_j parameters to zero to exclude certain feature values from a given training example in determining a unit’s W_j weight parameters?

Let’s see the details how it works. First of all we define a layer of 25 units:

>>> import tensorflow as tf
>>> dense = tf.keras.layers.Dense(units=25)

At this point, its weights (in TF documentation they are called kernel) and bias are not initialized:

>>> dense.get_weights()
[]

Now we create an input which contains 3 training examples and 4 features and pass it through this layer:

x = tf.constant([[1., 2., 3., 4.], [5., 6., 7., 8.], [9., 10., 11., 12.]])
>>> x.shape
TensorShape([3, 4])
>>> z = dense(x)
>>> z.shape
TensorShape([3, 25])
>>> dense.get_weights()
[array([[...]], dtype=float32), array([...], dtype=float32)]
>>> dense.get_weights()[0].shape
(4, 25)
>>> dense.get_weights()[1].shape
(25,)

As you can see, the output has a shape of (3, 25), and the layer’s kernel and bias were created and have shapes of (4, 25) and (25, ) respectively. Dense layer computes the dot product between the inputs x and the weights W (kernel) along the last axis of the inputs and axis 0 of the kernel z = xW + [b, \dots, b] ^\top. For example, if input has dimensions (batch_size, d0), then we create a kernel with shape (d0, units), and the kernel operates along axis 1 of the input. More precisely, it implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer. If you don’t specify activation, no activation is applied (i.e. “linear” activation: a(x) = x).

Weights are randomly initialized typically using methods like Glorot/Xavier or He initialization, bias vector is initialized with zeroes. Each column in the weight matrix W represents the weight vector for one unit in the layer. Each weight vector evolves independently during training. The gradients are computed based on that unit’s output and its contribution to the total loss. TensorFlow does not force or choose certain weights to be zero to exclude specific features manually. However, during training, the optimizer may reduce some weights close to zero if it finds that those features aren’t useful for minimizing the loss.

The following example demonstrates how a simple linear regression model in TensorFlow learns to assign meaningful weights to input features based on their relevance to the target output.

>>> import tensorflow as tf
>>> tf.keras.utils.set_random_seed(42)  # For reproducibility we set random seed to 42

We create an input tensor x of shape (10, 2), where the first column is a sequence of integers from 0 to 9 (a strong signal), and the second column is random noise (noisy/uninformative feature).

>>> x = tf.keras.ops.hstack([tf.reshape(tf.range(10), (10,1)), tf.random.normal((10,1))])
>>> x
<tf.Tensor: shape=(10, 2), dtype=float32, numpy=
array([[ 0.        ,  0.3274685 ],
       [ 1.        , -0.8426258 ],
       [ 2.        ,  0.3194337 ],
       [ 3.        , -1.4075519 ],
       [ 4.        , -2.3880599 ],
       [ 5.        , -1.0392479 ],
       [ 6.        , -0.5573232 ],
       [ 7.        ,  0.539707  ],
       [ 8.        ,  1.6994323 ],
       [ 9.        ,  0.28893656]], dtype=float32)>

A target output y is computed from the first column using the rule y = 3 x_1 + 1. This makes only the first feature relevant to predicting y.

>>> y = x[:, 0] * 3 + 1
>>> y
<tf.Tensor: shape=(10,), dtype=float32, numpy=array([ 1.,  4.,  7., 10., 13., 16., 19., 22., 25., 28.], dtype=float32)>

A simple linear model (Dense(1)) is created with 2 weights (one for each feature) and a bias. Model weights are initialized randomly.

>>> model = tf.keras.models.Sequential([tf.keras.layers.Dense(1)])
>>> model(x)
<tf.Tensor: shape=(10, 1), dtype=float32, numpy=
array([[0.15062995],
       [0.6255182 ],
       [2.1731575 ],
       [2.3918853 ],
       [2.95398   ],
       [4.587522  ],
       [5.8223114 ],
       [7.3400383 ],
       [8.886603  ],
       [9.250912  ]], dtype=float32)>
>>> model.get_weights()
[array([[1.0131117],
       [0.459983 ]], dtype=float32), array([0.], dtype=float32)]

The model is compiled with stochastic gradient descent (SGD) and mean squared error (MSE) loss.

>>> model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=3e-3), loss=tf.keras.losses.MeanSquaredError())

The model is trained for 3000 epochs to minimize the loss between its predictions and the true values of y.

>>> model.fit(x, y, epochs=3000)
...
Epoch 3000/3000
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 30ms/step - loss: 2.4529e-05

After training, the model learns a weight close to 3.0 for the first (relevant) feature,
a weight near 0 for the second (irrelevant/noisy) feature, a bias close to 1.0.

>>> model(x)
<tf.Tensor: shape=(10, 1), dtype=float32, numpy=
array([[ 0.98906773],
       [ 3.9935114 ],
       [ 6.992269  ],
       [ 9.998071  ],
       [13.002052  ],
       [16.000353  ],
       [19.000769  ],
       [21.999685  ],
       [24.99845   ],
       [28.003479  ]], dtype=float32)>
>>> model.get_weights()
[array([[ 3.0015907e+00],
       [-2.4382621e-03]], dtype=float32), array([0.9898662], dtype=float32)]

However, it’s not always guaranteed. If we initialize the second column with ones: tf.ones((10,1)), then both input features become perfectly correlated (the second column has a constant value), and the model may distribute the learned contribution between the two features in arbitrary ways. For example, assigning part of the weight to the constant feature and adjusting the bias accordingly because multiple weight-bias combinations can produce the same outputs.

>>> model.get_weights()
[array([[2.9999967 ],
       [0.73000354]], dtype=float32), array([0.2700143], dtype=float32)]

Thanks for this.

I will read over it and digest it carefully.

1 Like

@ai_is_cool,

I also highly encourage you to make some extra experiments. What if the noise vector doesn’t have zero mean (tf.random.normal((10,1), mean=10)) and has the same standard deviation of 1? Can the model converge faster if trained with smaller batches (model.fit(x, y, batch_size=2, epochs=3000))? What happens if x_2 = x_1? Is it possible to eliminate constant features (tf.ones((10,1))), or noisy features that oscillate close to the constant value (tf.random.normal((10,1), mean=1, stddev=0.01))?

What is the variable d0 and what value does it hold?

d0 is the number of features.

Is d0 standard nomenclature in neural network terminology for the number of features as Andrew doesn’t mention it up till now in his video presentations.

It is from TF documentation.

Perhaps as I am not familiar with TF documentation you could explain what TF variables mean as you go along as Andrew does not describe them in his Advanced Learning Algorithm Week 2 course as far I have done it so far?