What is the Cost Function for Softmax?

ai_is_cool · May 10, 2025, 1:33pm

So a batch is one set of training examples from 1…m?

conscell · May 10, 2025, 1:40pm

So a batch is one set of training examples from 1…m?

It can be a full batch of training examples or a mini-batch (a subset of the dataset) after which you apply the optimization step.

conscell · May 10, 2025, 1:50pm

All I really want to know is what is the mathematical expression of the loss function and the mathematical expression of the cost function for the softmax activation function as Andrew made a mistake when he was talking about the loss function and called it the cost function instead.

I re-watched that lecture and agree with you - there was some inconsistency in the terminology used. You can find mathematical expressions for the loss and cost functions in my replies.

ai_is_cool · May 10, 2025, 2:12pm

What do you mean by optimization step? Do you mean gradient descent to optimise the W_{j} and B parameters for low cost function evaluation?

ai_is_cool · May 10, 2025, 2:13pm

Thanks, I now understand the loss and cost functions for the softmax activation function.

ai_is_cool · May 10, 2025, 8:13pm

I think the loss function should be a function defined as a function of the variables z_k for k =1 to m, Y and i more correctly.

conscell · May 11, 2025, 4:37am

Yes, by optimization step, I mean using an algorithm like gradient descent to update the parameters in order to minimize the cost function. This tutorial provides more details.

conscell · May 11, 2025, 4:45am

Do you mean z_k, 1 \le k \le N, k\in {\mathbb Z}, where N is the number of classes?

ai_is_cool · May 11, 2025, 9:12am

Yes, that is the correct mathematical notation. But it also is a function of the tensor \textit{Y} and {i}, 1 \le i \le m, i, m\in {\mathbb N}

conscell · May 11, 2025, 9:41am

Please also note that the class labels y^{(i)} are integers. Therefore, Y is a vector (of course it is possible to consider it as a tensor). It is also possible to convert its elements into one-hot vectors, but in this case the equation for the loss function would be slightly different.

ai_is_cool · May 11, 2025, 10:12am

@conscell Yes, I agree.

Do you know how TF determines the different W_j and \text b parameters for each of the units in an input layer of say 25 units when the same set of input training examples are available for each unit? Does TF choose to set some of a unit’s W_j parameters to zero to exclude certain feature values from a given training example in determining a unit’s W_j weight parameters?

conscell · May 12, 2025, 12:35am

Let’s see the details how it works. First of all we define a layer of 25 units:

>>> import tensorflow as tf
>>> dense = tf.keras.layers.Dense(units=25)

At this point, its weights (in TF documentation they are called kernel) and bias are not initialized:

>>> dense.get_weights()
[]

Now we create an input which contains 3 training examples and 4 features and pass it through this layer:

x = tf.constant([[1., 2., 3., 4.], [5., 6., 7., 8.], [9., 10., 11., 12.]])
>>> x.shape
TensorShape([3, 4])
>>> z = dense(x)
>>> z.shape
TensorShape([3, 25])
>>> dense.get_weights()
[array([[...]], dtype=float32), array([...], dtype=float32)]
>>> dense.get_weights()[0].shape
(4, 25)
>>> dense.get_weights()[1].shape
(25,)

As you can see, the output has a shape of (3, 25), and the layer’s kernel and bias were created and have shapes of (4, 25) and (25, ) respectively. Dense layer computes the dot product between the inputs x and the weights W (kernel) along the last axis of the inputs and axis 0 of the kernel z = xW + [b, \dots, b] ^\top. For example, if input has dimensions (batch_size, d0), then we create a kernel with shape (d0, units), and the kernel operates along axis 1 of the input. More precisely, it implements the operation: output = activation(dot(input, kernel) + bias) where activation is the element-wise activation function passed as the activation argument, kernel is a weights matrix created by the layer, and bias is a bias vector created by the layer. If you don’t specify activation, no activation is applied (i.e. “linear” activation: a(x) = x).

Weights are randomly initialized typically using methods like Glorot/Xavier or He initialization, bias vector is initialized with zeroes. Each column in the weight matrix W represents the weight vector for one unit in the layer. Each weight vector evolves independently during training. The gradients are computed based on that unit’s output and its contribution to the total loss. TensorFlow does not force or choose certain weights to be zero to exclude specific features manually. However, during training, the optimizer may reduce some weights close to zero if it finds that those features aren’t useful for minimizing the loss.

conscell · May 12, 2025, 4:36am

The following example demonstrates how a simple linear regression model in TensorFlow learns to assign meaningful weights to input features based on their relevance to the target output.

>>> import tensorflow as tf
>>> tf.keras.utils.set_random_seed(42)  # For reproducibility we set random seed to 42

We create an input tensor x of shape (10, 2), where the first column is a sequence of integers from 0 to 9 (a strong signal), and the second column is random noise (noisy/uninformative feature).

>>> x = tf.keras.ops.hstack([tf.reshape(tf.range(10), (10,1)), tf.random.normal((10,1))])
>>> x
<tf.Tensor: shape=(10, 2), dtype=float32, numpy=
array([[ 0.        ,  0.3274685 ],
       [ 1.        , -0.8426258 ],
       [ 2.        ,  0.3194337 ],
       [ 3.        , -1.4075519 ],
       [ 4.        , -2.3880599 ],
       [ 5.        , -1.0392479 ],
       [ 6.        , -0.5573232 ],
       [ 7.        ,  0.539707  ],
       [ 8.        ,  1.6994323 ],
       [ 9.        ,  0.28893656]], dtype=float32)>

A target output y is computed from the first column using the rule y = 3 x_1 + 1. This makes only the first feature relevant to predicting y.

>>> y = x[:, 0] * 3 + 1
>>> y
<tf.Tensor: shape=(10,), dtype=float32, numpy=array([ 1.,  4.,  7., 10., 13., 16., 19., 22., 25., 28.], dtype=float32)>

A simple linear model (Dense(1)) is created with 2 weights (one for each feature) and a bias. Model weights are initialized randomly.

>>> model = tf.keras.models.Sequential([tf.keras.layers.Dense(1)])
>>> model(x)
<tf.Tensor: shape=(10, 1), dtype=float32, numpy=
array([[0.15062995],
       [0.6255182 ],
       [2.1731575 ],
       [2.3918853 ],
       [2.95398   ],
       [4.587522  ],
       [5.8223114 ],
       [7.3400383 ],
       [8.886603  ],
       [9.250912  ]], dtype=float32)>
>>> model.get_weights()
[array([[1.0131117],
       [0.459983 ]], dtype=float32), array([0.], dtype=float32)]

The model is compiled with stochastic gradient descent (SGD) and mean squared error (MSE) loss.

>>> model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=3e-3), loss=tf.keras.losses.MeanSquaredError())

The model is trained for 3000 epochs to minimize the loss between its predictions and the true values of y.

>>> model.fit(x, y, epochs=3000)
...
Epoch 3000/3000
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 30ms/step - loss: 2.4529e-05

After training, the model learns a weight close to 3.0 for the first (relevant) feature,
a weight near 0 for the second (irrelevant/noisy) feature, a bias close to 1.0.

>>> model(x)
<tf.Tensor: shape=(10, 1), dtype=float32, numpy=
array([[ 0.98906773],
       [ 3.9935114 ],
       [ 6.992269  ],
       [ 9.998071  ],
       [13.002052  ],
       [16.000353  ],
       [19.000769  ],
       [21.999685  ],
       [24.99845   ],
       [28.003479  ]], dtype=float32)>
>>> model.get_weights()
[array([[ 3.0015907e+00],
       [-2.4382621e-03]], dtype=float32), array([0.9898662], dtype=float32)]

However, it’s not always guaranteed. If we initialize the second column with ones: tf.ones((10,1)), then both input features become perfectly correlated (the second column has a constant value), and the model may distribute the learned contribution between the two features in arbitrary ways. For example, assigning part of the weight to the constant feature and adjusting the bias accordingly because multiple weight-bias combinations can produce the same outputs.

>>> model.get_weights()
[array([[2.9999967 ],
       [0.73000354]], dtype=float32), array([0.2700143], dtype=float32)]

ai_is_cool · May 12, 2025, 9:37pm

Thanks for this.

I will read over it and digest it carefully.

conscell · May 13, 2025, 2:31am

@ai_is_cool,

I also highly encourage you to make some extra experiments. What if the noise vector doesn’t have zero mean (tf.random.normal((10,1), mean=10)) and has the same standard deviation of 1? Can the model converge faster if trained with smaller batches (model.fit(x, y, batch_size=2, epochs=3000))? What happens if x_2 = x_1? Is it possible to eliminate constant features (tf.ones((10,1))), or noisy features that oscillate close to the constant value (tf.random.normal((10,1), mean=1, stddev=0.01))?

ai_is_cool · May 13, 2025, 9:19am

What is the variable d0 and what value does it hold?

conscell · May 13, 2025, 9:44am

d0 is the number of features.

ai_is_cool · May 13, 2025, 10:05am

Is d0 standard nomenclature in neural network terminology for the number of features as Andrew doesn’t mention it up till now in his video presentations.

conscell · May 13, 2025, 10:11am

It is from TF documentation.

ai_is_cool · May 13, 2025, 10:15am

Perhaps as I am not familiar with TF documentation you could explain what TF variables mean as you go along as Andrew does not describe them in his Advanced Learning Algorithm Week 2 course as far I have done it so far?

Topic		Replies	Views
Softmax Loss Function for single example Advanced Learning Algorithms week-module-2	18	602	December 30, 2022
Cannot compute_cost course 2 week 3 Improving Deep Neural Networks: Hyperparameter tun coursera-platform	91	6838	January 7, 2023
Derivative of Relu in output layer Neural Networks and Deep Learning coursera-platform	125	2518	December 23, 2022
Week 1: Where is the logistic regression (loss, cost function, gradient descent) in the present NN? Advanced Learning Algorithms week-module-1	2	295	February 8, 2024
Feedforward Neural Networks in Depth Deep Learning Resources coursera-platform	69	107179	September 20, 2025

What is the Cost Function for Softmax?

Related topics