Softmax Regression - Constructing Custom Function

Thomas_Tay · August 28, 2024, 3:42am

Hi All, I had passed this course previously and doing my second revision. For logistics regression in multi-class classification, I encounter some problem while trying to write my own function for Softmax Regression using gradient descent method.

The following is my understanding for the softmax function and its loss/cost function:

Softmax Function:

Loss/Cost Function:

Gradient Descent:

I am able to compute the loss and cost and compare results against pytorch or tensorflow. However, I encounter problem when feeding a simple example of 2 features and 3 class into gradient descent function.

Unfortunately, I could not find examples for manual implementation on the web. So, I turn to ChatGPT and other platform.

The following is the partial code from Claude:

def fit(self, X, y):
        num_samples, num_features = X.shape
        self.num_classes = len(np.unique(y))
       
        # Initialize weights and bias
        self.weights = np.zeros((num_features, self.num_classes))
        self.bias = np.zeros((1, self.num_classes))
        
        # Convert y to one-hot encoded matrix
        y_onehot = np.eye(self.num_classes)[y]
        ......

What I don’t understand is the initialization of the weights and bias. According to ChatGPT, we need to formulate 3 linear equations z0 , z1 and z2. The weights and bias of each equation represent one class. Thus we need to have the weights in the form of num of class by num of features.

Now, I know what need to be done but I still don’t quite understand the intuition. Obviously, my understanding of softmax function may not be complete.

Hope that someone can enlighten us although it is a bit beyond the course curriculum.

Thank You.

paulinpaloalto · August 28, 2024, 4:12am

Why do 3 separate equations, when you can express it all simply with matrix operations. You just have to keep track of how your data is arranged. Claude has assumed you are using “samples first” orientation, so the X sample input matrix is num_samples x num_features. I don’t know about MLS, but that’s not how Prof Ng does it in DLS. There he uses num_features x num_samples. This is just a choice: Prof Ng is the boss, so he gets to choose. But if you’re writing your own code from scratch, you can be the boss and use the other orientation. Which is actually more common in general …

But Claude is being self-consistent, it appears. With their orientation of X, then the W matrix needs to be num_features x num_classes and you’ll do the computation as:

Z = X \cdot W + b

That will give you a result that is num_samples x num_classes. Then you just apply softmax “row-wise” to that.

Thomas_Tay · August 28, 2024, 5:03am

Thank you Paul for replying. I understand the orientation part. What puzzled me is the formula Z = X . W + b, the output is num_of_samples by num_of_class as compare to Sigmoid function where the output is num_of_samples by 1.

After some searching, my rudimentary understanding is that Softmax computes the probability of each class in a single training example. For each training example, we should have a probability distribution that should add up to 1. Thus, our Z or f(x) should contains the probability distribution for each class for each training example.

I hope that my understanding is correct.

Cheers.

TMosh · August 28, 2024, 6:46am

Your understanding of the sigmoid function is incorrect.
Sigmoid is applied element-wise to every element in Z.
So the size and shape of Z and sigmoid(Z) will be identical.

Thomas_Tay · August 28, 2024, 9:37am

Thank you Tom for your reply. Please allow me to explain in more detail. Please pardon me if this post is lengthy as it is very important for me to get the basic concept right.

Logistic Regression (Binary Classification)
First, let’s say we have the following examples 2 features and 3 examples:

X1 = np.array([[1, 3], [2,4], [8, 9]])
y1 = np.array([0, 0, 1]).reshape(-1,1)

Therefore,

n = 2 # number of features
m = 3 # number of training examples

X1 should be a 3 by 2 matrix, y1 is a 3 by 1 matrix.

Let us assume our prediction for w and b is as follows:

# w_0 = 0.2
# w_1 = 0.3
w = np.array([0.2, 0.3]).reshape(-1,1)
b = 1

We reshape w into a 2 by 1 vector so that we can do a dot product with X1. Using the formula

f(x) = Xw + b

f(x) = y_hat = z should have a 3 by 1 vector. That means one prediction for each training example.

fx = X1@w + b
fx
array([[2.1],
       [3.2],
       [5.3]])

As explained by Tom, we apply sigmoid(z) element wise. Thus, we should have same shape as z. The result is a number between 0 and 1 for each training example.

In our example, we have the following after we applied the Sigmoid function:

yhat
array([[0.89090318],
       [0.96083428],
       [0.9950332 ]])

Then we apply the following formula to compute the gradient

which is similar to

temp_dw = (yhat - y) * X1
temp_db = (yhat - y)

and so on.

I also use the same Sigmoid function applying to OvR and achieve satisfactory result.

Let us use the same example but change classification:

X2 = np.array([[1, 3], [5,4], [8, 9]])
y2 = np.array([0, 1, 2]).reshape(-1,1)

My initial mistake is to apply what I had learn in Sigmoid to Softmax. I use the same 2 by 1 vector for w and scalar for b.

Now, I will attempt to explain the Softmax function and try to fit the same linear formula into Softmax.

Softmax Regression (Multi-Class Classification)

The Softmax Function is as follows:

My interpretation of the Softmax function is that; this is a prediction of one class divide by sum of prediction of each class.

For each training example, f(x) should have the prediction for each class.

Therefore, f(x) should have a shape of 1 by num_of_class for a single example.

Now working backward, to achieve the result above, we should modify w into a matrix and b into a vector for each class.

I also learned that regardless of using Categorical Cross Entropy or Sparse Categorical Cross Entropy, we should have the same dimension of f(x) which is a dimension of num_of_examples by num_of_class.

Using the same example, assuming we have the following:

# class 0: w_0 = 0.2, w_1 = 0.3, b = 0.1
# class 1: w_0 = 0.1, w_1 = 0.2, b = 0.3
# class 2: w_0 = 0.3, w_1 = 0.4, b = 0.5
W = np.array([[0.2, 0.1, 0.3],[0.3, 0.2, 0.4]])
b = np.array([[0.1,0.3,0.5]])

The shape of W should be a 2 by 3 matrix and b should be a 1 by 3 vector. Applying the linear formula:

fx = X2@W + b
fx
array([[1.2, 1. , 2. ],
       [2.3, 1.6, 3.6],
       [4.4, 2.9, 6.5]])

So each row is one training example and each columns is the prediction for each class. Then we apply Softmax

soft1 = mySoftmax(fx)
soft1
array([[0.24726331, 0.20244208, 0.55029462],
       [0.19357779, 0.09612788, 0.71029433],
       [0.10650421, 0.0237643 , 0.86973149]])

If we sum up each row we should have 1 for each example.

np.sum(soft1, axis=1, keepdims=True)
array([[1.],
       [1.],
       [1.]])

Hope that my understanding of Softmax implementation is correct. I also learned that the way we compute gradient is different when applying Softmax in NN. However, that is another query for another day.

I must say that by creating my own gradient descent function from scratch. I am forced to have a deeper understanding of Softmax implementation.

Thank you for those who have the patient to go through my example.

Cheers.

Thomas_Tay · August 28, 2024, 9:58am

Apologies. I think I got the W matrix wrong for my screenshot. The following should be the correct one.

paulinpaloalto · August 28, 2024, 3:09pm

I haven’t read your incredibly detailed post yet, but just to answer this question:

Yes, as Tom points out, sigmoid is applied elementwise to Z. So the point is what is the shape of Z? In logistic regression, which is doing a binary (yes/no) classification, you only need one bit of information as the output per input sample. So you design the weights to be a vector and the dot product of a sample vector with the weight vector gives you a scalar value (plus the scalar bias of course). Of course I’m talking about the case of a single input sample. But if you have m samples, then X is m x n_x and Z is m x 1. Sigmoid converts each of those to a value between 0 and 1, which we interpret as the model’s prediction of the probability that sample is a “Yes”.

In the softmax case, you are making a multiclass classification, so the number of output neurons is the number of classes. If you have 4 classes (cat, dog, rabbit, none of the above), then you get a vector of 4 outputs for each input. Then you apply the softmax function to those values to get the probability distribution that represents the model’s prediction of which of the classes it recognizes for a given input.

It’s not any more complicated than that. It will probably be another hour or so before I can commit the time to read your whole treatise and see if there is more to say here.

TMosh · August 28, 2024, 5:05pm

‘X’ is the wrong variable name to use for a vector of the bias units.

Thomas_Tay · August 28, 2024, 11:50pm

Thank you Paul for confirming my understanding.

Thomas_Tay · August 28, 2024, 11:51pm

Thank you Tom for catching my mistake. Bias should be b, will correct my notes.

Thomas_Tay · August 29, 2024, 12:17am

Hi All, I have a final question regarding Softmax function. It is about numerical stability. Some formula uses np.exp(x - max(x)) instead of np.exp(x). Are there any literature about this topic?

TMosh · August 29, 2024, 12:41am

Not that I’ve seen.

paulinpaloalto · August 29, 2024, 1:03am

If you google “numeric resolution softmax”, you’ll find several StackExchange threads and Medium articles and some other website articles.

Take a look at the shape of the curve for e^x: it gets very steep very fast as the values of x get larger than 0. The issue is both resolution and potential overflow. As always, if we were doing pure math here and using \mathbb{R}, then this would not matter. But unfortunately we don’t have that luxury and have to operate in the finite space of either 32 bit or 64 bit floating point. That means there are (best case) a total of 2^{64} distinct numbers we can represent between -\infty and \infty and we have to deal with rounding errors and potential loss of resolution or overflow. If you want to write the algorithm so that it works in both 32 bit and 64 bit floating point, check the IEEE 754 spec to see what the maximum values that can be represented in either. For binary32 it’s on the order of 10^{38}.

ln(10) = 2.3026

So somewhere greater than x = 87, you’ll get Inf if you compute np.exp(x). Give it a try:

z = np.array([-0.5, 0, 0.5, 1, 10, 38, 90]).astype(np.float32)
print(f"z = {z}")
print(f"np.exp(z) = {np.exp(z)}")
z = [-0.5  0.   0.5  1.  10.  38.  90. ]
np.exp(z) = [6.0653067e-01 1.0000000e+00 1.6487212e+00 2.7182820e+00 2.2026467e+04
 3.1855931e+16           inf]

Subtracting max(x) lets us deal with only numbers x \leq 0 as arguments to e^x. You can see that the curve is pretty tractable in that range with the slope < 1 for all x \leq 0 and e^x \leq 1.

And of course if you work it out, the value is the same:

e^{(a - b)} = e^a * e^{-b} = \displaystyle \frac {e^a}{e^b}

So if you look at the expression for softmax, we get a factor of \displaystyle \frac {1}{e^{max(x)}} in both the numerator and the denominator and they cancel out.

Thomas_Tay · August 29, 2024, 1:55am

Thank you Paul for the helpful and detailed explanation.

Topic		Replies	Views
Softmax Loss Function for single example Advanced Learning Algorithms week-module-2	18	587	December 30, 2022
Derivation of softmax function from sigmoid function of logistic regression Advanced Learning Algorithms week-module-2	9	62	June 18, 2025
Confusion about the mathematical formula of a1,a2,a3,a4 in softmax regression Advanced Learning Algorithms week-module-2	4	323	October 25, 2023
Improved Implementation of Softmax - Trouble Understanding the Logic Advanced Learning Algorithms week-module-2	4	39	August 18, 2024
Why Softmax function? Advanced Learning Algorithms week-module-2	12	231	June 9, 2024

Softmax Regression - Constructing Custom Function

Related topics