Thank you Tom for your reply. Please allow me to explain in more detail. Please pardon me if this post is lengthy as it is very important for me to get the basic concept right.

**Logistic Regression (Binary Classification)**

First, letâs say we have the following examples 2 features and 3 examples:

```
X1 = np.array([[1, 3], [2,4], [8, 9]])
y1 = np.array([0, 0, 1]).reshape(-1,1)
```

Therefore,

```
n = 2 # number of features
m = 3 # number of training examples
```

X1 should be a 3 by 2 matrix, y1 is a 3 by 1 matrix.

Let us assume our prediction for `w`

and `b`

is as follows:

```
# w_0 = 0.2
# w_1 = 0.3
w = np.array([0.2, 0.3]).reshape(-1,1)
b = 1
```

We reshape `w`

into a 2 by 1 vector so that we can do a dot product with X1. Using the formula

f(x) = Xw + b

`f(x) = y_hat = z`

should have a 3 by 1 vector. That means one prediction for each training example.

```
fx = X1@w + b
fx
array([[2.1],
[3.2],
[5.3]])
```

As explained by Tom, we apply sigmoid(z) element wise. Thus, we should have same shape as z. The result is a number between 0 and 1 for each training example.

In our example, we have the following after we applied the Sigmoid function:

```
yhat
array([[0.89090318],
[0.96083428],
[0.9950332 ]])
```

Then we apply the following formula to compute the gradient

which is similar to

```
temp_dw = (yhat - y) * X1
temp_db = (yhat - y)
```

and so on.

I also use the same Sigmoid function applying to OvR and achieve satisfactory result.

Let us use the same example but change classification:

```
X2 = np.array([[1, 3], [5,4], [8, 9]])
y2 = np.array([0, 1, 2]).reshape(-1,1)
```

My initial mistake is to apply what I had learn in Sigmoid to Softmax. I use the same 2 by 1 vector for `w`

and scalar for b.

Now, I will attempt to explain the Softmax function and try to fit the same linear formula into Softmax.

**Softmax Regression (Multi-Class Classification)**

The Softmax Function is as follows:

My interpretation of the Softmax function is that; this is a prediction of one class divide by sum of prediction of each class.

For each training example, `f(x)`

should have the prediction for each class.

Therefore, `f(x)`

should have a shape of 1 by `num_of_class`

for a single example.

Now working backward, to achieve the result above, we should modify `w`

into a matrix and `b`

into a vector for each class.

I also learned that regardless of using Categorical Cross Entropy or Sparse Categorical Cross Entropy, we should have the same dimension of `f(x)`

which is a dimension of `num_of_examples`

by `num_of_class`

.

Using the same example, assuming we have the following:

```
# class 0: w_0 = 0.2, w_1 = 0.3, b = 0.1
# class 1: w_0 = 0.1, w_1 = 0.2, b = 0.3
# class 2: w_0 = 0.3, w_1 = 0.4, b = 0.5
W = np.array([[0.2, 0.1, 0.3],[0.3, 0.2, 0.4]])
b = np.array([[0.1,0.3,0.5]])
```

The shape of `W`

should be a 2 by 3 matrix and `b`

should be a 1 by 3 vector. Applying the linear formula:

```
fx = X2@W + b
fx
array([[1.2, 1. , 2. ],
[2.3, 1.6, 3.6],
[4.4, 2.9, 6.5]])
```

So each row is one training example and each columns is the prediction for each class. Then we apply Softmax

```
soft1 = mySoftmax(fx)
soft1
array([[0.24726331, 0.20244208, 0.55029462],
[0.19357779, 0.09612788, 0.71029433],
[0.10650421, 0.0237643 , 0.86973149]])
```

If we sum up each row we should have 1 for each example.

```
np.sum(soft1, axis=1, keepdims=True)
array([[1.],
[1.],
[1.]])
```

Hope that my understanding of Softmax implementation is correct. I also learned that the way we compute gradient is different when applying Softmax in NN. However, that is another query for another day.

I must say that by creating my own gradient descent function from scratch. I am forced to have a deeper understanding of Softmax implementation.

Thank you for those who have the patient to go through my example.

Cheers.