How to prove that softmax is a generalized linear model C2W3

Ldai-CBS · August 24, 2024, 11:16pm

The results and the figures in ‘Multi-class classification’ video seem to be quite intuitive, since every element in Z[L] has to go through a same activation function, and the Z[L] is a result from the linear function. (Where Z[L] = W[L]·A[L-1] + b[L])

But how can it be proved in a formal way. Anyone can help me?

paulinpaloalto · August 24, 2024, 11:40pm

The formula Z = W^{[l]} \cdot A^{[l-1]} + b^{[l]} where all the inputs are matrices or vectors is the definition of a linear transformation. Of course softmax is not a linear function, so the full result mapping from either the inputs or just the last layer plus the activation are not linear functions.

Please help us by clarifying what more you want to prove here.

Ldai-CBS · August 25, 2024, 12:24am

Oh it seems like I misunderstood something.
But I mean, look at the figures, why are those sections linearly divided?

paulinpaloalto · August 25, 2024, 12:35am

Sorry, I will have to watch that lecture again to reconstruct what those examples are about. It may be tomorrow before I have time to do that. But anything you can graph in 2D like that is a pretty simplistic example.

paulinpaloalto · August 25, 2024, 3:36am

Ok, it’s simple. As he explained in the lecture, those examples are the simplest possible case: it’s not really a neural network. It’s like Logistic Regression with only one layer, but with a softmax activation instead of sigmoid. So the inputs are points in the plane with coordinates (x_1, x_2) and the outputs are a softmax vector computed by taking:

z = W \cdot x + b
\hat{y} = softmax(z)

So let’s suppose that C = 4, meaning that we have 4 possible output “classes” to recognize. So that means each vector x is a 2 x 1 vector and W is a 4 x 2 matrix and b is a 4 x 1 column vector. Both z and \hat{y} will be 4 x 1 column vectors for the case of a processing a single input.

Now think about what those lines mean in the graphs: they are the “decision boundaries” between two possible predicted classes. Let’s say it’s between class 1 and class 2. Then that line is the set of points (x_1, x_2) for which the \hat{y}_1 and \hat{y}_2 values are equal. But note that if the softmax values are equal, that means the input z values are equal, because softmax is strictly monotonic. That means that they are the points which satisfy the following equation:

w_{11} * x_1 + w_{12} * x_2 + b_{11} = w_{21} * x_1 + w_{22} * x_2 + b_{21}

What I’ve done there is just write out the meaning of the matrix multiply followed by the addition of b.

Then if you simplify that by gathering like terms, you get this:

(w_{11} - w_{21} ) * x_1 + (w_{12} - w_{22}) * x_2 = b_{21} - b_{11}

Well, what is that? It’s the equation of a line in the plane, right?

Prof Ng mentions again at the very end that this is the simplest case. He explicitly points out that if you add more layers to create a real neural network to do this type of softmax classification, then the decision boundaries that can be represented are much more complex than linear functions.

paulinpaloalto · August 25, 2024, 3:20pm

Maybe it’s worth also pointing out that this is just a generalization of the reason that Logistic Regression can only express a linear decision boundary, even in the case that the inputs are more than 2D vectors. In the Logistic Regression case, the output is a single number, rather than a vector as in the softmax case:

z = w \cdot x + b
\hat{y} = sigmoid(z)

It is making a prediction in the form of a probability between 0 and 1, so we interpret the prediction as “Yes” (It’s a cat) if \hat{y} > 0.5. So the decision boundary between “Yes” and “No” is expressed by this equation:

\hat{y} = 0.5

But we also know that sigmoid is strictly monotonic and sigmoid(0) = 0.5, so that equation is equivalent to:

z = 0

If you write that out, it becomes:

w \cdot x + b = 0

Or

\displaystyle \sum_{i = 1}^n w_i * x_i = - b

That is the equation of a plane (hyperplane) in the input space of all x vector values.

Ldai-CBS · August 25, 2024, 5:39pm

It’s a really good example. Thanks!

Topic		Replies	Views
Calculating gradient of softmax function Improving Deep Neural Networks: Hyperparameter tun	5	1811	April 9, 2024
Week 3 - Video (1-3) Neural Networks and Deep Learning	6	560	April 9, 2022
Why softmax in last layer for multiclass NN? Improving Deep Neural Networks: Hyperparameter tun	5	562	January 7, 2022
Model Output with and without Softmax Activation / from_logits=True Advanced Learning Algorithms week-2	11	474	June 1, 2023
Concept of "linear" in the last homework Neural Networks and Deep Learning	2	517	May 17, 2022

How to prove that softmax is a generalized linear model C2W3

Related topics