Explanation vectorization gradient descent

lysch · September 11, 2024, 10:21am

Hey,

In the Video “Vectorizing Logistic Regression’s Gradient Computation” I am not quite following the explanation on the second slide. There are 2 points of confusion for me:

Why are there 2 different formulas for dw, one on the left and one on the right?
What happens to J?

Thank you for your help!

nadtriana · September 11, 2024, 12:16pm

The left side shows the element-wise calculation dw_1 += x_1^{(i)} \cdot dz^{(i)} and dw_2 += x_2^{(i)} \cdot dz^{(i)} of the gradient in a loop for each feature ( x_1^{(i)} and x_2^{(i)} ) individually. In contrast, the right side shows the vectorized formula, dw = \frac{1}{m} X \cdot dz^T, which performs the gradient calculation in one step using matrix operations. The vectorized version is more efficient, especially for large datasets, because it uses optimized linear algebra operations.

On the left side, non-vectorized form, J is incremented with the cost of each example i in the loop, based on the formula J += - [y^{(i)} \log(a^{(i)}) + (1 - y^{(i)}) \log(1 - a^{(i)})], and then the total cost is divided by m (the number of training examples) to get the average cost J = \frac{J}{m}. At the same time, on the right side, this operation is vectorized similarly to dw, the J vectorized form (not shown here) is computed in a single step using matrix operations. The vectorized form is a more efficient and compact way of computing the total cost J using matrix operations.

Nevermnd · September 11, 2024, 12:43pm

I feel @nadtriana is correct, but though I know he is at an advanced level… Why is the Winograd solution not seen as ‘practical’ ? I mean it computational.

I would like better knowledge as why we avoid it (?)

lysch · September 11, 2024, 12:59pm

Thank you very much for the fast reply! So we do change the formula for J, it’s just not shown on the right. But in regards to dw I’m still confused. I should’ve been clearer: on the left there is a vectorized version in green, that says dw += x(i) * dw(i). So why do we have that one as well as the one on the right?

Thanks again!

nadtriana · September 11, 2024, 1:22pm

Let’s clarify the green part:

dw_j += x_j^{(i)} \cdot dz^{(i)}

This expression is part of the non-vectorized gradient computation, and it is used to incrementally update the gradient of the weights during each iteration of the loop over the training examples. So, for each feature j, you’re computing:

dw_j = \sum_{i=1}^{m} x_j^{(i)} \cdot dz^{(i)}

The expression += ensures that each contribution x_j^{(i)} \cdot dz^{(i)} is added to the previous total. Once the loop is complete, dw_j is divided by m to get the average.

In the vectorized version, instead of iterating over each training example in a loop, you compute the gradient for all examples in a single matrix multiplication:

dw = \frac{1}{m} X \cdot dZ^T

This computes the same thing as the loop on the left but in a single step. The matrix X contains all the training examples (each row is a training example, and each column corresponds to a feature), and dZ contains the errors for all the examples. When you multiply them, you get the total gradient for all weights at once, without needing to manually iterate through each example.

nadtriana · September 11, 2024, 1:39pm

The Winograd algorithm is highly efficient for certain types of matrix multiplication, especially in convolutional neural networks (CNNs). However, its use in logistic regression is not considered practical, especially when implementing vectorization, for several reasons. Particularly, the computational overhead of setting up Winograd’s transformations outweighs any potential benefits for small matrix operations typical in logistic regression.

Nevermnd · September 11, 2024, 1:46pm

@nadtriana Is okay, I mean I know Strassen. My brain has just never come around to why this is seen as ‘intractable’— Thus I stand around feeling a bit confused

paulinpaloalto · September 11, 2024, 2:46pm

We are just doing matrix multiplication here. The Strassen algorithm is just another way to do matrix multiplication that is faster in some circumstances, but not in others. The real point here is that once we start doing things the way real solutions are built, we will be using a platform, typically TensorFlow here, but it could as well be PyTorch. Once we switch to doing that the performance critical computations are mostly just handled in the platform code. But the libraries also have flexibility: we’re just invoking np.dot or np.matmul or tf.matmul, right? Have you looked at what happens in the numpy library code? Maybe it’s sophisticated enough to use Strassen if the dimensions and sparseness or lack thereof are appropriate. It’s not your problem unless you want to become one of the developers who work on the guts of TensorFlow or numpy.

But seriously, I would be really surprised if the numpy matmul code was using Strassen. It’s all open source, though, right? If you’ve got time to kill, you could probably figure this out. Let us know if you decide to “go there”.

Nevermnd · September 11, 2024, 3:23pm

I hope you ‘trust me’ and I take you as serious;

So if you were at SGI, you must know it takes more than one cycle to run an instruction-- Probably several.

And, as you said, you could probably ‘parallelize’, but very complicated and difficult.

Even CUDA is easier.

But the last person in the world I’d like to argue with is you, Paul.

So I will just ‘step away.’

*After some thought (and consideration) I decided to just level this here. Paul is a very nice person, but you definately don’t mess around with him.

paulinpaloalto · September 11, 2024, 4:20pm

Yes, I completely understand that algorithms matter. Parallelization matters. Vectorization matters. But writing general purpose libraries that perform well in all cases is not a trivial matter. If you believe that somehow you could write better or more efficient matrix multiplication code than is currently available, go for it. The proof will be in the pudding. Please be aware that LINPACK and LAPACK have been a thing for a very long time and lots of people have put lots of effort into optimizing the performance. And I’m sure a fair number of those people knew enough math to be familiar with Strassen and other algorithmic options.

Nevermnd · September 11, 2024, 5:14pm

Actually something I half shared with someone over there, and then at Nvidia;

I mean I know its conception would have to be refactored in the thinking-- But I hoped to revisit the ‘traveling salesman’ problem in a new way (and with new thinking);

I mean, you must be aware of this.

But, it turns out I cannot even likely access this hardware even from CUDA, so…

Oh well…

I tried ?

Nevermnd · September 11, 2024, 5:16pm

*meant the RTX raytracing hardware-- A path is a path if you think about it that way.

paulinpaloalto · September 11, 2024, 5:39pm

The travelling salesman problem has been thought about by lots of people from Leonhard Euler to the present. Route optimization is a big deal these days as well.

Have you seen the Discrete Optimization course from University of Melbourne? I took an earlier version of it maybe 7 or 8 years ago and remember it being quite interesting, but pretty different from any of the ML/DL techniques. I confess I’ve now forgotten it all, but they do cover some graph theory problems I think.

Nevermnd · September 11, 2024, 5:46pm

‘Euler tour’-- I am familiar with the bridges, I just thought maybe we can now do this crazy fast. But I don’t have access…

Nevermnd · September 11, 2024, 5:57pm

And sorry, as I am speaking with you, my idea wasn’t blunt; Or just ‘line to line’.

We might have to ‘fold the dimensions’ to get it to work, but that is kind of what we are doing in neural nets, is it not ?

Topic		Replies	Views
Week 2 - Vectorizing Log Reg Grad Out - dw computation Neural Networks and Deep Learning	2	533	December 2, 2021
W2 A2 \|Possible inaccuracy while doing L R gradient descent implementation Neural Networks and Deep Learning	5	590	October 9, 2022
Vectorizing Logistic Regression's Gradient Output - vectorizing dw Neural Networks and Deep Learning	1	451	June 25, 2023
The value of J calculation missed in vectorization of logistic regression Neural Networks and Deep Learning	1	505	January 29, 2023
Week 2 vectorization Neural Networks and Deep Learning	2	623	May 19, 2021

Explanation vectorization gradient descent

Related topics