Explanation vectorization gradient descent

Hey,

In the Video “Vectorizing Logistic Regression’s Gradient Computation” I am not quite following the explanation on the second slide. There are 2 points of confusion for me:

  1. Why are there 2 different formulas for dw, one on the left and one on the right?
  2. What happens to J?

Thank you for your help!

The left side shows the element-wise calculation dw_1 += x_1^{(i)} \cdot dz^{(i)} and dw_2 += x_2^{(i)} \cdot dz^{(i)} of the gradient in a loop for each feature ( x_1^{(i)} and x_2^{(i)} ) individually. In contrast, the right side shows the vectorized formula, dw = \frac{1}{m} X \cdot dz^T, which performs the gradient calculation in one step using matrix operations. The vectorized version is more efficient, especially for large datasets, because it uses optimized linear algebra operations.

On the left side, non-vectorized form, J is incremented with the cost of each example i in the loop, based on the formula J += - [y^{(i)} \log(a^{(i)}) + (1 - y^{(i)}) \log(1 - a^{(i)})], and then the total cost is divided by m (the number of training examples) to get the average cost J = \frac{J}{m}. At the same time, on the right side, this operation is vectorized similarly to dw, the J vectorized form (not shown here) is computed in a single step using matrix operations. The vectorized form is a more efficient and compact way of computing the total cost J using matrix operations.

1 Like

I feel @nadtriana is correct, but though I know he is at an advanced level… Why is the Winograd solution not seen as ‘practical’ ? I mean it computational.

I would like better knowledge as why we avoid it (?)

Thank you very much for the fast reply! So we do change the formula for J, it’s just not shown on the right. But in regards to dw I’m still confused. I should’ve been clearer: on the left there is a vectorized version in green, that says dw += x(i) * dw(i). So why do we have that one as well as the one on the right?

Thanks again!

Let’s clarify the green part:

dw_j += x_j^{(i)} \cdot dz^{(i)}

This expression is part of the non-vectorized gradient computation, and it is used to incrementally update the gradient of the weights during each iteration of the loop over the training examples. So, for each feature j, you’re computing:

dw_j = \sum_{i=1}^{m} x_j^{(i)} \cdot dz^{(i)}

The expression += ensures that each contribution x_j^{(i)} \cdot dz^{(i)} is added to the previous total. Once the loop is complete, dw_j is divided by m to get the average.

In the vectorized version, instead of iterating over each training example in a loop, you compute the gradient for all examples in a single matrix multiplication:

dw = \frac{1}{m} X \cdot dZ^T

This computes the same thing as the loop on the left but in a single step. The matrix X contains all the training examples (each row is a training example, and each column corresponds to a feature), and dZ contains the errors for all the examples. When you multiply them, you get the total gradient for all weights at once, without needing to manually iterate through each example.

1 Like

The Winograd algorithm is highly efficient for certain types of matrix multiplication, especially in convolutional neural networks (CNNs). However, its use in logistic regression is not considered practical, especially when implementing vectorization, for several reasons. Particularly, the computational overhead of setting up Winograd’s transformations outweighs any potential benefits for small matrix operations typical in logistic regression.

@nadtriana Is okay, I mean I know Strassen. My brain has just never come around to why this is seen as ‘intractable’— Thus I stand around feeling a bit confused :grin:

1 Like

We are just doing matrix multiplication here. The Strassen algorithm is just another way to do matrix multiplication that is faster in some circumstances, but not in others. The real point here is that once we start doing things the way real solutions are built, we will be using a platform, typically TensorFlow here, but it could as well be PyTorch. Once we switch to doing that the performance critical computations are mostly just handled in the platform code. But the libraries also have flexibility: we’re just invoking np.dot or np.matmul or tf.matmul, right? Have you looked at what happens in the numpy library code? Maybe it’s sophisticated enough to use Strassen if the dimensions and sparseness or lack thereof are appropriate. It’s not your problem unless you want to become one of the developers who work on the guts of TensorFlow or numpy.

But seriously, I would be really surprised if the numpy matmul code was using Strassen. It’s all open source, though, right? If you’ve got time to kill, you could probably figure this out. Let us know if you decide to “go there”.

I hope you ‘trust me’ and I take you as serious;

So if you were at SGI, you must know it takes more than one cycle to run an instruction-- Probably several.

And, as you said, you could probably ‘parallelize’, but very complicated and difficult.

Even CUDA is easier.

But the last person in the world I’d like to argue with is you, Paul.

So I will just ‘step away.’

*After some thought (and consideration) I decided to just level this here. Paul is a very nice person, but you definately don’t mess around with him.

Yes, I completely understand that algorithms matter. Parallelization matters. Vectorization matters. But writing general purpose libraries that perform well in all cases is not a trivial matter. If you believe that somehow you could write better or more efficient matrix multiplication code than is currently available, go for it. The proof will be in the pudding. Please be aware that LINPACK and LAPACK have been a thing for a very long time and lots of people have put lots of effort into optimizing the performance. And I’m sure a fair number of those people knew enough math to be familiar with Strassen and other algorithmic options.

Actually something I half shared with someone over there, and then at Nvidia;

I mean I know its conception would have to be refactored in the thinking-- But I hoped to revisit the ‘traveling salesman’ problem in a new way (and with new thinking);

I mean, you must be aware of this.

But, it turns out I cannot even likely access this hardware even from CUDA, so…

Oh well…

I tried ?

*meant the RTX raytracing hardware-- A path is a path if you think about it that way.

The travelling salesman problem has been thought about by lots of people from Leonhard Euler to the present. Route optimization is a big deal these days as well.

Have you seen the Discrete Optimization course from University of Melbourne? I took an earlier version of it maybe 7 or 8 years ago and remember it being quite interesting, but pretty different from any of the ML/DL techniques. I confess I’ve now forgotten it all, but they do cover some graph theory problems I think.

‘Euler tour’-- I am familiar with the bridges, I just thought maybe we can now do this crazy fast. But I don’t have access…

And sorry, as I am speaking with you, my idea wasn’t blunt; Or just ‘line to line’.

We might have to ‘fold the dimensions’ to get it to work, but that is kind of what we are doing in neural nets, is it not ?