Neural Nets and Parallelism

So, I know I am new to this and am only on the first course-- yet do have some experience with programming/system engineering.

Yet given the algorithms it is not entirely clear to how you could parallelize this thing (via CUDA or whatever).

I mean, yes, for dot product calculations you could disperse them on as many real/virtual threads and recombine the result–

And at first I even thought you could train many examples at once-- Until I realized you are doing this over a shared weight set, and as far as I know there is no technology that lets you have two (or more) simultaneous memory accesses at one time.

Just wondering how this is done/the thinking.

Thoughts ?

1 Like

The parallelism happens at each layer of the forward propagation and backward propagation. You can’t parallel process “across” layers, of course, because they are serial: you need the full output of the previous layer as input to the current layer. But all matrix operations can be parallelized at the level of the vector units in the GPU. GPUs have many vector units that run in parallel handling different parts of a matrix multiply (say). Think about how a matrix multiply works: you are stepping each row of the first operand across all the columns of the second operand. Each row can be handled by a different vector unit, since the results are independent. Also we are parallelizing across all the samples in a given minibatch, if you think about how the vector operations work.

This kind of vectorization and parallelization is exactly what GPUs are designed to do and why they are such a big deal in ML and crypto and 3D graphics, which is what they were originally invented for and what the G stands for. :nerd_face: It’s all just matrix multiplies fundamentally.

1 Like

Dear Paul,

I guess one of my misconceptions coming into this was that you had to run new derivatives all the time, and thus that is where the action is-- Instead it seems you do it only once (for whatever loss function you are using), and then it is more of just the ‘optimization’ problem (i.e. gradient decent) that you are working at. Does my revised understanding make sense (?)

Also, I am still a little confused around gradient descent. I mean the concept itself is pretty straight forward-- But doesn’t your minima kind of ‘move around’ as you add more training examples ?

It is also one of those ‘pie in the sky’ kind of ideas I have now and then, but rather than gradient descent, if you are doing the matrix stuff on an FPGA, could you utilize an OpAmp differential circuit to perform the optimization ?

1 Like

Well the actual value of the derivatives changes, because you’ve changed the function on each training iteration, right? But it’s just a bunch of matrix multiplies with a different matrices each time. You change the weights (coefficients) and those figure in the back prop calculations. But the other parts of the huge Chain Rule (the derivatives of the activation functions, e.g.) stay the same. But as I mentioned, the formula is the same, just with different matrices as input. Just as an example, here’s a formula from back prop:

dA^{[l-1]} = \left ( W^{[l]T} \cdot dA^{[l]} \right) * g^{[l]{'}}(Z^{[l]})

So on each iteration every value on the RHS will be different because of the weight updates, right? But it’s fundamentally the same computation in terms of the actual vector operations you do.

Not sure if I understand your fundamental point here, but I hope the above is at least relevant to what you are wondering here.

If you change your training set, then you’re solving a different problem, so you’re starting the training over again, right? So, yes, the minima may be at least marginally or maybe not so marginally different, because it’s a different problem. So why is that a) surprising or b) a problem?

Note that the solution surfaces here are incredibly complex and there are staggering numbers of local minima. Here’s a thread that talks about weight space symmetry and permutations.

But it turns out that the math is in our favor: there are lots of good solutions in most of the cases we actually deal with and gradient descent with appropriate parameterization can typically find them.

Sorry if I already gave you any of those links before …

Dear Paul,

I am not suggesting there is a ‘problem’ here: 1) I want to ensure I am understanding everything correctly 2) My thought is, perhaps, there is a way more efficent way to do this-- And it is not digital, it’s analog.

Of course the matrix calculations would have to be digital-- I have no clever solution for that as analog (less we are talking quantum or something).

Appreciate the input, but no need to be upset ! Just trying to think through this…

As to your earlier comment regarding ‘minima moving around’, I completely agree with you-- A ‘different problem’ would be a different problem-- precisely why neural nets, or really any model does not generalize. But here we are trying to solve an optimization as we add more data to it. The thought makes me feel a little bit weary.

Yup, even when you just switch from a minibatch to the next, during training.

Again I am new and sorry for a ‘dumb’ question. I mean I have heard of this said ‘deep search space’… but none of our fundamental equations is polynomial. It is all basic regression-- And then somehow we make this leap (?)

Well from a strictly mathematical point of view, a linear expression is a first degree polynomial, but I am not sure I understand your point. If it is “why don’t we use higher degree polynomials as a way to get complex functions”, then the answer is that we don’t need to. We can achieve complexity by adding a non-linear activation function at each layer, so that each layer is a non-linear function. Then we can stack (“compose” is the mathematical term) those functions to create as much complexity as we need for a given problem. That method has proven to give excellent results at least in part because the back propagation process is mathematically “tractable” and gives useful solutions in lots of realistic cases. I don’t know the history here and whether researchers have experimented with higher degree polynomials as the layers in a neural network. They do use “polynomial regression” as an enhancement on linear regression for some simple types of problems and that does work. What that actually means is that they run linear regression, but the input features are polynomial combinations of the original features in the case that a linear decision boundary is just not adequate to fit the data.

Dear Paul,

Realize I am still learning so I may ask some ‘stupid’ questions at times. Yet that is all to the point, in the end, I better understand.

For one, I was just kind of surprised this is all kind of basically linear regression but with a weight term (your activation function). I mean I was doing regression trees in Matlab way back in 2007 in Grad School at U of T-- And to be honest I had no idea at the time both Hinton and Karpathy were there at the time (I was studying Economics), or I probably wouldn’t be taking this class now.

What I was missing though was back-prop. Nor do I disagree it ‘seems to work’. Rather my point was the way it is described is as this super crazy hyper parameterized space you are chasing the gradient down-- In terms of dimensions, perhaps yes, but especially if you are using ReLU, well it is all linear all the way down.

I mean in another of my ‘out there’ ideas I might try I’ve kind of wondered what would happen if we made our activation functions ‘adaptive’, or allowed them to change as the model evolves based on the strength of the connections between nodes-- Kind of like a feedback loop.

Anyways friend… Just trying…

I think part of the problem I’m having coming up with good answers to your questions is that you seem to have invented your own nomenclature for everything that doesn’t bear much resemblance to how Prof Ng discusses things. If you have that much academic background, then surely you can appreciate that a big part of learning a field is absorbing and understanding the notation and the terminology. That is especially true in anything math related and everything we are doing here is “math all the way down”. Although even there, the ML universe doesn’t use the exact same conventions as the “pure math” world. My academic background is in pure math, so I also have to make some adjustments, but most of them are pretty minor. The one that comes to mind is that log means natural log in ML world. In math world, log is base 10 and you say ln for natural log. If you’re coming from the world if economics, they also use a lot of math if you were doing microeconomics or econometrics, but I’m sure they have their own unique terminology and notation.

So back to the quote of yours above: the “weights” are the W and b values which are just the coefficients in the linear transformation at each layer of the network. The activation function is not a “weight term” whatever you mean by that. It is a non-linear function that adds complexity at each layer. As I’ve probably mentioned earlier in our various conversations, it is an easily provable theorem that the composition of linear functions is still linear. So if you don’t include a non-linear “activation” at each layer, then there is literally no point in having multiple layers in your Neural Net: all NNs would be logically equivalent to Logistic Regression.

No it’s not. ReLU is a non-linear function. Yes, it is “piecewise linear”, but that is still non-linear. In mathematics there is no “almost”: the function is either linear or it’s not.

It’s an interesting and creative idea. There are adaptive algorithms for doing gradient descent (e.g. making the learning rate dynamic based on various factors), but I’ve never heard of anyone trying to make the activation functions adaptive as well. Maybe you could make them parameterizable in a way similar to the linear transformations and then back prop would have the ability to tweak the activations as well. Interesting. When you get through Course 2 and have a more complete view of the landscape, you could try some experiments with that kind of idea. Or we could do a literature search and see if anyone else has tried anything like that.

Onward! :nerd_face:

Dear Paul,

Yes, I understand-- As to activations I say ‘weight term’ in the sense that it is actually putting a carry on any particular determination. It can pull it up, it can pull it down. What would be the right way I should express this (?)

And, no, I am not so smart… Due to occrances in my early life, math has never been my strong point… But, I would very much like to compliment you, the other staff, and especially Prof Ng with this course.

I mean I tried to audit, with the option to take, MIT’s related course from their Data Science Micro Masters… It doesn’t cover even half of what this course does, starts with Perceptrons (and I already have a copy of Yaser Abu-Mostafa’s ‘Learning From Data’)… But I was so lost.

On the contrary, for Prof. Ng’s class, I get it [or think I’m getting it].

I’ve always been better as a concept/idea guy and thus appreciate you are smarter than me. Honestly, GMM was my ‘death knell’ in grad school-- I just didn’t get it. Luckily there is none of that here. So please forgive my silly questions as I learn.

-A

Realized I should at least add I did take and pass MIT’s equivalent of 6.004x when they offered it. MIT is f**** tough, but why I know at least something about system architecture to ask this question in the first place.

It’s great to hear that you find Prof Ng’s course(s) well structured and useful. I can claim no credit for that: the mentors are just fellow students who volunteer their time to answer questions here on the forums, but we had nothing to do with the creation or content of the courses.

I think you are trying too hard here. What is “a carry” in your sense? Like a “carry” in successive addition? Before we “go there”, let’s consider the behavior and the implications for some of the commonly used activation functions that Prof Ng has shown us so far.

Consider the following 4 examples: ReLU, Leaky ReLU, tanh and sigmoid. Look at their graphs. What can we conclude from them?

They are all monotonic functions. If z_2 \geq z_1, then you can conclude that g(z_2) \geq g(z_1). But note that monotonicity is not strictly required of an activation. swish for example is not monotonic.

Beyond that, they have somewhat different behaviors:

ReLU acts like a “high pass” filter: it just drops all negative input values and replaces them with 0.

Leaky ReLU doesn’t drop negative values, but reduces their absolute values to some degree based on the slope you choose (a hyperparameter).

Both tanh and sigmoid have a very similar shape (and in fact are quite closely related mathematically): they have “flat tails” as |z| \rightarrow \infty so they make large values more or less interchangable with each other above some threshold of |z|. They also “clamp” all the values between either -1 and 1 (for tanh) or 0 and 1 (for sigmoid). So we can interpret the output of sigmoid as a probability and say that it is predicting “True” if the output is > 0.5.

For the output layer of a network, the choice of activation is fixed by the purpose of your network: if it is a binary classifier (“Yes/No”, “Cat/Not a cat”), then you always use sigmoid. If it is a multiclass classifier (identifying one of a number of animals or objects or …), then you use softmax, which is the multiclass version of sigmoid. It gives you a probability distribution on the outputs of the possible classes.

Finally getting back to your actual question:

How about this as a way to describe what you are getting at there: the activation function provides a form of “interpretation” of the input. How that interpretation looks depends on the function. In the case of ReLU, the “interpretation” is: only positive values are interesting. In the case of Leaky ReLU, the interpretation is “negative values should have less effect than positive ones”. Instead of just dropping them as ReLU does, Leaky ReLU just “tones them down”. In the case of tanh and sigmoid, the “interpretation” is clamping the values to a fixed range, but in a monotonic way. The net effect of that is that the differences in values becomes a lot less significant, the farther away from the origin you are. In the particular case of sigmoid, that gives you the final level of “interpretation”: mapping the input to a probability.

Okay – I’ll accept that. Conceptually I am still wondering though (and haven’t had the time yet to run through an example to test)-- Lets says you are using ReLU, in certain cases, will this actually effectively ‘prune’ the network (?). Or everything remains connected, though be it with some nodes at a very low contribution value. Wondering just because the consideration of how the network reacts (be it in prediction) to the instantation of new/novel data.

My intuition would be that everything remains connected, even with ReLU. Note that a given neuron may not always output a negative value: it may depend on the inputs, right? Of course this is with the proviso that in general we don’t really know what happens at the level of individual neurons and we judge by the results: the predictions the network makes are either useful or they’re not.

There is some interesting work where researchers have tried “instrumenting” neurons in the inner layers of networks to try to figure out what they are doing. Prof Ng will shows us that in Week 4 of DLS Course 4, so you’ve got a lot to get through before you get there. But the lecture is also posted on YouTube and I think it would probably give you some useful intuitions even if you haven’t yet learned the details of how ConvNets work.

Yeah, no I am not there yet. In certain cases I can think of where pruning might be a good idea, and was wondering if that was with ReLU was happening (I mean after all you are tossing a ton of zeros into the equation).