On a high dimensional field of the cost function, we drop numerous “balls,” but each ball has a “field of gravity” that intensifies as their cost function gets smaller. Each ball will share with other balls their position and strength of their “gravity.” I thought this way, instead of having just one explorer in the space of high dimensional fields, we can have multiple explorers that work with each other to find the lowest cost function! Not only that it’ll be able to thrive in plateaus as well.
Although it is quite abstract, I wanted to know what you guys think about this idea, whether you think it is worthy at all or not. Thank you for reading my thought.
Function minimization is a well-established field of study. Do you have a mathematical basis for the method you propose? How would it improve on the established methods?
It seems to me that rolling a ball on the cost surface is no different from how we are running gradient descent right now. In gradient descent, we compute the cost. With the ball, we compute its field of gravity. Both seem to require the same amount of computation.
Multiple balls mean running gradient descent on multiple networks each initialized to a different set of values
Communication between balls is interesting, but how it is going to help is not described. One thing for certain is that it is going to cost additional computational power.
In real world, gravity helps us find a lowest point and human doesn’t need to compute gravity ourselves. In computer world, gravity has to be computed by ourselves and that part is expensive.
In real world, gravity pull things around for us. In computer world, we decide how to move the ball. That decision consume computational power.
@Eugene_Ku, I am a Physics graduate so I have thought about the points that I share here, they led me to conclude that moving the concept of gravity to the computer world is going to be expensive and thus not necessarily helpful. We need more definitive evidence that it can save computational cost.
For the part of communicating between balls, I would like to hear more from you about how it is going to work and how it is going to help.
There is a similar idea called Bayesian Optimization (not a replacement for gradient descent though) which assumes a (e.g.) Gaussian process along each dimension of hyperparameters (not network’s trainable parameters thus not a replacement for gradient descent).
The idea is that it samples multiple points on the surface to get the surface’s heights at those points, then we assume the surface to be generable with a Gaussian process (constrained by sampled points) so that the generated surface is basically a smooth one (not necessarily the correct one, but we just somehow believe that the correct one should be smooth).
Then on the next round of sampling, we sample on the lowest points on that generated surface (such that we are not making small steps, but big jumps). As we collect more samples, we refine the generated surface.
In this way we are able to combine information brought to us by multiple samples (or balls in your idea). That combination is like your “communication”. However, the assumption that it is a guassian process is very critical , and it is unlikely for a neural network to have a cost surface similarly smooth everywhere. Therefore, blindly applying it to your idea might not be a smart move.
Hi @rmwkwok, thank you for your reply. I’m not sure if the computational cost will be that high as you described. But your arguments help me see how my idea is too baseless.
Well, my ideas were based on an my “theory” that between all of your weight initializations (of balls), there got to be a good minimum in between those values. Would the optimal weights possibly be outside of your initializing weights? Because based on my understanding, a good model has a low value of weights that is well distributed so that there wouldn’t be one weight that is out of the scope. If this is not true at all, then I guess my idea is completely useless. However, if the above condition is true (good optima exists between the range of your weight initialization), I think this is more natural way to optimize without the headache of scaling and diverging gradients.
However, I see that there a lot of works on optimization, and I’m completely blind to them. I’ll start reading couple of optimization papers and perhaps, come back to this idea. Please let me know if you have any papers you recommend, @TMosh
I say: “computing the cost at one set of weight values is equal to computing the gravity value at one set of weight values.” Agree?
What gradient descent requires us is to compute all the cost values along one path (from initial weights to final weights).
What the ball ideas might require (feel free to say otherwise) is to compute all cost values within the “range” covered by all the balls.
Considering the size between a path and a range, I think a path takes less computation.
You can argue that the more balls you have, the more likely the optimum is within the “range”, however, also that more balls are useless in defining the smallest range required.
I think it’s too early to say this. The most critical point here, I think, is not whether the balls can produce a “range” that covers the optimum in all dimensions. Instead, I think the critical point is how the balls are going to communicate with each other cleverly and make a difference.
I agree with you. Don’t be limited by my arguments - they are something for you to crash down. Maybe your idea’s initial cost is higher, but if it ends up less, it is still a win. Or maybe you can refine it such that the cost is lower at all time. Who knows?