Ancient paper: "Backpropagation is Sensitive to Initial Conditions"

In a paper published 1990:

Backpropagation is Sentitive to Initial Conditions (PDF)

John F. Kolen and Jordan B. Pollack write:

This paper explores the effect of initial weight selection on feed-forward networks learning simple functions with the backpropagation technique. We first demonstrate, through the use of Monte Carlo [i.e. with judicious use of random number generators - ed.] techniques, that the magnitude of the initial condition vector (in weight space) is a very significant parameter in convergence time variability. In order to further understand this result, additional deterministic experiments were performed. The results of these experiments demonstrate the extreme sensitivity of backpropagation to initial weight configuration.

We read:

Chaotic behavior has been carefully circumvented by many neural network
researchers (e.g., through the choice of symmetric weights by Hopfield [5]), but has been reported in increasing frequency over the past few years. Connectionists , who use neural models for cognitive modeling, disregard these reports of extreme nonlinear behavior in spite of common knowledge that nonlinearity is what enables network models to perform non-trivial computations in the first place. All work to date has noticed various forms of chaos in network dynamics, but not in learning dynamics. Even if backpropagation is shown to be non-chaotic in the limit , this still does not preclude the existence of fractal boundaries between attractor basins since other non-chaotic nonlinear systems produce such boundaries (i.e., forced pendulums with two attractors).

What does this mean to the backpropagation community? From an engineering applications standpoint , where only the solution matters, nothing at all. When an optimal set of weights for a particular problem is discovered, it can be reproduced through digital means. From a scientific standpoint, however, this sensitivity to initial conditions demands that neural network learning results must be specially treated to guarantee replicability. When theoretical claims are made (from experience) regarding the power of an adaptive network to model some phenomena, or when claims are made regarding the similarity between psychological data and network performance, the initial conditions for the network need to be precisely specified or filed in a public scientific database.

One hears that the “landscape has surprisingly few local minima and gradient descent doesn’t get stuck” and so on, but if gradient descent just wanders around a basin forever without converging (similar to a sequence of complex numbers wandering around the boundaries of the Mandelbrot set forever), that does not help. An interesting problem that I haven’t heard about do far :thinking:

This work was taken up in the 1994 paper:

Catastrophic forgetting in connectionist networks

by Robert M. French

Below an extract:

Those images look interesting. I have an itch to redo this - once I know how to make the GPU train the network for each pixel, otherwise it’s going to take some time.