Why data augmentation instead of constraints on weights

When we don’t have enough data to train a model, we augment data by making transformations on original data. For purpose of discussion, let us consider the problem of image recognition. Also let us say we use only horizontal flips for data augmentation.
The questions is, “instead of augmenting the data with flipped images, why don’t we put constraints on weights such that the algorithm with constrained weights should categorize an image and its flipped version as the same class?” (essentially introducing symmetry in the weights because the data also obeys certain symmetries)
To me, constraining the weights seems more natural, and the right thing to do.

Any thoughts?

Interesting question! Well, have you thought about how you would implement your suggestion? How could you achieve that? How much complexity does that add to the back propagation algorithm?

But maybe the more salient question is does it generalize? There are lots of ways to do data augmentation besides horizontal flips, right? How about vertical flips? How would your method deal with random rotations of the images? Or randomly perturbing the color values by small amounts? The point of data augmentation is that you’re not limited to just one recipe. If you want to augment your data, the more the merrier, right? Well, subject to whatever storage and compute cost limitations you may have.

This is just my intuition, but it seems to me that your method is just too limited to that one particular case. We already have full generality in the ability of back propagation to learn from what the cost function tells us and it’s easy to come up with more techniques for data augmentation. In other words we don’t really have to come up with any new algorithms on the back propagation or cost function side to handle all those cases. Whereas your method may only be applicable in a few of the obvious symmetry cases and adds more algorithmic complexity to the process. But this is an experimental science: you could actually implement your method for horizontal and vertical flips and then do a “side by side” comparison of the plain vanilla training versus your “symmetry enhanced” version and see what happens. Maybe my intuitions stated above turn out to be wrong. It’s happened before

1 Like

\underline{\textbf{Implementation}} \textbf{:}
Say we have a 4\times4 image, which in matrix form, is \begin{bmatrix} 1 & 2 & 3 & 4 \\ 5 & 6 & 7 & 8\\ 9 & 10 & 11 & 12\\ 13 & 14 & 15 & 16 \end{bmatrix}.

Its horizontally flipped image is the matrix given as \hspace{1pt} \begin{bmatrix} 4 & 3 & 2 & 1 \\ 8 & 7 & 6 & 5 \\ 12 & 11 & 10 & 9\\ 16 & 15 & 14 & 13 \end{bmatrix}.

When the image and its horizontally flipped version are fed in to a NN, the output should be exactly the same for both the image and its flipped version as they both belong to the same class. This is possible only if the outputs from the previous layer are also exactly the same. Extending this argument backwards (in the NN), we can say that every neuron of the NN should have identical outputs for the image and its flipped version.
Now let us say that w_{ij} are the weights of the i^{th} neuron in the first layer. The output of every neuron should be the same for the input image and its flipped version.
That is, before the non-linear activation for the i^{th} neuron, we require

w_{i1}(1)+w_{i2}(2)+w_{i3}(3)+w_{i4}(4)... = w_{i1}(4)+w_{i2}(3)+w_{i3}(2)+w_{i4}(1)+...

The above result should be the same for any image, not just the example image I discussed above. This is possible only if w_{i4}=w_{i1}, w_{i3}=w_{i2} .
Similarly, the following conditions should also be satisfied:
\begin{eqnarray}w_{i8}&=&w_{i5}, w_{i7}=w_{i6}\\w_{i12}&=&w_{i9}, w_{i11}=w_{i10}\\w_{i16}&=&w_{i13}, w_{i15}=w_{i14}\end{eqnarray}

So, during initialization of weights, we randomly initialize only the weights w_{i1},w_{i2},w_{i6},w_{i5},... and set the remaining weights equal to these initialized weights as per the constraints above. That is we initially set w_{i4} equal to w_{i1}, w_{i3} equal to w_{i2} and so on.

Even during backpropagation, we update only w_{i1},w_{i2},w_{i6},w_{i5},... using gradient descent. The remaining weights are updated as per the above constraints. That is, updated w_{i4} is set equal to updated w_{i1}, updated w_{i3} is set equal to updated w_{i2} and so on.
Generalization is discussed in the next comment.

\underline{\textbf{Generalization}} \textbf{:}
Similar to the constraints derived for case of horizontal flip (discussed in above comment), we will get the following constraints if we include only the vertical flip:
\begin{eqnarray} w_{i13} &=& w_{i1}, w_{i9} = w_{i5}\\ w_{i14} &=& w_{i2}, w_{i10} = w_{i6}\\ w_{i15} &=& w_{i3}, w_{i11} = w_{i7}\\ w_{i16} &=& w_{i4}, w_{i12} = w_{i8}\\ \end{eqnarray}

If we include both the horizontal and vertical flips, the constraints are
\begin{eqnarray} w_{i16} &=& w_{i13} = w_{i4} = w_{i1}\\ w_{i15} &=& w_{i14} = w_{i3} = w_{i2}\\ w_{i12} &=& w_{i9} = w_{i8} = w_{i5}\\ w_{i11} &=& w_{i10} = w_{i7} = w_{i6}\\ \end{eqnarray}
So, we initialize and update (using gradient descent) only the weights w_{i1},w_{i2},w_{i5},w_{i6}. Rest of the weights are populated using the above constraints.

Fun fact : When we take care of horizontal and vertical flips, flip over the diagonal are also automatically taken care of.

Horizontal and vertical flips are discrete transformations. The constraints on weights due to these discrete transformations can be easily derived. To derive the constraints due to continuous transformations like horizontal shift, vertical shift, rotations, etc., it needs a little bit of work nonetheless it is not impossible. In fact, if one takes some time and think about it, it can be realized that every transformation used in data augmentation (for image classification) corresponds to a constraint on weights.

\underline{Side-note}: I have a computational fluid dynamics background. It is very common for us to work with symmetries, and symmetries ease our problems (a lot of times). Constraints are also very common for us. For example, the numerical velocity field (typically of million dimensions) of (incompressible) fluid flow should always satisfy a constraint – it should be divergence free. This background motivated me to ask the very question being discussed in this post.

\underline{Conclusion} Of course, the obvious next step is to get the hands dirty and do some work. That is, do a side by side comparison of my method with plain vanilla training, as you have mentioned. Hopefully, I will sail through. ( I am new to deep learning. So, lot of basics to learn before I attempt to re-invent the wheel. )

Thanks for fleshing out your ideas to the next level. I’m not sure I buy the argument that you could extend this idea to rotations by an arbitrary (random) angle, but maybe it is not required that you handle every possible augmentation. Just to make sure we’re on the same page here, your model has to be general in the sense that you don’t get to have one model that handles just one type of augmentation and a different one for different types of input, right? Whatever you are doing will be used with all input images. But maybe it will be a win if you just implement the horizontal, vertical and (free) diagonal flips: that will at least automatically handle those symmetries.

I think it actually wouldn’t be that hard to implement what you are describing. Just add a step at the end of applying the gradients that just replicates the upper left quadrant of the matrix to the other three quadrants. In other words, you just let back propagation do its thing and then over-ride the results by forcing the symmetries ex post facto. You’d be throwing away some of the work, but maybe you can still get convergence. Give it a try and it will be really interesting to see what happens!

Well, with a little more thought, maybe it’s not so easy to do this. One thing to be aware of is that if you are talking about the Fully Connected Feed Forward networks in DLS Course 1, then take a look at how the input images are handled: they are “unrolled” or flattened into vectors, so it’s actually not so easy to express the geometric symmetries. They are still buried in the data, but it’s not so easy to express. Note that there are two different orders in which you can do the flattening, so your symmetry method would need to know which one you are using. See this thread for more information. Read the whole thread down to the section that discusses order = “F”.

In Convolutional Nets (DLS Course 4), the geometry of the input tensors is preserved. But there the transformation being applied is not so straightforward: they are movable filters that are applied serially across the geometry of the input. More thought required to see how one would encode the symmetries in that case.

The other high level point here is that what we are doing here is fundamentally different than Fluid Dynamics. Mind you, I never took any Fluid Dynamics, but I did get as far as Hamiltonians and the Calculus of Variations in Intermediate Physics (but it was a very long time ago ). There you are solving differential equations in very high dimensions. Here we are also in very high dimensions, but there are no differential equations governing the behavior. Our constraint is the Cost Surface created by the Loss Function we have chosen. Here’s a thread that gives a link to a really interesting paper about that from Yann LeCun’s group that’s worth a look to get sense for what the solution spaces look like.

Why would you manipulate the weights which are the learning parameters of a model, i know in certain cases you manipulate the weights but its outside intervention and i wouldnt say very healthy in its entirety. Woudnt it be better the model has more access to more labeled inputs in different scenarios without outside manual intervention with the variable learning variables. Plus i dont think the output for an image and its fliped is the same, its a class propability, even both images are not the same.