In week1 of course 2 of the TensorFlow Developer specialization, we practice augmenting images using the “shift” feature (amongst others, like rotate, zoom, etc.).
Since CNNs are already translation-invariant, what is the point of augmenting this way?
Lawrence Moroney mentions in a video that if a lot of our subject matter is centered in the image but then is off-center on images it hasn’t seen before, it might not perform as well. I don’t believe that is true due to the nature of how CNNs sliding filter runs at every location of the image, right?
(I know CNNs are not rotation invariant nor scale or skew invariant, so all the other augmentation methods seem valid to me).
This reminds me of an application in the military, they were trying to detect tanks in a field and after training the model and getting good training results they started using it. The model found no tanks in reality during the rainy days but on the sunny days even if there was no tank it did output a tank present.
The problem was the model was trained on finding a clear sky present or not. Coming back to your question I think Lawrence is right the position of the object matters. The CNN doesnt think like humans the earlier layers try to learn higher level features and the layers towards the end, the lower level features, but if the position of the object changes those change as well, so the CNN needs to be trained for every possible scenario that can be present in reality.
Hmm, wow. I guess, in addition to intuitively it seems like it should work that way since it uses a sliding window that covers all locations of the image equally, I just believed the many articles online which imply that it’s common knowledge. For example, here’s one.
Even the first few paragraphs of the paper you cited show quotes by experts implying invariance is commonly assumed. (However the authors go on to say the invariance has been overstated previously).
So I guess you are right! This article sounds like they are translation equivariant, which is different than translation invariant. In short, equivariant means it gives the same outputs but in a different order if the input shifts locations.
Learned something today, thank you. I will dutifully proceed with shifting the images for augmentation.
That’s a useful anecdote. I’ve read of similar stories where, say, a product on an assembly line could not be identified correctly unless it was actually on an assembly line, because all the training photos were taken off the line. I’ve seen some synthetic data techniques recommending, for that reason, to give some percentage of crazy backgrounds (outer space, the ocean, neon colors) instead of what you expect your real world data to contain.
Constructing a good training set is really difficult.
Early self-driving cars had a habit of suddenly swerving toward trees, because the training sets were created by recording human drivers. This included sequences where a human-driven car deliberately steered toward an obstacle in order to record how to correct the deviation.