Finding patterns of features within samples

I ran into a rather basic question - how will neural networks be able to spot patterns within individual samples in the training set? Does it take some synthetic features to guide it to do that?

For a very simple example, lets say we have two inputs n- x1 and x2, and the output y is “1” if they are the same, and “0” if they are not.

Wouldn’t the NN fail miserably recognizing this? Wouldn’t the training be limited to patterns of what x1 and x2 are set to? e.g., (1,1), (2,2), (3,3),(10,10), (11,11), (12,12), (1,5), (2,5), (3,5), etc.? That is the neural network won’t be able to recognize the simple fact that the output is “1” if the values are the same - and may well fail to predict if we pass it (100,100). Or for that matter if we even pass an intermediate value that the training set hasn’t encountered - say (5,5)?

So what’s the solution to this problem? I suppose i can create a ‘synthetic’ 3rd feature that’s true if x1 == x2, and false if x1 != x2. Is that what is normally done? The problem in that case, however, is there would become a combinatorial explosion of possible equalities in larger sets of features. Right? Also, what if we want to do other comparisions - x1 < x2, x1 < x2, etc. - that’s another large set of combinations that have to be enumerated into new synthesized featrues?

Are there other approaches to this problem?

Thanks!

It is not typically necessary to resort to “synthetic features” as you describe. At least I don’t think Prof Ng every discusses anything like that here that I can recall. It may seem counterintuitive, but back propagation is a powerful method for learning what works and what doesn’t in terms of pattern recognition. Of course key to that is that you start with a Cost Function that realistically measures the effectiveness of the network predictions. Once you have that, then the question is just figuring out a network architecture that can represent enough complexity to detect whatever the patterns in question are. You just define the number of layers, number of neurons per layer and other “hyperparameter” choices like the activation functions in the hidden layers. Then you start with randomly initialized weights and run the training and see what can be learned. If that seems too much like “magic”, I agree that it’s not a priori obvious, but it turns out that it works in a large number of cases. I suggest you “hold that thought” and listen to what Prof Ng says throughout this course and see how the solutions work in the exercises, particularly in Week 4.

2 Likes

If you follow the steps in Paul’s reply, for your “equal” problem, after some trails you might come to a Neural Network that works like sigmoid(w(x_1-x_2)+b) which requires only one layer and one neuron with a sigmoid activation. I hope that by inspection you agree that the aforementioned network should work for the problem. Moreover, this network should also work for the “greater than” problem and it is likely that the parameter b will be trained to zero.

I want to emphasize that my above response is NOT suggesting that you have to work out a formula like the above one before you start the steps mentioned by Paul, instead, if you go through the steps, then a good and trained model will look like the formula. I want to assure you that the “equal” and the “greater than” problem isn’t difficult for neural network at all, and like Paul said, neural network works in a large number of cases :wink: , and you only need to make a move to start going through the steps!

Cheers,
Raymond

1 Like

Thanks for the replies - from a previous course (ML Specialization) I gathered the intuition that NN is really about finding patterns in the data, and is not good at extrapolation (or intrapolation). Is that not true?

The optimization by default is setup to optimize “classification” of the training data - the actual points, without regard to any relations between them. In the general case, the resulting network may have non-linear boundaries between/around these points - excluding points outside these boundaries that may well even separate sets of these points.

It seems that the cost function just doesn’t optimize for the two properties being the same, rather just for specific individual values – and if my domain has equality as an “implicit feature”, it’s just not captured at all. Not unless I synthesize a feature that captures the comparision of two features explicitly.

Sorry if that sounds like repeating what I said earlier, just in different words :slight_smile:

I don’t know if I’m missing something - but I guess one way I could test it out is to take a reasonably complex example – an example that has a similar requirement, and create a model with and without the “synthetic” feature. Have you guys come across a public data set that we can use to test this out?