Convolution Confusion (Filters) C4W1

Preface: So we didn’t have to deal with this in this assignment, but I am presuming the initialization of the weights here is happening in the same way as course 2 ? (I.e. small and random, not zero to avoid collapse)

So, throughout the lecture it is rather clear what is happening where Andrew is using explicit filters (i.e. horizontal and vertical). And perhaps I even had this question in the back of my mind for the first few courses, though now it comes into focus since we can literally ‘see it’. However now it particularly stands out because we are using convolutions and pooling serve almost as a kind of ‘compression’ in a sense.

And moving away from explicit filters, the idea is that let’s ‘let the network figure it out’.

What I am having a hard time wrapping my mind around though, especially in the first few layers, where is our assurance that the features it is picking up on are actually useful ? I mean granted, over all we are trying to minimize the cost function-- Though this also I guess kind of assumes there even is a smooth function over the images we are considering ?

Sorry if I am not expressing this well (why it is a question/confusion), but lets say the first layer picks up on ‘something’ that minimizes cost for a time-- Yet in the end turns out not to be the most defining feature of that image; But now all the deeper layers are dependent on it.

How do we kick ourselves out of that ‘feedback loop ?’

1 Like

The weight initialization occurs automatically when the layers are instantiated by TensorFlow (or sklearn, whatever tools you’re using). It’s built-in. Yes, they’re small random values. What we’re avoiding is symmetry among the weights, not ‘collapse’.

There is none, other than the weights are adjusted to minimize the cost. Features which are important will have larger magnitude weights than those which are not (unimportant weights will trend toward zero).

Yes. You choose the “smooth function” by the optimizer you select. For example, you can pick a least-squares error for linear outputs, or various types of categorical loss functions for classification.

There isn’t a feedback loop. All of the weights are adjusted on every iteration. It doesn’t train the early layers first and the other layers later.

It is possible to get a non-optimal solution - because the NN cost functions are not convex. So you may have to train several times (with different initial weights) and pick the solution that gives the best performance.

2 Likes

@TMosh thank you for your feedback and let me muse on this a bit. Sometimes I have to ask stupid questions until I understand, but I figure that is part of learning not just ‘accepting’.

1 Like

@Nevermnd, I think these are very interesting and inspiring questions! We are questioning a scenario where the first layer’s filters may be suboptimal so that, for example, three filters are doing a job that could have been done by one filter. If there were additional feedback mechanism to somehow coordinate those three filters and combine them, it will make the network simpler. This is a nice-to-have feature, isn’t this? However, this is just not how our neural network works and I agree with Tom’s response.

Cheers!

@rmwkwok Dear Raymond,

Yes I just want to make sure I understand. Thinking about this question also brought to mind an interesting article I read a couple of years ago:

https://www.csail.mit.edu/news/why-did-my-classifier-just-mistake-turtle-rifle

I mean obviously what we see here is a turtle, but those are not the features the network is picking up on. So, on some level it would at least be nice to imagine we are deducing edges first, mouth, nose, then faces etc, it seems a little less sure there is a guarantee this is what is actually happening.

Also, perhaps I am not expressing it properly, but isn’t the whole network just a giant feedback loop ? I mean your forward/back prop is just a feedback loop on the data (i.e. as the weights get adjusted).

Again, as this is towards a better understanding, but considering the fact that the computation occurs in a linear (not a massively parallel type way-- meaning you calc the results of each layer one at a time in order) manner, is there a way for the deeper layers to say to the effect ‘Woops ! we have a problem’ and thus influence in the earlier layers ?

I mean maybe this happens in the back prop, though I just am not seeing it yet…

Or, again, at least in terms of the CVnet we are explicitly giving deeper layers (hopefully) more ‘knowledge’ but less data.

Obviously people have found all this works ‘pretty okay’-- So I am not questioning that, only such that it seems a little strange and just trying to understand better.

1 Like

Hi, Anthony.

Lots of interesting thoughts there! Firstly, thanks very much for the MIT article. Totally interesting. I had heard the term “adversarial example”, but had never looked into the implications. The article was a bit frustrating in that it was written for the general audience, so all you get is “these guys have done some very cool work”. Now we have to go find their actual papers and learn the details. But that one idea of the “non-robust features” that the algorithm can detect, but which don’t trigger the human visual cortex, does hit home in a useful way: so that means there’s a strategy for modifying images in a way guaranteed to get the wrong answer from the human’s point of view, but which is actually the right answer. That idea plus the AI advances since 2019 and it sounds like trouble! :scream_cat: It would be interesting to read their papers and see examples of that type of “non-robust feature”.

Well, it isn’t the network that’s a feedback loop: it’s the training process. But even then, the key point is (as you say) that it’s just the weights that change in the feedback process, not the initial data. But that gives rise to another interesting thought: remember back to DLS C1 W4 where Andrew is finally giving us the fully general case of forward and back propagation for the L layer fully connected net: he points out that at each hidden layer l, you are computing dW^{[l]} and db^{[l]} and dA^{[l-1]}. And then the dA^{[l-1]} is input to the process at layer l - 1. But then when you hit layer 1 (the first hidden layer), you get dA^{[0]} as part of the output and A^{[0]} = X the input data by definition, right? Andrew makes the comment that we can’t change the data, so we just discard dA^{[0]}. But maybe you could make it a full feedback loop and change the data? What are the implications of that? Maybe that’s the type of thing that the “adversarial examples” teams are doing? Gotta go read those papers!

You’re right that you can’t parallelize forward or back prop across layers, but there is a lot of parallelization possible within the layers: GPUs do both vectorization and parallelization of matrix multiplies and that’s all we’re basically doing here.

But note that all the networks we’ve seen so far are “simply connected” and the graph goes only in one direction. Also note that a deep layer has no notion of whether its intermediate answer is good or bad, right? You don’t find that out until you hit the cost function and that’s at the very end of course. So I’m not sure that the idea of having non-simply connected graphs where later layers loop back to earlier ones really makes any sense. But of course this could easily just be a lack of imagination on my part. :laughing: Maybe you’re onto something here!

One other thought here is that you’ll see some really interesting material in C4 W4 in the lecture “What are Deep ConvNets Learning”. Andrew shows us some very cool work where researchers instrumented the internal nodes of a ConvNet to see what inputs trigger the strongest signal at a given neuron. So you get a view of what’s actually been learned by the network.

We will soon (DLS C4 W2) see the first example of a compute graph that is not simply connected, so stay tuned for that. But (spoiler alert) it’s still only going forward. Then in C5 Sequence Models, we’ll see some bidirectional networks. So definitely stay tuned for that as well!

1 Like

Hello Anthony @Nevermnd,

I think there are some points that we could pause and think:

Let’s consider “turtle shell” as it is a more significant feature of turtle.

What needs us to think about is:

“even if a group of the filters in a CNN is guaranteed (or proved) to detect some turtle shells, can we make a turtle shell that is not detectable by the group?”

“If so, is such guarantee what you are thinking about? If that is not the guarantee, what is that? Literally any turtle shell? What does “any turtle shell” mean to the requirement about our training set?”

Even if we can’t come up with a perfect hypothesized guarantee, it is still critical that we can come up with something close to that, something we can make up to the best of our effort. Try? :wink:

As Paul explained, it is the training process that we see a (negative) feedback, but not during inference time which is what your article is concerning about - the adversary attack happens after training is done.

Here is the 1 million question that I wish you could pause 10 second and think:

“If a wizard is here to let you achieve whatever you hope for, exactly what is one kind of the problems (that we whoop) that you want it to feedback during what time (training / inference?)?”

Again, just try, just one example. Doesn’t have to be very well-thought, but that needs to be concrete, better with some tangible examples explainable to the others. Note that, without thinking out of the box, during training, we only have training samples and an improving neural network, and at inference time, we only have samples to-be-inferred and a well-trained network. Of course, if we think out of the box, we can start to assume something our current neural network approach does not have.

I hope you have had some answers by now. With the answers in mind, are our current neural networks here built for that?

During training, we feedback, specifically, only, training samples errors to update the weights. Are feeding back training sample errors your answer to adversarial attack at inference time?

There are 2 pieces of fact: we have excellent CNNs, and we have proven adversarial examples. Given them, my above questions are important, because, if, from your answers, there are some components that our current neural networks don’t have, then we can depart from there and start thinking about something different or something novel :wink:

Cheers,
Raymond

PS: Paul shared other relevant course materials that definitely worth visiting for the discussion.

@rmwkwok Yes, thank you for pointing out this particular cited instance occurs at inference. I mean I know in particular Prof. Ng stresses how there was the ‘old’ way of doing things, like hand designing features, and that didn’t turn out to be so good. I mean, okay, I can accept that.

However, in my mind I still feel a little stubborn. I mean my first major was Philosophy, so I have a hard time believing symbolic AI is ‘totally dead’. Or as it comes across in my mind, yes, it has been seen and shown neural networks, unguided and on their own can produce really impressive results-- Yet, at the same time they increasingly require seemingly ridiculous amounts of data to be effective.

Aside from alone being resource costly, I am skeptical in the end we are getting much closer to ‘knowledge’ here. Or it starts to remind me a bit of the ‘infinite monkey problem’:

Or, yeah, I guess at some point we’d get Shakespeare.

My present thinking (which may be wrong) is that we have an over tendency for one to train on very specific end results-- Which I guess makes sense because then one has a direct product to sell. But none of this generalizes all that well (and notably I’m not sure I am a proponent of so called AGI, but I do feel we could be doing what we do now, better).

Perhaps even the interesting thing in the case given is actually that-- Well we still see a turtle: ‘it’ sees something totally invisible to us and perhaps this is where AI can help.

Granted, as I learn I am still developing my thoughts on this and don’t even know if there is a way to do this in Tensorflow, but might one init the weights in the first few layers, still small, random, but towards a a distribution that suggests say a horziontal or vertical edge detector ? And presumably the weights would work their way back out of that pattern if they didn’t find anything–

Nor am I even suggesting such filters would be the best metric. (On the side of this I am finally studying DSP, so its concurrence is timely).

https://www.dspguide.com/

Nor, either does even such an amendment fit with what I’d call ‘symbolic AI’.

In the end, these are all just questions as one thinks, works through, learns a little. I’ve never been that great at ‘not questioning authority’.

Best,
-A

1 Like

@paulinpaloalto thank you very much for the feedback-- And, yes, I feel it is important I go through the whole course first. Just trying to nip some of my questions in the bud as they come up. One way or another it is interesting to think about.

Hi Anthony,

Yea, task-specific machine learning model is the reality, the limitation, and the pragmatic use.

“Symbolic AI” is not my field of interest, but if it is about making use of domain knowledge, then I totally agree to but the question is what and how. You might add some untrainable edge detectors at the front layers, alongside with other trainable filters - yes, you might, but normally people don’t. And edge detectors is also probably not the answer for misclassifying a turtle as a riffle, right? Domain knowledge is good, but which part of the domain knowledge is critical to clearly differentiate a riffle from a turtle? Maybe it is too obvious for human to even say what that should be.

On the other hand, we are having too much high-level discussion without touching any code. There is no way to base the discussion on observation, and there is no observation to verify anything.

Maybe this discussion is unnecessarily complicated because we didn’t examine any examples to find out where the problem lies.

The idea of having a model that can be applied to different problems is very interesting. There is a name for this but I just can’t remember it now. If I come across it again some day and if I remember this thread, I will post the name.

Raymond

Thanks Raymond,

I’ve kind of also been a little surprised (at least as this course is presented) that individual neurons in layers aren’t represented as objects, but just matrices [at least insofar as having spent enough time wrapping my brain around OOP as being much better in other contexts].

Granted seeing as we are grounded in Python here, rather than other languages, doing so would probably make the whole process incredibly slow.

Yet it would also I think one to better query, say, what a certain subset of neurons are doing in a particular layer are ‘doing’. Or add additional variables or functions for more intelligently querying them.

Not a strict idea here yet-- Still learning.

Matrix algebra has been optimized to a fine degree. That’s why it’s used for machine learning (and why GPUs are so useful for machine learning. They’re matrix algebra machines).

Treating each neuron as an object would incur lots of runtime overhead that doesn’t provide value.

The next higher level of abstraction is how TensorFlow works. For example, a NN layer is an instance of a Python class.

2 Likes

Right, if you’re jonesing for OOP, be careful what you wish for: TF will give you OOP to the max.

Also note that the idea of instrumenting the internal layers to understand what is going on will be covered in Week 4 of this course, as I mentioned in one of my earlier posts on this thread. Look for the lecture “What Are Deep ConvNets Learning?” in Week 4. You can add instrumentation to learn more about how the networks work, but you only do that when you’re doing research as opposed to training a real model. If you’re trying to train GPT-n for some value of n \geq 4, you don’t want to waste any cycles generating frivolous info.

1 Like

Yes, even with this basic level of TF I’ve been trying to understand how we are making calls that don’t even seem (strictly, at least to my mind) like Python syntax anymore.

I.e.

    Z1 = tfl.Conv2D(yada, yada)(input_img)
    ## RELU
    A1 = tfl.yada, yada()(Z1)

The variables fall outside of the direct function call somehow ? Still scratching my head a little how that works.

image

The “A” part returns an instance of a class as a callable function. The “yada, yada” parameters are used by the object constructor when it is initialized.

The “B” part has the parameters that are passed to the function.

3 Likes

:grin: of course “yada, yada” included to not share course code/solutions. Strictly I learned my OOP in Java. I will dig deeper into the Python methods. The way you have described it helps me understand though.

I don’t know Java, but in python a function is just another type of object. You can pass a function as an argument to another function and a function can return a function as one of its return values, which is what is happening with the TF “Layer” class and all its various subclasses.

1 Like

Dear @paulinpaloalto … Java was another of my ‘misadventures’ I’d be only willing to talk about with you privately…

But there, after building your class/instance/constructor you create a novel one with the ‘new’ keyword and then access everything of that instance with the ‘.’ operator.

Perhaps Python does it more efficiently, but at first I wasn’t sure what I was looking at here…

1 Like

Python just doesn’t require the “new” keyword.

If you reference an object and it doesn’t already exist, the instance is created automatically.

Otherwise they’re very similar.

@TMosh so I went back and tried to reference this… Can you provide with me to some ‘non-TF/Keras’ documentation as to how this works in Python ?

I mean maybe the library is pulling something trippy here… Yet the convention of the dot ‘.’ operator is fairly pervasive when accessing objects, even if you don’t have to declare them, whether we are talking about Python or anything else.

So you are saying

Z1 = tfl.Conv2D(!@#$)(input_img)

is basically the same as

Z1 = tfl.Conv2D(!@#$).input_img

?

I mean that I could accept (‘new’ is implicit), just slightly different lingo, but not something I have encountered before…