Convolution Confusion (YOLO/UNets)

So I have successfully completed this course but must admit I still have some conceptual related questions I am having a hard time wrapping my mind around just yet. I may also update these questions as I get closer to an answer, but here goes:

  1. In YOLO, I understand some of the really big advantages are: a) no more need for a ‘sliding window’ b) many of the computations can now be replicated, or lets say ‘no longer duplicated’.

However (even with something small in class as presented, like a 19 x 19 grid), I guess I am not clear as to how, for the bounding box, the algorithm manages to find the ‘center’.

I mean, maybe you could find a subset of similar or associated regions, and then from there say ‘okay, these are our high probability regions’-- Let’s take the X,Y min/max from those and find our center.

But (at least apparently) I don’t see that in our calculations, so I am wondering how on Earth the algorithm pulls that off (?)

  1. I understand Neural Nets have been found to do some amazing things, but if we are only then, in YOLO, looking at a bunch of tiny boxes at a time… How can it possibly find an association ? Or as Prof. Ng admits and most of us know, humans are way better at this.

But, granted an image not taken from the class, as a person, if you showed me just this:


I would have no wild idea what that was, but the image it is extracted from is exactly this:

So it is not exactly clear to me how this is happening. I mean, I think, personally, despite our innate ability, I feel our visual recognition comes totally from the context. We see the ‘entire scene’ and then decide, ‘does this make sense being there’ ?

Maybe YOLO is doing the same, but we are not looking at little boxes one at a time.

Like, the first time you see Guernica, the canvas is just so big in person, you only see it ‘all at once’.


  1. I am trying to still do my own look up on this, but I find the 1 x 1 convolution confusing. Is this a convolutional version of something that is ‘fully connected’ ?

  2. And I’ll just leave off on this for the moment-- In the ‘upscale’ stage of UNets, I can only imagine these filters are crucially being produced by the skip transfers of the activations on the ‘downscale’ side, correct ? Or how on Earth does it know what to train to as we head back up ?

*Oops, forgot one thing-- With regard to ‘anchor boxes’, we were kind of just ‘handed them’ in the assignment, but in actuality, is this something you manually design (like a feature, or as you would in your train set ?). Or is this something an algorithm can decide ?

Not sure what you mean by this, but computing the center of the box is pure geometry of course. So I assume you mean finding the centroid of each recognized object, so that it can pick which grid cell the object is assigned to. It just learns that through training. It requires a huge amount of data to train an algorithm like YOLO and all the data is labeled with all the info including object types and bounding boxes. You have a loss function which is a hybrid function, since it needs to deal with classifications as well as regression style outputs. Prof Ng does not really discuss how the training works, but there are a number of very detailed threads here on the forum about YOLO. Here’s one that covers the training.

They are learned a priori using a different algorithm. Here’s a thread about that. And here’s a thread about how they are applied.

1 Like

Yes, that’s a good way to think of 1 x 1 convolutions. There are several lectures on this in Week 2. Now that you’ve been through the whole course and have seen all that is covered, I find it can be helpful to go back and rewatch some of the earlier lectures. Now that you have wider and deeper knowledge, sometimes the earlier lectures will “hit home” in a new way. At least it’s worth a try.

It’s got training data, right? That’s what drives everything here as always. Note that the labeling process for semantic segmentation is pretty scary. I’m not sure how they really accomplish this “at scale”, but note that every single pixel in the image has its own label (class), right? And the loss function is computed across all of them. Seems like there would be some unbalanced effects you could get from large vs small objects. I’ll bet there are some papers on that.

Of course the point of the “skip” connections from the downsampling path is that they make it easier for the algorithm to learn the correct reconstruction of the geometry of the original image, but now with the labels in place. Maybe in theory, the algorithm could learn the task without the skip connections, but there is “theory” and there is “practice”. Making the training a lot more efficient is a huge win, since we’re dealing with a lot of data and a lot of parameters here.

1 Like

@paulinpaloalto Yes, it is pure, simple geometry, and this must happen for the whole thing to work-- And perhaps I just ‘missed it’, but I didn’t seem to see where in the code the determination (for lack of a better term) of where exactly the center of this ‘heatmap’ occurs.

Thank you for your other replies-- Please allow me some time to review.

It’s not in any code we write or would have seen, right? The output of the model includes both the bounding box and the object type for every object detected by the algorithm. The model is trained to recognize objects and that includes computing the bounding boxes for us. If you have the coordinates of the bounding box of an object, then it’s pretty straightforward to compute the centroid of that object. It’s not really the centroid in the physics meaning of that (center of mass), but simply the center of the rectangle, as in the intersection of the two diagonals of the bounding box. Once the algorithm has that value, then it uses that to determine which grid cell will have the data for that object. Note that there is no restriction on objects that they need to be contained within a single grid cell: an object can span multiple cells and the only real purpose of the grid cells is just to “hang” the objects there, so that we can write the loops to process them in a localized way.

One other level of things here is that the nature of the YOLO algorithm and its training is that it can recognize the same object multiple times in slightly different ways. We deal with that ex post facto by the “culling” process called Non-Max Suppression, which is covered both in the lectures and in the assignment. There are a number of articles on the forum about that as well, e.g. this one.

@paulinpaloalto Oh, okay-- And I agree, though the assignments weren’t too technically challenging, at the same time if you (not meaning ‘you’-- I mean a student) were also paying attention, this is class is a little conceptually challenging. So I think I will have to revisit it a few times.

For the moment let me modify a very small part of my question-- With regard to UNets on the ‘upscale’ portion, do you see this as even possible without the skip connections ? Or I can kind of get everything so far, but you can’t add more information from ‘nothing’, right ?

So that is the only way I can think the transpose convolution works… Perhaps I am wrong ?

Notice, however, that that is not what YOLO does. It is common to hear or read that YOLO ‘divides the image into grid cells’ but this is not strictly correct. The input image, whether for training or for runtime prediction, let’s call it X, is never subdivided. Rather it is the training ground truth labels, Y, and the predictions, \hat{Y}, that are mapped to grid cells (and anchor boxes and classes…the 19x19x425 matrix from the lab)

As @paulinpaloalto describes, for that red car image there would be exactly one cell in the training data that had non-zero entries. All the other cells in Y would be zeros. The specific grid cell with the non-zero values is trivial to determine since for training data we know the image dimensions, and we know where the object is, either the four corner coordinates or the center and size, depending on how the labels were stored. During training, any cell in Y other than that one should not predict an object centered within itself, and if it does, the cost function will penalize it. At runtime, forward propagation just recapitulates its training, and (hopefully) exactly one grid cell plus anchor box location (sometimes Redmon et al call them detectors) predicts the object. In any case, the prediction is not made based upon a 19x19 subdivision of the original image as suggested by the red square. Rather, it is made based on the downsampled/distilled signal flowed through the CNN as it transformed the input signal, for example 183,184 pixels (428x428), into the S*S*B*(1+4+C) output values.

There is more detail in the threads linked above. Let us know what you find?


It’s a good question and I don’t claim to know the answer. If I had to guess, my theory would be that you can’t prove that it’s impossible without the skip connections, but I’d have to believe the training would be hideously more expensive and you’d probably need to modify the architecture of the network. It’s not that you have “nothing”, because the training is based on actual labeled data and the labeled data contains the required geometry. The question is how the downsampling path could encode the geometry information through the normal convolution outputs in such a way that the transpose convolution layers could reconstruct it. But you have the “forcing function” in the form of the loss computed between the generated image and the labeled images which contain the geometry. The question is whether that would be sufficient to force the previous layers to figure out a way to encode the geometry information. Maybe you’d need a lot more output channels on the downsampling path than you can get away with in the real U-Net architecture with skip connections.

But this is all definitely in “angels dancing on the head of a pin” territory :grinning:. The right way spend additional mental energy here would be to figure out how to make the architecture work better and be easier to train, not what is the nastiest way I can pose the challenge and still get a valid solution that is achievable in polynomial time.

Something about that previous statement about polynomial time reminds me of a joke that Donald Knuth made in one of his books. Of course he was writing back in the 70s, when the state of the art in both compute power and algorithmic sophistication was quite a bit more limited than it is today. At that point, human chess masters were still easily beating the best computer algorithms. I forget the exact way he introduced the idea, but he pointed out that chess was literally a finite game, meaning that there are a finite number of possible chess games. But at that point, a computer still couldn’t solve it. So he said “It’s not sufficent that the problem be finite: it needs to be very finite.” You know you’re a math nerd if you laugh out loud at that statement.

@paulinpaloalto I’ll have to leave this reinvestigation for tomorrow-- But when I was adjunct at Northeastern University circa 2010-2013 I actually got to meet the head of the then IBM Watson team David Ferrucci-- And to be honest I cut my heart and soul on art, poetry, so I asked him out right, well do you think this can understand the ‘meaning’ of a poem, like words not commonly used together in the corpus-- He said ‘Yes’, but I think obviously not then.

It is getting a little bit more ‘creepy close’ now, though I think ‘not quite yet’.

Just IMHO.

@paulinpaloalto allow me to revisit this question later. I mean when I really want to ‘beat myself over the head’, I’ve long had a physical copy of the seminal Deep Learning text-- And Ian was one of Ng’s students, no ? But surprisingly the chapter on convolutions is rather short, but there even graphically they are using a completely different structure than what was presented.

Being able to run a model is great. Though I want to make sure I fully understand it. Give me some time and I will circle back on this.

1 Like

Just wanted to reinforce that YOLO predicts the center location of an object in the image exactly the same way it predicts class, object presence and bounding box shape/size. That is, it is provided the correct center pixel coordinates at training time and then learns how to reproduce that as a prediction. What YOLO doesn’t do is reverse engineer the center location or the proper grid cell from the bounding box. That is done once, but during training data creation, not by YOLO itself, neither during training nor at runtime. Based on the image size and the chosen grid cell size, at training data creation time the object’s center coordinates are determined from the labelled training data, the grid cell computed by simple algebra, and then the object’s training data is embedded in that location of the Y matrix (along with best anchor box, which is computed at this time as well). Then at runtime, and this is the real key to grokking the YOLO idea, every grid cell and anchor box tuple (ie detector) simultaneously makes predictions. The grid cell location doesn’t have to be computed or ‘assigned’ at the end of forward prop. Rather it is explicit in the matrix location of the Y ground truth and the corresponding \hat{Y} where the prediction vector occurs.

If that doesn’t make sense, let’s discuss further.

Also note that this is why NonMaxSuppression must be used with YOLO. Since each detector location is making predictions at the same time, there can be what I think of as false positives…two or more predictions of the same object. Similarity measure helps NMS lower the count in that quadrant of the truth table due to duplicates. There can still be false positives due to other issues or if the IOU threshold in NMS is set too low.


…you know who Donald Knuth is and have read his books :rofl:

One of my earliest consulting jobs in the 80’s involved implementing a balanced tree I got from The Art of Computer Programming


@ai_curious I am aware of Knuth, but was born in '82 so I don’t think I have actually read one of his books… Yet, if you like to talk ‘nerd level’, I have a copy of Richard Walter Conway’s ‘Programming for Poets’ in the back of my car-- And who, really, outside of a maintenance system, programs in Pascal/Fortran any more ?

Yet I love poetry, so I thought of ‘rewriting’ it for a new generation-- But this is on my ‘long-list’.

I also feel your contributions are very useful, though I am just going to need to find some time to just sit down and dig into this.

Hopefully, I will ask a more enlightened question later.

1 Like

Mmmm… Fortran.

1 Like

The even crunchier recipe is FORTRAN on punch cards. Some of us are old enough to remember those daze. :laughing:

Indeed. Those boxes of punch cards were heavy and awkward to carry around campus. Fortunately we got time-shared CRT monitors fairly soon after that.