State of Computer Vision

HI Sir,

@thearkamitra
@arosacastillo
@AmmarMohanna
@XpRienzo
@reinoudbosch
@chrismoroney39
@paulinpaloalto

In the lecture video “state of computer vision”, we had doubts in the below following statement. We are unable to understand many statements told by andrew ng, can you please help to clarify ?

Statement 1:
So, if you look across a broad spectrum of machine learning problems, you see on average that when you have a lot of data you tend to find people get in way with using simpler algorithms as well as less hand-engineering. So, there’s just less needing to carefully design features for the problem, but instead you can have a giant neural network, even a simpler architecture, and have a neural network.
Just learn whether we want to learn we have a lot of data.

Doubt 1: instead you can have a giant neural network, even a simpler architecture. what does it meaning the statement cannot understand. simpler architecture but giant neural network Ah!..what does it mean sir ?

Doubt 1b: Why we need simpler architecture when we have lots of more data ? Why we need complex architecture when we have little data ?

Doubt 2: What is the reason behind less hand engineering things when lots of data and more hand engineering things when small data ? why so sir like that ? cannot understand the reason …

Statement 2: But there’s still a lot of hand-engineering of network architectures and computer vision. Which is why you see very complicated hyper frantic choices in computer vision, are more complex than you do in a lot of other disciplines.

Doubt 3 : what does it means lot of hand engineering NN architecture become leads to complicated hyperparameter choices? can u give some example to understand ? what is hand engineering NN architecture ? and How it becomes lead to complicate choice of hyperparamter ?

Doubt 4: What does it means standardized benchmark datasets ? what does it mean the context of benchmark ?

Statement 3 : But you also see in the papers people do things that allow you to do well on a benchmark, but that you wouldn’t really use in a production or a system that you deploy in an actual application.

Regarding statement 3, Preferred network in benchmark not used in production system. Are the these due to Reason like computational budget. Can u please let us know what covers under computational budget ? And other reasons are lot more memory , slow down running time right ? can u please confirm others reasons are correct ?

Statement 4: Train several NN independently and average their output? Here what average their output y hats means accuracy of the result?

General doubt :
As per the lecture, Ensembling technique actually tends to more focus on testing time rather than training & cross validation because some things we observed from the lecture. Is it true ensembling actually more focus on test time rather than training time & Cross Validation ? because im saying according to the below points observed from the lecture.

  1. Train several Neural networks independently and average their results. This is the statement from the lecture. Does it means train NN at against training dataset & CRoss validation sets or at test time?

  2. But because ensembling means that to test on each image, you might need to run an image through anywhere from say 3 to 15 different networks quite typical. This is the statement from the lecture. Does it means train 3 to 15 NN at test time ? Not right ?

3.Why need of run classifier for multiple crops at test time ? We can do data augmentation at training time right. Thats why we asked is ensembling tends to focus on test time more ? why it should be like that ?

can u please help to understand please ?

@thearkamitra
@arosacastillo
@AmmarMohanna
@XpRienzo
@reinoudbosch
@chrismoroney39
@paulinpaloalto
@TMosh

can u please help to understand…i revisited many times the same video but please help to clarify the doubts…pls help me…

Generally, the mentors cannot answer your questions about the lectures and philosophy of machine learning. We’re mostly here to help with the programming assignments.

Questions about the course content and machine learning practice can best be answered by the teaching staff, but they are not active on the discussion forum.

Okay Tom Sir,

Im looping dedicated mentors for this content. Hope mentors help us …this is not individual doubt…its team of doubts …someone please help to answer…Many one will be happy

@thearkamitra
@arosacastillo
@AmmarMohanna
@XpRienzo
@reinoudbosch
@chrismoroney39
@paulinpaloalto

Oh wow, you’ve left a lot to unpack here

1) Simpler in the sense that there’s no specific architectural quirks like you have for Recurrent Nets or ConvNets. If you have a lot of data you can use a general algorithm like the ones you’ve learnt in the earlier courses, with enough depth so all the parameters can just adjust to your data. It can still be complex in the sense of having a lot of layers and parameters.

1b) You don’t need simpler networks when you have a lot of data, you just don’t need to spend time engineering a new algorithm specific to your use case and can get by with less work designing some novel way. You still can create and use more complex networks and they probably will be more useful or train faster.

2) So when you have less data you don’t cover a lot of patterns in that space so you have to maximise your predictive part of data. Hand engineering features helps with that.
With a lot of data and a deep enough network you essentially hand off this responsibility to the network which can now learn something similar to those features you’d hand engineer.

3) Well if you are designing a completely new network for a particular use case, while not always but you might have to consider more things. In CNNs, you’ve seen that you might have to choose a stride, a filter size, the padding etc, while in a fully connected neural network you just had to choose number of units in a particular layer and number of these layers in comparison.

4) Benchmark dataset is just a standard dataset which people use to check their network’s performance. You can see how other more established networks perform on it and can work on your network accordingly. Like, for computer vision people usually use a version of imagenet to test their networks.

5) You can simply think of computational budget as the money you’d spend on the inference device. Yeah memory costs matter, there’s GPUs too, and larger neural networks take a lot of time and power. That energy required to run the inference is also very important. If you’re tight on your budget, would you want to spend millions on power requirements?

6) Average here just means you’re taking an average of their yhats, the probabilities of classes they’ve outputed. Accuracy will be considered when you convert those probabilities into predictions and compare it to actual ground truth.

3 Likes

On your general doubt:

  1. Training them independently just means you train them from scratch, pretty sure you can use the same training and cross validation data. Just that after you’ve trained all of those networks, you average their probabilities then consider that result. Since you’ve trained them all independently, there’s a chance they’ve learnt different features than the others so averaging lets you take advantage of all of them

  2. You’ve already done the training at this point, but your testing will take 3-15 times more because you will have to pass the same testing images through all of those networks you have trained, then average and check against your test set.

  3. I don’t think augmentation is even relevant here, could you elaborate what you mean to ask?

3 Likes

@XpRienzo provides lots of helpful information and insight in this thread. Unfortunate that @Anbu, who asks for a lot of individual assistance, doesn’t seem to acknowledge or appreciate receiving it.

I think the reference to augmentation in the video is due the superficial resemblance to one technique of extending a data set by turning one training input into several. I have done that personally where I had 1200x768 training images, but my CNN only accepted 416x416. Rather than throw away lots of data, I cropped out many 416x416 sections. Not only did this give me more input images, it also meant the same few labelled objects appeared in different places, which improved generalization. In this video, however, cropping serves a different purpose. Each crop is fed into the classifier separately, one value - the \hat{y} - is predicted for each forward propagation, and the outputs are averaged to produce a single prediction for the original full-sized image. The effect is that a single classifier is treated like an ensemble since it runs multiple times. This can help raise accuracy scores if some regions of the image are closer to the training data than others.

The reason people try all kinds of hacks against benchmarks is so that their paper appears high in a list or table sorted on accuracy. Looks good in a paper, may not be practical for real world applications.

By the way, one place where ensembles might make sense is in complex classification where the problem is best solved by several models each focusing on a different dimension of the solution. For example, pharmaceutical manufacturers are required to report and act on high risk adverse event information they receive. But what constitutes ‘high risk’ ? If it was reported by a medical professional in a clinical trial it carries more weight than a post on twitter. What part of the body was the focus? The indicators of the reaction matter, too. Was it a general discomfort? Rash? Fever? Cardiac arrest? It can be easier to build and train separate models to evaluate the different components of risk rather than try to train a single model with many complex features.

2 Likes