I have several questions regarding week 1 of this course:
1- How is it possible to surpass human level performance for example if experts’ performance which is a proxy to bias 0.5%, and we get data labeled by those experts, so now if the model gets 0.4% accuracy for example, how do we know that extra 0.1% is right when actually the experts themselves aren’t sure about the extra 0.1% labelling?
2- regarding the quiz there is this question:
“You find that a team of ornithologists debating and discussing an image gets an even better 0.1% performance, so you define that as “human-level performance.” After working further on your algorithm, you end up with the following”:
one of the answers is reducing regularization, well that’s actually a valid answer, but here I am using a “knob” meant for improving variance, to improve bias, which is against orthogonalization mentioned by prof.Andrew so why is it a correct option?
I mean that prof. Andrew mentioned in the first lecture that the knob from improving bias (training error) was to do a handful of things including training a bigger network , … etc, but regularization is a knob meant for variance (dev error) only, that’s even why he mentioned that early stopping is not preferrable as it controls both bias and variance together (training and dev errors).
Could you help me with the first question too please regarding surpassing human level performance?
Sorry, it may just be issues with your English, but I don’t understand most of your questions.
I don’t have enough information based on what you’ve said to say anything about that quiz question (item 2 in your original post), but Prof Ng does mention in the lectures that there are some types of predictions where human performance is not very good compared to what is possible with ML models. Examples that I remember are recommender systems and predicting which link on a webpage a user will click next. Bayes Error is the lowest error that is possible on the task and all we can say is that both Human Error and Model Error are greater than or equal to the Bayes Error. For image classification tasks (does this picture contain a cat?) the human visual cortex is pretty hard to beat so HE \leq ME in most of those cases. But there are cases in which HE > ME, as in the recommender system case that Prof Ng mentioned and one other example that comes to mind is the famous Google AI model that can detect the sex of patients from their retinal scans, which human opthalmologists had previously thought was impossible.
It might be worth just watching the lectures again.
The other point that Tom made is also worth emphasizing. Regularization does not only affect variance. If you turn the regularization “knob” far enough, you can get to the point where low variance becomes high bias.
So why is early stopping considered contrary to orthogonalization (as mentioned by prof.Andrew) despite it affecting both bias and variance only similar to regularization?
So I was curious about your question and went back myself to review these materials.
Let me offer a reinterpretation, or my own perception of what Prof. Ng is saying: Rather than the ‘fancy’ term ‘orthogonalization’, I read this as defining problem separation. Or even from my really ancient days of econometrics/statistics, it is kind of like ‘degrees of freedom’.
Figure, presuming it is predictable, any prediction or equation is going to tend to hinge on its result from a select few important points-- Which is not to say ‘all the data’ isn’t important, but inevitably you are going to find inflection points, or factors that affect the course of the data more than others.
What I believe he is saying that he finds troubling about early stopping-- I mean maybe our results look great, but stopping early we have not quite eked out exactly all our inflection points fully yet, and in the end, those could turn out to be really important.
Yes, it may take longer to run, but lets be sure where the geometry of the problem fully folds (sorry, only in my head do I think of this task a bit like an intricate origami-- Or working on raw data, we are trying to find the best set of ‘folds’ that completes the optimal equation. We are playing with hyperplanes of course )
As to point one, recall the ultimate effectiveness of our model is not dependent on training, or even test, but ‘unseen’ cases.
Consider radiologists and cancer… Perhaps in the end all the ‘experts’ miss it, but in the end, you either do, (or do not) have cancer. So this presumes the problem you are studying ultimately has some ‘ground truth’ behind it.
Or that is to say ‘okay, well this is what all the experts think-- but at the end of the day you have ‘reality’, where probability ends’.
Personally, I also find this an easier example to think about also, where humans still make mistakes, rather than AGI where AI becomes like some ‘God’.
Hey, I think we need to be clear on this one first. “reducing regularization” is not a knob for improving variance (problem), only “increasing regularization” is.
Here, we reduce regularization in the hope that the bias will get lowered and the model errors get closer to the human-level performance.
Also, I think Andrew’s idea of orthogonalization is not discussing knobs for bias/variance, but knobs for the above four metrics.
It is clear that bias and variance have a trade-off relationship, but their effects on the four metrics via those listed knobs are less clear.
For example, would the “regularization” knob degrade the training set metric significantly? My answer is, it depends. If my network was not just big, but too big, that knob would affect less the training set’s metric but has a good hope of improving the dev set’s.
Let me point out that your explanation style is very catchy.
Regarding the first question, “not exploring vital points in hyperplane space” could be a drawback of using early stopping but I don’t fully get what it has to do with orthogonalization
I think a knob is something that can affect the property in both ways whether reducing or increasing it, for example: increasing regularization reduces variance due to model generalization, deceasing regularization increases variance due to overfitting data a bit more.
That’s exactly why I didn’t understand why it’s a choice. It’s more focused on reducing dev set error, which can help reduce variance. However, if the network is large enough, as in the question, it shouldn’t significantly affect the training error, which reduces bias.
The question asks for two most promising options, and I don’t see this statement in the question as a pre-condition
In other words, we need to rank all the available options and choose the first two.
You said decreasing regularization increases variance, but you also said “both ways”, so decreasing regularization reduces bias, too. Reducing bias has a good chance to better close the gap between human-level performance and training set error. What is the problem with this?