I would like to share some of my thoughts after I have watched “Classification with logistic regression” and open a discussion on some of the points mentioned:
Andrew says at the start of the module that linear regression is not appropriate for classification. However, by the end of the chapter I was left with the impression that we use linear regression as an inner function for the logistic function. It is again linear regression, but I think that the cost function will be defined differently.
Andrew ways that advertising used to be driven by classification algorithms. Why does he speak so in past tense and what is the current way that advertisers target their audience?
Hey @popaqy,
Welcome to the community. The current age’s digital advertising is a large domain in itself, involving the use of AI to automatically create ad campaigns, to more efficiently perform SEO (from a user’s point of view, like there are Smart-SEO tools), to find which products are more popular, and due to which ad campaigns, which ad campaign to show to which user group, in which location, during which time periods, and much more.
Since, AI is exploited in a variety of ways by today’s advertising sector, hence, there is no single algorithm or even single type of AI, that we can say for certain, governs the AI in advertising.
In other words, for some use-cases in advertising, we use supervised AI, like classification and regression, for some use-cases, we employ unsupervised AI like clustering, and perhaps for some use-cases (I am not aware although), someone might be using reinforcement learning. I hope this helps.
Building on top of your 1, I want to focus on the sigmoid function itself. It has a very nice feature of bending the extreme regions of very-positive and very-negative z (or the linear regression outcome) from straight lines to almost horizontal lines. We know the gradient of a horizontal line is 0, and so the gradient of an almost horizontal line is very close to 0. Since gradient tells you how much each model parameter should update to reduce cost, sigmoid basically asks the update step to not care about those regions (because it is 0!).
This is great when the outcome of the linear model is already able to push a positive sample to the positive extreme and a negative sample to the negative extreme. This is great becase it is like saying that “Okay, for my positive sample, it does not matter whether your linear regression outcome is +300, +3000, or +30000, because after my sigmoid function, they are all going to be close to 1, so, please don’t care about them, their large difference is actually some very subtle details that is ignorable.” We can also see this mathematically that (a^2 - a) makes sure the resulting gradient to be small in those extreme regions.
z = w_1x_1 + w_2x_2 + ... + b a = \frac{1}{1+exp(-z)} J = \text{any cost function of } a
However, what if the outcome of the linear model isn’t already good? What if it pushes a positive sample to the negative extreme? Here comes the cost function as the savior because \frac{\partial{J}}{\partial{a}} is the only remaining unknown we can use to change the game! And as you said
In some next videos you will see that the cost function is that
The (a^2 - a) term now disappears, so it saves us from always ignoring the extreme regions, but the final form preserves the advantage of almost ignoring those samples that are correctly pushed to their respective extreme, because the final form still has a which bends extreme values to close to 0 and 1 respectively, and whenever the pushing is correct, (y-a) becomes almost 0.
Therefore, with both the sigmoid function and a well designed cost function (don’t miss out any one of them ), we can algorithmically ignore those very large but also subtle details of the correctly predicted samples, while focus on those incorrectly predicted ones.