# The cost function and other concepts

I have watched the last video on Week 2 a couple of times. I think it may be interesting to discuss the Maths as it was quite striking to me as a non specialized person. I felt quite dumb so I read some articles about probability.

I guess the following may be of interest to few people. It is more a rant to see if anyone has corrections or wants to discuss.

I have been reading about some related math stuff. Even the fact that after 4 tails in a coin toss you still get 1/2 chance for tails in the next toss, is striking me. Not just to me though. This is called the Gamblerâ€™s Fallacy. It seems also - I could be wrong - that probabilities arenâ€™t that old of an invention. I donâ€™t think they could be thought of without the concept of infinity. Some people can clearly distinguish each phenomenon as independent and then there is no fallacy.

To predict the next value of a set of seemingly random values, one strategy is to calculate the Mean Value. You prove this by minimizing the square error; not by minimizing the difference \sum y_i-x. There is also some conditions were you will use the Median though; I think this is when the values are discrete. The median appears if the \sum|y_i- x| is minimized instead.

That approach would minimize the error though understood as an average distance from this magic number to any result. But if we want to predict as a â€śyesâ€ť or â€śnoâ€ť what is the next value, there wonâ€™t be a predilection for any of them (in a fair coin toss). I find it interesting how the phrasing changes how we think about the problem.

So one approach is about the error, the other one seems about â€śwhichâ€ť number is the next in a win/loss situation.

[ As a tangent, mean and median will be the same in a perfect Gaussian Distribution of values. This can be thought without any math. Just imagine an array of values (values can be repeated). Because of the curveâ€™s symmetry & the array symmetry if it follows it, values to the right and left of the value in the center (assume it is odd for simplicity) cancel in distance to the center (in average). Then Mean and Median are the same. In a dice, mean would be 3.5, median either 3 or 4, I think. ]

In the case of the cost function, as defined in the last video, this is even more complicated. You want the estimation of probability itself to be a function that can be minimized. I donâ€™t think I know what that means yet.

The function then is:

P(y|x) = \hat{y}^y*(1-\hat{y})^{(1-\hat{y})}

And because either in y=0 or y=1 the probability will be correct if the estimation is maximized, we want to maximize \hat{y}.

And then, if we accept that function extending to many Xs isnâ€™t complex.

Finally because L is defined as -P (or something close) you would get:

\large\frac{dP}{d\hat{y}} = -\frac{dL}{d\hat{y}}
and indeed because all we have in algorithm notation is \frac{dL}{d\hat{y}}

we want to minimize it.

Hi, @Mah_Neh. You are to be congratulated for your intense curiousity! Note that the course (and specialization) is designed to require only a minimal (perhaps, soley intuitive) understanding of probability and statistics concepts. That is why I prefaced my post with something like "if you have a probability and statistics background (you may want to read on) â€¦ . That is probably (no pun intended) why the last lecture â€śExplanation of logistic regression cost functionâ€ť is marked as optional. If you acquire enough knowledge to follow that, itâ€™s icing on the cake. I think of the Specialization as taught from a computer science and â€śengineeringâ€ť perspective.

Onward and upward!

1 Like

Thanks, I wish I could control it sometimes. I still have your previous reply in the read-soon list. Thanks for encouraging.

Personally, I understand!

1 Like