I have watched the last video on Week 2 a couple of times. I think it may be interesting to discuss the Maths as it was quite striking to me as a non specialized person. I felt quite dumb so I read some articles about probability.

I guess the following may be of interest to few people. It is more a rant to see if anyone has corrections or wants to discuss.

I have been reading about some related math stuff. Even the fact that after 4 tails in a coin toss you still get **1/2** chance for tails in the next toss, is striking me. Not just to me though. This is called the Gamblerâ€™s Fallacy. It seems also - I could be wrong - that probabilities arenâ€™t that old of an invention. I donâ€™t think they could be thought of without the concept of infinity. Some people can clearly distinguish each phenomenon as independent and then there is no fallacy.

To predict the next value of a set of seemingly random values, one strategy is to calculate the Mean Value. You prove this by minimizing the square error; not by minimizing the difference \sum y_i-x. There is also some conditions were you will use the Median though; I think this is when the values are discrete. The median appears if the \sum|y_i- x| is minimized instead.

That approach would minimize the error though understood as an average distance from this magic number to any result. But if we want to predict as a â€śyesâ€ť or â€śnoâ€ť what is the next value, there wonâ€™t be a predilection for any of them (in a fair coin toss). I find it interesting how the phrasing changes how we think about the problem.

So one approach is about the error, the other one seems about â€śwhichâ€ť number is the next in a win/loss situation.

[ As a tangent, mean and median will be the same in a perfect Gaussian Distribution of values. This can be thought without any math. Just imagine an array of values (values can be repeated). Because of the curveâ€™s symmetry & the array symmetry if it follows it, values to the right and left of the value in the center (assume it is odd for simplicity) cancel in distance to the center (in average). Then Mean and Median are the same. In a dice, mean would be 3.5, median either 3 or 4, I think. ]

In the case of the cost function, as defined in the last video, this is even more complicated. You want the estimation of probability itself to be a function that can be minimized. I donâ€™t think I know what that means yet.

The function then is:

P(y|x) = \hat{y}^y*(1-\hat{y})^{(1-\hat{y})}

And because either in y=0 or y=1 the probability will be correct if the estimation is maximized, we want to maximize \hat{y}.

And then, if we accept that function extending to many Xs isnâ€™t complex.

Finally because L is defined as -P (or something close) you would get:

\large\frac{dP}{d\hat{y}} = -\frac{dL}{d\hat{y}}

and indeed because all we have in algorithm notation is \frac{dL}{d\hat{y}}

we want to minimize it.