Maybe I missed it, but I can’t find where the videos define the formula for finding the variance of values within the leaf nodes. (This is for the optional week 4 video on regression trees.) In the video Andrew introduces the concept of variance, says not to worry about the equation for that slide, then fills in all the values. But he never goes back and gives the formula later.

Hi!

The variance for the node would be \dfrac{\sum_{i=1}^n(x_i - \mu)^2}{N}

Where \mu is the mean of the values of the node, x_i is an individual value and N is the number of values for that node. When splitting based on variance the idea is to make splits so that the variance of child nodes gets closer and closer to 0.

Thanks! This is exactly what I’m looking for.

Hi Sam, I tempted to calculated the first variance using the formula above. why I got the result 1.17 instead 1.47.

Please note I used the 5 samples (7.2,9.2,8.4,7.6,10.2) N=5 and u=8.52 in my calculation, basically I calculated the square of each number minus u then sum up the total then divided 5.

I also tried on the 2nd sets of data, got a variance 17.49 instead 21.87. Could you shed a light on where I did wrong?

Thank you

Christina

Hello @Christina_Fan,

We probably have learned to compute variance in the left way, but sometimes people choose to use the other way.

If the formula is sufficient for now, then it is good.

If you wonder about why to divide by n-1 and want to get into the thinking mode of a statistican, then as a starting point, you might read the first three paragraphs in this section of wikipedia or google “population variance vs. sample variance” for some materials of your learning style. However, you don’t need to get to the bottom of this for completing this specialization or to use decision tree in your work.

Cheers,

Raymond

It’s a statistics thing.

- If you have a sample, the divisor is (N-1).
- If you have the entire population, the divisor is (N).