Kernel density process question

I am trying to understand kernel density esitmation (week 2 - Visualizing data: Kernel density estimation).

I feel I have understod the idea quite well except for a few details, and I would really appreciate if someone who understands more clearly than me could explain.


We have n original data points, we draw a guassian distribution at these points, then we sum the guassians to get new points that reflect the overlap of data.

Q1. Do we sum only at the n original data points, or (as other sources suggest) at fixed intervals across the x-axis?

Having done this, the next step in the lecture is to ‘multiply everything by 1/n and sum the curves’.

Q2a). I presume this division by n refers to the n original data points, and so suggests that the summation question in Q1 is indeed only done at the n original data points, but I don’t see why dividing by n at these n data points guarantees a continuous PDF who’s area sums to 1?

Q2b) If the summation of guassians doesn’t only occur at the n original data points and is actually done at, say, m regular intervals, then would the multiplication be by 1/m?

Q3. Even if it is done at m regular intervals and everything is divided by m, I don’t quite understand why this would result in a PDF with an area of 1, when it seems that the area is determined less by the height of the n points, and more by the paramaters of the original guassian kernal. Or is it simply a given that this occurs?

Apologies if this is a basic misunderstanding.

Many thanks for any help anyone can give

Hello @bishopb,

I didn’t watch that lecture but I think I can guess what’s happening from your post.

  1. For each Gaussian (blue curve) that is over a data point (black stroke), it spans over the whole range of the x-axis. In other words, each Gaussian is non-zero everywhere, no matter how much it looks zero at its tails. Therefore, when you sum the Gaussians, you go to each point of x, asks for the values of each of the Gaussian at that point x, and you will get n values, and you sum those n values up. You do the same for all x, and you get the orange curve.

  2. Each Gaussian’s area is ONE. You have n Gaussians. Your orange curve is a sum of n Gaussians. Your orange curve’s total area is n \times 1. What would you do to make the orange curve’s area 1?

Raymond

Thank you so much Raymond, that does indeed explain what I was struggling with - I had not considered that the tails are not 0 valued and also that each guassian will have area 1! It makes sense now.

Thank you again

also Raymond

Thank you both for the elaborate question and answer.
aborucu