In the Week 3 Bogota air quality exercise, a KNN algorithm is developed to estimate the pollution levels between the pollution sensors. It computes an estimated pollution level for any location in the city by a distance-weight average of the pollution at K nearest sensors. Then an MAE is computed to arrive at an optimum value of K.
The estimate is the pollution-level-hat value, but there is no actual pollution level at these locations, I don’t understand how the MAE error term can be calculated.
–Bill Allen, firstname.lastname@example.org
You have the dataset which includes the actual pollution levels of all of the locations (and missing values). The dataset is more or less the same through week 2 to week 3, except the changes you make along the way via various labs.
I am not sure which one of the two labs you were talking about, but each lab has an
utils.py file the defines the function that calculates the MAE. They will tell you how the source datasets are used. Again I am not sure which function you were talking about so the following might be irrelvant, but some of those functions simply split the dataset into two parts, fit the KNN to one part and then evaluate the MAE on the other part.
The data set includes the pollutions levels of the sensors. So, in the first analysis, the neural net is trained by removing some of the actual data and estimating what the missing data might be. Therefore, when it comes up with an estimate, it can compare it to the actual pollution to calculate the error (the MAE). Then it can iterate to improve the accuracy of the estimates.
The second analysis is estimating the pollution anywhere in the city by interpolating the values of the nearest 3 sensors. This gives an estimated pollution at a location. But there is no sensor at the location, so there is no actual pollution to compare it to. So, I don’t see how they can calculate the error.
I think you would have to look at the study (week 3 lab 2) to understand what I mean.
The dataset only has the pollution at the sensor locations. The analysis I’m referring to is estimating the pollution at locations where there is no sensor, so there is no actual pollution level to compare the estimate to.
I believe you were talking about Week 3 Lab 2 Section 4.
If things happened as you said that we were evaluating on locations without sensor, then I agreed that we couldn’t possibly estimate the MAE. However, there are two things you might want to read for a different view.
The lab’s text, which said “evaluate … at each sensor station”, so there is sensor.
The behind-the-scene code inside the function of
utils.calculate_mae_for_k. To read the code, you need to open the
utils.py file by clicking “File” > “Open” from the Jupyter Notebook Interface.
For bullet point number 2, although this course does not expect for intensive code reading, it is really just for the case that you might want to take a look at the code for the actual calculation steps, because the steps show literally every bit of the details. However, a simplified general idea of what’s happening in that function is that, it splits the
full_dataset into a training dataset and a evaluation dataset, then “fits” the k-nearest-neighbour model to the training dataset, and then evaluates the model on the testing dataset. The evaluation step is on existing data with sensor value.
I agree that the real amazing thing is for the model to predict for locations without sensor, but a reasonable strategy is to first estimate for the reliability of such approach on locations with sensors.
We can dig deeper if you want, but the reason why I presented the above facts and my comments is because I hope we are at least on the same page that we evaluate on locations that have sensors.