AI for Good - Public Health - MAE for KNN algorithm

Bill_Allen · July 31, 2023, 7:23am

In the Week 3 Bogota air quality exercise, a KNN algorithm is developed to estimate the pollution levels between the pollution sensors. It computes an estimated pollution level for any location in the city by a distance-weight average of the pollution at K nearest sensors. Then an MAE is computed to arrive at an optimum value of K.

The estimate is the pollution-level-hat value, but there is no actual pollution level at these locations, I don’t understand how the MAE error term can be calculated.

Any ideas?
–Bill Allen, awallen2@alaska.edu

Mubsi · July 31, 2023, 7:44am

Hi @Bill_Allen,

You have the dataset which includes the actual pollution levels of all of the locations (and missing values). The dataset is more or less the same through week 2 to week 3, except the changes you make along the way via various labs.

Best,
Mubsi

rmwkwok · July 31, 2023, 7:49am

Hello @Bill_Allen,

I am not sure which one of the two labs you were talking about, but each lab has an utils.py file the defines the function that calculates the MAE. They will tell you how the source datasets are used. Again I am not sure which function you were talking about so the following might be irrelvant, but some of those functions simply split the dataset into two parts, fit the KNN to one part and then evaluate the MAE on the other part.

Cheers,
Raymond

Bill_Allen · July 31, 2023, 6:03pm

The data set includes the pollutions levels of the sensors. So, in the first analysis, the neural net is trained by removing some of the actual data and estimating what the missing data might be. Therefore, when it comes up with an estimate, it can compare it to the actual pollution to calculate the error (the MAE). Then it can iterate to improve the accuracy of the estimates.

The second analysis is estimating the pollution anywhere in the city by interpolating the values of the nearest 3 sensors. This gives an estimated pollution at a location. But there is no sensor at the location, so there is no actual pollution to compare it to. So, I don’t see how they can calculate the error.

I think you would have to look at the study (week 3 lab 2) to understand what I mean.

Bill_Allen · July 31, 2023, 6:09pm

The dataset only has the pollution at the sensor locations. The analysis I’m referring to is estimating the pollution at locations where there is no sensor, so there is no actual pollution level to compare the estimate to.

rmwkwok · August 1, 2023, 1:13am

Hello @Bill_Allen,

I believe you were talking about Week 3 Lab 2 Section 4.

If things happened as you said that we were evaluating on locations without sensor, then I agreed that we couldn’t possibly estimate the MAE. However, there are two things you might want to read for a different view.

The lab’s text, which said “evaluate … at each sensor station”, so there is sensor.

image1100×112 8 KB
The behind-the-scene code inside the function of utils.calculate_mae_for_k. To read the code, you need to open the utils.py file by clicking “File” > “Open” from the Jupyter Notebook Interface.

For bullet point number 2, although this course does not expect for intensive code reading, it is really just for the case that you might want to take a look at the code for the actual calculation steps, because the steps show literally every bit of the details. However, a simplified general idea of what’s happening in that function is that, it splits the full_dataset into a training dataset and a evaluation dataset, then “fits” the k-nearest-neighbour model to the training dataset, and then evaluates the model on the testing dataset. The evaluation step is on existing data with sensor value.

I agree that the real amazing thing is for the model to predict for locations without sensor, but a reasonable strategy is to first estimate for the reliability of such approach on locations with sensors.

We can dig deeper if you want, but the reason why I presented the above facts and my comments is because I hope we are at least on the same page that we evaluate on locations that have sensors.

Cheers,
Raymond

Topic		Replies	Views
River monitoring in Brasil AI and Public Health week-module-3	6	367	December 7, 2023
Loss Increasing Generative Deep Learning with TensorFlow week-module-2	3	551	November 17, 2021
Air quality of Madrid (Spain) - PM2.5 forecasting AI and Public Health week-module-3	5	442	October 11, 2023
How to evaluate accuracy of a regression model AI Discussions	24	458	December 29, 2022
Week 4 course 4 error cant get mae of below 2 Sequences, Time Series and Prediction week-module-4	8	673	May 16, 2023

AI for Good - Public Health - MAE for KNN algorithm

Related topics