If I understand correctly, L-infinity norm is also defined as the maximum value of a vector.
In TFDV it is also used by the skew comparator for categorical features, but how exactly? What is the intuition behind the 0.1 threshold? Are we finding the maximum of some vectors?
It would be really helpful if someone could provide a numerical example.
L-infinity norm of the vector that represents the difference between the normalized counts from the feature.string_stats.rank_histogram in the control statistics (i.e., serving statistics for skew or previous statistics for drift) and the treatment statistics (i.e., training statistics for skew or current statistics for drift) > feature.skew_comparator.infinity_norm.threshold or feature.drift_comparator.infinity_norm.threshold
The L-infinity norm of a vector is indeed defined as the maximum absolute value of its components. In TFDV (TensorFlow Data Validation), the skew comparator for categorical features uses the L-infinity norm to measure the difference between the observed frequency of a category in a dataset and the expected frequency of that category.
Example:
Let’s say we have a categorical feature “color” with three possible values: “red”, “blue”, and “green”. We also have a dataset with 1000 examples, and we expect each color to appear with a frequency of 1/3 (i.e., we expect 333 examples of each color).
However, if we observe that the “red” color appears 400 times, the “blue” color appears 400 times, and the “green” color appears only 200 times, we can say that the “green” color is skewing the distribution. The skewness of the “green” color can be measured using the L-infinity norm of the vector of differences between the observed and expected frequencies:
where ||.||_inf denotes the L-infinity norm. In this case, the maximum absolute difference is 200/1000 - 333/1000 = 0.134 , which is greater than the threshold of 0.1. Therefore, we would conclude that the skewness of the “green” color is significant.
If the threshold was 0.2, then we would conclude that the skewness of the “green” color is not significant.
Or lets says “green” appears 350 times, the “red” color appears 320 times, and the “blue” appears 330 times, then keeping 0.1 threshold, we would conclude that the skewness of the “green” feature is not significant, and we don’t need to address it.
The intuition behind the 0.1 threshold is that it represents a maximum allowable difference between the observed and expected frequencies. If the L-infinity norm of the difference vector is greater than 0.1, then the skewness is considered significant and needs to be addressed. This becomes significant if training skew and test skew vary significantly.
Hey @JazzKaur
Here I saw you explained the definition the calculation of L-inf for categorical feature. Just aware I have not get the concrete definition for numerical feature for different dataset yet, is that defined by the maximum difference of any pair of value for the feature from the different dataset and calculate the abs distance? And for each of the feature we set a threshold, if any threshold broken there would be a abnormal alert to denote the corresponding feature?