How is chebyshev distance between two datasets for a categorical feature calculated? Can someone provide an example?
First of all welcome to our our community!
your question is truly intriguing.
I think we should refer to Chebiscev’s definition of distance.
So this definition requires some important requirements.
the subtraction operation must be defined in the domain of x and y
the abs operation has to be defined on the result of the subctration
the ‘max’ operation must be defined on the result of the 'abs
For sure I guess that all these operations are properly defined for numerical data.
For categorical data I think that the Chebiscev distance can be defined only if a kind of sorting is defined on the feature domain.
Do you agree?
I didn’t find a different definition of the Chebiscev distance that could be matching categorical features.
Please let me know your opinion.
Thanks for replying. I had just completed week 1 of MLEP course 2 and there was one lecture (‘Tensorflow Data Validation’) in which it is mentioned that the calculation of data skew between two datasets is only implemented for categorical features as of now in TFDV and that the skew is quantified using Chebishev distance. The same thing is also mentioned in description of tfdv.validate_statistics().
In C2_W1 assignment, as an example, the threshold for this distance was put as 0.03. I am guessing (purely guessing, not sure) chebyshev distance between two datasets for a categorical feature is calculated in the following way:
Each unique value(domain) of the feature represent the dimensions of the vector representing the categorical feature stats for a dataset and values in each dimension is the fraction of total samples from the dataset containing the unique value. (Egs: Lets say we have a feature called gender, we have 40 samples with gender = ‘male’ and 60 samples with gender = ‘female’, then the vector representation would be [0.4 0.6] )
We can then calculate chebyshev distance for the categorical feature between two datasets.
what you propose sounds good to me.
In other words you have found a way to link numerical values to the feature values in the domain. So the Chebiscev definition can be applied again to the new vector ([0.4 0.6]).
Thanks a lot for sharing your conclusion