Is Visualisation needed for exploratory analysis?

  1. In data validation, does one really need visualisation or can a lot of these be automated? Example check Anova for variations in numeric features. For categorical features use KL Divergence or ChiSquare tests to measure differences between eval dataset and training dataset.

  2. What do we do with categorical features with large domains and also high number of empty values in the data validation stage? Is it expected that the model knows how to handle it?

  3. Aren’t covariance between features also necessary for data validation? For example
    In training set, categorical feature A has a domain of 10 and categorical feature B also has a domain of 10 values.
    While in the evaluation set features A and B show identical distribution to the training set but distribution of A x B is not similar. Doesn’t this pose a problem on model accuracy?

  1. You can perform a bunch of statistical tests to confirm the nature of features in the given dataset. The visualization produced by tfdv considers point summary, simple missing data statistics and a histogram plot with 10 bins that shows the distribution of data points. If we were to consider the computational requirements for performing these statistical tests for a huge number of data points, data visualization would provide quicker feedback.
  2. I haven’t seen a dataset with the attributes you’ve mentioned. Imputation techniques are often used to fill missing values. Also, do see if a method like clustering can help reduce the number of categories and thereby reduce the number of missing values.
  3. Please explain AxB keeping this in perspective. Are you referring to a feature cross?

Thanks Mentor.

Follow up on 1. Please explain AxB keeping this in perspective. Are you referring to a feature cross?

Lets say A = [1,2,…10] - 10 distinct values
Lets say B = [1,2,…10] - 10 distinct values
Thus AxB = 100 possible values

In the training set, lets say A & B appear with a pattern
A,B
1,1
2,2
1,2
2,1
3,3
4,4
5,5
(much less than all of the possible values of AxB which is 100)

But the serving set shows all 100 possible values.

Hope this helps clarify.

You’re performed a feature cross i.e. cross join in database terms.
For a model to be effective, distribution of training dataset should be as close as possible to the serving dataset. Odds are good that you are observing a covariate shift or dataset shift across training and serving datasets.

Since the model can learn only from what is known, we have 2 options:

  1. Add rows to the training dataset that cover the feature space covered in the serving dataset and train the model. (preferred)
  2. Have an unknown category for feature crosses that don’t exist in training data and train the model where the crossing produces unknown cross i.e. simulate unknown feature crossing.