Scaling for extreme large datasets

Jack_Tao · January 29, 2024, 9:15am

Hi Deeplearning,

First to report an issue that in C2W2 the link of “Scaling” article in “Optional References” seems wrong. It is pointing to https://raw.githubusercontent.com/jbrownlee/Datasets/master/shampoo.csv

I also have a question about performing scaling one large data sets. If i have to load chunk by chunk from data sets and use Z-score normalization. I have to do 2 passes to understand the distribution of those features. How this 2 passes data processing fit into TFX work flow? Is there any example to do so ? Any additional method needs to be used ?

balaji.ambresh · January 30, 2024, 7:08pm

Thanks for reporting. The staff have been notified to fix the “Scaling” link.

2 passes seems to be a reasonable solution to compute the z-score due to dependency on standard deviation. See here to understand how means and variances are merged on a batch level.