Normalization v.s. Standardize

In the two assignments in week 2, we performed normalization on each row of the X matrix (i.e. normalizing each feature), and standardized each column of the X matrix (i.e. standardizing each training example).

Why should we normalize the rows and standardize the columns?

Why shouldn’t we normalize the columns and standardize the rows?


Hi, @Yuchen_Zhang . I am not sure what you mean. In the (graded) assignment, Exercise 2, you are presented with a filled-in cell in which you “standardize” the data:

trains_set_x = train_set_x_flatten / 255.
test_set_x = test_set_x_flatten / 255.

The division by 255 is an element-by-element operation. Every value in the ..._flatten matrices is divided by 255 (pixel intensities are indexed from 0 to 255) to keep the values between 0 and 1 (including endpoints). Each column represents an image (with 12288 pixel values) . The number of columns therefore represents the number of example images. The key point here is that every element of the ...flatten matrix represents a pixel intensity.

The point of confusion may be as follows. Typically, a standardization operation involves a bit more than that, because the goal is to have each feature (a row in the feature matrix) to be measured in comparable units. After all, the features are not generally pixel values of an image, which do have comparable units. For example, in house-price prediction, the features may be quantities such as square footage, number of rooms, number of baths, acreage, etc.

As an example, if X is an n_x \times m feature matrix, we might want to do this by subtracting the mean of each feature and then divide by its standard deviation. To do this we need the row-rise means and standard deviations: X_mean = np.mean(X, axis=1) and X_stdev = np.std(X, axis=1). The standardized feature matrix then becomes

(X-X_{mean})/X_{stdev} .

Note that a small computational miracle happens here. As note earlier, X is an n_x \times m matrix, but X_{mean} and X_{stdev} are n_x-dimensional vectors. In the numerator “broadcasting” operation automatically subtracts the X_{mean} vector from each column in the X matrix, and a similar broadcasting operation handles the division.

I hope this helps!