Based on the feedback, I think the 3rd choice should also be selected as correct, but I couldn’t figure out why. My thinking goes as follows: For example the prediction pixel is 0.4… L1 loss for lighter image would be 0.4-0.3 = 0.1, and L2 would be 0.1^2 = 0.01. This means to me that L2 is giving a smaller loss, hence less (not more) punishment to the prediction.
Using the square of the pixel differences as the penalty can introduce some challenges related to fairness in image generation. When applying it, the model tends to magnify and prioritize larger errors between the generated and ground truth images. This can become problematic, especially when working with images of people from two different groups with significantly distinct average pixel brightness (0.3 for one group and 0.9 for the other).
The issue arises from the fact that the square of the pixel differences (L2) is more sensitive to larger deviations, meaning that the generator might be biased towards reducing errors in the group with higher average brightness (0.9) while potentially neglecting the group with lower average brightness (0.3). As a result, the generated images might exhibit more fidelity to the characteristics of the brighter group while not adequately capturing the subtleties and nuances of the group with lower brightness.
A good point is that If the model focuses primarily on reproducing images from the brighter group, it could reinforce existing bias, marginalizing the underrepresented group with lower brightness. In applications where equitable representation and unbiased outcomes are essential, this imbalance in image generation could be seen as unfair and undesirable and possibly unethical.
To address these bias issues, it’s better the absolute value between the difference in pixels (L1), which treats all errors equally regardless of their magnitude. Therefore, the generator is encouraged to produce more balanced results for both groups, and potential biases related to pixel brightness differences can be diminished.
Thanks for the detailed explanation in biases! It’s very helpful! I am still trying to figure out this sentence:
I understand that L1 loss is treating every addition of distance “equally” since it’s linear, but how is L2 magnifying the larger errors, not reducing the larger errors? Do you have an analytical/mathematical explanation?
Thanks for your illustrative example. I understand that L2 will magnify the losses if the error is > 1.
However, the pixel value given in the question seems to be in range [0,1], and I somehow assumed the predicted pixel value should also be somewhere in range [0,1]. This brings the absolute loss value for each example also in [0,1], and then the L2 is not magnifying, but instead reducing the larger errors compared to L1. Here’s an example:
Suppose we have two errors: -0.1 and 0.2. With L1, the average error would be:
L1 = (1/2) * (|-0.1| + |0.2|) = 0.15
And with L2, the average error would be:
L2 = (1/2) * ((-0.1)^2 + 0.2^2) = 0.025
Perhaps, I should not have assumed that the predicted pixel value is also [0,1]? But this seems a reasonable assumption after we train the model a few iterations and see the predicted values get closer to the real pixel values.
@Mengying_Zhang1 please consider small and large error separately. Your example is a small error, therefore in this case L2 will perform better, but we are talking about large errors (larger difference between y_{pred} and y_{true}).
Case 1: Small Error (e.g., y_{pred} = 0.8, y_{true} = 0.7)
L1 = |0.8 - 0.7| = 0.1
L2 = (0.8 - 0.7)^2 = 0.01
Case 2: Large Error (e.g., y_{pred} = 0.9, y_{true} = 0.1)
It seems in both small and large error cases you provided, L2 has a smaller loss than L1? So I’m still having trouble understanding the claim that L2 magnifies the larger errors.
That comparison makes sense to me. However, since L2 was able to magnify the error in Case 2, it will help the model learn through bigger punishments if predicting a high pixel value (0.9) for the low pixel group (0.1). Wouldn’t that be desirable?
Let’s us consider L2 in a more detailed explanation under Case 1 and Case 2 remembering that:
Case 1: small error
L2 = (0.8 - 0.7)^2 = 0.01
Case 2: large error
L2 = (0.9 - 0.1)^2 = 0.64
The disadvantage of using L2 loss in this example lies in how it treats larger errors more significantly than L1 loss does. When an error is larger, squaring it in L2 loss makes it even more significant in the overall loss calculation. This effect can lead to the GAN being sensitive to outliers
which can have a substantial impact on the L2 loss. For instance, in “Case 2” with a large error, the squared error becomes 0.64, indicating a considerably high loss. Outliers can disproportionately affect the model’s optimization process and lead to suboptimal performance therefore this would be undesirable.
Speaking more generally, when would using a L2 loss better? Is it the case that L2 always distorts the loss space, so we want to be careful with it even if not just about GAN?
I agree with the statement that L1 is more balanced in treating distances and L2 is more sensitive to outliers. I also understand now that L2 magnifies the larger errors.
However, I am thinking this sensitivity is not necessarily a bad thing. For example for the scenario in the quiz question, wouldn’t we want to choose an “unbalanced” algorithm that helps the model learn more about those low pixel value groups? This can be done precisely through L2, which magnifies the larger errors (which will be the case for the low pixel value groups compared to the high pixel value groups), and equivalently, forces the model to focus more on learning the low pixel value groups. In some sense, my understanding is that this sensitivity is doing us favor by putting more weights on those minority groups in the unbalanced data.
I would say It is partially true, while L2 loss would benefit the GAN learning of minority groups; overemphasizing the minority group might lead to biased representations or distort the fairness of the generated images, compromising the overall accuracy (high loss values) and bias of the model. Would be necessary to ensure that the GAN captures both minority and majority group characteristics without favoring one over the other (which is challenging with the L2). L1 will be more likely to give you an accurate and unbiased image generation results as opposed to L2.
I guess there should be papers around having a mixture of both L1 and L2 and checking on both accuracy and fairness, if not, it would be a interesting paper. Maybe you could do a first publication in this