L2 regularization is to make weights small, to reduce overfitting.
Dropout is to remove some neurons in hidden layers. Why don’t we use L1 norm on weights to do the dropout? Something similar to Lasso, Ridge, and Elastic Net models. Any insights?
Thanks a lot.
Dropout doesn’t always shutoff the same neurons in a layer. It selects a few at random for each iteration.
Should you use L1, you’ll turn off a few connections from certain inputs over time and there’s no randomness involved.
There’s nothing stopping you from specifying a penalty on weights / bias