What does L1 & L2 Regularization Look Like?
The Beauty of Mathematics
This article visualizes L1 & L2 Regularization, with Cross Entropy Loss as the base loss function. Moreover, the Visualization shows how L1 & L2 Regularization could affect the original surface of cross entropy loss. Although the concept is not difficult, the visualization do make understanding of L1 & L2 regularization easier. For example why L1-reg often leads to sparse model. Above all, the visualization itself is indeedly beautiful.
1. Cross Entropy Loss ¶
Consider a super simple neural network: The forward propogation process of the network would be: $$\hat{z_1}=\beta_1x$$ $$\hat{z_2}=\beta_2x$$ $$Softmax(\hat{z_i}),\ i\in{2}$$ Consider a cross entropy loss: $$J(\beta)=-p\log(q)-(1-p)\log(1-q)$$ $$=-p\log(\frac{e^{\beta_1x}}{e^{\beta_1x}+e^{\beta_2x}})-(1-p)\log(\frac{e^{\beta_2x}}{e^{\beta_1x}+e^{\beta_2x}})$$ $$=…$$ $$=-p\log{e^{\beta_1x}}-(1-p)log{e^{\beta_2x}}+log(e^{\beta_1x}+e^{\beta_2x})$$ $$=-p\beta_1x-(1-p)\beta_2x+log(e^{\beta_1x}+e^{\beta_2x})$$ where $p$ denotes the true label of $z_1$ and $1-p$ denotes the true label of $z_2$, $\beta_1$ and $\beta_2$ are model parameters, $x$ denotes the model input and is a scalar.
So then, the cross entropy loss could be visualized as:
(For convenience, p is set to 1 and x is set to 1. Because we only want to see how loss varies to different parameter sets $\beta_1$ and $\beta_2$)We could see a smooth curved surface down towards the ground.
2. Cross Entropy Loss with L1 Regularization ¶
Generally, to prevent parameters endlessly fitting to a great number, and to solve overfitting, we could apply regularization.
L1-Regularization regularizes weights by adding the sum of L1-norm of all parameters to loss function:
$$J(\beta)=-p\beta_1x-(1-p)\beta_2x+log(e^{\beta_1x}+e^{\beta_2x})+\lambda{(||\beta_1||_1+||\beta_2||_1)}$$
The following graphs show different surfaces upon different L1-reg weights:
See how the folding angle changes with different settings of lambdaNotice that the folding lines are right above the two axis where $\beta_1=0$ and $\beta_2=0$, which makes the model easily getting into the places where $\beta_1$ or $\beta_2$ is zero, and that’s exactly why L1-Regularization results in sparse model (which parameters are tend to be 0)
Tensorflow would simply return 0 gradient regarding these non-differentiable points.
See https://stackoverflow.com/a/41520694
The codes down below could test out the derivative value regarding different piecewise-defined functions:
|
|
To solve the issue of non-differentiable resulted from L1-Regularization, sometimes we could apply coordinate descent, which avoids calculating the gradient of the loss surface.
3. Cross Entropy Loss with L2 Regularization ¶
Another common one is to apply L2-Regularization, by adding the sum of L2-norm of all parameters:
$$J(\beta)=-p\beta_1x-(1-p)\beta_2x+log(e^{\beta_1x}+e^{\beta_2x})+\Omega{(||\beta_1||_2^2+||\beta_2||_2^2)}$$
The following graphs show different surfaces upon different L2-reg weights:
It is observed that L2 regularization makes the loss function curved smoothly, making the minimum loss point at a position where $\beta_1$ and $\beta_2$ takes a non-infinite value! And what’s more, when the L2 regularization parameter becomes bigger, the point would get closer to zero point (where $\beta_1=0$,$\beta_2=0$).
4. Cross Entropy Loss with L1+L2 Regularization ¶
L1 and L2 Regularization can take place at the same time, which is like:
5. Conslusion ¶
L1 Regularization
- Penalizes sum of absolute value of weights, which results in a sparse model
- Sparse model is cater to feature selection
- Sparse model is simple and interpretable, but cannot learn complex patterns
- Robust to outliers
L2 Regularization
- Penalizes sum of squared value of weights, which results in a dense model
- Learns complex patterns and generally gives better prediction
- Sensitive to outliers