What does L1 & L2 Regularization Look Like?

The Beauty of Mathematics

2021-04-15 2048 words 10 minutes views

/en/posts/2021/cross-entropy-loss-visualized/cover.png

Contents

This article visualizes L1 & L2 Regularization, with Cross Entropy Loss as the base loss function. Moreover, the Visualization shows how L1 & L2 Regularization could affect the original surface of cross entropy loss. Although the concept is not difficult, the visualization do make understanding of L1 & L2 regularization easier. For example why L1-reg often leads to sparse model. Above all, the visualization itself is indeedly beautiful.

1. Cross Entropy Loss ¶

Consider a super simple neural network:

The forward propogation process of the network would be: $$\hat{z_1}=\beta_1x$$ $$\hat{z_2}=\beta_2x$$ $$Softmax(\hat{z_i}),\ i\in{2}$$ Consider a cross entropy loss: $$J(\beta)=-p\log(q)-(1-p)\log(1-q)$$ $$=-p\log(\frac{e^{\beta_1x}}{e^{\beta_1x}+e^{\beta_2x}})-(1-p)\log(\frac{e^{\beta_2x}}{e^{\beta_1x}+e^{\beta_2x}})$$ $$=…$$ $$=-p\log{e^{\beta_1x}}-(1-p)log{e^{\beta_2x}}+log(e^{\beta_1x}+e^{\beta_2x})$$ $$=-p\beta_1x-(1-p)\beta_2x+log(e^{\beta_1x}+e^{\beta_2x})$$ where $p$ denotes the true label of $z_1$ and $1-p$ denotes the true label of $z_2$, $\beta_1$ and $\beta_2$ are model parameters, $x$ denotes the model input and is a scalar.

So then, the cross entropy loss could be visualized as:

(For convenience, p is set to 1 and x is set to 1. Because we only want to see how loss varies to different parameter sets $\beta_1$ and $\beta_2$)

We could see a smooth curved surface down towards the ground.

2. Cross Entropy Loss with L1 Regularization ¶

Generally, to prevent parameters endlessly fitting to a great number, and to solve overfitting, we could apply regularization.

L1-Regularization regularizes weights by adding the sum of L1-norm of all parameters to loss function:

$$J(\beta)=-p\beta_1x-(1-p)\beta_2x+log(e^{\beta_1x}+e^{\beta_2x})+\lambda{(||\beta_1||_1+||\beta_2||_1)}$$

The following graphs show different surfaces upon different L1-reg weights:

See how the folding angle changes with different settings of lambda

Notice that the folding lines are right above the two axis where $\beta_1=0$ and $\beta_2=0$, which makes the model easily getting into the places where $\beta_1$ or $\beta_2$ is zero, and that’s exactly why L1-Regularization results in sparse model (which parameters are tend to be 0)

What does Tensorflow do when it encounters differentiating non-differentiable functions?

Tensorflow would simply return 0 gradient regarding these non-differentiable points.

See https://stackoverflow.com/a/41520694

The codes down below could test out the derivative value regarding different piecewise-defined functions:

1
2
3
4
5
6
7


import tensorflow as tf
x = tf.Variable(0.0)
y = tf.where(tf.greater(x, 0), x+2, 2)  # The piecewise-defined function here is：y=2 (x<0), y=x+2 (x>=0)
grad = tf.gradients(y, [x])[0]
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run(grad))

Coordinate Descent

To solve the issue of non-differentiable resulted from L1-Regularization, sometimes we could apply coordinate descent, which avoids calculating the gradient of the loss surface.

See https://en.wikipedia.org/wiki/Coordinate_descent

3. Cross Entropy Loss with L2 Regularization ¶

Another common one is to apply L2-Regularization, by adding the sum of L2-norm of all parameters:

$$J(\beta)=-p\beta_1x-(1-p)\beta_2x+log(e^{\beta_1x}+e^{\beta_2x})+\Omega{(||\beta_1||_2^2+||\beta_2||_2^2)}$$

The following graphs show different surfaces upon different L2-reg weights:

It is observed that L2 regularization makes the loss function curved smoothly, making the minimum loss point at a position where $\beta_1$ and $\beta_2$ takes a non-infinite value! And what’s more, when the L2 regularization parameter becomes bigger, the point would get closer to zero point (where $\beta_1=0$，$\beta_2=0$).

4. Cross Entropy Loss with L1+L2 Regularization ¶

L1 and L2 Regularization can take place at the same time, which is like:

5. Conslusion ¶

L1 Regularization

Penalizes sum of absolute value of weights, which results in a sparse model
Sparse model is cater to feature selection
Sparse model is simple and interpretable, but cannot learn complex patterns
Robust to outliers

L2 Regularization

Penalizes sum of squared value of weights, which results in a dense model
Learns complex patterns and generally gives better prediction
Sensitive to outliers