Contents

What does L1 & L2 Regularization Look Like?

The Beauty of Mathematics

This article visualizes L1 & L2 Regularization, with Cross Entropy Loss as the base loss function. Moreover, the Visualization shows how L1 & L2 Regularization could affect the original surface of cross entropy loss. Although the concept is not difficult, the visualization do make understanding of L1 & L2 regularization easier. For example why L1-reg often leads to sparse model. Above all, the visualization itself is indeedly beautiful.

1. Cross Entropy Loss

Consider a super simple neural network:

/en/posts/2021/cross-entropy-loss-visualized/simple_neural_network.png
a super simple neural network
The forward propogation process of the network would be: $$\hat{z_1}=\beta_1x$$ $$\hat{z_2}=\beta_2x$$ $$Softmax(\hat{z_i}),\ i\in{2}$$ Consider a cross entropy loss: $$J(\beta)=-p\log(q)-(1-p)\log(1-q)$$ $$=-p\log(\frac{e^{\beta_1x}}{e^{\beta_1x}+e^{\beta_2x}})-(1-p)\log(\frac{e^{\beta_2x}}{e^{\beta_1x}+e^{\beta_2x}})$$ $$=…$$ $$=-p\log{e^{\beta_1x}}-(1-p)log{e^{\beta_2x}}+log(e^{\beta_1x}+e^{\beta_2x})$$ $$=-p\beta_1x-(1-p)\beta_2x+log(e^{\beta_1x}+e^{\beta_2x})$$ where $p$ denotes the true label of $z_1$ and $1-p$ denotes the true label of $z_2$, $\beta_1$ and $\beta_2$ are model parameters, $x$ denotes the model input and is a scalar.

So then, the cross entropy loss could be visualized as:

(For convenience, p is set to 1 and x is set to 1. Because we only want to see how loss varies to different parameter sets $\beta_1$ and $\beta_2$)

We could see a smooth curved surface down towards the ground.

2. Cross Entropy Loss with L1 Regularization

Generally, to prevent parameters endlessly fitting to a great number, and to solve overfitting, we could apply regularization.

L1-Regularization regularizes weights by adding the sum of L1-norm of all parameters to loss function:

$$J(\beta)=-p\beta_1x-(1-p)\beta_2x+log(e^{\beta_1x}+e^{\beta_2x})+\lambda{(||\beta_1||_1+||\beta_2||_1)}$$

The following graphs show different surfaces upon different L1-reg weights:

See how the folding angle changes with different settings of lambda

Notice that the folding lines are right above the two axis where $\beta_1=0$ and $\beta_2=0$, which makes the model easily getting into the places where $\beta_1$ or $\beta_2$ is zero, and that’s exactly why L1-Regularization results in sparse model (which parameters are tend to be 0)

What does Tensorflow do when it encounters differentiating non-differentiable functions?

Tensorflow would simply return 0 gradient regarding these non-differentiable points.

See https://stackoverflow.com/a/41520694

The codes down below could test out the derivative value regarding different piecewise-defined functions:

1
2
3
4
5
6
7
import tensorflow as tf
x = tf.Variable(0.0)
y = tf.where(tf.greater(x, 0), x+2, 2)  # The piecewise-defined function here is:y=2 (x<0), y=x+2 (x>=0)
grad = tf.gradients(y, [x])[0]
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print(sess.run(grad))
Coordinate Descent

To solve the issue of non-differentiable resulted from L1-Regularization, sometimes we could apply coordinate descent, which avoids calculating the gradient of the loss surface.

See https://en.wikipedia.org/wiki/Coordinate_descent

3. Cross Entropy Loss with L2 Regularization

Another common one is to apply L2-Regularization, by adding the sum of L2-norm of all parameters:

$$J(\beta)=-p\beta_1x-(1-p)\beta_2x+log(e^{\beta_1x}+e^{\beta_2x})+\Omega{(||\beta_1||_2^2+||\beta_2||_2^2)}$$

The following graphs show different surfaces upon different L2-reg weights:

It is observed that L2 regularization makes the loss function curved smoothly, making the minimum loss point at a position where $\beta_1$ and $\beta_2$ takes a non-infinite value! And what’s more, when the L2 regularization parameter becomes bigger, the point would get closer to zero point (where $\beta_1=0$,$\beta_2=0$).

4. Cross Entropy Loss with L1+L2 Regularization

L1 and L2 Regularization can take place at the same time, which is like:

5. Conslusion

L1 Regularization

  • Penalizes sum of absolute value of weights, which results in a sparse model
  • Sparse model is cater to feature selection
  • Sparse model is simple and interpretable, but cannot learn complex patterns
  • Robust to outliers

L2 Regularization

  • Penalizes sum of squared value of weights, which results in a dense model
  • Learns complex patterns and generally gives better prediction
  • Sensitive to outliers