Skip to main content
KD_K

Regularization

Regularization is a fundamental technique in machine learning used to prevent overfitting. The most widely used techniques, L1 and L2 regularization, introduce a penalty term to the loss function to encourage simpler, more generalizable models.

Overfitting is particularly common in complex models, such as deep neural networks with many parameters. These models can “memorize” the training data, including noise, rather than learning meaningful patterns. Regularization helps mitigate this issue by penalizing large weight values in the model. The core idea is to include the magnitude of model parameters as a penalty term in the loss function:

L+λwpL + \lambda ||w||_p

Here, LL represents the original loss function, λ\lambda is the regularization strength, and wp||w||_p is the pp-norm of the weight vector ww.

By minimizing this regularized loss, we encourage the weights to become smaller, therefore reducing model complexity. But why do we want smaller weights?

At the beginning of training, weights are often initialized close to zero. This means the training process initially focuses on minimizing the original loss LL. Once LL reaches a certain level of minimization, the regularization term, λwp\lambda ||w||_p, starts to play a more significant role, encouraging the model to reduce the magnitude of the weights. If some weights are reduced to near zero, it indicates that these weights have little influence on reducing the overall loss. We can effectively remove these weights without significantly impacting performance. Keeping weights large only increases model complexity without providing any real benefit. In other words, by driving unnecessary weights towards zero, we can simplify the model.

Furthermore, without regularization, model weights can grow excessively large during training as the model attempts to perfectly fit the training data. This overfitting is precisely what we aim to prevent with regularization. By penalizing large weights, we discourage the model from relying too heavily on individual data points, promoting the learning of more general patterns.

L1 Regularization (Lasso)

L1 regularization adds an absolute value penalty term to the objective loss function:

Ltotal=L+λw1=L+λinwi\begin{align*} L_{total} &=L+\lambda ||w||_1 \cr &=L+\lambda \sum_i^n|w_i| \end{align*}

where LtotalL_{total} is the overall loss function, λ\lambda is regularization strength, wi|w_i| represents the absolute value of parameter wiw_i.

An interesting property of L1 regularization is that it encourages sparsity, meaning that it sets some parameters to exactly zero. Due to this property, the L1 penalty becomes useful for feature selection, as it effectively removes irrelevant features by eliminating their corresponding weights.

Gradient of L1 regularization

Ltotalwi={Lwi+λif wi>0Lwiλif wi<0undefinedif wi=0\frac{\partial L_{total}}{\partial w_i} = \begin{cases} \frac{\partial L}{\partial w_i} + \lambda & \text{if } w_i > 0 \cr \frac{\partial L}{\partial w_i} - \lambda & \text{if } w_i < 0 \cr \text{undefined} & \text{if } w_i = 0 \end{cases}

The gradient of the L1 penalty term is constant (either +λ\lambda or -λ\lambda) regardless of the weight's magnitude. This means that all weights experience the same shrinking force regardless of their size. Therefore, for small weights, L1 completely zeros them out over time.

L2 Regularization (Ridge)

Meanwhile, L2 regularization adds a squared penalty term to the loss function:

Ltotal=L+λw2=L+λinwi2\begin{align*} L_{total} &= L + \lambda ||w||_2 \cr &= L+\lambda \sum_i^nw_i^2 \end{align*}

Gradient of L2 regularization

Ltotalwi=Lwi+2wiλ\frac{\partial L_{total}}{\partial w_i} = \frac{\partial L }{\partial w_i} + 2w_i\lambda

Unlike in L1 loss, where the gradient is constant, L2 regularization scales proportionally with the magnitude of weight wiw_i. This means that larger weights have a larger shrinking force and smaller weights shrink less. Because L2 regularization applies a force proportional to the weight size, it encourages all weights to be small but typically not exactly zero.

As L2 regularization prevents large weights by punishing them with a larger gradient, it promotes smoothness, and helps with numerical stability in models with multicollinearity.

MAP interpretation

In Bayesian statistics, we aim to find the most probable set of model parameters given the observed data and prior belief. Bayes' theorem gives us the posterior distribution:

p(θx)=p(xθ)p(θ)p(x)p(\theta|x) = {p(x|\theta)p(\theta)\over p(x)}

Maximum a Posterior (MAP) seeks to find θ\theta that maximizes this posterior probability distribution. In machine learning, we often solve this problem by taking negative logarithm and minimize it. Taking the negative log:

arg minθlogp(θx)=arg minθlogp(xθ)logp(θ)+logp(x)=arg minθlogp(xθ)logp(θ)\begin{align*} \argmin_\theta \\{- \log p(\theta |x)\\} &=\argmin_\theta \\{-\log p(x|\theta) -\log p(\theta) + \log p(x)\\} \cr &=\argmin_\theta \\{-\log p(x|\theta) -\log p(\theta) \\} \end{align*}

Here we can find logp(xθ)- \log p(x|\theta), which corresponds to usual loss function (negative log likelihood), and logp(θ)- \log p(\theta) from prior distribution over the parameters which will become our regularization term.

L1 regularization as Laplace prior

A random variable (ww) has a Laplace distribution if its probability density function is:

p(wμ,b)=12be(wμb)p(w|\mu,b) = {1\over 2b}e^{(- {|w-\mu| \over b})}

where μ\mu is a location parameter and b>0b>0 is a scale parameter. Let's consider a prior distribution for each parameter wiw_i to be independent and identically distributed Laplace with mean 00 and scale parameter bb:

p(wi)=12be(wib)p(w_i) = {1\over 2b}e^{(- {|w_i| \over b})}

Taking the negative logarithm, logp(wi)-\log p(w_i), we obtain:

log(2b)+wib\log(2b) + {|w_i| \over b}

Computing the gradient of this:

wiwib={1bif wi>01bif wi<0undefinedif wi=0\frac{\partial}{\partial w_i} {|w_i| \over b} = \begin{cases} {1\over b} &\text{if } w_i> 0 \cr -{1\over b} &\text{if } w_i< 0 \cr \text{undefined} &\text{if } w_i= 0 \end{cases}

Comparing this gradient to the gradient of the L1 regularization term, we observe that 1b{1\over b} corresponds to λ\lambda. So, minimizing negative log posterior with a Laplace prior is equivalent to minimizing the Loss function with an L1 regularization.

L2 regularization as Gaussian prior

Gaussian distribution forms its probability density function as:

p(wμ,σ2)=12πσ2e(wμ)22σ2p(w|\mu, \sigma ^2) = {1 \over \sqrt{2 \pi \sigma ^2}} e ^{- {(w-\mu)^2\over 2 \sigma ^2}}

where the parameter μ\mu is the mean of the distribution and the parameter σ2\sigma^2 is the variance.

Let's consider a prior distribution for each parameter wiw_i to be independent and identically distributed Gaussian with mean 00 and variance σ2\sigma^2:

p(wi)=12πσ2ewi22σ2p(w_i) = {1 \over \sqrt{2 \pi \sigma ^2}}e^{- {w_i^2\over 2 \sigma ^2}}

Similar to the L1 regularization case, taking the negative logarithm of p(wi)p(w_i), we get:

12log(2πσ2)+wi22σ2{1\over 2} \log(2\pi\sigma^2) +{w_i^2 \over 2\sigma^2}

Computing the gradient of this:

wiwi22σ2=wiσ2\frac{\partial}{\partial w_i} {w_i^2 \over 2\sigma^2} = {w_i \over \sigma^2}

From above L2 loss, we can put 1σ2=2λ{1\over \sigma^2} = 2\lambda, and therefore minimizing negative log posterior with a Gaussian prior is equivalent to minimizing the loss function with an L2 regularization term.

Reference

Wikipedia Laplace distribution

Wikipedia Gaussian distribution