Revisiting the bias-variance decomposition

A central result in ‘classical’ machine learning and statistics is known as the bias-variance decomposition. Let’s flesh it out.

Consider the univariate regression problem in which we have \(x\in\mathbb{R}^D\), \(y\in\mathbb{R}\), and want to learn the conditional distribution \(p(y\lvert x)\), or at least its mean \(\mathbb{E}[y\lvert x]\), via some parametric function \(f(x)\). Now, any finite training set of \(N\) points drawn from \(p\) will itself be a random variable:

\[\mathcal{D}=\left\{\left(x^{(n)}, y^{(n)}\right) \big|\ (x, y) \sim p(x, y)\right\}_{n=1}^N.\]

The bias-variance decomposition considers how the risk (expected loss under \(p\) for a model \(f\)) depends on this random variable. Let’s do the calculation first, then tease out the intuition.

The risk under the joint distribution \(p(x,y)\) for a regression task using the square loss is given by

\[\begin{aligned} \mathbb{E}[\mathcal{L}] &\stackrel{\mathrm{(a)}}{=} \mathbb{E}_{xy}\mathbb{E}_{\mathcal{D}}\Big[\{y - f(x; \mathcal{D})\}^2\Big] \\ &\stackrel{\mathrm{(b)}}{=} \mathbb{E}_{xy}\mathbb{E}_{\mathcal{D}}\Big[\{y - h(x) + h(x) - f(x; \mathcal{D})\}^2\Big] \\ &\stackrel{\mathrm{(c)}}{=} \mathbb{E}_{xy}\Big[\{y - h(x)\}^2\Big] + \mathbb{E}_x\mathbb{E}_\mathcal{D}\Big[\{h(x) - f(x; \mathcal{D})\}^2\Big] \\ &\stackrel{\mathrm{(d)}}{=} \mathbb{E}_{xy}\Big[\{y - h(x)\}^2\Big] + \mathbb{E}_x\mathbb{E}_\mathcal{D}\Big[ \big\{h(x) - \mathbb{E}_\mathcal{D}[f(x;\mathcal{D})] + \mathbb{E}_\mathcal{D}[f(x;\mathcal{D})] - f(x; \mathcal{D})\big\}^2 \Big] \\ &\stackrel{\mathrm{(e)}}{=} \mathbb{E}_{xy}\Big[\{y - h(x)\}^2\Big] + \mathbb{E}_x\Big[\big\{h(x) - \mathbb{E}_\mathcal{D}[f(x;\mathcal{D})]\big\}^2\Big] + \mathbb{E}_x\mathbb{E}_\mathcal{D}\Big[\big\{f(x; \mathcal{D}) - \mathbb{E}_\mathcal{D}[f(x;\mathcal{D})]\big\}^2\Big] \\ &\stackrel{\mathrm{(f)}}{=} \underbrace{\int\int\{y - h(x)\}^2p(x,y)\mathrm{d}x\mathrm{d}y}_{\mathrm{noise}} + \underbrace{\int\{h(x) - \mathbb{E}_\mathcal{D}[f(x;\mathcal{D})]\big\}^2p(x)\mathrm{d}x}_{\mathrm{bias}^2} + \underbrace{\int\mathbb{E}_\mathcal{D}\Big[\big\{f(x; \mathcal{D}) - \mathbb{E}_\mathcal{D}[f(x;\mathcal{D})]\big\}^2\Big]p(x)\mathrm{d}x}_{\mathrm{variance}}, \end{aligned}\]

reasoning:

\(\mathrm{(a)}\) expresses the fact that \(f\) is a function of both its inputs \(x\) and the dataset \(\mathcal{D}\) it is trained on, which is a random variable.
\(\mathrm{(b)}\) uses the additive identity; here \(h(x):=\mathbb{E}_y[y\lvert x]\) is the optimal hypothesis under the square loss,
\(\mathrm{(c)}\) follows by noting that the cross-term vanishes since \(\mathbb{E}_{xy}[y-h(x)] = 0\),
\(\mathrm{(d)}\) and \(\mathrm{(e)}\) follow by again using the additive identity and noting that the cross-term is zero, and
\(\mathrm{(f)}\) spells out the expectations in terms of the underlying marginals.

This decomposition tells us a few things:

In all but the deterministic case, all expectation models will unsurprisingly incur non-zero risk from irreducible noise in the data.
Models that don’t have sufficient capacity/flexibility or that are poorly specified will incur non-zero risk in expectation due to bias (‘underfitting’) resulting from being unable to closely match \(h(x)\).
Models that ‘overfit’ to their datasets \(\mathcal{D}\) will have high variance over all \(\mathcal{D}\); this also contributes to the true risk.

Simple demonstration

Let’s make something like Figure 3.5 of Bishop. We draw our data from \(p(x,y)\) defined by:

\[p(x) = \mathcal{U}\left(-\frac{\pi}{2}, \frac{\pi}{2}\right) \\ p(y\lvert x) = \mathcal{N}\left(\sin(x), \sigma^2\right)\]

# @title Imports

import numpy as np
import pandas as pd
import plotnine as gg
import sklearn

gg.theme_set(gg.theme_bw());

# @title Create the dataset

num_data = 100
noise_sigma = 0.2

train_xs = np.random.uniform(-np.pi, np.pi, size=num_data)
train_ys = np.sin(train_xs) + noise_sigma * np.random.randn(num_data)

# @title Plot the data and its underlying distribution.

df = pd.DataFrame({'x': train_xs,
                   'y': train_ys, 
                   'h': np.sin(train_xs),
                   'std': noise_sigma,
                   })

p = (gg.ggplot(df)
     + gg.aes(x='x', y='h')
     + gg.geom_point(gg.aes(y='y'))
     + gg.geom_line(color='green', size=2)
     + gg.geom_ribbon(gg.aes(ymin='h - std', ymax='h + std'), alpha=0.2)
     + gg.geom_ribbon(gg.aes(ymin='h - 2 * std', ymax='h + 2 * std'), alpha=0.1)
     + gg.ggtitle(r'$p(x, y)$: Mean in green, $\sigma$ '
                  'and 2$\sigma$ intervals in grey')
     + gg.xlab(r'$x$')
     + gg.ylab(r'$y$')
)
p

png

Let’s let fit a linear model

\[f(x;\theta)=\theta^T\phi(x)\]

where \(\theta\in\mathbb{R}^M\) are model weights and \(\phi: \mathbb{R}\to\mathbb{R}^M\) are features. Some example features:

Polynomial:
\[\phi_P^m(x) = x^m\]
Fourier:
\[\phi_F^m(x) = \sin\left(\frac{2\pi m x}{M}\right)\]
Gaussian:

\[\phi_G^m(x) = \exp\left(-\left[x + \frac{M\pi - 2\pi m}{M}\right]^2\right)\]

We’ll minimize the \(l_2\)-regularized empirical risk

\[\hat{\mathcal{L}} = \lambda\big\|\theta\big\|^2_2 + \sum_{n=1}^N \left(y^{(i)} - f\left(x^{(i)};\theta\right)\right)^2.\]

# Number of training points, number of datasets to sample, number of features to use.
num_train = 25
num_models = 20
num_features = 24

# Define the features.
polynomial_features = np.array([train_xs**n for n in range(num_features)]).T
fourier_features = np.array([np.sin(2*np.pi*n*train_xs / num_features) for n in range(num_features)]).T
gaussian_features = np.array([np.exp(-(train_xs - (2*np.pi*x/num_features - np.pi))**2) for x in range(num_features)]).T

# Train for different amounts of regularization.
features = gaussian_features
results_df = pd.DataFrame()
for lam in [1e-3, 1e-1, 1e0, 1e1]:
  for idx in range(num_models):
    # Sample a small subset of the dataset.
    train_idx = np.random.randint(len(features), size=num_train)
    train_features = features[train_idx]
    train_targets = train_ys[train_idx]

    # Fit a model to it.
    model = sklearn.linear_model.Ridge(alpha=lam)
    model.fit(train_features, train_targets)

    # Collect results.
    df = pd.DataFrame({
        'x': train_xs,
        'y': train_ys,
        'h': np.sin(train_xs),
        'f': model.predict(features),
        'idx': [idx] * num_data,
        'lam': lam,
    })
    results_df = pd.concat([results_df, df])

# Plot our fitted functions.
fit_plot = (gg.ggplot(results_df)
            + gg.aes(x='x')
            
            + gg.geom_line(gg.aes(y='f', group='idx'), color='red', alpha=0.5)
            + gg.facet_wrap('lam', labeller='label_both', ncol=4)
            + gg.coord_cartesian(ylim=[-1.1, 1.1])
            + gg.theme(figure_size=(12, 3))
            + gg.xlab(r'$x$')
            + gg.ylab(r'$f(x)$')
)

# Plot their mean, over the datasets D.
mean_df = results_df.groupby(['x', 'lam', 'y', 'h']).agg({'f': np.mean}).reset_index()

mean_plot = (gg.ggplot(mean_df)
             + gg.aes(x='x', y='f')
             + gg.geom_line(gg.aes(y='h'), color='green')
             + gg.geom_line(color='red')
             + gg.facet_wrap('lam', labeller='label_both', ncol=4)
             + gg.theme(figure_size=(12, 3))
             + gg.xlab(r'$x$')
             + gg.ylab(r'$\mathbb{E}_\mathcal{D}f(x)$')
)

# Weak regularizer (low lambda) -> low bias, high variance.
print(fit_plot)

# Strong regularizer (high lambda) -> high bias, low variance.
print(mean_plot)

png