Open In Colab

A central result in ‘classical’ machine learning and statistics is known as the bias-variance decomposition. Let’s flesh it out.

Consider the univariate regression problem in which we have , , and want to learn the conditional distribution , or at least its mean , via some parametric function . Now, any finite training set of points drawn from will itself be a random variable:

The bias-variance decomposition considers how the risk (expected loss under for a model ) depends on this random variable. Let’s do the calculation first, then tease out the intuition.

The risk under the joint distribution for a regression task using the square loss is given by

reasoning:

  • expresses the fact that is a function of both its inputs and the dataset it is trained on, which is a random variable.
  • uses the additive identity; here is the optimal hypothesis under the square loss,
  • follows by noting that the cross-term vanishes since ,
  • and follow by again using the additive identity and noting that the cross-term is zero, and
  • spells out the expectations in terms of the underlying marginals.

This decomposition tells us a few things:

  • In all but the deterministic case, all expectation models will unsurprisingly incur non-zero risk from irreducible noise in the data.
  • Models that don’t have sufficient capacity/flexibility or that are poorly specified will incur non-zero risk in expectation due to bias (‘underfitting’) resulting from being unable to closely match .
  • Models that ‘overfit’ to their datasets will have high variance over all ; this also contributes to the true risk.

Simple demonstration

Let’s make something like Figure 3.5 of Bishop. We draw our data from defined by:

# @title Imports

import numpy as np
import pandas as pd
import plotnine as gg
import sklearn

gg.theme_set(gg.theme_bw());
# @title Create the dataset

num_data = 100
noise_sigma = 0.2

train_xs = np.random.uniform(-np.pi, np.pi, size=num_data)
train_ys = np.sin(train_xs) + noise_sigma * np.random.randn(num_data)
# @title Plot the data and its underlying distribution.

df = pd.DataFrame({'x': train_xs,
                   'y': train_ys, 
                   'h': np.sin(train_xs),
                   'std': noise_sigma,
                   })

p = (gg.ggplot(df)
     + gg.aes(x='x', y='h')
     + gg.geom_point(gg.aes(y='y'))
     + gg.geom_line(color='green', size=2)
     + gg.geom_ribbon(gg.aes(ymin='h - std', ymax='h + std'), alpha=0.2)
     + gg.geom_ribbon(gg.aes(ymin='h - 2 * std', ymax='h + 2 * std'), alpha=0.1)
     + gg.ggtitle(r'$p(x, y)$: Mean in green, $\sigma$ '
                  'and 2$\sigma$ intervals in grey')
     + gg.xlab(r'$x$')
     + gg.ylab(r'$y$')
)
p

png

Let’s let fit a linear model

where are model weights and are features. Some example features:

  • Polynomial:

  • Fourier:

  • Gaussian:

We’ll minimize the -regularized empirical risk

# Number of training points, number of datasets to sample, number of features to use.
num_train = 25
num_models = 20
num_features = 24

# Define the features.
polynomial_features = np.array([train_xs**n for n in range(num_features)]).T
fourier_features = np.array([np.sin(2*np.pi*n*train_xs / num_features) for n in range(num_features)]).T
gaussian_features = np.array([np.exp(-(train_xs - (2*np.pi*x/num_features - np.pi))**2) for x in range(num_features)]).T

# Train for different amounts of regularization.
features = gaussian_features
results_df = pd.DataFrame()
for lam in [1e-3, 1e-1, 1e0, 1e1]:
  for idx in range(num_models):
    # Sample a small subset of the dataset.
    train_idx = np.random.randint(len(features), size=num_train)
    train_features = features[train_idx]
    train_targets = train_ys[train_idx]

    # Fit a model to it.
    model = sklearn.linear_model.Ridge(alpha=lam)
    model.fit(train_features, train_targets)

    # Collect results.
    df = pd.DataFrame({
        'x': train_xs,
        'y': train_ys,
        'h': np.sin(train_xs),
        'f': model.predict(features),
        'idx': [idx] * num_data,
        'lam': lam,
    })
    results_df = pd.concat([results_df, df])
# Plot our fitted functions.
fit_plot = (gg.ggplot(results_df)
            + gg.aes(x='x')
            
            + gg.geom_line(gg.aes(y='f', group='idx'), color='red', alpha=0.5)
            + gg.facet_wrap('lam', labeller='label_both', ncol=4)
            + gg.coord_cartesian(ylim=[-1.1, 1.1])
            + gg.theme(figure_size=(12, 3))
            + gg.xlab(r'$x$')
            + gg.ylab(r'$f(x)$')
)
# Plot their mean, over the datasets D.
mean_df = results_df.groupby(['x', 'lam', 'y', 'h']).agg({'f': np.mean}).reset_index()

mean_plot = (gg.ggplot(mean_df)
             + gg.aes(x='x', y='f')
             + gg.geom_line(gg.aes(y='h'), color='green')
             + gg.geom_line(color='red')
             + gg.facet_wrap('lam', labeller='label_both', ncol=4)
             + gg.theme(figure_size=(12, 3))
             + gg.xlab(r'$x$')
             + gg.ylab(r'$\mathbb{E}_\mathcal{D}f(x)$')
)
# Weak regularizer (low lambda) -> low bias, high variance.
print(fit_plot)

# Strong regularizer (high lambda) -> high bias, low variance.
print(mean_plot)

png

png