Revisiting the biasvariance decomposition
A central result in ‘classical’ machine learning and statistics is known as the biasvariance decomposition. Let’s flesh it out.
Consider the univariate regression problem in which we have , , and want to learn the conditional distribution , or at least its mean , via some parametric function . Now, any finite training set of points drawn from will itself be a random variable:
The biasvariance decomposition considers how the risk (expected loss under for a model ) depends on this random variable. Let’s do the calculation first, then tease out the intuition.
The risk under the joint distribution for a regression task using the square loss is given by
reasoning:
 expresses the fact that is a function of both its inputs and the dataset it is trained on, which is a random variable.
 uses the additive identity; here is the optimal hypothesis under the square loss,
 follows by noting that the crossterm vanishes since ,
 and follow by again using the additive identity and noting that the crossterm is zero, and
 spells out the expectations in terms of the underlying marginals.
This decomposition tells us a few things:
 In all but the deterministic case, all expectation models will unsurprisingly incur nonzero risk from irreducible noise in the data.
 Models that don’t have sufficient capacity/flexibility or that are poorly specified will incur nonzero risk in expectation due to bias (‘underfitting’) resulting from being unable to closely match .
 Models that ‘overfit’ to their datasets will have high variance over all ; this also contributes to the true risk.
Simple demonstration
Let’s make something like Figure 3.5 of Bishop. We draw our data from defined by:
# @title Imports
import numpy as np
import pandas as pd
import plotnine as gg
import sklearn
gg.theme_set(gg.theme_bw());
# @title Create the dataset
num_data = 100
noise_sigma = 0.2
train_xs = np.random.uniform(np.pi, np.pi, size=num_data)
train_ys = np.sin(train_xs) + noise_sigma * np.random.randn(num_data)
# @title Plot the data and its underlying distribution.
df = pd.DataFrame({'x': train_xs,
'y': train_ys,
'h': np.sin(train_xs),
'std': noise_sigma,
})
p = (gg.ggplot(df)
+ gg.aes(x='x', y='h')
+ gg.geom_point(gg.aes(y='y'))
+ gg.geom_line(color='green', size=2)
+ gg.geom_ribbon(gg.aes(ymin='h  std', ymax='h + std'), alpha=0.2)
+ gg.geom_ribbon(gg.aes(ymin='h  2 * std', ymax='h + 2 * std'), alpha=0.1)
+ gg.ggtitle(r'$p(x, y)$: Mean in green, $\sigma$ '
'and 2$\sigma$ intervals in grey')
+ gg.xlab(r'$x$')
+ gg.ylab(r'$y$')
)
p
Let’s let fit a linear model
where are model weights and are features. Some example features:

Polynomial:

Fourier:

Gaussian:
We’ll minimize the regularized empirical risk
# Number of training points, number of datasets to sample, number of features to use.
num_train = 25
num_models = 20
num_features = 24
# Define the features.
polynomial_features = np.array([train_xs**n for n in range(num_features)]).T
fourier_features = np.array([np.sin(2*np.pi*n*train_xs / num_features) for n in range(num_features)]).T
gaussian_features = np.array([np.exp((train_xs  (2*np.pi*x/num_features  np.pi))**2) for x in range(num_features)]).T
# Train for different amounts of regularization.
features = gaussian_features
results_df = pd.DataFrame()
for lam in [1e3, 1e1, 1e0, 1e1]:
for idx in range(num_models):
# Sample a small subset of the dataset.
train_idx = np.random.randint(len(features), size=num_train)
train_features = features[train_idx]
train_targets = train_ys[train_idx]
# Fit a model to it.
model = sklearn.linear_model.Ridge(alpha=lam)
model.fit(train_features, train_targets)
# Collect results.
df = pd.DataFrame({
'x': train_xs,
'y': train_ys,
'h': np.sin(train_xs),
'f': model.predict(features),
'idx': [idx] * num_data,
'lam': lam,
})
results_df = pd.concat([results_df, df])
# Plot our fitted functions.
fit_plot = (gg.ggplot(results_df)
+ gg.aes(x='x')
+ gg.geom_line(gg.aes(y='f', group='idx'), color='red', alpha=0.5)
+ gg.facet_wrap('lam', labeller='label_both', ncol=4)
+ gg.coord_cartesian(ylim=[1.1, 1.1])
+ gg.theme(figure_size=(12, 3))
+ gg.xlab(r'$x$')
+ gg.ylab(r'$f(x)$')
)
# Plot their mean, over the datasets D.
mean_df = results_df.groupby(['x', 'lam', 'y', 'h']).agg({'f': np.mean}).reset_index()
mean_plot = (gg.ggplot(mean_df)
+ gg.aes(x='x', y='f')
+ gg.geom_line(gg.aes(y='h'), color='green')
+ gg.geom_line(color='red')
+ gg.facet_wrap('lam', labeller='label_both', ncol=4)
+ gg.theme(figure_size=(12, 3))
+ gg.xlab(r'$x$')
+ gg.ylab(r'$\mathbb{E}_\mathcal{D}f(x)$')
)
# Weak regularizer (low lambda) > low bias, high variance.
print(fit_plot)
# Strong regularizer (high lambda) > high bias, low variance.
print(mean_plot)