Multi-Task vs. Multi-Output¶

In this notebook, we will be looking at the difference between multitask and multioutput. In a nutshell, multioutput is when we have a single function that is able to return the multidimensional vector of labels and multitask is when we have a function for each output of the multidimensional vector of labels. There are pros and cons to each of these methods so it's up to the expert to give the data scientists some reasons as to why one would choose one over the other.

Let's define some terms: we have some dataset $\mathcal{D}=\{X,Y\}$ of pairs of input $X \in \mathbb{R}^{N\times D}$ and ouputs $Y \in \mathbb{R}^{N \times O}$ . Here $N$ is the number of samples, $D$ is the number of features/dimensions and $O$ are the number of outputs (or tasks).

import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, BayesianRidge
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, ConstantKernel, RBF
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import time as time

# Make Fake Dataset
X, y = make_regression(
    n_samples=1000, 
    n_features=10,    # Total Features
    n_informative=3,   # Informative Features 
    n_targets=10,
    bias=10,
    noise=0.8,
    random_state=123

)

# Training and Testing
xtrain, xtest, ytrain, ytest = train_test_split(X, y, train_size=500, random_state=123)

/home/emmanuel/.conda/envs/sci_py36/lib/python3.6/site-packages/sklearn/model_selection/_split.py:2179: FutureWarning: From version 0.21, test_size will always complement train_size unless both are specified.
  FutureWarning)

Algorithm I - Linear Regression¶

Test I - MultiOutput Linear Regression¶

linear_model = LinearRegression()
t0 = time.time()
linear_model.fit(xtrain, ytrain)
t1 = time.time() - t0
ypred = linear_model.predict(xtest)

# Get Stats
mae = mean_absolute_error(ypred, ytest)
mse = mean_squared_error(ypred, ytest)
rmse = np.sqrt(mse)
r2 = r2_score(ypred, ytest)

print(
    f"MAE: {mae:.3f}\nMSE: {mse:.3f}\nRMSE: {rmse:.3f}\nR2: {r2:.3f}" 
    f" \nTime: {t1:.3} seconds"
)

MAE: 0.640
MSE: 0.643
RMSE: 0.802
R2: 1.000 
Time: 0.00388 seconds

Test II - MultiTask Linear Regression¶

linear_model_multi = MultiOutputRegressor(
    LinearRegression(), 
    n_jobs=-1,              # Number of cores to use to parallelize the training
)

t0 = time.time()
linear_model_multi.fit(xtrain, ytrain)
t1 = time.time() - t0
ypred = linear_model_multi.predict(xtest)

# Get Stats
mae = mean_absolute_error(ypred, ytest)
mse = mean_squared_error(ypred, ytest)
rmse = np.sqrt(mse)
r2 = r2_score(ypred, ytest)

print(
    f"MAE: {mae:.3f}\nMSE: {mse:.3f}\nRMSE: {rmse:.3f}\nR2: {r2:.3f}" 
    f" \nTime: {t1:.3} seconds"
)

MAE: 0.640
MSE: 0.643
RMSE: 0.802
R2: 1.000 
Time: 0.143 seconds

The results are exactly the same in the case of Linear Regression. But for more complex models, we can expect that this will not be the same. One obvious trade-off is the number of tasks (outputs) that we have. 100 outputs is quite a lot so that requires a lot of time to train.

Algorithm II - Gaussian Process Regression¶

MultiOutput¶

The GP algorithm does have multioutput so we can just pass in a vector of labels.

# define kernel function
kernel = ConstantKernel() * RBF() + WhiteKernel()

# define GP model
gp_model = GaussianProcessRegressor(
    kernel=kernel,            # kernel function (very important)
    normalize_y=True,         # good standard practice
    random_state=123,         # reproducibility
    n_restarts_optimizer=10,  # good practice (avoids local minima)
)

# train GP Model
t0 = time.time()
gp_model.fit(xtrain, ytrain)
t1 = time.time() - t0
ypred = gp_model.predict(xtest)

# Get Stats
mae = mean_absolute_error(ypred, ytest)
mse = mean_squared_error(ypred, ytest)
rmse = np.sqrt(mse)
r2 = r2_score(ypred, ytest)

print(
    f"GP Model:\n"
    f"MAE: {mae:.3f}\nMSE: {mse:.3f}\nRMSE: {rmse:.3f}\nR2: {r2:.3f}" 
    f" \nTime: {t1:.3} seconds"
)

GP Model:
MAE: 0.697
MSE: 0.762
RMSE: 0.873
R2: 1.000 
Time: 8.17 seconds

# Define kernel function
kernel = ConstantKernel() * RBF() + WhiteKernel()

# Define GP Model
gp_model = GaussianProcessRegressor(
    kernel=kernel,            # kernel function (very important)
    normalize_y=True,         # good standard practice
    random_state=123,         # reproducibility
    n_restarts_optimizer=10,  # good practice (avoids local minima)
)

# Define Multioutput function
gp_model_multi = MultiOutputRegressor(
    gp_model, 
    n_jobs=-1,              # Number of cores to use to parallelize the training
)

# Fit Model
t0 = time.time()
gp_model_multi.fit(xtrain, ytrain)
t1 = time.time() - t0

# Predict with test set
ypred = gp_model_multi.predict(xtest)

# Get Stats
mae = mean_absolute_error(ypred, ytest)
mse = mean_squared_error(ypred, ytest)
rmse = np.sqrt(mse)
r2 = r2_score(ypred, ytest)

print(
    f"GP Model:\n"
    f"MAE: {mae:.3f}\nMSE: {mse:.3f}\nRMSE: {rmse:.3f}\nR2: {r2:.3f}" 
    f" \nTime: {t1:.3} seconds"
)

GP Model:
MAE: 0.698
MSE: 0.766
RMSE: 0.875
R2: 1.000 
Time: 10.8 seconds

Slightly better accuracy but notice in this case the training time was similar. This sometimes happens because some algorithms have a difficult time converging when there are multiple outputs.

Algorithm III - Bayesian Ridge Regression¶

This algorithm does not have multioutput functionality so instead we will use the multi-task functionality.

Multi-Task¶

# Define GP Model
bayes_model = BayesianRidge(
    n_iter=1000,
    normalize=True,         # good standard practice
    verbose=1,              # Gives updates
)

# Define Multioutput function
bayes_model_multi = MultiOutputRegressor(
    bayes_model, 
    n_jobs=-1,              # Number of cores to use to parallelize the training
)

# Fit Model
t0 = time.time()
bayes_model_multi.fit(xtrain, ytrain)
t1 = time.time() - t0

# Predict with test set
ypred = bayes_model_multi.predict(xtest)

# Get Stats
mae = mean_absolute_error(ypred, ytest)
mse = mean_squared_error(ypred, ytest)
rmse = np.sqrt(mse)
r2 = r2_score(ypred, ytest)

print(
    f"GP Model:\n"
    f"MAE: {mae:.3f}\nMSE: {mse:.3f}\nRMSE: {rmse:.3f}\nR2: {r2:.3f}" 
    f" \nTime: {t1:.3} seconds"
)

GP Model:
MAE: 0.640
MSE: 0.643
RMSE: 0.802
R2: 1.000 
Time: 0.0338 seconds