In this post I explore different tools and methods that can be used to try and interpret "black box" models.
Since the NFL draft is today, I thought a fun dataset to use for this topic would be the NFL combine data. Using that data we'll first build a model to predict the early career performance for NFL defensive ends (DE). After that, we will use different interpretability methods to better understand the model.
Creating a model¶
Let's get started by importing what we need to set up the data and our modeling pipeline.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.preprocessing import Imputer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer
from sklearn.ensemble import RandomForestRegressor
from skll.metrics import spearman
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
import warnings
# setting up the styling for the plots in this notebook
sns.set(style="white", palette="colorblind", font_scale=1.2,
rc={"figure.figsize":(12,9)})
RANDOM_STATE = 420
N_JOBS=8
Here we load in the data that contains each player's combine information and their 3 year Approximate Value, which we will use as measure of early career performance (and going forward will be abbreviated as AV). When building our model we won't be predicting a defensive end's AV, instead we'll be predicting be where his AV ranks (as a percentile) among other defensive ends. Players with high a AV should be closer to 1, while players with low AV should be closer to 0.
NOTE: If you want read up on which combine measurables matter for each NFL position, take a look at Bill Lotter's series of blog posts on the topic.
# read in our data
data_df = pd.read_csv('data/combine_data_since_2000_PROCESSED_2018-04-26.csv')
# onyl get players that have been in the league for 3 years
data_df2 = data_df.loc[data_df.Year <= 2015].copy()
# calculate the player AV percentiles by position
data_df2['AV_pctile'] = (data_df2.groupby('Pos')
.AV
.rank(pct=True,
method='min',
ascending=True))
Now that we've loaded the data and calculated the AV percentiles, let's get the DE data and create a training set and testing set. We will train and tune our model on the first 8 years (2000-2011) of combine data and then test it on the next 4 years (2012-2015).
# Get the data for the position we want, in this case it's DE
pos_df = data_df2.loc[data_df2.Pos=='DE'].copy().reset_index(drop=True)
# Split the data into train and test sets
train_df = pos_df.loc[pos_df.Year <= 2011]
test_df = pos_df.loc[pos_df.Year.isin([2012, 2013, 2014, 2015])]
We'll be build a random forest model since that seems to be the most common "black box" algorithm people use at work. One thing to to note is that in our modeling Pipeline
we will need to include an Imputer
because some DEs are missing data as they did not participate in all the combine drills.
# Combine measurables
features = ['Forty',
'Wt',
'Ht',
'Vertical',
'BenchReps',
'BroadJump',
'Cone',
'Shuttle']
# what we want to predict
target = 'AV_pctile'
X = train_df[features].values
y = train_df[target].values
# the modeling pipeline
pipe = Pipeline([("imputer", Imputer()),
("estimator", RandomForestRegressor(random_state=RANDOM_STATE))])
To tune our model we will use BayesSearchCV
from scikit-optimize
, which utilizes bayesian optimization to find the best hyperparameters. We'll use Spearman's rank correlation as our scoring metric since we are mainly concerned with the ranking of players when it comes to the draft.
# We use spearman's rank correlation as the scoring metric since
# we are concerned with ranking the players
spearman_scorer = make_scorer(spearman)
# the hyperparamters to search over, including different imputation strategies
rf_param_space = {
'imputer__strategy': Categorical(['mean', 'median', 'most_frequent']),
'estimator__max_features': Integer(1, 8),
'estimator__n_estimators': Integer(50, 500),
'estimator__min_samples_split': Integer(2, 200),
}
# create our search object
search = BayesSearchCV(pipe,
rf_param_space,
cv=10,
n_jobs=N_JOBS,
verbose=0,
error_score=-9999,
scoring=spearman_scorer,
random_state=RANDOM_STATE,
return_train_score=True,
n_iter=75)
# fit the model
# I get some funky warnings, possibly due to the spearman scorer,
# I choose to suppress them
with warnings.catch_warnings():
warnings.filterwarnings('ignore')
search.fit(X, y)
# best model parameters
search.best_params_
# CV score
search.best_score_
# CV standard deviation
search.cv_results_['std_test_score'][search.best_index_]
Now that we've tuned our model let's evaluate it on the test set.
# test set data
X_test = test_df[features].values
y_test = test_df[target].values
# predictions
y_pred = search.predict(X_test)
# evaluation
model_test_score = spearman_scorer(search, X_test, y_test)
model_test_score
So our model's predictions had a Spearman's rank correlation of about 0.209. Is that good or bad? I'm not sure. To get a better sense of how good or how bad of a score that is we can use the actual draft picks as a comparison point.
# create percentiles for nfl draft picks
# Lower numerical picks (e.g. 1, 2, 3) are ranked closer to 1
# Higher numerical picks (e.g. 180, 200, etc) are ranked closer to 0
draft_pick_pctile = test_df.Pick.rank(pct=True,
method='min',
ascending=False,
na_option='top')
spearman(y_test, draft_pick_pctile)
It looks like NFL teams are 3 times better at ranking DEs than our model.
Ok great, now that we have our model let's explore the different tools we can use to better understand it.
Feature Importance¶
Mean Decrease Impurity¶
When using a tree-ensemble like random forest you can find out which features the model found valuable by checking the feature importances. In scikit-learn
the feature importances are a reflection of how well a feature reduces some criterion. In our regression example that criterion is mean squared error. This method for calculating feature importance is typically called mean decrease impurity or gini importance.
We can access our model's feature importances with the feature_importances_
attribute.
# get the estimator and imputer from our pipeline, which will be used
# as we try and interpret our model
estimator = search.best_estimator_.named_steps['estimator']
imputer = search.best_estimator_.named_steps['imputer']
estimator.feature_importances_
Alternatively, we could use eli5
's explain_weights_df
function, which returns the importances and the feature names we pass it as a pandas
DataFrame
.
import eli5
# create our dataframe of feature importances
feat_imp_df = eli5.explain_weights_df(estimator, feature_names=features)
feat_imp_df
So it looks like a player's weight and their forty time are important to model. A thing to note is that explain_weights_df
also returns the standard deviations, but they may not be trustworthy as those values assume a normal distribution. Instead of relying on those standard deviations we can access each tree in our ensemble and plot the full distribution of feature importances.
# get the feature importances from each tree and then visualize the
# distributions as boxplots
all_feat_imp_df = pd.DataFrame(data=[tree.feature_importances_ for tree in
estimator],
columns=features)
(sns.boxplot(data=all_feat_imp_df)
.set(title='Feature Importance Distributions',
ylabel='Importance'));
Permutation Importance¶
Permutation importances or mean decrease accuracy (MDA) is an alternative to mean decrease impurity that can be applied to any model. The basic idea of permutation importance is to permute the values of each feature and measure how much that permutation negatively impacts the scoring metric (which in our case is the Spearman's rank correlation). This gives us a sense of how our model would perform without that specific feature. All we need to do calculate permutation importance is use PermutationImportance
from eli5
.
from eli5.sklearn import PermutationImportance
# we need to impute the data first before calculating permutation importance
train_X_imp = imputer.transform(X)
# set up the met-estimator to calculate permutation importance on our training
# data
perm_train = PermutationImportance(estimator, scoring=spearman_scorer,
n_iter=50, random_state=RANDOM_STATE)
# fit and see the permuation importances
perm_train.fit(train_X_imp, y)
eli5.explain_weights_df(perm_train, feature_names=features)
# plot the distributions
perm_train_feat_imp_df = pd.DataFrame(data=perm_train.results_,
columns=features)
(sns.boxplot(data=perm_train_feat_imp_df)
.set(title='Permutation Importance Distributions (training data)',
ylabel='Importance'));
Based on the permutation importances it again looks like Forty and Wt are the two most important features to the model.
Feature Contributions¶
While feature importances can provide us insight into which variables a model finds valuable, they don't tell us how those features impact our model's predictions. One way to actually do that is by using the decision paths in our trees to see how much each feature changes the predictions when going from the parent to the child node. In the end we can break down these feature contributions in a linear manner such that our predictions can be interpreted like so:
$$prediction = bias + feature_{1}contribution + feature_{2}contribution +... feature_{n}contribution$$Ando Saabas has written a few blog posts that delve into this topic. For now I will try to explain how feature contributions are calculated by looking at an example that uses a shallow tree in our forest.
from IPython.display import Image
from sklearn.tree import export_graphviz
import graphviz
import pydotplus
from io import StringIO
# source for plotting decision tree
# https://medium.com/@rnbrown/creating-and-visualizing-decision-trees-with-python-f8e8fa394176
# Get all trees of depth 2 in the random forest
depths2 = [tree for tree in estimator.estimators_ if tree.tree_.max_depth==2]
# grab the first one
tree = depths2[0]
# plot the tree
dot_data = StringIO()
export_graphviz(tree, out_file=dot_data, feature_names=features,
filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
To keep things simple let's find the feature contributions for a player that follows the leftmost path in the tree above, say a player who weighs 260 lbs and runs 4.6 seconds in the forty yard dash.
The player first starts at the root (top) node, where the mean AV percentile of all the samples is 0.414 (this is value of our bias term). At the first split, the tree considers whether the player is 270.5 lbs or less, so any change in the prediction caused by this split is attributed to the player's weight. Our player falls into the left child node because he weighs 260 lbs. Since the average percentile at this node is 0.317 versus 0.414 in the parent node, we can say that the player's weight caused a decrease of 0.097 percentage points in his predicted AV percentile. Now for the split at this new node, the tree considers whether the player runs a 4.755 forty or less. The player in our example runs a 4.6 forty, so after this split, he ends up in the leftmost leaf node of the tree. At this leaf node the player's final predicted percentile is 0.481, since that is an increase of 0.164 percentage points from the previous node, we can say that the player's forty time contributed 0.164 percentage points to his predicted AV percentile. In the end, the feature contributions for this player's predicted AV percentile are as follows:
$$\underset{\text{AV %ile}}{0.481} = \underset{\text{bias}}{0.414}-\underset{\text{Wt}}{0.097}+\underset{\text{Forty}}{0.164}$$Which are the contribution values we get when we use eli5
's explain_prediction_df
function.
# simple exmaple of a player with a 4.6 Forty and a Wt of 260 lbs
example = np.array([4.6, 260, 0, 0, 0, 0, 0, 0])
eli5.explain_prediction_df(tree, example, feature_names=features)
And 0.481 should be what the tree predicts.
tree.predict(example.reshape(1,-1))
Which it does. To calculate the bias term and feature contributions for the entire forest of trees, all you do is average the bias terms and feature contributions of all the trees.
NOTE: I'm aware of two libraries that allow you to easily calculate these kinds of feature contributions, eli5
and Ando Sabaas' treeintepretter
. Besides for the difference in their APIs there are differences in which algorithms they can be used with. treeinterpretter
currently works for the following scikit-learn
estimators:
- DecisionTreeRegressor
- DecisionTreeClassifier
- ExtraTreeRegressor
- ExtraTreeClassifier
- RandomForestRegressor
- RandomForestClassifier
- ExtraTreesRegressor
- ExtraTreesClassifier
eli5
works for those estimator plus both of scikit-learn
's gradient boosting estimators as well as XGBoost and LightGBM estimators.
Now let's actually get the feature contributions for each sample in our training and testing sets. One thing to note is that the explain_prediction_df
only calculates the contributions one observation at a time, which can be time consuming. To speed things up I wrote a helper function that let's us use multiple processes (i.e. multiple cpu cores) to get the feature contributions for all our predictions.
from concurrent.futures import ProcessPoolExecutor
def multiproc_iter_func(max_workers, an_iter, func, item_kwarg, **kwargs):
"""
A helper functions that applies a function to each item in an iterable using
multiple processes. 'item_kwarg' is the keyword argument for the item in the
iterable that we pass to the function.
"""
with ProcessPoolExecutor(max_workers=max_workers) as executor:
future_results = [executor.submit(func, **{item_kwarg: item}, **kwargs)
for item in an_iter]
results = [future.result() for future in future_results]
return results
# Contibrutions for training set predictions
# construct a list of all contributions for the entire train set
train_expl_list = multiproc_iter_func(N_JOBS, train_X_imp,
eli5.explain_prediction_df, 'doc',
estimator=estimator,
feature_names=features)
# concatenate them into 1 large dataframe, with the proper player name as an
# index
train_expl_df = pd.concat(train_expl_list, keys=train_df.Player,
names=['Player'])
# take a look at a couple of players
train_expl_df.head(18)
# Contributions for test set predictions
# we need to impute the missing values in the test set
test_X_imp = imputer.transform(X_test)
# now repeat what we did with the training data on the test data
test_expl_list = multiproc_iter_func(N_JOBS, test_X_imp,
eli5.explain_prediction_df, 'doc',
estimator=estimator,
feature_names=features)
test_expl_df = pd.concat(test_expl_list, keys=test_df.Player,
names=['Player'])
test_expl_df.head(18)
# Double check that the sums of contributions equal the actual predictions
y_pred_sums = test_expl_df.groupby('Player').weight.sum()
np.allclose(y_pred, y_pred_sums)
# I"m creating one big dataframe that includes both train and test
# to plot them on same plot using seaborn's boxplot
train_expl_df.rename(columns={'weight': 'contribution'}, inplace=True)
test_expl_df.rename(columns={'weight': 'contribution'}, inplace=True)
train_expl_df['data'] = 'train'
test_expl_df['data'] = 'test'
train_test_expl_df = pd.concat([train_expl_df, test_expl_df])
sns.boxplot(x='feature', y='contribution', hue='data', order=features,
data=train_test_expl_df.loc[train_test_expl_df.feature!='' ],
palette={'train': 'salmon',
'test':'deepskyblue'})
plt.legend(loc=9)
plt.title('Distributions of Feature Contributions');
Swarmplots¶
We can also use swarmplots which allow for a more granular view of the distribution since they plot each individual observation. If we color each point by the associated feature's (scaled) value, we can view how the change in a feature's value changes along with its contribution.
seaborn
doesn't support a colorbar by default so it's something we have to add on our own. I hacked together function (based on this stackoverflow answer) that will let us add a vertical colorbar to the right of our swarmplots.
from matplotlib.colorbar import ColorbarBase
from mpl_toolkits.axes_grid1 import make_axes_locatable
# a hacky function that plots a swarmplot along with a colorbar
# based off the code found here:
# https://stackoverflow.com/questions/40814612/map-data-points-to-colormap-with-seaborn-swarmplot
def swarmplot_with_cbar(cmap, cbar_label, *args, **kwargs):
fig = plt.gcf()
ax = sns.swarmplot(*args, **kwargs)
# remove the legend, because we want to set a colorbar instead
ax.legend().remove()
## create colorbar ##
divider = make_axes_locatable(ax)
ax_cb = divider.new_horizontal(size="3%", pad=0.05)
fig.add_axes(ax_cb)
cb = ColorbarBase(ax_cb, cmap=cmap, orientation='vertical')
cb.set_label(cbar_label, labelpad=10)
return fig
# min-max scaling of the feature values allows us to use a colorbar
# to indicate high or low feature values
train_scaled_feat_vals = (train_expl_df.groupby('feature')
.value
.transform(lambda x: x/x.max()))
train_expl_df['scaled_feat_vals'] = train_scaled_feat_vals
cmap = plt.get_cmap('viridis')
cbar_label = 'Feature Value %ile'
plt.title('Distribution of Feature Contributions (training data)')
swarmplot_with_cbar(cmap, cbar_label, x='feature', y='contribution',
hue='scaled_feat_vals', palette='viridis', order=features,
data=train_expl_df.loc[train_expl_df.feature!='' ]);
test_scaled_feat_vals = (test_expl_df.groupby('feature')
.value
.transform(lambda x: x/x.max()))
test_expl_df['scaled_feat_vals'] = test_scaled_feat_vals
plt.title('Distribution of Feature Contributions (test data)')
swarmplot_with_cbar(cmap, cbar_label, x='feature', y='contribution',
hue='scaled_feat_vals', palette='viridis', order=features,
data=test_expl_df.loc[test_expl_df.feature!='' ]);
Based on both plots, we can see that faster forty times and higher weights result in more positive contributions. The other features tend to have most of their contributions hover around 0. It's also interest to note the gaps in middle of the Wt distributions on both plots.
Plotting Feature Contributions against Feature Values¶
Let's plot the feature contributions against the feature values to get a better sense of how they relate to one another. We can use seaborn
's lmplot
to easily create a grid of these kinds of plots for both our training and testing data.
fg = sns.lmplot(x='value', y='contribution', col='feature',
data=train_expl_df.loc[train_expl_df.feature!='' ],
col_order=features, sharex=False, col_wrap=3, fit_reg=False,
size=4, scatter_kws={'color':'salmon', 'alpha': 0.5, 's':30})
fg.fig.suptitle('Feature Contributions vs Feature Values (training data)')
fg.fig.subplots_adjust(top=0.90);
fg = sns.lmplot(x='value', y='contribution', col='feature',
data=test_expl_df.loc[test_expl_df.feature!='' ],
col_order=features, sharex=False, col_wrap=3, fit_reg=False,
size=4, scatter_kws={'color':'salmon', 'alpha': 0.5, 's':30})
fg.fig.suptitle('Feature Contributions vs Feature Values (testing data)')
fg.fig.subplots_adjust(top=0.90);
In both plots for Wt it's interesting to see the rapid increase in contribution at around 270 lbs. The model essentially believes that weighing more than 270 is automatically a positive factor for a player, while weighing less than that is a negative one.
One thing to note about these plots is that when we see different contributions (e.g. -0.05, -0.10, -0.15) for the same feature value (e.g. a forty time of 5 seconds) there is probably another feature (or set of features) that is causing these differences. To view such feature interactions we can set the color of the dots to reflect the value of another feature. Let's take a look at how a player's weight interacts with the contribution of their forty time (at least in the training set).
# before we actually plot anything we need to do a bit of data manipulation
# let's pivot the data and create a new dataframe where the columns are
# the feature contributions and each row is a player, with the player
# name as the index value
# here are different ways to pivot column values to columns
# https://stackoverflow.com/questions/26255671/pandas-column-values-to-columns
# based on running %%timeit, the groupby method was fastest
train_contrib_df = (train_expl_df.groupby(['Player','feature'])
.contribution
.aggregate('first')
.unstack())
# add in the feature values
train_feat_contrib_df = train_contrib_df.merge(train_df[['Player'] + features],
how='left', left_index=True,
right_on='Player',
suffixes=('_contrib', '_value'))
# now we can plot
plt.scatter(x='Forty_value', y='Forty_contrib', c='Wt_value', cmap=cmap,
data=train_feat_contrib_df)
plt.xlabel('Forty')
plt.ylabel('contribution')
plt.colorbar(label='Wt');
We can see that there is some interaction between weight and forty time. Given a specific forty time, players with higher weights tend to have a more positive (or less negative) contribution.
Heatmaps¶
With a little data wrangling and seaborn
's heatmap
function we can take a look at the full set of contributions for each player in our test set.
def double_heatmap(data1, data2, cbar_label1, cbar_label2,
title='', subplot_top=0.86, cmap1='viridis', cmap2='magma',
center1=0.5, center2=0, grid_height_ratios=[1,4],
figsize=(14,10)):
# do the actual plotting
# here we plot 2 seperate heatmaps one for the predictions and actual percentiles
# the other for the contributions
# the reason I chose to do this is because of the difference in magnitudes
# between the percentiles and the contributions
fig, (ax,ax2) = plt.subplots(nrows=2, figsize=figsize,
gridspec_kw={'height_ratios':grid_height_ratios})
fig.suptitle(title)
fig.subplots_adjust(hspace=0.02, top=subplot_top)
# heatmap for actual and predicted percentiles
sns.heatmap(data1, cmap="viridis", ax=ax, xticklabels=False, center=center1,
cbar_kws={'location':'top',
'use_gridspec':False,
'pad':0.1,
'label': cbar_label1})
ax.set_xlabel('')
# heatmap of the feature contributions
sns.heatmap(data2, ax=ax2, xticklabels=False, center=center2, cmap=cmap2,
cbar_kws={'location':'bottom',
'use_gridspec':False,
'pad':0.07,
'shrink':0.41,
'label': cbar_label2})
ax2.set_ylabel('');
return fig
# get the prediction and actual target values to plot
y_test_and_pred_df = pd.DataFrame(np.column_stack((y_test, y_pred)),
index=test_df.Player,
columns=['true_AV_pctile', 'pred_AV_pctile'])
# let's pivot the data such that the feature contributions are the columns
test_heatmap_df = (test_expl_df.groupby(['Player','feature'])
.contribution
.aggregate('first')
.unstack())
# there may be some NaNs if a feature did not contribute to a prediction,
# so fill them in with 0s
test_heatmap_df = test_heatmap_df.fillna(0)
# merge our predictions with the the contributions
test_heatmap_df = test_heatmap_df.merge(y_test_and_pred_df, how='left',
right_index=True, left_index=True)
# sort by predictions
test_heatmap_df.sort_values('pred_AV_pctile', ascending=True, inplace=True)
title = 'Feature contributions to predicted AV %ile \nfor each player in the testing data'
fig = double_heatmap(test_heatmap_df[['true_AV_pctile', 'pred_AV_pctile']].T,
test_heatmap_df[features].T, '%ile', 'contribution',
title=title)
The plot above consist of two heatmaps. Each column in both heatmaps represents a player. The rows in the top heatmap represent the true and predicted AV percentiles while the rows in the bottom heatmap represent the the feature contributions for each player prediction.
I like the above visualization because it makes it easy to view a large set of predictions and contributions all at once. I definitely prefer it over the alternative of viewing each set of contributions through a printout out of the DataFrame
.
Joint Feature Contributions¶
A benefit of using a tree-ensemble like our random forest model, is that it captures interactions among our features without us explicitly defining them. However in our attempt to interpret our model we have only looked at the importances and contributions of individual features. Since we've yet to measure the impact of the feature interactions that the model has found, we have an incomplete picture of what our model is actually doing. To gain some insight into these feature interactions we can use the treeinterpeter
package. It uses the same method as before to calculate contributions, but instead of crediting individual features along the decision paths, treeinterpeter
allows us to credit the feature interactions.
NOTE: If you use XGBoost you can use the xgbfir package to inspect feature interactions.
Let's use treeinterpreter
on our simple decision tree from before, in order to get an idea of of how joint feature contributions (i.e. the contributions of the feature interactions) are calculated.
# a reminder of what the tree looks like
Image(graph.create_png())
To get the joint feature contributions, we pass in our estimator and the data to the predict
function and set joint_contribution
to True
. That should return the prediction, the bias term and the joint feature contributions for our player.
import treeinterpreter.treeinterpreter as ti
# get the contributions for our simple player example
# who has a 4.6 forty and weighs 260 lbs
# joint_contribution=True gets the joint feature contributions
# when set to False it just returns the individual feature contributions
example_pred, example_bias, example_contrib = ti.predict(tree,
example.reshape(1, -1),
joint_contribution=True)
# same prediction as before
example_pred
# same bias value as before
example_bias
example_contrib
The joint contributions are returned as list of dictionaries (in our example one dictionary for our one player), where the keys are numeric tuples representing the features (0 represents Forty and 1 represent Wt) and the values are the joint feature contributions. The joint feature contributions for our simple example are as follows:
$$\underset{\text{AV %ile}}{0.481} = \underset{\text{bias}}{0.414}-\underset{\text{Wt}}{0.097}+\underset{\text{Forty & Wt}}{0.164}$$The contributions are the same values as before, the difference we see here is that instead of crediting the contribution of 0.164 percentage points to just the player's forty time, we also credit the previous feature, Wt, in the decision path. Remember, we are now crediting feature interactions, not just individual features. We only credit a single feature when it's either at the root node (like Wt is) or if it's the only feature used along a decision path.
Let's get the joint feature contributions for the test set predictions.
joint_pred, joint_bias, joint_contrib = ti.predict(estimator,
test_X_imp,
joint_contribution=True)
# double check predictions are correct
np.allclose(y_pred, joint_pred)
# the bias is still the same
joint_bias[:3]
# 96 observations in test set
len(joint_contrib)
# tuples representing the column indexes of our features
list(joint_contrib[0].keys())[:3]
# an example of a joint feature contribution
joint_contrib[0][(0, 1)]
To get the joint feature contributions into more useable format let's match the tuples of indexes to the proper feature names. Then we can construct a DataFrame
of each player's joint feature contributions, ordered by the absolute value of the contributions.
def create_ordered_joint_contrib_df(contrib):
"""
Creates a dataframe from the joint contribution info, where the
feature combinations are ordered (in descending fashion) by the absolute
value of the joint contribution.
"""
df = pd.DataFrame(contrib, columns=['feat_interaction', 'contribution'])
# get the reordered index
new_idx = (df.contribution.abs()
.sort_values(inplace=False, ascending=False)
.index)
df = df.reindex(new_idx).reset_index(drop=True)
return df
# add the names of the feats to the joint contributions
joint_contrib_w_feat_names = []
# for each observation in the join contributions
for obs in joint_contrib:
# create a list
obs_contrib = []
# for each tuple of column indexes
for k in obs.keys():
# get the associated feature names
feature_combo = [features[i] for i in k]
# get the contribution value
contrib = obs[k]
# store that information in the observation individual list
obs_contrib.append([feature_combo, contrib])
# append that individual to the large list containing each observations
# joint feature contributions
joint_contrib_w_feat_names.append(obs_contrib)
# create an ordered dataframe for each player
joint_contrib_dfs = [create_ordered_joint_contrib_df(contrib)
for contrib in joint_contrib_w_feat_names]
# now combine them all
joint_contrib_df = pd.concat(joint_contrib_dfs, keys=test_df.Player, names=['Player'])
# edit feat_interaction column so the values are strings and not lists
joint_contrib_df['feat_interaction'] = joint_contrib_df.feat_interaction.apply(' | '.join)
joint_contrib_df.head()
Great, now we have the joint feature contributions for each player in our test set in a nice DataFrame
. Let's take a look at how important each feature and feature interaction is to our predictions. To do that we will measure (as a percentage) how much of the total joint contributions an individual feature or feature interaction is responsible for. In other words we are going to measure the relative importance of each feature and feature interaction.
# first get the sum of the absolute values for each joint feature contribution
abs_imp_joint_contrib = (joint_contrib_df.groupby('feat_interaction')
.contribution
.apply(lambda x: x.abs().sum())
.sort_values(ascending=False))
# then calculate the % of total contribution by dividing by the sum of all absolute vals
rel_imp_join_contrib = abs_imp_joint_contrib / abs_imp_joint_contrib.sum()
rel_imp_join_contrib.head(15)[::-1].plot(kind='barh', color='salmon',
title='Joint Feature Importances');
plt.ylabel('Features')
plt.xlabel('% of total joint contributions');
It looks like the interaction between a Forty and Wt was the most important feature interaction in our predictions, accounting for over 20% of the joint feature contributions. Overall Forty, Wt and their interaction account for more than 50% of the total feature contributions.
We can also take a look at the distributions of contributions for each feature and feature interaction.
top_feat_interactions = rel_imp_join_contrib.head(15).index
top_contrib_mask = joint_contrib_df.feat_interaction.isin(top_feat_interactions)
sns.boxplot(y='feat_interaction', x='contribution',
data=joint_contrib_df.loc[top_contrib_mask],
orient='h', order=top_feat_interactions);
And finally let's make another double heatmap plot to observe some of the joint contributions for each prediction in our test set.
joint_contrib_heatmap_df = (joint_contrib_df[top_contrib_mask]
.groupby(['Player','feat_interaction'])
.contribution
.aggregate('first')
.unstack())
joint_contrib_heatmap_df = joint_contrib_heatmap_df.fillna(0)
joint_contrib_heatmap_df = joint_contrib_heatmap_df.merge(y_test_and_pred_df,
how='left',
right_index=True,
left_index=True)
# sort by predictions
joint_contrib_heatmap_df.sort_values('pred_AV_pctile', ascending=True,
inplace=True)
title = 'Top 15 Joint Feature Contributions to predicted AV %ile\n(testing data)'
fig = double_heatmap(joint_contrib_heatmap_df[['true_AV_pctile', 'pred_AV_pctile']].T,
joint_contrib_heatmap_df[top_feat_interactions].T,
cbar_label1='%ile', cbar_label2='contribution',
title=title, grid_height_ratios=[1, 7], figsize=(14, 12),
subplot_top=0.89)
PDP and ICE plots¶
Individual Conditional Expectation (ICE) plots allow us to visualize how changes for a given feature impact the predictions for a set of observations.
Let's use pycebox
to create an ICE plot to view how changes in the forty yards dash impact our prediction in our training data.
from pycebox.ice import ice, ice_plot
# pcyebox likes the data to be in a DataFrame so let's create one with our imputed data
# we first need to impute the missing data
train_X_imp_df = pd.DataFrame(train_X_imp, columns=features)
We get the ICE values for our feature of interest (Forty) using the ice
function.
forty_ice_df = ice(data=train_X_imp_df, column='Forty',
predict=search.predict)
And then we create the ICE plot with the ice_plot
function.
ice_plot(forty_ice_df, c='dimgray', linewidth=0.3)
plt.ylabel('Pred. AV %ile')
plt.xlabel('Forty');
So how was the above plot created and what is it telling us? To create an ICE plot, we first pick a feature of interest. Then for each observation we make predictions across a range of values for that feature, while holding all other features constant. Finally we just visualize those predictions as curves on a plot. By plotting these curves we are able to observe the relationship between the feature of interest and the predicted target variable.
In our ICE plot above we can see how each player's predicted AV percentile tends to decrease in a non-linear manner between forty times of 4.6 seconds and 5.0 seconds. We can also see that each player's prediction is impacted in a different manner. For example, it looks like the player predictions at the top of the plot do not decrease as much as those at the bottom. The differences we see among the curves indicate that there are interactions between the forty times and the other features.
To inspect feature interactions we can color the ICE curves by another feature. We can do that by passing in a feature to ice_plot
's color_by
parameter. Let's color each line by the player's weight.
# new colormap for ICE plot
cmap2 = plt.get_cmap('OrRd')
# set color_by to Wt, in order to color each curve by that player's weight
ice_plot(forty_ice_df, linewidth=0.5, color_by='Wt', cmap=cmap2)
# ice_plot doesn't return a colorbar so we have to add one
# hack to add in colorbar taken from here:
# https://stackoverflow.com/questions/8342549/matplotlib-add-colorbar-to-a-sequence-of-line-plots/11558629#11558629
wt_vals = forty_ice_df.columns.get_level_values('Wt').values
sm = plt.cm.ScalarMappable(cmap=cmap2,
norm=plt.Normalize(vmin=wt_vals.min(),
vmax=wt_vals.max()))
# need to create fake array for the scalar mappable or else we get an error
sm._A = []
plt.colorbar(sm, label='Wt')
plt.ylabel('Pred. AV %ile')
plt.xlabel('Forty');
From the plot above we can see that heavier players are not impacted in the same manner as lighter ones.
We can also add the PDP by setting the plot_pdp
to True
in the ice_plot
function. To adjust the styling of the PDP line we pass a dictionary of settings to pdp_kwargs
.
ice_plot(forty_ice_df, linewidth=.5, color_by='Wt', cmap=cmap2, plot_pdp=True,
pdp_kwargs={'c': 'k', 'linewidth': 5})
plt.colorbar(sm, label='Wt')
plt.ylabel('Pred. AV %ile')
plt.xlabel('Forty');
The PDP is the average of all ICE curves on the plot, so the PDP above represents the average change in the predicted AV percentile over the range of forty times.
Centered ICE Plots¶
One drawback with our previous ICE plots is that the stacked nature of the lines can make it difficult to observe the differences between the ICE curves. To make it easier to spot those differences we can center or "pinch" the curves at a specific feature value. Typically the minimum is a good centering point. With these centered ICE plots we observe the relative change of the predictions with respect to the predictions at the centered value.
To center our ICE curves at the minimum Forty value we just set centered
to True
.
ice_plot(forty_ice_df, linewidth=.5, color_by='Wt', cmap=cmap2, plot_pdp=True,
pdp_kwargs={'c': 'k', 'linewidth': 5}, centered=True)
plt.colorbar(sm, label='Wt')
plt.ylabel('Pred. AV %ile (centered)')
plt.xlabel('Forty');
Note that each line above starts at 0. The y-axis now represents the difference in each player's prediction relative to their prediction at the minimum forty yard dash time of 4.47 seconds.
Let's take a look at the ICE plots for all our features. To make that easier to do, I created a helper function that can plot out all the ICE plots for each feature. It also adds a rug plot at the bottom of each ICE plot to display information about the distribution of the data.
def plot_ice_grid(dict_of_ice_dfs, data_df, features, ax_ylabel='', nrows=3,
ncols=3, figsize=(12, 12), sharex=False, sharey=True,
subplots_kws={}, rug_kws={'color':'k'}, **ice_plot_kws):
"""A function that plots ICE plots for different features in a grid."""
fig, axes = plt.subplots(nrows=nrows,
ncols=ncols,
figsize=figsize,
sharex=sharex,
sharey=sharey,
**subplots_kws)
# for each feature plot the ice curves and add a rug at the bottom of the
# subplot
for f, ax in zip(features, axes.flatten()):
ice_plot(dict_of_ice_dfs[f], ax=ax, **ice_plot_kws)
# add the rug
sns.distplot(data_df[f], ax=ax, hist=False, kde=False,
rug=True, rug_kws=rug_kws)
ax.set_title('feature = ' + f)
ax.set_ylabel(ax_ylabel)
sns.despine()
# get rid of blank plots
for i in range(len(features), nrows*ncols):
axes.flatten()[i].axis('off')
return fig
# create dict of ICE data for grid of ICE plots
train_ice_dfs = {feat: ice(data=train_X_imp_df, column=feat, predict=estimator.predict)
for feat in features}
fig = plot_ice_grid(train_ice_dfs, train_X_imp_df, features,
ax_ylabel='Pred. AV %ile', alpha=0.3, plot_pdp=True,
pdp_kwargs={'c': 'red', 'linewidth': 3},
linewidth=0.5, c='dimgray')
fig.tight_layout()
fig.suptitle('ICE plots (training data)')
fig.subplots_adjust(top=0.89);
fig = plot_ice_grid(train_ice_dfs, train_X_imp_df, features,
ax_ylabel='Pred AV %ile (centered)',
alpha=.2, plot_points=False, plot_pdp=True,
pdp_kwargs={"c": "red", "linewidth": 3},
linewidth=0.5, c='dimgray', centered=True,
sharey=False, nrows=4, ncols=2, figsize=(11,16))
fig.tight_layout()
fig.suptitle('Centered ICE plots (training data)')
fig.subplots_adjust(top=0.9)
Two-dimensional PDPs¶
We can also create plot the PDPs for 2 features at once. This allows us to better understand interactions between the features and how they impact the predictions. We'll use the the pdpbox
library to create 2-D PDPs.
NOTE pdpbox
can also create the same kind of ice plots that we created with pycebox
above. However pycebox
doesn't have support for 2-D PDPs.
import itertools
from pdpbox import pdp
I'm just going to jump into creating a grid of 2-D PDPs and explain whats happening in comments of the code below.
from matplotlib.text import Text
def plot_2d_pdp_grid(pdp_inters, feature_pairs,
ncols=3, nrows=4, figsize=(13, 16),
xaxis_font_size=12, yaxis_font_size=12,
contour_line_fontsize=12,
tick_labelsize=10, x_quantile=None,
plot_params=None, subplots_kws={}):
"""Plots a grid of 2D PDP plots."""
# create our subplots to plot our PDPs on
fig, axes = plt.subplots(nrows=nrows, ncols=ncols,
figsize=figsize, **subplots_kws)
# for each feature pair, plot the 2-D pdp
for pdp_inter, feat_pair, ax in zip(pdp_inters, feature_pairs, axes.flatten()):
# use pdpbox's _pdp_contour_plot function to actually plot the 2D pdp
pdp._pdp_contour_plot(pdp_inter, feat_pair,
x_quantile=x_quantile, ax=ax,
plot_params=plot_params,
fig=None)
# adjust some font sizes
ax.tick_params(labelsize=tick_labelsize)
ax.xaxis.get_label().set_fontsize(xaxis_font_size)
ax.yaxis.get_label().set_fontsize(yaxis_font_size)
# set the contour line fontsize
for child in ax.get_children():
if isinstance(child, Text):
child.set(fontsize=contour_line_fontsize)
# get rid of empty subplots
for i in range(len(pdp_inters), nrows*ncols):
axes.flatten()[i].axis('off')
return fig
# get each possible feature pair combination
feature_pairs = [list(feat_pair) for feat_pair in itertools.combinations(features, 2)]
# we will only plot the feature iteractions that invlove either Forty or Wt
# just to avoid making soooo many plots
forty_wt_feat_pairs = [fp for fp in feature_pairs if 'Forty' in fp or 'Wt' in fp]
# now calculate the data for the pdp interactions
# we can do that with pdpbox's pdp_interact function
# in the current development version on github, parallelization is supported
# but it didn't work for me so I resorted to using that multiprocess helper
# function from before
train_feat_inters = multiproc_iter_func(N_JOBS, forty_wt_feat_pairs,
pdp.pdp_interact, 'features',
model=estimator, train_X=train_X_imp_df)
# and now plot a grid of PDP interaction plots
# NOTE that the contour colors do not represent the same values
# across the different subplots
fig = plot_2d_pdp_grid(train_feat_inters, forty_wt_feat_pairs)
fig.tight_layout()
fig.suptitle('PDP Interaction Plots (training data)', fontsize=20)
fig.subplots_adjust(top=0.95);
LIME¶
Local interpretable model-agnostic explanations (LIME) allow us to explain individual predictions for "black box" models by creating local, interpretable, surrogate models. We fit a local model using the following recipe (which I copied from Christop Molnar's great book, Interpretable Machine Learning):
- Choose your instance of interest for which you want to have an explanation of its black box prediction.
- Perturb your dataset and get the black box predictions for these new points.
- Weight the new samples by their proximity to the instance of interest.
- Fit a weighted, interpretable model on the dataset with the variations.
- Explain prediction by interpreting the local model.
There are different packages that implement LIME (including eli5
and another package I just discovered called Skater
). Here we will use the original lime
package created by the authors of the LIME paper.
import lime
from lime.lime_tabular import LimeTabularExplainer
Let's create our LIME explainer and explain an instance from our test set.
# create the explainer by passing our training data,
# setting the correct modeling mode, pass in feature names and
# make sure we don't discretize the continuous features
explainer = LimeTabularExplainer(train_X_imp_df, mode='regression',
feature_names=features,
random_state=RANDOM_STATE,
discretize_continuous=False)
test_X_imp_df = pd.DataFrame(test_X_imp, columns=features)
# the number of features to include in our predictions
num_features = len(features)
# the index of the instance we want to explaine
exp_idx = 2
exp = explainer.explain_instance(test_X_imp_df.iloc[exp_idx,:].values,
estimator.predict, num_features=num_features)
Cool, now we have our explanation, let's inspect it.
# the prediction made by the local surrogate model
exp.local_pred[0]
# the bias term for the local explanation
exp.intercept[0]
# a plot of the weights for each feature
exp.as_pyplot_figure();
# # and a prettier output that we can view in the notebook
# # it looks like it messes with the blog post so I've commented it out
# exp.show_in_notebook()
Great, we have an explanation for that one instance, what about the rest of the test set? We can use the apply
method to get explanation for the whole test set.
NOTE: We can't parallelize explain_instance
with the multiprocessing function since explain_instance
is a bound method.
lime_expl = test_X_imp_df.apply(explainer.explain_instance,
predict_fn=estimator.predict,
num_features=num_features,
axis=1)
Before moving on we should double check and see that the local predictions from our surrogate models match our actual predictions. We can judge the local prediction by looking at either the root-mean-squared error or the R2
from sklearn.metrics import mean_squared_error, r2_score
# get all the lime predictions
lime_pred = lime_expl.apply(lambda x: x.local_pred[0])
# RMSE of lime pred
mean_squared_error(y_pred, lime_pred)**0.5
# r^2 of lime predictions
r2_score(y_pred, lime_pred)
If you aren't satisfied with the fit of the local surrogate models you can try to improve them by playing around with the kernel_width
parameter in LimeTabularExplainer
. We will try and improve the surrogate models by decreasing the kernel width to make the fits more local.
# new explainer with smaller kernel_width
better_explainer = LimeTabularExplainer(train_X_imp_df, mode='regression',
feature_names=features,
random_state=RANDOM_STATE,
discretize_continuous=False,
kernel_width=1)
better_lime_expl = test_X_imp_df.apply(better_explainer.explain_instance,
predict_fn=estimator.predict,
num_features=num_features,
axis=1)
# get all the lime predictions
better_lime_pred = better_lime_expl.apply(lambda x: x.local_pred[0])
# RMSE of lime pred
mean_squared_error(y_pred, better_lime_pred)**0.5
# r^2 of lime predictions
r2_score(y_pred, better_lime_pred)
Our new local predictions better match our actual predictions. To view all of our explanations at once we can create heatmap in the same manner we did when looking at the feature contributions. To do that we need to create a DataFrame
with each instance's feature weights and bias term from the LIME explanation.
# construct a DataFrame with all the feature weights and bias terms from LIME
# create an individual dataframe for each explanation
lime_dfs = [pd.DataFrame(dict(expl.as_list() + [('bias', expl.intercept[0])]), index=[0])
for expl in better_lime_expl]
# then concatenate them into one big DataFrame
lime_expl_df = pd.concat(lime_dfs, ignore_index=True)
lime_expl_df.head()
Now that we have the weights for each feature we can measure their prediction contributions by multiplying the weights by the actual feature values. But before we do that we need to scale the data in our test set since LIME scales the data inside the explainer when the data is not discretized.
# scale the data
scaled_X = (test_X_imp_df - explainer.scaler.mean_) / explainer.scaler.scale_
# calc the lime feature contributions
lime_feat_contrib = lime_expl_df[features] * scaled_X
# add on bias term, actual av %ile and predicted %ile
other_lime_cols = ['bias', 'true_AV_pctile', 'pred_AV_pctile']
lime_feat_contrib[other_lime_cols] = pd.DataFrame(np.column_stack((lime_expl_df.bias,
y_test_and_pred_df)))
lime_feat_contrib.sort_values('pred_AV_pctile', inplace=True)
lime_feat_contrib.head()
Now to plot each set of explanations using our double_heatmap
function. Unlike previous heatmaps, we will include the bias terms since the surrogate models that LIME creates can have different bias terms for each player.
title = 'LIME Feature Contributions for each prediction in the testing data'
fig = double_heatmap(lime_feat_contrib[['true_AV_pctile', 'pred_AV_pctile']].T,
lime_feat_contrib.loc[:, :'bias'].T, title=title,
cbar_label1='%ile', cbar_label2='contribution',
subplot_top=0.9)
# set the x-axis label for the bottom heatmap
# fig has 4 axes object, the first 2 are the heatmaps, the other 2 are the colorbars
fig.axes[1].set_xlabel('Player');
SHAP¶
SHAP (SHapley Additive exPlanations) is a recent method for model interpretation that leverages game theory to help measure the impact of the features on the predictions. What’s the benefit of using the SHAP method for individual feature contributions over the decision path method from before? Well with the decision path method we have to traverse down the decision tree crediting each feature for the difference in the predictions. This can result in individual contributions that favor features found in splits lower in the tree. The SHAP method doesn't have that problem as it doesn't rely on the order of the features specified by the tree, instead it calculates the contributions by basically averaging the differences in predictions over every possible feature ordering. For more on this topic I suggest reading Scott Lindburg’s (one of the authors of the SHAP paper) blog post. And for a good explanation of how SHAP values are calculated I suggest reading the chapter on SHAP from Christopher Molnar's book.
Now let's actually use the SHAP method to explain our predictions.
import shap
# create our SHAP explainer
shap_explainer = shap.TreeExplainer(estimator)
# calculate the shapley values for our test set
test_shap_vals = shap_explainer.shap_values(test_X_imp)
The shap
package provides us with convenience functions to help us plot the SHAP values for our predictions. We can use force_plot
to inspect individual predictions.
# load JS in order to use some of the plotting functions from the shap
# package in the notebook
shap.initjs()
# plot the explanation for a single prediction
shap.force_plot(test_shap_vals[0, :], test_X_imp_df.iloc[0, :])
In the above explanation we can see that there is a base value (i.e. a bias term) of 0.4219, and that each feature pushes (i.e. adds to) that value in order to reach a final prediction of 0.46.
We can also use the force_plot
function to look at the explanations for our whole dataset. When using the function in a notebook, it produces an interactive plot that allows you to inspect the SHAP values for each observation by hovering the mouse over the plot. By default the observations are clustered together by how similar they are. For example, the first 17 observations in the plot below are players whose weights have a very large positive impact on their predictions.
shap.force_plot(test_shap_vals, test_X_imp_df)
To plot the distribution of each feature's SHAP we can use the summary_plot
function.
shap.summary_plot(test_shap_vals, test_X_imp_df, auto_size_plot=False)
With the dependence_plot
function we can see how a feature's SHAP values change over the range of feature values. The function automatically colors each point on the plot by a 2nd feature, allowing us to better understand the interaction effects.
for feat in features:
shap.dependence_plot(feat, test_shap_vals, test_X_imp_df,
dot_size=100)
And of course a heatmap of the SHAP values.
test_shap_df = pd.DataFrame(np.column_stack((test_shap_vals, y_test_and_pred_df)),
columns= features + ['bias', 'true_AV_pctile',
'pred_AV_pctile'])
test_shap_df.sort_values('pred_AV_pctile', inplace=True)
title = 'SHAP Values for each prediction in the testing data'
fig = double_heatmap(test_shap_df[['true_AV_pctile', 'pred_AV_pctile']].T,
test_shap_df[features].T, '%ile', 'SHAP Value',
title=title, subplot_top=0.89)
fig.axes[1].set_xlabel('Player');
Hopefully you found this blog post helpful. If you see any mistakes, have any questions or suggestions [or if you're hiring :)] you can email me at savvas.tjortjoglou@gmail.com, hit me up on Twitter @savvastj, or just leave a comment below.
If you like this post and want to support my blog you can check out my patreon page here.
Resources¶
Here are a list of resources that I found helpful when writing up this post:
General
Feature Importance
- How are feature_importances in RandomForestClassifier determined?
- Selecting good features – Part III: random forests
Feature Contributions
- Interpreting random forests
- Random forest interpretation with scikit-learn
- Random forest interpretation – conditional feature contributions
ICE plots and PDPs
LIME
- "Why Should I Trust You?": Explaining the Predictions of Any Classifier
- Why your relationship is likely to last (or not): using Local Interpretable Model-Agnostic Explanations (LIME)
- Understanding LIME
SHAP
- Interpretable Machine Learning with XGBoost
- Consistent Individualized Feature Attribution for Tree Ensembles
NFL Combine/Draft
Videos
- Machine Learning and Interpretability
- Towards interpretable reliable models
- Interpretable Machine Learning Using LIME Framework
- Explaining behavior of Machine Learning models with eli5 library
Model Interpretability Packages
As always you can find the notebook and data used for this post on github.
Comments
comments powered by Disqus