With the 2015 NBA Draft in the books (#knickstaps) I wanted to take a look at some data from previous drafts and explore it as means of learning some Python and some of its libraries.
Now we'll be taking a look at the NBA draft data we scraped in the previous post. We'll be using pandas
to read in our data and manipulate it. Then we will use matplotlib
and seaborn
to create some visualizations from the data.
Lets import the libraries we need.
NOTE: There will be alot of repeated code, which obviously the violates DRY principle.
import pandas as pd
import numpy as np
# we need this 'magic function to plot within ipython notebook
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
Read in the CSV file¶
pandas
allows us to easily read in CSV files using read_csv
. The index_col
parameter allows us to set the column that will act as the names for our rows. In our CSV that is the first column.
draft_df = pd.read_csv("draft_data_1966_to_2014.csv", index_col=0)
Lets take a look at the data.
draft_df.head()
draft_df.tail()
draft_df.dtypes
Lets look at a few summary statistics using describe
.
draft_df.describe()
Lets get the average Win Shares per 48 minutes for the 1966 draft. To do that we need to apply the Boolean operation
draft_df['Draft_Yr'] == 1
to our draft_df, which returns a DataFrame
containing data for the 1966 draft. We then grab its WS_per_48 column and call the mean()
method.
draft_df[draft_df['Draft_Yr'] == 1966]['WS_per_48'].mean()
There are alot of different ways to index and slice data using pandas
I suggest reading the documentation.
Now that we can get the WS_per_48 mean for one year lets get it for every year.
# draft_df.Draft_Yr.unique() contains all the years
# in out DataFrame
WS48_yrly_avg = [draft_df[draft_df['Draft_Yr']==yr]['WS_per_48'].mean()
for yr in draft_df.Draft_Yr.unique() ]
WS48_yrly_avg
type(WS48_yrly_avg)
Another way we wan get the above information is by using groupby
which would allow us to group our data by draft year and then find the mean WS/48 for each year.
WS48_yrly_avg = draft_df.groupby('Draft_Yr').WS_per_48.mean()
WS48_yrly_avg # this is a pandas Series not a list
type(WS48_yrly_avg)
Visualizing the Draft¶
We can now take WS48_yrly_avg and plot it using matplotlib
and seaborn
.
When creating plots, less is more. So no unnecessary 3D effects, labels, colors, or borders.
# Plot WS/48 by year
# use seaborn to set our graphing style
# the style 'white' creates a white background for
# our graph
sns.set_style("white")
# Set the size to have a width of 12 inches
# and height of 9
plt.figure(figsize=(12,9))
# get the x and y values
x_values = draft_df.Draft_Yr.unique()
y_values = WS48_yrly_avg
# add a title
title = ('Average Career Win Shares Per 48 minutes by Draft Year (1966-2014)')
plt.title(title, fontsize=20)
# Label the y-axis
# We don't need to label the year values
plt.ylabel('Win Shares Per 48 minutes', fontsize=18)
# Limit the range of the axis labels to only
# show where the data is. This helps to avoid
# unnecessary whitespace.
plt.xlim(1966, 2014.5)
plt.ylim(0, 0.08)
# Create a series of grey dashed lines across the each
# labled y-value of the graph
plt.grid(axis='y',color='grey', linestyle='--', lw=0.5, alpha=0.5)
# Change the size of tick labels for both axis
# to a more readable font size
plt.tick_params(axis='both', labelsize=14)
# get rid of borders for our graph using seaborn's
# despine function
sns.despine(left=True, bottom=True)
# plot the line for our graph
plt.plot(x_values, y_values)
# Provide a reference to data source and credit yourself
# by adding text to the bottom of the graph
# the first 2 arguments are the x and y axis coordinates of where
# we want to place the text
# The coordinates given below should place the text below
# the xlabel and aligned left against the y-axis
plt.text(1966, -0.012,
'Primary Data Source: http://www.basketball-reference.com/draft/'
'\nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)',
fontsize=12)
# Display our graph
plt.show()
The huge jump in WS/48 coincides with the change to a two round draft format in 1989. So it makes sense to see the jump in average WS/48 as better players now made up a higher percentage of the total players drafted.
Lets take a look at how the number of players drafted has changed over time. First we need to calculate the number of players drafted by year then replace the y_value variable, from the above code, with those values.
players_drafted = draft_df.groupby('Draft_Yr').Pk.count()
players_drafted
sns.set_style("white")
plt.figure(figsize=(12,9))
x_values = draft_df.Draft_Yr.unique()
y_values = players_drafted
title = ('The Number of players Drafted in each Draft (1966-2014)')
plt.title(title, fontsize=20)
plt.ylabel('Number of Players Drafted', fontsize=18)
plt.xlim(1966, 2014.5)
plt.ylim(0, 250)
plt.grid(axis='y',color='grey', linestyle='--', lw=0.5, alpha=0.5)
plt.tick_params(axis='both', labelsize=14)
sns.despine(left=True, bottom=True)
plt.plot(x_values, y_values)
plt.text(1966, -35,
'Primary Data Source: http://www.basketball-reference.com/draft/'
'\nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)',
fontsize=12)
plt.show()
Lets plot both of those plots in one plot with 2 y-axis labels. To do this we can use the matploltlib
Figure
object and an array of (or single) Axes
objects that the plt.subplots()
method returns us. We can access some of the plot elements, like our x-axis and y-axis, that we want to customize through the Axes
objects. To create the two different plots we will create two different Axes
objects and call the plot
method from each of them.
sns.set_style("white")
# change the mapping of default matplotlib color shorthands (like 'b'
# or 'r') to default seaborn palette
sns.set_color_codes()
# set the x and y values for our first line
x_values = draft_df.Draft_Yr.unique()
y_values_1 = players_drafted
# plt.subplots returns a tuple containing a Figure and an Axes
# fig is a Figure object and ax1 is an Axes object
# we can also set the size of our plot
fig, ax1 = plt.subplots(figsize=(12,9))
title = ('The Number of Players Drafted and Average Career WS/48'
'\nfor each Draft (1966-2014)')
plt.title(title, fontsize=20)
# plt.xlabel('Draft Pick', fontsize=16)
# Create a series of grey dashed lines across the each
# labled y-value of the graph
plt.grid(axis='y',color='grey', linestyle='--', lw=0.5, alpha=0.5)
# Change the size of tick labels for x-axis and left y-axis
# to a more readable font size for
plt.tick_params(axis='both', labelsize=14)
# Plot our first line with deals with career WS/48 per draft
# We assign it to plot 1 to reference later for our legend
# We alse give it a label, in order to use for our legen
plot1 = ax1.plot(x_values, y_values_1, 'b', label='No. of Players Drafted')
# Create the ylabel for our WS/48 line
ax1.set_ylabel('Number of Players Drafted', fontsize=18)
# Set limits for 1st y-axis
ax1.set_ylim(0, 240)
# Have tick color match corrsponding line color
for tl in ax1.get_yticklabels():
tl.set_color('b')
# Now we create the our 2nd Axes object that will share the same x-axis
# To do this we call the twinx() method from our first Axes object
ax2 = ax1.twinx()
y_values_2 = WS48_yrly_avg
# Create our second line for the number of picks by year
plot2 = ax2.plot(x_values, y_values_2, 'r',
label='Avg WS/48')
# Create our label for the 2nd y-axis
ax2.set_ylabel('Win Shares Per 48 minutes', fontsize=18)
# Set the limit for 2nd y-axis
ax2.set_ylim(0, 0.08)
# Set tick size for second y-axis
ax2.tick_params(axis='y', labelsize=14)
# Have tick color match corresponding line color
for tl in ax2.get_yticklabels():
tl.set_color('r')
# Limit our x-axis values to minimize white space
ax2.set_xlim(1966, 2014.15)
# create our legend
# First add our lines together
lines = plot1 + plot2
# Then create legend by calling legend and getting the label for each line
ax1.legend(lines, [l.get_label() for l in lines])
# Create evenly ligned up tick marks for both y-axis
# np.linspace allows us to get evenly spaced numbers over
# the specified interval given by first 2 arguments,
# Those 2 arguments are the the outer bounds of the y-axis values
# the third argument is the number of values we want to create
# ax1 - create 9 tick values from 0 to 240
ax1.set_yticks(np.linspace(ax1.get_ybound()[0], ax1.get_ybound()[1], 9))
# ax2 - create 9 tick values from 0.00 to 0.08
ax2.set_yticks(np.linspace(ax2.get_ybound()[0], ax2.get_ybound()[1], 9))
# need to get rid of spines for each Axes object
for ax in [ax1, ax2]:
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
# Create text by calling the text() method from our figure object
fig.text(0.1, 0.02,
'Data source: http://www.basketball-reference.com/draft/'
'\nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)',
fontsize=10)
plt.show()
Lets create a DataFrame
of just the top 60 picks, and then grab the data we need to plot. Note Drafts from 1989 to 2004 dont have a minimum of 60 draft picks.
top60 = draft_df[(draft_df['Pk'] < 61)]
top60_yrly_WS48 = top60.groupby('Draft_Yr').WS_per_48.mean()
top60_yrly_WS48
sns.set_style("white")
plt.figure(figsize=(12,9))
x_values = draft_df.Draft_Yr.unique()
y_values = top60_yrly_WS48
title = ('Average Career Win Shares Per 48 minutes for'
'\nTop 60 Picks by Draft Year (1966-2014)')
plt.title(title, fontsize=20)
plt.ylabel('Win Shares Per 48 minutes', fontsize=18)
plt.xlim(1966, 2014.5)
plt.ylim(0, 0.08)
plt.grid(axis='y',color='grey', linestyle='--', lw=0.5, alpha=0.5)
plt.tick_params(axis='both', labelsize=14)
sns.despine(left=True, bottom=True)
plt.plot(x_values, y_values)
plt.text(1966, -0.012,
'Primary Data Source: http://www.basketball-reference.com/draft/'
'\nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)'
'\nNote: Drafts from 1989 to 2004 have less than 60 draft picks',
fontsize=12)
plt.show()
Bar Plots¶
Lets plot the average WS/48 by Pick for the top 60 picks.
top60_mean_WS48 = top60.groupby('Pk').WS_per_48.mean()
top60_mean_WS48
sns.set_style("white")
# set the x and y values
x_values = top60.Pk.unique()
y_values = top60_mean_WS48
fig, ax = plt.subplots(figsize=(15,10))
title = ('Average Win Shares per 48 Minutes for each'
'\nNBA Draft Pick in the Top 60 (1966-2014)')
ax.set_title(title, fontsize=18)
ax.set_xlabel('Draft Pick', fontsize=16)
ax.set_ylabel('Win Shares Per 48 minutes', fontsize=16)
ax.tick_params(axis='both', labelsize=12)
ax.set_xlim(0,61)
ax.set_xticks(np.arange(1,61)) # label the tick marks
# create white y-axis grid lines to
ax.yaxis.grid(color='white')
# overlay the white grid line on top of the bars
ax.set_axisbelow(False)
# Now add the bars to our plot
# this is equivalent to plt.bar(x_values, y_values)
ax.bar(x_values, y_values)
sns.despine(left=True, bottom=True)
plt.text(0, -.05,
'Primary Data Source: http://www.basketball-reference.com/draft/'
'\nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)'
'\nNote: Drafts from 1989 to 2004 have less than 60 draft picks',
fontsize=12)
plt.show()
Lets plot the same information as above but as a horizontal bar plot, which will give us better spacing for our tick labels.
sns.set_style("white")
# Note we flipped the value variable names
y_values = top60.Pk.unique()
x_values = top60_mean_WS48
fig, ax = plt.subplots(figsize=(10,15))
title = ('Average Win Shares per 48 Minutes for each'
'\nNBA Draft Pick in the Top 60 (1966-2014)')
# Add title with space below for x-axix ticks and label
ax.set_title(title, fontsize=18, y=1.06)
ax.set_ylabel('Draft \nPick', fontsize=16, rotation=0)
ax.set_xlabel('Win Shares Per 48 minutes', fontsize=16)
ax.tick_params(axis='both', labelsize=12)
# set a limit for our y-axis so that we start from pick 1 at the top
ax.set_ylim(61,0)
# Show all values for draft picks
ax.set_yticks(np.arange(1,61))
# pad the y-axis label to not overlap tick labels
ax.yaxis.labelpad = 25
# Move x-axis ticks and label to the top
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')
# create white x-axis grid lines to
ax.xaxis.grid(color='white')
# overlay the white grid line on top of the bars
ax.set_axisbelow(False)
# Now add the horizontal bars to our plot,
# and align them centerd with ticks
ax.barh(y_values, x_values, align='center')
# get rid of borders for our graph
# Not using sns.despine as I get an issue with displaying
# the x-axis at the top of the graph
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
plt.text(-0.02, 65,
'Primary Data Source: http://www.basketball-reference.com/draft/'
'\nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)'
'\nNote: Drafts from 1989 to 2004 have less than 60 draft picks',
fontsize=12)
plt.show()
Dot Plots/Point Plots¶
Instead of using a bar plot we can use a dot plot or point plot to represent the above information.
seaborn
allows us to create point plots using pointplot
.
sns.set_style("white")
# fig, ax = plt.subplots(figsize=(10,15))
plt.figure(figsize=(10,15))
# Create Axes object with pointplot drawn on
# This pointpolt by default retuns the mean along with a confidence
# intervals drawn, default returns 95 CI
ax = sns.pointplot(x='WS_per_48', y='Pk', join=False, data=top60,
orient='h')#, ci=None)
title = ('Average Win Shares per 48 Minutes (with 95% CI)'
'\nfor each NBA Draft Pick in the Top 60 (1966-2014)')
# Add title with space below for x-axix ticks and label
ax.set_title(title, fontsize=18, y=1.06)
ax.set_ylabel('Draft \nPick', fontsize=16, rotation=0) # rota
ax.set_xlabel('Win Shares Per 48 minutes', fontsize=16)
ax.tick_params(axis='both', labelsize=12)
# set a limit for our y-axis so that we start from pick 1 at the top
# ax.set_ylim(61,0)
# Show all values for draft picks
# ax.set_yticks(np.arange(1,61))
# pad the y-axis label to not overlap tick labels
ax.yaxis.labelpad = 25
# limit x-axis
ax.set_xlim(-0.1, 0.15)
# Move x-axis ticks and label to the top
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')
# add horizontal lines for each draft pick
for y in range(len(y_values)):
ax.hlines(y, -0.1, 0.15, color='grey',
linestyle='-', lw=0.5)
# Add a vertical line at 0.00 WS/48
ax.vlines(0.00, -1, 60, color='grey', linestyle='-', lw=0.5)
# get rid of borders for our graph
# Not using sns.despine as I get an issue with displaying
# the x-axis at the top of the graph
ax.spines["top"].set_visible(False)
ax.spines["bottom"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.spines["left"].set_visible(False)
plt.text(-0.1, 63,
'Primary Data Source: http://www.basketball-reference.com/draft/'
'\nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)'
'\nNote: Drafts from 1989 to 2004 have less than 60 draft picks',
fontsize=12)
plt.show()
Box Plots¶
To create a boxplot using seaborn
all we have to do is use seaborn.boxpolot
, which returns us an Axes
object with the boxpolot drawn onto it.
Lets take a look at the WS/48 for the top 30 picks using boxplots.
top30 = top60[top60['Pk'] < 31]
sns.set_style("whitegrid")
plt.figure(figsize=(15,12))
# create our boxplot which is drawn on an Axes object
bplot = sns.boxplot(x='Pk', y='WS_per_48', data=top30, whis=[5,95])
title = ('Distribution of Win Shares per 48 Minutes for each'
'\nNBA Draft Pick in the Top 30 (1966-2014)')
# We can call all the methods avaiable to Axes objects
bplot.set_title(title, fontsize=20)
bplot.set_xlabel('Draft Pick', fontsize=16)
bplot.set_ylabel('Win Shares Per 48 minutes', fontsize=16)
bplot.tick_params(axis='both', labelsize=12)
sns.despine(left=True)
plt.text(-1, -.5,
'Data source: http://www.basketball-reference.com/draft/'
'\nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)'
'\nNote: Whiskers represent the 5th and 95th percentiles',
fontsize=12)
plt.show()
Each box contains the inter-quartile range, which means the bottom of the box represents the 25th percentile and the top represents the 75th percentile. The median is represented by the line within the box.
By default in seaborn
and matplotlib
, each whisker extends out to 1.5 * the closest quartile. So the top whisker line extends out 1.5 * the value of the 75th percentile. The dots that fall outside the whiskers are considered outliers.
However in our boxplot above, we set the whiskers to represent the 5th and 95th percentiles by setting the whis
parameter to [5, 95]. The dots now represent outliers that fall within the top or bottom 5% of the distribution.
Lets get the top 5% for the 3rd overall draft pick. To do this we get all 3rd overall picks, then get their WS_per_48 and call the quantile()
method. Passing in 0.95 into quantile()
returns the WS_per_48 value of the 95th percentile for all 3rd picks.
pick3_95 = top30[top30['Pk']==3]['WS_per_48'].quantile(0.95)
pick3_95
Now to get the players that have a WS_per_48 greater then about 0.1784
# Here we are accessing columns as attributes and then using
# Boolean operations
# Lets create a mask that contains our Boolean operations then index
# the data using the mask
mask = (top30.Pk == 3) & (top30.WS_per_48 > pick3_95)
pick3_top5_percent = top30[mask]
pick3_top5_percent[['Player', 'WS_per_48']]
We can rewrite the above code using the query
method. To reference a local variable within our query string we must place '@' in front of its name. pandas
also allows us to use English instead of symbols in our query string.
pick3_top5_percent = top30.query('Pk == 3 and WS_per_48 > @pick3_95')
pick3_top5_percent[['Player', 'WS_per_48']]
Violin Plots¶
Creating violin plots using seaborn
is pretty much the same as creating a boxplot, but we use seaborn.violinplot
instead of seaborn.boxplot
.
top10 = top60[top60['Pk'] < 11]
sns.set(style="whitegrid")
plt.figure(figsize=(15,10))
# create our violinplot which is drawn on an Axes object
vplot = sns.violinplot(x='Pk', y='WS_per_48', data=top10)
title = ('Distribution of Win Shares per 48 Minutes for each'
'\nNBA Draft Pick in the Top 10 (1966-2014)')
# We can call all the methods avaiable to Axes objects
vplot.set_title(title, fontsize=20)
vplot.set_xlabel('Draft Pick', fontsize=16)
vplot.set_ylabel('Win Shares Per 48 minutes', fontsize=16)
vplot.tick_params(axis='both', labelsize=12)
plt.text(-1, -.55,
'Data source: http://www.basketball-reference.com/draft/'
'\nAuthor: Savvas Tjortjoglou (savvastjortjoglou.com)',
fontsize=12)
sns.despine(left=True)
plt.show()
Each violin in the above plot actually contains a box plot, with white dot in the middle representing the median.
A violin plot is a combination of a boxplot and kernel density estimate. Instead of just having whiskers or dots to give us more information about the distribution of our data, the violin plot provides an estimated shape of the distribution.
Software Versions¶
import sys
print('Python version:', sys.version_info)
import IPython
print('IPython version:', IPython.__version__)
import matplotlib as mpl
print('Matplotlib version:', mpl.__version__)
print('Seaborn version:', sns.__version__)
print('Pandas version:', pd.__version__)
You can check out the ipython notebook and data used in this post at the github repo here.
Comments
comments powered by Disqus