With the 2015 NBA Draft in the books (#knickstaps) I wanted to take a look at some data from previous drafts and explore it as means of learning some Python and some of its libraries.

In this first part we'll be scraping and cleaning data from the 1966 draft (the first year without territorial picks) to the 2014 draft.

To scrape the data we want, we will use the BeautifulSoup library. It allows us to search through the HTML of a webpage and extract the information we need. We'll then take the scraped data and then store it in a DataFrame using the pandas library.

In [1]:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

Scraping the Data¶

Lets get a feel for using Beautiful Soup by just scraping the data for the 2014 NBA draft.

First we need to get the HTML document we will be scraping. To do that we will use urlopen that we imported from the urllib.request library.

In [2]:

# url that we are scraping
url = "http://www.basketball-reference.com/draft/NBA_2014.html"

# this is the html from the given url
html = urlopen(url)

Now we create a BeautifulSoup object by passing through html to the BeautifulSoup() constructor.

In [3]:

soup = BeautifulSoup(html)

In [4]:

type(soup)  # we see that soup is a BeautifulSoup object

Out[4]:

bs4.BeautifulSoup

Getting Column Headers¶

First lets grab the information that will become the column headers of our DataFrame. That information is found in the content of a table header cell HTML element (which are denoted by the tags), which is found within a table row element () of a table header element () of a table element (

The HTML structure would look something like this:


    
        
            The text we want is between these HTML tags

As an example, lets take a look at the table header cell element for the 'Player' column header:

Player

We want to extract the text content (which would be 'Player' in above example) from all elements like one above and store them into a list.

By inspecting the HTML document (in Chrome you do this by right-clinking on the webpage and selecting "Inspect element" from the dropdown menu) we see that the 2nd table row is the one that contains the table headers we want.

Using BeautifulSoup to get this information is pretty straightfoward. All we have to do is:

Use the findAll() method to find the first 2 rows on the page.
- We pass the element we want as the first argument, in this case'tr'
- We set the limit parameter to 2 so we only get the first 2 table row elements.
- So we have something like this:
```
  soup.findAll('tr', limit=2)
```
- The findAll() method above returns us a list of BeautifulSoup Tag objects. In this case it returns the Tag objects that represent the table rows we wanted.
We want the 2nd table row so we have to extract the 2nd element of the list of Tag objects, which is done by just indexing the list as follows:
```
     soup.findAll('tr', limit=2)[1]
```
Since the above returns us a Tag object we can just call findAll() on it to extract the table header cell elements like this:
```
     soup.findAll('tr', limit=2)[1].findAll('th')    
```
Now we have a list of containing the table header cell elements that can iterate through and extract the text we want via the getText() method. Lets extract that text and construct a list of the column headers using list comprehension .

In [5]:

column_headers = [th.getText() for th in 
                  soup.findAll('tr', limit=2)[1].findAll('th')]

In [6]:

column_headers # our column headers

Out[6]:

['Rk',
 'Pk',
 'Tm',
 'Player',
 'College',
 'Yrs',
 'G',
 'MP',
 'PTS',
 'TRB',
 'AST',
 'FG%',
 '3P%',
 'FT%',
 'MP',
 'PTS',
 'TRB',
 'AST',
 'WS',
 'WS/48',
 'BPM',
 'VORP']

NOTE Some of the column headers (or names or labels) need to be renamed, but we won't be doing that until we scrape the date from 1966 and on.

Getting the Data¶

We now need to extract the data from the HTML table and fill up our DataFrame.

We can follow a similar process to the one we used to exract the header information.

In this case the data we want is found in the table row elements after the first two header rows. So lets get the list of table rows as follows:

In [7]:

data_rows = soup.findAll('tr')[2:]  # skip the first 2 header rows

In [8]:

type(data_rows)  # now we have a list of table rows

Out[8]:

list

The difference with extracting the the player data versus the column headers is that the the player data is in a 2-dimensional format (or a matrix), so we have to construct a 2-dimensional list. We can do this using a nested list comprehension

In [9]:

player_data = [[td.getText() for td in data_rows[i].findAll('td')]
            for i in range(len(data_rows))]

Lets take a look at what the above does.

The outer for loop

for i in range(len(data_rows))

gets each table row which represents each draft pick.

The for loop in the inner list comprehension

[td.getText() for td in data_rows[i].findAll('td')]

extracts the text from the table data element () for each row. This text represents the column data for each draft pick.

You could also construct the above list without a list comprehension:

In [10]:

player_data_02 = []  # create an empty list to hold all the data

for i in range(len(data_rows)):  # for each table row
    player_row = []  # create an empty list for each pick/player

    # for each table data element from each table row
    for td in data_rows[i].findAll('td'):        
        # get the text content and append to the player_row 
        player_row.append(td.getText())        

    # then append each pick/player to the player_data matrix
    player_data_02.append(player_row)

Both player_data and player_data_02 are equivalent.

In [11]:

player_data == player_data_02

Out[11]:

True

Now that we have the column labels and the data we can construct a pandas DataFrame.

We pass in the player_data as the first parameter in the DataFrame contructor and set the columns paramter to column_headers.

In [12]:

df = pd.DataFrame(player_data, columns=column_headers)

In [13]:

df.head()  # head() lets us see the 1st 5 rows of our DataFrame by default

Out[13]:

	Rk	Pk	Tm	Player	College	Yrs	G	MP	PTS	TRB	...	3P%	FT%	MP	PTS	TRB	AST	WS	WS/48	BPM	VORP
0	1	1	CLE	Andrew Wiggins	University of Kansas	1	82	2969	1387	374	...	.310	.760	36.2	16.9	4.6	2.1	2.1	.034	-2.3	-0.2
1	2	2	MIL	Jabari Parker	Duke University	1	25	738	308	138	...	.250	.697	29.5	12.3	5.5	1.7	1.3	.088	-1.2	0.1
2	3	3	PHI	Joel Embiid	University of Kansas						...
3	4	4	ORL	Aaron Gordon	University of Arizona	1	47	797	243	169	...	.271	.721	17.0	5.2	3.6	0.7	1.0	.060	-2.8	-0.2
4	5	5	UTA	Dante Exum		1	82	1817	393	131	...	.314	.625	22.2	4.8	1.6	2.4	-0.1	-.003	-3.8	-0.8

5 rows × 22 columns

In [14]:

df.head(3) # we can change from the default view by passing in an integer

Out[14]:

	Rk	Pk	Tm	Player	College	Yrs	G	MP	PTS	TRB	...	3P%	FT%	MP	PTS	TRB	AST	WS	WS/48	BPM	VORP
0	1	1	CLE	Andrew Wiggins	University of Kansas	1	82	2969	1387	374	...	.310	.760	36.2	16.9	4.6	2.1	2.1	.034	-2.3	-0.2
1	2	2	MIL	Jabari Parker	Duke University	1	25	738	308	138	...	.250	.697	29.5	12.3	5.5	1.7	1.3	.088	-1.2	0.1
2	3	3	PHI	Joel Embiid	University of Kansas						...

3 rows × 22 columns

Cleaning the data¶

There are few things we have to do make our data usable:

Get rid of a couple of rows (that were header rows) that contain only NoneType values.
Rename some of the columns
Change the data in each column to their proper data type.
Deal with some more missing values in the form of NaNs.
Add and drop a certain columns.

Getting rid of the rows with missing values¶

Now lets find the rows containing NoneType values. To do this we can use pandas boolean indexing. We can find the the rows we want by calling isnull() method (which return True if there is a NoneType or NaN) from the 'Pk' column. If 'Pk' value is missing then there isn't a draft pick in that row so we can get rid of that row.

In [15]:

# Finding the None rows
df[df['Pk'].isnull()]

Out[15]:

	Rk	Pk	Tm	Player	College	Yrs	G	MP	PTS	TRB	...	3P%	FT%	MP	PTS	TRB	AST	WS	WS/48	BPM	VORP
30	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

2 rows × 22 columns

We get rid of those rows by reassigning df a subset of itself that doesn't have the above rows.

Note how we can accese the same column of our DataFrame in different ways. In the cell below we are accessing the column as an attribute.

In [16]:

df = df[df.Player.notnull()]

Now there aren't any rows that are full of missing values.

In [17]:

df[df['Pk'].isnull()]

Out[17]:

	Rk	Pk	Tm	Player	College	Yrs	G	MP	PTS	TRB	...	3P%	FT%	MP	PTS	TRB	AST	WS	WS/48	BPM	VORP

0 rows × 22 columns

Renaming the columns¶

We should rename some of the columns since Python is not happy with having '%' or '/' in identifiers.

Lets rename the WS/48 column to WS_per_48 using the rename() method. We set the columns parmeter to a diction with the key being the column we want to rename and the the value being the new name for that column.

In [18]:

df.rename(columns={'WS/48':'WS_per_48'}, inplace=True)

Since there are multiple columns with the '%' character lets use the built-in string method replace() to replace '%' with '_Perc' for those columns

In [19]:

# get the column names and replace all '%' with '_Perc'
df.columns = df.columns.str.replace('%', '_Perc')

We also need to differentiate between per game stats and total cumulative career stats. For example, we have two MP columns, one is the career total minutes played by that player and the other is his per game average.

To do this we can append '_per_G' to the per game stats using a list comprehension.

In [20]:

# Get the columns we want by slicing the list of column names
# and then replace them with the appended names
df.columns.values[14:18] = [df.columns.values[14:18][col] + 
                                  "_per_G" for col in range(4)]

In [21]:

print(df.columns)

Index(['Rk', 'Pk', 'Tm', 'Player', 'College', 'Yrs', 'G', 'MP', 'PTS', 'TRB',
       'AST', 'FG_Perc', '3P_Perc', 'FT_Perc', 'MP_per_G', 'PTS_per_G',
       'TRB_per_G', 'AST_per_G', 'WS', 'WS_per_48', 'BPM', 'VORP'],
      dtype='object')

Change Data to Proper Data Type¶

In [22]:

df.dtypes  # Take a look at data typse in each column

Out[22]:

Rk           object
Pk           object
Tm           object
Player       object
College      object
Yrs          object
G            object
MP           object
PTS          object
TRB          object
AST          object
FG_Perc      object
3P_Perc      object
FT_Perc      object
MP_per_G     object
PTS_per_G    object
TRB_per_G    object
AST_per_G    object
WS           object
WS_per_48    object
BPM          object
VORP         object
dtype: object

We can see that our DataFrame consists of alot columns with data type object. We can call on the DataFrame method convert_objects() and pass in the parameter convert_numeric=True to convert the columns with numerical strings to the most suitable numeric data type.

In [23]:

df = df.convert_objects(convert_numeric=True)
df.dtypes

Out[23]:

Rk             int64
Pk             int64
Tm            object
Player        object
College       object
Yrs          float64
G            float64
MP           float64
PTS          float64
TRB          float64
AST          float64
FG_Perc      float64
3P_Perc      float64
FT_Perc      float64
MP_per_G     float64
PTS_per_G    float64
TRB_per_G    float64
AST_per_G    float64
WS           float64
WS_per_48    float64
BPM          float64
VORP         float64
dtype: object

Not everything is exactly as we want it to be. We should convert the columns Yrs, G, MP, PTS, TRB, and AST to integers using astype(). But before we can do that we need to deal with some NaNs (Not a Number) or else we will get an error like this:

ValueError: Cannot convert NA

Dealing with the NaN values¶

The NaNs in our data indicate that a player has not played in the NBA. We should replace these NaNs with 0s to indicate that the player has not accumulated any stats.

To replace NaN values with 0s we just pass in the value 0 into the fillna().

In [24]:

df = df[:].fillna(0) # index all the columns and fill in the 0s

Now that there are no more missing values in our data we can call astype().

To get the columns we will convert we can use loc, which allows us to access the rows and columns of our DataFrame by using their labels. Using loc we can slice consecutive rows or columns in a similar mannner that we slice lists, except the slice ranges in .loc are inclusive on both ends of the indicated slice.

The typical format for slicing with loc is as follows:

df.loc[first_row_label:last_row_label , first_col_label:last_col_label]

(Read more about indexing pandas DataFrames here)

Since the columns Yrs, G, MP, PTS, TRB, and AST are consecutive columns we can retrieve them by slicing our DataFrame using loc and then convert their data type.

In [25]:

df.loc[:,'Yrs':'AST'] = df.loc[:,'Yrs':'AST'].astype(int)

In [26]:

df.head() # All NaNs are now replaced with 0s

Out[26]:

	Rk	Pk	Tm	Player	College	Yrs	G	MP	PTS	TRB	...	3P_Perc	FT_Perc	MP_per_G	PTS_per_G	TRB_per_G	AST_per_G	WS	WS_per_48	BPM	VORP
0	1	1	CLE	Andrew Wiggins	University of Kansas	1	82	2969	1387	374	...	0.310	0.760	36.2	16.9	4.6	2.1	2.1	0.034	-2.3	-0.2
1	2	2	MIL	Jabari Parker	Duke University	1	25	738	308	138	...	0.250	0.697	29.5	12.3	5.5	1.7	1.3	0.088	-1.2	0.1
2	3	3	PHI	Joel Embiid	University of Kansas	0	0	0	0	0	...	0.000	0.000	0.0	0.0	0.0	0.0	0.0	0.000	0.0	0.0
3	4	4	ORL	Aaron Gordon	University of Arizona	1	47	797	243	169	...	0.271	0.721	17.0	5.2	3.6	0.7	1.0	0.060	-2.8	-0.2
4	5	5	UTA	Dante Exum		1	82	1817	393	131	...	0.314	0.625	22.2	4.8	1.6	2.4	-0.1	-0.003	-3.8	-0.8

5 rows × 22 columns

In [27]:

df.dtypes # and we have the datatyps we want

Out[27]:

Rk             int64
Pk             int64
Tm            object
Player        object
College       object
Yrs            int64
G              int64
MP             int64
PTS            int64
TRB            int64
AST            int64
FG_Perc      float64
3P_Perc      float64
FT_Perc      float64
MP_per_G     float64
PTS_per_G    float64
TRB_per_G    float64
AST_per_G    float64
WS           float64
WS_per_48    float64
BPM          float64
VORP         float64
dtype: object

Adding and Deleting Columns¶

Now lets finally add a Draft_Yr column to indicate the draft class year. We do this by using the insert() method. The first parameter is where we are inserting the column. The second parameter we pass in is the name of the column. The third parameter is the value(s) for that column

In [28]:

df.insert(0, 'Draft_Yr', 2014)

Now to delete the Rk column as it is redundant since we have the Pk column. To delete a column we use the drop() method. We pass in the name of the column we want to delete, the axis we are deleting along, and set inplace to True so that the changes occur inplace and we don't have to reassign the df to itself.

In [29]:

df.drop('Rk', axis='columns', inplace=True)

In [30]:

df.columns # checkout our revised columns

Out[30]:

Index(['Draft_Yr', 'Pk', 'Tm', 'Player', 'College', 'Yrs', 'G', 'MP', 'PTS',
       'TRB', 'AST', 'FG_Perc', '3P_Perc', 'FT_Perc', 'MP_per_G', 'PTS_per_G',
       'TRB_per_G', 'AST_per_G', 'WS', 'WS_per_48', 'BPM', 'VORP'],
      dtype='object')

Scraping and Cleaning Data for Drafts from 1966¶

Scraping and cleaning the data from all previous drafts follows a similar procedure to the one we used for the 2014 draft. The main difference is that we have to to do it multiple times and then combine the data into one large DataFrame .

Scraping¶

First lets create a URL template that will allow us to access the web page for each year.

In [31]:

url_template = "http://www.basketball-reference.com/draft/NBA_{year}.html"

In [32]:

# create an empty DataFrame
draft_df = pd.DataFrame()

Now we are going to create a for loop to scrape and append our data to our big DataFrame. It does take a bit to get all the data.

In [33]:

for year in range(1966, 2015):  # for each year
    url = url_template.format(year=year)  # get the url
    
    html = urlopen(url)  # get the html
    soup = BeautifulSoup(html, 'html5lib') # create our BS object
    

    # get our player data
    data_rows = soup.findAll('tr')[2:] 
    player_data = [[td.getText() for td in data_rows[i].findAll('td')]
                for i in range(len(data_rows))]
    
    # Turn yearly data into a DatFrame
    year_df = pd.DataFrame(player_data, columns=column_headers)
    # create and insert the Draft_Yr column
    year_df.insert(0, 'Draft_Yr', year)
    
    # Append to the big dataframe
    draft_df = draft_df.append(year_df, ignore_index=True)

NOTE: I had to pass html5lib as an extra argument into the BeautifulSoup contructor. html5lib is a third-party Python parser that Beautiful Soup can use. The reason I had to do this was because my default parser (lxml, another thrid party parser) was not extracting all of the table rows for some of the years.

For more about parsers and Beautiful Soup check out the documentation

Lets take a look at the first and last few rows of our data

In [34]:

draft_df.head()

Out[34]:

	Draft_Yr	Rk	Pk	Tm	Player	College	Yrs	G	MP	PTS	...	FT%	MP	PTS	TRB	AST	WS	WS/48	BPM	VORP
0	1966	1	1	NYK	Cazzie Russell	University of Michigan	12	817	22213	12377	...	.827	27.2	15.1	3.8	2.2	51.7	.112	-2.0	0.1
1	1966	2	2	DET	Dave Bing	Syracuse University	12	901	32769	18327	...	.775	36.4	20.3	3.8	6.0	68.8	.101	0.6	8.5
2	1966	3	3	SFW	Clyde Lee	Vanderbilt University	10	742	19885	5733	...	.614	26.8	7.7	10.3	1.1	33.5	.081	-2.4	-0.6
3	1966	4	4	STL	Lou Hudson	University of Minnesota	13	890	29794	17940	...	.797	33.5	20.2	4.4	2.7	81.0	.131	0.1	5.9
4	1966	5	5	BAL	Jack Marin	Duke University	11	849	24590	12541	...	.843	29.0	14.8	5.2	2.1	59.3	.116	-2.8	-1.4

5 rows × 23 columns

In [35]:

draft_df.tail()

Out[35]:

	Draft_Yr	Rk	Pk	Tm	Player	College	Yrs	G	MP	PTS	...	3P%	FT%	MP	PTS	TRB	AST	WS	WS/48	BPM	VORP
6445	2014	56	56	DEN	Roy Devyn Marble	University of Iowa	1	16	208	37	...	.182	.313	13.0	2.3	1.9	1.1	-0.1	-.031	-4.5	-0.1
6446	2014	57	57	IND	Louis Labeyrie						...
6447	2014	58	58	SAS	Jordan McRae	University of Tennessee					...
6448	2014	59	59	TOR	Xavier Thames	San Diego State University					...
6449	2014	60	60	SAS	Cory Jefferson	Baylor University	1	50	531	183	...	.133	.574	10.6	3.7	2.9	0.3	0.8	.071	-3.7	-0.2

5 rows × 23 columns

Now we clean¶

In [36]:

# Convert data to proper data types
draft_df = draft_df.convert_objects(convert_numeric=True)

# Get rid of the rows full of null values
draft_df = draft_df[draft_df.Player.notnull()]

# Replace NaNs with 0s
draft_df = draft_df.fillna(0)

# Rename Columns
draft_df.rename(columns={'WS/48':'WS_per_48'}, inplace=True)
# Change % symbol
draft_df.columns = draft_df.columns.str.replace('%', '_Perc')
# Add per_G to per game stats
draft_df.columns.values[15:19] = [draft_df.columns.values[15:19][col] + 
                                  "_per_G" for col in range(4)]

# Changing the Data Types to int
draft_df.loc[:,'Yrs':'AST'] = draft_df.loc[:,'Yrs':'AST'].astype(int)

# Delete the 'Rk' column
draft_df.drop('Rk', axis='columns', inplace=True)

Let's see how our data looks now.

In [37]:

draft_df.dtypes

Out[37]:

Draft_Yr       int64
Pk           float64
Tm            object
Player        object
College       object
Yrs            int64
G              int64
MP             int64
PTS            int64
TRB            int64
AST            int64
FG_Perc      float64
3P_Perc      float64
FT_Perc      float64
MP_per_G     float64
PTS_per_G    float64
TRB_per_G    float64
AST_per_G    float64
WS           float64
WS_per_48    float64
BPM          float64
VORP         float64
dtype: object

In [38]:

draft_df['Pk'] = draft_df['Pk'].astype(int) # change Pk to int

In [39]:

draft_df.dtypes

Out[39]:

Draft_Yr       int64
Pk             int64
Tm            object
Player        object
College       object
Yrs            int64
G              int64
MP             int64
PTS            int64
TRB            int64
AST            int64
FG_Perc      float64
3P_Perc      float64
FT_Perc      float64
MP_per_G     float64
PTS_per_G    float64
TRB_per_G    float64
AST_per_G    float64
WS           float64
WS_per_48    float64
BPM          float64
VORP         float64
dtype: object

In [40]:

draft_df.isnull().sum() # No missing values in our DataFrame

Out[40]:

Draft_Yr     0
Pk           0
Tm           0
Player       0
College      0
Yrs          0
G            0
MP           0
PTS          0
TRB          0
AST          0
FG_Perc      0
3P_Perc      0
FT_Perc      0
MP_per_G     0
PTS_per_G    0
TRB_per_G    0
AST_per_G    0
WS           0
WS_per_48    0
BPM          0
VORP         0
dtype: int64

Writing Out Data to a CSV File¶

Using the to_csv() method we can easily write out our DataFrame to a csv file.

In [41]:

draft_df.to_csv("draft_data_1966_to_2014.csv")

Finally, lets print out the version of Python and versions of the libraries we used.

In [42]:

# sys allows us to get the info for the version of Python we use
import sys
import urllib.request
import bs4

In [43]:

print('Python version:', sys.version_info)
print('Urllib.request version:', urllib.request.__version__)
print('BeautifulSoup version:', bs4.__version__)
print('Pandas version:', pd.__version__)

Python version: sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0)
Urllib.request version: 3.4
BeautifulSoup version: 4.3.2
Pandas version: 0.16.2

You can find the ipython notebook and data for this post here.

Learning Python: Part 1 - Scraping and Cleaning the NBA Draft