# What is Survival Analysis?¶

Survival analysis is used to study the **time** until some **event** of interest (often referred to as **death**) occurs. Time could be measured in years, months, weeks, days, etc. The event could be anything of interest. It could be an actual death, a birth, a Pokemon Go server crash, etc. In this post we are interested in how long drafted NFL players are in the league, so the event of interest will be the retirement of drafted NFL players. The duration of time leading up to the event of interest can be called the **survival time**. In our case, the survival time is the number of years that a player was active in the league (according to Pro Football Reference).

Some of the players in this analysis are still active players (e.g. Aaron Rodgers, Eli Manning, etc.), so we haven't observed their retirement (the event of interest). Those players are considered **censored**. While we have some information about their career length (or survival time), we don't know the full length of their career. This specific type of censorship, one in which we do not observe end of the survival time, is called **right-censorship**. The methods developed in the field of survival analysis were created in order to deal with the issue of censored data. In this post we will use one such method, called the Kaplan-Meier estimator, to estimate the survival function and construct the survival curve for an NFL career.

## A brief comment on the data used¶

I used the draft data scraped from my previous post. The duration of a player's career is just the difference between "To" value from the PFR draft table and the year the player was drafted. Players were considered active, if there name was in bold. However there are may be some players who are retired that PFR still considers active (e.g. Mike Kafka). You can check out how I prepared the data in this Jupyter notebook. Let me know if you see any issues/mistakes I've made.

# What is the Survival Function?¶

The survival function, $S(t)$, of a population is defined as follows:

$$S(t) = Pr(T > t)$$Capital $T$ is a random variable that represents a subject's survival time. In our case $T$ represents an NFL player's career length. Lower case $t$ represents a specific time of interest for $T$. In our analysis the $t$ represents a specific number of years played. In other words the survival function just gives us the probability that someone survives longer than (or at least as long as) a specified value of time, $t$. So in the context of our analysis, $S(3)$ will provide us the probability that an NFL career lasts longer than 3 years.

# What is the Kaplan-Meier estimator?¶

To estimate the survival function of NFL players we will use the Kaplan-Meier estimator. The Kaplan-Meier estimator is defined by the following product (from the `lifelines`

documentation):

where $d_i$ are the number of death events at time $t$ and $n_i$ is the number of subjects at risk of death just prior to time $t$.

We will walk through a simple example in a bit in order to get a better understanding of the above definition.

# Estimating the Survival Function of NFL Players¶

To estimate the survival function of NFL players we will be using the `lifelines`

library. It provides a user friendly interface for survival analyis using Python. Lets get started by importing what we need and reading in the data.

```
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from lifelines import KaplanMeierFitter
draft_df = pd.read_csv("data/nfl_survival_analysis_data.csv")
# set some plotting aesthetics, similar to ggplot
sns.set(palette = "colorblind", font_scale = 1.35,
rc = {"figure.figsize": (12,9), "axes.facecolor": ".92"})
```

```
draft_df.head()
```

The columns of interest for our analysis are the *Duration* and *Retired* columns. The *Duration* column represents the number of years a player played in the NFL. The *Retired* column represents whether the player retired from the NFL or not. 1 indicates that he is retired, while 0 indicates that he is still an active player.

To calculate the Kaplan-Meier estimate we will need to create a `KaplanMeierFitter`

object.

```
kmf = KaplanMeierFitter()
```

We can then fit the data by calling the `KaplanMeierFitter`

s `fit`

method.

```
# The 1st arg accepts an array or pd.Series of individual survival times
# The 2nd arg accepts an array or pd.Series that indicates if the event
# interest (or death) occured.
kmf.fit(durations = draft_df.Duration,
event_observed = draft_df.Retired)
```

After fitting our data we can access the event table that contains a bunch of information regarding the subjects (the NFL players) at each time period.

```
kmf.event_table
```

The *removed* column contains the number of observations removed during that time period, whether due to death (the value in the *observed* column) or censorship. So the *removed* column is just the sum of the *observed* and *censorship* columns. The *entrance* column tells us whether any new subjects entered the population at that time period. Since all the players we are studying start at $time = 0$ (the moment they were drafted), the *entrance* value is 15,592 at that time and 0 for all other times.

The *at_risk* column contains the number of subjects that are still alive during a given time. The value for *at_risk* at $time = 0$, is just equal to the *entrance* value. For the remaining time periods, the *at_risk* value is equal to the difference between the time previous period's *at_risk* value and *removed* value, plus the current period's *entrance* value. For example for $time = 1$, the number of subject's *at risk* is 10,995 which is equal to 15,592 (the previous *at_risk* value) - 4,597 (the previous *removed* value) + 0 (the current period's *entrance* value).

Since we have access to the survival table we can calculate the survival probability at different times "by hand."

Let us take a look at the definition of the Kaplan-Meier Estimate again:

$$\hat{S}(t) = \prod_{t_i \lt t} \frac{n_i - d_i}{n_i}$$where $d_i$ are the number of death events at time $t$ and $n_i$ is the number of subjects at risk of death just prior to time $t$.

What the above essentially tells us is that the value of the survival function for time $t$, is the product of the survival probabilities for all individual time periods leading up to time $t$.

We can define the survival probability for an individual time period as follows:

$$S_t = \frac{\substack{\text{Number of subjects} \\ \text{at risk at the start}} - \substack{\text{Number of subjects} \\ \text{that died}}}{\substack{\text{Number of subjects} \\ \text{at risk at the start}}}$$**NOTE** the number of deaths in the above formula does not include the number of censored observations.

Lets walk through a simple example and calculate the the probability that an NFL career lasts longer than 2 years. First we calculate the individual survival probabilities for $t = 0$, $t = 1$, and $t = 2$.

Here's the calculation for the survival probability time for $t = 0$:

$$S_0 = \frac{\substack{\text{Number of players at risk at the start} \\ \text{(i.e. Number of players drafted)}} - \substack{\text{Number of players} \\ \text{that immediately failed}}}{\substack{\text{Number of players at risk at the start} \\ \text{(i.e. Number of players drafted)}}} = \frac{15,592 - 4,504}{15,592} = \frac{11,088}{15,592} \approx 0.711$$And the code for the calculation:

```
# get the values for time = 0 from the survival table
event_at_0 = kmf.event_table.iloc[0, :]
# now calculate the survival probability for t = 0
surv_for_0 = (event_at_0.at_risk - event_at_0.observed) / event_at_0.at_risk
surv_for_0
```

What the above means is that about 71.1% of players drafted make it on to the field.

Now the individual survival probability for $t = 1$:

$$S_1 = \frac{\substack{\text{Number of players} \\ \text{that survive the draft}} - \substack{\text{Number of players} \\ \text{that failed in the 1st year}}}{\substack{\text{Number of players} \\ \text{that survive the draft}}} = \frac{10,995 - 1,076}{10,995} = \frac{9,919}{10,995} \approx 0.902$$```
# Calculate the survival probability for t = 1
event_at_1 = kmf.event_table.iloc[1, :]
surv_for_1 = (event_at_1.at_risk - event_at_1.observed) / event_at_1.at_risk
surv_for_1
```

The value for $S_1$ represents the conditional probability that if a player does not immediately fail once drafted, then he has a 90.2% chance of playing 1 year of football.

Below is the calculation for $S_2$:

$$S_2 = \frac{\substack{\text{Number of players that survive the} \\ \text{1st year and are entering the 2nd year}} - \substack{\text{Number of players} \\ \text{that failed in the 2nd year}}}{\substack{\text{Number of players that survive the} \\ \text{1st year and are entering the 2nd year}}} = \frac{9,685 - 1,176}{9,685} = \frac{8,509}{9,685} \approx 0.879$$```
# Calculate the survival probability for t = 2
event_at_2 = kmf.event_table.iloc[2, :]
surv_for_2 = (event_at_2.at_risk - event_at_2.observed) / event_at_2.at_risk
surv_for_2
```

$S_2$ also represents a conditional probability. It is the probability that a player plays in their 2nd year given that he did not retire after his 1st year. This ends up being about 87.9%.

Finally to calculate the probability that an NFL career will last more than 2 years, we just multiply the three individual survival probabilities:

$$S(2) = S_0 \times S_1 \times S_2 = \frac{11,088}{15,592} \times \frac{9,919}{10,995} \times \frac{8,509}{9,685} \approx 0.564$$```
# The probability that an NFL player has a career longer than 2 years
surv_after_2 = surv_for_0 * surv_for_1 * surv_for_2
surv_after_2
```

So we see that drafted players have about a 56.4% chance of making it past their 2nd year, or having a career as long as 2 years. Hopefully going through that short example gives you a better idea of how the Kaplan-Meier estimator works.

Our `KaplanMeierFitter`

object has already done all of the above calculations for us. We can get the survival probability after a given time by simply using the `predict`

method. So to get the value for $S(2)$ we just pass in 2 into the `predict`

method.

```
kmf.predict(2)
```

That's pretty close to the value we calculated by hand. (I'm not sure why they aren't exactly the same. Possibly a rounding issue? If you do know why please let me know).

The `predict`

method can also handle an array of numbers, returning an array of probabilities.

```
# The survival probabilities of NFL players after 1, 3, 5, and 10 yrs played
kmf.predict([1,3,5,10])
```

To get the full list of estimated probabilities from our `KaplanMeierFitter`

, access the `survival_function_`

attribute.

```
kmf.survival_function_
```

The `median_`

attribute also provides us the number of years where on average 50% of players are out of the league.

```
kmf.median_
```

## Plotting the Kaplan-Meier Estimate¶

Plotting the Kaplan-Meier estimate (along with its confidence intervals) is pretty straightfoward. All we need to do is call the `plot`

method.

```
# plot the KM estimate
kmf.plot()
# Add title and y-axis label
plt.title("The Kaplan-Meier Estimate for Drafted NFL Players\n(1967-2015)")
plt.ylabel("Probability a Player is Still Active")
plt.show()
```

The first thing thing that you should notice is that the Kaplan-Meier estimate is a step function. Each horizontal line represents the probability that a player is still active after a given time $t$. For example, when $t = 0$, the probability that a player is still active after that point is about 71%.

### Plotting the Kaplan-Meier Estimate by Position¶

Before we plot the career lengths by position, lets clean up some of the data. We will merge and drop some of the player positions in order to make the plotting a bit more manageable.

```
draft_df.Pos.unique() # check out all the different positions
```

```
draft_df.Pos.value_counts() # get a count for each position
```

```
# Relabel/Merge some of the positions
# Set all HBs to RB
draft_df.loc[draft_df.Pos == "HB", "Pos"] = "RB"
# Set all Safeties and Cornernbacks to DBs
draft_df.loc[draft_df.Pos.isin(["SS", "FS", "S", "CB"]), "Pos"] = "DB"
# Set all types of Linebackers to LB
draft_df.loc[draft_df.Pos.isin(["OLB", "ILB"]), "Pos"] = "LB"
```

```
# drop players from the following positions [FL, E, WB, KR, LS, OL]
# get the row indices for players with undesired postions
idx = draft_df.Pos.isin(["FL", "E", "WB", "KR", "LS", "DL", "OL"])
# keep the players that don't have the above positions
draft_df_2 = draft_df.loc[~idx, :]
```

```
# check the number of positions in order to decide
# on the plotting grid dimiensions
len(draft_df_2.Pos.unique())
```

Now that we have the data organized, lets plot the Kaplan-Meier estimate for each position. I've commented the code below to walk you through the process of plotting each position in a 5x3 plotting grid.

```
# create a new KMF object
kmf_by_pos = KaplanMeierFitter()
duration = draft_df_2.Duration
observed = draft_df_2.Retired
# Set the order that the positions will be plotted
positions = ["QB", "RB", "WR",
"TE", "T", "G",
"C", "DE", "DT",
"NT", "LB", "DB",
"FB", "K", "P"]
# Set up the the 5x3 plotting grid by creating figure and axes objects
# Set sharey to True so that each row of plots share the left most y-axis labels
fig, axes = plt.subplots(nrows = 5, ncols = 3, sharey = True,
figsize=(12,15))
# flatten() creates a 1-D array of the individual axes (or subplots)
# that we will plot on in our grid
# We zip together the two 1-D arrays containing the positions and axes
# so we can iterate over each postion and plot its KM estimate onto
# its respective axes
for pos, ax in zip(positions, axes.flatten()):
# get indices for players with the matching position label
idx = draft_df_2.Pos == pos
# fit the kmf for the those players
kmf_by_pos.fit(duration[idx], observed[idx])
# plot the KM estimate for that position on its respective axes
kmf_by_pos.plot(ax=ax, legend=False)
# place text indicating the median for the position
# the xy-coord passed in represents the fractional value for each axis
# for example (.5, .5) places text at the center of the plot
ax.annotate("Median = {:.0f} yrs".format(kmf_by_pos.median_), xy = (.47, .85),
xycoords = "axes fraction")
# get rid the default "timeline" x-axis label set by kmf.plot()
ax.set_xlabel("")
# label each plot by its position
ax.set_title(pos)
# set a common x and y axis across all plots
ax.set_xlim(0,25)
ax.set_ylim(0,1)
# tighten up the padding for the subplots
fig.tight_layout()
# https://stackoverflow.com/questions/16150819/common-xlabel-ylabel-for-matplotlib-subplots
# set a common x-axis label
fig.text(0.5, -0.01, "Timeline (Years)", ha="center")
# set a common y-axis label
fig.text(-0.01, 0.5, "Probability That a Player is Still Active",
va="center", rotation="vertical")
# add the title for the whole plot
fig.suptitle("Survival Curve for each NFL Position\n(Players Drafted from 1967-2015)",
fontsize=20)
# add some padding between the title and the rest of the plot to avoid overlap
fig.subplots_adjust(top=0.92)
plt.show()
```

## Checking the Conditional Survival Time¶

Another interesitng attribute in our `KaplanMeierFitter`

is the `conditional_time_to_event_`

. It is a `DataFrame`

that contains the estimated median remaining lifetime, conditioned on surviving up to time $t$. So from the table below we see that if a player is in the league for 1 year, their expected remaining career length is 5 years. Please note that some of the conditional survival times for later time values are a bit funky due to the smaller sample sizes of those time periods.

```
kmf._conditional_time_to_event_()
```

# Resources¶

Here are the resources I used to help write up this post and learn about survival analysis:

## Papers, Articles, and Documentation¶

- The
`lifelines`

documentation - The PDF to the original paper by Kapalan and Meier
- Survival Analysis: A Self Learning Text
- A Practical Guide to Understanding Kaplan-Meier Curves
- Understanding survival analysis: Kaplan-Meier estimate
- What is Survival Analysis (PDF)
- A short article by Kaplan

## Videos¶

- Lifelines: Survival Analysis in Python, by Cameron Davidson-Pilon (the creator of the
`lifelines`

library) - Survival Analysis in Python and R, by Linda Uruchurtu

As always you can find my code and data on github. Please let me know if you see any mistakes/issues or have any suggestions on improving this post.

## Comments

comments powered by Disqus