# When Less Data is Enough Data

Jordan BeanFollowMar 26

The world is moving faster and we want answers in real-time, or at least an answer that’s better than a guess. A question I often have is when does a sample reach critical mass to gain conviction? It’s somewhere between 0% of the data and 100%. But where?

As an example, let’s take the Major League Baseball season. The goal of each team is to make the playoffs. As a fan, though, with a 162 games in a season, at what point can I have confidence one way or the other on whether my team will be a contender?

What if we wanted to know when the season “matters”? At what point in the season can we predict with reasonable certainty whether the team will make the playoffs or not?

Communicating data is about speaking two languages — one for the people that only want to know the conclusions and another for the ones that want to know the details and process. The former doesn’t care about the math and process — they just want the answer and to know they can trust that answer — and the latter doesn’t trust the conclusions without knowing the process. It’s our job to speak both languages.

The article will start with the fan view — in effect, an executive summary of the findings — followed by a more detailed analyst view of how we arrived at those findings.

**The Fan: When should I care?**

The short answer is, even after a single game there’s already separation between the playoff and non-playoff teams. In a world where all teams are equal in game one, playoff and non-playoff teams would both average 0.5 wins. Instead, playoff teams average 0.59 wins and non-playoff teams 0.47 after 1 game. The spread between teams only grows as the season progresses:

About a quarter of the way into the season (~40 games), there’s already a difference of ~4 wins between the average wins of playoff and non-playoff teams. If your team has 23 or more wins, they’re looking pretty good. If they have 19 or fewer, well, not so much. Using just the number of wins that the team has, there’s a ~80% chance that a model will correctly classify a team as having made or not made the playoffs.

Go ahead and click around for yourself on the interactive chart below to see the distribution of wins at each game number:

At the All Star break (~halfway through the season), the model has an 87% chance of classifying a positive value correctly and the number of wins begins to show a clear delineation between those destined for the playoffs and those that aren’t. Playoff teams have about 8 more wins by this point, on average, and the distributions show separation between the “lower playoff teams” (25th percentile of playoff-team wins) from the “upper non-playoff teams” (75th percentile of non-playoff team wins).

What about the September stretch? Teams play ~25 games in September, so that would correspond to starting at game number ~135–140 on the graph above. There’s a difference of about 15 wins on average by this point and our predictability increases to ~95%.

So, if you want to be ~80% confident on whether you’re team will make the playoffs or not, check for whether they have 23+ wins by their 40th game. If they have 46+ wins by the All-Star break, things are really looking good. And, if going into the stretch your team has 80+ wins, kick back and wait for the playoffs because chances are you’ll be playing October baseball.

**The Analyst: Let’s get technical**

*Project code in R for data manipulation and Python for modeling / visualization is **available on GitHub here**.*

*Creating the “Perfect” data set*

Effective problem solving starts with accurately framing the question and preparing the data. I knew the question I wanted to answer — *Approximately how many games does it take to develop confidence in whether a team will make the playoffs or not?* — I just didn’t yet know how to answer it.

In a course I’m taking right now for the MSBA program I’m in, we’re covering probability distributions and quantifying uncertainty. I wondered whether taking a probabilistic approach would be best or if a modeling approach would be better.

Ultimately, I settled on fitting a Random Forest classification model to the data after extensive data prep and cleaning. The ideal data set would be:

At any given point in a season, how many wins does a team have, and did the team make the playoffs that year? Then, with this data set, we can build a model for each game number and compute accuracy metrics on it to reach our desired level of confidence. After much trial and error, I landed on a data set that looks like this:

For each year 1990 to 2018, and each team, we have the game number, number of wins by that point, whether they made the playoffs, and I added a “wins above mean” column that calculates the number of wins minus the mean for that year and game number.

*Modeling and Interpretation of Results*

The nature of this question lends itself well to a classification problem; every year, a team either makes the playoffs or does not. While there were a number of classification models tested — KNN, Gradient Boosting, Logistic Regression — I settled on a Random Forest as it’s powerful, interpret-able, and relatively fast.

I started by creating a single model and calculating the ROC AUC score (which has a surprisingly non-intuitive meaning, as I learned when researching how to interpret it). The results are below:

Next, I plotted the ROC curve for a variety of game numbers to see how the curve shifts as we increase the sample size of games. The red dotted line represents a completely random guess and the further up and to the left we are, the better the model performs:

Now, this model and the resulting calculations are based on just a single model. Each time we run it, we would get different results. To account for this, I decided to run a series of simulation models to better understand the spread of data at each game number.

I ran this 500 separate times, each time choosing a different partitioning of the data (a “random” random state), with a series of randomly chosen model parameters:

num_runs = 0

desired_runs = 500

while num_runs <= desired_runs:

for game_number in range(1, 163):

wins_game_num = wins[wins.game_num == game_number]

x_train, x_test, y_train, y_test =

train_test_split(wins_game_num[['num_wins_team']],

test_size = 0.4,

random_state = random.randint(0, 100),

stratify = wins_game_num['playoff_flag'])

rf_classifier = RandomForestClassifier(

n_estimators= random.randint(15,500),

max_depth = random.randint(10,100),

max_features =

random.choice(['auto','sqrt','log2',

None]),

min_samples_leaf = random.randint(1, 25))

...

<Remaining code>

This resulted in 500 model simulations computing the ROC AUC score from which we can plot the mean score as well as the range around that score at each game number:

Variability decreases as the game number increases — indicating more certainty in our results — and overall we see that we can quickly improve on a random guess (0.5) to distinguish between the playoff and non-playoff teams based on number of wins.

**Conclusion**

In my work at Stax Inc., we’re always looking for ways to gain inferences and insights faster. The Private Equity market is competitive, management teams need to react and adapt quicker, and our work needs to reflect this heightened push for speed.

Here, we can gain ~80% of the predictive power looking at ~25% of the data. This isn’t to say that this ratio will always be generalizable, but that inferences can be drawn well before 100% of the data is collected and analyzed in many circumstances.

Source: Towards Data Science