Hey all! As the PGA season truly gets underway this week I wanted to share some of the insight behind my model, as well as the findings I found to be most enlightening, surprising, and actionable for DFS purposes. Already, I keep saying “model”, but in truth, there are five models: one projecting earnings, one made cut odds, and the last three project the probabilities of a top 10, top 3, and first place finish. I will be sharing the entirety of the models in FTA+ chat (message myself @alexblickle1, @jac3600, @thoreosNmilk, or @ftakj on Twitter for subscription options or a free trial).
Statistics of Interest
In order to project each relevant result, I used the four strokes gained statistics (off the tee, approach the green, around the green, and putting), bogey avoidance (% of holes that a player makes bogey or worse), birdie or better percentage (% of holes that a player makes birdie or better), and past performance in the relevant stat (for example, made cut % from the previous year to help project future made cut odds). All data is from the 2016-2018 PGA Tour seasons, including all players who played at least 50 rounds each year (a sample size of 109 golfers). To summarize, the stats of interest are:
- Earnings per event, MC %, top 10 %, top 3%, win %
Descriptive vs Predictive
Before we begin, I feel it’s necessary to provide a distinction between descriptive statistics and predictive statistics, with PGA related context. In short, descriptive statistics tell us what happened, while predictive statistics help us forecast what will happen. Obviously, in DFS we want the latter! This idea brings us to the first actionable finding.
Year-to-Year Stability of Each Statistic of Interest
If past performance in a statistic predicts future performance in that statistic well, the stat is considered stable. A stat that is highly volatile, and therefore cannot be predicted well by the same stat in the past, is considered unstable. Let’s examine the stability of each statistic of interest by finding the r^2 when regressing each stat in year n+1 on the same stat from year n. For those unfamiliar with statistics jargon, r^2 tells us the proportion of the variation in the response variable that is explained by the variation in the explanatory variable(s). Let’s use SG:OTT as a clarifying example.
When regressing SG:OTT in year n+1 on SG:OTT from year n (2018 on 2017 or 2017 on 2016), the r^2 is .6162. So, 61.62% of the variation in a season’s SG:OTT can be explained by the variation in the previous season’s SG:OTT. As you’ll see, this number is extraordinarily high! Here are the rest of the stability measurements:
This is actionable info numero uno. The stability of off the tee performance makes it inherently predictive. If this concept doesn’t click for you, think of it this way: What does knowing that putting matters (putting is highly descriptive) do for us if we have no way of predicting future putting performance since it’s so unstable? Since performance off the tee, on the other hand, can be predicted reasonably by past performance off the tee, its descriptiveness carries over into predictiveness.
Put another way, we can expect players who have driven the ball well to continue doing so. We can’t, in good confidence, project players who have putted well to continue to do so.
More Data is Better
Before we get into the models, a note: For each model, I ran three variations. The first included all pairs on year n+1 on year n. For the second, I looked at 2018 results on a weighted average of 2016 and 2017 stats (weighted by number of rounds played). Finally, I then looked at just 2018 on 2017 stats. Without question, the most powerful variation is 2018 results on the weighted averages from 2016 and 2017. I’ll use the earnings models as an example. Using all year n+1 and year n pairings results in an r^2 just under .5. 2018 earnings on 2017 stats yields an r^2 just over .5. However, 2018 earnings on the weighted average stats makes r^2 jump to an impressive .58. *I’ll note here for reproducibility that due to the top-heavy payout structures of PGA Tour events, I actually regress the ln of earnings on the stats, then convert the results back to a full earnings projection.* Anyway, the massive leap in explained variance when including 2016 stats tells us a lot…
Does Course History Matter?
I hypothesize that course history does not matter. Think about it this way: despite having an entire season’s worth of data, the models are vastly more powerful when more data is included, even though that data is over a year old and adding to an already large sample (average number of rounds in 2017 for the sample is 81). However, people cite course history all the time despite the fact that the typical course history sample size is about 10-30 rounds. There’s simply not enough data for course history to be predictive in my opinion. I recommend using course history in a broader sense (groups of guys rather than individuals) to identify which type of player performs best at the course.
The Models, For Real This Time
Below is a chart showing which stats are included in each model. The number in the box is the estimated coefficient. At the top you’ll find the r^2 and adjusted r^2 of each model. Again, the higher the r^2, the more powerful the model. Underneath the model’s r^2 is the stability of the Statistic of Interest, illustrating how much better we can estimate each by including the SG stats, bogey avoidance, and birdie or better percentage.
Predictiveness of Each Model
I’m thrilled with the predictiveness of each model. It’s not surprising that the Earnings Model and Made Cut Odds lead the way, but the predictiveness of the Top 10% Model is astounding to me. The power of the Top 3 and Win % Models drop off a bit, but they’re still significantly higher than I was anticipating.
Stability of Top 10%
Look at the stability of Top 10%! The SG stats and such vastly improve the Top 3 and Win % Models, but Top Ten % doesn’t need much help (the stats make r^2 jump from .49 to just .54). I’m thinking this tells us that we should be looking at Top 10% more when discussing player of the year, best of all time, etc, as the stability suggests it may be capturing more true talent/skill/clutch-ness than any other result. From a DFS standpoint, lots of people like to look at past made cut % as a measure of “safety” for the player, particularly because made cuts is displayed on the lineup building page. If you want a small, but real edge, with minimal extra work, look at Top 10% instead!
Let’s Talk About Bogey Avoidance
Of everything I found, this is my favorite. Take a look at the model coefficients for Bogey Avoidance. Only one is intuitive at first glance and that’s the Missed Cuts Model. The negative estimate says that the fewer bogeys a player makes, the higher probability he has of making each cut. Duh, right? Well, everywhere else that bogey avoidance is statistically significant (it’s very significant, this is no accident), the coefficient is positive, meaning the more bogeys, the better. I struggled with this result at first… a lot.
Why would making more bogeys lead to higher expected earnings and top 10/3/1 percentages? So, assuming something was wrong, I checked and found there are no multicollinearity concerns. Next, I removed the SG stats and BoB from each of the models and sure enough, BA had a negative, and statistically significant estimated coefficient. Then it hit me: a higher bogey %, all else equal, means the player is higher variance. Since top 10, top 3, and wins are rare events, being higher variance is beneficial. This also explains the higher expected earnings since payout structures are so top heavy.
Let’s use Henrik Stenson as an example. Stenson has low projected percentages across the board (except for MC % which is very high). Not surprisingly, he has an extraordinarily low bogey %. So, if we simply raised his bogey % would his projected probabilities improve? The short answer is no, but here’s the longer answer. If we raised his bogey percentage, one of two things would have to happen. Either his birdie or better percentage would have to increase accordingly, or his strokes gained stats would suffer. In the latter event, the decrease in SG stats would outweigh the increased bogey %, leading to a decrease in odds across the board. In the first scenario, his odds would improve across the board, as he would effectively be making himself a more variant player. This can certainly imply that he would benefit from being more aggressive as a whole, particularly off the tee, where he’s famous for hitting his 3-wood over and over. The next time you’re torn between two players, take the one with higher BA and BoB percentages.
Where’s the Edge?
To me, the biggest edge we’ve found here is the importance of driving. Performance off the tee is highly significant, with a relatively high coefficient in every model. Stats that are more commonly looked at, like iron play and even putting, have lower coefficients or aren’t even significant in the models. Thus, the biggest edge to me is fading chalky players who are strongest in those areas for lower owned players who excel off the tee.
I realize parts of this article had to be a little heavy on the statistics jargon, so Thor and I will be releasing a podcast tonight to discuss the findings and model in a more relatable way. Thank you all for sticking it out to the end with me! Let’s have a great season.Fantasy Golf