Special Teams Part II: What Drives Power Play and Penalty Kill Success?

Introduction

Last post we looked at the repeatability of earning more power-plays per game than the opponent, and the influence of outdrawing on overall success (in terms of points/game). In summary, if you want to look at PP Opp% (proportion of total minor penalties that a team draws), you have to account for score effects. Teams that are trailing draw approximately 8% more power-plays than teams leading. In addition, home/road, back-to-back, possibly strength of schedule should be adjusted for as well. After adjusting for score a teams ability to outdraw their opponent regresses approximately 70% back to league average at the half way point in a season. Using only PP Opp% to predict future points, we can say that a team 1 standard deviation above the mean for PP Opp% will be 0.3 standard deviations above the mean for points per game, on average. Or put another way, PP Opp% accounts for about 10% of the variance (distribution) in points per game.

Today we will move onto the meat, potatoes, and Sierra Nevada of special teams. A lot more work has been done on this subject, but still pales in comparison to the work done at even-strength. The PP and the PK are different beasts, and require different skill sets. We know from previous studies that shot rates on the PP are much more repeatable than shooting percentage, just like at even-strength. Further, It appears the Fenwick/60 may be the best indicator of future PP success, while PK Sv% appears to much more reliable, and therefore the picture is mixed. Given that a teams effect on PK Sv% is very close to none, we can say that the goaltender’s influence significantly changes our approach to analyzing the PK as compared to the PP. Most recently, a pilot study has shown that zone entries are difficult to come by, with further zone entry analysis expected out hopefully later this summer or at the beginning of next year.

This leaves us with a few questions to consider. First, all of our data on special teams comes from samples of less than three years, barely enough to draw conclusions. We have more years of data now, so we can substantiate those previous studies with additional data to clarify earlier findings. Secondly, none of the previous studies compared PP and PK to overall success (in terms of pts/game), which we will do today.

We’ll be compiling a lot of information into tables today, so get your statistical hard hat on.

Methods

NHL data from 2007-2012 game summaries, and PBP data was used to generate a game by game dataset. The data contained 5v4 non-empty net data only, everywhere PP or PK is used, actually indicates 5v4 non-empty net, or 4v5 non-empty net. Because we are dealing with shot rate stats, I corrected for scorer bias in misses and blocked shots recording. This was accomplished by dividing the total shots (misses and blocked shots) at home vs. the road. The variance of this distribution was then compared to variance in goals (total home vs. total road) and corrected for the added sample size.

I then created a simulation model that would randomize each team-year games (1-82) for the year, calculate the statistic of interest (Table 1 below) from the game by game data for that team and compare that to an out of sample variable of interest (self, Pts, or PP GF/60 for PP, or GA/60 on the PK). This was repeated 1000X for each game bin. I’ll note here that for analyzing team statistics, it does us little good to go beyond 41 games, because teams simply vary too much in the roster, schedule, and competition. Ultimately we are interested in how teams perform over a subset of an entire year, (playoffs not withstanding) which is why I choose 5->75, 10->70, 20->60, 40->40 bins. Also, correlations work in both directions when the variables are the same (as in our %regression to the mean analysis), so 5->75 also represents 75->5 for example.

I’ve included the vba code for my simulator, or I can send you the excel workbook. One of the major advantages of blogging is the freedom and transparency of information. The more we share the better and faster our analysis will be.

I’ve also included an expanded list of variables, as well as there 95% confidence intervals for the correlations in a spreadsheet in the appendix, for those interested in the statistical significance of our findings. Calculated via r to z fisher transformation.

Table 1.	DESCRIPTIVE	STATISTICS
	mean(home)	mean(road)	mean(all)	SD(home)	SD(road)	SD(all)
N	12,274	12,274	24,548
Opp	3.61	3.34	3.48	1.61	1.55	1.58
Opp%	0.52	0.48	0.50	0.15	0.15	0.15
Time	6.14	5.69	5.91	2.89	2.79	2.84
PP%	18%	16%	17%	23%	23%	23%
Shooting%	12.38%	12.26%	12.32%	20.94%	21.36%	21.15%
GF/60	6.21	5.79	6.00	24.8	24.85	24.82
SF/60	43.95	41.46	42.71	28.0	26.22	27.12
FF/60	70.71	66.35	68.53	48.4	40.56	44.49
CF/60	94.16	89.21	91.68	50.5	46.10	48.29
GA/60	0.75	0.74	0.75	8.1	12.32	10.23
SA/60	7.39	8.11	7.75	34.0	32.71	33.34
FA/60	10.72	11.52	11.12	38.5	46.16	42.35
CA/60	12.63	13.56	13.10	42.7	56.99	49.82

Skewness/Kurtosis test for normality was significant for all variables listed

N: number of observations (games) in the sample. Opp: Power-play (or Penalty Kill) opportunities. Opp%: Proportion of PP opportunities drawn by a team. Time: 5v4 time. GF/60: Goals For per 60 minutes of 5v4 ice time. SF/60: Shot For per 60 minutes of 5v4 ice time. FF/60: (Goals+Shots+Misses) For per 60 minutes of 5v4 ice time. CF/60: (Goals+Shots+Misses+Blocked Shots) For per 60 minutes of 5v4 ice time. GA/60: Goals Against per 60 minutes of 5v4 ice time. SA/60: Shots Against per 60 minutes of 5v4 ice time. FA/60: (Goals+Shots+Misses) Against per 60 minutes of 5v4 ice time. CA/60: (Goals+Shots+Misses+Blocked Shots) Against per 60 minutes of 5v4 ice time.

For PK data reverse home and road for values.

All data is non-empty net

Table 2. SCORER BIAS
Cumulative 07-12 Team	Misses correction facotr	Blocked Shots correction factor
ANA	1.03	0.84
ATL	0.84	0.96
BOS	0.85	0.86
BUF	0.94	0.93
CAR	1.14	1.11
CBJ	0.90	0.94
CGY	1.03	1.01
CHI	0.77	0.91
COL	0.96	1.03
DAL	1.14	1.07
DET	0.99	0.90
EDM	1.08	1.13
FLA	0.98	0.86
L.A	1.16	0.89
MIN	1.03	0.98
MTL	1.04	1.18
N.J	0.83	0.72
NSH	0.99	1.05
NYI	0.97	1.12
NYR	1.07	1.01
OTT	1.00	1.07
PHI	0.99	1.07
PHX	0.97	0.99
PIT	1.04	1.03
S.J	1.07	1.20
STL	1.03	0.91
T.B	0.99	0.97
TOR	1.21	1.17
VAN	0.96	1.02
WPG	1.03	0.92
WSH	0.99	1.13

Table 1. above shows the descriptive statistics used in the model. Additional model variables can be calculated directly from the above. I’ll also note that I calculated the means by summing the total for / total time * 60. This gives a better estimate of long-run expected mean. The average of the game by game stats are a tad higher. Also, SD are approximates based on game by game data.

Table 2. shows the correction for scorer bias I used for each team. This was accomplished by dividing total missed shots (home and away) by the factor listed by the home team, same for blocked shots.

This study has the power to show non-0 correlations with statistical significance, but only rarely are the values between correlations statistically significant (See appendix for 95% Confidence Intervals). 5 years of data isn’t a massive amount, but it’s more than 3 and improves upon our previous studies. It may take 10-20+ years to show statistical significance between variables.

Results

We will begin by looking at how reliable our statistics are. The best and most often used method for reliability is %regression to the mean, which is calculated from 1 minus the correlation (1-r) of a variable with (out of sample) self. We can use %regression to the mean to predict the future of that value. It’s like golf, lower values are better. See appendix for example of how to regress to the mean.

PP %Regression TO THE MEAN
Games	Sh%	PP%	GF/60	SF/60	FF/60	CF/60
5->75	0.97	0.89	0.88	0.69	0.57	0.49
10->70	0.95	0.86	0.83	0.59	0.45	0.36
20->60	0.94	0.81	0.78	0.49	0.34	0.26
40->40	0.93	0.78	0.75	0.43	0.29	0.21
Games	GD/60	SD/60	FD/60	CD/60	Shot%	Fen%	Corsi%
5->75	0.89	0.71	0.60	0.52	0.81	0.76	0.75
10->70	0.85	0.61	0.48	0.39	0.73	0.68	0.66
20->60	0.80	0.51	0.38	0.29	0.66	0.59	0.57
40->40	0.77	0.45	0.32	0.24	0.61	0.54	0.52

Column acronyms are listed in Table 1. The “D” in GD etc. represents differential, eg. GD/60 = GF/60 – GA/60. Shot%: Proportion of shots while team is on the ice. Fen% Proportion of goals, shots, and missed shots. Corsi%: Proportion of goals, shots, misses and blocked shots.

PK %Regression TO THE Mean
Games	Sv%	PK%	GA/60	SA/60	FA/60	CA/60
5->75	0.90	0.86	0.86	0.77	0.66	0.61
10->70	0.86	0.80	0.80	0.68	0.55	0.50
20->60	0.82	0.74	0.74	0.60	0.45	0.39
40->40	0.79	0.70	0.70	0.55	0.39	0.34
Games	GD/60	SD/60	FD/60	CD/60	Shot%	Fen%	Corsi%
5->75	0.86	0.76	0.68	0.62	0.82	0.74	0.73
10->70	0.81	0.68	0.58	0.51	0.75	0.66	0.64
20->60	0.74	0.60	0.48	0.41	0.68	0.57	0.54
40->40	0.71	0.54	0.42	0.35	0.63	0.51	0.49

Table 1 confirms that shot rates, Corsi in particular, are the most reliable. In fact, Corsi For/60 is statistically more reliable than Sh%, PP%, and GF/60 (p-value < 0.05). In Table 2. We are unable to show the same trend in the PK data. PK save percentage shows much less regression to the mean than PP shooting percentage.

Already we can see the divergence between PP and PK. Teams on the power-play are much better at controlling shots than teams on the penalty kill. This is intuitive as teams on the power-play are the ones in possession of the puck, and have the burden of entering the neutral zone. However, (likely the result of the goaltender), teams on the PK control the percentage of shots that go in net.

We next move to PP and PK Success. We might argue at this point that these variables could be different. But in my opinion GF or GA per 60 minutes time on ice is a suitable gold standard from which we can compare our variables of interest. In the table below we look at the correlation between the variable of interest in the columns and out of sample PP GF/60 or PK GA/60.

PP GF/60
Games	Sh%	PP%	GF/60	SF/60	FF/60	CF/60
5->75	0.04	0.12	0.12	0.19	0.24	0.21
10->70	0.05	0.15	0.17	0.24	0.30	0.26
20->60	0.07	0.21	0.22	0.30	0.35	0.29
40->40	0.08	0.24	0.25	0.31	0.34	0.28
Games	GD/60	SD/60	FD/60	CD/60	Shot%	Fen%	Corsi%
5->75	0.11	0.17	0.22	0.20	0.06	0.08	0.08
10->70	0.15	0.21	0.27	0.24	0.07	0.11	0.11
20->60	0.20	0.26	0.32	0.28	0.09	0.14	0.14
40->40	0.24	0.28	0.32	0.27	0.11	0.15	0.15

PK GA/60
Games	Sv%	PK%	GA/60	SA/60	FA/60	CA/60	Sv%+CD/60
5->75	0.09	0.14	0.14	0.10	0.16	0.15	0.07
10->70	0.13	0.20	0.20	0.13	0.21	0.20	0.20
20->60	0.18	0.26	0.26	0.16	0.25	0.23	0.36
40->40	0.21	0.30	0.30	0.18	0.25	0.23	0.32
Games	GD/60	SD/60	FD/60	CD/60	Shot%	Fen%	Corsi%
5->75	0.14	0.10	0.16	0.16	0.06	0.10	0.10
10->70	0.19	0.13	0.21	0.20	0.09	0.13	0.13
20->60	0.26	0.16	0.25	0.24	0.11	0.16	0.16
40->40	0.29	0.18	0.26	0.24	0.13	0.18	0.18

Again we confirm that shot rate statistics are the best drivers of PP performance. While Corsi is the most reliable, Fenwick is the most predictive (for GF/60 and as we will see, Pts/game). We have evidence now that Fenwick/60 is probably the best indicator of future PP success. Based on sample size we can’t conclude this with statistical significance, but we can say that there is a probability of (p-value at 41 games of) 2% that the difference between FF/60 vs shooting% occurred by chance, the probability (p-value) increases to 53% for Corsi For/60 and 75% for Shots For/60. Possession dominates, with what looks like Fenwick For/60 to be the most significant.

The PK table is significantly more muddied, and we don’t see nearly the impact of possession on the PK. Teams have less control over the PK as compared to the PP, (evidenced by the lower %regression to the mean in PP GF/60 vs. PK GA/60) we see less of a clear picture of what exactly drives PK success, likely because “luck” plays a bigger role. I included Sv%+CD/60 (although its kinda cheating because its a multivariate regression), to show that taking both possession (in CD/60) and goaltender ability (in Sv%) likely provides the best prediction, although its only marginally better than using either GA/60 or PK%.

Lastly, we take a look at standings points per game. The tables below are the correlation between the variables in the columns and out of sample points/game. This tells us which of our variables has the greatest impact on winning.

PP Pts/game
Games	Sh%	PP%	GF/60	SF/60	FF/60	CF/60
5->75	0.03	0.09	0.09	0.14	0.22	0.23
10->70	0.03	0.12	0.12	0.18	0.28	0.29
20->60	0.05	0.16	0.17	0.23	0.33	0.33
40->40	0.06	0.19	0.19	0.24	0.33	0.31
Games	GD/60	SD/60	FD/60	CD/60	Shot%	Fen%	Corsi%
5->75	0.09	0.14	0.22	0.23	0.09	0.12	0.12
10->70	0.11	0.19	0.28	0.28	0.13	0.16	0.16
20->60	0.15	0.24	0.33	0.32	0.17	0.20	0.21
40->40	0.18	0.26	0.34	0.32	0.19	0.23	0.23

PK Pts/game
Games	Sv%	PK%	GA/60	SA/60	FA/60	CA/60	Sv%+CD/60
5->75	0.05	0.08	-0.07	-0.03	-0.06	-0.08	0.08
10->70	0.07	0.11	-0.10	-0.05	-0.08	-0.11	0.11
20->60	0.11	0.15	-0.14	-0.06	-0.09	-0.13	0.18
40->40	0.12	0.17	-0.16	-0.07	-0.10	-0.13	0.21
Games	GD/60	SD/60	FD/60	CD/60	Shot%	Fen%	Corsi%
5->75	-0.07	-0.04	-0.07	-0.09	-0.05	-0.08	-0.08
10->70	-0.10	-0.06	-0.09	-0.12	-0.06	-0.10	-0.11
20->60	-0.14	-0.08	-0.12	-0.15	-0.08	-0.13	-0.13
40->40	-0.16	-0.09	-0.12	-0.15	-0.09	-0.14	-0.14

Not surprisingly we see similar results as we did with GF/60 (obviously because these 2 are correlated). Overall, the most interesting thing here is that it appears (although again not statistically significant) PP has a bigger impact on winning than does the PK. This likely traces back to the repeatability issue. Teams have a more difficult time controlling what happens while they are on the PK relative to the PP. Because Pts/game is mostly dominated by even-strength effort, our correlations aren’t as outstanding, and so our picture again isn’t as clear. For GMs, even if even-strength possession confounds these results, it doesn’t matter. In facts it’s a bonus. The acquisition of a strong possession player at good value will also bolster the power-play.

On the PP it still seems as though Fenwick holds up. Clearly the possession stats (shots, Fenwick, Corsi) outperform both shooting percentage and PP%. As was recently discussed, there likely is good value for teams that pursue players who also drive quality possession on the PP as compared to shooting percentage.

The association between PK and points seems to be controlled both by possession and Sv%. When sample sizes are low (5 games, 10 games) CD/60 performs well, but as we approach mid season (40 games), we see the influence of Sv% driving PK success.

So far we’ve focused on prediction, which must deal with the inherent noise (measurement error) in a sample. If instead we aren’t concerned about prediction and want to know the stat with the strongest correlation with PP/PK success (GF or GA per 60), we would need to correct for attenuation (which is yet another way of saying the %regression to the mean). The formula to correct for attenuation was developed by Spearman in the early 1900s to, “rid a correlation coefficient from the weakening effect of measurement error.” Our result is an adjusted correlation that removes the theoretical noise from the sample. Ie. it assumes perfectly reliability of each variable, or put another way, 0% regression to the mean. Again, were no longer dealing with prediction, but a theoretical “strongest association.” JLikens used the same formula in a pivotal article that has resulted all but a few bloggers using fenwick% predominantly. The formula is;

Looking at a sub-set of our variables provides us with the theoretically strongest influence on PP/PK success.

PP GF/60
Games	Sh%	SF/60	FF/60	CF/60
5->75	0.59	0.96	1.05	0.85
10->70	0.54	0.93	0.99	0.79
20->60	0.57	0.88	0.91	0.72
40->40	0.59	0.81	0.80	0.62

PK GA/60
Games	Sv%	SA/60	FA/60	CA/60
5->75	0.78	0.53	0.74	0.67
10->70	0.78	0.53	0.72	0.64
20->60	0.81	0.50	0.67	0.60
40->40	0.82	0.49	0.63	0.55

On the PP the most interesting occurrence is the merging of SF/60 and FF/60. This suggests that shots and misses are essentially indistinguishable from each other from an offensive point of view. Defensively, on the PK there seems to be a difference between goals, shots, and misses. This has to do with the goaltenders repeatability.

In total, the PP doesn’t change a lot (after all it had a higher reliability to begin with). When we look at the PK, Sv% stands out, while Corsi falls behind Fenwick.

Discussion

While not definitive proof, we can say that many metrics previously thought to be drivers of both PP and PK success were confirmed in this study. For the power-play, given the reliability and strength of association between both points and GF/60, I would suggest using Fenwick For per 60 minutes of 5v4 ice time, corrected for scorer bias. It was shown to be both predictive within a season, as well as the best performer accounting for attenuation.

The penalty kill picture is slightly less clear, mostly due to reliability. This in itself makes sense, as the PK team does not control the puck and therefore has less influence on the outcome. When predicting PK success (defined in this study as GA/60), both Sv% and CD/60 play a large role. I cheated and added in a (Sv%+CD/60) variable which performed better than any other stat shown above.

One of the more interesting trends I didn’t expect to see was the relationship between PP success and PK success to points/game. Using our “gold standard” for PP and PK success (GF and GA per 60), the correlation to points was slightly higher for the PP at 0.19 vs. 0.16. Due to the smaller sample size (5 years of data), we conclude that the probability of there being a difference between the two correlations due to chance is (ie. a p-value of) 78%. On the other hand, comparing the best predictor of PP success FF/60 to the best predictor of PK success, (Sv%+CD/60) 0.33 vs. 0.21, respectively, creates a much larger gap. The probability of observing this by chance is (ie. a p-value of) 0.0003%. This may be striking, but we also probably have some confounding occurring. The best predictor of pts/game is Fenwick%. I wouldn’t be shocked in the slightest if FF/60 on the PP was very well correlated with EV Fenwick%. As I cautioned in the previous post , we won’t know the impact the power-play and penalty kill has on pts/game until we involve them in a multivariate linear regression where we can account for these confounding effects.

Lastly, we see an interesting change on the PP in our final table. Corsi falls well behind Fenwick and Shots. In fact, controlling for measurement error and sample size, it appears as though Fenwick and Shots are virtually identical. Also interesting is that the associations with strength (FF/60 and Sv%) both approximate 0.8, though require quite a different skill. And most important for prediction (and therefore GMs) is the superior reliability of FF/60 to Sv%.

Summary Points

NHL teams spend a substantial portion of games on special teams. 5v4 time alone accounts for 10-20% of game time, and thus needs to be analyzed to further our prediction models.
Although we don’t have enough years of NHL RTSS (5 years and counting) data to conclude with statistical significance, it appears that Fenwick For/60, with misses and blocked shots adjusted for scorer bias, is the best predictor of power-play success (GF/60), and well correlated with winning (Pts/game).
The penalty-kill picture is less clear, likely the result of heavier regression to the mean. Presumably the heavier regression is because PK units spend much more time without the puck. Both Sv% (which is likely goaltender driven), and Corsi differential/60 (Corsi For – Corsi Against per 60) are predictive of future penalty kill success (GA/60), and winning (pts/game), but less powerful than the predictors of power-play success.
Shooting percentage on the power-play is negligible, regressing heavily to the mean, and shows at best very modest correlations with PP success and winning. Even if we attempt to attenuate Sh% (ie. control for how much it regresses to the mean), it still under-performs other more significant metrics like FF/60.
If we focus on which stat in theory has the strongest association with PP success (GF/60), we see that Fenwick For per 60 and Shots For per 60 are virtually indistinguishable. On the PK, however, Sv% becomes the strongest stat. /

Appendix

Didn’t have enough tables for you?

Additional tables for the PP with 95%CI and adjusted correlations can be found here.

And for the penalty kill here.

Model details

The python code I use to scrape the PP/PK data can be find here.

VBA code for the excel macro that runs the simulations and calculates the correlations can be found here.

Calculating regression to the mean

I’ve gone ahead and already converted the correlations into %regression for you by taking (1-r). We multiply the league average for that stat by our coefficient in the table (the %regression to the mean) for the number of games that have passed, and then multiple the remaining percent by the teams stat.

Eg. If SJS has played 25 games, with a 100 FF/60 rate, assuming a 68.5 league average FF/60 rate, we need to regress their FF/60 approximately 34% to the mean. to predict what their FF/60 for the next 60ish games will be. So we take 1 part 34% league average (the mean), and 1 part 66% talent; (0.34*68.5) + (.66*100) = 89.29 If we want to predict their year end rate, we take the weighted average of their FF/60 now, and our predicted over the rest of the season. (25/82)*100 + (57/82)*89.29 = 92.56 You can see how quickly variance can influence year end totals, in something as repeatable as FF/60.

Talking Points