This is the second part of two on the Devils season looking at marginal goals. Yesterday I wrote about goalscoring, today I look at goaltending. What is marginal goals? This is what I wrote yesterday:
I call this methodology marginal goals. In economics, marginal refer to the next unit, basically considering the individual incremental impact of this next unit: if we are at x and the marginal unit puts us at x+d, economists are interested in this d: the difference between where we are now and where we were before. In the context of hockey, if we are at x and a goal is scored, what is the d of this goal? What is its marginal impact, in the immediate context in which it is scored? When a goal is scored, basically what happens is that the goalscoring player adds one to his season total, which is then used to evaluate that player’s scoring ability at the end of the year. But it should be more granular than that. A goal when down by one with two minutes to go is not the same thing as a goal when up by five after the second period. This difference is what marginal goals tries to capture, at the time of the goal being scored: what is the individual impact of that particular goal in its wider context.
Goaltenders save shots and prevent goals. Vitek Vanecek has been the main man in net for the Devils this year and has conceded 119 goals, compared to the 60 for Blackwood and 32 for Schmid. But again, there is no context: all other things being equal, conceding a goal when up by five is worth the same as a goal conceded when up by one according to these raw numbers. There are vast amounts of statistics available, quantifying the performance of goaltenders, but these all consider the saves they make, not the goals they allow. For instance, of shots that become goals, do you know what the expected goals are, on average, on those shots? Neither did I, yet this feels like a highly significant aspect of goaltender evaluation. We complain when Blackwood gives up a softy, but how can we quantify this? Does he really give up more softies than Vanecek and Schmid, for instance? And we talk about backbreakers, goals that shift momentum or put the team in a hole. How can we quantify this?
Yesterday I wrote about goalscoring, today the topic is goaltending. If you did not read yesterday’s article that is fine. The main carry over from yesterday would be the initial discussion of the data that I used to analyse goalscoring and will use to analyse goaltending, but I will simply reuse the explanation of the data, so nothing to worry about there.
A goal can only be considered relative to the game it was scored in, and the phase of play that it was scored in. The metrics I propose for these questions are based on a goal’s impact on winning probabilities: if Hughes scores a goal, or Vanecek lets one in, this will change the probability of the Devils winning that game. This change in winning probability I denote dW%. Basically, at a given moment in time, the Devils will have a certain chance of winning the game. Going up or down a goal will change this likelihood, and dW% measures the size of this change. For goalscorers, contributing goals with high dW% is a good thing, as they give their team a better chance of winning. For goaltenders, conceding goals with high dW% is a bad thing, as they are throwing away their team’s chances. MoneyPuck, throughout all NHL games, at all times, provide the probability of each team winning the game, derived using their models which use run-of-play statistics to evaluate the relative strengths of the two contesting teams. The following is how this probability evolved throughout the Devils-Capitals game the other night.
The game started off as 60-40 Devils, before swinging the Capitals’ way by 50% or so when they jumped into a three-goal lead. Erik Haula scored at the end of the first, giving the Devils a roughly 7% greater chance of winning, and so on with the Caps going up by three again before the Devils gradually clawed their way back. Notice how when the Caps were in the lead their probability of winning slowly increased when nothing was happening. Note also how Dougie Hamilton’s goal made it 50-50 while Luke Hughes’ overtime winner alone was worth 50%.
Here is the trend from another game, Tampa Bay versus Arizona from the middle of February. This game was scoreless through regulation and overtime, with the Coyotes winning in the shootout.
Note how Tampa started off with a 65% or so chance of winning, which then converged towards 50% as the game remained scoreless, hitting 50% by the time overtime started.
Why are these evolutions relevant? Well, consider the drop in the Capitals’ probability of winning following Erik Haula’s goal. This is an increase in the Devils’ chance of winning, dW%. As you can see here:
MoneyPuck provides the exact dW% for each goal, where this time Haula increased the Devils’ chances by 7.26%.
“Need a Save” Goaltending
The context of goals allowed is twofold: was it a big goal in terms of significantly hurting the Devils’ chances of winning? was it a big goal in terms of the Devils’ netminder letting a shot in that they really should not have? We can answer the first question using the same data from above, going through MoneyPuck for the Devils’ games and looking at dW% for each goal a Devils goalie conceded. Regarding the second question, as well as tracking winning probabilities throughout the game, MoneyPuck also tracks cumulative expected goals, providing a number for how likely a given shot is to be a goal:
As above, the logos indicate that a goal has been scored. We can thus, for each goal against this year, find the xG% for each of those goals, and match them with the dW% for the same goal:
Evidently, Joe Snively’s goal the other night was worth 0.085 expected goals. He came in on a two-on-one, the defenseman Brendan Smith took away the other forward Evgeny Kuznetsov, leaving Snively free range to shoot and score. From that perspective, 8.5 xG% feels pretty low, but it turns out that most shots are very low xG: this makes sense, of course, for a Devils team that has an average of 2.81 expected goals against per game and 27.85 shots against per game, the average xG% per shot against is going to be 10.09%.
The following looks at the average for the three goaltenders the Devils have used this season in three categories: dW% from goals allowed, xG% on shots that become goals, and the ratio dW% to xG%. From the perspective of the Devils, we would want our goaltenders to minimise this ratio, as this would mean allowing goals that are high xG% and low dW%.
Note that the mean dW% divided by mean xG% does not equal mean dW%/xG% (henceforth “chokeness”), which makes sense, given that mathematically the mean of ratios is not typically equal to the ratio of means. The first point of interest is mean dW%. All three goaltenders, on average, allow goals which decrease the Devils’ chances of winning by essentially equivalent amounts. Blackwood does seem to have a slightly lower mean dW%, but, bare with the necessary statistical details, the p-value for a two-sample two-sided t-test (assuming independence, equal-variance, normally distributed, all of which you could question, normality especially, but this is not the point of the analysis: see the p-values more as indicators than gospel) for testing whether Blackwood’s mean dW% is the same as Schmid’s has a p-value of 0.578, while the p-value for comparing Blackwood’s dW% to Vanecek is 0.552.
What this essentially means is that we cannot statistically say that Blackwood’s dW% is different from either Vanecek’s or Schmid’s. And this sort of makes sense, given that dW% depends on what the team in front of the goalie is doing, whether they have put the team up by a couple, or not kept pace with the other team. It makes sense that, since they play behind the same team, the context from this perspective would generally be the same.
Something that does differ inter-goaltender is xG%: Vanecek and Blackwood are similar, but Schmid is substantially better, as it takes a shot of on average 23.591 xG% to beat him, compared to the easier shots that the other guys concede on average. This difference is statistically significant (p-value Schmid-Blackwood is 0.034, and Schmid-Vanecek is 0.051), ie, with the statistics, we can confidently say that it takes a more difficult shot to beat Schmid than it does to beat either of the other guys. This would strongly suggest that Schmid, at least from the persective of the goals he allows (recall that other goaltending stats look at the goals he does not allow), has had a significantly better season than his peers, not giving opponents anything for free.
The third column again shows that Schmid is better on average than the others. A soft, back-breaking goal (low xG%, high dW%) would give higher chokeness than an unstoppable token goal. On the average goal Schmid concedes, he moves towards low chokeness relative to Vanecek, who in turn does better than Blackwood.
But going back to that Snively goal, would we want Blackwood to save a shot with 8.5 xG%? I use “want” here in the sense that we would have him at fault if he did not? The goal, if you read the first article, had a dW% of roughly 22%. Is that a lot? Is it back-breaking? I want to develop the chokeness notion and make it more well-defined. Let us say that there are three categories of dW% — back-breaker, standard, token — and three categories of xG% — softy, standard, unstoppable. Any single goal belongs to one of the three categories in each dimension, meaning there are nine total possible types of goal. The following table shows these nine types and the “need a save”-ness for each of them: essentially, how desperate are we as fans for the goaltender to stop that shot. We are more desperate for the goalie to stop a back-breaker if the shot is a softy than if it is unstoppable, more desperate for a save on a standard shot if it is of standard dW% than if it is a token.
The colours show “need a save”-ness, where green means that we are fine with it being a goal, yellow slightly less so, orange is frustration that they could not keep it out, red it outright fury. As with chokeness, then, we want goalies to be in the green regions with as many of the goals that they concede as possible, and prefer yellow to orange and orange to red.
The question now is how to define the thresholds separating goals between the different thresholds: at which level of dW% does a standard goal become a softy, and so on? The following looks at the distribution of xG% on goals allowed this season from all three goaltenders:
Clearly, the majority of shots lie between xG% of 0 and 20. Per NaturalStattrick, the average save percentage in all situations across the NHL this season is 89.9%. On average, then, roughly 10% of shots are allowed. Although this perhaps does not necessarily follow logically, let us therefore say that 10% of goals are unstoppable and, for symmetry, that 10% of goals are softies. Note that these thresholds are necessarily arbitrary; however, this does not really matter, given that the same boundaries apply for all three goalies (the only way it could matter would be if one goalie is systematically just above the arbitrarily-chosen threshold and another just below, thus misrepresenting their relative qualities. We will see later that this is not the case). The 10th percentile xG% on goals allowed by Devils goalies is 4.2%, the 90th percentile is 38.5%. As such, softies are goals where the xG% on the shot is less than 4.2%. Unstoppable shots are those with xG% greater than 38.5%, and standard shots are in the middle.
Regarding dW% thresholds now, the following displays the distribution of dW% for goals the Devils have both scored and conceded this year, upper and lower panel, respectively. Having the data available for the Devils goalscorers, I figured adding it to the analysis to give a larger sample size for dW% made sense, especially considering the distributions are similar (the main differences are at the extremes, where the scorers have more tokens and back-breakers. This is because of goaltenders not being in net for token empty netters, while the Devils have had great success tying games late, thus getting high dW%, something that has not really happened on goals against).
Here, again, the thresholds are arbitrary. For the upper limit, defining back-breakers, the 90th percentile across the combination of goals for and against gives 29.47 dW%. Looking at the distribution for goaltending, this looks somewhat reasonable, as it does seem to separate a group of high dW% from the rest. Let’s set it at 29.5 dW%. Regarding the lower end, defining token goals, intuitively, quite a large proportion of goals serve very little purpose in the grand scheme of things. In the Capitals game, for instance, the third and fourth Capitals goals were worth 8.29 dW% and 6.97 dW%, respectively, not really doing much, marginally, for the team. Intuitively to me it makes sense that somewhere in the 15 to 25% range of goals should be tokens. From the combined set, the 15th percentile is 3.71 dW%, the 20th percentile is 5.33 dW%, and the 25th percentile is 6.69 dW%. 5 dW% is the 19th percentile, 10 dW% is the 33rd percentile. As we can see in the figure, dW% is highly concentrated in the 0-10 dW% region, so I would say that the 20-percentile level of 5.33 dW% “feels right”.
Putting these two sets of definitions together, coding all of the goals against for Devils goaltenders, the following table shows the proportion of goals against for each goalie that is in each colour region.
With the aforementioned thresholds, only three goals this season has been simultaneously a softie and a back-breaker. These were Vanecek’s 21st goal against and Blackwood’s 18th and 53rd, so for Vanacek a long while ago, nothing that is relevant now.
What is relevant now is that Schmid and Vanecek have spent an overwhelming proportion of their goals, compared to Blackwood, in the green and yellow regions. The sum of proportion of goals in the green and yellow region has Blackwood at a 4% disadvantage compared to Vanecek and 6% versus Schmid. Unsurprisingly, Blackwood is in the orange a lot more than Vanecek and Schmid: these goals, remember are sofites with standard implications or back-breakers on standard shots. These attributes seem to describe Blackwood quite well. The following tables show the exact distribution for each goalie. The row and column sums give the proportion of goals in the type of that row/column.
In terms of xG%, Vanecek gives up fewer softies and has a larger proporiton of his goals being unstoppable, relative to the other two guys. This is a big vote of confidence for Vanecek, as it would seem that it takes more dangerous shots to beat him. Going slightly more granular, Schmid does have the highest softy proportion, but he also has an overwhelmingly large softy-token proportion. Considering softies that are let in when the game is actually up for contest, he is actually the least soft. From this perspective, Schmid does give up bad goals — more so than even Blackwood, who is infamous for his soft goals allowed — but he does so in uncompetitive games, in scenarios where conceding does not matter much.
Regarding dW%, Schmid has the largest token proportion, by a wide wide margin. He also has the lowest back-breaker proportion, granted that his value here is not really that different from Vanecek’s. This suggests that, if Schmid does allow a goal, on average that goal is not going to be very meaningful in the context of the game. Both factors, then, speak in favour of Schmid!
Now, what does this mean? It must be acknowledged that the sample size is limited: 32 goals for Schmid is not much to go on (from a different perspective, his 901 minutes of playing time is not insignificant, however, as we are looking at the goals he concedes, the sample size remains an unfortunate 32). As such, very very few of these proportions are actually significantly different from one another: we cannot confidently say, using the data available, that they are actually not equal to one another; we cannot claim that they are different. Using a two-sample proportion z-test (where the assumptions are respected), the only significant difference between Vanecek and Schmid, even using a one-sided test, is Schmid’s token proportion of 0.2188 compared to Vanecek’s of 0.0756, which has a p-value of 0.0097 (two-sided, 0.0195). Consider that the p value of Schmid’s softy proportion being greater than Vanecek’s (0.1563 to 0.1092) is 0.2327! ie, despite Schmid having almost 5% more of his goals being softies, we cannot say that this is actually significantly more than Vanecek. If we imagine that we give each goalie the same number of extra goals to increase their sample sizes, to get a p value < 0.05, assuming the proportions stay the same, each goalie would have had to have conceded an additional 206 goals! And for a p-value below 0.01, the number is 484! What I am trying to say is that the sample sizes are really small, meaning we cannot say as much about the differences between the goaltenders as we would like to.
That being said, for what it’s worth, we saw above that Schmid’s chokeness was by far the best of the three. Here we see that, ultimately, he has the most favourable outlook in terms of both xG% and dW%. This leaves me conflicted. After coming in and completely shutting out the Capitals after Blackwood had made an utter mess of things, Schmid has yet again risen in my estimation, and that is from what was already a very high regard. These numbers suggest that he is the man for the job. But would I throw him in against the Rangers? I can’t say. I guess the fact that Vanecek is not significantly worse with their respective sample sizes and has looked better recently would say that there is no need to. But again, since he came in and robbed the Senators in overtime earlier in the year, I can’t remember a night where Schmid has not inspired me and seemingly the team with complete confidence. I guess the conclusion that I draw then is that, if only at the very least, Schmid is a great backup option to Vanecek, and that Vitek should have a very very short leash: one bad game and he’s out.
As a final point, I just want to look quickly at the xG% and dW% thresholds defined above to distinguish between different types of goals. Particularly, I want to look at how sensitive the categorisation of goals is to the chosen thresholds. What I mean by this is that, if we had selected a different set of thresholds, would this have significantly impacted the conclusions we drew in the last couple paragraphs? The below tables show what happens for each goalie when the thresholds converge by 5%, ie, the upper thresholds separating standard goals from being back-breakers or unstoppable decrease by 5% while the lower thresholds separating standard goals from tokens and softies increase by 5%. Each cell shows how many goals would have been reclassified given this threshold change.
There is a pretty common pattern, where standard-standard goals become less frequent — which makes sense: by converging thresholds, a lower range of dW% and xG% values will qualify a goal as standard. This mainly impacts goals in the green regions, meaning we are not particularly interested in these changes. Recall that Vanecek conceded 119 goals, Blackwood 60 and Schmid 32. From this perspective, Schmid was more sensitive to the thresholds, as he saw 3/32 = 9.4% of his goals reclassified, compared to 5% for Vanecek and Blackwood. But it is quite likely that this too is a sample-size issue. But again, these changes affected only green goals, so overall we can conclude that the chosen threshold values were not inappropriately extreme, ie, far out, since converging them has no significant impact.
This second set of tables looks at the impact of changing thresholds in the opposite direction, diverging them, meaning the lower thresholds become lower and the upper thresholds become higher: for example, for a goal to be classified as a token or unstoppable, the dW% and xG%, respectively, must be higher than before.
We see an opposite pattern from previously, where standard-standard goals become more likely rather than less, which again makes sense (more extreme thresholds makes classification as standard more likely). In this case, Schmid’s sensitivity does decrease a bit relative to previously, where his change of 2/32 = 6.25% is more in line with the 5% of Blackwood and Vanecek. This time, however, there is some movement in the relevant red and orange regions. What this means is that we previously set the thresholds between standard and back-breakers as well as standard and softy’s in such a way where a small outward change to these thresholds would have a large impact on our conclusions. It is thus possible that they are too narrow. As such, if this were to become a more rigourous framework for goaltender evaluation, a more detailed analysis would have to be applied to find the suitable value for these thresholds. However, in this case, as the changes appear pretty uniform across the three goalies — ie, there is no, for instance, large change for Vanecek while Schmid stays the same, which would suggest that the thresholds made it difficult to compare those two goalies as a small threshold change could substantially impact the proportions used for comparison — I believe the conclusions drawn above, which in themselves were highly tentative, can be retained.
What do you think of this analysis? I admit that the data (MoneyPuck is great, and I would imagine correct on average, but the manner in which some goals, rebounds especially, are coded makes me slightly sceptical) and methodology (especially the arbitrary threshold definitions) are far from ideal, but this was only meant to be an initial probing into this style of analytics, looking at marginal goals, goals with context. Do you feel that there is something of value here? Regarding goaltending, Schmid does more and more feel like the guy for me, and even the basic stat — his average xG% is much higher than his peers, dW% being constant — would support that, let alone the quietly favourable evidence found when segmenting dW% and xG% further.
Anyway, hopefully you have enjoyed this piece and it has given you something to think about. Is there any aspect of this analysis that you feel I neglected? Somewhere, perhaps, you believe I could have improved? Other than that, were you surprised by any of the results? Can these findings be used as argumentation, in combination with other statistics, for Schmid over Vanecek in the first round and (hopefully) beyond? Let me know in the comments, and thank you for reading my piece and supporting the site!