clock menu more-arrow no yes mobile

Filed under:

A WAR Ensemble of Numbers: The New Jersey Devils by Wins Above Replacement

Wins Above Replacement, or WAR, has become recently prominent in the world of hockey statistics. This post looks at two of the more popular models, one by Manny Elk of Corsica and one by @EvolvingWild, to see how the New Jersey Devils stack up and, from there, question the models.

2018 NHL Awards
Taylor Hall led the Devils in WAR according to the models by Manny Elk at Corsica and the Younggrens of @EvolvingWild. But what does that actually mean?
Photo by Ethan Miller/Getty Images

The hot new point of discussion, debate, and dispute in the niche world of hockey analytics is Wins Above Replacement (WAR). The concept is certainly not a new one in sports as WAR is most associated with baseball. There have been attempts to create a hockey-version of it. Some have come right out recently and declared their model of WAR. Back in 2017, Ian Tulloch wrote a five-part series about his own go at making a model at The Leafs Nation, noting others by Manny Elk of Corsica and Luke & Josh Younggren (as per Hockey-Graphs), the twins behind the @EvolvingWild Twitter account. Elk has gone ahead and made WAR a live stat on his now-actually-functional stats site, Corsica. Through the twins’ Twitter account, their Goals Above Replacement app hosted at Shinyapps has a WAR component. Since their relatively recent releases, WAR has started to show up a bit more in analysis - and in criticism. For an example of the former, CJ recently used it to argue in the face of Greene’s declining 5-on-5 numbers that he was possibly more fine last season than originally thought. For an example of the latter, James Mirtle got together with Matt Cane and Tyler Dellow to sort out WAR and highlight various issues with it. The latter ignited plenty of Twitter beefs, helped by the fact that nothing else is happening in hockey at the NHL and by Manny Elk and the EvolvingWild twins reacting poorly to legitimate issues.

But I don’t care about that part. I care about the New Jersey Devils. Chances are, you do too. To that end, let’s at least see what WAR says about either using both the Corsica and EvolvingWild models. Then, let’s go over what it means beyond the literal wording:

The 2017-18 Devils by Corsica’s WAR Model

After being critical of it before, I will give Elk credit for having a functional website where I can easily reference and link to what I found to others. Here’s the Devils’ WAR list complete with a component-by-component listing of how each player rated that led to that WAR total. I summed up the totals in this chart:

The 2017-18 Devils by the Corsica WAR model
The 2017-18 Devils by the Corsica WAR model
Stats from

At least Taylor Hall is at the top. The model definitely shows some love to Kyle Palmieri, Nico Hischier, and Blake Coleman. The two goalies were about even in terms of WAR. Will Butcher and Ben Lovejoy were heads and tails above a blueline that was mostly not adding value, which surprisingly included Sami Vatanen. The model also didn’t rate Jesper Bratt well either. The model believes that Damon Severson and Miles Wood were the worst defensive players on the team with Nico HJischier and Vatanen also deep in the red. On the flipside, Coleman and Lovejoy were on the opposite end.

There’s a lot going on here. At first glance, there are a number of odd things about the results. You may notice the game numbers are a little high. Some are even above 82. The Corsica model includes both regular season and playoff games. I’ve not seen too many stats that do that, but I suppose that makes some sense as they are games that count. It also does not split up performances with other teams. This means Sami Vatanen’s numbers includes his time with Anaheim; Patrick Maroon’s and Michael Grabner’s include their time with their previous teams; and so forth. Also, all players are included - goalies are right there with the skaters. Also not common in stats since the two positions are very different. Lastly, I’m not quite sure, but this is not just 5-on-5 play. I think these numbers reflect everything on the ice, which is also not common since power plays and penalty kills have different situations and objectives than from even strength play.

Oh, and going back to Hall, his 3.92 WAR ranked 19th in the entire NHL. He was behind Nathan MacKinnon and well ahead of Anze Kopitar, just to pick two names at random. Alex Ovechkin was your leader over his 105 games played with 8.52. For further perspective, only 12 players broke 5 WAR and seven of them were goalies who had very good seasons. I have a feeling of deja vu; but I’ll get to that later.

So What Does This All Mean?

To answer that, let’s dive into how Elk even came up with this.

Fortunately, Corsica breaks down the individual components that get summed up into WAR and Elk wrote an explanation of how he formulated the model. I give Elk full kudos for the latter; it even includes snippets of code for each part. I will forewarn you, if you thought the math behind Corsi or other such stats was tough, then this is on another level as it incorporates regression and other models. I’m still processing it; but I’m more interested in the high-level parts than the details. So here’s the intent behind Corsica’s model:

WAR, first developed by sabermetricians in baseball circles, is an evaluative metric intended to approximate a player’s impact in units of wins added. A player’s contribution is measured against what a replacement level player is expected to offer in the same circumstances. Replacement level is not uniquely defined, but it is commonly equated to players earning a league-minimum salary. I opt to echo this paradigm.

Elk defines replacement level through defining replacement goals added:

In order to obtain WAR, the goals added in each category must be subtracted by the replacement goals added, then multiplied by a conversion constant. The former task requires replacement coefficients for each player position from each regression performed. Using a record of players earning league-minimum salary in each season between 2009-2010 and 2016-2017, exponentiated regression coefficients belonging to these replacement-level players were grouped by position and averaged:

In other words, Hall’s performances were worth nearly four more wins alone than if you had George Generic (a cheap player) instead on the roster. Considering how few players are worth more wins on their own, that means Hall’s performances have a lot of value to his team. On the flipside, the model suggests the Devils would have been better off with George Generic over Boyle, Zacha, Hayes, Bratt, Mueller, Moore, Severson, Vatanen, and Lack. Your mileage may vary if that makes sense to you.

Elk breaks down his model into six components: shooting rates for and against (WAR RF and WAR RA), shot quality for and against (WAR QF and WAR QA), shooting, goaltending, penalties taken and drawn (WAR PT and WAR PA), and zone transitions (WAR DZF, WAR NZF, and WAR OZF). The sum of that compared with a replacement player’s goals added and utilizing a 4.5 goals-to-win ratio gives us WAR. The OWAR is for the offensive side of WAR and DWAR is for the defensive side of WAR. The explainer goes into more detail about what goes into each one.

This is where the criticism really comes into play. It is not really clear whether what is being defined for defensive WAR is actually reflective of defensive play or performances. Per the explainer, Elk does not consider assists or break up performances by game situation or shootouts, which absolutely contribute to team wins in the regular season. The shot quality piece relies on another model, the expected goals model at Corsica, to function and other components are driven by regression models. In other words, we have a WAR model supported by models so the flaws of the original are now inherent in the WAR model. Oh, and a lot of the source stats are based on the events recorded by the NHL scorer at games; so errors and bias in the play-by-play log and metadata will also come up here. Lastly, this is a bold assumption for shot quality is written unchallenged:

It is assumed that all skaters on the ice can exert an influence on shot quality.

I can understand a model not being intuitive and relying on some assumptions to make it work and coming up with some odd results. But there’s just a lot of head-scratching results with the Devils alone. Here are a few examples that stood out to me while I put the above chart together:

  1. For example, Severson and Moore were together throughout much of the season. Was Severson really that much worse in shot quality against when his expected goals in all situations last season was 73.04, which was well behind Vatanen (90.27), Moore (92.65), and Greene (102.33)? And Severson’s xGF% was actually above 50% in all situations and third on the team whereas those three were below 48% (Greene was at 40% xGF in all situations) These numbers come from Corsica, so it’s not like Elk doesn’t have this info at end.
  2. For another example, Severson’s penalty WAR was positive (0.02) and Moore’s was quite negative (-0.3). I expect Moore’s value to be negative since Corsica has him at a -21 penalty differential; but Severson was at -9 (20 taken, 11 drawn) which is still not good. Is the replacement level -10? What replacement-level player survives costing his team more than 9 penalties taken than what they draw?
  3. For a third example, Miles Wood came up third in WAR shooting rate for the team. Yet, shots are defined in the model as unblocked shooting attempts, also known as Fenwick. Wood was not at all bad in Fenwick both as gross and per-60 on-ice rate. But he wasn’t third best at it. Boyle had higher values along with Hall and Hischier; yet Boyle rates lower in this category. And others had superior Fenwick for percentages to Wood; also rated lower.

Maybe these three examples involve other factors for those components. The explainer in the model does mention some of the other factors but doesn’t give way to how much they mean (home ice advantage is included for some for some reason). That the components of the WAR model does not square away with some of the data Corsica itself reports for all situation (or other) play makes me even more skeptical of what the Corsica WAR model is telling me.

No model is perfect and no one has really solved the problem of how to measure defense, so it’s not fair to just throw Elk’s model in the trash. But there are real issues that come up with if we take Corsica’s WAR to mean a way to measure a player’s value over a generic person.

Let’s look at another model.

The 2017-18 Devils by Evolving Wild’s WAR Model

So the twins behind the @EvolvingWild account developed their own model for Goals Above Replacement and then converted it into Wins Above Replacement. It is structured differently from Corsica’s model. Situations are broken up with their own GAR count. There’s a separate one for penalties and then the sum of all four values gets to total GAR, which leads to WAR (and another feeling of deja vu).

The 2017-18 Devils by the @EvolvingWild WAR model
The 2017-18 Devils by the @EvolvingWild WAR model
Chart by @EvolvingWild

Well, Hall is number one. There are also fewer players below the zero value. The EW model liked Brian Gibbons a whole lot as well as Stefan Noesen. Palmieri is not as highly rated but he shows up well. Defensemen are led by Butcher and Lovejoy again, with Vatanen (and Mueller) getting more respect. This model still didn’t rate Severson well, but it was harsher on Bratt, Moore, and Stafford. Goalies are not a part of the EW model, which is understandable as their position is much different from forward and defenseman. If you clicked on the link, then you’ll note that the game totals are 82 or less - the EW model apparently only looks at the regular season. I like that too for comparison purposes. I really like that for the players listed, it is only based on the games they played for the team - Vatanen is represented by his work with the Devils and not his time with the Ducks. (Maroon and Grabner didn’t make the 350 minute cutoff.)

For the sake of perspective, Hall’s 4.1 WAR is the second highest in the NHL. It’s right in between Claude Giroux (4.3, led last season) and Sean Couturier (4.0). Nico Hischier’s 2.7 WAR rates in the top 50 in EW’s model, which does have some out-of-the-ordinary names in it like Evgeny Dadonov, Dustin Brown, Mattias Ekholm (the highest defenseman at 2.9), and Yanni Gourde. I can understand how a model can cast new light on some guys; does the logic work out?

I really don’t know. Maybe I missed it, there’s no explainer or documentation that I've seen from the twins explaining what goes into this model or what is defined as a replacement level. They presented part of it - their underlying - at the Rochester Institute of Technology Hockey Sports Analytics Conference last month. (Aside: Prior to this conference, there was a small beef about whether hockey analytics were progressing. I can understand the push-back on that but the fact that the conference went from hockey-only to all-sports is a telling sign.) Hockey-Graphs has that presentation which is something. But the slide deck goes into statistical plus-minus, which is really a trainable model that ends up with GAR and, by extension, WAR.

The slide deck does note a number of variables involved; but without a formulation, I cannot really go into whether there’s something in it that does or does not make sense. It apparently includes giveaways and takeaways? And GF/60 and SF/60 or GA/60 and SA/60 as, I presume, on-ice values? Zone starts seem to play a role? I see xGA/60 and xGF/60 involved so, again, this model appears to take from another model. Penalties just appears to be a GF60/GA60 change after situation, which begs the question of why using rates at all. But do those variables really match up with the common abbreviations on stats sites? All the same without context or formulation, I’m still at a loss for a couple of things. Does this model really take production into account? How is defense really calculated? What goes into shorthanded value? (I would love to know why does Lovejoy have a such higher value at 2.5 for shorthanded play (one of the highest in the NHL) but his common partner, Greene, was dead even at zero?) And, most importantly, what is a replacement player in this model? Is George Generic a minimum salaried player, a representative of 13th forwards and 7th defensemen, both, neither? Hall is 4.1 wins better than replacement - which is what exactly? And is the second best player by this WAR model really just four wins ahead of some replacement-level player? Say what you want about the Corsica model, but at least Elk provided some explanation of what was the point.

It is tempting to think better of the @EvolvingWild model. I mean, the Devils look better as a group by it. It also explicitly breaks up even strength play from penalty kill and power play situations, so credit is given to those who excel on the PK (Coleman, Lovejoy) and PP (Hall, Palmieri) on top of even strength play. But there’s plenty missing I would like to know so I could have an idea of what it means. The twins would probably do well to take a page from Elk and write something up. Maybe they’ll do that at Hockey-Graphs soon when they’re not salty on Twitter. Either way, like Elk’s model, it’s something but I hesitate before declaring this really means something.

I Feel Like This...Deja Vu

These models stand out among the newest, hottest debate of mid-August 2018. But there were attempts at a WAR-like stat earlier. And seeing OWAR and DWAR and skaters mixed with goalies and GAR split up for various states all reminded me of it. Originally presented through the old Hockey Prospectus site back in 2009, Tom Awad developed Goals Versus Threshold (GVT).

The principle was the same as WAR: a catch-all stat for a hockey player comparing their value in terms of goals against what a replacement player would have came up with - the threshold. It was based on a player’s output and some other stats, split up into offensive, defensive, and goalie/shootout components of GVT. This was back in 2009 so there was no expected goals model to simulate what each shot’s value would be, regression models and machine-learning weren’t included, and other aspects like penalties were not even considered. It was a simpler model back in the day where Corsi and the like were still being worked on and pushed forward.

There were plenty of explanations and posts all over describing what it was and noting how it would work. CJ, for example, discussed it in this 2014 post along with point shares - another attempt at a one-stat-for-a-player. It seemed interesting enough. Except it never really caught on. Stats sites didn’t really carry it. It’s faults - especially for defensive value - were noted. It was not used as a point to argue what a player did; the rage was all about their on-ice rates both relative to the team and otherwise. As far as I know, GVT was not updated to include what was eventually learned about scoring chances, scoring effects, zone starts, and so forth. Unless I’m missing it, Plus, Rob Vollman informed me that Awad stopped calculating it after 2015-16. Combined with no one else picking up the mantle, we do not have even up to date data for it.

The thing is that while the model is more of a curiosity of the past, these newer WAR models are still influenced by it whether they admit it or not. Yes, Elk and Younggren have much more going on, more complex math behind the scenes, and they include more information than what was known back in 2009. But their WAR models also struggle to define a logical defensive value that is more than just based on offensive events against them/their team. Their models also leave parts out that leave me a little confused; GVT thought to at least try to include the shootout, you’d think something that could contribute to wins would be included in something called Wins Above Replacement. And the WAR models leave one wanting about what truly is replacement level - it was solely a baseline for GVT, I suppose Awad calculated out what would be but whether it was representative of the game. Most of all, CJ’s criticism from 2014 when he wrote about GVT would still hold with WAR:

That being said, catch-all statistics in hockey are admittedly in their infancy and I agree that it is doubtful that there will ever be a statistic that is comparable to WAR because baseball is a game dictated by “events.” Hockey is more fluid, but our recording of the game events improves we may approach something that is at the very least a useful benchmark.

I will add this caveat: the issue with events is not really Elk’s or Younggren’s or Awad’s or your or my fault. It’s a fault of how data is tracked, kept, and defined. Like the old adage of someone searching for their keys in the dark only around a light post because that’s all they can see, what WAR is shooting for and what GVT was shooting for may not actually be possible with what’s being done. Defense is definitely a big part of it. So is defining replacement - minimum salaried players that do well tend not to be minimum salaried players and those that don’t, well, are not usually in the NHL. I think a lot of Cane’s and Dellow’s criticisms at The Athletic are well-founded. While I understand a stat can throw some people off by the conclusions, when a component or the total result is seemingly at odds with the stats based on actual events (and within the same site!), it only raises more doubt about what the model is really doing.

Some Concluding WAR Thoughts

I will agree that the newer WAR models are worth more than what GVT was and have more going for it. It seems like several steps forward from where GVT was. I can appreciate that analytics have involved more complex mathematical concepts in recent years to analyze whether players or teams are good or not. Algebra and basic math can only go so far. But whether they are actually useful for you and I remains to be seen. Elk and the two Younggren boys are popular (for a given definition of popular) in the analytics scene and so their work may get pushed more by those who favor them. Then again, popularity of the source only goes so far. Awad was one of the early authorities of it with the support of some of the early big names like Vollman. That didn’t mean fans (or teams) were going to use it at length. It remains to be seen whether WAR will catch on.

I mean, do you, the Devils fan, think more or less of the players from last season knowing this? Sure, it hypes up Hall more, but does it sway an opinion or cause someone to take a closer look at a guy like Vatanen or Wood? I don’t know. I’m more inclined to look at stats that state what happened when they were on the ice and what they did instead of a catch-all based on other models and a smattering of other stats that may not be in line with what I want to know.

What could really help - and it would be a good project - is if these models could be validated in some way. Rather than just run it and spit out the top 20 players and go, “Yep, these guys are really good so the model must be working,” a more structured approach to determine it does what it intends to do would go a long way to quell some of the concerns over it. It appears Elk and the Younggrens do include historical data in their datasets. That would be a good source to start from. But that is a suggestion.

What will help is that these two WAR models are not the only ones. From the Hockey-Graphs post about the RITSAC presentations, someone named Gordon Arsenoff had a presentation utilizing Markov chains as a means to develop WAR. His site, SALO Hockey, has more details, but it is a different approach. One that will not only lead to different conclusions, but how it generated them may lead to new thoughts about what we think about player contributions from a statistical standpoint. Or not. The point is that other models are out there and are in development. Even if they do not fully work or make much sense or drive fans to use it, they can make progress by experimenting what can be calculated or thought out. Models are not perfect, which means they can be improved upon.

As for the Devils, well, the two WAR models came up with some varying results on who is or is not good. The model at Corsica thinks a good chunk of the lineup was not adding value last season. EvolvingWild’s model was more forgiving (for lack of a better term) but was consistent in noting how Boyle, Bratt, Severson, and Moore did not add value. Both models thought Hall was great. They recognize his star-level performances from last season much more than NBC Sports. But, again, would I utilize this to make a point about a player or determine if someone needs to be improved upon? Possibly, but I wouldn’t use it as a be-all, end-all stat based on what I understand of either.

What do you make of WAR? Are these models the way to go for the future? Would you use them? What do you think of each model and how they rated the Devils from last season? Please leave your answers and other thoughts about WAR, what makes up these WAR models, and the Devils in the comments. Thank you for reading and you’re welcome that I did not use the obvious reference of “WAR - what is it good for?” that surely has been run into the ground by now.