Charting a new course: Part 2 - Coaches and Teams
Elo ratings, Baxes, Thompsons and big ass dashboards
Welcome to Stats Drop, an inundation of rugby league numbers.
This is part 2 of an introduction to Stats Drop metrics. You can read part 1 here.
If you’re receiving this by email, a reminder that the Datawrapper embeds work best on desktop, then next best if you click through on mobile and least best as presented in the email.
One of the nice things about quitting short form posting and posting exclusively in a longer format, other than not having tedious pedants in my replies and giving myself time to confirm whether this thought is worth sharing with the world, is the mitigation of context collapse:
Context collapse "generally occurs when a surfeit of different audiences occupy the same space, and a piece of information intended for one audience finds its way to another" with that new audience's reaction being uncharitable and highly negative for failing to understand the original context.
I would post a table or a chart to Twitter. The real freaks who followed my work would probably understand what it meant and sometimes hit the fav button or discuss the implications like a reasonable person might. Other times, you’d get an interloper, possibly a follower of a follower or someone who liked a joke I made three years earlier and had not paid any attention in the meantime, upset that Shaun Johnson is too highly rated or Nathan Cleary is too low rated. Against what? Who knows. It’s like they hadn’t read a single thing I’d written!
Rather than considering the possibility that there are nuances here, and that I’ve thought about this longer than they have, they would fire off a negative reply and then cease thinking about it, scrolling on or tabbing over to pornography or whatever it is people do to keep their id satisfied on the internet. I’d get annoyed and cause myself psychic damage for no gain.
Now, you’re all my prisoners. If you want the chart, you have to wade through 500 words of explanation that may or may not be all that clear to get to it.
This is another newsletter in that vein.
Elo
Elo ratings were developed by Arpad Elo to rank chess players and are now used in FIFA's official world rankings, among other applications.
I've been using Elo ratings since 2017 to estimate the quality of rugby league teams. Despite the lengthy introduction in part 1, the systems of Elo ratings haven’t changed much since then. The spreadsheet I use to calculate ratings is now efficient enough for Google Sheets to host but you won’t notice that.
The average rating is 1500 and a higher rating reflects a better team. We can use the difference in Elo ratings to estimate the probability of a given team winning against a given opponent. I maintain two systems for each competition:
Form ratings are designed to reflect short term performance and move quickly to reflect recent results (about six to eight weeks in the NRLM). The system variables are optimised to maximise head-to-head tipping success. When two teams match up, an expected margin is calculated between the two teams based on their respective ratings. If a team beats the line, even if they lose the match, their rating goes up by exactly the same amount the other team's rating goes down. Form ratings only track regular season performance.
Class ratings are slower moving than form ratings and take multiple seasons to change significantly. Class ratings reflect team's innate quality and act as a handbrake from looking too closely at the last couple of matches. Unlike form ratings, class ratings go up only with a win, the quantity determined by the difference in ratings. Winning finals games are weighted more heavily and grand finals more heavily still.
One change I did make this off-season was to calculate NRL Elo ratings back to 19881. While I will normally limit my analysis to the NRL era, it means that teams will start that period of analysis with a class rating that roughly reflects their actual class, instead of the standard initial rating of 1500.
The more clever readers will realise that this is a zero sum game. They may observe that the last three premiership winning clubs have all peaked in the last five years and then reason that might lead to some lopsidedness across the league. While class Elo is only a measure of relative quality, not absolute quality, that is a reasonable conclusion to reach.
Thank you for reading Stats Drop
Coaches
It's all well and good to rate players but what about the big dogs telling them what they’re supposed to be doing? I think there’s two ways to think about a coach’s impact on a team:
Are things better when they leave than when they started?
Are players out- or under-performing expectations?
We had previously attempted to measure these ideas with:
Changes in class Elo rating and this was the basis of The coaches that fucked up your club back in 2020.
Comparisons of pre-season projections to post-season metrics in what was called Coach Factor.
But now we replace them with what I am hoping are the slightly more comprehensible units of Thompson(s) and Bax(es).
Thompsons
We’ve talked a bit about Duncan Thompson previously. Thompsons are units to measure the change in class Elo rating of a team. Each point of change is one Thompson. A regular season game will be worth about five to seven Thompsons, depending on the teams’ ratings, positive if won and negative if lost. As we discovered in 2020, a loss of 50 Thompsons from the starting position is almost certainly a death sentence, although this is not the only way expectations can fail to be met.
Hopefully, we can all be adults about this and acknowledge the shortfalls in this approach without completely discarding it. The situation in which a new coach arrives has considerable impact on what is achievable. Wayne Bennett’s first stint at the Broncos is the lowest rated tenure, joined by Steve Folkes and Bob Fulton at the bottom of this table. They all had very good teams that could not be kept at that level over the length of their tenure. Bennett won three premierships, Folkes one and Fulton had overseen Manly’s three consecutive pre-NRL grand final appearances, so these aren’t exactly bums. There’s a lesson here about the impact of arbitrarily slicing and dicing a coach’s body of work, instead of letting the complete resume speak for itself.
More commonly, coaches come into a bad situation where the class rating is very low. Unless they actively sabotage their team through their incompetence, most teams will rise from a sub-1450 rating back to mean through the magic of mean regression, inflating the record of otherwise mediocre coaches.
To mitigate but not completely eliminate this effect, we’ll create an expected Thompsons (xTh) metric. Using linear regression, we’ll compare the final rating against the initial rating to get a sense of what mean regression will provide (R squared = 0.35) and try to separate that from what the coach provides. The difference between the actual Thompsons generated in the tenure/career and the expected Thompsons is Thompsons over expectation (Thox).
The effect of the mean regression is relatively mild but Thox has a better correlation with grand final wins (0.42) than straight change in Thompsons (0.27), although not as good as the correlation with the sheer number of appearances (0.61)2.
The problem looking at this metric over a career is that Thox is only calculated once for each tenure, based on the initial rating, and is blind to how long that tenure is. After a enough time, ups and downs start to cancel each other out but changing clubs every couple of years allows a coach to get a fresh shot at piling on some easy Thox.
This highlights the crudeness of these metrics and that we will have to accept one number isn’t going to do anyone’s work justice - player or coach - not least because it is difficult to separate the impact of the coach on their players and vice versa. But, again, we’re all adults and we can deal with some crudeness.
Baxes
We’ve also talked a bit about Bob Bax previously. Baxes are the difference between a player’s pre-season projected Z score3 and the player’s actual Z score of a match. A player with a projected Z of 150 that puts up 120 in a game has created -30 Baxes for his or her coach. So many Baxes are generated, we will frequently have cause to use kiloBax (kBx) for 1000 Baxes and hectoBax (hBx) for 100 Baxes.
In theory, good coaches help their players achieve their best so will consistently generate more Baxes than other coaches.
There may be many reasons for a discrepancy between a player’s projection and actual performance - injury, development, aging, career year syndrome, favouritism (or lack of) with the coach, familiarity or suitability (or not) with the coach’s scheme, form of teammates - and it may seem unwise to assign all of the potential effects to the coach4, when they have a staff and a roster to account for. My assertion is that these effects will wash out with quantity and time but there’s always room for more nuance.
There’s some scholarly work to be done to trace the performance of coaches between competitions using these metrics - put that in the pile with doing the same for player performance - but also whether coaches are subject to the same career year syndrome as players.
My instinct is that they are and that when things click into place for player or coach, the career year is very much a number of tail end probabilities all coming up as a royal flush. The distribution of these events, whether the same or different for players and coaches would be interesting to track.
Thank you for reading Stats Drop
Dashboards
So far we’ve talked exclusively about the NRLM, largely because its the most popular and familiar competition and has the deepest dataset to wallow around in. We also have four other competitions’ worth of data to work with and in processing it, have achieved one of my longer term goals of producing a nice looking dashboard with way too much information in it so you can make of the statistical situation what you will.
Here’s the history of the 70 minute era of NRLW:
Don’t even try to look at this on anything less than a 55” TV or preferably on a Minority Report-style hologram. Do not think that the email embed is doing this bad boy justice. I will probably just link to dashboards in future so you will be forced to click through and get the full experience.
There are explanations in the footnotes of the dashboard but here are some extended explanations of more familiar metrics that I’ve written about previously:
Pythagorean expectation: There is a relationship between for and against and winning percentage, which is expressed by the Pythagorean expectation formula. This formula is used to estimate the number of wins a team should have based on their points scored and conceded. Generally, if a team outperforms their Pythagorean expectation (that is, wins more games than predicted by the formula) they will win fewer games in the following season and vice versa. Outperforming Pythag is not necessarily a good thing. Pythagorean expectation gets increasingly accurate at estimating win percentage over a longer time period.
SCWP (Should-a Would-a Could-a Points): Using the principle of Pythagorean expectation but, in lieu of using actual points scored and conceded, utilises metres gained and conceded and line breaks gained and conceded to estimate the number of points that the team should have scored and conceded. The results are called "second-order wins", disctinct from "first-order wins" calculated by standard Pythagorean expectation and actual wins (sometimes "zeroth-order").
Disappointment line (DSP): The number of wins we expect for each team based on their pre-season class rating, as a proxy of general expectations of what will be possible in the coming season. An average team will be expected to win half of their games, an above average team will be expected to win more than half and vice versa. Failing to reach the number of wins indicates a disappointing season and the greater the miss, the more disappointing the season has been.
Once I’ve completed similar dashboards for the QRLW and men’s competitions, we’ll be able to get a better idea of what metrics matter, which metrics converge to a single point of information (e.g. this team is bad at defence) and what metrics are sustainable year-to-year or subject to mean reversion or some other statistical impact.
I have a couple of other ideas for metrics (a panic index that tracks wooden spoon probabilities via Monte Carlo sims, a point-minute lead metric to replace WCL5) that may or may not get added as I get time to calculate them and automate its implementation in the dashboard.
Programming notes
Stats Drop will be making a regular return in 2025. I’m aiming to send two things out most weeks, which seems to be a broadly sustainable workrate. One will be the regular Wednesday newsletter, trimmed down a little, and the other will be from one of the other sections (Stats Drop, Pony Picayune, Bovine Bulletin) or something standalone.
That means Stats Drop will be roughly monthly-ish, maybe twice monthly if I have enough good ideas, time and energy. The next Stats Drop should be a return of the Deep Dive into the NRL season, which I haven’t done in three years but I have missed fussing over the numbers.
The Dataset™ will be made available in due course but it will only be available to paid subscribers. Paid subscriptions are not yet turned on, as I haven’t quite figured out what will be in front or behind the paywall. Subscriptions and other financial support will mostly subsidise a laptop with a functioning battery and more RAM.
I am using that due course to fill in some gaps - mostly event data for the NRL and QCup - that were delayed in the interests of first completing these posts, albeit I am already a bit behind when I wanted to publish (see lack of RAM) and we haven’t even gotten close to season kick-off. I am looking forward to a productive 2025.
The other was to trim NSW Cup back to 2015 to align with available player data and because it’s not as clearly delineated from its predecessor competitions as the Queensland Cup is.
I did try to get xTh to include appearances but the multi-variable regression didn’t want to cooperate to produce something useful. Appearances are at least partly an consequence of grand final wins, rather than a cause (more grand final wins leads to more job opportunities with a better chance of extending career, as well as eventually stumbling over a trophy if you try long enough).
If no projection is available because the player is a rookie or returning to the league, I use the intercept from the linear regression, which is a Z score of about 40 in the men’s comps and 60 in the women’s. 90 or 100 might be a better assumption, as a random selection of players will tend towards mean, but if we were going to improve the predictive power of Z score, rather than accepting it as a pretty number, we’d also want to weave projections into pre-game expectations and I haven’t done that yet.
Coach factor used to be divided by two to address this but that never changed anything when comparing between coaches because they all had their scores divided by two. Like Taylors, Baxes have no inherent meaning and only serve as a token for comparing between different coaching regimes.
WCL was the old way of calculating in-game probability based on the time on clock and margin at that point in the game. That was fine but not particularly special or rigorous. The probabilities would correlate strongly with the margin, so it’s simpler to score a team by the product of how much they are leading by and for how long they hold that margin, summed up over a game.