Charting a new course: Part 1 - Players
Production, replacement level, wins above reserve grade, Taylors and Z score
Welcome to Stats Drop, an inundation of rugby league numbers.
I haven’t done a Stats Drop in a minute. There are reasons for that, which we’ll get to shortly, but you might be wondering why this looks the way it does. With the success of the Pony Picayune experiment1, I’m spinning Stats Drop off into its own section.
This means, if you were so inclined, you could just subscribe to this part of the newsletter, which is going to be more generally NRL-oriented than the other parts of the newsletter. This might be more to your taste than me regularly castigating Phillip Street for their the latest act of anti-Queensland perfidy, or you could dodge the numbers and stick to the perfidy. It’s up to you. Personally, I like the pontification and the quantification.
If you’re receiving this by email, a reminder that the Datawrapper embeds work best on desktop, then next best if you click through on mobile and least best as presented in the email.
You can read Part 2 of this introduction to Stats Drop metrics here.
Back in 2023, I sensed a problem. Greg Marzhew was rated really highly and Payne Haas was not. This might seem like an inherently anti-Knights/pro-Broncos bias to have, which I do, but Marzhew had accumulated 1.9 wins above reserve grade (WARG) to that point while Haas had 0.8.
Even accounting for Marzhew’s career year and a (hypothetical only) lull in Haas’, has Marzhew ever really been twice the player that Haas is? Over the course of most of a season? Irrespective of how we might choose to measure “twice”, I don’t think Marzhew was on pace for a $1.5 million season.
On closer inspection, it seemed the best middles, of which Haas was one of the most highly rated, and edges were producing at about the same total rate as the bench players, which really doesn’t make sense and merited some investigation.
It turns out if one of the key elements of a player rating system is running metres, and the denominator to assess whether players are out- or under-performing is based on the five previous seasons’ data but not what is happening right now, and the current season is being played at a significantly different rate (e.g. a change of 20% over just four seasons), then the stats are not going to track the eye test well.
The solution was not that complicated. We can use the current season’s data as the denominator to capture league-wide tactical shifts while accepting that the early season numbers will be wonky due to small sample size. I thought I’d also take on a few other issues on the periphery while doing this refurbishment but that took about six months of idle brain power to arrive at the broad strokes of what I wanted it to look like.
By then, I had fallen a long way behind on data collection, which then needed to be collected. The NRL has now posted player stats back to 1998, so obviously I had to go and get that as well. I then figured I may as well go back to first principles, recollect some data I already had (match and coaching details from Rugby League Project, mostly) to make sure it was correct and consistent and rebuild it all from scratch. Once collected, the data had to be processed, regressed, analysed and formatted. I effectively did this last bit twice. All of that took a long time, not least because my laptop kept running out of memory.
This post, and the next one, is what came out of that.
Thank you for reading Stats Drop
Abstraction
To restate the aim, we are trying to build a model of rugby league that uses statistics to provide interesting insights that can complement, confirm or correct the eye test. We are not going to be right but we are, to borrow a hat tip from friend of the newsletter Rugby League Eye Test, going to be less wrong.
If we’re going back to first principles, let’s really get back to first principles:
Waves ripple through the quantum fields that make up our universe.
These quantum interactions present as a stadium, field, players, referee and ball at a macroscopic level, which will then play out as a game of rugby league in accordance with both the laws of the game and of physics at a rate of one second per second.
Photons will bounce off various surfaces and enter an observer’s retina, creating electrical impulses which are in turn interpreted by their brain on a slight delay, causing all sorts of cognitive responses (e.g. disgust if watching the Tigers, horror if watching the Panthers, etc).
A different set of photons will enter a camera lens, which will undertake an in-principle similar electronic processing of those photons but present them as digital bits of information which will be transmitted, stored and displayed on various computers or TV monitors.
Later, someone will pull up that information, play it as a video and put it through some software and produce some numbers that represent elements of play.
Those numbers will be posted on the internet.
An intrepid newsletter writer will copy and paste those numbers into a spreadsheet over an off-season and build a statistical model of how rugby league works.
Modelling rugby league at a quantum level is conceptually possible but practically not. I’m not going to estimate how many atoms and electrons are involved in a game but will just assume it’s a big number and it only gets exponentially more complicated from there. If you’d like to know more, see Sean Carroll’s Quanta and Fields, but that’s how a perfect mirror of reality would be built.
Fortunately, no one needs this level of accuracy (less Heisenberg’s uncertainty), although some lunatics might insist that if you’re not going to have this level of accuracy, then it’s a waste of time. Those people are quantum weenies. We can still glean insight, provided we understand what we’re working with.
The process of abstraction removes information - the colour of the home team’s jersey is recorded on video and in visual memory, but not in statistical data - in exchange for the potential for comprehensibility. Abstraction omits data and adds artefacts to the data as it proceeds through the levels. An observer might not be able to see what’s happening in the opposite corner at level 3. The frame rate and resolution of the camera might not be sufficient to see exactly when the ball touched a specific blade of grass at level 4. Video compression and pixelation or a coding error might turn nine metres into ten at level 5.
There are a number of issues introduced at level 6, which you would think might be relatively straightforward but are illustrative. The 1998 stats are borderline non-existent and where they are, they are untrustworthy. The most egregious example was the Paul Carige game having Carige listed as making one (1) error. The player stats for pre-2013 grand finals, which looks like someone tried to take some care with, are in worse shape than the rest of the season’s data. Any game involving the Gold Coast Chargers or Northern Eagles does not show the away team’s stats but it only affects those two teams. Because of the changes to the field goal scoring, field goals are now recorded as 1 point or 2 point, except no one went back to fix the old games, which were tagged as just ‘field goal’, so these don’t show up in the scoring summaries. Sometimes the score shown on draw page is not the same as the score shown on the match page. I don’t think its right to say that the Eels played their home games in 2001 at Bankwest Stadium or that the Northern Eagles played at Lottoland.
Some of this might make clear why supporting Rugby League Project is so important but a lot of it looks like the typical sort of care and attention to detail that we’ve come to expect from the PVL NRL (see thesis #7). I naturally tend to ‘if you’re not going to do it properly, don’t do it at all’ dad logic but I also don’t want someone to do something silly like deciding to following through and taking it all away. Most of the data is fine but a few things need fixing up.
With all of that in mind, we can still build a statistical model that does a passable job of representing reality. The compilation of various NRL, QRL and NSWRL stats with data from Rugby League Project and a light sprinkling of Wikipedia comprises The Dataset™. The Dataset™ is currently a 66.8 MB Excel spreadsheet and is, somehow, still not complete but covers five southern hemisphere competitions: NRLM, NRLW, Queensland Cup, QRLW premiership and NSW Cup.2 This sounds impressive but is only useful if we understand its limitations.
We’re going use The Dataset™ to build some useful metrics to analyse the game. Today, we’re looking at the players. Next time, we’ll look at coaches and teams.
Thank you for reading Stats Drop
Production
We have something like 185,000 lines of player statistics (to one degree or another) over the history of the NRLM. Add another 50k for each of the state cups and 5k for each of the women’s competitions and that’s a near overwhelming amount of data to turn into something useful.
Some of those statistics correlate with winning. The canonical selection from the NRL.com suite are tries, running metres, line breaks, line break assists, try assists, tackle busts, kicking metres and, inversely correlated, errors and missed tackles.
Having done this for so long, I don’t particularly want to relitigate why these statistics matter other than to note this is a slimmed down set from predecessor systems, which included statistics like hit ups and dummy half run metres that also correlate but are kind of double-counting running metres.
Accumulation of the statistics that correlate with winning is called production.
Not every player contributes in the same way on the field but they all function as parts of the team and they all affect the result of the match. Few props score tries but many wingers do. Does that make wingers better players or more valuable than props? We need a reference unit to which we can convert different statistics to allow for comparison of production. We call that unit the Taylor (Ty for short).3
There are, unfortunately, many ways to convert one number into another. Here are two that I will use.
FTy
FTy are Taylors created by playing footy (that's the F). If a player does good (bad) stuff, he or she gets credit (docked) for it. That's what playing footy is all about.
To establish the correlation between the statistics and the outcome, we put games into buckets. Each bucket represents a performance in a given category in a given season. For example, there were 56 instances of teams scoring exactly five tries in a game in the 2005 season. Of those, 40 won their matches. That bucket has a winning percentage of .714. Repeat for ten buckets (zero, one try, two tries, etc) in each statistical category (tries, assists, breaks+busts, etc), with enough games to meet the minimum, for each season and do a linear regression4 between the quantities of the statistic and the winning percentage of the buckets. The variables that make those statistics have more or less impact in a specific game situation will even out over a long enough time period5, although no player rating system is immune to chronic stat padding.
The linear regression gives a gradient of slope, which we will use to convert the categories into Taylors, and a coefficient of correlation, which we will use to weight each of the categories against each other, with a heavier weighting given to higher coefficients, when adding up the categories. This is the same technique that has been used since 2019.
A similar chart could be produced for the other categories but it wouldn't look as neat. In some seasons, categories like kicking metres and errors, have low to very low correlation to winning and so lots of kicking metres or errors are required to have a measurable impact. In most seasons, these things matter.
The advantage of this approach is that it is relatively simple, which makes it robust. Production is calculated independently of listed position or time on field, which means if this data isn’t available, we can still do a reasonable job of calculating something like FTy.
We only have minutes played for 2001 onwards, which I assume is related to changes to the interchange rules from the Super League era to what we are more familiar with today. Some categories are not available until 2003. But we can still pull together something for this dark age of the NRL's statistical history, should someone be interested in knowing the third most productive second rower in season 2000 (it was Gorden Tallis, 24.2 FTy).
Summing each player’s pre-game average FTy per appearance in that season will yield a head-to-head tipping percentage of 60.0%6 in NRLM regular season games (excluding round 1, n=4642) and using a career average will bump that up to 61.1%. State cup rates of return are a similar, about a percentage point either way, but closer to 70% in women’s competitions.
However, we are not that concerned with this specific kind of prediction, although I will dabble in it. Geniuinely good players will get more game time and more opportunities through the coach’s tactics and so naturally generate more production. It takes on the characteristic of a circular argument. We cannot measure the efficiency or the panache or the effectiveness with which the production was generated - you have to watch the games for that - but over time, accumulating more tries, assists, breaks, metres and minimising errors and missed tackles are some of the hallmarks of what we consider good players.
PTy
In the land of 100 metre forwards, the 120 metre forward is king. FTy measures production in a brute force way, the accumulation of raw numbers trumps all and 120 metres is 20% better than 100 metres. PTy attempts to identify that extra 20 metres as the important bit and minimises the first 100 metres as unimportant. If every player can run for 100 metres, then the game will be decided in the yardage above that threshold. If every player can score a try, then the result of the game will rest with the player that can score a second try.
Mathematically, PTy is somewhat simpler to derive than FTy. We calculate the margin on the scoreboard but also for the margin of tries, running metres (broken into categories of pre-contact, return and post-contact where available), errors, kicking metres, missed tackles, assists (line break and try treated separately), line breaks and tackle busts. Within each season, we regress the margin on the scoreboard to the margin in each statistical category to get a gradient of slope and coefficient of correlation, and we then build up the exchange rates for each season in a similar fashion to FTy.
The difference is the player inputs. Instead of working with the raw quantity, we calculate the average try scoring rate per minute played at that position and subtract that from the try scoring rate per minute played by each player in each game (minimum 25% of average position game time) and multiply that by the exchange rate. In this way, we highlight the contributions above average7 per time played at that position, which is where the outcome of the game can really swing.
A player that does nothing in the categories that correlate with winning will rate 0 FTy but approximately will rate -2 PTy for the same performance. Playing to the average level rates 0 PTy but roughly +1 FTy. FTy and PTy have a coefficient of correlation of 0.5 for the NRLM. This is not surprising given the input data is largely the same, but it does show that somewhat different things are being measured.
Z score
In and of themselves, PTy and FTy are not that easy to understand, which makes them less useful for communicating to an audience. It'd be nicer to repackage that information in a single number that is more intuitive. If we set the average production in a game (in both F and P) at that position as 100, then express production as a ratio of that average, capped at -250 and +250, then average the F and P components, then that will do the job. A score of 70 is 70% of average. 120 is 20% better than average. Negative numbers bad. Big number is better number.
We will call this the Z score (or just Z, I forget why I picked Z). Z is a rate statistic of production per game relative to a positional average, in contrast to Taylors which measure volume of production with differing yardsticks.
Z is mostly useful for assessing performance in individual games or production across a season. It is similar in many respects to TPR, the previous player rating for rate of production, but instead of looking like a batting average, it will look more like weighted runs created.
Z scores also form the basis of player projections, which set a mean regressed expectation of what production we might expect out of a player based on their previous one, two or three seasons of play. While there’s still a bit of work to be done here, it’s largely the same as it was in 2019, two player rating systems ago. The mean absolute error between projections and actual performance is in the ball park of 20 points on the Z score (on average, a player projected at 90 will land somewhere between 110 and 70). The dream is to be able cross-project Z scores between competitions but I haven’t gotten around to that in five years, so I’m not sure when I’m going to get to it.
Here’s every NRL player for which we have data (i.e. no 1998, no away teams for games involving the Northern Eagles), ranked by the number of games they had with a Z score of +250 (the maximum allowed). Of the 165,843 Z scores currently calculated, only 5.8% of them are +250, so it’s something like getting a 10/10 by production. The absolute best players by this metric are getting that in every four appearances, which is astonishing, even if this way of looking at things is not particularly meaningful.
Replacement level
Replacement level is a concept borrowed from baseball sabermetrics. For our purposes, a replacement level player is the cheapest player that is readily available that can play regular minutes in first grade. This is not quite the same as being the best player in state cup or being the worst player in the NRL. There are below replacement level players in the NRL and above replacement level players in QCup because the market is not frictionless.
An average player is more valuable than that tag would imply, and there are far more below average players than average or above average players. If a team’s star goes down with an ACL injury, they are unlikely to be replaced by an average player. It is far more likely that a replacement level player will be filling in. That perspective changes both assessment of the match probabilities and how one goes about pricing production. See also: Moneyball.
If it helps, Richard Swain, Casey McGuire and Rory Kostjasyn are three men’s players with over 100 NRL appearances and close to zero wins above reserve grade (replacement). I don’t remember Casey McGuire’s career all that clearly, other than he was one of those Broncos, but Kyle Flanagan, Drew Hutchison and Jaeman Salmon are more recent examples in the same range with fewer appearances. Danny Levi used to be the poster boy for replacement level but he’s put together a few tenths over the last few seasons and has moved away from zero.
The previous version of wins above reserve grade axiomatically stated that a team of replacement level players would win two games on a 24 game first grade schedule. The number of NRL/ARL/NSWRL teams since 1988 that fell below this threshold is roughly the same as the proportion of teams that fell below replacement level (.320) in the MLB since 1901. We will maintain this but use it in a more cosmetic, and less fundamental, way.
Previously, we would use linear regression to estimate what kind of production a 2-22 team would generate and use that to set replacement level. This would work out to around .055 to .070 TPR in the old money. Now, we take the number of teams in the league multiply by 17 and the player at that position in terms of FTy/game and PTy/game (for a minimum number of games played) is deemed the replacement level cutoff. Everyone above him or her is deemed above replacement level. E.g. in a 16 team NRL, the replacement level is set at the 272nd best player as rated by FTy/game and PTy/game with a minimum of 10 games played.
While this is still mostly a measure of relative quality, and less a measure of absolute quality, and this might seem like complete gibberish, what the graph says is: there are relatively a lot of replacement level guys lying around. We can vary the number of teams in the NRL and the threshhold for what we consider regular minutes and the value that comes in for replacement level is fairly consistent over time8.
On the other hand, what constitutes the most productive in a given year fluctuates quite a lot, depending on individual player form, the spread of competence within his or her team and across the league and the nature of the sport in that season. As we know from experience, these factors can vary significantly.
When one is assessing the quality of any sporting enterprise, it pays to think not just about the best but also what the yardstick is. If the yardstick is pretty well constant, then when club bosses like to complain about dilution of existing talent and widening the gap between have and have nots, we can conclude they’re really complaining that there aren’t enough 2021 Tom Trbojevicii to go around to keep their jobs secure.
This perspective ignores that the rarity of the star player is what makes the star valuable. Aligning the stars is part, maybe the most important part, of the job of a club boss if everyone has a star, no one does. There is only one 2021 Tom Trbojevic because even Tom Trbojevic isn’t 2021 Tom Trbojevic. They can’t all have him and someone has to lose but filling in the bottom half dozen spots on the roster should be a piece of cake.
Wins Above Reserve Grade (WARG)
In the previous system, the number of WARG calculated in a given season would vary dramatically, despite the fact that there are basically the same number of wins above reserve grade available every year (number of regular seaosn games played minus replacement level team wins). If production inflates, as it did in 2021, and the bar to clear for an average performance moves from 1400 metres per game to 1800, then it doesn’t necessarily mean that the players in 2021 were historically special, generating twice as many WARG than 2013’s cohort. While this is one way of looking at it, I think it is better to think of it as those metres being diluted in value because they’re chasing the same number of wins.
Accordingly, I changed the way WARG were allocated. Each player’s production above replacement level, as a proportion of the whole league’s production above replacement level, is multiplied by the number of available wins above reserve grade. For example, in 2024, the NRLM had 202 available wins, the NRLW had 44.3, QCup 148.3. A very productive player might generate 0.5% of the league production over replacement and so would be entitled to 0.5% of those wins. It’s that simple.
WARG is a volume statistic, useful for considering production over a season or a career. A career with 6 WARG is good enough to be in the top 10% of all players with a NRLM appearance (now familiar Dataset™ disclaimers apply) and more than 19 is required to be in the top 0.5%. That is, in no particular order, Darren Lockyer, Matthew Bowen, Paul Gallen, Cameron Smith, Johnathan Thurston, Billy Slater, Robbie Farah, Benji Marshall, Cooper Cronk, Greg Inglis, Jarryd Hayne, Daly Cherry-Evans and James Tedesco at the end 2024.
We can account for variations in number of appearances by looking at Single Season Pace (WARG divided by games played times regular season games). The equivalent top tier of 0.5% with at least 50 appearances and a SSP of at least 2.3 WARG per 24 games, are Darren Lockyer, Andrew Johns, Brad Fittler, Jarryd Hayne, Tom Trbojevic and Ryan Papenhuyzen.
I give the PTy generated over replacement three times the weighting of FTy production in allocating WARG but this is a relatively arbitrary number that I suspect does not matter much but I have not spent time fine tuning as yet. If the chart below changes minorly, that is because I have altered some variables.
A related issue I had sensed in 2023, but did not recount in the introduction, was compiling a list of highest rated players by WARG over the previous decade that featured Anthony Milford in the top ten. I could see top twenty or thirty, maybe, as he was playing well until covid, especially in the years of covering for declining Darius, but top ten? He’s now top 35 over that decade, which feels more appropriate.
Incidentally, Haas finished 2023 with 1.5 new WARG and Marzhew 2.2, which feels closer to reality, if still a strange legacy of that year’s super-yardage outside backs.
Views for Pony posts were 10-20% higher than the non-Pony posts around them and the season review was the most viewed post of last year, closing in on Laybuttian territory.
Origin is mostly there, for a sixth competition, and will be done some time this year. I will then work on integrating internationals and Super League ahead of 2026’s World Cups because one of my favourite things I did was a Deep Dive into the 2021* World Cup, charting as many players’ stats as I could.
It is entirely unimportant that 1 Taylor is defintionally equal to the production of 1 try. That is, the amount of production involved in catching a ball, successfully grounding it and being awarded four points. This does not include the amount of production required to get into that position in the first place. I already regret this footnote for the amount of confusion that will be created for the sickos who read it.
I remember when I first did this someone emailled me to say I should have done logistic regression instead of linear, except that I only recently realised that games have three outcomes, not two, and so that technique would not be applicable in any case.
This is an assertion, not a proven fact.
60% isn’t going to win you many bets or tipping comps but that’s also not the point. It just shows there’s some validity to the approach, even without some of the fine tuning I still plan to do. TPR was around 63% and you could probably tune the weightings to get that number up if that was your aim. As a lark a few years ago, I combined that with both Elo ratings as a go/no go and calculated that I could reliably predict the outcome of NRL games to generate the same return as an ASX ETF would give you over the course of several years. That assumes your bookmaker won’t cut you off during the good years you need to offset the bad years, something that Vanguard won’t do. Not financial (but still good) advice!
Conceptually, it might be more consistent to use replacement level, rather than average, but we need these numbers to define replacement level.
Whether there are more or fewer sub-replacement level players in the NRL will definitionally depend on how many players the clubs use and how many games they get. More churn will equal more sub-replacement level players.