So, I decided to revisit my Pokemon Red Elo estimation work, and decided to write this article up as a test for some of my new rendering capabilities. This article will primarily go over deriving Elo.
Deriving Elo
Elo ratings are typically used in tournaments between players/teams, such as chess. It’s well known that if a player with a lower Elo defeats a player with a higher Elo, the increase/decrease for the player is larger than say if they were comparable in ‘skill’. The Elo rating is a measure of that skill. Let’s derive it. Suppose we have a ratings vector , then our goal is to learn this vector.
and battle with ratings and . Then we define the Elo update for and as:
- Here means A dominates (beats) B
- Where if , and if wins.
- is the scale factor, or how impactful a loss is.
Note if it’s a tie, then it’s considered both a loss and win for simplicity. If was an underdog, say Youngster Joey, taking on E4 Agatha, then the probability of him beating her would be much lower, and thus his Elo would go up quite a bit. On the other hand, it would be expected of Agatha to beat Joey, so the Elo increment is miniscule in comparison.
But what is ? We want the odds of beating to be a function of and , that is there is some such that:
And that . We can disregard ties for the moment since they’re not a substantial issue here.
Because the probabilities add to , we have yielding:
Now, what function reasonably satisfies this? The simplest answer is an exponential of the form , which when substituted yields:
The choice of base and determines how much a win means at a certain odds ratio, and aren’t really too important. This is essentially a sigmoid as well. Now, let’s say we have our data from the tournament, which could look something like the following:
win, trainer_1, trainer_21, AGATHA, YOUNGSTER0, GREEN1, LANCE...
Here, trainer_1 is the defender and trainer_2 is the contender, meaning that the win is whether the defender won or not. But this data sounds like it would come from something that’s like a Bernoulli distribution! It so happens
that this formulation is a GLM with the Logit link, which is what we commonly call the Logistic Regression. So, our data consists of:
- , where the the vector is except for in the kth entry and in the jth entry
Well, we know from equation , what a good way of modelling the probability is. Via the PDF for the Bernoulli distribution, the probability of battle resulting in outcome is:
Recalling that . How do we determine ? This is a classic Logistic problem, with , which we want to maximise, or alternatively minimise .
To compute , we note that via the chain rule and find that:
There’s no closed form solution for this, due to , so we have to rely on Gradient Descent and company to approximate our rankings. The most basic update that you’d know is that , where is the learning rate. Since our problem is reasonably small, we don’t really need to worry about providing a mini-batch of terms as our gradient. Though I will add that you can do an ‘online’ update where new matches provide gradient updates, and slowly update our rankings if needed, which is a cool connection. We could also contextualise momentum (e.g Adam) as smoothing our gradient from previous gradients as well.
Let’s look through the annotated code now!