/ Playbook / Part 2

More Pokemon Red Elo World

So, I decided to revisit my Pokemon Red Elo estimation work, and decided to write this article up as a test for some of my new rendering capabilities. This article will primarily go over deriving Elo.

Deriving Elo

Elo ratings are typically used in tournaments between players/teams, such as chess. It’s well known that if a player with a lower Elo defeats a player with a higher Elo, the increase/decrease for the player is larger than say if they were comparable in ‘skill’. The Elo rating is a measure of that skill. Let’s derive it. Suppose we have a ratings vector θ=(R1,R2,...,Rn)\theta = (R_1, R_2, ..., R_n), then our goal is to learn this vector.

P1P_1 and P2P_2 battle with ratings R1R_1 and R2R_2. Then we define the Elo update for P1P_1 and P2P_2 as:

R1:=R1+α(Outcome(P1,P2)P(P1P2)))R_1' := R_1 + \alpha (\text{Outcome}(P_1,P_2)- \mathbb{P}(P_1\succ P_2)))

R2:=R2+α(Outcome(P2,P1)P(P2P1)))R_2' := R_2 + \alpha (\text{Outcome}(P_2,P_1)- \mathbb{P}(P_2\succ P_1)))

  • Here ABA \succ B means A dominates (beats) B
  • Where Outcome(P1,P2)=1\text{Outcome}(P_1,P_2)=1 if P1P_1, and 00 if P2P_2 wins.
  • α\alpha is the scale factor, or how impactful a loss is.

Note if it’s a tie, then it’s considered both a loss and win for simplicity. If P1P_1 was an underdog, say Youngster Joey, taking on E4 Agatha, then the probability of him beating her would be much lower, and thus his Elo would go up quite a bit. On the other hand, it would be expected of Agatha to beat Joey, so the Elo increment is miniscule in comparison.

But what is P(P1P2)\mathbb{P}(P_1\succ P_2)? We want the odds of P1P_1 beating P2P_2 to be a function of R1R_1 and R2R_2, that is there is some ff such that:

P(P1P2)P(P2P1)=f(R1)f(R2)=OP1,P2\frac{\mathbb{P}(P_1\succ P_2)}{\mathbb{P}(P_2 \succ P_1)} = \frac{f(R_1)}{f(R_2)} = O_{P_1, P_2}

And that P(P1P2)+P(P2P1)=1\mathbb{P}(P_1\succ P_2) + \mathbb{P}(P_2 \succ P_1) = 1. We can disregard ties for the moment since they’re not a substantial issue here.

Because the probabilities add to 11, we have OP1,P2+1=1P(P2P1)O_{P_1, P_2} + 1 = \frac{1}{\mathbb{P}(P_2 \succ P_1)} yielding:

OP1,P2OP1,P2+1=P(P1P2)=f(R1)f(R1)+f(R2)\frac{O_{P_1, P_2}}{O_{P_1, P_2}+1} = \mathbb{P}(P_1\succ P_2) = \frac{f(R_1)}{f(R_1)+f(R_2)}

Now, what function reasonably satisfies this? The simplest answer is an exponential of the form f:(R)=exp(kR)f:(R) = \exp(k R), which when substituted yields:

P(P1P2)=11+exp(k(R1R2))(1)\mathbb{P}(P_1\succ P_2) = \frac{1}{1+\exp(-k(R_1-R_2))} \tag{1}

The choice of base and kk determines how much a win means at a certain odds ratio, and aren’t really too important. This is essentially a sigmoid σ\sigma as well. Now, let’s say we have our data from the tournament, which could look something like the following:

win, trainer_1, trainer_2
1, AGATHA, YOUNGSTER
0, GREEN1, LANCE
...

Here, trainer_1 is the defender and trainer_2 is the contender, meaning that the win is whether the defender won or not. But this data sounds like it would come from something that’s like a Bernoulli distribution! It so happens that this formulation is a GLM with the Logit link, which is what we commonly call the Logistic Regression. So, our data D={(x(i),y(i))}\mathcal{D} = \{(x^{(i)}, y^{(i)})\} consists of:

  • y(i){0,1}y^{(i)} \in \{0,1\}
  • xk,j(i)R391x^{(i)}_{k,j} \in \mathbb{R}^{391}, where the the vector is 00 except for 11 in the kth entry and 1-1 in the jth entry

Well, we know from equation (1)(1), what a good way of modelling the probability is. Via the PDF for the Bernoulli distribution, the probability of battle x(i)x^{(i)} resulting in outcome yiy_i is:

P(y(i)=yix(i);θ)=σ(θTx(i))y(i)(1σ(θTx(i)))1y(i)(Likelihood of a datapoint)\mathbb{P}(y^{(i)}=y_i | x^{(i)}; \theta) = \sigma(\theta^T x^{(i)})^{y^{(i)}} (1-\sigma(\theta^T x^{(i)}))^{1-y^{(i)}} \tag{Likelihood of a datapoint}

Recalling that θTx(i)=RkRj\theta^T x^{(i)} = R_k - R_j. How do we determine θ\theta? This is a classic Logistic problem, with L(D,θ)=i=1NP(y=y(i)x(i);θ)\mathcal{L}(\mathcal{D},\theta) = \prod_{i=1}^N \mathbb{P}(y=y^{(i)} | x^{(i)}; \theta), which we want to maximise, or alternatively minimise J(θ)=logL(D,θ)\mathcal{J}(\theta) = - \log \mathcal{L}(\mathcal{D},\theta).

J(θ)=i=1N(log(σ(θTx(i))y(i)(1σ(θTx(i)))1y(i)))=i=1N(y(i)logσ(wTx(i))+(1y(i))log(1σ(wTx(i))))\begin{align*} \mathcal{J}(\theta) &= -\sum_{i=1}^N \left(\log \left( \sigma(\theta^T x^{(i)})^{y^{(i)}} (1-\sigma(\theta^T x^{(i)}))^{1-y^{(i)}}\right) \right) \\ &=-\sum_{i=1}^N \left( y^{(i)} \log \sigma(w^T x^{(i)} ) + (1-y^{(i)}) \log (1 - \sigma(w^T x^{(i)})) \right) \end{align*}

To compute J(θ)\nabla \mathcal{J}(\theta), we note that θ(σ(θTx))=σ(x)(1σ(x))xT\partial_{\theta} (\sigma(\theta^T x)) = \sigma(x)(1-\sigma(x)) x^T via the chain rule and find that:

J(θ)=i=1N(σ(wTx(i))y(i))x(i)(Gradient of NLL)\nabla \mathcal{J}(\theta) = -\sum_{i=1}^N \left(\sigma(w^T x^{(i)}) - y^{(i)} \right) x^{(i)} \tag{Gradient of NLL}

There’s no closed form solution for this, due to yiy_i, so we have to rely on Gradient Descent and company to approximate our rankings. The most basic update that you’d know is that θi+1=θiηJ(θi)\theta_{i+1} = \theta_{i} - \eta \nabla \mathcal{J}(\theta_i), where η\eta is the learning rate. Since our problem is reasonably small, we don’t really need to worry about providing a mini-batch of kk terms as our gradient. Though I will add that you can do an ‘online’ update where new matches provide gradient updates, and slowly update our rankings if needed, which is a cool connection. We could also contextualise momentum (e.g Adam) as smoothing our gradient from previous gradients as well.

Let’s look through the annotated code now!

Loading content...