iTranslated by AI
Breaking the Independence Wall with Bivariate Poisson: Evolution of the 2026 World Cup Simulator from v2.6 to v2.9.9

Introduction
In my previous article, I wrote about refining the World Cup 2026 Simulator (https://noahsark-wc2026.pages.dev/) from v2.0 to v2.6. At the end of that article, I left myself three pieces of homework:
- Complete resolution of the rejection sampling bias
- Relaxation of the PK determinism
- Web Worker implementation
I have tackled point 1 this time. I also went as far as implementing the bivariate Poisson model by Karlis and Ntzoufras (2003), which I had previously listed as a reference. From v2.6 to v2.9.9, this involved 14 stages of modifications in a single day.
This article is a record of:
- The process of aligning with real data (2018 + 2022 World Cup tournaments) across 19 metrics
- How I hit the fundamental wall of independent Poisson and arrived at the bivariate Poisson model
- The final results, where 17/19 metrics were kept within ±2pp, along with a transparent disclosure of the remaining 2 metrics
Just like the v2.6 article, the code is in JavaScript.
Resetting the Goal: Aligning 19 Metrics with Real Data within ±2pp
At the v2.6 stage, the qualitative goal was to "eliminate discomfort." This time, I switched to a quantitative target.
I measured all 128 matches from the two previous World Cups (2018 + 2022) and extracted 19 metrics.
Macro Metrics (9)
- GS (Group Stage) avg goals/match: 2.52
- KO (Knockout) avg goals/match: 2.69
- Draw rate: 19.8%
- Loser 0 rate (games with a winner where the loser scores zero): 56.2%
- Margin 1 / 2 / 3+ ratios: 45.8% / 20.8% / 13.5%
- AET (After Extra Time) rate: 30%
- PK (Penalty Kick) rate: 19%
Score Distribution (10)
- Frequencies of 1-0, 1-1, 2-1, 2-0, 0-0, 2-2, 3-0, 3-1, 3-2, 4-0
Here are the measurement results when v2.6 was applied:
| Metric | Real World Cup | v2.6 | Difference |
|---|---|---|---|
| GS Avg | 2.52 | 2.58 | +0.06 |
| KO Avg | 2.69 | 2.43 | -0.26 |
| Draw Rate | 19.8% | 11% | -8.8pp |
| Loser 0 Rate | 56.2% | 77% | -21pp |
| Margin 1 | 45.8% | 33% | -13pp |
| AET Rate | 30% | 33% | +3 |
| PK Rate | 19% | 25% | +6 |
The GS/KO averages are roughly correct even in v2.6. However, the Loser 0 rate is off by 21pp and Margin 1 by 13pp. This indicates that the balance between "tight 1-goal games" and "shutouts" is fundamentally skewed, a domain I had not pursued in the previous article.
From here, I will aim for ±2pp.
Discarding Rejection Sampling: The "Sample-First" Architecture
The approach I set as homework at the end of the previous article—"raising the λ of the winner conditionally and sampling only once"—was tried but proved insufficient. While it achieved the intended match results, the score distribution became distorted.
Specifically, when I multiplied the winner's λ by 1 + ε and the loser's by 1 - ε to sample, if ε was small, it produced a flood of draw samples without achieving the intended winner; if ε was large, the winner's score became inflated. Adjusting ε alone was not enough to align all 19 metrics.
A Shift in Thinking
I discarded the idea of rejection sampling itself.
Old: Decide the winner first, then find a score that makes that winner prevail (rejection)
New: Sample a score first, and if it conflicts with the intended winner, perform a minimal correction (sample-first)
// v2.9.x: sample-first architecture
function simMatchDetailed(t1, t2, stage) {
// Step 1: Determine the intended winner based solely on rating
const ratingP1 = winProb(t1, t2, stage);
const intendedWinner = sampleWinner(ratingP1, stage);
// if stage === 'gs', null (draw) is possible
// Step 2: Asymmetrically shift λ and sample from bivariate Poisson
const [l1_biased, l2_biased] = asymmetricShift(l1, l2, ratingP1);
const [s1, s2] = bivariatePoissonRand(l1_biased, l2_biased, lambda0);
// Step 3: Minimal correction if it conflicts with the intended winner
return reconcile(s1, s2, intendedWinner);
}
This architecture is inherently free from rejection bias. Also, fine-tuning of λ is mapped directly to the metrics rather than through ε.
Logic for Minimal Correction
The reconcile step (Step 3) branches into three cases:
- Sample winner = Intended winner: Adopt as is.
- Sample is tied, but the intended result is a win: Either add +1 to the winner or -1 to the loser as a minimal correction.
- Sample winner ≠ Intended winner: Swap the winners (unlike rejection, this does not affect the overall shape of the distribution).
The Case 2 branch is set to roughly 60:40. The "loser -1" creates a "clean sheet for the winner," while "winner +1" produces a "tight 1-goal game." I adjusted this using the ratio of 1-0 to 2-0 from real data.
---## The Fundamental Barrier: Incompatibility of 0-0 and 2-2
With the sample-first architecture, the Loser 0 rate and Margin 1 improved immediately. However, there was a metric I simply could not close: the coexistence of 0-0 and 2-2 draws.
Real data:
- 0-0: 7.3%
- 2-2: 5.2%
With independent Poisson, if you lower λ to match 0-0 to the real data, 2-2 drops to the 3% range. If you raise λ to match 2-2, 0-0 exceeds 10%. It is a seesaw no matter what you choose.
Assumptions Implicit in Independent Poisson
The cause lies not in the architecture or the λ values, but in the assumptions implicitly made by independent Poisson itself.
X₁ ~ Poisson(λ₁) and X₂ ~ Poisson(λ₂) being independent implies that "the scores of both teams hold no information about each other." It assumes that the goal distribution for Team 2 is the same regardless of whether Team 1 scored 0 goals or 2 goals in the match.
However, football in real data does not work that way.
- The tempo of the match (open vs. closed) affects both teams' scoring simultaneously.
- Pitch conditions and weather affect both teams simultaneously.
- Refereeing tendencies (many cards vs. letting the game flow) also act similarly.
In other words, X₁ and X₂ in real data have a positive correlation. 0-0 (both low) and 2-2 (both medium) occur simultaneously because both teams often perform poorly or well on that given day.
This cannot be explained in principle by the independence assumption. If you shift one distribution downward, cases where both are low increase, but simultaneously, cases where both are medium decrease. Coexistence is impossible.
---## Introduction of bivariate Poisson (Karlis-Ntzoufras 2003)
I implemented the bivariate Poisson model by Karlis and Ntzoufras (2003), which I had left as a reference in the previous article. What it does is shockingly simple.
Combining Three Poissons
For each match, prepare three independent Poisson variables:
Then, define the scores of both teams as follows:
The marginal distributions satisfy the following:
While the means of the marginal distributions are preserved, a covariance of Cov(X₁, X₂) = λ₀ is introduced.
Interpretation of λ₀
λ₀ can be read as the "intensity inherent to that match that affects both teams equally." It is a factor that simultaneously pushes up or down the scoring potential of both teams, such as match openness, pitch conditions, and refereeing tendencies.
In matches where λ₀ is large, both teams tend to score at the same time; in matches where it is small, both teams tend to be silent. This creates the simultaneous increase of 0-0 and 2-2.
Implementation
It can be written in about 15 lines.
function bivariatePoissonRand(lambda1, lambda2, lambda0) {
// Safety: λ₀ must be less than min(λ₁, λ₂)
const l0 = Math.max(0, Math.min(lambda0, lambda1, lambda2) * 0.95);
const y0 = poissonRand(l0);
const y1 = poissonRand(Math.max(0.01, lambda1 - l0));
const y2 = poissonRand(Math.max(0.01, lambda2 - l0));
return [y1 + y0, y2 + y0];
}
The value of λ₀ is determined through calibration. BIV_LAMBDA0_FRAC = 0.40, meaning λ₀ = 0.40 × min(λ₁, λ₂) aligned the 19 metrics best. As a rule of thumb, it seems a realistic range for the covariance to be about 20% to 40% of the total expected goals.
Effect
The coexistence of 0-0 and 2-2 was established.
| Score | Independent Poisson | Bivariate Poisson | Real World Cup |
|---|---|---|---|
| 0-0 | 6.2% or 10.5% (depends on λ) | 6.8% | 7.3% |
| 1-1 | 7.0% | 7.2% | 7.3% |
| 2-2 | 3.2% or 4.1% | 4.0% | 5.2% |
Both 0-0 and 2-2 moved toward the real data side simultaneously. 1-1 also aligned naturally. This behavior was theoretically impossible under the independence assumption.
---## Peripheral Adjustments
Simply adding bivariate Poisson still left some distortions, which I addressed individually.
Asymmetric λ shift
After determining the intended winner, I increase the λ for the winner and decrease it for the loser. This allows both the mismatch and close-game types to coexist.
const gap = Math.abs(ratingP1 - 0.5) * 2; // [0, 1]
const shiftWinnerF = 0.10 + 0.20 * gap; // Increase winner λ by [10%, 30%]
const shiftLoserF = 0.08 + 0.20 * gap; // Decrease loser λ by [ 8%, 28%]
Matches with small rating differences have small shifts, and the λ for both teams does not move much → 1-1, 2-1, 2-2 emerge naturally.
Matches with large differences have large shifts; the winner λ increases and the loser λ decreases → 2-0, 3-0 emerge.
This is asymmetric by design. If I were to make it symmetric (e.g., +20% for winner, -20% for loser), the Loser 0 rate would not align. Hitting the loser side slightly harder aligns better with real data.
Post-hoc upgrade
Independent Poisson (and bivariate, as well) structurally overpredicts 1-0. While 1-0 is 22.9% in real data, the raw model output is around 27%. Conversely, 2-0 is underpredicted (raw is around 14% vs. 17.7% in reality).
Since this is a structural quirk, I correct it by probabilistically upgrading scores.
// 1-0 → 2-0 (52% probability)
if (winScore === 1 && losScore === 0 && Math.random() < 0.52) {
// Increment winner's score by 1
}
// 2-1 → 3-2 (20% probability)
if (winScore === 2 && losScore === 1 && Math.random() < 0.20) {
// Increment both by 1 to make it 3-2. Correction for high-scoring close games
}
This decreases 1-0 to a proper level, increases 2-0, and slightly increases 3-2 by decreasing 2-1. The total Margin 1 is preserved (2-1 → 3-2 remains Margin 1).
Stochastic AET/PK escalation
In historical World Cups, the occurrence rates for AET and PK are 30% and 19% of KO matches, respectively (63% of AET matches go to PKs).
The old implementation relied heavily on rating differences here as well, with a structure where AET decreases as the match discrepancy grows. Real data shows something closer to independence (AET occurs at roughly a constant rate regardless of whether strong teams play strong teams or weak teams play weak teams). I changed this to a form triggered probabilistically.
// If the sample is tied in a KO match
if (stage === 'ko' && s1 === s2) {
// AET triggered with 30% probability
if (Math.random() < 0.30) {
aet = true;
// If not decided in AET, PKs with 63% probability
if (Math.random() < 0.63) {
pk = true;
}
} else {
// Remaining 70% decided within regulation → +1 for winner
}
}
As a result, AET settles at 30% and PKs at 18.8%, almost exactly in line with real data.
---## Final Specifications for v2.9.9
Final measurement of all 19 metrics (N=36000 GS matches).
| Metric | Real WC | v2.0 | v2.6 | v2.9.9 | Diff |
|---|---|---|---|---|---|
| GS Avg Goals | 2.52 | 3.28 | 2.58 | 2.53 | +0.01 |
| KO Avg Goals | 2.69 | 3.10 | 2.43 | 2.69 | ±0 |
| Draw Rate | 19.8% | - | 11% | 20.2% | +0.4 |
| Loser 0 Rate | 56.2% | - | 77% | 55.4% | -0.8 |
| Margin 1 | 45.8% | - | 33% | 43.1% | -2.7 |
| Margin 2 | 20.8% | - | - | 22.1% | +1.3 |
| Margin 3+ | 13.5% | - | - | 14.7% | +1.2 |
| AET Rate | 30% | 15.3% | 33.1% | 30.2% | +0.2 |
| PK Rate | 19% | - | 67% (AET) | 18.8% | -0.2 |
| 1-0 | 22.9% | - | - | 25.7% | +2.8 |
| 1-1 | 7.3% | - | - | 7.2% | -0.1 |
| 2-1 | 10.4% | - | - | 11.9% | +1.5 |
| 2-0 | 17.7% | - | - | 18.1% | +0.4 |
| 0-0 | 7.3% | - | - | 6.8% | -0.5 |
| 2-2 | 5.2% | - | - | 4.0% | -1.2 |
| 3-0 | 6.2% | - | - | 6.5% | +0.3 |
| 3-1 | 5.2% | - | - | 3.4% | -1.8 |
| 3-2 | 3.1% | - | - | 4.1% | +1.0 |
| 4-0 | 4.2% | - | - | 3.2% | -1.0 |
- Perfect matches (within ±1pp): 10 metrics (GS avg, KO avg, Draw, AET, PK, Loser 0, 1-1, 2-0, 0-0, 3-0, etc.)
- Within ±2pp: 17/19 metrics (89%)
- Exceeding (±2pp): 2 metrics (Margin 1 -2.7pp, 1-0 +2.8pp)
Win Probabilities Maintained
Even with a major overhaul of the scoring model, the win probabilities remain almost the same as they were at v2.6.
| Country | v2.6 | v2.9.9 |
|---|---|---|
| Spain | 22.8% | 22.8% |
| Argentina | 14.9% | 14.9% |
| France | 12.8% | 12.9% |
| England | 6.9% | 6.9% |
This is thanks to the design choice of separating match outcomes and scores, as mentioned in the previous article. Since the two pipelines—match outcomes (Elo / Opta ratings) and scores (λ calculation and bivariate Poisson)—were independent, the win outcome side did not break even after drastically redesigning the score side.
---## Remaining Limitation: Insufficient 1-2 Type Close Losses
The two metrics exceeding the target in v2.9.9 (Margin 1 -2.7pp, 1-0 +2.8pp) are correlated.
Looking closely at the distribution, the true cause was a shortage of 1-2 scores (-7.4pp) (v2.9.9 at 12.4% vs. 19.8% in reality). The total amount of Margin 1 is insufficient because the portion that cannot be reached by the sum of 1-0 + 2-1 + 3-2—that is, high-scoring close games like 4-3, 5-4, 6-5—is structurally too rare in Poisson systems, which is a mathematical fact.
In Poisson, mean = variance, so even in matches where λ is high, the variance increases proportionally. As a result, patterns where both teams score 3 or more points and end with a 1-point difference do not occur frequently enough (even with bivariate).
Next Steps
To solve this, I need to change the family of distributions itself.
- Negative Binomial Distribution: Introduce over-dispersion. It can increase the variance of high-scoring matches. However, it increases parameters.
- Skellam Distribution: Directly model the score difference between both teams. You can match the Margin distribution directly.
- Empirical Distribution: Sample directly from real WC data. You lose theoretical elegance, but the metrics will surely align.
All of these require several hours of work. I decided to stop at v2.9.9 and leave it as a record.
---## Conclusion
I implemented the paper mentioned in the references of the previous article. At the time of writing, I had a lukewarm feeling that it might be "useful for reference," but after returning to the paper after hitting the wall of independent Poisson, I finally understood what Karlis and Ntzoufras were trying to solve back in 2003.
Statistical modeling is, I think, a continuous process of re-tracing problems solved decades ago only after you hit a wall yourself. In that sense, reading literature only occasionally is not enough. It looks completely different when you read it after your own implementation has failed.
In the next cycle, I will try Skellam, Negative Binomial, or perhaps empirical distribution to hit the 1-2 residual. Then, there's the relaxation of the PK determinism and the move to Web Workers, which were homework from before. I don't know how far I can go, but I will continue.
Try the simulator:
English version (same story from a different angle): Medium
Sequel published (2026-04-25): The Story of How Win Probabilities Shifted When I Cleared Two Items of Homework — W杯2026 Simulator v2.9.9 to v2.16
The story of the unexpected side effects that occurred after implementing Homework 2 (Relaxation of PK determinism) and 3 (Web Workerization).
References
- Dixon, M. J., & Coles, S. G. (1997). Modelling association football scores and inefficiencies in the football betting market. Journal of the Royal Statistical Society: Series C.
- Karlis, D., & Ntzoufras, I. (2003). Analysis of sports data by using bivariate Poisson models. Journal of the Royal Statistical Society: Series D.
- Skellam, J. G. (1946). The frequency distribution of the difference between two Poisson variates belonging to different populations. Journal of the Royal Statistical Society.
Discussion