Uses and Abuses of Statistics – MLB Edition

Written by Editorial Team on March 13, 2024. Posted in Culture, News Media.

If you watch a lot of sports as I do, you cannot fail to be aware of the so-called ‘Analytics Revolution’, a phenomenon that has wormed its way into sports broadcasting. Whatever professional teams may be doing with the reams of game and performance data they now collect, one cannot miss how much sportscasters talk about it, before, during and after each broadcast.

As someone whose happiness would greatly increase if said sportscasters would just shut up, I cannot say all this statistic-centric chattering is welcome, but sometimes it is interesting. A frequent use of stats in a broadcast is when one of the commentators cites a statistic that they think is directly relevant to what is happening in the game. For example –

Hockey team x scores the first goal of the game, and the commentator says ‘The team that scores first wins the game z% of the time.’

Baseball team y goes into the 6th inning trailing by 2 runs and the commentator says ‘Teams that trail by 2 or more runs in the second half of a game have only a Z% chance of winning.’

Now, there is no mystery as to where these statements come from. For the first one, you just look at the last 10 years (say) of all NHL games and see which team scored first and which team won. The percentage of the games in which it is the same team gives you z in the statement.

A particular example of this occurred during the second round of last season’s MLB playoffs, when two teams were playing the third game of a best-of-five series, tied at one win each. The commentator said ‘The team that wins the third game in this situation goes on to win the series 70% of the time.’

Once again, it’s clear this statement comes from looking back at previous MLB best-of-five series in which the teams split the first two games, but in this case I had an immediate reaction to this stat: that seems too low.

My immediate no-pencil-and-paper reaction was not that he was quoting a mistaken actual statistic, but rather that I thought the 3rd-game-winning-team would win the series more often than that. I got out my pad and pen, and here is what I came up with.

What would simple probability calculations predict for the probability in question? Imagine team A has won the third game against team B, so it is leading the series 2 to 1 with two possible games to go. Assume also, just as a starting point, that because this is the playoffs, these are two evenly matched teams. Thus, absent any specific information about each team in each game (who is pitching, injuries, weather, etc) one would expect the probability that either team wins is ½.

Given that, you can calculate the probability of team A going on to win the series (having won game three) by noting that the series after game three can go only one of three ways:

A wins game four and the series
A loses game four but wins game five and the series
A loses both games four and five and loses the series.

This is the whole universe of possibilities, and it is easy to calculate the probability of each one.

The probability is ½ under our assumption that 1/2 is the probability A wins any single game
The probability A loses game four is ½ and the probability A wins game five is also ½, so the probability of those two events happening is ½ times ½ which is ¼. (The probability-aware out there will note that I have assumed that the probability of winning in each game is independent of what happens in the other game. I will come back to that below.)
The probability A loses each game is again ½, so again the probability it loses both is ¼.

Note that these three probabilities do add up to 1, so we have covered everything, but we have also found that the probability that either i or ii happens – the two cases in which A wins the series – add up to ¾, or 75%.

So, on this account, my instincts were right, 70% is lower than 75%.

However, when a calculation comes out differently than an actual number from the world, it is the calculation that must be re-thought. My first thought along those lines was the following: if A wins game three, then it has won two of three games against B, and although that is a small sample, it does point to the possibility that maybe team A is somewhat better than B, and that should be taken into account.

For example, maybe in this scenario the probability A wins either of games four or five should be 0.55 and the probability B wins only 0.45.

This is not helpful in reconciling the data with the calculations, however, as if one re-does the calculations for the probability of each of the three outcomes above, one now gets:

0.55
0.45 x 0.55 = 0.2475
0.45 x 0.45 = 0.2025

and the predicted probability of A winning the series is now up to 79.75%, even further away from the empirical 70%.

Huh.

So, one has to look at something else, and my preferred culprit would be the assumption built into all these calculations that the outcome of each game is independent of what happened in previous games. In particular, I suspect that a team that has won two of the three first games takes it a little easy in the fourth game. Not just that Team A’s players might ‘relax’ a bit, but also that team A’s manager might save his best pitcher for game five if he is needed, hoping that if they win game four his ace will be available for game one of the next round. In that scenario, the probability team A wins game four is less than 0.5, not more. You all can probably think of other explanations.

In any case, it was clear that what the sportscaster who said this last autumn wanted us fans to think is ‘whoa, winning game three is really important’, when in fact there is something more interesting to be said: why don’t winners of game three do better than they do in a five-game series?

Uses and Abuses of Statistics – MLB Edition

Leave a Reply Cancel reply