If you found your way here you probably know what this is about. Game 5 of the 2017 World Series between the Los Angeles Dodgers and the Houston Astros was, at least back then, a great game. But for some months now we know that the Houston Astros cheated in the 2017 regular season and probably also the following years. They stole the opposing catcher’s signs and relayed them in real time to the hitters. There is little doubt left that they also cheated during the World Series in 2017.
A couple of weeks ago it was pointed out that it is probably pretty unlikely that the Astros hitters only swung at one breaking ball during this game 5. Tom Verducci claimed that there were no swinging strikes on breaking balls this day. Which actually isn’t true because there was one by Altuve in the bottom of the first inning for the strike out as you can see on Baseball Savant and on YouTube at around minute 41 But this doesn’t really make anything better.
So I thouhgt I would calculate the probability that this happens based on all sliders and curve balls from Kershaw in the year 2017. I got the data from Baseball Savant but weirdly there is no data for the 1st of September 2017. So this day isn’t included. I’m sure it wouldn’t change a lot.
The following chart shows the cumulative probability of swings and misses of off Kerhsaws breaking balls for Game 5 of the 2017 World Series based on all breaking balls Kershaw threw in 2017:
Here is a quick guide on how to read this: I calculated two values. One (blue) is based on all breaking balls thrown by Kershaw in 2017 including the postseason. The other (red) is based on all breaking balls excluding the World Series. So if you think the Astros cheated, use the red line to have a proper picture of what is going on. If you think they didn’t, use the blue line.
The number of swinging strikes are cumulative. Meaning that this shows the probability that there were for example 3 or less swinging strikes in this game and not exactly 3 swings and misses on breaking balls.
You also have to keep in mind that the sample here is Game 5 of the World Series where Kershaw threw 51 breaking balls. So this always shows the probability based on 51 breaking balls. This is just how the math works which I explain down below if you are interested.
But either way it doesn’t look good for the Astros. Let’s give them the benefit of the doubt and let’s believe that they didn’t cheat during the World Series (blue line). The probability that Kershaw got 1 or less swings and misses on his breaking balls while throwing 51 sliders and curve balls this night is 0.009%.
What does that mean?
Let’s put this in some context. Two days later he pitched again against the Astros. This time in L.A. as a reliever. He threw 19 breaking balls and had 3 swings and misses. The probability for getting 3 or less swings and misses on this day is 42.68%.
On the 1st of May he pitched against the San Francisco Giants in L.A. He threw a total of 39 breaking balls this day and had only 2 swings and misses. Now, the probability for him to get 2 or less swinging strikes on this day is also only 0.69163%.
Yes, this is also extremely low but we are dealing with probabilities here. Probabilites tell us about what should happen based on what happened at another time. But that doesn’t mean that everything that should happen also will happen. And before somes Astros fans get too excited I will just say this: 0.69163 is around 77 times greater than 0.009.
We still don’t know for sure if the Astros cheated that day but next to all the other evidence the numbers don’t speak for them.
If there are some nerds who are reading this you can keep on reading. I’m explaining how I came to these numbers.
The calculations for these numbers are pretty straightforward. I used the hypergeometric distribution to calculate the probabilities. The formula looks like this:
In our case N is all the breaking balls thrown by Kershaw in 2017: 1,488
K is all swings and misses off breaking balls in 2017: 307
n is all breaking balls on the particular day. So for Game 5 of the World Series it’s 51
k is all swings and misses on this day: 1
The thing with the two numbers on top of each other inside the brackets is a binomial coefficient which is also pretty straightforward. So for example:
6 is the answer to the question how many ways there are to choose 2 elements out of 4 elements. There are 6, namely: 1,2 – 1,3 – 1,4 – 2,3 – 2,4 – 3,4.
So we can say that if Kershaw throws 4 breaking balls and has 2 swings and misses there are 6 ways he can do that. He can have a swing and a miss on the first and second pitch, on the first and third pitch, on the first and fourth pitch and so on.
So we take these binomial coefficients and use them in our formula above.
What is important to understand is that with the hypergeometric distribution we calculate the probability of, for example, exactly 2 swings and misses. That’s why I used cumulative probabilites in the chart because it just makes more sense to speak of the probability that he had 1 or less swings and misses that day, rather than exactly 1 swing and a miss. Of course you could cumulate them the other way around too, so you can give the probability for 1 or more swings and misses but that wasn’t the concern here.
If you are further interested in this please just write me a mail: bernd [at] marktforschung-schmidl.at