Win Expectancy and Leverage Index tables, R Code

This post is just a quick dump of some code you can use to create win-expectancy and leverage index tables like what I used for my recent Baseball PreGUESTus article. It is written for the free statistical program R, and it builds upon the excellent work on run-expectancy and run distribution tables done by Sobchak at

In order to run this code, you will need R with the package plyr installed. You will also need the file bo_transitions.csv from ChancesIs (either the CSV file hosted on that site, or one created using a similar query to the one Sobchak published) and the file game_state_frequency.csv, which you can copy from this table. Sobchak's data and the game_state_frequency table are from the years 1993-2010. You can collect the data for other years by altering Sobchak's SQL query and this game_state_frequency query.

*note-you only need game_state_frequency.csv for calculating LI. You don't need it if all you want is a WE table.

Once you have those files on your computer, you can construct a win-expectancy table with the following R code:

Win Expectancy Table, R code

You will have to change the line
setwd("/Users/Seshoumaru/Desktop/untitled folder/baseball/run-win expectancy")

to the folder path where you saved the necessary CSV files.

The win expectancy values are generated based on Sobchak's simulated run distributions. It is currently set to run 100,000 simulated innings from each state to estimate the distributions. You can raise the number of simulations to increase the precision, but it will take longer to process. On my computer, 100,000 simulations took about 4 minutes to run. 1,000,000 simulations took about an hour. The win expectancies themselves are not simulated, however.

The code limits run scoring to 16 runs for the remainder of the inning you are in, plus 16 runs total for the rest of the game. This is done to greatly reduce processing time. The generated tables cover scores from the home team being down 16 to up 16 (all score differentials are from the perspective of the home team.

The above code assumes equal run distributions for both teams. With a few changes, you can alter the code to include home-field advantage by using separate distributions for the home and away teams. To do this, you will need to alter Sobchak's query to create additional bo_transition files for just the home team and just the away team (called bo_transitions_home.csv and bo_transitions_away.csv). Once you have added those files, you can run the following code:

Win Expectancy Table, HFA version, R code


Andrew said...

Thanks for the code! When I was reading the article, one thing that came to mind would be to use neither LI or boLI for your hypothetical situation where the closer doesn't need to be saved for tomorrow. Instead, you could simulate a bunch of games and see how likely it would be that there would be a better situation for the closer to come in.

I using your code as a starting point, I wrote a function to simulate a game at any starting point you want. Then kept track of how many times the starting point had the highest LI for that team.

I ran the function 10000 times for the example in the article, and that point was the highest LI for the game in 49.96% of the games. The highest LI in the game was less than 2 in 83.61% of the games.

This seems to indicate that you should use your closer in this situation, but I am not sure. Maybe a way to make this more thorough would be to calculate how much putting in the closer improves the probability of winning in this situation versus in the other potentially important situations (weighted by the probability of reaching those situations).

amy atkinson said...

I did not get the code exactly yet does this mean that we can use this in our baseball field equipment too? Like a code organizer to file things properly.

Sam Sharpe said...

Thanks this is great. Could you tell me what the functions,, diag.sum are doing exactly? There isn't much commenting so I am a little confused.

Kincaid said...

Sorry I didn't see this earlier. I am not sure how helpful these explanations will be, but hopefully they will at least start to make sense if you get a chance to play around with the functions and look at their outputs.

run.dist.simulation is a table of the probabilities of scoring a given number of runs in an inning from each base-out state. This table gives probabilities for scoring anywhere from 0-16 runs (there is nothing in the simulation limiting it to that range, I just cut it off at 16 runs to reduce the number of calculations needed). This distribution of run scoring only applies to runs scored through the end of the inning, though. To calculate win probabilities, we need the probability of scoring a given number of runs through the end of the game.

What does is takes the distribution of run scoring through the end of the inning, and it adds another inning on top of that. So instead of giving the probabilities for scoring anywhere from 0-16 by the end of this inning, it takes those probabilities and turns them into the probabilities of scoring anywhere from 0-32 runs by the end of the next inning (although it looks like I also limited this to 0-30 runs in the win probability calculations to further reduce calculation time). And then you can take the run distribution for the next two innings, and feed that back into the function to add another inning, and it gives you the run distribution over the next three innings, etc. only creates run distributions from the start of an inning through the end of the game. does the same thing, but it calculates the run distribution from any base-out state through the end of the game rather than just from the start of the inning through the end of the game. The extra parameter "h" is a number from 1:24 that identifies the base-out state. (The reason also exists is that the distributions from the start of an inning through the end of the game are used to feed into later on.)


diag.sum() is a helper function that is used in conjunction with the two functions. and don't actually return a single row of probabilities for scoring each number of runs. Rather, they return a table of data, with different rows giving probabilities for different combinations that lead to a given number of runs.

For example, say we want to know the probability of scoring exactly 1 run through the end of the end of the game. Because of how works, it will give us a table where one row gives us the probability of scoring 1 run in this inning and 0 runs for the rest of this game. And then another row will give us the probability of scoring 0 runs this inning and then 1 run for the rest of the game. To get the total probability of scoring 1 run through the end of the game, we have to add those two probabilities together.

Fortunately, the different iterations that lead to the same number of runs end up on the same diagonal in the table returned by This means to get the total probabilities, we just have to sum up the diagonals of that table. Which is what this function does, hence the name "diag.sum()".

Post a Comment