Simulating March Madness in R

March 13, 2018 15-minute read

When my brother told me that his family trip to Vegas just so happened to coincide with the commencement of March Madness, I laughed. Then he asked me to do my data science thing for his bracket selection and I sighed… and said that I’d give it a shot.

I don’t get into college sports. My alma mater, UIC, is more of a research powerhouse than a sports powerhouse. And my feeling is that I’d rather pay to see professionals than pay to see college students work for free. Although, my neighbor and ex-classmate, Curtis Granderson, did donate a ton of money so that the baseball team could build a new stadium. Every year I intend on becoming a regular attendee, but with an Opening Day in February, I lack the dedication.

FiveThirtyEight

Where does a data scientist start? With data! Of course FiveThirtyEight.com has a great data set. I was able to get a head start using their 2017 data. If you go to the bottom you can see that they let you download the csv file. Like true professionals, their csv format is almost identical this year. This made updating my code very easy… And now on to the coding for all the nerds out there.

If you’re here for the insights, spoiler alert: There ain’t much difference between my results and 538’s except that Villanova is going all the way baby!

Simulating the tournament

First I downloaded the data from the website and saved it. I could’ve used the download.file() function to highlight my sweet data science skills, but I don’t need your validation hipster data scientists! The second and third lines of code aren’t necessary right now, but if you look at the 2017 data, they eventually added the women’s bracket and appended forecasts from later dates in the tournament. If you look at this code after today, March 13th, you might want to use this.

Here’s what the data looks like:

dat <- read.csv("fivethirtyeight_ncaa_forecasts_2018.csv", stringsAsFactors = FALSE)
dat <- dat[which(dat$gender == "mens" &
                     dat$forecast_date == "2018-03-12"), ]
dat18 <- dat
dat18[, 5:10] <- dat[, 5:10] / dat[, 4:9]
head(dat18)

##   gender forecast_date playin_flag rd1_win   rd2_win   rd3_win   rd4_win
## 1   mens    2018-03-12           0       1 0.9826613 0.8756271 0.7719774
## 2   mens    2018-03-12           0       1 0.9853076 0.8759937 0.7739714
## 3   mens    2018-03-12           0       1 0.9580615 0.8417617 0.5785810
## 4   mens    2018-03-12           0       1 0.9541362 0.8007723 0.8081242
## 5   mens    2018-03-12           0       1 0.9227570 0.8134675 0.5388669
## 6   mens    2018-03-12           0       1 0.9326323 0.8311586 0.6775802
##     rd5_win   rd6_win   rd7_win team_alive team_id      team_name
## 1 0.7113416 0.6808053 0.5592682          1     258       Virginia
## 2 0.7441768 0.5854901 0.5964982          1     222      Villanova
## 3 0.6178763 0.5637075 0.5951155          1     150           Duke
## 4 0.5157563 0.4794573 0.5201710          1    2305         Kansas
## 5 0.6040952 0.4937998 0.5330441          1     127 Michigan State
## 6 0.4376696 0.5804383 0.4625857          1    2132     Cincinnati
##   team_rating team_region team_seed
## 1       94.31       South         1
## 2       94.92        East         1
## 3       92.59     Midwest         2
## 4       90.88     Midwest         1
## 5       91.67     Midwest         3
## 6       91.06       South         2

From here we simulate tournament play.

I handled the First Four awkwardness last year using code like this. mat.17 is so named because at that point there are 17 teams in the region. The resulting mat.16 brings us to the more standard 16-team tournament.

mat.17 <- dat18[which(dat18$team_region == "Midwest" &
                          dat18$gender == "mens" &
                          dat18$forecast_date == "2018-03-12"),]

first.four <- which(mat.17$team_seed == "11a" | mat.17$team_seed == "11b")
mat.16 <- rbind(mat.17[-first.four, ],
                mat.17[sample(x = first.four, size = 1,
                              prob = mat.17[first.four, "rd1_win"]), ])
mat.16[16, "team_seed"] <- "11"
mat.16[, "team_seed"] <- as.numeric(mat.16[, "team_seed"])
mat.16 <- mat.16[order(mat.16$team_seed), ]
mat.16

##    gender forecast_date playin_flag   rd1_win    rd2_win   rd3_win
## 4    mens    2018-03-12           0 1.0000000 0.95413624 0.8007723
## 3    mens    2018-03-12           0 1.0000000 0.95806152 0.8417617
## 5    mens    2018-03-12           0 1.0000000 0.92275700 0.8134675
## 23   mens    2018-03-12           0 1.0000000 0.79816102 0.5216653
## 25   mens    2018-03-12           0 1.0000000 0.61981004 0.5658178
## 21   mens    2018-03-12           0 1.0000000 0.57353779 0.2372587
## 34   mens    2018-03-12           0 1.0000000 0.58044746 0.1972652
## 22   mens    2018-03-12           0 1.0000000 0.63156066 0.2501076
## 42   mens    2018-03-12           0 1.0000000 0.36843934 0.1838281
## 36   mens    2018-03-12           0 1.0000000 0.41955254 0.1628328
## 43   mens    2018-03-12           1 0.5877584 0.46073568 0.2195347
## 46   mens    2018-03-12           0 1.0000000 0.38018996 0.4731856
## 56   mens    2018-03-12           0 1.0000000 0.20183898 0.2627198
## 53   mens    2018-03-12           0 1.0000000 0.07724300 0.3434272
## 61   mens    2018-03-12           0 1.0000000 0.04193848 0.2556496
## 57   mens    2018-03-12           0 1.0000000 0.04586376 0.2238489
##      rd4_win   rd5_win   rd6_win   rd7_win team_alive team_id
## 4  0.8081242 0.5157563 0.4794573 0.5201710          1    2305
## 3  0.5785810 0.6178763 0.5637075 0.5951155          1     150
## 5  0.5388669 0.6040952 0.4937998 0.5330441          1     127
## 23 0.2713751 0.2677178 0.3011693 0.3526738          1       2
## 25 0.2799675 0.2758782 0.3023590 0.3538549          1     228
## 21 0.3506738 0.4321323 0.4005194 0.4480226          1    2628
## 34 0.2760591 0.3406308 0.2712332 0.3225783          1     227
## 22 0.6034425 0.2983157 0.3133056 0.3646722          1    2550
## 42 0.5025144 0.2280794 0.2512151 0.3020226          1     152
## 36 0.2996991 0.3644445 0.3087235 0.3601552          1     201
## 43 0.2710992 0.3507883 0.2756146 0.3270297          1     183
## 46 0.1810895 0.1809292 0.2254796 0.2750365          1     166
## 56 0.1242912 0.1252019 0.1478387 0.1891547          1     232
## 53 0.1640383 0.2287034 0.1766618 0.2218949          1    2083
## 61 0.1059174 0.1482783 0.1137260 0.1489058          1     314
## 57 0.2781671 0.1104513 0.1245251 0.1618349          1     219
##                team_name team_rating team_region team_seed
## 4                 Kansas       90.88     Midwest         1
## 3                   Duke       92.59     Midwest         2
## 5         Michigan State       91.67     Midwest         3
## 23                Auburn       84.04     Midwest         4
## 25               Clemson       84.15     Midwest         5
## 21       Texas Christian       84.84     Midwest         6
## 34          Rhode Island       83.07     Midwest         7
## 22            Seton Hall       85.05     Midwest         8
## 42  North Carolina State       81.74     Midwest         9
## 36              Oklahoma       82.10     Midwest        10
## 43              Syracuse       81.86     Midwest        11
## 46      New Mexico State       79.79     Midwest        12
## 56 College of Charleston       76.44     Midwest        13
## 53              Bucknell       78.25     Midwest        14
## 61                  Iona       74.79     Midwest        15
## 57          Pennsylvania       75.55     Midwest        16

Next I created a general tournament function using my knowledge of tournaments gained from 10 years of wrestling; i.e. the best team plays the worst team, the second best team plays the second worst team, etc. So if you order the data frames as we’ve done above, this makes picking the teams being simulated very easy. That is expressed in the ind <- c(i, nrow(mat.big) + 1 - i) line of code. The line of code after that simulates a game by pulling one sample (or team) according to their win probability in a given round as prescribed by 538. Lastly, we collect our winners and output them as mat.small.

Tournament.Round <- function(mat.big = mat.16, rd = "rd2_win"){
  vec.small <- c()
  for(i in 1:(nrow(mat.big) / 2)){
    ind <- c(i, nrow(mat.big) + 1 - i)
    temp <- sample(x = ind, size = 1, prob = mat.big[ind, rd])
    vec.small <- c(vec.small, temp)
  }
  mat.small <- mat.big[vec.small, ]
}

Now we use that Tournament.Round() function and the earlier First Four code to create a function that will simulate an entire 17-team regional outcome. The First Four code is kind of awkward and long. If I were to do this for work, I’d clean it up and make it work in perpetuity, but I don’t work for you!

Region.Simulation <- function(region = "Midwest", dat = dat18){
  mat.17 <- dat[which(dat$team_region == region &
                          dat$gender == "mens" &
                          dat$forecast_date == "2018-03-12"),]

  # First Four... only for Midwest and South
  if(region == "Midwest"){
    first.four <- which(mat.17$team_seed == "11a" | mat.17$team_seed == "11b")
    mat.16 <- rbind(mat.17[-first.four, ],
                    mat.17[sample(x = first.four, size = 1,
                                  prob = mat.17[first.four, "rd1_win"]), ])
    mat.16[16, "team_seed"] <- "11"
    mat.16[, "team_seed"] <- as.numeric(mat.16[, "team_seed"])
    mat.16 <- mat.16[order(mat.16$team_seed), ]
  }
  if(region == "South"){
    mat.16 <- mat.17
    mat.16[, "team_seed"] <- as.numeric(mat.16[, "team_seed"])
    mat.16 <- mat.16[order(mat.16$team_seed), ]
  }

  if(region == "East"){
    first.four <- which(mat.17$team_seed == "16a" | mat.17$team_seed == "16b")
    mat.17 <- rbind(mat.17[-first.four, ],
                    mat.17[sample(x = first.four, size = 1,
                                  prob = mat.17[first.four, "rd1_win"]), ])
    mat.17[17, "team_seed"] <- "16"
    first.four <- which(mat.17$team_seed == "11a" | mat.17$team_seed == "11b")
    mat.16 <- rbind(mat.17[-first.four, ],
                    mat.17[sample(x = first.four, size = 1,
                                  prob = mat.17[first.four, "rd1_win"]), ])
    mat.16[16, "team_seed"] <- "11"

    mat.16[, "team_seed"] <- as.numeric(mat.16[, "team_seed"])
    mat.16 <- mat.16[order(mat.16$team_seed), ]
  }
  if(region == "West"){
    first.four <- which(mat.17$team_seed == "16a" | mat.17$team_seed == "16b")
    mat.16 <- rbind(mat.17[-first.four, ],
                    mat.17[sample(x = first.four, size = 1,
                                  prob = mat.17[first.four, "rd1_win"]), ])
    mat.16[16, "team_seed"] <- "16"
    mat.16[, "team_seed"] <- as.numeric(mat.16[, "team_seed"])
    mat.16 <- mat.16[order(mat.16$team_seed), ]
  }
  # Round of 16
  mat.8 <- Tournament.Round(mat.16, rd = "rd2_win")
  mat.4 <- Tournament.Round(mat.8, rd = "rd3_win")
  mat.2 <- Tournament.Round(mat.4, rd = "rd4_win")
  mat.1 <- Tournament.Round(mat.2, rd = "rd5_win")
  list(First_Four = mat.16[c(11, 16), "team_name"],
       First_Round = mat.8[, "team_name"],
       Second_Round = mat.4[, "team_name"],
       Regional_Semis = mat.2[, "team_name"],
       Regional_Final = mat.1[, c("team_name", "rd6_win", "rd7_win")])
}

Here’s an example of the output from one simulation of the Midwest region. For ease of display I’ll only output the regional champ:

Region.Simulation(region = "Midwest")$Regional_Final

##          team_name   rd6_win   rd7_win
## 21 Texas Christian 0.4005194 0.4480226

Now we need to take the previous code and add a few lines to simulate the Final Four and the championship:

March.Madness.Simulation <- function(dat = dat){
  midwest <- Region.Simulation(region = "Midwest", dat = dat)
  east <- Region.Simulation(region = "East", dat = dat)
  west <- Region.Simulation(region = "West", dat = dat)
  south <- Region.Simulation(region = "South", dat = dat)
  reg1 <- sample(x = c("east", "west"), size = 1,
                 prob = c(east$Regional_Final$rd6_win,
                          west$Regional_Final$rd6_win))
  National_Semi1 <- get(reg1)$Regional_Final
  reg2 <- sample(x = c("midwest", "south"), size = 1,
                 prob = c(midwest$Regional_Final$rd6_win,
                          south$Regional_Final$rd6_win))
  National_Semi2 <- get(reg2)$Regional_Final
  National_Champ <- sample(x = c(National_Semi1$team_name,
                                 National_Semi2$team_name),
                          size = 1,
                          prob = c(National_Semi1$rd7_win,
                                   National_Semi2$rd7_win))
  midwest$Regional_Final <- midwest$Regional_Final$team_name
  east$Regional_Final <- east$Regional_Final$team_name
  west$Regional_Final <- west$Regional_Final$team_name
  south$Regional_Final <- south$Regional_Final$team_name

  list(Midwest = midwest, East = east, West = west, South = south,
       National_Semi = c(National_Semi1$team_name, National_Semi2$team_name),
       National_Champ = National_Champ)
}

Analysis

To get a sense of what these probabilities look like in simulation I’m going to simulate 10,000 March Madness Tournaments.

nat.champ.vec <- unlist(March.Madness.Simulation(dat = dat18))
nat.champ.mat <- matrix("", nrow = 10000, ncol = length(nat.champ.vec))
colnames(nat.champ.mat) <- names(nat.champ.vec)

for(i in 1:10000){
  set.seed(i)
  nat.champ.mat[i, ] <- unlist(March.Madness.Simulation(dat = dat18))
}
nat.champ.mat <- as.data.frame(nat.champ.mat)

Let’s see who wins the tournament simulations and how often!

tbl <- sort(table(nat.champ.mat$National_Champ), decreasing = TRUE)
tbl

##
##                 Villanova                  Virginia
##                      1447                      1384
##                      Duke                    Kansas
##                       724                       681
##                Cincinnati            Michigan State
##                       580                       578
##            North Carolina                    Purdue
##                       559                       530
##                   Gonzaga                    Xavier
##                       398                       351
##                  Michigan                  Kentucky
##                       268                       231
##             West Virginia             Wichita State
##                       226                       201
##                 Tennessee                   Houston
##                       194                       159
##                   Arizona                Texas Tech
##                       153                       152
##                Ohio State                   Florida
##                       116                       100
##                Seton Hall                    Butler
##                        78                        59
##           Texas Christian             Florida State
##                        59                        58
##                Miami (FL)                    Auburn
##                        57                        56
##                     Texas             Virginia Tech
##                        56                        56
##                 Texas A&M                 Creighton
##                        47                        44
##              Kansas State                   Clemson
##                        41                        37
##                   Alabama                  Arkansas
##                        33                        28
##              Rhode Island                  Missouri
##                        28                        25
##               Loyola (IL)           San Diego State
##                        24                        24
##                  Davidson                  Oklahoma
##                        17                        17
##                    Nevada                      UCLA
##                        16                        16
##      North Carolina State          New Mexico State
##                        14                        11
##                Providence           St. Bonaventure
##                         9                         9
##                  Syracuse              Murray State
##                         9                         8
##        South Dakota State                  Bucknell
##                         6                         5
##             Arizona State                   Buffalo
##                         4                         3
##             Georgia State                      Iona
##                         3                         3
##                  Marshall              Pennsylvania
##                         2                         2
##     College of Charleston                   Montana
##                         1                         1
## North Carolina-Greensboro         Stephen F. Austin
##                         1                         1

So using the simulations, we have Villanova winning 14.5% of the time and Virginia winning 13.8% of the time. Compare this to 538’s predictions of 17% and 18% respectively… as of 3/13/18. These numbers are pretty far apart.

To be fair, I’m not modeling any changes in win probabilities given certain matchups. I got similar results using last year’s data. Let’s do a proper comparison:

champs <- names(tbl)
MeVs538 <- merge(data.frame(team_name = champs, simulation = c(tbl) / sum(tbl)),
                 dat[, c("team_name", "rd7_win")])
MeVs538 <- MeVs538[order(MeVs538$simulation, decreasing = TRUE), ]
MeVs538$rd7_win <- round(MeVs538$rd7_win, 2)
MeVs538 <- cbind(MeVs538, simulation_odds = round((10000 - c(tbl))/c(tbl), 2))
MeVs538

##                    team_name simulation rd7_win simulation_odds
## 55                 Villanova     0.1447    0.17            5.91
## 56                  Virginia     0.1384    0.18            6.23
## 14                      Duke     0.0724    0.10           12.81
## 21                    Kansas     0.0681    0.08           13.68
## 9                 Cincinnati     0.0580    0.06           16.24
## 28            Michigan State     0.0578    0.06           16.30
## 34            North Carolina     0.0559    0.05           16.89
## 41                    Purdue     0.0530    0.05           17.87
## 18                   Gonzaga     0.0398    0.04           24.13
## 60                    Xavier     0.0351    0.03           27.49
## 27                  Michigan     0.0268    0.02           36.31
## 23                  Kentucky     0.0231    0.02           42.29
## 58             West Virginia     0.0226    0.02           43.25
## 59             Wichita State     0.0201    0.01           48.75
## 49                 Tennessee     0.0194    0.02           50.55
## 19                   Houston     0.0159    0.01           61.89
## 2                    Arizona     0.0153    0.01           64.36
## 53                Texas Tech     0.0152    0.01           64.79
## 37                Ohio State     0.0116    0.01           85.21
## 15                   Florida     0.0100    0.01           99.00
## 44                Seton Hall     0.0078    0.00          127.21
## 8                     Butler     0.0059    0.00          168.49
## 52           Texas Christian     0.0059    0.00          168.49
## 16             Florida State     0.0058    0.00          171.41
## 26                Miami (FL)     0.0057    0.00          174.44
## 5                     Auburn     0.0056    0.00          177.57
## 50                     Texas     0.0056    0.00          177.57
## 57             Virginia Tech     0.0056    0.00          177.57
## 51                 Texas A&M     0.0047    0.00          211.77
## 12                 Creighton     0.0044    0.00          226.27
## 22              Kansas State     0.0041    0.00          242.90
## 10                   Clemson     0.0037    0.00          269.27
## 1                    Alabama     0.0033    0.00          302.03
## 4                   Arkansas     0.0028    0.00          356.14
## 42              Rhode Island     0.0028    0.00          356.14
## 29                  Missouri     0.0025    0.00          399.00
## 24               Loyola (IL)     0.0024    0.00          415.67
## 43           San Diego State     0.0024    0.00          415.67
## 13                  Davidson     0.0017    0.00          587.24
## 38                  Oklahoma     0.0017    0.00          587.24
## 32                    Nevada     0.0016    0.00          624.00
## 54                      UCLA     0.0016    0.00          624.00
## 35      North Carolina State     0.0014    0.00          713.29
## 33          New Mexico State     0.0011    0.00          908.09
## 40                Providence     0.0009    0.00         1110.11
## 46           St. Bonaventure     0.0009    0.00         1110.11
## 48                  Syracuse     0.0009    0.00         1110.11
## 31              Murray State     0.0008    0.00         1249.00
## 45        South Dakota State     0.0006    0.00         1665.67
## 6                   Bucknell     0.0005    0.00         1999.00
## 3              Arizona State     0.0004    0.00         2499.00
## 7                    Buffalo     0.0003    0.00         3332.33
## 17             Georgia State     0.0003    0.00         3332.33
## 20                      Iona     0.0003    0.00         3332.33
## 25                  Marshall     0.0002    0.00         4999.00
## 39              Pennsylvania     0.0002    0.00         4999.00
## 11     College of Charleston     0.0001    0.00         9999.00
## 30                   Montana     0.0001    0.00         9999.00
## 36 North Carolina-Greensboro     0.0001    0.00         9999.00
## 47         Stephen F. Austin     0.0001    0.00         9999.00

And now a simple plot:

plot(MeVs538[, c(2, 3)], xlab = "My Simulations", ylab = "538 Simulations")
abline(0, 1)

Five38plot2

Our simulations are mostly comparable with the exception of the favorites that are a bit off.

I guess now is a good time to read 538’s How Our March Madness Predictions Work. So yeah, they calculate explicit probabilities that take into account matchups and teams' probabilities of advancing. But still… these are really far apart!

Looking at their very nice visualization of the bracket probabilities, it’s easy to see what goes wrong. Villanova has a 50% chance of advancing to the Final Four, whereas Virginia has only a 47% chance. That (and some earlier minor differences) are enough to swamp the 3% and 1% higher probabilities of winning the Final Four and National Championship games, respectively.

What to make of it? Maybe 538 would do well to actually run some simulations.? I think their conditional probabilities may be missing some interaction effects. Or maybe the interaction effects between teams really are that prominent. I don’t know.? Like I said, I don’t follow college sports :)

Edit 1:

After thinking about it overnight, 538’s probabilities are the probabilities of making it to a given round and winning. In order to simulate the tournament correctly, I need the probability of winning a game assuming they’ve made it to that round. To uncondition the variable like this I simply divide by the probability of making it to and winning the previous round. That’s the fifth line of code.

Edit 2:

Awkwardly, if you go and download 538’s data now, the numbers they published on Monday Forecast_date == 2018-03-12, were retroactively changed so that Virginia is now ranked second instead of first. I guess I’ll take that as a win.? (For the skeptics, here’s the video evidence.)

Above I implement Edit 1 using Monday’s original data. My analysis maintains that Villanova is favored, but brings the likelihood below 538’s projections. Villanova is still ranked first for me and second for them, as of the real Monday forecasts. And now Virginia’s Guard has broken his wrist, so everyone is jumping on the Villanova train regardless.

End of Edit

Finally, here are the most frequent results from the simulations:

t(t(apply(nat.champ.mat, 2, function(x){names(sort(table(x), decreasing = TRUE))[1]})))

##                         [,1]
## Midwest.First_Four1     "Syracuse"
## Midwest.First_Four2     "Pennsylvania"
## Midwest.First_Round1    "Kansas"
## Midwest.First_Round2    "Duke"
## Midwest.First_Round3    "Michigan State"
## Midwest.First_Round4    "Auburn"
## Midwest.First_Round5    "Clemson"
## Midwest.First_Round6    "Texas Christian"
## Midwest.First_Round7    "Rhode Island"
## Midwest.First_Round8    "Seton Hall"
## Midwest.Second_Round1   "Kansas"
## Midwest.Second_Round2   "Duke"
## Midwest.Second_Round3   "Michigan State"
## Midwest.Second_Round4   "Auburn"
## Midwest.Regional_Semis1 "Kansas"
## Midwest.Regional_Semis2 "Duke"
## Midwest.Regional_Final  "Kansas"
## East.First_Four1        "UCLA"
## East.First_Four2        "Radford"
## East.First_Round1       "Villanova"
## East.First_Round2       "Purdue"
## East.First_Round3       "Texas Tech"
## East.First_Round4       "Wichita State"
## East.First_Round5       "West Virginia"
## East.First_Round6       "Florida"
## East.First_Round7       "Butler"
## East.First_Round8       "Virginia Tech"
## East.Second_Round1      "Villanova"
## East.Second_Round2      "Purdue"
## East.Second_Round3      "Texas Tech"
## East.Second_Round4      "Wichita State"
## East.Regional_Semis1    "Villanova"
## East.Regional_Semis2    "Purdue"
## East.Regional_Final     "Villanova"
## West.First_Four1        "San Diego State"
## West.First_Four2        "Texas Southern"
## West.First_Round1       "Xavier"
## West.First_Round2       "North Carolina"
## West.First_Round3       "Michigan"
## West.First_Round4       "Gonzaga"
## West.First_Round5       "Ohio State"
## West.First_Round6       "Houston"
## West.First_Round7       "Texas A&M"
## West.First_Round8       "Florida State"
## West.Second_Round1      "Xavier"
## West.Second_Round2      "North Carolina"
## West.Second_Round3      "Michigan"
## West.Second_Round4      "Gonzaga"
## West.Regional_Semis1    "Gonzaga"
## West.Regional_Semis2    "North Carolina"
## West.Regional_Final     "North Carolina"
## South.First_Four1       "Loyola (IL)"
## South.First_Four2       "Maryland-Baltimore County"
## South.First_Round1      "Virginia"
## South.First_Round2      "Cincinnati"
## South.First_Round3      "Tennessee"
## South.First_Round4      "Arizona"
## South.First_Round5      "Kentucky"
## South.First_Round6      "Miami (FL)"
## South.First_Round7      "Texas"
## South.First_Round8      "Creighton"
## South.Second_Round1     "Virginia"
## South.Second_Round2     "Cincinnati"
## South.Second_Round3     "Tennessee"
## South.Second_Round4     "Arizona"
## South.Regional_Semis1   "Virginia"
## South.Regional_Semis2   "Cincinnati"
## South.Regional_Final    "Virginia"
## National_Semi1          "Villanova"
## National_Semi2          "Virginia"
## National_Champ          "Villanova"

Edit 3:

Regarding Edit 2, after some investigation 538’s estimates were changed to reflect that Virginia’s guard was injured. The problem is that when they changed their estimates on March 13th, their March 12th csv was presumably recompiled as well. This is a fantastic example of data governance issues and reproducibility that regulated industries have to take very seriously. 538 and March Madness? Not so much.