How Accurate Were the 538 Presidential Election Forecasts in 2020?

Nate Silver’s 538 US presidential election forecasts include 80% intervals for 51 locations (50 states + Washington DC). Here’s a reproducible notebook looking at how well 538 did this year. Click on “Code” to see the code used to perform each step of this analysis.

library(tidyverse)
library(jsonlite)
library(lubridate)
library(forcats)
library(plotly)

We start by loading the election results as displayed by the New York Times (scraped by Alex Gaynor). The results shown here are based on data downloaded on Nov 8, 2020.

# read the JSON file and extract the relevant information into a data frame
results <- jsonlite::read_json("results.json")
state_name <- results$data$races %>% map_chr("state_name")
electoral_votes <- results$data$races %>% map_dbl("electoral_votes")
biden <- results$data$races %>% 
  map_dbl(~ .x$candidates %>% 
        keep(~ .x$last_name == "Biden") %>% 
        map_dbl("percent"))
trump <- results$data$races %>% 
  map_dbl(~ .x$candidates %>% 
        keep(~ .x$last_name == "Trump") %>% 
        map_dbl("percent"))
actual <- tibble(state = state_name, 
                 biden = biden, 
                 trump = trump,
                 electoral_votes = electoral_votes)

We next get the 538 forecasts from their github. In particular, we are interested in what they predicted on November 3, 2020.

# extract the relevant information into a data frame
five38 <- read_csv("election-forecasts-2020/presidential_state_toplines_2020.csv")
five38 <- five38 %>% 
  mutate(date = mdy(modeldate))
forecasts <- five38 %>% 
  filter(date == "2020-11-03") %>% 
  transmute(state, 
            trump = voteshare_inc,
            trump_lo = voteshare_inc_lo,
            trump_hi = voteshare_inc_hi,
            biden = voteshare_chal,
            biden_lo = voteshare_chal_lo,
            biden_hi = voteshare_chal_hi,
            biden_win_prob = winstate_chal)
# reorder the states by actual % Biden and join the two data frames
# and add some relevant columns
both <- actual %>% 
  left_join(forecasts, by = "state", suffix = c("_actual",
                                                "_538")) %>% 
  mutate(state = fct_reorder(state, biden_actual),
         cover80 = biden_actual <= biden_hi & biden_actual >= biden_lo,
                        biden_win = biden_actual > trump_actual,
                        pred_biden_win = biden_538 > trump_538)

This plot shows the actual percentage voting for Biden (in black) for each location along with 538’s 80% interval.

both %>% 
  ggplot(aes(x = state,
             y = biden_538,
             ymin = biden_lo,
             ymax = biden_hi,
             color = cover80)) +
  geom_errorbar() +
  geom_point(data = both, aes(x = state, y = biden_actual), color = "black") + 
  theme(legend.position = "none") +
  labs(x = "State", y = "Percent Voting Biden", title = "How Well Did 538 Do?") +
  coord_flip()

# Coverage:
coverage <- round(100*binom.test(sum(both$cover80), 51)$conf.int)

# Absolute error:
ae <- both %>%
  summarize(ae = mean(abs(biden_actual - biden_538))) %>% 
  pull(ae)

# Number of wrong calls:
confusion_matrix <- both %>% 
  transmute(biden_win_actual = biden_actual > trump_actual,
            biden_win_538 = biden_538 > trump_538) %>% 
  table()

Some observations

The 80% intervals from 538 correctly included the actual value in 44 of the 51 locations, giving them an empirical coverage proportion of about 86%. If one were to imagine that the event that a state’s interval correctly covers its target is independent of the outcomes of other states, then one gets a 95% confidence interval for the true coverage percentage of their intervals being between 74% and 94% (which, notably, includes the target coverage probability of 80%). But in practice one would expect the errors to be positively correlated—if a mistake is made in one state, it may very well be made in many states. This could mean that the width of the 95% confidence interval should be even wider.
Despite the good coverage, the intervals do appear to be systematically biased (in the statistical sense of the term) in favor of Biden. Being above the \(45^\circ\) line in this next plot means the forecast overestimated the Biden vote. (The two red points are the two states that the forecast got wrong and the size represents the number of electoral votes.)

lims <- c(25, 95)
g <- both %>% 
  mutate(correct_forecast = biden_win == pred_biden_win) %>% 
  ggplot(aes(x = biden_actual,
             y = biden_538,
             size = electoral_votes,
             fill = correct_forecast,
             label = state_name)) +
  geom_point(color = "black", pch = 21) + 
  geom_abline(slope = 1, intercept = 0) +
  labs(x = "Actual % Voting for Biden",
       y = "Predicted % Voting for Biden") +
  theme(legend.position = "none") +
  xlim(lims) +
  ylim(lims)
ggplotly(g, tooltip = c("state_name", 
                        "biden_actual",
                        "biden_538",
                        "electoral_votes"))

There were only three locations where Biden did better than the forecast:

both %>% filter(biden_actual > biden_538) %>% 
  select(state, biden_actual, biden_538) %>% 
  knitr::kable()

state	biden_actual	biden_538
California	65.1	64.06170
Colorado	55.3	54.53553
District of Columbia	92.6	91.27472

On average, the 538 state-level point forecasts were within 3 percentage points of the actual.
The 538 forecasts got the winner right in all but 2 of the 51 locations (96% accuracy). The mistakes were Florida and North Carolina.
The 538 forecasts are probabilistic, so we can look to see how sure they were were in places where they were wrong vs. right.

g <- both %>% 
  mutate(how_sure = pmax(biden_win_prob, 1 - biden_win_prob),
         were_correct = if_else(biden_win == pred_biden_win,
                                "Forecast Was Right", "Forecast Was Wrong")) %>% 
  ggplot(aes(x = how_sure, y = were_correct, state = state_name)) + 
  geom_point() +
  labs(x = "Probability given by 538 model", y = "",
       title = "When 538 gave a high probability, did it tend to be right?") +
  xlim(0.5, 1)
ggplotly(g, tooltip = "state_name")

How many mistakes might 538 expect itself to make? Assuming errors are made independently in each location, we could calculate this from their individual probabilities. Note: As noted above, this is clearly a bad assumption since the errors made across locations would be correlated with each other; in fact, I believe 538 goes out of its way to account for such correlations. But still I’m curious to see what results.

set.seed(123)
p <- both %>% 
  mutate(p = pmax(biden_win_prob, 1 - biden_win_prob)) %>% 
  pull(p)
nsim <- 5000
num_right <- colSums(matrix(rbinom(n = nsim * length(p), size = 1, prob = p), ncol = nsim))
tibble(num_right = num_right) %>% 
  ggplot(aes(x = num_right, y = ..density..)) + geom_histogram(binwidth = 1, color = "white") +
  geom_vline(xintercept = sum(both$biden_win == both$pred_biden_win), 
             col = "red", lwd = 2) +
  labs(x = "Number of correct forecasts",
       title = "Did 538 do better than they might have expected?")

It appears that their actual performance was within what they might have expected (and in fact if one were to actually account for the positive correlation in errors, one would expect the distribution shown to be even wider).

How well would 538 have done one month before?

Let’s repeat this for 538’s forecasts from one month earlier (Oct. 4, 2020).

old_forecasts <- five38 %>% 
  filter(date == "2020-10-04") %>% 
  transmute(state, 
            trump = voteshare_inc,
            trump_lo = voteshare_inc_lo,
            trump_hi = voteshare_inc_hi,
            biden = voteshare_chal,
            biden_lo = voteshare_chal_lo,
            biden_hi = voteshare_chal_hi,
            biden_win_prob = winstate_chal)
old_both <- actual %>% 
  left_join(old_forecasts, by = "state", suffix = c("_actual",
                                                "_538")) %>% 
  mutate(state = fct_reorder(state, biden_actual),
         cover80 = biden_actual <= biden_hi & biden_actual >= biden_lo,
         biden_win = biden_actual > trump_actual,
         pred_biden_win = biden_538 > trump_538)
old_both %>% 
  ggplot(aes(x = state,
             y = biden_538,
             ymin = biden_lo,
             ymax = biden_hi,
             color = cover80)) +
  geom_errorbar() +
  geom_point(data = both, aes(x = state, y = biden_actual), color = "black") + 
  theme(legend.position = "none") +
  labs(x = "State", y = "Percent Voting Biden", title = "How Well Did 538 One Month in Advance?") +
  coord_flip()

g <- old_both %>% 
  mutate(correct_forecast = biden_win == pred_biden_win) %>% 
  ggplot(aes(x = biden_actual, y = biden_538,
             size = electoral_votes,
             fill = correct_forecast,
             label = state_name)) +
  geom_point(color = "black", pch = 21) + 
  geom_abline(slope = 1, intercept = 0) +
  labs(x = "Actual % Voting for Biden",
       y = "Predicted % Voting for Biden") +
  theme(legend.position = "none") +
  xlim(lims) +
  ylim(lims)
ggplotly(g, tooltip = c("state_name", 
                        "biden_actual",
                        "biden_538",
                        "electoral_votes"))

g <- old_both %>% 
  mutate(how_sure = pmax(biden_win_prob, 1 - biden_win_prob),
         were_correct = if_else(biden_win == pred_biden_win,
                                "Forecast Was Right", "Forecast Was Wrong")) %>% 
  ggplot(aes(x = how_sure, y = were_correct, state = state_name)) + 
  geom_point() +
  labs(x = "Probability given by 538 model", y = "",
       title = "When 538 gave a high probability, did it tend to be right?") + 
  xlim(0.5, 1)
ggplotly(g, tooltip = "state_name")

Apparently, their forecasts were doing well even one month before the election.

How Accurate Were the 538 Presidential Election Forecasts in 2020?

Jacob Bien

11/8/2020

Some observations

How well would 538 have done one month before?