Did Illegal Voting Cost Trump the Popular Vote?

On the whole, politicians are not typically known for their honesty, but our recently elected 45th president, Mr. Donald J. Trump, has taken the practice of peddling "alternative facts" to all new levels. In fact, the pulitzer prize winning fact checking website, Politifact, had at one point during the campaign calculated that only a mere 4% of his statements could be considered completely true.

Given Trump's propensity for duplicity, there certainly is no shortage of fantastic claims for an enterprising citizen journalist to dig into. For my money though, his most interesting assertion to date still has to be that millions of illegal ballots cost him the popular vote this past November.

This particular assertion is of interest to me for a handful of reasons. First, I have yet to see a single source outside of the Trump camp treat this claim with even an ounce of credibility. Second, unlike many of his most bombastic claims, this one is based on actual research from reputable sources in some cases. And, finally, this particular claim can be easily proven, or disproven as the case may be, by simply looking at the data.

So, let's jump right in and examine the evidence both for and against Trump's claim of a popular vote win. And feel free to follow along with the analysis by downloading the jupyter notebook for this article and running it locally.

The Evidence for Trump

There seems to be three main sources of proof for Trump's proclamation of a popular vote win.

The first is a peer reviewed paper published in the journal Electoral Studies by Political Scientists Jesse T. Richman, Gulshan A. Chattha, and David C. Earnest. The second is a Pew Research report written by David Becker on the inefficiencies within our current voter registration system. The third, and most recent, is a yet to be seen piece of analysis from True the Vote, a non-profit dedicated to monitoring voter fraud through the use of the VoteStand mobile application for "quickly report[ing] suspected election illegalities."

The VoteStand Analysis

Let's start by quickly dispensing with the VoteStand analysis first since there's really nothing to discuss at this point. The story here is that the person behind the creation of the VoteStand application, Gregg Phillips, made the claim shortly after the election that upwards of 3 million votes were cast illegally in favor of Clinton.

To date, however, Gregg and VoteStand have yet to share any of their findings or their methodologies with the general public, so all we currently have to go on are Gregg's initial claims, and even he seems to be hedging his bet a bit now. So, until we see some actual evidence from the VoteStand analysis, there's not much we can do with the unsubstantiated claims of its founder other than acknowledge them and move on.

The Pew Research Report

The other easily disputable piece of evidence is the Pew Research report. In one tweet, Trump stated that he would be calling for an investigation into voter fraud with a specific interest in voters registered in more than one state.

This idea has no doubt come out of his interpretation of the Pew Research findings. However, though the report found evidence of millions of invalid voter registrations (e.g., 24 million outdated records, 1.8 million deceased voters, and 2.4 million registered in multiple states), the author, David Becker, has since stated that no evidence of voter fraud was ever uncovered in the Pew Research study.

Bottom line, there's nothing in this report to suggest that any illegal voting activity occurred in this past election. On a somewhat snarky side note, however, in Trump's own administration there are at least five members that are actively registered to vote in more than one state.

Trump Team Members Registered to Vote in Multiple States

  • Chief White House strategist Stephen Bannon
  • Press Secretary Sean Spicer
  • Treasury Secretary Steven Mnuchin
  • Tiffany Trump, the president's youngest daughter
  • Senior White House Advisor Jared Kushner, and Trump's son-in-law

The question then, given Trump's interest in invalid voter registrations, is "are people such as Stephen Bannon, Sean Spicer, and even Trump's daughter Tiffany part of some conspiracy to illegally elect Hillary Clinton?" Oh, and I almost forget VoteStand's founder Gregg Phillips who, as it turns out, is registered to vote in three different states according to the Associated Press.

The point I'm trying to make, and this is confirmed in David Becker's tweet, is that, yes, we know our system isn't perfect; people move around and pass away and our voter records are not always kept up to date, but that's not proof that millions of people voted illegally.

The CCES Analysis

That brings me to the most interesting piece of evidence for Trump's claim: a peer reviewed paper titled "Do non-citizens vote in U.S. elections?". In this paper, three political scientists from Old Dominion University describe an analysis they performed on data from two large online, opt-in surveys conducted by the London based polling firm YouGov.

The surveys were given during the 2008 presidential election and the subsequent 2010 midterms, and though these surveys were designed to capture information from the adult citizen population, a relatively small number of non-citizens did participate in the surveys: 339 of 32,800 respondents in 2008 and 489 of 55,400 in 2010. The data from these small samples of non-citizen participants was then used to extrapolate the numbers by which Trump professes to have won the popular vote. Specifically, the authors of the paper conclude that up to an estimated 14.7% of non-citizens could have voted in the 2008 election, and as we'll see shortly, it is this overly-optimistic estimate that Trump is drawing upon to prove his November mandate.

Now, there are numerous reasons why Trump's use of this paper's findings as proof of his win could be somewhat misleading, but there are two reasons in particular that we will discuss in the following sections.

Bad Math and Misinterpretation

First, it's no secret that Trump doesn't particularly like to read, so it's quite possible that he just hasn't read the paper that his claim relies so heavily upon. If he had, he would have no doubt realized that the odds of Hillary winning because of a groundswell of support in the non-citizen community is relatively low at best, and is not supported by the results in the paper. But don't take my word for it, the math is simple, so let's run the numbers.

The Cook Political Report, an independent, non-partisan political newsletter, reported that Hillary Clinton won the popular vote by roughly 2.9 million votes. Within the same time period, the total number of non-citizens living in the U.S. was about 23 million, according to a report from the Kaiser Family Foundation.

In [1]:
# The number of votes Hillary won over Trump
popular_vote_margin = 2864974
# The total number of non-citizens living in the U.S. in 2016
total_non_citizens = 22984400
# Proportion of non-citizen votes needed for a Trump win
percent_needed_for_trump_win = round(popular_vote_margin / total_non_citizens * 100, 1)

So, for Trump to win the popular vote, he would have needed approximately 12.5% of non-citizens to not only vote, but to vote exclusively for Hillary. In the paper, Richman et al., conclude that at the high end, 11.3% of non-citizens may have actually voted in the 2008 election. Now, that falls a bit shy of the 12.5% that Trump needs for a win, but when you take into account the margin of error (see the calculation below), you get a range of 7.9 - 14.7% (with a confidence level of 95%), which would mean that a popular vote victory could, theoretically at least, be possible.

In [2]:
import math

def margin_of_error(p, n, z=1.96):
    """Returns the Margin of Error for the given proportion.
    p -- the proportion of the sample with a characteristic
    n -- the size of the sample
    z -- z-score for the desired confidence interval
    return z * math.sqrt(p * (1 - p)/n)

# The proportion (11.3%) of non-citizens that may have voted in 2008
p = 0.113 
# The total number (339) of non-citizens in the 2008 survey
n = 339
# Calculate the margin of error for the sample population of 339 non-citizens
moe = round(margin_of_error(p, n) * 100, 1)
# Calculate the 95% confidence interval
ci = (11.3 - moe, 11.3 + moe)

print('Margin of Error: {0:.1f}%'.format(moe))
print('Confidence Interval (95%): {0:.1f} - {1:.1f}%'.format(*ci))
Margin of Error: 3.4%
Confidence Interval (95%): 7.9 - 14.7%

So, what could possibly be wrong with this picture?

Well, for starters, the author of the paper himself has stated that the numbers simply don't add up to a victory for Trump. In a blog post on his personal website, Richman explains that the 12.5% that Trump needs for a win is not likely to have materialized. In fact, as Richman points out in his blog post, the authors of the original paper had already estimated that the most likely proportion of non-citizens who voted would have been closer to 6.4%. If Trump had read the entire paper, he would have noticed this number and realized that his claims just don't add up.

In [3]:
# Richman's paper concludes that 6.4% is a more 
# realistic estimate of non-citizen voters
adj_p = 0.064
illegal_voters = adj_p * total_non_citizens

Now, if we use the more realistic percentage to calculate the number of illegal voters, we get only 1.5 million, which is roughly half of the 2.9 million that Trump needs to declare a win.

Furthermore, based on voter behavior from the 2008 election, Richman calculates that it's likely that only 81.8% of those illegal voters would have voted for Clinton, the rest would have gone with Trump, with some small percentage going to third party candidates. Using this new information, we can get a more accurate picture of the number of votes that Clinton would have received from the non-citizen population.

In [4]:
# Of those illegal voters 81.8% would have voted for Clinton
clinton = round(illegal_voters * 0.818)
# and 17.5% would have gone to Trump
trump = round(illegal_voters * 0.175)
# Number of popular votes for Clinton minus non-citizens
adj_popular_vote_margin = popular_vote_margin - (clinton - trump)
# The proportion of non-citizens needed to vote to have lost the election
# as a result of illegal voting activity (assuming that 90% of non-citizens
# vote for Clinton and keeping the third party vote constant).
non_citizen_proportion_needed = popular_vote_margin/(total_non_citizens * (0.9 - 0.093))

With this new information taken into account, the most likely result would then be that Trump would have still lost the popular vote by 1.9 million votes.

In fact, even if we give Trump the benefit of the doubt and assume that an implausibly high proportion of non-citizens voted for Clinton (let's say 90%), Trump would then need 15.4% of all non-citizens in the U.S. to vote in order to lose the election solely due to illegal voting activity. That's a far cry from the 11.3% reported as the high end in the original paper, and even adjusting that number for the margin of error, the number Trump needs still falls outside of the possible range (7.9 - 14.7%).

Bottom line, even if the paper's results are to be believed, the math just doesn't add up. And, as we've just seen, the results of the paper suggest that the possibility of Trump winning the popular vote from illegal votes alone is so slim as to have little to no grounding in reality. Of course, this assumes that the results of the paper can be believed. However, according to another peer reviewed paper in the same journal—the authors of which are the PI and co-PI of the studies used in the Richman paper—that may not be the case.

Respondent Error

The main argument of "The Perils of Cherry Picking Low Frequency Events in Large Sample Surveys" is that, due to a unique characteristic of these large sample surveys (which we will cover shortly), the findings of Richman et al. can be completely explained by simple respondent error.

Respondent error, according to Wikipedia, "refers to any error introduced into the survey results due to respondents providing untrue or incorrect information" and simply put, it's a natural part of any survey. It could be that the participant didn't understand the question. Or, perhaps they haven't had that first cup of coffee, and in a state of caffeine-deprived grogginess, mistakenly checked the wrong box. The truth of the matter is that errors such as these happen, and the larger the survey the more chances for such errors to occur.

Now, I know that it may seem a little far-fetched to say that we can go from millions of illegal voters handing an illegitimate win to Clinton, to basically saying "there's nothing to see here folks" just because a handful of people checked the wrong box, but this may prove to be the case.

In the remaining sections I work through each of the main points made by Ansolabehere et al. in their short, but effective rebuttal.

A Real World Example

The authors of the paper begin by stepping through an example that illustrates how inferring population behavior from a relatively low-frequency event in a large sample can lead to misinterpretations. In this article, however, I'm going to bypass the fictional example and just jump straight into the real data.

The dataset that had the most impact with respect to the findings in Richman's paper was the 2008 presidential survey data. This dataset contains observations from 32,800 participants, of which 339 (1%) were labeled as non-citizens. Making the assumption that 99.9% of the time (a pretty high rate of success) the respondents answered the citizenship question correctly, we would expect that roughly 33 individuals (32,800 * (1 - 0.999)) would have incorrectly classified themselves as non-citizens. These incorrectly labeled individuals would then account for about 10% (33/339) of the non-citizen sample. If you then make the additional assumption that citizens vote with a much higher regularity than non-citizens (or, perhaps that non-citizens do not vote at all), the result is that the incorrectly labeled participants would be nearly completely responsible for any observed non-citizen voting behavior.

As an example, if we assume that 70% of citizens voted and all non-citizens abstained, then the behavior of the incorrectly labeled citizens alone would give the appearance that nearly 7% ((33 * 0.7)/339) of non-citizens voted illegally in the presidential election. If we were to take this one step further and apply this number to the total number of non-citizens in the US in 2016 (23 million), it would appear that roughly 1.6 million (7% of 23M non-citizens) illegal votes could have been cast in the most recent election. 1.6 million votes is quite a bit larger than the number of illegal votes estimated by Richman (800k), so it would appear that the conclusions reached in the original paper, if our assumptions above hold true, could be just a side effect of expected errors in the data.

The question then is, "just how accurate are our assumptions?"

Well, according to Ansolabehere, et al., the rate of respondent error and the citizen voter rates that we posited above (0.1% and 70% respectively) match the rates observed in a separate panel survey that was conducted during the 2010 and 2012 elections. However, rather than just taking this assertion at face value, we can go a step further and actually download the panel data and take a look for ourselves.

Step 1: Download, load, and clean the data.

I had a little trouble at first finding the panel data that was used in the paper. The yearly survey data can be easily found from the CCES homepage, but you have to dig a bit deeper to find the panel data. Luckily, the fine folks at Harvard who conduct the CCES surveys were quick to respond with a link to the panel data used in the paper.

Several formats of the panel data are provided via the download button on the website linked above, but since the original format of the dataset is a Stata file, and pandas has the handy read_stata function, I decided to go with that since it has the nice side effect of making sure that all of the attributes in the dataset have the correct type without any extra work on my part.

Note: If you're following along on your own machine and wish to run the code below, you'll first need to follow the link above to download the panel data before you can execute the next bit of code.

In [5]:
import pandas as pd

cces_panel = pd.read_stata('CCES1012PaneV2.dta')

The next problem you may run into is finding the right attributes in the data to analyze.

We're specifically trying to figure out the percentage of respondents that could have mistakenly labeled themselves incorrectly with respect to their citizenship, and so we'll need to grab the columns that represent the citizenship question from both years. Unfortunately, as of this writing, the guide that comes with the panel data refers to the columns incorrectly. After a quick email exchange with the PI of the survey, I was able to piece together which features to use. As it turns out, the names of the features we need actually match the names from the original 2010 and 2012 non-panel surveys, these are V263 and immstat respectively.

Now that we have the correct columns for our analysis, we can clean up the data a bit by removing any participants from the dataset that did not answer the citizenship question in both years.

In [6]:
cces_panel.dropna(subset=['V263', 'immstat'], inplace=True)

With the data loaded and properly scrubbed, we're ready to calculate the respondent error rate for the survey.

Step 2: Calculate the respondent error rate

We'll start our calculation by first calculating the contingency table for the survey year and citizenship features. We could do this by using the crosstab function that the pandas library provides, but we'll do it by hand instead so we can use the totals for further calculations later. The following code creates a subset of the data for each of the four possible groups of participants:

  • cc - Individuals who marked themselves as citizens in both years
  • cnc - Individuals who marked themselves as citizens in 2010 and non-citizens in 2012
  • ncc - Individuals who marked themselves as non-citizens in 2010 and citizens in 2012
  • ncnc - Individuals who marked themselves as non-citizens in both years
In [7]:
# Citizen in 2010 and Citizen in 2012
cc = cces_panel[(cces_panel.V263 != 'Immigrant non-citizen') & 
                (cces_panel.immstat != 'Immigrant non-citizen')]
# Citizen in 2010 and Non-citizen in 2012
cnc = cces_panel[(cces_panel.V263 != 'Immigrant non-citizen') & 
                 (cces_panel.immstat == 'Immigrant non-citizen')]
# Non-citizen in 2010 and Citizen in 2012
ncc = cces_panel[(cces_panel.V263 == 'Immigrant non-citizen') & 
                 (cces_panel.immstat != 'Immigrant non-citizen')]
# Non-citizen in 2010 and Non-citizen in 2012
ncnc = cces_panel[(cces_panel.V263 == 'Immigrant non-citizen') &                  
                  (cces_panel.immstat == 'Immigrant non-citizen')]

Once we have the individual groups created, we simply need to calculate each group's proportion of the entire sample population to produce our contingency table.

In [8]:
cc_percent = round(len(cc)/len(cces_panel) * 100, 2)
cnc_percent = round(len(cnc)/len(cces_panel) * 100, 2)
ncc_percent = round(len(ncc)/len(cces_panel) * 100, 2)
ncnc_percent = round(len(ncnc)/len(cces_panel) * 100, 2)
Response in 2010 Response in 2012 Number of Respondents Percentage
Citizen Citizen 18737 99.25
Citizen Non-citizen 20 0.11
Non-citizen Citizen 36 0.19
Non-citizen Non-citizen 85 0.45

Using the contingency table above we can estimate the respondent error rate (conversely, the success rate) of the survey participants with respect to correctly reporting their citizenship status.

We can see that the overwhelming majority of participants gave matching answers in consecutive years: 99.25% labeled themselves as citizens in both years, while 0.45% identified themselves as non-citizens. It's probably pretty safe to assume that these individuals filled in this portion of the survey correctly, so we'll add them together to get an initial success rate of 99.7%. Of the two remaining groups, those who marked themselves as citizens in 2010 and non-citizens in 2012 are likely to have responded incorrectly since it is highly unlikely that someone would lose their citizenship in the two years between elections. On the other hand, it's very possible that a handful of non-citizens surveyed in 2010 would be granted citizenship in the intervening years, so we'll add that group to our success rate as well, which will bring the total to 99.9%. Conversely, we can say that our respondent error rate is equal to 0.1%, which perfectly matches the number we used in our example above.

Step 3: Calculate the citizen voter rate

The other number in question was the citizen voter rate of approximately 70%. Let's dig into the data once more to see if the number we used in the example above matches the real world.

Again, it took a little poking around to find the correct feature to analyze, but if we take a look at the VV_2010_gen field, we can see whether or not the participant voted as well as the method by which they did. To determine the number of people who did vote in the 2010 election, we will need to know the list of possible values that each record can take on for that field. The VV_2010_gen field is a category, so we can get a look at its list of possible values through the cc.cat.categories property.

In [9]:
Index(['Absentee', 'Early', 'Mail', 'Voted unknown method',
       'Confirmed Non-voter', 'Unmatched'],

Of the possible values listed above, only the 'Absentee', 'Early', 'Mail', and 'Voted unknown method' options pertain to people who voted in the 2010 election. Therefore, we can calculate the percentage of voters by simply counting the number of records with one of those values and dividing that by the total number of survey participants.

In [10]:
voted = ['Absentee', 'Early', 'Mail', 'Voted unknown method']
citizen_voter_rate = sum(cc.VV_2010_gen.isin(voted))/len(cc)

We can then see that the voter rate among individuals registered as citizens in both surveys was 71.0%, which is roughly equivalent to the 70% number we used in the example above. Couple that with the 0.1% respondent error rate that we calculated previously, and it looks like the example we gave does in fact match reality.

In short, it does look like the results of the Richman paper could be completely attributed to response error and thusly discarded as proof of illegal voting activity swaying the popular vote in favor of Clinton.

Margin of Error (or, why is this only a problem with large datasets?)

As I mentioned earlier, drawing conclusions from relatively small samples pulled from larger sample surveys is only a problem when the original survey data is quite large, such as the 2008 and 2010 CCES surveys use by Richman et al. to come up with their estimates of possible illegal voting activity. The question then is, "why does this problem only turn up in large sample surveys?"

The answer to that question would be the margin of error.

A margin of error, according to Wikipedia, is "a statistic that expresses the amount of random sampling error in a survey's results." Essentially, it is a number that can be applied to a poll's reported results to calculate a range within which we have a given probabilistic chance of the real value falling. In short, the larger the margin of error, the wider the range, and consequently the lower the possibility of the poll's reported results actually matching the real world figures.

To make this concept a bit clearer, we can take a look at a short example.

Let's start with a small dataset, say n = 1000, and then calculate the number of citizens and non-citizens based on the breakdown we saw in the original 2008 CCES survey.

In [11]:
n = 1000
non_citizens_proportion = 339/32800
non_citizens = round(n * non_citizens_proportion)
citizens = n - non_citizens

Using the same breakdown that we saw in the 2008 survey, we end up with a sample of 990 citizens and 10 non-citizens.

Next, we'll use the same error rate that we calculated earlier to determine the number of citizens that would incorrectly identify themselves as non-citizens.

In [12]:
error_rate = 0.001
incorrectly_labeled_citizens = round(citizens * error_rate)

With a respondent error rate of 0.001, we would expect that 1 citizen would incorrectly identify themselves as a non-citizen. Given a sample size of 10, we would expect that one incorrectly labeled citizen to have a high impact on any numbers extrapolated from the sample since they would account for 10.0% of the entire non-citizen sample. We'll see just how big an impact that person has by calculating the expected illegal voter rate, and consequently the number of illegal votes, based solely on respondent error alone.

In [13]:
illegal_voter_rate = incorrectly_labeled_citizens/non_citizens * citizen_voter_rate
illegal_votes = total_non_citizens * illegal_voter_rate

Making the assumption that only the citizens that were incorrectly labeled as non-citizens voted, and that they voted at the normal citizen voter rate of 71.0%, we were able to calculate the number of illegal votes we would expect to see based on respondent error and nothing more. Doing so gives us an estimate of around 1.6 million illegal votes, which is exactly what we got in our example above.

If we then calculate the margin of error for our small sample of 10 non-citizens, we can calculate the confidence interval, or the range within which the actual number of illegal votes is likely to be found.

In [14]:
moe_small_survey = margin_of_error(illegal_voter_rate, non_citizens)
ci_small_survey_low = round((illegal_voter_rate - moe_small_survey) * total_non_citizens)
ci_small_survey_high = round((illegal_voter_rate + moe_small_survey) * total_non_citizens)

The small sample size gives us a relatively large margin of error of $\pm$15.9%. Using this margin of error to calculate a confidence interval, we get a range of between -2.0 and 5.3 million illegal votes cast. This is a huge range that dwarfs our initial estimate of 1.6 million votes and essentially renders our estimate useless. Thus, it's easy to see how extrapolating population behavior from a low-frequency event in a small sample typically would not happen simply because the margin of error is too great.

On the other hand, if we calculate the margin of error and confidence interval again using the larger sample size of 339 non-citizens that we saw in the 2008 survey, we get a much more acceptable margin of error.

In [15]:
moe_large_survey = margin_of_error(illegal_voter_rate, 339)
ci_large_survey_low = round((illegal_voter_rate - moe_large_survey) * total_non_citizens)
ci_large_survey_high = round((illegal_voter_rate + moe_large_survey) * total_non_citizens)

The larger sample size gives us a much smaller margin of error of $\pm$2.7%. Using this margin of error to calculate the new confidence interval, we get a range of between 1.0 and 2.3 million illegal votes cast. This is a much more reasonable range and it's easy to see how a researcher could be lulled into believing that they've just uncovered a significant finding in the data. However, as we've already seen, the danger with using findings from these low-frequency events to extrapolate behavior in the larger population is that these findings can be attributed solely to response error.


To recap, Trump has claimed, on several occasions, that not only did he win the electoral college vote, he also won the popular vote, if you discount all of the votes Hillary received from non-citizens residing within our borders. To the best of my knowledge, there are three main resources to which Trump has alluded that seem to support this claim.

The first, is a bit of analysis performed by the non-profit VoteStand that declared that close to 3 million non-citizens voted on behalf of Hillary in this past election. To date, however, this analysis has not been seen by anyone outside of the company, and as a result it can be easily discarded as proof.

The second piece of evidence is a Pew Research study that investigated issues with the current voter registration system and found evidence of millions of invalid voter registrations. Though the numbers in this study sound convincing, the author has since stated that no actual evidence of voter fraud was ever uncovered. Therefore, as an interesting look at some of the issues inherent in our voter registration system, this study is fantastic, but as proof of illegal voting activity, unfortunately it falls short.

Finally, the most interesting resource was a peer-reviewed paper from several political scientists with the Old Dominion University. In this paper, the authors used data from two different large sample surveys to extrapolate the number of possible illegal voters to be in the millions. The conclusions of this paper, however, have been disputed, by the original authors of the surveys, as being possibly a result of response error in the survey data. In addition, even if the results of the paper are to be believed, the original author of the paper has subsequently claimed that the expected numbers would still fall well below that needed for a Trump win of the popular vote.

In short, there is currently no evidence that supports Trump's claim that illegal voting activity cost him the popular vote. So, I'm sorry Mr. President, but it looks like you're just wrong on this one.