As Donald Trump prepares to enter the White House, has our reliance on big data been exposed?
On the eve of the US election, the Huffington Post presidential forecast gave Hillary Clinton a 98.2 percent chance of winning the race to the White House.
The Huffington Post simulated the election 10 million times using state-by-state averages. What that means is that the site average recent polls from each state and followed the trends up to the day of the election in order to find the most likely outcome.
In 9.8 million simulations, Hillary Clinton won the 270 electoral votes she needed to enter the White House. As a result, the HuffPost predicted that Clinton had an incredible 98% chance of becoming US President. According to a piece on the eve of the election, “Republican Donald Trump has essentially no path to an Electoral College victory.”
24 hours later, Trump was President in waiting. He takes office on January 20 after losing the popular vote but winning the Electoral College. So how did the pollsters get it so wrong? And in a year of mammoth upsets – of Trump, Brexit and Jamie Vardy havin’ a party – was 2016 the year we lost faith in data?
Let’s go back to basics. Accurately predicting any outcome inherently boils down to the following factors:
1. Correct data
Both in the US election and the EU referendum polls, whole swathes of the population seemed to have slipped through the net as voters leaned to the right on both sides of the pond.
In this respect, questions need to be asked of the analysts, not the process. Did pollsters question where they got their data? Did it accurately represent the US population? And did they look hard enough for the truth rather than the convenient outcome – i.e. a Clinton landslide?
Furthermore, did those that were polled actually vote? Were they floating voters? What was the sample size? Did Russian hacking, and Clinton’s run-in with the FBI, manipulate the results in any way?
Here’s another thought. Given the media’s coverage of the election, and the ridiculously personal war of words between Clinton and Trump, were the general public worried about sharing their voting preferences?
“It’s like the Nielsen ratings. When people would write down what shows they watched, they always ended up watching documentaries on PBS, when in reality they were watching the Simpsons,” said Bennett Borden in an interview with InformationWeek.
2. A history of similar events and trends
US presidential elections have been predicted with (relative) accuracy since Franklin D. Roosevelt’s victory in 1936 – so what went wrong in 2016? The Clinton/Trump election was a political anomaly for so many reasons.
“There was no benchmark for the election of firsts; no real trends to follow other than a wave of negative press for both camps,” said Ashley Bonda, senior manager at big data recruiters Churchill Frank.
“The first woman on the ticket, a celebrity president with zero political experience, accusations on both sides of the fence, record-breaking ratings, and two of the most divisive presidential candidates in recent memory. It’s no surprise that the pundits got it wrong,” he added.
3. Impartiality from data scientists
The waters are muddied further when human behaviour comes into play. Let’s look at the Huffington Post poll again. Disregarding the data they collected for a moment – which I’m sure was correct at the time – was this a case of convenience sampling?
Put it this way; it’s very likely that the left-leaning Huffington Post actually wanted Clinton to win the US election, so they were much less likely to question its forecast of a Clinton landslide.
“Pollsters need to get a representative sample, estimate the likelihood of a person actually voting, make many justified and unjustified assumptions, and avoid following their conscious and unconscious biases,” said Gregory Piatetsky-Shapiro, data scientist and editor of KDNuggets.
Quite simply, there are too many variables. “If we toss a 100 million fair coins, we can predict the estimated number of heads and tails quite accurately. But using polling to predict the votes of 100 million people is much more difficult,” Piatetsky-Shapiro added.
So, as Donald Trump prepares to enter the White House, what have we learned? Trump’s victory certainly isn’t the “death of data”, as Republican strategist Mike Murphy told MSNBC. In fact, some analysts did predict that Trump would win the race to the White House despite losing the popular vote.
Data scientists at NBC even predicted that Hillary would be on the ticket ahead of Joe Biden back in 2014 – so it’s safe to say that the polls are right more times than they’re wrong.
The polling in the 2016 US election was fundamentally flawed.
Only the analysts can say how reliable the data they polled really was; it’s clear that they didn’t reach a fair demographic of voters, and, as a result, the data was never going to be accurate. Strike one.
There was no precedent for the now infamous campaign between Clinton and Trump. Strike two.
There was clear elements of bias in how the data was collected and presented to the general public. Strike three.
The biggest question still lingering over the election is whether there’s an overreliance on the polls. Even when done correctly, polling data is rarely 100% accurate and often inconclusive. Like all political discourse it should be approached with caution – and questioned every step of the way.