The massive hype over "big data" during the past several election cycles, not to mention in the world at large, is finally being revealed as nonsense. Nassim Taleb has been the only major voice calling the whole approach bullshit, although he hasn't focused so much on the Trump phenomenon.
This election shows the fatal flaw of the big data program -- like all statistical learning programs, it has absolutely no clue what answer to give when it encounters an entirely unfamiliar environment. Maybe it'll give the right answer, and maybe it'll give a wrong answer -- whatever it says, our only rational response is to ignore it and look elsewhere, if anywhere, for answers.
Take an example: the 538 blog of poser quants tells us that, historically, the eventual Presidential nominee for a party had already done very well in opinion polls with the electorate, had amassed huge amounts of funds from donors, and/or had racked up scores of endorsements from politicians.
With Trump dominating the polls -- and media coverage -- while raising very little funds from donors and receiving no endorsements from major politicians, science says he can't win. Or at least, his chances are way below Fiorina, who they were "bullish" on after the second GOP debate, compared to their "bearish" stance on the master.
What the spergs can't see is that Trump is unlike anything in the data-set that they've honed their intuitions on. We haven't seen something like him since Teddy Roosevelt, but nerds generally don't appreciate history, and cannot force themselves to think back further than WWII, and typically 1980 in politics. Sure enough, 538's graphs on the "history" of endorsements for candidates only goes back to 1980.
Simply put, if there's no similar event to the Trump phenomenon in their history, why consult the history at all? It's like asking someone who's been trained on conjugating Spanish verbs to weigh in on how some verb is conjugated in Chinese. 
This whole situation brings up one of the central topics of statistical inference -- making a prediction based on interpolation vs. extrapolation.
With interpolation, you're making a guess about an item that lies within the range of what you've already seen, even if you haven't seen that exact item before. Nobody has any major objections to making these kinds of predictions, if you've got a dense enough data-set that will reveal how things behave within that range. You're mapping out a tiny square-inch within a territory that has been extensively surveyed for a mile around it.
With extrapolation, you're making a prediction about an item that lies well outside of the range that your data-set lies in. Honest folks view extrapolation as bogus -- not that the prediction is bound to be wrong, but that there's no reason to pay any heed to a guess that has no basis or grounding in the data-set. You are now sailing into uncharted waters, and assuming that the patterns of a territory you explored earlier will continue to apply in this unexplored territory. What could go wrong with assuming that the same pattern holds true everywhere?
For example, let's say there are two variables X and Y -- I promise, even innumerate people can get this -- like you remember from graphing equations in algebra class. Suppose you have a huge data-set -- thousands of points on the graph, revealing the fine-grained shape of the relationship between the two. Sample points -- (1,2), (2,4), (3,6), (1.1, 2.2), (2.1, 4.2), (3.1, 6.2), etc., all clearly suggesting that the Y value is 2 times the X value.
But what if the points in your data-set only had positive X values? Well, it might not present an obstacle if you're asked to predict what Y value will go with an X value of 2.5 -- supposing you hadn't already been given that point, you'd guess pretty safely that it would be 5, fitting with the rest of the multitude of points around it, and that Y would be 2 times X here as well.
However, if you were thrown a curveball, like X being a negative number, say -10, you wouldn't really know what to predict for the Y value anymore. Points with a negative value for X are outside of the data-set that you're drawing an association from, so you'd have no basis for a good guess. Maybe it'll continue the pattern from the points with positive X values, and Y will be -20. Then again maybe there's an absolute value function at work, making the magnitude the same but always giving a positive Y value, in this case X = -10 and Y = 20. Or any other of an infinite number of imaginable behaviors in this environment that you have no previous information about.
In such an unfamiliar territory, your guess is as good as any. Maybe it'll turn out right, maybe wrong, but you'll have no basis on those thousands of points of "big data" for your guess. If you do guess correctly, it will only be pure dumb luck, and nobody should pay any heed to your guess in the meantime.
How does extrapolation confuse people in an election like this one, with a never-before-seen candidate like Trump?
Lazy people have likened Trump to Perot, particularly if he decides to run on a third party. They try to analogize from the Perot phenomenon and conclude that Trump has little chance of winning the GOP nomination, and would crash and burn as a third party candidate.
But Perot had zippo in the polls, let alone was he dominating the GOP polls by double digits for more or less the entire time, and increasing more or less steadily all the while. He wasn't given wall-to-wall media coverage, and did not consistently draw crowds in the thousands and even tens of thousands. And he was a complete unknown before the election, while Trump has instant brand recognition. Not to mention their policy differences, with Trump being a broad populist and Perot focusing narrowly on NAFTA and trade agreements.
Since Trump's situation is radically different from Perot's, the earlier example of Perot predicts nothing about Trump today.
Slightly less lazy comparisons to George Wallace also don't hold up. When he sought the Democratic nomination in 1964, his appeal was largely regional (the Deep South), whereas Trump draws huge enthusiastic crowds in the Midwest, Plains, Deep South, Appalachia, New England, the Southwest -- everywhere, really. And he did not consistently dominate opinion polls. In 1968, he ran third party, but did not do so after dominating polls and coverage and crowds while earlier running on one of the two main parties. In 1972, he was nearly assassinated and his campaign ground to a halt. So far (knock on wood), no analogy can be drawn from Wallace's several campaigns to Trump's.
There has quite simply never been a candidate who was so dominating of the polls of a major party, media coverage, and crowd attendance, all throughout the second half of the year leading up to the primaries -- yet who was so loathed by the party's leadership, its elected officials, and his fellow candidates, let alone the other major party, with whom they launched an all-out mission to take him out.
Therefore, we have no idea whatsoever how the whole thing will unfold. Will the leadership bite the bullet and let him win, or try to sabotage him with attack ads? If that doesn't succeed, will they rig the primaries? If not, will they rig or buy off those at the Convention in the summer? Will they team up with Hillary to keep Trump from re-directing the Republican party? Or help to rig the general election? Or try to assassinate him?
We have no "big data" to draw on that would illuminate our current state of uncertainty. There just hasn't been anything like this before -- certainly, not an earlier example that also has tons of data to learn from. Our hunches may turn out to be right or wrong, but they will not be so on account of "what the data tell us". In an entirely unfamiliar setting, the data tell us nothing.
 Speaking of language, this is why computers cannot learn human languages to the degree we do. They do poorly with irregular forms, such as irregular verbs and irregular plurals. The statistical learning algorithms look for patterns between a present and past tense form of a verb, for thousands of verbs. It's not hard to learn the pattern for regular verbs -- stick "-ed" on the end. Some irregular verbs fall into families with similarities, but it's not hard-and-fast, and some verbs are sui generis.
Train the computer on verbs like "drink / drank / drunk" and they can correctly guess that "sing" goes "sing / sang / sung".
But ask it about the incredibly common verb "hit" -- it'll try to apply some variation to the root form, maybe "hit / hat / hut", or guess that it's regular "hit / hitted / hitted". All its guesses will be wrong since the forms are all the same, "hit / hit / hit". After training on "tooth / teeth," it won't be able to guess that it's "foot / feet," since the sound similarities between "tooth" and "foot" only held in an earlier stage of English (when the vowel was a long "oo"), and today they just have to be memorized individually.
These failures of machine learning apply very generally, and are the central weakness in connectionist and neural network approaches to modeling human language and cognition more broadly. They are good at abstracting associations within the data-set that they've been trained on, and can make good guesses about the properties of a new item if it resembles an item they've already seen. If the new item is unfamiliar from the training data-set, the guesses go all over the place and are all equally worthless.
Big data cannot think outside the box.