While on vacation last week I read an amazingly good book, The Signal and the Noise: Why So Many Predictions Fail — But Some Don't by Nate Silver.
For those of you who have not heard of Mr. Silver, here is a quote from Amazon: “Nate Silver built an innovative system for predicting baseball performance, predicted the 2008 election within a hair’s breadth, and became a national sensation as a blogger—all by the time he was thirty. The New York Times now publishes FiveThirtyEight.com, where Silver is one of the nation’s most influential political forecasters.”
Since many of Mr. Silver’s predictions revolve around politics and baseball, you might ask how this impacts the future use of data analysis. Some of the examples in the book describe how statistical analysis can be used to make predictions. For instance, between 1967 and 1997, the Super Bowl accurately predicted the performance of the economy: When a team from the old NFL won the Super Bowl, the stock market gained an average of 14 percent, but when an old AFL team won, it fell by 10 percent. Mr. Silver when on to explain that this phenomenon had a 1 in 4,700,000 possibility of being a statistical anomaly. (Since 1998, this supposed predictor has been way off.)
As was said in the book, though 1 and 4,700,000 seems like a huge number, it is not that large when you consider that people win Powerball. The book has lots of examples where scientific predictions turned out to be wrong, from flu virus to earthquake prediction.
Predictions and Big Data
So while I was on vacation last week, this got me thinking about how Silver's ideas relate to the coming world where data analysis will drive everything we will do. People are making predictions about things at an astounding rate, and the rate is only increasing.
Big data is going to require not just collecting the data, but looking at a myriad of other factors. In the NFL/AFL example I used above, the prediction makes no sense, as there was no underlying basis for the prediction.
Correlating data is pretty easy to do, but how do you prove the correlation is valid?
That is going to take even more information and someone with domain expertise on the area. This is going to require more data to be stored, more computer power and the right kind of programs and programmers, because each data type is different.
For example, consider what goes into weather forecasting. As forecasters develop new models, those models must be validated. And making sure that the new model is a better predictor of the weather than the old model requires lots of historical data. Usually, forecasters will run the new model with the historical data that has been collected and archived. The output of the model is compared against the output of the older model to see if the new model is a statistically better predictor of the weather. This is done for hundreds to thousands of days of weather data.
It takes months and sometime even years for this type of model verification. Since we have the actual data for the weather for each day and the output of the predictions all archived, the weather services around the world have the ability to do this verification. Weather forecasting is far more accurate today than it was even five years ago, not only because we have more computer power but because we have lots of data archived and can validate the new models.
This leads to a few key takeaway points:
- Weather modeling is pretty well known, and a number of the predictors have been found. The number of people working on prediction of weather is a fairly large group worldwide. The same may not be true of other areas where predictions are made.
- Weather forecasters have years of old data from which to evaluate current and new models. They have the old predictions and the old weather inputs. The same cannot be said for other areas.
- Though you might think this is trivial, the fact that weather forecaster have archived everything from the original inputs to the model outputs for each day is a big advantage over most environments.
People in a few other fields also keep highly detailed records, for example, seismic data that oil companies collect and satellite imagery data from the US Geological Service. But some of the new areas that people are trying to forecast might not have the historical data necessary to make the predictions. In some cases, that data does not exist.
The Need for Archives
So how much historical data from archives needs to be collected to make a prediction valid?
This is likely going to be very different for different uses of data. But the key point here is that you are going to need an archive of information to be able to predict what will happen in the future. Archival data is going to be needed, and at present we have a limited number of archive options.
We have the traditional HSM environments where data is stored on tape and most of these HSMs have a file system interface for file system like interface. And we have object storage where everything is most often stored on disk and you have a REST/SOAP interface.
The current archival products for most environments make some pretty broad assumptions about usage and the usage model. Some of this is based on how the archival systems are architected, but in some cases it is more about how the HSM or object system is designed.
In my opinion, the new usage models for big data are going to require a rethink of how archives are used. We only have to look to the success of applications such as weather modeling for good examples on how archival data can and should be used.
I highly recommend that everyone read Nate Silver’s book. He did an amazing job giving lots of examples on how data should and should not be used. One of the key points in the book is that successful predictions over the long term require not just large archives but a good understand of what is important to the predictions.
I'm predicting that archives, which were out of vogue, are going to trend back toward popularity. Maybe my powder blue leisure suit might also be ready to be dusted off. I hope it fits me.