As we start moving into the deep learning and AI world, it might be a good idea to reflect on how we went from basic data collection to an information-based world.
Stored data is just stuff until you can figure out how to turn it into actionable information, and sometimes it takes years of collecting data to have enough to get to that point. Good examples of data that require long-term collection include: medical trials with new processes, medication or equipment; group behavior based on external factors that happen infrequently; and climatic change.
The thing about data is you do not know what you do not know about it. A good example is “junk DNA,” a term from the 1970s and 1980s that was used to describe DNA that was not chromosomes and was often in between chromosomes. By the 2000s, it was discovered that some of that “junk” DNA regulated how and when chromosomes replicated. Good thing people stored that data, which was costly at the time given the cost of storage per byte. An even higher cost at the time was the cost to sequence the DNA, which is why it was kept. Historically this is pretty common, where the cost of collecting the data was high and the cost to store the data was also high, so we can thank those who preceded us for doing the right thing. They stored this old data because we have learned a lot from it.
We know that some weather forecast centers keep all the collected data every day, including the output of their forecast models. When these sites have a new forecast model, they run the old data through the new model and look at the model output and observations to see if the new model is better than the old model and by how much. Doing this for one city might seem easy, but doing this for the whole planet is a lot of data and information to compare.
So the challenge falls to storage and data architects to preserve this data by developing an architecture that meets the need for performance, scalability and governance.
What is Information Management?
Since the dawn of data collection, the whole point of collecting data was to make sense of all the data being collected. Collecting data and doing analysis by hand was very time-consuming, and the time it took to change data into information was both labor-intensive and costly.
The modern age of information began with the use of Hollerith punch cards for the 1890 U.S. Census, though they were blank, unlike the formatted ones you might have seen . The key point here is that having lots of data without tools to analyze the data and turn it into information is costly, and before the 1890 Census this was done by hand.
Clearly the information generated in the 1890 census was very rudimentary by today’s standards. But by the standards of the 1890s it was revolutionary that people could look at the results of the Census so quickly and make decisions (e.g., actionable information based on data).
Today we wouldn’t call the tabulation of the data from the 1890 census data information. The definition of information –compared to just data – should be based on the standards of the time, and the definition in many areas is now evolving rapidly.
The size and scope of the information analysis market is expanding at an ever-increasing pace, from self-driving cars to security camera analysis to medical developments. In every industry, in every part of our lives, there is rapid change, and the rate of change is increasing. All of this is data-driven and all the new and old data collected is being used to develop new types of actionable information. Lots of questions are being asked about the requirements around all of the data collected and information developed.
What Does This Mean For You and Your Organization?
There are many requirements based on the type of information and data you have. Some might involve using what is called DAR (Data Encryption at Rest), which encrypts the storage device so that if removed from the system, the data is nearly or totally impossible to access (the degree of difficulty depends on the encryption algorithm and the size, complexity and entropy of the key or keys for the device).
Understanding what is required from a governance point of view for your data or the resulting information is based on things like best practices for your industry or regulations and agencies like the U.S. National Bureau of Standards (NIST), ISO, HIPAA, SEC, GDPR in Europe. And the resulting architectural or procedural changes are the types of things that you will need to address as part of your architecture.
You or your compliance group will know best how long you might need to keep data or information, but there are many other requirements that you will have to address to ensure that you meet your business objectives in the areas of performance, availability and data integrity, all of which need to be address for the life of the data and information.
Final Thoughts
Compliance is not easy, nor is it free. The cost depends on lots of factors, but trying to force compliance after the architecture is planned and built is always far most costly than doing it beforehand.
It is my opinion that when defining compliance requirements, you should be looking to the future rather than the present because of the cost and challenge of shoehorning things in after the fact. That means that someone needs to be continuously studying compliance requirements in your industry, along with best practices. Data will only become more important in the future, and we need to be up to the challenge.
About the Author:
Henry Newman is CTO, Seagate Government Solutions