Big Data From the Beginning - Page 2
Big Data's Beginnings
Before jumping into a list of applications that can possibly be used for Big Data, we should take a step back and talk about how we "feed the beast." That is, how do we access the data and how do we store it? (i.e., file formats) I consider this to be a crucial part of examining the "top-end" of Big Data.
Getting data into the system we're using for analysis is not as easy as it seems. In many cases, the data is taken from disparate sources that have their own file format or their own storage system. Moreover, some data sources create data at a very fast pace. An example of this is video monitoring that streams data from cameras to storage. Today, virtually every store, gas station, Starbucks, street corner and highway has some sort of video camera. This data must be collected and stored for future analysis. Before you say that much of this data is pointless, you might be surprised.
Remember the tornadoes in spring 2012? Just a few days after they passed, videos from gas station cameras and truck parking lots starting showing up on television. Making sensational videos for the nightly news was not the point of these videos. Rather, they can be used to track the path of the tornado, which is useful information for people who model tornadoes near the ground as well as for storm trackers who want to know where exactly the tornado touched down. It's also useful information for insurance companies so they know precisely where the tornado touched down. And this information is also useful for the local governments because the tornado path information lets them look for homes or businesses that might have been hit. This enables them to look for people, broken gas lines, downed power lines and cars that might have been moving during the tornado. If you are interested in "Big Data" and creating information and knowledge from data, then you need systems that can quickly gather this data and write it to storage someone for quick analysis and retrieval.
This discussion about ingesting data from "sensors" into storage brings up a question: Do we leave the data where it is and access the data remotely, or do we pull the data into a centralized system and then analyze it? There are benefits and difficulties with each approach, so you must carefully examine the details of each approach. Deciding whereto put the data is up to you, but you have to realize that there are implications in leaving it in disparate locations.
Issues to think about include:
- Data access methods? (e.g., REST, WebDAV and NFS)
- Data permissions? (Who gets access to the data? Permissions?)
- Data security
- Performance (e.g., bandwidth and latency)
- Data access priority? (local access trumps remote access?)
- Data integrity?
- Management of the data?
- Responsibility for the data?
These are what you might think of as "basic" issues because we haven't even discussed the data itself. However, in the brave new world of "cloud" everything these are issues you will have to address.
I don't want to dive into these topics in any depth except for one -- data access methods. This is important because it deals with the design of the application, which leads to the design of the storage system and the network (among other things). There are several interfaces for accessing information on the Internet, but the one that seems to have become the latest standard is called REST (Representational State Transfer). You can use a REST interface to pull data from a disparate server to your local server if you wish to do that. But you can also use REST as an interface to perform some function on the remote server on behalf of the client which is your server. Defining the exact details of the interface and how it works is one of the THE critical steps in creating a reasonably useful environment for accessing data from disparate resources.
A companion to a REST interface is something called WebDAV (Web-based Distributed Authoring and Versioning). WebDAV is a child of the Web 2.0 movement. It allows users to collaboratively edit and manage files on disparate web servers. You can write applications using the WebDAV interface to access data on different servers. But one of the coolest features of WebDAV is that you can maintain the properties including the author, modification date, namespace management, collections and overwrite protection, all of which are very useful in Big Data.
After considering how to access the data, primarily to get it into our storage system, we really need to think about how the data is actually stored. For example, is the data stored in a flat text file and if so, how is the data organized and how do you access it? Is the data stored in some sort of database? Is the data stored in some standard format such as hdf5? Or is the data stored in a proprietary format? Answering this question is more critical than people think because you will have to write methods to get the data you want and retrieve it in virtually any tool. In many cases, the choice of how and where the data is written and stored depends on the actual applications being developed and how the results are created and stored. But be sure to pay close attention to this aspect of Big Data -- in what form will the data be stored?