Why Use R? - Page 4
Connections With R
So far the tools I've mentioned have been focused on the database portion of the problem -- gathering the data and performing some queries. This is a very important part of the Big Data process (if there is such a thing), but it's not everything. You must take the results of the queries and perform some computations, usually statistical, on them such as, what is the average age of people buying a certain product in the middle of Kansas? What was the weather like when most socks were purchased (e.g., temperature, humidity and cloudiness all being factors)? What section of a genome is the most common between people in Texas and people in Germany? Answering questions like these takes analytical computations. Moreover, much of this computation is statistical in nature (i.e., heavily math oriented).
Without much of a doubt, the most popular statistical analysis package is called R. R is really a programming language and environment. It is particularly focused at statistical analysis. To add to the previous discussion of R, it has a wide variety of built-in capabilities, including linear and non-linear modeling, a huge library of classical statistical tests, time-series analysis, classification, clustering and a number of other analysis techniques. It also has a very good graphical capability, allowing you to visualize the results. R is an interpreted language, which means that you can run it interactively or write scripts that R processes. It is also very extensible allowing you to write code in C, C++, Fortran, R itself or even Java.
For much of Big Data's existence, R has been adopted as the lingua franca for analysis, and the integration between R and database tools is a bit bumpy but getting smoother. A number of the tools mentioned in this article series have been integrated with R or have articles explaining how to get R and that tool to interact. Since this is an important topic, I have a list of links below giving a few pointers, but basically, if you Google for "R+[tool]" where [tool] is the tool you are interested in, you will likely find something.
- Column Stores
Key-Value Store/Tuple Store + R
- CouchDB + R
MongoDB + R
Terrastore + R
- Article about Teradata add-on package for R
But R isn't the only analytical tool available or used. Matlab is also a commonly used tool. There are some connections between Matlab and some of the databases. There are also some connections with SciPy, which is a scientific tool built with Python. A number of tools can also integrate with Python, so integration with SciPy is trivial.
Just a quick comment about programming languages for Big Data. If you look through a number of the tools mentioned, including Hadoop, you will see that the most common language is Java. Hadoop itself is written in Java, and a number of the database tools are either written in Java or have Java connectors. Some people view this as a benefit, while others view it as an issue. After Java, the most popular programming languages are C or C++ and Python.
All of these tools are really useful for analyzing data and can be used to convert data into information. However, one feature that is missing is good visualization tools. How do you visualize the information you create from the data? How do you visually tell which information is important and which isn't? How do you present this information easily? How do you visualize information that has more than three dimensions or three variables? These are very important topics that must be addressed in the industry.
Whether you realize it or not, visualization can have an impact on storage and data access. Do you store the information or data within the database tool or somewhere else? How can you recall the information and then process it for visualization? Questions such as these impact the design of your storage solution and its performance. Don't take storage lightly.