Social data is being created at an unprecedented rate. Facebook has around a billion users, and Twitter is at the half-billion mark. That’s a massive amount right there. Now factor in Youtube, which features big audio and video files and has just passed the billion user mark too. Not to mention all the unstructured data being retained by other social media sites and by corporate social apps.
Here, then, are some tips on how to deal with the rising tide of social storage.
1. Small Bits
The first important thing to grasp about the storage of social data is that it comes in high volume but that each piece is relatively small. This is quite different from some other types of storage.
“Social media data is mostly small bits — blog posts, tweets, photos, etc.” observed Bill Peterson, Senior Manager, Big Data Solutions Marketing at NetApp. “Even videos are normally small ones.”
2. Object Lesson
There are different use cases for storing social media data. For example, companies like Twitter and Facebook have to store the data somewhere so it can be retrieved when users want to see it. In addition, organizations want to archive their social media data in bulk so they can try to analyze it and gain insight from this data. The former is known as the foreground copy whereas the latter is the background copy.
“Object-based storage is a natural fit for the foreground copy, as object stores have the necessary scale both in total size and geographic distances to meet the needs of an application like ‘store all the pictures in Facebook’ or ‘store all the tweets on our corporate VPN,’” said Peterson. “Object storage systems normally have http-based interfaces, making it easy to put references to such objects into the Web pages that display them.”
3. Analytics Friendly
For the background, archival copy of social media data, the best practice reason for keeping it is to perform analytics to gain insight. Bundling lots of small objects together into very large files is often a requirement for the analytics platforms to accomplish this task. For example, if you want to analyze tweets, you want a big file full of tweets, not a file (or object) per tweet. Hadoop is one of the platform choices for this class of analytics.
“Hadoop is very good at large files (GB, TB, PB) and not so good at lots of small files,” explained Peterson. “Hadoop also excels at streaming data access and write-once read-many data storage design.”
4. Need for Speed
Social data demands speed. Users typically don’t hang around for buggy applications or slow service. They go elsewhere.
“Working with social data requires storage that can deliver information in near real-time, making solid state drives the top solution,” advised John Scaramuzzo, President of SMART Storage Systems. “However, be on the lookout for SSDs that can achieve high-endurance levels with lower-cost MLC flash to ensure you not only get the required throughput, but can avoid the need to steadily replace burned out drives.”
5. Slower Archives
It isn’t really feasible to discard rarely accessed social data and only store the hot stuff. After all, nobody wants to be the one who, when legal comes looking for something, has to confess that they deleted it. So it should be split into hot and cold sectors according to organizational needs. While the hot data is given fast response, you can get away with slower access times on the rest.
“For data that is not in active use, response times of 100ms or so are typically acceptable,” said Peterson. “Colder objects can tolerate much lower response times.”
6. Three Tiers, At Least
Peterson recommends at least three tiers: the in-memory (or in-flash) tier, the on-disk tier, and the cold-data tier. Movement from the in-memory to disk tier occurs via basic caching. Movement to the cold data layer, on the other hand, involves some amount of collapsing large numbers of small objects into small numbers of large objects.
“If you don’t do this then the old data tier ends up with too many objects,” added Peterson.
7. Storing Profiles
Social profile data is the information that a user passes on to a website through the process of registering with a site such as Facebook or Google. This includes hobbies, interests, friends list, etc. of the user. That’s a lot of very important data that has to be stored and secured effectively.
“Most of the profile data itself is stored as document indexes for performance reasons,” said Vidya Shivkumar, Vice President of Product at Janrain. In addition, it is stored in a relational database for queries that are needed in some use cases.”
Janrain, for example, utilizes relational, key-value stores and document indexes.
8. Bulk Up
The sheer volume of social data can add up to an awful lot of storage arrays. In many cases, it might be better to offload the bulk storage to a cloud service from Amazon, Google, Microsoft etc. Janrain uses Amazon’s infrastructure for hosting.
Why should companies feel the need to handle storage on their own?” asked Shivkumar. “There are a lot of vendors who offer this capability and why would a business not consider it?”
9. Don’t Expect Much Deduplication
Deduplication is workload specific. Traditional backups and VMs, for example, can provide excellent dedupe ratios. However, tweets and blog posts tend to compress but not dedupe. Photos, though, may provide some deduplication gains.
“Items like photos dedupe as multiple people will upload the same picture,” explained Peterson.
10. No Backups
Peterson said that social data is not typically backed up in the usual sense. Instead, multiple copies are made in multiple places. NetApp StorageGrid, for example, allows you to create classes of data by using queries on the metadata.
“Classes of data can have rules, such as requiring at least two copies in at least two different data centers,” noted Peterson. “There is no backup; there is just an engine that attempts to ensure that no matter what happens there are always copies in two physical locations.”