Data lakes and data swamps are similar approaches to data storage, compiling structured and unstructured data in one repository. Large enterprises are most likely to use lakes and swamps because they need to hold enormous amounts of data, even if they don’t know when or why they’ll need it. Data lakes and swamps cost less than structured storage because they’re more scalable; all data can be added to the repository without needing a particular format.
What does a data lake do?
Data lakes are beneficial because they require less carefully organized storage than warehouses, which store highly structured data. Best Big Data analytics practices include analyzing both unstructured and partly structured data instead of having them siloed in different databases or warehouses. Data lakes can hold objects, which makes them useful for enterprises with large amounts of unstructured data.
However, that doesn’t mean that throwing a bunch of data in a lake with no controls or organization whatsoever will result in beautiful or useful data analytics for your business. Lakes need structure in their own way. But unlike warehouses, they mainly need:
- Easy ways to locate data
- Governance for data
- Methods of cleaning and sorting data
- Plans for utilizing accurate and useful data
Successful data lakes have metadata stored along each data object. This metadata categorizes data and makes it easier to locate within the lake. Clearly defined objects decrease the backlog of time that sorting through data requires.
Data governance includes the policies set for stored data: how long it should be stored, who should be allowed to access it, and what compliance requirements it needs to meet. Compliance is particularly important if you’re storing any type of customer data. Data protection regulations set strict guidelines for customer data and also require organizations to track how many people have access to it.
Much of the data thrown in a lake will eventually grow outdated. If BI platforms or analysts are using this data to make decisions, that data should be accurate. Data lakes need methods of cleaning old, outdated objects when they’re no longer accurate or no longer need to be stored for regulatory purposes.
The four characteristics of a successful data lake designate the difference between a data lake and a data swamp.
What does a data swamp do?
Data swamps usually begin as a lake. Enterprises don’t plan to start a data swamp; swamps aren’t sold as-a-service, nor are they marketed. Data lakes turn into swamps when businesses don’t set expectations and guidelines for their data storage. Swamps make analysis and retrieval very challenging.
Data swamps become a catch-all for data. When an organization needs or wants to store data, and they don’t know how to categorize it or don’t need to put it in a warehouse, a data lake-turned-swamp is waiting to collect all unrelated objects and files. Data swamps store unnecessary and outdated objects because users toss anything in them, without setting guidelines for relevance or timeliness.
Data swamps aren’t regularly managed or governed by administrators or analysts. They don’t have controls or categorization placed on their stored objects. That’s part of the reason they don’t lend themselves to big data analytics. The other reason is their lack of metadata. Objects and files stored in swamps frequently don’t have metadata, which makes them incredibly challenging to search or organize.
Data swamps are also a danger to compliance. They obscure customer data, and if businesses can’t find data in the murky recesses of the swamp, they could be found non-compliant to regulatory standards that require data to be retrieved or deleted. Most regulations require businesses to keep strictly accurate records of data, including who has access to it, and data swamps make that difficult (or impossible).
Keeping your data lake from becoming a swamp
There’s certainly something very appealing about being able to toss any piece of data in a huge, scalable storage repository without having to worry about it. But that strategy doesn’t set enterprises up for future analytics or success. Data swamps are only useful for unimportant, random data that doesn’t need to be used in any business intelligence ventures.
As previously mentioned, data lakes need organization so they present useful, relevant data. When lakes are intentionally designed, all objects and files have metadata, and data is closely governed, lakes have the potential to give accurate and game-changing business insights. They just require some work at the beginning before they get there.