Over the last decade, the cloud has gradually absorbed the bulk of enterprise systems. So it was inevitable that data warehouses would end up in the cloud. For many organizations, it is easier to rent data warehouse services than to build their own infrastructure.
A wide range of vendors now offer data warehouse services — areas where organizations can store a large amount of data gathered from a great many sources. These data management systems help in the areas of storage centralization, business intelligence (BI), and analytics.
Cloud-based data warehouses hide a lot of the complexity from the user. The various elements such as databases, storage units, ELT (extraction, loading, and transformation), reporting, data mining, and analytics engines are provided to the user via a relatively simple interface in most cases.
Table of Contents
The Benefits of a Data Warehouse
Data warehouses provide many benefits:
- Providing a location to host a large amount of data centrally.
- A way for data scientists to analyze data easily by having it consolidated in one place.
- A way to retain data and provide historical context.
- The ability to perform queries.
While some organizations are warehousing their data in-house, there are distinct advantages to housing a data warehouse in the cloud. These include greater flexibility and simplicity, they are easier to use and manage, and usually cost less. However, they may fall afoul of enterprise or governmental policy regarding governance, security, privacy, and data sovereignty.
That said, many organizations use them as they are relatively plug and play. Someone else provides the infrastructure, takes care of management, support, maintenance, upgrades and more. All for one monthly cost.
Also read: 10 Use Cases for Data Warehouses
Top Cloud Data Warehouse Vendors 2021
Enterprise Storage Forum reviewed many different data warehouse providers. Here are our picks for top cloud-based data warehouses. Note that some of these providers also accommodate on-premises data warehouses. However, they focus on the cloud, sometimes exclusively.
Cloudera
The Cloudera’s CDP Data Hub offers a way to easily ingest, route, manage, and deliver data-at-rest and data-in-motion from the edge, any cloud, or data center to any downstream system with built-in security. Running on the Cloudera Data Platform (CDP), the data hub secures and provides governance for all data and metadata on private clouds, multiple public clouds, or hybrid clouds.
Key Differentiators
- Uses Apache NiFi for flow management and Apache Kafka for streams messaging, both of which are part of Cloudera DataFlow, a real-time streaming data platform that delivers insights and actionable intelligence.
- Enables IT to deliver a cloud-native self-service analytic experience to BI analysts for queries that only take minutes.
- Scales cost-effectively past petabytes.
- Connects to AWS and Azure object storage.
- A burst to cloud feature moves data and context from a data center to the cloud.
- Self-service provisioning and administration.
- Data visualization.
- Services to help at every step of the journey, on all infrastructures, from on-premises to the cloud, and ranging from solution design, to implementation and production readiness.
- Real time analysis of very large and constantly growing data sets.
Oracle
Oracle Data Warehouse in the Cloud can handle many types of data and support many types of analytic systems. It can be used on its own or as a complement and extension of traditional data warehouse installations. As well as storing data in relational databases, it can also store data from many different sources including web pages, social media feeds, search indexes, and equipment sensors.
Key Differentiators
- Oracle offers a complete platform-as-a-service (PaaS) environment that allows integrated control of both hardware and software.
- Optional Oracle Database Exadata Cloud Service for extreme performance.
- 37 years of data management experience and the provider of the top database for decades.
- Compatibility between cloud and on-premises Oracle deployments.
- Instant access to high-performance analytics
- Enterprise grade infrastructure
- Supported by 20 cloud data centers worldwide.
- Oracle Cloud can handle hybrid models that include structural data along with unstructured data in NoSQL and Hadoop.
- Distribute data to geographically dispersed systems and workgroups, including analytic initiatives and departmental BI workgroups.
- Analytic solutions tethered to Oracle Database 12c, while offering a secondary data path leveraging Hadoop as the engine for new data sources.
IBM
IBM Db2 Warehouse on Cloud is a fully managed, elastic cloud data warehouse that delivers independent scaling of storage and compute. Its dashboard makes it easier to see and manage data. A columnar data store, actionable compression, and in-memory processing facilitate analytics and machine learning workloads.
Key Differentiators
- In-memory processing for complex analytics and concurrency.
- Volumes of cloud-native mobile, web and IoT data stored for faster analytics.
- Automation of daily tasks, including monitoring, uptime checks and backups.
- Scalable cloud service.
- Aggregate data from across the business for a cross-organizational view.
- Deployable on multiple cloud providers.
- Self-service and geo-replicated disaster recovery backups for data protection.
- Multi-layer resiliency with Kubernetes-managed compute and block storage.
- Independent scaling of storage and compute — burst on compute during peak demand, and scale down when demand falls; and expand storage capacity as data volumes grow.
- High performance on complex analytics workloads using IBM BLU Acceleration.
- Querying on compressed data, leaving the rest on disk.
- Adaptive Workload Management technology automatically manages resources between concurrent workloads, given user-defined resource targets.
- Manage self-service snapshot backup and restore through the Db2 Warehouse on Cloud web console.
Azure
Microsoft Azure Synapse Analytics brings together data integration, enterprise data warehousing, and big data analytics on one platform. It helps to unify ingestion, preparation, management, querying, and analytics in support of BI needs.
Key Differentiators
- Deliver insights from all data, across data warehouses and big data analytics systems, with speed.
- Apply machine learning models to all intelligent apps.
- Reduce project development time with a unified experience for developing analytics solutions.
- Column- and row-level security and dynamic data masking.
- Almost limitless scalability.
- Query both relational and nonrelational data at petabyte scale using the language of your choice.
- Optimize query performance with workload management, workload isolation, and concurrency.
- Automate mandatory and critical data warehouse migration steps.
- Translates legacy code in minutes to be able to operate in the cloud.
- Built on top of an SQL engine.
- Build ETL/ELT processes in a code-free visual environment to easily ingest data from more than 95 native connectors.
- Use your preferred language, including T-SQL, Python, Scala, Spark SQL, and .Net—whether you use serverless or dedicated resources.
SAP Data Warehouse Cloud
SAP Data Warehouse Cloud unifies data and analytics in a multi-cloud solution that includes data integration, database, data warehouse, and analytics capabilities. Built on the SAP HANA Cloud database, this software-as-a-service (SaaS) provides in-memory capabilities for querying and analysis.
Key Differentiators
- SAP Data Warehouse Cloud leverages a single compound metric called a Capacity Unit, which maps to the underlying consumption of compute and storage resources.
- A capacity unit is consumed in compute blocks of 64Gb of memory/four (4) virtual CPUs and storage blocks of 256GB of warm storage.
- Data can be stored either all in memory or a combination of memory/disk to save costs.
- Connect data across multi-cloud and on-premise repositories in real time while preserving business context.
- Virtual workspace and no-code environment to connect, model, visualize, and share data securely in an IT-governed environment.
- Analyze all types of structured, unstructured, and geospatial data.
- Accelerate implementation with pre-integrated database, data warehouse, data intelligence, data lake, and analytics capabilities.
- Leverage prebuilt industry and Line of Business content, templates, data models, and integrations with SAP and third-party data sources and data lakes.
Amazon Redshift
Amazon Redshift is a fast, simple, cost-effective data warehousing service. It combines a high-performance data warehouse with the flexibility and scalability of a data lake to gain new insights from data. With Redshift, you can query and combine exabytes of structured and semi-structured data across the data warehouse, operational databases, and data lake using standard SQL.
Key Differentiators
- Redshift lets you save the results of queries back to a S3 data lake using open formats, like Apache Parquet for additional analytics from other analytics services like Amazon EMR, Amazon Athena, and Amazon SageMaker.
- High performance at scale.
- Price-performance improves as the data warehouse grows.
- Takes advantage of AWS designed-hardware and machine learning (ML).
- AWS Nitro System accelerates data compression and encryption
- Graph optimization algorithms automatically organize and store data for faster query results.
Apache Hive
Apache Hive is database/data warehouse software that supports data querying and analysis of large datasets stored in the Hadoop distributed file system (HDFS) and compatible systems such as Apache HBase. It is distributed under an open source license.
Key Differentiators
- The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.
- Structure can be projected onto data already in storage.
- A command line tool and JDBC driver are provided to connect users to Hive.
- Write Hive Query Language (HQL) statements that are similar to standard SQL statements for data query and analysis.
- Designed to make MapReduce programming easier as there is no need to write lengthy Java code.
- Hive metastore enables you to apply a table structure onto large amounts of unstructured data. Other tools such as Apache Spark and Apache Pig can then access the data in the metastore.
Google BigQuery
BigQuery is part of the Google Cloud Platform, a database-as-a-service (DBaaS) supporting the querying and rapid analysis of enterprise data. This serverless data warehousing solution from Google does all resource provisioning behind the scenes, enabling users to focus on data and analysis.
Key Differentiators
- BigQuery ML enables data scientists and data analysts to build and operationalize machine learning models on planet-scale structured or semi-structured data, directly inside BigQuery, using SQL.
- Export BigQuery ML models for online prediction into Vertex AI.
- BigQuery Omni multicloud analytics to analyze data across clouds such as AWS and Azure.
- BigQuery BI Engine is an in-memory analysis service built into BigQuery that enables users to analyze large and complex datasets interactively with sub-second query response time and concurrency.
- BI Engine natively integrates with Google’s Data Studio, Looker, Connected Sheets, and other BI partners solutions via ODBC/JDBC.
- BigQuery GIS combines the serverless architecture of BigQuery with native support for geospatial analysis to augment analytics workflows with location intelligence.
Read next: Top Big Data Tools & Software 2021