The State of Cloud Data Warehouses – 2022 Edition
What's the best data warehouse right now? With only a few days left in the year, it's a good time to check which new features Snowflake, Amazon, Google, and Databricks gave us in 2022. Our Data Architect Mika will tap into the current state of different Cloud Data Warehouse products and highlight some of the recent features introduced for each product.
Written by — Mika Heino, Data Architect
AWS re: Invent was held again in Las Vegas, and we were treated again with multiple new announcements to the Redshift -ecosystem.
If you ask me, the most exciting feature was the addition of the integration of Apache Spark which enables you to read data seamlessly from Redshift into Spark job. The feature is similar to Snowflake Snowpark but with the limitations of Spark -cluster scaling (versus the virtual warehouse). Redshift also now supports zero-ETL integration from Amazon Aurora to Amazon Redshift. You can think of this as a similar feature that Azure Synapse Link offers in Microsoft Azure -ecosystem, but AWS -ecosystem data landed into Aurora is available on Redshift, too. During re: Invent, AWS also released the auto-copy feature from S3 to Redshift, which sounds similar to Snowflake Snowpipe. This feature, as stated by AWS makes it possible to trigger jobs from S3 activities: "You can now store a COPY statement into a Copy Job, which automatically loads the new files detected in the specified Amazon S3 path.".
From a security perspective, Amazon Redshift released a few minor yet important releases during the year. Redshift now supports Role Based Access Control (RBAC) and Row-Level Security (RLS) (Column-Level Security was already supported). In re: Invent, also Dynamic Data Masking (DDM) was added to this list as Preview -feature. This makes the product more aligned with Snowflake features since the features are a must in the security landscape that we now live in in 2022.
How about the Achilles' heel of Redshift? The scalability? Last year, AWS released the Redshift Serverless which went GA during the summer. Redshift Serverless was the first step of AWS to bridge the gap between Snowflake and Firebolt elasticity and Redshift. During the re: Invent and over the past year, AWS added features that improve the cluster resize times and added Automatic Workload Management to automatically add and remove capacity into the cluster during busy and low times. For the users RA3 -nodes the most interesting feature came with the addition of concurrency scaling, which eases bottleneck waits as the documentation states "When concurrency scaling is enabled for a queue, eligible write queries are sent to concurrency scaling clusters without having to wait for resources to free up on the main Amazon Redshift cluster.". Users of RA3 clusters were also treated with the addition of Multi-AZ clusters which is High Availability -feature for users having business-critical analytical applications running in Redshift.
If you are already invested in Redshift or have fully embraced AWS with your Cloud Datawarehouse architecture, Redshift seems to be finally catching up with the competition on the basic DW features although the elasticity is still a far cry compared to the rivals.
Firebolt entered the data warehouse game last year. Like last year, the marketing game at Firebolt is top-notch. If this was a marketing competition Firebolt would be a clear winner by far with their out-of-the-box marketing style, like these gems.
But this isn't a marketing competition.
Like last year, Firebolt still doesn't offer demo or trial accounts to test drive the product from their website. Firebolt doesn't also offer any documentation about the new features added, so it's really hard to list what's new. On top of this, Firebolt is still only offered on top of AWS. It seems that Firebolt is a swiss army knife for specific data warehouse use cases (as their case-studies page proves), but the user base will not grow without some major new features over their biggest competitor, Snowflake.
Last year, Databricks and Snowflake fought about the speed on LinkedIn over the TPC-DS dataset world record. I stated last year that Databricks was focusing on the wrong things by highlighting query speed as the main reason to choose your DWH platform and my mind hasn't changed. Luckily this messaging has been downplayed in 2022 and Databricks has focused on improving the core product.
Last year all the fuzz on Databricks was about Serverless SQL. This year the biggest fuzz, at least here at Recordly due to our love for dbt, was the addition of the possibility to orchestrate dbt tasks in Databricks workflows. This makes for example possible as per documentation to "For example, your workflow can ingest data with Auto Loader, transform the data with dbt, and analyze the data with a notebook task.".
2022 was a busy year for Databricks as multiple features went either into Public Preview or GA. During the year Delta Sharing made in added to GA which is a high-level similar feature to Snowflake Data Sharing which enables data sharing outside organizations. Similarly Unity Catalog, Databricks Data Governance solution made into GA.
Also, Delta Live Tables, a Databricks solution for helping in defining data processing pipelines made into GA. Databricks AutoML is GA, too. AutoML is Databricks which eases the ML pipeline process as stated in the documentation "You provide the dataset and identify the prediction target, while AutoML prepares the dataset for model training. AutoML then performs and records a set of trials that create, tunes, and evaluates multiple models". On top of these, Databricks supports AWS PrivateLink as PrivateLink support is now GA.
If I forgot something, the full list of release notes can be found here.
What a year.
I could make a lengthy presentation about groundbreaking Unistore, but that's still in Private Preview so I'll leave that to a future blog. Unistore is an upcoming feature that enables Snowflake to run both transactional and analytical workloads on the same data without any architectural changes.
Snowflake released multiple new features making the product more exciting than before. By far the biggest announcements were towards Machine Learning capabilities such as Python support. You can now create ML -pipelines inside Snowflake using Python UDF or within Snowpark.
Here's my own test drive on the subject (the image is a link to the actual blog post).
The Python code runs as a Function inside Snowflake and can be seamlessly added into any data pipeline. Snowflake added their first Snowpark-optimized Virtual Warehouses to support these ML training use cases.
To enhance this coding beyond SQL functionalities, Snowflake also released Snowflake scripting. Snowflake scripting is similar functionality as Oracle PL/SQL which makes it possible to create more logic inside Snowflake beyond the normal CASE WHEN statements. With the addition of SQL API going into GA, you can create a fully-fledged application together with Streamlit.
Outside of these major releases, Snowflake also released a few minor updates, and out of those, I have highlighted the following.
- For SQL support, Snowflake added SELECT * Excluding and Renaming possibilities in late November.
- For monitoring and Data Governance, Snowflake supports sending emails as a stored procedure and added capabilities to create column lineage for data usage with Column Lineage Access History views.
- For overall data-pipeline capabilities have new features added, such as Tasks support familiar DAG -mode and Tasks can also be created on Serverless mode where Tasks run on Snowflake-managed compute resources.
Snowflake also added support for Delta Lake and the Query Acceleration Service (QAS) started as a Preview feature. QAS is a feature that "When enabled for a warehouse, it can improve overall warehouse performance by reducing the impact of outlier queries, which are queries that use more resources than the typical query".
Again, if I forgot something, the full list of release notes can be found here.
Azure Synapse Dedicated Pool
I asked this question last year and I'll ask it again. What's the deal with Azure Synapse Dedicated Pools having not released any new major architecture changes over the past years? Azure Synapse Dedicated Pool is starting to rot if you look at the scalability and elasticity functionalities.
Don't take me wrong. I used to work as an Oracle and Microsoft DBA for about nine years during my first consultant job. I have used each SQL Server version from legacy 2000 to newer 2016. I love the Microsoft -ecosystem and the stuff that SQL Server has innovated and I hate to see that the innovation hasn't moved into Dedicated Pool (SQL Server itself is still alive, kicking and releasing new features constantly, the latest being version 2022).
I worked for over a year with Dedicated Pools on a customer project and we finally got a migration done to Snowflake this year. I can say that year was hell for me and the whole team. We had constant scalability issues as our SQL commands queued constantly. Scaling up the DWUs didn't help at all, it only made the system consume more money.
The list of issues with Azure Synapse Dedicated Pools is so long that it would be a separate article altogether. I can honestly say that, if you or your organization is considering Dedicated Pool as a data warehouse solution, call me and I'll tell you where you're headed and why, for example, Snowflake is a far better option.
Anyway, because I want to find something good in everything, I would like to list here the new features of Azure Synapse Analytics Dedicated Pool, but as per Microsoft documentation, the latest improvement was on December 2020 so no go for that.
Google BigQuery. The product that I would use if Snowflake haven't been invented. The year has been somewhat silent, but let's look at what's new.
It seems that a lot of the new stuff is easily translated into Snowflake lingo, so here we go. For starters, BiqQuery entered the now infamous Delta Lake and Iceberg fuzz, by adding the support for Iceberg tables. Similar features do not stop there, as, like External Functions, the possibility to start up GCP Remote Functions inside SQL commands is also now possible with BigQuery.
The features that are already in place for Snowflake have also been manifested into BigQuery, like time-travel (similar functionality as Snowflake Time Travel and Oracle Flashback) and table cloning (similar functionality as Snowflake CLONE) which are also now in Public Preview. On top of that JSON datatype is now supported.
For the regular features, transactions are now also possible with Google BigQuery. You can now create multi-statement SQL commands within BEGIN TRANSACTION -command. In security features, Column level data masking is now also possible with Google BiqQuery.
The full list of release notes can be found here.
Last year, Larry Ellison (still the man, the myth, and the legend) released HeatWave. Even though I haven't seen or heard of any installations of MySQL HeatWave, here are the biggest new features added to HeatWave in 2022.
HeatWave supports now Machine Learning with HeatWave ML. HeatWave ML enables you to "train a model, generate inferences and offer machine learning explanations, without extracting data out of the MySQL database. HeatWave ML fully automates the training process and creates a model with the best algorithm, optimal features, and the optimal hyper-parameters for a given data set and a specified task.".
HeatWave also supports something that is a must nowadays in Data Warehouses (I'm looking at you Azure Synapse Dedicated Pool). You can scale the HeatWave cluster without any downtime.
I know that this doesn't sound much, but I will continue to include Oracle and Larry in the list due to my love-and-hate relationship with the company.
Last year, I stated Snowflake as the winner of the best data warehouse option and I'll stick to that conclusion this year as well. As noted multiple times, Snowflake is more than a data warehouse, and the added ML capabilities make the platform a must for companies building data warehouses. Databricks is also a major contender, but their view on the solution pattern is different than Snowflake, going full on the data lake pattern. AWS is bridging the cap more and more and if your company already uses AWS and you're not planning on building EDW, Redshift might be an option for you. The same analogue also can be said about Google BiqQuery. Microsoft, on the other hand, is playing itself out of the game and I cannot recommend Dedicated Pool for new greenfield installations.
I'll end this with the same notes as last year; competition is good for everybody because ultimately we, the users, win.
Enjoyed the article? You should then consider working with us! We are always happy to welcome new Data Engineers and Data Architects to join our kick-ass bunch of data troops! Read more about us and apply👇