The State of Cloud Data Warehouses - 2021 edition
With the recent addition of Firebolt and Amazon Redshift Serverless, and the dispute between Snowflake and Databricks, now is a reasonable time to look back at what's happened this year in the Cloud Data Warehouse scene. In this article, our Data Architect, Mika, will tap into the current state of different Cloud Data Warehouse products and how they differ from one another.
Written by — Mika Heino, Data Architect
With the recent addition of Firebolt and Amazon Redshift Serverless, and the dispute between Snowflake and Databricks, now is a reasonable time to analyze the current state of Cloud Data Warehouses (or Data Platforms for some people).
In this blog post, I will examine on a high level how different data products differ and highlight some of the recent features that have been introduced for each product.
What's new from team selipsky? - Redshift
Assuming that you followed the recent AWS re:Invent 2021 last week, you might have noticed that several AWS services got a "Serverless" upgrade, including Redshift. Although Redshift Serverless is still a preview feature, we received extensive documentation and a blog about the new feature. With Serverless, Redshift has added more options (besides the RA3-instances) for compute and storage separation. In addition, according to AWS documentation, "with Amazon Redshift Serverless, you pay separately for the compute and storage you use. Compute capacity is measured in Redshift Processing Units (RPUs), and you pay for the workloads in RPU-hours with per-second billing.". Therefore, Serverless offers an endpoint to point your queries into, while AWS handles the needed compute power under the hood.
This feature still doesn't eliminate the need to tune queries, define partitions or do DBA tasks with Redshift, but you do get to enjoy more easiness that has been missing from Redshift. You define the base RPU capacity, which determines the minimum vCPU power and memory in turn. One RPU provides two vCPUs and 16 GiBs of memory. This base RPU is then used by Amazon Redshift to limit the resources (thus cost) used for your workload, and you can always increase the value to improve query performance. If you have already invested in Redshift or fully embraced AWS with your Cloud Data Warehouse architecture, this feature is a must for you.
A new player has entered the game - Firebolt
Firebolt is the new kid in town with the promises of being Snowflake but with add-ons that make Firebolt better and faster. Unfortunately, Firebolt doesn’t offer a demo account to test on (but you can order their awesome SQL marketing t-shirts), so I'm at the mercy of the Firebolt marketing material. With the provided material, you can already identify Firebolts capabilities to be similar to core Snowflake but with an additional possibility to add indexes and tune queries.
Firebolt also brags about their better offering for compute nodes. However, the similarities between Firebolt and Snowflake are so tremendous that we need to go beyond Firebolts comparison chart. The Snowflake ecosystem has more features than Firebolt - e.g. Data Sharing and Snowpark capabilities - making Snowflake more than a cloud data warehouse. So comparing Firebolt and Snowflake is not fair after all. Snowflake is a clear winner when considering the entire Snowflake ecosystem, but Firebolt becomes a valid challenger when comparing core EDW features.
Is Databricks now the fastest kid in town?
If you are following Snowflake or Databricks on LinkedIn, you might have noticed the feud between the major players over the TPC-DS dataset world record. You are, of course, entitled to your own views and opinions about the dispute. However, in my opinion, Databricks is focusing on the wrong things by highlighting query speed as the main reason to choose your DWH platform.
Speed is just one factor. With money, you can get any major data warehouse product to fly, but what ultimately matters is the product's ease of use and scalability. When developing with the speed of 2022, you don't want to operate on products that take time to give you compute power or require you to adjust multiple parameters to get the best out of the product. Of course, I want those parameters for partition/clustering keys to exist somewhere on the product. However, I want to have the possibility to adjust those without jeopardizing my query speeds.
Ultimately Databricks has taken a different architecture route than Snowflake since all the data is stored either at Delta Lake or Azure Data Lake. Therefore, the comparison between Databricks and Snowflake is like comparing apples and oranges. Of course, you can use external tables in Snowflake, but you load all the data into the Snowflake storage most of the time. Furthermore, Databricks doesn’t offer any compute that would be "ready" to use with this architecture model.
You always have the wait time, but Databricks actually introduced a Serverless SQL endpoint feature this fall to mitigate this issue. With this feature, you have instant compute capability available for all. Unfortunately, the Serverless SQL is not as elastic as Snowflakes virtual warehouses (for example, the auto-shutdown time is 10 minutes), but I hope Databricks will develop this feature more in the future.
Going beyond data warehousing - Snowflake
The core product shouldn’t be messed with when you are already at the top. This is why Snowflake hasn't touched their core features, as they work like a charm. However, the most significant update from Montana comes with Python support in Snowpark; you can now run Python code inside Snowflake (which may have been mind-blowing a few years ago). With the new Snowpark product, Snowflake has entered the Data Science scene with a blast.
Nonetheless, since we are only comparing classic Data Warehouse capabilities of the products, I will leave Snowpark aside from this comparison. I will just say that no significant changes have been made to core Snowflake since no considerable changes are simply required.
What's happening at Redmond? - Azure Synapse
So Ignite happened, but we didn't hear major news regarding Azure Synapse Dedicated SQL pools. However, Microsoft released several minor updates to their great Synapse ecosystem. The most significant update for big enterprise customers is that Delta Lake is finally GA. Other updates include new tools that offer better support log and telemetry analytics and better Spark support and database templates for users who don't want to build their database and schemas from scratch.
I'm a massive fan of the Azure Synapse ecosystem, but I'm a bit worried about the fact that we haven't heard any significant updates to Dedicated SQL Pools. When building modern EDW, you want compute and storage to be separated. However, Dedicated SQL Pools are still using a lot of legacy code from SQL Server, making the product a bit old-fashioned compared to the competition. Since compute and storage are not de-coupled at Synapse, you can't, for example, dedicate a separate compute for ETL/ELT process or the BI-product. When you query something from Dedicated SQL pools, you share the resources with other people. Scaling the DWU's will help, but it comes with a cost.
I really hope that we will hear something new during the next Ignite. Otherwise, I fear that Azure Synapse will lose the race for the best Cloud Data Warehouse product.
Ready for EDW use? - Google BigQuery
Google BigQuery is an excellent product, or so I have heard. I have tried BigQuery a few times, and indeed, the product is ideal for analytical use cases. However, I am not sure whether you can build traditional EDW using Google's tools. BigQuery has previously missed many basic functionalities, however, these have now been added to the product. For example, Google has added the basic SQL functionalities as "CREATE SCHEMA", "CREATE VIEW", "GRANT", "REVOKE", "CASE", and row-level security into BiqQuery this year. Anyone who understands our lingua franca SQL may realize why I have been skeptical about the product. I hope that Google keeps adding features to BigQuery because the product is one of the few - in addition to Snowflake and Firebolt - that has been built ground up to the cloud and thus does not burden the legacy code.
In the meantime, Google is promoting BigQuery Omni, which went into GA state this fall and is Google's answer to scattered data. With Omni, you can query data using the BigQuery interface regardless of whether data lies on AWS or Azure. It’s a nice little addition to Google's overall analytics offering, but this feature doesn't enhance the core BigQuery in any way.
Larry Ellison: the man, the myth, the marketing genius - Oracle
If you haven't already heard, Larry proclaimed that "Oracle Claims Leadership in Cloud Datawarehouse Space Over Snowflake". Honestly. All jokes aside, I'm always somewhat biased with Larry and Oracle. People belittle Larry and Oracle for being the "go-to-choice" for old-school on-premise stuff. What people can’t see is that Oracle is still relevant - just look at the ORCL ticker. However, although Oracle is claiming space, it's not in the area where we, the data hipsters, are.
Another argument of why you shouldn’t belittle Oracle is that Snowflake, Firebolt, and Amazon are all standing on the shoulders of a giant - those being ones of the billionaire Lawrence Joseph Ellison, who hasn't (yet) gone to space like the competition. The Oracle Corporation was one of the first companies to introduce commercially available SQL-based RDBMS software and develop all the major features we take for granted nowadays.
So what am I babbling about? Oracle has released a new version of MySQL, which should be news for us Finns. The latest version is called HeatWave, which can be considered a Snowflake "killer". According to the marketing material, HeatWave can deliver both OLTP and OLAP workloads. However, the marketing material does not go into detail on how HeatWave differs from the traditional MySQL Cluster. Instead, you can read claims to questions like "Do customers benefit from real-time analytics?" where HeatWave is considered better than Snowflake since you are forced to have (sic) ETL process where data is loaded into Snowflake from a transactional system. According to Oracle, you don't need to have ETL processes as HeatWave can support both workloads. I believe that MySQL can be an excellent OLTP engine, but I'll be eagerly waiting for the first case of HeatWave being used for EDW purposes.
Competition is getting hot
It's nice to see that the competition is getting hot, and Snowflake is getting more solid competitors in the Cloud Data Warehousing and Data Platforms space. I believe that Amazon will ultimately develop an MPP that will match Snowflake and the silence at Microsoft will eventually end.
Competition is good since we, the users, will ultimately be the winners. But before those beliefs finally manifest, the current situation for Snowflake is like on Elton John's awesome 80s comeback song "I'm Still Standing".
The situation on overall cloud competition is an entirely different story. In this space, Snowflake, Firebolt, and other players are still niche players due to AWS, Azure, and Google dominating with their overall cloud infra offering.
Enjoyed the article? You should then consider working with us! We are constantly looking for Data Engineer and Data Architects to join our kick-ass bunch of data troops! Read more about us and apply👇