Choosing a data warehouse


#1

I’ve been reading up on the differences between BigQuery and Redshift. I’m curious how you all would decide which is appropriate for an organization that’s just getting set up with a data warehouse. Also, when would Snowflake be appropriate?


#2

There are a few key factors:

1. Where is your existing infrastructure?

If your business has everything in AWS, it can simplify your life to just use Redshift. If you’re planning to go the full Google Analytics 360, you might prefer BigQuery for its native data flow from GA to BQ.

2. How much do you like managing infrastructure/how much infrastructure flexibility do you want?

BigQuery abstracts out all hardware considerations. Just load the data and go. But the Google Cloud ecosystem isn’t as robust as AWS, so there isn’t as much flexibility with respect to using the platform to move the pieces around versus having to write scripts to do that. With Redshift, it’s relatively seamless to move data from Redshift to EMR, or to use Kinesis to stream data into your DB, but then you have to manage the cluster yourself.

3. How much data?

If you don’t have a ton of data, a pay-as-you-query setup like BigQuery might save you money (but might cost you more money longer term if and as your data volume grows), whereas there is a bit of a cost floor for Redshift.


I’m not a fan of the variable cost model of BigQuery, but I don’t want to spend time managing infrastructure, and my company’s got the full GA360 package, so BigQuery makes the most sense for us.

Regarding snowflake, I don’t think there’s a time where it’s “too early” to consider it. If you’re at a volume now where Postgres doesn’t cut it as a data warehouse, and you’re going to go to an MPP system like Redshift or BigQuery, then you should be considering Snowflake now. I implore you to decide which system you want 5 years from now and go with that now. Migration can be painful, and should be avoided whenever possible by choosing wisely up front.

Good luck!


#3

Thank you for the thoughtful response @michael.dunn - very helpful.


#4

At Fishtown Analytics, we’ve worked with Postgres, Redshift, Snowflake, and BigQuery. Ultimately, each database has its own strengths and weaknesses.

@michael.dunn makes a great point above regarding your existing infrastructure. If you’re on GCP, BigQuery probably makes a lot of sense. The same goes for AWS/Redshift. I don’t think this is a hard-and-fast rule though, as Snowflake isn’t offered by a particular cloud provider, and most off-the-shelf ETL services can pipe data across clouds.

My main insight here is that these databases are more similar than they are different, and all (except Postgres) are acceptable choices for a first data warehouse. I think some choices are better than others, but ultimately, there’s a good reason for using each of them. The primary differences between these databases occur at an architectural level, and a summary of each of these is given below:

Postgres

Postgres is a relational database, meaning that it stores rows of data. It is fantastic for transactional application workloads, like selecting a single record from a users table, or updating the status of a couple of orders.

Analytical queries tend to look very different than these “transactional” queries. Instead, you might do a count(distinct) on a whole column of data with no filters applied, or join four tables together and group by half a dozen fields, aggregating the rest. Postgres is not designed to handle queries like this, and you’ll probably find that queries take forever to return unless indexes are liberally added to the columns that you join on.

To Postgres’s credit, it will serve you well for analytical queries until your tables contain millions of records. My thinking here is generally that your data will grow in volume. You can either migrate to a more capable analytical warehouse early (when it’s not strictly necessary yet), and provide yourself a whole lot of runway. Or, you can wait until your data (and team!) outgrows Postgres, and you are forced to migrate. I personally would prefer to be in the former situation, as migrating warehouses is very possible, but can be very not fun.

Redshift

Amazon Redshift is a columnar database that is based on Postgres. Postgres and Redshift feel pretty similar - they both use the same protocol, and share (mostly) identical syntaxes.

Redshift stores data in columns (not rows), meaning that a given query will only read the columns required for its execution. This means that operations like full table scans are not only doable, but totally reasonable. You would probably have a bad time doing a full table scan on a large Postgres table.

A Redshift database is a cluster of worker database nodes. These nodes are able to split up a query, execute its constituent parts in parallel, then combine the results. As a result, you can horizontally scale performance by adding nodes to your cluster. If you check out the Redshift pricing page you’ll see that adding a node will give you more processing power, as well as more storage.

This is sort of a problem though: sometimes you need a lot of storage space, but you don’t need the extra processing power. The converse may also be true. This coupling of “storage” and “compute” is one of the main drawbacks of Redshift compared to the other databases listed below. As data teams push the limits of Redshift, they find themselves “resizing” their cluster to add another node every couple of months. This can be a painful and expensive process, and it gets pretty old pretty quickly.

Redshift comes with a bunch of configuration to optimize query performance like sort keys and dist keys. You can also configure compression for your columns, or perform table maintenance tasks like vacuum and analyze. It ends up being close to a full-time job to keep Redshift humming along, and honestly, I think that time could be better spent actually doing analytics. The databases below don’t have as many bells and whistles, but you end up not really missing them. Maintaining Redshift can be a real chore.

Redshift is pretty easy to set up and is integrated deeply into the AWS ecosystem. You can natively pipe in data from S3, Kinesis, GLUE, etc. If using these tools is high on your priority list, then Redshift might be a good choice for your organization.

BigQuery

BigQuery is Google’s take on a distributed analytical database. Whereas in Redshift you might have six or eight compute nodes, BigQuery will throws hundreds or thousands of nodes at you query. Further, storage on BigQuery is effectively infinite, and you just pay for how much data you load into and query in the warehouse. This means that truly insane amounts of data can be queried with pretty astounding query latency. This video is the best resource I’ve found on BigQuery under the hood; definitely give it a quick look if you’re interested in using BQ.

So, BigQuery can process really big datasets really quickly, but it of course comes with some caveats. BigQuery is really, very bad at doing joins. The recommended way to to handle one-to-many relationships is through repeated records. Repeated records are a radically new data storage pattern that Google just decided to make up. These repeated records are incredibly useful and they work very well, but you’ll need to internalize this feature of BigQuery to make effective use of its power.

I have a pretty good intuition for SQL and how databases work, but really, it just doesn’t translate to BigQuery. I think that BQ is a great choice for a database, as long as the folks in your organization are willing to rewire their brains to take advantage of its features and access patterns.

BigQuery has some annoying artificial limitations. Date partitioning is crucial for performant querying, but only something like 2,300 date partitions can be created at a given time. My bet is that limitations like these will become more flexible (or disappear) over time, but keep in mind that GMail was in Beta for five years, and Google’s gonna do Google things.

Snowflake

Snowflake is situated as a sort of happy medium between Redshift and BigQuery. It generally looks and acts how you would expect a database to behave. If you have existing intuition around analytics, it will probably translate to Snowflake pretty readily. Check out this article for more information on migrating from Redshift to Snowflake.

Snowflake is more similar in architecture to Redshift than BigQuery. Crucially though, its storage is decoupled from its compute. With Snowflake you pay for 1) storage space used and 2) amount of time spent querying data. Snowflake also has a notion of a “logical warehouse” which is the “compute” aspect of the database. These warehouses can be scaled up or down to deliver different grades of performance. You can also configure the number of compute nodes to parallelize query execution. These warehouses can be configured to “pause” when you’re not using them. As a result, you can have a super beefy warehouse for BI queries that’s only running when people are using your BI tools, while your background batch jobs can use cheaper hardware.

Snowflake doesn’t have all the performance optimization bells and whistles of Redshift, but you end up not really needing or missing them. Snowflake’s query optimizer is substantially better than Redshift’s, and I find myself not really thinking about performance considerations when writing SQL for Snowflake.

Final thoughts

I prefer using Snowflake and BigQuery over Redshift because I can mostly just write the logic I want to implement without regard for how the database will execute the query behind the scenes. On Redshift, you have to think about column compression, sort keys, dist keys, disk utilization, or vacuum/analyze statements. ERROR: Disk Full is a thing you have to deal with on Redshift, but that’s just not a class of error that can occur on BigQuery or Snowflake.

It is absolutely possible to migrate warehouses, but it can definitely be a big pain. The best advice I can give is: conduct analytics in a way that preserves your optionality. Avoid writing custom scripts to pipe data into Redshift; instead, use a tool like Singer that will pipe data to any of the warehouses listed above. Be judicious when using non-standard features of a warehouse (like Postgres GIS plugins, or BigQuery javascript UDFs). To be sure, you should use your warehouse to its fullest potential, but definitely be mindful of how these decisions impact your switching cost. Today, Redshift feels old and crufty, but it was the best-in-class analytical warehouse only 6 years ago. This industry moves fast!

The good news is that technology in the data space is becoming more modular, and databases are continuing to coalesce around a standard-ish version of SQL. If you’re just getting started with a warehouse, really any of Redshift, Snowflake, or BigQuery are likely appropriate. I think that choosing Snowflake or BigQuery is probably a more forward-thinking decision, and they’re my preference absent any other business-specific considerations.

This is just one analyst’s opinion though. Here’s some further reading:


#5

I’m curious to here more from Big Query users about the difficulty of joins. From the benchmark posts above, it seems like this isn’t that big of an issue, but @drew seems to suggest otherwise.


#6

Is it fair to say that this mainly applies at some level of scale where performance issues come into play? For example, when I write one-to-many queries in BigQuery, I write them without giving any thought to performance and have never noticed any significant issues. For a simple example, this query takes about 4 seconds:

select com_id, count(tag_id) as tags
from mysql_app.app_companies 
left join mysql_app.app_companies_tags on tag_com_id = com_id
group by 1

However, it may simply be because we’re working with millions of rows in these tables, not billions.

If that’s the case, how would you decide when it’s worth taking time to optimize for performance, either in BigQuery specifically, or any of these data warehouses?