What are the best practices to partition big tables in Redshift

In Redshift we often need to split a huge event table into different smaller table. The objectives are to increase performance and facilitate end users analytics queries. Let’s also assume each event row belongs to an event type and that we have time series data.

What are the best practices regarding such an endeavor and the common pitfall to avoid?

Hi Guillaume!

Let’s call this idea “soft-partitioning” - this would be when instead of keeping all rows for an entity in one table, you split the table into “partitions” and create a view to union them together so that they can be queried as a single entity. In some databases, partitioning is supported natively, but in Redshift that is not the case.

In the example you’ve provided, let’s start by imagining that all event types have the same set of columns, and any extra data has been normalized into other tables, and you’re okay with keeping those per-event fields outside of the soft-partitioned table.

In that case, we would have a set of tables (let’s call them a,b,...,z) containing m columns, which can be accessed like:

SELECT shared_col_1, shared_col_2, ... shared_col_m
FROM a

The main question we need to answer in this case is “how do we partition”, aka what do a, b, c, etc represent?

Well, this really depends on your data-generating-process, and your usage pattern. Unfortunately, making an informed table optimization decision also requires going down some pretty deep rabbit holes about Redshift, so let’s dive right in.

Interlude: Redshift Architecture

At the file-system level, Redshift logically subdivides tables into columns, and columns are logically divided into sorted and unsorted regions, and these column-regions are divided along a dist-key and distributed among the slices of the cluster and split up into 1MB blocks of compressed data. When redshift performs a read, it checks the min/max values per block, and determines which block(s) can be skipped.

Not only that, these blocks are immutable. if a single value in a block changes, the whole block is invalidated. so, if a last_name column is updated for a person who changed their name, the 1MB block containing that last_name value is marked as invalid (“tombstoned” in Redshift parlance) and a new block is created, and this block-replacement happens, implicitly, within a transaction.

Not only that, if the data was sorted before, it’s possible that the update caused the block to become unsorted (such as when updating a sortkey column). Now your table needs to be vacuumed and analyzed again. If it’s a big table, then you may be facing an unfortunate trade off between read performance for that table on one hand and table-write-availability and cluster performance on the other. Vacuums are expensive, and they prevent writes on the target table while the vacuum is writing.

Similar considerations should be taken for inserts - you choose the “order” of your data, and if possible, you want to insert “in order” as often as possible to avoid data being written to the blocks in the unsorted region - the less sorted your data is, the higher the percentage of data that needs to be scanned to complete your query. Again, on a big table, we care about that!

As a final topic in this Redshift architecture deep dive, let’s consider how redshift deals with views of unioned tables.

it is the case that when you have a view defined like

CREATE VIEW breakfast_events AS

SELECT 'make coffee' AS event_type, coffee_events.* FROM coffee_events

UNION ALL

SELECT 'make eggs' AS event_type, eggs_events.* FROM eggs_events

UNION ALL 

... etc

Then Redshift’s query planner “knows” that if you select from that view WHERE event_type = 'make eggs' that it does not actually need to execute the view as written – it doesn’t even need to look at block statistics to know, with total certainty, that 'make eggs' events come from the UNION’ed subqueries for which the event_type was hardcoded to 'make eggs'. If we did not hardcode this value (such as if event_type were pulled from the underlying table), then redshift would need to depend on its statistics or possibly scan all data to make that determination.

So, when we hardcode the “partition” value into the subqueries for this view, we give redshift’s planner perfect information to act on.

Putting it together

So, with that information, we can combine our knowledge of Redshift with our understanding of the problem domain.

Date-based partitioning: A Classic

Traditionally, data is partitioned into sets of “hot” and “cold” tables. In most circumstances, this means monthly partitions, because most arriving events will be from the last month or two, and anything else is a freak occurrence or is a planned backfill. It’s a solid strategy for keeping your data in sort order. In particular, even if you have event_type in your data, one could argue that it’s rare to look at just one event over a very long period of time - usually you look at a session and all events associated with it. For most systems, sessions last span less than 24 hours more than 99% of the time. Therefore, partitioning your data by month makes a decent amount of sense!

But if the data is really large, you might not actually have enough granularity for “highly selective” scans – depending on the sort order of the blocks within a partition, you might not skip any blocks at all, and the scan could still be slow, because for any growing product, this month and last month will have many more events than most prior months, especially if growth is exponential. So let’s say in order to ensure selectivity, we make the first sortkey column of the table event_time.

But wait, what if we want to filter on a specific event_type? We’re going to have to do a full scan of all partitions within the specified date range. if you are looking for a certain event, you may not know the exact time, and could be stuck doing full scans of multiple partitions.

You could use event_type as the first sortkey column, but the you may lose selectivity on event-agnostic queries. Does that matter? It depends. if you have enormous data, such that each event contains many blocks, then making event_time a secondary sortkey could still provide decent selectivity. But that would likely have to be on the order of 10s of millions of events per event type to see appreciable benefit. If you’ve got that, then try it!

categorical partitioning - worth considering

If instead, you partitioned by event_type, you can then sort your data by event_time without any guilt. on any time-selective query (event_time between x and y), you’ll get good block selectivity as long as the event_type has a lot of data (and if an event type doesn’t have a lot of data, frankly, you don’t really have a scale challenge for that event type, so it is moot - if that event type grows, then the date range per block will tighten, and selectivity will improve).

Furthermore, if an event arrives late, it goes into a dedicated unsorted region for that event – you’re less likely to receive a similar benefit in date based partitioning, especially if the sortkey is event type — in that case, basically all data will arrive out of sort order, and so every partition will need to be frequently vacuumed. You’d have to choose between offering good event_type selectivity on one hand, and frequently vacuuming on the other. No such trade-off exists for categorical partitioning with a time based sortkey.

In other words, more needs to go right for classic date based partitioning to provide more benefits.

Extending the problem: varying data structure across categories

So far, we assumed that all events have the exact same fields. But that often is not true. What if, as before, each table has m shared columns, but also some varying number of event specific extra columns? let’s express those using a notation like extra_col_1, ... extra_col_n_<table> such that table a has n_a extra columns, and table z has n_z extra columns.

Another challenge is that those extra columns don’t represent the same things. In snowflake or bigquery the best approach is to pack those into a variant (semistructured) column for the columns that don’t match and call it a day.

In redshift you might get away with json strings if the number of extra attributes is small and sparse, but it just isn’t the redshift way.

Nor would it be great to just add every extra column to each table – you’d have m + n_a + n_b + ... + n_z columns, which could get large quickly – remember, each column is allocated at least one block per slice, and on a cluster with hundreds of slices, you could easily allocate tens or hundreds of gigabytes to completely empty columns.

But storage is cheap on RA3 nodes - that’s true, but there is a performance cost – those blocks still need to be allocated and written to disk, even if they’re empty. adding tens or hundreds of sparse columns can significantly impact write performance. For that reason, it’s worth considering another option.

“virtual” columns

Remember how the word “virtual” was cool in the 90’s? I have news for you, virtual is still cool. Since we’re going to wrap the union-of-tables in a view, the tables don’t actually need to each have all the same columns – you can hardcode them as null in the view definition for the tables which don’t have the particular columns in question.

Example

Let’s continue with the notation from earlier: for events a,b,c,...,z, there would be corresponding tables a,b,c,...z, and each table can be said to have m shared columns and n_<table> extra columns (eg n_a, n_b,...).

So, they’d be UNION'ed like:

SELECT 
​
	a.shared_col_1, a.shared_col_2,...a.shared_col_m,
​
	a.a_extra_col_1, a_extra_col_2, ... a_extra_col_n_a, 
​
	NULL AS b_extra_col_1, NULL AS b_extra_col_2,..., z_extra_col_n_z
FROM a
​
UNION ALL 
​
SELECT 
	b.shared_col_1, b.shared_col_2, ..., b.shared_col_m,
​
	NULL AS a_extra_col_1, NULL AS a_extra_col_2, ... a_extra_col_a_z,
​
	b.extra_col_1, b.extra_col_2, ..., b.extra_col_n_b,
​
	NULL AS c_extra_col_1, NULL AS c_extra_col_2, ..., NULL AS z_extra_col_n_z
FROM b
​
UNION ALL
​
... /* union tables c through x */
​
UNION ALL
​
SELECT z.shared_col_1, z.shared_col_2, ..., z.shared_col_m,
​
NULL AS a_extra_col_1, a_extra_col_2, ... x_extra_col_n_x,
​
z.z_extra_col_1, z.z_extra_col_2, ..., z.z_extra_col_n_z
FROM z

^That’s how I’d do it. There are still some open questions, which I’ll briefly touch on.

That seems like a pain to implement…

yes. You’ll definitely want a macro for this. I’ve said it before that in DBT, union macros are easy, but I’d find myself swallowing my words in implementing this. It’s not extremely complicated, but it’s not trivial.

What will you need above and beyond the standard loop over the table list?

the adapter object has methods to get all the columns of a table. if you do this for all tables, you can:

  • convert the lists into sets, and the intersect all of them to get the set of shared columns.
  • take the set of shared columns and perform a set difference from each of the column sets to get the sets of extra columns per event type
  • use adapter to get the data types of the extra columns – this is (or at least once was) necessary because redshift NULL values are varchar by default – the hardcoded nulls (eg NULL AS a_extra_col_1) you need to cast them to the correct type, so you’ll need to pull all the data types and match them to the generated lists of extra columns
  • do some complicated looping to insert the non-null extra columns in the right place (for tables b to x, the extra columns will sit somewhere in the middle of the column list)

What about distkeys?

well, that’s kind of a different question! I’ll always answer “what should my distkey be” with “what are you joining to?”. Usually, events are joined to users. At this point, I’d recommend checking the skew of the proposed distkey. That’s a tutorial in and of itself so I’ll leave it as an exercise to the reader. But if you aren’t going to be consistently joining to one table, or if the skew of the join column is highly skewed, use DISTSTYLE EVEN and call it a day!

Whew okay that’s my answer. Happy Trails!

P.S.

Dear Reader:
If you’re a redshift engineer reading this, and you would like to issue a correction, please by all means do so!

(Also note: this reply is a rephrase of a conversation we had in DBT slack, sorry for the length!)