Curiosity question: Can someone explain why Redshift goes through the step of hashing the join columns in some joins? I’m not sure I computationally understand what happens there and would be interested in knowing more. Couldn’t find anything good online.
Hey @dylanbaker - this is a great question!
Hash joins can be used when there’s a join on an equality, eg:
select ... from table_a join table_b on table_a.id = table_b.id
it does not work for inequality joins like:
select ... from table_a join table_b on table_a.id > table_b.id
To understand why this is the case, you need to understand Hash Maps.
A Hash Map is a data structure that supports constant-time lookups, meaning: computers can figure out if a key is present in a Hash Map in a single operation! This is not true of all data structures, as we’ll see below.
Another type of data structure is an Array. You can generally think of Arrays as “lists”, but my computer science professors would be mad if I told you the two terms were interchangeable.
Arrays are indeed used to maintain an ordered collection of things. Computers can return the element at a specific index in an Array in a single step. Imagine you have an Array like this:
["Apple", "Banana", "Pear", "Orange", "Grapefruit"]
Each element in the Array has an index:
Given this Array, a computer can tell you that the 3rd element in the Array (index=2) is
Pear in a single operation. It’s harder for computers to answer other types of questions with Arrays. A question like
is there an element named
will require the computer to look at every single element in the Array. You may be able to look up the element at an index in a single operation, but finding an element by its value requires one operation for each element! That means that an Array with
N items requires
N operations to find an element, so we call searching Arrays for elements a “Linear” operation. If the Array doubles in size, then it will require twice as many operations to find an element.
Detour: Sorted Arrays
Bear with me here - I know this question is about Hash Maps!
We just saw that finding an element in an Array is a “linear” operation, but this isn’t always the case. Instead, imagine that we sorted our Array of fruit. Sorted alphabetically, it would look like:
If you know that the values are sorted, then you can start in the middle of the Array and repeatedly bisect it until you find (or don’t find) the value you’re looking for. A search for “Orange” would look like:
Start in the middle
–> index=2, element=Grapefruit
“Orange” comes after “Grapefruit” alphabetically, so we know that if Orange is in the Array, it’s in the second half of the list.
Pick a new index between the current index (2) and the last index in the Array (4)
–> index=3, element=Orange. Found it!
In this example, we only needed to check two elements instead of four! This algorithm is called “Binary Search”, and it’s a great way to find elements in a sorted Array. Since you can cut the search space in half with each guess, the number of guesses you need grows “logarithmically” with the size of the Array. Therefore, doubling the size of the Array only requires one extra guess! We’d call this a
log n search function, since the number of guesses you need to make is about
log_2(N), where N is the size of the Array. This is super handy if you have thousands or millions of items in an Array, and you need to find one in particular.
The big idea here is that by spending a little extra time sorting the list, we can vastly improve the performance of our search functions. So, onto Hash Maps…
Hash Maps (for real this time)
Let’s imagine that we again have a series of Fruit:
["Apple", "Banana", "Pear", "Orange", "Grapefruit"]
Recall: Arrays can tell us the element found at an index in a single operation. Hash Maps are a way of exploiting this property of Arrays to find an element by it’s value in a single operation. To do this, we need to convert the value of an element into its index. We can do that using a hash function.
You might be familiar with hash functions like MD5, but that’s only one type of hash function. A different type of hash function might accept a string (like “Apple”) and return an integer between 0 and 100. This function might work by assigning a number to each letter of the alphabet, then doing some clever math to produce a value for the input string.
A hash function will always produce the same value for a given input, so if Apple is hashed to 75 once, it will always be hashed to 75 using the same algorithm.
With our hash function, we can produce numbers between 0 and 100 for different types of fruits. In the above example, that might look like:
Next, we can create an Array with 100 elements, initialized to be totally empty. Further, we can place each element at the index indicated by the hash of it’s value. That would look like:
In the above example, the index of each element in the Array is determined by hashing it’s value. Just as we saw in the Sorted Array example, building this Hash Map takes time, but it pays dividends! With this data structure, we can determine if elements are present in the Hash Map in a single operation! To determine if an element is present in a Hash Map:
- Calculate the hash of the element you want to find to produce an index
- Get the element at that index
If the element is present, then , you found it! If there’s nothing there, then you know the element isn’t present in the Hash Map. A couple of quick examples:
- Is Banana present?
- hash(Banana) = 52
- lookup(52) = Banana, present!
- Is Falafel present?
- hash(Falafel) = 40
- lookup(40) = NULL, not present!
Pulling this back to the database world: Building a Hash Map takes some time, but then you can do a constant time lookup for every comparison in the join! I think databases typically hash the smaller of the two tables in a join, then they iterate through the bigger table, hashing each value and consulting the Hash Map to determine if the row should come through the join or not. This type of join becomes effectively “linear” in complexity with the size of the bigger table, which is pretty good!
If the database couldn’t do a Hash Join, it would instead need to do a “Nested Loop Join”. This would require the database to check every value in the left table against every value in the right table. The complexity of a Nested Loop Join would be “quadratic”, in that you need to do about N*N (or N²) different operations to process the join. Not great! Nested Loop Joins don’t hold up when you’re joining million-row tables together – your database might end up needing to complete trillions of operations to execute that join! Compare that to a logarithmic algorithm, where log_2(1000000) is close to 20
So, this touches on some other topics that I won’t dig into here, but that are definitely deserving of future posts:
- We saw that sorted datasets support faster lookups, which should give you a sort of intuition for Sort Keys on Redshift, for instance.
- Databases like Redshift use “statistics” to determine which of the tables in a join is “smaller”. That’s part of why it’s so important to run
- In some cases, spending time pre-processing a dataset can pay dividends. While building a Hash Map is time consuming, a Hash Join will be faster than a Nested Loop Join for any moderately sized dataset.
- There’s no such thing as a free lunch! A Hash Map requires memory space, which you’re trading in exchange for performance. It’s hard to have it both ways, and generally optimizing for one will require sacrificing the other. Tradeoffs!
This ended up being more of a computer science flavored answer then a database-specific one, but I think it’s super important to build intuition like this. There are heaps () of other data structures and algorithms out there that Databases make use of, and a basic understanding of the science can really help you reason about things like database performance and optimizer decisions.
 I didn’t really spend time explaining this, but inequality joins don’t work with Hash Maps because you’re not checking for item presence in a list, you’re instead computing an expression. While we can hash a date like
2018-01-01, we definitely can’t hash an expression like
date >= 2018-01-01. These queries will probably be executed using Nest Loop Joins.
 I said “single operation”, but “hash” and “find by index” are two operations! When we talk about algorithm complexity, we frame it in regards to the size of the dataset. More precisely than “a single operation”, you would say that the search occurs “in constant time”. The amount of time/steps required to find an element in a Hash Map does not change with the size of the dataset!
Some further reading:
This is awesome. Makes complete sense. Thanks!
(Has prompted another question relating to the efficient way of joining on inequalities. I’m going to ask that in another post.)