I recently had the pleasure () of flattening an EAV model in dbt.
What is an EAV model?
Let’s just lift the description from Wikipedia:
Entity–attribute–value model ( EAV ) is a data model to encode, in a space-efficient manner, entities where the number of attributes (properties, parameters) that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest.
Let’s take an example of a products
model. In a relational model, you might have a products
table like so:
dbt_swag_shop.products
id | name | price | brand | size | color | width_mm | volume_ml | is_bpa_free |
---|---|---|---|---|---|---|---|---|
1 | dbtshirt | 2 | Asos | S | Grey | NULL | NULL | NULL |
2 | Logo Sticker | 3 | StickerMule | NULL | Orange | 50 | NULL | NULL |
3 | dbt water bottle | 1 | Purist | NULL | Orange | NULL | 650 | True |
We can see here that the relational model doesn’t extend very well – each product has really different attributes, so this table would become really wide over time.
In an EAV model, this would instead be expressed through a series of tables like this:
dbt_swag_shop.entities
entity_id | entity_code |
---|---|
1 | product |
2 | product |
3 | product |
dbt_swag_shop.attributes
attribute_id | attribute_code | data_type |
---|---|---|
1 | name | string |
2 | price | integer |
3 | brand | string |
4 | size | string |
5 | color | string |
6 | width_mm | integer |
7 | volume_ml | integer |
8 | is_bpa_free | boolean |
dbt_swag_shop.values_string
value_id | attribute_id | entity_id | value |
---|---|---|---|
1 | 1 | 1 | dbtshirt |
2 | 1 | 2 | Logo Sticker |
3 | 1 | 3 | dbt water bottle |
4 | 3 | 1 | Asos |
5 | 3 | 2 | StickerMule |
6 | 3 | 3 | Purist |
7 | 4 | 1 | S |
8 | 5 | 1 | Grey |
9 | 5 | 2 | Orange |
10 | 5 | 3 | Orange |
dbt_swag_shop.values_integer
value_id | attribute_id | entity_id | value |
---|---|---|---|
1 | 2 | 1 | 2 |
2 | 2 | 2 | 3 |
3 | 2 | 3 | 1 |
4 | 6 | 2 | 50 |
5 | 7 | 7 | 650 |
dbt_swag_shop.values_boolean
value_id | attribute_id | entity_id | value |
---|---|---|---|
1 | 8 | 3 | TRUE |
NB: The values are split into a table for each data type, since databases can’t store a column that contains both strings and decimals.
The EAV approach offers flexibility, and makes it easy to get data in! A computer can easily read this data, but for us mere mortals, it’s a little harder to instinctively understand, say, what the name of a product is.
These days, it’s much more common to store these sparse attributes in a json dictionary (if you’re using Snowflake or BigQuery, you’ll have a pretty good time writing SQL against that data). The only benefits I can really see of the EAV model over a dictionary-based approach is that:
- if you want to change an attribute name (say
color
→colour
), you only have to update the name in one place - data types are strongly forced in an EAV model
Given the choice, I’d prefer to work with attributes that are stored in a dictionary. However, as analytics engineers, often we can’t choose the data structure that we have to work with. So if we have an EAV model, what’s the best way to get it into a shape we can work with?
SQL for flattening an EAV model
Let’s say we want to build a table of products, with the attributes name
, price
, and color
as columns. Here’s the SQL:
with products as (
select * from dbt_swag_shop.entities
where entity_code = 'product'
),
name as (
select * from dbt_swag_shop.values_string
where attribute_id = 1
),
price as (
select * from dbt_swag_shop.values_integer
where attribute_id = 2
),
brand as (
select * from dbt_swag_shop.values_string
where attribute_id = 3
),
final as (
select
product.entity_id as product_id,
name.value as name,
price.value as price,
brand.value as brand
from products
left join name using (entity_id)
left join price using (entity_id)
left join brand using (entity_id)
)
select * from final
Hold up! How did I know which dbt_swag_shop.values_{{ data_type }}
table to query? And how did I know the attribute_id
s to filter on? Well, I ran this query before I wrote my SQL
select * from dbt_swag_shop.values_string
where attribute_code in ('name', 'price', 'brand')
Which returns:
attribute_id | attribute_code | data_type |
---|---|---|
1 | name | string |
2 | price | integer |
3 | brand | string |
Looking back at the main query, you can see that there’s a lot of repetition a lot of repetition – those subqueries all follow a patten, as do the column names. Where there’s a pattern, there’s often an opportunity to turn it into a loop! And what we actually want to do is iterate over the results of that second query!
Putting this together in dbt
In dbt, the run_query macro lets you run a query, and returns the results of that query to your jinja context! Perfect!
Here’s the code I used to flatten out the EAV model given a list of columns I wanted to build:
{# These are the columns we will be creating #}
{% set attribute_codes=[
'name',
'price',
'brand'
] %}
{#- This query gives us the information we need to be able to join the EAV model #}
{%- set product_attributes_query -%}
select * from dbt_swag_shop.values_string
where attribute_code in ('{{ attribute_codes | join ("', '") }}')
{%- endset -%}
{#- Run this query, and load the results into the Jinja context -#}
{%- set product_attributes_results=run_query(product_attributes_query) -%}
{#- Light transformation to make the results easier to work with -#}
{%- if execute -%}
{%- set product_attributes=product_attributes_results.rows -%}
{%- else -%}
{%- set product_attributes=[] -%}
{%- endif -%}
{#- MODEL CODE -#}
with products as (
select * from dbt_swag_shop.entities
where entity_code = 'product'
),
{% for attribute in product_attributes %}
{{ attribute['attribute_code'] }} as (
select * from dbt_swag_shop.value_{{ attribute['data_type'] }}
where attribute_id = {{ attribute['attribute_id'] }}
),
{% endfor %}
final as (
select
product.entity_id as product_id,
{% for attribute in product_attributes %}
{{ attribute['attribute_code'] }}.value as {{ attribute['attribute_code' ] }} {{- ", " if not loop.last }}
{% endfor %}
from products
{% for attribute in product_attributes %}
left join {{ attribute['attribute_code'] }} using (entity_id)
{% endfor %}
)
select * from final
^ This code is pretty complex, but extremely flexible. If you want to add an extra column to your flattened model, you only need to add it to the attribute_codes
list at the top of this code block.
But, I have traded readability for flexibility. If I were actually only creating three columns, I would likely use the first version of the query. Given I was actually using this to flatten a model with a dozen columns, which we want to add to later, it felt like it was worth it.
If you wanted to take it one step further, you could abstract this even more and package it up as a macro, so you can repeat this pattern to flatten all the entities (products
, stores
, customers
, brands
, etc.) in your EAV model by simply calling something like:
{{ flatten_eav_model(
entity='product',
columns=['name', 'price', 'brand']
) }}
I’ll leave that for someone braver than me!