Flattening an EAV model in SQL, the DRY way

I recently had the pleasure (:upside_down_face:) of flattening an EAV model in dbt.

What is an EAV model?

Let’s just lift the description from Wikipedia:

Entity–attribute–value model ( EAV ) is a data model to encode, in a space-efficient manner, entities where the number of attributes (properties, parameters) that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest.

Let’s take an example of a products model. In a relational model, you might have a products table like so:

dbt_swag_shop.products

id name price brand size color width_mm volume_ml is_bpa_free
1 dbtshirt 2 Asos S Grey NULL NULL NULL
2 Logo Sticker 3 StickerMule NULL Orange 50 NULL NULL
3 dbt water bottle 1 Purist NULL Orange NULL 650 True

We can see here that the relational model doesn’t extend very well – each product has really different attributes, so this table would become really wide over time.

In an EAV model, this would instead be expressed through a series of tables like this:

dbt_swag_shop.entities

entity_id entity_code
1 product
2 product
3 product

dbt_swag_shop.attributes

attribute_id attribute_code data_type
1 name string
2 price integer
3 brand string
4 size string
5 color string
6 width_mm integer
7 volume_ml integer
8 is_bpa_free boolean

dbt_swag_shop.values_string

value_id attribute_id entity_id value
1 1 1 dbtshirt
2 1 2 Logo Sticker
3 1 3 dbt water bottle
4 3 1 Asos
5 3 2 StickerMule
6 3 3 Purist
7 4 1 S
8 5 1 Grey
9 5 2 Orange
10 5 3 Orange

dbt_swag_shop.values_integer

value_id attribute_id entity_id value
1 2 1 2
2 2 2 3
3 2 3 1
4 6 2 50
5 7 7 650

dbt_swag_shop.values_boolean

value_id attribute_id entity_id value
1 8 3 TRUE

NB: The values are split into a table for each data type, since databases can’t store a column that contains both strings and decimals.

The EAV approach offers flexibility, and makes it easy to get data in! A computer can easily read this data, but for us mere mortals, it’s a little harder to instinctively understand, say, what the name of a product is.

These days, it’s much more common to store these sparse attributes in a json dictionary (if you’re using Snowflake or BigQuery, you’ll have a pretty good time writing SQL against that data). The only benefits I can really see of the EAV model over a dictionary-based approach is that:

  • if you want to change an attribute name (say color -> colour), you only have to update the name in one place
  • data types are strongly forced in an EAV model

Given the choice, I’d prefer to work with attributes that are stored in a dictionary. However, as analytics engineers, often we can’t choose the data structure that we have to work with. So if we have an EAV model, what’s the best way to get it into a shape we can work with?

SQL for flattening an EAV model

Let’s say we want to build a table of products, with the attributes name, price, and color as columns. Here’s the SQL:

with products as (
  select * from dbt_swag_shop.entities
  where entity_code = 'product'
),

name as (
  select * from dbt_swag_shop.values_string
  where attribute_id = 1
),

price as (
  select * from dbt_swag_shop.values_integer
  where attribute_id = 2
),

brand as (
  select * from dbt_swag_shop.values_string
  where attribute_id = 3
),

final as (
  select
    product.entity_id as product_id,
    name.value as name,
    price.value as price,
    brand.value as brand
  
  from products
  left join name using (entity_id)
  left join price using (entity_id)
  left join brand using (entity_id)
)

select * from final

Hold up! How did I know which dbt_swag_shop.values_{{ data_type }} table to query? And how did I know the attribute_ids to filter on? Well, I ran this query before I wrote my SQL :sweat_smile:

select * from dbt_swag_shop.values_string
where attribute_code in ('name', 'price', 'brand')

Which returns:

attribute_id attribute_code data_type
1 name string
2 price integer
3 brand string

Looking back at the main query, you can see that there’s a lot of repetition a lot of repetition – those subqueries all follow a patten, as do the column names. Where there’s a pattern, there’s often an opportunity to turn it into a loop! And what we actually want to do is iterate over the results of that second query!

Putting this together in dbt

In dbt, the run_query macro lets you run a query, and returns the results of that query to your jinja context! Perfect!

Here’s the code I used to flatten out the EAV model given a list of columns I wanted to build:

{# These are the columns we will be creating #}
{% set attribute_codes=[
    'name',
    'price',
    'brand'
] %}

{#- This query gives us the information we need to be able to join the EAV model #}
{%- set product_attributes_query -%}
  select * from dbt_swag_shop.values_string
  where attribute_code in ('{{ attribute_codes | join ("', '") }}')
{%- endset -%}

{#- Run this query, and load the results into the Jinja context -#}
{%- set product_attributes_results=run_query(product_attributes_query) -%}

{#- Light transformation to make the results easier to work with -#}
{%- if execute -%}
{%- set product_attributes=product_attributes_results.rows -%}
{%- else -%}
{%- set product_attributes=[] -%}
{%- endif -%}

{#- MODEL CODE -#}

with products as (
  select * from dbt_swag_shop.entities
  where entity_code = 'product'
),

{% for attribute in product_attributes %}

{{ attribute['attribute_code'] }} as (
    select * from dbt_swag_shop.value_{{ attribute['data_type']  }}
    where attribute_id = {{ attribute['attribute_id'] }}
),

{% endfor %}

final as (
    select
        product.entity_id as product_id,

    {% for attribute in product_attributes %}
        {{ attribute['attribute_code'] }}.value as {{ attribute['attribute_code' ] }} {{- ", " if not loop.last }}
    {% endfor %}

    from products

    {% for attribute in product_attributes %}
    left join {{ attribute['attribute_code'] }} using (entity_id)
    {% endfor %}
)

select * from final

^ This code is pretty complex, but extremely flexible. If you want to add an extra column to your flattened model, you only need to add it to the attribute_codes list at the top of this code block.

But, I have traded readability for flexibility. If I were actually only creating three columns, I would likely use the first version of the query. Given I was actually using this to flatten a model with a dozen columns, which we want to add to later, it felt like it was worth it.

If you wanted to take it one step further, you could abstract this even more and package it up as a macro, so you can repeat this pattern to flatten all the entities (products, stores, customers, brands, etc.) in your EAV model by simply calling something like:

{{ flatten_eav_model(
  entity='product',
  columns=['name', 'price', 'brand']
) }}

I’ll leave that for someone braver than me!

4 Likes