Flattening an EAV model in SQL, the DRY way

claire · July 30, 2019, 7:31pm

I recently had the pleasure () of flattening an EAV model in dbt.

What is an EAV model?

Let’s just lift the description from Wikipedia:

Entity–attribute–value model ( EAV ) is a data model to encode, in a space-efficient manner, entities where the number of attributes (properties, parameters) that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest.

Let’s take an example of a products model. In a relational model, you might have a products table like so:

dbt_swag_shop.products

id	name	price	brand	size	color	width_mm	volume_ml	is_bpa_free
1	dbtshirt	2	Asos	S	Grey	NULL	NULL	NULL
2	Logo Sticker	3	StickerMule	NULL	Orange	50	NULL	NULL
3	dbt water bottle	1	Purist	NULL	Orange	NULL	650	True

We can see here that the relational model doesn’t extend very well – each product has really different attributes, so this table would become really wide over time.

In an EAV model, this would instead be expressed through a series of tables like this:

dbt_swag_shop.entities

entity_id	entity_code
1	product
2	product
3	product

dbt_swag_shop.attributes

attribute_id	attribute_code	data_type
1	name	string
2	price	integer
3	brand	string
4	size	string
5	color	string
6	width_mm	integer
7	volume_ml	integer
8	is_bpa_free	boolean

dbt_swag_shop.values_string

value_id	attribute_id	entity_id	value
1	1	1	dbtshirt
2	1	2	Logo Sticker
3	1	3	dbt water bottle
4	3	1	Asos
5	3	2	StickerMule
6	3	3	Purist
7	4	1	S
8	5	1	Grey
9	5	2	Orange
10	5	3	Orange

dbt_swag_shop.values_integer

value_id	attribute_id	entity_id	value
1	2	1	2
2	2	2	3
3	2	3	1
4	6	2	50
5	7	7	650

dbt_swag_shop.values_boolean

value_id	attribute_id	entity_id	value
1	8	3	TRUE

NB: The values are split into a table for each data type, since databases can’t store a column that contains both strings and decimals.

The EAV approach offers flexibility, and makes it easy to get data in! A computer can easily read this data, but for us mere mortals, it’s a little harder to instinctively understand, say, what the name of a product is.

These days, it’s much more common to store these sparse attributes in a json dictionary (if you’re using Snowflake or BigQuery, you’ll have a pretty good time writing SQL against that data). The only benefits I can really see of the EAV model over a dictionary-based approach is that:

if you want to change an attribute name (say color → colour), you only have to update the name in one place
data types are strongly forced in an EAV model

Given the choice, I’d prefer to work with attributes that are stored in a dictionary. However, as analytics engineers, often we can’t choose the data structure that we have to work with. So if we have an EAV model, what’s the best way to get it into a shape we can work with?

SQL for flattening an EAV model

Let’s say we want to build a table of products, with the attributes name, price, and color as columns. Here’s the SQL:

with products as (
  select * from dbt_swag_shop.entities
  where entity_code = 'product'
),

name as (
  select * from dbt_swag_shop.values_string
  where attribute_id = 1
),

price as (
  select * from dbt_swag_shop.values_integer
  where attribute_id = 2
),

brand as (
  select * from dbt_swag_shop.values_string
  where attribute_id = 3
),

final as (
  select
    product.entity_id as product_id,
    name.value as name,
    price.value as price,
    brand.value as brand
  
  from products
  left join name using (entity_id)
  left join price using (entity_id)
  left join brand using (entity_id)
)

select * from final

Hold up! How did I know which dbt_swag_shop.values_{{ data_type }} table to query? And how did I know the attribute_ids to filter on? Well, I ran this query before I wrote my SQL

select * from dbt_swag_shop.values_string
where attribute_code in ('name', 'price', 'brand')

Which returns:

attribute_id	attribute_code	data_type
1	name	string
2	price	integer
3	brand	string

Looking back at the main query, you can see that there’s a lot of repetition a lot of repetition – those subqueries all follow a patten, as do the column names. Where there’s a pattern, there’s often an opportunity to turn it into a loop! And what we actually want to do is iterate over the results of that second query!

Putting this together in dbt

In dbt, the run_query macro lets you run a query, and returns the results of that query to your jinja context! Perfect!

Here’s the code I used to flatten out the EAV model given a list of columns I wanted to build:

{# These are the columns we will be creating #}
{% set attribute_codes=[
    'name',
    'price',
    'brand'
] %}

{#- This query gives us the information we need to be able to join the EAV model #}
{%- set product_attributes_query -%}
  select * from dbt_swag_shop.values_string
  where attribute_code in ('{{ attribute_codes | join ("', '") }}')
{%- endset -%}

{#- Run this query, and load the results into the Jinja context -#}
{%- set product_attributes_results=run_query(product_attributes_query) -%}

{#- Light transformation to make the results easier to work with -#}
{%- if execute -%}
{%- set product_attributes=product_attributes_results.rows -%}
{%- else -%}
{%- set product_attributes=[] -%}
{%- endif -%}

{#- MODEL CODE -#}

with products as (
  select * from dbt_swag_shop.entities
  where entity_code = 'product'
),

{% for attribute in product_attributes %}

{{ attribute['attribute_code'] }} as (
    select * from dbt_swag_shop.value_{{ attribute['data_type']  }}
    where attribute_id = {{ attribute['attribute_id'] }}
),

{% endfor %}

final as (
    select
        product.entity_id as product_id,

    {% for attribute in product_attributes %}
        {{ attribute['attribute_code'] }}.value as {{ attribute['attribute_code' ] }} {{- ", " if not loop.last }}
    {% endfor %}

    from products

    {% for attribute in product_attributes %}
    left join {{ attribute['attribute_code'] }} using (entity_id)
    {% endfor %}
)

select * from final

^ This code is pretty complex, but extremely flexible. If you want to add an extra column to your flattened model, you only need to add it to the attribute_codes list at the top of this code block.

But, I have traded readability for flexibility. If I were actually only creating three columns, I would likely use the first version of the query. Given I was actually using this to flatten a model with a dozen columns, which we want to add to later, it felt like it was worth it.

If you wanted to take it one step further, you could abstract this even more and package it up as a macro, so you can repeat this pattern to flatten all the entities (products, stores, customers, brands, etc.) in your EAV model by simply calling something like:

{{ flatten_eav_model(
  entity='product',
  columns=['name', 'price', 'brand']
) }}

I’ll leave that for someone braver than me!

Topic		Replies	Views
Regarding dbt model versioning Help	0	668	October 13, 2023
dbt tries to materilize downstream models incorrectly as specifies in dbt_project.yml Help dbt-core	0	526	December 4, 2023
Handling translations in models Help jinja	5	1119	October 19, 2022
Dynamic models specific in config tables Help jinja , snowflake	1	1151	January 17, 2023
Creating a "model version" column in your models using md5 Archive	1	2633	June 10, 2021