Prevent people from using seeds for data loading with a custom materialization

joellabes · June 15, 2023, 3:40am

@jaypeedevlin asked and answered a great question on the dbt Slack:

Has anyone hacked together anything to prevent seeds beyond a certain size? We’d like to put some sort of barrier in place to stop folks using seeds for ingestion instead of just for mapping-type files.

As he notes, the dbt docs strongly advise against using seeds for data loading, because it’s not performant and you need to commit the changes to your repository every time.

It turns out that because you can override the behaviour of basically any macro in dbt, you can also block this from happening inside of your project. In this case, we need to override the load_csv_rows() macro to include the below lines at the top of the macro:

--macros/snowflake__load_csv_rows.sql

{% macro snowflake__load_csv_rows(model, agate_table) %}

  {# Check that the seed doesn't have too many values (rows * columns) #}
  {% set MAX_SIZE = 60000 %}
  {% set seed_size = agate_table.rows | length * agate_table.columns | length %}

  {% if seed_size > MAX_SIZE %}
    {{ exceptions.raise_compiler_error("{} values in seed exceeds maximum of {}.".format(seed_size, MAX_SIZE)) }}
  {% endif %}

{# then the rest of the adapter's implementation of load_csv_rows here... #}

{% endmacro %}

Here’s the implementations of load_csv_rows for Snowflake and BigQuery. Other adapters’ implementations are either in similar spaces in their respective repos, or use the default implementation like Redshift and Postgres.

brunoszdl · June 15, 2023, 8:58pm

Awesome!! Great idea from Josh, and the implementation was more simple than I thought

evolex · June 22, 2023, 1:51pm

@joellabes hello. Thanks for the tip.
For the idea
Seeds are “just for mapping-type files”.
I have a question:
We used big seeds for performance testing as well for all input data types. I even had to modify the source dbt python code to INCREASE the limit for seed csv files.
What is the approach for big testing seed files?

Topic		Replies	Views
Why can't I use dbt seed for large .csv files? Help seeds	3	14554	May 1, 2019
Dbt snowflake query Help	5	942	April 14, 2023
Load a csv file into Snowflake with dbt Help seeds	2	3526	April 18, 2023
dbt seed max file upload size Help	1	2569	July 18, 2023
How to Exclude Seeds from Running in dbt build Help node-selection	2	3515	February 3, 2023

Prevent people from using seeds for data loading with a custom materialization

Related topics