Prevent people from using seeds for data loading with a custom materialization

@jaypeedevlin asked and answered a great question on the dbt Slack:

Has anyone hacked together anything to prevent seeds beyond a certain size? We’d like to put some sort of barrier in place to stop folks using seeds for ingestion instead of just for mapping-type files.

As he notes, the dbt docs strongly advise against using seeds for data loading, because it’s not performant and you need to commit the changes to your repository every time.

It turns out that because you can override the behaviour of basically any macro in dbt, you can also block this from happening inside of your project. In this case, we need to override the load_csv_rows() macro to include the below lines at the top of the macro:

--macros/snowflake__load_csv_rows.sql

{% macro snowflake__load_csv_rows(model, agate_table) %}

  {# Check that the seed doesn't have too many values (rows * columns) #}
  {% set MAX_SIZE = 60000 %}
  {% set seed_size = agate_table.rows | length * agate_table.columns | length %}

  {% if seed_size > MAX_SIZE %}
    {{ exceptions.raise_compiler_error("{} values in seed exceeds maximum of {}.".format(seed_size, MAX_SIZE)) }}
  {% endif %}

{# then the rest of the adapter's implementation of load_csv_rows here... #}

{% endmacro %}

Here’s the implementations of load_csv_rows for Snowflake and BigQuery. Other adapters’ implementations are either in similar spaces in their respective repos, or use the default implementation like Redshift and Postgres.

1 Like

Awesome!! Great idea from Josh, and the implementation was more simple than I thought

1 Like

@joellabes hello. Thanks for the tip.
For the idea
Seeds are “just for mapping-type files”.
I have a question:
We used big seeds for performance testing as well for all input data types. I even had to modify the source dbt python code to INCREASE the limit for seed csv files.
What is the approach for big testing seed files?