Clean your warehouse of old and deprecated models (snowflake)

dbt models persist in your production database even after they’re deleted from your project. This adds clutter to the warehouse and can slow down operations like database or schema cloning, which are used frequently in CI/CD for snowflake dbt projects.

I wrote a macro that executes DROP statements for all the tables and views in your target schema that don’t exist in your project. How does it work? First, it fetches all the models, seeds, and snapshots in your project. Then, it queries the information schema for all the models in your target schema excluding the ones fetched above. Finally, it formats (and then executes) DROP statements for each table/view that isn’t in your project.

To invoke from the command line, use: dbt run-operation drop_old_relations

Warning: This will delete everything in your target schema that isn’t in your dbt project.

Warning: This does not cover aliases

I strongly recommend running dbt run-operation drop_old_relations --args '{"dryrun": True}' first to see the DROP commands

Thanks a lot to Randy of Hashmap for providing much of the code! You can find his original version here, which drops models based on the last_altered date in the info schema using a pre-set cutoff

{% macro drop_old_relations(dryrun=False) %}

{% if execute %}
  {% set current_models=[] %}

  {% for node in graph.nodes.values()
     | selectattr("resource_type", "in", ["model", "seed", "snapshot"])%}
    {% do current_models.append(node.name) %}

  {% endfor %}
{% endif %}

{% set cleanup_query %}

      WITH MODELS_TO_DROP AS (
        SELECT
          CASE 
            WHEN TABLE_TYPE = 'BASE TABLE' THEN 'TABLE'
            WHEN TABLE_TYPE = 'VIEW' THEN 'VIEW'
          END AS RELATION_TYPE,
          CONCAT_WS('.', TABLE_CATALOG, TABLE_SCHEMA, TABLE_NAME) AS RELATION_NAME
        FROM 
          {{ target.database }}.INFORMATION_SCHEMA.TABLES
        WHERE TABLE_SCHEMA = '{{ target.schema }}'
          AND TABLE_NAME NOT IN
            ({%- for model in current_models -%}
                '{{ model.upper() }}'
                {%- if not loop.last -%}
                    ,
                {% endif %}
            {%- endfor -%}))
      SELECT 
        'DROP ' || RELATION_TYPE || ' ' || RELATION_NAME || ';' as DROP_COMMANDS
      FROM 
        MODELS_TO_DROP

  {% endset %}

{% do log(cleanup_query, info=True) %}
{% set drop_commands = run_query(cleanup_query).columns[0].values() %}

{% if drop_commands %}
  {% if dryrun | as_bool == False %}
    {% do log('Executing DROP commands...', True) %}
  {% else %}
    {% do log('Printing DROP commands...', True) %}
  {% endif %}
  {% for drop_command in drop_commands %}
    {% do log(drop_command, True) %}
    {% if dryrun | as_bool == False %}
      {% do run_query(drop_command) %}
    {% endif %}
  {% endfor %}
{% else %}
  {% do log('No relations to clean.', True) %}
{% endif %}

{%- endmacro -%}

EDIT: Added dryrun feature

4 Likes

I love this! I modified the script slightly. The modified version below:

  1. Provides the option of a dry run where the commands are printed, but not executed
  2. Will match any schema prefixed with the target schema, rather than only the exact target schema (since we use multiple schemas of the form <target_schema>_<schema_name>). The exact target schema will still be matched. This introduces some risk if target schemas could overlap with prefixes of other schema names, so use with caution. It only matches strictly prefixes though, not a full wildcard, so if you’re consistent with prefixing schemas this shouldn’t be an issue.
  3. Is clear about excluding non-standard table types (e.g. for us, we don’t want this to delete our external tables, which are managed slightly outside of these parameters through dbt).

Modified version:

{% macro drop_old_relations(dry_run='false') %}
{% if execute %}
  {% set current_models=[] %}
  {% for node in graph.nodes.values()
     | selectattr("resource_type", "in", ["model", "seed", "snapshot"])%}
    {% do current_models.append(node.name) %}
  {% endfor %}
{% endif %}
{% set cleanup_query %}
      with models_to_drop as (
        select
          case 
            when table_type = 'BASE TABLE' then 'TABLE'
            when table_type = 'VIEW' then 'VIEW'
          end as relation_type,
          concat_ws('.', table_catalog, table_schema, table_name) as relation_name
        from 
          {{ target.database }}.information_schema.tables
        where table_schema ilike '{{ target.schema }}%'
          and table_name not in
            ({%- for model in current_models -%}
                '{{ model.upper() }}'
                {%- if not loop.last -%}
                    ,
                {% endif %}
            {%- endfor -%}))
      select 
        'drop ' || relation_type || ' ' || relation_name || ';' as drop_commands
      from 
        models_to_drop
      
      -- intentionally exclude unhandled table_types, including 'external table`
      where drop_commands is not null
  {% endset %}
{% do log(cleanup_query, info=True) %}
{% set drop_commands = run_query(cleanup_query).columns[0].values() %}
{% if drop_commands %}
  {% for drop_command in drop_commands %}
    {% do log(drop_command, True) %}
    {% if dry_run == 'false' %}
      {% do run_query(drop_command) %}
    {% endif %}
  {% endfor %}
{% else %}
  {% do log('No relations to clean.', True) %}
{% endif %}
{%- endmacro -%}
1 Like

Thanks for sharing! Have you considered how to extend this to account for aliases?

What do you mean by aliases?

Where the table created in the db is different than the model name.

Why alias model names?#

The names of schemas and tables are effectively the “user interface” of your data warehouse. Well-named schemas and tables can help provide clarity and direction for consumers of this data. In combination with custom schemas, model aliasing is a powerful mechanism for designing your warehouse.

interesting, i didn’t know about that feature. you could probably extend this by grabbing the alias name in the initial loop of the dbt graph. something like this maybe?

  {% for node in graph.nodes.values()
     | selectattr("resource_type", "in", ["model", "seed", "snapshot"])%}
    {% if node.alias %}
    {% do current_models.append(node.alias) %}
    {% else %}
    {% do current_models.append(node.name) %}
    {% endif %}