FAQ: Cleaning up removed models from your production schema

I just had to do a similar process, with a focus on dropping source tables that we had added to our sync job in the past but weren’t using.

I approached this with a combo of the Jinja graph variable, the information_schema and the venerable spreadsheet.

To build a list of sources we did use, I used the compiled results of this snippet:

{% for node in graph.sources.values() -%}
  select '{{ node.schema }}' as s_name, '{{ node.name }}' as t_name
  {%- if not loop.last %} union all {% endif %}
{% endfor %}

And pasted their results into this query:

with defined_sources as (
    select 'users' as s_name, 'users' as t_name union all
    select 'users' as s_name, 'contactroles' as t_name union all
    ...
    select 'ft_hubspot' as s_name, 'ticket' as t_name
),
     
relevant_tables as (
    select 
        table_schema as s_name, 
        table_name as t_name
    from information_schema.tables
    where table_schema not ilike 'dev_%' --our dbt dev schemas
    and table_schema not ilike 'analytics%' --our output schemas
    and table_schema not in ('pg_catalog', 'information_schema', 'admin', 'public', 'census', 'inbound__timesheets_and_quality_control') --Redshift internal tables and tables built by using get_relations_by_pattern instead of being defined
)

select * from relevant_tables
except
select * from defined_sources

This gave me all the tables in the warehouse that weren’t defined in dbt, so we weren’t using and could consider dropping to save disk usage and reduce load on the source databases.

To focus on the biggest tables first, I wanted to combine it with the metadata from SVV_TABLE_INFO:

select
   schema as s_name,
   "table" as t_name,
   size as size_mb,
   tbl_rows as total_rows
from SVV_TABLE_INFO

However, I found that information_schema queries always run against the leader node which meant I couldn’t join the results of this query with my first query. Instead, I dumped the results to a spreadsheet and worked it out there.