I just found a trick in dbt to create empty tables and thought I would share with a wider audience and perhaps gather some feedback. I was working on a POC where I needed dbt to create an empty staging table just once. I would love to hear if there are better approaches.
The trick is to define a csv file in seeds with just headers
For example,
We have a file called as staging_table.csv with following contents
BulkColumn,loaded_at
And adding column details in yml file
For example,
Though there is one edge case: If you keep the seed around and run dbt seedin the future it will wipe all data in the table in order to re-sync with the empty seed. If that is okay, you have no problem but if the data that gets placed into this table later is important it will be gone … especially if you automatically call dbt seed as part of a CD run.
If you “just” want dbt to bootstrap an empty table for you, you could write a simple macro and place it in the on_run_start section of your config: on-run-start & on-run-end | dbt Developer Hub
I agree with the edge case. Great insight! Although I am using this as a buffer table, I would prefer retaining a month’s data in production.
As much as I have understood, on_run_start will run every time dbt run is executed. So won’t the table be dropped and recreated each time dbt run is executed?
So won’t the table be dropped and recreated each time dbt run is executed?
@ajsheth That depends on the SQL you use in the macro. If you create something like below it should work:
{% macro ensure_table_XYZ() -%}
{# Only run this macro if there is a DB connection. #}
{%- if not execute %}
{{- return(None) }}
{%- endif %}
{# Only run this macro if we are using `dbt run` #}
{%- if flags.WHICH != 'run' -%}
{{- return(None) }}
{%- endif %}
{%- call statement(fetch_result=False, auto_begin=false) %}
create table if not exists catalog.schema.table_XYZ (
... columns ...
);
{%- endcall %}
{%- endmacro %}
The crux here is that we can use create table IF NOT EXISTS, which turns the macro into a “do nothing” macro if the table has already been created; otherwise it will create the table.
Ofc, you may want to increase the sophistication of this by, for example, having a --full-refresh on a certain model also wipe the buffer table or by making the macro more sophisticated to support different tables.
Totally makes sense. Since I am using MS SQL Server there was no way to implement create table if not exists without using store procedure or some other complicated logic invovling information schema. That’s why I was looking to do this purely through dbt.
After checking different posts and doing some R&D, I found a way to create empty table if it doesn’t exist in SQL Server (or any other database) in dbt style!
{% macro create_if_not_exists(database, schema, table) %}
{%- set source_relation = adapter.get_relation(
database=database,
schema=schema,
identifier=table) -%}
{% set table_exists=source_relation is not none %}
{% if not table_exists %}
Create table {{database}}.{{schema}}.{{table}} (
BulkColumn varchar(max),
loaded_at datetime
)
{% endif %}
{% endmacro %}
This macro can be run using the on-run-start hook as you suggested.
We use a variable (--vars '{mode: deploy}') to handle this.
When mode == deploy, we have a custom materialization which takes the YAML/columns config of the table, and runs a CREATE TABLE IF NOT EXISTS command for it (instead of running the SQL for the model). There’s also an optional parameter to run CREATE OR REPLACE instead (sometimes helpful in development). It also looks at the table metadata for certain things (like clustering keys, etc.).