I have a macro that will add some extra columns to an existing model and performs some data cleansing as per business requirements.
After the data been cleansed, what is the best way to delete the unwanted columns?
Suppose an input data(model) may have millions of columns and we don’t care what are the other columns other than the known columns that are to be deleted.
Below code in a dbt model can select the columns in the list but I don’t know how to delete/not select the columns in the list. Any thoughts?
{% set columns_to_delete = ["column_1", "column_2", "column_3"] -%}
select
{% for column in columns_to_delete %}
{{ column }},
{% endfor %}
I have checked that {% for column **not** in columns_to_delete %} is not working.
hey @s.chaser!
I am unsure if by I don't know how to delete/not select the columns in the list you’re wanting to exclude them from a query or literally drop the specific columns from a table’s ddl.
If you’re just wanting to exclude a set of columns, you can utilize dbt_utils.star like this:
select
{{ dbt_utils.star(from=ref('my_model'), except=["exclude_field_1", "exclude_field_2"]) }}
from {{ ref('my_model') }}
If you need to specifically drop them from a table (assuming you’re using a snowflake warehouse), I imagine you could create a macro like this:
{% macro drop_cols(schema, table, col_list = [], debug = true) %}
{% set sql %}
alter table {{target.database}}.{{schema}}.{{ table }} drop
{% for col in col_list %}
{{col}}
{% if not loop.last %},{% endif %}
{% endfor %}
{% endset %}
{% log(sql, true) %}
{% if not debug %}
{{ run_query(sql) }}
{% endif %}
{% endmacro %}
The result showing that 3 columns already dropped as expected from DATA_DROP_COLUMNS,
with below error messages:
Completed with 1 error and 0 warnings:
Database Error in model drop_columns (models\drop_columns\drop_columns.sql)
001003 (42000): SQL compilation error:
syntax error line 7 at position 0 unexpected '|'.
syntax error line 13 at position 6 unexpected ')'.
Need spent more time to figure out what’s going on here
Happy Holidays! In this situation, the macro was written to be run as a stand alone run-operation like so: dbt run-operation drop_cols --args '{schema: xxx, model_name: DATA_DROP_COLS, col_list: ['COL2', 'COL3', 'COL4']}'.
Completed with 1 error and 0 warnings:
Database Error in model drop_columns_hook (models\drop_columns\drop_columns_hook.sql)
001003 (42000): SQL compilation error:
syntax error line 1 at position 30 unexpected 'DELETE'.
Are you able to point out what I have done wrong here?
Thanks for the link @Reidwil — I have read multiple times of the linked info, still have no clue how to fix it with my limited understanding.
I don’t see any syntax error here
ohh my god, I am sweating now…I was blinded!
Thank you @Reidwil for being patient with me
It is now working as expected!
select *
from {{ ref('data_drop_columns')}}
{{
config(
post_hook="ALTER TABLE SCHEMA_NAME.THIS_MODEL_NAME DROP COLUMN col2, col3, col4"
)
}}
Any chance here to improve from changing SCHEMA_NAME to whatever schema the current model being created in? This because the current model is created in a different schema defined by dbt_project.yml rather than settings in the profiles.yml.
I guess there should be smarter way to do so, just have no luck to find anything to help me to do so.
Thank you again for being so helpful to direct me on the right thing to do.