I’m using dbt-databricks and want to ingest some source tables in the from parquet files from our data lake. The tables are uploaded every day into a new location on the data lake.
Example of a parquet folder:
raw/<sourcename>/<tablename>/<yyyy>/<mm>/<dd>/tablename_yyyy_mm_dd.parquet
Our yaml source definition would look something like this:
- name: sourcename
schema: parquet
tables:
- name: tablename
identifier: /mnt/raw/sourcename/tablename/??/??/??/*.parquet
quoting:
identifier: true
The dbt source should always point to the latest folder of a given source table. Depending on the time of day the folder may not exist yet, and so the source should use the file from the previous day.
So basically I want to reference the source, but somehow modify the identifier to point to the correct parquet folder. Is this possible? I was thinking of writing a macro, but I am unsure of the best approach.
I am pretty sure I can write a macro that can resolve the correct folder path, but I am unsure of how I can pass that information along to the source.yml.
I was thinking of wrapping the normal source macro in a “source_parquet” before the normal source macro runs:
{% macro source_parquet(source, tbl) %}
<Modify the identifier>
{{ source(source, tbl) }}
{% endmacro %}
I would then reference my new macro for parquet sources
with source_table as (
select *
FROM {{ source_parquet('sourcename', 'tablename')}}
)
Is that feasible or is there a better approach?
Thanks!