Sometimes, there’s a table in your DAG that might be created via a non-SQL process, and it would be useful for your dbt project to “know” about it. For example, let’s say you have a table, customers_predicted_ltvs
, with one record per customer_id
:
customer_id | predicted_ltv |
---|---|
1 | 120 |
2 | 115 |
… | … |
This table of estimates is created by machine learning , and uses transformations which can’t be done in SQL — we’re going to call it a non-SQL model. The non-SQL model uses some dbt models as inputs. Further, you want to use the output of this non-SQL model in other dbt models.
First of all, before going down this rabbit hole — really check whether it’s genuinely impossible to represent this model in SQL. Some warehouses offer inbuilt machine learning materializations (check out this BigQuery package as an example), and most warehouses have the ability to add python or javascript UDFs, and include native stats functions. Whenever possible, try to make this non-SQL model, a SQL model. But, if you really can’t do that, let’s go on…
It may be a good idea to represent this non-SQL model in your dbt project — for one, representing it in your project makes it documentable. Secondly, without representing it in your project, it’s possible that you could end up in a weird cycle, where you accidentally use the output of your non-SQL model as an input to a dbt model, and that dbt model is also an input to the non-SQL model. Making your project “aware” of this node will help prevent this since dbt won’t let you create cyclic graphs.
So, how should we represent this table?
-
Is this non-SQL model a source? Not really, since there’s upstream dependencies that are part of your dbt project
-
Is this non-SQL model an exposure? Kind of, since it’s an interface that consumes dbt models, but again, not really — exposures can’t have downstream dependencies in your project.
-
Is this non-SQL model a dbt model? Also no, since it doesn’t get created as part of your dbt project. But the nice thing about dbt models is that they have upstream and downstream dependencies, just like this machine learning model, so it’s the closest we have!
A hacky way to represent this non-SQL model in your project is to add it as a dbt model, with a simple select *
.
models/customers_predicted_ltvs
{{ config(materialized='ephemeral') }}
/*
This table is created by a non-SQL process (you should probably add more info IRL),
which selects from the following dbt models:
{{ ref('customers') }}
{{ ref('orders') }}
*/
select * from my_schema.customers_predicted_ltv
Things to note:
- including the
refs
in a comment means that the edges in your DAG will be drawn between upstream dbt models, and this non-SQL model - declaring this as an ephemeral model means dbt won’t try to build this itself
- now this can be documented with
descriptions
in.yml
files
Then in downstream models, you can create downstream dependencies!
models/fct_customers.sql
with customers_predicted_ltv as (
select * from {{ ref('customers_predicted_ltv') }}
),
...
It’s important to note that this is a hack —while we’ve managed to represent our non-SQL model in our DAG, that model will not be updated by a dbt run
. As a result, it’s likely that your dbt models and your non-SQL model will be updated on different cadences, which can lead to odd edge cases.
I’d love to hear if anyone has used a similar, or different, approach, and what challenges they encountered!