Representing non-SQL models in a dbt DAG

Sometimes, there’s a table in your DAG that might be created via a non-SQL process, and it would be useful for your dbt project to “know” about it. For example, let’s say you have a table, customers_predicted_ltvs, with one record per customer_id:

customer_id predicted_ltv
1 120
2 115

This table of estimates is created by :sparkles: machine learning :sparkles:, and uses transformations which can’t be done in SQL — we’re going to call it a non-SQL model. The non-SQL model uses some dbt models as inputs. Further, you want to use the output of this non-SQL model in other dbt models.

:warning: First of all, before going down this rabbit hole — really check whether it’s genuinely impossible to represent this model in SQL. Some warehouses offer inbuilt machine learning materializations (check out this BigQuery package as an example), and most warehouses have the ability to add python or javascript UDFs, and include native stats functions. Whenever possible, try to make this non-SQL model, a SQL model. But, if you really can’t do that, let’s go on…

It may be a good idea to represent this non-SQL model in your dbt project — for one, representing it in your project makes it documentable. Secondly, without representing it in your project, it’s possible that you could end up in a weird cycle, where you accidentally use the output of your non-SQL model as an input to a dbt model, and that dbt model is also an input to the non-SQL model. Making your project “aware” of this node will help prevent this since dbt won’t let you create cyclic graphs.

So, how should we represent this table?

  • Is this non-SQL model a source? Not really, since there’s upstream dependencies that are part of your dbt project

  • Is this non-SQL model an exposure? Kind of, since it’s an interface that consumes dbt models, but again, not really — exposures can’t have downstream dependencies in your project.

  • Is this non-SQL model a dbt model? Also no, since it doesn’t get created as part of your dbt project. But the nice thing about dbt models is that they have upstream and downstream dependencies, just like this machine learning model, so it’s the closest we have!

A hacky way to represent this non-SQL model in your project is to add it as a dbt model, with a simple select *.

models/customers_predicted_ltvs

{{ config(materialized='ephemeral') }} 
/*
This table is created by a non-SQL process (you should probably add more info IRL),
which selects from the following dbt models:
{{ ref('customers') }}
{{ ref('orders') }}
*/

select * from my_schema.customers_predicted_ltv

Things to note:

  • including the refs in a comment means that the edges in your DAG will be drawn between upstream dbt models, and this non-SQL model
  • declaring this as an ephemeral model means dbt won’t try to build this itself
  • now this can be documented with descriptions in .yml files :slight_smile:

Then in downstream models, you can create downstream dependencies!
models/fct_customers.sql

with customers_predicted_ltv as (
  select * from {{ ref('customers_predicted_ltv') }}
),
...

It’s important to note that this is a hack —while we’ve managed to represent our non-SQL model in our DAG, that model will not be updated by a dbt run. As a result, it’s likely that your dbt models and your non-SQL model will be updated on different cadences, which can lead to odd edge cases.

I’d love to hear if anyone has used a similar, or different, approach, and what challenges they encountered!

6 Likes

We use a different approach to solve the same problem, which has its own advantages and disadvantages. For a similar use case (ML model to predict customer LTV), we have dbt models both upstream and downstream of the non-SQL ML model. We solve this via external scheduling & dbt model tagging to run various parts of our dbt workflow. In practice, this looks like this:

  • dbt workflow 1 (includes feature engineering SQL to prepare data that our predictive model needs)
  • unload of the the feature engineering tables from our warehouse (into S3)
  • ML workflow (via AWS step functions)
  • load of the ML output into the warehouse
  • dbt workflow 2 (these models are tagged)

The two dbt workflows have different tags on each model within. So the models in workflow 1 are untagged while those in workflow 2 each have a tag. This allows us to trigger them appropriately.

This solution requires that each step is activated in order by an external scheduling system, but it works quite well & is pretty flexible. In our use case, the cadence of the ML & the overall workflow is the same (daily), and this system makes sure that these line up correctly.

2 Likes

@claire this is a great post! thanks so much.

one thing I added to your “hack” script was to automatically generate the schema and table name.

you can have models/customers_predicted_ltvs as

{{ config(materialized='ephemeral') }} 
/*
This table is created by a non-SQL process (you should probably add more info IRL),
which selects from the following dbt models:
{{ ref('customers') }}
{{ ref('orders') }}
*/
select * {{ target.schema }}.{{ model.name }}