Dynamically create models using schema files

Hi,
We have a requirement to create models (sql) dynamically using schema files. We have like 10 schema files and want to generate models using those schema files. These schema files keep changing. I tried using generate_base_model but it works for analyses and it look for DB tables. Any suggestion to handle this scenario.

Can you post an example of the schema file you’re trying to input, and the sort of output you would like to see?

Note: @joellabes (dbt Labs) originally posted this reply in Slack. It might not have transferred perfectly.

Hi,
Below is the example version 1 person.json schema file in aws s3 location (assuming in some location where we maintain all schemas)

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "properties": {
    "firstName": {
      "type": "string"
    },
    "lastName": {
      "type": "string"
    },
    "age": {
      "type": "integer"
    }
  },
  "required": [
    "firstName",
    "lastName",
    "age"
  ]
}

I want to create a model dynamically person.sql like below

with source as 
(
 select * from person
)

select   
    firstName,
    lastName,
    age
from source

In future schema may change to version v2 like below (adding dob column)

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "properties": {
    "firstName": {
      "type": "string"
    },
    "lastName": {
      "type": "string"
    },
    "age": {
      "type": "integer"
    },
    "dob": {
      "type": "string"
    }
  },
  "required": [
    "firstName",
    "lastName",
    "age",
    "dob"
  ]
}

I want to recreate a model dynamically person.sql like below (with dob column)

with source as 
(
 select * from person
)

select   
    firstName,
    lastName,
    age, 
    dob
from source

OK that makes sense! The first problem you’re going to have is that dbt runs inside your warehouse, and can only do things that you could do in straight SQL.

Which warehouse are you using? You’re going to need some sort of function that can ingest the file from S3 and make it available in SQL. Perhaps a UDF that returns a string of json?

If you can do that, then you should be able to parse a json blob relatively easily and create a model using jinja templating. You’d need to make a file for each model you want to generate.

Once you’ve done that, you will still have the problem that you won’t be able to automatically attach tests to the models because yaml files can’t be dynamically generated.

Honestly if you only have 10 files then it sounds like a lot more effort than it’s worth for it to be 100% dynamic. How often do they change that the additional functionality is worth the overhead in complexity?

If you really want to go down this path though, an alternative approach might be to template the sql and yaml files in Python and then run dbt as normal on the results of that work.

Thank you for your input.
We are using snowflake warehouse. We have multiple customers and Schema files are slightly different for each, roughly 10 - 12 schema files for each customers. These schemas will change once in six months or even more depend on customers. So thought of taking dynamic model creation approach.

OK - dbt requires a .sql file for each model that you want to have materialized in your database, so you have two options:

• If each customer needs their own set of tables (customer_1/person.sql, customer_2/person.sql…), you’d want some sort of codegen tool to make the template files that each reference a macro. Up to you whether you dynamically generate those from scratch at the start of each run, or just re-trigger them when you get a new customer or their schema changes. This can’t be done directly in dbt.
• If your requirements allow a single table with each customer’s data unioned together (person.sql, 'address.sql…), you could check out https://discourse.getdbt.com/t/unioning-identically-structured-data-sources/921/1

Note: @joellabes (dbt Labs) originally posted this reply in Slack. It might not have transferred perfectly.

We have thousands of customers and planning to load data to in separate databases.

We are seeing below options now:

  • Have one project for each customer with their own schemas and loading data to separate DB and tables.
  • Have one project for all customers with similar schema
1 Like

I’ll be interested to hear what you ultimately decide on!