Does dbt add primary/foreign key constraints?

Hi there,

I’m a newbie to this tool. With what i’ve read and tested so-far I was under the false impression of what the ‘relationships’ method of the testing feature does.

I’ve set relationships between tables and when I run the test command those tests are successful (i.e PASS). However, for those tables in the data warehouse i’m developing as of now, there are no PK/FK constraints - I have defined that in postgres myself or figured out where to define that within the models schema.

  • What does the dbt-tool actually check for in the case of ‘relationships’?
  • Where would I define the PK/FK constraints within the model? Have not seen documentation about this, possibly missed that part.

Short snippet of the schema.yml:
version: 2

models:

  • name: customers
    description: A table containing all customers
    columns:
    - name: uid
    description: This is a unique identifier in the customer
    tests:
    # - unique
    - not_null
    - name: suppliers
    description: This is a unique identifier in the supplier table
    tests:
    - not_null
    - relationships:
    to: ref(‘customers’)
    field: uid

Thanks in advance!

Hey @oferk83, welcome!

dbt is typically used with analytical data warehouses like Redshift, Snowflake, and BigQuery. In these databases, column constraints are usually either unsupported or unenforced by the database. For this reason, dbt has the ability to assert data constraints without actually materializing them in the database.

When you run dbt test, dbt will generate SQL queries that assert the validity of the tests that you’ve defined for your models. You can check out the definition of the relationships schema test in the dbt repo. The unique, not_null, and other tests are all defined similarly.

With this paradigm, you can also test other attributes of your models that can’t be expressed as column constraints in your database. Check out some examples of those in the dbt-utils repo.

I’m not sure about BigQuery, but for both Snowflake and Redshift if you do define a constraint not only will it not be enforced by the database, but queries may return incorrect results if the underlying data violates the logical rule defined by the constraint.

So it can be risky defining constraints in the database if there is ever a possibility of the data being incorrect. Yes, you would catch it after the fact if a test was defined in dbt, but a business user might have run a query in the interim before you are able to remedy the data problem and gotten incorrect results.

So we have tended to shy away from defining constraints in Snowflake and Redshift when working with them. The risks outweighed the benefits for now, although if things somehow were to change we would revisit the question.

Thanks @josh - really good points!

We don’t have any plans to support adding column constraints natively in dbt, but some folks have made this work using post-hooks to add constraints. This might look like:

{{
  config(
    post_hook='create index if not exists "{{ this.name }}__index_on_COL_NAME" on {{ this }} ("COL_NAME")'
  )
}}

We have a fledgling postgres package that implements these macros, which might be helpful here: https://github.com/fishtown-analytics/postgres

I haven’t used these macros before, but might be a good starting point!

I didn’t see anyone mention it here, but the main case for primary/foreign key constraints in data warehouses is data discoverability.

Two main use cases on the top of my mind:

  1. Navigating related data and auto-completing SQL queries become a lot better, in DataGrip at least. These Database clients use the Primary Key and Foreign Key information to auto complete join conditions when writing SQL queries. When you query data, you can use the context menu to select data that match a foreign key column from a query result, for example.

  2. Metabase automatically fills in Table metadata for you if foreign and primary keys information is available, it fills out metadata about the tables. If you have that information in place, when you use Metabase query builder, it’s able to show to the end user the fields of the joined table so they can be filtered upon.

My thoughts:

Someone wanting to take advantage of that would have to use post hooks to successfully implement that. It’s a little bit clunky, but works.

Since what I talked about is mostly about documenting the relationship between models, I think the sensible place for it to be would be in the YAML files, and then we should have an option some how to generate the “alter table” statements to make those happen.

@drew Question to you: do we have access to the data in the YAML files when executing post-hooks? I thinking it would be relatively easy to create a macro that you invoke from a post-hook to parse the list of columns and generate those exact “alter table” statements.

Hey @thalesmello - I really like the way you’re thinking about this! I don’t believe the yaml file contents are actively available in the hook context, but this is definitely a workflow I’d like to enable.

Presumably we’d want to do something like the following:

  1. add a primary_key test which effectively just implements unique + not_null
  2. provide some knowledge of the existing tests for a model in its compilation context
  3. make it easy to distinguish between views/tables in these hooks: we definitely won’t want to add these constraints on views!

Do you buy all of this?

2 Likes

I do agree with the things you pointed out.

Regarding your third point, that’s something I worked around by checking

{% if this.materialized in [‘table’, ‘incremental’] %}

When I was adding indices in our old Postgres warehouse.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.