Foreign Key/Primary Key Constraint Creation

Hi there,

I’m a newbie to this tool. With what i’ve read and tested so-far I was under the false impression of what the ‘relationships’ method of the testing feature does.

I’ve set relationships between tables and when I run the test command those tests are successful (i.e PASS). However, for those tables in the data warehouse i’m developing as of now, there are no PK/FK constraints - I have defined that in postgres myself or figured out where to define that within the models schema.

  • What does the dbt-tool actually check for in the case of ‘relationships’?
  • Where would I define the PK/FK constraints within the model? Have not seen documentation about this, possibly missed that part.

Short snippet of the schema.yml:
version: 2

models:

  • name: customers
    description: A table containing all customers
    columns:
    - name: uid
    description: This is a unique identifier in the customer
    tests:
    # - unique
    - not_null
    - name: suppliers
    description: This is a unique identifier in the supplier table
    tests:
    - not_null
    - relationships:
    to: ref(‘customers’)
    field: uid

Thanks in advance!

Hey @oferk83, welcome!

dbt is typically used with analytical data warehouses like Redshift, Snowflake, and BigQuery. In these databases, column constraints are usually either unsupported or unenforced by the database. For this reason, dbt has the ability to assert data constraints without actually materializing them in the database.

When you run dbt test, dbt will generate SQL queries that assert the validity of the tests that you’ve defined for your models. You can check out the definition of the relationships schema test in the dbt repo. The unique, not_null, and other tests are all defined similarly.

With this paradigm, you can also test other attributes of your models that can’t be expressed as column constraints in your database. Check out some examples of those in the dbt-utils repo.

@drew - thanks for the prompt response!

Yeah I see that, however, I’m wondering what the best-practice is for creating the PK and FK constraints that are necessary for the dimensions and fact tables regardless of the databases (Redshift, Snowflake and in my case Postgres) within dbt-tool.
I’m trying to develop a start schema initially with dbt-tool to assess whether this tool could be useful for my team for re-creating the data warehouse as well as evolve it when necessary by analysts rather than engineers.

Really excited about this tool !

Ofer

I’m not sure about BigQuery, but for both Snowflake and Redshift if you do define a constraint not only will it not be enforced by the database, but queries may return incorrect results if the underlying data violates the logical rule defined by the constraint.

So it can be risky defining constraints in the database if there is ever a possibility of the data being incorrect. Yes, you would catch it after the fact if a test was defined in dbt, but a business user might have run a query in the interim before you are able to remedy the data problem and gotten incorrect results.

So we have tended to shy away from defining constraints in Snowflake and Redshift when working with them. The risks outweighed the benefits for now, although if things somehow were to change we would revisit the question.

Thanks @josh - really good points!

We don’t have any plans to support adding column constraints natively in dbt, but some folks have made this work using post-hooks to add constraints. This might look like:

{{
  config(
    post_hook='create index if not exists "{{ this.name }}__index_on_COL_NAME" on {{ this }} ("COL_NAME")'
  )
}}

We have a fledgling postgres package that implements these macros, which might be helpful here: https://github.com/fishtown-analytics/postgres

I haven’t used these macros before, but might be a good starting point!

@josh @drew - thanks for your comments and experience.
I am still baffled though re definition of constraints with these modern databases as in Redshift and Snowflake.
How are then are fact tables and dimensions designed effectively in these databases?
Also, within dbt, I thought it would be possible to build those objects as the foundation for creating analytics over the top of that layer if necessary or extending the data warehouse / data marts.

My thoughts was having a source —> staging layer --> dwh layer (facts/dimesions for reporting) --> analytical (so any additional manipulations required or relevant).

I’ll definitely take a look at the repo suggested for postgres. I saw and spoke to a hand full of users that use dbt or learning to use dbt with postgres their db privately as well as in tech companies.

@oferk83 this is one of the big selling points of modern analytical databases! While Postgres requires you to create indexes for fast joins/lookups, databases like Redshift, Snowflake, and BigQuery are built on fundamentally different architectures. This isn’t a knock on Postgres (it’s a great database!) – these other databases are just specifically tailored for analytical workloads.

I wrote a little bit about this here if you’re interested in learning more: Choosing a data warehouse

I think your approach of source —> staging layer --> dwh layer is a really good one! I guess the big idea is: you can do this on an analytical database without needing to think about column constraints. There are performance tuning features of these databases that can help speed up queries (like sort+dist keys on Redshift, clustering on Snowflake, clustering/partitioning on BigQuery), but these aren’t exactly column constraints, and they’re easier for us to support from dbt.

Hope this helps!

Also, @oferk83 – is the confusion that you think that you can’t have a relationship at some higher semantic level without defining it in the database specifically with a PK/FK constraint?

FYI in any database platform you can join between two tables on any arbitrary columns and/or calculations. Formally defining the constraint just gives you some additional benefits, which will vary depending on the database platform, but actually defining the constraint isn’t required in any way to join Table1.column1 to Table2.column2.

So in some platonic sense in that scenario, the “constraint” still exists in the sense that the semantic meaning of the data in the tables is such that you “should” be able to do the join between the tables. It just hasn’t been defined in the database formally via a constraint.

Apologies if I have misunderstood your question.

No, i’m well aware that joins can be made without defining constraints - that is clearly the definition of table relationships.
I think the confusion is to understand exactly what the testing feature of ‘relationships’ is actually doing when it runs if there are no constraints placed on the tables/models.

Got it. The testing feature is verifying that all of the values in the child table and column in question are present in the relevant parent table and column. That is all the test is doing.

1 Like

I didn’t see anyone mention it here, but the main case for primary/foreign key constraints in data warehouses is data discoverability.

Two main use cases on the top of my mind:

  1. Navigating related data and auto-completing SQL queries become a lot better, in DataGrip at least. These Database clients use the Primary Key and Foreign Key information to auto complete join conditions when writing SQL queries. When you query data, you can use the context menu to select data that match a foreign key column from a query result, for example.

  2. Metabase automatically fills in Table metadata for you if foreign and primary keys information is available, it fills out metadata about the tables. If you have that information in place, when you use Metabase query builder, it’s able to show to the end user the fields of the joined table so they can be filtered upon.

My thoughts:

Someone wanting to take advantage of that would have to use post hooks to successfully implement that. It’s a little bit clunky, but works.

Since what I talked about is mostly about documenting the relationship between models, I think the sensible place for it to be would be in the YAML files, and then we should have an option some how to generate the “alter table” statements to make those happen.

@drew Question to you: do we have access to the data in the YAML files when executing post-hooks? I thinking it would be relatively easy to create a macro that you invoke from a post-hook to parse the list of columns and generate those exact “alter table” statements.

Hey @thalesmello - I really like the way you’re thinking about this! I don’t believe the yaml file contents are actively available in the hook context, but this is definitely a workflow I’d like to enable.

Presumably we’d want to do something like the following:

  1. add a primary_key test which effectively just implements unique + not_null
  2. provide some knowledge of the existing tests for a model in its compilation context
  3. make it easy to distinguish between views/tables in these hooks: we definitely won’t want to add these constraints on views!

Do you buy all of this?

I do agree with the things you pointed out.

Regarding your third point, that’s something I worked around by checking

{% if this.materialized in [‘table’, ‘incremental’] %}

When I was adding indices in our old Postgres warehouse.