Help on organizing queries used for exploration of new database schema

fx86 · June 16, 2023, 5:53am

The problem I’m having

At work, I was directed to a schema (bunch of tables) to figure out the logic for metrics we neeed. These tables are named uniformly but do not have any documentation. I was wondering if there is a way to organize queries so that their output could be used to create documentation within dbt using Python and Jinja ?

Example - I am today running discovery against Schema-X. I would be looking for basic stats on the tables within this schema like - row count, top3 unique values, last updated timestamp etc for each table. I could then aim these queries against a new schema tomorrow and get an automated report. I could do this with Python but I would like to do this with dbt instead.

brunoszdl · June 16, 2023, 5:29pm

Hi @fx86! You can create a macro to get the name of the tables for a specific schema and then for each table run a query that will return the information you need.

For example:

{% macro get_schema_information(schema_name) %}

{% if execute %}

  {% set tables_list = [] %}
  {% for node in graph.nodes.values() if node.schema == schema_name %}
      {% tables_list.append(node.name) %}
  {% endfor %}

  {% for table in tables_list %}
     {% if not loop.last %}
          select
            {{ table }}
             , count(*)
            , max(update_date)
          from {{ target.database }}.{{ target.schema }}.{{ table }}
          union all
    {% else %}
          select
             {{ table }}
             , count(*)
            , max(update_date)
          from {{ target.database }}.{{ target.schema }}.{{ table }}
    {% endif %}
  {% endfor %}

{% endif %}

{% endmacro %}

This macro will loop through all the tables in your schema and get the relevant information about each table and union all into a single table.

Then you can create a model such as schema_information.sql and create it as

#schema_information.sql in your models folder
{{ config(materialized='table'}}

{{ get_schema_information('my_schema') }}

Some things to consider:

It will work only if the columns you are looking for have the same name in all tables, such as update_date
It was just an example, you add more things to this macro
I didn’t test it, may contain compilation errors
I am not sure what you want to retrieve, but I think this example can help
Check how tables are called in your adapter, in this example I assumed bigquery which is called like ‘database’.‘schema’.‘table’

fx86 · June 18, 2023, 9:48am

Thanks @ brunoszdl! So you think macros would be the right place for this sort of code to go into and not analysis ?
Also: thanks for the code.

brunoszdl · June 19, 2023, 12:24am

Macros are good for jinja logic (hard to do with pure SQL) and reusable code.

You can build an analysis that uses this macro

I assumed your problem had complex logic (loop through all the tables in the schema) and it is reusable for every schema, so it seems like a good fit for macros.

But if you have other ideas you can consider them too!

fx86 · June 20, 2023, 6:50am

Thanks for the advice, @brunoszdl !

Topic		Replies	Views
How do I temporarily make an archive copy of a schema? Help jinja , environments , custom-schema	1	947	October 13, 2022
Managing hundreds of schemas in the dbt_project.yml In-Depth Discussions best-practice , adopting-dbt	5	3658	August 26, 2022
Need some help with jinja trying to use a for loop over a run_query result set from a seed table Help	4	15409	November 5, 2022
how to create a transformed table for existing ones Help jinja , bigquery	3	1010	April 14, 2023
dbt macro-using dbt_utils.union_relations to union n number of tables, however not able to make relation parameter for input Help jinja , snowflake , macros , dbt-core	0	635	May 10, 2024

Help on organizing queries used for exploration of new database schema

The problem I’m having

Related topics