Help on organizing queries used for exploration of new database schema

The problem I’m having

At work, I was directed to a schema (bunch of tables) to figure out the logic for metrics we neeed. These tables are named uniformly but do not have any documentation. I was wondering if there is a way to organize queries so that their output could be used to create documentation within dbt using Python and Jinja ?

Example - I am today running discovery against Schema-X. I would be looking for basic stats on the tables within this schema like - row count, top3 unique values, last updated timestamp etc for each table. I could then aim these queries against a new schema tomorrow and get an automated report. I could do this with Python but I would like to do this with dbt instead.

Hi @fx86! You can create a macro to get the name of the tables for a specific schema and then for each table run a query that will return the information you need.

For example:

{% macro get_schema_information(schema_name) %}

{% if execute %}

  {% set tables_list = [] %}
  {% for node in graph.nodes.values() if node.schema == schema_name %}
      {% tables_list.append(node.name) %}
  {% endfor %}

  {% for table in tables_list %}
     {% if not loop.last %}
          select
            {{ table }}
             , count(*)
            , max(update_date)
          from {{ target.database }}.{{ target.schema }}.{{ table }}
          union all
    {% else %}
          select
             {{ table }}
             , count(*)
            , max(update_date)
          from {{ target.database }}.{{ target.schema }}.{{ table }}
    {% endif %}
  {% endfor %}

{% endif %}

{% endmacro %}

This macro will loop through all the tables in your schema and get the relevant information about each table and union all into a single table.

Then you can create a model such as schema_information.sql and create it as

#schema_information.sql in your models folder
{{ config(materialized='table'}}

{{ get_schema_information('my_schema') }}

Some things to consider:

  • It will work only if the columns you are looking for have the same name in all tables, such as update_date
  • It was just an example, you add more things to this macro
  • I didn’t test it, may contain compilation errors
  • I am not sure what you want to retrieve, but I think this example can help
  • Check how tables are called in your adapter, in this example I assumed bigquery which is called like ‘database’.‘schema’.‘table’

Thanks @ brunoszdl! So you think macros would be the right place for this sort of code to go into and not analysis ?
Also: thanks for the code.

Macros are good for jinja logic (hard to do with pure SQL) and reusable code.

You can build an analysis that uses this macro

I assumed your problem had complex logic (loop through all the tables in the schema) and it is reusable for every schema, so it seems like a good fit for macros.

But if you have other ideas you can consider them too!

1 Like

Thanks for the advice, @brunoszdl !