At work, I was directed to a schema (bunch of tables) to figure out the logic for metrics we neeed. These tables are named uniformly but do not have any documentation. I was wondering if there is a way to organize queries so that their output could be used to create documentation within dbt using Python and Jinja ?
Example - I am today running discovery against Schema-X. I would be looking for basic stats on the tables within this schema like - row count, top3 unique values, last updated timestamp etc for each table. I could then aim these queries against a new schema tomorrow and get an automated report. I could do this with Python but I would like to do this with dbt instead.
Hi @fx86! You can create a macro to get the name of the tables for a specific schema and then for each table run a query that will return the information you need.
For example:
{% macro get_schema_information(schema_name) %}
{% if execute %}
{% set tables_list = [] %}
{% for node in graph.nodes.values() if node.schema == schema_name %}
{% tables_list.append(node.name) %}
{% endfor %}
{% for table in tables_list %}
{% if not loop.last %}
select
{{ table }}
, count(*)
, max(update_date)
from {{ target.database }}.{{ target.schema }}.{{ table }}
union all
{% else %}
select
{{ table }}
, count(*)
, max(update_date)
from {{ target.database }}.{{ target.schema }}.{{ table }}
{% endif %}
{% endfor %}
{% endif %}
{% endmacro %}
This macro will loop through all the tables in your schema and get the relevant information about each table and union all into a single table.
Then you can create a model such as schema_information.sql and create it as
#schema_information.sql in your models folder
{{ config(materialized='table'}}
{{ get_schema_information('my_schema') }}
Some things to consider:
It will work only if the columns you are looking for have the same name in all tables, such as update_date
It was just an example, you add more things to this macro
I didn’t test it, may contain compilation errors
I am not sure what you want to retrieve, but I think this example can help
Check how tables are called in your adapter, in this example I assumed bigquery which is called like ‘database’.‘schema’.‘table’
Macros are good for jinja logic (hard to do with pure SQL) and reusable code.
You can build an analysis that uses this macro
I assumed your problem had complex logic (loop through all the tables in the schema) and it is reusable for every schema, so it seems like a good fit for macros.
But if you have other ideas you can consider them too!