To give you some ideas, here is what I was playing around with yesterday.
It kind of works, but the solution also brought up other questions that I will share in a separate post.
It is possible to use variables in metric definitions, so I added filters that filter the partition key column into the metric definition itself. The variables I currently use are the start and end timestamps that I get from the Airflow run. For example, Airflow run starts at 2023-03-16 12:00 and runs hourly means that data_interval_start = 2023-03-16 12:00 and data_interval_end=2023-03-16 13:00.
metrics:
- name: orders_finished
label: Orders Finished
description: "Number of orders finished"
model: ref('fact_order')
calculation_method: count_distinct
expression: order_id
timestamp: created
time_grains: [day, week, month, quarter, year]
filters:
- field: state
operator: '='
value: "'finished'"
- field: created_date
operator: '>='
value: date('{{ var("data_interval_start") }}')
- field: created_date
operator: '<='
value: date('{{ var("data_interval_end") }}')
dimensions:
- city_id
I use the metric by materializing its results into an incrementally updated table with city and daily granularity:
{{
config(
materialized ="incremental",
incremental_strategy ="merge",
unique_key =["city_id", "date_day"],
partition_by ="date_day",
schema = "schema_name"
)
}}
select *
from {{ metrics.calculate(
[
metric('orders_finished')
],
grain='day',
dimensions=['city_id']
) }}
The compiled code looks like this, notice that the filter on created_date is in a place we want this to be:
select *
from
(
with calendar as (
select
*
from `schema_name`.`dbt_metrics_default_calendar`
)
, model_b8a8c92f4fbcd57d006a0a6bf049e5e5__aggregate as (
select
date_day,
city_id,
max(metric_date_day is not null) as has_data,
count(distinct property_to_aggregate__orders_finished) as orders_finished
from (
select
cast(base_model.created as date) as metric_date_day,
calendar.date_day as date_day,
calendar.date_day as window_filter_date,
base_model.city_id,
(order_id) as property_to_aggregate__orders_finished
from `schema_name`.`sfact_order` base_model
left join calendar
on cast(base_model.created as date) = calendar.date_day
where 1=1
and (
state = 'finished'
and created_date >= date('2023-03-16 12:00:00')
and created_date <= date('2023-03-16 14:00:00')
)
) as base_query
where 1=1
group by 1, 2
), model_b8a8c92f4fbcd57d006a0a6bf049e5e5__final as (
select
parent_metric_cte.date_day,
parent_metric_cte.city_id,
coalesce(orders_finished, 0) as orders_finished
from model_b8a8c92f4fbcd57d006a0a6bf049e5e5__aggregate as parent_metric_cte
)
select
date_day ,
city_id,
orders_finished
from model_b8a8c92f4fbcd57d006a0a6bf049e5e5__final
order by 1 desc
) metric_subq
For reference, the version where I simply add a filter at metric calculation time. It is very obvious that it’s going to be applied to the timestamp column and not to the date column that I have as a partition key. This results in a full scan of the table.
Metric definition:
- name: orders_finished2
label: Orders Finished
description: "Number of orders finished"
model: ref('fact_order')
calculation_method: count_distinct
expression: order_id
timestamp: created
time_grains: [day, week, month, quarter, year]
filters:
- field: state
operator: '='
value: "'finished'"
dimensions:
- city_id
meta:
owner: '@silja.mardla'
Calculating the metric:
{{
config(
materialized ="incremental",
incremental_strategy ="merge",
unique_key =["city_id", "date_day"],
partition_by ="date_day",
schema = "schema_name"
)
}}
select *
from {{ metrics.calculate(
[
metric('orders_finished2')
],
grain='day',
dimensions=['city_id'],
start_date=var("data_interval_start"),
end_date=var("data_interval_end")
) }}
Compiled code:
select *
from
(
with calendar as (
select
*
from `schema_name`.`dbt_metrics_default_calendar`
where date_day >= cast('2023-03-16 12:00:00' as date)
and date_day <= cast('2023-03-16 14:00:00' as date)
)
, model_79b7f4638595cdb0e8236ee9dacac2fd__aggregate as (
select
date_day,
city_id,
max(metric_date_day is not null) as has_data,
count(distinct property_to_aggregate__orders_finished2) as orders_finished2
from (
select
cast(base_model.created as date) as metric_date_day,
calendar.date_day as date_day,
calendar.date_day as window_filter_date,
base_model.city_id,
(order_id) as property_to_aggregate__orders_finished2
from `schema_name`.`fact_order` base_model
left join calendar
on cast(base_model.created as date) = calendar.date_day
where 1=1
and (
cast(base_model.created as date) >= cast('2023-03-16 12:00:00' as date)
and cast(base_model.created as date) <= cast('2023-03-16 14:00:00' as date)
)
and (
state = 'finished'
)
) as base_query
where 1=1
group by 1, 2
), model_79b7f4638595cdb0e8236ee9dacac2fd__final as (
select
parent_metric_cte.date_day,
parent_metric_cte.city_id,
coalesce(orders_finished2, 0) as orders_finished2
from model_79b7f4638595cdb0e8236ee9dacac2fd__aggregate as parent_metric_cte
)
select
date_day ,
city_id,
orders_finished2
from model_79b7f4638595cdb0e8236ee9dacac2fd__final
order by 1 desc
) metric_subq
And another version that does not work. Here I include created_date as a dimension and add a where clause at metric evaluation time. There are two problems: the where clause gets added to the wrong place and also I can’t use variables because they don’t get evaluated.
Metric definition:
- name: orders_finished3
label: Orders Finished
description: "Number of orders finished"
model: ref('fact_order')
calculation_method: count_distinct
expression: order_id
timestamp: created
time_grains: [day, week, month, quarter, year]
filters:
- field: state
operator: '='
value: "'finished'"
dimensions:
- city_id
- created_date
meta:
owner: '@silja.mardla'
Metric calculation:
{{
config(
materialized ="view",
schema = "schema_name"
)
}}
select *
from {{ metrics.calculate(
[
metric('orders_finished3')
],
grain='day',
dimensions=['city_id','created_date'],
where="created_date >= date('var(data_interval_start)')"
) }}
Compiled code:
select *
from -- depends on: `dbt_silja_mardla_spark`.`dbt_metrics_default_calendar`
(
with calendar as (
select
*
from `schema_name`.`dbt_metrics_default_calendar`
)
, model_79b7f4638595cdb0e8236ee9dacac2fd__aggregate as (
select
date_day,
city_id,
created_date,
max(metric_date_day is not null) as has_data,
count(distinct property_to_aggregate__orders_finished3) as orders_finished3
from (
select
cast(base_model.created as date) as metric_date_day,
calendar.date_day as date_day,
calendar.date_day as window_filter_date,
base_model.city_id,
base_model.created_date,
(order_id) as property_to_aggregate__orders_finished3
from `schema_name`.`fact_order` base_model
left join calendar
on cast(base_model.created as date) = calendar.date_day
where 1=1
and (
state = 'finished'
)
) as base_query
where 1=1
group by 1, 2, 3
), model_79b7f4638595cdb0e8236ee9dacac2fd__final as (
select
parent_metric_cte.date_day,
parent_metric_cte.city_id,
parent_metric_cte.created_date,
coalesce(orders_finished3, 0) as orders_finished3
from model_79b7f4638595cdb0e8236ee9dacac2fd__aggregate as parent_metric_cte
)
select
date_day ,
city_id,
created_date,
orders_finished3
from model_79b7f4638595cdb0e8236ee9dacac2fd__final
where created_date >= date('var(data_interval_start)')
order by 1 desc
) metric_subq