The problem I’m having
I’m following up on this (archived) thread : How to dynamically generate input to `partitions` config in BigQuery
as the title suggests, i’m looking to dynamically set “partitions” when using incremental mode with “insert_overwrite” strategy, with the objective of optimizing the incremental model by avoiding creating a temporary table and querying it for partitions to replace
The challenge is that i’m dynamically fetching the partitions inside a macro via run_query()
, except that run_query is ran during execution, while the config bloc is applied during “parsing”
What I’ve already tried
I tried to use adapter.execute()
but i’m not managing to fetch any result.
Any suggestions ?
Thanks in advance
1 Like
I believe that the macro bq_insert_overwrite_sql
is the macro that determines how to generate the query and it chooses between bq_static_insert_overwrite_sql
and bq_dynamic_insert_overwrite_sql
depending on whether the partitions
config is supplied. You could shim this macro and override the logic to do whatever you want. I believe this is all run at compile time.
https://github.com/dbt-labs/dbt-bigquery/blob/f2804c0f3184ebe36cada9113802051c75c347d0/dbt/include/bigquery/macros/materializations/incremental_strategy/insert_overwrite.sql#L40|https://github.com/dbt-labs/dbt-bigquery/blob/f2804c0f3184ebe36cada9113802051c75c3[…]cros/materializations/incremental_strategy/insert_overwrite.sql
Note: @Mike Stanley
originally posted this reply in Slack. It might not have transferred perfectly.
Hey Mike
thanks for responding
i’m not sure i get how i can override the macro and apply my logic
even if i remove the dynamic part, i’d still need to pass the “partitions” param somehow
Note: @Alaa Houimel
originally posted this reply in Slack. It might not have transferred perfectly.
the partitions
config has its value set at parse time. So you have to deal with the limited jinja context that’s available at parse time etc. But how dbt actually bakes that partitions
config into its query that it’s going to send to the database, that is set up at compile time. If you look on line 43, you can see it’s checking to see if partitions
is defined and then on line 56, it’s turning that static list of partitions into the list that will ultimately be constructed into the SQL query.
So, you could either patch the logic on line 43 to detect whether the list is static (at parse time) or dynamic (at compile time) and go down different paths, or you could patch it at line 56 to change how it builds the list called predicate
to use some other list than the partitions
config, or you could patch it at line 89 as well which is where the predicate
config is used.
You shim a macro like this just by creating a macro in your project with the same name. By default, dispatch searches your project first, and then other locations such as the adapter.
Note: @Mike Stanley
originally posted this reply in Slack. It might not have transferred perfectly.
One thing we experimented with, was allowing a config to be provided that would tell dbt how to obtain the partition information from a database location where we maintained that information. We patched line 43 so the logic had three branches: if given that partition information location, query the database, else if given a static list, use the existing static logic, else use the existing dynamic logic. This did work, but we weren’t very happy with the approach so we swapped to using the INFORMATION_SCHEMA.PARTITIONS table in our partition predicates
Note: @Mike Stanley
originally posted this reply in Slack. It might not have transferred perfectly.
Part of the reason we didn’t like this is because it would force us to maintain these rather low-level shims, and we’d need to keep revalidating them and bringing them into new version of the adapter. It probably would’ve been better to implement it as a custom materialisation, but that itself is hard to migrate off of once you’ve used it a lot
Note: @Mike Stanley
originally posted this reply in Slack. It might not have transferred perfectly.
thanks for the thorough help !
I understand how this can become a headache to maintain with time / updates
for now, it’s either that, or running a separate code (python script for example) to fetch the partitions i need to replace, and than pass that as a static variable to dbt via dbt run ... --vars={partitions=...}
i managed to override the macro by adapting the logic after line 45 . here’s a summary :
• i have a macro , get_macros_to_replace
which takes two params : taget_table_name
and source_table_name
queries the INFORMATION_SCHEMA.PARTITIONS table to get what it needs.
• i had to pass the source_table_name
to the overridden macro in order to run my get_macros_to_replace
inside it
• i managed to that by adding this in the “else” part of {% if execute %}, by setting the output to target_table_name + "##" + source_table_name
• i can reverse parse my required params afterwards inside the overridden macro, to correctly fetch my partitions
Note: @Alaa Houimel
originally posted this reply in Slack. It might not have transferred perfectly.
for now it works as expected, but i’ll need to keep an eye on things in the near future to decide whether to maintain this approach, or go with the “two-steps” external script approach
Note: @Alaa Houimel
originally posted this reply in Slack. It might not have transferred perfectly.
again, thank you very much for pointing out this solution, this gave me quite the trouble for some days
but i agree, it would be interesting to have this logic as a built-in strategy in dbt
Note: @Alaa Houimel
originally posted this reply in Slack. It might not have transferred perfectly.
The built-in macro _dbt_max_partitions
gets the latest partition predicate in the table (and there is an open PR to switch this to use the INFORMATION_SCHEMA.PARTITIONS table to get the latest partition value from there instead). This lets you say “find me data in source from a date later that the latest date in target” which is almost there in terms of what you need here, and doesn’t require any shenanigans. You can actually implement this yourself as a partition predicate.
The only thing it doesn’t let you do is say “list the partitions in X that don’t exist yet in Y” but I find this pattern actually unnecessary most of the time
Note: @Mike Stanley
originally posted this reply in Slack. It might not have transferred perfectly.
I’m aware of this variable , would make even more sense (and efficiency) to get from INFORMATION_SCHEMA.PARTITIONS. good to hear there’s a PR about it
regarding my usecase, more often than not, i need to update already existing data in my target, based on source (late arrivals, changes in data, newer predictions etc )
that’s why i need a full comparision of partitions between target and source in terms of last_modification_date
here’s the original blog that presented this approach.
I personally find it quite elegant / efficient
https://medium.com/@anders.eriksson_32478/dealing-with-delayed-data-using-dbt-on-bigquery-316fa5c81803
Note: @Alaa Houimel
originally posted this reply in Slack. It might not have transferred perfectly.