Breaking a large query into parallelized partitioned queries

We have a long-running model that performs a complex set of geospatial comparisons to get point-to-point distances between every person in the US and a list of around 300K points of interest. We are using snowflake, and using snowflake’s geospatial data types and functions. We also do restrict the comparisons, so we are not (for instance) calculating distances between people in New York with POI in Los Angeles. Regardless, the first pass at the model took over 24 hours to execute on a large snowflake warehouse. This first pass was not implemented in dbt. We then tried essentially partitioning the query into multiple queries, each running a separate region of the country, executing in parallel. This reduced the overall execution time by around 50%, with approximately the same credit burn on snowflake. This also was not in dbt. The orchestration was simply a python script that fired off the regional queries in parallel.

We want to move this model into dbt, but are a little stuck on how to accomplish the parallelism bit. Based on my understanding of dbt, since this is technically a single model, resulting in a single table, it would not “natively” be able to be parallelized the way described. One thought we have is to create separate models for different geographic regions, which can execute in parallel (via dbt threads), with downstream union to get at the final data set. But, if we were to decide that we wanted change our geographic “partitioning”, that then would mean creating or removing intermediate models. Depending on how small we determined our geographic regions needed to be, it also could require quite a lot of intermediate models. Are there other options we are not thinking of?

1 Like