Splitting up a model query that is too big for snowflake

We have 3300+ schemas in our database (one per customer). We have a model that loops through each schema and selects a specific table and then UNION ALL to combine them into one table in our analytics database. The query has now gotten too big for snowflake to run.

Snowflake’s suggestion was to break it up into separate queries with 100 union statements each into separate transient tables, then union all the transient tables at the end (while the current query in snowflake has all 3300+ union statements in one query).

Any suggestions on how to do that?

Our current model looks like this:

{{ config(materialized='table') }}

{%- set db = "my_db" -%}
{# custom macro that get's a list of all 3300+ schemas #}
{%- set schemas = run_query(nex_utils.find_schemas_sql(db=db, tbl='my_table')) -%}

{% if execute %}
{% for row in schemas.rows %}
SELECT
      column_1
      , column_2
      , column_3
      ....
FROM {{ db }}.{{ row.get('TABLE_SCHEMA') }}.my_table
{% if not loop.last %} UNION ALL {% endif %}
{% endfor -%}
{% endif %}

Alternative suggestion: Does the code run as-is if you don’t run all 3300 schemas at once? Perhaps you could chunk the 3300 into groups, and have one model for group 1, another model for group 2, etc. with as many groups as required to bring down the processing power so it can be handled by snowflake. I think that would be logically easier to read/reason about, rather than messing with a certain amount of union all statements (and wondering whether you got them all).

Thanks for the suggestion @patkearns10. Another complicating factor is that the number of schemas will continue to change, so it seems like it would be a bit tedious to manage explicit models for groups of schemas.

What I’ve done for now is added a macro to replace the find_schemas_sql macro we currently use to build our list of schemas to union. The new macro uses the find_schemas_sql and loops through all the schemas 100 at a time and unions them into transient tables, then returns a list of transient tables that were created. The model then goes through that list similar to how it does now and unions the transient tables together.

If there is a way to create more of a dynamic set of models that could be run in parallel that would include the groups of schemas to be unioned, and then one model that unions them all at the end, that would be awesome. Right now my macro that breaks it up is not in parallel. Each union of 100 tables takes about 10 minutes, so the whole process takes about 2 hours.