I recently used dbt External Tables to read my external stage in Snowflake, create external table, and then an incremental model. So far so good and everything is working perfectly.
My set up of the external table is as follows:
sources: - name: ext_table_test database: AWS schema: S3 loader: S3 tables: - name: logs external: location: "@stage.aws.logs/2023" file_format: "aws.api_log_file" partitions: - name: partition_date data_type: date expression: CAST(SPLIT_PART(metadata$filename, '/', 1) || '-' || SPLIT_PART(metadata$filename, '/', 2) || '-' || SPLIT_PART(metadata$filename, '/', 3) AS DATE)
The reason why I have
/2023 at the end of the location is because I wanted to check first how much data will it be to process, how dbt will handle it, and how much time it will take to process in Snowflake. By the way the folders in AWS go all the way to 2015.
Now I would only like to bring
/2022 and thinking about best way to do it. The logs can’t be changed and also the 2022 has ended so there’s no need to do an incremental model.
I was thinking about adding a separate external table
tables - name: logs_2022 external: location: "@stage.aws.logs/2022"
But is it worth it? In the models later on I have to join all the data anyway. The thing I can’t do is to allow the full path as I don’t want data prior to 2022 in the external table to slow down reading it.
Or will it? I was also thinking about opening up the location but using patterns?
location: "@stage.aws.logs" pattern: "/(2022|2023)/.*\.gz"
but not sure if this will work?