My data team and I are trying to figure out a way to keep track of the date a particular table has been rebuilt.
The context of why I’m trying to do this
We have a lot of tables that are scheduled for daily refreshes in ECS. Sometimes we update these tables to include new features. Sometimes these updates result in the tables breaking which causes the table to not include any new data past the date when it broke. (For example, a column was added, but not included in the “group by” section of the code, so it broke the sql and the table was no longer getting built).
We set up monitor alarms triggered by “ERRROR” string in the logs in AWS, so now it will hopefully email notification when a refresh job fails. But we were wondering if there’s a way to get that information more directly.
What I’ve already tried
We thought about including a “updated on” column in all the tables, which we can check to see if the table was actually updated recently. But that would require rewriting a lot of tables and we don’t really want to do that. Is there maybe some kind of dbt logs or tracking that we could use?
For example, when the daily job kicks off and rebuilds all the tables with “daily” tag, if there a way for it to let us know AFTER it ran all the tables, which ones were broken and didn’t get rebuilt?
Each dbt run command creates a file called run_results.json that has a record of what it did. Save that file and then parse all the files to see what happened.
Alternatively, look at the query logs of your database for queries run by the service account that runs your dbt jobs.
Thanks! So if a job runs all tables with “daily” tag. It would overwrite the run_results.json after every table, right? How would you parse it after every table then? Or does it will it only create one run_results after ALL the files with the “daily” tag were run?
Well it depends on where you’re running the dbt command. You’ll need to persist the run_results.json file somewhere (eg S3) and then parse it. If you’re running in a cloud environment like ECS where your VM and its storage aren’t persistent you’ll have to write the file out to persistent storage as the last step of your process. There are also dbt packages that do this job from what I remember (inserting data from run_results into the database) but I don’t use them, we parse the run_results and the query log in our database.