Hi all,
I’m new to dbt and am trying to figure out, how best to handle bad data in a transformation process.
Let’s say I have a very simple NOT NULL check on a source table and I want to move rows that violate that check (“bad” data) to an error table for later inspection and route the “good” rows through my models into my data warehouse.
So far I came up with two approaches:
Build two models on the source, one that selects the bad data and one that selects the good data. Based on those models proceed with my data loading of good data and inspect the models with bad data to handle the errors
Implement the built in test feature of dbt to find the bad data. If the test fails handle the bad data (i.e. remove it from the load) and start over until all tests pass at which point trigger loading the good data.
Both approaches seem not quite right, as with the first one I am not using the test feature of dbt and have to do a lot of custom coding.
In the second approach I can use the test feature and with the store-failures option probably even create the bad data automatically, but I will need some sort of orchestration software like airflow to react to the test failures.
So I feel there must be a better option to find bad data with dbt test, then move it somewhere else and process only the remaining good data.
Would be happy to hear how others in the community solve this.
Hi @rgrupe ,
I think that what to do depends on your business rules, but the possibilities I see are:
stopping the load: if a null is something you can’t fix nor want to continue with bad data or without all the data. Make the test on your source data and let it throw an error before you start the run.
fixing the data: if you can fix the data, maybe with a lookup or a default value then you might remove the test on the source (or make it throw a warning) and handle the fixing in the model reading the data (one or more depending on how you prefer). Be sure to put a test on the transformed data, so you are sure that after the fixing the data is actually fixed
removing the bad data: if you can’t fix the data, but you can continue without the bad data, then you can just read the good data out of the source and do what needed. I suggest to have a test on the source with a warning to monitor the amount of bad data and a test on the model with good data to be sure it is all good there.
On top of how you process the data there is another layer to consider: how you store it.
I suggest to store the full history of the RAW data coming from the source (like a snapshot does) and then if you just need the current state read it out of the history and start there your transformations, and these are just view and tables rebuilt every day, without incremental loads.
In this way you can have bad data arriving on day1 when you see the warning and you might filter it off or snap a default value on it, but if in day 2 a good version arrives then you will use it and the bad data is just stored in the history for analysis and auditing.
@RobMcZag, I have played around with the options and am now using a mix of it.
So when loading the data (pre dbt) I will enforce some rules and run a set of simple dbt tests on the loaded source data. If those fail, the load will stop and the provider of the data is informed. No dbt run command is triggered then.
The more complex tests I will include in my models and sort out bad data for later analysis while routing through the good data. So here I won’t use the dbt tests, but rather do it in the model processing.
Finally when my dbt run cylcle is done, I will run more tests on the target tables to make sure, everything went as planned.
Still not 100% happy with this, as I would like to have the option to act on dbt test failures directly (not through airflow or some other tool), but it should work.
Hey @paparao I do not really understand what you are up to.
You run new tests before every run of the ELT.
No need to store the past failures, but you can do, they are in the JSON files in your folder (in the core version) and you have the logs in the UI in dbt Cloud.
You can decide if a test throws and error (failing the run) or a warning (non failing).
If there is even just one error that you are not OK to run with, you should not run the ELT after it. Te other way around is to split the run in many smaller runs with its own tests, but I would not advise for it as you get into a lot of maintenance.
And you should keep all your tests green. Otherwise why having them?
@rgrupe you do not need to load the data to run the test on it.
As long as your data platform can read the data, you can define a source on the data and run dbt tests on it before loading the data.
I generally have:
tests on external sources, to make sure the incoming data is as I expect
load staging, history tables from staging, business models, data marts
test them (stg, hist, business & data marts)
Sometimes I may split it in two after the stg or hist, to have tests that can block the etl at that point, but it usually does not make sense as only the history models are incremental models, all others are views or tables rebuilt every day (so I can just re-run these models once the data is fixed).
Consider that I rebuild tables even with more than half billion rows and it takes a few minutes on an XS warehouse in Snowflake.