Hi everyone,
I’m working on a data processing project where we receive financing data from multiple sources like HubSpot and NetSuite. To manage data transformation, I decided to use dbt. Every hour, I pull the entire dataset from these sources using Azure Data Factory (ADF) and load it into a source schema in a PostgreSQL database as text columns.
My dbt project includes a pre-staging layer where I perform various data quality checks and business rule validations before converting the data into the correct format and storing it in a pre-staging schema. I need your advice on this pre-staging layer.
One of the key data quality checks involves validating data types and segregating good data from bad data. I want to ensure that valid data progresses through the pipeline (to the staging and mart layers) while invalid data is moved to an exception schema for further inspection. Additionally, I need to assign error/warning codes to the bad data.
My initial idea is to write a single SELECT statement for the good data and then use the reverse condition for the bad data. Ideally, I’d like to write my logic once and use it to extract both good and bad data.
I’ve used dbt’s built-in data tests and additional packages like Elementary, but they seem to only stop the propagation of invalid data. I prefer a solution that allows the pipeline to continue processing without the invalid data.
Could you share your experiences and best practices for handling such requirements? Is my approach correct, or should I consider a different strategy?
Thanks in advance for your insights!