Background:
I’m looking for a replacement of a >20yr old propriatery development tool that offers a lot of functionality similar to dbt. Over time we’ve developed a lot of features tailored to our needs, most that I’m happy to throw away, but one in particular that would be a deal breaker.
Our propriatery tool is better equipped to handle more complex incremental models that dbt has out-of-the-box. If we are to migrate this is something we need to adress.
Question:
I have listed one example ‘use-case’ below. From the functional perspecifive we have one model, however we’d like to implement that in separate steps or multiple technical models without needing to code the technical template.
I’m seeking for a strategy that would allow us to generate these as a template. I can image run_query and/or hooks may do the trick but it feels rather hacky to do so. On the other hand I’m relatively new to dbt so I don’t know all the possibilities. I’m looking forward to your suggestions, thank you.
Example:
Given the source data model:
Table1:
This is a big fact table for which we need incremental processing
PK: Key1, Key2, ValidDate,
Other fields: value, isCurrent*, changeMarker*
- marks technical use only
Table2:
This is a table fed indirectly through user input
PK: Key1, OutputKey, ValidFrom
Other fields: ValidUntil
The developer that implements the business logic gives the following configuration to declare the model.
Config: {
Input: [
Table1: {
TimedimensionType: Snapshot,
Key: [Key1, Key2]
},
Table2: {
TimedimensionType: Interval,
Key: [Key1, OutputKey]
}
],
JoinClause: [
Table1.Key1 = Table2.Key1
],
Output:{
TimedimensionType: Snapshot,
Keys: [
{
inExpr: Table2.OutputKey,
outAlias: OutputKey
}
],
Values: [
{
inExpr: MEDIAN(Table1.Value),
outAlias: MedianValue
}
],
Aggregation: True
}
]
A full execution (in develop/debug) mode will compile into code like
CREATE OR REPLACE TABLE Output AS
SELECT Table2.OutputKey OutputKey, Table1.ValidDate, MEDIAN(Table1.Value) SumValue
FROM Table1
JOIN Table2 ON (Table1.Key1 = Table2.Key1 AND Table2.ValidFrom <= Table1.ValidDate AND Table1.ValidDate < Table2.ValidUntil)
WHERE Table1.isCurrent ='Y'
GROUP BY Table2.OutputKey, Table1.ValidDate
However in production we’d like incremental execution instead. More or less accomplished by the following steps.
Please note that I simplified for the sake of the example, I am not seeking for feedback on inaccuracies nor inefficiencies.
Fetch lastProcessedChangeMarker
Fetch newChangeMarker
MERGE INTO Table1_Interval_helper tgt USING
(SELECT Key1, MIN(ValidDate) ValidFrom, MAX(ValidDate+1) ValidUntil
FROM Table1
WHERE isCurrent ='Y' AND changeMarker > lastProcessedChangeMarker) src
ON (src.key1 = tgt.key1)
WHEN NOT MATCHED THEN INSERT (tgt) values (src)
WHEN MATCHED THEN UPDATE SET tgt.ValidFrom = LEAST(tgt.ValidFrom, src.ValidFrom) AND tgt.ValidUntil = GREATEST(tgt.ValidUntil, src.ValidUntil)
CREATE OR REPLACE TABLE Output_join_helper_current AS
SELECT Key1, Key2, OutputKey, GREATEST(Table1.ValidFrom, Table2.ValidFrom ) ValidFrom , LEAST(Table1.ValidUntil, Table2.ValidUntil ) ValidUntil
FROM Table1_Interval_helper Table1
INNER JOIN Table2 ON (Table1.Key1 = Table2.Key1 AND Table2.ValidFrom < Table1.ValidUntil AND Table1.ValidFrom < Table2.ValidUntil)
CREATE OR REPLACE TABLE Output_table_deltas AS
WITH DELTA_INTERVALS_HELPER AS
(SELECT Key1, Key2, OutputKey, ValidFrom, ValidUntil
FROM Output_join_helper_current
MINUS -- Actually we de a strict interval logic based minus here
SELECT Key1, Key2, OutputKey, ValidFrom, ValidUntil
FROM Output_join_helper_previous),
DELTA_HELPER AS
(SELECT OutputKey, ValidDate
FROM DELTA_INTERVALS_HELPER, DATE_ITERATOR(ValidFrom, ValidUntil)),
DELTA_TABLE1 AS
(SELECT OutputKey, ValidDate
FROM Table1
JOIN Output_join_helper_current ON (Table1.Key1 AND Output_join_helper_current.ValidFrom < Table1.ValidUntil AND Table1.ValidFrom < Output_join_helper_current.ValidUntil)
WHERE changeMarker > lastProcessedChangeMarker)
SELECT *
FROM DELTA_HELPER
UNION
SELECT *
FROM DELTA_TABLE1
MERGE INTO Output tgt
USING (SELECT delta.OutputKey, delta.ValidDate, MEDIAN(CASE WHEN Table1.Key1 IS NOT NULL AND Table2.Key1 IS NOT NULL THEN value END), CASE WHEN
SUM(CASE WHEN Table1.Key1 IS NOT NULL AND Table2.Key1 IS NOT NULL THEN 1 END) > 0 THEN 'Y' ELSE 'N' isCurrent, newChangeMarker ChangeMarker
FROM Output_table_deltas delta
LEFT OUTER JOIN Table2 ON (delta.outputKey =Table2.OutputKey AND Table2.ValidFrom <= delta.ValidDate AND delta.ValidDate < Table2.ValidUntil)
LEFT OUTER JOIN Table1 ON (Table1.key1 = Table2.Key1 AND Table1.ValidDate = delta.ValidDate AND Table1.isCurrent=Y)) src
ON (tgt.OutputKey = src.OutputKey AND tgt.ValidDate = src.ValidDate)
WHEN NOT MATCHED INSERT (tgt) VALUES (src)
WHEN MATCHED UPDATE tgt = src
–Post administration
DROP TABLE Output_join_helper_previous;
RENAME TABLE Output_join_helper_current TO Output_join_helper_previous;
Store newChangeMarker as lastProcessedChangeMarker