The problem I’m having
I recently began learning dbt and am eager to test its capabilities by publishing my first incremental model. It would be greatly beneficial if someone could review my approach to ensure it’s correct. My goal is to avoid developing any bad practices at the start of my learning journey.
I’ve created the following staging model [models/staging/fiannce/stg_charges.sql]
WITH source AS (
SELECT _airbyte_data->>'id' AS id,
_airbyte_data->'captured' AS captured,
_airbyte_data->'paid' AS paid,
_airbyte_data->>'invoice' AS invoice,
_airbyte_data->>'payment_intent' AS payment_intent,
_airbyte_data->'amount_refunded' AS amount_refunded,
_airbyte_data->>'balance_transaction' AS balance_transaction,
_airbyte_data->'created' AS created,
_airbyte_data->>'customer' AS customer,
_airbyte_data->>'description' AS description,
_airbyte_data->>'receipt_email' AS receipt_email,
_airbyte_data->>'amount' AS amount,
_airbyte_data->>'failure_code' AS failure_code,
_airbyte_data->>'receipt_number' AS receipt_number,
_airbyte_data->'refunded' AS refunded,
_airbyte_data->>'status' AS status,
_airbyte_data->>'statement_descriptor' AS statement_descriptor,
_airbyte_data->>'currency' AS currency,
_airbyte_data->>'failure_message' AS failure_message,
_airbyte_data->>'payment_method' AS payment_method,
_airbyte_data->'metadata' AS metadata,
_airbyte_data->'updated' AS updated
FROM {{source('finance','_airbyte_raw_original_charges')}}
),
staged AS (
SELECT id,
captured,
paid,
invoice AS invoice_id,
payment_intent AS payment_intent_id,
amount_refunded::numeric,
balance_transaction AS balance_transaction_id,
to_timestamp(created::double precision) AS created,
customer AS customer_id,
description,
receipt_email,
amount::numeric/100 AS amount,
failure_code,
receipt_number,
refunded,
status,
statement_descriptor,
currency,
failure_message,
payment_method AS payment_method_id,
metadata ->> 'channel'::text AS metadata_channel,
COALESCE((metadata ->> 'adjust-balance'::text)::numeric, 0::numeric) AS metadata_adjust_balance,
to_timestamp(updated::double precision) AS updated_at,
row_number() over (partition by id ORDER BY updated DESC) AS duplicates
FROM source
)
SELECT *
FROM staged
WHERE duplicates = 1
Then under marts the final model [models/marts/finance/dim_charges_original.sql]
{{
config(
materialized = 'incremental',
unique_key = 'id',
incremental_strategy = 'merge'
)
}}
SELECT * FROM {{ ref("stg_charges")}}
{% if is_incremental() %}
WHERE updated_at > current_date - interval '7 day'
{% endif %}
Context about the data
This is transactional data from stripe with ~100k rows, it already incorporates multiple duplicates in the raw source. The data is pushed in through Airbyte.
Questions
- Am I correctly implementing the ‘incremental strategy’ in dbt?
- In my ‘stg_charges.sql’ file, I’ve formatted the data types and renamed columns to their appropriate attribute names. To eliminate duplicates, I’m using the row_number() function. Is there a more efficient method for achieving ‘deduplication’ in dbt?
- Is it necessary to always rebuild the entire query in ‘stg_charges.sql’, or is there an approach to only filter for the latest data not yet included in the final ‘dim_charges_original.sql’ model, or perhaps limit the data to a specific timeframe? This could improve the performance of the staging queries.