Can you define column data types in dbt?

Hello DBT, can you kindly advise if there is a way to define “column type” (eg: varchar, date) in dbt?

1 Like

Because dbt uses “Create Table As Select” statements (CTAS) to create tables in relevant database platforms, instead of doing separate “CREATE TABLE” and then “INSERT” statements, it doesn’t natively have a way to specify datatypes.

However, what we do in our dbt models is to force a CAST on every line of the projection part of of the SQL Statement (e.g. the column list normally immediately after SELECT) to ensure that the datatypes coming out are what we think they are.

So, like this:

WITH SUBQUERY AS
(SELECT
MD5_BINARY(TO_VARCHAR(DIM_AIRLINE_KEY) || '|' || FLIGHT_NUM || '|' || FLIGHT_LEG_NUM) AS DIM_FLIGHT_LEG_SCHEDULE_KEY,
MD5_BINARY(TO_VARCHAR(DIM_AIRLINE_KEY) || '|' || FLIGHT_NUM) AS DIM_FLIGHT_SCHEDULE_KEY,
-- All of these 5 fields excluded from the star are already in the dimension.
{{ dbt_utils.star(from=ref('stg_flight_leg'), except=["START_DIM_AIRPORT_KEY", "END_DIM_AIRPORT_KEY", "DIM_AIRLINE_KEY", "FLIGHT_NUM"]) }}
FROM {{ ref('stg_flight_leg') }})
SELECT 
    -- Not casting BINARY fields, doesn't make sense
    FACT_FLIGHT_LEG_KEY,
    DIM_FLIGHT_LEG_SCHEDULE_KEY,
    DIM_FLIGHT_SCHEDULE_KEY,
    CAST(FLIGHT_POSITION_DATE AS DATE) AS FLIGHT_POSITION_DATE,
    DIM_AIRPLANE_KEY,
    CAST(FLIGHT_DURATION AS INT) AS FLIGHT_DURATION,
    CAST(FLIGHT_START_DATE AS DATE) AS FLIGHT_START_DATE,
    CAST(FLIGHT_START_TIME AS TIME) AS FLIGHT_START_TIME,
    CAST(FLIGHT_START_TIMESTAMP AS TIMESTAMP) AS FLIGHT_START_TIMESTAMP,
    CAST(FLIGHT_END_DATE AS DATE) AS FLIGHT_END_DATE,
    CAST(FLIGHT_END_TIME AS TIME) AS FLIGHT_END_TIME,
    CAST(FLIGHT_END_TIMESTAMP AS TIMESTAMP) AS FLIGHT_END_TIMESTAMP,
    CAST(FLIGHT_START_LAT AS FLOAT) AS FLIGHT_START_LAT,
    CAST(FLIGHT_START_LONG AS FLOAT) AS FLIGHT_START_LONG,
    CAST(FLIGHT_END_LAT AS FLOAT) AS FLIGHT_END_LAT,
    CAST(FLIGHT_END_LONG AS FLOAT) AS FLIGHT_END_LONG,
    CAST(CNT_LEGS AS NUMBER(18,0)) AS CNT_LEGS,
    CAST(FLIGHT_LEG_NUM AS NUMBER) AS FLIGHT_LEG_NUM,
    CAST(START_AIRPORT_DISTANCE_KM AS FLOAT) AS START_AIRPORT_DISTANCE_KM,
    CAST(END_AIRPORT_DISTANCE_KM AS FLOAT) AS END_AIRPORT_DISTANCE_KM
FROM
SUBQUERY WHERE DIM_FLIGHT_LEG_SCHEDULE_KEY IN
(SELECT DIM_FLIGHT_LEG_SCHEDULE_KEY FROM {{ ref('dim_flight_leg_schedule') }} )

Except for the _KEY fields, which we know are binary so we don’t bother with the cast.

1 Like

I’m wondering what is the best practice of where to do this casting?

For example, my first step is always to create a base_ table that renames columns. Seems like that would be the right place to do the casting so that the types automatically propagate to downstream queries.

Another consideration is that when you CAST something you always have to specify the column name, which can lead to a lot of repeating as in Josh’s example above.

That could probably be solved through a macro. I checked dbt_utils and don’t see anything like that yet.

So then the structure could be something like:

with source as (
    
    select * from stg.ga_page_stats
    
),

casted as (
    select 
        mcast(date, date),
        mcast(pagePath, varchar(3000)),
        mcast(sourceMedium, varchar(512)),
        mcast(campaign, varchar(512)),
        mcast(keyword, varchar(512)),
        mcast(adContent, varchar(512)),
        mcast(fullReferrer, varchar(512)),
        mcast(pageviews, int),
        mcast(uniquePageviews, int)
    from source 
),

renamed as (
    select 
        date as date,
        pagePath as page_path,
        sourceMedium as traffic_source_medium,
        campaign as traffic_campaign,
        keyword as traffic_keyword,
        adContent as traffic_content,
        fullReferrer as full_referrer,
        pageviews as pageviews,
        uniquePageviews as unqiue_pageviews
    from casted 
),

Or am I overcomplicating things here?

1 Like

We do this casting on every single model. The issue of ensuring that datatypes are what we expect they are is not just an issue for models that are immediately adjacent to sources but for every single model along the way. From an engineering perspective we never want to assume - we always want to verify/confirm/dictate/be prescriptive.

Your idea for a macro to make the CASTing syntax a little more elegant is cool, I like that.

A post was split to a new topic: Casting column datatypes based on a seed

We’ve also had a need to cast data types. We unpacked a JSON string.
This is what we did:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.