SQL Readability: A Consideration

RaulingAverage · April 28, 2020, 6:50am

Hello dbt team,

Hope you are doing well. I wanted to reach out because when working with the dbt example codes, and seeing other people’s repos, I see a lack of what SQL readability should look like, in regards to indentation and formatting.

SQL Readability is an opinionated topic, but bare with me.

In this article, I talk about SQL Readability from where I learned somewhere else, and moreover, provide resources on my research.

Ultimately, queries may look as such:

/*

* Querying the Myspace Events table with user information for the time
* after the first Musician profile-user was created.

*/

WITH earliest_musician_user_date AS ( 

    SELECT 
    
        MIN(created_date)
    
    FROM mySpace_users 
    
    WHERE profile  = 'Musician'

)

SELECT 

    id, 
    created_date, 
    events.login_id, 
    geo_state, 
    first_name, 
    last_name, 
    SUM(login.logIn) AS total_logins -- Obtaining total logins

FROM mySpace_events AS events

LEFT OUTER JOIN AS myspace_users users

ON users.id = events.user_id

WHERE login.date >= CURRENT_DATE - earliest_musician_user_date

GROUP BY 1,2,3,4,5,6

emilie · April 28, 2020, 1:17pm

The best way to encourage readability is through standardization.
When people know the structure of what they’re reading, they can spend less time figuring out what’s going on a more time focused on the content. That’s where SQL Style Guides come in.

We’ve talked about this before when @mhmazur shared his SQL Style Guide in this discourse post: Sharing my SQL Style Guide

@claire recently explained some of the reasoning in Fishtown’s style guide around CTEs: Why the Fishtown SQL style guide uses so many CTEs

Some of the other’s frequently mentioned in dbt slack:

Fishtown’s: https://github.com/fishtown-analytics/corp/blob/master/dbt_coding_conventions.md#sql-style-guide
Kickstarter’s: https://gist.github.com/fredbenenson/7bb92718e19138c20591
GitLab’s (where I work): https://about.gitlab.com/handbook/business-ops/data-team/sql-style-guide/

For a project’s readability, it’s less about any particular query and more about consistency across queries.

It’s hard to be definitive across organizations because every organization has a slightly different context. For example, your example above takes up a lot of space on a 13 inch Macbook. From the discourse view, I cannot see the whole thing on a single screen:

Even when I move it to vim, this pretty straight-forward query takes up most of the screen:

A company that standard issues external monitors to team members can make more spacing demands without considering the negative effect to productivity. Your recommendations should be considerate of who is working on the project and what their environment might look like.

My personal reaction (not GitLab’s) is that your outlined suggestion is wasting a lot of valuable screen space. I would rewrite it as follows:

WITH myspace_events AS (

    SELECT *
    FROM myspace_events

), myspace_users AS (

    SELECT min(created_date) as earliest_musician_date
    FROM myspace_users
    WHERE profile = 'Musician'

), final AS (

    SELECT event_id, -- I would have renamed this ID in a base model
        created_date,
        myspace_events.login_id, 
        geo_state,
        first_name,
        last_name,
        SUM(login) AS total_login -- you reference a login.login but don't alias a table called login
    FROM myspace_events
    LEFT OUTER JOIN myspace_users
    ON myspace_users.user_id = myspace_events.user_id -- I would have renamed user_id in the base model
    WHERE login_date >= (CURRENT_DATE - earliest_musician_date)
    GROUP BY 1, 2, 3, 4, 5, 6

)

SELECT * 
FROM final

This reduces the SQL from 38 to 29 lines.
Notable differences include:

removing comments, as there are better places to put them; the comments I include here are for reference and I would not include them in an actual query
reducing blank lines
helping visualize order of operations using parentheses where there are multiple actions in a single line
avoiding ambiguous column names, like id by renaming them in base models
not aliasing model names when adding 8 characters means you don’t have to
using CTEs to call source data

You can see how much less screenspace this occupies in a text editor:

But opinions on SQL style guides is a bit like opinions on everything else… everyone has one.

RaulingAverage · April 29, 2020, 6:02am

Major +1 to your feedback @emilie. I particularly like your weigh in on the existing query, with supplemented sources.

I am particularly going to update my Readability recommendation from your resources as well (E.g. Gitlab SQL Style Guide)

And this, I feel, would be a great reason to further have the Fishtown team consider an update to the SQL Readability standards that are both on examples & also mentioned in their best practices.

My hopes would be to at least help other team’s before they have their own technical debt before following what’s previously been created, as I would like to be “considerate of who is working on the project and what their environment might look like.”

I know that “SQL style guides is a bit like opinions”, but I was hoping this could be a ripe opportunity for the team to come to re-consideration on SQL Readability before, hopefully, more adoption would occur–especially after Fishtown’s recent highlighted news

Anyways, I really appreciate the response!

emilie · April 29, 2020, 12:47pm

Hmm. I’m just now realizing that your comments about SQL readability are specifically towards Fishtown’s SQL Style Guide. I do find their SQL style very readable- and, it’s incredibly consistent. Maybe you can give specific feedback on what you find difficult to parse, so that someone can help explain the reasoning behind that approach.

RaulingAverage · April 30, 2020, 2:30am

Oh yeah, no worries @emilie!

Though the opinion is to the team, I feel a standardization in the Fishtown SQL Style Guide & examples, can occur.

I would create a PR, but rather get the opinions before taking such an abrupt action.

Here are some concerns I had

Spacing
lowercase on Reserve Words
Lines can be longer than 80 characters
Reserve Words are in-line w.r.t query level (kinda like aligning brackets, in other languages)

Note: These are personal thoughts on past experiences, and do not reflect the readability at my current work now.

Using the dbt coding conventions example

This

with

my_data as (

    select * from {{ ref('my_data') }}

),

some_cte as (

    select * from {{ ref('some_cte') }}

),

final as (

    select [distinct]
        my_data.field_1,
        my_data.field_2,
        my_data.field_3,

        -- use line breaks to visually separate calculations into blocks
        case
            when my_data.cancellation_date is null and my_data.expiration_date is not null then expiration_date
            when my_data.cancellation_date is null then my_data.start_date + 7
            else my_data.cancellation_date
        end as cancellation_date,

        -- use a line break before aggregations
        sum(some_cte.field_4),
        max(some_cte.field_5)

    from my_data

    left join some_cte  
        on my_data.id = some_cte.id

    where my_data.field_1 = 'abc'
      and (
          my_data.field_2 = 'def' or
          my_data.field_2 = 'ghi'
      )

    group by 1, 2, 3, 4
    having count(*) > 1

)

select * from final

Could look like this

WITH my_data AS (
    SELECT * FROM {{ ref('my_data') }}
),

some_cte as (
    SELECT * FROM {{ ref('some_cte') }}
),

final as (

    SELECT 
        [DISTINCT]
        my_data.field_1,
        my_data.field_2,
        my_data.field_3,

        -- Use line breaks to visually separate 
        -- calculations into blocks.
        CASE
        WHEN 
                my_data.cancellation_date IS NULL AND 
                      my_data.expiration_date IS NOT NULL
        THEN 
                expiration_date
        WHEN 
                my_data.cancellation_date IS NULL
        THEN
                 my_data.start_date + 7
        ELSE 
                 my_data.cancellation_date
        END AS cancellation_date,

        -- Use a line break before aggregations.
        SUM(some_cte.field_4),
        MAX(some_cte.field_5)

    FROM 
            my_data

    LEFT JOIN 
            some_cte  
    ON 
             my_data.id = some_cte.id

    WHERE 
            my_data.field_1 = 'abc'
               AND  (
                my_data.field_2 = 'def' or
               my_data.field_2 = 'ghi'
                )

    GROUP BY 
         my_data.field_1,
        my_data.field_2,
        my_data.field_3,
        cancellation_date,
    HAVING 
        COUNT(*) > 1

)

SELECT * FROM FINAL

Topic		Replies	Views
Introducing sqlfmt, an auto-formatter for dbt SQL Show and Tell sqlfmt	22	23170	April 12, 2023
Sharing my SQL Style Guide Show and Tell best-practice	9	6826	December 17, 2020
Why the Fishtown SQL style guide uses so many CTEs Archive	14	38819	March 24, 2022
SQL formatting config Help dbt-cloud	2	166	November 19, 2024
Coding standards document Archive	2	4144	January 15, 2021

SQL Readability: A Consideration

Related topics