`group by 1` is a maintenance problem

nchammas · October 14, 2019, 3:49pm

I couldn’t find any place to post comments directly on the blog post, but I wanted to call out a big problem with group by 1 that the author didn’t cover.

group by 1 makes the order of columns in your select clause semantically meaningful! As in, changing the order of the columns in your select clause could change your output if you use group by 1.

That’s such a dangerous line to cross that it totally kills any value I might see in using group by 1. Maintaining SQL that uses this pattern become a dangerous exercise. You reorder some columns just because it looks better to you, thinking it’s a harmless change, but oops! The output is all different now.

The author writes:

Plus, as someone that writes a lot of SQL, I very much appreciate the fact that 1 is exactly one character.

I think this reflects a typical divide in programming styles and patterns. Some are optimized for the person writing the code, and some are optimized for the person reading and maintaining the code. group by 1 is very much a benefit to the writer, and a harm to the reader and maintainer.

The author opens her post with a choice between two styles of writing a query. But she misses a third style:

with enhanced as (
    select
        date_trunc('month', order_created_at) as order_month,
        *
    from core.fct_orders
)
select
    order_month,
    count(order_id) as n_orders,
    sum(revenue) as total_revenue
from enhanced
group by order_month;

If you want not to repeat yourself in your group by clause, then common table expressions are the way to go. For the sake of maintainability, I’d steer clear of group by 1.

emilie · October 14, 2019, 4:06pm

My 2 cents:

I think @Claire proactively addresses this when she writes,

If I want to know what grain my report is at, it’s easy for me to parse the 1 , jump up a couple of lines of SQL and realize it’s at the grain of order_month .

Your first column should indicate the grain of your model and that shouldn’t change as you enhance the model over time. Changing the grain of a model creates a fundamentally new model.

nchammas · October 14, 2019, 4:21pm

I think enforcing conventions like “first column in select clause should be your group by column”, which perhaps is the author’s ultimate goal, is best left to a linter or to code review.

But in the absence of such a linter, enforcing your convention by recommending a dangerous pattern like group by 1 seems like a step back. It elevates the choice of select column order from a harmless choice of “better style vs. worse style” to a critical choice of “correct results vs. incorrect results”.

And if your goal is better style, then just advocate the better style directly, rather than couple it to a dangerous pattern that raises the stakes of mistakenly deviating from the recommended style.

claire · October 15, 2019, 12:52pm

Hey Nick! I’m a little confused at the moment.

You reorder some columns just because it looks better to you, thinking it’s a harmless change, but oops! The output is all different now.

Can you give an example of a query with “group by 1” where reorganized columns would give a different output?

In my experience, your SQL engine is going to return an error rather than a different result set. We test all our queries before deploying them, so fortunately this wouldn’t make it into prod in our world.

michael.dunn · October 16, 2019, 2:28pm

You have to group by everything that isn’t aggregated, so changing from

select
  a
  ,b
  ,c
  ,sum(d)
from t
group by 1,2,3

to

select
  b
  ,c
  ,a
  ,sum(d)
from t
group by 1,2,3

doesn’t change the resultset in any meaningful way.

Changing it to

select
  a
  ,b
  ,sum(d)
  ,c
from t
group by 1,2,3

certainly would, but as @claire points out, that would cause an error, since you’d no longer be grouping by all non-aggregated columns. Maybe this is exactly what you’re getting at, but this is why it’s important to establish and enforce consistent code style.

nchammas · October 16, 2019, 11:11pm

Oh, the situation is better than I thought.

I am not concerned about errors, since those are easy to see and understand. I was concerned about a seemingly-innocent change in the query silently producing incorrect results, but indeed that doesn’t seem to be the case due to how group by works.

I didn’t think it through with a specific example for group by because my main concern is with ordinal references, and ordinal references generally are a bad idea for the reasons I outlined.

An easy example is with order by:

select
    col1,
    col2,
    col3,
    ...
from le_table
order by 1, 2;

Changing the order of columns in, or adding new columns to, your select clause will change your overall result ordering. If you somehow build on this ordering with window functions, then it may also change the meaning of your results entirely.

Another example of how ordinal references can unexpectedly cut you is with functions that take positional parameters (as opposed to keyword parameters), where reordering the input parameters changes the behavior of the function:

# Did I use the function correctly?
compute_result(input, lookup)

# Here, it's obvious.
compute_result(lookup_table=lookup, input_table=input)

So indeed, what I said here about group by 1:

You reorder some columns just because it looks better to you, thinking it’s a harmless change, but oops! The output is all different now.

is incorrect.

However, though it’s not as bad as I initially claimed, group by 1 still introduces unnecessary maintenance risk due to its use of ordinal references.

For example, changing this:

select
    name,
    count(*)
from person
group by 1;

to this:

select
    city,
    count(*)
from person
group by 1;

will quietly produce different results. With group by name, however, the change will produce an error.

It seems like an overly contrived example, but in a large, messy codebase with multiple contributors, scenarios like this are possible. Data testing will certainly catch a problem like this, but this is a more difficult bug to catch as compared to a query that simply fails. Patterns that make bugs harder to create and easier to spot should be preferred to those that do the opposite.

So, though my initial claim was incorrect, I think the minor added maintenance risk described above, combined with the cognitive load of having to dereference ordinals to meaningful names, still make group by 1 a bad pattern for code readers and maintainers. (And that goes for ordinal references in general.)

michael.dunn · October 17, 2019, 4:02pm

Great write-up.

Not sure about these though:

changing [query grouping by name] to [query grouping by city] will quietly produce different results

Wouldn’t you expect it to produce different results, given you’re changing the columns you’re selecting?

the cognitive load of having to dereference ordinals to meaningful names

You know that you’re grouping by everything that isn’t an aggregate, and if you’ve followed the advice to put all non-aggregates before all aggregates, I don’t see why you’d ever need to mentally map the ordinals of a group by to the column names. 5 non-aggregated columns → group by 1,2,3,4,5. Don’t really care which column is 2 and which is 5, because it literally doesn’t matter.

Your point about ordinal references is well-taken in the context of order by, but for a group by statement, I just don’t see the impact.

nchammas · October 19, 2019, 5:05pm

Yes, sorry, I should have been clearer. Of course, any change to the columns in your select clause will change your output. What I meant in the case of group by 1, though, was that the semantics of the query will have changed.

Maybe this concern simply reflects a bias I have. When I look at a SQL query as a maintainer, I treat the different parts of the query with different levels of sensitivity. To me, the select clause is the “safest” to touch, because modifying it simply adds, removes, or reorders attributes while returning the same logical result. And if I mess that up my query will quickly and loudly fail.

With ordinal references, however, that’s no longer true. To return to my example, if I think I no longer need name in this query:

select
    name,
    count(*)
from person
group by 1;

and replace it with another column, not only have I changed what attributes the query is returning, but I’ve changed how the aggregation is being done. That’s a more significant level of change – a change to the meaning of the result – and that’s what I was pointing out.

In a world without ordinal references, I don’t have to think that extra bit about whether changing the order or content of the select clause is affecting the semantics of my query. It simply changes the attributes being returned.

Hmm, I’m not following. My point is that group by 1 is not meaningful to a human. The 1 has to be mentally translated into a meaningful name. In her post, @claire wrote:

In my opinion, the second query is cleaner and easier to read. If I want to know what grain my report is at, it’s easy for me to parse the 1 , jump up a couple of lines of SQL and realize it’s at the grain of order_month .

This process of dereferencing the ordinal is a cognitive load. It’s not an issue if you have short queries that diligently adhere to a style guide, but when you have large, messy codebase, this load can be noticeable. The load is borne by the person reading or maintaining the code, and less so by the person who initially wrote it. It can be avoided entirely by using CTEs and meaningful aliases, as I demonstrated at the end of my initial post.

I don’t feel as strongly about group by 1 as I did when I wrote that post. Thank you for helping me clarify my thinking there. But overall, I’m still against the idea. I think I’ve made the case as best as can be made, but having explained my thought process out loud, I’ll take one final go at summarizing my argument.

Avoid ordinal references in SQL (including group by 1) because:

They create a meaningless layer of indirection.
They create an unnecessary coupling between the order and selection of columns in your select clause and the overall semantics of your query.

Prefer proper names and aliases because:

They directly confer meaning to the reader.
They decouple the details of how your select clause is structured from the overall semantics of your query.

claire · October 25, 2019, 6:44am

Honestly, I’m just excited that there are people who care as much about their SQL style as I do – I really appreciate your input to the discussion!

I actually don’t have anything against group by column_name, and will happily adopt it if a client I’m working with wants to use it. I just think group by 1 can get a bad rap, which I believe is underserved! (Sure the article had a lot of hyperbole in it, but it’s a blog post, not technical documentation, so we’re allowed to push the boundaries a little )

As I said in the original article:

At the end of the day, be opinionated in your SQL style … and stick with it consistently.

Topic		Replies	Views
Sharing my SQL Style Guide Show and Tell best-practice	9	6703	December 17, 2020
SQL Readability: A Consideration Archive	4	4969	April 30, 2020
Introducing sqlfmt, an auto-formatter for dbt SQL Show and Tell sqlfmt	22	21981	April 12, 2023
Why the Fishtown SQL style guide uses so many CTEs Archive	14	37943	March 24, 2022
How to not lose your mind when auditing data: Part II Show and Tell data-quality	2	9923	April 13, 2020

`group by 1` is a maintenance problem

Related topics