Greenplum again! CTEs are not great in greenplum


#1

Hi Folks,

Bit of a spanner here for me, using Greenplum 4.x. There are two query planners in Greenplum, one called the legacy planner which is very clunky, and the super (not so super) “optimizer”. The optimizer reverts to the legacy planner when things get to complicated. Unfortunately neither of these can manage to rewrite CTEs sensibly. Take a query with some base models in CTEs and run them through the optimizer and it will “revert” to the legacy planner and the result is very expensive and doesn’t run. If you copy and paste the CTEs into subqueries, the optimizer doesnt revert and can do some amazing tricks!

So how difficult would it be to add another materialisation strategy “subquery”, which is like ephemeral except writes the dependent models into subqueries instead of CTEs?

(All of the above may or may not be an issue in GP5.x, but I won’t be able to test that until later in the year.)

-James


#2

Hey @jamesg - this would actually be a great question for an issue in GitHub! We’ll be better able to discuss potential features of dbt over there.

My instinct is that a subquery materialization might be a little complicated, but happy to explore what it could look like in the issue!


#3

Thanks @drew, no problem: https://github.com/fishtown-analytics/dbt/issues/1248


#4

Greenplum is based on postgres, does it keep the postgres behaviour of treating CTEs as an optimisation fence? If it does then I would recommend avoiding CTEs entirely regardless of the query planner.


#5

When selecting the query optimizer in GP4.x (set optimizer=on) then the CTEs may or may not end up being materialized internally (fenced) depending on complexity. With dbt you can easily end up with many CTEs and higher complexity and therefore it gets too complex and reverts to legacy mode and “fences” them.