Greenplum! Extend postgres adapter for greenplum table parameters


#1

Hey everyone,

I want to either extend the existing postgres adapter or clone/extend the postgres adapter for use on greenplum MPP.

This means adding some syntas to the create table statement to deal with storage parameters, and to set the distribution key.

I’ve managed to write the code required to do this, but I’m not entirely clear how the adapter structure is layed out in DBT, and what is the best way to make this extension. I see two options:

  1. Extend the postgres adapter to pick up the extra config. This is easy, however has the downside of not being able to ignore the extra config parameters when switching the same code between generic postgres and greenplum
  2. Clone the postgres adapter and create a separate greenplum version. This is more involved and I’d need some help to add it to the code base.

Thoughts folks? :slight_smile:

Cheers,
-James


#2

There is actually very little to the postgres adapter, because most of the postgres behaviour is the default behaviour. Adapters only overwrite when they need to do something different.

I would recommend cloning the postgres adapter and changing what is different in greenplum. The big caveat is that adapters are getting reworked soon™ so they treated as plugins. That might mean you have to redo some of your work when that happens.


#3

This is cool! We definitely didn’t set out to support Greenplum explicitly, but I’m really glad it’s similar enough to postgres to be functional.

Historically, dbt has shipped with the code required to connect to every database it supports. That meant that there were files called postgres,py, redshift.py, etc, in the dbt source code. In our upcoming Stephen Girard release, we’re building out a plugin architecture for dbt’s adapters. As a result, dbt can be extended to work with different databases without modifying any of the code in dbt-core.

This is all to say: I’m glad you’ve found dbt and Greenplum to be a nice pairing as we’re really focusing on use cases like these at present :slight_smile:

Are you able to share what your new configs look like? If they’re relatively small and self-contained, there’s actually a decent way to do this without changing any dbt code at all! This would sort of be a stopgap until Stephen Girard is complete which, as @elexisvenator noted, is coming soon ™


#4

Greenplum is just postgresql under the covers so everything is the same, including the drivers. But there is some syntax extensions. The only real thing relevant for here is the create table as select. We need to set the storage parameters (eg, append only, columnar) and the distribution key (randomly, or by )
https://gpdb.docs.pivotal.io/500/ref_guide/sql_commands/CREATE_TABLE_AS.html

So i created a new file greenplum.sql that would live in include/global_project/macros/adapters, and looks like this:

{% macro distributed_by(raw_distributed_by) %}
  {%- if raw_distributed_by is none -%}
    {{ return('') }}
  {%- endif -%}

  {%- if raw_distributed_by in ['randomly', 'RANDOMLY'] -%}
    {% set distributed_by_clause %}
      distributed randomly
    {%- endset -%}
  {%- else -%}  
    {% set distributed_by_clause %}
      distributed by ({{ raw_distributed_by }})
    {%- endset -%}
  {%- endif -%}

  {{ return(distributed_by_clause) }}
{%- endmacro -%}


{% macro with_table_storage_parameter(raw_table_storage_parameter) %}
  {%- if raw_table_storage_parameter is none -%}
    {{ return('') }}
  {%- endif -%}

  {% set table_storage_parameters %}
  with (
  {% if raw_table_storage_parameter is string -%}
    {% set table_storage_parameters = [raw_table_storage_parameter] %}
  {%- else -%}
    {%- for param in raw_table_storage_parameter -%}
      {{ param }}
      {%- if not loop.last -%},{%- endif -%}
    {%- endfor -%}
  {%- endif -%}
  )
  {%- endset -%}

  {{ return(table_storage_parameters)}}

{%- endmacro -%}


{% macro postgresql__create_table_as(temporary, relation, sql) -%}
  {%- set raw_distributed_by = config.get('distributed_by', none) -%}
  {%- set raw_table_storage_parameter = config.get('table_storage_parameters', none) -%}

  create table {{ relation }}
  {{ with_table_storage_parameter(raw_table_storage_parameter) }}
  as (
    {{ sql }}
  )
    {{ distributed_by(raw_distributed_by) }}
  ;
{% endmacro %}


{% macro postgresql__create_view_as(relation, sql) -%}
  create or replace view {{ relation }} as (
    {{ sql }}
  );
{% endmacro %}

As you can see this is very simple additions and should pick up config from the model that looks like:

{{config(
		materialized='table',
		distributed_by='word'
	)}}

What I can’t quite figure out is what else i need to configure and where to activate this new adapter clone, it’s a bit confusing (for me! :slight_smile: ) where and how the line is drawn between the default config/behaviour and picking up something new…


#5

@jamesg this is awesome!

What I can’t quite figure out is what else i need to configure and where to activate this new adapter clone

So, it’s hard to answer this question as this logic has indeed just diverged. In 0.12.2 (released today!) this code all lives directly in include/global_project/macros/adapters as you’ve noted. Going forwards, we’ll have different “plugins” for each database with their own set of connection-related code and SQL overrides. You can check out an example here.

The bad news is that we’re still a ways out from finalizing this code, so contributions are going to be tricky to manage at this very moment. The GREAT news is that dbt is built in a such a way that this code doesn’t actually need to live inside of dbt!

In a dbt materialization, dbt will call a macro like create_table_as to effect some change on the database. We call these types of macros “adapter macros” because they outsource the underlying logic to a specific macro based on the active adapter. You can check out an adapter macro in action here. Finally, note that dbt’s macro precedence will prefer “local” macros defined in your project over the “global” macros that ship with dbt.

As a result of all of this, you can actually just take the macros you’ve included above and paste them as-is into .sql files in your projects macros/ directory. dbt will prefer your version of postgresql__create_table_as over the default implementation, and you should be able to effectively change the behavior of dbt without modifying any of its code :slight_smile:

So, try making a file like this and giving dbt run a spin!

-- macros/greenplum_overrides.sql

{% macro distributed_by(raw_distributed_by) %}
  {%- if raw_distributed_by is none -%}
    {{ return('') }}
  {%- endif -%}

  {%- if raw_distributed_by in ['randomly', 'RANDOMLY'] -%}
    {% set distributed_by_clause %}
      distributed randomly
    {%- endset -%}
  {%- else -%}  
    {% set distributed_by_clause %}
      distributed by ({{ raw_distributed_by }})
    {%- endset -%}
  {%- endif -%}

  {{ return(distributed_by_clause) }}
{%- endmacro -%}


{% macro with_table_storage_parameter(raw_table_storage_parameter) %}
  {%- if raw_table_storage_parameter is none -%}
    {{ return('') }}
  {%- endif -%}

  {% set table_storage_parameters %}
  with (
  {% if raw_table_storage_parameter is string -%}
    {% set table_storage_parameters = [raw_table_storage_parameter] %}
  {%- else -%}
    {%- for param in raw_table_storage_parameter -%}
      {{ param }}
      {%- if not loop.last -%},{%- endif -%}
    {%- endfor -%}
  {%- endif -%}
  )
  {%- endset -%}

  {{ return(table_storage_parameters)}}

{%- endmacro -%}


{% macro postgresql__create_table_as(temporary, relation, sql) -%}
  {%- set raw_distributed_by = config.get('distributed_by', none) -%}
  {%- set raw_table_storage_parameter = config.get('table_storage_parameters', none) -%}

  create table {{ relation }}
  {{ with_table_storage_parameter(raw_table_storage_parameter) }}
  as (
    {{ sql }}
  )
    {{ distributed_by(raw_distributed_by) }}
  ;
{% endmacro %}


{% macro postgresql__create_view_as(relation, sql) -%}
  create or replace view {{ relation }} as (
    {{ sql }}
  );
{% endmacro %}

Let me know if that all makes sense


#6

Bingo. yep, it works. Had a moment there before realising that it’s called postgres not postgresql, but once I changed that it spat out the right thing.

Thank you!


#7

That’s great! This type of setup isn’t sustainable long-term, but we’ll be much better suited to let adapters vary in their functionality once our Stephen Girard release comes together. Super happy to hear you can get it working in the meantime :slight_smile:


#8

Yep I agree. It would be important going forward to be able to just change the database type in the profiles.yml and have it still work.

So anyway I’m happy for now, but also happy to rework it when the next release comes out!