How to get started? Snowplow + BigQuery

Hi,

I’m trying to get started with dbt, but I’m stuck:

I have Snowplow data in BigQuery and I could setup a profile, install the snowplow packages and build a dbt_projects.yml file.

Now when I do a test run, I get error messages:

Database Error in test unique_snowplow_base_events_event_id (models/base/schema.yml)
  Table name "pageviews" missing dataset while no default dataset is set in the request.
  compiled SQL at target/compiled/snowplow/schema_test/unique_snowplow_base_events_event_id.sql

Database Error in test not_null_snowplow_base_events_app_id (models/base/schema.yml)
  Table name "pageviews" missing dataset while no default dataset is set in the request.
  compiled SQL at target/compiled/snowplow/schema_test/not_null_snowplow_base_events_app_id.sql

Database Error in test not_null_snowplow_base_events_event_id (models/base/schema.yml)
  Table name "pageviews" missing dataset while no default dataset is set in the request.
  compiled SQL at target/compiled/snowplow/schema_test/not_null_snowplow_base_events_event_id.sql

and more of course…

I’m not sure where to set the dataset other than the profiles.yml file? Am I missing something here?

This is my profile:

my-bigquery-snowplow-db:
  target: prod
  outputs:
    prod:
      type: bigquery
      method: oauth
      project: hereismygcpprojectname
      dataset: snowplow
      threads: 1
      timeout_seconds: 300
      location: EU
      priority: interactive

And my dbt_projects.yml:

#settings specifically for this models directory
#config other dbt settings within ~/.dbt/profiles.yml
name: 'snowplow'
version: '0.0.1'

source-paths: ["models"]
target-path: "target"
clean-targets: ["target"]
test-paths: ["test"]
analysis-paths: ["analysis"]
data-paths: ["data"]
macro-paths: ["macros"]

models:
  snowplow:
    base:
      materialized: ephemeral
      optional:
        enabled: false
    page_views:
      optional:
        enabled: false

    vars:
      'snowplow:events': pageviews
      'snowplow:context:web_page': null
      #'snowplow:context:performance_timing': TABLE OR {{ REF() }} or FALSE
      #'snowplow:context:useragent': TABLE OR {{ REF() }} or FALSE
      'snowplow:timezone': 'Europe/Berlin'
      'snowplow:page_ping_frequency': 10
      'snowplow:app_ids': ['MZ Webseite']
      'snowplow:pass_through_columns': []

My dataset in Bigquery:

hereismygcpprojectname:snowplow:pageviews

I’m wondering what’s wrong? Can you help me?

Cheers
Andreas

Hi Andreas,

I believe the error lies in setting your variable for 'snowplow:events' in dbt_projects.yml.

Profile Configuration

The dataset that you specify in your profiles.yml is actually where dbt will create the models in your project. So, in your case, the models in the Snowplow package will be created in snowplow dataset. Generally, you want to separate your models from the source tables. You can name the dataset in your profile something like snowplow_dbt instead.

Project Configuration

In the vars section of dbt_project.yml, that is where you want to specify the path to the pageviews table. You can either specify the exact schema name (in your case: hereismygcpprojectname.snowplow.pageviews) or per dbt Best Practice, use the ref function.

If you use the ref function, you would need to create a model in your models folder. For example, you can create the model snowplow_pageviews_base.sql:

{{ config(materialized='view') }}

select * from `hereismygcpprojectname`.`snowplow`.`pageviews`

FYI you can also use sources

And then your dbt_project.yml would look something like this:

#settings specifically for this models directory
#config other dbt settings within ~/.dbt/profiles.yml
name: 'snowplow'
version: '0.0.1'

source-paths: ["models"]
target-path: "target"
clean-targets: ["target"]
test-paths: ["test"]
analysis-paths: ["analysis"]
data-paths: ["data"]
macro-paths: ["macros"]

models:
  snowplow:
    base:
      materialized: ephemeral
      optional:
        enabled: false
    page_views:
      optional:
        enabled: false

    vars:
      'snowplow:events': "{{ ref('snowplow_pageviews_base') }}"
      'snowplow:context:web_page': null
      #'snowplow:context:performance_timing': TABLE OR {{ REF() }} or FALSE
      #'snowplow:context:useragent': TABLE OR {{ REF() }} or FALSE
      'snowplow:timezone': 'Europe/Berlin'
      'snowplow:page_ping_frequency': 10
      'snowplow:app_ids': ['MZ Webseite']
      'snowplow:pass_through_columns': []

Also, you want to make sure your source table matches up to what the snowplow package requires :wink:

Hope that helps!

Kevin

2 Likes

Kevin, thank you so much, this helped a lot already! I didn’t realize one was for input and one for output.

I’m stuck with new errors unfortunately:

One SQL error I could fix was in the packaged script using TIMESTAMP on a field that was already of type TIMESTAMP. So I managed to build the snowplow_id_map table.

Now for the webpage context it expects the table where the webpage context would be saved. However for Bigquery this is a nested table and I don’t know how to specify it:

This doesn’t work:

‘snowplow:context:web_page’: “myproject.snowplow.pageviews.contexts_com_snowplowanalytics_snowplow_web_page_1_0_0

Do you know how to call the nested table structure?

Also another thing I’ve stumbled across is the missing “root_id” (which is just in Snowplow for stitching tables, but as there are no tables to stitch in BQ, it is missing)

Thanks
Andreas

Ok, I just realized what you were suggesting before and I think I can solve it this way:

SELECT x.id as root_id FROM `myproject.snowplow.pageviews`, unnest( contexts_com_snowplowanalytics_snowplow_web_page_1_0_0) as x

But where do I put this file exactly, when I use the package version of the snowplow sql scripts? I can’t specifiy to use my own models folder?

You are heading in the right direction.

You can make this as a separate model in your models folder

snowplow_webpage_context_base.sql:

select context.*
from `myproject.snowplow.pageviews`
cross join unnest(contexts_com_snowplowanalytics_snowplow_web_page_1_0_0) as context

and then reference that model in your variable:

vars:
      'snowplow:events': "{{ ref('snowplow_pageviews_base') }}"
      'snowplow:context:web_page': "{{ ref('snowplow_webpage_context_base') }}"

Thank you for your help @kevinkong!

However I really have problems wrapping my head around the dependencies. Maybe you can elaborate? Now with the new file added to my local /models folder I get the following:

Compilation Error in model snowplow_page_views (models/page_views/snowplow_page_views.sql)

  1. I have the Snowplow Package 0.7.5 via dbt debs.
  2. This is located in /dbt_modules/snowplow and contains a macro + models folder. Here I already have multiple versions of a snowplow_web_page_context.sql (In models/page_views, in macros/adapters/default/page_views/ and in macros/adapters/bigquery/page_views/)
  3. In my local models folder I will add a new sql file snowplow_web_page_context.sql

I’ve tried a multitude of folder structures in my local models folder, but the errors seem to stay the same.

My questions would be:

  1. If I reference a file with ref, how is the order of the above mentioned files? Where does it look first? Do I have to replicate all three locations?
  2. How should my local folder structure in /models look like to have the files found?
  3. How do I disable the original file in dbt_modules and reference to my files?
  4. Where will be the path back to the package files. So if I replicate one file in my local folder, will dbt automatically try the next not available file in the dbt_modules folder?

An easier example would be this:

There seems to be an error in snowplow/macros/adapters/bigquery/identification/ snowplow_id_map.sql

max(timestamp(collector_tstamp)) should be max(collector_tstamp) instead.

I can fix this directly in the above folder, but that would be bad style. I guess I could rather fix it in my own local /models/ folder (I won’t need the macro probably).

So I’ve setup /models/snowplow/identification/snowplow_id_map.sql with the fixed file.

Now I’ve tried to edit my dbt_project.yml:

Some where on Github it says to disable the package file, so dbt uses my new file:

models:
  snowplow:
    base:
      materialized: ephemeral
      optional:
        enabled: false
    #      snowplow_base_web_page_context:
    #        enabled: false
    identification:
       snowplow_id_map:
         enabled: false

But what I’m getting with dbt run is:

Model 'model.snowplow.snowplow_sessions' depends on model 'snowplow_id_map' which was not found or is disabled

Which kinda makes sense for me, because I didn’t reference my new file nowhere in my dbt_project.yml. I just can’t find out how to do this exactly?

Sorry, a lot of questions, but I really can’t find out exactly how this works. I’ve read the documentation up and down multiple times, but it just barely seems to scratch the surface.

Best
Andreas

I don’t want to feed false knowledge on dbt :sweat_smile: so I will let @drew answer these technical questions.

Our project actually is doing the same thing you are doing with snowplow_id_map.sql and it appears our dbt_project.yml are set up similarly… My guess is that you need to specify your local project in the project config as well, which you should name differently from the snowplow package.

See Claire’s sample project: https://github.com/fishtown-analytics/jaffle_shop/blob/master/dbt_project.yml

E.g.

models:
  myproject:

  snowplow:
    base:
      materialized: ephemeral
      optional:
        enabled: false
    #      snowplow_base_web_page_context:
    #        enabled: false
    identification:
       snowplow_id_map:
         enabled: false

Hey @volderette - check out this issue if you haven’t already seen it: https://github.com/fishtown-analytics/snowplow/issues/70

This Snowplow package predates Snowplow’s native support for BigQuery. We had our own custom BigQuery loader for Snowplow which split out custom contexts into distinct tables. As such, the BigQuery implementation of this package is probably not going to work so well for a native snowplow-loaded dataset like the one that you have!

I’ll also say: we tried to make the models in the Snowplow package compatible with each of Redshift, Snowflake, and BigQuery. We accomplished this by making macros which implemented the correct SQL for each database, then calling those macros from the relevant dbt models. This usually works pretty well in practice, but here, it’s very confusing. Future versions of this package will not use the same pattern, and will hopefully be a little less confusing to operate!

I think your approach of disabling the snowplow_id_map model from the snowplow package, substituting it for your own model, is a really good one! I’m surprised that didn’t work. If you’re still working on that, I’d be happy to help out!

Oh! Yes - definitely change the name: of your project from snowplow to something else! The dbt_project.yml config you sent over above disables your own copy of the model too I think! @kevinkong identified this above, but I missed it on the first pass :slight_smile:

@drew, thanks for tuning in! I did change the local copy, but it seems there are a number of problems with the table format in Big query, as the issue you linked is also suggesting.

It’s really difficult to follow through the macros and adapters, but I will try to create a local copy of it and fix the problems. I guess this will take some time.

Best
Andreas

Not sure how far you were able to go. If it helps, here’s a DAG of the Snowplow package.

Also, I find it helpful to have this command handy: $ grep -lr 'some_model' . (include the period at the end). This allows you to locate files that contain the text “some_model” if you are having a hard time tracing things.

Thank you @kevinkong, this is definitely helpful! I started a couple of times, but it is really not easy to grasp the connections of adapters, macros and models. Often times opening one file means a dead end because it links again to the next file. But I guess it just takes some time to get used to the way this works in dbt. I wonder though if no one else has tried to setup dbt with the snowplow bigquery data model.

Cheers
Andreas

I think I’m almost there, just one problem left, I can’t seem to solve:

The incremental update of macros/adapters/bigquery/pageviews/ snowplow_page_views.sql
doesn’t seem to work.

I’m getting an UPDATE/MERGE must match at most one source row for each target row. This does make sense for me as the page_view_id is never unique but the file is merged on page_view_id.
I always will have multiple rows of the same page_view_id as the page_view_index splits the rows of course.

I don’t exactly know how to fix this. Do I rather have to use the event id?

Best
Andreas