Last week (October 1, 2020), I had the pleasure of joining Looker and Fishtown Analytics on a webinar to cover how we’re using both dbt and Looker at Drizly. We had more questions than we were able to get through on the event so I’m posting my answers below. If you were not able to make this event, you can watch it here!
Q&A
Any limitations or challenges you’re experiencing with dbt?
We have not experienced any limitations with dbt
One of the biggest hurdles I’ve had with getting dbt up and running is that i want to recompile data sets on a consistent basis, but that’s a whole 'nother set of tools to learn and manage, how do you all do it?
Our data engineer set up our orchestration layer so that our dbt models run each morning on a schedule, currently we are in transition to Dagster for orchestration
Can you give some examples to illustrate the division of work between dbt and Looker? What’s an example of modeling you might do in Looker rather than dbt, or vice versa?
See this blog post
Emily- are there any teams that aren’t using looker? If so, why?
Our entire company is going to be on Looker and we, as the centralized analytics team, manage it for all business users
What’s the total company size at Drizly?
About 250 at the moment, and growing! See open roles here
For rapidly growing databases, what can be done to optimize for this and does dbt build times increase as source data grows?
With snowflake, it’s easy to scale up the warehouse as data size increases, so we have not run into issues with slow build times. However, we also build our data models with data size and optimization in mind. I would check for the recommendations your specific warehouse has on query optimization.
how dbt is different from Airflow?
Airflow can orchestrate many different processes - and many teams use it for data transformation. dbt is entirely focused on data transformation. With orchestration, understanding dependencies is very important. Because dbt uses a ref function, it automatically can build the DAG and know what order to run models in.
How are you orchestrating your dbt runs (e.g. airflow, cron, etc)?
We are currently in transition to using Dagster for orchestration
Any best practice recommendations on the dbt output side in terms of data classification (e.g. Bronze, Silver, and Gold)?
I don’t have a great answer to this question - I have heard people talk about bronze as explorational and gold as ready for BI with Silver in between. Because dbt provides a development environment, I would say the dev schema is bronze and the production should be gold but many teams have various ways they roll out new production models
How do your dbt environments interact with Snowflake? Does each analyst have their own dev-schema in snowflake?
Yes, each of our analysts has a dedicated dev schema in snowflake to develop against. Each dev environment is a zero-copy clone from our prod database, and it refreshes each morning
Follow up question: So looker can read a dictionary from dbt directly or you can do it in both tools?
both dbt and Looker allow you to add descriptions to columns.
What do you use for a data dictionary? Do you use the Looker built in one?
We currently use the dbt data documentation for our data dictionary
So you would prototype a PDT in Looker and then build it upstream in dbt?
Yes! This is our recommendation
Building on what you just said about having two coordinated pull requests, using dbt on one side and replicating that in LookML on the other screen; is that not duplicating work? Please could you elaborate on that?
I think there are two scenarios:
-
you build staging, intermediate and a fact model in dbt. The PR in dbt is heavy in this instance. The coordinate PR in Looker is likely only adding a view for the fact model and joining it to any relevant existing explores.
-
you add a single new column. This would be a PR in dbt for a single line change and add a single dimension in the Looker view. This may feel duplicative (and is definitely more work than if you just used LookML) but is the time/speed trade off mentioned in the webinar.
For complex, but infrequently used models, do you prefer to keep those in dbt or Looker (where you might be able to build on the fly as necessary?)
Having consistency in where the modeling happens is important for clarity for the team. While PDTs can allow you to test functionality, we think having a plan to move to dbt is the best workflow
Have you been able to leverage Looker Blocks?
We use the snowflake usage block to track our snowflake usage and credits via a Looker dashboard, but have not explored many other blocks yet.