First few months with dbt - questions on best practices


#1

After writing out the base models plus a few transformation layers for several of our production systems, NetSuite, Salesforce, and Segment I have gotten to place where I want to reach out to the community on a few questions and see if there are any best practices that others have established.

First - I think it might be helpful to understand what has been built from the ground up -

Every data source in the project file has a vars: where it renames the schema.table to reflect it’s source. Example: netsuite_transaction_lines. This helps with the bad naming of tables from the source or from poor schema naming in our Fivetran integration. This also makes it easy to rip n replace schemas if we needed to spin up two instances of the same data source for Fivetran and migrate.

There is a base model and a test for every table. We apply some basic filtering like excluding _fivetran_deleted records and rename columns that aren’t human-readable.

The first transformation layer brings together related records to create one master view. Example: Creating a netsuite_customer_xf model that joins all related tables that contain attributes (vertical, sales rep, payment terms, etc) of the customer. We also have some calculation models that generate values that get added back to the customer records. Example: average monthly spend over the last 12 months to define customer size (less than 50k, 50k - $100k, etc). Continuing with the example of Netsuite we have similar views for support cases, items, and transactions. This has turned joining 20+ tables together into using only 1 or 2 to get answers about our customers.

The next layer for Netsuite was capturing all the business logic to define different things we wanted to track from a financial perspective and turning them into mega tables that could be consumed by our current BI solution - Metabase. So using the 1 or 2 tables together bring every field anyone would ever care about a customer, transaction and make it available as a value to filter, pivot and aggregate on.

This has been successful up to this point, but as I look forward, there are somethings I am starting to ask myself that I am hoping someone else has already thought of or solved.

  1. I want to start combining data sources together, but am starting to realize that data sources often share similar column names. Is abbreviating source and/or table names considered best practice? It seems like this could get really long, messy, but I am not sure what the alternative would be. As an example: both Netsuite and our production system have an account table and similar columns - account_name, account_id, account_number, etc. Additionally, most of our production tables all have a status, amount, currency, created_at columns and combining tables within even one data source could become difficult without aliasing each one of these to the table (account_status, account_currency, account_created, etc). Assuming this is my only option, does it make more sense to do this on the base models or only when combining sources? Are there pros, cons?

  2. On the same topic of combining sources, I assume I would just add another model config like “_combined” to bring multiple sources together?

  3. The analyst team is responsible for both external and internal reporting. As mentioned above we currently use Metabase as our internal BI solution. All other reporting happens though Jaspersoft. Creating mega tables work well for a BI solution, but for our Jaspersoft reports, we will be doing custom SQL for each published report. Is creating another model config, folder structure for these custom reports to include in our project as views worth doing? It seems like this would simplify things for Jaspersoft as we would simply be doing a select * from one model (report) and it would allow us version control and identify dependencies. Curious to know if anyone is using dbt in a similar fashion.

Hope that all makes sense. :sweat_smile:


#2

I’ll try to hit your first two questions. I’m not super-familiar with Jaspersoft so I’m hesitant to weigh in on #3.

You’re asking a question of data warehouse design here, and ultimately, the answer isn’t specific to dbt. You can use many different tools to construct a data warehouse and will run into these questions about model design and naming regardless what tool you use. There are excellent books on the topic:

…there are many more as well. What you seem to be building are “dimension tables”–collections of dimensions about core entities in your warehouse. This is a reasonable thing to do. But it shouldn’t be something that you do only within a single data source.

It seems like you’re running into a lot of naming collisions, but you list a lot of fields that don’t seem important to me. created_at will be on every table in every data source most likely, but that’s not a problem! Each individual entity has its own created at when you look at the raw data, but a customer only has a single date that it was created. This is where design and business logic come into the picture–it’s your job as the modeler to define what the created_at field for the customers entity is! You can also define additional dimensions, like first_purchased_at, or lifetime_value None of these directly correspond to field names in the source data, nor should they. Data modeling is the process of aggregating knowledge about your business; taking source data and transforming it into meaning. If you start making decisions like this and adopt a design mindset with your modeling efforts, you’ll find that you start caring less about things like naming collisions and more about underlying business logic. This is good.

You’ll also run into naming collisions at the model level. We resolve this by naming things like this: netsuite_accounts, prod_postgres_accounts, and accounts. Tables that map directly to a single system always have that system name as a prefix, whereas tables that represent a core business entity (like an account) lose the prefix and become system-independent.