After writing out the base models plus a few transformation layers for several of our production systems, NetSuite, Salesforce, and Segment I have gotten to place where I want to reach out to the community on a few questions and see if there are any best practices that others have established.
First - I think it might be helpful to understand what has been built from the ground up -
Every data source in the project file has a vars: where it renames the schema.table to reflect it’s source. Example: netsuite_transaction_lines. This helps with the bad naming of tables from the source or from poor schema naming in our Fivetran integration. This also makes it easy to rip n replace schemas if we needed to spin up two instances of the same data source for Fivetran and migrate.
There is a base model and a test for every table. We apply some basic filtering like excluding _fivetran_deleted records and rename columns that aren’t human-readable.
The first transformation layer brings together related records to create one master view. Example: Creating a netsuite_customer_xf model that joins all related tables that contain attributes (vertical, sales rep, payment terms, etc) of the customer. We also have some calculation models that generate values that get added back to the customer records. Example: average monthly spend over the last 12 months to define customer size (less than 50k, 50k - $100k, etc). Continuing with the example of Netsuite we have similar views for support cases, items, and transactions. This has turned joining 20+ tables together into using only 1 or 2 to get answers about our customers.
The next layer for Netsuite was capturing all the business logic to define different things we wanted to track from a financial perspective and turning them into mega tables that could be consumed by our current BI solution - Metabase. So using the 1 or 2 tables together bring every field anyone would ever care about a customer, transaction and make it available as a value to filter, pivot and aggregate on.
This has been successful up to this point, but as I look forward, there are somethings I am starting to ask myself that I am hoping someone else has already thought of or solved.
-
I want to start combining data sources together, but am starting to realize that data sources often share similar column names. Is abbreviating source and/or table names considered best practice? It seems like this could get really long, messy, but I am not sure what the alternative would be. As an example: both Netsuite and our production system have an account table and similar columns - account_name, account_id, account_number, etc. Additionally, most of our production tables all have a status, amount, currency, created_at columns and combining tables within even one data source could become difficult without aliasing each one of these to the table (account_status, account_currency, account_created, etc). Assuming this is my only option, does it make more sense to do this on the base models or only when combining sources? Are there pros, cons?
-
On the same topic of combining sources, I assume I would just add another model config like “_combined” to bring multiple sources together?
-
The analyst team is responsible for both external and internal reporting. As mentioned above we currently use Metabase as our internal BI solution. All other reporting happens though Jaspersoft. Creating mega tables work well for a BI solution, but for our Jaspersoft reports, we will be doing custom SQL for each published report. Is creating another model config, folder structure for these custom reports to include in our project as views worth doing? It seems like this would simplify things for Jaspersoft as we would simply be doing a select * from one model (report) and it would allow us version control and identify dependencies. Curious to know if anyone is using dbt in a similar fashion.
Hope that all makes sense.