Best Data Architecture with dbt Core and Snowflake on AWS?

Hi everyone,

Our small data team of two engineers is looking for guidance on setting up an efficient and robust data architecture. We’re planning to use dbt Core and Snowflake on AWS, and we’d appreciate any insights on how to best handle CI/CD and orchestration for our specific workflow.

Our current data flow is as follows:

  • Data Ingestion: We started using AWS Glue and Python scripts to pull data from APIs and load it into S3 hourly.

  • Data Loading: Snowpipe picks up new files from S3 in real-time and loads them into Snowflake.

  • Data Transformation: We’d like to use dbt Core for our data transformation models.

  • Code Management: All our code will be stored in GitHub.

The key questions are:

  1. CI/CD Pipeline: How can we set up a CI/CD pipeline that integrates with GitHub to automatically run dbt tests and builds? We want to ensure that every pull request (PR) is thoroughly tested before it’s merged into our main branch.

  2. Pipeline Orchestration: What is the best way to orchestrate the end-to-end process? Would Airflow be the best option or is it overkill for such a small team? Would AWS Step Function be an easier and capable option? We need a tool that can:

    • Trigger the AWS Glue jobs.

    • Wait for the Snowpipe load to complete.

    • Trigger the dbt Core runs after the data is available in Snowflake.

    • Handle dependencies and failures gracefully.

  3. End-to-End ETL Process: Could you provide a detailed example of what this entire process should look like, from the raw data arriving in S3 to the final models being served in Snowflake?

  4. Docker or Not? Given our small team, should we containerize our dbt Core environment using Docker? What are the pros and cons for our specific use case?

We’re open to using other AWS services (like Step Functions or Lambda) or third-party orchestrators. Any advice or examples would be incredibly helpful!

My few cents:

  1. Glue & python are okay but if scalability is needed think about 3rd-party tools that write all the code for you, especially if the sources are fairly standard. Watch for how much you need to maintain this code and whether you have good observability of loading of data.
  2. I recently replaced snowpipe with incremental model from external table in dbt. I didn’t need it to be real time and I can still process 10-30mln records from files every hour and it take 3-4 minutes. At least all my code is within dbt and not in multiple places (that is key for me).
  3. If you need to orchestrate across different tools, check Orchestra and if the dependencies between the processes are important, then maybe that will drive some changes in your stack, like using ingestion tools so that you can invoke the pipeline much more easily (I am doing that too, moving to Rivery from a big bunch of python scripts)

Hi ,

I am on the other side as we are moving from the a whole framework of AWS Glue and approx 10 Lambda functions to load into Snowflake before this done using a bunch of stored procedures.

  • We have identified with our commercial teams ( key stakeholders ) and narrowed down to approx 300 tables that near real time ingestion is needed. We have created Snowpipes for them i.e. we load the data in parquet format in variant and then process this as we go upstream.
  • We have a dbt on cloud job that runs approx every 5 mins or so based on tags an this is working very well .

Cost wise given the relatively low volume of records snowpipe became a lot more efficient with its per gb pricing . However the orchestration side is where we need a more elegant solution i.e. on the weekends, I rather not run this dbt cloud job every 5 mins but every 2 hours .

We are looking at Open Flow as a orchestrator + ingestion tool given we are highly regulated it will become nigh impossible to get another tool in .

Cheers

Thank you both @afshin @moseleyi for sharing your valuable insights. Right now we’re trying the Glue + Docker approach in AWS - still exploring the CI/CD portion of the architecture.