Hi everyone,
Our small data team of two engineers is looking for guidance on setting up an efficient and robust data architecture. We’re planning to use dbt Core and Snowflake on AWS, and we’d appreciate any insights on how to best handle CI/CD and orchestration for our specific workflow.
Our current data flow is as follows:
-
Data Ingestion: We started using AWS Glue and Python scripts to pull data from APIs and load it into S3 hourly.
-
Data Loading: Snowpipe picks up new files from S3 in real-time and loads them into Snowflake.
-
Data Transformation: We’d like to use dbt Core for our data transformation models.
-
Code Management: All our code will be stored in GitHub.
The key questions are:
-
CI/CD Pipeline: How can we set up a CI/CD pipeline that integrates with GitHub to automatically run dbt tests and builds? We want to ensure that every pull request (PR) is thoroughly tested before it’s merged into our
mainbranch. -
Pipeline Orchestration: What is the best way to orchestrate the end-to-end process? Would Airflow be the best option or is it overkill for such a small team? Would AWS Step Function be an easier and capable option? We need a tool that can:
-
Trigger the AWS Glue jobs.
-
Wait for the Snowpipe load to complete.
-
Trigger the dbt Core runs after the data is available in Snowflake.
-
Handle dependencies and failures gracefully.
-
-
End-to-End ETL Process: Could you provide a detailed example of what this entire process should look like, from the raw data arriving in S3 to the final models being served in Snowflake?
-
Docker or Not? Given our small team, should we containerize our dbt Core environment using Docker? What are the pros and cons for our specific use case?
We’re open to using other AWS services (like Step Functions or Lambda) or third-party orchestrators. Any advice or examples would be incredibly helpful!