The problem I’m having
I’m encountering network related errors when running Python models in dbt using a Dataproc cluster.
Namely, a 404 error : “The resource ‘projects/analytics-euw4-01/global/networks/default’ was not found”.
This is normal : the default configuration is not meant to be used in our case, and my team has configured a VPC for Dataproc to use. I just need to be able to provide the name of the subnetwork to be used to Dataproc somehow, and as far as I can tell, this option only exists for the Serverless option. Should this option be available with Clusters as well?
The context of why I’m trying to do this
My datascience team is running custom Python code for preprocessing, to get KPIs from predictions that happen within our product. This code is available in a third party package that they manage.
My goal is to integrate this custom Python code to my dbt models, in order to integrate those computations to my dbt pipeline, and have everything within dbt.
What I’ve already tried
Configuring everything correctly on Dataproc’s side.
Some example code or error messages
400 Subnetwork 'default' does not support Private Google Access which is required for Dataproc clusters when 'internal_ip_only' is set to 'true'. Enable Private Google Access on subnetwork 'default' or set 'internal_ip_only' to 'false'.
404 The resource ‘projects/{project_name}/global/networks/default’ was not found
Is there any way to handle VPC config for dbt with Dataproc within dbt configuration?
Have a nice day!