Hi Team,
I am trying to read data from S3 bucket(HUDI Tables) and write transformed data to GCS storage. I am unable to write output to GCS bucket. I have all permissions in place and used below configs at Thrift server. dbt is defaulting output to S3a bucket and writing output data to AWS S3 and not GCP.
looking for some guidance here from experts. Thanks in advance
./sbin/start-thriftserver.sh
–master local[*]
–driver-memory 4G
–executor-memory 4G
–conf “spark.serializer=org.apache.spark.serializer.KryoSerializer”
–conf “spark.sql.hive.convertMetastoreParquet=false”
–conf “spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension,org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions”
–conf “spark.hadoop.fs.s3a.access.key=”
–conf “spark.hadoop.fs.s3a.secret.key=”
–conf “spark.hadoop.fs.s3a.endpoint=s3.amazonaws.com”
–conf “spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem”
–conf “hoodie.datasource.hive_sync.enable=true”
–conf “hoodie.datasource.hive_sync.use_glue=true”
–conf “hoodie.datasource.hive_sync.sync_as_data_source_table=true”
–conf “spark.sql.hive.thriftServer.singleSession=true”
–conf “hoodie.datasource.hive_sync.database=default”
–conf “hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor”
–conf “hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator”
–conf “hoodie.datasource.write.table.type=COPY_ON_WRITE”
–conf “spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog”
–conf “spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog”
–conf “spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO”
–conf “spark.sql.catalog.glue_catalog.warehouse=s3a://bucket-name/folder1/”
–conf “spark.sql.thriftServer.bindAddress=0.0.0.0”
–conf “spark.driver.bindAddress=0.0.0.0”
–conf “spark.sql.catalog.gcp_catalog=org.apache.iceberg.spark.SparkCatalog”
–conf “spark.sql.catalog.gcp_catalog.type=hadoop”
–conf “spark.sql.catalog.gcp_catalog.warehouse=gs://gcp-bucket/warehouse/”
–conf “spark.sql.catalog.gcp_catalog.io-impl=org.apache.iceberg.gcs.GCSFileIO”
–conf “spark.hadoop.fs.defaultFS=gs://gcp-bucket/”
–conf “spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem”
–conf “spark.hadoop.google.cloud.auth.service.account.enable=true”
–conf “spark.hadoop.google.cloud.auth.service.account.json.keyfile=/opt/key-file-from-serviceAcc.json”
–conf “spark.eventLog.enabled=true”
–conf “spark.eventLog.dir=gs://gcp-bucket/logs/”
–conf “spark.sql.warehouse.dir=gs://gcp-bucket/warehouse/”
–conf “spark.sql.hive.thriftServer.singleSession=true”
–conf “spark.hadoop.hive.server2.authentication=NONE”
–conf “spark.sql.thriftServer.singleSession=true”
–conf “spark.sql.thriftServer.protocol=binary”
Dear dbt Experts,
can you guys guide/help here please. Thanks in Advance!!