Help With Intermediate tmp Tables Created by DBT-Glue

msaurkar · December 20, 2023, 8:40pm

The problem I’m having

The tmp table that is being written has null values, which get carried over to the final iceberg table.
Cannot determine other mis alignment issues resulting from this.

The context of why I’m trying to do this

I want to set up ELT that creates an iceberg lake and would lake to implement and ascertain data integrity.

What I’ve already tried

I am using dbt-GLLUE adapter following version:
dbt=1.7.4
dbt-glue=1.7.1

I am creating iceberg lake and have added the following configs to my dbt profiles:

“spark.sql.legacy.allowNonEmptyLocationInCTAS=true
–conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
–conf spark.serializer=org.apache.spark.serializer.KryoSerializer
–conf spark.sql.warehouse=
–conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
–conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
–conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
–conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.dynamodb.DynamoDbLockManager
–conf spark.sql.catalog.glue_catalog.lock.table=myGlueLockTable
–conf spark.sql.legacy.createTableColumnTypesInCatalog=true
–conf spark.sql.sources.default=org.apache.iceberg.spark.SparkCatalog
–conf spark.sql.sources.write.semantic=ORC
–conf spark.sql.legacy.createEmptyManagedTableByDefaul=true”

I went through the dbt logs and isolated the issue to, writing of the tmp table:

spark.sql(“CREATE TABLE test_table LOCATION ‘s3://BucketName/Iceberg_db/tmp_test_table’ AS SELECT * FROM tmp_tmp_test_table”)

When I inspect the table in Athena the input , output format and serde serialization lib assigned are wrong.
i.e:
Input Format: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde serialization lib: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

When I recreate the process in interactive notebook and override the above statement with:

spark.sql(“”"
CREATE TABLE tmp_test_table
ROW FORMAT SERDE ‘org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe’
WITH SERDEPROPERTIES (
‘serialization.format’ = ‘1’
)
STORED AS
INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat’
LOCATION ‘s3://bucket_name/iceberg_db/tmp_test_table’
AS SELECT * FROM tmp_tmp_test_table
“”")

Correct values:
Input Format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Output Format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
serde serialization lib: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

The null issue is resolved. I was wondering there is way to override the intermediate tmp table creations or any other setting that will help set the correct input, output formats and serde serialization library?

msaurkar · December 22, 2023, 4:51pm

After dredging to several possible solution I found the solution to this with the following Stack Overflow:

To implement the flag in dbt I added the set statement as prehook in dbt-project.yml
Optionally I also added the hive flag in my profiles.yml under the conf tags

SET hive.default.fileformat=parquet

Topic		Replies	Views
Is there a way to write only parquet files AWS S3 from dbt-spark without materializing a table or view Help dbt-core	0	140	February 19, 2025
dbt-glue Iceberg configurations Help	1	1544	December 20, 2023
Suggestion for dbt glue - improvement Show and Tell dbt-core	0	155	August 30, 2024
DBT spark-iceberg integration In-Depth Discussions dbt-core	1	2934	July 6, 2025
DBT tests don't work on an iceberg table on Glue Help dbt-core	0	43	June 12, 2025

Help With Intermediate tmp Tables Created by DBT-Glue

The problem I’m having

The context of why I’m trying to do this

What I’ve already tried

Related topics