The problem I’m having
The tmp table that is being written has null values, which get carried over to the final iceberg table.
Cannot determine other mis alignment issues resulting from this.
The context of why I’m trying to do this
I want to set up ELT that creates an iceberg lake and would lake to implement and ascertain data integrity.
What I’ve already tried
I am using dbt-GLLUE adapter following version:
dbt=1.7.4
dbt-glue=1.7.1
I am creating iceberg lake and have added the following configs to my dbt profiles:
“spark.sql.legacy.allowNonEmptyLocationInCTAS=true
–conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
–conf spark.serializer=org.apache.spark.serializer.KryoSerializer
–conf spark.sql.warehouse=
–conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
–conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
–conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
–conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.dynamodb.DynamoDbLockManager
–conf spark.sql.catalog.glue_catalog.lock.table=myGlueLockTable
–conf spark.sql.legacy.createTableColumnTypesInCatalog=true
–conf spark.sql.sources.default=org.apache.iceberg.spark.SparkCatalog
–conf spark.sql.sources.write.semantic=ORC
–conf spark.sql.legacy.createEmptyManagedTableByDefaul=true”
I went through the dbt logs and isolated the issue to, writing of the tmp table:
spark.sql(“CREATE TABLE test_table LOCATION ‘s3://BucketName/Iceberg_db/tmp_test_table’ AS SELECT * FROM tmp_tmp_test_table”)
When I inspect the table in Athena the input , output format and serde serialization lib assigned are wrong.
i.e:
Input Format: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde serialization lib: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
When I recreate the process in interactive notebook and override the above statement with:
spark.sql(“”"
CREATE TABLE tmp_test_table
ROW FORMAT SERDE ‘org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe’
WITH SERDEPROPERTIES (
‘serialization.format’ = ‘1’
)
STORED AS
INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat’
LOCATION ‘s3://bucket_name/iceberg_db/tmp_test_table’
AS SELECT * FROM tmp_tmp_test_table
“”")
Correct values:
Input Format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Output Format: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
serde serialization lib: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
The null issue is resolved. I was wondering there is way to override the intermediate tmp table creations or any other setting that will help set the correct input, output formats and serde serialization library?