(DBT-Glue) build fails when creating a seed

The problem I’m having

I run dbt with glue connector as scheduled task in AWS Fargate. I have a test project which populates small piece of data in seed and runs a couple of models.
I am aware that seeds are not meant to be used as data source, but it’s just for testing.

The container runs:
dbt list → Project is found
dbt debug → No problems found, connection is OK
dbt build → An error occurs. The error message is very cryptic

Found 8 models, 2 seeds, 0 sources, 0 exposures, 0 metrics, 480 macros, 0 groups, 0 semantic models
Concurrency: 1 threads (target='default')
1 of 10 START seed file smartanalytics-jkral-warehouse_seeds.oot_store_sample .. [RUN]
Glue adapter: Glue returned `error` for statement None for code
csv = [{"PK": "Oot#3STEST642789391", .... <long text here> }]
df = spark.createDataFrame(csv)
table_name = 'smartanalytics-jkral-warehouse_seeds.oot_store_sample'
if (spark.sql("show tables in smartanalytics-jkral-warehouse_seeds").where("tableName == 'oot_store_sample'").count() > 0):
df.write .mode("overwrite") .format("parquet") .insertInto(table_name, overwrite=True)
else:
df.write.option("path", "s3://smartanalytics-jkral-sawarehousebucketbd9ba2ed-ocjsa1nnfzif/smartanalytics-jkral-warehouse_seeds/oot_store_sample")        .format("parquet")        .saveAsTable(table_name)
SqlWrapper2.execute("""select * from smartanalytics-jkral-warehouse_seeds.oot_store_sample limit 1""")
, NameError: name 'null' is not defined

The context of why I’m trying to do this

I have a container with dbt scheduled to run as AWS Fargate Scheduled Task. I want to use Glue for processing. I want to use iceberg table format. I want to store my data in S3 bucket.

myproject:
  outputs:
    default:
      type: glue
      glue_version: "3.0"
      query-comment: DBT model
      role_arn: "{{ env_var('DBT_GLUE_ROLE') }}"
      region: "{{ env_var('AWS_REGION') }}"
      location: "s3://{{ env_var('DBT_BUCKET_NAME') }}"
      schema: "{{ env_var('DBT_SCHEMA') }}"
      database: "{{ env_var('DBT_SCHEMA') }}"
      session_provisioning_timeout_in_seconds: 120
      workers: 2
      worker_type: G.1X
      idle_timeout: 5
      datalake_formats: iceberg
      tags: "{{ env_var('DBT_JOB_TAGS') }}"
  target: default

What I’ve already tried

  • Check if the role has access to S3 and Glue Catalog. I tried to explicitly add CreateTable policy which had no effect.
    const glueJobPolicy = new Policy(this, `${id}-PassRolePolicy`, {
      statements: [
        new PolicyStatement({
          effect: Effect.ALLOW,
          actions: ['iam:PassRole', 'lakeformation:BatchGrantPermissions'],
          resources: ['*']
        }),
        new PolicyStatement({
          actions: ['glue:*', 's3:*', 'glue:CreateTable'],
          resources: ['*']
        })
      ]
    })
  • Check name of S3 bucket / glue database → all are correct

Just an update for someone facing similar problem in the future.
It seems that seeds don’t handle JSONs nested inside CSV (or I didn’t find the right format to do so). Simple seed, without JSON strings worked well.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.