Problems with schema

The problem I’m having

Hello! I don’t have much experience with dbt. I am trying to create a table via an external table in Databricks. I have it defined in raw/source.yaml as follows:

- name: table_1
  freshness:
    warn_after: {count: 12, period: hour}
    error_after: {count: 24, period: hour}
  external:
    location: "{{'s3://datahub-' + env_var('environment') + '-raw/table_1/'}}"
    using: csv
    infer_schema: true
    partitions:
      - name: timestamp
        data_type: integer
  columns:
    - name: col_1
      data_type: string
    - name: col_2
      data_type: string
    - name: col_n
      data_type: string

The context of why I’m trying to do this

My problem is that my CSV contains more columns than I have defined in my YAML, and when I create the table using:

dbt run-operation stage_external_sources --args "{select: raw.table_1}" --vars "ext_full_refresh: true"

Data from columns that I don’t want are populating the columns that I do want.

Could you please help me fix this issue? I have tried to force schema with infer_schema(I dont know if this applies only to data_type) and I dont know to force to adapt my data to my schema defined in the yaml

In the docs of this macro, it says

    # Specify ALL column names + datatypes.
    # Column order must match for CSVs, column names must match for other formats.
    # Some databases support schema inference.
1 Like

Hello,

Really thanks for your answer! I didn’t find that documentation in external | dbt Developer Hub
but looks like thats the right answer I mark this as solved.

Again thanks :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.