dbt pipeline with parquet files

davidbackx · April 20, 2023, 12:53pm

Hello I am trying to make a dbt pipeline where I use parquet files as a datasource. As there is no dbt-parquet package, I think the best dbt package to use for this is dbt-duckdb as duckdb also supports the reading of parquet files. My goal is also to be able to read from a s3 bucket where the files are stored but if someone can help already help me with running it locally that would help me big time. But now I always encounter the error IO Error: The file ".../sources/energy.parquet" exists, but it is not a valid DuckDB database file!. I get this error when running dbt debug My profiles.yml looks like the following.

transform_dbt:
  outputs:
    dev:
      type: duckdb
      path: ./sources/energy.parquet
      extensions:
        - httpfs
        - parquet
      settings:
        s3_region: my-aws-region
        s3_access_key_id: "{{ env_var('S3_ACCESS_KEY_ID') }}"
        s3_secret_access_key: "{{ env_var('S3_SECRET_ACCESS_KEY') }}"
  target: dev

version: 2

sources:
  - name: s3
    schema: energy
    tables:
      - name: energy
        identifier: s3://bucket/energy.parquet

This is my sources.yml file.


# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: transform_dbt
version: '1.0.0'
config-version: 2
vars:
  db_name: energy.parquet

# This setting configures which "profile" dbt uses for this project.
profile: 'transform_dbt'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"  # directory which will store compiled SQL files
clean-targets:         # directories to be removed by `dbt clean`
  - "target"
  - "dbt_packages"


# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In this example config, we tell dbt to build all models in the example/
# directory as views. These settings can be overridden in the individual model
# files using the `{{ config(...) }}` macro.
models:
  transform_dbt:
    example:
      materialized: table

And this is my dbt_project.yml, I am new to dbt so any help is greatly appreciated. As I said my goal is to be able to read from my s3 bucket but if anyone can help with reading from a local file that would help me very much.

Topic		Replies	Views
dbt with parquet files in azure data lake as source Help	1	1053	April 16, 2024
How can we access AWS S3 buckets from DBT Help dbt-cloud	3	7614	March 6, 2024
About the Help category Help	2	2265	January 21, 2024
Is there a way to write only parquet files AWS S3 from dbt-spark without materializing a table or view Help dbt-core	0	161	February 19, 2025
dbt doesn't seem to read my file from an external table Help bigquery , dbt-core	0	185	August 27, 2024

dbt pipeline with parquet files

Related topics