Hey @Bijan! This is a really good question, and one that comes up frequently. Let me provide some background on the dbt seed
command which I hope will help answer your question.
Mapping models
In the early days of dbt, we created “mapping” models that were used in the data modeling process. These models looked like:
-- models/mapping/country_codes.sql
{{ config(materialized='table') }}
select 'DE' as country_code, 'Germany' as country union all
select 'FR' as country_code, 'France' as country union all
select 'BE' as country_code, 'Belgium' as country union all
....
In downstream models, we joined to this model in order to map a country code to a country name. Obviously this wasn’t ideal, so we created dbt seed
to help store (and version control) these mapping datasets in csv files instead of sql files. In order for this to work, we needed to make a general-purpose csv loader, but we definitely stopped short of building a full ETL tool.
The dbt seed
command works by
- Inferring the schema of your CSV file
- Creating an empty table in the database
- Loading data into this table using an
insert into ...
statement
This process works well for mapping datasets – a few hundred rows with a couple of string columns is no problem! Things work less-well when you have complex datatypes (dates, booleans, mixed types, etc). Beyond data type issues, most analytical databases just aren’t good at inserting data with insert into...
statements.
Snowflake, BigQuery, and Redshift all recommend loading data from a blob filestore (like S3, GCS, etc). The problems that you’re seeing are a direct result of things like table locks and query compilation, both of which can be avoided by using native database data loading functionality, or some ETL tool that is purpose-built for solving this type of problem.
Should I load this csv with dbt seed?
In the docs we note:
dbt seed
loads data from csv files into your data warehouse. Because these csv files are located in your dbt repository, they are version controlled and code reviewable. Thus, dbt seed
is appropriate for loading static data which changes infrequently.
I think the version control aspect is important here – if you’d version control the file, then it’s probably appropriate to use dbt seed
to load it into your warehouse. If the dataset doesn’t make sense to version control, then you’re probably better off using some other tool!
If you’re looking for a rule of thumb, I’d say a csv file that’s up to ~1k lines long and less than a few kilobytes is probably a good candidate for use with the dbt seed
command.
Other resources for loading CSVs into a warehouse
If I missed any good resources, please link them below!