Dbt seed for large .csv files

Hi,

I have a few rather large but static files that I would like to upload using dbt seed.

More precisely, they are files of legacy data from systems not active anymore, and while it’s well possible to host them elsewhere, or write a script to upload them, or any of a myriad of other options, dbt seed just… seems so perfect to use at first glance.

Well, dbt seed seems to be an easy and quick way to get the data where it needs to be, except that dbt seed doesn’t seem to like large data files… It starts and says “Running with dbt=0.13.0” and nothing else happens*.

So… I suppose my question is why does it not work file files in excess of a few MB? (the largest file in question is 33MB in my case)

*I might add that

  1. if I am having some patience, it loads eventually (and once it starts the first model, it loads quickly, too)
  2. At least twice I had had some patience, but not enough, and CTRL+c’d… except that my action appeared to trigger the start of dbt seed uploading the data? Might have been coincidences, but the timing was spot on in both those cases…

What database platform are you using? In Snowflake we actually put these files in S3, point an external stage to them, and write code to create the stage for each dbt run and literally query the stage and import the records. Works in 10 or 20 seconds for 50 or 100k records. You do lose the ability to version the files in git directly integrated with dbt, but versioning csv files that are that large is a bit odd anyway.

Hey @Bijan! This is a really good question, and one that comes up frequently. Let me provide some background on the dbt seed command which I hope will help answer your question.

Mapping models

In the early days of dbt, we created “mapping” models that were used in the data modeling process. These models looked like:

-- models/mapping/country_codes.sql

{{ config(materialized='table') }}

select 'DE' as country_code, 'Germany' as country union all
select 'FR' as country_code, 'France' as country union all
select 'BE' as country_code, 'Belgium' as country union all
....

In downstream models, we joined to this model in order to map a country code to a country name. Obviously this wasn’t ideal, so we created dbt seed to help store (and version control) these mapping datasets in csv files instead of sql files. In order for this to work, we needed to make a general-purpose csv loader, but we definitely stopped short of building a full ETL tool.

The dbt seed command works by

  1. Inferring the schema of your CSV file
  2. Creating an empty table in the database
  3. Loading data into this table using an insert into ... statement

This process works well for mapping datasets – a few hundred rows with a couple of string columns is no problem! Things work less-well when you have complex datatypes (dates, booleans, mixed types, etc). Beyond data type issues, most analytical databases just aren’t good at inserting data with insert into... statements.

Snowflake, BigQuery, and Redshift all recommend loading data from a blob filestore (like S3, GCS, etc). The problems that you’re seeing are a direct result of things like table locks and query compilation, both of which can be avoided by using native database data loading functionality, or some ETL tool that is purpose-built for solving this type of problem.

Should I load this csv with dbt seed?

In the docs we note:

dbt seed loads data from csv files into your data warehouse. Because these csv files are located in your dbt repository, they are version controlled and code reviewable. Thus, dbt seed is appropriate for loading static data which changes infrequently.

I think the version control aspect is important here – if you’d version control the file, then it’s probably appropriate to use dbt seed to load it into your warehouse. If the dataset doesn’t make sense to version control, then you’re probably better off using some other tool!

If you’re looking for a rule of thumb, I’d say a csv file that’s up to ~1k lines long and less than a few kilobytes is probably a good candidate for use with the dbt seed command.

Other resources for loading CSVs into a warehouse

If I missed any good resources, please link them below!

2 Likes