What can I use to load monthly updating Excel files to database?

nancy.chelaru · September 25, 2022, 4:49pm

Hello everyone!

Each month, we have a number of Excel files that have ~20-60K rows that need to be uploaded to the database as tables that are then used in downstream data models. They require some light data cleaning before loading in, which are currently done using Python but can be done using SQL when we stage them. I’m conflicted as to whether to use seeds for this:

Reasons against using seeds

These files are larger than the ~1k row recommended in the seeds docs
While these Excel files can be saved as CSVs before uploading, this sometimes leads to loss of column data type
These files are updated monthly, which might not be “infrequent” as recommended in the seeds docs
These files need to be renamed to give the proper table name, which losses the “lineage” of what the name of the original file was that created this data table

Reasons for using seeds

It would be great to version control the data files that created the tables, which seeds would allow us to do

If seeds are not the appropriate method, what would be the recommended workflow for loading Excel files into the database?

Really appreciate any thoughts on this! Thank you very much!

joellabes · September 27, 2022, 2:43am

Hey @nancy.chelaru, this is a great question and you’re definitely thinking about it in the right way!

I agree that you shouldn’t use seeds. When you say you need to version control the data files, what part of the version control process is most important to you?

If you need to track changes to the file over time, using dbt snapshots might be a good approach.
If you need to be able to block changes to the file, subject to a process similar to code review, we’d have to think about it a bit more and it would depend on the specific tool you ultimately choose.

Which database platform and ingestion tools do you already use? Ingestion tools like Fivetran and Stitch have the ability to ingest data from a Google Sheet or a folder in Google Drive/Box/Dropbox, or to upload a file directly. They tend to need to be csv files so the issue you raised about column datatypes remains true, although Fivetran’s browser uploader supports changing the column type inline.

If you’re using Snowflake, it also has a web-based loader; BigQuery can read from Google Drive or Google Cloud Storage; Redshift can copy data over from S3 (you’ll probably want to use our Redshift package to help with this).

That’s a lot of options and the advice is pretty generic - if you let me know what tools you’re already using, we can get a bit more specific.

panasenco · September 27, 2022, 4:37pm

Since you say that you use Python for data cleaning, it sounds like you already have a working way to get the Excel data into Python. How do you do this in Python - do you use Pandas, PyArrow, or something else? You can probably just write the data pretty directly to your data warehouse from your Python code.

paddelia1 · September 29, 2022, 6:49pm

I have a Power Automate Desktop solution that will split large files and load this to a staging table.
From Excel, save the data as csv, From Power Automate Desktop (PAD) CSV Upload dialog, you pick the file, set the few options such as split file, create format, create stage, create table, then upload. This will create all the objects in snowflake and upload the data very quickly. Then call or create another script to do the ETL in snowflake.
It uses the snowflake cli from Windows machine.
See
paddelia1/csvToSnowflake: CSV to Snowflake in 3 steps and 3 seconds (github.com)

Topic		Replies	Views
Why can't I use dbt seed for large .csv files? Help seeds	3	14835	May 1, 2019
Load a csv file into Snowflake with dbt Help seeds	2	3705	April 18, 2023
Dbt snowflake query Help	5	962	April 14, 2023
DBT is taking too much time to upload large size seed data(3GB) to datawarehouse Help seeds , dbt-core	1	1807	July 26, 2023
How to insert CSV data into existing table using dbt Help best-practice	1	4140	January 27, 2024

What can I use to load monthly updating Excel files to database?

Related topics