Each month, we have a number of Excel files that have ~20-60K rows that need to be uploaded to the database as tables that are then used in downstream data models. They require some light data cleaning before loading in, which are currently done using Python but can be done using SQL when we stage them. I’m conflicted as to whether to use seeds for this:
Reasons against using seeds
- These files are larger than the ~1k row recommended in the seeds docs
- While these Excel files can be saved as CSVs before uploading, this sometimes leads to loss of column data type
- These files are updated monthly, which might not be “infrequent” as recommended in the seeds docs
- These files need to be renamed to give the proper table name, which losses the “lineage” of what the name of the original file was that created this data table
Reasons for using seeds
- It would be great to version control the data files that created the tables, which seeds would allow us to do
If seeds are not the appropriate method, what would be the recommended workflow for loading Excel files into the database?
Really appreciate any thoughts on this! Thank you very much!
Hey @nancy.chelaru, this is a great question and you’re definitely thinking about it in the right way!
I agree that you shouldn’t use seeds. When you say you need to version control the data files, what part of the version control process is most important to you?
- If you need to track changes to the file over time, using dbt snapshots might be a good approach.
- If you need to be able to block changes to the file, subject to a process similar to code review, we’d have to think about it a bit more and it would depend on the specific tool you ultimately choose.
Which database platform and ingestion tools do you already use? Ingestion tools like Fivetran and Stitch have the ability to ingest data from a Google Sheet or a folder in Google Drive/Box/Dropbox, or to upload a file directly. They tend to need to be csv files so the issue you raised about column datatypes remains true, although Fivetran’s browser uploader supports changing the column type inline.
If you’re using Snowflake, it also has a web-based loader; BigQuery can read from Google Drive or Google Cloud Storage; Redshift can copy data over from S3 (you’ll probably want to use our Redshift package to help with this).
That’s a lot of options and the advice is pretty generic - if you let me know what tools you’re already using, we can get a bit more specific.
Since you say that you use Python for data cleaning, it sounds like you already have a working way to get the Excel data into Python. How do you do this in Python - do you use Pandas, PyArrow, or something else? You can probably just write the data pretty directly to your data warehouse from your Python code.
I have a Power Automate Desktop solution that will split large files and load this to a staging table.
From Excel, save the data as csv, From Power Automate Desktop (PAD) CSV Upload dialog, you pick the file, set the few options such as split file, create format, create stage, create table, then upload. This will create all the objects in snowflake and upload the data very quickly. Then call or create another script to do the ETL in snowflake.
It uses the snowflake cli from Windows machine.
paddelia1/csvToSnowflake: CSV to Snowflake in 3 steps and 3 seconds (github.com)