This post is adapted from a more in-depth Medium article I wrote here.
I occasionally need to grant a non-technical colleague the ability to input information into our data warehouse on an ad-hoc basis. For example, our customer service team at Milk Bar maintains a list of special wedding cake orders in Google Sheets that we need to collect data from for downstream calculations.
This is a tricky problem: I’m not a web developer, and standing up and maintaining a web form or a database for something like this is overkill. I’m also not interested in paying for a separate product to do this for me. What if we could sync their Google Sheet directly to a table in our data warehouse?
I know what you’re thinking. What about data governance? What’s to stop someone from accidentally deleting a column or entering data of the wrong type? How do I know that my colleagues are entering data according to my expectations?
In my experience, with the right guardrails and protections, Google Sheets can be a reliable and flexible component of your data stack. Here are my tips for setting one up properly as a data source.
Securing your Google Sheet
Start by locking down your sheet so that it can only the right people can view and edit it. I like to check the Advanced option “Prevent editors from changing access or adding new people” to prevent inadvertent link sharing by my end users.
Protected sheets and ranges
You can also lock down tabs or ranges within your sheets so that only you or authorized users can edit them. You can even set an option to warn users that editing a cell in a specific range may have consequences and they should proceed with caution.
You might have some sheets that contain necessary information for formulas or calculations that don’t need to be edited by any of your users. If that’s the case, protect these sheets and ranges so only you can modify them.
Data validation (accessed via Data > Data Validation) is the most important requirement for Google Sheets usage in the data stack. With data validation, you can set conditions on ranges within your sheet that must be fulfilled for data entry to be accepted. Data validation is how you can essentially enforce schema control on a Google Sheet.
For instance, let’s say you have a currency column that must have a positive value. With data validation, Google Sheets will reject any negative numbers entered into the validated range.
There are a number of useful data validation criteria to choose from:
- List from a range. This criterion requires any entry in our cell range to match the value of a cell in a validation range. We often use it to force the user to choose from a list of predefined values or categories. As a handy bonus, you can add a drop-down list to each cell in your range so your user can pick a valid value directly. I like to keep a separate, fully protected sheet (only I can edit it) for my validation range.
- List of items. Similar to List from a range, except you provide the valid values directly as a comma-separated list. This is great for validating entries against a short list of static values.
- Number. This criterion is used to check that entries fall within a numeric range. We can prevent negative numbers or force entry to be less than a known maximum.
- Text. We can test entries to see if they contain or do not contain certain sub-strings. Perhaps you want to make sure your users are entering a proper prefix when they copy over invoice numbers. There are also a couple of useful options at the bottom of the drop-down: is valid email and is valid URL. These are both common validation needs that would require regex or something similarly complex to confirm.
- Date. We can validate that entries are valid dates and fall within a required range. Because of the multitude of date formats, dates are notoriously messy. This validation criterion is a vital if your users are entering dates anywhere in your sheet.
Custom formula is. This is the real game-changer for data validation. You can write any custom formula that returns a True or False value and Google will validate entries against it. You can implement some really complex validation logic this way. For example, use the
REGEXPMATCHformula to confirm that entries match a more complicated set of string logic. Google Sheets also doesn’t allow you to apply multiple validation criteria to the same cell, so this option allows us to combine criteria into a single formula.
- Checkbox. You can replace the cell(s) in your cell range with a checkbox that returns a custom value when it is checked and unchecked. True and False will probably account for 99% of the use cases here, but it’s nice to have the flexibility to represent something else if you want to.
Once you’ve thoroughly validated the data in your Google Sheet, your sheet will be just as good as a custom web form, if not better. You can be very confident in the quality of your data.
Optional: Using an IMPORTRANGE intermediate sheet
Instead of syncing directly from the user-accessed sheet to the data warehouse, I like to set up an intermediate import sheet that only I can edit. This intermediate sheet provides me with a bit more control. I configure this sheet to copy all the data from the user-accessed sheet using the
IMPORTRANGE formula. This sheet is where I can make any necessary adjustments before I pass it to my data warehouse.
- For example, I might only import a specific range of the user-accessed sheet, so errant data outside of that range doesn’t break anything downstream.
- I could also union two user-accessed sheets (maybe there are multiple sheets, one for each year) into a single sheet with a vertical array.
- I can even use the
QUERYformula to filter the rows in the user-accessed sheet to remove blanks.
Alternative: Use a Google Form + Google Sheet instead
If users don’t need to edit data after they enter it, you can set up an append-only system using a Google Form. Google Forms has a native integration with Google Sheets, which makes this setup dead simple for append-only needs. Google Forms will append a row for each response to your form in a timestamped Google Sheet. If you add questions, the sheet will expand to match your form.
Ensure clean data by choosing the appropriate question types for your form.
Syncing to the data warehouse
Loading a Google Sheet into a data warehouse is a common problem, so there are a lot of potential solutions.
If you’re a Google BigQuery user, the easiest approach, by far , is to set up the Google Sheet as an external table. BigQuery’s documentation describes how to do this in more detail, but all you need to do is define a schema for the table and you can query the first tab of your sheet directly from BigQuery.
If you’re using Snowflake, Redshift, or another data warehouse, you can take your pick from these other options.
- Most data loading services have an integration that handles loading Google Sheets. See Fivetran’s documentation and Alooma’s.
- Stitch’s Google Sheets integration. Data loading service Stitch has a guide on how to set up some Apps Script code in Google Sheets that will send data to Stitch’s Import API. Stitch will then load that data into your data warehouse of choice. With this approach, you have to sync the data manually by clicking “sync to Stitch,” which may be a blessing or a curse depending on your needs. Note that you will have to sign up with Stitch if you’re not already a customer, but syncing a few sheets isn’t likely to exceed their 5 million rows/month free tier.
- sheetload. This is a Python tool created by the data team at GitLab to load local CSVs and Google Sheets into a data warehouse. At the time of writing, the tool supports Postgres and Snowflake.
- Ro’s Google Sheet integration. The data team at Ro open-sourced their custom approach to loading Google Sheets data into Snowflake. Note that their approach truncates and replaces the data table on every load, so it won’t preserve deleted or modified rows.
- You could also look into rolling your own modular approach with Zapier, webhooks, and a serverless function like AWS Lambda. You could trigger your Zap on row creation in your sheet which would send the data, via a webhook, to a serverless function set up to receive the payload as an endpoint. This approach becomes more challenging if you need to capture updates and deletions.
Do you use Google Sheets in your data stack? If you do, add a comment with your approach and any tips or pitfalls you’ve encountered!