Pulling in slow-changing, manually-entered data

adamstone · May 2, 2019, 8:41am

So we have a need to capture occasional data points - for now, future sales and marketing events - so we can incorporate those in our forecasting models.

This sort of data doesn’t get updated often, but when it does, the sales & marketing teams need a way to enter their event info somewhere (e.g., “Marketing Event, June 27, 2019, Los Angeles, CA”) so I can pull that into our data warehouse (Snowflake).

We use Stitch for data loading. I thought about using a Google Form, and then pulling via Stitch’s hacky Google Sheets integration, but I’ve tried that before and it didn’t work very well.

I think a solution involving S3 + Snowflake/Snowpipe could work, but I’m not seeing Google Forms-like products that would save or automatically export data to S3. Any ideas/suggestions?

(Also - We’re also implementing Salesforce sometime in the near future so maybe they could enter stuff in a Salesforce object or create a calendar event or something, and that would be easier to pull in via the Stitch+Salesforce integration. But we need a solution in the meantime, and I imagine this is a problem most data teams have!)

tristan · May 2, 2019, 10:38am

We’ve used Stitch’s Google Sheets integration dozens of times across dozens of clients–it’s a core part of our workflow! We’ve found it to be quite reliable. What problems are you running into with it?

tmurphy · May 2, 2019, 11:26am

We have a custom Google Sheet / csv uploader that works with Snowflake. Docs are here https://gitlab.com/gitlab-data/analytics/tree/master/extract/sheetload

adamstone · May 2, 2019, 12:24pm

INTERESTING! Maybe I did something wrong. It’s been a while since I last tried this, so let me check my notes about what happened…

Okay, it had something to do with unique keys - I needed to combine them using a md5 function, and it kept hanging and then creating multiple rows. I filed an issue about it and found an alternative function, but I don’t think I ever got it to work well. This was the last error msg I got from Stitch re this integration:

NULL result in a non-nullable column
  File '229649-2644895918347462805.tab.gz', line 1, character 1
  Row 1, column "STAGING_SHEET1_13_LOADER_SNOW2_11350_F3EF50D4_3F16_4CD1_ACA9_E860AC2D56AB"["VI_NUMBER__ST":20]

I’m thinking it’s the spreadsheet that was the problem, so I’ll give this another shot with a cleaner spreadsheet.

adamstone · May 2, 2019, 3:10pm

Also, @tristan, I wasn’t sure if the sync, once set up, is continuous or if I have to manually kick off a sync job by going to the Stitch menu in Google Sheets and selecting “Sync with Stitch.” Would you know?

Also filed an issue about that but no response.

tristan · May 2, 2019, 3:26pm

You have to manually click “sync with stitch” when you want to push new data. I actually prefer that process, because I as the owner of the spreadsheet want to control when data in that spreadsheet gets pushed over to my warehouse. There are many intermediate states where I wouldn’t want that data pushed over! But you can’t, for example, create automatically updating spreadsheet functions and just have them constantly pulled in once a day.

adamstone · May 2, 2019, 3:47pm

That’s what I was hoping for, but I can live with manual syncing. I’ve set up a test and it is syncing correctly to Stitch. Thanks!!

I am also looking for suggestions about snazzy online forms that snazzily save results in S3, that’d be good to know…

carlyk · May 3, 2019, 4:03pm

We use Fivetran’s GSheets connector for this and it works amazingly well. Totally worth the $3,000/year connector fee in my opinion.

adamstone · May 6, 2019, 12:31pm

@tristan, it’s working really well (which confirms my suspicions that when I tried to set it up last fall, it was on a legacy, badly-formatted spreadsheet). I notice to be safest it’s best to filter by rows containing the max _SDC_RECEIVED_AT or _SDC_SEQUENCE value present in the table. Is that how you set it up as well?

tristan · May 6, 2019, 12:42pm

If you’re on BQ, that’ll be how Stitch sends the data. stitch treats BQ as a write-only data store for all integrations as far as I know.

dan · January 20, 2020, 11:46am

We use Zapier to push new or updated spreadsheets within a Google Drive folder to S3. By using a template with strict data validation, this seemed to be the best way to get data into our DW. We then use a Dataiku job to merge the docs from S3 and pipe the output into an S3 destination, which is set up as an external table in Snowflake.

It has a slightly higher setup cost but is fully automated and lets our sales team take ownership of their data (if they want to update a file, they just edit it and then it’ll come through via Zapier and replace the old version). This is also handy since we can’t deal with just a few sources - we have new files being added that need to automatically come through.

Topic		Replies	Views
Appending the monthly excel data to snowflake table Help	0	674	August 21, 2023
How to set up a Google Sheet as a reliable data source Show and Tell spreadsheets	2	12567	June 14, 2021
Need Advice on Data Loading Strategy into Data Vault Models Help jinja , incremental , snowflake , dbt-core	2	99	January 16, 2025
Reading csv file from GCS bucket directly into dbt Help	0	1119	December 11, 2023
How to enrich data via APIs? Help best-practice	0	1001	August 23, 2023

Pulling in slow-changing, manually-entered data

Related topics