Pulling in slow-changing, manually-entered data

So we have a need to capture occasional data points - for now, future sales and marketing events - so we can incorporate those in our forecasting models.

This sort of data doesn’t get updated often, but when it does, the sales & marketing teams need a way to enter their event info somewhere (e.g., “Marketing Event, June 27, 2019, Los Angeles, CA”) so I can pull that into our data warehouse (Snowflake).

We use Stitch for data loading. I thought about using a Google Form, and then pulling via Stitch’s hacky Google Sheets integration, but I’ve tried that before and it didn’t work very well.

I think a solution involving S3 + Snowflake/Snowpipe could work, but I’m not seeing Google Forms-like products that would save or automatically export data to S3. Any ideas/suggestions?

(Also - We’re also implementing Salesforce sometime in the near future so maybe they could enter stuff in a Salesforce object or create a calendar event or something, and that would be easier to pull in via the Stitch+Salesforce integration. But we need a solution in the meantime, and I imagine this is a problem most data teams have!)

1 Like

We’ve used Stitch’s Google Sheets integration dozens of times across dozens of clients–it’s a core part of our workflow! We’ve found it to be quite reliable. What problems are you running into with it?

We have a custom Google Sheet / csv uploader that works with Snowflake. Docs are here https://gitlab.com/gitlab-data/analytics/tree/master/extract/sheetload

1 Like

INTERESTING! :eyes: Maybe I did something wrong. It’s been a while since I last tried this, so let me check my notes about what happened…

Okay, it had something to do with unique keys - I needed to combine them using a md5 function, and it kept hanging and then creating multiple rows. I filed an issue about it and found an alternative function, but I don’t think I ever got it to work well. This was the last error msg I got from Stitch re this integration:

NULL result in a non-nullable column
  File '229649-2644895918347462805.tab.gz', line 1, character 1
  Row 1, column "STAGING_SHEET1_13_LOADER_SNOW2_11350_F3EF50D4_3F16_4CD1_ACA9_E860AC2D56AB"["VI_NUMBER__ST":20]

I’m thinking it’s the spreadsheet that was the problem, so I’ll give this another shot with a cleaner spreadsheet.

Also, @tristan, I wasn’t sure if the sync, once set up, is continuous or if I have to manually kick off a sync job by going to the Stitch menu in Google Sheets and selecting “Sync with Stitch.” Would you know?

Also filed an issue about that but no response.

You have to manually click “sync with stitch” when you want to push new data. I actually prefer that process, because I as the owner of the spreadsheet want to control when data in that spreadsheet gets pushed over to my warehouse. There are many intermediate states where I wouldn’t want that data pushed over! But you can’t, for example, create automatically updating spreadsheet functions and just have them constantly pulled in once a day.

That’s what I was hoping for, but I can live with manual syncing. :slight_smile: I’ve set up a test and it is syncing correctly to Stitch. Thanks!!

I am also looking for suggestions about snazzy online forms that snazzily save results in S3, that’d be good to know…

We use Fivetran’s GSheets connector for this and it works amazingly well. Totally worth the $3,000/year connector fee in my opinion.

1 Like

@tristan, it’s working really well (which confirms my suspicions that when I tried to set it up last fall, it was on a legacy, badly-formatted spreadsheet). I notice to be safest it’s best to filter by rows containing the max _SDC_RECEIVED_AT or _SDC_SEQUENCE value present in the table. Is that how you set it up as well?

If you’re on BQ, that’ll be how Stitch sends the data. stitch treats BQ as a write-only data store for all integrations as far as I know.

We use Zapier to push new or updated spreadsheets within a Google Drive folder to S3. By using a template with strict data validation, this seemed to be the best way to get data into our DW. We then use a Dataiku job to merge the docs from S3 and pipe the output into an S3 destination, which is set up as an external table in Snowflake.

It has a slightly higher setup cost but is fully automated and lets our sales team take ownership of their data (if they want to update a file, they just edit it and then it’ll come through via Zapier and replace the old version). This is also handy since we can’t deal with just a few sources - we have new files being added that need to automatically come through.

Hello, dan

Yes, You’re right I agree with you - (Also - We’re also implementing Salesforce sometime in the near future so maybe they could enter stuff in a Salesforce object or create a calendar event or something, and that would be easier to pull in via the Stitch+Salesforce integration. But we need a solution in the meantime, and I imagine this is a problem most data teams have!). Can you please explain how may this help us?

Regards,
Srija.