So we have a need to capture occasional data points - for now, future sales and marketing events - so we can incorporate those in our forecasting models.
This sort of data doesn’t get updated often, but when it does, the sales & marketing teams need a way to enter their event info somewhere (e.g., “Marketing Event, June 27, 2019, Los Angeles, CA”) so I can pull that into our data warehouse (Snowflake).
We use Stitch for data loading. I thought about using a Google Form, and then pulling via Stitch’s hacky Google Sheets integration, but I’ve tried that before and it didn’t work very well.
I think a solution involving S3 + Snowflake/Snowpipe could work, but I’m not seeing Google Forms-like products that would save or automatically export data to S3. Any ideas/suggestions?
(Also - We’re also implementing Salesforce sometime in the near future so maybe they could enter stuff in a Salesforce object or create a calendar event or something, and that would be easier to pull in via the Stitch+Salesforce integration. But we need a solution in the meantime, and I imagine this is a problem most data teams have!)
We’ve used Stitch’s Google Sheets integration dozens of times across dozens of clients–it’s a core part of our workflow! We’ve found it to be quite reliable. What problems are you running into with it?
INTERESTING! Maybe I did something wrong. It’s been a while since I last tried this, so let me check my notes about what happened…
Okay, it had something to do with unique keys - I needed to combine them using a md5 function, and it kept hanging and then creating multiple rows. I filed an issue about it and found an alternative function, but I don’t think I ever got it to work well. This was the last error msg I got from Stitch re this integration:
NULL result in a non-nullable column
File '229649-2644895918347462805.tab.gz', line 1, character 1
Row 1, column "STAGING_SHEET1_13_LOADER_SNOW2_11350_F3EF50D4_3F16_4CD1_ACA9_E860AC2D56AB"["VI_NUMBER__ST":20]
I’m thinking it’s the spreadsheet that was the problem, so I’ll give this another shot with a cleaner spreadsheet.
Also, @tristan, I wasn’t sure if the sync, once set up, is continuous or if I have to manually kick off a sync job by going to the Stitch menu in Google Sheets and selecting “Sync with Stitch.” Would you know?
You have to manually click “sync with stitch” when you want to push new data. I actually prefer that process, because I as the owner of the spreadsheet want to control when data in that spreadsheet gets pushed over to my warehouse. There are many intermediate states where I wouldn’t want that data pushed over! But you can’t, for example, create automatically updating spreadsheet functions and just have them constantly pulled in once a day.
@tristan, it’s working really well (which confirms my suspicions that when I tried to set it up last fall, it was on a legacy, badly-formatted spreadsheet). I notice to be safest it’s best to filter by rows containing the max _SDC_RECEIVED_AT or _SDC_SEQUENCE value present in the table. Is that how you set it up as well?
We use Zapier to push new or updated spreadsheets within a Google Drive folder to S3. By using a template with strict data validation, this seemed to be the best way to get data into our DW. We then use a Dataiku job to merge the docs from S3 and pipe the output into an S3 destination, which is set up as an external table in Snowflake.
It has a slightly higher setup cost but is fully automated and lets our sales team take ownership of their data (if they want to update a file, they just edit it and then it’ll come through via Zapier and replace the old version). This is also handy since we can’t deal with just a few sources - we have new files being added that need to automatically come through.
Yes, You’re right I agree with you - (Also - We’re also implementing Salesforce sometime in the near future so maybe they could enter stuff in a Salesforce object or create a calendar event or something, and that would be easier to pull in via the Stitch+Salesforce integration. But we need a solution in the meantime, and I imagine this is a problem most data teams have!). Can you please explain how may this help us?