Learnings from optimising 22 of our most expensive Snowflake pipelines 2
We recently spent a sprint focused on reducing our Snowflake costs. During this sprint, we investigated 22 of our most expensive pipelines (in terms of Snowflake costs), one by one. In total we merged 56 changes, and in this post we’ll be laying out the optimisations that worked best for us.
Most of these changes were just common sense and didn’t involve any advanced data engineering techniques. Still, we’re always making mistakes (and that’s okay!) and hopefully this post will help readers avoid a few of the pitfalls we encountered.
⚠️ Medium is now 14 years old. Our team has inherited a tech stack that has a long history, some flaws and technical debt. Our approach to this problem was pragmatic; we’re not trying to suggest big pipeline re-designs to reach a perfect state, but rather consider our tech stack in it’s current state and figure out the best options to cut costs quickly. Our product evolves, and as we create new things we can remove some old ones, which is why we don’t need to spend too much time re-factoring old pipelines. We know we’ll rebuild those from scratch at some point with new requirements, better designs and more consideration for costs and scale.
Do we need this?
In a legacy system, there often are some old services that just “sound deprecated.” For example, we have a pipeline called medium_opportunities
, which I had never heard about in 3 years at Medium. After all, it was last modified in 2020… For each of those suspect pipelines we went through a few questions:
Do we need this at all?? Through our investigation, we did find a few pipelines that were costing us more than $1k/month and that were used by… nothing.
A lot of our Snowflake pipelines will simply run a Snowflake query and overwrite a Snowflake table with the results. For those, the question is: Do we need all of the columns? For pipelines we cannot delete, we identified the columns that were never used by downstream services and started removing them. In some cases, this removed the most expensive bottlenecks and cut the costs in a massive way.
If it turns out the expensive part of your pipeline is needed for some feature, you should question if that feature is really worth the cost or if there is a way to tradeoff some cost with downgrading the feature without impacting it too much. (Of course, there are situations where it’s just an expensive and necessary feature…)
Is the pipeline schedule aligned with our needs? In our investigation we were able to save a bunch of money just by moving some pipelines from running hourly to daily.
An example:
A common workflow among our pipelines involves computing analytics data in Snowflake and exporting it to transactional SQL databases on a schedule. One such pipeline was running on a daily schedule to support a feature of our internal admin tool. Specifically, it gave some statistics on every user’s reading interests (which we sometimes use when users complain about their recommendations).
It turns out this was quite wasteful since this feature wasn’t used daily by the small team who relies on it (maybe a couple times per week). So, we figured we could do away with the pipeline and the feature, and replace it with an on-demand dashboard in our data visualization tool. Then the data will be computed only when needed for a specific user. It might require the end user to wait a few minutes for the data, but it’s massively cheaper because we only pay when somebody triggers a query. It’s also less code to maintain and a data viz dashboard is much easier to update and adapt to our team’s needs.