🏔️

What’s New with Orchestration in Snowflake

musyu

2024/06/06に公開

ナウキャストでデータエンジニアをしている六車です！
現在開催中のSnowflake DataCloudSummit 2024で発表されたオーケストレーション関連の新機能について紹介します
以下のXの投稿の詳細版です

概要

新機能は以下の通り。

Task Graph: 複数のタスク、依存関係、成功/失敗通知を使用して、複雑なビジネスプロセスをモデル化
Serverless Task Flex: バッチETLパイプラインに柔軟なスケジュールウィンドウを提供し、コスト効率の高いオプションを実現
Triggered Tasks: データの可用性に基づいてパイプラインの実行を自動化し、スケジュール設定の必要性を排除し、低レイテンシーを実現
Task Graph Observability by Snowflake Trail: タスクの実行に関する詳細な可視性を提供し、グラフ表現、実行の詳細、トラブルシューティング機能など
その他の新機能(予定含む): Lower latency Tasks, Tasks Backfill, UIの強化

データオーケストレーションにSnowflakeタスクを使用することでコスト削減、管理の簡素化を強調。

文字起こし全体(自動生成、英語)

What’s New with Orchestration in Snowflake, DE104
の録音データの文字起こしです。Gemini-1.5-flash-001を使用して自動生成しました。

文字起こし(英語)

Um, so what we're going to be chatting with you today is sharing out the latest and greatest new features with the orchestration in Snowflake. And, uh, then we're also going to be hearing from a customer of Snowflake Tasks. I'm Manuela Away. I'm a Senior Product Manager at Snowflake for managing orchestration. Today, our talk is primarily around Snowflake Tasks. And, I'm Joe Tobin, Vice President of Mercury Data Engineering, and I'll be talking about how we've implemented Tasks into our everyday workflows. Right. So, uh, as we mentioned, we're talking we're going to be sharing with you all the latest new features with Snowflake Tasks and a demo of some of these features in action. I'll keep it short so that we can get to the exciting part with Joe who'll be sharing with you about our successful story with our business transformation, our using, using Snowflake Tasks. So, just to give a land of land, uh, this is a typical data pipeline in Snowflake where you've got ingestion. Um, you're transforming your data and then data delivery. So, in this talk, we're going to be focused on Tasks and Serverless Tasks and all of these capabilities are tied into the Snowflake platform so that you can get the power of the platform with your pipelines running on Snowflake. So, just a quick primer around Tasks. Tasks are Snowflake's native orchestration capability, making it easy for you to orchestrate and schedule your Tasks on Snowflake, running on your dedicated user-managed warehouses. We also have Serverless Tasks, which simplifies this process even further, such that you're your pipelines run on Snowflake-managed compute. So, you don't need to worry about right-sizing the compute. We will figure that out for you and automatically scale your work as your workloads grow over time. So, Tasks has been in GA for several years now, so we're really excited around the usage of Tasks across our Snowflake community. We've got about 8,000 active customers running on Snowflake today, running about 30 million daily Tasks. So, why do customers love Serverless Tasks? With other third-party orchestration products, the burden is on the user for managing either your on-prem or cloud data products. Um, you also may need to manage infrastructure, right, for running those. With Snowflake Serverless Tasks, you're able to focus on implementing your business use cases and reaping the benefits from the data rather than the burden of efficiently scaling your compute. We optimize for pricing and performance based on your workloads, so this reduces your any job wastage and lowering the total cost of ownership. So, just a quick highlight on what we're going to be talking about around the new features. So, first, we're going to start out with Task Graphs where we enable you to model your more complex business processes. And, then, you've heard about Snowflake Trail today in the platform keynote, where we have a portfolio of products making it easier for you to gain critical visibility on the the execution of your Tasks. Then, we're going to be talking about Trigger Tasks which is now in public preview where we're lowering the latency for your pipelines and also optimizing your costs. This morning, you heard about Serverless Task Flex, currently in private preview, and this is the new capability that we have for most cost-effectively running your TaskSQL or Python, or GBM workloads. And, more enhancements. So, some of the enhancements that we've made, Task Graphs we've had for a while. In this past year, we've been putting a lot of investment in this area because we kept hearing from you about your desire to be able to model more complex business processes. So, in this example here, I have a graph with five Tasks with the root Task node, uh, which has a Task Graph configuration. So, this is a configuration you can define for your entire Task Graph to use, and this helps you to be able to manage your Task Graphs across different environments. For example, if you're running in development or you're running in production, and you may need to point to different file locations, right? Task Graph configurations will really simplify that for you rather than having hard code that in your Task. We also have introduced in private preview, Task Graph Success notifications, where you can get notified on unsuccessful completion of your Task Graph. And, some customers use this to be able to trigger some downstream action. This is an extension upon the current capability that we have today where you could get notified upon an error with a Task. And then, we have here in Task B, runtime reflection variables where we provide out-of-the-box functions that make it easier for you to get some of the runtime characteristics of your current Task Graph. One example where this may be helpful is if you want to determine when's the last time a Task has processed a block of stream, so you can determine a range of data of, uh, time range that you want to be able to process data. In Task C, we're going to in I'm going to give you a demo of this pretty soon. We're going to be intentionally writing to a table which doesn't exist, which is going to cause a failure. And, this is so that we can showcase the new capability that we have to automatically retry from where the Task Graph last failed. And so, this will ensure that you're not, um, you're lowering your latencies for your Task Graph, you're not wasting resources for re-processing things that are already complete. With Task D, we have we're utilizing the Task predecessor return value, such that you can be able to model some dynamic behaviors. If you can set up a conditional using a return value from a prior Task to be able to determine what path to go down. And, then last, we have the finalizer Task, which is guaranteed to run whenever your Task Graph starts and it's really useful to be able to do some cleanup from your Task Graph. Let's say if you have a temporary table that you're using to store data, you want to clean up those resources in order to save money. All right. And, then, um, Task Graph, uh, observability, which is plugged into Snowflake Trail, where you can get a visual of your Task Graph or data, right? You can get into dig into some of the, uh, execution details to be able to understand trends and fluctuations in your executions and you're able to debug such that you can make take quick actions, uh, on the execution of your Tasks. And then, a lot of these capabilities, such as suspending, resuming, are now available through the UI, um, so those should enhance the moment. All right. So, now let's see this Task Graph in action. All right. So here, I've modeled the uh Task Graph that we just showed in the slides, right? So, we have Task A, um, Task C, we walk through this, you're going to see when we start your we're going to start with Task A and then Task B is run, it's going to run, Task C we'll see will fail, it's expected to fail because it's going to write to a table which doesn't exist. Which causes Task D to not run. Yeah. And then, the finalizer will run to do any cleanup in between runs. Um, one thing I want to show, So, through the UI, you can also see, for example, I have this Task scheduled to run every 45 seconds and it's running on Snowflake-managed compute. You're also able now to do some edits of your Task configuration. You can change the schedule, what type of compute it's running on, You remember in Task A, I have a Task Graph configuration. So, I have those values here, you can change them. I have this task configured to, uh, suspend after six failures. Let's, uh, reduce that. And, when you make that change, you'll see it reflected in the Task running. All right, so let's start this or resume the Task, and I'm just going to kick off the first one so that we can see this working in action. So, you see Task A is starting up. And, this is our live view, uh, live status view. So, um, it's going to go really fast, we're going to be really up here. So, I just created the, um, table which Task D writes to, and I don't know if you saw that, but Task Task A and Task B did not rewrite. It only restarted from Task C. So, let's just take a look at what else you can see from the UI. I want to show you the run history for Task C, where you can see, so this is one we just did, and it succeeded after three retries and the error that it had previously was, it was trying to write to a table which doesn't exist. Right. So, that's fully automated. You didn't have to do anything around, um, other than, uh, enable the capability working. And, I was also writing out to a table here just so that we can track that each of the Tasks were running. Task D, uh, was able to run after Task C when it tried and was successful. So, that is our, um, we'll go back to this, so we just saw the Task Graph, uh, in action. Now, Trigger Tasks, which we're going to, uh, announce is currently in public preview now, um, we since we've released this, we have about 96 customers enabled in their private preview, um, with about 20,000, uh, daily, uh, triggered Task executions. So, their private preview has been going really well, and we're excited to release this now in public preview, with the additional support of streams on views in addition to streams on tables which we had previously. Trigger Tasks enables you to automate your pipeline execution to run when there's new data to process, so there's no schedule needed, um, and it's just it's going to run as soon as data is available. So, I'm going to show you a Trigger Task which reads off of a stream. We have a low-latency Task configured here, which is running every 10 seconds, and this is also a new capability that is going to be in private preview. So, I have this low latency Task which writes to a table that updates a stream, which will trigger the Trigger Task, which then is going to customize the email. So, let's see this in action. So, you'll see here, this is my Trigger Task. It's currently in a started state, and there's no schedule, right? So, it will only run when there's new data. And, it's going to look for this stream, uh, when there's updates to this 20 stream. And, this is the, uh, this is the Task that's going to run every 10 seconds. I'm not going to run it to lock in, uh, during my demo, so I'll just kick it off here and then, when we go back and take a look at that, it's just going to run. You'll see that the Trigger Task triggered on its own, without the schedule, when there was data. There we go. So, this is the custom email that I have configured for that Trigger Task to run, uh, it kicks off a stored procedure in the background and essentially looks at the status of the Trigger, uh, Task state. All right. So, that's Trigger Tasks. No schedule needed, automatically runs once there is new data in your stream. So, you heard a bit about Serverless Task Flex in this morning's platform keynote. So, Serverless Task Flex is the most cost-effective way for you to be able to run your, uh, SQL Python, GBM workloads on Snowflake. And, you simply create a Serverless Task Flex by specifying to be able to run in a flexible scheduling mode, and you would specify a target completion window from which to run. So, before we get into some more of the details there, comparatively a Serverless Task Flex is 42% less costly than running on Serverless Tasks. So, the target use case for Serverless Task Flex is a batch ETL pipeline, which you can run SQL Python GBM where you specify a flexible scheduling window. So, in the example that I have written in blue, this Task is set up to start off at midnight and have a target completion window of 4 hours, so your task will complete sometime between midnight and 4 am. And, it's going to cost you 42% less than running on Serverless Tasks. So, all of these capabilities that we just showed you, you can also get, uh, visibility in your cost associated with your Task and Serverless Tasks is plugged into the Snowflake budget capability where you can see either what's the budget consumed for your particular, uh, Task, or you can set up a budget in your accounts which will alert you if you're coming up on your, uh, budget. And then, some of the other features that we quickly went through, uh, lower latency Tasks. So, we've reduced the floor for scheduling of your Tasks from the current 1 minute to, uh, 10 seconds, and we're working to reduce that into single digit seconds. This is currently in private preview. We have Task Backfill, so I don't know if any of you have had, uh, scenarios where, uh, you have a pipeline set up, it's automated and then the source changes, right, and then you got to go back and rerun and re-process. Task Backfill is intended to make that easier for you, where you specify a range of data to run for an existing Task Graph and we can re-manage all of the Task execution and give you visibility into the progress of that Backfill job. And, then in the Task UI, we showcased the, uh, live status view where you were able to see the status updating as the Task Graph was running, you were able to edit the Task configuration directly from the UI and see the renumber of retry attempts for a Task Graph when it did the automatic retry. So, you have options for how you want to orchestrate your data in Snowflake. We firmly believe that Serverless Tasks really simplifies a lot of this process for you, and so for those types of use cases where you have a variety of workloads, which don't all fit on the same size compute or you're not able to fully utilize a warehouse, um, Serverless Tasks is a great option for you. Serverless Task Flex is an extension upon that for batch pipelines where you can, uh, run on a flexible schedule, and then you get some additional cost benefits from that. And then, lastly, if you have workloads that are all somewhat similar and you can run them concurrently, then user, uh, running your Task on your user-managed warehouses, is a good option. Okay, so now I'm going to hand it over to Joe, and he can get into the details around the success that he saw. Excellent. Thank you, Manuela, and that is really exciting, and that's why we had a great opportunity here to work with Snowflake and improve our orchestration layer through the usage of Tasks inside of Snowflake. So, as I said earlier, I'm Joe Tobin, I head up our data engineering team on our Mercury product team. Our Mercury product is focused in on the identity, knowing who your people are, knowing about your people through the data where I'm focused and connecting those people to the marketplace for for better targeting. We had an opportunity here where we realized that we had a complex multi-solution orchestration layer. This had many complications for us. We were constantly having to maintain, um, excuse me, having to maintain systems in multiple platforms, having to maintain the connectors from those platforms into Snowflake. In addition, because we're connecting from outside sources, we had a challenge where we wanted to look at and manage the data governance, where is that data going, who is accessing that data, what is it being used for. And being a third-party or external systems, we were concerned about data exfiltration, what data is that system taking out, where is it going when it's outside of the system? So, I reached out to my team and I challenged them and I said, I need you to tell me what you need to happen inside of Snowflake so that Tasks and orchestration can run natively. At the time, we started working on this, Tasks was in its beginnings. Um, there wasn't a lot of functionality. There wasn't the awesome features that we just saw. So, I had the team go through and look at and tell me, what tools are we using? What do we like about them? What don't we like about them? What's missing from Snowflake? And, we took that, and we went to the product team at Snowflake and said, if you can do this, we will move more of our Tasks. And, after some time and some back and forth, we had collaborating sessions with them, we had demos with them, and through that feedback cycle, we were able to get to a point with Task where we started a migration and transition process. In that process, we looked at what Snowflake could do, what our third-party systems could do, and begin looking at the efforts to move our products from third-party into Snowflake. We did an extensive amount of testing where we were able to get equal or better, uh, runtimes in our process, where we were able to collect the same type of logging information. And, what processes could we improve, can we improve costs? Everybody's looking to cut their costs and bring that down, could we make dev time better? So, not only are we looking at operational costs from compute or runtimes or, um, even software costs, but also people time, can we get delivery out the door faster using less development time. So, as we went through this process, this is what we started with on the left. We started with a cloud-based orchestration system. It has its own logging capabilities. It has its own connection capabilities, that would connect into Snowflake. Because of those connections, there are always updates that come from those third-party softwares when they are updated, the connectors will need to be updated, or when you get that dreaded email every quarter that says, "Here's all the connectors that are no longer supported." You and your DevOps team go running out and figure out how can I fix this, regression test it, and get it working again. We had the same problem in our on-prem solution. Legacy software, running on-prem, keeping its own logs, no connectivity between the two. And then, as we talked to Snowflake, Snowflake keeps its own logs. So, if there was an issue or a problem, our DevOps team would spend a lot of time troubleshooting, going through logs in one platform, logs in another, and eventually figure out the issue, make the changes, testing in one platform, testing the other. So, as we moved forward and moved into Snowflake, we were able to come to a unified solution where through the use of Streams, Tasks, we could run our stored procedures, we could collect our logs in a unified location. And now, with the introduction of Snowflake Trail, we're able to see that information in one shot. Um, it makes it easier for us to troubleshoot what's going on and makes it easier for us to understand where our problems are, quickly resolve them and get operations running again. So, we got to that one solution. Um, we now run everything for our data builds inside of Snowflake using Snowflake Tasks. The orchestration has been simplified, it's been streamlined. Data governance, because we're inside of Snowflake, we now have improved governance. We can see who is accessing, when they're accessing and what roles through our back. We have stronger control of what's happening with the data within the platform and we're eliminating the data exfiltration, it is staying inside of the platform, and we can now have stronger control of what moves in and out of the platform. So, we had some great business outcomes here. Um, because we are reducing the usage of third-party software and cloud platforms, on-prem platforms, we can now reduce software costs, we can reduce people costs for maintaining those, we can reduce DevOps time for constantly going through and testing and updating connectors as they become available, updated or deprecated. Um, in many cases, we saw process improvement time. There was no more lag or latency of data or instructions moving from third-party tools into Snowflake, Snowflake pushing that information out being collected. Now, if you look at the graph at the bottom here, we had some really strong cases where hey, it was amazing, we saved a lot of time, a lot of processing. We had some marginal ones where, you know, we could go either way. They were fairly quick Tasks, so there wasn't really a big gain. And, in some cases, some things just didn't run as fast, right? It's not a end-all solution, but we got to that one solution, so if something goes wrong, we know where it is and the thing I strive for is we got a happy development team. They're no longer angry looking at logs in three different systems at a time. So, the key learnings we had is we had a plan. We had to look at what we had, we had to find out what we need, we did a lot of testing, um, the collaboration with the Snowflake team was absolutely fantastic. Uh, we wouldn't have achieved this without Manuela and her development team going through and helping us, uh, with testing and listening to feedback. Um, it doesn't solve everything, right? It's, you know, there is no magic bullet out there that's going to give you the solution that you're looking for. In our case, it gave us a best-case scenario, it gave us a best-case solution. Finally, or, um, excuse me, it removed, um, the connectivity or the connector dependency, you know, as I said and so many of you shake your head at that dreaded email, "This connector is deprecated." That alone created a lot of extra work. We don't have to do that anymore, it's all managed inside of Snowflake. It's improved our logging to one location and now we have Snowflake Trail that makes it easier for the graphical interface. Um, we have better data governance and with the announcement of Horizon today, we are excited to move forward with that in improving our data governance and cost management. Manuela also showed you the examples there. It is really great for us to be able to sit down, look at what our budget is, how much are we spending in one platform instead of trying to aggregate three different platforms, breaking across different spend pieces, so that we can understand what it really costs to build out our data product. Uh, and the last thing I really like is it's created a shallow learning curve for us. I no longer have to get developers to learn multiple platforms, how to go through them, how to set them up. Through one platform of Snowflake, and through the training and support that we get from them, we can get developers and operations people up to speed faster in using the products and being able to set up, uh, orchestration faster. I appreciate you listening and if anybody is interested in more about Mercury, uh, feel free to come up to the screen, get to the, uh, I appreciate you listening, and if anybody's interested in more about Mercury, uh, feel free to come up to the screen, get to the, uh, QR code. That's awesome, Joe. Thanks so much for sharing about your story, and it was amazing to see some of those business outcomes. Um, so thank you for your time today, uh, we encourage you to check out these other sessions that are focused around data engineering on Snowflake, and we're happy to take some questions. I think there's, uh, a microphone here in the middle, or if you want to chat, um, you know, you can chat outside as well. Thanks so much for your time. Enjoy your rest of your day.

Task Graph

UIを通じてタスクの設定を変更したり、リトライ回数を確認したりすることが可能に。

デモでエラーを発生して、自動リトライが行われる様子を確認

ブラウザ上から設計を変えることもできる

Serverless Task Flex

今回のSummitで発表された新機能。柔軟なスケジューリングウィンドウでバッチETLパイプラインをコスト最適化して実行する機能
この機能により、最大42%のコスト削減が可能(!?)になるらしい
スピードがあまり求められないバッチ処理には有用そう

Triggered Tasks

Snowflake内でデータ変更イベントが発生したときに、タスクを即座に実行する機能。

以下のようなクエリでTriggered Tasksを作成できます。

CREATE TASK triggeredTask  WAREHOUSE = my_warehouse
  WHEN system$stream_has_data('my_stream')
  AS
    INSERT INTO my_downstream_table
    SELECT * FROM my_stream;

ALTER TASK triggeredTask RESUME;

Task Graph Observability by Snowflake Trail

タスクグラフの実行に対する詳細な可視化を提供し、ユーザーが制御できるようにする機能
最近のタスクの実行時間の変動を確認したり、トラブルシューティングを行うことができる
任意のタスクから、ユーザーはタスクの詳細、グラフ、および実行履歴を表示し、実行、停止/再開などの管理ができる
こちらの機能は、Snowflake Trailによって提供される

その他の新機能(予定含む)

Lower latency Tasks; 10秒スケーリングが可能になる
Tasks Backfill; タスクグラフを過去のデータ範囲で実行および管理する作業を簡素化
UIの強化;
- ライブステータスビュー
- タスクの編集
- タスクグラフのリトライ試行

Tasks BackfillやUIの強化はオーケストレーションツールには必要不可欠な機能であると思っており、これがSnowflakeに統合されることで、データパイプラインの運用管理がより楽になりそうです。ただ、使えるのは少し先になりそう・・・

総評

Snowflakeのオーケストレーション機能は、タスクの設定や管理が直感的に行えるUIを提供しており、また、新機能の追加により、より複雑なビジネスプロセスのモデル化や、コスト削減、低レイテンシー化が可能になってきそうです。
また、Snowflake Trailによる可視性の向上も大きな特徴であり、データパイプラインの運用を効率化する上で非常に有用な機能と言えるかと思います。
オーケストレーションがSnowflakeネイティブに統合されると、外部でAirflowやDagsterなどのオーケストレーションツールを使う必要がなくなるため、データパイプラインの運用がより簡単になることが期待されます。
ただ、新機能の一部はプレビュー段階であり、一般提供されるまでには時間がかかるかもしれません。ユースケースの見極めは大事そうです。今後のSnowflakeの発展が楽しみです。