Data Flow Advantages
Here’s why Data Flow is better than running your own Spark clusters, or other Spark Services out there.
- It's serverless, which means you don’t need experts to provision, patch, upgrade or maintain Spark clusters. That means you focus on your Spark code and nothing else.
- It has simple operations and tuning. Access to the Spark UI is a click away and is governed by IAM authorization policies. If a user complains that a job is running too slow, then anyone with access to the Run can open the Spark UI and get to the root cause. Accessing the Spark History Server is just as simple for jobs that are already done.
- It is great for batch processing. Application output is automatically captured and made available by REST APIs. Do you need to run a four hour Spark SQL job and load the results in your pipeline management system,? In Data Flow it’s just two REST API calls away.
- It has consolidated control. Data Flow gives you a consolidated view of all Spark applications, who is running them and how much they consume. Do you want to know which applications are writing the most data and who is running them? Simply sort by the Data Written column. Is a job running for too long? Anyone with the right IAM permissions can see the job and stop it.