Data Flow Overview

Learn about Data Flow and how you can use it to easily create, share, run, and view the output of Apache Spark  applications.

The Data Flow architecture showing Applications, Library, and Runs in the User Layer. Below this is the Administrator Layer consisting of Administrator controls for access policies and usage limits. Below is the Infrastructure Layer of elastic compute and the elastic storage. Finally is the Security Layer consisting of identity management and access management.

What is Oracle Cloud Infrastructure Data Flow

Data Flow is a cloud-based serverless platform with a rich user interface. It allows Spark developers and data scientists to create, edit, and run Spark jobs at any scale without the need for clusters, an operations team, or highly specialized Spark knowledge. Being serverless means there is no infrastructure for you to deploy or manage. It is entirely driven by REST APIs, giving you easy integration with applications or workflows. You can control Data Flow using this REST API. You can run Data Flow from the CLI as Data Flow commands are available as part of the Oracle Cloud Infrastructure Command Line Interface. You can:

  • Connect to Apache Spark data sources.

  • Create reusable Apache Spark applications.

  • Launch Apache Spark jobs in seconds.

  • Create Apache Spark applications using SQL, Python, Java, Scala, or spark-submit.

  • Manage all Apache Spark applications from a single platform.

  • Process data in the Cloud or on-premises in your data center.

  • Create Big Data building blocks that you can easily assemble into advanced Big Data applications.

There is a box representing Data Flow Spark on-demand from which an arrow labelled Processed Data goes down to Object Storage. Below Object Storage are two other boxes with an arrow from each to it. One box represents Spark applications, the other represents Raw Data. There are two arrows showing the flow of Spark applications and raw data from Object Storage to Data Flow Spark on-demand.