Data Flow Overview

Learn about Data Flow and how you can use it to easily create, share, run, and view the output of Apache Spark applications.

Figure 1. Architecture Overview

Data Flow Concepts

An understanding of these concepts is essential for using Data Flow.

Data Flow Applications
Data Flow Applications are infinitely reusable Spark application templates consisting of a Spark application, its dependencies, default parameters, and a default run-time resource specification. Once a developer creates a Data Flow Application, anyone can use it without worrying about the complexities of deploying it, setting it up, or running it.
Data Flow Library
The Data Flow Library is the central repository of Data Flow Applications. Anyone can browse, search, and execute applications published to the Library, subject to having the correct permissions in the Data Flow system.
Data Flow Runs
Every time a Data Flow Application is run, a Data Flow Run is created. The Data Flow Run captures the Application's output, logs, and statistics that are automatically securely stored. Output is saved so it can be viewed by anyone with the correct permissions using the UI or REST API. Runs give you secure access to the Spark UI for debugging and diagnostics.
Elastic Compute
Every time you run a Data Flow Application, you decide how big you want it to be. Data Flow allocates your VMs, runs your job, securely captures all output, and shuts the cluster down. You don't have anything to maintain in Data Flow. Clusters only run when there is real work to do.
Elastic Storage
Data Flow works with the Oracle Cloud Infrastructure Object Storage service. For more information, see the Overview of Object Storage.
Data Flow is integrated with Oracle Cloud Infrastructure Identity and Access Management (IAM) for authentication and authorization. Your Spark applications run on behalf of the person who launches them. This means that the Spark application has the same privileges the end user has. You do not need to use credentials to access any IAM-capable system. In addition, Data Flow benefits from all the other security attributes of Oracle Cloud Infrastructure including transparent encryption of data at rest and in motion.
Administrator Controls
Data Flow allows you to set service limits, and create administrators who have full control over all applications and runs. You are in control regardless of how many users you have.
Apache Spark
Apache Spark is a unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.
Spark Application
Spark Applications use the Spark API to perform distributed data processing tasks. Spark Applications can be written in several languages including Java, Python and more. Spark Applications manifest themselves as files such as JAR files that are executed within the Spark framework.
Spark UI
The Spark UI is included with Apache Spark and is an important tool for debugging and diagnosing Spark applications. You can access the Spark UI for any Data Flow Run, subject to the Run’s authorization policies.
Spark Logs
Spark generates log files which are useful for debugging and diagnostics. Each Data Flow Run automatically stores log files which you can access via UI or API, subject to the Run’s authorization policies.