Getting Started with Data Flow

Learn about Oracle Cloud Infrastructure Data Flow, what it is, what you need to do before you begin using it, including setting up policies and storage, loading data, and how to import and bundle Spark applications.

What is Oracle Cloud Infrastructure Data Flow

Data Flow is a cloud-based serverless platform with a rich user interface. It allows Spark developers and data scientists to create, edit, and run Spark jobs at any scale without the need for clusters, an operations team, or highly specialized Spark knowledge. Being serverless means there is no infrastructure for you to deploy or manage. It is entirely driven by REST APIs, giving you easy integration with applications or workflows. You can:

  • Connect to Apache Spark data sources.

  • Create reusable Apache Spark applications.

  • Launch Apache Spark jobs in seconds.

  • Create Apache Spark applications using SQL, Python, Java, or Scala.

  • Manage all Apache Spark applications from a single platform.

  • Process data in the Cloud or on-premises in your data center.

  • Create Big Data building blocks that you can easily assemble into advanced Big Data applications.

Figure 1. Data Flow Overview

Before you Begin with Data Flow

Note

Avoid entering confidential information when assigning descriptions, tags, or friendly names to your cloud resources through the Oracle Cloud Infrastructure Console, API, or CLI. This applies when creating or editing an application in Data Flow.

Before you begin using Data Flow, you must have:

  • An Oracle Cloud Infrastructure account. Trial accounts can be used to demo Data Flow.
  • A Service Administrator role for your Oracle Cloud services. When the service is activated, Oracle sends the credentials and URL to the designated Account Administrator. The Account Administrator creates an account for each user who needs access to the service.
  • A supported browser, such as:
    • Microsoft Internet Explorer 11.x+

    • Mozilla Firefox ESR 38+

    • Google Chrome 42+

  • A Spark application uploaded to Oracle Cloud Infrastructure Object Storage. Do not provide it packaged in a zipped format such as .zip or .gzip.
  • Data for processing loaded into Oracle Cloud Infrastructure Object Storage. Data can be read from external data sources or clouds. Data Flow optimizes performance and security for data stored in an Oracle Cloud Infrastructure Object Store.
  • Spark streaming is not supported.
  • This table shows various technologies supported by Data Flow. It is for reference only, and is not meant to be comprehensive.
    Supported Technologies
    TechnologyValue
    Supported Spark VersionsSpark 2.4.4
    Supported Application Types
    • Java
    • Scala
    • SparkSQL
    • PySpark (Python 3 only)

Set Up Administration

Before you can create, manage and execute applications in Data Flow, the tenant administrator (or any user with elevated privileges to create buckets and modify IAM) must create specific storage buckets and associated policies in IAM. These set up steps are required in Object Store and IAM for Data Flow to function.

Object Store: Setting Up Storage

Before running applications in Data Flow service, there are two storage buckets that are required in Object Store, see the Object Storage documentation.

In Oracle Cloud Infrastructure Console, create two storage buckets in Object Store for the following:
Data Flow Logs
Data Flow requires a bucket to store the logs (both standard out and standard err) for every application run. Create a standard storage tier bucket called dataflow-logs in the Object Store service. The location of the bucket must follow the pattern:
oci://dataflow-logs@<Object_Store_Namespace>/
Data Flow Warehouse
Data Flow requires a data warehouse for Spark SQL applications. Create a standard storage tier bucket called dataflow-warehouse in the Object Store service. The location of the warehouse must follow the pattern:
oci://dataflow-warehouse@<Object_Store_Namespace>/

Identity: Policy Set Up

Data Flow requires policies to be set in IAM to access resources in order to manage and run applications. For more information on how IAM policies work, refer to the Identity and Access Management documentation. For more information about tags and tag namespaces to add to your policies, see Managing Tags and Tag Namespaces.

Data Flow User Policies

As a general practice, categorize your Data Flow users into two groups for clear separation of authority:
  • For administration-like users (or super-users) of the service who can take any action on the service, including managing applications owned by other users and runs initiated by any user within their tenancy subject to the policies assigned to the group:
    • Create a group in your identity service called dataflow-admin and add users to this group.
    • Create a policy called dataflow-admin and add the following statements:
      ALLOW GROUP dataflow-admin TO READ buckets IN <TENANCY>
      ALLOW GROUP dataflow-admin TO MANAGE dataflow-family IN <TENANCY>
      ALLOW GROUP dataflow-admin TO MANAGE objects IN <TENANCY> WHERE ALL
                {target.bucket.name='dataflow-logs', any {request.permission='OBJECT_CREATE',
                request.permission='OBJECT_INSPECT'}}
      ALLOW GROUP dataflow-admin TO INSPECT tag-namespaces IN <TENANCY>
      ALLOW GROUP dataflow-admin TO READ tag-namespaces IN <TENANCY>
    It includes access to the dataflow-logs bucket.
  • The second category is for all other users who are only authorized to create and delete their own applications. But they can run any application within their tenancy, and have no other administrative rights such as deleting applications owned by other users or canceling runs initiated by other users.
    • Create a group in your identity service called dataflow-users and add users to this group.
    • Create a policy called dataflow-users and add the following statements:
      ALLOW GROUP dataflow-users TO READ buckets IN <TENANCY>
      ALLOW GROUP dataflow-users TO USE dataflow-family IN <TENANCY>
      ALLOW GROUP dataflow-users TO MANAGE dataflow-family IN <TENANCY> WHERE ANY 
      {request.user.id = target.user.id, request.permission = 'DATAFLOW_APPLICATION_CREATE', 
      request.permission = 'DATAFLOW_RUN_CREATE'}
      ALLOW GROUP dataflow-users TO MANAGE objects IN <TENANCY> WHERE ALL 
      {target.bucket.name='dataflow-logs', any {request.permission='OBJECT_CREATE', 
      request.permission='OBJECT_INSPECT'}}
      ALLOW GROUP dataflow-users TO INSPECT tag-namespaces IN <TENANCY>
      ALLOW GROUP dataflow-users TO READ tag-namespaces IN <TENANCY>

Federation with an Identity Provider

If you use identity federation SAML 2.0 systems, such as Oracle Identity Cloud Service, Microsoft Active Directory, Okta, or any other provider that supports SAML 2.0, you can use one user name and password across multiple systems including Oracle Cloud Infrastructure Console. In order to enable this single sign-on experience, your tenant administrator (or another user with equivalent privileges) must set up the federation trust in IAM. For more details appropriate for your identity provider see:

Once you have configured the federation trust, use the Oracle Cloud Infrastructure Console to map the appropriate Identity Provider User Group to the required Data Flow User Group in the identity service.

Data Flow Service Policy

The Data Flow service needs permission to perform actions on behalf of the user or group on objects within the tenancy.

To set this up, create a policy called dataflow-service and add the following statement:
ALLOW SERVICE dataflow TO READ objects IN tenancy WHERE target.bucket.name='dataflow-logs'

Importing an Apache Spark Application to the Oracle Cloud

Your Spark applications need to be hosted in Oracle Cloud Infrastructure Object Storage before you can run them. You can upload your application to any bucket. The user running the application must have read access to all assets (including all related compartments, buckets and files) for the application to launch successfully.

Best Practices for Bundling Applications

Best Practice for Bundling your Applications
TechnologyNotes
Java or Scala ApplicationsFor the best reliability, upload applications as Uber JARs or Assembly JARs, with all dependencies included in the Object Store. Use tools like Maven Assembly Plugin (Java) or sbt-assembly (Scala) to build appropriate JARs.
SQL ApplicationsUpload all your SQL files (.sql) to the Object Store.
Python ApplicationsBuild applications with the default libraries and upload the python file to the Object Store. Do not include any third-party libraries or packages, see Known Issues.

Do not provide your application package in a zipped format such as .zip or .gzip.

Once your application is imported to Oracle Cloud Infrastructure Object Store, you will later refer to it using a special URI:
oci://<bucket>@<tenancy>/<applicationfile>

For example, with a Java or Scala application, let’s suppose a developer at examplecorp developed a Spark application called logcrunch.jar and uploaded it to a bucket called production_code. You can always determine the correct tenancy by clicking on the user profile icon in the top right of the console UI.

The correct URI becomes:
oci://production_code@examplecorp/logcrunch.jar

Load Data into the Oracle Cloud

Data Flow is optimized to manage data in Oracle Cloud Infrastructure Object Storage. Managing data in Object Storage maximizes performance and allows your application to access data on behalf of the user running the application.

Loading Data
ApproachTools
Native web UIThe Oracle Cloud Infrastructure Console lets you manage storage buckets and upload files, including directory trees.
Third-party tools

Consider using REST APIs and the Command Line Infrastructure.

For transferring large amounts of data, consider these 3rd-party tools:

Cross Tenancy Access

Your users can work across tenancies, that is they can do something in a different tenancy to the one in which they exist. For example, you can have Data Flow in one tenancy whilst reading objects stored in a second tenancy.
  • The Data Flow user belongs to group tenancy-a-group in a tenancy called Tenancy_A.
  • Data Flow runs in Tenancy_A.
  • The objects to be read are in a tenancy called Tenancy_B.

You need to allow tenancy-a-group to read buckets and objects in Tenancy_B.

Apply these policies in the root compartment of Tenancy_A:
define tenancy Tenancy_B as tenancy-b-ocid
endorse group tenancy-a-group to read buckets in tenancy Tenancy_B
endorse group tenancy-a-group to read objects in tenancy Tenancy_B

The first statement is a "define" statement that assigns a friendly label to the OCID of Tenancy_B. The second and third statements let the user's group, tenancy-a-group, read buckets and objects in Tenancy_B.

Apply these policies in the root compartment of Tenancy_B:
define tenancy Tenancy_A as tenancy-a-ocid
define group tenancy-a-group as tenancy-a-group-ocid
admit group tenancy-a-group of tenancy Tenancy_A to read buckets in tenancy
admit group tenancy-a-group of tenancy Tenancy_A to read objects in tenancy

The first and second statements are define statements that assign a friendly label to the OCID of Tenancy_A and tenancy-a-group. The third and fourth statements let tenancy-a-group read the buckets and objects in Tenancy_B. The word admit indicates that the access applies to a group outside the tenancy in which the buckets and objects reside.

You can limt access further by restricting the read buckets policy to a compartment. For example, to a compartment called your_compartment:
admit group tenancy-a-group of tenancy Tenancy_A to read buckets in compartment your_compartment
Or even further, by limiting the read objects policy to a bucket. For example, to a bucket called your_bucket in your_compartment:
admit group tenancy-a-group of tenancy Tenancy_A to read objects in compartment your_compartment where target.bucket.name = 'your_bucket'