The Data Flow Library

Learn about the applications Data Flow Library  in Data Flow, including reusable Spark application templates and application security. Also learn how to create and view applications, edit applications, delete applications, and apply arguments or parameters.

Reusable Spark Application Templates

Data Flow Data Flow Application  are infinitely reusable Spark application templates. They consist of a Spark application, its dependencies, default parameters, and a default run-time resource specification. Once a Spark developer creates a Data Flow Application, anyone can use it without worrying about the complexities of deploying it, setting it up, or running it. They can use it through Spark analytics in custom dashboards, reports, scripts, or REST API calls.

Every time you invoke the Data Flow Application, you create a Data Flow Data Flow Run . It fills in the details of the application template and launches it on a specific set of IaaS resources.

View Data Flow Applications

View Applications in Data Flow

From the Oracle Cloud Infrastructure Console, click Data Flow and then Applications, or from the Data Flow Dashboard, click Applications from the left-hand menu. The resulting page displays a table of the applications. It lists the name of each application, along with the language (Python, SQL, Java, or Scala), the owner, when it was created, and when it was last updated. To Edit an Application, or just to see more detailed information of the application, either click on the name of the application, or select Edit from the menu at the end of that application's row in the table. This menu also has options to Run or Delete the application.

If you have clicked on the Application name, the Details page for that application is displayed. The Detail Information tab displays the resource configuration and application configuration of details. There is also a Tags tab to display the tabs on the application. You can use Edit, Run, or Delete to change the application.

Filter Applications in Data Flow
You can filter, search, and sort the list of applications in a number of ways:
  • In the left-hand side menu is a filters section. You can filter on the State of your applications and the Language used. These are drop-down lists. You can also set a date range during which applications were updated using the Updated Start Date and Updated End Date fields. Finally, you can enter all or part of a application name in the Name Prefix field, and filter by application name. You can choose to filter on one, or some, or all of these options. You can clear all these fields to remove the filters.
  • The State filter gives the options of Any state, Accepted, In Progress, Canceling, Canceled, Failed, or Succeeded.
  • The Language filter lets you filter by All, Java, Python, SQL, or Scala.
  • The Updated Start Date and Updated End Date fields allow you to pick a date from a calendar, along with a time (UTC). It displays the current month, but allows you to navigate to previous months. Or there are quick links to allow you to choose, today's date, yesterday's date, or the past three days.
  • If you have applied tags to your applications, you can filter on these tags in the Tag Filters section. You can clear these tag filters.

These filtering options also allow you to search for an application if you can't remember the specifics of it. For example, you know it was created last week, but can't remember exactly when.

Sort Applications in Data Flow

You can sort the list of applications by Created date, either ascending or descending.

Create a Java or Scala Data Flow Application

  1. Upload the JAR artifact to Oracle Object Storage.
  2. On the Applications page, click Create.
  3. In the Create Application panel, select JAVA or SCALA as appropriate from the Language options.
  4. Create the application and any necessary items.
    1. Provide a Name for the application.
    2. (Optional) Enter a Description which will help you search for it.
    3. Select the Spark Version from the drop-down list.
    4. Pick the Driver Type from the drop-down list.
    5. Select the Executor Type from the drop-down list.
    6. Enter the Number of Executors you need.
    Tips For Your Application's Default Size helps you calculate the number of resources needed.
  5. Configure the Application.
    1. Enter the File URL to the application, using this format:
       oci://<bucket_name>@<objectstore_namespace>/<file_name>
    2. Enter the Main Class Name.
    3. (Optional) Enter any Arguments. The Spark app expects two command-line parameters, one for the input and one for the output. In the Arguments field, enter:
      ${input} ${output}
      You are prompted for default value. It’s a good idea to enter these now.
      Note

      Do not include either "$" or "/" characters in the parameter name or value.
  6. (Optional) Add advanced configuration options.
    1. Click Show Advanced Options.
    2. Enter the Key of the configuration property and a Value.
    3. Click + Another Property to add another configuration property.
    4. Repeat steps b and c until you've added all the configuration properties.
    5. Populate Application Log Location with where you want to save the log files.
    6. Choose Network Access.
      1. If you are Attaching a Private Endpoint to Data Flow, click the Secure Access to Private Subnet radio button. Select the private endpoint from the resulting drop-down list. Click Change Compartment if it is in a different compartment to the Application.
        Note

        You cannot use an IP adress to connect to the private endpoint, you must use the FQDN.
      2. If you are not using a private endpoint, click the Internet Access (No Subnet) radio button.
  7. Click Create.

If you want to change the values for Language, Name, and File URL in the future you can Edit an Application to do this. You can only change Language between Java and Scala. You cannot change it to Python or SQL.

Create a PySpark Data Flow Application

  1. Upload a PySpark script to an Oracle Object Storage.
  2. On the Applications page, click Create.
  3. In the Create Application panel, select PYTHON from the Language options.
  4. Create a Python application in Data Flow.
    1. Enter a Name for the application.
    2. (Optional) Enter an appropriate Description which will help you search for it.
    3. Select the Spark Version from the drop-down list.
    4. Select the Driver Type from the drop-down list.
    5. Select the Executor Type from the drop-down list.
    6. Enter the Number of Executors you need.
    Tips For Your Application's Default Size helps you calculate the number of resources needed.
  5. Configure the application.
    1. Enter the File URL to the application, using this format:
       oci://<bucket_name>@<objectstore_namespace>/<file_name>
    2. Enter the name of the Main Python File.
    3. (Optional) Add any Arguments.
      Note

      Do not include either "$" or "/" characters in the parameter name or value.
  6. (Optional) Add advanced configuration options.
    1. Click Show Advanced Options.
    2. Enter the Key of the configuration property and a Value.
    3. Click + Another Property to add another configuration property.
    4. Repeat steps b and c until you've added all the configuration properties.
    5. Populate Application Log Location with where you want to save the log files.
    6. Choose Network Access.
      1. If you are Attaching a Private Endpoint to Data Flow, click the Secure Access to Private Subnet radio button. Select the private endpoint from the resulting drop-down list. Click Change Compartment if it is in a different compartment to the Application.
        Note

        You cannot use an IP adress to connect to the private endpoint, you must use the FQDN.
      2. If you are not using a private endpoint, click the Internet Access (No Subnet) radio button.
  7. Click Create.

If you want to change the values for Name and File URL in the future you can Edit an Application to do this. You cannot change Language if Python is selected.

Create a SQL Data Flow Application

  1. Upload a SQL script to an Oracle Object Storage.
  2. On the Applications page, click Create.
  3. In the Create Application panel, select SQL from the Language options.
  4. Create the application.
    1. Enter a Name.
    2. (Optional) Enter a Description with which you can easily find the application and its runs.
    3. Select the Spark Version from the drop-down list.
    4. Select the Driver Type from the drop-down list.
    5. Select the Executor Type from the drop-down list.
    6. Enter the Number of Executors you need.
    Tips For Your Application's Default Size helps you calculate the number of resources needed.
  5. Configure the application.
    1. Enter the File URL to the SQL script using the format:
       oci://<bucket_name>@<objectstore_namespace>/<file_name>
    2. (Optional) Create any Parameters.
      1. Enter the Name and the Value of the parameter.
        Note

        Do not include either "$" or "/" characters in the parameter name or value.
      2. Click + Another Parameter to add another parameter.
      3. Repeat steps i and ii until you've added all the parameters.
  6. Advanced Configuration Options
    1. Click Show Advanced Options.
    2. Enter the Key of the configuration property and a Value.
    3. Click + Another Property to add another configuration property.
    4. Repeat steps b and c until you've added all the configuration properties.
    5. Populate Application Log Location with where you want to save the log files.
    6. Choose Network Access.
      1. If you are Attaching a Private Endpoint to Data Flow, click the Secure Access to Private Subnet radio button. Select the private endpoint from the resulting drop-down list.
        Note

        You cannot use an IP adress to connect to the private endpoint, you must use the FQDN.
      2. If you are not using a private endpoint, click the Internet Access (No Subnet) radio button.
  7. Click Create.

If you want to change the values for Name and File URL in the future you can Edit an Application to do this. You cannot change Language if SQL is selected.

Application Parameters

If the Spark application requires one or more parameters at run time, then Data Flow lets you to provide the parameter name and default value when you create the application.

For Java, Python, and Scala, during the application creation step, enter the parameter name in the Arguments field, enclose it in curly brackets and precede it with the $ symbol. For example:
${MyParameter}
Once you provide the parameter name in the Arguments field, a Parameters section with two new fields appears below it. The fields are Name, and Default Value. The Name field is not editable, and contains the name of your parameter. The Default Value field is editable, and you can enter a default value for the parameter. The default value can be over-ridden at execution time.
If you have more than one parameter, enter each of the parameter names one after the other in the Arguments field. Data Flow expects a space between each to define the argument values, but be careful, as the parameters will still be parsed if you miss out the space. For example, if you have three parameters called Parameter1, Parameter2, and Parameter3, with values of Value1, Value2, and Value3, and enter them as follows:
${Parameter1}${Parameter2} ${Parameter3}
then the resulting argument provided to Data Flow, has only two values:
Value1Value2 Value3
which might not be what you want.

Each parameter has its own Name and Default Value fields.

For SQL applications, the parameter entry does not use the ${MyParameter} format. Instead, the Parameters section with a single Name text field and its corresponding Value field are present. Simply enter the name of the parameter and the default value in the corresponding field. If you need to add multiple parameters, click the +Add Parameters button.

Application Parameters when Running Applications

Application Parameters allow you to re-use your Data Flow Applications in a variety of different ways. See Run Applications for more information.

Tips For Your Application's Default Size

Every time you run a Data Flow Application you specify a size and number of executors which, in turn, determine the number of OCPUs used to run your Spark application. An OCPU is equivalent to a CPU core, which itself is equivalent to 2 vCPUs. Refer to Compute Shapes for more information on how many OCPUs each shape contains.

A rough rule of thumb is to assume 10 GB of data processed per OCPU per hour. Optimized data formats like Parquet will appear to run much faster since only a small subset of data will be processed.

As an example if we want to process 1 TB of data with an SLA of 30 minutes, we should expect to use about 200 OCPUs:

There are many ways of allocating 200 OCPUs. For example, you can select an executor shape of VM.Standard2.8 and 25 total executors for 8 * 25 = 200 total OCPUs.

This formula is a rough estimate and your run-times will vary. You can better estimate your actual workload’s processing rate by loading your Application and viewing the history of Application Runs. This history allows you to see the number of OCPUs used, total data processed and run-time, allowing you to estimate the resources you need to meet your SLAs. From there it is up to you to estimate the amount of data a Run will process and size the Run appropriately.

Develop Data Flow-compatible Spark Applications

Data Flow supports running ordinary Spark applications and has no special design-time requirements. We recommend that you develop your Spark application using Spark local mode on your laptop or similar environment. When development is complete, upload the application to Oracle Cloud Infrastructure Object Storage, and run it at scale using Data Flow.

Delete Applications

To delete an Application in Data Flow,
  1. Open the Applications page.
  2. Find the application to be deleted, and select the Delete menu item from the menu on the Application's row in the table.
Alternatively, you can click on the Application's name, and in the Application's Detail page, click Delete. You must own the Application or be a Data Flow Administrator to delete the Application.

When you delete an Application, its associated Runs are not deleted; the output and logs for these Runs remains available.

Application Security

Spark applications run in Data Flow use the same IAM permissions as the user who initiates the run. The Data Flow Service creates a security token in the Spark Cluster that allows it to assume the identity of the running user. This means the Spark application can access data transparently based on the end user’s IAM permissions. There is no need to hard-code credentials in your Spark application when you access IAM-compatible systems. Diagram of how security is applied to Applications.

If the service you are contacting is not IAM-compatible you will need to use credential management or key management solutions like Oracle Cloud Infrastructure Key Management.

Learn more about Oracle Cloud Infrastructure IAM in the IAM Documentation.

Adding Third-Party Libraries to Data Flow Applications

Your PySpark applications might need custom dependencies in the form of Python wheels or virtual environments. Your Java or Scala applications might need additional JAR files that you can't, or don't, want to bundle in a Fat JAR. Or you might want to include native code or other assets to make available within your Spark runtime.

Data Flow allows you to provide a ZIP archive in addition to your application. It is installed on all Spark nodes before launching the application. If you construct it properly, the Python libraries will be added to your runtime, and the JAR files will be added to the Spark classpath. The libraries added are completely isolated to one Run. That means they do not interfere with other concurrent Runs or subsequent Runs. Only one archive can be provided per Run.

Anything in the archive must be compatible with the Data Flow runtime. For example, Data Flow runs on Oracle Linux using particular versions of Java and Python. Binary code compiled for other operating systems, or JAR files compiled for other Java versions, might cause your Run to crash. Data Flow provides tools to help you build archives with compatible software. However, these archives are ordinary Zip files, so you are free to create them any way you want. If you use your own tools, you are responsible for ensuring compatibility.

Dependency archives, similarly to your Spark applications, are loaded to Oracle Object Storage. Your Data Flow Application definition contains a link to this archive, which can be overridden at runtime. When you run your Application, the archive is downloaded and installed before the Spark job runs. The archive is completely private to the Run. This means, for example, that you can run concurrently two different instances of the same Application, with different dependencies, but without any conflicts. Dependencies do not persist between Runs, so there won't be any problems with conflicting versions for other Spark applications that you might run.

Build a Dependency Archive Using the Data Flow Dependency Packager
  1. Download docker.
  2. Download the packager tool image:
    docker pull phx.ocir.io/oracle/dataflow/dependency-packager:1.0.0
  3. For Python dependencies, create a requirements.txt file. For example, it might look like:
    numpy==1.18.1
    pandas==1.0.3
    pyarrow==0.14.0
    Note

    Do not include pyspark or py4j. These dependencies are provided by , and including them will cause your Runs to fail.
    The Data Flow Dependency Packager uses Python's pip tool to install all dependencies. If you have Python wheels that can't be downloaded from public sources, place them in a directory beneath where you build the package. Refer to them in requirements.txt with a prefix of /opt/dataflow/. For example:
    /opt/dataflow/<my-python-wheel.whl>

    where <my-python-wheel.whl> represents the name of your Python wheel. Pip sees it as a local file and installs it normally.

  4. For Java dependencies, create a file called packages.txt. For example, it might look like:
    ml.dmlc:xgboost4j:0.90
    ml.dmlc:xgboost4j-spark:0.90
    https://repo1.maven.org/maven2/com/nimbusds/nimbus-jose-jwt/8.11/nimbus-jose-jwt-8.11.jar

    The Data Flow Dependency Packager uses Apache Maven to download dependency JAR files. If you have JAR files that cannot be downloaded from public sources, place them in a local directory beneath where you build the package. Any JAR files in any subdirectory where you build the package are included in the archive.

  5. Use docker container to create the archive. Use this command if using MacOS or Linux:
    docker run --rm -v $(pwd):/opt/dataflow -it phx.ocir.io/oracle/dataflow/dependency-packager:1.0.0
    If using Windows command prompt as the Administrator, use this command:
    docker run --rm -v %CD%:/opt/dataflow -it phx.ocir.io/oracle/dataflow/dependency-packager:1.0.0
    If using Windows Powershell as the Administrator, use this command:
    docker run --rm -v ${PWD}:/opt/dataflow -it phx.ocir.io/oracle/dataflow/dependency-packager:1.0.0
    These commands create a file called archive.zip.
  6. (Optional) You can add static content. You might want to include other content in your archive. For example, you might want to deploy a data file, an ML model file, or an executable your Spark program will call at runtime. You do this by adding files to archive.zip after you created it in Step 4.
    For Java applications:
    1. Unzip archive.zip.
    2. Add the JAR files in the java/ directory only.
    3. Zip the file.
    4. Upload it to your object store.
    For Python applications:
    1. Unzip archive.zip.
    2. Add you local modules to only these three subdirectories of the python/ directory:
       python/lib
       python/lib32
       python/lib64
    3. Zip the file.
    4. Upload it to your object store.
    Note

    Only these four directories are allowed for storing your Java and Python dependencies.
    When your Data Flow application runs, the static content is available on any node under the directory where you chose to place it. For example, if you added files under python/lib/ in your archive, they are available in the /opt/dataflow/python/lib/ directory on any node.
  7. Upload archive.zip to your object store.
  8. Add the library to your application. See Create a Java or Scala Data Flow Application or Create a PySpark Data Flow Application for how to do this.
The Structure of the Dependency Archive

Dependency archives are ordinary ZIP files. Advanced users might choose to build archives with their own tools rather than using the Data Flow Dependency Packager. A properly-constructed dependency archive will have this general outline:

python
python/lib
python/lib/python3.6/<your_library1>
python/lib/python3.6/<your_library2>
python/lib/python3.6/<...>
python/lib/python3.6/<your_libraryN>
python/lib/user
python/lib/user/<your_static_data>
java
java/<your_jar_file1>
java/<...>
java/<your_jar_fileN>
Example Requirements.txt and Packages.txt Files
This example of a requirements.txt file includes the Oracle Cloud Infrastructure SDK for Python version 2.14.3 in a Data Flow Application:
-i https://pypi.org/simple
certifi==2020.4.5.1
cffi==1.14.0
configparser==4.0.2
cryptography==2.8
oci==2.14.3
pycparser==2.20
pyopenssl==19.1.0
python-dateutil==2.8.1
pytz==2020.1
six==1.15.0
This example of a requirements.txt file includes a mix of PyPI sources, web sources, and local sources for Python wheel files:
-i https://pypi.org/simple
blis==0.4.1
catalogue==1.0.0
certifi==2020.4.5.1
chardet==3.0.4
cymem==2.0.3
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en-core-web-sm
idna==2.9
importlib-metadata==1.6.0 ; python_version < '3.8'
murmurhash==1.0.2
numpy==1.18.3
plac==1.1.3
preshed==3.0.2
requests==2.23.0
spacy==2.2.4
srsly==1.0.2
thinc==7.4.0
tqdm==4.45.0
urllib3==1.25.9
wasabi==0.6.0
zipp==3.1.0
/opt/dataflow/mywheel-0.1-py3-none-any.whl
To connect to Oracle databases like ADW, you need to include Oracle JDBC JAR files. Download and extract the compatible driver JAR files into a directory below where you build the package. For example, to package the Oracle 18.3 (18c) JDBC driver, ensure all these JARs are present:
ojdbc8-18.3.jar
oraclepki-18.3.jar
osdt_cert-18.3.jar
osdt_core-18.3.jar
ucp-18.3.jar