Oracle Cloud Infrastructure Documentation

Exercise 1: ETL with Java

An exercise to learn how to create a Java applications in Oracle Cloud Infrastructure Data Flow

Overview

The most common first step in data processing applications, is to take data from some source and get it into a format that is suitable for reporting and other forms of analytics. In a database you would load a flat file into the database and create indexes. In Spark your first step is usually to clean and convert data from a text format into Parquet format. Parquet is an optimized binary format supporting efficient reads, making it ideal for reporting and analytics. In this exercise, we take source data, convert it into Parquet and then do a number of interesting things with it. Our dataset is the Berlin Airbnb Data dataset, downloaded from the Kaggle website under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) "Public Domain Dedication" license. The processing in this tutorial

The data is provided in CSV format and our first step will be to convert this data to Parquet and store it in object store for downstream processing. We have provided a Spark application to make this conversion called oow-lab-2019-java-etl-1.0-SNAPSHOT.jar. Your objective will be to create a Data Flow Application which runs this Spark app and execute it with the correct parameters. Since we’re starting out, this exercise will guide you step-by-step and provide the parameters you need. Later you will need to provide the parameters yourself, so make sure you understand what you’re entering and why.

Create the Java Application

Create a Data Flow Application.

  1. Navigate to the Data Flow service in the Console by expanding the hamburger menu on the top left and scrolling to the very bottom.
  2. Highlight Data Flow, then select Applications. Choose a compartment where you want your Data Flow applications to be created. Finally, click Create Application. Click Create Application
  3. Select JAVA and enter a name for your Application, for example, Tutorial Example 1.
  4. Scroll down to Resource Configuration. You should leave all these values as their defaults.
  5. Scroll down to Application Configuration. Configure the application as follows:
    1. FILE URL: This is the location of the JAR file in object storage; it is already there you don't need to put a copy there. The location for this application is:
      oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/oow-lab-2019-java-etl-1.0-SNAPSHOT.jar
    2. MAIN CLASS NAME: Java applications need a Main Class Name which depends on the application. For this exercise, enter
      convert.Convert
    3. ARGUMENTS: The Spark application expects two command-line parameters, one for the input and one for the output. In the ARGUMENTS field, enter
      ${input} ${output}
      You will be prompted for default values and it’s a good idea to enter these now (see below).
  6. The input and output arguments should be:

    1. Input:
      oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/kaggle_berlin_airbnb_listings_summary.csv
    2. Output:
      oci://<yourbucket>@<namespace>/optimized_listings

    Double-check your Application configuration, to confirm it looks similar to the following:

    Note

    You will need to customize the output path to point to your bucket in your tenant.
  7. When done, click Create. When the Application is created, you will see it in the Application list.

Congratulations! You have just created your first Data Flow Application. Now let’s run it.

Exercise 1: Run the Data Flow Java Application

Having created a Java application let's run it.

  1. If you followed the steps precisely, all you need to do is highlight your Application in the list, click the kebab icon and click Run.
  2. You’re presented with the ability to customize parameters before running the Application. In our case we entered the precise values ahead-of-time and we can just start running by clicking Run.
  3. While the Application is running you can optionally load the Spark UI to monitor progress. From the kebab menu for the run in question, select Spark UI.

  4. You will be automatically redirected to the Apache Spark UI, which is useful for debugging and performance tuning.
  5. After a minute or so your Run should show successful completion with a State of Succeeded:

  6. Drill into the Run to see additional details, and scroll to the bottom to see a listing of logs.

  7. When you click on the spark_application_stdout.log.gz file, you should see the following log output:

  8. You can also navigate to your output object storage bucket to confirm that new files have been created. The output will look similar to this:

    These new files are used by subsequent applications, ensure you can see them in your bucket before moving onto additional exercises.