Oracle Cloud Infrastructure Documentation

Exercise 2: SparkSQL Made Simple

In this exercise we will run a SQL script to perform basic profiling of the dataset we generated in Exercise 1. Make sure you completed Exercise 1 successfully before you attempt this exercise.

Exercise 2: Overview

As with other Data Flow Applications, SQL files are stored in object storage and may be shared among many SQL users. To facilitate this, Data Flow allows you to parameterize SQL scripts and customize them at run-time. As with other applications you can supply default values for parameters which often serve as valuable clues to people running these scripts.

The SQL script is available for use directly in your Data Flow Application, you do not need to create a copy of it. The script is reproduced here to illustrate a few points.

Reference text of the SparkSQL Script: SparkSQL script

Important highlights:
  1. The script begins by creating the SQL tables we need. Currently, Data Flow does not have a persistent SQL catalog so all scripts must begin by defining the tables they require.
  2. The table’s location is set as ${location} This is a parameter which the user needs to supply at runtime. This gives Data Flow the flexibility to use one script to process many different locations and to share code among different users. For this lab, we must customize ${location} to point to the output location we used in Exercise 1
  3. As we will see, the SQL script’s output will be captured and made available to us under the Run.

Exercise 2: Create a SQL Application

  1. In Data Flow, create a SQL Application, select SQL as the LANGUAGE and accept default resources.
  2. Under Application Configuration, configure the SQL Application as follows:
    1. FILE URL: This is the location of the SQL file in object storage. The location for this application is: oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/oow_lab_2019_sparksql_report.sql
    2. PARAMETERS: The SQL script expects one parameter, the location of output from the prior step. Click +Another Parameter and enter a parameter named location with the value you used as the output path in step a, based on the template
      oci://<bucket>@<namespace>/optimized_listings

    When you are done confirm your Application configuration looks similar to the following:

  3. Be sure to customize the location value to a valid path in your tenancy.

Exercise 2: Run a SQL Application

  1. Save your Application and run it from the Applications list.
    Note

    You can ignore this warning should you see it:
    -chgrp: '' does not match expected pattern for group
    Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...
  2. After your Run is complete, open the Run:
  3. Navigate to the Run logs:
  4. Open spark_application_stdout.log.gz and confirm your output agrees with the output below.
    Note

    Your rows may be in a different order from the picture but values should agree.
  5. Based on our SQL profiling we conclude that in this dataset Neukolln has the lowest average listing price at $46.57 while Charlottenburg-Wilmersdorf has the highest average at $114.27 (Note: the source dataset has prices in USD rather than EUR.)

This exercise has shown some key aspects of Data Flow, once a SQL application is in place anyone can easily run it without worrying about cluster capacity, data access and retention, credential management, or other security considerations. For example a business analyst can easily leverage Spark-based reporting with Data Flow.