Oracle Cloud Infrastructure Documentation

Exercise 3: Machine Learning with PySpark

This exercise also makes use of the output from Exercise 1, this time using PySpark to perform a simple machine learning task over the input data. Our objective is to identify the best bargains among the various Airbnb listings using Spark machine learning algorithms.

  1. Exercise 3: Overview
  2. Exercise 3: Create a PySpark Application
  3. Exercise 3: Run a PySpark Application

Exercise 3: Overview

A PySpark application which does this has been made available for you to use directly in your Data Flow Applications, you do not need to create a copy.

Reference text of the PySpark script is provided here to illustrate a few points:

A few observations from this code:
  1. The Python script expects a command-line argument (highlighted in red). When we create the Data Flow Application, we will need to create a parameter which the user will set to the input path.
  2. The script uses linear regression to predict a price per listing and determines the best bargains by subtracting the list price from the prediction. The most negative value indicates the best value, per the model.
  3. The model in this script is extremely simplified and only takes square footage into account. In a real setting you would use more variables, such as the neighborhood and other important predictor variables.

Exercise 3: Create a PySpark Application

  1. Create an Application and select the PYTHON as the LANGUAGE.
  2. In Application Configuration, configure the Application as follows:
    1. FILE URL: This is the location of the Python file in object storage. The location for this application is: oci://oow_2019_dataflow_lab@bigdatadatasciencelarge/usercontent/oow_lab_2019_pyspark_ml.py
    2. ARGUMENTS: The Spark app expects two command-line parameters, one for the input and one for the output. In the ARGUMENTS field, enter ${location}. You will be prompted for a default value. Enter the value used as the output path in step a on the template
      oci://<bucket>@<namespace>/optimized_listings
  3. Double-check your Application configure to confirm it looks similar to the following:
  4. Be sure to customize the location value to a valid path in your tenancy.

Exercise 3: Run a PySpark Application

  1. Run the Application from the Application list.
  2. When the Run completes, open it and navigate to the logs.

  3. Open the spark_application_stdout.log.gz file. Your output should be identical to this:
  4. From this output we see that listing ID 690578 is the best bargain with a predicted price of $313.70 versus list price of $35.00 with listed square footage of 4639 square feet. If this sounds a little too good to be true the unique ID allows us to drill back into the data to better understand if this really is the steal of the century or not. Again, a business analyst could easily consume the output of this machine learning algorithm to further their analysis.