Using Notebook Sessions to Build and Train Models

Once you have a notebook session created, you can write and execute Python code using the machine learning libraries in the JupyterLab interface to build and train models.

Authenticating to the Oracle Cloud Infrastructure APIs from a Notebook Session

When you are working within a notebook session, you are operating as the Linux user datascience. This user does not have an Oracle Cloud Infrastructure Identity and Access Management (IAM) identity, so it has no access to the Oracle Cloud Infrastructure API. Oracle Cloud Infrastructure resources include Data Science projects and models and the resources of other Oracle Cloud Infrastructure services, such as Object Storage, Functions, Vault, Data Flow, and so on. To access these resources from the notebook environment, you must use one of the two provided authentication approaches:

(Recommended) Authenticating Using a Notebook Session's Resource Principal

A resource principal is a feature of IAM that enables resources to be authorized principal actors that can perform actions on service resources. Each resource has its own identity, and it authenticates using the certificates that are added to it. These certificates are automatically created, assigned to resources, and rotated, avoiding the need for you to upload credentials to your notebook session.

Data Science enables you to authenticate using your notebook session's resource principal to access other Oracle Cloud Infrastructure resources. When compared to using the Oracle Cloud Infrastructure configuration and key files approach, using resource principals provides a more secure and easy way to authenticate to resources.

In advance, your tenancy administrator must write policies to grant permissions to your resource principal to access other Oracle Cloud Infrastructure resources, see Configuring Your Tenancy for Data Science.

You can authenticate with resource principals in a notebook session using the following interfaces:

Oracle Accelerated Data Science SDK:

Run the following in a notebook cell:

import ads
ads.set_auth(auth='resource_principal')

For details, see the Accelerated Data Science documentation.

Oracle Cloud Infrastructure Python SDK:

Run the following in a notebook cell.

import oci
from oci.data_science import DataScienceClient
rps = oci.auth.signers.get_resource_principals_signer()
dsc = DataScienceClient(config={}, signer=rps)
Oracle Cloud Infrastructure CLI:

Use the `--auth=resource_principal` flag with commands.

Note

The resource principal token is cached for 15 minutes. If you change the policy or the dynamic group, you have to wait for 15 minutes to see the effect of your changes.
Important

If you don't explicitly use the resource principals when invoking the SDKs or CLI, they use the configuration and key files approach by default.

(Default) Authenticating using Oracle Cloud Infrastructure Configuration and Key Files

You can operate as your own personal IAM user by setting up configuration and key files to access Oracle Cloud Infrastructure resources. This is the default authentication approach.

To authenticate using the configuration and key file approach, you must upload an Oracle Cloud Infrastructure configuration file into the notebook session to the ~/.oci/ directory. For the relevant profile defined in the configuration file, you also need to upload any .pem files. After uploading these required files, edit the configuration file to reflect the possible change of the path to the .pem files. You can do this using the JupyterLab terminal or by opening the configuration file directly in JupyterLab using its basic text editor.

Alternatively, you can use the getting-started.ipynb notebook provided to interactively create configuration and key files. To launch the getting-started.ipynb notebook, click Getting Started in the JupyterLab Launcher tab.

Working with Existing Code Files

You can create new files or work with your own existing files.

Uploading Files

Files can be uploaded from your local machine by clicking Upload in the JupyterLab interface or by dragging and dropping files.

Creating a Key Pair in a Notebook Session to Use with a Third-Party Version Control Provider
  1. Open the JupyterLab environment.
  2. Create a ~/.ssh/config file.
  3. Edit the file and add your private key:
    ssh-agent bash -c ‘ssh-add <path_to_your_private_key>
Cloning a Git Repository Without an Existing Private Key

If you don't have a private key, you can create one in the notebook session by running the ssh-keygen command in the JupyterLab environment.

These instructions use a Git repository as an example though the steps are similar for other repositories. Flows between third-party version control providers and internal Git servers may differ.

  1. Copy your public key into your version control provider. For example, on GitHub.com under your avatar menu, click settings then click SSH and GPC keys, and then click New SSH key.
  2. Go back to your notebook environment.
  3. Add your private key with:
    ssh-agent bash -c ‘ssh-add <path_to_your_private_key>'

    Now that you have added your private key, a new identity has been created.

  4. Clone your repository on your notebook session environment. For example, you could clone a repository from GitHub.com.
    git clone git@github.com:<your_account>/<your_repository>
  5. If you are working with multiple version control hosts, we recommend that you edit your ~/.ssh/config file and add the information about your Git host and the location of the private key for that particular host:
    Host <your_host.com> IdentityFile <path_to_your_private_key>
Using Additional Terminal Commands

You can execute sftp, scp, curl, wget or rsync commands to pull files into your notebook session environment under the networking limitations imposed by your VCN and subnet selection.

Using Pre-Installed Python Packages

You can use many different popular open source libraries to build and train models. For example, scikit-learn, TensorFlow, Keras, and pandas.

Also, you can use the pre-installed Oracle developed Python library, which is the Accelerated Data Science SDK.

A list of the pre-installed Python packages in Data Science is found in these files:

Tips for Using GPUs

From a JupyterLab terminal tab, you can obtain the number of GPU machines and their specifications by entering this command:

nvidia-smi

The pre-installed GPU statistics allow you to monitor how the GPU machines are being used while building and training machine learning models. These statistics include utilization, memory, and power.

From a JupyterLab terminal window, enter:

gpustat 

To get utilization, memory, and power statistics, enter:

gpustat -u -p -c -i 3 --show-power

The command details are in gpustat.

Note

To delete the data stored in the memory of the GPU, we recommend that you shut down the kernel of the notebook running the GPU workload.

Installing Additional Python Libraries

You can install a library that's not pre-installed in the provided image.

Access to the public internet is required to install additional libraries. Install a library by opening a notebook session and running this command:

%%bash
pip install <library-name>==<library-version>
Important

Data Science doesn't allow root privileges in notebook sessions. You can only install libraries using yum and pip as a normal user. Attempting to use sudu or sudu su results in errors.

You can install any open source package available on a publicly-accessible Python Package Index (PyPI) repository. You can also install private or custom libraries from your own internal repositories.

Note

The VCN or subnet that you used to create the notebook session must have network access to the source locations for the packages you want to download and install, see Manually Configuring Your Tenancy for Data Science.

Using the Provided Environment Variables in Notebook Sessions

When you start up a notebook session, the service creates useful environment variables that you can use in your code:

NB_SESSION_COMPARTMENT_OCID

The compartment OCID of the current notebook session.

NB_SESSION_OCID

The OCID of the current notebook session.

PROJECT_OCID

The OCID of the project associated with the current notebook session.

USER_OCID

Your user OCID.

PROJECT_COMPARTMENT_OCID

The compartment OCID of the project associated with the current notebook session.

To access these environment variables in your notebook session, use the Python os library. For example:

import os 
project_ocid = os.environ[‘PROJECT_OCID’]
print(project_ocid)
Note

The NB_SESSION_COMPARTMENT_OCID and PROJECT_COMPARTMENT_OCID values do not update in a running notebook session if the resources has moved compartments after the notebook session was created.

Using the Oracle Accelerated Data Science SDK

The Oracle Accelerated Data Science (ADS) SDK is a Python library that is included as part of the Oracle Cloud Infrastructure Data Science service notebook session resource. ADS offers a friendly user interface that covers many of the steps involved in the lifecycle of machine learning models, from connecting to different data sources to using AutoML for model training to model evaluation and explanation. ADS also provides a simple interface to access the Oracle Cloud Infrastructure Data Science service model catalog and other Oracle Cloud Infrastructure services including object storage.

Note

For complete documentation on how to use the Accelerated Data Science SDK, see Accelerated Data Science Library and Accessing the Library of Notebook Examples.

Connecting to Your Data

You can connect to your data in these ways:

Connecting to Data on Oracle Cloud Infrastructure Object Storage

To retrieve your data, you must first set up a connection to Oracle Cloud Infrastructure Object Storage.

To do this, see Authenticating to the Oracle Cloud Infrastructure APIs from a Notebook Session.

After this setup, you can use the Oracle Cloud Infrastructure Python SDK in a notebook session to retrieve your data from Oracle Cloud Infrastructure Object Storage. Also, you can use the ADS SDK to pull data from Object Storage. Example notebooks are provided in the notebook session environment to show you the necessary steps, see Accessing the Library of Notebook Examples.

Connecting to Data on the Autonomous Data Warehouse

You can connect to the Autonomous Data Warehouse (ADW) from your notebook session. The autonomous_database.ipynb example notebook interactively illustrates this type of connection.

Note

The VCN and subnet configuration that you selected when creating your notebook session should permit access to your ADW database. Contact your IT administrator to confirm that access with the networking configuration you selected is permitted.

To connect to ADW and pull data into a dataframe in your notebook session:

  1. Go to the Oracle Cloud Infrastructure Console, and access your ADW instance page.
  2. Click DB Connection to download the wallet file.
  3. Create your own password for the download action.
  4. Create a folder for your wallet files on the notebook environment. We recommend that you create that folder in /home/datascience.
  5. Upload your wallet files into your <path_to_wallet_folder> folder in JupyterLab by clicking Upload.
  6. Open the sqlnet.ora file from the wallet files, then configure the METHOD_DATA to be:
    METHOD_DATA = (DIRECTORY="<path_to_wallet_folder>")
  7. Create a new notebook file, and set the following environment variables in a notebook cell:
    %env TNS_ADMIN=<path_to_wallet_folder>

    You can find SID names from the tnsname.ora file from the wallet file, and then set ADW_SID as an environment variable (the SID is an identifier that identifies the unique name of the Oracle Database):

    %env ADW_SID=<your_SID_name>

    Retrieve your ADW user name and password from your database administrator, and then store your credentials as environment variables in your notebook session:

    %env ADW_USER=<your_username>
    %env ADW_PASSWORD=<your_password>
  8. In a separate notebook cell, run this command to test the connection to the database:
    !sqlplus $ADW_USER/$ADW_PASSWORD@$ADW_SID

    If it's successful, you should see messages similar to:

    SQL*PLUS Release 19.0.0.0.0 - Production on Tue Dec 17 16:14:32 2019
     
    Copyright (c) 1982, 2019, Oracle. All rights reserved.
     
    Last Successful login time: Mon De 16 2019 14:19:21 -08:00
     
    Connected to:
    Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production
    Version 18.4.0.0.0
    
    SQL>
  9. Define a URI as your connection source:
    uri=f'oracle+cx_oracle://{os.environ["ADW_USER"]}:
    {os.environ["ADW_PASSWORD"]}@{os.environ["ADW_SID"]}'
  10. In a different cell, import sqlalchemy and pandas, and then create an engine to connect to your database. Define your SQL query and pass it to the read_sql() method that is part of the Pandas Python library:
    from sqlalchemy import create_engine 
    import pandas as pd 
    engine = create_engine(uri) 
    ds_01 = pd.read_sql('SELECT * from <table_name>', con=engine)

    The returned object is a Pandas dataframe.

You can access notebook examples within notebook sessions that show you the different steps involved in connecting and querying data from ADW and other data sources.

Connecting to Data on Oracle Cloud Infrastructure Streaming

The kafka-python Python client library is available in the notebook session. It is a client library for the Apache Kafka distributed stream processing system and it allows data scientists to connect to the Streaming service using its Kafka-compatible API. We provide the streaming.ipynb notebook example in the notebook session environment. It is a step-by-step approach to producing and consuming messages to and from a stream. The example includes:

  • Creating a Stream Pool and a Stream
  • Storing your Streaming Credentials as Secrets in an Oracle Cloud Infrastructure Vault
  • Retrieving your Secrets from the Vault
  • Producing Messages to a Stream
  • Consuming Messages from a Stream
Connecting to Data Using Oracle Vault

In addition, you can use the Oracle Cloud Infrastructure Vault service to centrally manage the encryption keys that protect your data and the credentials that you use to securely access resources. You can use the vault.ipynb example notebook to learn how to use vaults with Data Science, it includes:

  • Creating a vault.
  • Creating a key.
  • Working with secrets.
  • Listing resources.
  • Deleting secrets, keys, and vaults.

Accessing the Library of Notebook Examples

There are various interactive example notebooks, included in the notebook session environment, that you can use to learn how to use Data Science.

These examples are designed to help you understand the various features of Data Science and help you solve business problems.

Use Opening Notebook Sessions to open the JupyterLab environment. The Launcher tab is opened by default. You can access the library of notebook examples to chose one to run in two ways:

  • Click the File menu, select New, and then select Notebook Examples.

  • In the Launcher tab, click Notebook Examples.

You can open a new Launcher tab by clicking the plus sign (+) in the File Browser. In the Launcher tab, you can click these buttons to:

Python 3

Opens a new notebook for you to edit. It contains useful tips that explain:

  • Everything previously stored in your /home/datascience folder is now stored on your block volume drive.

  • Check for Public Internet Access:

    import requests
    response=requests.get("https://oracle.com")
    assert response.status_code==200, "Internet connection failed"
  • Oracle Cloud Infrastructure Configuration and Key Files Set Up:

    Follow the instructions in the Getting Started notebook, which is accessible in the Launcher tab by clicking Getting Started.

  • Typical Cell Imports and Settings:

    %load_ext autoreload
    %autoreload 2
    %matplotlib inline
    
    import warnings
    warnings.filterwarnings('ignore')
    
    import logging
    logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)
    
    import ads
    from ads.dataset.factory import DatasetFactory
    from ads.automl.provider import OracleAutoMLProvider
    from ads.automl.driver import AutoML
    from ads.evaluations.evaluator import ADSEvaluator
    from ads.common.data import MLData
    from ads.explanations.explainer import ADSExplainer
    from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
    from ads.explanations.mlx_local_explainer import MLXLocalExplainer
    from ads.catalog.model import ModelCatalog
    from ads.common.model_artifact import ModelArtifact
  • Useful Environment Variables

    import os
    print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
    print(os.environ["PROJECT_OCID"])
    print(os.environ["USER_OCID"])
    print(os.environ["TENANCY_OCID"])
    print(os.environ["NB_REGION"])
Getting Started

Starts the Getting Started notebook example, which sets up the necessary Oracle Cloud Infrastructure configuration and key files. It also shows you how to use resource principals so that you don't have to create configuration and key files in each notebook session.

Use this example in each JupyterLab session so that you have access to Oracle Cloud Infrastructure services. Alternatively, you can copy the .oci folder from one notebook session to a new one. Either method creates the necessary Oracle Cloud Infrastructure configuration and key files.

Starts the getting started notebook example. This notebook shows you how to use resource principals to authenticate to the Oracle Cloud Infrastructure API from a notebook session through the Accelerated Data Science (ADS) SDK, Python SDK, and CLI interfaces. Alternatively, you can use this notebook to create the Oracle Cloud Infrastructure configuration file to access the Oracle Cloud Infrastructure API from the notebook environment.

If this button isn't displayed, then click getting-started.ipynb in the File Browser.

Note

If you don't use resource principals, you have to run this notebook every time you launch a new notebook session to get new Oracle Cloud Infrastructure configuration and key files. Alternatively, you can copy the .oci folder from one notebook session to a new one. Either method creates the necessary configuration and key files.

Notebook Examples

Provides a list of the notebook examples for you to select and open them. Each opens in a new tab when loaded. Overview of the Notebook Examples contains a list of the examples you can choose to learn from.

If this button isn't displayed, then click the File menu, select New, and then select Notebook Examples to choose a notebook.

Each example is loaded with a temporary file name that you can change by right-clicking the tab name, and then click Rename Notebook.... Enter a new name, then click Rename.

Terminal

Opens a system shell terminal in a new tab so that you can enter and test code for your notebooks.

Text File

Opens a text file in a new tab.

Show Contextual Help

Provides JupyterLab help about a function you select.

Overview of the Notebook Examples

Use the example notebooks to learn about how to model numerous dataset types in multiple ways.

Most of the examples include training a model, using it to make predictions, and evaluating how well the model generalizes to the problem.

Example Name Shows You How To
adsdataset_working_with.ipynb

Explore the ADSDataset class. This class is a flexible container for working with data. It is backed by Dask and allows distributed operations on the data.

autonomous_database.ipynb

Connect to an Oracle Autonomous Database (ADB) and execute queries with SQLAlchemy, Pandas and cx_Oracle. This notebook shows you how you can load a table from the database using ADS DatasetFactory object, how to create a new table in the database.

binary_classification_attrition.ipynb

Use a binary classification model to predict employee attrition using modules and functions in ADS.

binary_classification_transportation.ipynb

Build a binary classification model to predict the departure delay of flights using the Oracle ADS DatasetFactory, H2O AutoML, Oracle ADS AutoML, Oracle ADS Evaluator, and Oracle ADS Explainer tools.

class_weighting.ipynb

Use custom scoring functions to enable class weighted variants of regular scoring functions using the Oracle AutoML Oracle AutoML.

classification_adult.ipynb

Build a classifier using the OracleAutoMLProvider tool for a public Census Income dataset.

classification_wine.ipynb

Build a classifier exploring the use of various option of the OracleAutoMLProvider tool for a public Italian wine varietal dataset. It also includes how to use the ADSEvaluator class.

data_visualizations.ipynb

Review the comprehensive overview of the different visualization capabilities provided by ADS, including smart data visualizations for columns of different data types.

dataflow.ipynb

Use the ADS Data Flow module that enables you to use the Data Flow service in a notebook session. It allows you to launch and monitor Spark jobs from the notebook session environment.

datasetfactory_loading_data.ipynb

Use the ADSDataset module to connect to a variety of data sources and read data in different formats with theDatasetFactory.open() function.

mlx_ale.ipynb

Explore accumulated local effects (ALEs) as a model explanation diagnostic.

mlx_classification_attrition.ipynb

Use global and local explanation techniques on a binary classification use case.

mlx_classification_wine.ipynb

Use global and local explanation techniques on a multinomial classification dataset.

mlx_pdp_vs_ale.ipynb

Compare partial dependence plots (PDPs) and ALEs.

mlx_regression_housing.ipynb

Perform exploratory data analysis to understand the data using the Boston housing dataset.

mlx_whatif.ipynb

Explore the What-if scenario tool that is integrated into the model explanation toolkit.

model_catalog.ipynb

Use CRUD (create, read, update, delete) operations on models to prepare the model artifacts, save models and their metadata into the model catalog. Also, to list, load, and delete models from the model catalog.

model_deployment.ipynb

Deploy models to the Oracle Functions service using Oracle Cloud Shell.

model_evaluation.ipynb

Use the ADSEvaluator class, which is the ML evaluation component of the ADS package. It shows you how to use it for the evaluation of any general class of supervised machine learning models, as well as comparison amongst models within the same class.

model_from_other_library.ipynb

Convert a model into an ADSModel object. The ADSModel object unlocks a series of functionalities, including model evaluation through ADSEvaluator and model explanations through ADSExplainer. You can also use the convenient.save() method of an ADSModel object to save the model artifact to the model catalog without having to write or modify any inference script or serializing the model object.

project.ipynb

Reviews notebook concepts and how to use them to manipulate the oci and ads libraries.

pyod_classification_fraud.ipynb

Create a supervised anomaly detection use case using the preinstalled PyOD library.

streaming.ipynb

Use the kafka-python client to connect to Oracle Cloud Infrastructure Streaming service by leveraging the Streaming Kafka compatible API. It shows you how to consume and produce messages using the kafka-python client.

transforming_data.ipynb

Use some of the many ways to clean and transform data in an ADS ADSDatasetFactory object.

vault.ipynb

Create a vault, a key, and store a secret that is encrypted with that key. Also, it demonstrates how to retrieve the secret so that it can be used in a notebook. It includes updating the secret and basic operations like listing and deleting vaults, keys and secrets.