Model Catalog

Overview

The purpose of the model catalog is to provide a managed and centralized storage space for models. The model catalog ensures that model artifacts are immutable. The model catalog allows data scientists to share models and reproduce them as needed.

The model catalog can be accessed directly in a notebook session with ADS or using the Oracle Cloud Infrastructure Console by going to the Data Science Projects page and selecting a project, and then click Models. The Models page is the model catalog.

After a model and its artifacts are stored in the model catalog, they become available for other data scientists who are working on the same project and have access to the Oracle Cloud Infrastructure compartment the model is stored in.

Data scientists can:

  • List, read, download, and load models from the catalog to their own notebook sessions.

  • Download the model artifact from the catalog and run the model on their laptop or some other machine.

  • Deploy the model artifact as an Oracle Function.

A model can be saved to the model catalog using Generic Model approach or ADSModel approach:

  • The Generic Model approach creates a Generic Model artifact using prepare_generic_model() and saves it to the catalog.

  • The ADSModel approach prepares an artifact from the ADSModel object and saves it to the catalog. ADSModel objects are typically created from the AutoML engine. Data scientists can also convert models trained with other machine learning libraries into an ADSModel object (using from_estimator()).

With either approach, a model artifact is sent with the model to the model catalog.

Note

  1. ADS is closed source so you can only use an ADSModel object inside the Data Science service. If you want to export a model outside of this service, use the Generic Model approach.

  2. The Generic Model approach is generic and covers any type of model. The ADSModel works with ADS supported libraries, such as scikit-learn, Keras, XGBoost, and lightGBM. For details on creating ADSModel objects, see Training Models.

  3. Using the ADSModel gives you access to all the features of ADS such as model evaluation ADSEvaluator, explanation ADSExplainer and so on. If you are using ADS supported libraries and can work in the Data Science notebook session, then create and use an ADSModel.

Using the Model Catalog

There are two approaches to save models in the model catalog.

Training a Simple Model

In this approach, a model is created and then saved to the model catalog. The iris dataset is used and it is available through the DataBrowser class and the scikit-learn library.

Datasets are provided as a in the notebook session for convenience. Datasets are considered Third Party Content and are not considered Materials under Your agreement with Oracle applicable to the Services. See the dataset license.

import os
from ads.dataset.dataset_browser import DatasetBrowser

sklearn = DatasetBrowser.sklearn()
sklearn.list()
['boston', 'breast_cancer', 'diabetes', 'iris', 'wine', 'digits']
ds = sklearn.open('iris')

Split the dataset into training and testing sets. Then train a random forest model on the training dataset.

#Import necessary models from scikit-learn
from sklearn.ensemble import RandomForestClassifier

#Transform dataset and split dataset into test and train sets
transformed_ds = ds.auto_transform(fix_imbalance=False)
train, test = transformed_ds.train_test_split(test_size=0.3)

#Fit a random forest model on training data
rf_clf = RandomForestClassifier().fit(train.X.values, train.y.values)

A random forest model has been trained using scikit-learn. Next, the steps needed to save that model object to the model catalog are explained.

Preparing and Saving a Model to the Model Catalog

Generic Model Approach

With Generic Model approach, the model can be exported outside of the Data Science service and run on another machine. It can still be used within the Data Science service.

The first step is to save the model object to disk in a directory in your notebook session. If scikit-learn was used to create the model, then joblib is a good option. If a Keras model is to be saved, the model.save(filepath) method saves the model as a single HDF5 file. Pickle and cPickle are not recommended methods to save a Keras model.

A scikit-learn example:

import joblib
import os

model_path = os.path.join('/', 'home', 'datascience', 'my_model')
os.makedirs(model_path, exist_ok=True)
joblib.dump(rf_clf, os.path.join(model_path,'model.pkl'))

There is now a pickle file in /home/datascience/my_model/model.pkl.

Next, use ADS to prepare the required additional files that are required to build the model artifact and ensure that the model can run in Oracle Cloud Infrastructure or be exported elsewhere.

The model catalog is a managed storage service for the artifacts. The model catalog contains not just the artifact, but also the metadata about the model including its provenance.

Template files for the model artifact files can be generated using the prepare_generic_model() function in ADS.

from ads.common.model_export_util import prepare_generic_model
model_artifact = prepare_generic_model(model_path)

The only required argument in the prepare_generic_model() object is the local path to the folder to store the artifacts before pushing it to the model catalog.

Note

We recommend that you use the home directory, /home/datascience, to store all data including artifacts. A notebook session can be deactivated at any time to change its compute shape or storage size. Then when the notebook session is reactivated the home folder contents stay intact. The boot device is initialized again and any data is lost.

The function prepare_generic_model() generates a score.py file that contains functions that load the model into memory (load_model()) and to make inferences (predict()). The requirements.txt file that contains a list of Python dependencies for the model’s runtime environment.

The load_model() function has to be used to modify to read the model object in the score.py file. There is also an option to add pre/post inference logic in predict().

This is the score.py file that has been modified to read the scikit-learn model that was saved to disk:

import json
import os
from cloudpickle import cloudpickle

model_pickle_name = 'model.pkl'

"""
Inference script. This script is used for prediction by scoring server when schema is known.
"""
def load_model(model_file_name=model_pickle_name):
    """
    Loads model from the serialized format

    Returns
    -------
    model:  a model instance on which predict API can be invoked
    """
    model_dir = os.path.dirname(os.path.realpath(__file__))
    contents = os.listdir(model_dir)
    if model_file_name in contents:
        with open(os.path.join(os.path.dirname(os.path.realpath(__file__)), model_file_name), "rb") as file:
            return cloudpickle.load(file)
    else:
        raise Exception('{0} is not found in model directory {1}'.format(model_file_name, model_dir))


def predict(data, model=load_model()):
    """
    Returns prediction given the model and data to predict

    Parameters
    ----------
    model: Model instance returned by load_model API
    data: Data format as expected by the predict API of the core estimator. For eg. in case of sckit models it could be numpy array/List of list/Panda DataFrame

    Returns
    -------
    predictions: Output from scoring server
        Format: {'prediction':output from model.predict method}

    """

    from pandas import read_json, DataFrame
    from io import StringIO
    X = read_json(StringIO(data)) if isinstance(data, str) else DataFrame.from_dict(data)
    return {'prediction':model.predict(X).tolist()}

The requirements.txt file contains a base set of common Python dependencies for the model’s runtime environment. For example, numpy==1.17.0 and pandas==0.24.2. With Generic Model approach, ADS does not automatically populate the requirements file with all the libraries necessary for the model.

You have to modify the requirements file and include the libraries (and library versions) that are needed. Keeping track of the dependencies helps you run the model outside the Data Science service.

After modifying score.py, the resulting artifact can be tested by loading the artifact into memory with reload() and calling the predict() function on some test data:

model_artifact.reload()
model_artifact.model.predict([[4.7,3.2,1.3,0.2]])

If the model behaves as expected it is ready to be saved to the model catalog. This can be done by calling the save() method of the artifact object.

To save the artifact, the model must be assigned a project and compartment OCID. An OCID is an Oracle assigned unique ID. All projects, notebook sessions, and models have OCIDs.

Note

These are useful environment variables that are accessible in the notebook session:

  • PROJECT_OCID: Project OCID attached to this notebook session.

  • NB_SESSION_OCID: OCID of the notebook session.

  • NB_SESSION_COMPARTMENT_OCID: Compartment OCID of the notebook session.

  • USER_OCID: Your user OCID.

A model can be saved in the same compartment as the notebook session, and in the same project. However, it is also possible to save the model in a different compartment or project.

In the this example, the model is saved in the same compartment OCID and project OCID of the notebook session. In addition to these IDs, the model can be given a name, description, and the name of the notebook or script the model was trained in.

compartment_id = os.environ['NB_SESSION_COMPARTMENT_OCID']
project_id = os.environ["PROJECT_OCID"]

# Saving the model artifact to the model catalog:
mc_model = model_artifact.save(
  project_id=project_id,
  compartment_id=compartment_id,
  display_name="my_generic_model",
  description="Saving and loading generic model",
  training_script_path="ADS_Generic_Model_Artifact.ipynb",
  ignore_pending_changes=True
)
mc_model

After the model is saved in the model catalog, you can click the “Models” link on your project page to see the model details. You can:

  • Export and download the model artifact (a zip archive) from the Console UI.

  • Export and download the artifact using the Oracle Cloud Infrastructure CLI or SDK.

  • Load the own model in a different notebook session by calling the model catalog.

  • Let other data scientists load the model in their notebook session by calling the model catalog.

ADSModel Approach

Another way to save a model to the catalog is to convert it to an ADSModel object. ADS supports models from scikit-learn, xgboost, keras, and ligthGBM libraries. Use the from_estimator() method to perform the conversion, see Training Models for more details on creating ADSModel objects.

#Convert the scikit-learn model to ADSModel Instances
from ads.common.model import ADSModel
rf_model = ADSModel.from_estimator(rf_clf)

After an ADSModel has been created, the model artifact has to be prepared and then saved to the model catalog. The model artifact is a ZIP archive of all the necessary files to load and run a model in a different notebook session. The model catalog is a managed storage service for those artifacts. The model catalog contains the artifact and the metadata about the model including its provenance.

Load the necessary modules to prepare and save an ADSModel object:

from ads.catalog.model import ModelSummaryList, ModelCatalog
from ads.catalog.project import ProjectSummaryList, ProjectCatalog
from ads.catalog.summary import SummaryList
from ads.common.model_artifact import ModelArtifact

To prepare the artifact for the model catalog, call the prepare() method of an ADSModel object. The only required argument in prepare() is the local path to the folder where you would like to store the artifact similar to the Generic Model approach.

In addition, the prepare() method captures and saves a sample of both the input/output data that is provided (returned) to the predict() function defined in score.py. Optionally, you can add a sub-directory (fn-folder/) to the artifact that contains all the necessary files for deployment to Oracle Functions. This option is not selected by default.

Prepare the model artifact for the ADSModel case:

path_to_model_artifact = os.path.join('/', 'home', 'datascience', 'model')
model_artifact = rf_model.prepare(path_to_model_artifact, force_overwrite=True)

You can see the files in the model artifact with:

!ls /home/datascience/model
func.py
func.yaml
requirements.txt
runtime.yaml
model.pkl
score.py
(data-sample.json)
(schema.json)

The model artifact contains the following files:

  • The func.py script contains fn specific handling of the input/output data.

  • The func.yaml script contains fn specific versioning information.

  • The requirements.txt file that contains a list of required packages for the model. This is auto-generated for the ADSModel.prepare() method, but if the user adds any preprocessing to the score.py, they should include relevant packages in here.

  • The runtime.yaml file that documents the current notebook environment.

  • The model.pkl is a Pickled model that is a serialized version of your ADSModel object, but without any ADS dependencies.

  • The score.py inference script used to perform model inference. A user can modify predict() to add any custom logic or data transformation, before or after the model estimator object inference endpoint has been called.

  • The data-sample.json file is optionally generated, and contains an example of the json formatting required by the handler in func.py. This is of course editable by the user.

  • The schema.json file is optionally generated, and contains metadata about the data input (feature types, columns names, example output, etc.)

When an ADSModel object is saved, the score.py, requirements.txt, the serialized model object (model.pkl), and an additional file runtime.yaml (that is not used) are automatically generated. With the ADSModel approach, the artifact can be directly stored in the model catalog without modifying any of the artifact files.

Note

ADSModel objects cannot be exported outside of the Data Science service.

To save an artifact, use the save() method.

compartment_id = os.environ['NB_SESSION_COMPARTMENT_OCID']
project_id = os.environ["PROJECT_OCID"]

# Saving the model artifact to the model catalog:
mc_model = model_artifact.save(project_id=project_id, compartment_id=compartment_id, display_name="random forest model on iris dataset",
description="random forest model on iris dataset", training_script_path="model_catalog.ipynb", ignore_pending_changes=True)

Note

When the notebook file is in a git repo, ADS automatically extracts the model provenance metadata. The provenance information includes the relevant git information to ensure that the model can be reproduced by pulling the exact file and version that the model was trained in. Provenance metadata includes:

  • git_branch: The working branch of the repository.

  • git_commit: The latest commit ID.

  • repository_url: The associated URL of the remote repo. For example, git@github.com:<account>/<repo>.

  • script_dir: The directory containing the relevant score.py script of your artifact.

  • training_script: The path to notebook or Python script that the save() method was called in.

Note

ADS does not enforce a git commit while saving the model to the catalog. It is best to create a git commit prior to saving the model.

After the model is saved in the model catalog, you can click Models on the project page to see a list of models.

Modifying an Artifact

To modify one of the files in the model artifact, you have to commit the changes by calling reload().

The score.py file is one of the model artifact files.

If you want to change the function _handle_input(data) in score.py, you could use:

def _handle_input(data):
    """
    Convert input data to the format that model expects

    Parameters
    ----------
    data: input data

    Returns
    -------
    transformed_input: Should match the input format for model.predict() API
    """
    #############################
    # To be updated if required #
    #############################
    return data

You can modify the function to change the format of the function output using:

def _handle_input(data):
    """
    Convert input data to the format that the model expects

    Parameters
    ----------
    data: input data

    Returns
    -------
    transformed_input: Should match the input format for model.predict() API
    """
    import pandas as pd
    return pd.DataFrame(data, columns=['sepal_length_(cm), sepal_width_(cm), petal_length_(cm), petal_width_(cm)])

Call reload to commit the changes to the artifact:

model_artifact.reload()

Next, save the changes back to the catalog by calling model_artifact.save().

Note

Model artifacts stored in the model catalog are immutable. The model metadata can be edited, but the existing model artifact cannot be. To save changes to a model artifact, the model catalog creates a new entry for that model. You can keep track of model versions using tagging on the models (for example, model-run:1, version:3). Models can be filtered in the model catalog UI using tag values.

Listing Models in the Catalog

Each model in the model catalog has a unique OCID. The OCID can be found by clicking the model name in the model catalog. Within a notebook, ADS can be used to specify the OCID of the model to load as follows:

Create an instance of the ModelCatalog object.

mc = ModelCatalog(compartment_id = os.environ['NB_SESSION_COMPARTMENT_OCID'])

Call list_models() to see a list of the models.

my_model_list = mc.list_models()
my_model_list
../../_images/model_list_1.png

In this example, my_model_list is an instance of the ModelSummaryList object. There are several operations that you can perform on it. For example, it can filter the list so that only models created by a specific user are shown.

# TODO: Update the your_user_name with your login name
your_user_name = 'insert-your-user-name'
my_new_list = []

if your_user_name != 'insert-your-user-name':
    my_new_list = my_model_list.filter(lambda x: x.user_name==your_user_name)

You can convert the ModelSummaryList into a pandas dataframe and perform operations on it.

list_df = my_model_list.df

Downloading a Model Artifact

Use download_model() of a ModelCatalog object to download a model artifact to your notebook session. download_model requires a model OCID value and a target directory for the artifact file. Download the first model from the model list that you created. If you know the OCID of your model, you can pass it directly to download_model().

path_to_my_artifact = os.path.join('/', 'home', 'datascience', 'download_model')
mc.download_model(my_model_list[0].id, path_to_my_artifact, force_overwrite=True)
Artifact directory: /home/datascience/download_model
Contains: ['func.py', 'func.yaml', '.model-ignore', 'model.pkl', 'requirements.txt', 'runtime.yaml', 'score.py']

Loading the Model and Calling the Inference Method

Load the model artifact.

from ads.common.model_artifact import ModelArtifact
model_artifact = ModelArtifact(path_to_my_artifact)

Load the artifact and extract the model object directly. model_artifact.model is the model object returned by load_model(). It can be a generic model trained with Keras, tensorflow or it can also be an ADSModel object. In this example, it is an ADSModel object:

model_artifact.reload()
rf_model = model_artifact.model
type(rf_model)
ads.common.model.ADSModel

To make new predictions, use predict():

model_artifact.model.predict(test.X.values[1:10])

Other Operations on Models and the Model Catalog

Following are some additional operations that you can perform on the model catalog after connecting to the model catalog using:

mc = ModelCatalog(compartment_id = os.environ['NB_SESSION_COMPARTMENT_OCID'])

Deleting a Model

Delete a model by calling delete_model and specifying the model OCID.

mc.delete_model(mc.list_models()[0].id)

Retrieving a Model

get_model() is used to retrieve the model. It requires the Model’s OCID. You can obtain that information using the Oracle Cloud Infrastructure Console to look in the model catalog or by using the list_models() method.

retrieved_model = mc.get_model(mc.list_models()[0].id)

Deactivating a Model

The deactivate() method marks a model as Inactive, which is similar to archiving a model. Deactivated models are not returned on default queries to the model catalog though their artifact is not deleted.

retrieved_model.deactivate()
retrieved_model.commit()

Note

After updating a model in the notebook, you must perform a commit() call to push changes to the model catalog. Otherwise, the model is not updated in the model catalog.

Editing the Model Metadata

Modify a model’s display_name, description, and its provenance_metadata.

retrieved_model.display_name  = "new name"
retrieved_model.description  = "new description"
retrieved_model.commit()

Display Model Provenance Metadata

If you are using git to track your model changes, ADS automatically extracts the latest commit information and stores that information as part of the model provenance metadata.

retrieved_model.provenance_metadata
{
  "git_branch": "master",
  "git_commit": "bb72983dda7cedc29dea24830c1775ee0c5f1ccd",
  "repository_url": "ssh://git@example.com:/project/MLNotebook.git",
  "script_dir": "/home/datascience/MLNotebook/model",
  "training_script": "example.ipynb"
}