Training Models

Oracle AutoML

Oracle Logo

Oracle AutoML automates the machine learning experience. It replaces the laborious and time consuming tasks of the data scientist whose workflow is as follows:

  1. Select a model from a large number of viable candidate models.

  2. For each model, tune the hyperparameters.

  3. Select only predictive features to speed up the pipeline and reduce over-fitting.

  4. Ensure the model performs well on unseen data (also called generalization).

../../_images/motivation.png

Oracle AutoML automates this workflow and provides you with an optimal model given a time budget. In addition to incorporating these typical machine learning workflow steps, Oracle AutoML is also optimized to produce a high quality model very efficiently. This is achieved by the following:

  • Scalable design: All stages in the Oracle AutoML Pipeline exploit both inter-node and intra-node parallelism, improving scalability and reducing runtime.

  • Intelligent choices reduce trials in each stage: Algorithms and parameters are chosen based on dataset characteristics. This ensures that the selected model is accurate and is efficiently selected. This is achieved with the use of meta-learning throughout the pipeline. Meta-learning is used in:

    • Algorithm selection to select an optimal model class.

    • Adaptive sampling to identify the optimal set of samples.

    • Feature selection to determine the ideal feature subset.

    • Hyperparameter optimization it is used to tune the hyperparameter values.

The following topics describe the Oracle AutoML Pipeline and individual stages of the pipeline in more detail.

Keras

Familiarize yourself with Keras by reviewing About Keras.

By default, Keras uses TensorFlow as the backend.

These examples examine a binary classification problem predicting churn. This is a common type of problem that can be solved using Keras, Tensorflow, and scikit-learn.

You need to load the dataset, pull the data from github, and cache it for faster use after the firtst time:

from os import path
import numpy as np
import pandas as pd
import requests

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

churn_data_file = '/tmp/churn.csv'
if not path.exists(churn_data_file):
    # fetch sand save some data
    print('fetching data from web...', end =" ")
    r = requests.get('https://github.com/darenr/public_datasets/raw/master/churn_dataset.csv')
    with open(churn_data_file, 'wb') as fd:
        fd.write(r.content)
    print("Done")


df = pd.read_csv(churn_data_file)

Import Keras and scikit-learn to generate metrics. Most of these tasks can be done using ADS. For example, ADS can open datasets and split them into test and training sets. This example demonstrates how to do these tasks with external libraries:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, roc_auc_score

from keras.models import Sequential
from keras.layers import Dense

The first step is data preparation. From the pandas.DataFrame, you extract the X-values and Y-values as a numpy array. The feature selection is performed manually. This example is designed to show how ADS is used with external libraries, but this whole section can be replaced with ADS AutoML. AutoML does not design network architectures. Instead, it creates models using packages like xgboost, lightgbm and sklearn estimators.

The next step is feature encoding using sklearn LabelEncoder to convert category variables into ordinal numbers (‘red’, ‘green’, ‘blue’ –> 0, 1, 2).

Once the variables have been encoded compatible with Keras, the data is split into a 80/20 ratio. The training is performed on the 80% and tested with the 20% to see how well the model generalizes to unseen examples.

feature_name = ['CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance',
     'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']

response_name = ['Exited']
data = df[[val for sublist in [feature_name, response_name] for val in sublist]].copy()

# Encode the category columns
for col in ['Geography', 'Gender']:
  data.loc[:, col] = LabelEncoder().fit_transform(data.loc[:, col])

# Do an 80/20 split for the training and test data
train, test = train_test_split(data, test_size=0.2, random_state=42)

# Scale the features and split the features away from the response
sc = StandardScaler() # Feature Scaling
X_train = sc.fit_transform(train.drop('Exited', axis=1).to_numpy())
X_test = sc.transform(test.drop('Exited', axis=1).to_numpy())
y_train = train.loc[:, 'Exited'].to_numpy()
y_test = test.loc[:, 'Exited'].to_numpy()

Next, you design and code a neural network using the neural network architecture. It is a sequential model with an input layer with 10 nodes. It then has two hidden layers with 255 densely connected nodes and the ReLu activation function. The output layer has a single node with a sigmoid activation function because the model is doing binary classification. The optimizer is Adam and the loss function is binary cross-entropy. The model is optimized on the accuracy metric. This takes several minutes to run.

keras_classifier = Sequential()
keras_classifier.add(Dense(units=255, kernel_initializer='uniform', activation='relu', input_dim=10))
keras_classifier.add(Dense(units=255, kernel_initializer='uniform', activation='relu'))
keras_classifier.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid'))
keras_classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

keras_classifier.fit(X_train, y_train, batch_size=10, epochs=25)

To evaluate this model, you could use sklearn or ADS.

This example uses sklearn:

y_pred = keras_classifier.predict(X_test)
y_pred = (y_pred > 0.5)

cm = confusion_matrix(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred)

print("confusion_matrix:\n", cm)
print("roc_auc_score", auc)

This example uses the ADS evaluator package:

from ads.common.model import ADSModel
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import MLData

clf = ADSModel.from_estimator(keras_classifier)
evaluator = ADSEvaluator(test, models=[clf], training_data=X_train)

Scikit-Learn

The sklearn pipeline can be used to build a model on the same churn dataset that was used in the Keras section. The pipeline allows the model to contain multiple stages and transformations. In a more sophisticated example, there would be pipeline stages for feature encoding, scaling, and so on. In this pipeline example, a LogisticRegression estimator is used:

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline_classifier = Pipeline(steps=[
  ('clf', LogisticRegression())
])

pipeline_classifier.fit(X_train, y_train)

You can evaluate this model using sklearn or ADS.

XGBoost

Familiarize yourself with XGBoost by reviewing XGBoost Documentation.

Import XGBoost with:

from xgboost import XGBClassifier

xgb_classifier = XGBClassifier(nthread=1)
xgb_classifier.fit(eval_train.X, eval_train.y)

From three estimators we create three ADSModel objects: a Keras classifier, a sklearn pipeline with a single LogisticRegression stage, and an XGBoost model:

from ads.common.model import ADSModel
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import MLDataa

keras_model = ADSModel.from_estimator(keras_classifier)
lr_model = ADSModel.from_estimator(lr_classifier)
xgb_model = ADSModel.from_estimator(xgb_classifier)

evaluator = ADSEvaluator(eval_test, models=[keras_model, lr_model, xgb_model], training_data=eval_train)
evaluator.show_in_notebook()