Oracle AutoML automates the machine learning experience. It replaces the laborious and time-consuming tasks of the data scientist whose workflow is as follows:
Select a model from a large number of viable candidate models.
For each model, tune the hyperparameters.
Select only predictive features to speed up the pipeline and reduce over-fitting.
Ensure the model performs well on unseen data (also called generalization).
Oracle AutoML automates this workflow and provides you with an optimal model given a time budget. In addition to incorporating these typical machine learning workflow steps, Oracle AutoML is also optimized to produce a high-quality model very efficiently. This is achieved by the following:
Scalable design: All stages in the Oracle AutoML Pipeline exploit both inter-node and intra-node parallelism, improving scalability and reducing runtime.
Intelligent choices reduce trials in each stage: Algorithms and parameters are chosen based on dataset characteristics. This ensures that the selected model is accurate and is efficiently selected. This is achieved with the use of meta-learning throughout the pipeline. Meta-learning is used in:
Algorithm selection to choose an optimal model class.
Adaptive sampling to identify the optimal set of samples.
Feature selection to determine the ideal feature subset.
The following topics describe the Oracle AutoML Pipeline and individual stages of the pipeline in more detail.
- The Oracle AutoML Pipeline
- Building a Classifier using OracleAutoMLProvider
Keras is an open source neural network library. It can run on top of TensorFlow, Theano, and Microsoft Cognitive Toolkit. By default, Keras uses TensorFlow as the backend. Keras is written in Python, but it has support for R and PlaidML. You can familiarize yourself with Keras by reviewing About Keras.
These examples examine a binary classification problem predicting churn. This is a common type of problem that can be solved using
If the data is not cached, it is pulled from
github and cached. Then, it is loaded.
from os import path import numpy as np import pandas as pd import requests import logging logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR) churn_data_file = '/tmp/churn.csv' if not path.exists(churn_data_file): # fetch sand save some data print('fetching data from web...', end =" ") r = requests.get('https://github.com/darenr/public_datasets/raw/master/churn_dataset.csv') with open(churn_data_file, 'wb') as fd: fd.write(r.content) print("Done") df = pd.read_csv(churn_data_file)
Keras needs to be imported.
scikit-learn should be imported to generate metrics. Much of the data preprocessing and modeling can be done using the ADS library. However, the following example demonstrates how to do
these tasks with external libraries:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import confusion_matrix, roc_auc_score from keras.models import Sequential from keras.layers import Dense
The first step is data preparation. From the
pandas.DataFrame, you extract the X and Y-values as
numpy arrays. The feature selection is performed manually. The next step is feature encoding using
LabelEncoder. This converts categorical variables into ordinal values (‘red’, ‘green’, ‘blue’ –> 0, 1, 2) so that they are compatible with Keras. The data is then split using a 80/20 ratio. The training is performed on 80% of the data. Model testing is performed on the remaining 20% of the data to evaluate how well the model generalizes.
feature_name = ['CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary'] response_name = ['Exited'] data = df[[val for sublist in [feature_name, response_name] for val in sublist]].copy() # Encode the category columns for col in ['Geography', 'Gender']: data.loc[:, col] = LabelEncoder().fit_transform(data.loc[:, col]) # Do an 80/20 split for the training and test data train, test = train_test_split(data, test_size=0.2, random_state=42) # Scale the features and split the features away from the response sc = StandardScaler() # Feature Scaling X_train = sc.fit_transform(train.drop('Exited', axis=1).to_numpy()) X_test = sc.transform(test.drop('Exited', axis=1).to_numpy()) y_train = train.loc[:, 'Exited'].to_numpy() y_test = test.loc[:, 'Exited'].to_numpy()
Following is a depiction of the the neural network architecture. It is a sequential model with an input layer with 10 nodes. It has two hidden layers with 255 densely connected nodes and the ReLu activation function. The output layer has a single node with a sigmoid activation function because the model is doing binary classification. The optimizer is Adam and the loss function is binary cross-entropy. The model is optimized on the accuracy metric. This takes several minutes to run.
keras_classifier = Sequential() keras_classifier.add(Dense(units=255, kernel_initializer='uniform', activation='relu', input_dim=10)) keras_classifier.add(Dense(units=255, kernel_initializer='uniform', activation='relu')) keras_classifier.add(Dense(units=1, kernel_initializer='uniform', activation='sigmoid')) keras_classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) keras_classifier.fit(X_train, y_train, batch_size=10, epochs=25)
To evaluate this model, you could use
sklearn or ADS.
This example uses
y_pred = keras_classifier.predict(X_test) y_pred = (y_pred > 0.5) cm = confusion_matrix(y_test, y_pred) auc = roc_auc_score(y_test, y_pred) print("confusion_matrix:\n", cm) print("roc_auc_score", auc)
This example uses the ADS evaluator package:
from ads.common.model import ADSModel from ads.evaluations.evaluator import ADSEvaluator from ads.common.data import MLData eval_test = MLData.build(X = pd.DataFrame(sc.transform(test.drop('Exited', axis=1)), columns=feature_name), y = pd.Series(test.loc[:, 'Exited']), name = 'Test Data') eval_train = MLData.build(X = pd.DataFrame(sc.transform(train.drop('Exited', axis=1)), columns=feature_name), y = pd.Series(train.loc[:, 'Exited']), name = 'Training Data') clf = ADSModel.from_estimator(keras_classifier, name="Keras") evaluator = ADSEvaluator(eval_test, models=[clf], training_data=eval_train)
sklearn pipeline can be used to build a model on the same churn dataset that was used in the Keras section. The pipeline allows the model to contain multiple stages and transformations. Generally, there would be pipeline stages for feature encoding, scaling, and so on. In this pipeline example, a
LogisticRegression estimator is used:
from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline pipeline_classifier = Pipeline(steps=[ ('clf', LogisticRegression()) ]) pipeline_classifier.fit(X_train, y_train)
You can evaluate this model using
sklearn or ADS.
XGBoost is an optimized, distributed gradient boosting library designed to be efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides parallel tree boosting (also known as Gradient Boosting Decision Tree (GBDT), Gradient Boosting Machines(GBM)) and can be used to solve a variety of data science applications. The code runs, unmodified, on several distributed environments (Hadoop, SGE, MPI) and can processes billions of observations. You can familiarize yourself with XGBoost by reviewing XGBoost Documentation.
Import XGBoost with:
from xgboost import XGBClassifier xgb_classifier = XGBClassifier(nthread=1) xgb_classifier.fit(eval_train.X, eval_train.y)
From three estimators we create three ADSModel objects: a
Keras classifier, a
sklearn pipeline with a single
LogisticRegression stage, and an
from ads.common.model import ADSModel from ads.evaluations.evaluator import ADSEvaluator from ads.common.data import MLDataa keras_model = ADSModel.from_estimator(keras_classifier) lr_model = ADSModel.from_estimator(lr_classifier) xgb_model = ADSModel.from_estimator(xgb_classifier) evaluator = ADSEvaluator(eval_test, models=[keras_model, lr_model, xgb_model], training_data=eval_train) evaluator.show_in_notebook()