AutoML

The AutoMLx python package automatically creates, optimizes and explains machine learning pipelines and models. The AutoML pipeline provides a tuned ML pipeline that finds the best model for a given training dataset and a prediction task at hand. AutoML has a simple pipeline-level Python API that quickly jump-starts the datascience process with an accurate tuned model. AutoML has support for any of the following tasks:

  1. Supervised classification or regression prediction with tabular dataset where the target can be a simple binary or a multi-class value or a real valued column in a table, respectively.

  2. Supervised classification for Image and Text datasets.

  3. Unsupervised anomaly detection, where the target or the labels are not provided.

  4. Univariate and multivariate (single or multiple targets) timeseries forecasting task.

  5. Recommendation, based on a data of interactions between users and items.

The AutoML pipeline consists of five major stages of the ML pipeline: preprocessing , algorithm selection , adaptive sampling , feature selection , and model tuning

These pieces are readily combined into a simple AutoML pipeline which automatically optimizes the whole pipeline with limited user input/interaction.

Pipeline

Pipeline ( task = 'classification' , dataset_format = 'pandas' , score_metric = None , random_state = 7 , n_algos_tuned = 1 , model_list = None , preprocessing = True , search_space = None , max_tuning_trials = None , search_strategy = 'HyperGD' , ** kwargs )

Create AutoMLPipeline based on task and dataset type

Parameters
  • task ( str , default='classification' ) – Machine learning task, supported: classification, regression, anomaly_detection, forecasting, recommendation

  • dataset_format ( str , default='pandas' ) – Determine the type of input/output dataset. Defaults to pandas

  • score_metric ( str , callable , tuple , list or None , default=None ) –

    One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.

    • If None : it will be determined automatically depending on the task. Default score metrics : classification: binary: neg_log_loss, multiclass: neg_log_loss, regression: neg_mean_squared_error, forecasting: neg_sym_mean_abs_percent_error, anomaly_detection: unsupervised_unify95, recommendation: hit_rate

    • If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes.

    • If a callable: score function (or loss function) with signature score_func(model, X, y) .

    • If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above.

    • If a string: automatically infers the scoring metric from the string: nntt**unsupervised** – unsupervised_unify95, unsupervised_unify95_log_loss

      continuous_forecast – neg_sym_mean_abs_percent_error, neg_root_mean_squared_percent_error, neg_mean_abs_scaled_error, neg_root_mean_squared_error, neg_mean_squared_error, neg_max_absolute_error, neg_mean_absolute_error, neg_max_abs_error, neg_mean_abs_error

      binary – neg_log_loss, roc_auc, accuracy, f1, precision, recall, f1_micro, f1_macro, f1_weighted, f1_samples, recall_micro, recall_macro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples

      multiclass – neg_log_loss, accuracy, f1_micro, f1_macro, f1_weighted, f1_samples, recall_macro, recall_micro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples

      More information on scoring metrics can be found here :

      Classification metrics , Note: Scoring variations like recall_macro are equivalent to sklearn.metrics.recall_score(...,average="macro")

      continuous – neg_mean_squared_error, r2, neg_mean_absolute_error, neg_mean_squared_log_error, neg_median_absolute_error

      More information on scoring metrics can be found here :

      Regression metrics ,

      recommendation – hit_rate, hits, precision, recall, map, ndcg, auc

      * More information on scoring metrics can be found here in the documentation of the AutoRecommender class.

  • random_state ( int , default=7 ) – Random seed used by AutoML.

  • n_algos_tuned ( int , default=1 ) –

    Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.

    • To disable algorithm selection set

    n_algos_tuned = len(model_list) .

  • model_list ( List [ Model | str | Any ] or None , default=None ) –

    Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for regression and classification must implement the scikit-learn-style fit and predict methods. Classification models also must support predict_proba. Anomaly detection models must follow the pyod interface. (by default, all supported built-in models for a given task are used) Supported built-in models per task:

    classification – CatBoostClassifier, DecisionTreeClassifier, ExtraTreesClassifier, GaussianNB, KNeighborsClassifier, LGBMClassifier, LogisticRegression, RandomForestClassifier, SVC, TorchMLPClassifier, XGBClassifier

    regression – AdaBoostRegressor, DecisionTreeRegressor, ExtraTreesRegressor, KNeighborsRegressor, LGBMRegressor, LinearRegression, LinearSVR, RandomForestRegressor, SVR, TorchMLPRegressor, XGBRegressor

    anomaly_detection – ClusteringLocalFactorOD, HistogramOD, IsolationForestOD, KNearestNeighborsOD, MinCovOD, OneClassSVMOD, PrincipalCompOD, AutoEncoder

    forecasting – NaiveForecaster, ThetaForecaster, ExpSmoothForecaster, ETSForecaster, STLwESForecaster, STLwARIMAForecaster, SARIMAXForecaster, VARMAXForecaster, DynFactorForecaster, ExtraTreesForecaster, XGBForecaster, LGBMForecaster

    recommendation – AlsRecommender, ItemKNNRecommender, BprRecommender, TRexxRecommender

  • preprocessing ( bool , default=True ) –

    Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.

    • If True, auto-preprocessor runs on dataset to normalize data.

    Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using sklearn.preprocessing.StandardScaler . Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce a ValueError . AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings).

  • search_space ( dict or None , default=None ) –

    This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. We support 4 types of Key values:

    • Type 1: the search space key values must have two parameters:
      1. ’range’ which is a list containing the range.

      (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for LogisticRegression:

      search_space = {
      ‘LogisticRegression’ {
      ‘C’: {

      ‘range’: [0.03125, 512], ‘type’: ‘continuous’},

      ’solver’: {
      ‘range’: [‘newton-cg’, ‘lbfgs’,

      ’liblinear’, ‘sag’],

      ’type’: ‘categorical’

      }, ‘class_weight’: {

      ’range’: [None, ‘balanced’], ‘type’: ‘categorical’

      }

      }

      }

    • Type 2: Fixed key values where we could fix the value of hyper parameters.

      For example, if the user wishes to fix a hyper parameter for LogisticRegression:

      search_space = {
      ‘LogisticRegression’ {

      ‘C’: 0.03125, ‘solver’: ‘newton-cg’

      }

      }

    • Type 3: If search space of a model is an empty dictionary, then Model Tune is

    disabled for that model. - Type 4: If a key value contains mixed configuration, some hyper-parameters are fixed, others with a search space. For example:

    search_space = {
    ‘LogisticRegression’ {

    ‘C’: 0.03125, ‘solver’: ‘newton-cg’, ‘class_weight’: {

    ’range’: [None, ‘balanced’], ‘type’: ‘categorical’

    }

    }

    }

    • To disable Model Tune for all models set

    search_space = {} - If None , default search space defined inside AutoML is used. - If all the hyper-parameters are fixed for a model, the HyperParameterOptimization step will be skipped for that model. Otherwise, the remaining non-fixed parameters will be tuned.

  • max_tuning_trials ( int , dict or None , default=None ) – The maximum number of HPO trials, may be exceeded slightly. - If None : AutoML automatically determines when enough HPO trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if n_algos_tuned == 2 , then up to 2 * max_tuning_trials are performed in total. - If a dict : by passing a dictionary you can specify this parameter per algorithm. e.g., {'LogisticRegression': 100, 'RandomForestClassifier': 200} . Missing values in the dictionary default to None.

  • search_strategy ( str , default='HyperGD' ) – The search strategy used in Model Tune. Valid search_strategy values: TPESampler, BruteForceSampler, CmaEsSampler, MOTPESampler, NSGAIISampler, QMCSampler, RandomSampler, HyperGD

  • kwargs ( Any ) –

    Optional arguments. You can find a list of arguments related to each task in their config method: - :py:meth:automlx.AutoClassifier.configure

    for ‘classification’

    • :py:meth:automlx._interface.regressor.AutoRegressor.configure for ‘regression’

    • :py:meth:automlx._interface.anomaly_detector.AutoAnomalyDetector.configure for ‘anomaly_detection’

    • :py:meth:automlx._interface.forecaster.AutoForecaster.configure for ‘forecasting’

Raises

AutoMLxValueError – If the given task is not supported or the provided dataset format is not supported.

Returns

An AutoMLPipeline for the given task: - :py:class:automlx._interface.classifier.AutoClassifier

for ‘classification’

  • :py:class:automlx._interface.regressor.AutoRegressor for ‘regression’

  • :py:class:automlx._interface.anomaly_detector.AutoAnomalyDetector for ‘anomaly_detection’

  • :py:class:automlx._interface.forecaster.Forecaster for ‘forecasting’

  • :py:class:automlx.express.recommender.AutoRecommender for ‘recommendation’

Return type

AutoMLPipeline

AutoClassifier

class AutoClassifier

Classifier AutoMLPipeline

classes_

Holds the label for each class (for task=classification only, otherwise it is set to None ).

Type

List[Any]

selected_features_names_

Names of the engineered features selected by the AutoML pipeline.

Type

List[ str ]

selected_features_names_raw_

Names of original feature names selected by the AutoML pipeline. If preprocessing is disabled, then this corresponds to selected_features_names_ ; otherwise, a raw feature is considered selected if at least one of the features engineered from it is selected.

Type

List[ str ]

ranked_models_

List of model names ranked in order of their quality from the last fit call.

Type

List[ str ]

selected_model_

Name of the best model selected by AutoML.

Type

str

selected_model_params_

Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.

Type

dict

selected_rows_

List of indices in the original train dataset provided to AutoML corresponding to the rows sampled during Adaptive Sampling. In the case of CV, this attribute will result in a list of lists corresponding to indices selected in each fold. For example, in the case of no CV, this attribute looks like: [0, 1, 5] , indicating indices 0, 1, and 5 have been selected during adaptive sampling. In the case of CV=3, this attribute looks like: [ [0, 1], [0, 5], [1, 5] ] , indicating indices 0,1 were selected from the first fold, 0,5 were selected in the 2nd fold, and 1,5 were selected in the 3rd fold.

Type

list

selected_valid_rows_

List of indices in the original validation dataset (if CV==None ) provided to AutoML corresponding to the rows sampled during Adaptive Sampling. If CV is not None, the returned value is always None given that Adaptive Sampling does not sample the validation set when CV is enabled.

Type

list

pipelines_

Sorted list of pipelines (length equal to n_algos_tuned ), with 0th element being the best model.

Type

list

completed_trials_summary_

All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.

Type

pandas.DataFrame

completed_trials_detailed_

A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.

Type

pandas.DataFrame

n_jobs_

Parallelism internally used by AutoML. Calculated as inter_model_parallelism*intra_model_parallelism .

Type

int

feature_importances_

Importance of each feature in the dataset for the selected model

Type

numpy.ndarray of shape (n_features,)

threshold_tuning_score_

The validation score of the pipelines after applying threshold tuning. The scoring metric used to select this threshold can be found in threshold_tuning_scorer_ . It is None when the task is not classification or threshold_tuning is False.

Type

List[Dict[ str , float ]]

threshold_tuning_scorer_

The scoring metric used to select threshold during threshold tuning. It is None when the task is not classification or threshold_tuning is False.

Type

Metric

configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , adaptive_sampling = None , min_features = None , optimization = None , preprocessing = None , search_space = None , min_class_instances = None , max_tuning_trials = None , search_strategy = None , threshold_tuning = None )

Configure the AutoClassifier

If an argument is set to None, then its value is not changed and the default value is used.

Parameters
  • score_metric ( str , callable , tuple , list or None , default=None ) –

    One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.

    • If None : it will be determined automatically depending

    on the task. Default score metrics : binary: neg_log_loss, multiclass: neg_log_loss - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature score_func(model, X, y) . - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nntt

    binary – neg_log_loss, roc_auc, accuracy, f1, precision, recall, f1_micro, f1_macro, f1_weighted, f1_samples, recall_micro, recall_macro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples

    multiclass – neg_log_loss, accuracy, f1_micro, f1_macro, f1_weighted, f1_samples, recall_macro, recall_micro, recall_weighted, recall_samples, precision_micro, precision_macro, precision_weighted, precision_samples

    More information on scoring metrics can be found here :

    Classification metrics , Note: Scoring variations like recall_macro are equivalent to sklearn.metrics.recall_score(...,average="macro")

  • random_state ( int or None , default=None ) – Random seed used by AutoML. Default value (if not previously set): 7

  • n_algos_tuned ( int or None , default=None ) –

    Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.

    • To disable algorithm selection set

    n_algos_tuned = len(model_list) .

    Default value (if not previously set): 1

  • model_list ( List [ str | Any ] or None , default=None ) –

    Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for classification must implement the scikit-learn-style fit, predict, and predict_proba methods. (by default, all supported built-in models for a given task are used) Supported built-in models per task:

    classification – CatBoostClassifier, DecisionTreeClassifier, ExtraTreesClassifier, GaussianNB, KNeighborsClassifier, LGBMClassifier, LogisticRegression, RandomForestClassifier, SVC, TorchMLPClassifier, XGBClassifier

  • adaptive_sampling ( bool or None , default=None ) – Set to False to disable class balancing and adaptive sampling done in AutoML. Disabling this might significantly increase runtime. Default value (if not previously set): True

  • min_features ( int , float , list or None , default=None ) –

    Minimum number of features to keep. Acceptable values:

    • If int, 0 < min_features <= n_features

    • If float, 0 < min_features <= 1.0

    • If list, names of features to keep, for example

    ['a', 'b'] means keep features ‘a’ and ‘b’ - To disable feature selection set min_features = 1.0

    Default value (if not previously set): 1

  • optimization ( int or None , default=None ) –

    Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.

    • Level 0: Optimized for reproducibility

    (controls most randomness) - Level 3: Optimized for speed and accuracy - Level 10: Optimized for speed Default value (if not previously set): 3

  • preprocessing ( bool or None , default=None ) –

    Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.

    • If True, auto-preprocessor runs on dataset to normalize data.

    Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using sklearn.preprocessing.StandardScaler . Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce a ValueError . AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings). Default value (if not previously set): True

  • search_space ( dict or None , default=None ) –

    This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. We support 4 types of Key values:

    • Type 1: the search space key values must have two parameters:
      1. ’range’ which is a list containing the range.

      (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for LogisticRegression:

      search_space = {
      ‘LogisticRegression’ {
      ‘C’: {

      ‘range’: [0.03125, 512], ‘type’: ‘continuous’},

      ’solver’: {
      ‘range’: [‘newton-cg’, ‘lbfgs’,

      ’liblinear’, ‘sag’],

      ’type’: ‘categorical’

      }, ‘class_weight’: {

      ’range’: [None, ‘balanced’], ‘type’: ‘categorical’

      }

      }

      }

    • Type 2: Fixed key values where we could fix the value of hyper parameters.

      For example, if the user wishes to fix a hyper parameter for LogisticRegression:

      search_space = {
      ‘LogisticRegression’ {

      ‘C’: 0.03125, ‘solver’: ‘newton-cg’

      }

      }

    • Type 3: If search space of a model is an empty dictionary, then Model Tune is

    disabled for that model. - Type 4: If a key value contains mixed configuration, some hyper-parameters are fixed, others with a search space. For example:

    search_space = {
    ‘LogisticRegression’ {

    ‘C’: 0.03125, ‘solver’: ‘newton-cg’, ‘class_weight’: {

    ’range’: [None, ‘balanced’], ‘type’: ‘categorical’

    }

    }

    }

    • To disable Model Tune for all models set

    search_space = {} - If None , default search space defined inside AutoML is used. - If all the hyper-paramaters are fixed for a model, the HyperParameterOptimization step will be skipped for that model. Otherwise, the remaining non-fixed parameters will be tuned.

  • min_class_instances ( int or None , default=None ) – The minimum number of instances all classes must have when doing classification. If any class has less than this number of instances, training is stopped. This argument may take any value of 2 or higher. Default value (if not previously set): 5

  • max_tuning_trials ( int , dict or None , default=None ) –

    The maximum number of HPO trials, may be exceeded slightly.
    • If None : AutoML automatically determines when enough HPO

    trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if n_algos_tuned == 2 , then up to 2 * max_tuning_trials are performed in total. - If a dict : by passing a dictionary you can specify this parameter per algorithm. e.g., {'LogisticRegression': 100, 'RandomForestClassifier': 200} . Missing values in the dictionary default to None.

    Default value (if not previously set): None

  • search_strategy ( str or None , default=None ) – The search strategy used in Model Tune. Valid search_strategy values: TPESampler, BruteForceSampler, CmaEsSampler, MOTPESampler, NSGAIISampler, QMCSampler, RandomSampler, HyperGD Default value (if not previously set): 'HyperGD'

  • threshold_tuning ( bool or None , default=None ) –

    Determine whether or not AutoML optimizes the prediction threshold. Threshold tuning is only used in classification tasks. However, unlike classic threshold tuning, AutoML uses a novel technique that increases or decreases the model’s prediction probabilities for a given class, thereby keeping the prediction probability fixed to 0.5 for binary classification and allowing the method to generalize to multi-class classification problems.

    • If True, the prediction threshold will be optimized

    based on the provided score metric. Threshold tuning allows users to post-process classification model predictions to optimize for their custom metric. Threshold tuning will not be exported to onnx models, therefore the onnx model quality may be lower than the original model. - If False, threshold tuning is not applied. Default value (if not previously set): False

Raises

AutoMLxValueError – If min_class_instances is less than 2.

fit ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )

Automatically identifies the most relevant features, model and hyperparameters for a given training data ( X ) and target ( y ). Final model fit is conducted on a full dataset.

Parameters
  • X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.

  • y ( pandas.DataFrame , pandas.Series ) – Training dataset target. Note that y is required for forecasting task.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target

  • cv ( int , str or None , default='auto' ) –

    Determines the cross-validation split. Possible inputs for cv are:

    • None: uses X_valid and y_valid for validation

    • ’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise

    • integer: specifies the number of folds in a (Stratified)KFold ,

    • iterable: yields (train, test) splits as arrays of indices.

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

  • col_types ( List [ str ] or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’. ‘datetime’, ‘timedelta’, ‘image’. For text classification, it has to be set to ‘text’. In the image classification, features with col_type of image should be a column containing images in PIL format. If not None , it manually specifies the type of every dataset feature.

  • time_budget ( Dict [ str , float ] , float , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

Returns

self

Return type

AutoMLPipeline

predict ( self , X )

Predict labels for features (X).

Parameters

X ( pandas.DataFrame ) – Prediction dataset features

Raises
  • AutoMLxNotFittedError – If the pipeline is not fitted yet

  • AutoMLxRuntimeError – If there is no predictions after calling the selected model over the given dataset

Returns

y_pred – The predicted values.

Return type

numpy.ndarray of shape (n_samples,)

predict_proba ( self , X )

Probability estimates.

Parameters

X ( pandas.DataFrame ) – Prediction dataset features

Raises

AutoMLxRuntimeError – If there is no predictions after calling the model over the given dataset

Returns

y_pred – The predicted probabilities.

Return type

numpy.ndarray of shape = (n_samples, n_classes)

score ( self , X , y )

Score of this pipeline for a given set of features ( X ) and labels ( y ). If inferred_score_metric has multiple score metrics, the first score metric will be calculated.

Parameters
Raises

AutoMLxNotFittedError – If the pipeline is not fitted yet

Returns

score – Score of self.predict(X) with respect to y .

Return type

float

transform ( self , X , y = None )

Apply automatic preprocessing to a given set of features ( X ) and labels ( y ).

Parameters
Returns

  • X ( pandas.DataFrame ) – Transformed dataset features

  • y ( pandas.DataFrame, pandas.Series or None ) – Transformed dataset target

Raises

AutoMLxNotFittedError – The pipeline is not Fitted

refit ( self , X , y , X_valid = None , y_valid = None )

Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used. fit must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.

Parameters
Returns

self

Return type

AutoMLPipeline

train ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )

Automatically identifies the most relevant model and hyperparameters for this given set of features ( X ) and target ( y ). Does not conduct final model fit. If the latter is desired, use fit .

Parameters
  • X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.

  • y ( pandas.DataFrame , pandas.Series ) – Training dataset target.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target

  • cv ( int , str or None , default='auto' ) –

    Determines the cross-validation split. Possible inputs for cv are:

    • None: uses X_valid and y_valid for validation

    • ’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise

    • integer: specifies the number of folds in a (Stratified)KFold ,

    • iterable: yields (train, test) splits as arrays of indices.

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

  • col_types ( list of strings or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’. ‘datetime’, ‘timedelta’, ‘image’. For text classification, it has to be set to ‘text’. In the image classification, features with col_type of image should be a column containing images in PIL format. If not None , it manually specifies the type of every dataset feature.

  • time_budget ( Dict [ str , float ] , float , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

Returns

self

Return type

AutoMLPipeline

AutoRegressor

class AutoRegressor

Regressor AutoMLPipeline

selected_features_names_

Names of the engineered features selected by the AutoML pipeline.

Type

List[ str ]

selected_features_names_raw_

Names of original feature names selected by the AutoML pipeline. If preprocessing is disabled, then this corresponds to selected_features_names_ ; otherwise, a raw feature is considered selected if at least one of the features engineered from it is selected.

Type

List[ str ]

ranked_models_

List of model names ranked in order of their quality from the last fit call.

Type

List[ str ]

selected_model_

Name of the best model selected by AutoML.

Type

str

selected_model_params_

Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.

Type

dict

selected_rows_

List of indices in the original train dataset provided to AutoML corresponding to the rows sampled during Adaptive Sampling. In the case of CV, this attribute will result in a list of lists corresponding to indices selected in each fold. For example, in the case of no CV, this attribute looks like: [0, 1, 5] , indicating indices 0, 1, and 5 have been selected during adaptive sampling. In the case of CV=3, this attribute looks like: [ [0, 1], [0, 5], [1, 5] ] , indicating indices 0,1 were selected from the first fold, 0,5 were selected in the 2nd fold, and 1,5 were selected in the 3rd fold.

Type

list

selected_valid_rows_

List of indices in the original validation dataset (if CV==None ) provided to AutoML corresponding to the rows sampled during Adaptive Sampling. If CV is not None, the returned value is always None given that Adaptive Sampling does not sample the validation set when CV is enabled.

Type

list

pipelines_

Sorted list of pipelines (length equal to n_algos_tuned ), with 0th element being the best model.

Type

list

completed_trials_summary_

All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.

Type

pandas.DataFrame

completed_trials_detailed_

A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.

Type

pandas.DataFrame

n_jobs_

Parallelism internally used by AutoML. Calculated as inter_model_parallelism*intra_model_parallelism .

Type

int

feature_importances_

Importance of each feature in the dataset for the selected model

Type

numpy.ndarray of shape (n_features,)

configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , adaptive_sampling = None , min_features = None , optimization = None , preprocessing = None , search_space = None , max_tuning_trials = None , search_strategy = None )

Configure the AutoRegressor

If an argument is set to None, then its value is not changed and the default value is used.

Parameters
  • score_metric ( str , callable , tuple , list or None , default=None ) –

    One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.

    • If None : it will be determined automatically depending

    on the task. Default score metrics : neg_mean_squared_error - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature score_func(model, X, y) . - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nntt

    continuous – neg_mean_squared_error, r2, neg_mean_absolute_error, neg_mean_squared_log_error, neg_median_absolute_error

    More information on scoring metrics can be found here :

    Regression metrics ,

  • random_state ( int or None , default=None ) – Random seed used by AutoML. Default value (if not previously set): 7

  • n_algos_tuned ( int or None , default=None ) –

    Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.

    • To disable algorithm selection set

    n_algos_tuned = len(model_list) .

    Default value (if not previously set): 1

  • model_list ( List [ str | Any ] or None , default=None ) –

    Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for regression must implement the scikit-learn-style fit and predict methods. (by default, all supported built-in models for a given task are used) Supported built-in models per task:

    regression – AdaBoostRegressor, DecisionTreeRegressor, ExtraTreesRegressor, KNeighborsRegressor, LGBMRegressor, LinearRegression, LinearSVR, RandomForestRegressor, SVR, TorchMLPRegressor, XGBRegressor

  • adaptive_sampling ( bool or None , default=None ) – Set to False to disable class balancing and adaptive sampling done in AutoML. Disabling this might significantly increase runtime. Default value (if not previously set): True

  • min_features ( int , float , list or None , default=None ) –

    Minimum number of features to keep. Acceptable values:

    • If int, 0 < min_features <= n_features

    • If float, 0 < min_features <= 1.0

    • If list, names of features to keep, for example

    ['a', 'b'] means keep features ‘a’ and ‘b’ - To disable feature selection set min_features = 1.0

    Default value (if not previously set): 1

  • optimization ( int or None , default=None ) –

    Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.

    • Level 0: Optimized for reproducibility

    (controls most randomness) - Level 3: Optimized for speed and accuracy - Level 10: Optimized for speed Defaults to 3

  • preprocessing ( bool or None , default=None ) –

    Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.

    • If True, auto-preprocessor runs on dataset to normalize data.

    Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using sklearn.preprocessing.StandardScaler . Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce a ValueError . AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings). Default value (if not previously set): True

  • search_space ( dict or None , default=None ) –

    This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. We support 4 types of Key values:

    • Type 1: the search space key values must have two parameters:
      1. ’range’ which is a list containing the range.

      (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for AdaBoostRegressor:

      search_space = {
      ‘AdaBoostRegressor’ {
      ‘learning_rate’: {

      ‘range’: [0.05, 1], ‘type’: ‘continuous’

      }, ‘n_estimators’: {

      ’range’: [10, 50], ‘type’: ‘discrete’

      },

      }

      }

    • Type 2: Fixed key values where we could fix the value of hyper parameters.

      For example, if the user wishes to fix a hyper parameter for AdaBoostRegressor:

      search_space = {
      ‘AdaBoostRegressor’ {

      ‘learning_rate’: 0.984, ‘n_estimators’: 30

      }

      }

    • Type 3: If search space of a model is an empty dictionary, then Model Tune is

    disabled for that model. - Type 4: If a key value contains mixed configuration, some hyper-parameters are fixed, others with a search space. For example:

    search_space = {
    ‘AdaBoostRegressor’ {

    ‘learning_rate’: 0.984, ‘n_estimators’: {

    ’range’: [10, 50], ‘type’: ‘discrete’

    },

    }

    }

    • To disable Model Tune for all models set

    search_space = {} - If None , default search space defined inside AutoML is used. - If all the hyper-paramaters are fixed for a model, the HyperParameterOptimization step will be skipped for that model. Otherwise, the remaining non-fixed parameters will be tuned.

  • max_tuning_trials ( int , dict or None , default=None ) –

    The maximum number of HPO trials, may be exceeded slightly.
    • If None : AutoML automatically determines when enough HPO

    trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if n_algos_tuned == 2 , then up to 2 * max_tuning_trials are performed in total. - If a dict : by passing a dictionary you can specify this parameter per algorithm. e.g., {'LogisticRegression': 100, 'RandomForestClassifier': 200} . Missing values in the dictionary default to None.

    Default value (if not previously set): None

  • search_strategy ( str or None , default=None ) – The search strategy used in Model Tune. Valid search_strategy values: TPESampler, BruteForceSampler, CmaEsSampler, MOTPESampler, NSGAIISampler, QMCSampler, RandomSampler, HyperGD Default value (if not previously set): 'HyperGD'

class fit ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )

Automatically identifies the most relevant features, model and hyperparameters for a given training data ( X ) and target ( y ). Final model fit is conducted on a full dataset.

Parameters
  • X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.

  • y ( pandas.DataFrame , pandas.Series ) – Training dataset target. Note that y is required for forecasting task.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target

  • cv ( int , str or None , default='auto' ) –

    Determines the cross-validation split. Possible inputs for cv are:

    • None: uses X_valid and y_valid for validation

    • ’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise

    • integer: specifies the number of folds in a (Stratified)KFold ,

    • iterable: yields (train, test) splits as arrays of indices.

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

  • col_types ( List [ str ] or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’ If not None , it manually specifies the type of every dataset feature.

  • time_budget ( Dict [ str , float ] , float or None , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

Returns

self

Return type

AutoMLPipeline

class predict ( self , X )

Predict labels for features (X).

Parameters

X ( pandas.DataFrame ) – Prediction dataset features

Raises
  • AutoMLxNotFittedError – If the pipeline is not fitted yet

  • AutoMLxRuntimeError – If there is no predictions after calling the selected model over the given dataset

Returns

y_pred – The predicted values.

Return type

numpy.ndarray of shape (n_samples,)

class score ( self , X , y )

Score of this pipeline for a given set of features ( X ) and labels ( y ). If inferred_score_metric has multiple score metrics, the first score metric will be calculated.

Parameters
Raises

AutoMLxNotFittedError – If the pipeline is not fitted yet

Returns

score – Score of self.predict(X) with respect to y .

Return type

float

class transform ( self , X , y = None )

Apply automatic preprocessing to a given set of features ( X ) and labels ( y ).

Parameters
Returns

  • X ( pandas.DataFrame ) – Transformed dataset features

  • y ( pandas.DataFrame, pandas.Series or None ) – Transformed dataset target

Raises

AutoMLxNotFittedError – The pipeline is not Fitted

class refit ( self , X , y , X_valid = None , y_valid = None )

Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used. fit must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.

Parameters
Returns

self

Return type

AutoMLPipeline

class train ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )

Automatically identifies the most relevant model and hyperparameters for this given set of features ( X ) and target ( y ). Does not conduct final model fit. If the latter is desired, use fit .

Parameters
  • X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.

  • y ( pandas.DataFrame , pandas.Series ) – Training dataset target.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target

  • cv ( int , str or None , default='auto' ) –

    Determines the cross-validation split. Possible inputs for cv are:

    • None: uses X_valid and y_valid for validation

    • ’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise

    • integer: specifies the number of folds in a (Stratified)KFold ,

    • iterable: yields (train, test) splits as arrays of indices.

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

  • col_types ( list of strings or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’ If not None , it manually specifies the type of every dataset feature.

  • time_budget ( Dict [ str , float ] , float or None , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

Returns

self

Return type

AutoMLPipeline

AutoAnomalyDetector

class AutoAnomalyDetector

Anomaly Detection AutoMLPipeline

classes_

Holds the label for each class (for task=classification only, otherwise it is set to None ).

Type

List[Any]

selected_features_names_

Names of the engineered features selected by the AutoML pipeline.

Type

List[ str ]

selected_features_names_raw_

Names of original feature names selected by the AutoML pipeline. If preprocessing is disabled, then this corresponds to selected_features_names_ ; otherwise, a raw feature is considered selected if at least one of the features engineered from it is selected.

Type

List[ str ]

ranked_models_

List of model names ranked in order of their quality from the last fit call.

Type

List[ str ]

selected_model_

Name of the best model selected by AutoML.

Type

str

selected_model_params_

Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.

Type

dict

selected_rows_

List of indices in the original train dataset provided to AutoML corresponding to the rows sampled during Adaptive Sampling. In the case of CV, this attribute will result in a list of lists corresponding to indices selected in each fold. For example, in the case of no CV, this attribute looks like: [0, 1, 5] , indicating indices 0, 1, and 5 have been selected during adaptive sampling. In the case of CV=3, this attribute looks like: [ [0, 1], [0, 5], [1, 5] ] , indicating indices 0,1 were selected from the first fold, 0,5 were selected in the 2nd fold, and 1,5 were selected in the 3rd fold.

Type

list

selected_valid_rows_

List of indices in the original validation dataset (if CV==None ) provided to AutoML corresponding to the rows sampled during Adaptive Sampling. If CV is not None, the returned value is always None given that Adaptive Sampling does not sample the validation set when CV is enabled.

Type

list

pipelines_

Sorted list of pipelines (length equal to n_algos_tuned ), with 0th element being the best model.

Type

list

completed_trials_summary_

All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.

Type

pandas.DataFrame

completed_trials_detailed_

A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.

Type

pandas.DataFrame

n_jobs_

Parallelism internally used by AutoML. Calculated as inter_model_parallelism*intra_model_parallelism .

Type

int

feature_importances_

Importance of each feature in the dataset for the selected model

Type

numpy.ndarray of shape (n_features,)

configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , optimization = None , preprocessing = None , search_space = None , max_tuning_trials = None , search_strategy = None )

Configure the AutoAnomalyDetector

If an argument is set to None, then its value is not changed and the default value is used.

Parameters
  • score_metric ( str , callable , tuple , list or None , default=None ) –

    One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.

    • If None : it will be determined automatically depending

    on the task. Default score metrics : unsupervised_unify95 - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature score_func(model, X, y) . - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nntt**unsupervised** – unsupervised_unify95, unsupervised_unify95_log_loss

  • random_state ( int or None , default=None ) – Random seed used by AutoML. Default value (if not previously set): 7

  • n_algos_tuned ( int or None , default=None ) –

    Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.

    • To disable algorithm selection set

    n_algos_tuned = len(model_list) .

    Default value (if not previously set): 1

  • model_list ( List [ str | Any ] or None , default=None ) –

    Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Custom models for anomaly detection must follow the pyod interface. (by default, all supported built-in models for a given task are used) Supported built-in models per task:

    anomaly_detection – ClusteringLocalFactorOD, HistogramOD, IsolationForestOD, KNearestNeighborsOD, MinCovOD, OneClassSVMOD, PrincipalCompOD, AutoEncoder

  • optimization ( int or None , default=None ) –

    Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.

    • Level 0: Optimized for reproducibility

    (controls most randomness) - Level 3: Optimized for speed and accuracy - Level 10: Optimized for speed Default value (if not previously set): 3

  • preprocessing ( bool or None , default=None ) –

    Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users.

    • If True, auto-preprocessor runs on dataset to normalize data.

    Categorical features are label encoded and numeric features are normalized to mean of 0 and variance of 1 using sklearn.preprocessing.StandardScaler . Features with more than 20 percent missing values are ignored. The remaining missing values are imputed by mean for numeric features and mode for categorical features. - If False, user must cleanse (and normalize if desired) dataset before passing data to AutoML. The use of NaNs in the dataset is not allowed and will produce a ValueError . AutoML will leave it to the underlying algorithm implementations to handle strings (it is recommended to encode strings). Default value (if not previously set): True

  • search_space ( dict or None , default=None ) –

    This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. We support 4 types of Key values:

    • Type 1: the search space key values must have two parameters:
      1. ’range’ which is a list containing the range.

      (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for IsolationForestOD:

      search_space = {
      ‘IsolationForestOD’ {
      ‘n_estimators’: {

      ‘range’: [10, 50], ‘type’: ‘discrete’

      }, ‘max_features’: {

      ’range’: [0.5, 0.7], ‘type’: ‘continuous’

      }, ‘max_samples’: {

      ’range’: [5, 10], ‘type’: ‘discrete’

      }

      }

      }

    • Type 2: Fixed key values where we could fix the value of hyper parameters.

      For example, if the user wishes to fix a hyper parameter for IsolationForestOD:

      search_space = {
      ‘IsolationForestOD’ {

      ‘n_estimators’: 10, ‘max_features’: 0.5, ‘max_samples’: 10

      }

      }

    • Type 3: If search space of a model is an empty dictionary, then Model Tune is

    disabled for that model. - Type 4: If a key value contains mixed configuration, some hyper-parameters are fixed, others with a search space. For example:

    search_space = {
    ‘IsolationForestOD’ {
    ‘n_estimators’: {

    ‘range’: [10, 50], ‘type’: ‘discrete’

    }, ‘max_features’: 0.5, ‘max_samples’: 10

    }

    }

    • To disable Model Tune for all models set

    search_space = {} - If None , default search space defined inside AutoML is used. - If all the hyper-paramaters are fixed for a model, the HyperParameterOptimization step will be skipped for that model. Otherwise, the remaining non-fixed parameters will be tuned.

  • max_tuning_trials ( int , dict or None , default=None ) –

    The maximum number of HPO trials, may be exceeded slightly.
    • If None : AutoML automatically determines when enough HPO

    trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if n_algos_tuned == 2 , then up to 2 * max_tuning_trials are performed in total. - If a dict : by passing a dictionary you can specify this parameter per algorithm. e.g., {'LogisticRegression': 100, 'RandomForestClassifier': 200} . Missing values in the dictionary default to None.

    Default value (if not previously set): None

  • search_strategy ( str or None , default=None ) – The search strategy used in Model Tune. Valid search_strategy values: TPESampler, BruteForceSampler, CmaEsSampler, MOTPESampler, NSGAIISampler, QMCSampler, RandomSampler, HyperGD Default value (if not previously set): 'HyperGD'

fit ( self , X , X_valid = None , y_valid = None , col_types = None , time_budget = - 1 , contamination = None )

Automatically identifies the most relevant features, model and hyperparameters for a given training data ( X ). Final model fit is conducted on a full dataset.

Parameters
  • X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting tasks.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target

  • col_types ( List [ str ] or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’

  • time_budget ( Dict [ str , float ] , float , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

  • contamination ( float or None , default=None ) – Fraction of training dataset corresponding to anomalies (between 0.0 and 0.5). Should only be set for supervised anomaly detection (y_valid is required). Should be set to None for unsupervised anomaly detection (when using the unsupervised metrics).

Returns

self

Return type

AutoMLPipeline

predict ( self , X )

Predict labels for features (X).

Parameters

X ( pandas.DataFrame ) – Prediction dataset features

Raises
  • AutoMLxNotFittedError – If the pipeline is not fitted yet

  • AutoMLxRuntimeError – If there is no predictions after calling the selected model over the given dataset

Returns

y_pred – The predicted values.

Return type

numpy.ndarray of shape (n_samples,)

predict_proba ( self , X )

Probability estimates.

Parameters

X ( pandas.DataFrame ) – Prediction dataset features

Raises
  • AutoMLxNotFittedError – If the pipeline is not fitted yet.

  • AutoMLxRuntimeError – If there is no predictions after calling the model over the given dataset.

Returns

y_pred – The predicted probabilities.

Return type

numpy.ndarray of shape = (n_samples, n_classes)

score ( self , X , y )

Score of this pipeline for a given set of features ( X ) and labels ( y ). If inferred_score_metric has multiple score metrics, the first score metric will be calculated.

Parameters
Raises

AutoMLxNotFittedError – If the pipeline is not fitted yet

Returns

score – Score of self.predict(X) with respect to y .

Return type

float

transform ( self , X , y = None )

Apply automatic preprocessing to a given set of features ( X ) and labels ( y ).

Parameters
Returns

  • X ( pandas.DataFrame ) – Transformed dataset features

  • y ( pandas.DataFrame, pandas.Series or None ) – Transformed dataset target

Raises

AutoMLxNotFittedError – The pipeline is not Fitted

refit ( self , X , X_valid = None , y_valid = None )

Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used. fit must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.

Parameters
Returns

self

Return type

AutoMLPipeline

train ( self , X , X_valid = None , y_valid = None , col_types = None , time_budget = - 1 , contamination = None )

Automatically identifies the most relevant model and hyperparameters for this given set of features ( X ) and target ( y ). Does not conduct final model fit. If the latter is desired, use fit .

Parameters
  • X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target

  • col_types ( list of strings or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, ‘timedelta’ If not None , it manually specifies the type of every dataset feature.

  • time_budget ( Dict [ str , float ] , float or None , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

  • contamination ( float or None , default=None ) – Fraction of training dataset corresponding to anomalies (between 0.0 and 0.5). Should only be set for supervised anomaly detection (y_valid is required). Should be set to None for unsupervised anomaly detection (when using the unsupervised metrics).

Raises

AutoMLxValueError – If contamination has been provided for unsupervised AD

Returns

self

Return type

AutoMLPipeline

AutoForecaster

class AutoForecaster

Forecasting AutoMLPipeline

ranked_models_

List of model names ranked in order of their quality from the last fit call.

Type

List[ str ]

selected_model_

Name of the best model selected by AutoML.

Type

str

selected_model_params_

Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.

Type

dict

pipelines_

Sorted list of pipelines (length equal to n_algos_tuned ), with 0th element being the best model.

Type

list

completed_trials_summary_

All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.

Type

pandas.DataFrame

completed_trials_detailed_

A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.

Type

pandas.DataFrame

time_series_period

The seasonality period to force-fit the time series at regardless of whether it is detected in the data.

Type

int or None

n_jobs_

Parallelism internally used by AutoML. Calculated as inter_model_parallelism*intra_model_parallelism .

Type

int

configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , optimization = None , preprocessing = None , search_space = None , max_tuning_trials = None , search_strategy = None , time_series_period = None )

Configure the AutoForecaster

If an argument is set to None, then its value is not changed and the default value is used.

Parameters
  • score_metric ( str , callable , tuple , list or None , default=None ) –

    One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.

    • If None : it will be determined automatically depending

    on the task. Default score metrics : neg_sym_mean_abs_percent_error - If a list: should be a list of str, callable or tuple. The first score metric in the list will be the one for which the pipeline optimizes. - If a callable: score function (or loss function) with signature score_func(model, X, y) . - If a tuple: should be a tuple with two values with types (str, callable). The string corresponds to the name of the scoring metric, and the callable should have the same signature as above. - If a string: automatically infers the scoring metric from the string: nntt

    continuous_forecast – neg_sym_mean_abs_percent_error, neg_root_mean_squared_percent_error, neg_mean_abs_scaled_error, neg_root_mean_squared_error, neg_mean_squared_error, neg_max_absolute_error, neg_mean_absolute_error, neg_max_abs_error, neg_mean_abs_error

  • random_state ( int , or None , default=None ) – Random seed used by AutoML. Suggested default: 7

  • n_algos_tuned ( int , or None , default=None ) –

    Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.

    • To disable algorithm selection set

    n_algos_tuned = len(model_list) .

    Suggested default: 1

  • model_list ( List [ str ] , or None , default=None ) –

    Models that will be evaluated by the Pipeline. Users can specify built-in models by name (by default, all supported built-in models for a given task are used).

    • All models except VARMAX and DynFactor models are applicable

    when doing there is a single timeseries in y. - If you have multiple timeseries in y that you want to predict as a system, then multi-target forecasting VARMAX and DynFactor may be utilized. - When you have features or exogenous regressors that you known in advance for your forecast period, pass them into X.

    Supported built-in models per task:

    forecasting – NaiveForecaster, ThetaForecaster, ExpSmoothForecaster, ETSForecaster, STLwESForecaster, STLwARIMAForecaster, SARIMAXForecaster, VARMAXForecaster, DynFactorForecaster, ExtraTreesForecaster, XGBForecaster, LGBMForecaster

  • optimization ( int , or None , default=None ) –

    Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.

    • Level 0: Optimized for reproducibility (controls most randomness)

    • Level 3: Optimized for speed and accuracy

    • Level 10: Optimized for speed

    Suggested default: 3

  • preprocessing ( bool , or None , default=None ) – Determines whether or not AutoML pre-processes the dataset. This setting should only be used by advanced users. Most of the preprocessing can not be turned off for the forecasting task. Suggested default: True

  • search_space ( dict , or None , default=None ) –

    This parameter defines the search space for model tuning. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. We support 4 types of Key values:

    • Type 1: the search space key values must have two parameters:
      1. ’range’ which is a list containing the range.

      (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for ETSForecaster:

      search_space = {
      ‘ETSForecaster’ {

      ‘error’: { ‘range’: [‘add’, ‘mul’], ‘type’: ‘categorical’

      }, ‘damped_trend’: {

      ’range’: [True, False], ‘type’: ‘categorical’ }

      }

      }

    • Type 2: Fixed key values where we could fix the value of hyper parameters.

      For example, if the user wishes to fix a hyper parameter for ETSForecaster:

      search_space = {
      ‘ETSForecaster’ {

      ‘error’: “add”, ‘damped_trend’: True

      }

      }

    • Type 3: If search space of a model is an empty dictionary, then Model Tune is

    disabled for that model. - Type 4: If a key value contains mixed configuration, some hyper-parameters are fixed, others with a search space. For example:

    search_space = {
    ‘ETSForecaster’ {

    ‘error’: ‘add’, ‘damped_trend’: { ‘range’: [True, False], ‘type’: ‘categorical’ }

    }

    }

    • To disable Model Tune for all models set

    search_space = {} - If None , default search space defined inside AutoML is used. - If all the hyper-paramaters are fixed for a model, the HyperParameterOptimization step will be skipped for that model. Otherwise, the remaining non-fixed parameters will be tuned.

  • max_tuning_trials ( int , dict or None , default=None ) –

    The maximum number of HPO trials, may be exceeded slightly.
    • If None : AutoML automatically determines when enough HPO

    trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if n_algos_tuned == 2 , then up to 2 * max_tuning_trials are performed in total. - If a dict : by passing a dictionary you can specify this parameter per algorithm. e.g., {'LogisticRegression': 100, 'RandomForestClassifier': 200} . Missing values in the dictionary default to None.

    Default value (if not previously set): None

  • search_strategy ( str ) – The search strategy used in model tuning. Valid search_strategy values: TPESampler, BruteForceSampler, CmaEsSampler, MOTPESampler, NSGAIISampler, QMCSampler, RandomSampler, HyperGD Suggested default: 'HyperGD'

  • time_series_period ( int or None , default=None ) – The seasonality period to force-fit the time series at regardless of whether it is detected in the data. If None, AutoML guesses the seasonability by inspecting the training data. However, users can use this to set it manually instead.

fit ( self , y , X = None , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )

Automatically identifies the most relevant features, model and hyperparameters for a given training data ( X ) and target ( y ). Final model fit is conducted on a full dataset.

Parameters
  • y ( pandas.DataFrame ) – Training dataset target.

  • X ( pandas.DataFrame or None , default=None ) – A dataframe of explanatory variables that support the target timeseries in y. These must be known in advance for the foreast period and the training period.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame or None , default=None ) – Validation dataset target

  • cv ( int , str or None , default='auto' ) –

    Determines the cross-validation split. Possible inputs for cv are:

    • None: uses X_valid and y_valid for validation

    • ’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise

    • integer: specifies the number of folds in a (Stratified)KFold ,

    • iterable: yields (train, test) splits as arrays of indices.

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

  • col_types ( List [ str ] or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’

  • time_budget ( Dict [ str , float ] , float , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

Returns

self

Return type

AutoMLPipeline

predict ( self , X )

Predict the target for the time steps in X . For a simpler API to predict the target only for time steps in the future, use forecast .

Parameters

X ( pandas.DataFrame ) – A dataframe of explanatory variables that support the target timeseries in y . Predictions will be given for the time steps in X .

Raises
  • AutoMLxNotFittedError – If the pipeline is not fitted yet

  • AutoMLxRuntimeError – If there is no predictions after calling the model over the given dataset

  • AutoMLxRuntimeError – If result of time series numerical inverse transform is None

Returns

y_pred – A data frame containing the predicted values.

Return type

pandas.DataFrame

forecast ( self , periods , alpha = 0.05 , X = None )

Forecast future values of the target.

Parameters
  • periods ( int ) – The number of time steps to forecast from the end of the sample.

  • alpha ( float , default=0.05 ) – A significance level. To receive a prediction interval of 95% alpha must be set to 0.05.

  • X ( pandas.DataFrame , or None , default=None ) – A dataframe of explanatory variables that support the forecast for periods number of timestamps. The index should begin immediately after the last index in y (as provided to fit ). The columns must match the ones used in fit .

Returns

A dataframe with three columns listing prediction, ci_lower and ci_upper for the given confidence interval (CI) provided by level of alpha. Note: CI columns are excluded for models that don’t support intervals.

Return type

pandas.Dataframe

Raises
  • AutoMLxNotFittedError – If the pipeline is not fitted yet.

  • AutoMLxValueError – If explanatory variables are not provided, complete, or length of explanatory variables not equal to requested periods.

plot_forecast ( self , predictions , show_y = True , show_pi = True , additional_frames = None )

Plot the forecasts.

Parameters
  • predictions ( pd.DataFrame ) – A dataframe containing columns mean, pi_lower (optional) and pi_upper (optional)

  • show_y ( bool , default=True ) – If True, plots training series y

  • show_pi ( bool , default=True ) – if True, plots Prediction Intervals (PI) when available

  • additional_frames ( dictionary of pd.DataFrame , optional ) – Plots the dataframes to the same axes, e.g., additional_frames = dict(‘label1’=dataframe1, ‘label2’=dataframe2)

Return type

A plotly figure.

Raises

AutoMLxValueError – If predictions column names are incorrect.

score ( self , X , y )

Score of this pipeline for a given set of features ( X ) and labels ( y ). If inferred_score_metric has multiple score metrics, the first score metric will be calculated.

Parameters
  • X ( pd.DataFrame ) – Training dataset features

  • y ( pd.DataFrame , pd.Series ) – Training dataset target

Raises

AutoMLxNotFittedError – If the pipeline is not fitted yet

Returns

score – Score of self.predict(X) with respect to y .

Return type

float

transform ( self , X , y )

Apply automatic preprocessing to a given set of features ( X ) and labels ( y ).

Parameters
Raises

AutoMLxNotFittedError – If the pipeline is not fitted.

Returns

Transformed dataset features, transformed dataset timeseries

Return type

(pd.DataFrame or None, pd.DataFrame or pd.Series or None)

refit ( self , X , y , X_valid = None , y_valid = None )

Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection, Feature Selection, Model Tune are re-used. fit must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.

Parameters
Returns

self

Return type

AutoMLPipeline

train ( self , X , y , X_valid = None , y_valid = None , cv = 'auto' , col_types = None , time_budget = - 1 )

Automatically identifies the most relevant model and hyperparameters for this given set of features ( X ) and target ( y ). Does not conduct final model fit. If the latter is desired, use fit .

Parameters
  • X ( pandas.DataFrame ) – Training dataset features. Optional for forecasting task.

  • y ( pandas.DataFrame , pandas.Series ) – Training dataset target.

  • X_valid ( pandas.DataFrame or None , default=None ) – Validation dataset features

  • y_valid ( pandas.DataFrame , pandas.Series or None , default=None ) – Validation dataset target

  • cv ( int , str or None , default='auto' ) –

    Determines the cross-validation split. Possible inputs for cv are:

    • None: uses X_valid and y_valid for validation

    • ’auto’: uses 5 folds if number of instances < 1M, disable cv-folds otherwise

    • integer: specifies the number of folds in a (Stratified)KFold ,

    • iterable: yields (train, test) splits as arrays of indices.

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

  • col_types ( list of strings or None , default=None ) – List of length X.shape[1] with string values indicating type of features. Supported types are: ‘categorical’, ‘numerical’, ‘text’, ‘datetime’, and ‘timedelta’ If not None , it manually specifies the type of every dataset feature.

  • time_budget ( Dict [ str , float ] , float or None , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune , FeatureSelection , AdaptiveSampling , ThresholdTuning

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

Returns

self

Return type

AutoMLPipeline

AutoRecommender

class AutoRecommender

Recommender System AutoMLPipeline

ranked_models_

List of model names ranked in order of their quality from the last fit call.

Type

List[ str ]

selected_model_

Name of the best model selected by AutoML.

Type

str

selected_model_params_

Dictionary containing the optimal hyperparameters for the selected model. Keys are hyperparameter names with their corresponding values.

Type

dict

pipelines_

Sorted list of pipelines (length equal to n_algos_tuned ), with 0th element being the best pipeline.

Type

list

completed_trials_summary_

All trials performed by the AutoML Pipeline. The columns are: Algorithm, # Samples, # Features, Mean Validation Score, Hyperparameters, Runtime, Memory Usage, Step, where the hyperparameters are a dict.

Type

pandas.DataFrame

completed_trials_detailed_

A DataFrame indicating all trials performed by the AutoML Pipeline with all score metrics’ values. The columns are: Algorithm, # Samples, # Features, Hyperparameters, Features, Step, Scoring Metric, CV Fold ID, Score, Runtime, Memory Usage, # Epochs.

Type

pandas.DataFrame

n_jobs_

Parallelism internally used by AutoML. Calculated as inter_model_parallelism*intra_model_parallelism .

Type

int

configure ( self , score_metric = None , random_state = None , n_algos_tuned = None , model_list = None , optimization = None , preprocessing = None , search_space = None , max_tuning_trials = None , search_strategy = None )

Configure the AutoRecommender

If an argument is set to None, then its value is not changed and the default value is used.

Parameters

score_metric

One or more score functions (or loss functions) to be computed for each candidate model and hyperparameter configuration.

  • If None : it will be determined automatically depending

on the task. Default value score metrics (if not previously set) : recommendation: hit_rate - If a string: automatically infers the scoring metric from the string: nn

Availiable score metrics:
hits:

Computes the number of relevant recommendations done at K ,i.e., the top K recommendations done by the model that matched with an actual interaction of a user.

hit_rate:

Computes Hit Rate At K as the sum of the users to which at least a relevant item was correctly recommended, divided by the number of total users.

precision:

Computes Precision At K, a measure of how many of the top K recommended items are in the set of true relevant items for all users, without taking into account the order for the computation of the metric.

\[precision@K={\frac{1}{U}}{\sum_{i=1}^{U}{\sum_{j=1}^{K}{\frac{rel_i(r_j)}{K}}}}\]
recall:

Computes Recall At K, a measure of the fraction of the relevant items recommended between the top K out of all relevan items, without taking into account the order for the computation of the metric.

\[recall@K={\frac{1}{U}}{\sum_{i=1}^{U}{\sum_{j=1}^{K}{\frac{rel_i(r_j)}{Q_i}}}}\]
map:

Computes Mean Average Precision At K as the sum of every average precision of every user, divided by the number of users. The MAP is meant to calculate average precision for the relevant items in the test set, so it is normalized by the cutoff K or size of interactions for users with less than K interactions in the test set. The mean AP can be defined as the sum for every k of the precision at k, for every value 1 <= k <= K, multiplied by the delta recall.

\[MAP@K={\frac{1}{U}} {\sum_{i=1}^{U} {\frac{AP@K(i)}{\min(Q_i, K)}}}\]

random_state int or None, default=None

Random seed used by AutoML. Default value (if not previously set): 7

n_algos_tuned int or None, default=None

Number of algorithms that are optimized by the AutoML pipeline. Higher values might make AutoML more accurate at the expense of runtime.

  • To disable algorithm selection set

n_algos_tuned = len(model_list) .

Default value (if not previously set): 1

model_list List[str] or None, default=None

Models that will be evaluated by the Pipeline. Users can specify built-in models by name or pass in any custom model. Custom models must have their hyper-parameter configuration spaces defined in search_space. Supported built-in models per task:

recommendation – AlsRecommender, ItemKNNRecommender, BprRecommender, TRexxRecommender

Avaliable models:
  • "AlsRecommender" :

    Alternated Least Square (ALS) is a recommendation algorithm using Collaborative Filtering from Matrix Factorizations. reference:

  • BprRecommender :

    Bayesian Personalized Ranking (BPR) computes users’ items rankings for using a maximum posterior estimator. reference:

  • ItemKNNRecommender :

    ItemKNN is a model that internally computes an item-item similarity matrix based on observed co-interactions from users. To produce recommendations, it uses the user interaction history and combine the item vectors of each interacted items to find similar items. If there were originally weights put on interactions, we scale them with the item reciprocal ranks. reference: https://dl.acm.org/doi/10.1145/963770.963776

  • TRexxRecommender :

    T-Rexx is a deep learning model that provides sequence-aware recommendations. It extracts users’ preferences using a Multi-head self attention mechanism. Learned users and items embeddings are combined into predictions via a sampled softmax. It is an hybrid between the SDM model and the SASRec model. reference:

optimization int or None, default=None

Sets AutoML speed, accuracy, and reproducibility level. To be used by advanced users only.

  • Level 0: Optimized for reproducibility

(controls most randomness) - Level 3: Optimized for speed and accuracy - Level 10: Optimized for speed Default value (if not previously set): 3

preprocessing bool or None, default=None

Not supported for AutoRecommender. Has no effect on this class.

search_space dict or None, default=None

This parameter defines the Model Tuning search space. This parameter is required for custom models. Dictionary keys are algorithm names (str) with search space as the key value. We support 4 types of Key values:

  • Type 1: the search space key values must have two parameters:
    1. ‘range’ which is a list containing the range.

    (2) ‘type’ which is one of ‘continuous’, ‘discrete’, ‘categorical’. For example, if the user wishes to provide a custom tune search space for TRexxRecommender:

    search_space = {
    ‘TRexxRecommender’ {
    ‘dropout_rate’: {

    ‘range’: [0.03125, 0.5], ‘type’: ‘continuous’},

    ‘optimizer_name’: {

    ‘range’: [‘lazyadam’, ‘adam’] ‘type’: ‘categorical’

    }, ‘dnn_activation’: {

    ‘range’: [‘tanh’, ‘relu’], ‘type’: ‘categorical’

    }

    }

    }

  • Type 2: Fixed key values where we could fix the value of hyper parameters.

    For example, if the user wishes to fix a hyper parameter for TRexxRecommender:

    search_space = {
    ‘TRexxRecommender’ {

    ‘dropout_rate’: 0.5, ‘optimizer_name’: ‘adam’

    }

    }

  • Type 3: If search space of a model is an empty dictionary, then Model Tune is

disabled for that model. - Type 4: If a key value contains mixed configuration, some hyper-parameters are fixed, others with a search space. For example:

search_space = {
‘TRexxRecommender’ {

‘dropout_rate’: 0.5, ‘optimizer_name’: ‘adam’, ‘dnn_activation’: {

‘range’: [‘tanh’, ‘relu’], ‘type’: ‘categorical’

}

}

}

  • To disable Model Tune for all models set

search_space = {} - If None , default search space defined inside AutoML is used. - If all the hyper-paramaters are fixed for a model, the tuning step will be skipped for that model. Otherwise, the remaining non-fixed parameters will be tuned.

max_tuning_trials int, dict or None, default=None

The maximum number of HPO trials, may be exceeded slightly. - If None : AutoML automatically determines when enough HPO trials have been completed. - If an integer: the maximum number of trials for each algorithm. That is, if n_algos_tuned == 2 , then up to 2 * max_tuning_trials are performed in total. - If a dict : by passing a dictionary you can specify this parameter per algorithm. e.g., {'AlsRecommender': 100, 'ItemKNNRecommender': 200} . Missing values in the dictionary default to None. Default value (if not previously set): None

search_strategy str or None, default=None

The search strategy used in Model Tune. Valid search_strategy values: TPESampler, BruteForceSampler, CmaEsSampler, MOTPESampler, NSGAIISampler, QMCSampler, RandomSampler, HyperGD Default value (if not previously set): 'HyperGD'

Raises

AutoMLxValueError – If preprocessing arg is given a value

fit ( self , data , col_types , data_valid = None , time_budget = - 1 )

Automatically identifies the optimal model and hyperparameters for the given training data ( data ). Final model fit is conducted on a full dataset.

Parameters
  • data ( pandas.DataFrame ) – Training dataset.

  • col_types ( dict or list ) –

    Dict or list with string values indicating the type of features of the dataset or their role. for Dict:

    Mandatory values that correspond to column name keys:
    ”recommendation_subject” indicates the column with the ids for subjects

    that receive recommendations.

    ”recommendation” : indicates the column with the ids for recommendations. Example:

    col_types = {

    “movie_id”: “recommendation”, “user_id”: “recommendation_subject”

    } where “movie_id” is the column to recommend from and “user_id” is the column to recommend to.

    Additional columns can be added to indicate their type.
    Example:
    col_types = {

    “movie_id”: “recommendation”, “user_id”: “recommendation_subject”, “rating” : “numerical”

    }

    for List:
    Mandatory values:
    ”recommendation_subject” indicates the column with the ids for subjects

    that receive recommendations.

    ”recommendation” : indicates the column with the ids for recommendations.

    The values of this list are following the order of the columns of the training data DataFrame. “recommendation_subject” and “recommendation” must be placed at the position of the corresponding columns. The rest of the columns expect a type value. The required length of the list must be equal to the number of the columns of the training data DataFrame. Example: col_types = [“recommendation_subject”, “recommendation”, “numerical”] can be passed to a dataframe that has the following columns: columns: [“user_id”, “movie_id”, “rating”] in the case where values from the column “movie_id” should be recommended to values from the “user_id” column.

    Supported types are:
    ”categorical” for columns to be interpreted as categoricals

    regardless of their data type.

    ”numerical” : for int, float and double type. “text” : for str type that consist of multiple words.

  • data_valid ( pandas.DataFrame or None , default=None ) – Validation dataset.

  • time_budget ( Dict [ str , float ] , float , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

Returns

self

Return type

automlx._interface.automl_pipeline.AutoMLPipeline

predict ( self , subjects , new_data = None , n_recommendations = 10 , repeat_recommendations = False )

Predict labels for given subjects.

Parameters
  • subjects ( pandas.DataFrame ) – Ids of subjects to recommender subject predictions for.

  • new_data ( pandas.DataFrame ) – additional new context to be considered in the subject predictions.

  • n_recommendations ( int ) – For each subject id, n_recommendations are predicted.

  • repeat_recommendations ( bool ) – Enables/disables the repetition of predictions.

Returns

A data frame containing two columns: the subject ids along with the predicted subject recommendations.

Return type

pandas.DataFrame

recommend ( self , subjects , new_data = None , n_recommendations = 10 , repeat_recommendations = False )

Recommend labels for given subjects.

Parameters
  • subjects ( pandas.DataFrame ) – Ids of subjects to recommender subject recommendations for.

  • new_data ( pandas.DataFrame ) – additional new context to be considered in the subject recommendations.

  • n_recommendations ( int ) – For each subject id, n_recommendations are predicted.

  • repeat_recommendations ( bool ) – Enables/disables the repetition of recommendations.

Raises

AutoMLxNotImplementedError – If new_data attr is provided. If repeat_recommendations is passed as True.

Returns

A data frame containing two columns: the subject ids along with the predicted subject recommendations.

Return type

pandas.DataFrame

score ( self , data , score_metric = None , n_recommendations = 10 )

Score of this pipeline for a given dataset ( data ).

Parameters
  • data ( pd.DataFrame ) – Training dataset.

  • score_metric ( str ) – recommendation metric to evaluate (e.g. hit_rate, precision, etc…)

  • n_recommendations ( int ) – Top N recommendations to consider in evaluation.

Returns

score – Score of evaluation

Return type

float

train_test_split ( data , col_types )

Split the given dataset in two by using the leave-one-last-split approach.

The split generates a train/test split by putting a fraction of the last interaction according to chronological order of each recommendation subject in the testing set and leaves the remaining ones in the train dataset.

Parameters
  • data ( pandas.DataFrame ) – Dataset to split in AutoRecommender pipeline.

  • col_types ( dict or list ) –

    Dict or list with string values indicating the type of features of the dataset or their role. for Dict:

    Mandatory values that correspond to column name keys:
    ”recommendation_subject” indicates the column with the ids for subjects

    that receive recommendations.

    ”recommendation” : indicates the column with the ids for recommendations. Example:

    col_types = {

    “movie_id”: “recommendation”, “user_id”: “recommendation_subject”

    } where “movie_id” is the column to recommend from and “user_id” is the column to recommend to.

    Additional columns can be added to indicate their type.
    Example:
    col_types = {

    “movie_id”: “recommendation”, “user_id”: “recommendation_subject”, “rating” : “numerical”

    }

    for List:
    Mandatory values:
    ”recommendation_subject” indicates the column with the ids for subjects

    that receive recommendations.

    ”recommendation” : indicates the column with the ids for recommendations.

    The values of this list are following the order of the columns of the training data DataFrame. “recommendation_subject” and “recommendation” must be placed at the position of the corresponding columns. The rest of the columns expect a type value. The required length of the list must be equal to the number of the columns of the training data DataFrame. Example: col_types = [“recommendation_subject”, “recommendation”, “numerical”] can be passed to a dataframe that has the following columns: columns: [“user_id”, “movie_id”, “rating”] in the case where values from the column “movie_id” should be recommended to values from the “user_id” column.

    Supported types are:
    ”categorical” for columns to be interpreted as categoricals

    regardless of their data type.

    ”numerical” : for int, float and double type. “text” : for str type that consist of multiple words.

Raises

AutoMLxValueError – If timestamp is passed as None

Returns

Two train, test indexed by timestamp dataframes.

Return type

Tuple[pd.DataFrame, pd.DataFrame]

refit ( self , data , data_valid = None )

Refit previously tuned AutoML Pipeline with new data. Pipeline stages are not re-run on the new datasets. All settings and parameters from Model Selection and Model Tune are re-used. fit must have been called before calling this method. If validation set is provided, it will be concatenated with training set before doing the refit.

Parameters
Returns

self

Return type

automlx._interface.automl_pipelineAutoMLPipeline

train ( self , data , col_types , data_valid = None , time_budget = - 1 )

Automatically identifies the optimal model and hyperparameters for this given dataset ( data ). Does not conduct final model fit. If the latter is desired, use fit .

Parameters
  • data ( pandas.DataFrame ) – Training dataset.

  • col_types ( dict or list ) –

    Dict or list with string values indicating the type of features of the dataset or their role. for Dict:

    Mandatory values that correspond to column name keys:
    ”recommendation_subject” indicates the column with the ids for subjects

    that receive recommendations.

    ”recommendation” : indicates the column with the ids for recommendations. Example:

    col_types = {

    “movie_id”: “recommendation”, “user_id”: “recommendation_subject”

    } where “movie_id” is the column to recommend from and “user_id” is the column to recommend to.

    Additional columns can be added to indicate their type.
    Example:
    col_types = {

    “movie_id”: “recommendation”, “user_id”: “recommendation_subject”, “rating” : “numerical”

    }

    for List:
    Mandatory values:
    ”recommendation_subject” indicates the column with the ids for subjects

    that receive recommendations.

    ”recommendation” : indicates the column with the ids for recommendations.

    The values of this list are following the order of the columns of the training data DataFrame. “recommendation_subject” and “recommendation” must be placed at the position of the corresponding columns. The rest of the columns expect a type value. The required length of the list must be equal to the number of the columns of the training data DataFrame. Example: col_types = [“recommendation_subject”, “recommendation”, “numerical”] can be passed to a dataframe that has the following columns: columns: [“user_id”, “movie_id”, “rating”] in the case where values from the column “movie_id” should be recommended to values from the “user_id” column.

    Supported types are:
    ”categorical” for columns to be interpreted as categoricals

    regardless of their data type.

    ”numerical” : for int, float and double type. “text” : for str type that consist of multiple words.

  • data_valid ( pandas.DataFrame or None , default=None ) – Validation dataset.

  • time_budget ( Dict [ str , float ] , float or None , default=-1 ) –

    If float:

    Time budget in seconds.

    If Dict[str, float]:

    Time budget for each step in seconds. Step names are: ModelSelection , ModelTune

    • -1 for unconstrained time budget: best effort mode is

      enabled and optimization continues until convergence.

Returns

self

Return type

automlx._interface.automl_pipeline.AutoMLPipeline