Data Transformations

When datasets are loaded with ADS Dataset Factory, they can be transformed and manipulated easily with the built-in functions. Underlying, an ADSDataset object is a Dask dataframe. Any operation that can be performed to a Dask dataframe can also be applied to an ADS Dataset.

Loading the Dataset

You need to import some of the necessary libraries and modules to run the dataset transformation examples that follow.

import numpy as np
from dask import dataframe as dd

from ads.dataset.factory import DatasetFactory
from ads.dataset.dataset_browser import DatasetBrowser

import ads

Datasets can be loaded into ADS from multiple sources. ADS supports loading of datasets directly from libraries and websites, such as Sklearn using DatasetBrowser.

Datasets are provided as a convenience. Datasets are considered Third Party Content and are not considered Materials under Your agreement with Oracle applicable to the Services. Review the dataset license.

This example loads the wine dataset from the Sklearn library:

sklearn = DatasetBrowser.sklearn()
wine_ds = sklearn.open('wine')

wine_ds is an ADSDataset object. You can extract the Dask dataframe object directly by pulling the ddf attribute:

wine_ds.ddf

Applying Automated Transformations to the Dataset

ADS has built in automatic transform tools for datasets. When the get_recommendations() tool is applied to an ADSDataset object, it shows the user detected issues with the data and recommends changes to apply to the dataset. You can accept the changes is as easy as clicking a button in the drop down menu. After all the changes are applied, the transformed dataset can be retrieved by calling get_transformed_dataset().

wine_ds.get_recommendations()

Alternatively, you can use auto_transform() to apply all the recommended transformations at once. auto_transform() returns a transformed dataset with several optimizations applied automatically. The optimizations include:

  • Dropping constant and primary key columns, which has no predictive quality.

  • Imputation to fill in missing values in noisy data.

  • Dropping strongly co-correlated columns that tend to produce less generalizable models.

  • Balancing a dataset using up or down sampling.

One optional argument to auto_transform() is fix_imbalance, which is set to True by default. When True, auto_transform() corrects any imbalance between the classes. ADS downsamples the dominant class first unless there are too few data points. In that case, ADS upsamples the minority class.

ds = wine_ds.auto_transform()

You can visualize the transformation that has been performed on a dataset by calling visualize_transforms().

Note

visualize_transforms() is only applied to the automated transformations and does not capture any custom transformations that you may have applied to the dataset.

ds.visualize_transforms()
../../_images/visual_transform.png

Row Operations

The operations that can be applied to a Dask dataframe can be applied to an ADSDataset object.

Examples of some of the most common row operations you can apply on an ADSDataset object follow.

Deleting rows

Rows within a dataset can be filtered out by row numbers. The index of the new dataset can be reset accordingly.

#Filter out rows by row number and reset index of new data
ds_subset = ds.loc[10:100]
ds_subset = ds_subset.reset_index()

Do not try to insert index into dataset columns.

Reseting index

Reset the index to the default index , see the Dask documentation. When you reset index, the old index is added as a column index and a new sequential index is used. You can use the drop parameter to avoid the old index being added as a column:

ds_subset = ds.loc[10:100]
ds_subset = ds_subset.reset_index(drop=True)
ds_subset.head()

The index restarts at zero for each partition. This is due to the inability to statically know the full length of the index.

Appending rows

New rows can be added to an existing dataset:

#Create new row to be added
row_to_add = ds.loc[0]
row_to_add['target'] = 'class_0'

#Add in new row to existing dataset
new_addition_ds = ds.merge(row_to_add, how = 'outer')

Alternatively, you can use the append() method of a Dask dataframe to achieve a similar result:

ds2 = wine_ds.ddf.append(ds)

The ds2 is created as a Dask DataFrame object.

Row Filtering based on Column Values

Columns can be filtered out by the values:

ds_filtered = ds[(ds['alcohol'] > 13.0) & (ds['malic_acid'] < 2.5)]
ds_filtered.head()

Removing Duplicated Rows

Duplicate rows can removed using the drop_duplicates function:

ds_without_dup = ds.drop_duplicates()

Column Operations

The column operations that can be applied to a pandas dataframe can be applied to an ADS dataset as in the following examples.

Deleting a Column

To delete specific columns from the dataset, the drop_columns function can be used along with names of the columns to be deleted from the dataset. The ravel pandas command returns the flattened underlying data as an ndarray. The name_of_df.columns[:].ravel() command returns the name of all the columns in a dataframe as an array.

ds_subset_columns = ds.drop_columns(['alcohol', 'malic_acid'])
ds_subset_columns.columns[:].ravel()
array(['ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols',
    'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins',
    'color_intensity', 'hue', 'od280/od315_of_diluted_wines',
    'proline', 'target'], dtype=object)

Renaming a Column

Columns can be renamed with the rename_columns() method:

ds_columns_rename = ds.rename_columns({'alcohol': 'alcohol_amount',
                                'malic_acid': 'malic_acid_amount'})
ds_columns_rename.columns[:].ravel()
array(['alcohol_amount', 'malic_acid_amount', 'ash', 'alcalinity_of_ash',
    'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols',
    'proanthocyanins', 'color_intensity', 'hue',
    'od280/od315_of_diluted_wines', 'proline', 'target'], dtype=object)

Obtaining the Counts of Unique Values in a Column

The count per unique value can be obtained with the value_counts() method:

ds['target'].value_counts().compute()
class_1    71
class_0    59
class_2    48
Name: target, dtype: int64

Normalizing a Column

You can apply a variety of normalization techniques to numerical columns (both continuous and discrete). You can leverage the built in max() and min() methods to perform a minmax normalization:

max_alcohol = wine_ds['alcohol'].max()
min_alcohol = wine_ds['alcohol'].min()
alcohol_range = max_alcohol  - min_alcohol
wine_ds.ddf['norm_alcohol'] = (wine_ds['alcohol'] / alcohol_range)

Creating a Column by Combining Other Columns

This example creates a new column by performing operations to combine two or more columns together:

new_feature_col = ((0.4)*wine_ds['total_phenols'] + (0.6)*wine_ds['flavanoids'])
ds_new_feature = wine_ds.assign_column('new_feature', new_feature_col.compute())
ds_new_feature.head()

Alternatively, you can create a new column directly in the Dask dataframe attribute:

new_feature_col = ((0.4)*wine_ds['total_phenols'] + (0.6)*wine_ds['flavanoids'])
wine_ds.ddf['new_feature'] = new_feature_col
wine_ds.head()

To add new column, use a new name for it. You can add anew column and change it by combining with existing column:

noise = np.random.normal(0,.1,wine_ds.shape[0])
ds_noise = wine_ds.assign_column('noise', noise)

ds_ash = ds_noise.assign_column('noise', ds_noise['noise'] + ds_noise['ash'])
ds_ash = ds_ash.rename(columns={'noise':'ash_with_noise'})
ds_ash.head()

The resulting column is renamed with dict-like mapper.

Changing a Column by Values Derived from a Function

You can apply functions to update column values in existing column. This example updates the column in place using lambda expression:

wine_ds.assign_column('proline', lambda x: x is None or x > 1000)
wine_ds.head()

Changing Data Types of Columns

You can change the data type columns with the astype() method. ADS uses the Dask method, astype(), on dataframe objects. For specifics, see astype for a Dask Dataframe, using numpy.dtype, or pandas dtypes.

When you change the type of a column, ADS updates its semantic type to categorical, continuous, datetime, or ordinal. For example, if you update a column type to integer, its semantic type updates to ordinal. For data type details, see ref:loading-data-specify-dtype.

This example converts a dataframe column from float, to the low-level integer type and ADS updates its semantic type to ordinal:

wine_ds = wine_ds.astype(types={'proline': 'int64'})
print(wine_ds.feature_types['proline']['low_level_type'])
print(wine_ds.feature_types['proline']['type'])

# Note: When you cast a float column to integer, you lose precision.
wine_ds['proline'].head()

To convert a column of type float to categorical, you convert it to integer first. This example converts a column data type from float to integer, then to categorical, and then the number of categories in the column is reduced:

# create a new dataset with a renamed column for binned data and update the values
ds = wine_ds.rename_columns({'color_intensity': 'color_intensity_bin'})
ds = ds.assign_column('color_intensity_bin', lambda x: x/3)

# convert the column from float to categorical:
ds = ds.astype(types={'color_intensity_bin': 'int64'})
ds = ds.astype(types={'color_intensity_bin': 'categorical'})

You can use feature_types to see if the semantic data type of the converted column is categorical:

wine_ds.feature_types['color_intensity_bin']['type']
'categorical'

The low-level type of the converted column is category:

ds['color_intensity_bin'].head()
0    1
1    1
2    1
3    2
4    1
Name: color_intensity_bin, dtype: category
Categories (5, int64): [0, 1, 2, 3, 4]

Dataset Manipulation

ADS has built in functions that support categorical encoding, null values and imputation.

Categorical Encoding

ADS has a built in categorical encoder that can be accessed by calling from ads.dataset.label_encoder import DataFrameLabelEncoder. This example encodes the three classes of wine that make up the dataset:

from ads.dataset.label_encoder import DataFrameLabelEncoder
ds_encoded = DataFrameLabelEncoder().fit_transform(ds.to_pandas_dataframe())
ds_encoded['target'].value_counts()
1    71
0    59
2    48

One-Hot Encoding

One-hot encoding transforms one categorical column with n categories into n or n-1 columns with indicator variables. You can prepare one of the columns to be categorical with categories low, medium, and high:

def convert_to_level(value):
    if value < 12:
        return 'low'
    elif value > 13:
        return 'high'
    else:
        return 'medium'

ds = wine_ds
ds = ds.assign_column('alcohol', convert_to_level)

You can use the Pandas method get_dummies() to perform one-hot encoding on a column. Use the prefix parameter to assign a prefix to the new columns that contain the indicator variables. This example creates n columns with one-hot encoding:

data = ds.to_pandas_dataframe()['alcohol'] # data of which to get dummy indicators
onehot = pd.get_dummies(data, prefix='alcohol')

To create n-1 columns, use drop_first=True when converting the categorical column. You can add a one-hot column to the initial dataset with the merge() method:

data = ds.to_pandas_dataframe()['alcohol'] # data of which to get dummy indicators
onehot = pd.get_dummies(data, prefix='alcohol', drop_first=False)
ds_onehot = ds.merge(onehot)

Encoding for all categorical columns can be accomplished with the fit_transform() method:

from ads.dataset.label_encoder import DataFrameLabelEncoder

ds_encoded = DataFrameLabelEncoder().fit_transform(ds_onehot.to_pandas_dataframe())
ds_encoded['alcohol'].value_counts()
0    92
2    67
1    19

To drop the initial categorical column that you transformed into one-hot, use one of these examples:

ds_onehot = ds_onehot.drop_columns('alcohol')   # before ``fit_transform()`` method
# or
ds_encoded = ds_encoded.drop(columns='alcohol') # after ``fit_transform()`` method

Extracting Null Values from Datasets

To detect all nulls in a dataset, use the isnull function to return a boolean dataset matching the dimension of our input:

ds_null = ds.isnull()
np.any(ds_null)
alcohol                         False
malic_acid                      False
ash                             False
alcalinity_of_ash               False
magnesium                       False
total_phenols                   False
flavanoids                      False
nonflavanoid_phenols            False
proanthocyanins                 False
color_intensity                 False
hue                             False
od280/od315_of_diluted_wines    False
proline                         False
target                          False

Imputation

The fillna function ia used to replace null values with specific values. Generate a null value by replacing the entry below a certain value with null, and then imputing it with a value:

ds_with_null = ds.assign_column("malic_acid", lambda x: None if x < 2 else x)
ds_with_null['malic_acid'].head()
0     NaN
1     NaN
2    2.36
3     NaN
4    2.59
Name: malic_acid, dtype: float64
ds_impute = ds_with_null.fillna(method='bfill')
ds_impute['malic_acid'].head()
0    2.36
1    2.36
2    2.36
3    2.59
4    2.59
Name: malic_acid, dtype: float64

Combining Datasets

ADS datasets can be merged and combined together to form a new dataset.

Joining Datasets

You can merge two datasets together with a database-styled join on columns or indexes by specifying the type of join left, right, outer, or inner. These type are defined by:

  • left: Use only keys from the left dataset, similar to SQL left outer join.

  • right: Use only keys from the right dataset, similar to SQL right outer join.

  • inner: Intersection of keys from both datasets, similar to SQL inner join.

  • outer: Union of keys from both datasets, similar to SQL outer join.

This is an example of performing an outer join on two datasets. The datasets are subsets of the wine dataset, and each dataset contains only one class of wine.

ds_class1 = ds[ds['target']=='class_1']
ds_class2 = ds[ds['target']=='class_2']
ds_merged_outer = ds_class1.merge(ds_class2, how='outer')
ds_merged_outer['target'].value_counts().compute()
class_1    71
class_2    48
class_0     0
Name: target, dtype: int64

Concatenating Datasets

Two datasets can be concatenated along a particular axis (vertical or horizontal) with the option of performing set logic (union or intersection) of the indexes on the other axes. You can stack two datasets vertically with:

ds_concat = dd.concat([ds_class1, ds_class2], axis = 0)
ds_concat['target'].value_counts().compute()
class_1    71
class_2    48
class_0     0
Name: target, dtype: int64

Split Dataset into Train, Validation, Test Data

After all data transformations are complete, you can split the data into a train and test or train, test, and validation set. To split data into a train and test set with a train size of 80% and test size of 20%:

from ads.dataset.dataset_browser import DatasetBrowser
sklearn = DatasetBrowser.sklearn()
wine_ds = sklearn.open('wine')
ds = wine_ds.auto_transform()
train, test = ds.train_test_split(test_size=0.2)

For a train, test, and validation set, the defaults are set to 80% of the data for training, 10% for testing, and 10% for validation. This example sets split to 70%, 15%, and 15%:

data_split = wine_ds.train_validation_test_split(
    test_size=0.15,
    validation_size=0.15
)
train, validation, test = data_split
print(data_split)   # print out shape of train, validation, test sets in split

The resulting three data subsets each have separate data (X) and labels (y).

print(train.X)  # print out all features in train dataset
print(train.y)  # print out labels in train dataset

You can split the dataset right after the DatasetFactory.open() statement:

ds = DatasetFactory.open("path/data.csv").set_target('target')
train, test = ds.train_test_split(test_size=0.25)