Data Visualization

Data visualization is an important aspect of data exploration, analysis, and communication. Generally, visualization of the data is one of the first steps in any analysis. It allows the analysts to efficiently gain an understanding of the data and guides the exploratory data analysis (EDA) and the modeling process.

An efficient and flexible data visualization tool can provide a lot of insight into the data. ADS provides a smart visualization tool. It automatically detects the data type and renders plots that optimally represent the characteristics of the data. Within ADS, custom visualizations can be created using any plotting library.

Automatic Visualization

The ADS show_in_notebook() method creates a comprehensive preview of all the basic information about a dataset including:

  • the predictive data type (for example, regression, binary classification, or multi-class classification),

  • the number of columns and rows,

  • feature type information,

  • summary visualization of each feature,

  • the correlation map,

  • any data conditions that you should be aware of.

To improve plotting performance, the ADS show_in_notebook() method uses an optimized subset of the data. This smart sample is selected so that it is statistically representative of the full dataset with a 95th percentile confidence level. The correlation map is only displayed when the data only has numerical(continuous or oridinal) columns.

ds.show_in_notebook()
../../_images/show_in_notebook_summary.png
../../_images/show_in_notebook_features.png

To visualize the correlation, call the show_corr() method. If the correlation matrices have not been cached, this call triggers the corr() function which calculates the correlation matrices.

corr() uses the following methods to calculate the correlation based on the data types:

  • Continuous-Continuous: Pearson method link. The correlations range from -1 to 1.

  • Categorical-Categorical: Cramer's V method link. The correlations range from 0 to 1.

  • Continuous-Categorical: Correlation Ratio method link. The correlations range from 0 to 1.

Correlations are displayed independently because the correlations are calculated using different methodologies and the ranges are not the same. Consolidating them into one matrix could be confusing and inconsistent.

Note

Continuous features consist of continuous and ordinal types. Categorical features consist of categorical and zipcode types.

ds.show_corr(nan_threshold=0.8)
../../_images/show_corr1.png
../../_images/show_corr2.png
../../_images/show_corr3.png

By default, nan_threshold is set to 0.8. This means that if more than 80% of the values in a column are missing, that column is dropped from the correlation calculation. nan_threshold should be between 0 and 1. Other options includes:

  • correlation_threshold: Apply a filter to the correlation matrices and only exhibit the pairs whose correlation values are greater than or equal to the correlation_threshold.

  • frac: Defaults to None, which means no sampling is used. The portion of the original data to calculate the correlation on. frac must be between 0 and 1.

  • overwrite: Defaults to False. Correlation matrices are cached. Set overwrite to True to recalculate the correlation. Note that if both corr and show_corr are set, it triggers the calculation when there is no cached value. show_in_notebook calculates the correlation only when there are only numerical columns in the dataset.

  • plot_type: Defaults to heatmap. Valid values are heatmap and bar. If bar is chosen, correlation_target also has to be set and the bar chart will only show the correlation values of the pairs which have the target in them.

  • correlation_target: Defaults to None. It can be any columns of type continuous, ordinal, categorical or zipcode. When correlation_target is set, only pairs that contain correlation_target display.

ds.show_corr(correlation_target='col01', plot_type='bar')
../../_images/show_corr4.png

To explore features, use the smart plot() method. It accepts one or two feature names. The show_in_notebook() method automatically determines the best type of plot based on the type of features that are to be plotted.

Three different examples are described. They use a binary classification dataset with 1,500 rows and 21 columns. 13 of the columns have a continuous data type, and 8 are categorical. There are three different examples.

  • A single categorical feature: The plot() method detects that the feature is categorical because it only has the values of 0 and 1. It then automatically renders a plot of the count of each category.

    ds.plot("col02").show_in_notebook(figsize=(4,4))
    
    ../../_images/single_column_count_plot.png
  • Categorical and continuous feature pair: ADS chooses the best plotting method, which is a violin plot.

    ds.plot("col02", y="col01").show_in_notebook(figsize=(4,4))
    
    ../../_images/violin_plot.png
  • A pair of continuous features: ADS chooses a Gaussian heatmap as the best visualization. It generates a scatter plot and assigns a color to each data point based on the local density (Gaussian kernel).

    ds.plot("col01", y="col03").show_in_notebook()
    
    ../../_images/gaussian_heatmap.png

Customized Visualization

ADS provides intelligent default options for your plots. However, the visualization API is flexible enough to let you customize your charts or choose your own plotting library. You can use the ADS call() method to select your own plotting routine.

Seaborn

In this example, a dataframe is passed directly to the Seaborn pair plot function. It does a faceted, pairwise plot between all the features in the dataset. The function creates a grid of axises such that each variable in the data is shared in the y-axis across a row and in the x-axis across a column. The diagonal axises are treated differently by drawing a histogram of each feature.

import seaborn as sns
from sklearn.datasets import load_iris
from ads.dataset.factory import DatasetFactory
data = load_iris()
iris_df = pd.DataFrame(data.data, columns=data.feature_names)
sns.set(style="ticks", color_codes=True)
DatasetFactory.from_dataframe(iris_df).call(lambda df: sns.pairplot(df.dropna()))
../../_images/pairgrid.png

Matplotlib

  • Using Matplotlib:

import matplotlib.pyplot as plt
from numpy.random import randn

df = pd.DataFrame(randn(1000, 4), columns=list('ABCD'))

def ts_plot(df, figsize):
    ts = pd.Series(randn(1000), index=pd.date_range('1/1/2000', periods=1000))
    df.set_index(ts)
    df = df.cumsum()
    plt.figure()
    df.plot(figsize=figsize)
    plt.legend(loc='best')

ds = DatasetFactory.from_dataframe(df, target='A')
ds.call(ts_plot, figsize=(7,7))
../../_images/matplotlib.png
  • Using a Pie Chart:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    
    data = {'data': [1109, 696, 353, 192, 168, 86, 74, 65, 53]}
    df = pd.DataFrame(data, index = ['20-50 km', '50-75 km', '10-20 km', '75-100 km', '3-5 km', '7-10 km', '5-7 km', '>100 km', '2-3 km'])
    
    explode = (0, 0, 0, 0.1, 0.1, 0.2, 0.3, 0.4, 0.6)
    colors = ['#191970', '#001CF0', '#0038E2', '#0055D4', '#0071C6', '#008DB8', '#00AAAA',
            '#00C69C', '#00E28E', '#00FF80', ]
    
    def bar_plot(df, figsize):
        df["data"].plot(kind='pie', fontsize=17, colors=colors, explode=explode)
        plt.axis('equal')
        plt.ylabel('')
        plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
        plt.show()
    
    ds = DatasetFactory.from_dataframe(df)
    ds.call(bar_plot, figsize=(7,7))
    
    ../../_images/piechart.png

Geographic Information System (GIS) Chart

This example uses the California earthquake data retrieved from United States Geological Survey (USGS) earthquake catalog. It visualizes the location of major earthquakes.

earthquake.plot_gis_scatter(lon="longitude", lat="latitude")
../../_images/gis_scatter.png

Datasets are provided as a convenience. Datasets are considered Third Party Content and are not considered Materials under Your agreement with Oracle applicable to the Services. The earthquake dataset is in the public domain. It was retrieved from the USGS Earthquake Hazards Program.