Data Visualization

Data visualization is an important component of data exploration and data analysis in modern data science practices. It is one of the initial steps used to derive value from the data. It allows analysts to efficiently gain insights from the data and guides the exploratory data analysis (EDA).

An efficient and flexible data visualization tool can provide lots of insights about the data for data scientists. ADS offers a smart visualization tool that automatically detects the type of your data columns and offers the best way to plot your data. You can also create custom visualizations with ADS, by using your preferred plotting libraries and packages.

Automatic Visualization

You can apply the ADS show_in_notebook() method on a dataset. This creates a comprehensive preview of all the basic information about this dataset including:

  • the type of the dataset (whether it’s a regression, binary classification or multi-class classification),

  • the number of columns and rows, feature types of each columns,

  • visualization of each column,

  • the correlation map,

  • and a short dataset header.

To improve plotting performance, the ADS show_in_notebook() method uses a sample subset of the dataset. The sample dataset is a smart sample, calculated to be statistically significant within the confidence level of 95 and confidence interval of 1.0.

Use this format:

ds.show_in_notebook()
../../_images/show_in_notebook_summary.png
../../_images/show_in_notebook_features.png

If you are interested in exploring the relationship between two columns, you can use the plot() method. The plot() method is an automatic plotting tool. You pass in a variable for the x axis and optionally a variable for y, and then call the show_in_notebook() method to plot. The type of plot that ADS generates depends on type of columns, which is detected automatically through type discovery by ADS.

The following examples use a binary classification sample dataset with 1,500 rows, and 21 columns of which 13 columns have a continuous type, and 8 with a categorical type. There are three different examples.

  • Passing only the x variable, col02, which is of a categorical type. The plot() method detects that the data in this column either has a category of 0 or a 1. Therefore, it uses count plot and counts each category:

    ds.plot("col02").show_in_notebook(figsize=(4,4))
    
    ../../_images/single_column_count_plot.png
  • Plotting col02 against col01, where one is of a categorical type and the other is of a continuous type. ADS chooses the best plotting method to be a violin plot.

    ds.plot("col02", y="col01").show_in_notebook(figsize=(4,4))
    
    ../../_images/violin_plot.png
  • Plotting col01 against col03, which are both of a continuous type. ADS chooses the Gaussian heatmap to be the best way to visualize data. It generates a scatter plot and assigns a color to each data point based on the local density (Gaussian kernel).

    ds.plot("col01", y="col03").show_in_notebook()
    
    ../../_images/gaussian_heatmap.png

Customized Visualization

ADS provides you with a good set of options to plot the columns of your dataset. The visualization API is flexible enough to let you customize your charts or choose your own plotting library.

You can use the ADS call() method to select your own plotting routine.

Seaborn

In this example, a dataframe is passed directly to the Seaborn pair plot function that plots a pairwise relationships in for the dataset. The function creates a grid of axes such that each variable in data is shared in the y-axis across a single row and in the x-axis across a single column. The diagonal axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

import seaborn as sns
from sklearn.datasets import load_iris
from ads.dataset.factory import DatasetFactory
data = load_iris()
iris_df = pd.DataFrame(data.data, columns=data.feature_names)
sns.set(style="ticks", color_codes=True)
DatasetFactory.from_dataframe(iris_df).call(lambda df: sns.pairplot(df.dropna()))
../../_images/pairgrid.png

Matplotlib

  • Using any Matplotlib Function:

import matplotlib.pyplot as plt
from numpy.random import randn

df = pd.DataFrame(randn(1000, 4), columns=list('ABCD'))

def ts_plot(df, figsize):
    ts = pd.Series(randn(1000), index=pd.date_range('1/1/2000', periods=1000))
    df.set_index(ts)
    df = df.cumsum()
    plt.figure()
    df.plot(figsize=figsize)
    plt.legend(loc='best')

ds = DatasetFactory.from_dataframe(df, target='A')
ds.call(ts_plot, figsize=(7,7))
../../_images/matplotlib.png
  • Using a Pie Chart:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    
    data = {'data': [1109, 696, 353, 192, 168, 86, 74, 65, 53]}
    df = pd.DataFrame(data, index = ['20-50 km', '50-75 km', '10-20 km', '75-100 km', '3-5 km', '7-10 km', '5-7 km', '>100 km', '2-3 km'])
    
    explode = (0, 0, 0, 0.1, 0.1, 0.2, 0.3, 0.4, 0.6)
    colors = ['#191970', '#001CF0', '#0038E2', '#0055D4', '#0071C6', '#008DB8', '#00AAAA',
            '#00C69C', '#00E28E', '#00FF80', ]
    
    def bar_plot(df, figsize):
        df["data"].plot(kind='pie', fontsize=17, colors=colors, explode=explode)
        plt.axis('equal')
        plt.ylabel('')
        plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
        plt.show()
    
    ds = DatasetFactory.from_dataframe(df)
    ds.call(bar_plot, figsize=(7,7))
    
    ../../_images/piechart.png

Geographic Information System (GIS) Chart

This example uses the California earthquake data retrieved from United States Geological Survey (USGS) earthquake catalog. It gives a brief visual overview of major places where earthquakes happened.

Datasets are provided as a convenience. Datasets are considered Third Party Content and are not considered Materials under Your agreement with Oracle applicable to the Services. The earthquake dataset is in the public domain. It was retrieved from the USGS Earthquake Hazards program.

earthquake.plot_gis_scatter(lon="longitude", lat="latitude")
../../_images/gis_scatter.png