Harvesting Technical Metadata

Harvesting is a process that extracts data structure information from your data sources into your data catalog repository.

Image showing the harvest process

What is a Data Asset?

To harvest your data source, you need to register your data source as a data asset in your data catalog instance. A data asset is any physical data store or stream of data such as a database, a cloud storage container, or a message stream.

Supported Data Sources

You use the following data sources (accessible using public or private IPs) to create data assets in Data Catalog:

  • Object Storage

  • Oracle Database

  • MySQL

  • Autonomous Data Warehouse

  • Autonomous Transaction Processing

  • PostgreSQL

  • Microsoft SQL Server

  • Microsoft Azure SQL Database

  • IBM DB2

  • Apache Hive installed on Oracle Cloud Infrastructure

  • Apache Kafka installed on Oracle Cloud Infrastructure

  • On premise data sources connected to Oracle Cloud Infrastructure Virtual Cloud Networks (VCNs).

Depending on the type of data asset you create, you use different data structures to browse the data entities. For example, if you create an Oracle Database data asset, you browse through database objects to review the table and view data entities.

Supported File Types

The following file types are supported for Oracle Object Storage:

  • Comma Separated Value (CSV) files (.csv, .csv.gz)

    Note

    The supported separators are , (comma), \t (tab), | (vertical bar), ; (semicolon).
  • XML files (.xml, .xsd)

  • AVRO files (.avro, .avro.gz)

  • Excel files (.xls, .xlsx)

  • Apache Parquet files (.parquet, .pq)

  • Apache ORC files (.orc)

  • Simple JSON files (.json, .json.gz)

If you choose to harvest unsupported file types, Data Catalog harvester only extracts basic information from those files, such as names and paths.

What are Data Entities and Attributes?

A data asset contains one or more data entities. A data entity is a collection of data such as a database table or view, or a single logical file. A data entity normally has many attributes that describe its data. An attribute describes a data item with a name and data type.

Data Asset Data Entities Attributes
Database Tables and Views Columns
File Container Files Fields
Data Stream Event or Topic or Payload Keys

Harvesting Steps

When you harvest a data asset, the Data Catalog harvester extracts, standardizes, and indexes metadata information from the data asset to create a unified and searchable repository in the data catalog. You then browse or explore the data catalog to view the harvested data entities and attributes to annotate and enrich the data assets.

Harvesting a data source involves the following steps:

  1. Identify connectivity details to connect to the data source.
  2. Create a data asset.
  3. Add a connection to your data asset.
  4. Harvest the data asset.