Training Data Requirements in Generative AI

Understand the guidelines for creating training data for fine-tuning the pretrained models in OCI Generative AI.

Custom models accept only one training dataset file in a JSONL (JSON Lines) format. The file must have a minimum of 32 prompt/completion pair examples per file. This dataset is randomly split to a 80:20 ratio for training and validation. There's no maximum number of sentences for the training file, but large datasets take longer to train.

About JSONL

A JSONL file contains a new JSON value or object on each line. The file isn't evaluated as a whole, like a regular JSON file. Instead, each line is treated as if it is a separate JSON file. This format is ideal for storing a set of inputs in JSON format.

The OCI Generative AI service accepts a JSONL file for fine-tuning custom models in the following format:

{"prompt": "<first prompt>", "completion": "<expected completion given first prompt>"}
{"prompt": "<second prompt>", "completion": "<expected completion given second prompt>"}
.
.
.

JSONL Example

{"prompt": "What is the capital of France?", "completion": "The capital of France is Paris."}
{"prompt": "What is the smallest state in the USA?", "completion": "The smallest state in the USA is Rhode Island."}

Note

Ensure that each JSONL dataset file that you create for Generative AI has the following properties:

The file is UTF-8 encoded.
Each line item contains a valid JSON object.
Each JSON object has two properties: "prompt" and "completion".
Each JSON object is entered in a new line or followed by a newline character (\n).

After you create the JSONL file, add your dataset to an Object Storage bucket.

Oracle Cloud Infrastructure Documentation Try Free Tier

Training Data Requirements in Generative AI

Oracle Cloud Infrastructure Documentation
Try Free Tier