Training Data Requirements in Generative AI

Use the following guidelines to create training data for fine-tuning the pretrained models in Generative AI.

File format Entry format Minimum number of entries Maximum Number of files per custom model
jsonl {"prompt": "<a prompt>, "completion": "<response>"} 32 1
Fine-tuning Data

Custom models accept only one training dataset in a jsonl format. Minimum 32 prompt/completion pair examples per file. This dataset is randomly split to a 90:10 ratio for training and validation. There's no maximum number of sentences for the training file, but large datasets take longer to train.

About JSONL

A JSONL (or JSON Lines) file is a file that contains a new JSON value or object on each line. The file isn't evaluated as a whole, like a regular JSON file, but rather, each line is treated as if it was a separate JSON file. This format lends itself well for storing a set of inputs in JSON format. The OCI Generative AI service accepts a JSONL file for fine-tuning custom models in the following format:

{"prompt": "<first prompt>", "completion": "<expected completion given first prompt>"}
{"prompt": "<second prompt>", "completion": "<expected completion given second prompt>"}
.
.
.
JSONL Example
{"prompt": "What is the capital of France?", "completion": "The capital of France is Paris."}
{"prompt": "Where is the smallest state in the USA?", "completion": "The smallest state in the USA is Rhode Island."}
Note

Ensure that each JSONL dataset file that you create for Generative AI has the following properties:
  • The file is UTF-8 encoded.
  • Each line contains a valid JSON object followed by a new line character (\n).
  • Each JSON object has two properties: "prompt" and "completion".

After you create the JSONL file, add your dataset to an Object Storage bucket.