Training Data Requirements in Generative AI
Use the following guidelines to create training data for fine-tuning the pretrained models in Generative AI.
File format | Entry format | Minimum number of entries | Maximum Number of files per custom model |
---|---|---|---|
jsonl |
{"prompt": "<a prompt>, "completion": "<response>"} |
32 | 1 |
- Fine-tuning Data
-
Custom models accept only one training dataset in a
jsonl
format. Minimum 32 prompt/completion pair examples per file. This dataset is randomly split to a 90:10 ratio for training and validation. There's no maximum number of sentences for the training file, but large datasets take longer to train. - About
JSONL
-
A
JSONL
(orJSON Lines
) file is a file that contains a newJSON
value or object on each line. The file isn't evaluated as a whole, like a regularJSON
file, but rather, each line is treated as if it was a separateJSON
file. This format lends itself well for storing a set of inputs inJSON
format. The OCI Generative AI service accepts aJSONL
file for fine-tuning custom models in the following format:{"prompt": "<first prompt>", "completion": "<expected completion given first prompt>"} {"prompt": "<second prompt>", "completion": "<expected completion given second prompt>"} . . .
JSONL
Example-
{"prompt": "What is the capital of France?", "completion": "The capital of France is Paris."} {"prompt": "Where is the smallest state in the USA?", "completion": "The smallest state in the USA is Rhode Island."}
Ensure that each
JSONL
dataset file that you create for Generative AI has the following properties: - The file is
UTF-8
encoded. - Each line contains a valid
JSON
object followed by a new line character (\n
). - Each
JSON
object has two properties:"prompt"
and"completion"
.
After you create the JSONL file, add your dataset to an Object Storage bucket.