Creating a Pipeline

Create a Data Science pipeline to run a task.

Ensure that you have created the necessary policies, authentication, and authorization for pipelines.

Important

For proper operation of script steps, ensure that you have added the following rule to a dynamic group policy:

all {resource.type='datasciencepipelinerun', resource.compartment.id='<pipeline-run-compartment-ocid>'}

Before you begin:

Create a step artifact file.
Review the use of pipelines environment variables.
To store and manage pipeline logs, learn about logging.
Decide which conda environment you want to use. If you need a custom conda environment, create and publish one.

You can create pipelines by using the ADS SDK, OCI Console, or the OCI SDK.

Using ADS for creating pipelines can make developing the pipeline, the steps, and the dependencies easier. ADS supports reading and writing the pipeline to and from a YAML file. You can use ADS to view a visual representation of the pipeline. We recommend that you use ADS to create and manage pipeline using code.

1. Use the Console to sign in to a tenancy with the necessary policies.
2. Open the navigation menu and click Analytics & AI. Under Machine Learning, click Data Science.
3. Select the compartment that contains the project that you want to use.
  
  All projects in the compartment are listed.
4. Click the name of the project.
  
  The project details page opens and lists the notebook sessions.
5. Under Resources, click Pipelines.
6. Click Create pipeline.
7. (Optional) Select a different compartment for the pipeline.
8. (Optional) Enter a name and description for the pipeline (limit of 255 characters). If you don't provide a name, a name is automatically generated.
  
  For example, pipeline2022808222435.
9. Click Add pipeline steps to start defining the workflow for the pipeline.
10. In the Add pipeline step panel, select one of the following options, and then finish the pipeline creation:
From a Job
The pipeline step uses an existing job. Select one of the jobs in the tenancy.

Enter a unique name for the step. You can't repeat a step name in a pipeline.
(Optional) Enter a step description, which can help you find step dependencies.
(Optional) If this step depends on another step, select one or more steps to run before this step.
Select the job for the step to run.
(Optional) Enter or select any of the following values to control this pipeline step:

Custom environment variable key and value

The environment variables for this pipeline step.

Value

The value for the custom environment variable key.

You can click Additional custom environment key to specify more variables.

Command line arguments

The command line arguments that you want to use for running the pipeline step.

Maximum runtime (in minutes)

The maximum number of minutes that the pipeline step is allowed to run. The service cancels the pipeline run if its runtime exceeds the specified value. The maximum runtime is 30 days (43,200 minutes). We recommend that you configure a maximum runtime on all pipeline runs to prevent runaway pipeline runs.

Click Save to add the step and return to the Create pipeline page.
(Optional) Click +Add pipeline steps to add more steps to complete your workflow, and repeat the preceding steps.
(Optional) Create a default pipeline configuration that's used when the pipeline is run by entering environment variable, command line arguments, and maximum runtime options. See step 5 for an explanation of these fields.
(Optional) Select a Compute shape by clicking Select and following these steps:

Select an instance type.

Select a shape series.

Select one of the supported Compute shapes in the series.

Select the shape that best suits how you want to use the resource. For the AMD shape, you can use the default or set the number of OCPUs and memory.

For each OCPU, select up to 64 GB of memory and a maximum total of 512 GB. The minimum amount of memory allowed is either 1 GB or a value matching the number of OCPUs, whichever is greater.

Click Select shape.

For Block Storage, enter the amount of storage that you want to use between 50 GB and 10, 240 GB (10 TB). You can change the value by 1 GB increments. The default value is 100 GB.
(Optional) To use logging, click Select, and then ensure that Enable logging is selected.

Select a log group from the list. You can change to a different compartment to specify a log group in a different compartment from the job.

Select one of the following to store all stdout and stderr messages:

Enable automatic log creation

Data Science automatically creates a log when the job starts.

Select a log

Select a log to use.

Click Select to return to the job run creation page.

(Optional) Click Show advanced options to add tags to the pipeline.
(Optional) Enter the tag namespace (for a defined tag), key, and value to assign tags to the resource.

To add more than one tag, click Add tag.

Tagging describes the various tags that you can use organize and find resources including cost-tracking tags.

Click Create.

After the pipeline is in an active state, you can use pipeline runs to repeatedly run the pipeline.
From a Script
The step uses a script to run. You need to upload the artifact containing all the code for the step to run.

Enter a unique name for the step. You can't repeat a step name in a pipeline.
(Optional) Enter a step description, which can help you find step dependencies.
(Optional) If this step depends on another step, select one or more steps to run before this step.
Drag a job step file into the box, or click select a file to navigate to it for selection.
In Entry point, select one file to be the entry run point of the step. This is useful when you have many files.
(Optional) Enter or select any of the following values to control this pipeline step:

Custom environment variable key and value

The environment variables for this pipeline step.

Value

The value for the custom environment variable key.

You can click Additional custom environment key to specify more variables.

Command line arguments

The command line arguments that you want to use for running the pipeline step.

Maximum runtime (in minutes)

The maximum number of minutes that the pipeline step is allowed to run. The service cancels the pipeline run if its runtime exceeds the specified value. The maximum runtime is 30 days (43,200 minutes). We recommend that you configure a maximum runtime on all pipeline runs to prevent runaway pipeline runs.

(Optional) Create a default pipeline configuration that's used when the pipeline is run by entering environment variable, command line arguments, and maximum runtime options. See step 6 for an explanation of these fields.
For Block Storage, enter the amount of storage that you want to use between 50 GB and 10, 240 GB (10 TB). You can change the value by 1 GB increments. The default value is 100 GB.
Click Save to add the step and return to the Create pipeline page.
(Optional) Use +Add pipeline steps to add more steps to complete your workflow by repeating the preceding steps.
(Optional) Create a default pipeline configuration that's used when the pipeline is run by entering environment variable, command line arguments, and maximum runtime options. See step 6 for an explanation of these fields.
For Block Storage, enter the amount of storage that you want to use between 50 GB and 10, 240 GB (10 TB). You can change the value by 1 GB increments. The default value is 100 GB.
Select one of the following options to configure the network type:

Default networking—The workload is attached by using a secondary VNIC to a preconfigured, service-managed VCN, and subnet. This provided subnet lets egress to the public internet through a NAT gateway, and access to other Oracle Cloud services through a service gateway.

If you need access only to the public internet and OCI services, we recommend using this option. It doesn't require you to create networking resources or write policies for networking permissions.

Custom networking—Select the VCN and subnet that you want to use for the resource (notebook session or job).

For egress access to the public internet, use a private subnet with a route to a NAT gateway.

If you don't see the VCN or subnet that you want to use, click Change Compartment, and then select the compartment that contains the VCN or subnet.

Important

Custom networking must be used to use a file storage mount.

(Optional) To use logging, click Select, and then ensure that Enable logging is selected.

Select a log group from the list. You can change to a different compartment to specify a log group in a different compartment from the job.

Select one of the following to store all stdout and stderr messages:

Enable automatic log creation

Data Science automatically creates a log when the job starts.

Select a log

Select a log to use.

Click Select to return to the job run creation page.

(Optional) Click Show advanced options to add tags to the pipeline.
(Optional) Enter the tag namespace (for a defined tag), key, and value to assign tags to the resource.

To add more than one tag, click Add tag.

Tagging describes the various tags that you can use organize and find resources including cost-tracking tags.

Click Create.

After the pipeline is in an active state, you can use pipeline runs to repeatedly run the pipeline.
From Container
Optionally, when defining pipeline steps, you can choose to use Bring Your Own Container.
Select From container.
In the Container configuration section, click Configure.
In the Configure your container environment panel, select a repository from the list. If the repository is in a different compartment, click Change compartment.
Select an image from the list.
(Optional) Enter an entry point. To add another, click +Add parameter.
(Optional) Enter a CMD. To add another, click +Add parameter.
Use CMD as arguments to the ENTRYPOINT or the only command to run in the absence of an ENTRYPOINT.

(Optional) Enter an image digest.
(Optional) If using signature verification, enter the OCID of the image signature.
For example, ocid1.containerimagesignature.oc1.iad.aaaaaaaaab....

(Optional) Upload the step artifact by dragging it into the box.

Note

This step is optional only if BYOC is configured.

These environment variables control the pipeline run.

You can use the OCI CLI to create a pipeline as in this Python example:

Create a pipeline:

The following parameters are available to use in the payload:


Parameter name	Required	Description
Pipeline (top level)
`projectId`	Required	The project OCID to create the pipeline in.
`compartmentId`	Required	The compartment OCID to the create the pipeline in.
`displayName`	Optional	The name of the pipeline.
`infrastructureConfigurationDetails`	Optional	Default infrastructure (compute) configuration to use for all the pipeline steps, see `infrastructureConfigurationDetails` for details on the supported parameters. Can be overridden by the pipeline run configuration.
`logConfigurationDetails`	Optional	Default log to use for the all the pipeline steps, see `logConfigurationDetails` for details on the supported parameters. Can be overridden by the pipeline run configuration.
`configurationDetails`	Optional	Default configuration for the pipeline run, see `configurationDetails` for details on supported parameters. Can be overridden by the pipeline run configuration.
`freeformTags`	Optional	Tags to add to the pipeline resource.
`stepDetails`
`stepName`	Required	Name of the step. Must be unique in the pipeline.
`description`	Optional	Free text description for the step.
`stepType`	Required	`CUSTOM_SCRIPT` or `ML_JOB`
`jobId`	Required*	For `ML_JOB` steps, this is the job OCID to use for the step run.
`stepInfrastructureConfigurationDetails`	Optional*	Default infrastructure (Compute) configuration to use for this step, see `infrastructureConfigurationDetails` for details on the supported parameters. Can be overridden by the pipeline run configuration. *Must be defined on at least one level (precedence based on priority, 1 being highest): 1 pipeline run and/or 2 step and/or 3 pipeline
`stepConfigurationDetails`	Optional*	Default configuration for the step run, see `configurationDetails` for details on supported parameters. Can be overridden by the pipeline run configuration. *Must be defined on at least one level (precedence based on priority, 1 being highest): 1 pipeline run and/or 2 step and/or 3 pipeline
`dependsOn`	Optional	List of steps that must be completed before this step begins. This creates the pipeline workflow dependencies graph.
`infrastructureConfigurationDetails`
`shapeName`	Required	Name of the Compute shape to use. For example, VM.Standard2.4.
`blockStorageSizeInGBs`	Required	Number of GBs to use as the attached storage for the VM.
`logConfigurationDetails`
`enableLogging`	Required	Define to use logging.
`logGroupId`	Required	Log group OCID to use for the logs. The log group must be created and available when the pipeline runs
`logId`	Optional*	Log OCID to use for the logs when not using the `enableAutoLogCreation` parameter.
`enableAutoLogCreation`	Optional	If set to `True`, a log for each pipeline run is created.
`configurationDetails`
`type`	Required	Only `DEFAULT` is supported.
`maximumRuntimeInMinutes`	Optional	Time limit in minutes for the pipeline to run.
`environmentVariables`	Optional	Environment variables to provide for the pipeline step runs. For example: `"environmentVariables": { "CONDA_ENV_TYPE": "service" }` Review the list of service supported environment variables.

pipeline_payload = {
    "projectId": "<project_id>",
    "compartmentId": "<compartment_id>",
    "displayName": "<pipeline_name>",
    "pipelineInfrastructureConfigurationDetails": {
        "shapeName": "VM.Standard2.1",
        "blockStorageSizeInGBs": "50"
    },
    "pipelineLogConfigurationDetails": {
        "enableLogging": True,
        "logGroupId": "<log_group_id>",
        "logId": "<log_id>"
    },
    "pipelineDefaultConfigurationDetails": {
        "type": "DEFAULT",
        "maximumRuntimeInMinutes": 30,
        "environmentVariables": {
            "CONDA_ENV_TYPE": "service",
            "CONDA_ENV_SLUG": "classic_cpu"
        }
    },
    "stepDetails": [
        {
            "stepName": "preprocess",
            "description": "Preprocess step",
            "stepType": "CUSTOM_SCRIPT",
            "stepInfrastructureConfigurationDetails": {
                "shapeName": "VM.Standard2.4",
                "blockStorageSizeInGBs": "100"
            },
            "stepConfigurationDetails": {
                "type": "DEFAULT",
                "maximumRuntimeInMinutes": 90
                "environmentVariables": {
                    "STEP_RUN_ENTRYPOINT": "preprocess.py",
                    "CONDA_ENV_TYPE": "service",
                    "CONDA_ENV_SLUG": "onnx110_p37_cpu_v1"
            }
        },
        {
            "stepName": "postprocess",
            "description": "Postprocess step",
            "stepType": "CUSTOM_SCRIPT",
            "stepInfrastructureConfigurationDetails": {
                "shapeName": "VM.Standard2.1",
                "blockStorageSizeInGBs": "80"
            },
            "stepConfigurationDetails": {
                "type": "DEFAULT",
                "maximumRuntimeInMinutes": 60
            },
            "dependsOn": ["preprocess"]
        },
    ],
    "freeformTags": {
        "freeTags": "cost center"
    }
}
pipeline_res = dsc.create_pipeline(pipeline_payload)
pipeline_id = pipeline_res.data.id

Until all pipeline steps artifacts are uploaded, the pipeline is in the CREATING state.

Upload a step artifact:

After an artifact is uploaded, it can't be changed.

fstream = open(<file_name>, "rb")
 
dsc.create_step_artifact(pipeline_id, step_name, fstream, content_disposition=f"attachment; filename={<file_name>}")

Update a pipeline:

You can only update a pipeline when it's in an ACTIVE state.

update_pipeline_details = {
"displayName": "pipeline-updated"
}
self.dsc.update_pipeline(<pipeline_id>, <update_pipeline_details>)

Start pipeline run:

pipeline_run_payload = {
"projectId": project_id,
"displayName": "pipeline-run",
"pipelineId": <pipeline_id>,
"compartmentId": <compartment_id>,
}
dsc.create_pipeline_run(pipeline_run_payload)

The ADS SDK is also a publicly available Python library that you can install with this command:
```
pip install oracle-ads
```
You can use the ADS SDK to create and run pipelines.

Custom Networking

Use a custom Network that you've already created in the pipeline to give you extra flexibility on the network.

Creating Pipelines with Custom Networking

You can choose to use custom networking when creating a pipeline.

Note

Switching from custom networking to managed networking isn't supported after the pipeline is created.

Tip

If you see the banner:

The specified subnet is not accessible. Select
          a different subnet.

create a network access policy as described in the section, Pipeline Policies.

Using the Console

Choose to use custom networking in the Create pipeline panel.

If you choose default networking, the system uses the existing service-managed network. If you select the custom networking option, you're prompted to pick a VCN and a subnet.

Select the VCN and subnet that you want to use for the resource. For egress access to the public internet, use a private subnet with a route to a NAT gateway. If you don't see the VCN or subnet that you want to use, click Change Compartment, and then select the compartment that contains the VCN or subnet.

Using APIs

Provide subnet-id in the infrastructure-configuration-details to use a custom subnet on the pipeline level. For example:

"infrastructure-configuration-details": {
      "block-storage-size-in-gbs": 50,
      "shape-config-details": {
        "memory-in-gbs": 16.0,
        "ocpus": 1.0
      },
      "shape-name": "VM.Standard.E4.Flex",
      "subnet-id": "ocid1.subnet.oc1.iad.aaaaaaaa5lzzq3fyypo6x5t5egplbfyxf2are6k6boop3vky5t4h7g35xkoa"
}

Or in the step-container-configuration-details to use a custom subnet for a particular step. For example:

"step-infrastructure-configuration-details": {
          "block-storage-size-in-gbs": 50,
          "shape-config-details": {
            "memory-in-gbs": 16.0,
            "ocpus": 1.0
          },
          "shape-name": "VM.Standard.E4.Flex",
          "subnet-id": "ocid1.subnet.oc1.iad.aaaaaaaa5lzzq3fyypo6x5t5egplbfyxf2are6k6boop3vky5t4h7g35xkoa"
},

Oracle Cloud Infrastructure Documentation

Creating a Pipeline

Custom Networking

Creating Pipelines with Custom Networking

Using the Console

Using APIs