Creating a Pipeline

Create a Data Science pipeline to run a task.

Ensure that you have created the necessary policies, authentication, and authorization for pipelines.

Important

For proper operation of script steps, ensure that you have added the following rule to a dynamic group policy:

all {resource.type='datasciencepipelinerun', resource.compartment.id='<pipeline-run-compartment-ocid>'}

Before you begin:

You can create pipelines by using the ADS SDK, OCI Console, or the OCI SDK.

Using ADS for creating pipelines can make developing the pipeline, the steps, and the dependencies easier. ADS supports reading and writing the pipeline to and from a YAML file. You can use ADS to view a visual representation of the pipeline. We recommend that you use ADS to create and manage pipeline using code.

    1. Use the Console to sign in to a tenancy with the necessary policies.
    2. Open the navigation menu and click Analytics & AI. Under Machine Learning, click Data Science.
    3. Select the compartment that contains the project that you want to use.

      All projects in the compartment are listed.

    4. Click the name of the project.

      The project details page opens and lists the notebook sessions.

    5. Under Resources, click Pipelines.
    6. Click Create pipeline.
    7. (Optional) Select a different compartment for the pipeline.
    8. (Optional) Enter a name and description for the pipeline (limit of 255 characters). If you don't provide a name, a name is automatically generated.

      For example, pipeline2022808222435.

    9. Click Add pipeline steps to start defining the workflow for the pipeline.
    10. In the Add pipeline step panel, select one of the following options, and then finish the pipeline creation:
    From a Job

    The pipeline step uses an existing job. Select one of the jobs in the tenancy.

    1. Enter a unique name for the step. You can't repeat a step name in a pipeline.
    2. (Optional) Enter a step description, which can help you find step dependencies.
    3. (Optional) If this step depends on another step, select one or more steps to run before this step.
    4. Select the job for the step to run.
    5. (Optional) Enter or select any of the following values to control this pipeline step:
      Custom environment variable key and value

      The environment variables for this pipeline step.

      Value

      The value for the custom environment variable key.

      You can click Additional custom environment key to specify more variables.

      Command line arguments

      The command line arguments that you want to use for running the pipeline step.

      Maximum runtime (in minutes)

      The maximum number of minutes that the pipeline step is allowed to run. The service cancels the pipeline run if its runtime exceeds the specified value. The maximum runtime is 30 days (43,200 minutes). We recommend that you configure a maximum runtime on all pipeline runs to prevent runaway pipeline runs.

    6. Click Save to add the step and return to the Create pipeline page.
    7. (Optional) Click +Add pipeline steps to add more steps to complete your workflow, and repeat the preceding steps.
    8. (Optional) Create a default pipeline configuration that's used when the pipeline is run by entering environment variable, command line arguments, and maximum runtime options. See step 5 for an explanation of these fields.
    9. (Optional) Select a Compute shape by clicking Select and following these steps:
      1. Select an instance type.
      2. Select a shape series.
      3. Select one of the supported Compute shapes in the series.
      4. Select the shape that best suits how you want to use the resource. For the AMD shape, you can use the default or set the number of OCPUs and memory.

        For each OCPU, select up to 64 GB of memory and a maximum total of 512 GB. The minimum amount of memory allowed is either 1 GB or a value matching the number of OCPUs, whichever is greater.

      5. Click Select shape.
    10. For Block Storage, enter the amount of storage that you want to use between 50 GB and 10, 240 GB (10 TB). You can change the value by 1 GB increments. The default value is 100 GB.
    11. (Optional) To use logging, click Select, and then ensure that Enable logging is selected.
      1. Select a log group from the list. You can change to a different compartment to specify a log group in a different compartment from the job.
      2. Select one of the following to store all stdout and stderr messages:
        Enable automatic log creation

        Data Science automatically creates a log when the job starts.

        Select a log

        Select a log to use.

      3. Click Select to return to the job run creation page.
    12. (Optional) Click Show advanced options to add tags to the pipeline.
    13. (Optional) Enter the tag namespace (for a defined tag), key, and value to assign tags to the resource.

      To add more than one tag, click Add tag.

      Tagging describes the various tags that you can use organize and find resources including cost-tracking tags.

    14. Click Create.

      After the pipeline is in an active state, you can use pipeline runs to repeatedly run the pipeline.

    From a Script

    The step uses a script to run. You need to upload the artifact containing all the code for the step to run.

    1. Enter a unique name for the step. You can't repeat a step name in a pipeline.
    2. (Optional) Enter a step description, which can help you find step dependencies.
    3. (Optional) If this step depends on another step, select one or more steps to run before this step.
    4. Drag a job step file into the box, or click select a file to navigate to it for selection.
    5. In Entry point, select one file to be the entry run point of the step. This is useful when you have many files.
    6. (Optional) Enter or select any of the following values to control this pipeline step:
      Custom environment variable key and value

      The environment variables for this pipeline step.

      Value

      The value for the custom environment variable key.

      You can click Additional custom environment key to specify more variables.

      Command line arguments

      The command line arguments that you want to use for running the pipeline step.

      Maximum runtime (in minutes)

      The maximum number of minutes that the pipeline step is allowed to run. The service cancels the pipeline run if its runtime exceeds the specified value. The maximum runtime is 30 days (43,200 minutes). We recommend that you configure a maximum runtime on all pipeline runs to prevent runaway pipeline runs.

    7. (Optional) Create a default pipeline configuration that's used when the pipeline is run by entering environment variable, command line arguments, and maximum runtime options. See step 6 for an explanation of these fields.
    8. For Block Storage, enter the amount of storage that you want to use between 50 GB and 10, 240 GB (10 TB). You can change the value by 1 GB increments. The default value is 100 GB.
    9. Click Save to add the step and return to the Create pipeline page.
    10. (Optional) Use +Add pipeline steps to add more steps to complete your workflow by repeating the preceding steps.
    11. (Optional) Create a default pipeline configuration that's used when the pipeline is run by entering environment variable, command line arguments, and maximum runtime options. See step 6 for an explanation of these fields.
    12. For Block Storage, enter the amount of storage that you want to use between 50 GB and 10, 240 GB (10 TB). You can change the value by 1 GB increments. The default value is 100 GB.
    13. Select one of the following options to configure the network type:
      • Default networking—The workload is attached by using a secondary VNIC to a preconfigured, service-managed VCN, and subnet. This provided subnet lets egress to the public internet through a NAT gateway, and access to other Oracle Cloud services through a service gateway.

        If you need access only to the public internet and OCI services, we recommend using this option. It doesn't require you to create networking resources or write policies for networking permissions.

      • Custom networking—Select the VCN and subnet that you want to use for the resource (notebook session or job).

        For egress access to the public internet, use a private subnet with a route to a NAT gateway.

        If you don't see the VCN or subnet that you want to use, click Change Compartment, and then select the compartment that contains the VCN or subnet.

        Important

        Custom networking must be used to use a file storage mount.

    14. (Optional) To use logging, click Select, and then ensure that Enable logging is selected.
      1. Select a log group from the list. You can change to a different compartment to specify a log group in a different compartment from the job.
      2. Select one of the following to store all stdout and stderr messages:
        Enable automatic log creation

        Data Science automatically creates a log when the job starts.

        Select a log

        Select a log to use.

      3. Click Select to return to the job run creation page.
    15. (Optional) Click Show advanced options to add tags to the pipeline.
    16. (Optional) Enter the tag namespace (for a defined tag), key, and value to assign tags to the resource.

      To add more than one tag, click Add tag.

      Tagging describes the various tags that you can use organize and find resources including cost-tracking tags.

    17. Click Create.

      After the pipeline is in an active state, you can use pipeline runs to repeatedly run the pipeline.

    From Container
    Optionally, when defining pipeline steps, you can choose to use Bring Your Own Container.
    1. Select From container.
    2. In the Container configuration section, click Configure.
    3. In the Configure your container environment panel, select a repository from the list. If the repository is in a different compartment, click Change compartment.
    4. Select an image from the list.
    5. (Optional) Enter an entry point. To add another, click +Add parameter.
    6. (Optional) Enter a CMD. To add another, click +Add parameter.
      Use CMD as arguments to the ENTRYPOINT or the only command to run in the absence of an ENTRYPOINT.
    7. (Optional) Enter an image digest.
    8. (Optional) If using signature verification, enter the OCID of the image signature.
      For example, ocid1.containerimagesignature.oc1.iad.aaaaaaaaab....
    9. (Optional) Upload the step artifact by dragging it into the box.
      Note

      This step is optional only if BYOC is configured.
  • These environment variables control the pipeline run.

    You can use the OCI CLI to create a pipeline as in this Python example:

    1. Create a pipeline:

      The following parameters are available to use in the payload:

      Parameter name Required Description
      Pipeline (top level)
      projectId Required The project OCID to create the pipeline in.
      compartmentId Required The compartment OCID to the create the pipeline in.
      displayName Optional The name of the pipeline.
      infrastructureConfigurationDetails Optional

      Default infrastructure (compute) configuration to use for all the pipeline steps, see infrastructureConfigurationDetails for details on the supported parameters.

      Can be overridden by the pipeline run configuration.

      logConfigurationDetails Optional

      Default log to use for the all the pipeline steps, see logConfigurationDetails for details on the supported parameters.

      Can be overridden by the pipeline run configuration.

      configurationDetails Optional

      Default configuration for the pipeline run, see configurationDetails for details on supported parameters.

      Can be overridden by the pipeline run configuration.

      freeformTags Optional Tags to add to the pipeline resource.
      stepDetails
      stepName Required Name of the step. Must be unique in the pipeline.
      description Optional Free text description for the step.
      stepType Required CUSTOM_SCRIPT or ML_JOB
      jobId Required* For ML_JOB steps, this is the job OCID to use for the step run.
      stepInfrastructureConfigurationDetails Optional*

      Default infrastructure (Compute) configuration to use for this step, see infrastructureConfigurationDetails for details on the supported parameters.

      Can be overridden by the pipeline run configuration.

      *Must be defined on at least one level (precedence based on priority, 1 being highest):

      1 pipeline run and/or

      2 step and/or

      3 pipeline

      stepConfigurationDetails Optional*

      Default configuration for the step run, see configurationDetails for details on supported parameters.

      Can be overridden by the pipeline run configuration.

      *Must be defined on at least one level (precedence based on priority, 1 being highest):

      1 pipeline run and/or

      2 step and/or

      3 pipeline

      dependsOn Optional List of steps that must be completed before this step begins. This creates the pipeline workflow dependencies graph.
      infrastructureConfigurationDetails
      shapeName Required Name of the Compute shape to use. For example, VM.Standard2.4.
      blockStorageSizeInGBs Required Number of GBs to use as the attached storage for the VM.
      logConfigurationDetails
      enableLogging Required Define to use logging.
      logGroupId Required Log group OCID to use for the logs. The log group must be created and available when the pipeline runs
      logId Optional* Log OCID to use for the logs when not using the enableAutoLogCreation parameter.
      enableAutoLogCreation Optional If set to True, a log for each pipeline run is created.
      configurationDetails
      type Required Only DEFAULT is supported.
      maximumRuntimeInMinutes Optional Time limit in minutes for the pipeline to run.
      environmentVariables Optional

      Environment variables to provide for the pipeline step runs.

      For example:

      "environmentVariables": {
      
       "CONDA_ENV_TYPE": "service"
      
      }

      Review the list of service supported environment variables.

      pipeline_payload = {
          "projectId": "<project_id>",
          "compartmentId": "<compartment_id>",
          "displayName": "<pipeline_name>",
          "pipelineInfrastructureConfigurationDetails": {
              "shapeName": "VM.Standard2.1",
              "blockStorageSizeInGBs": "50"
          },
          "pipelineLogConfigurationDetails": {
              "enableLogging": True,
              "logGroupId": "<log_group_id>",
              "logId": "<log_id>"
          },
          "pipelineDefaultConfigurationDetails": {
              "type": "DEFAULT",
              "maximumRuntimeInMinutes": 30,
              "environmentVariables": {
                  "CONDA_ENV_TYPE": "service",
                  "CONDA_ENV_SLUG": "classic_cpu"
              }
          },
          "stepDetails": [
              {
                  "stepName": "preprocess",
                  "description": "Preprocess step",
                  "stepType": "CUSTOM_SCRIPT",
                  "stepInfrastructureConfigurationDetails": {
                      "shapeName": "VM.Standard2.4",
                      "blockStorageSizeInGBs": "100"
                  },
                  "stepConfigurationDetails": {
                      "type": "DEFAULT",
                      "maximumRuntimeInMinutes": 90
                      "environmentVariables": {
                          "STEP_RUN_ENTRYPOINT": "preprocess.py",
                          "CONDA_ENV_TYPE": "service",
                          "CONDA_ENV_SLUG": "onnx110_p37_cpu_v1"
                  }
              },
              {
                  "stepName": "postprocess",
                  "description": "Postprocess step",
                  "stepType": "CUSTOM_SCRIPT",
                  "stepInfrastructureConfigurationDetails": {
                      "shapeName": "VM.Standard2.1",
                      "blockStorageSizeInGBs": "80"
                  },
                  "stepConfigurationDetails": {
                      "type": "DEFAULT",
                      "maximumRuntimeInMinutes": 60
                  },
                  "dependsOn": ["preprocess"]
              },
          ],
          "freeformTags": {
              "freeTags": "cost center"
          }
      }
      pipeline_res = dsc.create_pipeline(pipeline_payload)
      pipeline_id = pipeline_res.data.id

      Until all pipeline steps artifacts are uploaded, the pipeline is in the CREATING state.

    2. Upload a step artifact:

      After an artifact is uploaded, it can't be changed.

      fstream = open(<file_name>, "rb")
       
      dsc.create_step_artifact(pipeline_id, step_name, fstream, content_disposition=f"attachment; filename={<file_name>}")
    3. Update a pipeline:

      You can only update a pipeline when it's in an ACTIVE state.

      update_pipeline_details = {
      "displayName": "pipeline-updated"
      }
      self.dsc.update_pipeline(<pipeline_id>, <update_pipeline_details>)
    4. Start pipeline run:
      pipeline_run_payload = {
      "projectId": project_id,
      "displayName": "pipeline-run",
      "pipelineId": <pipeline_id>,
      "compartmentId": <compartment_id>,
      }
      dsc.create_pipeline_run(pipeline_run_payload)
  • The ADS SDK is also a publicly available Python library that you can install with this command:

    pip install oracle-ads

    You can use the ADS SDK to create and run pipelines.

Custom Networking

Use a custom Network that you've already created in the pipeline to give you extra flexibility on the network.

Creating Pipelines with Custom Networking

You can choose to use custom networking when creating a pipeline.

Note

Switching from custom networking to managed networking isn't supported after the pipeline is created.
Tip

If you see the banner: The specified subnet is not accessible. Select a different subnet. create a network access policy as described in the section, Pipeline Policies.

Using the Console

Choose to use custom networking in the Create pipeline panel.

If you choose default networking, the system uses the existing service-managed network. If you select the custom networking option, you're prompted to pick a VCN and a subnet.

Select the VCN and subnet that you want to use for the resource. For egress access to the public internet, use a private subnet with a route to a NAT gateway. If you don't see the VCN or subnet that you want to use, click Change Compartment, and then select the compartment that contains the VCN or subnet.

Using APIs

Provide subnet-id in the infrastructure-configuration-details to use a custom subnet on the pipeline level. For example:
"infrastructure-configuration-details": {
      "block-storage-size-in-gbs": 50,
      "shape-config-details": {
        "memory-in-gbs": 16.0,
        "ocpus": 1.0
      },
      "shape-name": "VM.Standard.E4.Flex",
      "subnet-id": "ocid1.subnet.oc1.iad.aaaaaaaa5lzzq3fyypo6x5t5egplbfyxf2are6k6boop3vky5t4h7g35xkoa"
}
Or in the step-container-configuration-details to use a custom subnet for a particular step. For example:
"step-infrastructure-configuration-details": {
          "block-storage-size-in-gbs": 50,
          "shape-config-details": {
            "memory-in-gbs": 16.0,
            "ocpus": 1.0
          },
          "shape-name": "VM.Standard.E4.Flex",
          "subnet-id": "ocid1.subnet.oc1.iad.aaaaaaaa5lzzq3fyypo6x5t5egplbfyxf2are6k6boop3vky5t4h7g35xkoa"
},