Local debug component run using common-runtime

Overview

In order to reproduce the runtime environment of component step run in local machine, Component SDK supports local debugging using common-runtime, which will create a debug environment is same as the environment in the remote compute. This document describes the scenarios of this feature and how to use it.

There are four steps to debug in local using common-runtime.

  1. Get local debug command from failed run step in the portal

    In this step, user right clicks on a step run in the portal, then click Debug in the menu bar which will generate a command like az-ml run debug --run-id <failed-run-id>.

    Note: It’s only available for 1p customer (webxt).

  2. Execute local debug command to prepare debug environment

    User pastes and runs the debug command in local machine. This command will use common-runtime to generate the same environment as on AmlCompute. After executing the command, vscode will be opened and attached to the debug container and the working directory will be opened.

  3. Debug in the container

    Users can debug directly in the opened vscode which is attached to the container. If the command of the component run startswith python, it will automatically generate the .vscode/launch.json which is used to record the debugging configuration information.

  4. Remove debug containers

    When the debug is completed, user could use the generated script to delete the debug containers.

Limitation

  • For component type, local debug using common-runtime supports Commandcomponent, Distributedcomponent and the child run of Sweepcomponent.

  • For remote compute type, local debug using common-runtime only supports AmlCompute.

  • For component command, only python style command is supported to generate vscode launcher.json.

  • For execution OS type, only support Linux as OS type of component.

Prerequisites

Local debug using common-runtime

1. Get local debug command from failed run step in the portal

  1. Right click on a step run in the portal, then click Debug in the menu bar.

debug_menu_bar

  1. It will open a dialog, paste the az-ml run debug command.

debug_dialog

Note: It’s only available for 1p customer (webxt).

2. Execute local debug command to prepare debug environment

Paste and run the debug command in the terminal. If you are a Windows user, you need to execute this command in WSL2.

In the command az-ml run debug, the following steps will be operated:

  1. Get common-runtime information about step run from backend by the run-id.

  2. Generate the same debug environment as the remote

    1. Get bootstrapper through common-runtime info.

    2. Execute bootstrapper to generate the debug container, which is same as the remote. If the command of the step run starts with python, debug configuration will be generated in working directory.

  3. Vscode attach to the container and open the working directory

The command execution log liks below:

execution_log

After execution, it will generate a folder ~/common-runtime-debug/<run-id>/. The folder structure likes below:

./
├── DEBUG                   Contains debug container_name and working_dir path
├── remove_containers.sh    Script to remove debug containers when debug compeleted
├── vm-bootstrapper         Bootstrapper binary
├── stderr                  Bootstrapper execution stderr log
└── stdout                  Bootstrapper execution stdout log

Note: Because of pulling image and downloading dataset, this step may take a long time.

3. Debug in the container

After the command is executed, vscode will be opened and attached to the debug container and the working directory will be opened.

If the command of the component run startswith python, it will automatically generate the .vscode/launch.json which is used to record the debugging configuration information. User can press F5 to debug the component code.

vscode_debug

See reference for more detail about debugging.

4. Remove debug containers

When the debug is completed, user can execute this command to delete the debug containers.

sh ~/common-runtime-debug/<run-id>/remove_containers.sh

FAQ

az-ml run debug command

The details of az-ml run debug as follows:

usage: az-ml run debug [-h] [--subscription_id SUBSCRIPTION_ID] [--resource_group RESOURCE_GROUP] [--workspace_name WORKSPACE_NAME] [--debug] [--run-id RUN_ID]

A CLI tool to local debug using common runtime.

optional arguments:
  -h, --help            show this help message and exit
  --subscription_id SUBSCRIPTION_ID, -s SUBSCRIPTION_ID
                        Subscription id, required when pass run id.
  --resource_group RESOURCE_GROUP, -r RESOURCE_GROUP
                        Resource group name, required when pass run id.
  --workspace_name WORKSPACE_NAME, -w WORKSPACE_NAME
                        Workspace name, required when pass run id.
  --debug               Increase logging verbosity to show all debug logs
  --run-id RUN_ID       The run id of step run to be debugged.

Common issues

Docker responded with status code 500: path is mounted on / but it is not a shared mount

If you meet the following error when executing az-ml run debug.

CommonRuntimeJobError {
    code: "CommandError",
    category: SystemError,
    message: Compliant(
        "Docker responded with status code 500: path /tmp/azureml/cr/j/xxxxx/.grpc is mounted on / but it is not a shared mount.\n",
    ),
    details: [],
    error: None,
}

You could run this command sudo mount --make-shared / to resolve if.

The compute could not authenticate with the Docker registry

If you meet the following error when executing az-ml run debug.

CommonRuntimeJobError {
    code: "AggregatedUnauthorizedAccessError",
    category: UserError,
    message: Compliant(
        "Failed to pull Docker image 848b8fb7991f410dafa12b95593b519c.azurecr.io/azureml/azureml_301b04d9d1ade06963f05664646fb2d5. This error may occur because the compute could not authenticate with the Docker registry to pull the image. If using ACR please ensure the ACR has Admin user enabled or a Managed Identity with `AcrPull` access to the ACR is assigned to the compute. If the ACR Admin user's password was changed recently it may be necessary to synchronize the workspace keys.",
    ),
    details: [
        Detail {
            name: "Authentication methods attempted",
        },
        Detail {
            name: "Note",
            value: Literal(
                Compliant(
                    "Identity (MSI) not found on the compute, if the intention is to authenticate with identity ensure that a Managed Identity with `AcrPull` access to the ACR is assigned to the compute",
                ),
            ),
        },
        Detail {
            name: "Error",
            value: Error(
                CommonRuntimeJobError {
                    code: "DockerUnauthorizedAccessError",
                    category: UserError,
                    message: Compliant(
                        "Failed to pull Docker image 848b8fb7991f410dafa12b95593b519c.azurecr.io/azureml/azureml_301b04d9d1ade06963f05664646fb2d5 with authentication mode Anonymous due to: Docker responded with status code 500: {\"message\":\"Get https://848b8fb7991f410dafa12b95593b519c.azurecr.io/v2/azureml/azureml_301b04d9d1ade06963f05664646fb2d5/manifests/latest: unauthorized: authentication required, visit https://aka.ms/acr/authorization for more information.\"}\n. Compute could not authenticate with the Docker registry to pull the image.",
                    ),
                    details: [],
                    error: None,
                },
            ),
        },
    ],
    error: None,
}

It means the local machine could not authenticate with the Docker registry to pull the image. You need to execute this command az acr login --name <registry-name> to login the registry.

See reference for more detail about ACR.

Unknown runtime specified nvidia

If you meet the following error when executing az-ml run debug, it means that NVIDIA is not configured in docker runtimes。

CommonRuntimeJobError {
   code: "OrchestrateJobError",
   category: SystemError,
   message: Compliant(
     "Failed to execute command group with error API queried with a bad parameter: {\"message\":\"Unknown runtime specified nvidia\"}\n",
   ),
   details: [],
}

You need follow these steps to install nvidia-container-runtime and config docker runtimes:

  1. Install nvidia-container-runtime

    $ sudo apt-get install nvidia-container-runtime
    
  2. Add nvidia to docker runtime

    $ sudo tee /etc/docker/daemon.json <<EOF
    {
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        }
    }
    EOF
    $ sudo pkill -SIGHUP dockerd
    
  3. Restart docker service and check runtime is added

    $ sudo systemctl daemon-reload
    $ sudo systemctl restart docker
    $ docker info|grep -i runtime
     Runtimes: nvidia runc
     Default Runtime: runc
    

See reference for more detail about nvidia-container-runtime.