Interactive debug component run in remote compute

Overview

In order to reproduce the runtime environment of component run, Component SDK supports remote debugging by creating a debug run that has the same environment in the remote compute. User could debug the component in the remote compute and reproduce the environment for the distributed case or GPU use case. This document describes the scenarios of this feature and how to use it.

There are four steps to debug in remote compute.

  1. Generate debug run from an existing component run

    In this step, it will reuse the config of previous component run to submit a component run to the compute specified by you. And the sleep infinity command will be added before the step command to suspend the compute for debug operation.

  2. Connect to the remote compute or container

    After the debug run is submitted to the compute, you could use VScode to connect to the remote compute by SSH. For ITP compute, it will automatically enable SSH. For AmlCompute, you need to use a SSH enabled AmlCompute.

  3. Debug in the debug run

    After login the remote compute or container by VScode, you can debug the script of debug run. If the command of the previous run startswith python, it will automatically generate the .vscode/launch.json which is used to record the debugging configuration information.

  4. Cancel the debug run

    Because the component run is suspended, you need to cancel the debug run to release resources after debugging.

Limitation

  • For component type, remote debug supports Commandcomponent, Distributedcomponent and Sweepcomponent.

  • For remote compute type, remote debug supports AmlCompute and ITP compute. And the AmlCompute must be enabled SSH.

  • For component command, only python style command is supported to generate vscode launcher.json.

  • For execution OS type, only support Linux as OS type of component.

Prerequisites

  • Install VScode following the instructions here.

  • Install the extensions Remote - SSH and Docker in VScode.

  • For debugging in AmlCompute, need to install Docker on your machine and add it to the system path.

Using remote compute to debug component run

1. Generate debug run from existing component run

Click one step of the pipeline, the step run id is shown in the Details tab.

get_run_id.png

Generate debug run by component SDK.

from azureml.core import Workspace
from azure.ml.component import Run
workspace = Workspace.get(name=<workspace name>, resource_group=<resource group name>, subscription_id=<subscription id>)
failed_step_run = Run.get(workspace=workspace, run_id=<Your failed step run id>)
failed_step_run._debug(compute=<Remote debug compute name>)

If the type of remote compute is ITP compute, it will generate debug run in ITP compute. If the type of remote compute is AmlCompute, the debug run will be executed in AmlCompute. If not set remote compute, it will use the same compute config of component run to execute debug run.

2. Connect to the remote compute by VScode

After generate the debug step, the information about debug run and interactive debug steps will print in the terminal. The information about debug run likes below.

Sample of command component debug info:

INFO     - --------------------------------------------------------------------------------
INFO     - Information about debug run and remote compute:
INFO     - Link to job instance: https://ml.azure.com/runs/test_debug_run_1624009740_44cb4c6f?wsid=/subscriptions/xxx-xxx-xxx/resourcegroups/xxx/workspaces/xxx&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
INFO     - Command Working Directory: /mnt/batch/tasks/shared/LS_root/jobs/azureml/test_debug_run_1624009740_44cb4c6f/wd/azureml/test_debug_run_1624009740_44cb4c6f
INFO     - Command: python basic_component.py --input_dir=/mnt/batch/tasks/shared/LS_root/jobs/azureml/test_debug_run_1624009740_44cb4c6f/wd/inputDir_187df8bc-2389-4d0a-b41c-1d10fe6be02c/Titanic.csv --str_param some_string --enum_param e1 --output_dir /mnt/batch/tasks/shared/LS_root/jobs/azureml/test_debug_run_1624009740_44cb4c6f/wd/output_dir_workspaceblobstore --int_param 123
INFO     - SSH connection command: ssh azureuser@xx.xx.xx.xx -p 50004
INFO     - --------------------------------------------------------------------------------

If debug a MPI distributed component, it will generate the distributed command used to execute distributed jobs. Sample of distributed component debug info:

INFO     - --------------------------------------------------------------------------------
INFO     - Information about debug run and remote compute:
INFO     - Link to job instance: https://ml.azure.com/runs/test_debug_run_1624009740_44cb4c6f?wsid=/subscriptions/xxx-xxx-xxx/resourcegroups/xxx/workspaces/xxx&tid=72f988bf-86f1-41af-91ab-2d7cd011db47
INFO     - Command Working Directory: /mnt/batch/tasks/shared/LS_root/jobs/azureml/test_debug_run_1624009740_44cb4c6f/wd/azureml/test_debug_run_1624009740_44cb4c6f
INFO     - Debug command:
INFO     - 	Command: python mpi_train.py --training_data $AZUREML_TRANS_DATA_PATH_training_data --max_epochs $AZUREML_PARAMETER_max_epochs --learning_rate $AZUREML_PARAMETER_learning_rate --model_output $AZUREML_TRANS_DATA_PATH_model_output
INFO     - 	Distributed command: ssh worker-0 "/bin/bash --login -c \"cd /workspaceblobstore/azureml/test_debug_run_1626164063_c3f6549c && mpirun --tag-output -hostfile /job/\\\${DLTS_JOB_ID}/hostfile --npernode 4  -bind-to none -x NCCL_DEBUG=INFO -x NCCL_TREE_THRESHOLD=0 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include eth0 /bin/bash --login -c  \\\"/azureml-envs/azureml_c49e1f3d3829b3ca9fffc2a51fc57a49/bin/python mock_distributed_run.py python mpi_train.py --training_data \\\\\\\$AZUREML_TRANS_DATA_PATH_training_data --max_epochs \\\\\\\$AZUREML_PARAMETER_max_epochs --learning_rate \\\\\\\$AZUREML_PARAMETER_learning_rate --model_output \\\\\\\$AZUREML_TRANS_DATA_PATH_model_output\\\"\""
INFO     - SSH connection command:
INFO     - 	SSH to ps0: ssh -i //10.217.90.227/zhrua/.ssh/id_rsa -p 41327 xxx@azure-xxx.eastus.cloudapp.azure.com
INFO     - 	SSH to worker0: ssh -i //10.217.90.227/zhrua/.ssh/id_rsa -p 40485 xxx@azure-xxx.eastus.cloudapp.azure.com
INFO     - 	SSH to worker1: ssh -i //10.217.90.227/zhrua/.ssh/id_rsa -p 41807 xxx@azure-xxx.eastus.cloudapp.azure.com
INFO     - --------------------------------------------------------------------------------

Connect to the Linux container in ITP compute

After generate debug run, it will print debug info in terminal. You could follow these steps to connect to the ITP compute.

Note: If you want to connect to the remote compute without a password, you need to configure according to the steps in this section.

Note: Not support to debug the component with the curated environment in the ITP compute.

  1. Add SSH connection config to SSH targets. Write the SSH connection command showed in terminal to the text box.

    create_ssh_target.png

    If you don’t config the ssh-key in ITP, you need to use the password to connect to the remote compute. You could find password in User settings, see reference for more details.

  2. After create SSH target, you could find it in the remote explorer tab and click it to connect in a new window.

    connect_to_host.png

Connect to the Linux container in AmlCompute

If the os type of step run is Linux, you could follow these steps to connect to the linux container in AmlCompute.

  1. Configure ssh-agent on the local system with the private key file produced above.

    For Windows(OpenSSH), from an admin command prompt, run sc config ssh-agent start=auto and net start ssh-agent. Then, do ssh-add <keyfile>.

    For Linux, ssh-agent is present by default. Do ssh-add <keyfile>.

    After execute commands, you could verify the identity is available to the agent with ssh-add -l. It should list one or more identities that look something like 2048 SHA256:abcdefghijk somethingsomething (RSA).

  2. Create and active a Docker context that points to the AmlCompute running Docker.

    The remote username, host IP address and port are showed in terminal.

    docker context create <docker-context-name> --docker "host=ssh://<username-in-remote>@<remote-machine-name-or-IP>:<port>"
    docker context use <docker-context-name>
    
  3. Manage Docker as a non-root user

    By default, the Docker is owned by root and other users can only access it using sudo. It will raise error when VScode connects to the container. You need to log in the AmlCompute by SSH and execute the command to create and add user to docker group.

    sudo groupadd docker
    sudo usermod -aG docker $USER
    
  4. Connect the remote container in AmlCompute.

    After active the Docker context, you could use the VScode extension Docker to find and connect to the container. The container name is the same as debug run id. You could right click the container showed in Containers tab, then click Attach Visual Studio Code to connect to the remote container.

    docker-extension.png

See reference for more details about connecting remote docker over SSH.

3. Debug in remote compute

  1. After connect to the remote compute, you could use the working directory showed in terminal to open the workspace.

    open_folder.png

  2. If your step command is python style command, which starts with python, it will generate the debug configuration in .vscode/launch.json. You need to install python extension of VScode. Then you could press F5 to directly debug it.

4. End debugging

First you need to cancel the component run by the run link showed in terminal when finished debugging, because the debug run adds sleep infinity to the debug run command to suspend.

Then you may need to recover the docker context through this command, if you debug in AmlCompute.

docker context use <previous context name>

FAQ

How to create AmlCompute enabled SSH

You could follow these steps to config the SSH settings when creating an AmlCompute.

  • Enable SSH in portal

    1. Enable SSH access

    2. Select Use existing public key to SSH public key source.

    3. Copy the public key to the text area.

    config_ssh.png

  • Enable SSH by python SDK

    You could set the admin username and public key to the configuration to create the SSH enabled AmlCompute.

    from azureml.core.compute import ComputeTarget, AmlCompute
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4,
                                                           idle_seconds_before_scaledown=2400,
                                                           admin_username=<Your admin username>,
                                                           admin_user_ssh_key=<The public key>)
    compute = ComputeTarget.create(workspace, <Your compute name>, compute_config)
    compute.wait_for_completion(show_output=True)
    

    See reference for more detail about AmlCompute configuration.