Sweep Component

Sweep component is a kind of component to enable user to automate efficient hyperparameter tuning.

Note: You can directly sweep on a command component without define a sweep component. Learn more:

Overview

Hyperparameters are adjustable parameters that let you control the model training process. For example, with neural networks, you decide the number of hidden layers and the number of nodes in each layer. Model performance depends heavily on hyperparameters.

Hyperparameter tuning, also called hyperparameter optimization, is the process of finding the configuration of hyperparameters that results in the best performance. The process is typically computationally expensive and manual.

Sweep component lets you automate hyperparameter tuning and run component in parallel to efficiently optimize hyperparameters.

Let’s assume you already have a Command or Distributed component which trains a model. When doing the hyperparameter tuning process manually, you will run this component several times with different hyperparameter combinations. Each of these sub runs is called a trial run, and the component is referred to as a trial component. After all the trial runs finished, you can select the best result by comparing the metrics of the trial runs.

You can easily convert the trial component into a sweep component and automate this process with below steps:

  1. Prepare the trial component:

    • Hyperparameters to explore are exposed as component input parameters.

    • Component script has log metrics on model performance.

    • Component script has written sweep component outputs.

    • Inputs and outputs of trial component will be inherited to parent sweep component.

  2. Define the sampling algorithm and search space

    • Specify parameter sampling algorithm to use over the hyperparameter search space.

    • Mark parameters of the trial component as hyperparameters and define the search space with some distribution.

  3. Specify the objective

    • Specify the primary metrics representing the model performance which you want to optimize.

    • Specify the optimization goal to be maximized or minimized.

  4. Specify early termination policy

    • Specify policy to auto terminate poorly performing runs, which could improve compute efficiency.

  5. Specify resource limits

    • Control resource budget for trial runs.

After you define and create the sweep component, you can submit the component in a dsl.pipeline like other component types.

How to write sweep component yaml spec

Please refer to:

Example yaml:

$schema: https://componentsdk.azureedge.net/jsonschema/SweepComponent.json
# meta data of the sweep component
name: microsoft.com.azureml.samples.tune
version: 0.0.1
display_name: Tune
type: SweepComponent
description: A dummy hyperparameter tuning component
is_deterministic: false
tags: {category: Component Tutorial, contact: amldesigner@microsoft.com}

# STEP 1. reference an existing trial component yaml, which:
# - declares the inputs 'learning_rate' and 'subsample' which will be used as hyperparameters
# - logs the primary metrics and output model file.
trial: file:train.yaml

# STEP 2: define sampling algorithm and search_space
algorithm: random

# search_space structure is defined in yaml, and couldn't reset during runtime.
# each hyperparameter in search space must be corresponding to an input parameter.
# and in UI, user will set a distribution instead of fixed value for the input parameter.
search_space:
  learning_rate:
    # here defines the default search space for parameter learning_rate, 
    # which should be a subset of the original range of the parameter in trial component.
    type: uniform
    min_value: 0.03
    max_value: 0.1
  subsample:
    type: choice
    values: [0.2, 0.3]

# STEP 3: specify objective of the sweep component
objective: 
  # default primary_metric & goal, user can override in runsetting
  primary_metric: 
    default: accuracy
    # this is a list of available primary_metric objective.
    # user code must have logged these metric to run history
    enum: [accuracy, precision]
  goal: maximize
  
# STEP 4: specify early_termination policy
early_termination:
  policy_type: median_stopping
  evaluation_interval: 1
  delay_evaluation: 5

# STEP 5: specify resource limit
limits:
  max_total_trials: 20

# NOTE: early_termination & limits can be skipped in component yaml, then user needs to specify in runsetting during submission.
# if specified in yaml, these values will be treated as default value.

Note: is_deterministic field for sweep component is set to false by default. If you want to reuse previous run’s outputs when running a sweep component, you need to set is_deterministic=True in component yaml.

See more example sweep component yaml files in github samples repo.

Follow how to access instructions if you meet 404 error when accessing the samples.

How to consume a sweep component

Set inputs & parameters

For inputs and parameters, we can load a component function and apply with same logic like other components.

For parameters which marked as hyperparameters in search space, we can pass dictionary which has the same schema as hyperparameter expression.

component = sweep_component_func(
    # specify normal input port & parameters
    training_data= input_dataset,
    max_epochs= 2, 
    
    # specify hyperparameters
    learning_rate = {
      "type": "uniform", 
      "min_value": 0.04, 
      "max_value": 0.09
    },
)

Set runsettings

Here is an example to set runsettings in SDK:

component.runsettings.target = "amlcompute"
component.runsettings.sweep.algorithm = "random"
component.runsettings.sweep.objective.configure(primary_metric = "accuracy", goal = "maximize")
component.runsettings.sweep.early_termination.configure(
    policy_type= "median_stopping",
    evaluation_interval= 1,
    delay_evaluation= 5
)
component.runsettings.sweep.limits.configure(max_total_trials = 20)

See more doc and examples on these concepts: Algorithm, Objective, Early Termination, Limits.

Sample notebook

Sweep component outputs

Outputs of sweep component should be defined in the trial component’s yaml, and they will also be outputs of the parent run.

outputs:
  saved_model:
    type: path
    description: path of the saved_model of trial run
  training_stats:
    type: path
    description: writes some stats file of the trial component.

However it has different runtime behavior.

  • In each trial run, it will have the same runtime behavior (mount or upload) of a normal command component output.

  • In sweep parent run, the result of output, eg. the training_stats in the example above, is the best trial run’s result.

  • The default output path of a trial run on the datastore is: azureml/{run-id}/{output-name}, here, the run-id is the id of the trial run of a sweep run instead of the sweep run itself. The trial run id is with format: HD_uuid_{trial-number}, and the trial-number counts from 0. For the example above, the first trial’s output path will be: azureml/HD_27826510-6552-401a-8b01-c7954bb8fdd3_0/training_stats.

  • If user wants to specify the path_on_datastore and wants to keep all outputs of each trial run, they must use {run-id} in the path_on_datastore path to avoid output path conflict.

Reference

Sweep

This section is for sweep components specs.

Name Type Required Description
trial String Yes Reference a existing command or distributed component. Support a yaml file or a registered component. Example: file:train.yaml or azureml:registered_component_name:version.
algorithm String Yes Specify the parameter sampling method to use over the hyperparameter search space. Possible values are: random, grid, bayesian.
search_space Dictionary Hyperparameter Expression> Yes The range of values to search for each hyperparameter.
objective Objective Yes Defines primary metrics and goal.
early_termination EarlyTerminationPolicy No Automatically end poorly performing runs with an early termination policy. Early termination improves computational efficiency.
limits Limits Yes Control your resource budget by specifying resource limit like the maximum number of training runs.

Trial

Reference an existing trial component yaml.

  • It is a string which must begins with file: or azureml:.

    • Use file:path_to_yaml_file to specify the local yaml file path of the referred component.

    • Use azureml:component_name:version to refer a registered component by name and version. version is optional, you can use azureml:component_name to refer a component with default version.

  • The trial component should be a Command or Distributed component.

Hyperparameter Expression

Hyperparameters can be discrete or continuous, and has a distribution of values described by a parameter expression.

Discrete hyperparameters

Discrete hyperparameters are specified as a choice among discrete values or advanced distributions.

Name Description
choice(list) Choice parameter can be: one or more comma-separated values, a range object, any arbitrary list object.
quniform(min_value, max_value, q) Returns a value like round(uniform(min_value, max_value) / q) * q
qloguniform(min_value, max_value, q) Returns a value like round(exp(uniform(min_value, max_value)) / q) * q
qnormal(mu, sigma, q) Returns a value like round(normal(mu, sigma) / q) * q
qlognormal(mu, sigma, q) Returns a value like round(exp(normal(mu, sigma)) / q) * q
randint(upper) Specify a set of random integers in the range [0, upper)

Yaml example:

search_space:
  batch_size:
    type: choice
    values: [16, 32, 64, 128]
  qnormal_parameter:
    type: qnormal
    mu: 0.2
    sigma: -1
    q: 1

Note We also support v1 sdk parameter expression contract.

For using azureml.train.hyperdrive package, plase install azureml-train-core with command pip install azureml-train-core.

SDK python example:

from azureml.train.hyperdrive import choice

component.set_inputs(
  batch_size = choice([16, 32, 64, 128]),
  number_of_hidden_layers = choice(range(1,5))
)

Continuous hyperparameters

The Continuous hyperparameters are specified as a distribution over a continuous range of values:

Name Description
uniform(min_value, max_value) Returns a value uniformly distributed between min_value and max_value
loguniform(min_value, max_value) Returns a value drawn according to exp(uniform(min_value, max_value)) so that the logarithm of the return value is uniformly distributed
normal(mu, sigma) Returns a real value that's normally distributed with mean mu and standard deviation sigma
lognormal(mu, sigma) Returns a value drawn according to exp(normal(mu, sigma)) so that the logarithm of the return value is normally distributed

Yaml example:

search_space:
    learning_rate: 
        type: normal
        mu: 10
        sigma: 3
    keep_probability: 
        type: uniform
        min_value: 0.05
        max_value: 0.1

SDK python example:

from azureml.train.hyperdrive import normal, uniform

component.set_inputs(
  learning_rate = normal(10, 3),
  keep_probability = uniform(0.05, 0.1)
)

Conditional hyperparameters

Note

  • Currently available for random sampling algorithm;

  • grid & bayesian algorithm support will come later.

Conditional parameter is a choice type hyperparameter expression, and its values is an array of object. Properties of the object can be a hyperparameter expression.

Example: How to use sweep component conditional hyperparameter

In component spec:

search_space:
  model:
    type: choice
    values:
      - model_name: model_x
        x0:
          type: choice
          values: [1, 2, 3]
        x1:
          type: uniform
          min_value: -1
          max_value: 1
      - model_name: model_y
        y0:
          type: choice
          values: [4, 5, 6]
        y1:
          type: uniform
          min_value: -2
          max_value: 2

In the above example, model is a conditional hyperparameter, which will be passed as following environment variables, e.g.:

  • AZUREML_SWEEP_model = {“model_name”: “model_x”, “x0”: “1”, “x1”: “-0.04776077313177862”}}

If we want to override the conditional search space in python sdk:

from azureml.train.hyperdrive import choice, uniform

component = sweep_component_func(
    model = {
        "type": "choice",
        "values": [
            { 
                "model_name": "model_x",
                "x0": {
                    "type": "choice",
                    "values": [2, 3]
                },
                "x1": {
                    "type": "uniform",
                    "min_value": -1,
                    "max_value": 1
                }
            },
            { 
                "model_name": "model_y",
                "y0": {
                    "type": "choice",
                    "values": [4, 5]
                },
                "y1": {
                    "type": "uniform",
                    "min_value": -2,
                    "max_value": 2
                }
            }
        ]
    })

Or we can use v1 sdk parameter expression contract.

from azureml.train.hyperdrive import choice, uniform

component = conditional_sweep_func(
    model=choice(
        [
            {
                "model_name": "model_x",
                "x0": choice([2, 3]),
                "x1": uniform(-1, 1)
            },
            {
                "model_name": "model_y",
                "y0": choice([4, 5]),
                "y1": uniform(-2, 2)
            }
        ]
    )
)

Algorithm

Specify the parameter sampling method to use over the hyperparameter space. Azure Machine Learning supports the following methods:

  • Random sampling

  • Grid sampling

  • Bayesian sampling

Random sampling

Random sampling supports discrete and continuous hyperparameters. It supports early termination of low-performance runs. Some users do an initial search with random sampling and then refine the search space to improve results.

In random sampling, hyperparameter values are randomly selected from the defined search space.

Grid sampling

Grid sampling supports discrete hyperparameters. Use grid sampling if you can budget to exhaustively search over the search space. Supports early termination of low-performance runs.

Grid sampling does a simple grid search over all possible values. Grid sampling can only be used with choice hyperparameters. For example, the following space has six samples:

num_hidden_layers = choice([1, 2, 3]),
batch_size = choice([16, 32])

Bayesian sampling

Bayesian sampling is based on the Bayesian optimization algorithm. It picks samples based on how previous samples did, so that new samples improve the primary metric.

Bayesian sampling is recommended if you have enough budget to explore the hyperparameter space. For best results, we recommend a maximum number of runs greater than or equal to 20 times the number of hyperparameters being tuned.

The number of concurrent runs has an impact on the effectiveness of the tuning process. A smaller number of concurrent runs may lead to better sampling convergence, since the smaller degree of parallelism increases the number of runs that benefit from previously completed runs.

Bayesian sampling only supports choice, uniform, and quniform distributions over the search space.

Objective

Specify the primary metric you want hyperparameter tuning to optimize. Each trial run is evaluated for the primary metric. The early termination policy uses the primary metric to identify low-performance runs.

Name Type Required Description
primary_metric String or Object Yes the primary metric of the hyperparameter tuning to optimize.
goal String Yes Whether the primary metric will be maximize or minimize when evaluating the trials.

Yaml example:

  primary_metric: 
    default: accuracy
    # this is a list of available primary_metric objective.
    # user code must have logged these metric using run log
    enum: [accuracy, precision]
  goal: maximize

Example to override runsetting in sdk:

component.runsettings.sweep.objective.configure(primary_metric = "precision", goal = "maximize")

Log metrics for hyperparameter tuning

The training script for your model must log the primary metric during model training so that Sweep component can access it for hyperparameter tuning.

Log the primary metric in your training script with the following sample snippet:

from azureml.core.run import Run
run_logger = Run.get_context()
run_logger.log("accuracy", float(val_accuracy))

The training script calculates the val_accuracy and logs it as the primary metric “accuracy”. Each time the metric is logged, it’s received by the hyperparameter tuning service. It’s up to you to determine the frequency of reporting.

For more information on logging values in model training runs, see Enable logging in Azure ML training runs.

Early Termination

Automatically end poorly performing runs with an early termination policy. Early termination improves computational efficiency.

You can configure the following common parameters when a policy is applied:

Name Type Required Description
evaluation_interval Integer No The frequency of applying the policy.
delay_evaluation Integer No Delays the first policy evaluation for a specified number of intervals.
  • evaluation_interval: Each time the training script logs the primary metric counts as one interval. An evaluation_interval of 1 will apply the policy every time the training script reports the primary metric. An evaluation_interval of 2 will apply the policy every other time. If not specified, evaluation_interval is set to 1 by default.

  • delay_evaluation: This is an optional parameter that avoids premature termination of training runs by allowing all configurations to run for a minimum number of intervals. If specified, the policy applies every multiple of evaluation_interval that is greater than or equal to delay_evaluation.

Azure Machine Learning supports the following early termination policies:

NOTE Bayesian sampling does not support early termination. When using Bayesian sampling, set early_termination.policy_type = 'default'.

Bandit policy

Bandit policy is based on slack factor/slack amount and evaluation interval. Bandit ends runs when the primary metric isn’t within the specified slack factor/slack amount of the most successful run.

This policy takes below additional configuration parameters:

  • slack_factor or slack_amount: the slack allowed with respect to the best performing training run. slack_factor specifies the allowable slack as a ratio. slack_amount specifies the allowable slack as an absolute amount, instead of a ratio. For example, consider a Bandit policy applied at interval 10. Assume that the best performing run at interval 10 reported a primary metric is 0.8 with a goal to maximize the primary metric. If the policy specifies a slack_factor of 0.2, any training runs whose best metric at interval 10 is less than 0.66 (0.8/(1+slack_factor)) will be terminated.

Yaml example:

early_termination:
    policy_type: bandit
    slack_factor: 0.1
    evaluation_interval: 1
    delay_evaluation: 5

Example override in runsetting:

component.runsettings.sweep.early_termination.configure(
    policy_type= 'bandit',
    slack_factor= 0.1,
    evaluation_interval= 1,
    delay_evaluation= 5
)

Note We also support v1 sdk policy contract.

from azureml.train.hyperdrive.policy import BanditPolicy
component.runsettings.sweep.early_termination = BanditPolicy(slack_factor = 0.1, evaluation_interval=1, delay_evaluation=5)

In this example, the early termination policy is applied at every interval when metrics are reported, starting at evaluation interval 5. Any run whose best metric is less than (1/(1+0.1) or 91% of the best performing run will be terminated.

Median stopping policy

Median stopping is an early termination policy based on running averages of primary metrics reported by the runs. This policy computes running averages across all training runs and stops runs whose primary metric value is worse than the median of the averages.

This policy takes no additional configuration parameters.

Yaml example:

early_termination:
    policy_type: median_stopping
    evaluation_interval: 1
    delay_evaluation: 5

Example override in runsetting:

component.runsettings.sweep.early_termination.configure(
    policy_type= 'median_stopping',
    evaluation_interval= 1,
    delay_evaluation= 5
)

Note We also support v1 sdk policy contract.

from azureml.train.hyperdrive.policy import MedianStoppingPolicy
early_termination_policy = MedianStoppingPolicy(evaluation_interval=1, delay_evaluation=5)

In this example, the early termination policy is applied at every interval starting at evaluation interval 5. A run is stopped at interval 5 if its best primary metric is worse than the median of the running averages over intervals 1:5 across all training runs.

Truncation selection policy

Truncation selection cancels a percentage of lowest performing runs at each evaluation interval. Runs are compared using the primary metric.

This policy takes below additional configuration parameters:

  • truncation_percentage: the percentage of lowest performing runs to terminate at each evaluation interval. An integer value between 1 and 99.

Yaml example:

early_termination:
    policy_type: truncation_selection
    truncation_percentage: 20
    evaluation_interval: 1
    delay_evaluation: 5

Note We also support v1 sdk policy contract.

Sdk example:

from azureml.train.hyperdrive.policy import TruncationSelectionPolicy
early_termination_policy = TruncationSelectionPolicy(truncation_percentage=20, evaluation_interval=1, delay_evaluation=5)

In this example, the early termination policy is applied at every interval starting at evaluation interval 5. A run terminates at interval 5 if its performance at interval 5 is in the lowest 20% of performance of all runs at interval 5.

No termination policy (default)

If no policy is specified, the hyperparameter tuning service will let all training runs execute to completion.

component.runsettings.sweep.early_termination.policy_type = 'default'

Picking an early termination policy

  • For a conservative policy that provides savings without terminating promising jobs, consider a Median Stopping Policy with evaluation_interval 1 and delay_evaluation 5. These are conservative settings, that can provide approximately 25%-35% savings with no loss on primary metric (based on our evaluation data).

  • For more aggressive savings, use Bandit Policy with a smaller allowable slack or Truncation Selection Policy with a larger truncation percentage.

Limits

Control resource budget for trial runs.

Name Type Required Description
max_total_trials Integer Yes Maximum number of trial runs. Must be an integer between 1 and 1000.
max_concurrent_trials Integer No Maximum number of runs that can run concurrently. If not specified, all runs launch in parallel. If specified, must be an integer between 1 and 100.
timeout_minutes Integer No Maximum duration, in minutes, of the hyperparameter tuning experiment. Runs after this duration are canceled.

NOTE If both max_total_trials and timeout_minutes are specified, the hyperparameter tuning experiment terminates when the first of these two thresholds is reached. The number of concurrent trials is gated on the resources available in the specified compute target. Ensure that the compute target has the available resources for the desired concurrency.

component.runsettings.sweep.limits.configure(max_total_trials = 20, max_concurrent_trials=4)

This code configures the hyperparameter tuning experiment to use a maximum of 20 total runs, running 4 configurations at a time.

Appendix

Reference doc for: SDK 1.0