Sweep Component
Sweep component is a kind of component to enable user to automate efficient hyperparameter tuning.
Note: You can directly sweep on a command component without define a sweep component. Learn more:
See reference doc
See sample notebook
Overview
Hyperparameters are adjustable parameters that let you control the model training process. For example, with neural networks, you decide the number of hidden layers and the number of nodes in each layer. Model performance depends heavily on hyperparameters.
Hyperparameter tuning, also called hyperparameter optimization, is the process of finding the configuration of hyperparameters that results in the best performance. The process is typically computationally expensive and manual.
Sweep component lets you automate hyperparameter tuning and run component in parallel to efficiently optimize hyperparameters.
Let’s assume you already have a Command or Distributed component which trains a model. When doing the hyperparameter tuning process manually, you will run this component several times with different hyperparameter combinations. Each of these sub runs is called a trial run, and the component is referred to as a trial component. After all the trial runs finished, you can select the best result by comparing the metrics of the trial runs.
You can easily convert the trial component into a sweep component and automate this process with below steps:
Prepare the trial component:
Hyperparameters to explore are exposed as component input parameters.
Component script has log metrics on model performance.
Component script has written sweep component outputs.
Inputs and outputs of trial component will be inherited to parent sweep component.
Define the sampling algorithm and search space
Specify parameter sampling algorithm to use over the hyperparameter search space.
Mark parameters of the trial component as hyperparameters and define the search space with some distribution.
You can choose between predefine distributions in hyperparameter expression.
Specify the objective
Specify the primary metrics representing the model performance which you want to optimize.
Specify the optimization goal to be maximized or minimized.
Specify early termination policy
Specify policy to auto terminate poorly performing runs, which could improve compute efficiency.
Specify resource limits
Control resource budget for trial runs.
After you define and create the sweep component, you can submit the component in a dsl.pipeline like other component types.
How to write sweep component yaml spec
Please refer to:
Example yaml:
$schema: https://componentsdk.azureedge.net/jsonschema/SweepComponent.json
# meta data of the sweep component
name: microsoft.com.azureml.samples.tune
version: 0.0.1
display_name: Tune
type: SweepComponent
description: A dummy hyperparameter tuning component
is_deterministic: false
tags: {category: Component Tutorial, contact: amldesigner@microsoft.com}
# STEP 1. reference an existing trial component yaml, which:
# - declares the inputs 'learning_rate' and 'subsample' which will be used as hyperparameters
# - logs the primary metrics and output model file.
trial: file:train.yaml
# STEP 2: define sampling algorithm and search_space
algorithm: random
# search_space structure is defined in yaml, and couldn't reset during runtime.
# each hyperparameter in search space must be corresponding to an input parameter.
# and in UI, user will set a distribution instead of fixed value for the input parameter.
search_space:
learning_rate:
# here defines the default search space for parameter learning_rate,
# which should be a subset of the original range of the parameter in trial component.
type: uniform
min_value: 0.03
max_value: 0.1
subsample:
type: choice
values: [0.2, 0.3]
# STEP 3: specify objective of the sweep component
objective:
# default primary_metric & goal, user can override in runsetting
primary_metric:
default: accuracy
# this is a list of available primary_metric objective.
# user code must have logged these metric to run history
enum: [accuracy, precision]
goal: maximize
# STEP 4: specify early_termination policy
early_termination:
policy_type: median_stopping
evaluation_interval: 1
delay_evaluation: 5
# STEP 5: specify resource limit
limits:
max_total_trials: 20
# NOTE: early_termination & limits can be skipped in component yaml, then user needs to specify in runsetting during submission.
# if specified in yaml, these values will be treated as default value.
Note: is_deterministic field for sweep component is set to false by default.
If you want to reuse previous run’s outputs when running a sweep component, you need to set is_deterministic=True in component yaml.
See more example sweep component yaml files in github samples repo.
Follow how to access instructions if you meet 404 error when accessing the samples.
How to consume a sweep component
Set inputs & parameters
For inputs and parameters, we can load a component function and apply with same logic like other components.
For parameters which marked as hyperparameters in search space, we can pass dictionary which has the same schema as hyperparameter expression.
component = sweep_component_func(
# specify normal input port & parameters
training_data= input_dataset,
max_epochs= 2,
# specify hyperparameters
learning_rate = {
"type": "uniform",
"min_value": 0.04,
"max_value": 0.09
},
)
Set runsettings
Here is an example to set runsettings in SDK:
component.runsettings.target = "amlcompute"
component.runsettings.sweep.algorithm = "random"
component.runsettings.sweep.objective.configure(primary_metric = "accuracy", goal = "maximize")
component.runsettings.sweep.early_termination.configure(
policy_type= "median_stopping",
evaluation_interval= 1,
delay_evaluation= 5
)
component.runsettings.sweep.limits.configure(max_total_trials = 20)
See more doc and examples on these concepts: Algorithm, Objective, Early Termination, Limits.
Sample notebook
How to use sweep component for hyperparameter tuning - Demonstrates how to use sweep component for hyperparameter tuning.
How to use sweep component conditional hyperparameter - Demonstrates how to use sweep component with conditional hyperparameter.
Sweep component outputs
Outputs of sweep component should be defined in the trial component’s yaml, and they will also be outputs of the parent run.
outputs:
saved_model:
type: path
description: path of the saved_model of trial run
training_stats:
type: path
description: writes some stats file of the trial component.
However it has different runtime behavior.
In each trial run, it will have the same runtime behavior (
mountorupload) of a normal command component output.In sweep parent run, the result of output, eg. the
training_statsin the example above, is the best trial run’s result.The default output path of a trial run on the datastore is:
azureml/{run-id}/{output-name}, here, therun-idis the id of the trial run of a sweep run instead of the sweep run itself. The trial run id is with format:HD_uuid_{trial-number}, and thetrial-numbercounts from 0. For the example above, the first trial’s output path will be:azureml/HD_27826510-6552-401a-8b01-c7954bb8fdd3_0/training_stats.If user wants to specify the
path_on_datastoreand wants to keep all outputs of each trial run, they must use{run-id}in thepath_on_datastorepath to avoid output path conflict.
Reference
Sweep
This section is for sweep components specs.
| Name | Type | Required | Description |
|---|---|---|---|
| trial | String | Yes | Reference a existing command or distributed component. Support a yaml file or a registered component. Example: file:train.yaml or azureml:registered_component_name:version. |
| algorithm | String | Yes | Specify the parameter sampling method to use over the hyperparameter search space. Possible values are: random, grid, bayesian. |
| search_space | Dictionary |
Yes | The range of values to search for each hyperparameter. |
| objective | Objective | Yes | Defines primary metrics and goal. |
| early_termination | EarlyTerminationPolicy | No | Automatically end poorly performing runs with an early termination policy. Early termination improves computational efficiency. |
| limits | Limits | Yes | Control your resource budget by specifying resource limit like the maximum number of training runs. |
Trial
Reference an existing trial component yaml.
It is a string which must begins with
file:orazureml:.Use
file:path_to_yaml_fileto specify the local yaml file path of the referred component.Use
azureml:component_name:versionto refer a registered component by name and version.versionis optional, you can useazureml:component_nameto refer a component with default version.
The trial component should be a Command or Distributed component.
Hyperparameter Expression
Hyperparameters can be discrete or continuous, and has a distribution of values described by a parameter expression.
Discrete hyperparameters
Discrete hyperparameters are specified as a choice among discrete values or advanced distributions.
| Name | Description |
|---|---|
| choice(list) | Choice parameter can be: one or more comma-separated values, a range object, any arbitrary list object. |
| quniform(min_value, max_value, q) | Returns a value like round(uniform(min_value, max_value) / q) * q |
| qloguniform(min_value, max_value, q) | Returns a value like round(exp(uniform(min_value, max_value)) / q) * q |
| qnormal(mu, sigma, q) | Returns a value like round(normal(mu, sigma) / q) * q |
| qlognormal(mu, sigma, q) | Returns a value like round(exp(normal(mu, sigma)) / q) * q |
| randint(upper) | Specify a set of random integers in the range [0, upper) |
Yaml example:
search_space:
batch_size:
type: choice
values: [16, 32, 64, 128]
qnormal_parameter:
type: qnormal
mu: 0.2
sigma: -1
q: 1
Note We also support v1 sdk parameter expression contract.
For using azureml.train.hyperdrive package, plase install azureml-train-core with command pip install azureml-train-core.
SDK python example:
from azureml.train.hyperdrive import choice
component.set_inputs(
batch_size = choice([16, 32, 64, 128]),
number_of_hidden_layers = choice(range(1,5))
)
Continuous hyperparameters
The Continuous hyperparameters are specified as a distribution over a continuous range of values:
| Name | Description |
|---|---|
| uniform(min_value, max_value) | Returns a value uniformly distributed between min_value and max_value |
| loguniform(min_value, max_value) | Returns a value drawn according to exp(uniform(min_value, max_value)) so that the logarithm of the return value is uniformly distributed |
| normal(mu, sigma) | Returns a real value that's normally distributed with mean mu and standard deviation sigma |
| lognormal(mu, sigma) | Returns a value drawn according to exp(normal(mu, sigma)) so that the logarithm of the return value is normally distributed |
Yaml example:
search_space:
learning_rate:
type: normal
mu: 10
sigma: 3
keep_probability:
type: uniform
min_value: 0.05
max_value: 0.1
SDK python example:
from azureml.train.hyperdrive import normal, uniform
component.set_inputs(
learning_rate = normal(10, 3),
keep_probability = uniform(0.05, 0.1)
)
Conditional hyperparameters
Note
Currently available for
randomsampling algorithm;grid&bayesianalgorithm support will come later.
Conditional parameter is a choice type hyperparameter expression, and its values is an array of object. Properties of the object can be a hyperparameter expression.
Example: How to use sweep component conditional hyperparameter
In component spec:
search_space:
model:
type: choice
values:
- model_name: model_x
x0:
type: choice
values: [1, 2, 3]
x1:
type: uniform
min_value: -1
max_value: 1
- model_name: model_y
y0:
type: choice
values: [4, 5, 6]
y1:
type: uniform
min_value: -2
max_value: 2
In the above example, model is a conditional hyperparameter, which will be passed as following environment variables, e.g.:
AZUREML_SWEEP_model = {“model_name”: “model_x”, “x0”: “1”, “x1”: “-0.04776077313177862”}}
If we want to override the conditional search space in python sdk:
from azureml.train.hyperdrive import choice, uniform
component = sweep_component_func(
model = {
"type": "choice",
"values": [
{
"model_name": "model_x",
"x0": {
"type": "choice",
"values": [2, 3]
},
"x1": {
"type": "uniform",
"min_value": -1,
"max_value": 1
}
},
{
"model_name": "model_y",
"y0": {
"type": "choice",
"values": [4, 5]
},
"y1": {
"type": "uniform",
"min_value": -2,
"max_value": 2
}
}
]
})
Or we can use v1 sdk parameter expression contract.
from azureml.train.hyperdrive import choice, uniform
component = conditional_sweep_func(
model=choice(
[
{
"model_name": "model_x",
"x0": choice([2, 3]),
"x1": uniform(-1, 1)
},
{
"model_name": "model_y",
"y0": choice([4, 5]),
"y1": uniform(-2, 2)
}
]
)
)
Algorithm
Specify the parameter sampling method to use over the hyperparameter space. Azure Machine Learning supports the following methods:
Random sampling
Grid sampling
Bayesian sampling
Random sampling
Random sampling supports discrete and continuous hyperparameters. It supports early termination of low-performance runs. Some users do an initial search with random sampling and then refine the search space to improve results.
In random sampling, hyperparameter values are randomly selected from the defined search space.
Grid sampling
Grid sampling supports discrete hyperparameters. Use grid sampling if you can budget to exhaustively search over the search space. Supports early termination of low-performance runs.
Grid sampling does a simple grid search over all possible values. Grid sampling can only be used with choice hyperparameters. For example, the following space has six samples:
num_hidden_layers = choice([1, 2, 3]),
batch_size = choice([16, 32])
Bayesian sampling
Bayesian sampling is based on the Bayesian optimization algorithm. It picks samples based on how previous samples did, so that new samples improve the primary metric.
Bayesian sampling is recommended if you have enough budget to explore the hyperparameter space. For best results, we recommend a maximum number of runs greater than or equal to 20 times the number of hyperparameters being tuned.
The number of concurrent runs has an impact on the effectiveness of the tuning process. A smaller number of concurrent runs may lead to better sampling convergence, since the smaller degree of parallelism increases the number of runs that benefit from previously completed runs.
Bayesian sampling only supports choice, uniform, and quniform distributions over the search space.
Objective
Specify the primary metric you want hyperparameter tuning to optimize. Each trial run is evaluated for the primary metric. The early termination policy uses the primary metric to identify low-performance runs.
| Name | Type | Required | Description |
|---|---|---|---|
| primary_metric | String or Object | Yes | the primary metric of the hyperparameter tuning to optimize. |
| goal | String | Yes | Whether the primary metric will be maximize or minimize when evaluating the trials. |
Yaml example:
primary_metric:
default: accuracy
# this is a list of available primary_metric objective.
# user code must have logged these metric using run log
enum: [accuracy, precision]
goal: maximize
Example to override runsetting in sdk:
component.runsettings.sweep.objective.configure(primary_metric = "precision", goal = "maximize")
Log metrics for hyperparameter tuning
The training script for your model must log the primary metric during model training so that Sweep component can access it for hyperparameter tuning.
Log the primary metric in your training script with the following sample snippet:
from azureml.core.run import Run
run_logger = Run.get_context()
run_logger.log("accuracy", float(val_accuracy))
The training script calculates the val_accuracy and logs it as the primary metric “accuracy”. Each time the metric is logged, it’s received by the hyperparameter tuning service. It’s up to you to determine the frequency of reporting.
For more information on logging values in model training runs, see Enable logging in Azure ML training runs.
Early Termination
Automatically end poorly performing runs with an early termination policy. Early termination improves computational efficiency.
You can configure the following common parameters when a policy is applied:
| Name | Type | Required | Description |
|---|---|---|---|
| evaluation_interval | Integer | No | The frequency of applying the policy. |
| delay_evaluation | Integer | No | Delays the first policy evaluation for a specified number of intervals. |
evaluation_interval: Each time the training script logs the primary metric counts as one interval. Anevaluation_intervalof 1 will apply the policy every time the training script reports the primary metric. Anevaluation_intervalof 2 will apply the policy every other time. If not specified,evaluation_intervalis set to 1 by default.delay_evaluation: This is an optional parameter that avoids premature termination of training runs by allowing all configurations to run for a minimum number of intervals. If specified, the policy applies every multiple of evaluation_interval that is greater than or equal to delay_evaluation.
Azure Machine Learning supports the following early termination policies:
NOTE Bayesian sampling does not support early termination. When using Bayesian sampling, set
early_termination.policy_type = 'default'.
Bandit policy
Bandit policy is based on slack factor/slack amount and evaluation interval. Bandit ends runs when the primary metric isn’t within the specified slack factor/slack amount of the most successful run.
This policy takes below additional configuration parameters:
slack_factororslack_amount: the slack allowed with respect to the best performing training run.slack_factorspecifies the allowable slack as a ratio.slack_amountspecifies the allowable slack as an absolute amount, instead of a ratio. For example, consider a Bandit policy applied at interval 10. Assume that the best performing run at interval 10 reported a primary metric is 0.8 with a goal to maximize the primary metric. If the policy specifies aslack_factorof 0.2, any training runs whose best metric at interval 10 is less than 0.66 (0.8/(1+slack_factor)) will be terminated.
Yaml example:
early_termination:
policy_type: bandit
slack_factor: 0.1
evaluation_interval: 1
delay_evaluation: 5
Example override in runsetting:
component.runsettings.sweep.early_termination.configure(
policy_type= 'bandit',
slack_factor= 0.1,
evaluation_interval= 1,
delay_evaluation= 5
)
Note We also support v1 sdk policy contract.
from azureml.train.hyperdrive.policy import BanditPolicy
component.runsettings.sweep.early_termination = BanditPolicy(slack_factor = 0.1, evaluation_interval=1, delay_evaluation=5)
In this example, the early termination policy is applied at every interval when metrics are reported, starting at evaluation interval 5. Any run whose best metric is less than (1/(1+0.1) or 91% of the best performing run will be terminated.
Median stopping policy
Median stopping is an early termination policy based on running averages of primary metrics reported by the runs. This policy computes running averages across all training runs and stops runs whose primary metric value is worse than the median of the averages.
This policy takes no additional configuration parameters.
Yaml example:
early_termination:
policy_type: median_stopping
evaluation_interval: 1
delay_evaluation: 5
Example override in runsetting:
component.runsettings.sweep.early_termination.configure(
policy_type= 'median_stopping',
evaluation_interval= 1,
delay_evaluation= 5
)
Note We also support v1 sdk policy contract.
from azureml.train.hyperdrive.policy import MedianStoppingPolicy
early_termination_policy = MedianStoppingPolicy(evaluation_interval=1, delay_evaluation=5)
In this example, the early termination policy is applied at every interval starting at evaluation interval 5. A run is stopped at interval 5 if its best primary metric is worse than the median of the running averages over intervals 1:5 across all training runs.
Truncation selection policy
Truncation selection cancels a percentage of lowest performing runs at each evaluation interval. Runs are compared using the primary metric.
This policy takes below additional configuration parameters:
truncation_percentage: the percentage of lowest performing runs to terminate at each evaluation interval. An integer value between 1 and 99.
Yaml example:
early_termination:
policy_type: truncation_selection
truncation_percentage: 20
evaluation_interval: 1
delay_evaluation: 5
Note We also support v1 sdk policy contract.
Sdk example:
from azureml.train.hyperdrive.policy import TruncationSelectionPolicy
early_termination_policy = TruncationSelectionPolicy(truncation_percentage=20, evaluation_interval=1, delay_evaluation=5)
In this example, the early termination policy is applied at every interval starting at evaluation interval 5. A run terminates at interval 5 if its performance at interval 5 is in the lowest 20% of performance of all runs at interval 5.
No termination policy (default)
If no policy is specified, the hyperparameter tuning service will let all training runs execute to completion.
component.runsettings.sweep.early_termination.policy_type = 'default'
Picking an early termination policy
For a conservative policy that provides savings without terminating promising jobs, consider a Median Stopping Policy with
evaluation_interval1 anddelay_evaluation5. These are conservative settings, that can provide approximately 25%-35% savings with no loss on primary metric (based on our evaluation data).For more aggressive savings, use Bandit Policy with a smaller allowable slack or Truncation Selection Policy with a larger truncation percentage.
Limits
Control resource budget for trial runs.
| Name | Type | Required | Description |
|---|---|---|---|
| max_total_trials | Integer | Yes | Maximum number of trial runs. Must be an integer between 1 and 1000. |
| max_concurrent_trials | Integer | No | Maximum number of runs that can run concurrently. If not specified, all runs launch in parallel. If specified, must be an integer between 1 and 100. |
| timeout_minutes | Integer | No | Maximum duration, in minutes, of the hyperparameter tuning experiment. Runs after this duration are canceled. |
NOTE If both
max_total_trialsandtimeout_minutesare specified, the hyperparameter tuning experiment terminates when the first of these two thresholds is reached. The number of concurrent trials is gated on the resources available in the specified compute target. Ensure that the compute target has the available resources for the desired concurrency.
component.runsettings.sweep.limits.configure(max_total_trials = 20, max_concurrent_trials=4)
This code configures the hyperparameter tuning experiment to use a maximum of 20 total runs, running 4 configurations at a time.
Appendix
Reference doc for: SDK 1.0