Hemera Component
Overivew
A HemeraComponent is a component that can be used to submit Yarn-based jobs to Bing MagneTar Platform (A.K.A Multi-Tenancy platform) via Hemera. This component is for Microsoft internal only.
Prerequisites
Before using hemera component, you should be familiar with:
Scenarios
Run your Hemera jobs in Azure ML.
Limitation
MUST reference an existed Aether module.
Hemera component can only work with data on ADLS (Cosmos). Only native cosmos paths can be read or written by Hemera component. ADLS paths that invisible on Cosmos will hit authentication issue in runtime.
Output type MUST be file (e.g. AnyFile), which is a hard constraint in Hemera.
A Hemera component output is a file which content is the corresponding real output cosmos path. Customers need to use an extra component to populate the real cosmos path out as a dataset for downstream consumption.
AML and Aether have different behaviors on optional inputs handling. With some constraints Hemera, customers need to make some minor code changes in their module entry script if optional inputs are used.
How to write HemeraComponent yaml spec
Please refer to HemeraComponent spec doc.
Please refer to HemeraComponent Schema.
Example yaml:
$schema: https://componentsdk.azureedge.net/jsonschema/HemeraComponent.json
name: microsoft.com.azureml.samples.hemera.AdsLRDNNRawKeys_Dummy
version: 0.0.1
display_name: Ads LR DNN Raw Keys
type: HemeraComponent
description: Without position feature with NodeLostBlocker false [TensorFlowOnYarn,1.0.1] Ads LR DNN Raw Keys [Prod]
inputs:
TrainingDataDir:
type: path
optional: true
is_resource: true
ValidationDataDir:
type: path
optional: true
is_resource: true
InitialModelDir:
type: path
optional: true
is_resource: true
YarnCluster:
type: string
optional: false
default: mtprime-bn2-0
JobQueue:
type: string
optional: false
default: default
outputs:
output1:
type: AnyFile
output2:
type: AnyFile
command: >-
run.bat [-_TrainingDataDir {inputs.TrainingDataDir}] [-_ValidationDataDir {inputs.ValidationDataDir}] [-_InitialModelDir {inputs.InitialModelDir}]
%CLUSTER%={inputs.YarnCluster} -JobQueue {inputs.JobQueue}
-_ModelOutputDir {outputs.output1} -_ValidationOutputDir {outputs.output2}
hemera:
ref_id: c3d0fbd2-8b78-4231-a665-3a0de1796264
Note: The ref_id should be a GUID of an existed module in Aether production.
See more examples in github samples repo.
Follow how to access instructions if you meet 404 error when accessing the samples.
Samples
How to use hemera component - Demonstrates how to use hemera component to run MT jobs.
FAQ
How to check Hemera job logs
MT/Hemera provides a great job portal for users to check the logs and statuses. In AML side, users can find the Hemera job link in logs/azureml/executionlogs.txt. Here is an example:
| executionlogs.txt |
|---|
| [2021-11-10 11:12:48Z] Job is in processing. The TakStatusCode is (Queued) [2021-11-10 11:13:04Z] Job is in processing. The TakStatusCode is (Queued) [2021-11-10 11:13:21Z] Job is in processing. The TakStatusCode is (Queued) [2021-11-10 11:13:39Z] Job is running. The TakStatusCode is (Running) [2021-11-10 11:13:39Z] Add this job in hemera managerJobManager is submitting job 84ec4f6e-2419-4707-aead-879059c045cb@@@m@@@dbaf0465@@@11-10-2021_11-12-26_AM @11/10/2021 3:12:30 AM Submitting 84ec4f6e-2419-4707-aead-879059c045cb@@@m@@@dbaf0465@@@11-10-2021_11-12-26_AM to Job Scheduler @11/10/2021 3:12:30 AM Submitted 84ec4f6e-2419-4707-aead-879059c045cb@@@m@@@dbaf0465@@@11-10-2021_11-12-26_AM to Job Scheduler @11/10/2021 3:12:37 AM =============Job webportal link: https://magnetar/job-detail.html?jobName=JD_qixia_84ec4f6e-2419-4707-aead-879059c045cbmdbaf0465___11-10-2021_11-12-26_AM&groupId=f78b190&subCluster=MTPrime-PROD-BN2-0 JobManager submitted job 84ec4f6e-2419-4707-aead-879059c045cb@@@m@@@dbaf0465@@@11-10-2021_11-12-26_AM @11/10/2021 3:12:37 AM [2021-11-10 11:13:55Z] Job is running. The TakStatusCode is (Running) [2021-11-10 11:14:11Z] Job is running. The TakStatusCode is (Running) |
How to populate the real output path as a dataset
The outputs of Hemera component/module are intermediate output files that point to the real output cosmos paths.
In Aether, it’s common to use modules like “AnyFile to CosmosPath” to populate the real cosmos path and consume it in downstream modules.
In AML, we can leverage the “Link” dataset feature to achieve the similar thing. “Link” dataset feature allows customers to link an existed dataset to current component run. So, we can create a component to parse the real cosmos paths from intermediate files and create datatsets with them, then “link” as outputs of current component run.
First, we can create a new component with python script like this:
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--intermediate_data_dir',
help='Intermediate data directory.',
)
parser.add_argument(
'--output_port',
help='The output port name.',
)
args, _ = parser.parse_known_args()
intermediate_data_file = find_intermediate_data_file(args.intermediate_data_dir)
remote_cosmos_path = intermediate_data_file.read_text(encoding='utf-8-sig')
run = Run.get_context()
ws = run.experiment.workspace
dataset = create_data_dataset(ws, remote_cosmos_path)
run.output_datasets[args.output_port].link(dataset)
Then submit it with “Link” output mode via Component SDK:
link_dataset_component = link_component_func(
intermediate_data_dir=hemera_component.outputs.output1)
link_dataset_component.outputs.output_dataset.configure(mode="link")
For more details, please check the full example here.
Optional input argument gets a wrong value after migrated to AML
AML and Aether have different behaviors on optional inputs handling, which leads the final command lines that passed into module entry script are different.
In Aether, if a module is defined as below, the final command line still contains the argument -_TrainingDataDir with an empty string as value.
<m c="Hemera" dm="1" n="Hemera module sample" i="run.bat -_TrainingDataDir "[(TrainingDataDir)]" %CLUSTER%=(YarnCluster:default,yarn_prod_bn2) -JobQueue (JobQueue:default,AdsCP)" />
When the “TrainingDataDir” isn’t connect with any data sources, final command line of the Aether module above looks like:
run.bat -_TrainingDataDir “” %CLUSTER%=yarn_prod_bn2 -JobQueue AdsCP
In AML, when an input is defined as optional, if there is no data source connected, it wouldn’t be provided in the final command line anymore. For the same case in AML, final command line would be resolved to:
run.bat %CLUSTER%=yarn_prod_bn2 -JobQueue AdsCP
AML command line behavior works fine with named argument parsing logic. However, powershell doesn’t have a strong named argument parsing capability and most Hemera modules are using powershell in entry script right now.
The key problem is: Hemera cloud depends on the “%CLUSTER%=xxx” argument in command line to get the cluster that customer targets to submit, which is a positional argument in powershell and would be assigned to the first unassigned argument by default.
param([ValidateNotNull()] [string] $_TrainingDataDir,
[ValidateNotNullOrEmpty()] [string] $JobQueue)
Write-Host "Hello World!"
Write-Host "_TrainingDataDir: $_TrainingDataDir"
Write-Host "JobQueue: $JobQueue"
Consequently, if the “-_TrainingDataDir” isn’t provided in command line, “$_TrainingDataDir” would be assigned with wrong value.
Command line
.\Launch.ps1 %CLUSTER%=yarn_prod_bn2 -JobQueue AdsCP
Output
Hello World! _TrainingDataDir: %CLUSTER%=yarn_prod_bn2 JobQueue: AdsCP
One simple solution is explicitly adding a “$cluster” parameter at the beginning and use it to save the %CLUSTER% value, although it won’t be used in any places.
param([string] $Cluster,
[ValidateNotNull()] [string] $_TrainingDataDir,
[ValidateNotNullOrEmpty()] [string] $JobQueue)
Write-Host "Hello World!"
Write-Host "Cluster: $Cluster"
Write-Host "_TrainingDataDir: $_TrainingDataDir"
Write-Host "JobQueue: $JobQueue"
With this trick, command line parsing works fine now.
Command line
.\Launch.ps1 %CLUSTER%=yarn_prod_bn2 -JobQueue AdsCP
Output
Hello World! Cluster: %CLUSTER%=yarn_prod_bn2 _TrainingDataDir: JobQueue: AdsCP