AetherBridge Component

Overview

An AetherBridge component is a component that can be used to bridge job request to AEther environment. This component is for Microsoft internal only. This component is in private preview, please play it around and share feedback with us.

Prerequisites

Before using AetherBridge component, you should be familiar with:

To use AetherBridge component to submit job to AEther successfully, you should have below access:

Existing AEther module. Please notice that AetherBridge component is a temporary approach, we will provide AML native solution in long-term. If your AEther module needs to go with AetherBridge component, please follow this form to register your ask.
Contributor role of your ADLS
RWX access to your ADLS data explorer at root folder

Scenarios

Bridge your job requests to AEther through AetherBridge component in AzureML.

Limitation

AetherBridge component is a temporary solution, and bridging job to AEther introduces the following limitations due to the requirement of AEther:

AEther Exepool downloads the whole parent folder of the input file. So please DO NOT put your input file in a large parent folder to avoid long downloading latency or even disk full error. Please copy your input file to an empty folder with an extra Command Component if neccessary.
AzureML AetherBridge component must reference an existing AEther module.
The inputs/outputs configuration of AzureML component must align with the referenced AEther module on both amount and type. AEther File type corresponds to AnyFile and AEther Directory type corresponds to AnyDirectory. Additionally, the inputs/outputs relative path must fit cosmos path constraint, e.g, no spaces is allowed.
The inputs/outputs data must reside in Azure Data Lake on Cosmos directly or indirectly under “local” folder.

How to write AetherBridge component yaml spec

Please refer to AetherBridge component spec doc.

Example yaml:

$schema: https://componentsdk.azureedge.net/jsonschema/AetherBridgeComponent.json 
name: sample_aetherbridge_component_for_scraping_job
version: 0.0.1 
display_name: Sample AetherBridge Component for Scraping Job
type: AetherBridgeComponent
inputs:
  scraping_config:
    type: AnyFile
    optional: false
  entities: 
    type: AnyFile
    optional: false
  scraping_backend:
    type: string
    optional: false
  entity_text_column_name:
    type: string
    optional: false
  entity_id_column_name:
    type: string
    optional: false
outputs: 
  scrape_job_info:
    type: AnyFile
  scrape_file:
    type: AnyFile
  status_file:
    type: AnyFile
  job_statistics:
    type: AnyFile
  was_scrape_canceled:
    type: AnyFile
command: >-
  ScrapeSWS.exe -b {inputs.scraping_backend} -c {inputs.scraping_config} -e {inputs.entities} -m "{inputs.entity_text_column_name}" -g "{inputs.entity_id_column_name}" -o {outputs.scrape_job_info} -s {outputs.scrape_file} -a {outputs.status_file} -k {outputs.job_statistics} --cancelFile {outputs.was_scrape_canceled}
aether:
  module_type: ScrapingCloud
  ref_id: <your-aether-module-id>

Note: “type” of each input/output must be aligned with the corresponding port in AEther module. AnyFile for file port and AnyDirectory for directory port. The port name of scraping job output must be scrape_file, it’s corresponded to the Scrape File output port in the AEther module.

See more examples in github samples repo.

Follow how to access instructions if you meet 404 error when accessing the samples.

Inputs

Inputs are data files fed into the AetherBridge component for aether jobs in aether command. It can be defined in a DataSet/DataReference or provided as output from a previous module. AetherBridge component only supports input from Azure Data Lake on Cosmos. User who submits the job must have permission to access the data storage.

How to consume an AetherBridge component

Get dataset from your ADLS

from azureml.core import Dataset

scraping_config_dataset = Dataset.File.from_files(path=(adls, os.environ.get("ADL_SCRAPING_RELATIVE_PATH", "/local/<your-scraping-config-file-path>")))
entities_dataset = Dataset.File.from_files(path=(adls, os.environ.get("ADL_SCRAPING_RELATIVE_PATH", "/local/<your-entities-file-path>")))

Load component

from azure.ml.component import Component

scraping_func = Component.from_yaml(ws, yaml_file='./component_spec.yaml')

Create pipeline

from azure.ml.component import dsl, Pipeline

@dsl.pipeline(
    name='SubmitScrapingJobThroughAetherBridge',
    description='Submit scraping Job through AetherBridge component using SDK',
    default_datastore=adls
)
def scraping_pipeline_test() -> Pipeline:
    scraping_pipeline_test_file_result = scraping_func(
        scraping_config=input_dataset_config,
        entities=input_dataset_entities,
        scraping_backend="Prod",
        entity_text_column_name="<entity-text-column-name>",
        entity_id_column_name="<entity-id-column-name>"
    )
    return scraping_pipeline_test_file_result.outputs

Validate and submit the pipeline

scraping_pipeline = scraping_pipeline_test()
scraping_pipeline.validate()

run = scraping_pipeline.submit()
run.wait_for_completion()

Sample notebook

How to use AetherBridge component to submit scraping job - Demonstrates how to use AetherBridge component to submit scraping job.

Note: Usually there is no need for customers to set compute and runsettings. Only supports input from Azure Data Lake on Cosmos.