HDInsight Component

Overivew

A HDInsightComponent is a Component that executes a spark job in a HDInsight cluster.

Scenarios

Use Apache Spark to train your model or analysis your data, multiple popular frameworks are supported:

  • PySpark

  • .NET for Spark

  • JAVA

Limitation

Currently only works on compliant HDI cluster created by Office team.

How to write HDInsightComponent yaml spec

Please refer to HDInsightComponent spec doc.

Please also see HDInsightComponent schema .

Example yaml:

$schema: https://componentsdk.azureedge.net/jsonschema/HDInsightComponent.json
name: microsoft.com.azureml.samples.train-in-spark
version: 0.0.1
display_name: Train in Spark
type: HDInsightComponent
description: Train a Spark ML model using an HDInsight Spark cluster

inputs:
  input_path:
    type: path
    description: Iris csv file
    optional: false
  regularization_rate:
    type: float
    description: Regularization rate when training with logistic regression
    optional: true
    default: 0.01
outputs:
  output_path:
    type: path
    description: The output path to save the trained model to

hdinsight:
  file: "train-spark.py"
  args: >-
    --input_path {inputs.input_path} [--regularization_rate {inputs.regularization_rate}] --output_path {outputs.output_path}

See more examples in github samples repo.

Samples

Follow how to access instructions if you meet 404 error when accessing the samples.

Appendix

In legacy module concept, this maps to hdinsight jobType Module.