On this Page
Overview
The AutoML Snap automates the process of exploring and tuning machine learning models for a given dataset within the resource limit. A machine learning model is a mathematical representation of a real-world process that can be used to predict or solve a specific problem. For example, predict whether the customers are going to churn, predict whether the loan will be fully paid, or, forecast sales. To generate a machine learning model, you must provide training data to a machine learning algorithm to learn from.
Currently, the AutoML Snap supports binary classification, multiclass classification, and regression. For each type of problem, the Snap provides a different set of metrics and reports.
Input and Output
Expected input
- First input view: A document stream with classification or regression dataset.
- Second input view: A document that contains a model built by an AutoML Snap from a previous execution.
Expected output
- First output view: A serialization of a machine learning model, and metadata that are not human-readable. Additionally, the output includes a human-readable representation of the model if you select the Readable checkbox.
- Second output view: A document that contains the leaderboard. All the models built by this Snap display in the order of ranking along with metrics indicating the performance of the model.
- Third output view: A document that contains an interactive report of up to top 10 models.
Expected upstream Snaps
- First input view: A Snap that generates a classification or regression dataset. For example, CSV Generator, Mapper, or a combination of File Reader and JSON Parser.
- Second input view: A Snap that offers documents that provide a model built by an AutoML Snap. For example, a combination of File Reader and JSON Parser.
Expected downstream Snaps
- First output view: A Snap that formats and saves the model. For example, a combination of JSON Formatter and File Writer.
- Second output view: A Snap that accepts documents. For example, Aggregate, Mapper, or CSV Formatter.
- Third output view: A Snap that formats and saves the report. For example, a combination of Binary to Document and File Writer.
Prerequisites
None.
Configuring Accounts
Accounts are not used with this Snap.
Configuring Views
Input | This Snap has at most two document input views. |
---|---|
Output | This Snap has at most three document output views. |
Error | This Snap has at most one document error view. |
Troubleshooting
None.
Limitations and Known Issues
The AutoML generates models with statistical information. If the statistic value of a model for RMSLE is NaN, then the Snap skips that value in charts that are generated in reports.
Modes
- Ultra Pipelines: Does not work in Ultra Pipelines.
Snap Settings
Label | Required. The name for the Snap. Modify this to be more specific, especially if there are more than one of the same Snap in the pipeline. |
---|---|
Label field | Required. The label/class/target field in the dataset. This is the field you will train the model to predict. Default value: No default value Example: $class |
Time limit | Required. The maximum number of seconds for (OR up to) which the AutoML Snap can be run. If you set the Time Limit to 0, the Snap takes as much time as required to build N models, where N is the number you specify in the Number of models property.
Default value: 3600 |
Number of models | The number of models that must be included in AutoML. If you set the value to 0, the Snap builds as many models as possible within the time specified in the Time limit property.
Default value: 10 |
Fold | Required. The number of folds in k-fold cross validation. Minimum value: 2 Maximum value: 10 Default value: 5 |
Engine | Required. The engine to be used. Select from the following options:
Default value: H20 |
Algorithms | Group of algorithms to be used to derive the best model. The Snap supports the following algorithms:
Default value: Standard, Tree, XGBoost, NN For Weka engine, the algorithms that the Snap supports under each group are:
For H2O engine, the algorithms that the Snap supports under each group are:
|
Readable | Select this to output the model in a human-readable format. When selected, a $readable field is added to the output, which displays the model in a readable format. Default value: Not selected |
Use random seed | If selected, Random seed is applied to the randomizer in order to get reproducible results Default value: Selected |
Random seed | Required. A number used as the static seed for the randomizer. Default value: 12345 |
Report title | Title for the report. |
Page lookup error: page "Anaplan Read" not found. If you're experiencing issues please see our Troubleshooting Guide. | Page lookup error: page "Anaplan Read" not found. If you're experiencing issues please see our Troubleshooting Guide. |
Temporary Files
During execution, data processing on Snaplex nodes occurs principally in-memory as streaming and is unencrypted. When larger datasets are processed that exceeds the available compute memory, the Snap writes Pipeline data to local storage as unencrypted to optimize the performance. These temporary files are deleted when the Snap/Pipeline execution completes. You can configure the temporary data's location in the Global properties table of the Snaplex's node properties, which can also help avoid Pipeline errors due to the unavailability of space. For more information, see Temporary Folder in Configuration Options.Example
This Pipeline demonstrates how the AutoML Snap helps you train models with the H2O engine.
Download this pipeline.