On this Page

Overview

Clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. Clustering is a type of unsupervised learning. Unsupervised learning is a technique in which you can draw inferences from datasets consisting of data without labeled responses.

The Clustering Snap helps determine the intrinsic grouping among unlabeled numeric data. For example, discovering customer segments for marketing purposes, classifying different species of plants and animals, grouping books on the basis of topics and information, etc. If your data contains categorical fields, the Snap ignores all such fields.

Input and Output

Expected input

First input view: A document with numeric fields.
Second input view: A document that contains a model built by another Clustering Snap. If the model is not available, the Snap builds a model.

Expected output:

First output view: A document with the input data and assigned cluster index.
Second output view: A document that represents the model built by the Snap.

Expected upstream Snaps

First input view: A Snap that offers documents. For example, Mapper, or Categorical to Numeric.
Second input view: A Snap that offers documents which provide a clustering model built by the Clustering Snap. For example, a combination of File Reader and JSON Parser.

Expected downstream Snaps: A Snap that accepts documents. For example, Mapper, JSON Formatter, or Sort.

With one input view, the Snap builds a model. With two input views, the Snap uses the model to give predictions.
SnapLogic recommends using either 2 input views with 1 output view, or 2 output views with 1 input view. Do not use 2 input views with 2 output views.

Prerequisites

None.

Configuring Accounts

Accounts are not used with this Snap.

Configuring Views

Input	This Snap has at most two document input views.
Output	This Snap has at most two document output views.
Error	This Snap has at most one document error view.

Troubleshooting

None.

Limitations and Known Issues

None.

Modes

Ultra Pipelines: Works in Ultra Pipelines.

Snap Settings

Label	Required. The name for the Snap. You can modify this to be specific, especially if you have more than one of the same Snap in your pipeline.
Algorithm	Required. The clustering algorithm that must be used to cluster the data into specific groups. The available options are: K-Means: Partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean. X-Means: An extended K-Means which tries to automatically determine the number of clusters based on Bayesian Information Criterion (BIC) scores. G-Means: Another extended K-Means which tries to automatically determine the number of clusters by normality test. For a detailed description of the algorithms, read here. Default value: K-Means
Max cluster	Required. The maximum number of clusters that the Snap must create. Default value: 3 Minimum: 2 Maximum: 10000 If you select the Algorithm as K-Means, the Snap creates the exact number of clusters that you specify here. For X-Means and G-Means algorithms, the Snap performs an automatic optimization on your dataset and the number of clusters might be equal to or less than the value you specify here.
Pass through	Select to include all input fields in the output. Else, the Snap outputs only the cluster index. Default value: Selected
Snap Execution	The Snap execution mode. The available options are: Validate & Execute: Executes the Pipeline during execution and validation. Execute only: Executes the Pipeline during execution only, and not during validation. Disabled: Does not execute the Pipeline during execution or validation. Default value: Validate & Execute

Example

This Pipeline demonstrates how the Clustering Snap helps you cluster unlabeled data into groups using K-Means algorithm and save the model for later use.

Download this Pipeline.

Understanding the Pipeline

In this example, the CSV Generator Snap named Train Dataset contains Iris flower samples and the length and width data of their sepals and petals. The Type Converter Snap is configured to automatically detect and convert the data types. In this example, all 4 fields: sepal_length, sepal_width, petal_length, and petal_width are automatically converted into numeric fields. The output preview of the Type Converter Snap is as follows:

This dataset is passed to the Clustering Snap which is configured as follows:

The Clustering Snap is configured as follows:

Algorithm: K-Means
Max cluster: 3
Pass through: Selected

Because we have selected K-Means, the Snap will create exactly 3 clusters. We select Pass through to include all the input fields in the output.

The Clustering Snap is configured for two output views. The first output view displays the data and the cluster index, and the second output view displays the model.

The preview from the first output view of the Clustering Snap is as follows:

The dataset is grouped into 3 clusters as seen in the $pred column. The Clustering Snap computes the differences in each pair of the documents and groups the most similar ones into the same cluster. Each algorithm has a way to cluster documents differently. This clustering is for the K-Means algorithm.

The preview from the second output view of the Clustering Snap is as follows:

In the output preview, we can view the model. The model output of the Clustering Snap is converted to JSON using the JSON Formatter Snap and then passed to a File Writer Snap. This model can then be used in another Pipeline with Clustering Snap to derive predictions.

Snap Pack History

Click to view/expand

Release	Snap Pack Version	Date	Type	Updates
November 2024	main29029	13 Nov 2024	Stable	Updated and certified against the current SnapLogic Platform release.
August 2024	main27765	21 Aug 2024	Stable	Upgraded the `org.json.json` library from v20090211 to v20240303, which is fully backward compatible.
May 2024	main26341	08 May 2024	Stable	Updated and certified against the current SnapLogic Platform release.
February 2024	main25112	14 Feb 2024	Stable	Updated and certified against the current SnapLogic Platform release.
November 2023	main23721	Nov 8, 2023	Stable	Updated and certified against the current SnapLogic Platform release.
August 2023	main22460	Aug 16, 2023	Stable	Updated and certified against the current SnapLogic Platform release.
May 2023	433patches21854	14 Jul 2023	Latest	Fixed an issue with the Cross Validator (Classification) Snap where the native Windows DLL caused the Snaplex to stall
May 2023	433patches21644	28 Jun 2023	Latest	Improved an error message in the Remote Python Script Snap to explain the reason and resolution for the case where a Python script has errors.
May 2023	main21015	10 May 2023	Stable	Upgraded with the latest SnapLogic Platform release.
February 2023	main19844	09 Feb 2023	Stable	Upgraded with the latest SnapLogic Platform release.
November 2022	main18944	10 Nov 2022	Stable	Upgraded with the latest SnapLogic Platform release.
August 2022	main17386	11 Aug 2022	Stable	Upgraded with the latest SnapLogic Platform release.
4.29	429patches16809	20 Jul 2022	Latest	Removed the log4j dependency from the ML Core Snaps due to security vulnerabilities.
4.29	main15993	14 May 2022	Stable	Upgraded with the latest SnapLogic Platform release.
4.28	main14627	20 Jul 2022	Stable	Upgraded with the latest SnapLogic Platform release.
4.27	427patches13948	07 Jan 2022	Latest	Fixed an issue with the following Snaps, where a deadlock occurred when data is loaded from both the input views. Predictor Classification Predictor Regression Clustering
4.27	main12833	13 Nov 2021	Stable	Upgraded with the latest SnapLogic Platform release.
4.26	main11181	14 Aug 2021	Stable	Upgraded with the latest SnapLogic Platform release.
4.25	main9554	08 May 2021	Stable	Upgraded with the latest SnapLogic Platform release.
4.24	main8556	13 Feb 2021	Stable	Upgraded with the latest SnapLogic Platform release.
4.23	main7430	14 Nov 2020	Stable	Upgraded with the latest SnapLogic Platform release.
4.22	main6403	12 Sep 2020	Stable	Upgraded with the latest SnapLogic Platform release.
4.21	snapsmrc542	09 May 2020	Stable	Upgraded with the latest SnapLogic Platform release.
4.20 Patch	mlcore8770	18 Mar 2020	Stable	Adds the log4j dependency to the ML Core Snaps to resolve the "`Could not initialize class org.apache.log4j.LogManage`r" error.
4.20	snapsmrc535	08 Feb 2020	Stable	Upgraded with the latest SnapLogic Platform release.
4.19	snapsmrc528	14 Nov 2019	Stable	Upgraded with the latest SnapLogic Platform release.
4.18	snapsmrc523	10 Aug 2019	Stable	Upgraded with the latest SnapLogic Platform release.
4.17 Patch	ALL7402	11 Jun 2019	Latest	Pushed automatic rebuild of the latest version of each Snap Pack to SnapLogic UAT and Elastic servers.
4.17	snapsmrc515	11 Jun 2019	Latest	New Snap: Introducing the Clustering Snap that performs exploratory data analysis to find hidden patterns or groupings in data. Enhanced the AutoML Snap. You can now: Select algorithms to derive the top models. Input the best model generated by another AutoML Snap from a previous execution. View an interactive HTML report that contains statistics of up to 10 models. Added the Snap Execution field to all Standard-mode Snaps. In some Snaps, this field replaces the existing Execute during preview check box.
4.16	snapsmrc508	16 Feb 2019	Stable	New Snap: Introducing the AutoML Snap, which lets you automate the process of selecting machine learning algorithms and tuning hyperparameters. This Snap gives the best predictive model within the specified time limit.
4.15	snapsmrc500	15 Dec 2018	Stable	New Snap Pack. Perform data modeling operations such as model training, cross-validation, and model-based predictions. Additionally, you can also execute Python scripts remotely. Snaps in this Snap Pack are: Cross Validator -- Classification Cross Validator -- Regression Predictor -- Classification Predictor -- Regression Remote Python Script Trainer -- Classification Trainer -- Regression Releases the Remote Python Executor account and the Remote Python Executor Dynamic account for the Remote Python Script Snap.