Overview

This Snap performs Principal Component Analysis (PCA) on an input document and outputs a document containing fewer dimensions (or columns). PCA is a dimension-reduction technique that can be used to reduce a large set of variables to a small set that still contains most of the information in the original set. In simple terms, PCA attempts to find common factors in a given dataset, and ranks them in order of importance. Therefore, the first dimension in the output document accounts for as much of the variance in the data as possible, and each subsequent dimension accounts for as much of the remaining variance as possible. Thus, when you reduce the number of dimensions, you significantly reduce the amount of data that the downstream Snap must manage, making it faster.

PCA is widely used to perform tasks such as data compression, exploratory data analysis, pattern recognition, and so on. For example, you can use PCA to identify patterns that can help you isolate specific species of flowers that are more closely related than others, as in our example below.

How does it work?

The PCA Snap performs two tasks:

It analyzes data in the input document and creates a model that
1. Reduces the number of dimensions in the input document to the number of dimensions specified in the Snap.
2. Retains the amount of variance specified in the Snap.
It runs the model created in the step above on the input data and offers a document containing the processed output, offering a simplified view of the data, making it easier for you to identify patterns in it.

Input and Output

Expected input

First input view: Required. A document containing data that has numeric fields.
Second input view: A document containing the model (or mathematical formula that performs a transformation on the input data) that you want the PCA Snap to use on the data coming in through the first input. If you do not provide the model, the PCA Snap builds a model that is best suited for the input data provided through the first input.

Expected output

First output view: A document containing transformed data with fewer (lower) dimensions.
Second output view: A document containing the model that the PCA Snap created and used on the input data. If you supply the Snap with the model (created using a PCA Snap earlier) that you want to use, the Snap does not output the model. To see this behavior in action, see the example below.

Expected upstream Snaps

First input view: A Snap that provides a document containing data in a tabular format. For example, a combination of File Reader and CSV Parser, or Mapper.
Second input view: A Snap that provides documents. For example, a combination of File Reader and CSV Parser.

Expected downstream Snaps

Any Snap that accepts a document. For example, Mapper or CSV Formatter.

Prerequisites

The input data must be in a tabular format. The PCA Snap does not work with data containing nested structures.

Configuring Accounts

Accounts are not used with this Snap.

Configuring Views

Input	This Snap has at most two document input views.
Output	This Snap has at most two document output views.
Error	This Snap has at most one document error view.

Troubleshooting

None.

Limitations and Known Issues

None.

Modes

Ultra pipelines: Works in Ultra pipelines when the Snap has two input views and one output view.
Spark mode: Does not work in Spark mode.

Snap Settings

Label	Required. The name for the Snap. Modify this to be more specific, especially if there are more than one of the same Snap in the pipeline.
Dimension	Required. The maximum number of dimensions–or columns–that you want in the output. Minimum value: 0 Maximum value: Undefined Default value: 10
Variance	Required. The minimum variance that you want to retain in the output documents. Minimum value: 0 Maximum value: 1 Default value: 0.95
Pass through	Select this check box to include all the categorical input fields in the output.

Example

Visualizing the Iris Flower Classification Dataset in Two Dimensions

In this example, you take a CSV file containing the length and width of the sepals and petals of different species of Iris flowers. You then apply PCA to reduce the number of dimensions to two. This enables you to visualize the flower distribution based on the size of their sepals and petals, using a scatter plot.

The scatter plot below, plotting flowers based on their sepal width (x) and sepal length (y), is relatively confusing, and no evident pattern is discernible, as the data associated with the four dimensions (sepal length, sepal width, petal length, and petal width) in the Iris dataset appears to be scattered all over the graph:

You design the pipeline to transform the input data to a two-dimensional view while retaining at least 95% of the variance:

Once you apply PCA and reduce the number of dimensions to two, it becomes easier to see patterns in the representation:

In this graph, the x axis represents the first principal component (pc0), and the y axis represents the second principal component (pc1). You can now see how the Setosa species, represented by the light blue dots in the graph, is very different from the other species of Iris flowers.

Download this pipeline.

Understanding this Example

In this pipeline, you perform the following tasks:

You read the dataset that you want to use for training the model.
You parse the CSV data and convert the categorical data into numeric data, as the PCA Snap works only with numeric data.
You run the data through the PCA Snap, which generates the model required for identifying the two principal components and uses the model to reduce the number of numeric fields (columns) in the output from four to two.
You now use the model generated to perform PCA on a test dataset.

Key Snaps

Read Train Set (Using the File Reader Snap)

You configure the Read Train Set Snap to read data from here:

You now send the data to the CSV Parser and Type Converter Snaps, which you use with their default settings. The following image represents the output of the Type Converter Snap:

Principal Component Analysis

This Snap only transforms numeric fields, and you configure it to reduce the number of dimensions to 2 while retaining 95% of the variance:

This Snap provides two pieces of output:

It lists out only two components (dimensions), retaining 95% of the variance.
It outputs the model used to generate the data listed above:

As you can see, there are two dimensions that are captured, and the varianceCoverage property is higher than the 0.95 specified above. The model organizes the varianceCoverage property into two components: One accounts for nearly 94% of variance, while the second component accounts for nearly 4%. Thus, the model maintains nearly 94% of the variance of the original data in the first principal component; the second component contains the other 4%.

You run another set of data through the model that you got from the PCA Snap to see if the model works as expected. You find that indeed, the model works reliably, clustering flowers into two separate dimensions based on the length and width of their sepals and petals (See the graphs created before and after running the data through the PCA Snap, in the beginning of this example.)

Downloads

	File	Modified

No files shared here yet.

Additional Resources

Snap History

Snap Pack History

Click to view/expand

Release	Snap Pack Version	Date	Type	Updates
August 2024	main27765	21 Aug 2024	Stable	Upgraded the `org.json.json` library from v20090211 to v20240303, which is fully backward compatible. Enhanced the Date Time Extractor Snap to support Date time formats (`YYYY-MM-dd HH:mm:ss and YYYY-MM-dd HH:mm:ss.SSS`) and allow the root path to auto-convert all fields.
May 2024	main26341	08 May 2024	Stable	Updated and certified against the current SnapLogic Platform release.
February 2024	436patches25781	03 Apr 2024	Latest	Enhanced the Deduplicate Snap to honor an `interrupt while waiting in the delay loop` to manage the memory efficiently.
February 2024	main25112	14 Feb 2024	Stable	Updated and certified against the current SnapLogic Platform release.
November 2023	main23721	Nov 8, 2023	Stable	Updated and certified against the current SnapLogic Platform release.
August 2023	main22460	Aug 16, 2023	Stable	Updated and certified against the current SnapLogic Platform release.
May 2023	433patches21572	20 Jun 2023	Latest	The Deduplicate Snap now manages memory efficiently and eliminates out-of-memory crashes using the following fields: Minimum memory (MB) Minimum free disk space (MB)
May 2023	433patches21247	31 May 2023	Latest	Fixed an issue with the Match Snap where a null pointer exception was thrown when the second input view had fewer records than the first.
May 2023	main21015	10 May 2023	Stable	Upgraded with the latest SnapLogic Platform release.
February 2023	main19844	09 Feb 2023	Stable	Upgraded with the latest SnapLogic Platform release.
December 2022	431patches19268	19 Dec 2022	Latest	The Deduplicate Snap now ignores fields with empty strings and whitespaces as no data.
November 2022	main18944	10 Nov 2022	Stable	Upgraded with the latest SnapLogic Platform release.
August 2022	main17386	11 Aug 2022	Stable	Upgraded with the latest SnapLogic Platform release.
4.29	main15993	14 May 2022	Stable	Upgraded with the latest SnapLogic Platform release.
4.28	main14627	20 Jul 2022	Stable	Enhanced the Type Converter Snap with the Fail safe upon execution checkbox. Select this checkbox to enable the Snap to convert data with valid data types, while ignoring invalid data types.
4.27	427patches13730			Enhanced the Type Converter Snap with the Fail safe upon execution checkbox. Select this checkbox to enable the Snap to ignore invalid data types and convert data with valid data types.
4.27	427patches13948	07 Jan 2022	Latest	Fixed an issue with the Principal Component Analysis Snap, where a deadlock occurred when data is loaded from both the input views.
4.27	main12833	13 Nov 2021	Stable	Upgraded with the latest SnapLogic Platform release.
4.26	main11181	14 Aug 2021	Stable	Upgraded with the latest SnapLogic Platform release.
4.25	425patches10994	04 Aug 2021		Fixed an issue when the Deduplicate Snap where the Snap breaks when running on a locale that does not format decimals with Period (.) character.
4.25	main9554	08 May 2021	Stable	Upgraded with the latest SnapLogic Platform release.
4.24	main8556	13 Feb 2021	Stable	Upgraded with the latest SnapLogic Platform release.
4.23	main7430	14 Nov 2020	Stable	Upgraded with the latest SnapLogic Platform release.
4.22	main6403	12 Sep 2020	Stable	Upgraded with the latest SnapLogic Platform release.
4.21	snapsmrc542	09 May 2020	Stable	Introduces the Mask Snap that enables you to hide sensitive information in your dataset before exporting the dataset for analytics or writing the dataset to a target file. Enhances the Match Snap to add a new field, Match all, which matches one record from the first input with multiple records in the second input. Also, enhances the Comparator field in the Snap by adding one more option, Exact, which identifies and classifies a match as either an exact match or not a match at all. Enhances the Deduplicate Snap to add a new field, Group ID, which includes the Group ID for each record in the output. Also, enhances the Comparator field in the Snap by adding one more option, Exact, which identifies and classifies a match as either an exact match or not a match at all. Enhances the Sample Snap by adding a second output view which displays data that is not in the first output. Also, a new algorithm type, Linear Split, which enables you to split the dataset based on the pass-through percentage.
4.20 Patch	mldatapreparation8771	18 Mar 2020	Latest	Removes the unused `jcc-optional` dependency from the ML Data Preparation Snap Pack.
4.20	snapsmrc535	08 Feb 2020	Stable	Upgraded with the latest SnapLogic Platform release.
4.19	snapsmrc528	14 Nov 2019	Stable	New Snap: Introducing the Deduplicate Snap. Use this Snap to remove duplicate records from input documents. When you use multiple matching criteria to deduplicate your data, it is evaluated using each criterion separately, and then aggregated to give the final result.
4.18	snapsmrc523	10 Aug 2019	Stable	Upgraded with the latest SnapLogic Platform release.
4.17 Patch	ALL7402	11 Jun 2019	Latest	Pushed automatic rebuild of the latest version of each Snap Pack to SnapLogic UAT and Elastic servers.
4.17	snapsmrc515	11 Jun 2019	Latest	New Snap: Introducing the Feature Synthesis Snap, which automatically creates features out of multiple datasets that share a one-to-one or one-to-many relationship with each other. New Snap: Introducing the Match Snap, which enables you to automatically identify matched records across datasets that do not have a common key field. Added the Snap Execution field to all Standard-mode Snaps. In some Snaps, this field replaces the existing Execute during preview check box.
4.16	snapsmrc508	16 Feb 2019	Stable	Added a new Snap, Principal Component Analysis, which enables you to perform principal component analysis (PCA) on numeric fields (columns) to reduce dimensions of the dataset.
4.15	snapsmrc500	15 Dec 2018	Stable	New Snap Pack. Perform preparatory operations on datasets such as data type transformation, data cleanup, sampling, shuffling, and scaling. Snaps in this Snap Pack are: Categorical to Numeric Clean Missing Values Date Time Extractor Numeric to Categorical Sample Scale Shuffle Type Converter