In this pipeline, you perform the following tasks:
- You read the dataset that you want to use for training the model.
- You parse the CSV data and convert the categorical data into numeric data, as the PCA Snap works only with numeric data.
- You run the data through the PCA Snap, which generates the model required for identifying the two principal components and uses the model to reduce the number of numeric fields (columns) in the output from four to two.
- You now use the model generated to perform PCA on a test dataset.
Read Train Set (Using the File Reader Snap)
You configure the Read Train Set Snap to read data from here:
You now send the data to the CSV Parser and Type Converter Snaps, which you use with their default settings. The following image represents the output of the Type Converter Snap:
Principal Component Analysis
This Snap only transforms numeric fields, and you configure it to reduce the number of dimensions to 2 while retaining 95% of the variance:
This Snap provides two pieces of output:
- It lists out only two components (dimensions), retaining 95% of the variance.
- It outputs the model used to generate the data listed above:
As you can see, there are two dimensions that are captured, and the varianceCoverage property is higher than the 0.95 specified above. The model organizes the varianceCoverage property into two components: One accounts for nearly 94% of variance, while the second component accounts for nearly 4%. Thus, the model maintains nearly 94% of the variance of the original data in the first principal component; the second component contains the other 4%.
You run another set of data through the model that you got from the PCA Snap to see if the model works as expected. You find that indeed, the model works reliably, clustering flowers into two separate dimensions based on the length and width of their sepals and petals (See the graphs created before and after running the data through the PCA Snap, in the beginning of this example.)
Download this pipeline.