The input is generated by the CSV Generator Snap. The Type Converter Snap is used downstream of the CSV Generator Snap to automatically convert data types of the values. The output from the Type Converter Snap is then passed into the Copy Snap. The Copy Snap is configured to generate two streams of the input documents. One stream is fed to the Clean Missing Values Snap, and the other is fed to the Profile Snap. Below is an output preview of the Copy Snap, this is the data input for the Clean Missing Values Snap.
One of the documents does not have its corresponding value in the $Category field.
For smaller datasets it is easy to identify if there are any values missing. For large datasets it is easier to identify missing values using the Profile Snap. Since the Clean Missing Values Snap also requires the data statistics of the input when selecting Impute with Popular or Impute with Average in the Rule property, using a Profile Snap is always helpful. In this case, the $popular field in the Profile Snap's output is used.
The Profile Snap is configured as shown below:
Based on its configuration, the Profile Snap has the following output:
This output denotes that the input documents have one missing value in the $Category field, and the most popular value is Publishing. The Clean Missing Values Snap uses this value to handle the missing value in the document.
The output from the Copy Snap (data) and the output from the Profile Snap (statistics) serve as the input for the Clean Missing Values Snap. This Snap is configured as shown below:
Based on the Snap's configuration, missing values in the $Category field are handled by the Snap using Impute with Popular as the rule. The Snap treats absence of values, null, and whitespaces as missing values.
The Snap uses the value of the $popular field in the Profile Snap's output to handle missing values in the $Category field. The output of the Snap is as shown below:
You may write the output from the Clean Missing Values Snap into another file using the File Writer Snap.
Download this pipeline.