Deduplicate
In this article
Overview
You can use this Snap to remove duplicate records from input documents. When you use multiple matching criteria to deduplicate your data, the Snap evaluates using each criterion separately, and then aggregates to give the final result. This Snap ignores fields with empty strings and whitespaces as no data.
Prerequisites
None.
Limitations
None.
Known Issues
The Deduplicate Snap fails to deduplicate data when the input document contains an empty string, white spaces, or null values in a field.
Support for Ultra Pipelines
Does not support Ultra Pipelines.
Snap Views
Type | Format | Number of Views | Examples of Upstream and Downstream Snaps | Description |
---|---|---|---|---|
Input | Document |
|
| A document with data containing duplicate records. |
Output | Document |
|
|
|
Snap Settings
Parameter Name | Data Type | Description | Default Value | Example |
---|---|---|---|---|
Label | String | Specify a unique name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your Pipeline. | N/A | Deduplicate Office Names |
Threshold | Decimal | Required. The minimum confidence required for documents to be considered matched as duplicates using the matching criteria. Minimum Value: 0 Maximum Value: 1 | 0.8 | 0.95 |
Confidence | Checkbox | Select this check box to include each match's confidence levels in the output. | Deselected | N/A |
Group ID | Checkbox | Select this check box to include the group ID for each record in the output. | Deselected | N/A |
Matching Criteria | Fieldset | Enables you to specify the settings that you want to use to match input documents with the matching criteria. | N/A | N/A |
Field | JSONPath | The field in the input dataset that you want to use for matching and identifying duplicates. | N/A | $name |
Cleaner | String | Select the cleaner that you want to use on the selected fields. A cleaner makes comparison easier by removing variations from data, which are not likely to indicate genuine differences. For example, a cleaner might strip everything except digits from a ZIP code. Or, it might normalize and lowercase text. Depending on the nature of the data in the identified input fields, you can select the kind of cleaner you want to use from the options available:
| None | Text |
Comparator | String | A comparator compares two values and produces a similarity indicator, which is represented by a number that can range from 0 (completely different) to 1 (exactly equal). Choose the comparator that you want to use on the selected fields, from the drop-down list:
|