Deduplicate

In this article

Overview

You can use this Snap to remove duplicate records from input documents. When you use multiple matching criteria to deduplicate your data, the Snap evaluates using each criterion separately, and then aggregates to give the final result. This Snap ignores fields with empty strings and whitespaces as no data.

Prerequisites

None.

Limitations

None.

Known Issues

The Deduplicate Snap fails to deduplicate data when the input document contains an empty string, white spaces, or null values in a field.

Support for Ultra Pipelines

Does not support Ultra Pipelines.

Snap Views

TypeFormatNumber of ViewsExamples of Upstream and Downstream SnapsDescription
Input Document
  • Min: 1
  • Max: 1
  • Mapper
  • Copy
  • Numeric to Categorical
A document with data containing duplicate records.
OutputDocument
  • Min: 1
  • Max: 2
  • Filter
  • Shuffle 
  • Principal Component Analysis
  • First output view: Required. A document containing deduplicated records.
  • Second output view: Displays a document containing the duplicate records.

Snap Settings

Parameter NameData TypeDescriptionDefault ValueExample 
LabelStringSpecify a unique name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your Pipeline.N/ADeduplicate Office Names

Threshold

Decimal

Required. The minimum confidence required for documents to be considered matched as duplicates using the matching criteria.

Minimum Value: 0

Maximum Value: 1

0.80.95
ConfidenceCheckbox

Select this check box to include each match's confidence levels in the output.

DeselectedN/A
Group IDCheckboxSelect this check box to include the group ID for each record in the output.DeselectedN/A
Matching CriteriaFieldsetEnables you to specify the settings that you want to use to match input documents with the matching criteria.N/AN/A

Field

JSONPath

The field in the input dataset that you want to use for matching and identifying duplicates.

N/A$name

Cleaner

String

Select the cleaner that you want to use on the selected fields. 

A cleaner makes comparison easier by removing variations from data, which are not likely to indicate genuine differences. For example, a cleaner might strip everything except digits from a ZIP code. Or, it might normalize and lowercase text.

Depending on the nature of the data in the identified input fields, you can select the kind of cleaner you want to use from the options available:

  • None
  • Text
  • Number
  • Date Time
NoneText

Comparator

String

A comparator compares two values and produces a similarity indicator, which is represented by a number that can range from 0 (completely different) to 1 (exactly equal).

Choose the comparator that you want to use on the selected fields, from the drop-down list:

  • LevenshteinCalculates the least number of edit operations (additions, deletions, and substitutions) required to change one string into another.
  • Weighted Levenshtein: Calculates the least number of edit operations (additions, deletions, and substitutions) required to change one string into another. Each type of symbol has a different weight: number has the highest weight, while punctuation has the lowest weight. This makes "Main Street 12" very different from "Main Street 14", while "Main Street 12" is quite similar to "MainStreet12".
  • Longest Common Substring: Identifies the longest string that is a substring of both strings.
  • Q-GramsBreaks a string into a set of consecutive symbols; for example, 'abc' is broken into a set containing 'ab' and 'bc'. Then, the ratio of the overlapping part is calculated.
  • ExactIdentifies and classifies a match as either an exact match or not a match at all. An exact match assigns a score that equals the value in High. Else, it assigns a score that equals the value in Low.
  • Soundex: Compares strings by converting them into Soundex codes. These codes begin with the first letter of the name, followed by a three-digit code that represents the first three remaining consonants. The letters A, E, I, O, U, Y, H, and W are not coded. Thus, the names 'Mathew' and 'Matthew' would generate the same Soundex code: M-300. This enables you to quickly identify strings that refer to the same person or place, but have variations in their spelling.
  • Metaphone: Metaphone is similar to Soundex; only it improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding.
  • Numeric: Calculates the ratio of the smaller number to the greater.
  • Date Time: Computes the difference between two date-time data and produces a similarity measure ranging from 0.0 (meaning completely different) and 1.0 (meaning exactly equal). This property requires data in epoch format. If the date-time data in your dataset is not in epoch format, you must select Date Time in the Cleaner property to convert the date-time data into the epoch format.