On this page

The page's title should always be Configuring ABC Snaps where ABC is the Snap's name.  

Overview

Use this Snap to remove duplicate records from input documents. When you use multiple matching criteria to deduplicate your data, it is evaluated using each criterion separately, and then aggregated to give the final result.

Provide a functional overview of the Snap. Do not mention anything about the Snap's internal technology or techniques. The user should be able to understand what the Snap. Include a screenshot of a well-configured Snap.  

Prerequisites

List all prerequisites for using the Snap as a bullet list. Use direct sentences. For example, in case of a Write-type Snap a prerequisite would be that the user must have write access. Include links to external official documentation, if required. Use "None." if there no prerequisites. 

None.

Limitations

List all Snap-specific limitations as a bullet list. Limitations can be imposed by the Snap's development environment and also by the endpoint's API. List both. Use direct sentences. Include links to external official documentation, if required.  

None.

Troubleshooting

List, as bullet points, all Snap-level error messages encountered by the user and link each to the corresponding troubleshooting article in the Troubleshooting page.  Use "None." if there are no prerequisites.

None.

Modes

Snap Input and Output

Type of view: Document/Binary/Both. Get number of views from the Views tab in the Snap. List at least three compatible Snaps in each category. Provide a brief of the input/output required. If the input/output is optional then preface the description with "Optional." For example, "Transaction data complying with the Orderful schema as a JSON document."

Input/OutputType of ViewNumber of ViewsCompatible Upstream and Downstream SnapsDescription
Input Document
  • Min: 1
  • Max: 1
  • Mapper
  • Copy
  • Numeric to Categorical
A document with data containing duplicate records.
OutputDocument
  • Min: 1
  • Max: 2
  • Filter
  • Shuffle 
  • Principal Component Analysis
  • First output view: Required. A document containing deduplicated records.
  • Second output view: Displays a document containing the duplicate records.

Snap Settings

Parameter NameData TypeDescriptionDefault ValueExample 
LabelStringN/ADeduplicate Office Names

Threshold

Decimal

Required. The minimum confidence required for documents to be considered matched as duplicates using the matching criteria.

Minimum Value: 0

Maximum Value: 1

0.80.95
ConfidenceCheck box

Select this check box to include each match's confidence levels in the output.

DeselectedN/A
Group IDCheck boxSelect this check box to include the group ID for each record in the output.DeselectedN/A
Matching CriteriaFieldsetEnables you to specify the settings that you want to use to match input documents with the matching criteria.N/AN/A

Field

JSONPath

The field in the input dataset that you want to use for matching and identifying duplicates.

N/A$name

Cleaner

StringNoneText

Comparator

StringLevenshteinNumeric

Low

Decimal

A decimal value representing the level of probability of the input documents to be matched if the specified fields are completely unlike.

If this value is left empty, a value of 0.3 is applied automatically.


N/A0.1

High


Decimal

A decimal value representing the level of probability of the input documents to be matched if the specified fields are a complete match.

If this value is left empty, a value of 0.95 is applied automatically.


NA0.8
Snap ExecutionString


Specifies the execution type:

  • Validate & Execute: Performs limited execution of the Snap (up to 50 records) during Pipeline validation; performs full execution of the Snap (unlimited records) during Pipeline execution.
  • Execute only: Performs full execution of the Snap during Pipeline execution; does not execute the Snap during Pipeline validation.
  • Disabled: Disables the Snap and, by extension, its downstream Snaps.


Validate & ExecuteN/A

Examples

Deduplicating the List of Childhood Centers in Chicago

In this example, you deduplicate the data in a CSV file containing a list of childhood centers in Chicago.

  1. You add a File Reader Snap to the Pipeline and configure it to read the source CSV file stored online:


    The File Reader Snap displays the contents of the file, which contains many duplicate entries:


  2. You add a CSV Parser Snap to the Pipeline to interpret the input data as a CSV document.
  3. You add a Deduplicate Snap to the Pipeline and configure it to use the name, address, ZIP, and phone details in the input document as fields for deduplication:


  4. You also add an additional output view to the Snap, where the Snap can display the duplicate data. Now, the Snap has two output views, one for the cleaned (deduplicated) data, and another for the duplicated records that the Snap filtered out.


    The Snap, when executed, offers the following two output documents (Output0 and Output1). Output0 contains the deduplicated data, while Output1 contains the duplicate data:


  5. You attach a CSV Formatter Snap to each output view of the Deduplicate Snap to structure the outputs as CSV documents. You then connect a File Writer Snap to each CSV Formatter Snap to write the input data as files.

  6. The Pipeline, when run, generates two output documents: one containing deduplicated data, and the other containing the duplicate data:
     

Download this Pipeline.

Downloads

Edit the Excerpt Include macro below to link to the Snap Pack page for this Snap page. Ensure that the heading Snap Pack History is not within the Snap Pack's history Excerpt.