Match

On this Page

Overview

This Snap performs record linkage to identify documents from different data sources (input views) that may represent the same entity without relying on a common key. The Match Snap enables you to automatically identify matched records across datasets that do not have a common key field.

The Match Snap is part of our ML Data Preparation Snap Pack.

This Snap uses Duke, which is a library for performing record linkage and deduplication, implemented on top of Apache Lucene.

Input and Output

Expected input

  • First Input: The first dataset that must be matched with the second dataset.
  • Second Input: The second dataset that must be matched with the first dataset.

Expected Output

  • First Output: The matched documents and, optionally, the confidence level associated with the matching.
  • Second OutputOptional. Unmatched documents from the first dataset.
  • Third OutputOptional. Unmatched documents from the second dataset.

Expected Upstream Snaps

  • First Input: A Snap that offers documents. For example, Mapper, MySQL - Select, and JSON Parser.
  • Second Input: A Snap that offers documents. For example, Mapper, MySQL - Select, and JSON Parser.

Expected Downstream Snaps

  • Snaps that accept documents. For example, Mapper, JSON Formatter, and CSV Formatter.

Prerequisites

None.

Configuring Accounts

Accounts are not used with this Snap.

Configuring Views

Input

This Snap has exactly two document input views.
OutputThis Snap has at most three document output views.
ErrorThis Snap has at most one document error view.

Troubleshooting

None.

Limitations and Known Issues

None.

Modes


Snap Settings


LabelRequired. The name for the Snap. Modify this to be more specific, especially if there are more than one of the same Snap in the pipeline.
Threshold

Required. The minimum confidence required for documents to be considered matched.

Minimum Value: 0

Maximum Value: 1

Default Value: 0.8

Confidence

Select this check box to include each match's confidence levels in the output.

Default Value: Deselected

Match all

Select this check box to match one record from the first input with multiple records in the second input. Else, the Snap matches the first record of the second input with the first record of the first input.

Default Value: Deselected

Matching CriteriaEnables you to specify the settings that you want to use to perform the matching between the two input datasets.
Left Field

The field in the first dataset that you want to use for matching. This property is a JSONPath.

Example: $name

Default Value: [None]

Right Field

The field in the second dataset that you want to use for matching. This property is a JSONPath.

Example: $country

Default Value: [None]

Cleaner

Select the cleaner that you want to use on the selected fields. 

A cleaner makes comparison easier by removing variations from data, which are not likely to indicate genuine differences. For example, a cleaner might strip everything except digits from a ZIP code. Or, it might normalize and lowercase text.

Depending on the nature of the data in the identified input fields, you can select the kind of cleaner you want to use from the options available:

  • None
  • Text
  • Number
  • Date Time
Default Value: None

Comparator

A comparator compares two values and produces a similarity indicator, which is represented by a number that can range from 0 (completely different) to 1 (exactly equal).

Choose the comparator that you want to use on the selected fields, from the drop-down list:

  • LevenshteinCalculates the least number of edit operations (additions, deletions, and substitutions) required to change one string into another.
  • Weighted Levenshtein: Calculates the least number of edit operations (additions, deletions, and substitutions) required to change one string into another. Each type of symbol has a different weight: number has the highest weight, while punctuation has the lowest weight. This makes "Main Street 12" very different from "Main Street 14", while "Main Street 12" is quite similar to "MainStreet12".
  • Longest Common Substring: Identifies the longest string that is a substring of both strings.
  • Q-GramsBreaks a string into a set of consecutive symbols; for example, 'abc' is broken into a set containing 'ab' and 'bc'. Then, the ratio of the overlapping part is calculated.
  • ExactIdentifies and classifies a match as either an exact match or not a match at all. An exact match assigns a score that equals the value in High. Else, it assigns a score that equals the value in Low.
  • Soundex: Compares strings by converting them into Soundex codes. These codes begin with the first letter of the name, followed by a three-digit code that represents the first three remaining consonants. The letters A, E, I, O, U, Y, H, and W are not coded. Thus, the names 'Mathew' and 'Matthew' would generate the same Soundex code: M-300. This enables you to quickly identify strings that refer to the same person or place, but have variations in their spelling.
  • Metaphone: Metaphone is similar to Soundex; only it improves on the Soundex algorithm by using information about variations and inconsistencies in English spelling and pronunciation to produce a more accurate encoding.
  • Numeric: Calculates the ratio of the smaller number to the greater.
  • Date Time: Computes the difference between two date-time data and produces a similarity measure ranging from 0.0 (meaning completely different) and 1.0 (meaning exactly equal). This property requires data in epoch format. If the date-time data in your dataset is not in epoch format, you must select Date Time in the Cleaner property to convert the date-time data into the epoch format.

Default Value: Levenshtein

Low

Enter a decimal value representing the level of probability of the records to be matched if the specified fields are completely unlike.

Example: 0.1

Default Value: [None]

If this value is left empty, a value of 0.3 is applied automatically.

High

Enter a decimal value representing the level of probability of the records to be matched if the specified fields are exact match.

Example: 0.8

Default Value: [None]

If this value is left empty, a value of 0.95 is applied automatically.

Snap execution