To use the Match Snap, you need two datasets coming into the Snap, where you list out the dataset fields that you want to use to match the records. Once you have the matched data, you create a mechanism for identifying the lowest possible threshold for which you can get the maximum number of reliable matches. To do so, you need to perform the following tasks:
- Send data from two datasets into the Match Snap and configure the Match Snap to match countries based on their names and capitals.
- Sort the matched country records based on their confidence levels and write them out to a file. This file helps you determine whether to increase or decrease the threshold to adjust the number of matches.
- Write the matched country records without their confidence information to a file. This is the output of the Pipeline.
- Write the unmatched records from the source datasets into separate files, so you can see how many remain unmatched. If you do not need the unmatched records, remove the second and third output views of the Match Snap.
You create the Pipeline as shown below:
Receiving input datasets
You use a CSV Generator Snap to create the first input dataset. This dataset contains URLs to specific pages that contain a country's details, followed by the name of the country, its capital, and its area.
Once you validate the Pipeline, you can view a preview of the output of this Snap:
You use another CSV Generator Snap to create the second input dataset. Each row of this dataset contains a country ID, followed by a country's name, its capital, and area:
Once you validate the Pipeline, you can view the output:
Matching records in dataset fields
You now connect a Match Snap to the two CSV Generator Snaps and configure them to match countries based on their names and capitals, represented in text:
The Match Snap can offer up to three outputs:
- First Output: The list of matched records.
- Second Output: The list of unmatched records from the first input view.
- Third Output: The list of unmatched records from the second input view.
In this example, you enable all of them.
Once you validate the Pipeline, you can view the matched entries in the first output view:
Note that this Snap supports strings that contain diacritics (é, è, â, ñ, and so on), and is able to match the two versions of 'Brasilia' and 'Bogota' (highlighted in the screenshot below.)
The second output view shows the list of unmatched records from the first input view:
The third output view shows the list of unmatched records from the second input view:
You can now review the matched records to check whether they represent the same entities.
The Match Snap example offers three outputs:
- The list of matched countries
- The list of countries in Country Dataset 1 that could not be matched
- The list of countries in Country Dataset 2 that could not be matched
In the second half of this Pipeline, you review the matched records and tweak the value in the Threshold field of the Match Snap until you arrive at the lowest threshold value that gives you the maximum number of correct matches. You decide to make two copies of the output document containing the list of matched countries. In one of these, you decide to retain the confidence level, so you can sort the results by confidence. In the other, you decide to remove the confidence levels, so you can retain only the data you need. If the output contains wrong matches, increase the threshold. If the output looks great, but you want to see more matches with lower confidence, try lowering the threshold.
To do so, you create two copies of the document containing the matched countries, using the Copy Snap. You now need to generate the following two documents and identify the lowest confidence level that gives you the most number of reliable matches:
- A document containing the matched documents and confidence-level data
- A document containing only the matched fields, which is the main output of this pipeline
Generating a document containing matched fields and confidence-level data
From one copy of the list of matched countries, you use the Mapper Snap to create a document containing the matched countries and their confidence levels:
You now sort them based on their confidence levels, so that the countries with the lowest confidence levels will appear at the top of the list:
You use a File Writer Snap to write the sorted data into a CSV file:
Generating a document containing only matched records
From the other copy, you use the Mapper Snap to create a document containing only the matched countries:
You use a File Writer Snap to write the data into a CSV file:
Modifying matching threshold values to improve the result
You now need to iteratively lower the threshold values until you reach a threshold that is best suited to offer the most number of reliable matches. To do so, you execute the Pipeline a number of times, using iteratively lower values in the Threshold field of the Match Snap, until you reach a value below which your output matching data is not reliable. For example, if you decide to lower the threshold value from the default 0.8 to 0.5, you will find an additional row of data displaying inaccurately as matched:
You now know that given this data, the value in the Threshold field must be above 0.51 to be reliable and offer the most number of correct matches.
Download this Pipeline.