Join

On this Page

Overview

Snap type

Transform

Description

This Snap joins two or more data streams. It supports inner, left outer, and outer joins. If input data streams are sorted (ascending or descending), it is a streaming Snap at highly optimized performance. If the data streams are not sorted, you may use a Sort Snap in front of the Join Snap or select UNSORTED for the Sorted streams property. Please note that all documents in the same input view must have the same set of fields, otherwise, the naming of the fields in the output documents may appear to be inaccurate.

  • Expected upstream Snaps. Sort, Mapper, CSV Parser, JSON Parser, XML Parser, or any Snap with a document output view.
  • Expected downstream Snaps. CSV Formatter, JSON Formatter, XML Formatter, or any Snap with a document input view.
  • Expected input. All documents in the same stream should have the same set of fields regardless if values are null or not.
  • Expected output. Data joined from input document streams. Field names in the left input data are passed to the output data 'as is. For all field names in the right input document streams, if a field name conflicts with a field name in the left input data, it will be prefixed with its input view name. If there is no conflict, the field names in the right input documents are used in the output data without any modification.
Prerequisites

All documents in the same stream should have the same set of fields.

Known Issue, support and limitations

Known Issue: When the upstream Snaps of the Join Snap contains Copy, Router, Aggregate, or similar Snaps, it is likely that the data flow of a branch in a Pipeline gets blocked until another branch completes streaming the document. The Join Snap might hang if its upstream Snaps in a Pipeline has a blocked branch.

Workaround: Set Sorted streams to Unsorted in the Join Snap to effectively buffer all documents in all input views internally—this unblocks the document flow of all the upstream branches. The internal sorters sort the input documents from the input views into the local temporary stage.

Limited support in Ultra Task Pipelines:

Works in Ultra only if one of the input views on the Join Snap is connected to the unlinked input view. All other input views to the Join Snap must reach the end of the input document.

Examples
Allowed: If one of the views on the Join Snaps is fed by an upstream FileReader.

Not Allowed: If you make a copy of the unlinked input stream and connect both of those output views to a Join Snap.

Account

Accounts are not used with this Snap.

Views
Input

This Snap has two or more document input views.

The input data schema in the upstream Snaps of Join Snap must be consistent for each input view to produce the expected joined output data. Else, the Snap might output unexpected joined data. See examples below for more information.

Workaround:

You can insert a Mapper Snap to add missing fields with null values to fix the inconsistent input schema.

OutputThis Snap has exactly one document output view.
ErrorThis Snap has at most one document error view and produces zero or more documents in the view. The error view behavior of this Snap is different from other Snaps. Error view documents do not contain error, reason, resolution, and stack trace. The left input documents are passed to the error view without any modification if there are no matches in the inner join operation and the Unmatched data to error view property is selected. When the error view is not open (that is, Stop pipeline execution is selected for the error view in the View pane), the input document written to the error view will not stop the pipeline and the pipeline execution will continue.

 

Settings

Label

Required. The name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your pipeline.

Join Type


 

Required. The type of join to execute. The options available include:

  • Inner
  • Left outer
  • Outer
  • Merge

Default value: Inner

If you select Merge, the documents from the input views are merged into one document. You do not have to specify any other join properties when merging documents.


The rows on the left table are merged with the rows on the right table in the merge operation. If the right table has a fewer number of rows than the left table, null is added in the output document for the remaining rows. 

Join paths


JSON paths to use for left and right sides of the join. 

Each row in the table defines a relationship between the left-field and one of the right fields.
If there are N input views, N-1 rows are required to define each join path relationship. So, M*(N-1) rows are required to define all the join path relationships if there are M relationships. For example, if there are 4 input views and 3 join paths, 9 rows ((4-1) x 3) are required to define all the join path relationships. 

To use a partial set of join path relationships, use multiple Join Snaps.

Default value: [None]

Left path

Required. The JSON path to a value in a document of the first input view. One of the suggested field names should be selected. This property does not support expressions.
Example: $customer_id
Default value: [None]

Right input view

Required. Right input view name which is the second or another next input view.
You may use Suggest to select the right input view names.
Example: input1
Default value: [None] 

Right path

Required. The JSON path to a value in a document of the second or another next input view. One of the suggested field names should be selected. This property does not support expressions.
Example: $customer_id
Default value: [None]

Sorted streams

Required. How the data is sorted. Options available are Ascending, Descending, or Unsorted. If an Unsorted data stream is selected, the Snap sorts input data streams before it starts the join operation. 

Default value: Ascending

Null greater


If selected, null values are considered greater than non-null values. In conjunction with Sort streams:
  • If selected and Sort streams are ascending, nulls appear at the end of the list.
  • If selected and Sort streams are descending, nulls appear at the beginning of the list.
  • If not selected and Sort streams are ascending, nulls appear at the beginning of the list.
  • If not selected and Sort streams are descending, nulls appear at the end of the list.

Default value: Not selected

Unmatched data to error view


If selected, unmatched left input documents are passed to the error view only if the Join type is 'Inner'.

Default value: Not selected (false)

Null-safe access


If selected, the Snap will ignore missing data when accessing the join path. For example, a join path is '$id', but the 'id' key does not exist in the input data. In this case, the Snap will assume its value is null and continue. If unselected, the Snap will write an error to the error view for missing data and stop the execution.

Default value: Not selected (false)

Available Memory Threshold (%)

The Snap keeps all the Right input view documents with the same join-path values in memory until the join operation is done for the specific join-path values. When the Right input view has more than 10,000 input documents with the same join-path values, the Snap checks if the available memory is less than the threshold value mentioned in this property. If so, it starts to store input data into local temporary files to prevent the node from out of memory.

  • The Snap may fail if there isn't sufficient free local disk space in the node.
  • The Snap instances that existed before this property was first introduced, execute with a default value of 20% until its property value is updated.
  • When you select Unsorted in the Sorted stream field, the Snap sorts, and stages all input data into temporary disk files. The following capabilities in this Snap enable you to monitor the status of system memory and disk space:

    • Pipeline Execution Statistics displays a status message for Free disk space, Available memory and Average document size.

    • The Snap reduces the internal sort buffer size to a minimum of 10MB when the available memory in the node becomes lower than 500MB to help avoid the out-of-memory crash.

    • The Snap restores the internal sort buffer size to its original size when the available memory becomes larger than 2GB.

Minimum memory (MB)



If the available memory is less than this property value while processing input documents, the Snap stops to fetch the next input document until more memory is available. This feature is disabled if this property value is 0.

Default value: 500
Example750

Minimum free disk space (MB)


If the free disk space is less than this property value, the Snap stops processing input documents until more free disc space is available. This feature is disabled if this property value is 0.

Default value: 500
Example750

Out-of-resource timeout (minutes)


If the Snap pauses longer than this property value while waiting for more memory available, it throws an exception to prevent the system from running out of memory or disk space.

Default value: 30
Example20

Snap Execution





Select one of the three modes in which the Snap executes. Available options are:

  • Validate & Execute: Performs limited execution of the Snap, and generates a data preview during Pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during Pipeline runtime.

  • Execute only: Performs full execution of the Snap during Pipeline execution without generating preview data.

  • Disabled: Disables the Snap and all Snaps that are downstream from it.

Default Value: Execute only
Example:
Validate & Execute

Temporary Files

During execution, data processing on Snaplex nodes occurs principally in-memory as streaming and is unencrypted. When larger datasets are processed that exceeds the available compute memory, the Snap writes Pipeline data to local storage as unencrypted to optimize the performance. These temporary files are deleted when the Snap/Pipeline execution completes. You can configure the temporary data's location in the Global properties table of the Snaplex's node properties, which can also help avoid Pipeline errors due to the unavailability of space. For more information, see Temporary Folder in Configuration Options

Examples

Providing Consistent Input Schema to Get Correct Joined Output

This example Pipeline demonstrates how you can get expected output joined data from two inputs by providing a consistent input schema. We use the Join Snap to accomplish this task.

First, we provide input documents with consistent input schema using JSON Generator Snaps as shown below. 

Left Input SchemaRight Input Schema

Next, we connect the Join Snap to the upstream Snaps and configure it as shown below. 

Upon validation, the Snap displays the following joined output as a result of providing a consistent input schema. The key name of the right view is the same as in the left view; hence, the Join Snap prefixes it with the right view label in the output data, right_id, right_field1 and right_field2.

Download this Pipeline.

Inconsistent Joined Output Data as a Result of Inconsistent Input Schema

This example Pipeline demonstrates how the Join Snap generates inconsistent output joined data by providing inconsistent input schema in your inputs.

First, we provide input documents with inconsistent input schema using JSON Generator Snaps. 
The complete key set of input documents is {“id”, “field1”, “field2”}. Note that field2 entry is missing in the first left input document, the field1 entry is missing in the second left input document, and so on. The missing entries with null values cause unexpected results in the joined output data.

Left Input SchemaRight Input Schema