ELT Join
In this article
An account for the Snap
You must define an account for this Snap to communicate with your target CDW. Click the account specific to your target CDW below for more information:
Overview
Use this Snap to add a JOIN clause to join tables in separate queries coming from the upstream Snaps. This Snap also allows you to preview the result of the output query. You can validate the modified query using this preview functionality.
ELT Join Snap requires an account configuration
Starting from 4.24 GA, ensure to configure an account for this Snap.
Prerequisites
None.
Limitation
ELT Snap Pack does not support Legacy SQL dialect of Google BigQuery. We recommend that you use only the BigQuery's Standard SQL dialect in this Snap.
Known Issues
In any of the supported target databases, this Snap does not appropriately identify nor render column references beginning with an _ (underscore) inside SQL queries/statements that use the following constructs and contexts (the Snap works as expected in all other scenarios):
WHERE
clause (ELT Filter Snap)WHEN
clauseON
condition (ELT Join, ELT Merge Into Snaps)HAVING
clauseQUALIFY
clause- Insert expressions (column names and values in ELT Insert Select, ELT Load, and ELT Merge Into Snaps)
- Update expressions list (column names and values in ELT Merge Into Snap)
- Secondary
AND
condition Inside SQL query editor (ELT Select and ELT Execute Snaps)
Workaround
As a workaround while using these SQL query constructs, you can:
- Precede this Snap with an ELT Transform Snap to re-map the '_' column references to suitable column names (that do not begin with an _ ) and reference the new column names in the next Snap, as needed.
- In case of Databricks Lakehouse Platform where CSV files do not have a header (column names), a simple query like
SELECT * FROM CSV.`/mnt/csv1.csv`
returns default names such as _c0, _c1, _c2 for the columns which this Snap cannot interpret. To avoid this scenario, you can:- Write the data in the CSV file to a DLP table beforehand, as in:
CREATE TABLE csvdatatable (a1 int, b1 int,…) USING CSV `/mnt/csv1.csv`
where a1, b1, and so on are the new column names. - Then, read the data from this new table (with column names a1, b1, and so on) using a simple SELECT statement.
- Write the data in the CSV file to a DLP table beforehand, as in:
- In case of Databricks Lakehouse Platform, all ELT Snaps' preview data (during validation) contains a value with precision higher than that of the actual floating point value (float data type) stored in the Delta. For example, 24.123404659344 instead of 24.1234. However, the Snap reflects the exact values during Pipeline executions.
Snap Input and Output
Input/Output | Type of View | Number of Views | Examples of Upstream and Downstream Snaps | Description |
---|---|---|---|---|
Input | Document |
|
| The SQL queries referencing separate tables in which you want to add the JOIN clause. |
Output | Document |
|
| The incoming SQL queries joined with a JOIN clause. The output from executing this Snap varies based on which table is connected to which input view. |
Snap Settings
SQL Functions and Expressions for ELT
You can use the SQL Expressions and Functions supported for ELT to define your Snap or Account settings with the Expression symbol = enabled, where available. This list is common to all target CDWs supported. You can also use other expressions/functions that your target CDW supports.
Parameter Name | Data Type | Description | Default Value | Example |
---|---|---|---|---|
Label | String | Specify a name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your pipeline. | ELT Join | Combined Dataset |
Get preview data | Check box | Select this checkbox to include a preview of the query's output. The Snap performs limited execution and generates a data preview during Pipeline validation. In the case of ELT Pipelines, only the SQL query flows through the Snaps but not the actual source data. Hence, the preview data for a Snap is the result of executing the SQL query that the Snap has generated in the Pipeline. The number of records displayed in the preview (upon validation) is the smaller of the following:
Rendering Complex Data Types in Databricks Lakehouse Platform Based on the data types of the fields in the input schema, the Snap renders the complex data types like map and struct as object data type and array as an array data type. It renders all other incoming data types as-is except for the values in binary fields are displayed as a base64 encoded string and as string data type. | Not selected | Selected |
ELT Join Type | String/Drop-down list | Choose the join type to use in the SQL. Available options are:
See Snowflake Join Types for more information. See Redshift Join Types for more information. See Azure Synapse Join Types for more information. See Join Types for Databricks on AWS for more information. See Join operation in BigQuery Standard SQL for more information. Natural Joins for Azure Synapse, Databricks Lakehouse Platform and BigQuery Natural Joins are not natively supported by Azure Synapse, Databricks Lakehouse Platform (DLP), and BigQuery databases. But, this Snap uses a series of query rewrite mechanisms to support these Join Types. You can apply these Natural Joins to your data sets in Azure Synapse, Databricks Lakehouse Platform (DLP), and BigQuery, accordingly. More Joins for BigQuery BigQuery does not have the native support for Left Anti and Left Semi join types. But, this Snap uses a series of query rewrite mechanisms to support these Join Types in BigQuery. | Inner | Left outer |
ELT Join Condition | String/Expression | Specify the condition to initiate the JOIN operation. If you do not specify any condition here, the Snap uses You can specify any SQL expression that has a boolean output (true or false). The ELT Join Condition is ignored if the join type is:
| N/A | CUST_CODE1 = CUST_CODE |
Left Table Alias | String/Expression | Specify the alias to use for the table in the first input view. This enables you to qualify the columns to join with an alias name and resolve any ambiguity due to identical column names. | N/A | TBL1 |
Right Table Alias | String/Expression | Specify the alias to use for the table in the second input view. This enables you to qualify the columns to join with an alias name and resolve any ambiguity due to identical column names. | N/A | TBL2 |
Resultant Column Names Prefix Type | Drop-down list | Not applicable if target database is Databricks Lakehouse Platform (DLP). Choose an option from the list to prefix the resultant columns names with a table alias; this enables the Snap to prevent collision of identical column names in resultant table. Available options are:
For existing Pipelines For Pipelines created prior to 4.24 GA, if you choose the All Columns or the Only Duplicate Columns option in this field, ensure that you also configure an account for the Snap. | None | RT.D_DATE_SK |
Preventing SQL Injection
You can pass Pipeline parameters as values in an SQL expression; however, if you do not phrase the expression properly it can lead to the parameter's name being bound as a value in the database. This potentially incorrect information being inserted into the database is known as SQL injection. It is thus necessary to take precautions when including Pipeline parameters in your SQL expression to prevent SQL injection. Based upon the intended use of the Pipeline parameter, use one or both the following methods to prevent accidental SQL injection:
Method-1: Simple Substitutions
You can reference the Pipeline parameter directly with a JSON-path without enabling expressions.
For example, if you want to use the Pipeline parameter, name, which contains the value of a column in the ELT Join Condition field:
colname = _name
Method-2: Dynamic Substitutions
You must enable expressions when using Pipeline parameters for dynamic substitutions. Format the SQL expression, except the Pipeline parameter's reference, as a string.
For example, if you want to use the Pipeline parameter, name, which contains the value of a column in the ELT Join Condition field:
_columnname + “= _name”
The Snap evaluates the expression and also carries out path substitutions.
Here is how it works
The Snap pre-processes the query to extract any JSON-Paths and converts them to bound parameters. For example, consider the following query:
_columnname + “= _name”
The Snap converts this query into the following before turning it into a prepared statement for the database:
colname = ?
The Snap evaluates the JSON-Path to get the value to bind the Pipeline parameter in the prepared statement.
Using escape characters
When expressions are disabled, use \ as an escape character to treat underscore (_) as a string.
For example:
colname = \_name
Troubleshooting
Error | Reason | Resolution |
---|---|---|
Syntax error when database/schema/table name contains a hyphen (-) such as in (CDW: Azure Synapse) | Azure Synapse expects any object name containing hyphens to be enclosed between double quotes as in "<object-name>" . | Ensure that you use double quotes for every object name that contains a hyphen when your target database is Azure Synapse. For example: default."schema-1"."order-details" . |
Examples
Performing Inner Join
We need a query that contains a JOIN clause. This example demonstrates how we can use the ELT Join Snap to build a query with the JOIN clause.
First, we build SELECT queries to read the target tables. To do so, we can use two ELT Select Snaps, in this example: Read Part A and Read Part B. Each of these Snaps is configured to output a SELECT * query to read the target table in the database. Additionally, these Snaps are also configured to show a preview of the SELECT query's execution as shown:
Read Part A Configuration | Read Part B Configuration |
---|---|
A preview of the outputs from the ELT Select Snaps is shown below:
Read Part A Output | Read Part B Output |
---|---|
The SELECT * queries in both of these Snaps form the inputs for the ELT Join Snap. We want to perform an inner join based on matching values of the CUST_CODE column in the tables. Accordingly, this Snap is configured as shown below:
Upon execution, the ELT Join Snap combines both incoming SELECT * queries and adds the JOIN clause.
A preview of the ELT Join Snap's output is shown below:
We can also add an ELT Insert-Select Snap downstream and write the result of this query into another table.
Preventing Collision of Identical Columns in Resultant Column Names
In this example, we retrieve data from two tables which are in Snowflake database. We filter the records based on certain clauses and join the queries using a JOIN clause to get the resultant identical columns without any collision. We use the the ELT Join Snap to achieve this.