Use this Snap to add the SAMPLE clause to the incoming query. The SAMPLE clause creates a subset of the source table based on the specified number of rows. Alternatively, you can also create a subset of rows as a percentage of the number of rows in the source table. You can use this Snap to reduce the cost and resource overhead on the target database and file storage services such as S3 by testing the SQL query on a subset of the data instead of the whole table.
ELT Snap Pack does not support Legacy SQL dialect of Google BigQuery. We recommend that you use only the BigQuery's Standard SQL dialect in this Snap.
ELT Pipelines targeting a Databricks Lakehouse Platform (DLP) instance might fail due to a very long or complex SQL query that they build. As a workaround, you can set an advanced (URL) property useNativeQuery to 1 in your ELT Database Account configuration as shown below:
In any of the supported target databases, this Snap does not appropriately identify nor render column references beginning with an _ (underscore) inside SQL queries/statements that use the following constructs and contexts (the Snap works as expected in all other scenarios):
WHERE clause (ELT Filter Snap)
ON condition (ELT Join, ELT Merge Into Snaps)
Insert expressions (column names and values in ELT Insert Select, ELT Load, and ELT Merge Into Snaps)
Update expressions list (column names and values in ELT Merge Into Snap)
Secondary AND condition
Inside SQL query editor (ELT Select and ELT Execute Snaps)
As a workaround while using these SQL query constructs, you can:
Precede this Snap with an ELT Transform Snap to re-map the '_' column references to suitable column names (that do not begin with an _ ) and reference the new column names in the next Snap, as needed.
In case of Databricks Lakehouse Platform where CSV files do not have a header (column names), a simple query like SELECT * FROM CSV.`/mnt/csv1.csv` returns default names such as _c0, _c1, _c2 for the columns which this Snap cannot interpret. To avoid this scenario, you can:
Write the data in the CSV file to a DLP table beforehand, as in: CREATE TABLE csvdatatable (a1 int, b1 int,…) USING CSV `/mnt/csv1.csv` where a1, b1, and so on are the new column names.
Then, read the data from this new table (with column names a1, b1, and so on) using a simple SELECT statement.
In case of Databricks Lakehouse Platform, all ELT Snaps' preview data (during validation) contains a value with precision higher than that of the actual floating point value (float data type) stored in the Delta. For example, 24.123404659344 instead of 24.1234. However, the Snap reflects the exact values during Pipeline executions.
Snap Input and Output
Type of View
Number of Views
Examples of Upstream and Downstream Snaps
The SQL query in which you want to add the SAMPLE clause. Typically, it would be a SELECT query reading the source table.
The incoming SQL query with the SAMPLE clause and a preview of the query's result if the Get preview data check box is selected.
SQL Functions and Expressions for ELT
You can usetheSQL Expressions and Functions supported for ELTto define your Snap or Account settings with the Expression symbol = enabled, where available. This list is common to all target CDWs supported.You can also use other expressions/functions that your target CDW supports.
Specify a name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your Pipeline.
Default Value: ELT Sample
Example: Sample Order Data
Get preview data
Select this checkbox to include a preview of the query's output. The Snap performs limited execution and generates a data preview during Pipeline validation.
In case of ELT Pipelines, only the SQL query flows through the Snaps but not the actual source data. Hence, the preview data for a Snap is the result of executing the SQL query that the Snap has generated in the Pipeline.
The number of records displayed in the preview (upon validation) is the smaller of the following:
Number of records available upon execution of the SQL query generated by the Snap.
The value set in the Preview Data Count setting (default is 50 records).
Rendering Complex Data Types in Databricks Lakehouse Platform
Based on the data types of the fields in the input schema, the Snap renders the complex data types like map and struct as object data type and array as an array data type. It renders all other incoming data types as-is except for the values in binary fields are displayed as a base64 encoded string and as string data type.
Default Value: Not selected
Choose the sampling type to query your data.
Available options are:
Number of Records
Percentage of Records
Default Value: Number of Records
Example: Percentage of Records
Number of Records
Sampling Type is Number of Records
Enter the number of records to output.
The ELT Sample Snap outputs the entire table if the total number of records in the table is less than the value specified here.
Default Value: NA
Percentage of Records
Sampling Type is Percentage of Records
Enter the percentage number of records to sample your data based on a percentage value of the total records in the table.
The number of records in the output is rounded off to the lower value if the percentage leads to a decimal. For example, if you enter 75 and if the table has 10 records, the output yields 7 records since 75% of 10 is 7.5.
The number of records in the output is randomized and is up to a maximum of the percentage calculation specified above. Therefore, in the above instance, the output may contain 5, 6, or 7 records.
Default Value: NA
Creating a Subset Table
We want to create a subset of a table for testing various SQL queries. We need a query with the SAMPLE clause. This Pipeline shows how we can use the ELT Sample Snap to accomplish this task.
First, we build a SELECT query to read the target table, ORDERS3. To do so, we use the ELT Select Snap and configure it as shown below. We also configure the Snap to show a preview of the SELECT query's execution:
Upon execution, the ELT Select Snap builds a SELECT query as shown along with a preview of the query's output below:
This table contains 34 records. We only want a sample of 10 records in the subset table. Therefore, we use the ELT Sample Snap and configure it as shown below:
This Snap builds the following query based on this configuration:
From the output preview, we can see that the subset contains 10 records.
Alternatively, if you want to fetch a percentage of records of the original table, you can configure the Snap as shown below:
The number of records in the output is randomized to a maximum of the percentage specified here. In this example, since the original table has 34 records, specifying 50 in the Percentage of Records field yields a maximum of 17 records.
Based on the Snap's configuration, the output contains a random sample of 14 records.
We can also add an ELT Insert-Select Snap downstream and write the result of this query into another table.
Fixed the following issues with the ELT SCD2 Snap:
Where the Snap failed to get the right data type due to column name case mismatches between what is used in the Snap and what is actually used in the Azure Synapse tables (returned by the JDBC driver). You no longer need to type the column names in the exact case that Azure Synapse expects.
The Snap failed with the error—start_date does not exist—while writing SCD2 data to a Redshift table column start_date that is specified as the StartDate of Current Row in the Target Table Temporal Field(s) field set.
The Snap failed with the error—Reference 'END_DATE' is ambiguous—while merging SCD2 updates into DLP tables.
Where the Snap failed due to lack of required access privileges on the target database (for example, create table rights to create temporary tables as needed). The Snap now runs the input SQL statement and the elaborate sub-queries instead of attempting to create a temporary table in such scenarios.
Fixed the issue with the ELT SCD2 and ELT Load Snaps that fail to perform an add/drop operation involving multiple columns on a Redshift target table.
Enhanced the ELT Aggregate Snap to support COUNT_IF aggregate function for Redshift and Azure Synapse target databases. Eliminating duplicates with COUNT_IF aggregate function Note: Selecting the Eliminate Duplicates checkbox while using COUNT_IF aggregate function does not eliminate duplicate records in case of Snowflake and BigQuery databases, as there is no native support for this feature. However, for Redshift, Azure Synapse and Databricks Lakehouse Platform (DLP), the duplicates are eliminated from the list of records when you select this checkbox for COUNT_IF function.
Updated the Expressions and Functions Supported for ELT in the Snap and Account configuration sections. This list is common to all target CDWs supported. You can use these expressions to define your Snap or Account settings with the Expression symbol = enabled, where available.
Enhanced the ELT Load Snap to ensure that the Snap uses the default S3 Folder name specified in the Snap's account to accurately resolve the defined File Name Pattern.
Enhanced the ELT Select and ELT Execute Snaps to allow SQL comments inside the SQL Query Editor and SQL Statement Editor fields respectively.
Enhanced the ELT Transform Snap to display the exact data types of fields listed in the Input Schema and Target Schema in case of Azure Synapse.
Enhanced the ELT Merge Into Snap to to support MERGE INTO ALL option and automatic source table aliasing.
Enhanced the ELT Load Snap by adding the Source File to Target Table Columns Map field set to enable mapping of columns between the source file and the target table. In case of Databricks Lakehouse Platform (DLP) the Snap is enhanced further to support delimiters other than the comma in the source CSV files.
Enhanced the ELT Join Snap to support Left Anti and Left Semi join types in BigQuery though it does not natively support these join types.
Enhanced the underlying load mechanism for ELT SCD2 Snap from Insert-and-Update mode to Merge-into mode to substantially improve the Snap's performance while working with large and very large volumes of data (upwards of 500M rows or 50GB size).
Fixed an issue withELT TransformSnap where it may display incorrect schema only in the previews (during Pipeline validation). This occurs especially when the incoming SQL statement (defined in theSQL Statement Editorof the upstream Snap) contains one or more of the WHERE, GROUP BY, HAVING , ORDER BY, LIMIT, LIMIT followed by OFFSET, and SAMPLE clauses. Here are a few Pipeline scenarios where this issue might surface:
Fixed an issue where the ELT Load Snap connecting to a Databricks Lakehouse Platform (DLP) instance failed to perform the load operation. Ensure that you provide a valid DBFS Folder path in the Snap's account settings as the Snap requires this folder path.
Enhanced the ELT Database Account to support token-based authentication (Source Location Session Credentials) to S3 locations for Snowflake and Redshift target databases.
GROUP BY Fields Listfield set > Output Field to GROUP BY Field.
ORDER-By Fields to ORDER-BY Fields (Aggregate Concatenation Functions Only).
Removed the Suggestion option for Field Name field under General Aggregate Functions List field.
Made the Alias Name fields in the Aggregate Concatenation Functions List and the Percentile Distribution Functions List field sets mandatory.
If your target database is a Databricks Lakehouse Platform (DLP) instance, then the ELT Load Snap supports loading data from source CSV files that contain only comma as the separator between values.
Enhanced the ELT Database Account to allow parameterization of field values using Pipeline Parameters. You can define and use these parameters in expression-enabled fields to pass values during runtime.
Fixed the following Known Issues recorded in the 4.26 GA version:
For a Snowflake target instance, the ELT Insert Select Snap does not suggest column names to select for the Insert Column field in the Insert Expression List.
Intermittent null-pointer exceptions in the ELT Load Snap on Databricks Lakehouse Platform (DLP).
The ELT Insert Select Snap attempts to create the target table even when it exists in the Snowflake database.
When loading data from a JSON file into a target Databricks Lakehouse Platform (DLP) instance using an ELT Load Snap, if you choose the Drop and Create Table option as the Load Action and specify an additional column (that is not available in the JSON file) for the new table, it results in one more column null added to the new target table.
When you use the SQL editor in the ELT Select Snap configuration to define your SQL query, the Pipeline validation fails due to a syntax error in the following scenarios. However, the Pipeline execution works as expected. The only workaround is to drop the LIMIT clause and the optional OFFSET clause from the SQL query during Pipeline validation.
The query contains a LIMIT clause on a Snowflake, Redshift or Databricks Lakehouse Platform target instance: The SQL query created during Pipeline validation includes an additional LIMIT clause, for example: SELECT * FROM "STORE_DATA"."ORDERS" LIMIT 10 LIMIT 990.
The query contains an OFFSET clause (supported in case of Snowflake and Redshift): The SQL query created during Pipeline validation looks like SELECT * FROM "STORE_DATA"."ORDERS" LIMIT 10 OFFSET 4 LIMIT 990.
Enhanced the ELT Snap preview to support the following Snowflake data types:array,object,variant, andtimestamp.
The Snaps convert the values to hexadecimal (HEX) equivalents—the default setting for the session parameter BINARY_OUTPUT_FORMAT in Snowflake. SeeSession Parameters for Binary Valuesfor more information.
If this setting is different from hexadecimal (such as base64) in the Snowflake table, the Snaps still convert the values to hexadecimal equivalents for rendering them in the Snap preview.
Enhanced all ELT Snaps to display the Get preview datacheckbox below the Snap'sLabelfield.
Starting with the 4.26 release, all Snaps in the ELT Snap Pack(except the ELT Copy Snap) require an account to connect to the respective target database. Your existing Pipelines that do not use an account may fail. We recommend you to associate an ELT Database Account to each of the ELT Snaps (except ELT Copy Snap) for your Pipelines.
Enhanced theELT AggregateSnap to support Linear Regression functions on Redshift and Azure Synapse. The Snap also supports these functions on Databricks Lakehouse Platform.
Enhanced theELT ExecuteSnapto enable running multiple DML, DDL, and DCL SQLstatements from the same Snap instance.
Support LEFT ANTI JOIN and LEFT SEMI JOIN types on all supported databases.
Display or hide the Resultant Column Names Prefix Typefield based on the target database selected in the Snap's account.
Enhanced the ELT Load and ELT SCD2 Snaps to provide a list of suggested data types, while adding columns to or creating a table.
Updated the ELT SCD2 Snap to replace End date of historical row option in the Meaning field of Target Table SCD2 Fields field set with End Date of Current Row.
This may cause the existing Pipelines to fail as the End date of historical row option no longer exists.
You need to make the following update in the ELT SCD2 Snap's settings across your Pipelines after upgrading your Snap Pack to this patch:
Select End Date of Current Row from the Meaning drop-down list in the second entry (highlighted in the image).
Fixed the issue with the ELT Insert Select Snap containing an open output preview that fails to retrieve output preview data in case of Redshift and Azure Synapse databases, though the Pipeline runs work as expected.
Fixed an issue where the ELT Execute Snap does not error out (Snap turns Green) even when running an SQL query to drop a non-existent table from a Snowflake or Azure Synapse database.
[Update on ]: Enhanced the ELT Snap previews to support the following data types: array, object, variant, and timestamp.
The Snaps convert the values to hexadecimal (HEX) equivalents—the default setting for the session parameter BINARY_OUTPUT_FORMAT in Snowflake. See Session Parameters for Binary Values for more information.
If this setting is different from hexadecimal (such as base64) in the Snowflake table, the Snaps still convert the values to hexadecimal equivalents for rendering them in the Snap previews.
Enhanced the ELT Snap preview to display the exact binary and varbinary values from Snowflake database during Pipeline validation, by converting the values to hexadecimalequivalents—the default setting in Snowflake. If the setting is different from hexadecimal in the Snowflake table, then the Snaps still convert the values to hexadecimal for rendering the Snap preview.
Enhanced theELT TransformSnap to display the appropriate data type (binaryorvarbinary) for the column names populated in the output schema.
Enhanced theELT Window FunctionsSnap to address potential issues due to an incorrect definition for MINUS function in case of Redshift and Azure Synapse databases.
Support for Google Cloud Storage as a storage location (source)in addition to AWS S3 and Azure Data Lake Storage (ADLS) when your target database is Snowflake.
Automatic downloadof the JDBC driver required for the selectedDatabase Type using the new Download JDBC Driver Automaticallycheck box.
Enhanced theELT LoadSnap to prevent changes to existing tablesduring Pipeline validation.If you set the Load Action as Drop and Create table, and the target table does not exist,the Snap creates a new (empty) target table based on the schema specified in its settings.
Enhanced theELT Window FunctionsSnap to support Covariance, Correlation, and LinearRegression Functions on Snowflake, Redshift, and Azure Synapse databases. The Snap uses function-specific query re-writes to support these functions on Redshift and Azure Synapse databases.
Enhanced theELT Merge IntoandELT Insert SelectSnaps to support up to one output view, and addedtheGet Preview Datacheck box to these Snaps. You can now connectdownstreamELT Snaps to these Snaps.
Fixes the issue of production job failures due to ELT Insert Select Snap after upgrading to 4.24 GA by updating the ELT Transform Snap to continue allowing duplication of fields in the Expression list for the Pipeline to complete successfully.
No changes are needed to your existing Pipelines.
Fixes the column name collision issue in the Snap's output when the two tables being joined have columns with the same/identical names. You can specify the extent of prefix (that is, to prefix all columns, only duplicate columns, or no prefix) using the Resultant Column Names Prefix Type drop-down list. Based on the prefix you choose, a table alias name is prefixed to the identical columns in the output.
The behavior of ELT Load Snap for Load Action during Pipeline validation across the supported databases is as follows:
Append rows to existing table: Does not append the data from the source files into the target table.
Overwrite existing table: Does not overwrite the data.
Drop and Create table: Does not drop the target table even if it exists, but the Snap creates a new target table if a table does not exist.
Alter table: Does not modify the schema of the target table.
Adds support for Azure Synapsedatabase. You can now use theELT Snap Packto transform tables in the Snowflake, Redshift as well as Azure Synapse databases.
Updates the Snap Pack with the following features:
Suggest appropriate column names to select from, in the Snap fields.
Create Hash-distributed tables using the Target Table Hash Distribution Column (Azure Synapse Only) field when theLoad Actionis selected asDrop and Create tableand a condition like WHEN NOT MATCHED BY TARGET.
Enhances the Snap to support Natural JOINS (NATURAL INNER JOIN, NATURAL LEFT OUTER JOIN, NATURAL RIGHT OUTER JOIN, and NATURAL FULL OUTER JOIN) in addition to the INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER, and CROSS Joins in Azure Synapse Database. This enhancement also makes account configuration mandatory when using this Snap.
Fixes the column name collision issue in the Snap's result set when the two tables being joined have columns with the same/identical names. You can specify the Resultant Column Names Prefix Type drop-down list. Based on theprefix typeyou choose, a table alias name is prefixed to identical columns in the output.
Suggest appropriate column names to select from, in the Snap fields. This applies to Snowflake, Redshift, and Azure Synapse databases.
Include theTarget Table Hash Distribution Column (Azure Synapse Only) field for the Snap to create hash-distributed tables always.
Include theUpdate Expression List - When Not Matched By Source field set to allow defining one or more Update Expressions for the WHEN clause - WHEN NOT MATCHED BY SOURCE. This applies toAzure Synapse database.
Include theTarget Table Aliasfield to specify the alias name required for the target table. The Snap is also equipped with the ability to auto-replace the actual table names (with the alias name), if any, used in the ON clause condition, secondary AND conditions, Update Expression list, or Insert Expression list. This applies to Snowflake, Redshift, and Azure Synapse databases.