Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

In this article

Table of Contents
maxLevel2
absoluteUrltrue

Overview

Use this Snap to load files/data from AWS S3 buckets or Azure Cloud Storage containers to a Snowflake, Azure Synapse database, or Databricks Lakehouse Platform (DLP). You can also use this Snap to transfer files from the AWS S3 buckets to the Redshift database, and to load data from Amazon Redshift or Google Cloud Storage to BigQuery.  See ELT Source-Target-Hosting Support Matrix for complete information. 

Note

If you want to use the COPY INTO command in ELT Execute Snap for loading data into the target database, you need to pass (expose) these account credentials inside the SQL statement. You can use the ELT Load Snap to prevent this security issue.

...

Prerequisites

Valid accounts and access permissions to connect to the following:

  • Source: AWS S3, Azure Cloud Storage, or Google Cloud Storage

  • Target: Snowflake, Redshift, Azure Synapse, Databricks Lakehouse Platform, or BigQuery

Limitations

...

Multiexcerpt include macro
nameME_ELT_GBQ_StandardSQL_UOI
pageELT Copy

...

Known Issues

...

Due to an issue with the Redshift COPY INTO SQL for PARQUET input files table schema management, the Snap fails to perform ALTER TABLE actions before loading data from a PARQUET source file to the target table in Redshift. This issue does not occur when the source file is in CSV or another format.

...

  • This is due to the limitation with the Databricks SQL Admin Console that does not allow you to add the configuration parameter spark.databricks.delta.alterTable.rename.enabledOnAWS true to its SQL Warehouse Settings.

...

Because BigQuery does not support Automatic Schema Inferencing (ASI) from source files in Amazon S3, the ELT Load Snap fails to produce the required output when you choose Drop and Create table as the Load Action and define the Target to Source Column Map in the Snap’s settings. However, the same configuration works fine for Pipeline validation purposes.

...

The ELT Load Snap cannot automatically infer schema from GZIP-compressed source files located in an ADLS Gen2 storage location for loading the data into Snowflake target tables.

...

Due to an issue with BigQuery table schema management (the time travel feature), an ALTER TABLE action (Add or Update column) that you attempt after deleting a column (DROP action) in your BigQuery target table causes the table to break and the Snap to fail.

  • As a workaround, you can consider either avoiding ALTER TABLE actions on your BigQuery instance or creating (CREATE) a temporary copy of your table and deleting (DROP) it after you use it.

Due to an issue with the Simba Spark JDBC driver for DLP, you cannot insert a NULL value in the nullable columns of Boolean data type in your DLP instance using any of the Write-type Snaps—ELT Load, ELT SCD2, ELT Insert Select, ELT Merge Into, and ELT Execute.

The only workaround currently available for this issue is to upgrade your JDBC driver to databricks-jdbc-2.6.25-1.jar, and use the corresponding JDBC driver class (com.databricks.client.jdbc.Driver) and JDBC URL in your Snap account.

Info

This latest JDBC driver for DLP uses a JDBC URL structure and driver class that is different from the Simba Spark JDBC driver.

...

You cannot add a column to your BigQuery target table with a deleted column name using the ELT Load Snap, as BigQuery reserves deleted column names and data until the pre-configured time travel duration (from 2 through 7 days).

...

When loading data from a CSV file to a target DLP table, the header names in the file must exactly match the column names in the target table. Otherwise, the Snap returns the error—Column names should be the same in the target table and CSV file and aborts the load operation.

...

When you are enclosing the column names within backticks (`<Column_Name>`) for creating a new target table in a DLP instance using the ELT Load Snap—typically to ensure that the column names are used verbatim—you may encounter the error: ERROR processing query/statement: no viable alternative at input 'FROM'(line X, pos XX). To prevent this error, edit the header in your source file to exclude any special characters, thereby avoiding the use of backticks in the target table column names.

...

For Redshift and Azure Synapse databases, the ELT Load Snap displays JSON values for fields with datatypes Date and Time, when validating the Pipeline.

...

However, the ELT Load Snap outputs the correct date and time values when executing the Pipeline.

Info

To prevent these values from showing up in JSON format in the previews and ELT Transform Snap, you can cast the time or date values to a VARCHAR type in your custom query as follows: 
CAST (<time_column_type_name> AS VARCHAR(<size_based_on_precision_required>)).
Example: CAST (time_data AS VARCHAR(20)).

...

If you have dropped one or more columns from your target table outside of the Snap and then use it again to add these columns back and load data into the table, the Snap adds the new columns to the target table, but not the corresponding data from the source file.

...

Multiexcerpt include macro
nameME_ELT_KI_Underscore_Refs
pageELT Snap Pack

Snap Input and Output

...

None

...

  • Min: 0
  • Max: 0

...

Document

...

  • Min: 1
  • Max: 1

...

  • ELT Transform
  • ELT Copy

...

Snap Settings

Info
titleSQL Functions and Expressions for ELT

You can use the SQL Expressions and Functions supported for ELT to define your Snap or Account settings with the Expression symbol = enabled, where available. This list is common to all target CDWs supported. You can also use other expressions/functions that your target CDW supports.

...

Multiexcerpt include macro
namegetpreviewdata
pageELT Intersect

...

Required. Enter the name of the database schema. In case the schema name is not defined, then the suggestion for the schema name retrieves all schema names in the specified database when you click the suggest button.

Multiexcerpt include macro
nameME_Schema_Name
pageELT Insert-Select

...

Required. Enter the name of the table in which you want to perform the load operation. 

Multiexcerpt include macro
nameME_Schema_And_Table_Names
pageELT Insert-Select

Note
  • If the target table does not exist during run-time, the Snap creates one with the name that you specify in this field and writes the data into it. During Pipeline validation, the Snap creates the new table but does not write any records into it.
    • The new table thus created will not be dropped in the event of a subsequent/downstream Snap failure during validation.
  • You can specify the table name without using double quotes (""). You must use double quotes when including special characters in the table name, such as hyphens (-) in the table name.
  • A table name must always start with an alphabet.
  • Integers and underscores (_) can also be a part of the table name.
  • All characters are automatically converted to upper-case by the Snap. Use double-quotes to retain lower casing.

...

Specify the hash distribution column name for the target table in Azure Synapse, if you choose the Load Action as Drop and Create table.

Info

This field is applicable only for Azure Synapse. Azure Synapse needs a table to be always hash-distributed for improved query performance. If the target table is created outside the Snap, you need not specify the target table column name.

...

Select the load action to perform on the target table.

Available options are:

  • Append rows to existing table. Loads the data from the source files into the specified target table.
  • Overwrite existing table. Overwrites the target table with the data from the source files.
  • Drop and Create table. Drops the target table if it exists, creates a new table with the columns provided in the Table Columns field set, and then loads the data from the source files into the target table. Activates the Data Type field in the Table Columns field set.
  • Alter table. Modifies the schema of the target table based on the configuration of the Modifier field in the Table Columns field set and loads the data from the source files into the target table. Activates the Data Type and Modifier fields in the Table Columns field set. 
Note

The Snap does not modify the existing tables during Pipeline validation, but if you choose the Drop and Create table option in the Load Action field and the target table does not exist, it creates a new (empty) target table based on the schema specified in the Snap.

...

This field set enables you to define the table options for creating the target table before performing the load operation. You must specify each table option in a separate row. Click Image Removed to add a new row.

  • Table Option

...

Click the Suggest icon (Image Removed) and choose a table option that you want to use, from the suggested values. Specify the required values for the selected option.

The table options you can use are:

  • PARTITION BY
  • CLUSTER BY
  • OPTIONS
  • COMMENT

...

This field set enables you to configure the schema of the target table. You can use this field set to add/drop columns in the target table if you select the Drop and Create table or Alter table options in the Load Action field. 

This field set consists of the following fields:

  • Column
  • Data Type
  • Modifier

...

ID

FNAME

...

Activates when you select the Drop and Create table or Alter table options in the Load Action field.

Enter the data type of the values in the specified column. 

...

INT

VARCHAR

...

Activates when you select the Alter table option in the Load Action field.

Select whether you want to add/drop the specified column.

Available options are:

  • Add
  • Drop

...

Applicable only when you are loading data from a Redshift source to a BigQuery target.

Specify the schema name of the source table in Redshift.

...

Applicable only when you are loading data from a Redshift source to a BigQuery target.

Specify the name of the Redshift source table (from the schema selected above).

...

Specify the fully-qualified path to the folder in the Cloud storage in this field, if you want to override the default path defined in the Snap's Account settings. For example:

  • AWS S3 path: s3://<bucket name>/<folder>
  • DBFS folder: /<dbfs folder name>
  • Azure: azure://<accountname>.blob.core.windows.net/<containername>/<azurefolder>
  • Google Cloud Storage: gcs://<bucketname>/<folder>/

...

This field set enables you to specify the staged files to load. You must specify each file in a separate row. Click + to add a new row.

Note
titleUse File List for up to 2 files

For the best performance of your load operation, we recommend that you use the File Name Pattern field to load data from multiple files at a source location (S3 or Redshift) instead of specifying the individual file names/paths in the File List fieldset.

This field set consists of the following field:

  • File 

...

Enter the pattern to use to match the file name and/or absolute path. You can specify this as a regular expression pattern string, enclosed in single quotes.

Info
titleFor Redshift tables
  • Redshift does not natively support use of regular expressions. But, Snaps in the ELT platform provide you with this ability for key-based search mechanism. 
  • If your search is based on AWS IAM roles, you cannot use regular expressions with Redshift tables.
Note
titleUsing Wildcard Patterns in BigQuery

BigQuery has its own set of rules to resolve wildcard patterns like *.* (that you provide in this field). Refer Batch loading data > Load Wildcards in BigQuery documentation for the rules around using wildcards in BigQuery.

See Loading Using Pattern Matching for details on Snowflake regular expressions, Examples of COPY INTO (Delta Lake on Databricks) for DLP, or Regular Expressions and the Java Programming Language for using regular expressions with Redshift, Azure Synapse, or BigQuery in this field.

...

Snowflake: '.*key.*[.]csv'

Redshift: './key.*csv'

...

Select the format for the file. The options vary based on the database type (Snowflake/Redshift/Azure Synapse) in your account settings.

The supported file formats are: 

  • Snowflake: AVRO, CSV, JSON, ORC, PARQUET, XML
  • Redshift: AVRO, CSV, JSON, ORC, PARQUET
  • Azure Synapse: CSV, ORC, PARQUET

...

This field set enables you to define the file format options for the load operation. You must specify each file format option in a separate row. Click + to add a new row.

  • Format Option

See Snowflake Data Loading Commands, Redshift Data Format Parametersor COPY INTO (Transact-SQL) for Azure Synapse Analytics for the list of of options that you can use based on the target database.

Info
titleAuto-Inferencing Source Table Schema

ELT Load Snap allows you to infer the table schema from the source files and then apply the auto-inferred schema to your target tables. As of the July 2022 release, this feature is supported on the Redshift target CDW and with the source files in an S3 location. Learn more at Automatic Schema Inference with ELT Load Snap.

...

Click the Suggest icon (Image Removed) and choose a file format option that you want to use during the load operation, from the suggested values. Specify the required value for the selected option.

Note
titleLoading a CSV file with nulls into DLP

If the CSV file you want to load into a DLP instance contains null values, ensure that you define the Format Option in the File Format Option List fieldset as sep = ','  without which the load operation does not succeed.

...

Select the encryption type that you want to use for the loaded data and/or files.

Available options are:

  • None
  • Server-Side Encryption (Not supported by Azure Synapse)
  • Server-Side KMS Encryption (Not supported by Azure Synapse)
Note

Server-side encryption is available only for S3 accounts and Snowflake.

...

Activates when you select the Server-side KMS Encryption option in the Encryption Type field.

Enter the KMS key to use to encrypt the files. 

Note

Not required for Redshift databases. The load operation automatically recognizes and loads the encrypted files. See Loading encrypted files from Amazon S3 for details. 

...

This field set enables you to define the COPY options for the load operation. You must specify each COPY option in a separate row. Click + to add a new row.

This field set comprises the following field:

  • Copy Option

See COPY Options for Snowflake, COPY Parameter Reference for Redshift, or COPY INTO (Transact-SQL) for Azure Synapse Analytics.

...

Select an output preview type for the data validation performed as part of the Pipeline validation. The options vary based on the database type (Snowflake/Redshift) in your account settings.

Available options for Snowflake are:

  • None
  • RETURN_n_ROWS
  • RETURN_ERRORS
  • RETURN_ALL_ERRORS

Available options for Redshift are:

  • None
  • NOLOAD
Info

No specific validation modes are available for Azure Synapse.

For more details, see Validating Staged Files for Snowflake and Validating Input Data for Redshift.

...

Activates when you select the RETURN_n_ROWS option in the Validation Mode field.

Enter the number of records to retrieve from the target table to validate whether the load operation was successful.

...

Select this checkbox to include the column position from the source table in the case of Snowflake or Azure Synapse.

In the case of other target CDWs (Redshift, DLP, and BigQuery), you can leave this checkbox unchecked.

...

This field set enables you to define the mapping between the columns in the target table and the corresponding columns in the source file. You must define each mapping in a separate row. Click Image Removed to add a new row.

The available options are:

  • Target Table Column Name
  • Source File Column Name

...

Click the Suggest icon (Image Removed)  and choose a column from the list of target table columns displayed. Ensure that you have defined the Schema Name or the Target Table Name to see the list of columns in this field's suggestions.

...

Click the Suggest icon (Image Removed)  and choose a column from the list of source file columns displayed. Ensure that you have defined the Schema Name or source file path) to see the list of columns in this field's suggestions.

...

Snap Behavior During Pipeline Validation

This Snap's behavior during Pipeline validation varies for each Load Action selected.

...

Snowflake

The data is not loaded into the Snowflake table. It is just validated using the RETURN_n_ROWS validation mode. The number of records in the output is determined by the preview count set in the Pipeline settings. The Snap creates the target table (if not existing already) in the database when you select Load Action as Drop and Create the table. For all other load actions, the Snap verifies if the specified target table exists in the database.

Redshift

The data is not loaded into the Redshift table. The files in S3 are just validated using the NOLOAD validation mode to verify data integrity of the files. The target table is also affected based on your selection in the Load Action field. The Snap creates the target table (if not existing already) in the database when you select Load Action as Drop and Create table. For all other load actions, the Snap verifies if the specified target table exists in the database.

Azure Synapse

The data is not loaded into the Azure Synapse table. The Snap creates the target table (if not existing already) in the database when you select Load Action as Drop and Create table. For all other load actions, the Snap verifies if the specified target table exists in the database.

Troubleshooting

...

Depending on your target database, use the respective File Format Option in the ELT Load Snap.

  • For SnowflakeSKIP_HEADER = 1
  • For Azure SynapseFIRSTROW = 2
  • For Databricks Lakehouse Platform'header' = 'true'

...

Invalid DBFS folder name

(Target CDW: Databricks Lakehouse Platform)

...

Database cannot be blank.

(when seeking the suggested list for Schema Name field)

...

Suggestions in the Schema Name and Target Table Name fields do not work when you have not specified a valid value for the Database Name field in this Snap.

...

Specify the target Database Name in this Snap to view and choose from a suggested list in the Schema Name and Target Table Name fields respectively.

...

Unknown Source / Not found: <path>

(Target CDW: BigQuery)

...

BigQuery has a unique way of resolving wildcard patterns like *.* (that you provide in the File or File Name Pattern fields of this Snap).

...

Make sure that the file name pattern you provide conforms to the rules defined in Batch loading data > Load Wildcards section of BigQuery documentation.

...

Since the Redshift Copy Into SQL does not natively support File Pattern option unlike the other CDWs, SnapLogic supports this externally through Java Regular Expressions (RegEx).

The RegEx *.* returns matching results from the root location instead of the folder location specified before *.* in the File Name Pattern field.

...

Keyword RANGE is not acceptable as a column name.

(CDW: Databricks Lakehouse Platform)

...

Keyword ORDER is not acceptable as a column name in the JSON file.

(Target CDW: BigQuery)

...

SQL exception from Snowflake: Syntax error in one or more positions in the SQL query.

...

[Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0

Cannot create table ('<schema name>`.`<table name>`'). The associated location (`…<table name>`) is not empty but it's not a Delta table

(Target CDW: Databricks Lakehouse Platform)

...

The specified location contains one of the following:

  • A non-Delta table (such as CSV, ORC, JSON, PARQUET)

  • A corrupted table

  • A Delta table with a different table schema

So, the Snap/Pipeline cannot overwrite this table with the target table as needed.

...

Ensure that you take appropriate action (mentioned below) on the existing table before running your Pipeline again (to create another Delta table at this location).

Move or drop the existing table from the schema manually using one of the following commands:

Access the DBFS through a terminal and run:

  • dbfs mv dbfs:/<current_table_path> dbfs:/<new_table_path> to move the table or

  • dbfs rm -r dbfs:/<table_path> to drop the table.

OR

Use a Python notebook and run:

  • dbutils.fs.mv(from: String, to: String, recurse: boolean = false): boolean to move the table/file/directory or

  • dbutils.fs.rm(dir: String, recurse: boolean = false): boolean to drop the table/file/directory.

...

Column names should be the same in the target table and CSV file. 

(Target: DLP)

...

Examples

To skip the header row while uploading data from a CSV file, use the following File Format Options that suit your target database.

...

FIRSTROW = 2

...

Loading Files from S3 Bucket to Snowflake

We need to perform Snowflake ELT transformations upon a file present in an S3 bucket. Therefore, we must first load the file into the Snowflake database from the source location. This Pipeline shows how we can use the ELT Load Snap to accomplish this task.

Image Removed

Before we can load the file, we must configure the ELT account to read from the source S3 bucket by configuring the Source Location, Authentication, Access Key, Secret Key, S3 Bucket, and S3 Folder fields. The fields vary based on the source location and authentication. See Configuring ELT Database Accounts for details. 

Image Removed

We configure the ELT Load Snap as shown below:
Image Removed

Let us understand the Snap's configuration:

  • The Database Name, Schema Name, and Target Table Name specify the target database, schema, and table into which we want to write the file's data.
  • Table Action indicates that if a table with the same name exists, we configure the Snap to drop the table and create a new one in its place.
  • Since this is a new table, the Table Columns field set specifies the table's schema.
  • The File List field set specifies the source file's path in the S3 bucket and the file's format.
  • The Format Option field set specifies that the Snap must ignore the file's header. Otherwise, the header is also loaded into the target table.
  • The Copy Option field set specifies that the Snap must abort the loading operation if it encounters an error.
  • The Validation Mode and Rows to Return fields specify that 10 rows from the target table be retrieved upon successful completion.

To validate a successful execution, the ELT Load Snap executes a SELECT * command on the target table and returns 10 rows:

Image Removed

We can either create a separate Pipeline to perform ELT operations on the table or add appropriate Snaps downstream of the ELT Load Snap in the same Pipeline.

Download this Pipeline.

Uploading data from a CSV File in Azure Data Lake Gen2 Storage to Azure Synapse

The following one-Snap (ELT Load) Pipeline connects to an Azure Data Lake Gen2 Storage using an ADLS Gen2 Basic Auth account to upload a CSV file from the storage to an Azure Synapse database. This CSV file is created on MacOS with the default line endings.

Image Removed

...

Image Removed

...

Image Removed

For the Pipeline to work as expected in case of the CSV file created on MacOS with the default (MacOS) line endings, define the value for the file format option ROWTERMINATOR as '0X0A' instead of the usual '\n'. With '\n' as ROWTERMINATOR, the Pipeline runs successfully, but does not insert the data from the CSV file into the newly-created table in Azure Synapse.

...

Image Removed

...

Image Removed

Download this Pipeline.

Downloads

...

titleImportant Steps to Successfully Reuse Pipelines

...

...

In this article

Table of Contents
maxLevel2
absoluteUrltrue

Multiexcerpt include macro
nameME_NE_ELT_Snap_Accounts
templateData[]
pageELT Case Expression
addpanelfalse

Overview

You can use this Snap to load files/data from Amazon S3 buckets or Azure Cloud Storage containers to a Snowflake, Azure Synapse database, or a Databricks Lakehouse Platform (DLP) instance. You can also use this Snap to load or transfer data from:

  • Amazon S3 buckets to the Amazon Redshift database.

  • Amazon Redshift or Google Cloud Storage to BigQuery tables. 

  • A DBFS folder to a DLP table.

Refer to ELT Source-Target-Hosting Support Matrix for complete information. 

Note

If you want to use the COPY INTO command in ELT Execute Snap for loading data into the target database, you need to pass (expose) these account credentials inside the SQL statement. You can use the ELT Load Snap to prevent this security issue.

Image Added

Snap Type

The ELT Load Snap is a Write-type Snap that writes/loads data from a source file/table to a table in the destination/target CDW.

Prerequisites

  • Valid accounts and access permissions to connect to one source and one target in the following source and target locations (Refer to the ELT Source-Target-Hosting Support Matrix for the supported source-target combinations):

    • Source: Amazon S3, Azure Cloud Storage, DBFS location, or Google Cloud Storage

    • Target: Snowflake, Redshift, Azure Synapse, DLP, or BigQuery

Limitations

Cfm tabs page
colorPalettesDefault
primaryTabtrue
tabsPageTitleSnowflake
color#0052CC
hasBordertrue
iddgx1yzdri3n
backgroundColorPaletteDefault
  • Due to a limitation in Snowflake in inferring the source files (and thereby the source schema) based on file name patternsin a Storage Integration-based session, the Snap fails to load data from any supported source location into your Snowflake target instance based on any File Name Pattern that you specify. As a workaround, we recommend that you use the File List field set to define the source data file locations for the Snap to load the data.

  • Snowflake does not support COPY statements that result in data transformation in any of the validation modes it allows. Therefore, the ELT Load Snap fails during pipeline validation when the number of columns you define in the Target to Source Column Map field set is not the same as the number of columns in the target table. But the same pipeline runs successfully because this limitation does not apply to pipeline execution.

    • As a workaround for this limitation, ensure that you specify error_on_column_count_mismatch=false in the Format Option field under the File Format Option List field set.

Cfm tabs page
colorPalettesDefault
tabsPageTitleDLP
color#0052CC
id348p80xg72o
backgroundColorPaletteDefault
  • If the CSV file you want to load into a DLP instance contains null values, you need to define the Format Option in the File Format Option List field set as sep = ','  without which the load operation does not succeed.

Cfm tabs page
colorPalettesDefault
tabsPageTitleGoogle BigQuery
color#0052CC
idl97ljlflqen
backgroundColorPaletteDefault
  • ELT Snap Pack does not support Legacy SQL dialect of Google BigQuery. We recommend that you use only the BigQuery's Standard SQL dialect in this Snap.

  • Unlike other Write-type Snaps, the Snap does not display the end SQL query in its output preview when connected to a BigQuery instance. This is because BigQuery makes use of APIs instead of an SQL query to load data.

Known Issues

Cfm tabs page
colorPalettesDefault
primaryTabtrue
tabsPageTitleAll CDWs
color#0052CC
hasBordertrue
idl48z4834xn
backgroundColorPaletteDefault
  • Suggestions displayed for the Schema Name field in this Snap are from all databases that the Snap account user can access, instead of the specific database selected in the Snap's account or Settings.

  • In any of the supported target databases, this Snap does not appropriately identify nor render column references beginning with an _ (underscore) inside SQL queries/statements that use the following constructs and contexts (the Snap works as expected in all other scenarios):

    • WHERE clause (ELT Filter Snap)

    • WHEN clause

    • ON condition (ELT Join, ELT Merge Into Snaps)

    • HAVING clause

    • QUALIFY clause

    • Insert expressions (column names and values in ELT Insert Select, ELT Load, and ELT Merge Into Snaps)

    • Update expressions list (column names and values in ELT Merge Into Snap)

    • Secondary AND condition

    • Inside SQL query editor (ELT Select and ELT Execute Snaps)

Info

As a workaround while using these SQL query constructs, you can:

  • Precede this Snap with an ELT Transform Snap to re-map the '_' column references to suitable column names (that do not begin with an _ ) and reference the new column names in the next Snap, as needed.

Cfm tabs page
colorPalettesDefault
primaryTabfalse
tabsPageTitleSnowflake
color#0052CC
hasBordertrue
id9f1eacsrm7r
backgroundColorPaletteDefault
  • The ELT Load Snap cannot automatically infer schema from GZIP-compressed source files located in an ADLS Gen2 storage location for loading the data into Snowflake target tables.

  • The ELT Load Snap fails to complete the load operation on the target Snowflake instance when you choose to automatically infer the schema from the source data available in Microsoft Azure Storage Blob (WASB), specified as a File Name Pattern, and accessed using the Source Location Credentials. The Snap returns the error: Azure null Storage Account incorrectly configured. But the same Snap configuration with the source files specified using File List instead of a File Name Pattern succeeds without any issues.

Cfm tabs page
colorPalettesDefault
tabsPageTitleAmazon Redshift
color#0052CC
idn1p7s3ybg0c
backgroundColorPaletteDefault
  • The ELT Load Snap does not infer accurate Date & Timestamp data types for Redshift.

  • Due to an issue with the Redshift COPY INTO SQL for PARQUET input files table schema management, the Snap fails to perform ALTER TABLE actions before loading data from a PARQUET source file to the target table in Redshift. This issue does not occur when the source file is in CSV or another format.

  • The ELT Load Snap displays JSON values for fields with datatypes Date and Time, when validating the pipeline.

Expected Output

Redshift

”time_data” : “09:23:56”,

”time_data” : {“time”: “09:23:56”},

”date_data” : “2020-12-20”,

”date_data” : “2020-12-20”,

However, the ELT Load Snap outputs the correct date and time values when executing the pipeline.

Info

To prevent these values from showing up in JSON format in the previews and ELT Transform Snap, you can cast the time or date values to a VARCHAR type in your custom query as follows: 
CAST (<time_column_type_name> AS VARCHAR(<size_based_on_precision_required>)).
Example: CAST (time_data AS VARCHAR(20)).

Cfm tabs page
colorPalettesDefault
tabsPageTitleAzure Synapse
color#0052CC
iduuv7qiu4bw
backgroundColorPaletteDefault
  • The ELT Load Snap displays JSON values for fields with datatypes Date and Time, when validating the pipeline.

Expected Output

Azure Synapse

”time_data” : “09:23:56”,

”time_data” : {“time”: “09:23:56”},

”date_data” : “2020-12-20”,

”date_data” : {“date”: “2020-12-20”},

However, the ELT Load Snap outputs the correct date and time values when executing the pipeline.

Info

To prevent these values from showing up in JSON format in the previews and ELT Transform Snap, you can cast the time or date values to a VARCHAR type in your custom query as follows: 
CAST (<time_column_type_name> AS VARCHAR(<size_based_on_precision_required>))
Example: CAST (time_data AS VARCHAR(20))

Cfm tabs page
colorPalettesDefault
tabsPageTitleDLP
color#0052CC
id348p80xg72o
backgroundColorPaletteDefault
  • ELT pipelines built using an ELT Load Snap fail to perform the DROP AND CREATE TABLE and ALTER TABLE operations on Delta tables when working with a Databricks SQL persona on the AWS Cloud with the error Operation not allowed: ALTER TABLE RENAME TO is not allowed for managed Delta tables on S3. However, the same action runs successfully with the Data Science and Engineering persona on the AWS Cloud.

    • This is due to the limitation with the Databricks SQL Admin Console that does not allow you to add the configuration parameter spark.databricks.delta.alterTable.rename.enabledOnAWS true to its SQL Warehouse Settings.

  • Due to an issue with the Simba Spark JDBC driver for DLP, you cannot insert a NULL value in the nullable columns of Boolean data type in your DLP instance using any of the Write-type Snaps—ELT Load, ELT SCD2, ELT Insert Select, ELT Merge Into, and ELT Execute.

    • The only workaround currently available for this issue is to upgrade your JDBC driver to databricks-jdbc-2.6.25-1.jar, and use the corresponding JDBC driver class (com.databricks.client.jdbc.Driver) and JDBC URL in your Snap account.
      This latest JDBC driver for DLP uses a JDBC URL structure and driver class that is different from the Simba Spark JDBC driver.

  • When loading data from a CSV file to a target DLP table, the header names in the file must exactly match the column names in the target table. Otherwise, the Snap returns the error—Column names should be the same in the target table and CSV file and aborts the load operation.

  • When loading data from a JSON file into a target DLP instance using this Snap, if you choose the Drop and Create Table option as the Load Action and specify an additional column (that is not available in the JSON file) for the new table, it results in one more column null added to the new target table. In case of CSV files, this Snap behaves the same way as mentioned above, but only when the CSV file uses a comma separator.

  • In case of DLP where CSV files do not have a header (column names), a simple query like SELECT * FROM CSV.`/mnt/csv1.csv` returns default names such as _c0, _c1, _c2 for the columns which this Snap cannot interpret. To avoid this scenario, you can:

    • Write the data in the CSV file to a DLP table beforehand, as in: CREATE TABLE csvdatatable (a1 int, b1 int,…) USING CSV `/mnt/csv1.csv` where a1, b1, and so on are the new column names.

    • Then, read the data from this new table (with column names a1, b1, and so on) using a simple SELECT statement.

  • In case of DLP, all ELT Snaps' preview data (during validation) contains a value with precision higher than that of the actual floating point value (float data type) stored in the Delta. For example, 24.123404659344 instead of 24.1234. However, the Snap reflects the exact values during pipeline executions.

Cfm tabs page
colorPalettesDefault
tabsPageTitleGoogle BigQuery
color#0052CC
idbup8j276xu
backgroundColorPaletteDefault
  • Because BigQuery does not support Automatic Schema Inferencing (ASI) from source files in Amazon S3, the ELT Load Snap fails to produce the required output when you choose Drop and Create table as the Load Action and define the Target to Source Column Map in the Snap’s settings. However, the same configuration works fine for pipeline validation purposes.

  • Due to an issue with BigQuery table schema management (the time travel feature), an ALTER TABLE action (Add or Update column) that you attempt after deleting a column (DROP action) in your BigQuery target table causes the table to break and the Snap to fail.

    • As a workaround, you can consider either avoiding ALTER TABLE actions on your BigQuery instance or creating (CREATE) a temporary copy of your table and deleting (DROP) it after you use it.

  • You cannot add a column to your BigQuery target table with a deleted column name using the ELT Load Snap, as BigQuery reserves deleted column names and data until the pre-configured time travel duration (from 2 through 7 days).

Snap Views

Type

Format

Number of Views

Examples of Upstream and Downstream Snaps

Description

Input 

Document

  • Min: 0

  • Max: 1

  • ELT Transform

  • ELT Copy

  • ELT Select

A document containing the SQL query that you can use from the upstream ELT Snaps. 

Output

Document

  • Min: 0

  • Max: 1

  • ELT Transform

  • ELT Copy

A document containing the status of the load operation and an SQL query that you can use in any downstream ELT Snaps. 

Error

Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter when running the by choosing one of the following options from the When errors occur list under the Views tab:

  • Stop Pipeline Execution: Stops the current pipeline execution if the Snap encounters an error.

  • Discard Error Data and Continue: Ignores the error, discards that record, and continues with the remaining records.

  • Route Error Data to Error View: Routes the error data to an error view without stopping the Snap execution.

Learn more about Error handling in pipelines.

Snap Settings

Info
  • Asterisk ( * ): Indicates a mandatory field.

  • Suggestion icon ((blue star)): Indicates a list that is dynamically populated based on the configuration.

  • Expression icon ( (blue star) ): Indicates the value is an expression (if enabled) or a static value (if disabled). Learn more about Using Expressions in SnapLogic.

  • Add icon ( (blue star) ): Indicates that you can add fields in the field set.

  • Remove icon ( (blue star)): Indicates that you can remove fields from the field set.

  • Upload icon ( (blue star) ): Indicates that you can upload files.

Info

SQL Functions and Expressions for ELT

You can use the SQL Expressions and Functions supported for ELT to define your Snap or Account settings with the Expression symbol ( (blue star) ) enabled, where available. This list is common to all supported target CDWs. You can also use other expressions/functions that your target CDW supports.

Field Name

Field Type

Field Dependency

Description

Label*

Default ValueELT Load

ExampleS3 Load

String

None.

Specify a name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your pipeline.

Get preview data

Default ValueNot selected

ExampleSelected

Checkbox

None.

Select this checkbox to include a preview of the query's output. The Snap performs limited execution and generates a data preview during pipeline validation.

Info

For ELT pipelines, only the SQL query flows through the Snaps but not the actual source data. Therefore, the preview data for a Snap is the result of executing the SQL query that the Snap has generated in the pipeline.

The number of records displayed in the preview (on validation) is the smaller of the following:

  • Number of records available on execution of the SQL query generated by the Snap.

  • The value set in the Preview Data Count setting (default is 50 records).

Note

Rendering Complex Data Types in DLP

Based on the data types of the fields in the input schema, the Snap renders the complex data types such as map and struct as object data type and array as an array data type. It renders all other incoming data types as-is except for the values in binary fields are displayed as a base64 encoded string and as string data type.

Database Name

Default Value: N/A

ExampleTEST_DB

String

None.

Enter the name of the database in which the target table exists. Leave this blank if you want to use the database name specified in the Default Database Name field in the Account settings.

Schema Name (Not applicable to Databricks Lakehouse Platform)*

Default Value: N/A

ExampleSALES

String

Not applicable to DLP.

Enter the name of the database schema. In case the schema name is not defined, then the suggestions retrieved for the schema name contain all schema names in the specified database.

Info
  • Ensure that you include the exactly same schema name including the double quotes, if used, when you repeat the schema name in the Target Table Name field.

  • Leave this field blank if your target database is Databricks Lakehouse Platform.

Target Table Name*

Default Value: N/A

ExampleSALES_ORDERS

String

None.

Enter the name of the table or view in which you want to perform the load operation. 

Info

Only views that can be updated (have new rows) are listed as suggestions. So, Join views are not included. This also implies that the Snap account user has the Insert privileges on the views listed as suggestions.

Note
  • Ensure that you include the exact same schema name, if at all, including the double quotes as specified in the Schema Name field.

  • If the target table or view does not exist during run-time, the Snap creates one with the name that you specify in this field and writes the data into it. During pipeline validation, the Snap creates the new table or view but does not write any records into it.

    • This new table or view is not be dropped if a subsequent/downstream Snap failure occurs during validation.

  • You must use double quotes (““) when including special characters, such as hyphens (-) in the name.

  • A table or view name must always start with a letter.

  • Integers and underscores (_) can also be a part of the table or view name.

  • All characters are automatically converted to uppercase by the Snap. Use double-quotes to retain lowercase.

Target Table Hash Distribution Column (Azure Synapse Only)

Default Value: N/A

ExampleVar_table

String/Expression

Applicable to Azure Synapse only.

Specify the hash distribution column name for the target table in Azure Synapse, if you choose the Load Action as Drop and Create table.

Info

Azure Synapse needs a table to be always hash-distributed for improved query performance. If the target table is created outside the Snap, you need not specify the target table column name here.

Load Action

Default ValueAppend rows to existing table

ExampleAlter table

Dropdown list

None.

Select the load action to perform on the target table.

Available options are:

  • Append rows to existing table. Loads the data from the source files into the specified target table.

  • Overwrite existing table. Overwrites the target table with the data from the source files.

  • Drop and Create table. Drops the target table if it exists, creates a new table with the columns provided in the Table Columns field set, and then loads the data from the source files into the target table. Activates the Data Type field in the Table Columns field set.

  • Alter table. Modifies the schema of the target table based on the configuration of the Modifier field in the Table Columns field set and loads the data from the source files into the target table. Activates the Data Type and Modifier fields in the Table Columns field set. 

Note

The Snap does not modify the existing tables during pipeline validation, but if you choose the Drop and Create table option in the Load Action field and the target table does not exist, it creates a new (empty) target table based on the schema specified in the Snap.

Table Columns

This field set enables you to configure the schema of the target table. You can use this field set to add/drop columns in the target table if you select the Drop and Create table or Alter table options in the Load Action field. 

This field set consists of the following fields:

  • Column

  • Data Type

  • Modifier

Column

Default Value: N/A

ExampleID, FNAME

String

Load Action is Drop and Create table or Alter table.

Enter the name of the column that you want to load in the target table. You can also specify the columns to drop if you select the Alter table option in the Load Action field. 

Data Type

Default Value: N/A

ExampleINT, VARCHAR

String

Load Action is Drop and Create table or Alter table.

Enter the data type of the values in the specified column. 

Use the VARIANT data type for target table columns that correspond to a map (JSON OBJECT) or array (JSON LIST) in your nested Parquet files.

Modifier

Default ValueAdd

ExampleDrop

String

Load Action is Alter table.

Select whether you want to add/drop the specified column.

Available options are:

  • Add

  • Drop

Table Option List

This field set enables you to define the table options for creating the target table before performing the load operation. These options vary based on your target CDW. Refer to the Table options for Load action for the complete list of permitted table options. You must specify each table option in a separate row.

  • Table Option

Table Option

Default Value: N/A

Example: OPTIONS(custID=03761)

Expression/Suggestion

Choose a table option that you want to use from the suggested options. Specify the required values for the selected option.

Redshift Schema Name (Only when BigQuery with Source Location as Redshift)

Default Value: N/A

ExamplePUBLIC, DEF_SCHEMA

String/Expression

Database Type is BigQuery andSource Location is Amazon Redshift in the Snap Account.

Specify the schema name of the source table in Redshift.

Redshift Table Name (Only when BigQuery with Source Location as Redshift)

Default Value: N/A

ExampleDataLoad_Source01

String/Expression

Database Type is BigQuery andSource Location is Amazon Redshift in the Snap Account.

Specify the name of the Redshift source table (from the schema selected above).

Cloud Storage Path

Default Value: N/A

Examples3://bigbucket/app_data

String/Expression

Specify the fully qualified path to the folder in the Cloud storage in this field, if you want to override the default path defined in the Snap's Account settings. For example:

  • AWS S3 path: s3://<bucket name>/<folder>

  • DBFS folder: /<dbfs folder name>

  • Azure: azure://<accountname>.blob.core.windows.net/<containername>/<azurefolder>

  • Google Cloud Storage: gcs://<bucketname>/<folder>/

File List

This field set enables you to specify the staged files to load. You must specify each file in a separate row. Click + to add a new row.

Info

Use File List for up to 2 files

For the best performance of your load operation, we recommend that you use the File Name Pattern field to load data from multiple files at a source location (S3 or Redshift) instead of specifying the individual file names/paths in the File List field set.

This field set consists of the following field:

  • File 

File

Default Value: N/A

ExampleTestData.csv

String

None.

Enter the name of the staged file to load.

File Name Pattern

Default Value: N/A

Example

  • Snowflake: '.*key.*[.]csv'

  • Redshift: './key.*csv'

String

None.

Enter the pattern to use to match the file name and/or absolute path. You can specify this as a regular expression pattern string, enclosed into use the key-based search mechanism. 

Info

If your search is based on AWS IAM roles, you cannot use regular expressions with Redshift tables.

Note

Using Wildcard Patterns in BigQuery

BigQuery has its own set of rules to resolve wildcard patterns such as *.* (that you provide in this field). Refer to Batch loading data > Load Wildcards in the BigQuery documentation for the rules on using wildcards in BigQuery.

Refer to Loading Using Pattern Matching for details on Snowflake regular expressions, Examples of COPY INTO (Delta Lake on Databricks) for DLP, or Regular Expressions and the Java Programming Language for using regular expressions with Redshift, Azure Synapse, or BigQuery in this field.

File Format Type

Default ValueNone

ExampleAVRO

String/Expression/Suggestion

None.

Select the format for the file. The options vary based on the database type in your account settings.

The supported file formats are: 

  • Snowflake: None, AVRO, CSV, JSON, ORC, PARQUET, XML

  • Redshift: None, AVRO, CSV, JSON, ORC, PARQUET

  • Azure Synapse: CSV, ORC, PARQUET

  • DLP: AVRO, CSV, JSON, ORC, PARQUET

  • BigQuery: AVRO, CSV, JSON, ORC, PARQUET

File Format Option List

This field set enables you to define the file format options for the load operation. You must specify each file format option in a separate row.

  • Format Option

Refer to Snowflake Data Loading Commands, Redshift Data Format ParametersCOPY INTO (Transact-SQL) for Azure Synapse Analytics, COPY INTO Format options for DLP, or the BigQuery’s Batch load data guide specific to your source file format type for the list of options that you can use based on the target database.

Info

Automatic Inferencing of Source Table Schema

The ELT Load Snap allows you to infer the table schema from the source files and then apply the auto-inferred schema to your target tables. This feature is supported on the Redshift target CDW with the source files in an S3 location and on the Snowflake target CDW with the source files on an Amazon S3, Azure Blob Storage (WASB), ADLS Gen2, or Google Cloud Storage location. Learn more at Automatic Schema Inference with the ELT Load Snap.

Format Option

Default Value: N/A

ExampleSKIP_HEADER = 1

String/Expression/Suggestion

Based on the combination of Database Type and File Format Type.

Choose a file format option that you want to use during the load operation from the suggested values. Specify the required value for the selected option.

Note

Loading a CSV file with nulls into a DLP instance

If the CSV file you want to load into a DLP instance contains null values, ensure that you define sep = ',' as the Format Option without which the load operation fails.

Encryption Type

Default ValueNone

ExampleServer-Side Encryption

String/Expression/Suggestion

None.

Select the encryption type that you want to use for the loaded data and/or files.

Available options are:

  • None

  • Server-Side Encryption (Not supported by Azure Synapse)

  • Server-Side KMS Encryption (Not supported by Azure Synapse)

Note

Server-side encryption is available only for S3 accounts and Snowflake.

KMS Key

Default Value: N/A

Example: WC2YY-2BFMV-KHKB3-FDDGQ-8NRQY

String/Expression

Encryption Type is Server-Side KMS Encryption.

Enter the KMS key to use to encrypt the files. 

Note

Not required for Redshift databases. The load operation automatically recognizes and loads the encrypted files. Refer to Loading encrypted files from Amazon S3 for details. 

Copy Option List

This field set enables you to define the COPY options for the load operation. You must specify each COPY option in a separate row. Click + to add a new row.

This field set comprises the following field:

  • Copy Option

See COPY Options for Snowflake, COPY Parameter Reference for Redshift, COPY INTO (Transact-SQL) for Azure Synapse Analytics, or COPY_OPTIONS under COPY INTO Parameters for DLP.

Copy Option

Default Value: N/A

Example: ON_ERROR = ABORT_STATEMENT

String/Expression/Suggestion

None.

Choose a COPY option from the suggested values to use if an error occurs during the load operation. Specify the required value for the selected option.

Validation Mode

Default ValueNone

Example: RETURN_n_ROWS

String/Expression/Suggestion

None.

Select an output preview type for the data validation performed as part of the pipeline validation. The options vary based on the database type (Snowflake/Redshift) in your account settings.

Available options for Snowflake are:

  • None

  • RETURN_n_ROWS

  • RETURN_ERRORS

  • RETURN_ALL_ERRORS

Available options for Redshift are:

  • None

  • NOLOAD

For more details, refer to Validating Staged Files for Snowflake and Validating Input Data for Redshift.

Info

No specific validation modes are available for Azure Synapse, DLP, and Google BigQuery.

Rows To Return

Default Value: N/A

Example: 20

String (Number)

Validation Mode is RETURN_n_ROWS.

Enter the number of records to retrieve from the target table to validate whether the load operation was successful.

Enable source column position (Snowflake and Azure Synapse Only)

Default Value: Not selected

Example: Selected

Checkbox

Database Type is Snowflake or Azure Synapse.

Select this checkbox to specify the source table column’s position in the Target to Source Column Map instead of its name.

  • This is applicable to only Snowflake and Azure Synapse target CDWs. For other target CDWs (Redshift, DLP, and BigQuery), you can leave this checkbox unchecked and specify source column names.

  • Selecting this checkbox renames the Source File Column Name field in the Target to Source Column Map fieldset below to Source Column Position (Valid Only For Snowflake and Azure Synapse).

Target to Source Column Map

This field set enables you to define the mapping between the columns in the target table and the corresponding columns in the source file. You must define each mapping in a separate row.

The available options are:

  • Target Table Column Name

  • Source File Column Name

Target Table Column Name

Default Value: N/A
ExampleORDER_ID

String/Expression/Suggestion

None.

Choose a column from the list of target table columns displayed. Ensure that you have defined the Schema Name or the Target Table Name to view the list of columns in this field's suggestions.

Source File Column Name

Default Value: N/A
ExampleORD_ID

String/Expression/Suggestion

None.

Choose a column from the list of source file columns displayed. Ensure that you have defined the Schema Name (or source file path) to view the list of columns in this field's suggestions.

Source Column Position (Valid Only For Snowflake and Azure Synapse)

Default Value: N/A
Example: (An integer) 1, 3, 4

Integer/Expression

Database Type is Snowflake or Azure Synapse.

Specify the column’s position in the source file.

Table options for Load action

Cfm tabs page
indexInGroup1
tabsPageTitleSnowflake
primaryTabtrue
colorPalettesDefault
color#0052CC
fixedtrue
idd8ud3og4gcm
hasBordertrue
  • CLUSTER BY ( < expr > [, < expr >,...])

  • STAGE_FILE_FORMAT = ( {FORMAT_NAME = '<file_format_name> I TYPE = {CSV IJSON IAVROIORC I PARQUET I XML) [ formatTypeOptions ] ) )

  • STAGE_COPY_OPTIONS = ( copyOptions )

  • DATA_RETENTION_TIME_IN_DAYS = < integer > 

  • MAX_DATA_EXTENSION_TIME_IN_DAYS = < integer > 

  • ENFORCE_LENGTH = TRUE I FALSE

  • DEFAULT_DDI_COLLATION = ,<collation_specification>

  • COPY GRANTS

  • [ WITH ] ROW ACCESS POLICY < policy_name > ON ( < col_name > [, <col_name>... ] )

  • [ WITH ] TAG ( <tag_key> = 'tag_value' [. <tag_key > = 'tag_ value', ...])

  • COMMENT = < string literal > '

Refer to CREATE TABLE - Snowflake Documentation for details about these options.

Cfm tabs page
indexInGroup2
tabsPageTitleAmazon Redshift
colorPalettesDefault
color#0052CC
fixedtrue
idk06v531tq3g
  • DISTSTYLE { AUTO | EVEN | KEY | ALL }

  • DISTKEY ( column_name )

  • [COMPOUND | INTERLEAVED ] SORTKEY ( column_name [,...]) | [ SORTKEY AUTO ]

  • ENCODE AUTO

Refer to CREATE TABLE - Amazon Redshift for details about these options.

Cfm tabs page
indexInGroup3
tabsPageTitleAzure Synapse
colorPalettesDefault
color#0052CC
fixedtrue
idt87b25d4gk
  • DISTRIBUTION = HASH ( distribution_column_name ) | DISTRIBUTION = ROUND_ROBIN | DISTRIBUTION = REPLICATE

  • PARTITION ( partition_column_name RANGE [ LEFT | RIGHT ] FOR VALUES ( [ boundary_value [,...n] ] ) )

  • CLUSTERED COLUMNSTORE INDEX | CLUSTERED COLUMNSTORE INDEX ORDER (column [,...n]) | HEAP | CLUSTERED INDEX ( { index_column_name [ ASC | DESC ] } [ ,...n ] )

Refer to CREATE TABLE (Azure Synapse Analytics) - SQL Server for details about these options.

Cfm tabs page
indexInGroup4
tabsPageTitleDLP
colorPalettesDefault
color#0052CC
fixedtrue
idneual45i3z
  • PARTITIONED BY ( { partition_column [ column_type ] } [, ...] )

  • CLUSTERED BY ( cluster_column [, ...] ) [ SORTED BY ( { sort_column [ ASC | DESC ] } ) ] INTO num_buckets BUCKETS

  • LOCATION path

  • OPTIONS ( { option_key [ = ] option_val } [, ...] )

  • TBLPROPERTIES ( { property_key [ = ] property_val } [, ...] ) }

Refer to CREATE TABLE [USING] - Azure Databricks - Databricks SQL for details about these options.

Cfm tabs page
indexInGroup5
tabsPageTitleGoogle BigQuery
colorPalettesDefault
color#0052CC
fixedtrue
id3hl3v0q4m5j
  • PARTITION BY _PARTITIONDATE | <date_column> | DATE({ <timestamp_column> | <datetime_column> }) | DATETIME_TRUNC(<datetime_column>, { DAY | HOUR | MONTH | YEAR }) | TIMESTAMP_TRUNC(<timestamp_column>, { DAY | HOUR | MONTH | YEAR }) | TIMESTAMP_TRUNC(_PARTITIONTIME, { DAY | HOUR | MONTH | YEAR }) | DATE_TRUNC(<date_column>, { MONTH | YEAR }) |  RANGE_BUCKET(<int64_column>, GENERATE_ARRAY(<start>, <end>[, <interval>]))

  • CLUSTER BY <column1> [, <column2>, ...<column4>]

  • OPTIONS(<name>=<value> [, ... ])

  • COMMENT '<This is a comment>'

Refer to Data definition language (DDL) statements in GoogleSQL  |  BigQuery  |  Google Cloud for details about these options.

Using file format options with ELT Load

Here are a few common file format options that can be used for the corresponding CDW when loading data from CSV files in a storage location to a table in that CDW.

Target CDW

Exclude header row from the input data stream

Specify row delimiters

Azure Synapse 

FIRSTROW = 2

ROWTERMINATOR = '0X0A' (CSV file created in MacOS)

ROWTERMINATOR = '\n' (CSV file created in Windows)

DLP

'header' = 'true'

lineSep=None (\r) | \r\n | \n

Google BigQuery

skip_leading_rows = 1

N/A

(allow_quoted_newlines = true if there are quoted data sections in a CSV file, that contain newline characters.)

Redshift

IGNOREHEADER 1

N/A

Snowflake

SKIP_HEADER = 1

RECORD_DELIMITER = '<character>' | NONE

Load data from AVRO/ORC/JSON/Parquet files

Data in your AVRO/ORC/JSON/Parquet files can exist as any combination of the following structures: Flat data, nested lists, and nested maps.

  • Flat file fields contain data in a simple two-dimensional spreadsheet-like format (Level 0). Such fields can be read and loaded using the standard mapping and loading methods.

  • File fields in nested lists or maps contain structured and hierarchical data where values for each field can be:

    • A series of repeated fields (Level 1) or

    • A series of values (Level 1)

    • Another set of key-value pairs (Level 2), or

    • A set of keys with each or some of them containing a nested set of key-value pairs (Level 3), and so on.

Lists vs. Maps in nested files

  • The lists are very similar to the JSON arrays in their notation and are characterized by square brackets [ and ] in the source files.

  • The maps are very similar to the JSON objects in their notation and are characterized by curly braces { and } in the source files.

  • However, we recommend that you reference the actual AVRO/ORC/Parquet file to create your source-to-target field mappings and not its JSON-converted version.

Different Parquet file structures

The table below depicts the possible schema variations within a Parquet file with examples of source table data. It also indicates how to reference the fields (keys in the source files) and map them with the corresponding target table columns to load your data.

Source Files (AVRO/ORC/JSON/Parquet)

Canonical format

Non-canonical format

Flat data

Nested List

Nested Map

Nested Map

Sample schema (column definitions)

Code Block
languagepy
Field(name: "name", datatype: String)
Field(name: "age", datatype: Int32)
Field(name: "city", datatype: String)
Code Block
languagepy
Field(name: "tourist", datatype: Struct[
    Field(name: "name", datatype: String),
    Field(name: "visits", datatype: List[String, String])
])
Code Block
languagepy
Field(name: "tourist", datatype: Struct{
    Field(name: "name", datatype: String),
    Field(name: "visits", datatype: Map[String, String])
})
Code Block
languagepy
Field(name: "tourist", datatype: Struct{
    Field(name: "fname.lname", datatype: String),
    Field(name: "visits.made", datatype: Map[String, String])
})

Example Data

Code Block
+-------+----+--------+
| name  | age| city   |
+-------+----+--------+
| Mason | 54 | London |
| Cindy | 73 | Austin |
| Lewis | 28 | Sydney |
+-------+----+--------+
Code Block
{
    name: "Mason", 
    visits: 
    [
      {"cname":"Sweden", "pname": "Berne"}, 
      {"cname": "Mexico", "pname": "Mexico City"}, 
      {"cname": "Russia", "pname": "Moscow"}
    ]  
}
Info

Lines 3 through 8 in this example denote the nested list “visits”.

Code Block
{
    name: "Mason", 
    visits: 
    {
      "place": 
      [
        {"pname": Berne", "year": "2017"}, 
        {"pname": "Mexico City", "year": "2019"}, 
        {"pname": "Moscow", "year": "2020"}
       ]  
     }
}
Info

Lines 3 through 11 in this example denote the nested map “visits”.

Code Block
{
    fname.lname: "Kyle Mason", 
    visits.made: 
    {
      "place": 
      [
        {"pname": Berne", "year": "2017"}, 
        {"pname": "Mexico City", "year": "2019"}, 
        {"pname": "Moscow", "year": "2020"}
       ]  
     }
}

Referencing columns from source table schema in the Snap

name

age

city

name

visits:cname

visits:pname

name

visits.place:pname

visits.place:year

"fname.lname"

"visits.made".place:pname

"visits.made".place:year

Info

Use colon ‘:' to read and load data from child columns of a nested list in your Parquet file.

Info

Use period '.’ to read and load data from child columns of a nested map structure in your Parquet file.

Info
  • Use double quotes " for column names that contain special characters, to read and load data from these columns.

  • Use period '.’ to read and load data from child columns of a nested map structure in your Parquet file.

Consider the following ELT Load Snap configuration:

ELT_Load_Snap_Config_Parquet_1.pngImage Added

Besides the standard and mandatory settings in this Snap, the following settings are unique and pivotal to reading and loading data from Parquet files:

  • Load Action

  • Target Table Columns (for the first load and for any alter table actions thereafter)

    • New target table definition and table column alterations.

  • File List > File (for individual file loads)

  • File Name Pattern (to load multiple source files of one format that follow a specific naming and path patterns)

  • File Format Type = AVRO/ORC/JSON/PARQUET

  • Copy Option List (to load data based on the MATCH_BY_COLUMN_NAME = Case-sensitive or Case-Insensitive with no column mapping specified) - Applies to flat files and not files with nested data.

  • Target to Source Column Map

    • Mapping some or all source columns to corresponding columns in the target table.

You must decipher the structure (canonical or non-canonical) and schema of your Parquet files (mentioned above) and fill in the details of the target table - Column names, data types, and source-to-target column mapping.

Learn more at the example pipeline - Load data from a c anonical format Parquet file.

Supported data structures and Target CDWs

The ELT Load Snap currently supports loading data from these semi-structured file formats (AVRO/ORC/JSON/Parquet) stored in Azure Storage to Snowflake and DLP.

Cfm tabs page
indexInGroup1
tabsPageTitleSnowflake
primaryTabtrue
colorPalettesDefault
color#0052CC
fixedtrue
idul2qpdp8gsd
hasBordertrue

The following table provides further details on the granular-level support to load data into Snowflake tables.

File structure

AVRO

ORC

JSON

PARQUET

Flat data

Yes

Yes

Yes (Plain, JSON Lines)

Yes

Nested list with a fixed number of elements

Yes

Yes

Yes

Yes

Nested list with a variable number of elements

Yes

Yes

Yes

Yes

Nested map

Yes

N/A

Yes

Yes

Non-canonical format

Yes

N/A

N/A

Yes

Cfm tabs page
indexInGroup2
tabsPageTitleDLP
colorPalettesDefault
color#0052CC
fixedtrue
hasBordertrue
id94l29cfhlzs

The following table provides further details on the granular-level support to load data into DLP tables.

File structure

AVRO

ORC

JSON

PARQUET

Flat data

Yes

Yes

Yes (JSON Lines only)

Yes

Nested list with a fixed number of elements

Yes

Yes

Yes (JSON Lines only)

Yes

Nested list with a variable number of elements

Yes

Yes

Yes (JSON Lines only)

No

Nested map

Yes

N/A

Yes

Yes

Non-canonical format

Yes

N/A

N/A

Yes

Snap Behavior During Pipeline Validation

This Snap's behavior during pipeline validation varies for each Load Action selected.

Load Action

Target Table exists?

Validation Result

Action on Target Table#

Append rows to existing table

No

Success

Creates an empty target table based on the schema information available in the Snap settings.

Yes

Success

No updates made.

Overwrite existing table

No

Success

Creates an empty target table based on the schema information available in the Snap settings.

Yes

Success

No updates made.

Drop and Create table

No

Success

Creates an empty target table based on the schema information available in the Snap settings.

Yes

Success

No updates made.

Alter table

No

Success

Creates an empty target table based on the schema information available in the Snap settings.

Yes

Success

No updates made.

Bulk load operations are not done during pipeline validation.

Cfm tabs page
colorPalettesDefault
primaryTabtrue
tabsPageTitleSnowflake
color#0052CC
idbz0r7nehnb
hasBordertrue
backgroundColorPaletteDefault

The data is not loaded into the Snowflake table. It is just validated using the RETURN_n_ROWS validation mode. The number of records in the output is determined by the preview count set in the pipeline settings. The Snap creates the target table (if not existing already) in the database when you select Load Action as Drop and Create the table. For all other load actions, the Snap verifies if the specified target table exists in the database.

Cfm tabs page
colorPalettesDefault
tabsPageTitleAmazon Redshift
color#0052CC
idb98ckl3hgoe
backgroundColorPaletteDefault

The data is not loaded into the Redshift table. The files in S3 are just validated using the NOLOAD validation mode to verify data integrity of the files. The target table is also affected based on your selection in the Load Action field. The Snap creates the target table (if not existing already) in the database when you select Load Action as Drop and Create table. For all other load actions, the Snap verifies if the specified target table exists in the database.

Cfm tabs page
colorPalettesDefault
tabsPageTitleAzure Synapse
color#0052CC
idwrcpabvwyyh
backgroundColorPaletteDefault

The data is not loaded into the Azure Synapse table. The Snap creates the target table (if not existing already) in the database when you select Load Action as Drop and Create table. For all other load actions, the Snap verifies if the specified target table exists in the database.

Cfm tabs page
colorPalettesDefault
tabsPageTitleDLP
color#0052CC
idi5d64y0zk9
backgroundColorPaletteDefault

The data is not loaded into the DLP table. The Snap creates the target table (if not existing already) in the database when you select Load Action as Drop and Create table. For all other load actions, the Snap verifies if the specified target table exists in the database.

Cfm tabs page
colorPalettesDefault
tabsPageTitleGoogle BigQuery
color#0052CC
idqmzzmmaptp
backgroundColorPaletteDefault

The data is not loaded into the BigQuery table. The Snap creates the target table (if not existing already) in the database when you select Load Action as Drop and Create table. For all other load actions, the Snap verifies if the specified target table exists in the database.

Troubleshooting

Cfm tabs page
colorPalettesDefault
tabsPageTitleCommon Errors (All CDWs)
primaryTabtrue
color#0052CC
hasBordertrue
id9x3emex3tnn
backgroundColorPaletteDefault

Error

Reason

Resolution

File read error while uploading data from a CSV file

This can happen when the source CSV file contains a header row which the target database cannot recognize/interpret. 

Depending on your target database, use the respective File Format Option in the ELT Load Snap.

  • For Snowflake: SKIP_HEADER = 1

  • For Azure Synapse: FIRSTROW = 2

  • ForDLP: 'header' = 'true'

  • For Google BigQuery: skip_leading_rows = 1

  • For Redshift: IGNOREHEADER 1

Database cannot be blank.

(when seeking the suggested list for Schema Name field)

Suggestions in the Schema Name and Target Table Name fields do not work when you have not specified a valid value for the Database Name field in this Snap.

Specify the target Database Name in this Snap to view and choose from a suggested list in the Schema Name and Target Table Name fields respectively.

Cfm tabs page
colorPalettesDefault
primaryTabfalse
tabsPageTitleSnowflake
color#0052CC
hasBordertrue
idd2nm5s7mcih
backgroundColorPaletteDefault

Error

Reason

Resolution

Time '"01:01:23"' is not recognized.

The CSV file you are trying to load into the Snowflake database contains single or double quotes for time and/or date values. Hence, the ELT Load Snap is unable to read and insert these values into the target table.

After you specify the CSV files in the ELT Load Snap’s settings, add an entry in the File Format Options field set as FIELD_OPTIONALLY_ENCLOSED_BY='\'' or FIELD_OPTIONALLY_ENCLOSED_BY='\”' to match the quotes used in the CSV files for date and/or time values in it. The Snap would then ignore these quotes used for the values in the CSV files.

SQL exception from Snowflake: Syntax error in one or more positions in the SQL query.

Column names in Snowflake tables are case-sensitive. It stores all columns in uppercase unless they are surrounded by quotes during the time of creation in which case, the exact casing is preserved. See, Identifier Requirements — Snowflake Documentation.

Ensure that you follow the same casing for the column table names across the pipeline.

Cfm tabs page
colorPalettesDefault
tabsPageTitleAmazon Redshift
color#0052CC
hasBordertrue
id5hahlhfd4my
backgroundColorPaletteDefault

Error

Reason

Resolution

Loading data into a Redshift table using the RegEx pattern *.* in File Name Pattern field results in the operation being performed twice.

Since the Redshift Copy Into SQL does not natively support File Pattern option unlike the other CDWs, SnapLogic supports this externally through Java Regular Expressions (RegEx).

The RegEx *.* returns matching results from the root location instead of the folder location specified before *.* in the File Name Pattern field.

Use the RegEx csv/person/.+\..+ in the File Name Pattern field to load data from all the files available in a specific source folder.

Cfm tabs page
colorPalettesDefault
tabsPageTitleAzure Synapse
color#0052CC
hasBordertrue
idoge3sfknrks
backgroundColorPaletteDefault

Error

Reason

Resolution

<To be added>

Cfm tabs page
colorPalettesDefault
tabsPageTitleDLP
color#0052CC
hasBordertrue
idppkuwmfh5ao
backgroundColorPaletteDefault

Error

Reason

Resolution

Invalid DBFS folder name

You have not specified an appropriate DBFS path in the DBFS Folder path field in the Snap's (DLP) account settings.

Edit the Snap's account settings to ensure that you have provided a valid DBFS Folder path.

Keyword RANGE is not acceptable as a column name.

This can happen with any reserved keyword if it is used as a column/field name in the table to be created.

Ensure the enclose such column names (reserved keywords) between backticks (`). For example: `RANGE` STRING.

[Simba][SparkJDBCDriver](500051) ERROR processing query/statement. Error Code: 0

Cannot create table ('<schema name>`.`<table name>`'). The associated location (`…<table name>`) is not empty but it's not a Delta table.

The specified location contains one of the following:

  • A non-Delta table (such as CSV, ORC, JSON, PARQUET)

  • A corrupted table

  • A Delta table with a different table schema

So, the Snap/pipeline cannot overwrite this table with the target table as needed.

Ensure that you take appropriate action (mentioned below) on the existing table before running your pipeline again (to create another Delta table at this location).

Move or drop the existing table from the schema manually using one of the following commands:

Access the DBFS through a terminal and run:

  • dbfs mv dbfs:/<current_table_path> dbfs:/<new_table_path> to move the table or

  • dbfs rm -r dbfs:/<table_path> to drop the table.

OR

Use a Python notebook and run:

  • dbutils.fs.mv(from: String, to: String, recurse: boolean = false): boolean to move the table/file/directory or

  • dbutils.fs.rm(dir: String, recurse: boolean = false): boolean to drop the table/file/directory.

Column names should be the same in the target table and CSV file. 

One or more header names in your source CSV file do not match with the corresponding target table column names. 

Ensure that each of the column names in your target table exactly matches the corresponding header name used in the CSV file or vice versa before you retry running the pipeline.

Cfm tabs page
colorPalettesDefault
tabsPageTitleGoogle BigQuery
color#0052CC
hasBordertrue
idatlmc3vhj36
backgroundColorPaletteDefault

Error

Reason

Resolution

Unknown Source / Not found: <path>

BigQuery has a unique way of resolving wildcard patterns like *.* (that you provide in the File or File Name Pattern fields of this Snap).

Make sure that the file name pattern you provide conforms to the rules defined in Batch loading data > Load Wildcards section of BigQuery documentation.

Keyword ORDER is not acceptable as a column name in the JSON file.

This can happen with any reserved keyword if it is used as a column/field name in the table to be created.

Ensure that you enclose such column names (reserved keywords) between backticks (`). For example: `ORDER`.

Error code 400: Column <column_name> was recently deleted in the table <table_name>.

BigQuery reserves deleted column names for up to the time travel duration (2 through 7 days) as set by your DBA.

Use a different column name to complete your Load action.

Examples

Loading Files from S3 Bucket to Snowflake

We need to perform Snowflake ELT transformations upon a file present in an S3 bucket. Therefore, we must first load the file into the Snowflake database from the source location. This pipeline shows how we can use the ELT Load Snap to accomplish this task.

Image Added

Before we can load the file, we must configure the ELT account to read from the source S3 bucket by configuring the Source Location, Authentication, Access Key, Secret Key, S3 Bucket, and S3 Folder fields. The fields vary based on the source location and authentication. See Configuring ELT Database Accounts for details. 

Image Added

We configure the ELT Load Snap as shown below:

Image Added

Let us understand this Snap's configuration:

  • The Database Name, Schema Name, and Target Table Name specify the target database, schema, and table into which we want to write the file's data.

  • Table Action indicates that if a table with the same name exists, we configure the Snap to drop the table and create a new one in its place.

  • Since this is a new table, the Table Columns field set specifies the table's schema.

  • The File List field set specifies the source file's path in the S3 bucket and the file's format.

  • The Format Option field set specifies that the Snap must ignore the file's header. Otherwise, the header is also loaded into the target table.

  • The Copy Option field set specifies that the Snap must abort the loading operation if it encounters an error.

  • The Validation Mode and Rows to Return fields specify that 10 rows from the target table be retrieved upon successful completion.

To validate a successful execution, the ELT Load Snap executes a SELECT * command on the target table and returns 10 rows:

Image Added

We can either create a separate pipeline to perform ELT operations on the table or add appropriate Snaps downstream of the ELT Load Snap in the same pipeline.

Download this pipeline.

Uploading data from a CSV File in Azure Data Lake Gen2 Storage to Azure Synapse

The following one-Snap (ELT Load) pipeline connects to an Azure Data Lake Gen2 Storage using an ADLS Gen2 Basic Auth account to upload a CSV file from the storage to an Azure Synapse database. This CSV file is created on MacOS with the default line endings.

Image Added

Account Configuration

Snap Configuration

Image AddedImage Added

For the pipeline to work as expected when the CSV file created on MacOS with the default (MacOS) line endings, define the value for the file format option ROWTERMINATOR as '0X0A' instead of the usual '\n'. With '\n' as ROWTERMINATOR, the pipeline runs successfully, but does not insert the data from the CSV file into the newly-created table in Azure Synapse.

Snap Output with ROWTERMINATOR = '\n'

Snap Output with ROWTERMINATOR = '0X0A'

Image AddedImage Added

Download this pipeline.

Load data from a canonical format Parquet file

Suppose that your Parquet file contains a few rows of data split into three columns (id, batters, and topping) as shown below:

Nested_Parquet_data.pngImage Added

You can configure the ELT Load Snap in the following ways to load this data into a two-dimensional table on your Snowflake instance.

  • Create a target table or update an existing table with the same three columns - id, batters, and topping. In this case, the data type of the batters and topping columns must be VARIANT. Or

  • Create a (flat) target table with the following columns and data types, and load values into the respective columns using the Parquet file data identifiers.

Cfm tabs page
primaryTabtrue
colorPalettesDefault
an.spaceKeyDRWIP
tabsPageTitleTarget table definition
color#0052CC
fixedtrue
id4aex5ih67jf
hasBordertrue
backgroundColorPaletteDefault

Target table column name (in Snowflake)

Data type

row_id

integer

batters_list

variant

batter_info

variant

batter_id

integer

batter_type

string

topping_info

variant

topping_id

integer

topping_type

string

Cfm tabs page
tabsPageTitleTarget table to source column mapping
colorPalettesDefault
an.spaceKeyDRWIP
color#0052CC
idfar3ws7mce
backgroundColorPaletteDefault

Target table column name (in Snowflake)

Source column mapping (from Parquet)

row_id

id

batters_list

batters

batter_info

batters.batter

batter_id

batters.batter:id

batter_type

batters.batter:type

topping_info

topping

topping_id

topping:id

topping_type

topping:type

Load data from a non-canonical format Parquet file

Suppose that your Parquet file contains a few rows of data split into three columns (id, batters, and topping) as shown below:

image-20240328-213729.pngImage Added

If you want to load only a part of this data set (a subset of columns) to a target table in Snowflake, here is how you define your Source File Column Names in the Target to Source Column Map.

Cfm tabs page
primaryTabtrue
colorPalettesDefault
an.spaceKeyDRWIP
tabsPageTitleTarget table definition
color#0052CC
fixedtrue
id2fo7fw2n06b
hasBordertrue
backgroundColorPaletteDefault

Target table column name (in Snowflake)

Data type

row_id

integer

batters_id

integer

batter_type

string

batter_model

string

Cfm tabs page
tabsPageTitleTarget to source column mapping
colorPalettesDefault
an.spaceKeyDRWIP
color#0052CC
idbjmk6tc0jgs
backgroundColorPaletteDefault

Target table column name (in Snowflake)

Source column mapping (from Parquet)

row_id

id

batters_id

"batters.id"

batter_type

"batters.batter":type

batter_model

"batters.batter":machine.model

A few other scenarios

Use the MATCH_BY_COLUMN_NAME copy option without mapping individual columns

Use VARIANT columns without mapping individual columns

ELT_Load_Snap_Config_Parquet_MBCN.pngImage Added
Note

In this scenario, the Snap assumes that the target table contains the same column names as columns in the source Parquet file.

ELT_Load_Snap_Config_Parquet_Variant.pngImage Added

Note

In this scenario, the target table can have only one column of type VARIANT.

Downloads

Note

Important Steps to Successfully Reuse Pipelines

  1. Download and import the pipeline into SnapLogic.

  2. Configure Snap accounts as applicable.

  3. Provide pipeline parameters as applicable.

Attachments
patterns*.slp, *.zip

...

Snap Pack History

Refer to the Snap Pack History table in ELT Snap Pack

...

...