Snowflake - Bulk Load

In this article

Overview

You can use the Snowflake - Bulkload Snap to load data from input sources or files stored on external object stores like Amazon S3, Google Storage, and Azure Storage Blob into the Snowflake data warehouse.

Snap Type

The Snowflake - Bulk Load Snap is a Write-type Snap that performs a bulk load operation.

Prerequisites

You must have minimum permissions on the database to execute Snowflake Snaps. To understand if you already have them, you must retrieve the current set of permissions. The following commands enable you to retrieve those permissions:

SHOW GRANTS ON DATABASE <database_name> SHOW GRANTS ON SCHEMA <schema_name> SHOW GRANTS TO USER <user_name>
  • You must enable the Snowflake account to use private preview features for creating the Iceberg table.

  • External volume has to be created on the Snowflake worksheet, or Snowflake Execute snap. Learn more about creating external volume.

Security Prererequisites

You must have the following permissions in your Snowflake account to execute this Snap: 

  • Usage (DB and Schema): Privilege to use the database, role, and schema.

  • Create table: Privilege to create a temporary table within this schema.

The following commands enable minimum privileges in the Snowflake console:

grant usage on database <database_name> to role <role_name>; grant usage on schema <database_name>.<schema_name>; grant "CREATE TABLE" on database <database_name> to role <role_name>; grant "CREATE TABLE" on schema <database_name>.<schema_name>;

Learn more about Snowflake privileges: Access Control Privileges.

Internal SQL Commands

This Snap uses the following Snowflake commands internally:

  • COPY INTO - Enables loading data from staged files to an existing table.

  • PUT - Enables staging the files internally in a table or user stage.

Requirements for External Storage Location

The following are mandatory when using an external staging location:

When using an Amazon S3 bucket for storage:

  • The Snowflake account should contain S3 Access-key ID, S3 Secret key, S3 Bucket and S3 Folder.

  • The Amazon S3 bucket where the Snowflake will write the output files must reside in the same region as your cluster.

When using a Microsoft Azure storage blob:

  • A working Snowflake Azure database account.

When using a Google Cloud Storage:

  • Provide permissions such as Public access and Access control to the Google Cloud Storage bucket on the Google Cloud Platform.

Support for Ultra Pipelines

Works in Ultra PipelinesHowever, we recommend that you not to use this Snap in an Ultra Pipeline.

Limitations

  • Special character'~' is not supported if it is there in the temp directory name for Windows. It is reserved for the user's home directory.

  • Snowflake provides the option to use the Cross Account IAM in the external staging. You can adopt the cross-account access through the option Storage Integration. With this setup, you don’t need to pass any credentials around, and access to the storage only using the named stage or integration object. For more details: Configuring Cross Account IAM Role Support for Snowflake Snaps

  • Snowflake Bulk Load expects column order should be like a table from upstream snaps; otherwise, it will result in failure of data validation.

  • If a Snowflake Bulk Load operation fails due to inadequate memory space on the JCC node when the Data source is Input View and the Staging location is Internal Stage, you can store the data on an external staging location (S3, Azure Blob or GCS).

  • When the bulk load operation fails due to an invalid input and when the input does not contain the default columns, the error view does not display the erroneous columns correctly.
    This is a bug in Snowflake and is being tracked under JIRA SNOW-662311 and JIRA SNOW-640676.

  • This Snap does not support creating an iceberg table with an external catalog in Snowflake as, currently, the endpoint only allows read-only access for the tables that are created using an external catalog without any write capabilities. Learn more about iceberg catalog options.

  • Snowflake does not support cross-cloud and cross-region Iceberg tables when you use Snowflake as the Iceberg catalog. If the Snap displays an error message such as External volume <volume_name> must have a STORAGE_LOCATION defined in the local region ..., ensure that the External volume field uses an active storage location in the same region as your Snowflake account.

Behavior change

Starting from 4.35 GA, if your Snaplex is behind a proxy and the Snowflake Bulk Load Snap uses the default Snowflake driver (3.14.0 JDBC driver), then the Snap might encounter a failure. There are two ways to prevent this failure:

Add the following key-value pair in the Global properties section of the Node Properties tab:
Key: jcc.jvm_options
Value: -Dhttp.useProxy=true

global-properties.png

Add the following key-value pairs in the URL properties of the Snap under Advanced properties.

We recommend you use the second approach to prevent the Snap’s failure.

Known Issues

Snap Views

Type

Format

Number of Views

Examples of Upstream and Downstream Snaps

Description

Type

Format

Number of Views

Examples of Upstream and Downstream Snaps

Description

Input

Document

  • Min: 0

  • Max: 2

  •  JSON Generator

  • Binary to Document

Documents containing the data to be uploaded to the target location.

Second Input View

This Snap has one document input view by default. 

You can add a second input view for metadata for the table as a document so that when the target table is absent, this table metadata can be created in the database with a similar schema as the source table. This schema is usually from the second output of a database Select Snap. If the schema is from a different database, the data types might not be properly handled.

Learn more about adding metadata for the table in the second input view from the example Providing Metadata For Table Using The Second Input View.

Output

Document

  • Min: 0

  • Max: 1

  • Mapper

  • Snowflake Execute

If an output view is available, then the output document displays the number of input records and the status of the bulk upload as follows:

Error

Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter when running the Pipeline by choosing one of the following options from the When errors occur list under the Views tab:

  • Stop Pipeline Execution: Stops the current pipeline execution if the Snap encounters an error.

  • Discard Error Data and Continue: Ignores the error, discards that record, and continues with the remaining records.

  • Route Error Data to Error View: Routes the error data to an error view without stopping the Snap execution.

Learn more about Error handling in Pipelines.

Snap Settings

  • Asterisk (*): Indicates a mandatory field.

  • Suggestion icon (): Indicates a list that is dynamically populated based on the configuration.

  • Expression icon (): Indicates whether the value is an expression (if enabled) or a static value (if disabled). Learn more about Using Expressions in SnapLogic.

  • Add icon (): Indicates that you can add fields in the field set.

  • Remove icon (): Indicates that you can remove fields from the field set.

Field Name

Field Type

Field Dependency

Description

Field Name

Field Type

Field Dependency

Description

Label*


Default ValueSnowflake - Bulk Load
Example: Load Employee Tables

String

N/A

Specify the name for the instance. You can modify this to be more specific, especially if you have more than one of the same Snap in your Pipeline.

Schema Name

 

Default Value: N/A
Example: schema_demo

 

String/Expression/Suggestion



N/A

Specify the database schema name. In case it is not defined, then the suggestion for the Table Name retrieves all tables names of all schemas. The property is suggestible and will retrieve available database schemas during suggest values.

Table Name*

 

Default Value: N/A
Example: employees_table

String/Expression/Suggestion



N/A

Specify the name of the table to execute bulk load operation on.



Create table if not present

 

Default Value: Deselected



Checkbox







N/A

Select this checkbox to automatically create the target table if it does not exist.

Iceberg table

 

Default Value: Deselected

Checkbox

Appears when you select Create table if not present.

Select this checkbox to create an Iceberg table with the Snowflake catalog. Learn more about how to create and Iceberg table with Snowflake as the Iceberg catalog.

External volume*

 

Default Value: N/A
Example:

String/Expression/Suggestion

Appears when you select the Iceberg table.

Specify the external volume for the Iceberg table. Learn more about how to configure an external volume for Iceberg tables.

Base location*

 

Default Value: N/A
Example: iceberg_s3_stage

String/Expression

Appears when you select the Iceberg table checkbox.

Specify the Base location for the Iceberg table.

Data source

 

Default Value: Input view
Example: Staged files



Dropdown list

N/A

Specify the source from where the data should load. The available options are Input view and Staged files.

Preserve case sensitivity

 

Default Value: Deselected

Checkbox



N/A

Select this check box to preserve the case sensitivity of the column names.

  • If you do not select Preserve case sensitivity, the input documents are loaded to the target table if the key names in the input documents match the target table column names ignoring the case.

  • If you include a second input view, selecting Preserve case sensitivity has no effect on the column names of the target table, because Snap uses the metadata from the second input view.

Load empty strings

Default Value: Deselected

Checkbox



N/A

Select this check box to load empty string values in the input documents as empty strings to the string-type fields. Else, empty string values in the input documents are loaded as null. Null values are loaded as null regardless.

Truncate data

Default Value:  Deselected

Checkbox



N/A

Select this checkbox to truncate existing data before performing data load. With the Bulk Update Snap, instead of doing truncate and then update, a Bulk Insert would be faster.

Staging location


Default Value: Internal
Example: External



Dropdown list/Expression





N/A

Select the type of staging location that is to be used for data loading:

  • External: Location that is not managed by Snowflake. The location should be an AWS S3 Bucket or Microsoft Azure Storage Blob or Google Cloud Storage. These credentials are mandatory while validating the Account.

  • Internal: Location that is managed by Snowflake.

Flush chunk size (in bytes)

String/Expression

 

 

Appears when you select Input view for Data source and Internal for Staging location.

When using internal staging, data from the input view is written to a temporary chunk file on the local disk. When the size of a chunk file exceeds the specified value, the current chunk file is copied to the Snowflake stage and then deleted. A new chunk file simultaneously starts to store the subsequent chunk of input data. The default size is 100,000,000 bytes (100 MB), which is used if this value is left blank.

Target


Default Value: N/A
Example: s3://test_bucket

String/Expression





N/A

Specify an internal or external location to load the data. If you select External for Staging Location, a staging area is created in Azure, GCS, or S3 as configured. Otherwise, a staging area is created in Snowflake's internal location.

This field accepts the following input:

  • Named Stage: The name for user-defined named stage. This should be used only when a Staging location is set as Internal
    Format: @<Schema>.<StageName>[/path]

  • Internal Path: The staging location represent by a path.
    Format: @~/[path]

  • S3 Url: The external S3 URL that specifies an S3 storage bucket.
    Format: s3://[path]

  • Microsoft Azure Storage Blob URL: The external URL required to connect to the Microsoft Azure Storage.

  • Folder Name: Anything else (including no input). This is regarded as a Folder name under the Internal Home Path (@~) if using internal staging or under the S3 bucket and folder specified in the Snowflake account.

Storage Integration

 

Default Value: N/A
Example

String/Expression

Appears when you select Staged files for Data source and External for Staging location.

Specify the pre-defined storage integration that is used to authenticate the external stages.

Staged file list

Use this field set to define staged file(s) to be loaded to the target file.



Staged file

String/Expression

Appears when you select Staged files for Data source.

Specify the staged file to be loaded to the target table.

File name pattern


Default Value: N/A

Example: .length

String/Expression



Appears when you select Staged files for Data source.

Specify a regular expression pattern string, enclosed in single quotes with the file names and /or path to match.

File format object


Default ValueNone

Example: jsonPath()

String/Expression

N/A

Specify an existing file format object to use for loading data into the table. The specified file format object determines the format type such as CSV, JSON, XML, AVRO, or other format options for data files.

File format type


Default Value: None
Example: CSV

String/Expression/Suggestion

N/A

Specify a predefined file format object to use for loading data into the table. The available file formats include CSV, JSON, XML, and AVRO.

File format option



Default value:  N/A
Example: BINARY_FORMAT=UTF8



String/Expression



N/A

Specify the file format option. Separate multiple options by using blank spaces and commas.

Table Columns





Use this field set to specify the columns to be used in the COPY INTO command. This only applies when the Data source is Staged files

Columns
Default value: None

String/Expression/Suggestion

N/A

Specify the table columns to use in the Snowflake COPY INTO query. This configuration is valid when the staged files contain a subset of the columns in the Snowflake table. For example, if the Snowflake table contains columns A, B, C, and D, and the staged files contain columns A and D then the Table Columns field would have two entries with values A and D. The order of the entries should match the order of the data in the staged files.

Select Query


Default Value: N/A
Example
select substr(t.$2,4), t.$1, t.$5, t.$4 
from @mystage t

String/Expression

Appears when the Data source is Staged files.

Specify the SELECT query to transform data before loading it into the Snowflake database. 

The SELECT statement transform option enables querying the staged data files by either reordering the columns or loading a subset of table data from a staged file. For example, select $1:location, $1:dimensions.sq_ft, $1:sale_date, $1:price from @mystage/sales.json.gz t
This query loads the file sales.json from the internal stage mystage, (which stores the data files internally); wherein locationdimensions.sq_ft, and sale_date are the objects.

(OR)

select substr(t.$2,4), t.$1, t.$5, t.$4 from @mystage t
This query reorders the column data from the internal stage mystage before loading it into a table. The (SUBSTR), SUBSTRING function removes the first few characters of a string before inserting it.

Encryption type

 

Default Value: None
Example: Server-Side Encryption

Dropdown list



N/A

Specify the type of encryption to be used on the data. The available encryption options are:

  • None: Files do not get encrypted.

  • Server Side Encryption: The output files on Amazon S3 are encrypted with server-side encryption.

  • Server-Side KMS Encryption: The output files on Amazon S3 are encrypted with an Amazon S3-generated KMS key. 

KMS key



Default Value: N/A
Example: <Encrypted>

String/Expression





N/A

Specify the KMS key that you want to use for S3 encryption. Learn more about the KMS key: AWS KMS Overview and Using Server Side Encryption.

Buffer size (MB)



Default Value: 10MB
Example: 20MB

String/Expression







N/A

Specify the data in MB to be loaded into the S3 bucket at a time. This property is required when bulk loading to Snowflake using AWS S3 as the external staging area.

Minimum value: 5 MB

Maximum value: 5000 MB

Manage Queued Queries



Default Value: Continue to execute queued queries when the Pipeline is stopped or if it fails
Example: Cancel queued queries when the Pipeline is stopped or if it fails

Dropdown list

N/A

Select this property to determine whether the Snap should continue or cancel the execution of the queued Snowflake Execute SQL queries when you stop the pipeline.

Additional Options

On Error

 

Default Value: ABORT_STATEMENT
Example: CONTINUE





Dropdown list

N/A

Select an action to perform when errors are encountered in a file. The available actions are:

  • ABORT_STATEMENT: Aborts the COPY statement if any error is encountered. The error will be thrown from the Snap or routed to the error view.

  • CONTINUE: Continues loading the file. The error will be shown as a part of the output document.

  • SKIP_FILE: Skips file if any errors encountered in the file.

  • SKIP_FILE_*error_limit*: Skips file when the number of errors in the file exceeds the number specified in Error Limit.

  • SKIP_FILE_*error_percent_limit*%: Skips file when the percentage of errors in the file exceeds the percentage specified in Error percentage limit.

Error Limit

 

Default Value: 0
Example: 3

Integer

Appears when you select SKIP_FILE_*error_limit* for On Error.

Specify the error limit to skip file. When the number of errors in the file exceeds the specified error limit or when SKIP_FILE_number is selected for On Error.

Error Percentage Limit

Default Value: 0
Example: 1

Integer

Appears when you select SKIP_FILE_*error_percent_limit*% 
for On Error.

Specify the percentage of errors  to skip file. If the file exceeds the specified percentage when SKIP_FILE_number% is selected for On Error

Size Limit

Default Value: 0
Example: 5

Integer

N/A

Specify the maximum size (in bytes) of data to be loaded.

Purge

Default value: Deselected

Checkbox

Appears when the Staging location is External.

Specify whether to purge the data files from the location automatically after the data is successfully loaded.

Return Failed Only

Default Value: Deselected

Checkbox

N/A

Specify whether to return only files that have failed to load while loading.

Force

Default ValueDeselected

Checkbox

N/A

Specify if you want to load all files, regardless of whether they have been loaded previously and have not changed since they were loaded.

Truncate Columns

Default ValueDeselected

Checkbox

N/A

Select this checkbox to truncate column values that are larger than the maximum column length in the table.

Validation Mode

Default Value: None
ExampleRETURN_n_ROWS

Dropdown list

N/A

Select the validation mode for visually verifying the data before unloading it. The available options are:

  • NONE

  • RETURN_n_ROWS

  • RETURN_ERRORS

  • RETURN_ALL_ERRORS

Validation Errors Type



Default ValueFull error
ExampleDo not show errors

Dropdown list

Appears when you select NONE for Validation Mode.

Select one of the following methods for displaying the validation errors:

  • Aggregate errors per row: Provides a summary view of the errors. You can expand rows to reveal a detailed view of the errors.

  • Full error: Provides the complete error message.

  • Do not show errors

Rows to Return


Default Value: 0
Example: 5

Integer

Appears when you select RETURN_n_ROWS, RETURN_ERRORS, and RETURN_ALL_ERRORS for Validation Mode.

Specify the number of rows not loaded into the corresponding table. Instead, the data is validated to be loaded and returns results based on the validation option specified. It can be one of the following values: RETURN_n_ROWS | RETURN_ERRORS | RETURN_ALL_ERRORS

Snap Execution


Default Value: Execute only
Example: Validate & Execute

Dropdown list

N/A

Select one of the three modes in which the Snap executes. Available options are:

  • Validate & Execute: Performs limited execution of the Snap, and generates a data preview during Pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during Pipeline runtime.

  • Execute only: Performs full execution of the Snap during Pipeline execution without generating preview data.

  • Disabled: Disables the Snap and all Snaps that are downstream from it.

Troubleshooting

Error

Reason

Resolution

Error

Reason

Resolution

Data can only be read from Google Cloud Storage (GCS) with the supplied account credentials (not written to it).

Snowflake Google Storage Database accounts do not support external staging when the Data source is the Input view.

Data can only be read from GCS with the supplied account credentials (not written to it).

Use internal staging if the data source is the input view or change the data source to staged files for Google Storage external staging.

Examples

Provide metadata in the table using the second input view

This example Pipeline demonstrates how to provide metadata for the table definition through the second input view, to enable the Bulk Load Snap to create a table according to the definition.

  1. Configure the Snowflake Execute Snap as follows to drop the newTable with the DROP TABLE query.

2. Configure the Mapper Snap as follows to pass the input data.

3. Configure the Snowflake Bulk Load Snap with two input views:

            a. First input view: Input data from the upstream Mapper Snap.

            b. Second input view: Table metadata from JSON Generator. If the target table is not present, a table is created in the database based on the schema from the second input view.

JSON Generator Configuration: Table metadata to pass to the second input view.

4. Finally, configure the Snowflake Select Snap with two output views.

Output from the first input view.

Output from the second input view: This schema of the target table is from the second output (rows) of the Snowflake Select Snap.

Download this Pipeline.

Load binary data into Snowflake

The following example Pipeline demonstrates how you can convert the staged data into binary data using the binary file format before loading it into the Snowflake database.

 

To begin with, configure the Snowflake Execute Snap with this query: select * from "PUBLIC"."EMP2" limit 25——this query reads 25 records from the Emp2 table.