In this article

Overview

You can use this Snap to perform a bulk load operation on your DLP instance. The source of your data can be a file from a cloud storage location, an input view from an upstream Snap, or a table that can be accessed through a JDBC connection. The source data can be in a CSV, JSON, PARQUET, TEXT, or an ORC file.

This Snap uses the following Databricks commands internally:

COPY INTO - Enables loading data from staged files to an existing table.
CREATE TABLE [USING] - Enables loading data from some external sources like JDBC.
CREATE TABLE - Creates table in our case temporary table.
INSERT INTO - Inserts new rows into a table.

Snap Type

Databricks - Bulk Load Snap is a write-type Snap that loads data into your DLP instance.

Prerequisites

Valid access credentials to a DLP instance with adequate access permissions to perform the action in context.
Valid access to the external source data in one of the following: Azure Blob Storage, ADLS Gen2, DBFS, GCP, AWS S3, or another database (JDBC-compatible).

Support for Ultra Pipelines

Does not support Ultra Pipelines.

Limitations

Snaps in the Databricks Snap Pack do not support array, map, and struct data types in their input and output documents.

Known Issues

The Databricks - Bulk Load Snap fails to execute the DROP AND CREATE TABLE and ALTER TABLE operations on Delta tables when using the Databricks SQL persona on the AWS Cloud. The error message Operation not allowed: ALTER TABLE RENAME TO is not allowed for managed Delta tables on S3 is displayed. However, the same actions run successfully when using the Data Science and Engineering persona on the AWS Cloud.
Cause: This issue arises due to a limitation within the Databricks SQL Admin Console, which prevents you from adding the configuration parameter spark.databricks.delta.alterTable.rename.enabledOnAWS trueto the SQL Warehouse Settings. As a result, the Snap encounters restrictions when attempting to perform certain operations on managed Delta tables stored on Amazon S3.

Snap Views

Type

Format

Number of Views

Examples of Upstream and Downstream Snaps

Description

Input

Document

Min: 0
Max: 2

Mapper
Copy
JSON Generator
Databricks - Select

This Snap can read from two input documents at a time:

One JSON document for the incoming data to be loaded into the target Databricks instance.
Another JSON document that contains the table schema (metadata) for creating the target table.

Output

Document

Min: 0
Max: 1

Databricks - Select
Databricks - Unload

A JSON document containing the bulk load request details and the result of the bulk load operation.

Error

Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter while running the Pipeline by choosing one of the following options from the When errors occur list under the Views tab. The available options are:

Stop Pipeline Execution: Stops the current pipeline execution when the Snap encounters an error.
Discard Error Data and Continue: Ignores the error, discards that record, and continues with the rest of the records.
Route Error Data to Error View: Routes the error data to an error view without stopping the Snap execution.

Learn more about Error handling in Pipelines.

Snap Settings

Asterisk ( * ): Indicates a mandatory field.
Suggestion icon (): Indicates a list that is dynamically populated based on the configuration.
Expression icon ( ): Indicates whether the value is an expression (if enabled) or a static value (if disabled). Learn more about Using Expressions in SnapLogic.
Add icon ( ): Indicates that you can add fields in the fieldset.
Remove icon ( ): Indicates that you can remove fields from the fieldset.

Field Name		Field Type	Field Dependency	Description
Label* Default Value: Databricks - Bulk Load Example: Db_BulkLoad_FromS3		String	None	The name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your Pipeline.
Database name Default Value: None. Example: cust_db		String/Expression/Suggestion	None	Enter the name of the database in which the target table exists. Leave this blank if you want to use the database name specified in the Database Name field in the account settings.
Table Name* Default Value: None. Example: cust_records		String/Expression/Suggestion	None	Enter the name of the table in which you want to perform the bulk load operation.
Source Type Default Value: Cloud Storage File Example: Input View		Dropdown list	None	Select the type of source from which you want to load the data into your DLP instance. The available options are: Cloud Storage File. A file from a cloud location like AWS S3, Azure, or GCS. You can configure a series of options for the bulk load operation as described in this document. Input View. A JSON file coming from the preceding Snap’s output. You need to specify only the Load action. JDBC. A table in another database that can be connected to using a JDBC connector. You can specify the Source table name to load the data from or the Target Table Columns to replace the existing target table with a new one.
Load action* Default Value: Drop and create table Example: Append rows to existing table		Dropdown list	None	Select the appropriate load action you want to perform on the target table for this bulk upload operation. You can: Drop and create a table. To remove existing table in the specified database and create a new table with the schema defined below in the Snap, from an Input View, or a JDBC-connected database table. Append rows to existing table. To insert new rows of data to an existing target table.
Source table name		String	Source Type is JDBC.	Enter the source table name. The default values (database) configured in the Snap’s account for JDBC Account type are considered, if not specified in this field.
Target Table Columns			Source Type is Cloud Storage file or JDBC and Load action is Drop and create table.	Use this fieldset to specify the target table schema for creating a new table. Specify the Column Name and Data Type for as many columns you need to load in the target table.
	Column Default Value: None. Example: cust_ID	String	None	Enter the name of the column that you want to load in the target table.
	Data Type Default Value: None. Example: int, string	String	None	Enter the data type of the values in the specified column.
File format type Default Value: CSV Example: PARQUET		Dropdown list	Source Type is Cloud Storage file.	Select the file format of the source data file. It can be CSV, JSON, ORC, PARQUET, or TEXT.
File Format Option List			Source Type is Cloud Storage file.	You can use this field set to choose the file format options to associate with the bulk load operation, based on your source file format. Choose one file format option in each row.
File Format Option List	File format option Default Value: None. Example: cust_ID	String/Expression/Suggestion	Source Type is Cloud Storage file.	Select a file format option from the available options and set appropriate values to suit your bulk load needs, without affecting the syntax displayed in this field.
Files provider Default Value: File list Example: pattern		Dropdown list	Source Type is Cloud Storage file.	Declare the manner in which you are specifying the source files list - File list or pattern. Based on your selection in this field, the corresponding fields change: File list fieldset for File list and File pattern field for pattern.
File list			Source Type is Cloud Storage file and Files provider is File list.	You can use this field set to specify the file paths to be used for the bulk load operation. Choose one file path in each row.
File list	File Default Value: None. Example: cust_data.csv	String	Source Type is Cloud Storage file and Files provider is File list.	Enter the path of the file to be used for the bulk upload operation.
File pattern Default Value: None. Example: folder1/file_[a-g].csv		String/Expression	Source Type is Cloud Storage file and Files provider is pattern.	Enter the regex pattern to use to match the file name and/or absolute path. You can specify this as a regular expression pattern string, enclosed in single quotes. Learn more: Examples of COPY INTO (Delta Lake on Databricks) for DLP.
Encryption type Default Value: None. Example: Server-Side KMS Encryption		String	Source Type is Cloud Storage file.	Select the encryption type to use for decrypting the source data and/or files staged in the S3 buckets. Server-side encryption is available only for S3 accounts.
KMS key Default Value: None. Example: MF96D-M9N47-XKV7X-C3GCQ-G5349		String/Expression	Source Type is Cloud Storage file and Encryption type is Server-Side KMS Encryption.	Enter the KMS key to use to encrypt the files. In case that your source files are in S3, see Loading encrypted files from Amazon S3 for more detail.
Number of Retries Example: 3 Minimum value: 0 Default value: 0		Integer	Source Type is Input View.	Specifies the maximum number of retry attempts when the Snap fails to write.
Retry Interval (seconds) Example: 3 Minimum value: 1 Default value: 1		Integer	Source Type is Input View.	Specifies the minimum number of seconds the Snap must wait before each retry attempt.
Manage Queued Queries Default value: Continue to execute queued queries when pipeline is stopped or if it fails. Example: Cancel queued queries when pipeline is stopped or if it fails		Dropdown list	None	Select this property to determine whether the Snap should continue or cancel the execution of the queued Databricks SQL queries when you stop the Pipeline. If you select Cancel queued queries when pipeline is stopped or if it fails, then the read queries under execution are cancelled, whereas the write type of queries under execution are not cancelled. Databricks internally determines which queries are safe to be cancelled and cancels those queries. Due to an issue with DLP, aborting an ELT Pipeline validation (with preview data enabled) causes only those SQL statements that retrieve data using bind parameters to get aborted while all other static statements (that use values instead of bind parameters) persist. For example, `select * from a_table where id = 10` will not be aborted while `select * from test where id = ?` gets aborted. To avoid this issue, ensure that you always configure your Snap settings to use bind parameters inside its SQL queries.
Snap Execution Default Value: Execute only Example: Validate & Execute		Dropdown list	None	Select one of the three modes in which the Snap executes. Available options are: Validate & Execute: Performs limited execution of the Snap, and generates a data preview during Pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during Pipeline runtime. Execute only: Performs full execution of the Snap during Pipeline execution without generating preview data. Disabled: Disables the Snap and all Snaps that are downstream from it.

Troubleshooting

Error	Reason	Resolution
Missing property value	You have not specified a value for the required field where this message appears.	Ensure that you specify valid values for all required fields.

Examples

Bulk Load Employee data from a CSV file into a DLP instance

Consider the scenario where we need the employee data from a CSV file to be fed into a DLP instance so that we can analyze the data.

Prerequisite:

Configure the Bulk Load Snap account to connect to the AWS S3 service using Source Location Credentials to read the CSV file.
We need two Snaps:
- Databricks Bulk Load: To load the data from the CSV file in an S3 location
- Databricks Select: To read the data loaded in the target table and generate some insights.

Configure the Databricks - Bulk Load Snap to load employee data from the CSV into a new table, company_employees.

Here is how we do it:

Select the Drop and create table as the Load action.
Define the schema for the new table in the Target Table Columns field set.
Choose the source data type and indicate that the file contains a valid header.
Specify the file names (with relative paths, here) to load the data from.
As our CSV file in the S3 location is not encrypted, we leave the corresponding fields blank.

Run the pipeline—it loads the valid data into the target table and displays the new table name and the number of records loaded.

Next, to read the data from the new table in the DLP instance, use the Databricks - Select Snap. Provide the Table name and configure the Snap with a WHERE clause, salary < 500000.

On validation, the Snap retrieves and displays the data from the company_employees table that matches the WHERE condition specified.

Download this Pipeline.

Downloads

Download and import the Pipeline into SnapLogic.
Configure Snap accounts as applicable.
Provide Pipeline parameters as applicable.

	File	Modified

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.

No files shared here yet.

Drag and drop to upload or browse for files

Snap Pack History

Release	Snap Pack Version	Date	Type	Updates
May 2024	437patches26400	15 May 2024	Latest	Fixed an invalid session handle issue with the Databricks Snap Pack that intermittently triggered an error message when the Snaps failed to connect with Databricks to execute the SQL statement.
May 2024	main26341	08 May 2024	Stable	Updated the Delete Condition (Truncates a Table if empty) field in the Databricks - Delete Snap to Delete condition (deletes all records from a table if left blank) to indicate that all entries will be deleted from the table when this field is blank, but no truncate operation is performed.
February 2024	main25112	14 Feb 2024	Stable	Updated and certified against the current SnapLogic Platform release.
November 2023	main23721	08 Nov 2023	Stable	Updated and certified against the current SnapLogic Platform release.
August 2023	main22460	16 Aug 2023	Stable	Updated and certified against the current SnapLogic Platform release.
May 2023	433patches21630	28 Jun 2023	Latest	Enhanced the performance of the Databricks - Insert Snap to improve the amount of time it takes for validation.
May 2023	main21015	10 May 2023	Stable	Upgraded with the latest SnapLogic Platform release.
February 2023	main19844	09 Feb 2023	Stable	Upgraded with the latest SnapLogic Platform release.
November 2022	main18944	10 Nov 2022	Stable	The Databricks - Insert Snap now creates the target table only from the table metadata of the second input view when the following conditions are met: The Create table if not present checkbox is selected. The target table does not exist. The table metadata is provided in the second input view.
September 2022	430patches18305	29 Sep 2022	Latest	The name of the Databricks - Multi Execute Snap is simplified to Databricks - Execute Snap. The Use Result Query checkbox in the Databricks - Execute Snap enables you to include in the Snap's output the result of running (during validation) each SQL statement specified in the Snap. The Retry mechanism for the Databricks Snap Pack enables the following Databricks Snaps to repeatedly perform the selected operations for the specified number of times when the Snap account connection fails or times out. Databricks - Delete Databricks - Insert Databricks - Select Databricks - Execute Databricks - Bulk Load (when the Source Type is Input View) Databricks - Merge Into (when the Source Type is Input View) The following fields are added to each Databricks Snap as part of this enhancement: Number of Retries: The number of attempts the Snap should make to perform the selected operation when the Snap account connection fails or times out. Retry Interval (seconds): The time interval in seconds between two consecutive retry attempts.
September 2022	430patches17796	28 Sep 2022	Latest	The Manage Queued Queries property in the Databricks Snap Pack enables you to decide whether a given Snap should continue or cancel executing the queued Databricks SQL queries.
August 2022	main17386	11 Aug 2022	Stable	Upgraded with the latest SnapLogic Platform release.
4.29.2.0	42920rc17045	15 Jul 2022	Latest	A new Snap Pack for Databricks Lakehouse Platform (Databricks or DLP) introduces the following Snaps: Databricks - Select: Retrieves information from the target Databricks table. Databricks - Insert: Inserts new rows of data in the target Databricks table. Databricks - Delete: Deletes data from a target Databricks table. Databricks - Bulk Load: Loads millions of rows of data in the target table through a single load operation. Databricks - Unload: Unloads data from a target Databricks table through a single unload operation. Databricks - Merge Into: Updates millions of existing rows and inserts new rows in a target Databricks table through a single operation. Databricks - Multi Execute: Runs multiple SQL statements on the target Databricks instance.

Databricks - Bulk Load