Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

In this article

Table of Contents
maxLevel2
absoluteUrltrue

...

ASI Support for Snowflake in SnapLogic ELT

Prerequisites

To be able to leverage the ELT Load Snap's ASI feature for your Snowflake instance, you need to have the following minimum access privileges respectively on the objects:

Privilege

Object

USAGE

Storage integration

CREATE STAGE

Schema

...

Info

ASI for Snowflake and our Offering

Snowflake's INFER_SCHEMA feature provides the capability to automatically infer schema from PARQUET, AVRO, and ORC source file formats. SnapLogic provides you with the additional ability to infer schema from CSV and JSON source files. The ELT Load Snap can read these files from an Amazon S3, Azure Blob Storage (WASB), ADLS Gen2, or Google Cloud Storage location and use the schema information to create tables in a target Snowflake instance.

Ability / Support

Description

Source file locations

The ASI feature for Snowflake requires that your source files are available in an Amazon S3, WASB, ADLS Gen2, or Google Cloud Storage location.

Note

Source File Definition for ASI in ELT Load Snap

You can define the source files for ASI as:

  • A File List (set of individual file names in one of the supported source file locations) or

  • A File Name Pattern (set of files in the source location that follow a consistent pattern in their file name or file path). 

Source file formats

  • You can infer schema from source files in CSV, JSON, PARQUET, AVRO and ORC formats.

    • The JSON files can be in any of the different types - JSON, JSON Lines, and Extended JSON Lines. 

  • For file formats other than CSV, the ASI feature for Snowflake considers the entire target table as one column and assigns VARIANT data type to this column (table).

Schema inference mechanism

When you set the ELT Load Snap to auto infer source schema, the Snap:

  • By default, reads the first 400 records in the input file to detect and assign the data type for each column.

  • Assigns the VARCHAR data type to the corresponding column in Snowflake if it cannot detect the data type.

Full file scans

  • A full file scan that consumes lots of resources and time can provide the best results—accurate, maximum precision, and scale values of a decimal type column. This feature is in the upcoming releases of the ELT Snap Pack.

Merging inferred schema

When you have multiple source files to read and load data from, the ASI feature reads all files (only first 400 records) and automatically merges the new columns/data types detected from the subsequent files with the schema detected from the first file.

  • This feature is useful when you specify a File Name Pattern or multiple files in the File list for loading into your target CDW.

Needs a consistent header row across files

For accurate schema inference, ensure that the column names and the number of columns in each file's header match with those in the remaining specified files.

Date and time formats

You can specify DATE_FORMAT or TIME_FORMAT as another Format Option with INFERSCHEMA = TRUE in the File Format Option List. We support the following default values for these format options:

  • AUTO for DATE_FORMAT

  • AUTO for TIME_FORMAT 

  • AUTO for TIMESTAMP_FORMAT

Alternatively, you can use DATEFORMAT='AUTO' or TIMEFORMAT='auto'. Learn more at Using automatic recognition with DATEFORMAT and TIMEFORMAT - Snowflake.

Compressed file formats

The ASI feature works for compressed files stored in an Amazon S3 location, but not on WASB, ADLS Gen2, or Google Cloud Storage.

  • You can infer schema from CSV and JSON source files compressed in GZIP format. 

    • Ensure that you add another Format Option GZIP in the File Format Option List to work with these source files.

  • You can infer schema from PARQUET, AVRO, and ORC files compressed into GZIP or SNAPPY files.

Authentication types

Source Location Credentials and Storage Integration for Amazon S3, WASB, and ADLSGEN2, Storage Integration for GCS.

Other Compatible File Format Options

...

  • You cannot infer some Snowflake-specific data types like BINARY and GEOGRAPHY from CSV files using this Snap. It assigns the VARCHAR data type to columns with BINARY and GEOGRAPHY values.

  • Support is only for default format values (auto) for the DATE_FORMAT, TIME_FORMAT and TIMESTAMP_FORMAT file format options. Learn more about these file format options in Snowflake Documentation | Format Type Options

  • Because the Snap reads only the first 400 rows of data in each source file for ASI, the maximum size/value in a column may not get detected.

    • SnapLogic attempts to prevent the risk of having very large values in smaller size data type columns by assigning the maximum size data type to the columns. This may result in:

      • More than required memory consumption.

      • Loss of precision and scale when reading and assigning the DECIMAL data type.

    • This limitation will be addressed when the Full file scan mode is introduced.

  • Although the number of records read for ASI is only 400, reading specific individual records that are unusually large may affect both the I/O overhead and the Snap's performance.

    • If the JCC and the source data files are not colocated, the additional networking overhead may affect the performance of the ELT Load Snap as determined by the record size and available memory and I/O resources available to the JCC.

  • Due to a limitation in Snowflake in inferring the source files (and thereby the source schema) based on file name patterns in a Storage Integration-based session, the Snap fails to load data from any supported source location into your Snowflake target instance based on any File Name Pattern that you specify. As a workaround, we recommend that you use the File List field set to define the source data file locations for the Snap to load the data.

...