In this article

Overview

Use the SnapLogic Automatic Schema Inference (ASI) feature to extract schema information—column names with their data types—from a source file in CSV, JSON, PARQUET, AVRO, and ORC formats, and assign the corresponding target CDW-specific data types to the new table (columns), especially when the target table does not exist. 

  • As of the September 2022 release, the ELT Load Snap supports this feature for only Redshift target CDWs when your Load Action is Append rows to existing table or Overwrite existing table and the target table does not exist in the Redshift database.

  • We plan to normalize this feature across CDWs with full-fledged support for ASI (all load actions) in our upcoming releases.

    • Other CDWs—DLP and BigQuery—have native but limited support for Automatic Schema Inferencing. 

How to use the SnapLogic ASI feature

Specify the Format Option to infer table schema from your source files as follows:

ASI Support for Redshift in SnapLogic ELT

The following table describes the ASI feature and how it helps you simplify your data loads:

ASI for Redshift is a SnapLogic Feature

Redshift does not have any native support for the automatic schema inference feature. SnapLogic provides you with the ability to infer schema from source files in an S3 location and use the schema information to create tables in a target Redshift instance.

Ability / Support

Description

Source file formats

You can infer schema from source files in CSV, JSON, PARQUET, ORC and AVRO formats. The JSON files can be in any of the different flavors - JSON, JSON Lines, and Extended JSON Lines.

Compressed file formats

  • You can also infer schema from CSV and JSON source files compressed in GZIP format. 

    • Ensure that you add another Format Option GZIP in the File Format Option List to work with these source files.

  • You can infer schema from PARQUET, AVRO, and ORC files compressed into GZIP or SNAPPY files.

Source file locations

The ASI feature requires that your source files are available in an AWS S3 location.

Date and time formats

You can specify DATEFORMAT or TIMEFORMAT as another Format Option with INFERSCHEMA = TRUE in the File Format Option List. We support the following default values for these format options:

  • YYYY-MM-DD for DATEFORMAT

  • YYYY-MM-DD HH:MI:SS for TIMEFORMAT without time zone 

  • YYYY-MM-DD HH:MI:SSOF for TIMEFORMAT with time zone

Alternatively, you can use DATEFORMAT='auto' or TIMEFORMAT='auto'. Learn more at Using automatic recognition with DATEFORMAT and TIMEFORMAT - Amazon Redshift.

Schema inference mechanism

When you set the ELT Load Snap to auto-infer source schema, the Snap:

  • By default, reads the first 400 records in the input file to detect and assign the data type for each column.

  • Assigns the VARCHAR data type to the corresponding column in Redshift if it cannot detect the data type.

Full file scans

  • A full file scan that consumes lots of resources and time can provide the best results—accurate, maximum precision, and scale values of a decimal type column. This feature is in the upcoming releases of the ELT Snap Pack.

Merging inferred schema

When you have multiple source files to read and load data from, the ASI feature reads all files (first 400 records only) and automatically merges the new columns/data types detected from the subsequent files with the schema detected from the first file.

  • This feature is useful when you specify a File Name Pattern or multiple files in the File list for loading into your target CDW.

Other Compatible File Format Options

Based on the file format of your source files, you can seamlessly use the following file format options with INFERSCHEMA = TRUE:

Source File Format

Compatible File Format Options

Default Value

CSV

GZIP

N/A

DATEFORMAT
(also supports DATEFORMAT = 'auto')

YYYY-MM-DD

TIMEFORMAT
(also supports TIMEFORMAT = 'auto')

  • YYYY-MM-DD HH:MI:SS for timestamp without time zone

  • YYYY-MM-DD HH:MI:SSOF for timestamp with time zone

DELIMITER

Comma ( , )

ENCODING

UTF8

IGNOREHEADER

0

NULL

'\N'

QUOTE

Double quote mark ( “ )

JSON

JSON ‘auto’ | ‘auto ignorecase’


GZIP

N/A

DATEFORMAT

YYYY-MM-DD

TIMEFORMAT

  • YYYY-MM-DD HH:MI:SS for timestamp without time zone

  • YYYY-MM-DD HH:MI:SSOF for timestamp with time zone

DELIMITER

Pipe ( | )

ENCODING

UTF8

QUOTE

Double quote ( “ )

Limitations

ASI Support for Snowflake in SnapLogic ELT

Prerequisites

To be able to leverage the ELT Load Snap's ASI feature for your Snowflake instance, you need to have the following minimum access privileges respectively on the objects:

Privilege

Object

USAGE

Storage integration

CREATE STAGE

Schema

The following table describes the ASI feature and how it helps you simplify your data load

ASI for Snowflake and our Offering

Snowflake's INFER_SCHEMA feature provides the capability to automatically infer schema from PARQUET, AVRO, and ORC source file formats. SnapLogic provides you with the additional ability to infer schema from CSV and JSON source files. The ELT Load Snap can read these files from an Amazon S3, Azure Blob Storage (WASB), ADLS Gen2, or Google Cloud Storage location and use the schema information to create tables in a target Snowflake instance.

Ability / Support

Description

Source file locations

The ASI feature for Snowflake requires that your source files are available in an Amazon S3, WASB, ADLS Gen2, or Google Cloud Storage location.

Source File Definition for ASI in ELT Load Snap

You can define the source files for ASI as:

  • A File List (set of individual file names in one of the supported source file locations) or

  • A File Name Pattern (set of files in the source location that follow a consistent pattern in their file name or file path). 

Source file formats

  • You can infer schema from source files in CSV, JSON, PARQUET, AVRO and ORC formats.

    • The JSON files can be in any of the different types - JSON, JSON Lines, and Extended JSON Lines. 

  • For file formats other than CSV, the ASI feature for Snowflake considers the entire target table as one column and assigns VARIANT data type to this column (table).

Schema inference mechanism

When you set the ELT Load Snap to auto infer source schema, the Snap:

  • By default, reads the first 400 records in the input file to detect and assign the data type for each column.

  • Assigns the VARCHAR data type to the corresponding column in Snowflake if it cannot detect the data type.

Full file scans

  • A full file scan that consumes lots of resources and time can provide the best results—accurate, maximum precision, and scale values of a decimal type column. This feature is in the upcoming releases of the ELT Snap Pack.

Merging inferred schema

When you have multiple source files to read and load data from, the ASI feature reads all files (only first 400 records) and automatically merges the new columns/data types detected from the subsequent files with the schema detected from the first file.

  • This feature is useful when you specify a File Name Pattern or multiple files in the File list for loading into your target CDW.

Needs a consistent header row across files

For accurate schema inference, ensure that the column names and the number of columns in each file's header match with those in the remaining specified files.

Date and time formats

You can specify DATE_FORMAT or TIME_FORMAT as another Format Option with INFERSCHEMA = TRUE in the File Format Option List. We support the following default values for these format options:

  • AUTO for DATE_FORMAT

  • AUTO for TIME_FORMAT 

  • AUTO for TIMESTAMP_FORMAT

Alternatively, you can use DATEFORMAT='AUTO' or TIMEFORMAT='auto'. Learn more at Using automatic recognition with DATEFORMAT and TIMEFORMAT - Snowflake.

Compressed file formats

The ASI feature works for compressed files stored in an Amazon S3 location, but not on WASB, ADLS Gen2, or Google Cloud Storage.

  • You can infer schema from CSV and JSON source files compressed in GZIP format. 

    • Ensure that you add another Format Option GZIP in the File Format Option List to work with these source files.

  • You can infer schema from PARQUET, AVRO, and ORC files compressed into GZIP or SNAPPY files.

Authentication types

Source Location Credentials and Storage Integration for Amazon S3, WASB, and ADLSGEN2, Storage Integration for GCS.

Other Compatible File Format Options

Based on the file format of your source files, you can seamlessly use the following file format options with INFERSCHEMA = TRUE:

Source File Format

Compatible File Format Options

Default Value

CSV

COMPRESSION

NONE

DATE_FORMAT
(also supports DATEFORMAT = 'YYYY-MM-DD''DD-MON-YYYY'and 'MM/DD/YYYY')

AUTO

TIME_FORMAT

AUTO

TIMESTAMP_FORMAT

AUTO

FIELD_DELIMITER

Comma ( , )

RECORD_DELIMITER

New line character

ENCODING

UTF8

SKIP_HEADER

0

NULL_IF

\\N

ESCAPE

NONE

ESCAPE_UNENCLOSED_FIELD

Backslash ( \\ )

FIELD_OPTIONALLY_ENCLOSED_BY

(To ensure that all the nonVARCHAR values in the source file are assigned the most appropriate data type)

NONE

DELIMITER

Pipe ( | )

ENCODING

UTF8

QUOTE

Double quote ( “ )

In the September 2022 release, the ELT Load Snap does not provide support for using the MANIFEST file format option with ASI.

Limitations

Best Practices

Use the following best practices to prevent any unusual or unexpected behavior in the ELT Load Snap or the ASI results:

Factor Affecting ELT Load

Best Practice

Colocation of the JCC and input data files

  • Ensure that the processing entity (JCC) and the source data are on the same node or cluster to avoid any possible performance degradation in ASI for networking overheads.

    • If the input file record size is moderate, reading 400 records from a different location may not incur a significant networking overhead.

    • If the record size is massive, the networking overhead involved significantly affects both the ASI and the Snap's performance.

Data Compression

Use GZIP-compression for loading CSV or JSON files, Snappy, or GZIP-compression for loading Apache file formats: PARQUET, AVRO, and ORC files to save disk space. ASI supports both of these compression modes.