Automatic Schema Inference with ELT Load Snap
In this article
Overview
Use the SnapLogic Automatic Schema Inference (ASI) feature to extract schema information—column names with their data types—from a source file in CSV, JSON, PARQUET, AVRO, and ORC formats, and assign the corresponding target CDW-specific data types to the new table (columns), especially when the target table does not exist.Â
As of the September 2022 release, the ELT Load Snap supports this feature for only Redshift target CDWs when your Load Action is Append rows to existing table or Overwrite existing table and the target table does not exist in the Redshift database.
We plan to normalize this feature across CDWs with full-fledged support for ASI (all load actions) in our upcoming releases.
Other CDWs—DLP and BigQuery—have native but limited support for Automatic Schema Inferencing.Â
How to use the SnapLogic ASI feature
Specify the Format Option to infer table schema from your source files as follows:
In the ELT Load Snap Settings, add the format option INFERSCHEMA = TRUE | FALSE (This option is provided by SnapLogic) in the File Format Option List fieldset and remove the text: | FALSE (This option is provided by SnapLogic).
ASI Support for Redshift in SnapLogic ELT
The following table describes the ASI feature and how it helps you simplify your data loads:
ASI for Redshift is a SnapLogic Feature
Redshift does not have any native support for the automatic schema inference feature. SnapLogic provides you with the ability to infer schema from source files in an S3 location and use the schema information to create tables in a target Redshift instance.
Ability / Support | Description |
---|---|
Source file formats | You can infer schema from source files in CSV, JSON, PARQUET, ORC and AVRO formats. The JSON files can be in any of the different flavors - JSON, JSON Lines, and Extended JSON Lines. |
Compressed file formats |
|
Source file locations | The ASI feature requires that your source files are available in an AWS S3 location. |
Date and time formats | You can specify DATEFORMAT or TIMEFORMAT as another Format Option with INFERSCHEMA = TRUE in the File Format Option List. We support the following default values for these format options:
Alternatively, you can use DATEFORMAT='auto' or TIMEFORMAT='auto'. Learn more at Using automatic recognition with DATEFORMAT and TIMEFORMAT - Amazon Redshift. |
Schema inference mechanism | When you set the ELT Load Snap to auto-infer source schema, the Snap:
Full file scans
|
Merging inferred schema | When you have multiple source files to read and load data from, the ASI feature reads all files (first 400 records only) and automatically merges the new columns/data types detected from the subsequent files with the schema detected from the first file.
|
Other Compatible File Format Options
Based on the file format of your source files, you can seamlessly use the following file format options with INFERSCHEMA = TRUE:
Source File Format | Compatible File Format Options | Default Value |
---|---|---|
CSV | GZIP | N/A |
DATEFORMAT | YYYY-MM-DD | |
TIMEFORMAT |
| |
DELIMITER | Comma ( , ) | |
ENCODING | UTF8 | |
IGNOREHEADER | 0 | |
NULL | '\N' | |
QUOTE | Double quote mark ( “ ) | |
JSON | JSON ‘auto’ | ‘auto ignorecase’ | |
GZIP | N/A | |
DATEFORMAT | YYYY-MM-DD | |
TIMEFORMAT |
| |
DELIMITER | Pipe ( | ) | |
ENCODING | UTF8 | |
QUOTE | Double quote ( “ ) |
Limitations
You cannot infer some Redshift-specific data types like VARBYTE, SUPER, and GEO using this Snap. It assigns the VARCHAR data type to columns with VARBYTE, SUPER, and GEO values.
JSONPath files and the associated JSON files are not supported. Learn more at Copying from JSON files.
Support is only for default format values and auto for the DATEFORMAT and TIMEFORMAT file format options. Learn more about these file format options at the DATEFORMAT for Redshift.
Because the Snap reads only the first 400 rows of data in each source file for ASI, the maximum size/value in a column may not get detected.
SnapLogic attempts to prevent the risk of having very large values in smaller size data type columns by assigning the maximum size data type to the columns. This may result in:
More than required memory consumption.
Loss of precision and scale when reading and assigning the DECIMAL datatype.
This limitation will be addressed when the Full file scan mode is introduced.
Although the number of records read for ASI is only 400, reading specific individual records that are unusually large may affect both the I/O overhead and the Snap's performance.
If the JCC and the source data files are not colocated, the additional networking overhead may affect the performance of the ELT Load Snap as determined by the record size and available memory and I/O resources available to the JCC.
ASI Support for Snowflake in SnapLogic ELT
Prerequisites
To be able to leverage the ELT Load Snap's ASI feature for your Snowflake instance, you need to have the following minimum access privileges respectively on the objects:
Privilege | Object |
---|---|
USAGE | Storage integration |
CREATE STAGE | Schema |
The following table describes the ASI feature and how it helps you simplify your data load
ASI for Snowflake and our Offering
Snowflake's INFER_SCHEMA feature provides the capability to automatically infer schema from PARQUET, AVRO, and ORC source file formats. SnapLogic provides you with the additional ability to infer schema from CSV and JSON source files. The ELT Load Snap can read these files from an Amazon S3, Azure Blob Storage (WASB), ADLS Gen2, or Google Cloud Storage location and use the schema information to create tables in a target Snowflake instance.
Ability / Support | Description |
---|---|
Source file locations | The ASI feature for Snowflake requires that your source files are available in an Amazon S3, WASB, ADLS Gen2, or Google Cloud Storage location. |
Source file formats |
|
Schema inference mechanism | When you set the ELT Load Snap to auto infer source schema, the Snap:
Full file scans
|
Merging inferred schema | When you have multiple source files to read and load data from, the ASI feature reads all files (only first 400 records) and automatically merges the new columns/data types detected from the subsequent files with the schema detected from the first file.
Needs a consistent header row across files For accurate schema inference, ensure that the column names and the number of columns in each file's header match with those in the remaining specified files. |
Date and time formats | You can specify DATE_FORMAT or TIME_FORMAT as another Format Option with INFERSCHEMA = TRUE in the File Format Option List. We support the following default values for these format options:
Alternatively, you can use DATEFORMAT='AUTO' or TIMEFORMAT='auto'. Learn more at Using automatic recognition with DATEFORMAT and TIMEFORMAT - Snowflake. |
Compressed file formats | The ASI feature works for compressed files stored in an Amazon S3 location, but not on WASB, ADLS Gen2, or Google Cloud Storage.
|
Authentication types | Source Location Credentials and Storage Integration for Amazon S3, WASB, and ADLSGEN2, Storage Integration for GCS. |
Other Compatible File Format Options
Based on the file format of your source files, you can seamlessly use the following file format options with INFERSCHEMA = TRUE:
Source File Format | Compatible File Format Options | Default Value |
---|---|---|
CSV | COMPRESSION | NONE |
DATE_FORMAT | AUTO | |
TIME_FORMAT | AUTO | |
TIMESTAMP_FORMAT | AUTO | |
FIELD_DELIMITER | Comma ( , ) | |
RECORD_DELIMITER | New line character | |
ENCODING | UTF8 | |
SKIP_HEADER | 0 | |
NULL_IF | \\N | |
ESCAPE | NONE | |
ESCAPE_UNENCLOSED_FIELD | Backslash ( \\ ) | |
FIELD_OPTIONALLY_ENCLOSED_BY (To ensure that all the nonVARCHAR values in the source file are assigned the most appropriate data type) | NONE | |
DELIMITER | Pipe ( | ) | |
ENCODING | UTF8 | |
QUOTE | Double quote ( “ ) |
Limitations
You cannot infer some Snowflake-specific data types like BINARY and GEOGRAPHY from CSV files using this Snap. It assigns the VARCHAR data type to columns with BINARY and GEOGRAPHY values.
Support is only for default format values (auto) for the DATE_FORMAT, TIME_FORMAT and TIMESTAMP_FORMAT file format options. Learn more about these file format options in Snowflake Documentation | Format Type Options
Because the Snap reads only the first 400 rows of data in each source file for ASI, the maximum size/value in a column may not get detected.
SnapLogic attempts to prevent the risk of having very large values in smaller size data type columns by assigning the maximum size data type to the columns. This may result in:
More than required memory consumption.
Loss of precision and scale when reading and assigning the DECIMALÂ data type.
This limitation will be addressed when the Full file scan mode is introduced.
Although the number of records read for ASI is only 400, reading specific individual records that are unusually large may affect both the I/O overhead and the Snap's performance.
If the JCC and the source data files are not colocated, the additional networking overhead may affect the performance of the ELT Load Snap as determined by the record size and available memory and I/O resources available to the JCC.
Due to a limitation in Snowflake in inferring the source files (and thereby the source schema) based on file name patterns in a Storage Integration-based session, the Snap fails to load data from any supported source location into your Snowflake target instance based on any File Name Pattern that you specify. As a workaround, we recommend that you use the File List field set to define the source data file locations for the Snap to load the data.
Best Practices
Use the following best practices to prevent any unusual or unexpected behavior in the ELT Load Snap or the ASI results:
Factor Affecting ELT Load | Best Practice |
---|---|
Colocation of the JCC and input data files |
|
Data Compression | Use GZIP-compression for loading CSV or JSON files, Snappy, or GZIP-compression for loading Apache file formats: PARQUET, AVRO, and ORC files to save disk space. ASI supports both of these compression modes. |
Â
Have feedback? Email documentation@snaplogic.com | Ask a question in the SnapLogic Community
© 2017-2024 SnapLogic, Inc.