Redshift - Unload

On this Page

Snap Type:

Read

Description:

This Snap unloads the result of a query to one or more files on Amazon S3, using Amazon S3 server-side encryption (SSE)/server-side encryption with KMS/client-side encryption with CMK. The target Redshift table is not modified. 

ETL Transformations & Data Flow

https://github.com/awslabs/amazon-redshift-utils/tree/master/src/UnloadCopyUtility.


[
    {
        "entries": [
            {
                "url": "s3:///mybucket/unload_folder/test0000_part_00"
            },
            {
                "url": "s3:///mybucket/unload_folder/test0001_part_00"
            },
            {
                "url": "s3:///mybucket/unload_folder/test0002_part_00"
            },
            {
                "url": "s3:///mybucket/unload_folder/test0003_part_00"
            }
        ],
        "status": "success",
        "unloadQuery": "UNLOAD ('SELECT * FROM public.company') TO 's3://mybucket/unload_folder/test' 
CREDENTIALS 'aws_access_key_id=;aws_secret_access_key=' MANIFEST ESCAPE ALLOWOVERWRITE PARALLEL"
    }
]

Upon the successful preview, the expected output data is as follows:

[
    {
        "entries": [
            {
                "url": "s3:///<bucket>/<folder>/<prefix>000n_part_00"
            }
        ],
        "status": "preview",
        "unloadQuery": "UNLOAD ('SELECT * FROM public.company') TO 's3://mybucket/unload_folder/test' 
CREDENTIALS 'aws_access_key_id=;aws_secret_access_key=' MANIFEST ESCAPE ALLOWOVERWRITE PARALLEL"
    }
]

The Snap behaves the same on a Groundplex as it does in a Cloudplex.

Expected upstream Snaps: Any Snap with a document output view

Expected downstream Snaps: Any Snap with a document input view, such as JSON Formatter, Mapper, and so on. The CSV Formatter Snap cannot be connected directly to this Snap since the output document map data is not flat.

This Snap provides a front end to the Amazon Redshift Unload/Copy Utility. The Snap allows data to be efficiently moved from one Redshift instance to an optionally encrypted S3 bucket. The COPY snap moves data from an optionally encrypted S3 bucket into a second Redshift instance. Alternately the data may be downloaded to another system using the S3 READ Snap.

By default the data will be written to a separate CSV-encoded compressed file per Redshift 'slice'. It is possible to override this behavior but the resulting file must be smaller than 6.2 GB.

The default CSV delimiter is a carat ('^'), not a comma. (Note: elsewhere I've seen the default delimiter listed as a pipe ('|') character.)

Input & Output

Input: Key-value map data to evaluate expression properties of the Snap.

Input Schema Provided: No

Output: This snap produces one document. The fields are:

  • entries - a list of S3 URLs.
  • status - "success" or "unsuccess"
  • unloadQuery - the actual SQL command executed.

Output Schema Provided: No

Preview Supported: Yes

Passthrough Supported: No

Output Examples:

    • Upon the successful execution, the expected output data is as follows:

Prerequisites:
  • The Redshift account should contain S3 Access-key ID, S3 Secret key, S3 Bucket and S3 Folder.
  • The Redshift account security settings should allow access from the IP Address of the cloudplex or groundplex.
  • The Amazon S3 bucket where Amazon Redshift will write the output files must reside in the same region as your cluster.
  • The Amazon S3 bucket where Amazon Redshift should allow write access from the IP address of the cloudplex or groundplex.
Limitations and Known Issues:

Works in Ultra Task Pipelines.

Configurations:

Account & Access

This Snap uses account references created on the Accounts page of SnapLogic Manager to handle access to this endpoint. This Snap requires a Redshift Account with S3 properties. See Configuring Redshift Accounts for information on setting up this type of account. 

Views:

Input

This Snap has at most one input view. 

Output

This Snap has at most one document output view.
The output document contains map data:

  • a "status" field with "success" or "preview", 
  • a "unloadQuery" field with the query sent to Redshift, and
  • an "entries" field with a list of S3 URL's written by the Redshift UNLOAD operation.
Error

This Snap has at most one document error view and produces zero or more documents in the view.

The error view would contain error, reason, resolution and stack trace.

Troubleshooting:

The preview on this Snap will not execute the Redshift UNLOAD operation. Connect a JSON Formatter and a File Writer Snaps to the error view and then execute the pipeline. If there is any error, you will be able to preview the output file in the File Writer Snap for the error information.

Settings

Label


Required The name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your pipeline.

Select query

Required.


Defines a SELECT query. The results of the query are unloaded. In most cases, it is worthwhile to unload data in sorted order by specifying an ORDER BY clause in the query; this approach will save the time required to sort the data when it is reloaded.

Example: SELECT * FROM public.company ORDER BY id

Default value: SELECT * FROM 


S3 file prefix


The prefix of AWS S3 file names which are used by Redshift to write data. The Snap uses S3 Bucket and S3 Folder in the RedShift Account to format the full S3 path. File names created by RedShift are in the format:  s3://<bucket>/<folder>/<s3-file-prefix><slice-number>_part_<file-number>, where <s3-file-prefix> is the value of this property. The Amazon S3 bucket where Amazon Redshift will write the output files must reside in the same region as your cluster.

Example: test

Default value: [None]


Delimiter


Required. Single ASCII character that is used to separate fields in the output file. DELIMITER will be ignored if FIXEDWIDTH is specified. If the data contains the delimiter character, you will need to specify the ESCAPE option to escape the delimiter, or use ADDQUOTES to enclose the data in double quotes. Alternatively, specify a delimiter that is not contained in the data.

Example: a pipe character ( | ), a comma ( , ), or a tab ( \t )

Default value: a pipe character ( | )


Fixed width


Specifies FIXEDWIDTH spec., in which Redshift unloads the data to a file where each column width is a fixed length, rather than separated by a delimiter. The FIXEDWIDTH spec is a string that specifies the number of columns and the width of the columns. DELIMITER is ignored if FIXEDWIDTH is specified. Because FIXEDWIDTH does not truncate data, the specification for each column in the UNLOAD query needs to be at least as long as the length of the longest entry for that column. The format for fixedwidth_spec is: 'colID1:colWidth1,colID2:colWidth2, ...'

Example: colID1:36,colID2:36,colID3:256

Default value: [None]


Null as


Specifies a string that represents a null value in unload files. If this option is used, all output files contain the specified string in place of any null values found in the selected data. If this option is not specified, null values are unloaded as: 
Zero-length strings for delimited output 
Whitespace strings for fixed-width output
 If a null string is specified for a fixed-width unload and the width of an output column is less than the width of the null string, the following behavior occurs:
 An empty field is output for non-character columns
 An error is reported for character columns
Example: null

Default value: [None]

Anaplan server may ignore this setting and try to auto-detect.  See Anaplan for details.

Escape


If selected, for CHAR and VARCHAR columns in delimited unload files, an escape character (\) is placed before every occurrence of the following characters:
 Linefeed: \n
 Carriage return: \r
 The delimiter character specified for the unloaded data.
 The escape character: \
 A quote character: " or ' (if both ESCAPE and ADDQUOTES are selected).
We strongly recommend that you always select this property unless you are certain that your data does not contain any delimiters or other characters that might need to be escaped.

 Important:  If you loaded your data using a COPY with the ESCAPE option, you must also specify the ESCAPE option with your UNLOAD command to generate the reciprocal output file. Similarly, if you UNLOAD using the ESCAPE option, you will need to use ESCAPE when you COPY the same data.

Default value: Selected


Add quotes


If checked, RedShift places quotation marks around each unloaded data field, so that Amazon Redshift can unload data values that contain the delimiter itself.

Default value: Not selected


Allow overwrite


By default, UNLOAD fails if it finds files that it would possibly overwrite. If specified, UNLOAD will overwrite existing files, including the manifest file.

Default value: Not selected


Manifest


Creates a manifest file that explicitly lists the data files that are created by the UNLOAD process. The manifest is a text file in JSON format that lists the URL of each file that was written to Amazon S3. The manifest file is written to the same Amazon S3 path prefix as the unload files in the format <s3_path_prefix>manifest. For example, if the unload file S3 path prefix is 's3://mybucket/myfolder/venue_', the manifest file location will be 's3://mybucket/myfolder/venue_manifest'. Since this Snap needs to read the content of the manifest file for the output document data, it always includes MANIFEST option in the query. After reading the content of the manifest file, the Snap deletes the manifest file if this property is un-selected.

Default value: Not selected


Parallel


By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster. If unchecked, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used. The maximum size for a data file is 6.2 GB. Therefore, if the unload data is larger than 6.2 GB, UNLOAD will create more than one file. The UNLOAD command is designed to use parallel processing. We recommend leaving this property selected for most cases.

Default value:  Selected


Gzip


Unloads data to one or more gzip-compressed file per slice. Each resulting file is appended with a .gz extension.

Default value: Not selected

Client-side Encryption


Specifies the Amazon S3 client-side encryption type for the output files on Amazon S3. UNLOAD by default creates encrypted files using Amazon S3 server-side encryption with AWS-managed encryption keys (SSE). UNLOAD does not support Amazon S3 server-side encryption with a customer-supplied key (SSE-C). To unload to encrypted gzip-compressed files, check the GZIP property. If selected, provide the customer master key in the Master symmetric key property.

Default value: Not selected

IAM role


Check this property if the bulk load/unload has to be done using IAM role. If checked, ensure the properties (AWS account ID, role name and region name) are provided in the account.

Default value:  [None] 

KMS Encryption type

Specifies the type of KMS S3 encryption to be used on the data. The available encryption options are:

  • None - Files do not get encrypted using KMS encryption
  • Server-Side KMS Encryption If selected, the output files on Amazon S3 are encrypted using this encryption with Amazon S3 generated KMS key. 

Default value: None

If both the KMS and Client-side encryption types are selected, the Snap gives precedence to the SSE,  and displays an error prompting the user to select either of the options only.

Master symmetric key

Conditional. This property applies only when the Client-side Encryption property is selected. This is the customer master key for the data to be encrypted client side. For more information about this please refer to Customer Master Keys and Using Client Side Encryption.

Default value:  [None] 

KMS key


Specifies the AWS Key Management Service (KMS) key ID or ARN to be used for the S3 encryption. This is only required if the KMS Encryption type property is configured to use the encryption with KMS. For more information about the KMS key refer to AWS KMS Overview and Using Server Side Encryption

Default value:  [None] 

Snap Execution

Select one of the three modes in which the Snap executes. Available options are:

  • Validate & Execute: Performs limited execution of the Snap, and generates a data preview during Pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during Pipeline runtime.
  • Execute only: Performs full execution of the Snap during Pipeline execution without generating preview data.
  • Disabled: Disables the Snap and all Snaps that are downstream from it.

Example


Transferring data from one Redshift instance to another

The Redshift Unload and Redshift Copy Snaps can be used to transfer data from one Redshift instance to a second.

Sample Unload snap settings


The first mapper extracts the URL elements

The splitter breaks the string array of URLs into a series of documents.

The second mapper strips the leading "s3:///" from the URLs.

The S3 Upsert Snap uses an expression property for the filename.

The output of the Upsert Snap shows the results.



The key points are that

  1. The S3 file list must include the 'slice' information, e.g., '0000_part_00'. There will be one file per slice.
  2. The Unload Snap will always write to the S3 bucket associated with the Redshift account. The Upsert Snap's S3 file list must include the S3 bucket name.
  3. The destination table must be created beforehand. There is no auto-creation mode.
  4. The null substitution is not recognized by the upsert. It must be updated manually. This could cause problems with uniqueness constraints.


Exported pipeline is available in the Downloads section below.

This Snap will leave the data file on the S3 instance.

Downloads

  File Modified

File Redshift Unload - Example.slp

Apr 20, 2018 by Mohammed Iqbal


Redshift IAM Account Setup

  • If the EC2 plex (where your Pipeline is running with IAM role), Redshift cluster, and S3 bucket are in the same AWS account, then you must use Redshift Account (normal IAM account).
  • If the EC2 plex (where your Pipeline is running with IAM role) is in one account and the Redshift cluster and S3 bucket are in a different AWS account, you must use Redshift Cross-account IAM role Account to run your Pipelines successfully.

This is applicable only for Redshift - Bulk Load, Redshift - Unload, and Redshift - S3 Upsert Snaps.