$customHeader
Skip to end of banner
Go to start of banner

S3 Select

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 37 Current »

In this article

Overview

You can use this Snap to execute a SELECT query on CSV, JSON, and Parquet S3 objects, which may be compressed in GZIP or BZIP2 format.

  • AWS S3 does not define any folder object and the '/' characters are part of the S3 object key names.

  • This Snap captures metadata and lineage information from the input document.

S3 Select Snap settings

Snap Type

The S3 Select Snap is a Read-type Snap that reads a subset of your S3 data based on a SELECT query.

Prerequisites

A valid S3 account with the required permissions.

Support for Ultra Pipelines

Works in Ultra Pipelines

Limitations

The conditional fields Field Delimiter, Record Delimiter, Quote Character, Quote Escape Character, and File Header Info associated with the CSV Data Format are not displayed for non-CSV data formats. They are also not displayed if the Data Format field is expression-enabled. If you plan to make the Data Format expression-enabled to process the CSV-related properties, follow these steps:

  1. Set CSV as the Data Format.

  2. Configure the conditional fields.

  3. Enable the expression.

You must follow these steps because the values of these properties are used during the execution only if the value of the Data Format is set to CSV, even when the associated conditional fields are not displayed.

Known Issues

None.

Snap Views

Type

Format

Number of Views

Examples of Upstream and Downstream Snaps

Description

Input 

Document

  • Min: 0

  • Max: 1

  • Mapper

  • JSON Parser

An upstream Snap is optional. Any document with key-value pairs to evaluate expression properties.

Output

Binary

  • Min: 0

  • Max: 1

  • CSV Parser

  • JSON Parser

  • Mapper (binary input view)

  • File Writer

An example of the output binary data is as following:

Output preview of table containing preview in text format

Error

Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter while running the Pipeline by choosing one of the following options from the When errors occur list under the Views tab:

  • Stop Pipeline Execution: Stops the current pipeline execution if the Snap encounters an error.

  • Discard Error Data and Continue: Ignores the error, discards that record, and continues with the remaining records.

  • Route Error Data to Error View: Routes the error data to an error view without stopping the Snap execution.

Learn more about Error handling in Pipelines.

Snap Settings

  • Asterisk ( * ): Indicates a mandatory field.

  • Suggestion icon ((blue star) ): Indicates a list that is dynamically populated based on the configuration.

  • Expression icon ((blue star) ): Indicates the value is an expression (if enabled) or a static value (if disabled). Learn more about Using Expressions in SnapLogic.

  • Add icon ( (blue star) ): Indicates that you can add fields in the fieldset.

  • Remove icon ( (blue star)): Indicates that you can remove fields from the fieldset.

  • Upload icon ((blue star) ): Indicates that you can upload files.

Field Name

Field Type

Field Dependency

Description

Label*

Default ValueS3 Select
ExampleSelect Invoices from Q4 2018

String

None

 

The name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your Pipeline.

Bucket*

Default Value: None
Example:

String/Expression/Suggestion

None

Specify the S3 bucket name where S3 objects are located.

Do not add S3:/// before bucket name, because the Snap can fail.

  • Bucket names are unique globally and can be accessed without its region name in most cases. If you cannot access a bucket name without its region name, you can specify the region information in the following syntax

    • <S3_bucket_name>@<region_name>

  • You can also access the S3 bucket in an S3 Virtual Private Cloud (VPC) endpoint by specifying the bucket name in the following syntax:

    • <S3_bucket_name>@<VPC_S3_endpoint>

Note: If you enter an incorrect region name, but the bucket name is valid, the AWS S3 service may successfully access the bucket without any error.

Object Key*

Default Value: None
Examples

  • test.csv

  • abc/test.json

  • abc/xyz/test.xml

String/Expression/Suggestion

None

Specify the S3 object key name, which may include one or more forward-slash ('/') characters.

The forward-slash character is part of the S3 object key name and there is no folder object defined in AWS S3. The maximum length of the suggested list is 1,000.

Select Query*

Default Value: None
Exampleselect * from S3Object

String/Expression

None

Enter a SELECT query to be executed on the S3 object. For a detailed description on using the SELECT command, refer to the Amazon documentation: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html

Input Data Format

Compression Type*

Default Value: None
Example: GZIP

String/Expression/Dropdown list

None

Select one of the following compression types: NONE, GZIP, or BZIP2.

The Amazon S3 Select service does not support Compression Type BZIP2 or GZIP for the Parquet input data type.

Data format*

Default Value: CSV
ExamplesJSON

String/Expression/Dropdown list

None

Select one of the data formats for the input data: CSV, JSON or Parquet.

Field Delimiter

Default Value: , (Comma)
Examples

  • | (Pipe)

  • \t (tab)

String/Expression

Appears when you select CSV for Data Format.

Enter the field delimiter character used in the input data.

Record Delimiter

Default Value: LF (new line)
ExamplesCRLF

String/Expression/Suggestion

Appears when you select CSV for Data Format.

Enter the record delimiter used in the input data.

Quote Character

Default Value: " (double quote)
Example' (single quote)

String/Expression

Appears when you select CSV for Data Format.

Enter the quote character used in the input data.

Quote Escape Character

Default Value: \
Example"

String/Expression

Appears when you select CSV for Data Format.

Enter the escape character used to escape quote characters inside the values, where values are wrapped around quote characters in the input data. Note that some CSV file uses double quote character to escape double quotes inside values. For example, "He said ""Yes""", which means: He said "Yes" when it is parsed.

File Header Info

Default Value: USE
Examples

  • NONE

  • IGNORE

Dropdown list

Appears when you select CSV for Data Format.

Specify whether to use the header information in the SELECT query.

Show Advanced Properties

Default Value: Deselected

Checkbox

None

Select this checkbox to display the advanced properties.
Deselect this checkbox to hide the properties.

Maximum Retries*

Default Value: 3
Example: 2

Integer/Expression

Appears when you select Show Advanced Properties checkbox.

Specify the maximum number of retry attempts to perform in case of a temporary network loss.

Snap Execution

Default Value
Example: Validate & Execute

Dropdown list

None

Select one of the following three modes in which the Snap executes:

  • Validate & Execute: Performs limited execution of the Snap, and generates a data preview during Pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during Pipeline runtime.

  • Execute only: Performs full execution of the Snap during Pipeline execution without generating preview data.

  • Disabled: Disables the Snap and all Snaps that are downstream from it.

Examples

Selecting a Subset of Data from an Amazon S3 Object (CSV file)

This example Pipeline demonstrates how to use the Amazon S3 Select Snap to select a subset of data from a CSV file.

Pipeline shows multiple outputs for S3 Select

Prerequisites: A valid AWS S3 Account

In this example we take a CSV file and use S3 Select to output different subsets of the data.

Overview of steps:

  • Upload a single CSV file to the S3 object using the CSV Generator Snap.

  • Format the file using the CSV Formatter Snap.

  • Upload the file using the S3 Upload Snap.

  • Copy the output of the S3 Upload Snap to 4 different flows using the S3 Copy Snap.

  • Select a subset of the data using the S3 Select Snap.

  1. Configure the CSV Generator Snap to generate a new CSV document for the downstream Snap in the Pipeline.

     

  2. Configure the CSV Formatter Snap to format the data as specified in the Snap's settings.

    CSV Formatter Settings

     

  3. Configure the S3 Upload Snap to upload the S3 object (select/demo.csv) object to the S3 object bucket.

    S3 Upload configuration Settings

     

  4. Use the Copy Snap to send the same information to multiple endpoints. Configure the Copy Snap to copy the S3 object document stream to the Snap's output views. In this example we configured four different output views.

  5. Configure the S3 Select Snap to select a subset of the data from the S3 object. On validation, the Snap retrieves the data based on the SELECT statement.

    Under Settings, expand the Input Data Format section to select options for File Header Info. This is where you specify whether to use the header information in the SELECT statement.

    S3 Select showing File Header Info option selected is USE

  6. Add a CSV Parser Snap after each S3 Select Snap to read the CSV binary data from its input view, parse it, and then write it to its output view.
    In this example we configured different output views that use different settings for File Header Info:

    1. Output 1: USE means the header data is used in the SELECT statement. In this example we are referencing a column name in the header (variety = 'Setosa').
      SELECT * FROM s3Object s WHERE s.variety = 'Setosa'

      CSV Parser 1 output table

       

    2. Output 2: IGNORE means the header data is ignored in the SELECT statement. Here, we are using a column index instead of the column name to refer to the same column in the header (s._5 = 'Setosa').
      SELECT * FROM s3Object s WHERE s._5 = 'Setosa'

      CSV Parser 2 output table

       

    3. Output 3: NONE means there is no header data in the SELECT statement.
      SELECT * FROM s3Object s

      CSV Parser 3 output table

       

Download this Pipeline.


Selecting a Subset of Data from an Amazon S3 Object (JSON file)

This example Pipeline demonstrates how to use the Amazon S3 Select Snap to select a subset of data from a JSON file.

Pipeline showing the Snaps in this example

Prerequisites: A valid AWS S3 Account

In this example we take a JSON file and use S3 Select to select a subset of the data (in this case, employees from the Sales department).

Overview of steps:

  • Upload a single JSON file to the S3 object using the JSON Generator Snap.

  • Format the file using the JSON Formatter Snap.

  • Upload the file using the S3 Upload Snap.

  • Select a subset of the data using the S3 Select Snap.

  • Use the JSON Parser Snap to read the JSON binary data from its input view, parse it, and then write it to its output view.

  1. Use the JSON Generator Snap to generate a new JSON document for the next Snap in the Pipeline.

  2. Configure the JSON Formatter Snap to format the data as specified in the Snap's settings.

  3. Configure the S3 Upload Snap to upload the S3 object (employees.json) to the S3 object bucket.

    S3 Upload configuration Settings

     

  4. Configure the S3 Select Snap to select a subset of the data from the S3 object (in this case, we are selecting employees from the Sales department). On validation, the Snap retrieves the data based on the SELECT statement.

    S3 Select configuration Settings

     

  5. Use the JSON Parser Snap to read the JSON binary data from its input view, parse it, and then write it to its output view.
    The output shows the list of employees from the Sales department.

Download this Pipeline.

Downloads

  1. Download and import the Pipeline into SnapLogic.

  2. Configure Snap accounts, as applicable.

  3. Provide Pipeline parameters, as applicable.

 

  File Modified
No files shared here yet.

Snap Pack History

Release

Snap Pack Version

Date

Type

Updates

May 2024

main26341

Stable

Enhanced the S3 Select Snap to capture metadata and lineage information from the input document.

February 2024

436patches25360

Latest

Fixed an issue with the Amazon S3 Snaps that displayed a null pointer exception when the Access Key ID or Secret Key field was empty while utilizing the S3 Express Bucket in the S3 Account. The Snaps now throw the configuration exception if either field is empty.

February 2024

main25112

Stable

Updated and certified against the current Snaplogic Platform release.

November 2023

435patches24238

Latest

Added support for Amazon S3 Express One Zone in the Amazon S3 Snap Pack.

November 2023

main23721

Stable

Updated and certified against the current Snaplogic Platform release.

August 2023

main22460

Stable

Updated and certified against the current SnapLogic Platform release.

May 2023

433patches21816

Latest

The Amazon S3 Snaps automatically detect the Maximum session duration value for the Cross-Account IAM role (1 through 12 hours). The Snaps round down the value to the nearest hour. So, if the Snap administrator sets the Maximum session duration at 3 hours and 45 minutes, the Snaps read it as 3 hours. The Snaps also refresh the session before it expires. However, the automatic session refresh does not support the case of very large file upload or download that takes longer than the maximum session duration.

May 2023

main21015

Stable

Upgraded with the latest SnapLogic Platform release.

February 2023

432patches20385

Latest

Added support for Ultra Task Pipelines.

February 2023

main19844

Stable

Upgraded with the latest SnapLogic Platform release.

November 2022

main18944

Stable

  • The S3 Browser Snap output now includes the Storage Class field, which indicates the archived status of the S3 object.

  • The S3 Download Snap no longer fails even when the pipeline has multiple Snaps after 430patches18348.

October 2022

430patches18674

Latest

  • Introduced the following Snaps:

    • S3 Archive enables you to archive an S3 object and change its storage class.

    • S3 Restore enables you to restore an archived S3 object.

    • S3 Select enables you to retrieve a subset of data from an S3 object.

  • The S3 Download, S3 Archive, S3 Copy, S3 Delete, S3 Restore, and S3 Upload Snaps do not have the increased number of active threads accumulated, as they are now released immediately after the execution.

  • The S3 Download Snap now does not fail even when the pipeline has multiple Snaps after 430patches18348.

  • The S3 Browser Snap output now includes the Storage Class field, which indicates the archived status of the S3 object.

August 2022

430patches17354

Latest

The KMS Region field in the S3 Account now suggests the regions when you click the suggestion (blue star) icon.

August 2022

main17386

Stable

Introduced the Amazon S3 Snap Pack, which enables you to browse, copy, delete, download, or upload objects in S3. This Snap Pack contains the following Snaps:

  • S3 Browser: Lists the attributes of Amazon S3 objects in a specific bucket matching the prefix.

  • S3 Copy: Sends a copy request to the AWS S3 service to copy an Amazon S3 object from a source bucket to a target bucket.

  • S3 Delete: Removes an object from the specified bucket.

  • S3 Download: Downloads Amazon S3 objects from the S3 bucket.

  • S3 Upload: Uploads binary data to Amazon S3 objects.

  • S3 Presigned: Generates a presigned URL in the output document to access an Amazon S3 object.


  • No labels