S3 Select
In this article
Overview
You can use this Snap to execute a SELECT query on CSV, JSON, and Parquet S3 objects, which may be compressed in GZIP or BZIP2 format.
AWS S3 does not define any folder object and the '/' characters are part of the S3 object key names.
This Snap captures metadata and lineage information from the input document.
Snap Type
The S3 Select Snap is a Read-type Snap that reads a subset of your S3 data based on a SELECT query.
Prerequisites
A valid S3 account with the required permissions.
Support for Ultra Pipelines
Works in Ultra Pipelines.
Limitations
The conditional fields Field Delimiter, Record Delimiter, Quote Character, Quote Escape Character, and File Header Info associated with the CSV Data Format are not displayed for non-CSV data formats. They are also not displayed if the Data Format field is expression-enabled. If you plan to make the Data Format expression-enabled to process the CSV-related properties, follow these steps:
Set CSV as the Data Format.
Configure the conditional fields.
Enable the expression.
You must follow these steps because the values of these properties are used during the execution only if the value of the Data Format is set to CSV, even when the associated conditional fields are not displayed.
Known Issues
None.
Snap Views
Type | Format | Number of Views | Examples of Upstream and Downstream Snaps | Description |
---|---|---|---|---|
Input | Document |
|
| An upstream Snap is optional. Any document with key-value pairs to evaluate expression properties. |
Output | Binary
|
|
| An example of the output binary data is as following:
|
Error | Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter while running the Pipeline by choosing one of the following options from the When errors occur list under the Views tab:
Learn more about Error handling in Pipelines. |
Snap Settings
Asterisk ( * ): Indicates a mandatory field.
Suggestion icon ( ): Indicates a list that is dynamically populated based on the configuration.
Expression icon ( ): Indicates the value is an expression (if enabled) or a static value (if disabled). Learn more about Using Expressions in SnapLogic.
Add icon ( ): Indicates that you can add fields in the fieldset.
Remove icon ( ): Indicates that you can remove fields from the fieldset.
Upload icon ( ): Indicates that you can upload files.
Field Name | Field Type | Field Dependency | Description |
---|---|---|---|
Label* Default Value: S3 Select | String | None
| The name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your Pipeline. |
Bucket* Default Value: None
| String/Expression/Suggestion | None | Specify the S3 bucket name where S3 objects are located. Do not add S3:/// before bucket name, because the Snap can fail. |
Object Key* Default Value: None
| String/Expression/Suggestion
| None | Specify the S3 object key name, which may include one or more forward-slash ('/') characters. |
Select Query* Default Value: None | String/Expression | None | Enter a SELECT query to be executed on the S3 object. For a detailed description on using the SELECT command, refer to the Amazon documentation: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-glacier-select-sql-reference-select.html |
Input Data Format |
| ||
Compression Type* Default Value: None | String/Expression/Dropdown list | None | Select one of the following compression types: NONE, GZIP, or BZIP2. |
Data format* Default Value: CSV | String/Expression/Dropdown list | None | Select one of the data formats for the input data: CSV, JSON or Parquet. |
Field Delimiter Default Value:
| String/Expression | Appears when you select CSV for Data Format. | Enter the field delimiter character used in the input data. |
Record Delimiter Default Value: LF (new line) | String/Expression/Suggestion | Appears when you select CSV for Data Format. | Enter the record delimiter used in the input data. |
Quote Character Default Value: | String/Expression | Appears when you select CSV for Data Format. | Enter the quote character used in the input data. |
Quote Escape Character Default Value: | String/Expression | Appears when you select CSV for Data Format. | Enter the escape character used to escape quote characters inside the values, where values are wrapped around quote characters in the input data. Note that some CSV file uses double quote character to escape double quotes inside values. For example, |
File Header Info Default Value: USE
| Dropdown list | Appears when you select CSV for Data Format. | Specify whether to use the header information in the SELECT query. |
Show Advanced Properties Default Value: Deselected | Checkbox | None | Select this checkbox to display the advanced properties. |
Maximum Retries* Default Value: 3 | Integer/Expression | Appears when you select Show Advanced Properties checkbox. | Specify the maximum number of retry attempts to perform in case of a temporary network loss. |
Snap Execution Default Value: | Dropdown list | None | Select one of the following three modes in which the Snap executes:
|
Examples
Selecting a Subset of Data from an Amazon S3 Object (CSV file)
This example Pipeline demonstrates how to use the Amazon S3 Select Snap to select a subset of data from a CSV file.
Prerequisites: A valid AWS S3 Account
In this example we take a CSV file and use S3 Select to output different subsets of the data.
Overview of steps:
Upload a single CSV file to the S3 object using the CSV Generator Snap.
Format the file using the CSV Formatter Snap.
Upload the file using the S3 Upload Snap.
Copy the output of the S3 Upload Snap to 4 different flows using the S3 Copy Snap.
Select a subset of the data using the S3 Select Snap.
Configure the CSV Generator Snap to generate a new CSV document for the downstream Snap in the Pipeline.
Configure the CSV Formatter Snap to format the data as specified in the Snap's settings.
Configure the S3 Upload Snap to upload the S3 object (select/demo.csv) object to the S3 object bucket.
Use the Copy Snap to send the same information to multiple endpoints. Configure the Copy Snap to copy the S3 object document stream to the Snap's output views. In this example we configured four different output views.
Configure the S3 Select Snap to select a subset of the data from the S3 object. On validation, the Snap retrieves the data based on the SELECT statement.
Under Settings, expand the Input Data Format section to select options for File Header Info. This is where you specify whether to use the header information in the SELECT statement.
Add a CSV Parser Snap after each S3 Select Snap to read the CSV binary data from its input view, parse it, and then write it to its output view.
In this example we configured different output views that use different settings for File Header Info:Output 1: USE means the header data is used in the SELECT statement. In this example we are referencing a column name in the header (
variety = 'Setosa'
).SELECT * FROM s3Object s WHERE s.variety = 'Setosa'
Output 2: IGNORE means the header data is ignored in the SELECT statement. Here, we are using a column index instead of the column name to refer to the same column in the header (
s._5 = 'Setosa'
).SELECT * FROM s3Object s WHERE s._5 = 'Setosa'
Output 3: NONE means there is no header data in the SELECT statement.
SELECT * FROM s3Object s
Selecting a Subset of Data from an Amazon S3 Object (JSON file)
This example Pipeline demonstrates how to use the Amazon S3 Select Snap to select a subset of data from a JSON file.
Prerequisites: A valid AWS S3 Account
In this example we take a JSON file and use S3 Select to select a subset of the data (in this case, employees from the Sales department).
Overview of steps:
Upload a single JSON file to the S3 object using the JSON Generator Snap.
Format the file using the JSON Formatter Snap.
Upload the file using the S3 Upload Snap.
Select a subset of the data using the S3 Select Snap.
Use the JSON Parser Snap to read the JSON binary data from its input view, parse it, and then write it to its output view.
Use the JSON Generator Snap to generate a new JSON document for the next Snap in the Pipeline.
Configure the JSON Formatter Snap to format the data as specified in the Snap's settings.
Configure the S3 Upload Snap to upload the S3 object (employees.json) to the S3 object bucket.
Configure the S3 Select Snap to select a subset of the data from the S3 object (in this case, we are selecting employees from the Sales department). On validation, the Snap retrieves the data based on the SELECT statement.
Use the JSON Parser Snap to read the JSON binary data from its input view, parse it, and then write it to its output view.
The output shows the list of employees from the Sales department.
Download this Pipeline.
Downloads
Snap Pack History
Related Links
Have feedback? Email documentation@snaplogic.com | Ask a question in the SnapLogic Community
© 2017-2024 SnapLogic, Inc.