Page Comparison

On this Page

Table of Contents

maxLevel	2
exclude	Older Versions\|Additional Resources\|Related Links\|Related Information

Snap type:

Read

Description:

This Snap reads data from HDFS (Hadoop File System) and produces a binary data stream at the output. For the hdfs protocol, please use a SnapLogic on-premises Groundplex and make sure that its instance is within the Hadoop cluster and SSH authentication has already been established. The Snap also supports the webhdfs protocol, which does not require a Groundplex and works for all versions of Hadoop. The The Snap also supports reading from a Kerberized cluster using the HDFS protocol.

Expected upstream Snaps: [None]
Expected downstream Snaps: Any data transformation or formatting Snaps.
Expected input: [None]
Expected output: A document with the columns and data of the Parquet file.

Note
HDFS 2.4.0 is supported for the HDFS protocol.

Hadoop allows you to configure proxy users to access HDFS on behalf of other users; this is called impersonation. When user impersonation is enabled on the Hadoop cluster, any jobs submitted using a proxy are executed with the impersonated user's existing privilege levels rather than those of the superuser associated with the cluster. For more information on user impersonation in this Snap, see the section on User Impersonation below.

Prerequisites:

[None]

Limitations and Known Issues:

Supports reading from HDFS Encryption.
Works in Ultra Task Pipelines.
The platform does not support generating output previews for files larger than 8KB. This does not mean that the Snap has failed: The file will be read upon the Snap's execution; only the output preview will not be generated upon validation.

Account:

This Snap uses account references created on the Accounts page of SnapLogic Manager to handle access to this endpoint. This Snap supports Azure storage account, Azure Data Lake account, Kerberos account, or no account. Account types supported by each protocol are as follows:

Protocol	Account types	Documentation
wasb	Azure Storage	Azure Storage
wasbs	Azure Storage	Azure Storage
adl	Azure Data Lake	Azure Data Lake
hdfs	Kerberos	Kerberos

Required settings for account types are as follows:

Account Type	Settings
Azure Storage	Account name, Primary access key
Azure Data Lake	Tenant ID, Access ID, Secret Key
Kerberos Account	Client Principal, Service Principal, Keytab File

IAM Roles for Amazon EC2

global.properties

jcc.jvm_options = -DIAM_CREDENTIAL_FOR_S3=TRUE

Please note this feature is supported in only in Groundplex nodes hosted in the EC2 environment.

For more information on IAM Roles, see http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html

Kerberos Account UI Configuration

Warning
The security model configured for the Groundless (SIMPLE or KERBEROS authentication) must match the security model of the remote server. Due to limitations of the Hadoop library we are only able to create the necessary internal credentials for the configuration of the Groundplex.

Views:

Input	This Snap has at most one document input view. It may contain values for the File expression property.
Output	This Snap has exactly one binary output view and provides the binary data stream read from the specified sources. Examples of Snaps that can be connected to this output are CSV Parser, JSON Parser, and XML Parser.
Error	This Snap has at most one document error view and produces zero or more documents in the view.

Settings

Label

Required. Specify the name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your pipeline.

Directory

Specify the URL for the data source (directory). The Snap supports the following protocols.

hdfs://<hostname>:<port>/<path to directory>/
webhdfs://<hostname>:<port>/<path to directory>/
wasb:///<container name>/<path to directory>/
wasbs:///<container name>/<path to directory>/
adl://<container name>/<path to directory>/
abfs:///<filesystem>/<path>/
abfs://<filesystem>@<accountname>.<endpoint>/<path>
abfss:///<filesystem>/<path>/
abfss://<filesystem>@<accountname>.<endpoint>/<path>

When you use the ABFS protocol to connect to an endpoint, the account name and endpoint details provided in the URL override the corresponding values in the Account Settings fields.

The Directory property is not used in the pipeline execution or preview and used only in the Suggest operation. When you press the Suggest icon, it will display a list of subdirectories under the given directory. It generates the list by applying the value of the Filter property.

Examples:

hdfs://ec2-54-198-212-134.compute-1.amazonaws.com:8020/user/john/inputwebhdfs://cdh-qa-2.fullsail.Snaplogic.com:50070/user/ec2-user/csv/
$filename
wasb:///snaplogic/testDir/
wasbs:///snaplogic/testDir/
adl://snapqa/
abfs:///filesystem2/dirl/a+b
abfss:///filesystem2/dirl/a+b

Default value: hdfs://<hostname>:<port>/

Note
SnapLogic automatically appends "azuredatalakestore.net" to the store name you specify when using Azure Data Lake; therefore, you do not need to add 'azuredatalakestore.net' to the URI while specifying the directory.

Filter

Insert excerpt

	HDFS Writer
	HDFS Writer
nopanel	true

File

The name of the file to be read. This can also be a relative path under the directory given in the Directory property. It should not start with a URL separator "/".
The File property can be a JavaScript expression which will be evaluated with values from the input view document. When you press the Suggest icon, it will display a list of regular files under the directory in the Directory property. It generates the list by applying the value of the Filter property.
If this property is left blank (the * wildcard is used) when the Snap is executed, all files under the directory matching the glob filter will be read.
Example:

sample.csv
tmp/another.csv
$filename
_filename

Default value: [None]

User Impersonation

Excerpt

Select this check box to enable user impersonation.

Note
For encryption zones, use user impersonation.

Default value: Not selected

For more information on working with user impersonation, click the link below.

Expand

title	User Impersonation Details

Generic User Impersonation Behavior

When the User Impersonation check box is selected, and Kerberos is the account type, the Client Principal configured in the Kerberos account impersonates the pipeline user.

When the User Impersonation option is selected, and Kerberos is not the account type, the user executing the pipeline is impersonated to perform HDFS Operations. For example, if the user logged into the SnapLogic platform is operator@snaplogic.com, the user name "operator" is used to proxy the super user.

User impersonation behavior on pipelines running on Groundplex with a Kerberos account configured in the Snap

When the User Impersonation checkbox is selected in the Snap, it is the pipeline user who performs the file operation. For example, if the user logged into the SnapLogic platform is operator@snaplogic.com, the user name "operator" is used to proxy the super user.
When the User Impersonation checkbox is not selected in the Snap, the Client Principal configured in the Kerberos account performs the file operation.

For non-Kerberised clusters, you must activate Superuser access in the Configuration settings.

HDFS Snaps support the following accounts:

Azure storage account
Azure Data Lake account
Kerberos account
No account

When an account is configured with an HDFS Snap, user impersonation settings have no impact on all accounts, except the Kerberos account.

Number Of Retries

Specify the maximum number of attempts to be made to receive a response.

Info
The request is terminated if the attempts do not result in a response. When the number of retries are exhausted, the Snap writes the error to the error view. Retry operation, which is the attempts to receive a response occurs only when the Snap loses the connection with the server.

Default value: 0

Retry Interval (seconds)

Specify the time interval between two successive retry requests. A retry happens only when the previous attempt resulted in an exception.

Default value: 1

Multiexcerpt include macro

name	Snap Execution
page	SOAP Execute

Select one of the three following modes in which the Snap executes: Available options are:

Validate & Execute: Performs limited execution of the Snap and generates a data preview during Pipeline validation; then performs full execution of the Snap (unlimited records) during Pipeline runtime.
Execute only: Performs full execution of the Snap during Pipeline execution without generating preview data.
Disabled: Disables the Snap and all Snaps downstream from it.

Troubleshooting

Insert excerpt

	Hadoop Directory Browser
	Hadoop Directory Browser
nopanel	true

Insert excerpt

	Hadoop Snap Pack
	Hadoop Snap Pack
nopanel	true

Versions Compared

Old Version 45

New Version 46

Key

IAM Roles for Amazon EC2

Kerberos Account UI Configuration

Settings

Troubleshooting