Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Addressed Feb 2021 comment in https://mysnaplogic.atlassian.net/browse/SNAP-7269

On this Page

Table of Contents
maxLevel2
excludeOlder Versions|Additional Resources|Related Links|Related Information

Snap type:

Read


Description:

This Snap reads data from HDFS (Hadoop File System) and produces a binary data stream at the output. For the hdfs protocol, please use a SnapLogic on-premises Groundplex and make sure that its instance is within the Hadoop cluster and SSH authentication has already been established.  The Snap also supports the webhdfs protocol, which does not require a Groundplex and works for all versions of Hadoop. The The Snap also supports reading from a Kerberized cluster using the HDFS protocol.

  • Expected upstream Snaps: [None]
  • Expected downstream Snaps: Any data transformation or formatting Snaps.
  • Expected input: [None]
  • Expected output: A document with the columns and data of the Parquet file.
Note

HDFS 2.4.0 is supported for the HDFS protocol.

Hadoop allows you to configure proxy users to access HDFS on behalf of other users; this is called impersonation. When user impersonation is enabled on the Hadoop cluster, any jobs submitted using a proxy are executed with the impersonated user's existing privilege levels rather than those of the superuser associated with the cluster. For more information on user impersonation in this Snap, see the section on User Impersonation below.

Prerequisites:

[None]

Limitations and Known Issues:
  • Supports reading from HDFS Encryption.
  • Works in Ultra Task Pipelines.
  • The platform does not support generating output previews for files larger than 8KB. This does not mean that the Snap has failed: The file will be read upon the Snap's execution; only the output preview will not be generated upon validation.
Account: 

This Snap uses account references created on the Accounts page of SnapLogic Manager to handle access to this endpoint. This Snap supports Azure storage account, Azure Data Lake account, Kerberos account, or no account. Account types supported by each protocol are as follows:

ProtocolAccount typesDocumentation
wasbAzure StorageAzure Storage
wasbsAzure StorageAzure Storage
adl

Azure Data Lake

Azure Data Lake
hdfsKerberosKerberos


Required settings for account types are as follows:

Account Type

Settings

Azure Storage

Account name, Primary access key

Azure Data Lake 

Tenant ID, Access ID, Secret Key

Kerberos Account

Client Principal, Service Principal, Keytab File

IAM Roles for Amazon EC2

global.properties

jcc.jvm_options = -DIAM_CREDENTIAL_FOR_S3=TRUE

Please note this feature is supported in only in Groundplex nodes hosted in the EC2 environment.

For more information on IAM Roles, see http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html

Kerberos Account UI Configuration


Warning

The security model configured for the Groundless (SIMPLE or KERBEROS authentication) must match the security model of the remote server. Due to limitations of the Hadoop library we are only able to create the necessary internal credentials for the configuration of the Groundplex.


Views:


InputThis Snap has at most one document input view. It may contain values for the File expression property.
OutputThis Snap has exactly one binary output view and provides the binary data stream read from the specified sources. Examples of Snaps that can be connected to this output are CSV Parser, JSON Parser, and XML Parser.
ErrorThis Snap has at most one document error view and produces zero or more documents in the view.


Settings

Label

Required. Specify the name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your pipeline.

Directory



Specify the URL for the data source (directory). The Snap supports the following protocols.

  • hdfs://<hostname>:<port>/<path to directory>/

  • webhdfs://<hostname>:<port>/<path to directory>/

  • wasb:///<container name>/<path to directory>/

  • wasbs:///<container name>/<path to directory>/

  • adl://<container name>/<path to directory>/ 

  • abfs:///<filesystem>/<path>/
  • abfs://<filesystem>@<accountname>.<endpoint>/<path>
  • abfss:///<filesystem>/<path>/
  • abfss://<filesystem>@<accountname>.<endpoint>/<path>

When you use the ABFS protocol to connect to an endpoint, the account name and endpoint details provided in the URL override the corresponding values in the Account Settings fields.

The Directory property is not used in the pipeline execution or preview and used only in the Suggest operation. When you press the Suggest icon, it will display a list of subdirectories under the given directory. It generates the list by applying the value of the Filter property.

Examples:

  • hdfs://ec2-54-198-212-134.compute-1.amazonaws.com:8020/user/john/inputwebhdfs://cdh-qa-2.fullsail.Snaplogic.com:50070/user/ec2-user/csv/

  • $filename

  • wasb:///snaplogic/testDir/

  • wasbs:///snaplogic/testDir/

  • adl://snapqa/

  • abfs:///filesystem2/dirl/a+b
  • abfss:///filesystem2/dirl/a+b

Default value:  hdfs://<hostname>:<port>/

Note

SnapLogic automatically appends "azuredatalakestore.net" to the store name you specify when using Azure Data Lake; therefore, you do not need to add 'azuredatalakestore.net' to the URI while specifying the directory.


Filter


Insert excerpt
HDFS Writer
HDFS Writer
nopaneltrue

File


The name of the file to be read. This can also be a relative path under the directory given in the Directory property. It should not start with a URL separator "/".
The File property can be a JavaScript expression which will be evaluated with values from the input view document. When you press the Suggest icon, it will display a list of regular files under the directory in the Directory property. It generates the list by applying the value of the Filter property.
If this property is left blank (the * wildcard is used) when the Snap is executed, all files under the directory matching the glob filter will be read.
Example: 

  • sample.csv
  • tmp/another.csv
  • $filename
  • _filename

Default value:  [None]


User Impersonation


Excerpt

Select this check box to enable user impersonation.

Note
For encryption zones, use user impersonation. 

Default value:  Not selected

For more information on working with user impersonation, click the link below.


Expand
titleUser Impersonation Details

Generic User Impersonation Behavior

When the User Impersonation check box is selected, and Kerberos is the account type, the Client Principal configured in the Kerberos account impersonates the pipeline user.

When the User Impersonation option is selected, and Kerberos is not the account type, the user executing the pipeline is impersonated to perform HDFS Operations. For example, if the user logged into the SnapLogic platform is operator@snaplogic.com, the user name "operator" is used to proxy the super user. 

User impersonation behavior on pipelines running on Groundplex with a Kerberos account configured in the Snap

  • When the User Impersonation checkbox is selected in the Snap, it is the pipeline user who performs the file operation. For example, if the user logged into the SnapLogic platform is operator@snaplogic.com, the user name "operator" is used to proxy the super user.
  • When the User Impersonation checkbox is not selected in the Snap, the Client Principal configured in the Kerberos account performs the file operation.



For non-Kerberised clusters, you must activate Superuser access in the Configuration settings.


HDFS Snaps support the following accounts:

  • Azure storage account
  • Azure Data Lake account
  • Kerberos account
  • No account

When an account is configured with an HDFS Snap, user impersonation settings have no impact on all accounts, except the Kerberos account.



Number Of Retries

Specify the maximum number of attempts to be made to receive a response.

Info
  • The request is terminated if the attempts do not result in a response.
  • When the number of retries are exhausted, the Snap writes the error to the error view.
  • Retry operation, which is the attempts to receive a response occurs only when the Snap loses the connection with the server.

Default value: 0

Retry Interval (seconds)

Specify the time interval between two successive retry requests. A retry happens only when the previous attempt resulted in an exception.

Default value: 1

Multiexcerpt include macro
nameSnap Execution
pageSOAP Execute


Select one of the three following modes in which the Snap executes: Available options are:

  • Validate & Execute: Performs limited execution of the Snap and generates a data preview during Pipeline validation; then performs full execution of the Snap (unlimited records) during Pipeline runtime.
  • Execute only: Performs full execution of the Snap during Pipeline execution without generating preview data.
  • Disabled: Disables the Snap and all Snaps downstream from it.

Troubleshooting

Insert excerpt
Hadoop Directory Browser
Hadoop Directory Browser
nopaneltrue


Insert excerpt
Hadoop Snap Pack
Hadoop Snap Pack
nopaneltrue