On this Page

Snap type:

Read


Description:

This Snap reads data from HDFS (Hadoop File System) and produces a binary data stream at the output. For the hdfs protocol, please use a SnapLogic on-premises Groundplex and make sure that its instance is within the Hadoop cluster and SSH authentication has already been established. The Snap also supports reading from a Kerberized cluster using the HDFS protocol.

  • Expected upstream Snaps: [None]
  • Expected downstream Snaps: Any data transformation or formatting Snaps.
  • Expected input: [None]
  • Expected output: A document with the columns and data of the Parquet file.

HDFS 2.4.0 is supported for the HDFS protocol. This Snap supports HDFS, ADL (Azure Data Lake), ABFS(Azure Data Lake Storage Gen 2 ), and WASB(Azure storage) protocols.

Hadoop allows you to configure proxy users to access HDFS on behalf of other users; this is called impersonation. When user impersonation is enabled on the Hadoop cluster, any jobs submitted using a proxy are executed with the impersonated user's existing privilege levels rather than those of the superuser associated with the cluster. For more information on user impersonation in this Snap, see the section on User Impersonation below.

Prerequisites:

[None]

Limitations and Known Issues:
  • Supports reading from HDFS Encryption.
  • Works in Ultra Task Pipelines.
  • The platform does not support generating output previews for files larger than 8KB. This does not mean that the Snap has failed: The file will be read upon the Snap's execution; only the output preview will not be generated upon validation.
Account: 

This Snap uses account references created on the Accounts page of SnapLogic Manager to handle access to this endpoint. This Snap supports Azure storage account, Azure Data Lake account, Kerberos account, or no account. Account types supported by each protocol are as follows:

ProtocolAccount typesDocumentation
wasbAzure StorageAzure Storage
wasbsAzure StorageAzure Storage
adl

Azure Data Lake

Azure Data Lake
hdfsKerberosKerberos


Required settings for account types are as follows:

Account Type

Settings

Azure Storage

Account name, Primary access key

Azure Data Lake 

Tenant ID, Access ID, Secret Key

Kerberos Account

Client Principal, Service Principal, Keytab File

IAM Roles for Amazon EC2

global.properties

jcc.jvm_options = -DIAM_CREDENTIAL_FOR_S3=TRUE

Please note this feature is supported in only in Groundplex nodes hosted in the EC2 environment.

For more information on IAM Roles, see http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html

Kerberos Account UI Configuration


The security model configured for the Groundless (SIMPLE or KERBEROS authentication) must match the security model of the remote server. Due to limitations of the Hadoop library we are only able to create the necessary internal credentials for the configuration of the Groundplex.


Views:


InputThis Snap has at most one document input view. It may contain values for the File expression property.
OutputThis Snap has exactly one binary output view and provides the binary data stream read from the specified sources. Examples of Snaps that can be connected to this output are CSV Parser, JSON Parser, and XML Parser.
ErrorThis Snap has at most one document error view and produces zero or more documents in the view.


Settings

Label

Required. Specify the name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your pipeline.

Directory



Specify the URL for the data source (directory). The Snap supports the following protocols.

  • hdfs://<hostname>:<port>/<path to directory>/

  • wasb:///<container name>/<path to directory>/

  • wasbs:///<container name>/<path to directory>/

  • adl://<container name>/<path to directory>/ 

  • abfs:///<filesystem>/<path>/
  • abfs://<filesystem>@<accountname>.<endpoint>/<path>
  • abfss:///<filesystem>/<path>/
  • abfss://<filesystem>@<accountname>.<endpoint>/<path>

When you use the ABFS protocol to connect to an endpoint, the account name and endpoint details provided in the URL override the corresponding values in the Account Settings fields.

The Directory property is not used in the pipeline execution or preview and used only in the Suggest operation. When you press the Suggest icon, it will display a list of subdirectories under the given directory. It generates the list by applying the value of the Filter property.

Examples:

  • hdfs://ec2-54-198-212-134.compute-1.amazonaws.com:8020/user/john/input

  • $filename

  • wasb:///snaplogic/testDir/

  • wasbs:///snaplogic/testDir/

  • adl://snapqa/

  • abfs:///filesystem2/dirl/a+b
  • abfss:///filesystem2/dirl/a+b

Default value:  hdfs://<hostname>:<port>/

SnapLogic automatically appends "azuredatalakestore.net" to the store name you specify when using Azure Data Lake; therefore, you do not need to add 'azuredatalakestore.net' to the URI while specifying the directory.


File Filter


Specify the Glob filter pattern.

Use glob patterns to display a list of directories or files when you click the Suggest icon in the Directory or File property. A complete glob pattern is formed by combining the value of the Directory property with the Filter property. If the value of the Directory property does not end with "/", the Snap appends one, so that the value of the Filter property is applied to the directory specified by the Directory property.


The following rules are used to interpret glob patterns:

The * character matches zero or more characters of a name component without crossing directory boundaries. For example, the *.csv pattern matches a path that represents a file name ending in .csv, and *.* matches all file names that contain a period.

The ** characters match zero or more characters across directories; therefore, it matches all files or directories in the current directory and in its subdirectories. For example, /home/** matches all files and directories in the /home/ directory.

The ? character matches exactly one character of a name component. For example, 'foo.?' matches file names that start with 'foo.' and are followed by a single-character extension.

The \ character is used to escape characters that would otherwise be interpreted as special characters. The expression \\ matches a single backslash, and \{ matches a left brace, for example.

The ! character is used to exclude matching files from the output. 

The [ ] characters form a bracket expression that matches a single character of a name component out of a set of characters. For example, '[abc]' matches 'a', 'b', or 'c'. The hyphen (-) may be used to specify a range, so '[a-z]' specifies a range that matches from 'a' to 'z' (inclusive). These forms can be mixed, so '[abce-g]' matches 'a', 'b', 'c', 'e', 'f' or 'g'. If the character after the [ is a ! then it is used for negation, so '[!a-c]' matches any character except 'a', 'b', or 'c'.

Within a bracket expression, the '*', '?', and '\' characters match themselves. The '-' character matches itself if it is the first character within the brackets, or the first character after the !, if negating.

The '{ }' characters are a group of sub-patterns where the group returns a match if any sub-pattern in the group matches the contents of a target directory. The ',' character is used to separate sub-patterns. Groups cannot be nested. For example, the pattern '*.{csv, json}' matches file names ending with '.csv' or '.json'.

Leading dot characters in a file name are treated as regular characters in match operations. For example, the '*' glob pattern matches file name ".login".

All other characters match themselves.

Examples:

'*.csv' matches all files with a csv extension in the current directory only.

'**.csv' matches all files with a csv extension in the current directory and in all its subdirectories.

*[!{.pdf,.tmp}] excludes all files with the extension PDF or TMP.


File


The name of the file to be read. This can also be a relative path under the directory given in the Directory property. It should not start with a URL separator "/".
The File property can be a JavaScript expression which will be evaluated with values from the input view document. When you press the Suggest icon, it will display a list of regular files under the directory in the Directory property. It generates the list by applying the value of the Filter property.
If this property is left blank (the * wildcard is used) when the Snap is executed, all files under the directory matching the glob filter will be read.
Example: 

  • sample.csv
  • tmp/another.csv
  • $filename
  • _filename

Default value:  [None]


User Impersonation


Select this check box to enable user impersonation.

For encryption zones, use user impersonation. 

Default value:  Not selected

For more information on working with user impersonation, click the link below.


Generic User Impersonation Behavior

When the User Impersonation check box is selected, and Kerberos is the account type, the Client Principal configured in the Kerberos account impersonates the pipeline user.

When the User Impersonation option is selected, and Kerberos is not the account type, the user executing the pipeline is impersonated to perform HDFS Operations. For example, if the user logged into the SnapLogic platform is operator@snaplogic.com, the user name "operator" is used to proxy the super user. 

User impersonation behavior on pipelines running on Groundplex with a Kerberos account configured in the Snap

  • When the User Impersonation checkbox is selected in the Snap, it is the pipeline user who performs the file operation. For example, if the user logged into the SnapLogic platform is operator@snaplogic.com, the user name "operator" is used to proxy the super user.
  • When the User Impersonation checkbox is not selected in the Snap, the Client Principal configured in the Kerberos account performs the file operation.



For non-Kerberised clusters, you must activate Superuser access in the Configuration settings.


HDFS Snaps support the following accounts:

  • Azure storage account
  • Azure Data Lake account
  • Kerberos account
  • No account

When an account is configured with an HDFS Snap, user impersonation settings have no impact on all accounts, except the Kerberos account.



Number Of Retries

Specify the maximum number of attempts to be made to receive a response.

  • The request is terminated if the attempts do not result in a response.
  • When the number of retries are exhausted, the Snap writes the error to the error view.
  • Retry operation, which is the attempts to receive a response occurs only when the Snap loses the connection with the server.

Default value: 0

Retry Interval (seconds)

Specify the time interval between two successive retry requests. A retry happens only when the previous attempt resulted in an exception.

Default value: 1


Select one of the three following modes in which the Snap executes: Available options are:

  • Validate & Execute: Performs limited execution of the Snap and generates a data preview during Pipeline validation; then performs full execution of the Snap (unlimited records) during Pipeline runtime.
  • Execute only: Performs full execution of the Snap during Pipeline execution without generating preview data.
  • Disabled: Disables the Snap and all Snaps downstream from it.

Troubleshooting