On this Page

...

Snap type

Read

Description

This Snap browses a given directory path in the Hadoop file system (using the HDFS protocol) and generates a list of all the files in the directory and subdirectories. Use this Snap to identify the contents of a directory before you run any command that uses this information.

Note
As of now, the Currently, the Hadoop Directory Browser only Snap supports URIs that use the HDFS protocolusing HDFS & ABFS (Azure Data Lake Storage Gen 2 ) protocols.

For example, if you need to iteratively run a specific command on a list of files, this Snap can help you view the list of all available files.

Path (string): The path to the directory being browsed.
Type (string): The type of file.
Owner (string): The name of the owner of the file.
Creation date (datetime): The date the file was created. In the Hadoop file system, this can often show up as 'null' due to limited API functionality.
Size (in bytes) (int): The size of the file.
Permissions (string): Read, Write, Execute.
Update date (datetime): Date of update.
Name (string): Name of the file.

Input and Output

Expected upstream Snaps: Any Snap that offers a directory URI. This can be even a CSV Generator with a collection of, say file names and their URIs.
Expected downstream Snaps: A document listing out attributes of the files contained in the directory specified.
Expected input: Directory Path to be browsed and the File Filter Pattern to be applied. For example: Directory Path: hdfs://hadoopcluster.domain.com:8020/<user>/<folder_details>; File Filter: *.conf
Expected output: The attributes of the files contained in the directory specified that matching the filter pattern.

Prerequisites

The user executing the Snap must have at least Read permissions on the concerned directory.

Support and limitations Works in Ultra Task Pipelines.

Account

This Snap uses account references created on the Accounts page of the SnapLogic Manager to handle access to this endpoint.

This Snap supports Azure Data Lake Gen2 OAuth2 and Kerberos accounts.

Views

Input	This Snap has at most one optional document input view. It contains values for the directory path to be browsed and the glob filter to be applied to select the contents.
Output	This Snap has exactly one output view that provides the various attributes (such as Name, Type, Size, Owner, Last Modification Time) of the contents of the given directory path. Only those contents are selected that match the given glob filter.
Error	This Snap has at most one document error view and produces zero or more documents in the view.

Settings

Label

Required. The name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your pipeline.

Troubleshooting

Excerpt

Writing to S3 files with HDFS version CDH 5.8 or later

When running HDFS version later than CDH 5.8, the Hadoop Snap Pack may fail to write to S3 files. To overcome this, make the following changes in the Cloudera manager:

Go to HDFS configuration.
In Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml, add an entry with the following details:
- Name: fs.s3a.threads.max
- Value: 15
Click Save.
Restart all the nodes.
Under Restart Stale Services, select Re-deploy client configuration.
Click Restart Now.

Example

Expand

title	Hadoop Directory Browser in Action

Hadoop Directory Browser in Action

The Hadoop Directory Browser Snap lists out the contents of a Hadoop file system directory. In this example, we shall:

Read the contents of a Hadoop file system directory using the Hadoop Directory Browser Snap.
Save the directory file list information as a file in the same directory.
Run the Hadoop Directory Browser Snap again to retrieve the updated list of files.
Compare and check whether the number of files in the directory has increased by 1 as expected.

If the pipeline executes successfully, the second execution of the Hadoop Directory Browser will list out one additional file: the one we created using the output of the first execution of the Hadoop Directory Browser Snap.

How This Works

The table below lists the tasks performed by each Snap and documents the configuration details required for each Snap.

Snap	Purpose	Configuration Details	Comments
Hadoop Directory Browser Snap 1	Retrieves the list of files in the identified HDFS directory.	Directory: Enter here the address of the directory whose file-list you want to view.
Copy	Creates a copy of the output.		We will use this copy later when we compare this output with the final output created after the pipeline is executed.
Filter	Filters the file-list returned by the upstream Snap using specific filter criteria.	Filter expression: $Name == '<the file you want to use to create the new file>' Example: 'test_file.csv'	This helps you select one file from the list of files returned by the Hadoop Directory Browser Snap.
Mapper	Maps column headers in the retrieved file-list to those in the file selected using the Filter Snap.	Expression: $Path.substr(0,70) Target Path: $Directory Expression: $Name Target Path: $Filename	This enables you to create the list of directories and file names that will populate the new fie you will create further downstream.
HDFS Reader	Reads mapped data from the Mapper Snap and makes it available to be written as a new file.	Directory: $Directory File: $Filename
HDFS Writer	Writes (Creates) a new file in the specified directory using the data made available upstream.	Directory: The address of the HDFS file system directory where you want to create the new file. For our example to work, this should be the same directory used in the first Snap. File: The name of the new file. It's typically a good idea to include some kind of logic in the file name, so there's no need for manual intervention. We have gone for a randomizer.
Mapper	Shortens the name of the new file, which could be very long, given that we have used a randomizer to generate a unique name for the file.	Expression: $filename.substr(71,50) Target Path: $filename	This reduces the length of the file name to 20 letters or less, starting from the 71st character from the left. This removes most of the common strings that form the path name and saves primarily that part of the path string that uniquely identifies each file.
Hadoop Directory Browser	Retrieves the list of files in the updated HDFS directory.	Directory: Enter here the address of the directory in which the new file was created. For our example to work, this should be the same directory used in the first Snap.
Diff	Compares the output of the two Hadoop Directory Browser Snaps. If there is one extra file in the final Hadoop Directory Browser Snap, the example works!	Drag and drop the connection point from the Copy version of the initial file list to the Original connection point associated with the Diff Snap.

Run the pipeline. Once execution is done, click the Check Pipeline Statistics button to check whether it has worked. You should find the following changes:

Original: N Files (Depending on the number of files available in the concerned HDFS directory)
New: N+1 (This is the new file created using the output from the Hadoop Directory Browser Snap.)
Modified: 0
Deletions: 0
Insertions: 1 (This is the new file created using the output from the Hadoop Directory Browser Snap.)
Unmodified: N (Depending on the number of files available in the concerned HDFS directory)

Download the sample pipeline

...

Versions Compared

Old Version 25

New Version 26

Key

Settings

Troubleshooting

Writing to S3 files with HDFS version CDH 5.8 or later

Example

Hadoop Directory Browser in Action

How This Works

Page Comparison

Versions Compared

Old Version 25

New Version 26

Key

Settings

Troubleshooting

Writing to S3 files with HDFS version CDH 5.8 or later

Example

Hadoop Directory Browser in Action

How This Works