Overview

Use the HDFS ZipFile Write Snap to read in-coming data and write it to a ZIP file in an HDFS directory. This Snap also enables you to specify file access permissions for the new ZIP file. You can also configure how the Snap handles the new ZIP file if the destination directory already has another ZIP file with the same name.

For the HDFS protocol, use a SnapLogic on-premises Groundplex and ensure that its instance is within the Hadoop cluster and that SSH authentication is established.

The HDFS protocol supported by this Snap is HDFS 2.4.0.

Expected Input and Output

Expected Input: Binary data stream containing documents to be written to a ZIP file.
Expected Output: Zipped file containing the in-coming documents.
Expected Upstream Snaps: Required. Any Snap that offers binary data in its output view. Examples: JSON Formatter, HDFS Reader, File Reader.
Expected Downstream Snaps: Any Snap that takes document data as input. Examples: Mapper, HDFS Reader.

Prerequisites

The user executing the Snap must have Write permissions on the concerned directory.

Configuring Accounts

This Snap uses account references created on the Accounts page of SnapLogic Manager to handle access to this endpoint. See Hadoop Accounts for information on setting up this type of account.

Configuring Views

Input	This Snap has at least one document input view.
Output	This Snap has at most one document output view.
Error	This Snap has at most one document error view.

Troubleshooting

None at this time.

Limitations and Known Issues

None at this time.

Modes

Ultra Pipelines: May work in Ultra Pipelines.
Spark Mode: Supports Spark mode.

Snap Settings

Label	Required. The name for the Snap. Modify this to be more specific, especially if there are more than one of the same Snap in the pipeline.
Directory	The location of the directory where the new ZIP file must be saved. Syntax: Default value: [None]
File	The relative path and name of the file that must be created post execution. Example: sample.zip tmp/another.zip $filename Default value: [None]
User Impersonation	Select this check box to enable user impersonation. For encryption zones, use user impersonation. Default value: Not selected For more information on working with user impersonation, click the link below. User Impersonation Details Generic User Impersonation Behavior When the User Impersonation check box is selected, and Kerberos is the account type, the Client Principal configured in the Kerberos account impersonates the pipeline user. When the User Impersonation option is selected, and Kerberos is not the account type, the user executing the pipeline is impersonated to perform HDFS Operations. For example, if the user logged into the SnapLogic platform is operator@snaplogic.com, the user name "operator" is used to proxy the super user. User impersonation behavior on pipelines running on Groundplex with a Kerberos account configured in the Snap When the User Impersonation checkbox is selected in the Snap, it is the pipeline user who performs the file operation. For example, if the user logged into the SnapLogic platform is operator@snaplogic.com, the user name "operator" is used to proxy the super user. When the User Impersonation checkbox is not selected in the Snap, the Client Principal configured in the Kerberos account performs the file operation. For non-Kerberised clusters, you must activate Superuser access in the Configuration settings. HDFS Snaps support the following accounts: Azure storage account Azure Data Lake account Kerberos account No account When an account is configured with an HDFS Snap, user impersonation settings have no impact on all accounts, except the Kerberos account.
File Action	Required. Use this field to specify what you want the Snap to do if the file you want it to create already exists. Available options are: Overwrite, Append, Ignore, and Error. Overwrite: If the target file exists, the Snap overwrites the file. Append: The Snap appends new records to the existing file. Ignore: If the file already exists, the Snap neither throws an exception nor does it overwrite the file, but creates an output document indicating that the new data has been ignored. Error: The error displays in the Pipeline Run Log if the file already exists. Default value: Overwrite Append is supported for ADL protocol only.
File Permissions	File permission sets to be assigned to the file. To assign file permissions: Click the + button against File permissions. This adds a row to the fieldset. Click the Suggestible icon in the User type field and select the user type for which you want to enable access. This drop-down offers the following options: Owner: This is the user account under whose name the new file will be created. Group: This is the user group to which the user being impersonated belongs. Others: These are all other users who have at least Read access to the concerned directory. Click the Suggestible icon in the File permissions field and select the permission you want to enable for the user type selected in the User type field.
Base directory	Enter here the name of the root directory in the ZIP file.
Use input view label	If selected, the input view label is used for all names of the files added to the zip file. Otherwise, the input view ID is used instead, when input the binary stream does not have its content-location in its header. When this option is selected, if there are more than one binary input streams in an input view, for the second input stream and after, the file names will be the input view label appended with '_n'. If the label is in the format of 'name.ext', '_n' will be append to the 'name', e.g. name_2.ext for the second input stream. Example: If this option is selected, if Base directory is testFolder and the input view label is test.csv, the file name for the first binary input stream in that input view will be testFolder/test.csv, and the second, testFolder/test_2.csv, and the third, testFolder/test_3.csv, and so on. Default value: Not selected
Snap execution	Select one of the three modes in which the Snap executes. Available options are: Validate & Execute: Performs limited execution of the Snap, and generates a data preview during Pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during Pipeline runtime. Execute only: Performs full execution of the Snap during Pipeline execution without generating preview data. Disabled: Disables the Snap and all Snaps that are downstream from it.

The binary document header content-location of the HDFS ZipFile Writer input is the name within the ZIP file. (Example: foo.txt). The Snap does not include the 'base directory'. It could contain subdirectories though. On the other hand, the binary document header content-location of the output of the HDFS ZipFile Reader is the name of the ZIP file, the base directory, and the content location provided to the writer. Thus, while each Snap works well independent of each other, it's currently not possible to have a Reader > Writer > Reader combination in a pipeline without using other intermediate Snaps to provide the binary document header information.

Troubleshooting

Writing to S3 files with HDFS version CDH 5.8 or later

When running HDFS version later than CDH 5.8, the Hadoop Snap Pack may fail to write to S3 files. To overcome this, make the following changes in the Cloudera manager:

Go to HDFS configuration.
In Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml, add an entry with the following details:
- Name: fs.s3a.threads.max
- Value: 15
Click Save.
Restart all the nodes.
Under Restart Stale Services, select Re-deploy client configuration.
Click Restart Now.

Examples

Writing and Reading a ZIP File in HDFS

The first part of this example demonstrates how you can use the HDFS ZipFile Write Snap to zip and write a new file into HDFS. The second part of this example demonstrates how you can unzip and check the contents of the newly-created ZIP file.

Click here to download this pipeline. You can also downloaded this pipeline from the Downloads section below.

Understanding the Sample Pipeline

Create the pipeline as shown below:

The Hadoop Directory Browser Snap

Use a Hadoop Directory Browser Snap to first check the contents of the target directory. This will help you check whether the new file got added to the HDFS directory as expected, later in the example.

Enter the Directory URL as appropriate and specify the File filter as *.zip. This instructs the Snap to list out all the ZIP files in the target directory.

If the Snap executes as expected, you should see the contents of your target directory, as shown below:

Generating a File for Upload

You now need to choose a file to upload into the target directory. You could either select a file directly or use a JSON Generator Snap coupled with a JSON Formatter Snap, as in the example pipeline.

The HDFS ZipFile Writer Snap

Your file is now ready. Configure the HDFS ZipFile Writer Snap to upload the file as a ZIP file into the target directory in HDFS, as shown below.

The Hadoop Directory Browser Snap

Use a Copy Snap to perform two tasks after the ZIP file is created: first, to check whether the new file was created as expected and second, to try and read the contents to the newly-created ZIP file from the target HDFS directory.

To check whether the new file was created, add an HDFS Directory Browser Snap to the pipeline.

If the ZIP file was created, you should see it in the output, as shown below:

HDFS ZipFile Reader

Once you have confirmed that the new ZIP file has been created, use the HDFS ZipFile Reader Snap to read the new ZIP file. If the contents of the new ZIP file is the same as the contents of the input file, you know that the pipeline works!

To read the output of the HDFS ZipFile Read Snap, use a File Reader Snap:

If the contents of the new file is the same as the contents of the original file, you know the example works.

Click here to download this Pipeline. You can also downloaded this pipeline from the Downloads section below.

Downloads

	File	Modified

No files shared here yet.

Additional Resources

Snap History

Snap History