HDFS ZipFile Writer
On this Page
Overview
Use the HDFS ZipFile Write Snap to read in-coming data and write it to a ZIP file in an HDFS directory. This Snap also enables you to specify file access permissions for the new ZIP file. You can also configure how the Snap handles the new ZIP file if the destination directory already has another ZIP file with the same name.
For the HDFS protocol, use a SnapLogic on-premises Groundplex and ensure that its instance is within the Hadoop cluster and that SSH authentication is established.
The HDFS protocol supported by this Snap is HDFS 2.4.0. This Snap supports both HDFS & ABFS (Azure Data Lake Storage Gen 2 ) protocols.
Expected Input and Output
- Expected Input: Binary data stream containing documents to be written to a ZIP file.
- Expected Output: Zipped file containing the incoming documents.
- Expected Upstream Snaps: Required. Any Snap that offers binary data in its output view. Examples: JSON Formatter, HDFS Reader, File Reader.
- Expected Downstream Snaps: Any Snap that takes document data as input. Examples: Mapper, HDFS Reader.
Prerequisites
The user executing the Snap must have Write permissions on the concerned directory.
Configuring Accounts
This Snap uses account references created on the Accounts page of SnapLogic Manager to handle access to this endpoint. See Configuring Hadoop Accounts for information on setting up this type of account.
Configuring Views
Input | This Snap has at least one document input view. |
---|---|
Output | This Snap has at most one document output view. |
Error | This Snap has at most one document error view. |
Troubleshooting
None at this time.
Limitations and Known Issues
None at this time.
Modes
- Ultra Pipelines: Works in Ultra Pipelines.
Snap Settings
Label | Required. The name for the Snap. Modify this to be more specific, especially if there are more than one of the same Snap in the pipeline. |
---|---|
Directory | The URL for the data source (directory). The Snap supports both HFDS and ABFS(S) protocols. Syntax for a typical HDFS URL: Syntax for a typical ABFS and an ABFSS URL: When you use the ABFS protocol to connect to an endpoint, the account name and endpoint details provided in the URL override the corresponding values in the Account Settings fields. With the ABFS protocol, SnapLogic creates a temporary file to store the incoming data. Therefore, the hard drive where the JCC is running should have enough space to temporarily store all the account data coming in from ABFS. Default value: [None] |
File | The relative path and name of the file that must be created post execution. Example:
Default value: [None] |
User Impersonation | Select this check box to enable user impersonation. For encryption zones, use user impersonation. Default value: Not selected For more information on working with user impersonation, click the link below. |
File Action | Required. Use this field to specify what you want the Snap to do if the file you want it to create already exists. Available options are: Overwrite, Ignore, and Error.
Default value: Overwrite |
File Permissions | File permission sets to be assigned to the file. To assign file permissions:
|
Base directory | Enter here the name of the root directory in the ZIP file. |
Use input view label | If selected, the input view label is used for all names of the files added to the zip file. Otherwise, the input view ID is used instead, when input the binary stream does not have its content-location in its header. When this option is selected, if there are more than one binary input streams in an input view, for the second input stream and after, the file names will be the input view label appended with '_n'. If the label is in the format of 'name.ext', '_n' will be append to the 'name', e.g. name_2.ext for the second input stream. |
Snap execution | Select one of the three modes in which the Snap executes. Available options are:
|
Troubleshooting
Writing to S3 files with HDFS version CDH 5.8 or later
When running HDFS version later than CDH 5.8, the Hadoop Snap Pack may fail to write to S3 files. To overcome this, make the following changes in the Cloudera manager:
- Go to HDFS configuration.
- In Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml, add an entry with the following details:
- Name: fs.s3a.threads.max
- Value: 15
- Click Save.
- Restart all the nodes.
- Under Restart Stale Services, select Re-deploy client configuration.
- Click Restart Now.
Examples
Writing and Reading a ZIP File in HDFS
The first part of this example demonstrates how you can use the HDFS ZipFile Write Snap to zip and write a new file into HDFS. The second part of this example demonstrates how you can unzip and check the contents of the newly-created ZIP file.
Click here to download this pipeline. You can also downloaded this pipeline from the Downloads section below.
Downloads
Important steps to successfully reuse Pipelines
- Download and import the pipeline into the SnapLogic application.
- Configure Snap accounts as applicable.
- Provide pipeline parameters as applicable.