Parquet Writer
On this Page
Snap Type: | Write | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Description: | This Snap converts documents into the Parquet format and writes the data to HDFS or S3. A nested schema such as LIST and MAP are also supported by the Snap. You can also use this Snap to write schema information into the Catalog Insert Snap.
Modes
| |||||||||||||||||||||||||
Prerequisites: | The user must have access and permission to write to HDFS or AWS S3. | |||||||||||||||||||||||||
Limitations and Known Issues: |
You cannot use a SAS URI (generated on a specific blob) through the SAS Generator Snap. | |||||||||||||||||||||||||
Account: | This Snap uses account references created on the Accounts page of SnapLogic Manager to handle access to this endpoint. This Snap supports several account types, as listed below. The security model configured for the groundplex (SIMPLE or KERBEROS authentication) must match the security model of the remote server. Due to limitations of the Hadoop library we are only able to create the necessary internal credentials for the configuration of the groundplex. | |||||||||||||||||||||||||
Views: |
| |||||||||||||||||||||||||
Settings | ||||||||||||||||||||||||||
Label | Required. The name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your pipeline. | |||||||||||||||||||||||||
Directory | A directory in the supported file storage systems to read data. All files within the directory must be Parquet formatted. We support file storage systems as follows:
SnapLogic automatically appends azuredatalakestore.net to the store name you specify when using Azure Data Lake; therefore, you do not need to add azuredatalakestore.net to the URI while specifying the directory. The Directory property is not used in the pipeline execution or preview and used only in the Suggest operation. When you click on the Suggest icon, it will display a list of subdirectories under the given directory. It generates the list by applying the value of the Filter property.
Default value: hdfs://<hostname>:<port>/ | |||||||||||||||||||||||||
Filter | Use glob patterns to display a list of directories or files when you click the Suggest icon in the Directory or File property. A complete glob pattern is formed by combining the value of the Directory property with the Filter property. If the value of the Directory property does not end with "/", the Snap appends one, so that the value of the Filter property is applied to the directory specified by the Directory property. Default Value: * For more information on glob patterns, click the link below. | |||||||||||||||||||||||||
File | Filename or a relative path to a file under the directory given in the Directory property. It should not start with a URL separator "/". The File property can be a JavaScript expression which will be evaluated with values from the input view document. When you press the Suggest icon, it will display a list of regular files under the directory in the Directory property. It generates the list by applying the value of the Filter property.
Default value: [None] | |||||||||||||||||||||||||
User Impersonation | Select this check box to enable user impersonation. For encryption zones, use user impersonation. Default value: Not selected For more information on working with user impersonation, click the link below. | |||||||||||||||||||||||||
Hive Metastore URL | This setting is used to assist in setting the schema along with the database and table setting. If the data being written has a Hive schema, then the Snap can be configured to read the schema instead of manually entering it. Set the value to a Hive Metastore url where the schema is defined. Default value: [None] | |||||||||||||||||||||||||
Database | The Hive Metastore database where the schema is defined. See the Hive Metastore URL setting for more information. Default value: [None] | |||||||||||||||||||||||||
Table | The table to read the schema from in the Hive Metastore database. See the Hive Metastore URL setting for more information. Default value: [None] | |||||||||||||||||||||||||
Fetch Hive Schema at Runtime | When set, will fetch the schema from the Metastore table before writing. Will fail to write if cannot make connection to the metastore or the table does not exist during the pipeline's execution. Will use the metastore schema instead of the one set in the Snap's Edit Schema property if this is checked. Default value: Not selected | |||||||||||||||||||||||||
Edit Schema | A valid Parquet schema that describes the data. The schema can be specified based off a Hive Metastore table schema or generated from suggest data. Save the pipeline before editing the schema to generate suggest data that will assist in specifying the schema based off of the schema of incoming documents. If no suggest data is available, then some documentation and an example schema will be generated instead. Alter one of those schemas to describe the input data. After defining the message type, a list of fields are given. A field is comprised of a repetition, a type, and the field name. Available repetitions are required, optional, and repeated.
These types can be annotated with a logical type to specify how the application should interpret the data. The Logical types include:
message document { # Primitive Types optional int64 32_num; optional int64 64_num; optional boolean truth; optional binary message; optional float pi; optional double e; optional int96 96_num; optional fixed_len_byte_array (1) one_byte; # Logical Types optional binary snowman (UTF8); optional int32 8_num (INT_8); optional int32 16_num (INT_16); optional int32 u8_num (UINT_8); optional int32 u16_num (UINT_16); optional int32 u32_num (UINT_32); optional int64 u64_num (UINT_64); optional int32 dec_num (DECIMAL(5,2)); optional int32 jan7 (DATE); optional int32 noon (TIME_MILLIS); optional int64 jan7_epoch (TIMESTAMP_MILLIS); optional binary embedded (JSON); } "Generate template" will not work for nested structure like MAP and LIST type. | |||||||||||||||||||||||||
Compression | Required. The type of compression to use when writing the file. The available options are:
| |||||||||||||||||||||||||
Partition by | Press '+' button to add a new row and use the suggest button to select a key name in the input document, which will be used to get the 'Partition by' folder name. All input documents should contain this key name or an error document will be written to the error view. Please see the 'Partition by' example below for an illustration. Default value: [None] | |||||||||||||||||||||||||
Azure SAS URI Properties | Shared Access Signatures (SAS) properties of the Azure Storage account. | |||||||||||||||||||||||||
SAS URI | Specify the Shared Access Signatures (SAS) URI that you want to use to access the Azure storage blob folder specified in the Azure Storage Account. You can get a valid SAS URI either from the Shared access signature in the Azure Portal or by generating from the SAS Generator Snap. If SAS URI value is provided in the Snap settings, then the account settings (in case any account is attached) are ignored. | |||||||||||||||||||||||||
Snap Execution | Select one of the three modes in which the Snap executes. Available options are:
|
Troubleshooting
- To generate a schema based on suggest data, the Pipeline must have the suggest-data available in the browser. This may require the user to save the Pipeline before editing the schema.
- The Snap can only write data into HDFS.
Writing to S3 files with HDFS version CDH 5.8 or later
When running HDFS version later than CDH 5.8, the Hadoop Snap Pack may fail to write to S3 files. To overcome this, make the following changes in the Cloudera manager:
- Go to HDFS configuration.
- In Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml, add an entry with the following details:
- Name: fs.s3a.threads.max
- Value: 15
- Click Save.
- Restart all the nodes.
- Under Restart Stale Services, select Re-deploy client configuration.
- Click Restart Now.
Unable to Connect to the Hive Metastore
Error Message: Unable to connect to the Hive Metastore.
Description: This error occurs when the Parquet Writer Snap is unable to fetch schema for Kerberos-enabled Hive Metastore.
Resolution: Pass the Hive Metastore's schema directly to the Parquet Writer Snap. To do so:
- Enable the 'Schema View' in the Parquet Writer Snap by adding the second Input View.
- Connect a Hive Execute Snap to the Schema View. Configure the Hive Execute Snap to execute the DESCRIBE TABLE command to read the table metadata and feed it to the schema view.
Temporary Files
During execution, data processing on Snaplex nodes occurs principally in-memory as streaming and is unencrypted. When larger datasets are processed that exceeds the available compute memory, the Snap writes Pipeline data to local storage as unencrypted to optimize the performance. These temporary files are deleted when the Snap/Pipeline execution completes. You can configure the temporary data's location in the Global properties table of the Snaplex's node properties, which can also help avoid Pipeline errors due to the unavailability of space. For more information, see Temporary Folder in Configuration Options.
Examples
Inserting and Querying Custom Metadata from the Flight Metadata Table
The Pipeline in this zipped example, MetadataCatalog_Insert_Read_Example.zip, demonstrates how you can:
- Use the Catalog Insert Snap to update metadata tables.
- Use the Catalog Query Snap to read the updated metadata information.
In this example:
- We import a file containing the metadata.
- We create a parquet file using the data in the imported file
- We insert metadata that meets specific requirements into a partition in the target table.
- We read the newly-inserted metadata using the Catalog Query Snap.