In this article
Table of Contents | ||||||
---|---|---|---|---|---|---|
|
...
You can use this Snap to read the input document data from the input and write the data in the binary (parquet) format to the output.
...
Snap Type
The Parquet Formatter Snap is a WriteFormat-type Snap.
Prerequisites
None.
Support for Ultra Pipelines
...
Does not work in Ultra Pipelines.
Limitations and Known Issues
None.
Snap Views
Type | Format | Number of Views | Examples of Upstream and Downstream Snaps | Description |
---|---|---|---|---|
Input | Document
|
|
| Requires document data as input. You can override the schema setting by inserting an object like this into the second input view. |
Output | Binary
|
|
| Writes the document data in the binary (parquet) format to the output. |
Error | Error handling is a generic way to handle errors without losing data or failing the Snap execution. You can handle the errors that the Snap might encounter when running the Pipeline by choosing one of the following options from the When errors occur list under the Views tab:
Learn more about Error handling in Pipelines. |
Snap Settings
Info |
---|
|
Field Name | Field Type | Description |
---|---|---|
Label*
Default Value: Parquet Formatter |
Transform Parquet Formatter | String | Specify the name for the Snap. You can modify this to be more specific, especially if you have more than one of the same Snap in your |
Expand |
---|
After defining the message type, a list of fields are given. A field is comprised of a repetition, a type, and the field name. Available repetitions are required, optional, and repeated.
These types can be annotated with a logical type to specify how the application should interpret the data. The Logical types include:
Unsigned types - may be used to produce smaller in-memory representations of the data. If the stored value is larger than the maximum allowed by int32 or int64, then the behavior is undefined.
DECIMAL(precision, scale) - used to describe arbitrary-precision signed decimal numbers of the form value * 10^(-scale) to the given precision. The annotation can be with int32, int64, fixed_len_byte_array, binary. See the Parquet documentation for limits on precision that can be given. DATE - used with int32 to specify the number of days since the Unix epoch, 1 January 1970 Note: This Snap supports only the following date format: yyyy-MM-dd.
a number in months a number in days a number in milliseconds
|
pipeline.
|
Edit Schema |
Button
Specify a valid Parquet schema that describes the data.
The schema can be specified based off a Hive Metastore table schema or generated from suggest data. Save the pipeline before editing the schema to generate suggest data that assists in specifying the schema based off of the schema of incoming documents. If no suggest data is available, then an example schema is generated along with documentation. Alter one of those schemas to describe the input data.
The Parquet schema can also be written manually. A schema is defined by a list of fields and here is an example describing the contact information of a person.
| Button | Specify a valid Parquet schema that describes the data. The following is an example of a schema using all the primitive and some examples of logical types:
|
|
|
| |
Compression
Default Value: NONE Example: SNAPPY | Dropdown list |
Choose the type of compression to use when writing the file. The following are the available options:
Many compression algorithms require both Java and system libraries and |
the algorithms fail if the latter is not installed. If you see unexpected errors, ask your system administrator to verify |
if all the required system libraries are |
installed; they are typically not installed by default. The system libraries |
have names such as liblzo2.so.2 or libsnappy.so.1 and |
could be located in the /usr/lib/x86_64-linux-gnu directory.
| ||
Decimal rounding mode
Default Value: Half up Example: Up | Dowpdown list | Select the required rounding method for decimal values when they exceed the required number of decimal places. The following are the available options |
:
| ||
Snap execution Default Value: Validate & Execute | Dropdown list | Select one of the following three modes in which the Snap executes:
|
...
Schema
...
Error
...
Reason
...
Resolution
...
Account validation failed.
...
The Pipeline ended before the batch could complete execution due to a connection error.
...
Verify that the Refresh token field is configured to handle the inputs properly. If you are not sure when the input data is available, configure this field as zero to keep the connection always open.
Examples
Excluding Fields from the Input Data Stream
We can exclude the unrequired fields from the input data stream by omitting them in the Input schema fieldset. This example demonstrates how we can use the <Snap Name> to achieve this result:
<screenshot of Pipeline/Snap and description>
...
Code Block |
---|
{
"schema": "message document {\n optional binary AUTOSYNC_PRIMARYKEY (STRING);\n optional binary AUTOSYNC_VALUEHASH (STRING);\n optional binary AUTOSYNC_CURRENTRECORDFLAG (STRING);\n optional int64 AUTOSYNC_EFFECTIVEBEGINTIME (TIMESTAMP(MILLIS,true));\n optional int64 AUTOSYNC_EFFECTIVEENDTIME (TIMESTAMP(MILLIS,true));\n optional double ID1;\n optional binary ID2 (STRING);\n optional binary ID3 (STRING);\n optional binary ID4 (STRING);\n optional binary ID5 (STRING);\n optional binary ID6 (STRING);\n optional binary ID7 (STRING);\n optional binary ID8;\n optional double ID9;\n optional double ID10;\n optional double ID11;\n optional double ID12;\n optional double ID13;\n optional double ID14;\n optional int32 ID15 (DATE);\n optional int64 ID16 (TIMESTAMP(MILLIS,true));\n optional int64 ID17 (TIMESTAMP(MILLIS,true));\n optional int64 ID18 (TIMESTAMP(MILLIS,true));\n optional double ID100;\n}\n"
} |
Examples
Transform document data into parquet format and vice versa
This example demonstrates how to convert the input document data to parquet and parquet data back to document output.
...
Download this pipeline.
Step 1: Configure the JSON Generator Snap with input data.
...
Step 2: Configure the Parquet Formatter Snap with the schema for the input document data.
...
Configure the Pqarquet Parser Snap, on validation, the Snap reverts the Parquet data to document data.
Parquet Parser Configuration | Parquet Parser Output |
---|---|
Downloads
Info |
---|
|
...