On this Page

Overview

The Sample Snap is a Flow type Snap that enables you to generate a sample dataset from the input dataset. This sampling is carried out based on one of the following algorithms and with a predefined pass through percentage. The algorithms available are:

Linear Split
Streamable Sampling
Strict Sampling
Stratified Sampling
Weighted Stratified Sampling

These algorithms are explained in the Snap Settings section below.

A random seed can also be provided to generate the same sample set for a given seed value. You can also optimize the Snap's usage of node memory by configuring the maximum memory in percentage that the Snap can use to buffer the input dataset. If the memory utilization is exceeded, the Snap writes the dataset into a temporary local file. This helps you avoid timeout errors when executing the pipeline.

Input and Output

Expected input: The input document from which the sample dataset is to be generated. The Snap accepts both numeric and categorical data; the stratified sampling and weighted stratified sampling algorithms require datasets containing categorical fields.

Expected output:

First output: Document output containing the sample dataset.
Second output: Document output containing the dataset that is not present in the first output.

Expected upstream Snaps: Snaps that provide a document output stream containing the dataset. For example, CSV Generator or a combination of File Reader and CSV Parser.

Expected downstream Snaps: Snaps that accept a document input. For example, Mapper or a combination of JSON Parser and File Writer.

Prerequisites

A basic understanding of the sampling algorithms supported by the Snap is preferable.

Configuring Accounts

Accounts are not used with this Snap.

Configuring Views

Input	This Snap has exactly one document input view.
Output	This Snap has at most two document output views.
Error	This Snap has at most one document error view.

Troubleshooting

None

Limitations and Known Issues

None

Modes

Ultra Pipelines: Works with Ultra Pipelines only when Streamable Sampling is selected as the sampling algorithm.

Snap Settings

Label	Required. The name for the Snap. Modify this to be more specific, especially if there are more than one of the same Snap in the pipeline.
Pass through percentage	Required. The number of records, as a percentage, that are to be passed through to the output. This value is treated differently based on the algorithm selected. Default value: 0.5 The number of records output by the Snap is determined by the pass through percentage as well as the total number of records present. If there are 100 records and the pass through percentage is 0.5 then 50 records are expected to be passed through. If there are 103 records and the pass through percentage is 0.5 then only 51 records are expected to be passed through. This varies further if the algorithm is stratified or weighted stratified, in those cases, the number of records per class is also factored.
Algorithm	Required. The sampling algorithm to be used. Choose from one of the following options in the drop-down menu: Linear Split: Use this to partition a dataset. The Snap first buffers the dataset and then splits the dataset based on the value you enter in Pass through Percentage. For example, if you specify the pass-through percentage as 0.7, then the first output view contains the first 70% documents, while the second output contains the remaining 30%. Streamable Sampling: Use this in Ultra pipelines. The Snap passes records based on the probability defined in the Pass-through percentage property. With a pass-through percentage of 50 (0.5 in the Pass-through percentage property), the Snap passes each record with a probability of 50%. In doing so, the record count in the sample is not always guaranteed to be 50% of the input dataset. Strict Sampling: The Snap extracts a sample dataset exactly based on the pass through percentage. Stratified Sampling: Use this to generate a sample dataset containing the same number of records for each class. The output documents are expected to contain the same number of documents from each class specified in the Stratified field property. This helps reduce the problem of unbalanced dataset by down-sampling the majority classes while keeping most of the minority classes. Weighted Stratified Sampling: If the pass-through percentage is 0.5, then 50% of documents are passed through. Moreover, the original ratio of the number of documents in each class specified in the Stratified field is preserved. Default value: Streamable Sampling If Stratified Sampling or Weighted Stratified Sampling is selected, the Stratified field property must also be configured.
Stratified field	Conditional. The field in the dataset containing classification information pertaining to the data. This is a suggestible property and lists all the fields in the incoming dataset. Select the field that is to be treated as the stratified field and the sampling is done based on this field. Example: Consider an employee record dataset containing fields such as Name, ID, Position, and Location. The fields Position, and Location help you classify the data, so the input in this property for this case is $Position or $Location. Default value: None
Use random seed	If selected, Random seed is applied to the randomizer in order to get reproducible results. Default value: Selected
Random seed	Conditional. This is required if the Use random seed property is selected. Number used as static seed for the randomizer. Default value: 12345 The result is different if the value specified in Maximum memory % or the JCC memory are different.
Maximum memory %	Required. The maximum portion of the node's memory, as a percentage, that can be utilized to buffer the incoming dataset. If this percentage is exceeded then the dataset is written to a temporary local file and then the sample generated from this temporary file. This configuration is useful in handling large datasets without over-utilization of the node memory. The minimum default memory to be used by the Snap is set at 100 MB. Default value: 10
Snap Execution	Select one of the following three modes in which the Snap executes: Validate & Execute: Performs limited execution of the Snap, and generates a data preview during Pipeline validation. Subsequently, performs full execution of the Snap (unlimited records) during Pipeline runtime. Execute only: Performs full execution of the Snap during Pipeline execution without generating preview data. Disabled: Disables the Snap and all Snaps that are downstream from it. Default Value: Execute only Example: Validate & Execute

Temporary Files

During execution, data processing on Snaplex nodes occurs principally in-memory as streaming and is unencrypted. When larger datasets are processed that exceeds the available compute memory, the Snap writes Pipeline data to local storage as unencrypted to optimize the performance. These temporary files are deleted when the Snap/Pipeline execution completes. You can configure the temporary data's location in the Global properties table of the Snaplex's node properties, which can also help avoid Pipeline errors due to the unavailability of space. For more information, see Temporary Folder in Configuration Options.

Example

Data Sampling

This example demonstrates all sampling algorithms applied to a document. Each of the following sampling algorithms is demonstrated:

Streamable Sampling
Strict Sampling
Stratified Sampling
Weighted Stratified Sampling

Download this pipeline.

Understanding the pipeline

The input is a CSV document generated by the CSV Generator Snap. A preview of the output from the CSV Generator is as shown below:

This document is passed to the Copy Snap where it generates five document streams, four of these go into the Sample Snap, and one goes into a Profile Snap. The Profile Snap generates a statistical profile of the incoming document, in this case the input document for the Sample Snaps. A preview of the output from the Profile Snap is as shown below:

There are two aspects of the input document based on the Profile Snap's output:

Total number of records: 50
Number of classes: 2 (M, F)
- Number of M documents: 33
- Number of F documents: 17

This data is useful in understanding how the Sample Snap creates a sample dataset for each sampling algorithm selected.

Using the same pass-through percentage (50%), all four sampling algorithms are demonstrated here:

Streamable Sampling: The Sample Snap is configured as shown below

The output from the Snap is as shown below:

The downstream Profile Snap's output is useful in understanding the sample dataset's attributes:

The total number of documents in the sample dataset is 28, close to the pass-through percentage.
Strict Sampling: The Sample Snap is configured as shown below:

The output from the Snap is as shown below:

The downstream Profile Snap's output is useful in understanding the sample dataset's attributes:

The total number of documents in the sample dataset is 25. Exactly the same as the pass-through percentage.
Stratified Sampling: The Sample Snap is configured as shown below:

The $Gender field is specified as the stratified field. The Snap selects equal number of documents for each class of the stratified field while maintaining the pass-through percentage.

The output from the Snap is as shown below:

The downstream Profile Snap's output is useful in understanding the sample dataset's attributes:

The total number of documents in the sample dataset is 24.
Weighted Stratified Sampling: The Sample Snap is configured as shown below:

The $Gender field is specified as the stratified field. The Snap maintains the pass-through percentage while also maintaining the ratio of the classes.

The output from the Snap is as shown below:
The downstream Profile Snap's output is useful in understanding the sample dataset's attributes:

The total number of documents in the sample dataset is 24.

Download this pipeline.

Downloads

Important steps to successfully reuse Pipelines

Download and import the pipeline into the SnapLogic application.
Configure Snap accounts as applicable.
Provide pipeline parameters as applicable.

	File	Modified
Labels No labels Preview View	File ML_Sample.slp	Nov 09, 2018 by Mohammed Iqbal

Snap Pack History

Click to view/expand

Release	Snap Pack Version	Date	Type	Updates
November 2024	main29029	13 Nov 2024	Stable	Updated and certified against the current SnapLogic Platform release.
August 2024	main27765	21 Aug 2024	Stable	Upgraded the `org.json.json` library from v20090211 to v20240303, which is fully backward compatible. Enhanced the Date Time Extractor Snap to support Date time formats (`YYYY-MM-dd HH:mm:ss and YYYY-MM-dd HH:mm:ss.SSS`) and allow the root path to auto-convert all fields.
May 2024	main26341	08 May 2024	Stable	Updated and certified against the current SnapLogic Platform release.
February 2024	436patches25781	03 Apr 2024	Latest	Enhanced the Deduplicate Snap to honor an `interrupt while waiting in the delay loop` to manage the memory efficiently.
February 2024	main25112	14 Feb 2024	Stable	Updated and certified against the current SnapLogic Platform release.
November 2023	main23721	Nov 8, 2023	Stable	Updated and certified against the current SnapLogic Platform release.
August 2023	main22460	Aug 16, 2023	Stable	Updated and certified against the current SnapLogic Platform release.
May 2023	433patches21572	20 Jun 2023	Latest	The Deduplicate Snap now manages memory efficiently and eliminates out-of-memory crashes using the following fields: Minimum memory (MB) Minimum free disk space (MB)
May 2023	433patches21247	31 May 2023	Latest	Fixed an issue with the Match Snap where a null pointer exception was thrown when the second input view had fewer records than the first.
May 2023	main21015	10 May 2023	Stable	Upgraded with the latest SnapLogic Platform release.
February 2023	main19844	09 Feb 2023	Stable	Upgraded with the latest SnapLogic Platform release.
December 2022	431patches19268	19 Dec 2022	Latest	The Deduplicate Snap now ignores fields with empty strings and whitespaces as no data.
November 2022	main18944	10 Nov 2022	Stable	Upgraded with the latest SnapLogic Platform release.
August 2022	main17386	11 Aug 2022	Stable	Upgraded with the latest SnapLogic Platform release.
4.29	main15993	14 May 2022	Stable	Upgraded with the latest SnapLogic Platform release.
4.28	main14627	20 Jul 2022	Stable	Enhanced the Type Converter Snap with the Fail safe upon execution checkbox. Select this checkbox to enable the Snap to convert data with valid data types, while ignoring invalid data types.
4.27	427patches13730			Enhanced the Type Converter Snap with the Fail safe upon execution checkbox. Select this checkbox to enable the Snap to ignore invalid data types and convert data with valid data types.
4.27	427patches13948	07 Jan 2022	Latest	Fixed an issue with the Principal Component Analysis Snap, where a deadlock occurred when data is loaded from both the input views.
4.27	main12833	13 Nov 2021	Stable	Upgraded with the latest SnapLogic Platform release.
4.26	main11181	14 Aug 2021	Stable	Upgraded with the latest SnapLogic Platform release.
4.25	425patches10994	04 Aug 2021		Fixed an issue when the Deduplicate Snap where the Snap breaks when running on a locale that does not format decimals with Period (.) character.
4.25	main9554	08 May 2021	Stable	Upgraded with the latest SnapLogic Platform release.
4.24	main8556	13 Feb 2021	Stable	Upgraded with the latest SnapLogic Platform release.
4.23	main7430	14 Nov 2020	Stable	Upgraded with the latest SnapLogic Platform release.
4.22	main6403	12 Sep 2020	Stable	Upgraded with the latest SnapLogic Platform release.
4.21	snapsmrc542	09 May 2020	Stable	Introduces the Mask Snap that enables you to hide sensitive information in your dataset before exporting the dataset for analytics or writing the dataset to a target file. Enhances the Match Snap to add a new field, Match all, which matches one record from the first input with multiple records in the second input. Also, enhances the Comparator field in the Snap by adding one more option, Exact, which identifies and classifies a match as either an exact match or not a match at all. Enhances the Deduplicate Snap to add a new field, Group ID, which includes the Group ID for each record in the output. Also, enhances the Comparator field in the Snap by adding one more option, Exact, which identifies and classifies a match as either an exact match or not a match at all. Enhances the Sample Snap by adding a second output view which displays data that is not in the first output. Also, a new algorithm type, Linear Split, which enables you to split the dataset based on the pass-through percentage.
4.20 Patch	mldatapreparation8771	18 Mar 2020	Latest	Removes the unused `jcc-optional` dependency from the ML Data Preparation Snap Pack.
4.20	snapsmrc535	08 Feb 2020	Stable	Upgraded with the latest SnapLogic Platform release.
4.19	snapsmrc528	14 Nov 2019	Stable	New Snap: Introducing the Deduplicate Snap. Use this Snap to remove duplicate records from input documents. When you use multiple matching criteria to deduplicate your data, it is evaluated using each criterion separately, and then aggregated to give the final result.
4.18	snapsmrc523	10 Aug 2019	Stable	Upgraded with the latest SnapLogic Platform release.
4.17 Patch	ALL7402	11 Jun 2019	Latest	Pushed automatic rebuild of the latest version of each Snap Pack to SnapLogic UAT and Elastic servers.
4.17	snapsmrc515	11 Jun 2019	Latest	New Snap: Introducing the Feature Synthesis Snap, which automatically creates features out of multiple datasets that share a one-to-one or one-to-many relationship with each other. New Snap: Introducing the Match Snap, which enables you to automatically identify matched records across datasets that do not have a common key field. Added the Snap Execution field to all Standard-mode Snaps. In some Snaps, this field replaces the existing Execute during preview check box.
4.16	snapsmrc508	16 Feb 2019	Stable	Added a new Snap, Principal Component Analysis, which enables you to perform principal component analysis (PCA) on numeric fields (columns) to reduce dimensions of the dataset.
4.15	snapsmrc500	15 Dec 2018	Stable	New Snap Pack. Perform preparatory operations on datasets such as data type transformation, data cleanup, sampling, shuffling, and scaling. Snaps in this Snap Pack are: Categorical to Numeric Clean Missing Values Date Time Extractor Numeric to Categorical Sample Scale Shuffle Type Converter