JAR Submit - Spark SQL 2.x

On this Page

Overview

Use this Snap to run existing Java Spark Applications on Azure Databricks or EMR clusters.

Input and Output

  • Expected input: None
  • Expected output: None
  • Expected upstream Snaps: None
  • Expected downstream Snaps: None

Prerequisites

  • Your Snaplex type must be an eXtremeplex and must have sufficient resources available to execute correctly.
  • The eXtremeplex must have at least two nodes since the Snap creates a child process. Otherwise, the Pipeline completes execution with a "NoUpdate" message. 
  • A valid account with appropriate access permissions.

Configuring Accounts

This Snap requires a valid SnapLogic account. See Configuring eXtreme Execute Accounts for details.

Configuring Views

Input

None.
OutputNone.
ErrorNot supported.

Troubleshooting

None

Limitations 

  • eXtreme Pipelines on Azure Databricks with the JAR Submit Snap fail with the job canceled as the SparkContext shuts down.
  • The Spark job continues being in the running state even after the SnapLogic Designer/Dashboard displays the Pipeline status as failed on AWS EMR. 
  • This Snap requires a cluster with at least one master node and two core nodes. If you use this Snap in a single-node AWS EMR cluster, the Pipelines will not execute and continue to remain in the Started state unless you stop/abort them manually. They will also continue blocking other Pipelines in the queue. However, you can use additional Spark submit arguments if you still want to use that Snap in a single node cluster. For example, you can use the following arguments in an AWS EMR cluster with the c5.4xlarge Instance Type and 1 core node:

    --executor-memory 10g --num-executors 4 --driver-memory 4g

    • You must use a variation of the above Spark submit arguments for other cluster configurations.
    • Using the Spark submit arguments may also not work in some cases depending upon the JAR submit application being invoked, the available memory, and the CPU usage.
    • Contact SnapLogic Support for more assistance if you want to use these Snaps in a single node AWS EMR cluster. 

Known Issues

None.

Modes


Snap Settings


LabelName for this Snap. 
Java JAR Path

Enter the path of the AWS S3 or Azure DataBricks directory where the Java JAR file is located. The JAR file must exist in the same bucket/container that the eXtreme Execute account is configured to access.

 Starting 4.24 GA, the Snap cannot access JAR files in plex artifact bucket.

Script ArgsEnter additional Java script arguments to the Java JAR file.
Main Class in JARRequired. The name of the main JAR file for Spark. 
Spark Submit ArgsEnter additional Spark script arguments to the JAR file.
Timeout

Specify in seconds.  Set this value to 1 second or greater for the duration in which the script is allowed to run.

Default: -1

The default value results in an indefinite timeout, and setting the value to 0 seconds produces no timeout.


Additional Scripts for Azure Databricks

Azure Databricks uses a shared context and all Pipelines running in the same cluster generate and write logs to a single cluster log. To get the correct job statistics per Pipeline, set the scheduler pool name to the Pipeline runtime ID, which is passed through the code internally. The script must read it as the last script argument and include the following line in the script:

"runtime" above is the last argument passed.

Example


This Snap is intended to be used as a stand-alone Pipeline.  The image in the Overview of this page matches the Pipeline file in the Download section.

In this case, the JAR file is invoked from its S3 location through the Java JAR path field.  You can see the complete file extension in the following image:

You can run the JAR file as a CLI executable by entering the file path in the Script args field.


Download


  File Modified

File JAR_Submit_Basic_UseCase.slp

Feb 08, 2019 by John Brinckwirth


Snap Pack History

 Click to view/expand

4.27 (main12833)

  • No updates made.

4.26 (main11181)

  • No updates made.

4.25 (main9554)

  • No updates made.

4.24 (main8556)

4.23 (main7430)

  • Accounts support validation. Thus, you can click Validate in the account settings dialog to validate that your account is configured correctly. 

4.22 (main6403)

  • No updates made.

4.21 Patch 421patches5928

  • Adds Hierarchical Data Format v5 (HDF5) support in AWS EMR. With this enhancement, you can read HDF5 files and parse them into JSON files for further data processing. See Enabling HDF5 Support for details.
  • Adds support for Python virtual environment to the PySpark Script Snap to enable reading HDF5 files in the S3 bucket. You can specify the path for the virtual machine environment's ZIP file in this field.

4.21 Patch 421patches5851

  • Optimizes Spark engine execution on AWS EMR, requiring lesser compute resources.

4.21 (snapsmrc542)

  • No updates made.

4.20 (snapsmrc535)

  • Introduced a new account type, Azure Databricks Account. This enhancement makes account configuration mandatory for the PySpark Script and JAR Submit Snaps.
  • Enhanced the PySpark Script Snap to display the Pipeline Execution Statistics after a Pipeline with the Snap executes successfully.

4.19 (snapsmrc528)

  • No updates made.

4.18 (snapsmrc523)

  • No updates made.

4.17 Patch ALL7402

  • Pushed automatic rebuild of the latest version of each Snap Pack to SnapLogic UAT and Elastic servers.

4.17 (snapsmrc515)

  • No updates made. Automatic rebuild with a platform release.

4.16 (snapsmrc508)

  • New Snap Pack. Execute Java Spark and PySpark applications through the SnapLogic platform. Snaps in this Snap Pack are:
    • JAR Submit: Upload your existing Spark Java JAR programs as eXtreme Pipelines.
    • PySpark Script: Upload your existing PySpark scripts as eXtreme Pipelines.