PySpark Script - Spark SQL 2.x

In this article

Overview

Use this Snap to run an existing PySpark script on an Azure Databricks or EMR cluster.

Input and Output

  • Expected input: None
  • Expected output: None
  • Expected upstream Snaps: None
  • Expected downstream Snaps: None

Prerequisites

  • Your Snaplex type must be an eXtremeplex.
  • The eXtremeplex must have at least two nodes since the Snap creates a child process. Otherwise, the Pipeline completes execution with a NoUpdate message. 
  • A valid account with appropriate access permissions.

Configuring Accounts

This Snap requires a valid SnapLogic account. See Configuring eXtreme Execute Accounts for details.

Configuring Views

Input

None.
OutputNone.
ErrorNot supported.

Troubleshooting

None.

Limitations

  • The Spark job continues being in the running state even after the SnapLogic Designer/Dashboard displays the Pipeline status as failed on AWS EMR. 
  • This Snap requires a cluster with at least one master node and two core nodes. If you use this Snap in a single-node AWS EMR cluster, the Pipelines will not execute and continue to remain in the Started state unless you stop/abort them manually. They will also continue blocking other Pipelines in the queue. However, you can use additional Spark submit arguments if you still want to use that Snap in a single node cluster. For example, you can use the following arguments in an AWS EMR cluster with the c5.4xlarge Instance Type and 1 core node:

    --executor-memory 10g --num-executors 4 --driver-memory 4g
    • You must use a variation of the above Spark submit arguments for other cluster configurations.
    • Using the Spark submit arguments may also not work in some cases depending upon the JAR submit application being invoked, the available memory, and the CPU usage.
    • Contact SnapLogic Support for more assistance if you want to use these Snaps in a single node AWS EMR cluster. 

Known Issue

Breaking Change for Pipelines consisting of PySpark Script Snap: Due to an unexpected field name change internally, existing Pipelines (prior to 4.24 GA) will likely fail. You will need to recreate the said Pipelines for them to successfully execute.   

 

Modes


Snap Settings


LabelName for this Snap. 
Enable PySpark editor

Select this check box to prepare or edit your PySpark script in the PySpark script editor. Click the Edit PySpark Script button to start editing.

Alternatively, you can deselect this check box and provide the script file's path/location in the PySpark Script Path field.

Default value: Not selected

Edit PySpark scriptWhen you click this option, a new window displays the source script, which you can edit manually.  Alternatively, if you do not specify a file path to a script in the PySpark Script Path field, you can click Import to upload a script from a Project folder in your Org.  You can also click Export to download the script file as a .txt file.
PySpark Script Path

Activated when you deselect the Enable PySpark editor check box.

Enter the path of the AWS S3 or Azure DataBricks directory where the PySpark script file is located. The script file must exist in the same bucket/container that the eXtreme Execute account is configured to access.

 Starting 4.24 GA, the Snap cannot access scripts in plex artifact bucket.

Script ArgsEnter additional PySpark script arguments to the source script file.
Spark submit args

Enter additional Spark script arguments to the source script file.

The PySpark Script Snap does not support the --archives option. However, the Snap uses the option internally while handling the ZIP file of the virtual machine environment.

Example:

  • --py-files s3://mybucket/script/helloworld.py
  • --files s3://mybucket/files/sample.egg

Default value: N/A

Virtual environment path

Applicable only to AWS EMR. Enter the path for the virtual machine environment's ZIP file in the S3 bucket. See Creating Virtual Machine Environment for details. 

Example: s3://mybucket/virtualenv/venv.zip

Default value: N/A

Timeout

Specify in seconds.  Set this value to 1 second or greater for the duration in which the script is allowed to run.

Default: -1

The default value results in an indefinite timeout, and setting the value to 0 seconds produces no timeout.

Pipeline Execution Statistics

After a Pipeline with the PySpark Script Snap executes successfully, you can view the Pipeline Execution Statistics. You can view the Pipeline Execution Statistics from the following locations:

  • In the Designer tab, click  to open statistics for the current execution or validation of a Pipeline. 
  • In the Dashboard > Pipeline, click on the status (Completed for past statistics and Started for active statistics) in the Status column.

The following statistics are displayed under the Snap Statistics tab for the PySpark Script Snap:

  • Application Name: Name of the application for the parent and child jobs, if any.
  • App ID: The application ID for both the parent and child jobs.
  • Snap Pack versions: The Snap Pack version of the Snaps in the Pipeline.
  • Total Input Bytes: The total number of bytes that were input during the Pipeline execution.
  • Total Output Bytes: The total number of bytes that were output during the Pipeline execution.
  • Total Input Records: The total number of records that were passed as input during the Pipeline execution.
  • Total Output Records: The total number of records that were passed as output during the Pipeline execution.

For details on the rest of the Pipeline Execution Statistics, see here.

Additional Script for Azure Databricks

Azure Databricks uses a shared context and all Pipelines running in the same cluster generate and write logs to a single cluster log. To get the correct job statistics per Pipeline, set the scheduler pool name to the Pipeline runtime ID, which is passed through the code internally. The script must read it as the last script argument and include the following line in the script:

If the script uses two script arguments, then the scheduler pool's name is passed as the third argument.


Example


This Snap is intended to be used as a stand-alone Pipeline. You can invoke a PySparkScript from its S3 location through the path, but if you do not have access to the location or if you have edited the script on your local machine, you can copy its contents directly into Edit PySpark script window:

Likewise, you can click Import to upload a PySpark script directly into this Snap and edit it in this window.

Download

  File Modified

File PySpark_Basic_UseCase.slp

Feb 07, 2019 by John Brinckwirth

Snap Pack History

 Click to view/expand

4.27 (main12833)

  • No updates made.

4.26 (main11181)

  • No updates made.

4.25 (main9554)

  • No updates made.

4.24 (main8556)

4.23 (main7430)

  • Accounts support validation. Thus, you can click Validate in the account settings dialog to validate that your account is configured correctly. 

4.22 (main6403)

  • No updates made.

4.21 Patch 421patches5928

  • Adds Hierarchical Data Format v5 (HDF5) support in AWS EMR. With this enhancement, you can read HDF5 files and parse them into JSON files for further data processing. See Enabling HDF5 Support for details.
  • Adds support for Python virtual environment to the PySpark Script Snap to enable reading HDF5 files in the S3 bucket. You can specify the path for the virtual machine environment's ZIP file in this field.

4.21 Patch 421patches5851

  • Optimizes Spark engine execution on AWS EMR, requiring lesser compute resources.

4.21 (snapsmrc542)

  • No updates made.

4.20 (snapsmrc535)

  • Introduced a new account type, Azure Databricks Account. This enhancement makes account configuration mandatory for the PySpark Script and JAR Submit Snaps.
  • Enhanced the PySpark Script Snap to display the Pipeline Execution Statistics after a Pipeline with the Snap executes successfully.

4.19 (snapsmrc528)

  • No updates made.

4.18 (snapsmrc523)

  • No updates made.

4.17 Patch ALL7402

  • Pushed automatic rebuild of the latest version of each Snap Pack to SnapLogic UAT and Elastic servers.

4.17 (snapsmrc515)

  • No updates made. Automatic rebuild with a platform release.

4.16 (snapsmrc508)

  • New Snap Pack. Execute Java Spark and PySpark applications through the SnapLogic platform. Snaps in this Snap Pack are:
    • JAR Submit: Upload your existing Spark Java JAR programs as eXtreme Pipelines.
    • PySpark Script: Upload your existing PySpark scripts as eXtreme Pipelines.