PySpark Snap Setup for Linux

PySpark Snap Setup for Linux

 

Overview

This document provides comprehensive instructions for setting up PySpark on a Linux system, including prerequisites, installation steps, example, and troubleshooting.

Prerequisites

  • Python and Java: You'll need to have both Python and Java 11 installed on your system.

    • For Spark 3.X, use Java 11

Pre-installation

Download Spark: Visit the Apache Spark archive and select the appropriate package (with or without Hadoop, depending on your system's configuration).

Install Spark

  1. Extract and set up Spark.

    $ tar -xvzf spark-3.5.6-bin-hadoop3.tgz $ mv spark-3.5.6-bin-hadoop3 /home/gaian/software/spark
  2. Configure Environment variables.

    1. Edit your .bashrc.

      $ nano ~/.bashrc
    2. Add the following at the end.

      export SPARK_HOME=/home/gaian/software/spark export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
  3. Reload.

    $ source ~/.bashrc

Run commands

To interact with Spark, you can open either a Spark shell or a PySpark shell in your terminal and execute commands.

  • $ spark-shell:This command opens a Spark shell directly in your terminal, allowing you to run Spark scripts.

  • $ pyspark:This command functions similar to spark-shell, opening a PySpark shell for Python-based Spark interactions.

Examples: Basic commands

Here are some basic commands you can try in the PySpark shell:

# Simple data data = [     ("Alice", 25, "Engineer"),     ("Bob", 30, "Data Scientist"),     ("Charlie", 35, "Manager"),     ("Diana", 28, "Analyst") ] columns = ["name", "age", "role"] # Create DataFrame df = spark.createDataFrame(data, columns) # Show the data df.show() # Show with specific number of rows df.show(2)  # Show only 2 rows

With these steps, your PySpark installation is complete.

Post installation steps

To run the PySpark Snap from the Control Plane on your Groundplex, follow the instructions below:

File permissions

The Groundplex process might run as a different user (check with ps -ef | grep snaplogic), so make sure that the user can read your Spark install and script:

$ sudo chmod +x /home/gaian/software/spark/bin/spark-submit $ sudo chmod o+x /home/xyz $ sudo chmod o+x /home/xyz/software $ sudo chmod o+x /home/xyz/software/spark $ sudo chmod o+x /home/xyz/software/spark/bin $ sudo chmod +r /path/to/your/pyspark_script.py

 

Run the Scripts

Spark Home: This specifies the path to your Spark folder, where your bin folder is located.

Default Script: You get the default script when you drag and drop the PySpark Snap into the designer and click on Edit PySpark Script to edit the script.

The default script is for Python version 2. Since we are using Python 3, which is compatible with Spark, you need to modify the script as below.

  • Old script

    <The old script content would be here if provided>

  • Modified Script

    Python

import sys from operator import add from pyspark import SparkContext if __name__ == "__main__":     if len(sys.argv) < 3:         print("Usage: wordcount <master> <file>", file=sys.stderr)         sys.exit(-1)     sc = SparkContext(sys.argv[1], "WordCount")     lines = sc.textFile(sys.argv[2], 1)     counts = (         lines.flatMap(lambda x: x.split(" "))              .map(lambda x: (x, 1))              .reduceByKey(add)     )     output = counts.collect()     for (word, count) in output:         print(f"{word}: {count}")

Since the default script expects arguments, specifically an input text file for word counting, you need to specify the path of that file in the Script args field.

Terminal Command: 

$ /home/xyz/software/spark/bin/spark-submit /home/xyz/spark_scripts/word_count.py local[*] /home/xyz/spark_scripts/sample.txt

Output

Hello: 4 my: 6 name: 6 is: 6 john: 3 doe: 3 bye: 3 i: 1 am: 1 working: 1 in: 2 xyz: 1 Solutions: 1 MTS-I: 1 role.: 1
  • The PySpark script Snap includes the default script, which can be accessed by clicking the Edit PySpark Script button. If you want to execute the above command, you must manually create the file named "wordcount.py", which contains the default code in our directory.

  • Sample.txt is the file in which the program takes this as input and counts the number of words in this file.

 

Custom Script

To execute a custom script in our file system, specify the path to it in the Spark submit args field.

Terminal Command

$ /home/xyz/software/spark/bin/spark-submit /home/xyz/spark_scripts/sample_script.py

Output

Spark Version: 3.5.6 Python Version: 3.12 Sample Data: +---+-------+ | id|message| +---+-------+ |  1|  Hello| |  2|PySpark| |  3| Ubuntu| +---+-------+

 

Sample script to save as a .py file and execute from the terminal, or give it as a custom script to the PySpark Snap.

from pyspark.sql import SparkSession

# Create Spark session spark = SparkSession.builder \     .appName("Ubuntu Virtual Env Test") \     .getOrCreate()

 

print(f"Spark Version: {spark.version}") print(f"Python Version: {spark.sparkContext.pythonVer}") # Create sample data data = [(1, "Hello"), (2, "PySpark"), (3, "Ubuntu")] df = spark.createDataFrame(data, ["id", "message"]) print("\nSample Data:") df.show() # Stop session spark.stop() print("Test completed successfully!")

 

Terminal Command ( save the above file as samplescript.py )

$ /home/xyz/software/spark/bin/spark-submit /home/xyz/spark_scripts/sample_script.py

 

Output

Spark Version: 3.5.6 Python Version: 3.12 Sample Data: +---+-------+ | id|message| +---+-------+ |  1|  Hello| |  2|PySpark| |  3| Ubuntu| +---+-------+

 

Start Spark Master and Worker ( Optional )

Start Spark Master

$ $SPARK_HOME/sbin/start-master.sh

Note the master URL from the console output or the web UI (default: spark://<hostname>:7077).

Start Spark Worker (connect it to master)

$ $SPARK_HOME/sbin/start-worker.sh spark://<hostname>:7077

To stop and start/restart the spark 

Stop any old processes and start master and worker fresh:

# stop any running spark processes (safe to run) $ $SPARK_HOME/sbin/stop-all.sh # To stop just the worker ( not needed when you just want to stop the worker instead of all ) $ $SPARK_HOME/sbin/stop-worker.sh  # (optional) clear stale work dir that can cause weird state $ rm -rf $SPARK_HOME/work/* /tmp/spark-* 2>/dev/null || true # start master $ $SPARK_HOME/sbin/start-master.sh # start worker and connect to master (use precise host below) $ $SPARK_HOME/sbin/start-worker.sh spark://<host-name>:7077

 

Confirm the master started and get its URL

After starting, run:

# check processes ps -ef | grep -E 'org.apache.spark.deploy.master.Master|start-master' | grep -v grep # check ports again ss -ltnp | grep ':7077' || netstat -plnt | grep 7077 # open master web UI curl -sS http://<ip address of the machine where the spark is installed>:8080 | sed -n '1,5p'

Expected: The master process is present and port 7077 is listening, and the web UI returns HTML (or visit:
http://<ip address of the machine where the spark is installed>:8080 in your browser).

 

Example

PySpark Examples | Word count using PySpark script