PySpark Examples
PySpark script pipeline on Windows
Read CSV data with PySpark
The following example pipeline demonstrates how to execute a PySpark script on Windows Groundplex to read and display CSV data. It takes a CSV file path as an argument, reads the file using Spark DataFrame operations, and displays the data content. The pipeline is configured to run locally on Windows Groundplex.
Configure the PySpark script Snap as follows:
Click the Edit PySpark script button and provide the following script. The script is configured to run with spark-submit using local master mode with all available cores.
This is a PySpark script designed to read and display CSV data. It's built to accept a file path as a command-line argument and process it using Apache Spark's DataFrame operations.PySfrom pyspark.sql import SparkSession import sys import os spark = SparkSession.builder.appName("PySpark_ScriptArgs_File").getOrCreate() Check if file argument is provided if len(sys.argv) > 1: file_path = sys.argv[1] if os.path.exists(file_path): # Example: Read CSV as DataFrame df = spark.read.csv(file_path, header=True, inferSchema=True) print("Data from file:") df.show() else: print(f"File does not exist: {file_path}") spark.stop()This script is executed with the argument
C:/pyspark_scripts/input_data.csv. It reads and displays that specific CSV file using Spark's distributed processing capabilities, even though it's running locally with all available CPU cores (—-master local[*]).Validate the Snap. On validation, the Snap displays the following output it the preview:
Execute the Snap. On execution, it executes a PySpark script that creates a Spark session, accepts a CSV file path as a command-line argument, reads the CSV file into a DataFrame with header and schema inference, displays the data, and stops the Spark session.
PySpark script pipeline on Linux
Word count using PySpark script
The following example pipeline demonstrates how to execute a PySpark script on Linux Grouplex to count the number of words in an application that processes text files to count the frequency of each word. It uses the PySpark framework to distribute the computation across a Spark cluster, reading text files, splitting them into words, and counting occurrences of each word using map-reduce operations.
Configure the PySpark script Snap as follows:
Click the Edit PySpark script button and provide the following script. The script reads text files, splits lines into words using flatMap, maps each word to a tuple, reduces by key to count word frequencies, and collects the results to print word counts. The Spark application runs on Spark 3.5.6 with Hadoop 3 binaries.
import sys
from operator import add
from pyspark import SparkContext
if name == "main":
if len(sys.argv) < 3:
print >> sys.stderr, "Usage: wordcount <master> <file>"
exit(-1)
sc = SparkContext()
lines = sc.textFile(sys.argv[2], 1)
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print "%s: %i" % (word, count)Validate the Snap. On validation, the Snap displays the following output.