Pipeline Execution Flow

Introduction

Pipeline execution flow is a systematic, step-by-step process to facilitate the seamless movement of data from its initiation to completion. This orchestrated flow is designed to ensure the efficient processing of data across diverse business systems. Within this process, the control plane manages the integration tasks by strategically organizing components across the Snaplex infrastructure. Simultaneously, the Snaplex nodes execute the assigned integration tasks under the guidance of the control plane. The pipeline can be executed either by Triggered, Ultra, or Scheduled Tasks. Regardless of the task type, the pipeline goes through distinct states during its lifecycle.

Architecture Overview

The pipeline execution architecture involves a user interface for pipeline design, a control plane orchestrating tasks and managing the overall workflow, and a distributed Snaplex environment where Snaplex nodes execute integration tasks in parallel. The Metadata in the control plane stores essential pipeline information, and the Snap store provides a catalog of pre-built components. Throughout the process, the collaboration between these components ensures efficient task execution, error handling, and communication for seamless data integration.

Snaplogic execution.png

Steps in the pipeline execution flow

 

execution flow.png

Initialization

The initialization phase in pipeline execution involves the preparatory steps required to set up and begin the pipeline execution. Also called a NoUpdate state, this state sets the groundwork for the subsequent stages in the pipeline execution flow, allowing the system to establish the context and allocate resources effectively for the upcoming pipeline execution.

This stage involves the following activities:

  • Request processing: The control plane receives the pipeline execution request, which can originate from various sources:

    • Scheduled task

    • External triggers like API calls and event notifications

    • Manual initiation by a user from the Designer or Manager

  • Leader node decision: In this stage, the control plane assesses the status and resource capacity of the Snaplex nodes, prioritizes nodes based on workload, processing power, and memory, selects the most appropriate node, and prepares it for pipeline execution.

This state is only relevant if the pipeline is executed on the leader node.

Prepare

During the Prepare stage, several important tasks are carried out to set the groundwork for the execution of the pipeline. It involves communication between the control plane and the data plane (Snaplex). This state ensures that all necessary components are in place for pipeline execution and identifies and addresses potential configuration issues. The Prepare stage involves the following key activities:

  • Retrieving metadata and dependencies: The pipeline's metadata, including its structure, configuration, and dependencies (like referenced Snaps and shared resources), is fetched from the control plane. This metadata provides a comprehensive overview of the pipeline's structure, components, and connections.

  • Pre-execution checks: The system identifies any missing or invalid mandatory attributes in the Snap configuration and retrieves the account credentials like passwords or API keys of the Snaps in the pipeline.

  • Endpoint Interaction: The system prepares and verifies connections to any target endpoints using the specified protocols. This confirms smooth data flow throughout the execution stage.

  • Initial Resource Allocation: Depending on the pipeline's needs and expected workload, the control plane allocates initial resources on the designated Snaplex node. This ensures resources like sufficient memory and processing power are available for the pipeline execution.

  • Pipeline distribution: The pipeline code and components are distributed to the designated Snaplex nodes. This involves transferring the necessary files and configuration data to the node for execution.

Execution

During the execution stage, the actual processing of integration tasks takes place, and the orchestrated flow of data, as defined in the pipeline, is carried out in real-time. Throughout this process, the Snaplex nodes communicate with the control plane to report the status of task execution, provide updates, and receive further instructions.

  • Snap execution: The individual Snaps are activated and begin processing data. They perform the designated task according to the directives provided by the control plane.

  • Endpoint interactions: The pipeline establishes connections to any required external endpoints (for example, databases, applications, and cloud services) using the specified protocols. This enables the pipeline to process data from these systems.

  • Data flow orchestration: The pipeline coordinates the flow of data between Snaps and endpoints, ensuring that data moves through the pipeline in the correct sequence and format.

  • Resource management: The Snaplex node dynamically manages resources (memory, CPU, network) during execution to ensure optimal performance and prevent bottlenecks. The pipeline collects execution metrics, such as processing time, data volume, and error rates.

  • Pipeline execution statistics: The pipeline execution statistics is an information on the pipeline status when executed. As a pipeline executes, the statistics are updated periodically so that you can monitor its progress. For more details, see Pipeline Execution Statistics.

Completion

After the pipeline execution is complete and resources are released. The pipeline sends a comprehensive set of execution metrics to the control plane, including:

  • Execution start and end timestamps: These timestamps give the duration of the pipeline execution and assess its overall performance.

  • Data volume processed: This refers to the amount of data that is ingested, transformed, and processed by the pipeline during its execution.

  • Number of records processed: This refers to the count of individual data records that are ingested, transformed, and processed by the pipeline during its execution.

  • Success or failure status: A successful status indicates that the pipeline completed its data processing tasks without encountering errors or issues, while a failure status indicates that the pipeline encountered errors or exceptions during execution.

  • Any errors or warnings encountered: This refers to the issues or notifications that arise during the execution of the pipeline. These errors and warnings can include data validation failures, connectivity issues with endpoints, resource constraints, or any other issues that might impact the successful processing of the data.

  • Resource usage statistics: This includes the following statistics:

    • CPU utility: The percentage of available CPU resources utilized by the pipeline during execution.

    • Memory usage: The amount of memory (RAM) consumed by the pipeline to store data, code, and intermediate results.

    • Network usage: The amount of network traffic generated by the pipeline, typically in bytes or megabytes, for data transfer and communication with external systems.

Pipeline workflow: How control plane and nodes work together

Control plane

The control plane acts as a central management overseeing the pipeline execution. It plays a pivotal role in managing the lifecycle of data integration tasks, from initiation to completion. It manages the execution of integration workflows and ensures efficient collaboration with Snaplex nodes.

Snaplex nodes

Snaplex nodes are the distributed execution nodes in the SnapLogic architecture. They process the integration tasks assigned by the control plane. Each Snaplex node functions as an independent computing unit capable of executing tasks concurrently, contributing to the scalability and performance of the overall integration process.

 

Stage

Control Plane

Snaplex nodes

Stage

Control Plane

Snaplex nodes

Initialization

  • Receives pipeline execution request from the external trigger.

  • Assigns specific tasks within a pipeline to Snaplex nodes based on their availability and capabilities.

  • Initiates the process by identifying the relevant pipeline based on the execution request.

  • Recieves pipeline execution request from control plane.

  • Allocates resources like memory and processing power to perform the data integration tasks.

  • Sets up its execution environment based on the pipeline's requirements.

Prepare

  • Accesses the metadata store to retrieve the required information about the designated pipeline.

  • Performs validation and authorization checks to ensure that the pipeline execution adheres to defined security and access controls.

  • Establishes connections to relevant data sources and targets. This includes connecting to databases, APIs and file systems in the data integration process.

  • Performs pre-execution checks to validate that the environment is ready and any prerequisites are met.

  • Communicates its readiness and initialization status back to the control plane.

Execution

  • Loads the metadata into its runtime environment.

  • Monitors the progress of pipeline execution.

  • Captures relevant metrics for auditing, troubleshooting, and performance analysis.

  • Based on the real-time communication from the snaplex nodes, control plane manges errors either by predefined error-handling logic or stopping the entire pipeline if needed.

  • Executes the specific tasks assigned to them.

  • Executes tasks in parallel to optimize performance and handle large volumes of data efficiently.

  • Handles errors locally and try to resolve them before communicating them to the control plane.

Completion

  • Logs error messages and timestamps.

  • Consolidates output results of each Snap and transformed data generated by Snaplex nodes.

  • Performs post-execution cleanup activities like releasing resources and closing connections.

  • Provides a final status report with the status of the pipeline execution and relevant performance metrics.

  • Releases allocated compute resources, closes open connections, and cleans up temporary files.

  • Sends comprehensive metrics to the control plane for analysis and optimization.

  • Archives detailed execution logs and troubleshooting for future reference.