Configuring eXtremeplex on AWS EMR

In this article

Article in this section

Supported Versions

  • Spark 2.4.4
  • Amazon EMR 5.29

Prerequisites 

  • Familiarity with the SnapLogic and AWS platforms.
  • SnapLogic Org admin account.
  • AWS Account with S3 buckets.
  • Relevant IAM roles and policies. See Creating IAM Roles and Policies for details.
  • S3 log bucket for EMR logs.
  • S3 artifact bucket for SnapLogic. This location is used by SnapLogic to store artifacts needed to start a cluster and run Pipelines.
  • The S3 artifacts and logs bucket must be in the same region as the cluster's region. For information on the above IAM policies and how to customize them, read the related Amazon documentation.  
  • SnapLogic connects to your AWS Account, so all the associated costs are borne by your organization.  

Configuring eXtremeplex on AWS EMR

You can build and execute Spark-mode SnapLogic eXtreme Pipelines using your AWS EMR account.

Steps

  1. Log in to the SnapLogic eXtreme platform and go to the Manager tab.

  2. Under Project Spaces in the left panel, browse to the project folder to create a pipeline. Alternately, create a new project folder by clicking a given directory and selecting Create Project


  3. In the project folder, go to the Accounts tab, click + and select eXtreme > AWS Account to create your AWS account in SnapLogic. Currently, there are two account types: AWS Account and AWS IAM Role Account. 

    Account-specific configuration is as shown below:
    • AWS Account
       
      The Access-key ID and the Secret key is the same as that of your AWS access keys.
    • AWS IAM Role Account for cross-account IAM Role support:

      The Role ARN is the Amazon Resource Name (ARN) for the role. You must create a cross-account IAM role in AWS before configuring this account. This also enhances the role's security and prevents potential confused deputy attacks. The external ID must be created when you create the cross-account IAM role. See Creating a Cross Account IAM Role in AWS Console for more information.

     The Access-key ID and Secret key is the same as that of your AWS access keys.

  4. Click Validate to verify that SnapLogic is able to connect with your AWS account using the given credentials. After the AWS account validates, click Apply to create the account.

  5. Go to the Snaplexes tab and click + to create an eXtremeplex in the selected project space. 

    • Snaplex type. Select eXtremeplex. The related fields display.
    • Name. Enter a unique name for the eXtremeplex.
    • Environment. Enter a unique name (alphanumeric) to identify the eXtremeplex node. This name cannot be the same as another Snaplex within the same project space.
    • Version. Select the version of eXtremeplex. Choose Default for the latest version or an earlier version for compatibility with your pipelines.
    • Account Type. Select the cloud provider account to use for this eXtremeplex.

      Configuring Accounts

      After you select the Account Type, the Account tab is displayed.


      1. In the Account tab, select the Account type you created in Step 3 above or click Add Account to add one.
      2. When you finish configuring the account, click Update.
    • Region. Select the AWS region where you want to initiate the EMR cluster.
    • Network. Enter the Amazon Virtual Private Cloud (VPC) or subnet ID, which should begin with vpc- or subnet- to bring up the EMR cluster.  

      Network is mandatory for certain instance types. See the list of instance types that must be launched in a VPC. 

    • Instance Type. Select the Amazon Elastic Compute Cloud (EC2) Instance Type. We recommend using the generic instance types like m4 and m5, avoiding smaller instances like t1.micro or t2.nano.

      Amazon EBS (Elastic Block Store) optimization is not supported for i2.8xlarge instance type.

    • Market. Select one of the three following options:

      • On Demand. For a fixed per hour price mechanism.

      • Spot Instance. For dynamic pricing, based on supply and demand (Amazon uses its spare EC2 capacity for Spot Instances). If you select Spot Instance, enter a bid represented by a percentage of Amazon's On Demand price. For example, entering 50 means that you are bidding up to 50% of Amazon's On Demand price as your Spot Instance bid. Thus, the moment your bid exceeds Amazon's current spot price, your cluster is launched.  

      • Hybrid. For a combination of both On Demand and Spot Instance options to provide the best solution in terms of cost, performance, and reliability. 

        • Active Spot Instances can be terminated by Amazon after a 2-minute warning, if your bid goes below the real-time pricing. Hence, we recommend using Hybrid or Spot Instances only for SnapLogic pipelines that do not require a Service Level Agreement. 
        • To view the actual costs incurred for AWS EMR clusters started and managed by SnapLogic eXtreme, refer to the billing section of the AWS Console. SnapLogic assumes no responsibility regarding the incurred costs.
    • Volume size (GB). Enter the volume size of the capacity required for supporting your Pipeline runs on the cluster. The default is 500GB, which is also the minimum volume size. 
    • No. of Nodes. Enter the total number of nodes to create in the EMR cluster, including one mandatory master node. The master node is an EC2 instance that needs to coordinate/manage the work of other core nodes. A functioning cluster requires at least one master and one core node. 

    • S3 Log Bucket. Enter the bucket name and path to store the EMR cluster logs in this format: s3://<bucket_name>/<path_to_the_log_folder>.

    • S3 Artifact Bucket. Enter the S3 bucket name to store the SnapLogic artifacts. You can include region and subdirectories as well in the path. For example, s3://<bucket-name>/<sdir1>/<sdir2>

    • EC2 Instance Profile. Enter the IAM (Identity and Access Management) role for the Amazon EC2 instance profile. We recommend using EMR_EC2_LimitedRole. You will have to associate this with a limited IAM Role policy if you are using AWS CLI to manage EC2 instance profiles.
    • EMR Role. Enter the IAM role for the EMR cluster. We recommend using EMR_DefaultRole.

    • Cluster Tag. Enter a key-value pair for your cluster tag, which will display in the AWS console. Cluster tags help categorize your resources. 
      If you add tags to existing clusters in the Create Snaplex dialog, then they do not take effect, till you restart the cluster.

      • You can add, edit, or deleting tags for running clusters only in the AWS console (and not in the Create Snaplex dialog in SnapLogic). 
      • Tag changes done in the AWS console do not reflect in SnapLogic. Further, upon restarting clusters, any tags configured in Create Snaplex override the tags specified in the AWS console. To avoid issues, it is best to specify the cluster tags in the Create Snaplex dialog).
      • Total number of tags are restricted to 50 per cluster. 
      • Tag Key (up to 128 characters) has to be unique, while the tag Value (up to 256 characters) can be even null or empty. 
      • Tag Key and Value are case sensitive. 

      The above tag cluster conditions are per AWS policies and not SnapLogic. The complete list of restrictions are mentioned in the AWS documentation. 

  6. In the Advanced tab, enter the auto terminate and auto scaling details.  


    • Auto Terminate. Enter the minutes of inactivity after which the EMR cluster should automatically terminate.
    • Auto Scaling. Enable the check box if you want the EMR cluster to automatically scale, based on dynamically changing needs. 
    • Max Cluster Size. Enter the maximum number of nodes to which the cluster can scale.  
    • Auto Scaling Role. Enter the AWS IAM role for auto scaling. We recommend EMR_AutoScaling_DefaultRole.
    • Scale Out. Provide the information for automatic scale out of the cluster, when the demand for nodes go up.
      • Nodes. Enter the number of nodes to add each time the auto-scale triggers.
      • Available Memory. Enter the threshold for the YARN memory in percentage, below which the specified nodes will automatically add to the cluster. For example, a value of 15 means that if the available memory in the cluster goes below 15%, then one node will get added to the cluster. 
      • Cooldown. Enter the idle time between two scale-out events in seconds.
    • Scale In. Provide the information for automatic scale in of the cluster, when the demand for nodes go down.
      • Nodes. Enter the number of nodes to reduce each time the auto-scale triggers.
      • Available Memory. Enter the threshold for the YARN memory in percentage, above which the specified number of a nodes will automatically shut down. For example, a value of 85 means that if the available memory goes beyond 85%, then one node will shut down.  
      • Cooldown. Enter the idle time between two scale-in events in seconds.
    • Hybrid Instances. If you select Hybrid for Market type on the Settings tab, then you can also select Hybrid Instances on the Advanced tab for multi-instance types (Master, Core, and Task Nodes) on the cluster.  

      • Master Node Instance Type. Runs infrastructure services to manage the cluster.
      • Core Node Instance Type. Hosts HDFS and runs the application masters.
      • Task Node Instance Type. Adds more compute power to the cluster.

        Hybrid Instance Types

        A benefit of the Hybrid Instance cluster is that it reduces the cluster startup time significantly because the Core Node Instance group always contains On-Demand Instances. In this case, the cluster can spin up quickly, even if the Task Node Instance group has not started yet. Because Task Nodes are optional in the cluster and always defined as Spot Instances, they are subject to both availability and the maximum bid price, should it exceed the market price. If those conditions are not met, the Task Node Instance groups do not start. But if the Core Node Instance groups have already started, then the Pipelines run immediately. In contrast, clusters with Spot-only instances cannot start up if the specified Spot instances are not available, because all the Core Node instances would use the Spot instance type.

        • In terms of cost, typically the Hybrid Instance cluster can be more expensive than Spot cluster and less expensive than On-Demand Cluster.
        • In the performance, the Hybrid Instance cluster is typically better than Spot or On-Demand Instance clusters if the Task Nodes are running.
        • In terms of reliability, it is far better than Spot Only Instances and equally reliable as the On-Demand Instances.
        • In terms of the start up time, the Hybrid cluster starts up much faster than a Spot Cluster and at almost the same speed as that of the On-Demand cluster.

        If you do not select the Hybrid option on the Advanced Tab, then the instance type specified in the Settings tab is used as the same Instance Type for all three of the following node groups.

  7. Click Create to complete configuring your eXtremeplex. 

Security Groups

Default cluster security groups are migrated from the configurations at the Virtual Private Cloud (VPC) level. However, you can modify these default roles per your security needs. Amazon creates a cluster security group if you do not have one for EMR. See EMR-managed Security Groups for details. 

Known Issues

IssueWorkaround

The eXtreme AWS EMR Pipeline string length should not exceed 256 characters. For example, the following string is of 264 characters, which will fail the associated Pipeline execution:

SnapLogicPipe[5a74e25abf9b4c001e792c1f_3ecfa1d2-6b2a-437b-9a81-249bf2982a2e][/tahoeqa/4.20 Fall 2019/miscellaneous/Copy of emr - pipeline reads from S3, parse CSV, format to CSV and Write to S3 - length of pipeline's label is 125 characters][johndoe@snaplogic.com]

Ensure that the Pipeline string length is less than 256 characters. 

The SnapLogic Pipeline string comprises of the following components: 

  • Pipeline RUUID (usually 61 characters)
  • Pipeline path that includes the org name, the project, and the subfolder, if any
  • Pipeline name
  • User ID
Reading and writing Avro format files currently fails for AWS EMR.None
SnapLogic does not support the instance types m5a.xlarge, m5a.2xlarge, m5a.4xlarge, m5a.12xlarge, and m5a.24xlarge in the AWS EMR-5.14.0 release.
  • Instead of m5a.xlarge, use m4.xlarge.
  • Instead of m5a.2xlarge, use m4.2xlarge.
  • Instead of m5a.4xlarge, use m4.4xlarge.
  • Instead of m5a.12xlarge, use m4.10xlarge.
  • Instead of m5a.24xlarge, use m4.16xlarge.