Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

On this page

Table of Contents
maxLevel2
excludeOlder Versions|Additional Resources|Related Links|Related Information



Panel
bgColor#ebf7e1
borderStylesolid

Articles in this section

Child pages (Children Display)
alltrue
depth1



Overview

SnapLogic Data Science, in combination with the Intelligent Integration Platform, is a self-service platform for end-to-end ML that offers a low-code approach to data acquisition, data exploration and preparation, model training and validation, and model deployment. The Data Science solution includes the following Snap Packs:

Checkout the Data Science Use Cases that to understand how you can leverage the above machine learning Snaps. 

Data Science Project Life Cycle

The main objective of SnapLogic Data Science is to analyze and interpret data to help you gain meaningful business insights. The following are the key steps in a typical data science project lifecycle:

  1. Data Acquisition: Collate and integrate data from multiple sources including files, databases, on-premise applications, cloud applications, APIs, IoT devices, etc. 
  2. Data Exploration: Perform data profiling, analytics, and visualization to better understand the data. Then, define the business problems we want to solve with that data.
  3. Data Preparation: Perform data cleansing, scaling, and other statistical techniques to transform the data into datasets that are suitable for Machine Learning algorithms.
  4. Model Development: Develop, apply, and evaluate machine learning algorithms on the dataset. The result is an ML model that can predict.
  5. Model Deployment: Host the ML model as an API so that it can be easily consumed by external applications.
  6. Optimization: Keep track of the performance of the model, gather feedback, and iterate. Data distribution changes over time; monitoring and optimization are the key success factors.

Data Science Pipelines

A SnapLogic Pipeline is a set of Snaps that connect together to perform a specific task. In a data science project, we recommend the following five Pipeline types:

  1. Profiling Pipeline: Compute data statistics on the dataset. Data statistics are critical to understand the dataset and its distribution, and also for selecting the appropriate data preparation techniques in the next step.
  2. Data Preparation Pipeline: Apply data cleansing, and other statistical transformations to prepare the dataset before applying ML algorithms.
  3. Cross Validation Pipeline: Perform k-fold cross validation using various ML algorithms with different sets of parameters to find an optimal algorithm for your dataset.
  4. Model Building Pipeline: Train the model using an optimal algorithm and parameters from the cross validation step.
  5. Model Hosting Pipeline: Deploy the model as a low-latency API using Ultra Tasks.
Panel


Expand
titleClick here to see an example of the Pipelines set for the Telco Customer Churn Prediction use case

The following table displays a Pipeline set from the Telco Customer Churn Prediction use case. You can reuse these Pipelines by connecting to your data sources. 

PipelineDescription

Profiling. This Pipeline reads the dataset from SnapLogic File System (SLFS), performs type conversion, and computes data statistics that are saved into the SLFS in JSON format.

Data Preparation. This Pipeline also reads the dataset from SLFS, performs type conversion. Then, the Mapper Snap removes id field from the dataset. The Clean Missing Value Snap replaces all missing values in the dataset with average value. The average value is included in the data statistics computed in the previous Pipeline. We use the File Reader Snap to read these statistics.

Cross Validation. We have two Pipelines in this step.
The top Pipeline (child Pipeline) performs k-fold cross validation using a specific ML algorithm.
The Pipeline on the bottom (parent Pipeline) uses the Pipeline Execute Snap to automate the process of performing k-fold cross validation on multiple algorithms, the Pipeline Execute Snap spawns and executes child Pipeline multiple times with different algorithms. Instances of child Pipelines can be executed sequentially or in parallel to speed up the process by taking advantages of multi-core processor. The Aggregate Snap applies max function to find the algorithm with the best result.

Model Building. After knowing which algorithm performs the best on your dataset, this Pipeline builds the model using the Trainer (Classification) Snap. You can store this model in JSON, binary, or other formats.

Model Hosting. This Pipeline is scheduled as Ultra Task to provide REST API to external application. The request comes as an open input view. The key Snap in this Pipeline is Predictor (Classification), which hosts the ML model from JSON Parser and consumes requests from Extract Params (Mapper) Snap. It applies the ML model on the data in the request and generates a prediction.



Data Science Snap Packs

ML Data Preparation Snap Pack

This Snap Pack contains Snaps to prepare the dataset. Data preparation processes include handling missing values, scaling, sampling, transforming, and others.

SnapDescription
Clean Missing ValuesReplace missing values by dropping, or imputing.
Type ConverterDetermine types of values in columns. This Snap supports four data types: integer, floating point, text, or datetime.
Categorical to Numeric Convert categorical columns into numeric columns by integer encoding or one hot encoding.
Numeric to CategoricalConvert numeric columns into categorical columns by custom ranging or binning.
ScaleScale values in columns to specific ranges or apply statistical transformations.
ShuffleRandomly shuffle rows.
SampleRandomly keep/drop rows. This Snap supports Stratified sampling.
Date Time ExtractorExtract components from date-time objects.
Principal Component AnalysisPerform principal component analysis (PCA) on numeric fields (columns) to reduce the dimensions of the dataset.
Feature SynthesisCreate features out of multiple datasets that share a one-to-one or one-to-many relationship with each other. Features are measurements of data points. For example, height, mean, mode, min, max, etc.
MaskHide sensitive information in your dataset before exporting the dataset for analytics.
MatchPerform record linkage to identify documents from different data sources (input views) that may represent the same entity without relying on a common key. The Match Snap enables you to automatically identify matched records across datasets that do not have a common key field.
DeduplicateRemove duplicate records from input documents. 

ML Core Snap Pack

This Snap Pack contains Snaps that experiment with machine learning algorithms, build ML models, and use ML models. It also contains Remote Python Script Snap to execute Python script natively.

SnapDescription
AutoMLAutomate the process of exploring and tuning machine learning models for a given dataset within specific resource limits.
ClusteringPerform exploratory data analysis to find hidden patterns or groupings in data.
Cross Validator (Classification)Perform k-fold cross validation with state-of-the-art machine learning algorithms on classification dataset.
Cross Validator (Regression)Perform k-fold cross validation with state-of-the-art machine learning algorithms on regression dataset.
Trainer (Classification)Train the model using state-of-the-art machine learning algorithms on classification dataset.
Trainer (Regression)Train the model using state-of-the-art machine learning algorithms on regression dataset.
Predictor (Classification)Apply the model trained from Trainer (Classification) Snap and get prediction for unlabeled data.
Predictor (Regression)Apply the model trained from Trainer (Regression) Snap and get prediction for unlabeled data.
Remote Python ScriptExecute Python script natively on Python server.

ML Natural Language Processing Snap Pack

This Snap Pack contains Snaps that enable you to perform operations in natural language processing (NLP).

SnapDescription
TokenizerConverts sentences into an array of tokens.
Common WordsFinds the most popular words in the dataset of input sentences.
Bag of WordsVectorizes sentences into a set of numeric fields.

ML Analytics Snap Pack

This Snap Pack contains Snaps to analyze the data.

SnapDescription
ProfileCompute data statistics.
Type InspectorDisplay data types.

Getting Started with SnapLogic Data Science

Machine Learning Showcase

The best way to learn about data science and machine learning is to try them out. We have developed and deployed demos using SnapLogic Data Science. They are available on SnapLogic Machine Learning Showcase.

Panel


Expand
titleClick here to see a list of demos available in the SnapLogic Machine Learning Showcase

The table below briefly describe each of the demos available in the SnapLogic Machine Learning Showcase.

TitleKeywordDescription
Iris Flower ClassificationClassification
Logistic Regression
Iris flower classification is one of the most popular datasets in the world. This dataset contains 1 categorical field which is the name of Iris flower and 4 numeric fields: petal length, petal width, sepal length, and sepal width. The goal is to build a ML model that can predict the name of the flower based on its size.
Diabetes Progression PredictionRegression
Linear Regression
Diabetes progression dataset from this research paper contains 11 numeric fields: 10 of them are the patient's age, sex, BMI, BP, 6 serum measurements, and diabetes progression. The goal is to build a ML model that can predict the diabetes progression based on serum measurements and other information.
The Decision TreeMachine Learning Algorithm
K-Fold Cross Validation
Decision tree is one of the most easy to understand ML algorithms in the world, yet so powerful. In this demo, you can select the dataset, then, perform cross validation, build, and apply the model using decision tree algorithm.
Telco Customer Churn PredictionClassification
Logistic Regression
Data Visualization
Predictive Analytics
Customer churn is a big problem for service providers because losing customers results in losing revenue and could indicate service deficiencies. There are many reasons why the customers decide to leave services. With data analytics and machine learning, we can identify the important factors of churning, create a retention plan, and predict which customers are likely to churn. In this case, we use a dataset of telecommunication company.
Handwritten Digit RecognitionClassification
Convolutional Neural Networks
Python
Keras + TensorFlow
MNIST dataset is probably the best dataset for learning about convolutional neural networks. This dataset contains 70000 handwritten digit images. Each image is 28 by 28 pixels, it contains only one digit which has already been scaled and centered so we can skip data preparation step. We use Keras library to define the high level of the neural networks architecture and train the convolutional neural networks model using TensorFlow as a backend.
Image Recognition (Inception-v3)Classification
Convolutional Neural Networks
Python
Keras + TensorFlow
Inception-v3 is a deep convolutional neural networks model that is trained on ImageNet dataset. This model can accurately identify 1000 types of objects in images. Try taking a picture and submit to the model. Remote Python Script Snap is used to execute Python script and host deep learning models. The executor supports both CPU and GPU instances.
Natural Language ProcessingPython
TextBlob
NLTK
TextBlob is a very easy to use Natural Language Processing (NLP) library in Python. It is built on top of the popular NLTK library. In this demo, you will see how you can use SnapLogic Data Science to deploy open source ML libraries as scalable REST APIs which can be easily integrated with your applications.
Ultra Task API TesterREST API
Ultra Task
The key success factor in all of the demo is operationalization. We can easily build ML-driven applications if we have ML APIs. In this demo, you can see examples of ML APIs and you can use it to test your own ML APIs.
Loan Repayment Prediction

Classification
AutoML
Data Visualization
Predictive Analysis

In this demo, we train several models on loan data to identify loans that are likely to end up being charged off. Banks and other lenders can use this model to avoid making bad loans and invest in good loans that yield returns.



Machine Learning Use Cases

You can understand how to build the above SnapLogic demos using SnapLogic Data Science by going through our use cases.

Pipeline Patterns

You can access pre-built Data Science Pipelines that reflect the use cases in the Designer > Patterns tab.