On this Page

In this Section

Overview

SnapLogic Data Science, in combination with the Intelligent Integration Platform (IIP), is a self-service platform for end-to-end ML that offers a low-code approach to data acquisition, data exploration and preparation, model training and validation, and model deployment. The solution includes the following sets of pre-built Pipeline components (Snap Packs):

ML Analytics Snap Pack: Perform analytic operations, such as data profiling and data type inspection.
ML Data Preparation Snap Pack: Perform preparatory operations on datasets such as data type transformation, data cleansing, sampling, shuffling, and scaling.
ML Core Snap Pack: Perform machine learning operations on datasets, such as cross validation, model training, and model-based predictions.
ML Natural Language Processing Snap Pack: Perform operations in natural language processing (NLP).

To demonstrate probable real-time scenarios, eight use cases are included in the Data Science Use Cases section. These use cases demonstrate using the Snaps in the above Snap Packs.

Snap Packs

ML Data Preparation Snap Pack

This Snap Pack contains Snaps to prepare the dataset. Data preparation processes include handling missing values, scaling, sampling, transforming, and others.

Snap	Description
Clean Missing Values	Replace missing values by dropping, or imputing.
Type Converter	Determine types of values in columns. There are 4 supported types: integer, floating point, text, or datetime.
Categorical to Numeric	Convert categorical columns into numeric columns by integer encoding or one hot encoding.
Numeric to Categorical	Convert numeric columns into categorical columns by custom ranging or binning.
Scale	Scale values in columns to specific ranges or apply statistical transformations.
Shuffle	Randomly shuffle rows.
Sample	Randomly keep/drop rows. Stratified sampling is supported.
Date Time Extractor	Extract components from date-time objects.
Principal Component Analysis	Perform principal component analysis (PCA) on numeric fields (columns) to reduce the dimensions of the dataset.
Feature Synthesis	Create features out of multiple datasets that share a one-to-one or one-to-many relationship with each other. Features are measurements of data points. For example, height, mean, mode, min, max, etc.
Mask	Hide sensitive information in your dataset before exporting the dataset for analytics.
Match	Perform record linkage to identify documents from different data sources (input views) that may represent the same entity without relying on a common key. The Match Snap enables you to automatically identify matched records across datasets that do not have a common key field.
Deduplicate	Remove duplicate records from input documents.

ML Core Snap Pack

This Snap Pack contains Snaps that experiment with machine learning algorithms, build ML models, and use ML models. It also contains Remote Python Script Snap to execute Python script natively.

Snap	Description
AutoML	Automate the process of exploring and tuning machine learning models for a given dataset within specific resource limits.
Clustering	Perform exploratory data analysis to find hidden patterns or groupings in data.
Cross Validator (Classification)	Perform k-fold cross validation with state-of-the-art machine learning algorithms on classification dataset.
Cross Validator (Regression)	Perform k-fold cross validation with state-of-the-art machine learning algorithms on regression dataset.
Trainer (Classification)	Train the model using state-of-the-art machine learning algorithms on classification dataset.
Trainer (Regression)	Train the model using state-of-the-art machine learning algorithms on regression dataset.
Predictor (Classification)	Apply the model trained from Trainer (Classification) Snap and get prediction for unlabeled data.
Predictor (Regression)	Apply the model trained from Trainer (Regression) Snap and get prediction for unlabeled data.
Remote Python Script	Execute Python script natively on Python server.

ML Natural Language Processing Snap Pack

This Snap Pack contains Snaps that enable you to perform operations in natural language processing (NLP).

Snap	Description
Tokenizer	Converts sentences into an array of tokens.
Common Words	Finds the most popular words in the dataset of input sentences.
Bag of Words	Vectorizes sentences into a set of numeric fields.

ML Analytics Snap Pack

This Snap Pack contains Snaps to analyze the data.

Snap	Description
Profile	Compute data statistics.
Type Inspector	Display data types.

Typical Data Science Project Life Cycle

The main concept of Data Science is to get the most out of the data which can be considered as a new type of currency. The more data we have, the more intelligent things we can do. There are different types of data science projects depending on the types of businesses. However, data science projects can be broken down into 6 general steps as listed below.

Data Acquisition: Gather the data from multiple sources including files, databases, on-premise applications, cloud applications, APIs, IoT devices, etc. Data from different sources have different formats. Integration is the key.
Data Exploratory: Perform data profiling, analytics, and visualization to better understand the data. Then, define the business problems we want to solve with the data we have.
Data Preparation: Perform data cleansing, scaling, and other statistical techniques to transform the data into the datasets that are suitable for Machine Learning algorithms.
Model Development: Develop, apply, and evaluate machine learning algorithms on the dataset. The result of this step is a ML model which can give prediction based on the input.
Model Deployment: Host the ML model as an API so that it can be easily consumed by external applications.
Optimization: Keep track of the performance of the model, gather feedback, and iterate. Data distribution changes over time, monitoring and optimization are the key success factors in long term.

Pipeline

Pipeline is a set of Snaps that connect together to perform a specific task. There are different types of Pipelines depending on the type of task it performs. In data science project, there are 5 Pipeline types that are highly recommended.

Profiling Pipeline: Compute data statistics of the dataset. Data statistics are critical to understand the dataset and its distribution, and also for selecting the appropriate data preparation techniques in the next step.
Data Preparation Pipeline: Apply data cleansing, and other statistical transformations to prepare the dataset before applying ML algorithms.
Cross Validation Pipeline: Perform k-fold cross validation using various ML algorithms with different sets of parameters to find the most suitable algorithm for your dataset.
Model Building Pipeline: Train the model using the most suitable algorithm and parameters from the cross validation step.
Model Hosting Pipeline: Deploy the model as low-latency API using Ultra Task.

Click here to see an example of the Pipelines set for the Telco Customer Churn Prediction use case

The table below shows a set of Pipelines from the Telco Customer Churn Prediction use case with a brief description. These Pipelines are reusable by connecting to your data sources and replace data preparation Snaps. We provide these Pipelines as patterns where you can easily import to your project space.

Pipeline	Description
	Profiling. This Pipeline reads the dataset from SnapLogic File System (SLFS), performs type conversion, and computes data statistics which will be saved back to SLFS in JSON format.
	Data Preparation. This Pipeline also reads the dataset from SLFS, performs type conversion. Then, the Mapper Snap removes id field from the dataset. The Clean Missing Value Snap replaces all missing values in the dataset with average value. The average value is included in the data statistics computed in the previous Pipeline. We use File Reader1 Snap to read those statistics.
	Cross Validation. We have 2 Pipelines in this step. The top Pipeline (child Pipeline) perform k-fold cross validation using a specific ML algorithm. The Pipeline on the bottom (parent Pipeline) use Pipeline Execute Snap to automate the process of performing k-fold cross validation on multiple algorithms, the Pipeline Execute Snap spawns and executes child Pipeline multiple times with different algorithms. Instances of child Pipeline can be executed sequentially or in parallel to speed up the process by taking advantages of multi-core processor. The Aggregate Snap applies max function to find the algorithm with the best result.

	Model Building. After knowing which algorithm performs the best on your dataset. This Pipeline builds the model using Trainer (Classification) Snap. You can store this model in JSON, binary, or other formats.
	Model Hosting. This Pipeline is scheduled as Ultra Task to provide REST API to external application. The request will comes into an open input view. The core Snap in this Pipeline is Predictor (Classification) which hosts the ML model from JSON Parser and consumes requests from Extract Params (Mapper) Snap. It applies ML model on the data in the request and give prediction.

Getting Started with SnapLogic Data Science

Machine Learning Showcase

The best way to learn about data science and machine learning is to try them out. We have developed and deployed demos using SnapLogic Data Science. They are available on SnapLogic Machine Learning Showcase.

Click here to see a list of demos available in the SnapLogic Machine Learning Showcase

The table below briefly describe each of the demos available in the SnapLogic Machine Learning Showcase.

Title	Keyword	Description
Iris Flower Classification	Classification Logistic Regression	Iris flower classification is one of the most popular datasets in the world. This dataset contains 1 categorical field which is the name of Iris flower and 4 numeric fields: petal length, petal width, sepal length, and sepal width. The goal is to build a ML model that can predict the name of the flower based on its size.
Diabetes Progression Prediction	Regression Linear Regression	Diabetes progression dataset from this research paper contains 11 numeric fields: 10 of them are patient's age, sex, BMI, BP, 6 serum measurements, and diabetes progression. The goal is to build a ML model that can predict the diabetes progression based on serum measurements and other information.
The Decision Tree	Machine Learning Algorithm K-Fold Cross Validation	Decision tree is one of the most easy to understand ML algorithms in the world, yet so powerful. In this demo, you can select the dataset, then, perform cross validation, build, and apply the model using decision tree algorithm.
Telco Customer Churn Prediction	Classification Logistic Regression Data Visualization Predictive Analytics	Customer churn is a big problem for service providers because losing customers results in losing revenue and could indicate service deficiencies. There are many reasons why the customers decide to leave services. With data analytics and machine learning, we can identify the important factors of churning, create a retention plan, and predict which customers are likely to churn. In this case, we use a dataset of telecommunication company.
Handwritten Digit Recognition	Classification Convolutional Neural Networks Python Keras + TensorFlow	MNIST dataset is probably the best dataset for learning about convolutional neural networks. This dataset contains 70000 handwritten digit images. Each image is 28 by 28 pixels, it contains only one digit which has already been scaled and centered so we can skip data preparation step. We use Keras library to define the high level of the neural networks architecture and train the convolutional neural networks model using TensorFlow as a backend.
Image Recognition (Inception-v3)	Classification Convolutional Neural Networks Python Keras + TensorFlow	Inception-v3 is a deep convolutional neural networks model that is trained on ImageNet dataset. This model can accurately identify 1000 types of objects in images. Try taking a picture and submit to the model. Remote Python Script Snap is used to execute Python script and host deep learning models. The executor supports both CPU and GPU instances.
Natural Language Processing	Python TextBlob NLTK	TextBlob is a very easy to use Natural Language Processing (NLP) library in Python. It is built on top of the popular NLTK library. In this demo, you will see how you can use SnapLogic Data Science to deploy open source ML libraries as scalable REST APIs which can be easily integrated with your applications.
Ultra Task API Tester	REST API Ultra Task	The key success factor in all of the demo is operationalization. We can easily build ML-driven applications if we have ML APIs. In this demo, you can see examples of ML APIs and you can use it to test your own ML APIs.
Loan Repayment Prediction	Classification AutoML Data Visualization Predictive Analysis	In this demo, we train several models on loan data to identify loans that are likely to end up being charged off. Banks and other lenders can use this model to avoid making bad loans and invest in good loans that yield returns.

Machine Learning Use Cases

After checking out demos on our SnapLogic Machine Learning Showcase, it is time to learn more about how we build those demos and what other cool things you can do with SnapLogic Data Science. We provide tutorials/stories for most of the demos mentioned above. You can access them here.

Click here to see a list of use cases

The table below list all the use cases that we provide the tutorials.

Title	Description
Iris Flower Classification	Build a model to classify flower based on the size of sepal and petal.
Iris Flower Classification using Neural Networks	Build a model to classify flower based on the size of sepal and petal.
Diabetes Progression Prediction	Build a model to predict diabetes progression based on the patient's demographic and serum measurements.
Telco Customer Churn Prediction	Build a model to predict the churn rate of customer based on demographic and subscription history.
Sentiment Analysis Using SnapLogic Data Science	Build a sentiment analysis model using review data from Yelp.
Lending Club Loan Approval	Build a model to predict the rate that the loan will be charged off.
Kickstarter Project Success Prediction	Build a model to predict success rate of Kickstarter project based on project information.
Handwritten Digit Recognition	Build a convolutional neural networks model to classify handwritten digit.
Image Recognition (Inception-v3)	Use Inception-v3 model to identify objects in images.
Natural Language Processing	Use TextBlob library to perform simple NLP operations.
Speech Recognition	Use pre-train from DeepSpeech library to transcribe audio.

Pipeline Patterns

We have packaged Pipelines in the demos and use cases as patterns. If you sign up for SnapLogic Data Science trial, the Pipelines are available in the Patterns tab. Or you can click here to see them in the SnapLogic Manager page. If you are not a trial user, you can find those Pipelines here.

SnapLogic Data Science (Machine Learning)