On this Page

Pipeline	Description
	Data Preparation: This pipeline reads the dataset from Yelp and, because we are sure that reviews associated with a 1-star rating are definitely negative and those associated with a 5-star rating are definitely positive, retains only the 1-star and 5-star reviews. Doing so enables us to build a reliable model. The pipeline now uses stratified sampling to balance the ratio of 1-star and 5-star reviews. This ensures that our model is not biased towards either sentiment. Once we have a balanced dataset, we break each review into an array of words and identify the 200 words that are most commonly used across reviews. These are the words whose relative placement to each other we shall examine to create our algorithm. Finally, we generate statistics (such as the most popular word, the number of unique words, and so on) related to these 200 words to better understand how they function in the sample dataset.
	Cross Validation. We have 2 pipelines in this step. The top pipeline (child pipeline) performs k-fold cross validation using a specific ML algorithm. The pipeline at the bottom (the parent pipeline) uses the Pipeline Execute Snap to automate the process of performing k-fold cross validation on multiple algorithms. The Pipeline Execute Snap spawns and executes the child pipeline multiple times with different algorithms. You can execute instances of the child pipeline sequentially or leverage the SnapLogic multi-core processor to speed up the process by executing these pipelines in parallel. The Aggregate Snap applies the max function to identify the algorithm that offers the best results.

	Model Building. Based on the cross validation result, we can see that the logistic regression and support vector machines algorithms perform the best. The Trainer (Classification) Snap trains the logistic regression model. Once the model is ready, we use the JSON Formatter and File Writer Snaps to write the model to SLFS.
	Model Hosting. This pipeline is scheduled as an Ultra Task to offer sentiment analysis as a REST-API-driven service to external applications. New requests are offered as input to the Filter Snap. The Predictor (Classification) Snap receives the the ML model from the JSON Parser1 Snap and hosts it. The Tokenizer and Bag of Words Snaps prepare the input text for sentiment analysis. The File Reader Snap reads the common words that are required for the bag of words operation. For more information on how to offer an ML ultra task as a REST API, see SnapLogic Pipeline Configuration as a REST API.
	API Testing. This pipeline takes a sample request, sends it as a REST API request, and displays the results received. The JSON Generator Snap contains a sample request, including the token and the text. The REST Post Snap sends a request to the Ultra Task (API). The Mapper Snap extracts prediction from the response body. More information on scheduling and managing ML ultra tasks is here.

The Data Preparation Pipeline

...

Note
This is where it gets technical. We have highlighted the key Snaps in the image below to simplify understanding.

This pipeline contains the following key Snaps:

	Snap Label	Snap Name	Description
1	Read Review Dataset	File Reader	Reads an extract of the Yelp dataset containing 10,000 reviews from the SnapLogic File System (SLFS).
2	Filter 1 and 5 Stars	Filter	Retains only 1-star and 5-star reviews.
3	Stratified Sampling	Sample	Applies stratified sampling to balance the ratio of 1-star and 5-star reviews.
4	Extract Text and Sentiment	Mapper	Maps $stars to $sentiment and replaces 1 (star) with negative, and 5 (star) with positive. It also allows the input data ($text) to pass through unchanged to the downstream Snap.
5	Tokenizer	Tokenizer	Breaks each review into an array of words, of which two copies are made.
6	Common Words	Common Words	Computes the frequency of the top 200 most common words in one copy of the array of words.
7	Bag of Words	Bag of Words	Converts the second copy of the array of words into a vector of word frequencies.
8	Profile	Profile	Computes data statistics using the output from the Common Words Snap.

Key Data Preparation Snaps

Filter

This Snap keeps 1-star and 5-star reviews while dropping all others (2-star, 3-star, and 4-star reviews). This brings the number of useful reviews down from 10000 to 4428.

...

In the filtered list of 4428 reviews, there are 3555 5-star reviews and 873 1-star reviews. If we train the ML model on this dataset, the model will be biased towards positive (5-star) reviews, because it will get to sample more positive reviews than negative ones. This is what we call an unbalanced problem. There are many methods of dealing with unbalanced datasets, including downsampling (reducing the number of samples that form the majority until a balance is reached) and upsampling (duplicating samples that form the minority). In this case, we use the downsampling method.

We keep all the 873 1-star reviews and the same number of 5-star reviews. The Pass through percentage is (873 + 873) / 4428 = 0.3943. We specify $stars in the Stratified field property and select Stratified Sampling as the Algorithm. Stratified sampling is a technique that maintains the same number of documents of each class; in this case, the class is $stars.

Expand

title	Output preview

The output contains fewer documents; however, the list contains an equal number of 1-star and 5-star reviews.

...

We need two pipelines for cross-validation, and design them as shown below:

The pipeline displayed at the top (the child pipeline) performs k-fold cross validation using a specific ML algorithm. The pipeline at the bottom (the parent pipeline) uses the Pipeline Execute Snap to automate the process of performing k-fold cross validation on multiple algorithms.

The Pipeline Execute Snap in the parent pipeline spawns and executes the child pipeline multiple times with different algorithms. Instances of the child pipeline can be executed sequentially or in parallel to speed up the process. The Aggregate Snap applies a max function to find the algorithm that offers the best result.

This pipeline contains the following key Snaps:

Cross Validator: Performs 10-fold cross validation using various algorithms.
CSV Generator: Lists out the algorithms that must be used for cross validation.
Pipeline Execute: Spawns and executes the child pipeline 7 times, each time with a different algorithm.
Mapper: Lists out each algorithm with its accuracy numbers.
Aggregate: Applies the max function to find the algorithm with the greatest accuracy.

Key Cross Validation Snaps

...

The Cross Validation pipeline (above) demonstrated to us that the logistic regression and support vector machines algorithms offered the greatest cross-validation accuracy for our dataset. In the Model Building pipeline, we use the Trainer (Classification) Snap to train a logistic regression model, which we write to SLFS using the JSON Formatter and File Writer Snaps.

Key Model Building Snaps

Trainer (Classification)

...

Follow the instructions available here to schedule this pipeline as a REST API.

This pipeline is scheduled as an Ultra Task to provide REST API to external applications. The request comes into the open input view of the Filter Snap. The core Snap in this pipeline is Predictor (Classification), which hosts the ML model from JSON Parser1 Snap. The Tokenizer and Bag of Words Snaps prepare the input text. The File Reader Snap reads the common words that are required in the bag of words operation. See more information here.

Predictor (Classification)

...

Versions Compared

Old Version 17

New Version 18

Key

The Data Preparation Pipeline