Sentiment Analysis Using SnapLogic Data Science

Sentiment Analysis Using SnapLogic Data Science

On this Page

Overview

What does this use case do?

This use case demonstrates how you can use SnapLogic Machine Learning (ML) Snaps to perform sentiment analysis. Sentiment analysis enables you to computationally identify and classify opinions expressed in a piece of text.

In this use case, we build a simple sentiment analysis model that classifies input text as either positive, negative, or neutral in sentiment.

How this use case is structured

In the initial sections of this use case, we offer a high-level description of the project and its key tasks. We then describe the pipelines and Snaps that make up the use case. In each of these sections too, we first offer a functional description of what we are doing before getting into the technical details.

The Dataset Used for this Use Case

For this use case, we use the Yelp dataset. Yelp provides a subset of their data as an open dataset. This dataset contains data about businesses, reviews, users, check-ins, tips and photos. The full dataset is available here.

In this use case, we focus only on user-review data. To simplify our use case, we use only 5-star and 1-star reviews as positive and negative examples, respectively.

Building a Sentiment Analysis Model

Process Summary

In this use case, we perform the following high-level tasks to create a sentiment analysis model:

High-Level Task Description

To build a sentiment analysis model, we perform the following tasks:

  1. Data Preparation: Prepare the data required to train the model.

  2. Cross Validation: Use multiple algorithms to cross-validate the data and identify the algorithm that offers the most reliable results.

  3. Model Building: Build the sentiment analysis model using the algorithm identified in the previous step.

  4. Model Hosting: Make the model available as an ultra-task.

  5. API Testing: Run a sample sentiment analysis request to check whether it works as expected.

Pipelines Used

We use the following pipelines to perform each of the tasks listed above:



Pipeline

Description

Pipeline

Description

Data Preparation:

  1. This pipeline reads the dataset from Yelp and, because we are sure that reviews associated with a 1-star rating are definitely negative and those associated with a 5-star rating are definitely positive, retains only the 1-star and 5-star reviews. Doing so enables us to build a reliable model.

  2. The pipeline now uses stratified sampling to balance the ratio of 1-star and 5-star reviews. This ensures that our model is not biased towards either sentiment.

  3. Once we have a balanced dataset, we break each review into an array of words and identify the 200 words that are most commonly used across reviews. These are the words whose relative placement to each other we shall examine to create our algorithm.

  4. Finally, we generate statistics (such as the most popular word, the number of unique words, and so on) related to these 200 words to better understand how they function in the sample dataset.

Cross ValidationWe have 2 pipelines in this step. The top pipeline (child pipeline) performs k-fold cross validation using a specific ML algorithm. The pipeline at the bottom (the parent pipeline) uses the Pipeline Execute Snap to automate the process of performing k-fold cross validation on multiple algorithms.

  1. The Pipeline Execute Snap spawns and executes the child pipeline multiple times with different algorithms. You can execute instances of the child pipeline sequentially or leverage the SnapLogic multi-core processor to speed up the process by executing these pipelines in parallel.


  2. The Aggregate Snap applies the max function to identify the algorithm that offers the best results.

Model Building. Based on the cross validation result, we can see that the logistic regression and support vector machines algorithms perform the best.

  1. The Trainer (Classification) Snap trains the logistic regression model.

  2. Once the model is ready, we use the JSON Formatter and File Writer Snaps to write the model to SLFS.

Model HostingThis pipeline is scheduled as an Ultra Task to offer sentiment analysis as a REST-API-driven service to external applications.

  1. New requests are offered as input to the Filter Snap.

  2. The Predictor (Classification) Snap receives the the ML model from the JSON Parser1 Snap and hosts it.

  3. The Tokenizer and Bag of Words Snaps prepare the input text for sentiment analysis.

  4. The File Reader Snap reads the common words that are required for the bag of words operation.

For more information on how to offer an ML ultra task as a REST API, see SnapLogic Pipeline Configuration as a REST API.

API Testing. This pipeline takes a sample request, sends it as a REST API request, and displays the results received.

  1. The JSON Generator Snap contains a sample request, including the token and the text.

  2. The REST Post Snap sends a request to the Ultra Task (API).

  3. The Mapper Snap extracts prediction from the response body. More information on scheduling and managing ML ultra tasks is here.

The Data Preparation Pipeline

We design the Data Preparation pipeline as shown below:

This is where it gets technical. We have highlighted the key Snaps in the image below to simplify understanding.



This pipeline contains the following key Snaps:

Snap Label

Snap Name

Description

Snap Label

Snap Name

Description

1

Read Review Dataset

File Reader

Reads an extract of the Yelp dataset containing 10,000 reviews from the SnapLogic File System (SLFS).

2

Filter 1 and 5 Stars

Filter

Retains only 1-star and 5-star reviews.

3

Stratified Sampling

Sample

Applies stratified sampling to balance the ratio of 1-star and 5-star reviews.

4

Extract Text and Sentiment

Mapper

Maps $stars to $sentiment and replaces 1 (star) with negative, and 5 (star) with positive. It also allows the input data ($text) to pass through unchanged to the downstream Snap.

5

Tokenizer

Tokenizer

Breaks each review into an array of words, of which two copies are made.

6

Common Words

Common Words

Computes the frequency of the top 200 most common words in one copy of the array of words.

7

Bag of Words

Bag of Words

Converts the second copy of the array of words into a vector of word frequencies.

8

Profile

Profile

Computes data statistics using the output from the Common Words Snap.