Problem Scenario
Machine Learning has been showing promising results in various applications. Healthcare is one of them. Machine Learning can accurately help doctors diagnose patients. In this use case, we are trying to use machine learning algorithms to predict progression of diabetes of patients.
Description
In this paper, they collect baseline measurements: Age, Sex, BMI, BP, and 6 Serum Measurements (S1, S2,...S6). One year after baseline, a measure of diabetes progression is collected. Our goal is to teach the machine to predict the diabetes progression based on these 10 measurements.
Below screenshot is the preview of this dataset. There are 10 measurements and diabetes progression represented as Y in the rightmost column.
The live demo is available at our Machine Learning Showcase.
Objectives
- Cross Validation: Use Cross Validator (Regression) Snap from ML Core Snap Pack to perform 10-fold cross validation with the Linear Regression algorithm. K-Fold Cross Validation is a method of evaluating machine learning algorithms by randomly separating a dataset into training set and test set, the model will be trained on the training set and evaluated on the test set.
- Model Building: Use Trainer (Regression) Snap from ML Core Snap Pack to build the linear regression model based on the training set of 392 samples; then serialize and store.
- Model Evauation: Use Predictor (Regression) Snap from ML Core Snap Pack to apply the model on the test set of the remaining 50 samples and compute error.
- Model Hosting: Use Predictor (Regression) Snap from ML Core Snap Pack to host the model and build the API using Ultra Task.
- Test the API: Build the API as a Task then execute the Task to test the API.
We are going to build 4 pipelines: Cross Validation, Model Building, Model Evaluation, and Model Hosting; and a Task to accomplish the above objectives. Each of these pipelines is described in the Pipelines section below.
Pipelines
Cross Validation
In this pipeline, we use the Cross Validator (Regression) Snap to perform 10-fold cross validation using the Linear Regression algorithm. The result shows that the overall mean absolute error is 44.595.
The File Reader Snap reads the data which is in CSV format. Then, the CSV Parser Snap converts binary data into documents. Since the types of the documents from CSV Parser Snap are String (text), we use Type Snap to automatically derive types of columns. In this case, the data will be converted into either BigInteger or BigDecimal representing numeric values. Then, the Cross Validator (Regression) Snap performs 10-fold cross validation using the Linear Regression algorithm. Finally, we use Document to Binary Snap and File Writer Snap to save the result. In this case, we save the result on SnapLogic File System (SLFS) which you can preview or download from the Manager.
The screenshot below shows that the overall mean absolute error is 44.595. You may try to change the regression algorithm in the Cross Validator (Regression) Snap and see which algorithm performs the best on this dataset.
Model Building
In this pipeline, we use the Trainer (Regression) Snap to build the model from the training set using the Linear Regression algorithm.
The File Reader Snap reads the training set containing 392 samples. Then, the CSV Parser Snap converts binary data into documents. Since the types of the documents from CSV Parser Snap are String (text), we use Type Snap to automatically derive types of columns. Then, the Trainer (Regression) Snap trains the model using the Linear Regression algorithm. The model consists of two parts: metadata describing the schema of the dataset, and the actual model (serialized). If the Readable option in the Trainer Snap is selected, a readable model will be generated. Finally, the model is saved as a JSON file in the SLFS using the JSON Formatter and File Writer Snaps.
Model Evaluation
In this pipeline, the model generated above is tested against the test set.
The Predictor (Regression) Snap has two input views. The first input view is for the test set. The second input view accepts the model generated in the previous pipeline. In this case, the Predictor (Regression) Snap predicts the progression of diabetes.
The predictions from the Predictor (Regression) Snap are merged with the real diabetes progression (answer) from the Mapper Snap which extracts the $Y column from the test set. The result of merging is displayed in the screenshot below (lower-right corner). After that, we use the Aggregate Snap to compute the mean absolute error and mean squared error. The result is then saved using CSV Formatter Snap and File Writer Snap.
Model Hosting
This pipeline is scheduled as an Ultra Task to provide a REST API that is accessible by external applications. The core components of this pipeline are File Reader, JSON Parser and Predictor (Regression) Snaps that are the same in Model Evaluation pipeline. Instead of taking the data from test set, the Predictor (Regression) Snap takes the data from API request. The Check Token Snap (Router) is used to authenticate the request by checking the token that can be changed in pipeline parameters. The Extract Params Snap (Mapper) extracts the data from the request. The Body Wrapper Snap (Mapper) maps from prediction to $content that will be the response body. Finally, CORS Wrapper Snap (Mapper) adds headers to allow Cross-Origin Resource Sharing.
Testing the API
In order to test the API, we must first build it as a Task and execute this Task.
Building API
To build an API from this pipeline. Go to the calendar icon in the toolbar. You can either use Triggered Task or Ultra Task.
Triggered Task is good for batch processing since it starts a new pipeline instance for each request. Ultra Task is good to provide REST API to external applications that require low latency. In this case, we use Ultra Task. You do not need to specify the bearer token here since we use the Router Snap to perform authentication inside the pipeline. You can go to Manager by clicking Show tasks in this project in Manager in the Create Task window to see task details as shown in the screenshot below.
Testing
After creating the Ultra Task, you can test it. The screenshot below shows a sample request and response. Based on the following 10 measurements, the pipeline returns 103.88 as predicted diabetes progression. The expected diabetes progression of this patient is 118.