Kickstarter Project Success Prediction

On this Page

Problem Scenario

In these days, crowdfunding platforms are very popular for innovators and people with cool ideas to raise funds from the public. Only on Kickstarter (one of the most popular online crowdfunding platforms), more than 14,000,000 backers (people who invest in projects) have funded almost 150,000 projects. At each moment, almost 4,000 projects are live to receive funding from the public. There are a lot of crowdfunding projects that succeed and fail. The success rate of a project depends on a lot of factors. It will be great to find a way to estimate and improve the success rate of future projects.

Description

Out of curiosity, we want to try to use Machine Learning algorithms to predict whether the projects are going to succeed or fail. If we succeed in doing this, we should be able to figure out the best way to improve the success rate of future projects. We chose Kickstarter because of the large number of projects spanning over years. There are a lot of open datasets you can find on the internet and we got one from here. This dataset contains over 300,000 projects; however, it only contains general information about projects including title, category, currency, country, goal, pledge, important dates, and state. There is a lot more useful information you can add to improve the accuracy such as description, keywords, activities, competitors, patents, team, and company reputation. For demonstration purpose, we only considered 20,000 projects. The screenshot below shows the preview of this dataset.

Objectives

  1. Profiling: Use Profile Snap from ML Analytics Snap Pack to get statistics of this dataset.
  2. Data Preparation: Perform data preparation on this dataset using Snaps in ML Data Preparation Snap Pack.
  3. AutoML: Use AutoML Snap from ML Core Snap Pack to build models and pick the one with the best performance.
  4. Cross Validation: Use Cross Validator (Classification) Snap from ML Core Snap Pack to perform 10-fold cross validation on various Machine Learning algorithms. The result will let us know the accuracy of each algorithm in the success rate prediction.

We are going to build 5 Pipelines: Profiling, Data Preparation, Data Modelling and 2 Pipelines for Cross Validation with various algorithms. Each of these Pipelines is described in the Pipelines section below.

Pipelines

Profiling

In order to get useful statistics, we need to transform the data a little bit.

We use the first Mapper Snap (Select Fields) to select and rename fields.

Then, we use Type Converter Snap to automatically derive types of data.

Since we only focus on successful and failed projects, we use Filter Snap to filter out live, canceled, and projects with another status.

The Date Time Extractor Snap is used to convert the launch date and the deadline to an epoch which are used to compute the duration in the Mapper Snap (Compute Duration). With the Pass through in the Mapper Snap, all input fields will be sent to the output view along with the $duration. However, we drop $deadline here.

At this point, the dataset is ready to be fed into the Profile Snap.

Finally, we use Profile Snap to compute statistics and save on SLFS in JSON format.