This pipeline contains the following Snaps:
- JSON Generator: Contains an array of tokens created using the Yelp dataset. For details on how you can create such an array, see the example in the Tokenizer documentation.
- Mapper: Enables you to retain only the text part of the input document, indicated in the Snap Settings as $text.
- File Reader: Picks up and reads the contents of the yelp_common_words.json file, which contains the top 100 common words in the Yelp dataset. For details on how you can create this file, review the example in the Tokenizer documentation.
- JSON Parser: Parses the input JSON data and offers documents as input to the Bag of Words Snap.
- Bag of Words: Generates a document detailing the frequency with which members of the set of common words received from the File Reader and JSON Parser Snaps appear in the dataset of tokenized sentences received from the JSON Generator and Mapper Snaps.
The JSON Generator Snap contains an array of tokens created using an extract of the Yelp dataset (you can review the entire dataset here) and makes it available to the Mapper Snap as a document. If you do not have an array of tokens, you can create one using the Tokenizer Snap.
The Mapper Snap picks up the tokenized data in the input file coming from the JSON Generator Snap and maps it to the field $text, making all the relevant data accessible to the Bag of Words Snap:
The Bag of Words Snap is configured to pick up the input data coming in from the Mapper Snap:
The Bag of Words Snap runs the Mapper output (the array of tokenized words in each sentence) against the JSON Parser output (the array listing out the frequency of the 100 most common words in the same dataset) and creates a document detailing the frequency with which the common words appear in each sentence.
Download this pipeline.