points)
For this assignment, you are to write a spark program to predict Yelp user rating based on their review
Text.
You can download the data files from here:
https://uofi.box.com/s/b3hwy8rax1eipmiz2r78csxp06jnodci . Unzip the folder and find review.json
and user.json . You can read the description of each file and their attributes here:
https://www.yelp.com/dataset/documentation/main
What you need to do: 1- Data Exploration:
load the review.json file and extract “text” and “stars” attributes
find the distribution of “stars” attributes; that is, find the number of reviews for
each star value.
2- Feature Engineering:
The star ratings 1,2 and 3typically indicate dissatisfaction and the star rating 4,5
shows satisfaction. Create a new column “rating” with values 0 (if the star rating
is 1,2, or 3) and 1 (if the star rating is 4,5) . This will be the target variable you
want to predict.
Find the distribution of the “rating” column; that is, find the count of reviews for
each rating=0 and rating=1. Is the rating attribute balanced? If not, you should
down sample your data. That means, keep the rating value with the lowest
count but take a sample of the reviews for the other category in order to have a
balanced distribution between both classes. This is called stratified sampling and
you can accomplish this in spark using the “sampleBy” method of dataframe.
For an example of stratified sampling you can see here:
http://allaboutscala.com/big-data/spark/#dataframe-statistics-sampleby and
here: https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-
packages/ (the section on stratified sampling)
Unfortunately, the dataset is still too big for our tiny cluster and running ML
models on it can take a long time with our limited resources. So when doing
down-sampling, using sampleBy method, multiply all fractions by 0.1 to get a
sample of only 10% of reviews in each rating category after down-sampling.
Below are the counts I get after stratified sampling (Depending on the seed you
give to samplyBy method, you might get different counts but if you set the seed
to 111, you should get the same distribution as mine)
https://uofi.box.com/s/b3hwy8rax1eipmiz2r78csxp06jnodci
https://www.yelp.com/dataset/documentation/main
http://allaboutscala.com/big-data/spark/#dataframe-statistics-sampleby
https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages/
https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages/
Extract TFIDF vectors from the review Text. When creating countVectorizer, use
setMinDF(100) to only include words in the feature vector that appear in at
least 100 reviews. Make sure that you remove stop words and punctuations and
use stemming as explained in the labs.
3- Building Machine Learning pipelines.
Use three different machine learning models (Logistic Regression, Random
Forest, and Gradient Boosted Classification Trees) to predict the ratings based
on the TFIDF vector of the review text. For an example of a Gradient Boosteed
Classification Trees please refer to: https://spark.apache.org/docs/2.2.0/ml-
classification-regression.html#gradient-boosted-tree-classifier
Use CrossValidation with three folds and Area Under Curve (AUC) metric to
evaluate and tune each model’s hyper-parameter ( please refer to the labs
posted for this module for examples of cross validation in spark).
Create a separate pipeline for each ML model. Split the data to testing and
training sets, fit each model on the training data, get the predictions for the test
data, and print the AUC for each model.
Explain which model did a better job on predicting the ratings in the test set?
4- Making an ensemble of the above three models:
Typically, an ensemble of multiple models works better than a single model. In
this step, you take the predictions generated by the three models above (that is,
logistic regression and random forest, and gradient boosted classification tree) ,
zip them together and compute a “prediction_ensembled” column which is
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier
basically a majority vote of the three prediction columns generated by each
model. That is, if two or more of the models generated the same prediction,
then use that prediction in the prediction_ensembled column; otherwise, if
none of the predictions are the same, then use the rating value1 for the
prediction_ensemble column. (You can accomplish this in spark sql using a
simple case when query, see an example here: