Loading...

Messages

Proposals

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Sampleby pyspark

20/03/2021 Client: saad24vbs Deadline: 7 Days

points)

For this assignment, you are to write a spark program to predict Yelp user rating based on their review

Text.

You can download the data files from here:

https://uofi.box.com/s/b3hwy8rax1eipmiz2r78csxp06jnodci . Unzip the folder and find review.json

and user.json . You can read the description of each file and their attributes here:

https://www.yelp.com/dataset/documentation/main

What you need to do: 1- Data Exploration:

 load the review.json file and extract “text” and “stars” attributes

 find the distribution of “stars” attributes; that is, find the number of reviews for

each star value.

2- Feature Engineering:

 The star ratings 1,2 and 3typically indicate dissatisfaction and the star rating 4,5

shows satisfaction. Create a new column “rating” with values 0 (if the star rating

is 1,2, or 3) and 1 (if the star rating is 4,5) . This will be the target variable you

want to predict.

 Find the distribution of the “rating” column; that is, find the count of reviews for

each rating=0 and rating=1. Is the rating attribute balanced? If not, you should

down sample your data. That means, keep the rating value with the lowest

count but take a sample of the reviews for the other category in order to have a

balanced distribution between both classes. This is called stratified sampling and

you can accomplish this in spark using the “sampleBy” method of dataframe.

For an example of stratified sampling you can see here:

http://allaboutscala.com/big-data/spark/#dataframe-statistics-sampleby and

here: https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-

packages/ (the section on stratified sampling)

 Unfortunately, the dataset is still too big for our tiny cluster and running ML

models on it can take a long time with our limited resources. So when doing

down-sampling, using sampleBy method, multiply all fractions by 0.1 to get a

sample of only 10% of reviews in each rating category after down-sampling.

Below are the counts I get after stratified sampling (Depending on the seed you

give to samplyBy method, you might get different counts but if you set the seed

to 111, you should get the same distribution as mine)

https://uofi.box.com/s/b3hwy8rax1eipmiz2r78csxp06jnodci
https://www.yelp.com/dataset/documentation/main
http://allaboutscala.com/big-data/spark/#dataframe-statistics-sampleby
https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages/
https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages/
 Extract TFIDF vectors from the review Text. When creating countVectorizer, use

setMinDF(100) to only include words in the feature vector that appear in at

least 100 reviews. Make sure that you remove stop words and punctuations and

use stemming as explained in the labs.

3- Building Machine Learning pipelines.

 Use three different machine learning models (Logistic Regression, Random

Forest, and Gradient Boosted Classification Trees) to predict the ratings based

on the TFIDF vector of the review text. For an example of a Gradient Boosteed

Classification Trees please refer to: https://spark.apache.org/docs/2.2.0/ml-

classification-regression.html#gradient-boosted-tree-classifier

 Use CrossValidation with three folds and Area Under Curve (AUC) metric to

evaluate and tune each model’s hyper-parameter ( please refer to the labs

posted for this module for examples of cross validation in spark).

Create a separate pipeline for each ML model. Split the data to testing and

training sets, fit each model on the training data, get the predictions for the test

data, and print the AUC for each model.

Explain which model did a better job on predicting the ratings in the test set?

4- Making an ensemble of the above three models:

 Typically, an ensemble of multiple models works better than a single model. In

this step, you take the predictions generated by the three models above (that is,

logistic regression and random forest, and gradient boosted classification tree) ,

zip them together and compute a “prediction_ensembled” column which is

https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier
basically a majority vote of the three prediction columns generated by each

model. That is, if two or more of the models generated the same prediction,

then use that prediction in the prediction_ensembled column; otherwise, if

none of the predictions are the same, then use the rating value1 for the

prediction_ensemble column. (You can accomplish this in spark sql using a

simple case when query, see an example here:

Homework is Completed By:

Writer Writer Name Amount Client Comments & Rating
Instant Homework Helper

ONLINE

Instant Homework Helper

$36

She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up!

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 3 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 6 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 12 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

6 writers have sent their proposals to do this homework:

Finance Master
Financial Solutions Provider
Top Class Engineers
Exam Attempter
Top Rated Expert
Assignment Solver
Writer Writer Name Offer Chat
Finance Master

ONLINE

Finance Master

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$87 Chat With Writer
Financial Solutions Provider

ONLINE

Financial Solutions Provider

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$88 Chat With Writer
Top Class Engineers

ONLINE

Top Class Engineers

I have read and understood all your initial requirements, and I am very professional in this task.

$101 Chat With Writer
Exam Attempter

ONLINE

Exam Attempter

I am known as Unrivaled Quality, Written to Standard, providing Plagiarism-free woork, and Always on Time

$53 Chat With Writer
Top Rated Expert

ONLINE

Top Rated Expert

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$79 Chat With Writer
Assignment Solver

ONLINE

Assignment Solver

I have read and understood all your initial requirements, and I am very professional in this task.

$64 Chat With Writer

Let our expert academic writers to help you in achieving a+ grades in your homework, assignment, quiz or exam.

Similar Homework Questions

Measure of Success - Jeux - How to build a logistics network - What spartan values are suggested by this document - Assignment - Week 3 - Every little hurricane essay - Danny bakewell chelsea buns recipe - An frg provides activities and support that encourages - Brief summary of milgram's obedience experiment - Data analysis plus excel 2016 download - St jude international gala for hope - Monroe australia pty ltd - Recent financial statement data for harmony health foods (hhf) inc. is shown below. - Importance of knowledge in islam - A car carrying a 75 kg test dummy - Hesi maternity - There will come soft rains structure - Aviation safety management manual - How did the absent minded professor burn his ear key - Why is a stick of gum like sneeze - Top shakespeare love quotes - American hospitality academy reviews - West coast hifi osborne park - Association of radical midwives - STR (U4_RPL) - How to calculate ksp value - Sql server database naming conventions best practices - Java runtime version 1.4.2 - Student professional association - Taco company of australia inc v taco bell pty ltd - Applicant tracking system pdf - Chapter Summary - Northern afghan hound society - The visual imagery of the poem is dominated by - Advanced Industrial Hygiene - Financial Management - Annual Report Analysis. - Piper alpha case study - Follow APA format for each Task - The claim is that the proportion of drowning deaths - Lb - Discussion - Pico question examples congestive heart failure - Week 5 project leadership - Week 1 Discussion - Family therapy goldenberg 9th edition pdf - Pr008 work product 3 - SOFTWARE PROJECT MANAGEMENT - Spss kurtosis and skewness interpretation - Navy chief eval bullets - Time Bomb Case - Table setting etiquette worksheet - Imoprtance of strategic planning 1 - Gordis l 2014 epidemiology 5th ed philadelphia pa elsevier saunders - 0.25 litres in ml - Discussion - Which of the following gives positive tollens test - Statistics on the pledge of allegiance - Wk2 discussion - Bellevue training school for nurses - Defusement - Business Paper - Need to create infographic for the element thallium - The three traditional fields of philosophy are - Two Assignments due tomorrow (Wednesday) before 11:59 pm mountain time - Picnic spots in dandenong ranges - The raven stanza 18 - Conductometric titration of strong acid vs strong base - Compare and contrast two places you have visited - Saudi arabian riyal to us dollar - The lorelei poem analysis - Doc 8168 vol 3 - A woman's hands bezwoda poem - Speech peer review - Bay leather republic couch - O me o life walt whitman worksheet answers - Allocates expenses to revenues in the proper period - The effect of caffeine on daphnia - Www sutherlandcu com au - Prepare the following adjusting entries at august 31 for walgreens - Georgy girl ukulele chords - This world is not conclusion poem - Short essay on iron man - Omron electronic components americas - Which type of targeting strategy is zipcar pursuing - SELECTION PROCESS - Nursing - Week 5 dis eco 535 - Cannon-bard theory of emotion - How to solve genetic problems - Essentials of health information systems and technology balgrosky pdf - Create a flyer in word assignment - Blood basics assignment answer key - Geographic Crime Mapping - New strategies often entail budget reallocations because - S des encryption program in python - Doctor buenos días. no tiene usted muy (1) , ¿qué le ocurre? - Which of the following italicized words is correctly capitalized - IT Budget Plan - Lucrece thyme and apricot cleanser