Loading...

Messages

Proposals

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Sampleby pyspark

20/03/2021 Client: saad24vbs Deadline: 7 Days

points)

For this assignment, you are to write a spark program to predict Yelp user rating based on their review

Text.

You can download the data files from here:

https://uofi.box.com/s/b3hwy8rax1eipmiz2r78csxp06jnodci . Unzip the folder and find review.json

and user.json . You can read the description of each file and their attributes here:

https://www.yelp.com/dataset/documentation/main

What you need to do: 1- Data Exploration:

 load the review.json file and extract “text” and “stars” attributes

 find the distribution of “stars” attributes; that is, find the number of reviews for

each star value.

2- Feature Engineering:

 The star ratings 1,2 and 3typically indicate dissatisfaction and the star rating 4,5

shows satisfaction. Create a new column “rating” with values 0 (if the star rating

is 1,2, or 3) and 1 (if the star rating is 4,5) . This will be the target variable you

want to predict.

 Find the distribution of the “rating” column; that is, find the count of reviews for

each rating=0 and rating=1. Is the rating attribute balanced? If not, you should

down sample your data. That means, keep the rating value with the lowest

count but take a sample of the reviews for the other category in order to have a

balanced distribution between both classes. This is called stratified sampling and

you can accomplish this in spark using the “sampleBy” method of dataframe.

For an example of stratified sampling you can see here:

http://allaboutscala.com/big-data/spark/#dataframe-statistics-sampleby and

here: https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-

packages/ (the section on stratified sampling)

 Unfortunately, the dataset is still too big for our tiny cluster and running ML

models on it can take a long time with our limited resources. So when doing

down-sampling, using sampleBy method, multiply all fractions by 0.1 to get a

sample of only 10% of reviews in each rating category after down-sampling.

Below are the counts I get after stratified sampling (Depending on the seed you

give to samplyBy method, you might get different counts but if you set the seed

to 111, you should get the same distribution as mine)

https://uofi.box.com/s/b3hwy8rax1eipmiz2r78csxp06jnodci
https://www.yelp.com/dataset/documentation/main
http://allaboutscala.com/big-data/spark/#dataframe-statistics-sampleby
https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages/
https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages/
 Extract TFIDF vectors from the review Text. When creating countVectorizer, use

setMinDF(100) to only include words in the feature vector that appear in at

least 100 reviews. Make sure that you remove stop words and punctuations and

use stemming as explained in the labs.

3- Building Machine Learning pipelines.

 Use three different machine learning models (Logistic Regression, Random

Forest, and Gradient Boosted Classification Trees) to predict the ratings based

on the TFIDF vector of the review text. For an example of a Gradient Boosteed

Classification Trees please refer to: https://spark.apache.org/docs/2.2.0/ml-

classification-regression.html#gradient-boosted-tree-classifier

 Use CrossValidation with three folds and Area Under Curve (AUC) metric to

evaluate and tune each model’s hyper-parameter ( please refer to the labs

posted for this module for examples of cross validation in spark).

Create a separate pipeline for each ML model. Split the data to testing and

training sets, fit each model on the training data, get the predictions for the test

data, and print the AUC for each model.

Explain which model did a better job on predicting the ratings in the test set?

4- Making an ensemble of the above three models:

 Typically, an ensemble of multiple models works better than a single model. In

this step, you take the predictions generated by the three models above (that is,

logistic regression and random forest, and gradient boosted classification tree) ,

zip them together and compute a “prediction_ensembled” column which is

https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier
basically a majority vote of the three prediction columns generated by each

model. That is, if two or more of the models generated the same prediction,

then use that prediction in the prediction_ensembled column; otherwise, if

none of the predictions are the same, then use the rating value1 for the

prediction_ensemble column. (You can accomplish this in spark sql using a

simple case when query, see an example here:

Homework is Completed By:

Writer Writer Name Amount Client Comments & Rating
Instant Homework Helper

ONLINE

Instant Homework Helper

$36

She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up!

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 3 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 6 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 12 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

6 writers have sent their proposals to do this homework:

Finance Master
Financial Solutions Provider
Top Class Engineers
Exam Attempter
Top Rated Expert
Assignment Solver
Writer Writer Name Offer Chat
Finance Master

ONLINE

Finance Master

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$87 Chat With Writer
Financial Solutions Provider

ONLINE

Financial Solutions Provider

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$88 Chat With Writer
Top Class Engineers

ONLINE

Top Class Engineers

I have read and understood all your initial requirements, and I am very professional in this task.

$101 Chat With Writer
Exam Attempter

ONLINE

Exam Attempter

I am known as Unrivaled Quality, Written to Standard, providing Plagiarism-free woork, and Always on Time

$53 Chat With Writer
Top Rated Expert

ONLINE

Top Rated Expert

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$79 Chat With Writer
Assignment Solver

ONLINE

Assignment Solver

I have read and understood all your initial requirements, and I am very professional in this task.

$64 Chat With Writer

Let our expert academic writers to help you in achieving a+ grades in your homework, assignment, quiz or exam.

Similar Homework Questions

Prentice hall world history chapter 18 - 2011 irs 1040 line 39a instructions - 24 volt dc wire size chart - Homework - Contacts interpersonal communication in theory practice and context - Jacobi iteration method example - Coursework writing - 225 infinity dr charleston tn 37310 - What number i am - Healing the orphan heart - Pronoun reference exercises - Samuel by grace paley character analysis - Imvic test for salmonella - The red fox fur coat analysis - Aeroplane reference field length - What happens if you mix hydrogen and oxygen - Is s3 amazonaws com safe - Boyz in the hood football scene - Indecent vineyard theatre - Number of n atoms in 0.410 mol nh3 - My cousin vinny law questions - Unit 3 - Ladder of abstraction writing - Indie lens pop up - Pediatrics - Humanistic nursing theory by paterson & zderad - Recurrence relation for compound interest - Anth writhing - Career2 successfactors eu royal mail - Interpersonal communication project hsco 508 - Week 2 - Essay - Questions for finding forrester answers - U.S History - Swot analysis mayo clinic - Behavioural adaptations polar bear - Discussion - A heating curve worksheet answers - Dc shunt motor circuit diagram - Saxon phonics coding chart - Lakeside boatworks is planning to manufacture - An obligation to prevent trauma on campus - English writing 2 - Dunkin donuts organizational design - How to get to the alfred hospital by public transport - Hydrochloric acid and magnesium metal - Line and surface integrals - Journal - Due 3 hours ASAp - Article review 2 -734 - Pacific trails resort chapter 11 - Shaun tan the lost thing - Hub and spoke model in supply chain - Medical check up for confined space - 6 6 challenge problem lawnpro company - Breaking night quotes with page numbers - Sand and iron filings separation method - R40 tax claim form - El enfermero le la temperatura al paciente. (tomar, ir) - Letters from a slave girl the story of harriet jacobs - Group therapy - Transaction exposure and economic exposure - Phet circuit construction kit dc only answers - 4 dip switch combinations - What does 413 mean spiritually - Series parallel circuit lab - Week 2 discussion BUS310 - Accidentally disliked a song on spotify daily mix - Human Resource Management - Cybercrime s - Guildford shakespeare company measure for measure - How to read multistix 10 sg results - Worksheet on separating mixtures - Panera bread case study financial analysis - Wizard of oz story summary - Walmart business ethics and social responsibility - Decko industries reported the following monthly data - Vcaa cross curriculum priorities - History of welding ppt - E12 preferred resistor values - Assignment - Lesson Planning: Effective Strategies - Wire gauze science definition - Salvadore inc., a local retailer, has provided the following data for the month of september: - Shadow health comprehensive assessment documentation - Sam's big land flea market - Enron code of ethics pdf - Piv rating of full wave bridge rectifier - The wendt corporation had $10.5 million - Ben and jerry's case study analysis - Exam - Nazareth catholic church grovedale - Help - Physical Security - Conceptual database design definition - Visual representation of depression - How to write happy birthday in hiragana - Imagery in sonnet 18 - Discussion-5 - New topic need help - Poisoned waters video summary