Loading...

Messages

Proposals

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Sampleby pyspark

20/03/2021 Client: saad24vbs Deadline: 7 Days

points)

For this assignment, you are to write a spark program to predict Yelp user rating based on their review

Text.

You can download the data files from here:

https://uofi.box.com/s/b3hwy8rax1eipmiz2r78csxp06jnodci . Unzip the folder and find review.json

and user.json . You can read the description of each file and their attributes here:

https://www.yelp.com/dataset/documentation/main

What you need to do: 1- Data Exploration:

 load the review.json file and extract “text” and “stars” attributes

 find the distribution of “stars” attributes; that is, find the number of reviews for

each star value.

2- Feature Engineering:

 The star ratings 1,2 and 3typically indicate dissatisfaction and the star rating 4,5

shows satisfaction. Create a new column “rating” with values 0 (if the star rating

is 1,2, or 3) and 1 (if the star rating is 4,5) . This will be the target variable you

want to predict.

 Find the distribution of the “rating” column; that is, find the count of reviews for

each rating=0 and rating=1. Is the rating attribute balanced? If not, you should

down sample your data. That means, keep the rating value with the lowest

count but take a sample of the reviews for the other category in order to have a

balanced distribution between both classes. This is called stratified sampling and

you can accomplish this in spark using the “sampleBy” method of dataframe.

For an example of stratified sampling you can see here:

http://allaboutscala.com/big-data/spark/#dataframe-statistics-sampleby and

here: https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-

packages/ (the section on stratified sampling)

 Unfortunately, the dataset is still too big for our tiny cluster and running ML

models on it can take a long time with our limited resources. So when doing

down-sampling, using sampleBy method, multiply all fractions by 0.1 to get a

sample of only 10% of reviews in each rating category after down-sampling.

Below are the counts I get after stratified sampling (Depending on the seed you

give to samplyBy method, you might get different counts but if you set the seed

to 111, you should get the same distribution as mine)

https://uofi.box.com/s/b3hwy8rax1eipmiz2r78csxp06jnodci
https://www.yelp.com/dataset/documentation/main
http://allaboutscala.com/big-data/spark/#dataframe-statistics-sampleby
https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages/
https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages/
 Extract TFIDF vectors from the review Text. When creating countVectorizer, use

setMinDF(100) to only include words in the feature vector that appear in at

least 100 reviews. Make sure that you remove stop words and punctuations and

use stemming as explained in the labs.

3- Building Machine Learning pipelines.

 Use three different machine learning models (Logistic Regression, Random

Forest, and Gradient Boosted Classification Trees) to predict the ratings based

on the TFIDF vector of the review text. For an example of a Gradient Boosteed

Classification Trees please refer to: https://spark.apache.org/docs/2.2.0/ml-

classification-regression.html#gradient-boosted-tree-classifier

 Use CrossValidation with three folds and Area Under Curve (AUC) metric to

evaluate and tune each model’s hyper-parameter ( please refer to the labs

posted for this module for examples of cross validation in spark).

Create a separate pipeline for each ML model. Split the data to testing and

training sets, fit each model on the training data, get the predictions for the test

data, and print the AUC for each model.

Explain which model did a better job on predicting the ratings in the test set?

4- Making an ensemble of the above three models:

 Typically, an ensemble of multiple models works better than a single model. In

this step, you take the predictions generated by the three models above (that is,

logistic regression and random forest, and gradient boosted classification tree) ,

zip them together and compute a “prediction_ensembled” column which is

https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier
basically a majority vote of the three prediction columns generated by each

model. That is, if two or more of the models generated the same prediction,

then use that prediction in the prediction_ensembled column; otherwise, if

none of the predictions are the same, then use the rating value1 for the

prediction_ensemble column. (You can accomplish this in spark sql using a

simple case when query, see an example here:

Homework is Completed By:

Writer Writer Name Amount Client Comments & Rating
Instant Homework Helper

ONLINE

Instant Homework Helper

$36

She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up!

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 3 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 6 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 12 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

6 writers have sent their proposals to do this homework:

Finance Master
Financial Solutions Provider
Top Class Engineers
Exam Attempter
Top Rated Expert
Assignment Solver
Writer Writer Name Offer Chat
Finance Master

ONLINE

Finance Master

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$87 Chat With Writer
Financial Solutions Provider

ONLINE

Financial Solutions Provider

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$88 Chat With Writer
Top Class Engineers

ONLINE

Top Class Engineers

I have read and understood all your initial requirements, and I am very professional in this task.

$101 Chat With Writer
Exam Attempter

ONLINE

Exam Attempter

I am known as Unrivaled Quality, Written to Standard, providing Plagiarism-free woork, and Always on Time

$53 Chat With Writer
Top Rated Expert

ONLINE

Top Rated Expert

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$79 Chat With Writer
Assignment Solver

ONLINE

Assignment Solver

I have read and understood all your initial requirements, and I am very professional in this task.

$64 Chat With Writer

Let our expert academic writers to help you in achieving a+ grades in your homework, assignment, quiz or exam.

Similar Homework Questions

Geographic segmentation of samsung - Volar judith ortiz cofer pdf - Consumer Behaviour- Assignment Discussion - Standards for registered training organisations rtos 2015 - St george card activation - Singtel mobile plan sim only - Crytographic Attacks - Importance of marketing channels pdf - Mt macedon walking trail - Need Paper 4 pages, APA format , 5th edition, 8 reference - Teacher training personal statement - Informative speech on vaccinations outline - The ana code of ethics for nurses recognizes that confidentiality - Las meninas google art project - Grito de dolores speech - Toroidal transformer inrush current - Concept identification and definition - 1995 cyber thriller about espionage crossword - Ansys maxwell 2d tutorial - Nitro pdf web browser plugin - Single leg crossover v ups - Cisco ntp update calendar - 2 workbooks - Taxonomy life's filing system crash course biology video answer key - Amazon aws eb python 3.4 2 onbuild 3.5 1amazon aws eb python 3.4 2 onbuild 3.5 1 - Chelsea fc ticket office - Application of self induction and mutual induction - Comal county environmental health department - Partial fraction of 1 s 1 2 - What does sonnet 75 by edmund spenser mean - Density of sand g ml - One page essay - Canon pixma 3000 series - Contemporary project management 4e pdf - Line integral of scalar field - What is devolved formula capital - Seven mysteries of the kingdom of heaven - Training and development advisor - Lynch company manufactures and sells a single product - Two scavengers in a truck bbc - Week 2.1 discussion - Desiree's baby reading thinking guide answers - How to make titration curve on excel - Research Question - Wallace thrasher unsolved mysteries - Poli platanos florina greece - Ava code of professional conduct - DISCUSSION #3 - Evidence-Based Practice and the Quadruple Aim - View the film "Phil Zimbard - Walden university fnp preceptor commitment form - Discussion: Strategies for Addressing Questions - Zinc blood test tube color - Education - The great kapok tree powerpoint - Physical and chemical changes worksheet 8th grade answers - Ucl exam timetable 2016 - Advanced Ergonomics - Exchange online plan 1 service description - Abiotic factors of the freshwater biome - Zappos case study questions and answers - Emerging adulthood cognitive development - Children need to play not compete - Class matters chapter 1 summary - Nursing paper - How to name coordination complexes - Organizational theory design and change gareth jones multiple choice questions - Fin 534 homework set 1 - Big bad wolf dressed as grandma - Tasmanian tiger children's book - Lm1881 sync separator circuit - Ethical issues raised by the milgram study - I should have been a pair of ragged claws meaning - Health care industry Threat model with risks and justifications. - I'm just a bill on capitol hill - Prejudice in jasper jones - Dulux limed white colour scheme - Mister swabe guitar chords - Pedestal of infamy scarlet letter - International standard bible encyclopedia vol 7 pdf - Comparing sociological theories - Accounting discussion - Red rag to a bull mythbusters - How to see webassign answers - Order to invoice process flow chart - Qut gym kelvin grove - Adaptation of a echidna - Chemistry data booklet vcaa - Papergraders - Hotel belvedere san francisco - Discussion (MK) - Uq final exam timetable - Business - Under armour mission and vision statement - Family in lit - Google roller coaster mr doob - ECO 100 DISCUSSION - For a psychology major - Ddc inmate search jacksonville fl - Gcu style guide