Loading...

Messages

Proposals

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Sampleby pyspark

20/03/2021 Client: saad24vbs Deadline: 7 Days

points)

For this assignment, you are to write a spark program to predict Yelp user rating based on their review

Text.

You can download the data files from here:

https://uofi.box.com/s/b3hwy8rax1eipmiz2r78csxp06jnodci . Unzip the folder and find review.json

and user.json . You can read the description of each file and their attributes here:

https://www.yelp.com/dataset/documentation/main

What you need to do: 1- Data Exploration:

 load the review.json file and extract “text” and “stars” attributes

 find the distribution of “stars” attributes; that is, find the number of reviews for

each star value.

2- Feature Engineering:

 The star ratings 1,2 and 3typically indicate dissatisfaction and the star rating 4,5

shows satisfaction. Create a new column “rating” with values 0 (if the star rating

is 1,2, or 3) and 1 (if the star rating is 4,5) . This will be the target variable you

want to predict.

 Find the distribution of the “rating” column; that is, find the count of reviews for

each rating=0 and rating=1. Is the rating attribute balanced? If not, you should

down sample your data. That means, keep the rating value with the lowest

count but take a sample of the reviews for the other category in order to have a

balanced distribution between both classes. This is called stratified sampling and

you can accomplish this in spark using the “sampleBy” method of dataframe.

For an example of stratified sampling you can see here:

http://allaboutscala.com/big-data/spark/#dataframe-statistics-sampleby and

here: https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-

packages/ (the section on stratified sampling)

 Unfortunately, the dataset is still too big for our tiny cluster and running ML

models on it can take a long time with our limited resources. So when doing

down-sampling, using sampleBy method, multiply all fractions by 0.1 to get a

sample of only 10% of reviews in each rating category after down-sampling.

Below are the counts I get after stratified sampling (Depending on the seed you

give to samplyBy method, you might get different counts but if you set the seed

to 111, you should get the same distribution as mine)

https://uofi.box.com/s/b3hwy8rax1eipmiz2r78csxp06jnodci
https://www.yelp.com/dataset/documentation/main
http://allaboutscala.com/big-data/spark/#dataframe-statistics-sampleby
https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages/
https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages/
 Extract TFIDF vectors from the review Text. When creating countVectorizer, use

setMinDF(100) to only include words in the feature vector that appear in at

least 100 reviews. Make sure that you remove stop words and punctuations and

use stemming as explained in the labs.

3- Building Machine Learning pipelines.

 Use three different machine learning models (Logistic Regression, Random

Forest, and Gradient Boosted Classification Trees) to predict the ratings based

on the TFIDF vector of the review text. For an example of a Gradient Boosteed

Classification Trees please refer to: https://spark.apache.org/docs/2.2.0/ml-

classification-regression.html#gradient-boosted-tree-classifier

 Use CrossValidation with three folds and Area Under Curve (AUC) metric to

evaluate and tune each model’s hyper-parameter ( please refer to the labs

posted for this module for examples of cross validation in spark).

Create a separate pipeline for each ML model. Split the data to testing and

training sets, fit each model on the training data, get the predictions for the test

data, and print the AUC for each model.

Explain which model did a better job on predicting the ratings in the test set?

4- Making an ensemble of the above three models:

 Typically, an ensemble of multiple models works better than a single model. In

this step, you take the predictions generated by the three models above (that is,

logistic regression and random forest, and gradient boosted classification tree) ,

zip them together and compute a “prediction_ensembled” column which is

https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier
basically a majority vote of the three prediction columns generated by each

model. That is, if two or more of the models generated the same prediction,

then use that prediction in the prediction_ensembled column; otherwise, if

none of the predictions are the same, then use the rating value1 for the

prediction_ensemble column. (You can accomplish this in spark sql using a

simple case when query, see an example here:

Homework is Completed By:

Writer Writer Name Amount Client Comments & Rating
Instant Homework Helper

ONLINE

Instant Homework Helper

$36

She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up!

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 3 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 6 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 12 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

6 writers have sent their proposals to do this homework:

Finance Master
Financial Solutions Provider
Top Class Engineers
Exam Attempter
Top Rated Expert
Assignment Solver
Writer Writer Name Offer Chat
Finance Master

ONLINE

Finance Master

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$87 Chat With Writer
Financial Solutions Provider

ONLINE

Financial Solutions Provider

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$88 Chat With Writer
Top Class Engineers

ONLINE

Top Class Engineers

I have read and understood all your initial requirements, and I am very professional in this task.

$101 Chat With Writer
Exam Attempter

ONLINE

Exam Attempter

I am known as Unrivaled Quality, Written to Standard, providing Plagiarism-free woork, and Always on Time

$53 Chat With Writer
Top Rated Expert

ONLINE

Top Rated Expert

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$79 Chat With Writer
Assignment Solver

ONLINE

Assignment Solver

I have read and understood all your initial requirements, and I am very professional in this task.

$64 Chat With Writer

Let our expert academic writers to help you in achieving a+ grades in your homework, assignment, quiz or exam.

Similar Homework Questions

BUSINESS ETHICS RESEARCH PAPER - What are some examples of involuntary groups - Iso 27001 mandatory documents - Robert palmer she makes my day chords - Mortgage case study questions and answers - Dental hygiene national board - Sound card interface schematic - Ukulele chords hey there delilah - Ethical Theory - SELF REFLECTION ESSAY FOR A ADVANCE NURSE PRACTITIONER CLASS. - Pennine gp training co uk - 31 chamonix parade south morang - Pftop - Assignment - Communicate with confidence by dianna booher pdf - Context of war photographer - Discussion Question - Louis sachar family tree - Gap inc swot analysis 2016 - Assume the speed of vehicles along a stretch of i-10 has an approximately - A n crate rests on the floor - Loughborough university psychology department - Discourse Analysis - Indeed with hardship will be ease - Tartaric acid pka1 and pka2 - Wk 6 - Apply: Signature Assignment: Strategic Plan - Nmc controlled drugs guidelines - Time limit essay but no words limit - Seven management and planning tools examples - 100 deadlifts a day - Block chain - Http marc ucla edu mindful meditations - Nepa work - Secondary school teacher personal statement examples - Accounting what the numbers mean 11th edition solutions - Mr heck tate to kill a mockingbird - Match each retail term with the correct definition - Anyone can do my assignment? - Father richard ambrose singapore - Scale drawing lesson plan - Forensic drug analysis powerpoint - Monash 5 day extension - Atlas copco element outlet temperature high - Flinders harvard reference guide - How do bile salts speed up fat digestion by lipase - Crafting and executing strategy 17th edition ebook - Outer worlds divert power choice reddit - Digital Marketing Strategies 2020 and Beyond: https://www.youtube.com/watch?v=bGQG_-OG6fs - What do resistor colors mean - Cones of dunshire quote - Accounting - The slope of a distance time graph indicates - Cinnabar gmc motorhome parts - Bibliographys - Irac style essay - Article summary - What is compromised data integrity - HN522 Discussion 3 - Disadvantages of diagnosis related groups - Garden city management pte ltd - Supply Chain Design - Listen to the yell of leopold's ghost - For anyone - Organizational change cawsey pdf - Security architecture 2 - Glenn flothe alaska state trooper - Advocacy Through Legislation - Dr lawson broomfield hospital - Death of a salesman soliloquy - How to write a criminology report - Chapter 10 cell growth and division - Dock leveler parts list - Identify the sources of emotions and moods - Transition to the Nursing profession - Features of a newspaper article - What type of angles are x and y? - D1 - Balancing chemical equations made easy - Introduction to business law 5th edition pdf free download - What effect does inclusive language have on the reader - 3 indicators of a chemical reaction - Carina bus depot lost property - 1992 tarantino crime thriller crossword - Mil hdbk 274 as electrical grounding for aircraft safety - Why i want to be a chief petty officer essay - Ultimate adversaries a star wars accessory steve miller - Massacre of the dreamers essays on xicanisma pdf - 8 bit microcontroller means - Mckesson hr - If an industry has a level of market commonality - Discussion post - Jci standards 7th edition pdf free download - Leading and managing in nursing free pdf - Apa citation for silver linings playbook - SMGT 501 - Forum Discussion 3 - Senior geography project questions - A just or fair ethical decision occurs when course hero - Discussions for 9 weeks. - Nirvana aneurysm guitar lesson - Challenger sales methodology wikipedia