Loading...

Messages

Proposals

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Sampleby pyspark

20/03/2021 Client: saad24vbs Deadline: 7 Days

points)

For this assignment, you are to write a spark program to predict Yelp user rating based on their review

Text.

You can download the data files from here:

https://uofi.box.com/s/b3hwy8rax1eipmiz2r78csxp06jnodci . Unzip the folder and find review.json

and user.json . You can read the description of each file and their attributes here:

https://www.yelp.com/dataset/documentation/main

What you need to do: 1- Data Exploration:

 load the review.json file and extract “text” and “stars” attributes

 find the distribution of “stars” attributes; that is, find the number of reviews for

each star value.

2- Feature Engineering:

 The star ratings 1,2 and 3typically indicate dissatisfaction and the star rating 4,5

shows satisfaction. Create a new column “rating” with values 0 (if the star rating

is 1,2, or 3) and 1 (if the star rating is 4,5) . This will be the target variable you

want to predict.

 Find the distribution of the “rating” column; that is, find the count of reviews for

each rating=0 and rating=1. Is the rating attribute balanced? If not, you should

down sample your data. That means, keep the rating value with the lowest

count but take a sample of the reviews for the other category in order to have a

balanced distribution between both classes. This is called stratified sampling and

you can accomplish this in spark using the “sampleBy” method of dataframe.

For an example of stratified sampling you can see here:

http://allaboutscala.com/big-data/spark/#dataframe-statistics-sampleby and

here: https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-

packages/ (the section on stratified sampling)

 Unfortunately, the dataset is still too big for our tiny cluster and running ML

models on it can take a long time with our limited resources. So when doing

down-sampling, using sampleBy method, multiply all fractions by 0.1 to get a

sample of only 10% of reviews in each rating category after down-sampling.

Below are the counts I get after stratified sampling (Depending on the seed you

give to samplyBy method, you might get different counts but if you set the seed

to 111, you should get the same distribution as mine)

https://uofi.box.com/s/b3hwy8rax1eipmiz2r78csxp06jnodci
https://www.yelp.com/dataset/documentation/main
http://allaboutscala.com/big-data/spark/#dataframe-statistics-sampleby
https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages/
https://mapr.com/blog/churn-prediction-pyspark-using-mllib-and-ml-packages/
 Extract TFIDF vectors from the review Text. When creating countVectorizer, use

setMinDF(100) to only include words in the feature vector that appear in at

least 100 reviews. Make sure that you remove stop words and punctuations and

use stemming as explained in the labs.

3- Building Machine Learning pipelines.

 Use three different machine learning models (Logistic Regression, Random

Forest, and Gradient Boosted Classification Trees) to predict the ratings based

on the TFIDF vector of the review text. For an example of a Gradient Boosteed

Classification Trees please refer to: https://spark.apache.org/docs/2.2.0/ml-

classification-regression.html#gradient-boosted-tree-classifier

 Use CrossValidation with three folds and Area Under Curve (AUC) metric to

evaluate and tune each model’s hyper-parameter ( please refer to the labs

posted for this module for examples of cross validation in spark).

Create a separate pipeline for each ML model. Split the data to testing and

training sets, fit each model on the training data, get the predictions for the test

data, and print the AUC for each model.

Explain which model did a better job on predicting the ratings in the test set?

4- Making an ensemble of the above three models:

 Typically, an ensemble of multiple models works better than a single model. In

this step, you take the predictions generated by the three models above (that is,

logistic regression and random forest, and gradient boosted classification tree) ,

zip them together and compute a “prediction_ensembled” column which is

https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier
https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#gradient-boosted-tree-classifier
basically a majority vote of the three prediction columns generated by each

model. That is, if two or more of the models generated the same prediction,

then use that prediction in the prediction_ensembled column; otherwise, if

none of the predictions are the same, then use the rating value1 for the

prediction_ensemble column. (You can accomplish this in spark sql using a

simple case when query, see an example here:

Homework is Completed By:

Writer Writer Name Amount Client Comments & Rating
Instant Homework Helper

ONLINE

Instant Homework Helper

$36

She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up!

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 3 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 6 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 12 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

6 writers have sent their proposals to do this homework:

Finance Master
Financial Solutions Provider
Top Class Engineers
Exam Attempter
Top Rated Expert
Assignment Solver
Writer Writer Name Offer Chat
Finance Master

ONLINE

Finance Master

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$87 Chat With Writer
Financial Solutions Provider

ONLINE

Financial Solutions Provider

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$88 Chat With Writer
Top Class Engineers

ONLINE

Top Class Engineers

I have read and understood all your initial requirements, and I am very professional in this task.

$101 Chat With Writer
Exam Attempter

ONLINE

Exam Attempter

I am known as Unrivaled Quality, Written to Standard, providing Plagiarism-free woork, and Always on Time

$53 Chat With Writer
Top Rated Expert

ONLINE

Top Rated Expert

Hello, I an ranked top 10 freelancers in academic and contents writing. I can write and updated your personal statement with great quality and free of plagiarism

$79 Chat With Writer
Assignment Solver

ONLINE

Assignment Solver

I have read and understood all your initial requirements, and I am very professional in this task.

$64 Chat With Writer

Let our expert academic writers to help you in achieving a+ grades in your homework, assignment, quiz or exam.

Similar Homework Questions

Cigweld 200 ac dc tig manual - Ethnographic interview paper - Week 5 - Assignment: Plan Data Collection Effort for Informed Decision Making - Proposal Essay Outline - Discussion part 8 - Organizational Leadership - Krispy kreme case study analysis - Battle of salamis timeline - Roll of thunder hear my cry audiobook chapter 3 - What is hardware in ict - The lady tasting tea book - Two methods for evaluating evidence - Does anyone know about this? - Sacred heart primary school bellshill - William glasser theory of classroom management - PMS Integration - Topic 6 insolation and the seasons answers - Bullard house negotiation analysis - Aristotle nicomachean ethics cliff notes - Heineken bcg matrix - Ph of common materials lab answers - Synthesis of aspirin results - Business Essay - Inside job documentary reaction - Finding Your Purpose - Wipro lighting price list - Selling strategically uf - 4/2 - NEED IN 15 HOURS or LESS - Who wrote the pearl - How to find cash balance per bank statement - Presto company makes radios that sell for - Geometry Online experts - To assess the influence of self esteem on interpersonal attraction - Doha west power station kuwait - What is the glacial budget - Drops per minute chart - Concepts of territoriality privacy and personal space - Psychology essay writing tips - Eco - Shadow health mental health documentation - Information Systems - U2 6 solve quadratics by completing the square answers - How to use a circle map - Biodiversity paper - How do you solve a 1 2bh for b - Anyone can do my assignment? - P5 explain the concept of homeostasis - Peter thieriot - Draft an Argumentative Research Essay: Is homeschooling beneficial to our children's? - Product design and process selection - Greg alexander hedge fund - Applying Current Literature to Clinical Practice - How is the determination of melting point useful - Paul pavlou salary - Saving the bees persuasive speech - Introduction to history - Moody corporation uses a job order costing system - General paper comprehension tips - Enbrel sharps container program - The most exciting event in my life essay - 10 page essay compare and contrast Missouri and us constitution - Junos end of support - Assignment - Husky injection molding machine price - Gino's restaurant is a popular restaurant in boston massachusetts - A red red rose by robert burns theme - How to calculate evsi from decision tree - Marcos no es tan inteligente como jorge. - Discussion 2 ,250 words add references and citations by 08/20/20 at 6:00 pm,Reply 1 and 2 150 words each one ,add references and citations by 08/20/20 at 8:00 pm - Osmosis with dialysis tubing experiment - Ex 9 8 aging of receivables schedule - Last of the mohicans bloopers - Barton private hospital admission form - Etray civil service ireland - Global Issues: Chapter 6 War and Security - Food and beverages at southwestern university football games - Controlled voltage source simulink - Within an organization the trait approach can be applied to - Articles reflection - Clause 8.3.3 of as/nzs 3000 - Teal accent 4 soft bevel - Benchmark - Technology, Diversity, and Ethics - Irobot case - The conscious lovers summary - Creating a company culture for security design document - Soft and hard service standards - Www phe eqa org uk - Stage 6 syllabus english - Globalisation case study apple - Informative essay on genetically modified food - Casebook Analysis - Best bitcoin mining hardware 2020 - Brain bee sample questions - Letters that might precede 10001 crossword clue - The accused 12 angry men - What is maintenance organisation exposition - Por qué fueron maru y miguel a un restaurante - Burder forklift for sale - Writing assignment