Welcome to TutorsOnSpot.Com!

World's No. 1 Assignment Writing Market

Post Your Homework

Proposals

Post your homework and get free proposals here!

Post Your Homework

Stuck in your homework and missing deadline?

Get Urgent Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework Writing

100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Get Free 2 Pages Post Your Requirements And Get Free Help

Perform a multivariate linear regression that models the ice cream sales revenues as a function of the two weather variables. Make one figure of your choice that illustrates this linear regression (possibly with multiple panels).

Category: Engineering & Sciences Paper Type: Online Exam | Quiz | Test Reference: APA Words: 1000

Solution

By using MATALb ;

Results

b) The 5-day weather forecast for next week is:

What is the total expected revenues for the next 5 days (assuming the owner works every day)? What is the 90% confidence interval for your estimate?

Solution

6. Spam Emails. A charity has contacted you to ask if you could help them filtering the hundreds of spam emails they receive everyday. Spams affects directly the quality of the service offered by the charity because they crowd out important and urgent emails received from people in need. One of your friend thinks spams can be detected by calculating the frequency of certain words and characters in the text of an email. Your friend developed a program that calculates the frequencies of 48 selected keywords and 9 other variables that look at other various metrics from the content of the email. Hence, in total, there are 57 metrics extracted from a given email. The charity gave you a random sample of 1,000 emails they received last month, and your friend ran the program on those emails. Then, your friend read through all of the 1,000 emails and annotated each email to indicate if it was indeed a spam or not. The result of this hard work is saved in the file spam-train.csv , where the first 57 columns represent the various metrics calculated by your friend’s program, and the last (58th) column is the spam annotation: 1 if the email was a spam, 0 else.

a) Based on this “training” dataset presented in spam-train.csv, develop a predictive model based on a logistic regression coupled with a ROC analysis, that classifies emails as spam or not. The charity is willing to accept that not more than 2 out of 100 non-spam emails can be wrongly classified as “spam”.

Result

b) Now that your predictive model is developed, your friend has retrieve new emails received yesterday, ran the 57 metrics on them and saved the results in the file spam-test.csv. Run your predictive model on this new data set and classify each email as a spam or not. What is the proportion of spams that were identified with your model in this new data set?

The predicated values are in results vector in workspace.

c) A software company has approached the charity, claiming they have new state-of-the-art software that can detect spams like never before. The director of the charity is tempted to buy this software but hesitates because of its hefty price. The software company has run its state-of-the-art program on the same training set spam-train.csv as you did. The software gives a numerical score to an email: the higher the score, the more likely the email is a spam. The scores of the 1,000 emails of the training dataset are saved in spam_comp.txt. The director asks you if the software company does better than the method you and your friend provided from your benevolent (free) work. Answer the director with a short paragraph that explains your comparative analysis along with one single figure of your choice.
Solution

They have not provided any threshold for considering an email as a spam but if we take the threshold to be 500 then the accuracy of our model compared to theirs is 50% which is quite low and hence it makes sense to use their state-of-the-art solution.

7. Bike Sharing: A city put in place a bike sharing system a year ago. Through this system, users are able to easily rent a bike from a particular station and return it back at another station (possibly the same). There are complaints that some stations often have no bike available to rent. The logistics to make sure there are enough bikes available at all renting stations is complex and partially based on the duration of the bike ride for each user. Municipal employees have noticed that the ride duration tends to be longer when the weather is nice. If this is true, the municipal staff that moves bikes to empty stations could plan its activity in advance, based on weather forecasts, to improve bike availability. The manager of the bike sharing program wants to be sure that weather influences the bike ride duration and hires you as a data analyst consultant to study this.

The manager provides a dataset of 2,000 bike rides randomly selected over the last year in the file bike_trips.csv as well as the file bike_stations.csv that translates in English the bike stations names from numerical codes. You also have access to weather data for the city in the file bike_weather.csv . The file bike_INFO.txt contains important additional information about those three files

a) Merge the information from all three datasets (about bike rides, station names and weather) in a single table that must have the following format (note that the codes for weather and stations are replaced by their “names”):

The table above is illustrative and does not represent the values of the actual data contained in the file. The variable duration is the duration of the bike ride in minutes. Hint: this question involves several steps of joining tables.

Solution

Further you can check in 7aa.m file

b) Produce and display a table that summarizes the average bike ride duration for each of the four types of weather. Create a well-annotated boxplot that shows the distribution of bike ride durations by weather type.

Solution

Results Box plot

c) Conduct a rigorous statistical analysis that determines if the bike ride durations differ with the type of weather. Write a short paragraph that summarizes your analysis

Solution

As we can see that the bike duration is more in case weather is good for biking. The average ride duration is higher in sunny and cloudy weather but it decreases if it is raining and decreases further in case of a storm which is very unsuitable for bike riding.