Solution
By
using MATALb ;
Results
b)
The 5-day weather forecast for next week is:
What
is the total expected revenues for the next 5 days (assuming the owner works
every day)? What is the 90% confidence interval for your estimate?
Solution
6. Spam Emails. A charity has contacted
you to ask if you could help them filtering the hundreds of spam emails they
receive everyday. Spams affects directly the quality of the service offered by
the charity because they crowd out important and urgent emails received from
people in need. One of your friend thinks spams can be detected by calculating
the frequency of certain words and characters in the text of an email. Your
friend developed a program that calculates the frequencies of 48 selected
keywords and 9 other variables that look at other various metrics from the
content of the email. Hence, in total, there are 57 metrics extracted from a
given email. The charity gave you a random sample of 1,000 emails they received
last month, and your friend ran the program on those emails. Then, your friend
read through all of the 1,000 emails and annotated each email to indicate if it
was indeed a spam or not. The result of this hard work is saved in the file
spam-train.csv , where the first 57 columns represent the various metrics
calculated by your friend’s program, and the last (58th) column is the spam
annotation: 1 if the email was a spam, 0 else.
a)
Based on this “training” dataset presented in spam-train.csv, develop a
predictive model based on a logistic regression coupled with a ROC analysis,
that classifies emails as spam or not. The charity is willing to accept that
not more than 2 out of 100 non-spam emails can be wrongly classified as “spam”.
Result
b)
Now that your predictive model is developed, your friend has retrieve new
emails received yesterday, ran the 57 metrics on them and saved the results in
the file spam-test.csv. Run your predictive model on this new data set and
classify each email as a spam or not. What is the proportion of spams that were
identified with your model in this new data set?
The predicated values are in results
vector in workspace.
c)
A software company has approached the charity, claiming they have new state-of-the-art
software that can detect spams like never before. The director of the charity
is tempted to buy this software but hesitates because of its hefty price. The
software company has run its state-of-the-art program on the same training set
spam-train.csv as you did. The software gives a numerical score to an email:
the higher the score, the more likely the email is a spam. The scores of the
1,000 emails of the training dataset are saved in spam_comp.txt. The director
asks you if the software company does better than the method you and your
friend provided from your benevolent (free) work. Answer the director with a
short paragraph that explains your comparative analysis along with one single
figure of your choice.
Solution
They have not provided any threshold for
considering an email as a spam but if we take the threshold to be 500 then the
accuracy of our model compared to theirs is 50% which is quite low and hence it
makes sense to use their state-of-the-art solution.
7. Bike Sharing: A city put in place a
bike sharing system a year ago. Through this system, users are able to easily
rent a bike from a particular station and return it back at another station
(possibly the same). There are complaints that some stations often have no bike
available to rent. The logistics to make sure there are enough bikes available
at all renting stations is complex and partially based on the duration of the
bike ride for each user. Municipal employees have noticed that the ride
duration tends to be longer when the weather is nice. If this is true, the
municipal staff that moves bikes to empty stations could plan its activity in
advance, based on weather forecasts, to improve bike availability. The manager
of the bike sharing program wants to be sure that weather influences the bike
ride duration and hires you as a data analyst consultant to study this.
The manager provides a dataset of 2,000
bike rides randomly selected over the last year in the file bike_trips.csv as
well as the file bike_stations.csv that translates in English the bike stations
names from numerical codes. You also have access to weather data for the city
in the file bike_weather.csv . The file bike_INFO.txt contains important
additional information about those three files
a)
Merge the information from all three datasets (about bike rides, station names
and weather) in a single table that must have the following format (note that
the codes for weather and stations are replaced by their “names”):
The
table above is illustrative and does not represent the values of the actual
data contained in the file. The variable duration is the duration of the bike
ride in minutes. Hint: this question involves several steps of joining tables.
Solution
Further
you can check in 7aa.m file
b)
Produce and display a table that summarizes the average bike ride duration for
each of the four types of weather. Create a well-annotated boxplot that shows
the distribution of bike ride durations by weather type.
Solution
Results
Box plot
c)
Conduct a rigorous statistical analysis that determines if the bike ride
durations differ with the type of weather. Write a short paragraph that
summarizes your analysis
Solution
As we can see that the bike duration is
more in case weather is good for biking. The average ride duration is higher in
sunny and cloudy weather but it decreases if it is raining and decreases
further in case of a storm which is very unsuitable for bike riding.