Project Due Sunday, October 27 by 1700
For the third project, you will be creating a suite of functions to process and analyze a Twitter data set derived from one of the data sets used in the fivethirtyeight article https://fivethirtyeight.com/features/the-worst-tweeter-in-politics-isnt-trump/. The data file you have is a subset and was pre-processed, so do not use the original data.
For the first part of the project, you will write 2 functions and correct 1 function. The second part of the project is to write a short main program that uses your function.
You are provided with two text files, sen_tweets_edited_2.csv (yes, it has a .csv extension, but is also a text file) and test_file.txt, which is a small fille in the format of sen_tweets_edited_2.csv that you can use for testing. You are provided with two Python skeleton files Project_3.py for your functions and Project_3_Main.py for your main program. The other files provided are related to the optional extra credit.
About Twitter data and Tweets
Tweets are submission to the social media platform Twitter. They are 140 characters in length or less. Tweets often contain hashtags which are words or phrases with the prefix #, for example, #avocadotoast. Tweets may reference other Twitter users, as indicated by the @, for example, @realDonaldTrump. Users may re-post a tweet posted by another user – this is referred to as re-tweeting or a retweet. Users may reply directly to a tweet posted by another individual, which is called a reply. Users may also indicate that they like a tweet by making it a favorite.
Each function description has a bulleted list of key points. These will answer many of your questions and provide hints, suggestions, and
smaller subtasks if you do not know how to start or get stuck.
Keep the instructions open on your desktop while you work on the project and refer to them often.
Part One: Processing and analyzing tweets
All of the code for Part 1 should be submitted in the Project_3.py file. You will fill in your functions under the definition lines.
Task 1: process_hashes (tweet)
process_hashes has one required argument: a single string tweet that is a tweet. This function extracts hashtags from the tweet and returns them in a list. The function should work as follows:
• The function takes a string as input and returns a list. Think carefully about when you should convert this string to a list and which steps below are performed on the string, and which on the list. WRITE THIS OUT ON PAPER FIRST!
• All of the text should be put into upper case. • Punctuation should be removed, including hashtags and at signs. Think carefully about what
should get removed when and how hashtags are identified. Specifically, you need to remove, ?.,!_:;#@
• Delete trailing 's on hashtags. This means if the hashtag is: #POTUS's, after processing (including step above) it should be: POTUS
• The function returns a list that contains the hashtags found. This list will be empty if there were no hashtags. The list might contain the same hashtag more than once if it appears in the tweet more than once. (Do not remove duplicates.)
• You should not be reading in from a file anywhere in this function. The input is a string called tweet. Use tweet for processing.
• We haven’t discussed regular expressions. If you use them, you may get none of the points regardless of what the internet tells you to do. (You can do the extra credit with regular expressions.)
Here are some examples, note this is not actual code, and words like “Input tweet” should not print when your function runs. Please also note these tweets were chosen because of the text they contain, they are not a political statement.
Input tweet: ".@realDonaldTrump's #SwampCabinet must be held accountable. I will hold them to account even if @SenateGOP won't. "
process_hashes returns:
['SWAMPCABINET'] Input tweet: 'Why are Republicans asking the Supreme Court to raise taxes on Alaska families? #ACAWorks for #Alaska http://t.co/lHeHUoXRQq'
process_hashes returns:
['ACAWORKS', 'ALASKA']
Task 2: popular_tweets(filename, how = 'retweet', cutoff = 100, counts = False) correction
You have been given code for a function called popular_tweets(filename, how = 'retweet', cutoff = 100, counts = False) which does not work properly. Correct the code to meet the following specifications.
popular_tweets(filename, how = 'retweet', cutoff = 100, counts = False) returns a dictionary where the keys are strings corresponding to Twitter usernames and the values are either 1) a list of tweets (strings) by that user or 2) integers representing a count of tweets by that user. Whether the value is a list or an integer depends on the optional argument counts. More details of how the function works are presented below.
popular_tweets takes one required input filename, a string that is the name of a file containing tab-delimited Twitter data. The file specified by filename should be in the format (the spaces below represent tabs, \t, NOT spaces):
ID tweet_text replies retweets favorites username party state
For example, one line in the file might be:
179162 @DrNordal, it was nice meeting with you. Thanks for stopping by. 1 3 0 SenDeanHeller R NV
The ID is 179162. Next is the actual tweet. 1 is the number of replies. 3 is the number of retweets. 0 is the number of favorites. The username is SenDeanHeller. Party is R (Republican). State is NV (Nevada). You have been given two files in this format: test_file.txt is a small file you may wish to use for testing your code; sen_tweets_edited_2.csv is a larger data set. Pay close attention to what data is in which column.
The optional arguments how and cutoff determine which tweets will be included in the final output dictionary. The function is looking for tweets that are popular based on either how many replies they received, how many times they were retweeted, or how many times they were favorited. For the tweet shown above, there was 1 reply, 3 retweets, and 0 favorites.
The argument how tells whether to determine popularity based on replies, retweets or favorites. It will always be a string with default value 'retweet'. The only other possible values for how are 'reply' and 'favorite'. Each option corresponds to a column in the original data file.
The argument cutoff tells how many replies/retweets/favorites the tweet must have to be included in the output. If a tweet has fewer replies/retweets/favorites than the value of cutoff, it will not appear in the output dictionary. The default value of cutoff is 100.
In the examples below, the original data is the same for all 3 cases, but the output is different. The original data consists of five lines in a file (this data is fabricated but based on real data):
185791 Bump fire stocks allow 88 151 293 SenF D CA
286443 Congratulations to @TeamCoachBuzz 5 30 268 timkaine D VA
25697 I'm grateful for #Arkansas 0 4 4 JohnBoozman R AR
286523 Sea level threatens Hampton Roads 69 473 1819 timkaine D VA
251370 I also stand ready to work 8 11 30 SenShelby R AL
• Popularity based on retweets, how = 'retweet', cutoff = 100; function returns:
{'SenF': ['Bump fire stocks allow'], 'timkaine': ['Sea level threatens Hampton Roads']}
Notice that each key is a username and the values are lists containing tweets that had more than 100 retweets. There are no entries in the dictionary for JohnBoozman (4 retweets) or SenShelby (11 retweets), and timkaine’s first tweet does not appear (only 30 retweets)