Python Project
CSCI 140 Project 3
Checkpoint deadline Wednesday, October 23 by 1700
Project Due Sunday, October 27 by 1700
For the third project, you will be creating a suite of functions to process and analyze a Twitter data set derived from one of the data sets used in the fivethirtyeight article https://fivethirtyeight.com/features/the-worst-tweeter-in-politics-isnt-trump/. The data file you have is a subset and was pre-processed, so do not use the original data.
For the first part of the project, you will write 2 functions and correct 1 function. The second part of the project is to write a short main program that uses your function.
You are provided with two text files, sen_tweets_edited_2.csv (yes, it has a .csv extension, but is also a text file) and test_file.txt, which is a small fille in the format of sen_tweets_edited_2.csv that you can use for testing. You are provided with two Python skeleton files Project_3.py for your functions and Project_3_Main.py for your main program. The other files provided are related to the optional extra credit.
About Twitter data and Tweets
Tweets are submission to the social media platform Twitter. They are 140 characters in length or less. Tweets often contain hashtags which are words or phrases with the prefix #, for example, #avocadotoast. Tweets may reference other Twitter users, as indicated by the @, for example, @realDonaldTrump. Users may re-post a tweet posted by another user – this is referred to as re-tweeting or a retweet. Users may reply directly to a tweet posted by another individual, which is called a reply. Users may also indicate that they like a tweet by making it a favorite.
Each function description has a bulleted list of key points. These will answer many of your questions and provide hints, suggestions, and
smaller subtasks if you do not know how to start or get stuck.
Keep the instructions open on your desktop while you work on the project and refer to them often.
Part One: Processing and analyzing tweets
All of the code for Part 1 should be submitted in the Project_3.py file. You will fill in your functions under the definition lines.
Task 1: process_hashes (tweet)
process_hashes has one required argument: a single string tweet that is a tweet. This function extracts hashtags from the tweet and returns them in a list. The function should work as follows:
• The function takes a string as input and returns a list. Think carefully about when you should convert this string to a list and which steps below are performed on the string, and which on the list. WRITE THIS OUT ON PAPER FIRST!
• All of the text should be put into upper case. • Punctuation should be removed, including hashtags and at signs. Think carefully about what
should get removed when and how hashtags are identified. Specifically, you need to remove, ?.,!_:;#@
• Delete trailing 's on hashtags. This means if the hashtag is: #POTUS's, after processing (including step above) it should be: POTUS
• The function returns a list that contains the hashtags found. This list will be empty if there were no hashtags. The list might contain the same hashtag more than once if it appears in the tweet more than once. (Do not remove duplicates.)
• You should not be reading in from a file anywhere in this function. The input is a string called tweet. Use tweet for processing.
• We haven’t discussed regular expressions. If you use them, you may get none of the points regardless of what the internet tells you to do. (You can do the extra credit with regular expressions.)
Here are some examples, note this is not actual code, and words like “Input tweet” should not print when your function runs. Please also note these tweets were chosen because of the text they contain, they are not a political statement.
Input tweet: ".@realDonaldTrump's #SwampCabinet must be held accountable. I will hold them to account even if @SenateGOP won't. "
process_hashes returns:
['SWAMPCABINET'] Input tweet: 'Why are Republicans asking the Supreme Court to raise taxes on Alaska families? #ACAWorks for #Alaska http://t.co/lHeHUoXRQq'
process_hashes returns:
['ACAWORKS', 'ALASKA']
Task 2: popular_tweets(filename, how = 'retweet', cutoff = 100, counts = False) correction
You have been given code for a function called popular_tweets(filename, how = 'retweet', cutoff = 100, counts = False) which does not work properly. Correct the code to meet the following specifications.
popular_tweets(filename, how = 'retweet', cutoff = 100, counts = False) returns a dictionary where the keys are strings corresponding to Twitter usernames and the values are either 1) a list of tweets (strings) by that user or 2) integers representing a count of tweets by that user. Whether the value is a list or an integer depends on the optional argument counts. More details of how the function works are presented below.
popular_tweets takes one required input filename, a string that is the name of a file containing tab-delimited Twitter data. The file specified by filename should be in the format (the spaces below represent tabs, \t, NOT spaces):
ID tweet_text replies retweets favorites username party state
For example, one line in the file might be:
179162 @DrNordal, it was nice meeting with you. Thanks for stopping by. 1 3 0 SenDeanHeller R NV
The ID is 179162. Next is the actual tweet. 1 is the number of replies. 3 is the number of retweets. 0 is the number of favorites. The username is SenDeanHeller. Party is R (Republican). State is NV (Nevada). You have been given two files in this format: test_file.txt is a small file you may wish to use for testing your code; sen_tweets_edited_2.csv is a larger data set. Pay close attention to what data is in which column.
The optional arguments how and cutoff determine which tweets will be included in the final output dictionary. The function is looking for tweets that are popular based on either how many replies they received, how many times they were retweeted, or how many times they were favorited. For the tweet shown above, there was 1 reply, 3 retweets, and 0 favorites.
The argument how tells whether to determine popularity based on replies, retweets or favorites. It will always be a string with default value 'retweet'. The only other possible values for how are 'reply' and 'favorite'. Each option corresponds to a column in the original data file.
The argument cutoff tells how many replies/retweets/favorites the tweet must have to be included in the output. If a tweet has fewer replies/retweets/favorites than the value of cutoff, it will not appear in the output dictionary. The default value of cutoff is 100.
In the examples below, the original data is the same for all 3 cases, but the output is different. The original data consists of five lines in a file (this data is fabricated but based on real data):
185791 Bump fire stocks allow 88 151 293 SenF D CA
286443 Congratulations to @TeamCoachBuzz 5 30 268 timkaine D VA
25697 I'm grateful for #Arkansas 0 4 4 JohnBoozman R AR
286523 Sea level threatens Hampton Roads 69 473 1819 timkaine D VA
251370 I also stand ready to work 8 11 30 SenShelby R AL
• Popularity based on retweets, how = 'retweet', cutoff = 100; function returns:
{'SenF': ['Bump fire stocks allow'], 'timkaine': ['Sea level threatens Hampton Roads']}
Notice that each key is a username and the values are lists containing tweets that had more than 100 retweets. There are no entries in the dictionary for JohnBoozman (4 retweets) or SenShelby (11 retweets), and timkaine’s first tweet does not appear (only 30 retweets)
• Popularity based on retweets, how = 'retweet', cutoff = 10; function returns:
{'SenF': ['Bump fire stocks allow'], 'timkaine': ['Congratulations to @TeamCoachBuzz', 'Sea level threatens Hampton Roads'], 'SenShelby': ['I also stand ready to work']}
Notice that since the cutoff is lower, timkaine’s first tweet is now included (30 retweets) as well as SenShelby’s tweet (11 retweets)
• Popularity based on replies, how = 'reply', cutoff = 100; function returns:
{}
Notice that the output is an empty dictionary because NONE of the tweets had 100 replies or more.
• Popularity based on replies, how = 'reply', cutoff = 10; function returns:
{'SenF': ['Bump fire stocks allow'], 'timkaine': ['Sea level threatens Hampton Roads']}
Now with a lower cutoff, we get the tweet from SenF (88 replies) and one tweet from timkaine (69 replies).
• Popularity based on favorites, how = 'favorite', cutoff = 100; function returns:
{'SenF': ['Bump fire stocks allow'], 'timkaine': ['Congratulations to @TeamCoachBuzz', 'Sea level threatens Hampton Roads']}
Notice that we get both tweets for timkaine since both had at least 100 favorites. We also get the tweet from SenF which had 293 favorites.
The examples above are all cases where the argument count is equal to False – this produces a dictionary with lists of tweets as values. When count is True, the values in the dictionary are integers that represent how many tweets each user had that met the popularity cutoff. Here are the same examples from above with count = True.
• how = 'retweet', cutoff = 100, count = True; function returns:
{'SenF': 1, 'timkaine': 1}
Notice that the keys are the same as in the example above, but instead of the text of the tweets as values, we have a count of how many tweets there were. You will see this in all of the examples.
• how = 'retweet', cutoff = 10, count = True; function returns:
{'SenF': 1, 'timkaine': 2, 'SenShelby': 1}
• how = 'reply', cutoff = 100, count = True; function returns:
{}
• how = 'reply', cutoff = 10, count = True; function returns:
{'SenF': 1, 'timkaine': 1}
• how = 'favorite', cutoff = 100, count = True; function returns:
{'SenF': 1, 'timkaine': 2}
Debug the code you have been given to make the function worked as described in the specifications. You may not add or delete lines, and you must correct the lines in place without introducing dramatic changes. Plausible changes include switching the position of two lines, adding some code to lines, changing code to use correct variables and/or indices that are incorrect, and changing indentation. You should not be creating new objects or variables, adding lines, deleting lines, or re-writing lines to use completely different structures (e.g. adding list comprehension when there is no list comprehension).
Key points for debugging: (these will make more sense if you look at the code first)
• Use the comments in the code to guide you; use the debugging print lines, but suggest you only use them with test_file.txt, otherwise will print more output than may be useful
• The purpose of the info dictionary is to map each possible value for the parameter how to the appropriate column in the text file. You should ensure this mapping is correct.
• The output of this function is a dictionary with strings as keys and lists of strings (tweets) as values OR integers as values depending on whether count is True or False. Make the dictionary every time with the username mapped to a list of tweets first. Then deal with converting this to counts after it is complete. The information used to make the dictionary comes from the file.
• When working with dictionaries with lists as the values, it is important to determine whether you are creating a new entry in the dictionary OR adding to an existing entry. These are two separate cases that the code needs to deal with.
• You must understand what each and every variable in the function does in order to debug correctly. Randomly guessing at things will make the code worse. It is suggested that you write down each variable name (result, line, file, info, data, how, cutoff, counts, item) and what it stands for BEFORE making any significant changes to the code. For example: result – a dictionary that will store each username mapped to either a list of tweets OR a count of tweets for that user line – a string that is a single line read in from a text file in the format described in the instructions
You fill in the rest!
Task 3: graph_usage(tweets, cutoff = 10)
Write a function called graph_usage(tweets, cutoff = 10) which displays a graph showing how often either hashtags occur in the input data. This function takes one required argument tweets, which is a list of strings, that is a list of tweet strings. For example, tweets could be:
['This is a cool #tweet about #tweets', 'Sometimes people write #tweets on Twitter', 'Hey @username, did you #tweet that?','RT @madeup This person tweeted to @username about #tweets', 'Hey @username and @fakeuser, here is something #new about #tweets']
You MUST use your process_hashes function (see suggestions below) to extract the hashtags/mentions from the data. That will result in hashtags that are in all uppercase and have the # removed. Here is the output graph for the data above:
#tweets appeared 4 times in the list of tweets, #tweet appeared 2 times, and #new appeared once. Notice that on the y-axis of the graph each hashtag is in all caps and there is no #.
If you implement things correctly using what we have discussed you will have to exert minimal effort to generate these plots. They are the natural output of using a FreqDist object from NLTK and calling its plot function. If you are writing many lines of code to try to generate such values you are doing something wrong.
The optional argument cutoff determines how many hashtags or mentions to include. For example, if cutoff is 2, the output graph will only contain the 2 most common hashtags (don’t worry about ties, let NLTK handle that). Here is the plot using the same data with cutoff = 2:
So that you can see what this looks like when there are many hashtags, here is output from a larger data set. This is the output using cutoff = 10:
This is the output using cutoff = 5:
How to implement this function: Read this, it tells you exactly what to do
• The input to this function is a list called tweets. If you are trying to read in from a file anywhere in this function, you are doing it wrong.
• The first step in this function is to extract the hashtags from the text in tweets. Consider that tweets is a list of strings. You need to look at each item in tweets individually and use your process_hashes function to get the hashtags or mentions. Store all of these together in a new list.
• The NLTK functions/objects will automagically generate the plot you need. Make a FreqDist object from your list. Use the plot method, it will accept the cutoff argument directly. If you are not using NLTK, something is wrong.
• The function should return the FreqDist object you made. You will need to write the code to draw the plot, and then the last line of your function should return the FreqDist object.
• If you run in a notebook and don’t see a plot, make sure you have executed: %matplotlib inline, do not put that in your code that you submit – it will crash at the command line
Part 2: Main Program
Your main program will make use of the files of tweets you were given and the functions you wrote in the previous sections. Unless otherwise specified, you MUST use your functions. You should not repeat code from your functions inside of the main program.
You have also been given a helper function called join_tweets. The join_tweets function takes a dictionary as input and returns a list. The input dictionary will have lists of string as values. For example, for the input:
{'SenF': ['Bump fire stocks allow'], 'timkaine': ['Congratulations to @TeamCoachBuzz', 'Sea level threatens Hampton Roads'], 'SenShelby': ['I also stand ready to work']}
The output for join_tweets would be:
['Bump fire stocks allow', 'Congratulations to @TeamCoachBuzz', 'Sea level threatens Hampton Roads', 'I also stand ready to work']
Notice that it took the values from every dictionary item and put them all in one list to produce a list of all of the tweets in the dictionary. This function is correct. Do not change it, but you will need to use it.
Your main program should do the following using the sen_tweets_edited_2.csv file as the input file for popular_tweets. If this file crashes your computer, you may use the test file instead, just indicate that in your write-up.
• Create a dictionary that has the most popular tweets based on being retweeted at least 1000 times. Your dictionary should contain the actual text of the tweets. (Hint: use popular_tweets) Print out ONLY the usernames for this dictionary.
• Plot the 10 most common hashtags from the most popular tweets based on being retweeted at least 1000 times. You will need to process your dictionary using join_tweets first. If you are using your dictionary as input to graph_usage, you are doing it wrong.
• Create a dictionary that has most popular tweets based on have at least 500 replies. Your
dictionary should contain counts of how many tweets for each user, NOT the actual tweets.(Hint: use popular_tweets). Print out ONLY the usernames for this dictionary.
Extra credit opportunities: (Worth up to 3%)
1) Write the process_hashes function using list comprehension
2) Write a function called process_hashes_regex that uses regular expressions. Fill in your function under the def line provided.
3) Re-write the join_tweets function using list comprehension – think very carefully about how to do this. Fill in your function under the def line provided for join_tweets_lc.
SUBMISSION EXPECTATIONS
Project_3.py Your implementations/correction of the functions in Part 1. This MUST include the code given to you for join_tweets. It may also contain any extra credit you completed.
Project_3_main.py Your main program for Part 3. The first line in this file should be:
from Project_3 import *
Project_3.pdf A PDF document containing your reflections on the project including any extra credit opportunities you chose to pursue. You must also cite any sources you use. Please be aware that you can consult sources, but all code written must be your own. Programs copied in part or wholesale from the web or other sources will result in reporting of an Honor Code violation.
SUGGESTED COMPLETION SCHEDULE
This describes when to start each task to have a complete checkpoint submission.
After lecture 10/9 or 10/10: process_hashes and first task of main program
After lecture 10/16 or 10/17: popular_tweets and tasks 2 and 4 of main program
After lecture 10/18 or 10/20: graph_usage and task 3 of main program
POINT VALUES AND GRADING RUBRIC
-process_hashes (27.5 points)
-extract_tweets correction (30 pts)
-graph_usage (27.5 pts)
-Main program (12.5 pts)
Writeup – 2.5 pts