Python Project
Due November 10, 1700 (5:00 PM)
For this project you will be creating a set of functions to process and visualize data related to the tweets you processed for Project 3. You will also correct part of a main program which performs quality control steps on the data, and then you will write additional lines of code in the main program to use your functions to display subsets of the data.
The data set is derived from one of the data sets used in the fivethirtyeight article https://fivethirtyeight.com/features/the-worst-tweeter-in-politics-isnt-trump/. Do not use the original data file. You have been given the following files to use for the project:
senators_parsed.csv is a subset of the original data and was pre-processed
test_file.csv is a much smaller file you can use for testing
formatted.csv is the result you expect after completing Part 1 – you can use this file if you cannot complete Part 1
us-states.json is a JSON file containing geographic information for the United States
All of your function code will be submitted in the Project_4.py file, and the main program in Project_4_Main.py - skeleton files have been provided.
In order to complete this assignment you must have functional versions of the following packages installed: pandas, numpy, matplotlib, seaborn, folium.
BE AWARE:
In order to receive full credit, you must use pandas functions and methods and the seaborn/matplotlib libraries to make plots. This means that implementations which write loops to iterate over data frames and series to carry out tasks that can be achieved with pandas functions/methods and/or that use other imported objects will receive minimal credit. Plots made using other softwares will not be accepted for credit. Manipulations to data performed without corresponding code (e.g. opening data in Excel and editing it) will receive no credit.
Each function description has a bulleted list of key points. These will answer many of your questions and provide hints, suggestions, and smaller subtasks if you do not know how to start or get stuck.
Keep these instructions open on your Desktop while working and refer to them often. They will tell you exactly how to complete the assignment.
Part One: Data QC
The first part of this project requires reading in the data from the file senators_parsed.csv and reformatting some of the data. This will generate the data frame called sen_2016 that you will use for your main program in Part 3.
You have been provided with code in the main program to correct. You must edit these lines in place. You may not add any new lines of code or alter these lines dramatically (this means you must use the pandas functionality and should not add other functions, loops, or list comprehension; you should use the columns the original code tries to use).
It will be helpful for you to look at the data frame after each step. Ask if you don’t know what this means.
Your code should do the following (more details are provided in the skeleton file):
-Read the data in from the file to a pandas data frame called sen
-Create a column in the data frame called year that contains the year the tweet was created as a string. You will do this using the created_at column as the starting point. Each year should have the prefix 20, for example, if created_at is 12/11/15 16:17, year should be 2015 NOT just 15; this will be derived from the created_at column and MUST be done using a lambda;
HINT: first try to process a single string instead of working with the whole column of data, e.g. try to generate '2015' from the string '12/11/15 16:17'
-Create a column in the data frame called month that contains the name of the month the tweet was created as a string. You will do this using the created_at column as the starting point. For example, if created at is 12/11/15 16:17, month should be December; this will be derived from the created_at column, will make use of the months dictionary provided in the code, and MUST be done using a lambda;
HINT: first try to process a single string instead of working with the whole column of data, e.g. try to generate 'December' from the string '12/11/15 16:17' using the months dictionary
-Drop the bioguide, created_at and url columns
-Put the usernames into all caps
-Create a data frame that only contains data from November of 2016. We will call this data frame sen_2016 and you will use it in Part 3.
IMPORTANT: If the senators_parsed.csv file crashes your computer or takes too long to run when you try to run this code, use test_file.csv instead. It is in the same format. Just replace the filename in the first line of code.
Part Two: Functions
For this section of the project you will create 4 functions to use with your processed data frame (or any data in the same format such as the data in formatted.csv).
Function 1: subset_data(df, column = 'retweets', cutoff = 100, above = True)
The subset_data function returns a data frame that is a subset (or a copy, see below) of the data frame passed in by the user as the required argument df. The data frame df is subsetted based on the two required arguments column and value, and the optional argument how.
• column is a string that specifies which column (variable) in df should be used for subsetting • cutoff specifies the value(s) of interest in that column – this will be a numerical cutoff– it
is only ONE number • above is a Boolean. When above is True, include rows where the value in column is >
cutoff. When above is False, include rows where the value in column is <= cutoff.
ALL YOU ARE DOING HERE IS SUBSETTING A DATA FRAME BASED ON VALUES IN A COLUMN. If you understand how to subset a data frame, this will be straightforward.
This is best demonstrated by example. Suppose we have the following data frame:
We call subset_data on this data frame with column = 'COL1', cutoff = 6, above = True. It returns:
Notice that this doesn’t include 6. Only rows with values for COL1 that are greater than 6.
We call subset_data on this data frame column = 'COL2', cutoff = 11, above = False. The result would be:
You can assume that the user will only pass in the name of a numeric column (column contains ints or floats) and that cutoff will be a float or integer. This means you don’t need to check the types of the inputs.
If the user passes in a column which is not in the data frame, you should return None.
If the user passes in a combination of arguments for which there are no rows, you should return an empty data frame. (HINT: if you write your code properly using subsetting, this will happen automatically). Here is an example of the latter.
We call get_data on the original data frame df with column = 'COL1', cutoff = 150, and above = True. The result would be this empty data frame (notice it has the column names but no rows):
The best way to write this function is to write the logic out on paper first. It is not complex codewise, but easy to stumble if you don’t work it out before coding.
Important points:
• You do not need to check input data types –df will always be a data frame, column will always be a string, cutoff will always be numeric and above will always be True or False.
• This will only work if the column specified by column is in the data frame. Check that. • If column is not in the data frame, function should return None. You should not have a line
of code that says return None. This should happen naturally. • The data frame is passed in as the variable df. If there is any code in your function that
reads in from a file, then you have done this wrong. • Use subsetting. Taking the approach of deleting rows will be more difficult. Do not use a
loop. Use pandas. Our reference implementation is 5 lines. • When above is True, include rows where the value in column is greater than cutoff. When
above is False, include rows where column is less than or equal to cutoff. • If there are no rows that meet cutoff return an empty data frame. This will happen naturally
if your code is correct. If you are constructing an empty data frame with pd.DataFrame you are doing it wrong.
Function 2: aggregate_data(df, column, how = 'sum'):
This function may sound very difficult to write. It is not, as long as you use your notes, which have the code almost verbatim. Read the bullet points to help you.
Write a function called aggregate_data that takes two required arguments as input: df, a data frame, and column, a string indicating the name of a column in that data frame. aggregate_data returns a new data frame where the data is grouped and summarized by column. The optional argument how indicates how to summarize/combine the data. The possible values for how are 'sum', 'count', 'mean', or 'median'.
There are some nuances to this that are best explained by example.
Suppose that we have the following data frame:
We call aggregate_data with column = 'COL3' and how = 'sum', the result is:
Notice that COL3 is now the index and is not a data column in the newdata frame, and the row names themselves are the possible values of COL3.
The data for COL1 is the sum of COL1 values for when COL3 is center/left/right. For example, in the original data, there are 3 rows where COL3 is right – COL1 values in these rows are 5, 2, 100, so the value reported for right in the new data frame is 5+2+100 = 107.
Why is there no data for COL2? Because we asked for the sum – how do you find the sum of turtle and potato? If you ask for a mathematical operation like the sum or the mean, your new data frame will only have values for the numeric columns.
We call aggregate_data with column = 'COL3' and how = 'count', the result is:
Now we have data for COL1, COL2, and COL4 because all it is doing is counting how many observations there were where COL3 was center (1), COL3 was left (2) and COL3 was right(3).
Important points (start here if you feel like you have no idea what to do):
• This function is basically a wrapper around pandas functionality to group and aggregate data in a data frame. Find that in your notes and then work that into a function.
• You should be using pandas methods on the data frame. If a loop appears in your function, you have done something wrong. The pandas methods will naturally know what to do with numeric vs. non-numeric columns. Use them.
• The data is in df. Do not read anything in from a file. • You do not have to check data types – df will always be a data frame, column will always
be a string, and how will always be a string, and specifically will always be one of the options listed above.
• You need to check that column is in the data frame. If not, your function should return None. But you should not have a line of code that says return None
• You do not need to check that how is a valid value – it will always be one of the options listed in the description.
Function 3: plot_data(df, col1, col2, rotate_labels = False)
The plot_data function produces and returns a plot (a bar chart) that shows the counts of observations of one variable characterized by a second variable. The arguments to plot_data are:
• df, a Data Frame object • col1, a string representing a column of numeric data in df • col2, a string representing a column of categorical data in df • optional argument rotate_labels, a Boolean indicating whether the x-axis data labels
should be rotated for readability (see below)
The plot produced by the function will have one bar for each value of col2. The height of the bar will be the sum of the observations from col1 for that value of col2. This is best explained with an example. Suppose that we have a data frame with the following columns:
We call plot_data with col1 = 'COL1' and col2 = 'COL3':
Notice there is one bar for each value from COL3. The height of the center bar is 14, which is the sum of COL1 values when COL3 is center. Similarly, for left the height of the bar is 2+6 = 8, and the height of the bar for right is 5+2+100 = 107.
The x-axis labels are readable here, but if we had more categories, they might not be, so we might want to rotate them. Let’s take a look at the same plot with rotate_labels = True:
Notice that the only difference in the plots is that the x-axis labels are rotated 90 degrees.
Important points: (This tells you exactly what to do step by step!)
• df is a data frame. You should not be reading in a data frame from a file in this function. • Both col1 and col2 need to be in the data frame in order to make a plot. If you can’t make
a plot, return None (but you should not have a line in your code that says return None). You do not need to check the types of the arguments or the type of data in the columns they correspond to.
• Use your aggregate_data function to create a new data frame where col2 is the index and the values in the data frame are the sums of col1. [If you figure out how to do this with seaborn without using your function that is OK too. It is just not OK to use a loop.]
• Make the graph from the data frame you made in the step above. You will need to pass in two arguments: the first is what to use on the x-axis, the second is what to use on the y- axis. THIS IS IN YOUR NOTES.
• When the optional argument rotate_labels is True, the x-axis labels should be rotated 90 degrees.
• Don’t forget to have your function return the plot! Remember plots may show up in the notebook but not on the command line and that is OK.
Function 4: def map_data(df, data_col, , json_file, key, cmap = 'BuPu', filename = 'map.html')
This function may sound difficult to write. It is not, as long as you use your notes, which have the code you need almost verbatim. Read the bullet points to help you.
Write a function called map_data which produces and returns a choropleth map and also saves the map to a file. The arguments to map_data are:
• df, a data frame where the index (row names) corresponds to the areas on the map (this will be states or countries or counties etc. – see example below)
• data_col, a string that is the name of a column in df with the numeric data to map • json_file, a string that names a .json file with the geographic area information for mapping • key, a string that is the identifier in the .json file that matches the index of df • cmap, an optional argument specifying a color map to use for the map, a string • filename, an optional argument specifying the filename to save the map to, a string
The choropleth map will have regions shaded in according to the values in data_col. Here is an example. Suppose we have the data frame:
We call map_data on this data frame with data_col = 'COL1', json_file = 'us-states.json', key = 'properties.name' and defaults for the other arguments. It will produce a map, saved in the file map.html, that looks like this (zoomed in on US:
Note that the grey areas are states with no data. For the extra credit, you can change the color of states with no data and other options like the scale used.
Important points:
• All of the code is already in your notes. Modify it to use the arguments passed in to the function and to have it return the map. You also need to save the map to a file.
• Make sure that the column which is supposed to contain the data exists in the data frame. • The map may not appear on screen when you run the function from the command line.
That’s fine. It should save to a file however.
Part Three: Putting it all together
For this part of the assignment you will add lines to the main program to use the functions you wrote in Part 2 on the re-formatted data frame you created in Part 1. The lines of code you write for Part 3 will follow (i.e. go under) the lines of code from Part 1 that you corrected. ASK IF YOU DO NOT UNDERSTAND THIS – YOUR CODE WILL NOT WORK PROPERLY OTHERWISE.
You must use your functions to accomplish the following tasks. You should not repeat code from the functions in the main program. You will have to think about which functions to use for each task – you may need to combine two or more of your functions to make these work.
1) Create a data frame where the index (row names) contains usernames (the user column) and the data shows the total number of retweets, favorites, and replies for each user.
2) Subset the data frame you made in the step above to include only those individuals with more than 100,000 total retweets. Print out the resulting data frame.
3) Create a data frame where the index (row names) contains party names and the data shows the mean number of retweets, favorites, and replies for each user.
4) Create a plot that shows the total number of favorites by Party
5) Make a map that shows the total number of replies by State. Hint: you will need to create a new data frame that contains this information using one of your functions before using your map function to make the map. If you are trying to pass in sen_2016, you are doing it wrong.
Extra credit:
Please note that the following tasks will be worth only a small percentage of extra credit (up to 3%). You may import any modules and use any functions you like for the extra credit.
1) Add functionality to the plot_data function to allow for a log scale on the y-axis. The function should automatically plot on a log scale if the smallest data value is more than 100 times smaller than the largest data value. Fill this in under def plot_data_wl.
2) Customize and improve the maps produced by map_data. This includes changing the color for states with no data, adjusting the scale appropriately to the data, zooming in, labelling the legend and any other customizations that improve the look of the map.
3) If you don’t make any adjustments to the map and plot total retweets, or replies, or
favorites by state, it looks a bit strange. Look into the data to explain what you are seeing. What is causing this trend/pattern on the maps? (You might find your Prj 3 functions helpful.)
SUBMISSION EXPECTATIONS
Project_4.py Your implementations of the three functions in Part 2, the correction of the code in Part 1 and the additional main program in Part 3.
Project_4.pdf A PDF document containing your reflections on the project including any extra credit opportunities you chose to pursue. You must also cite any sources you use. Please be aware that you can consult sources, but all code written must be your own. This means the consultants cannot write your code. Code obviously not authored by you will receive no credit. Programs copied in part or wholesale from the web or other sources will result in reporting of an Honor Code violation.
POINT VALUES AND GRADING RUBRIC
Part 1: Code correction (20 pts) [YOU WILL BE GRADED ON STYLE]
Part 2: subset_data(20 pts), aggregate_data (15 pts), plot_data (15 pts), map_data (12.5 pts) [YOU WILL BE GRADED ON STYLE],
Part 3: Main program (15 pts) [YOU WILL BE GRADED ON STYLE]
Writeup (2.5 pts)
SUGGESTED COMPLETION SCHEDULE
Most of the code you need will come straight out of the lecture notes. You should complete the project as we go so the material is current to you. It will be much more difficult if you wait until a few days before it is due to start. We suggest you take a close look at the PS4 questions while working on this project – several of them were designed to correspond to this project.
Part 1 code correction: you can start this as soon as you do the readings, use material from lectures 10/28 or 10/29, and 10/30 or 10/31
subset_data: complete after lecture 10/28 or 10/29
aggregate_data: complete after lecture 10/28 or 10/29
Part 3, steps 1-3: complete after lecture 10/30 or 10/31
plot_data and Part 3, step 4: complete after lecture 11/4 or 11/5
map_data and Part 3, step 5: complete after lecture 11/6 or 11/7 (there are no checkpoint tests for the maps)