You are requested to analyse students’ exam performance dataset using python programming language. The dataset contains students’ scores in three different subjects (math, reading, and writing). Wire a python program to perform the following tasks:
Advanced Python for Data Science Project proposal 1: Exam performance analysis You are requested to analyse students’ exam performance dataset using python programming language. The dataset contains students’ scores in three different subjects (math, reading, and writing). Wire a python program to perform the following tasks: Task 1: Load the dataset in relevant format and show its properties, e.g. number of records, number of features and their types. (hint: use Pandas Library to read the data). Inspect the dataset and perform data cleaning (e.g. removing duplicate records and fixing missing data). Task 2: Provide descriptive statistics of the dataset and perform an exploratory data analysis (EDA) to answer the following analysis questions: • Compare students’ exam scores in different subjects (math, reading,
writing), What trend did you find? • Who performed better in different subjects male or female students? • Show any attributes (features) that are correlated with exam scores.
(e.g. Does parental level of education affect their children exam scores? Does test preparation influence students’ performance?) (hint: use corr() method in Pandas).
(you are encouraged to impose other analysis questions based on any trend you notice in the dataset). Task 3: Show visual representation of your analysis (hint: use data visualization packages such as Matplotlib and Seaborn). Task 4: Build a machine learning model to predict student’s exam performance in each subject given the following attributes: gender, race/ethnicity, parental level of education, lunch, and test preparation course. Download the dataset from the following link: Students exam performance data
DS540: Advanced Python for Data Science Project proposal 2: Tweets sentiment analysis
You are requested to perform natural language processing on users’ tweets using python programming language. The dataset contains textual data obtain from twitter users. Wire a python program to perform the following tasks: Task 1: Load the dataset in relevant format and show its properties, e.g. number of records, number of features and their types. (hint: use Pandas Library to read the data). Inspect the dataset and perform data cleaning (e.g. removing duplicate records and fixing missing data). Task 2: Pre-process the textual data and extract features using NLP techniques as follows: • Pre-processing steps:
1. Convert to lowercase. 2. Remove stop words. 3. Normalise the text (punctuation removal, spelling correction,
Stemming). 4. Tokenisation.
• Extract the following features:
1. Compute word count per tweet.
2. Average word length per tweet.
3. Special character count per tweet. 4. Tweets sentiments. (hint: use TextBlob library to obtain tweets’
sentiments).
5. N-grams. 6. TF-IDF.
DS540: Advanced Python for Data Science Task 3: Using visual representation, show the following: most commonly used words in tweets using Worldcould, number of positive, negative, and neutral tweets and word count distribution among different sentiments). (hint: use data visualization packages such as Matplotlib and Seaborn).
Task 4: Preform sentiment analysis using machine learning techniques to classify tweets into positive, negative, or neutral sentiments given the following features: word count per tweet, average word length per tweet, and special character count per tweet.
Download the dataset from the following link: Sentiment analysis Dataset
DS540: Advanced Python for Data Science
Project guidelines The report should provide the following information:
• A written description of data with relevant spreadsheets.
• Explanation of how you analysed your data (hint: what python packages/functions did you use).
• Explanation of what data you analysed and follow with relevant
visualization.
• Show the results of your analysis, follow with relevant visualization
and highlight important results.
• Details of your machine learning model development.
Notes:
1. Follow attached report template.
2. Your report can’t go beyond 10 pages inclusive of any references.
3. You must combine yourselves into a group of 1-2 students.
4. Submission deadline is on Saturday of Week 13 (28/11/2020). 5. You must submit your Jupyter Notebook along with the report.