The main aim of this coursework is to critically analyse data sources and data sets, critically evaluate possible data analytics challenges and solutions, choose, design and implement data mining algorithms to the chosen data, and apply the data mining techniques to specific case studies. The coursework is worth 100 marks, and the distribution of marks is detailed on the marking scheme.
You are expected to explore one or two chosen data set(s) of your choice from open data mining/machine learning (re)sources, to develop case studies and apply data mining techniques on the data set(s) for supervised and/or unsupervised learning, as motivated and decided by which is suitable (depending on the data set characteristics). Tasks A, B, and G are compulsory, and you must choose 2 tasks from C, D, E, and F:
Task A. [20 marks] Data Choice.
Name the chosen data set(s) (from module resources, UCI ML Repository or other open data sources or own collection) and describe the data (e.g. attribute types and values, source of data)
[5 marks]
Adult data set for salary prediction of 50K less or more
http://archive.ics.uci.edu/ml/datasets/adult
Describe the data mining problem (and background) you will address e.g. as a classification, prediction, association, clustering, or text mining related exercise
[5 marks] Classification and predicting, association rule task mining
Introduce the specific data mining question(s) related to the problem, with specific reference to the dataset(s) and the expected or proposed outcome of the data mining task upon completion
[10 marks]
Predicting the salaries and the best rules needed in knowing the income of the adults by reading the data.
Task B. [20 marks] Data Analysis
Analyse the data. Describe the context and content of the data in light of the chosen data mining task and proposed outcomes, discussing characteristics of the data that will present opportunities/challenges for the task.
[10 marks]
Add functionality/ies for descriptive statistics in the view of this problem: details of contextual programming and usage of graphical representation and analysis of data are expected. For example, sort the data by class, line or bar plot each of the features individually if applicable; for each feature compute characteristics like its minimum, maximum, mean, mode and standard deviation, and study the correlation between features for each class or the distance matrix. If attributes are not normalized then this step will be also considered.
[10 marks]
Task B:Data Analysis
Visualization
Weka exploring pre-process shows the attributes
· Sum of weights
· Attributes
· Instances
Weka pre-process can also define the selected attribute and shows the Statistic of each column, for example choosing the Age which is numeric, it can calculate the Minimum, Maximum, Mean and StdDev values.
This helps me more to understand the data