Machine Learning
{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " Assignment 2: Data Preprocessing
\n", "
\n", "UFO Sighting Data Exploration
\n", "
\n", " MCIS 6283-Machine Learning
\n", "\n", "Due date: Jan 23rd, 2019 (Wednesday)
\n", "Total Points: 100
\n", "\n", "Instructor: Xin Yang
\n", "\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Please put your name, student ID, date and time here (5 points)\n", "* Name:Sannidha Nallamothu\n", "* Student ID:999900278\n", "* Date:\n", "* Time:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In this assignment, you will investigate UFO data over the last century to gain some insight.\n", "* Please use all the techniques we have learned in the class to preprocesss/clean the dataset
ufo_sightings_large.csv
. \n", "* After the dataset is preprocessed, please split the dataset into training sets and test sets\n", "* Fit KNN to the training sets. \n", "* Print the score of KNN on the test sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Import dataset \"ufo_sightings_large.csv\" in pandas (5 points)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Checking column types & Converting Column types (10 points)\n", "Take a look at the UFO dataset's column types using the dtypes attribute. Please convert the column types to the proper types.\n", "For example, the date column, which can be transformed into the datetime type. \n", "That will make our feature engineering efforts easier later on." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Dropping missing data (10 points)\n", "Let's remove some of the rows where certain columns have missing values. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Extracting numbers from strings (10 points)\n", "The
length_of_time column in the UFO dataset is a text field that has the number of \n", "minutes within the string. \n", "Here, you'll extract that number from that text field using regular expressions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Identifying features for standardization (10 points)\n", "In this section, you'll investigate the variance of columns in the UFO dataset to \n", "determine which features should be standardized. You can log normlize the high variance column." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Encoding categorical variables (20 points)\n", "There are couple of columns in the UFO dataset that need to be encoded before they can be \n", "modeled through scikit-learn. \n", "You'll do that transformation here,
using both binary and one-hot encoding methods." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Text vectorization (10 points)\n", "Let's transform the
desc column in the UFO dataset into tf/idf vectors, \n", "since there's likely something we can learn from this field." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Selecting the ideal dataset (10 points)\n", "Let's get rid of some of the unnecessary features. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. Split the X and y using train_test_split, setting stratify = y (5 points)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X = ufo.drop([\"type\"],axis = 1)\n", "y = ufo[\"type\"].astype(str)\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10. Fit knn to the training sets and print the score of knn on the test sets (5 points)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "knn = KNeighborsClassifier(n_neighbors=5)\n", "# Fit knn to the training sets\n", "knn.fit(train_X, train_y)\n", "# Print the score of knn on the test sets\n", "print(knn.score(test_X, test_y))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }