Loading...

Messages

Proposals

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

What is roc area in weka

21/10/2021 Client: muhammad11 Deadline: 2 Day

Practical Data Mining

COMP-321B

Tutorial 5: Document Classification

Eibe Frank

July 17, 2014

c©2008-2014 University of Waikato

1 Introduction

A popular application area of machine learning is text classification. You may have actually encountered machine-learning-based text classification before: so- called “Bayesian” personalized spam filters are essentially document classifiers that classify email messages (i.e., short documents) into two groups, namely junk and not junk. They are trained on messages that have been manually labeled— perhaps by putting them into appropriate folders (e.g., “ham” vs “spam”). In this tutorial, we look at how to perform document classification using tools in WEKA. Please submit this tutorial to the teaching staff at the end of the lab.

2 Data with string attributes

The raw data in document classification is text, but most machine learning algorithms expect examples that are described by a fixed set of nominal or numeric attributes. Hence we need to convert the text data into a form suitable for learning. This is usually done by creating a dictionary of terms from all the documents in the training corpus and making a numeric attribute for each of the terms. Then, for a particular document, the value of an attribute is based on the frequency of the corresponding term in the document—or it simply indicates the presence of absence of the term in the document.

The resulting data is usually very high-dimensional because the dictionary consists of hundreds or thousands of terms. Each document is thus represented as a vector in a high-dimensional space. There is obviously also the class at- tribute, which has the label of each document, and which we want to learn to predict based on the contents of the document.

The filter weka.filters.unsupervised.attribute.StringToWordVector can be used to perform the conversion of raw text into term-frequency-based attributes for us. The filter assumes that the text of the documents is stored in an attribute of type string, a nominal attribute without a pre-specified set of values. In the filtered data, this string attribute is replaced by a fixed set of numeric attributes and the class attribute is made the first attribute in the resulting set. The numeric attributes have values 0 and 1 by default, indicating whether a term is absent or present in a document.

To perform document classification in WEKA, we first need to create an appropriate ARFF file with a string attribute that holds the documents’ text— declared in the header of the ARFF file using @attribute document string, where document is the name of the attribute. We also need a nominal attribute that holds the documents’ classifications.

To get some feeling for how this works, make an ARFF file from the labeled mini-“documents” in Table 1 and run StringToWordVector with default op- tions on this data. How many attributes are generated? Now change the value of the option minTermFreq to two. What are the attributes that are generated?

1

Document text Classification The price of crude oil has increased significantly yes Demand of crude oil outstrips supply yes Some people do not like the flavor of olive oil no The food was very oily no Crude oil is in short supply yes Use a bit of cooking oil in the frying pan no

Table 1: Training “documents”.

Answer:

Build a J48 decision tree from the last version of the data you generated. Sate the tree that is output, in textual form.

Answer:

Usually, we want to build a classifier to classify new documents. Let us assume we want to classify the documents given in Table 2, once they have been converted into ARFF format, using question marks for the missing class labels. Classification should be done based on the decision tree classifier generated from the documents in Table 1.

To do this, we need to use the FilteredClassifier in conjunction with the StringToWordVector filter and the base classifier that we want to apply (i.e., J48). Configure the FilteredClassifier using default options for String- ToWordVector and J48, and specify the data from Table 2 as the test set. (Ignore the fact that the Explorer states that the data has an unknown number of instances: it loads the data incrementally when the classifier is tested.) The data from Table 1 should be the training data (in unfiltered, raw form, so that its format is consistent with that of the test set), so make sure that data has

2

Document text Classification Oil platforms extract crude oil Unknown Canola oil is supposed to be healthy Unknown Iraq has significant oil reserves Unknown There are different types of cooking oil Unknown

Table 2: Test “documents”.

been loaded in the Preprocess panel.

We want to look at the classifications for the test data, so make sure that you select Output predictions under More options... in the Classify panel. After you have hit Start, take a look at the model and the predictions generated. What are the predictions (in the order in which the documents are listed in Table 2)? Please explain how they come about.

Answer:

3 Classifying actual short text documents

A well-known dataset for evaluating document classifiers is a corpus consisting of a collection of Reuters newswire articles. In C:\Program Files\Weka-3-6\data, you can find some training data derived from this benchmark corpus, more specifically ReutersCorn-train.arff and ReutersGrain-train.arff. There are also corresponding test datasets in *-test.arff. We will use these to ex- periment with two learning algorithms and the StringToWordVector filter.

The actual documents in the corn and grain data are the same, only the labels differ. In the first dataset, articles that talk about corn-related issues are labeled with a class value of 1, other articles are labeled with the value 0. In the second dataset, the same kind of labeling is applied, but with respect to grain-related issues. In the first case, we want to build a classification model that can be used to identify articles that talk about corn, in the second case we want to do the same for grain.

Let us now consider document classification using decision trees vs. docu- ment classification using a naive Bayes classifier. To this end, build document classifiers for the two training sets by applying the FilteredClassifier with StringToWordVector using (a) J48 and (b) NaiveBayesMultinomial, in each case evaluating them on the corresponding test set. (NaiveBayesMulti-

3

nomial implements the multinomial naive Bayes model discussed in class that is specifically designed for document classification.) What is the percentage of correct classifications obtained for the four scenarios? Which classifier would you deploy based on your results?

Answer:

Percent correct is actually not the only evaluation metric that is commonly used for document classification problems. WEKA lists several other “per-class” evaluation statistics that are often used for evaluating information retrieval sys- tems like search engines, under Detailed Accuracy By Class in the Classi- fier output text area. They are based on the number of true positives (TP), the number of false positives (FP), the number of true negatives (TN), and the number of false negatives (FN) in the test data. A true positive is an example in the test data that is classified correctly as an example of the target class con- cerned, a false positive is a (negative) example that is incorrectly assigned to the target class. FN and TN are defined analogously. (In the information retrieval setting, true positives would be documents or web sites that are returned by the search engine and that are actually relevant, false positives would be irrelevant documents that are returned by the search engine, and so on.)

It is clear that these numbers can only be computed on a per-class basis, where the corresponding class is considered the positive class, and all other classes are considered joined together as the negative class. Based on this ap- proach, the statistics output by WEKA under Detailed Accuracy By Class are computed as follows (ROC Area, another statistic that is output by WEKA, is discussed further below):

• TP Rate: TP / (TP + FN)

• FP Rate: FP / (FP + TN)

• Precision: TP / (TP + FP)

• Recall: TP / (TP + FN)

• F-Measure: the harmonic mean of precision and recall

Based on the formulas, what are the best possible values for each of the statistics in this list, considering an arbitrary dataset and classifier? Describe in English when these values are attained assuming we treat newswire articles from the above corn category as positive examples and other articles as negative examples.

4

Answer:

In the Classifier Output, there is also the ROC Area, which differs from these other statistics because it is based on ranking the examples in the test data according to how likely they are to belong to the positive class. The likelihood is given by the class probability that the classifier predicts. (Note that most classifiers in WEKA can produce class probabilities in addition to discrete classifications.) The ROC area (also known as AUC) is the probability that a randomly chosen positive instance in the test data is ranked above a randomly chosen negative instance, based on the ranking produced by the classifier.

The best outcome is that all positive examples are ranked above all negative examples. In that case the AUC is one. In the worst case it is zero. In the case where the ranking is essentially random, the AUC is 0.5. Hence we want an AUC that is at least 0.5, otherwise our classifier has not learned anything from the training data. Which of the two classifiers used above produces the best AUC for the two Reuters datasets? Compare this to the outcome for percent correct. What do the different outcomes mean?

5

Answer:

Interestingly, there is a close relationship between ROC Area and TP Rate/FP Rate. Rather than just obtaining a single pair of values for the latter statistics, we can obtain a whole range of value pairs by imposing different classification thresholds on the class probabilities estimated by the classifier: in the ranked list of instances, ranked according to their estimated probability of being positive, we simply count how many truly positive and negative instances exhibit a probability above the specified threshold.

By default, an instance is classified as “positive” if the estimated probability for the positive class exceeds 0.5, otherwise it is classified as negative (i.e., the most likely class is predicted). We can change this threshold from 0.5 to some other value between 0 and 1, and recompute the TP Rate/FP Rate. By repeating this with all possible different thresholds, we can compute what is called an ROC curve. You can graph it in WEKA by right-clicking on the Result list entry corresponding to the classification model you are evaluating, and selecting Visualize threshold curve.

When you do this, you get a plot with FP Rate on the x axis and TP Rate on the y axis. Depending on the classifier you use, this plot can be quite smooth, or it can be fairly discrete. The interesting thing is that if you connect the dots shown in the plot by lines, and you compute the area under the resulting curve, you get the ROC Area discussed above! That is where the acronym AUC for the ROC Area comes from (Area Under the Curve).

What does the ideal ROC curve corresponding to perfect performance look like (a rough sketch is sufficient)? (Note that this corresponds to the case where all positive instances are ranked above all negative ones when sorted according to the estimated probability of being positive.)

Answer:

6

Using the threshold curve GUI, you can also plot other types of curves, e.g., a precision/recall curve (with Recall on the x axis and Precision on the y axis). This plots precision/recall values for each probability threshold evaluated. Change the axes to obtain a precision/recall curve. What shape does the ideal precision/recall curve corresponding to perfect performance have (a rough sketch is again sufficient)?

Answer:

4 Exploring the StringToWordVector filter

By default, the StringToWordVector filter simply outputs binary attributes for all raw single-word terms. These attributes indicate presence and absence of a term as the attribute values in the transformed dataset. However, there are many options that can be changed, e.g:

• outputWordCounts: if set to true, actual word frequencies are output, rather than just binary values.

• IDFTransform and TFTransform: when both are set to true, raw term frequencies are transformed into so-called TF × IDF values that are popular for representing documents in information retrieval applications.

• stemmer: allows you to choose from different word stemming algorithms that attempt to reduced words to their stems.

• useStopList: allows you determine whether stop words are deleted. Stop words are words that are usually not informative (e.g., articles).

• tokenizer: allows you to choose a different tokenizer for generating terms, e.g., one that produces word n-grams instead of single words.

However, there are several other useful options as well. Check the descriptions in the GenericObjectEditor for more information by clicking on More.

Experiment with the options that are available. What options give you the best AUC for the two datasets from above when MultinomialNaiveBayes is applied as the classifier? (Note that an exhaustive search is not required.)

7

Answer:

Often, not all attributes (i.e., terms) are important when classifying docu- ments, because many words may be irrelevant for determining the topic of a document. We can use WEKA’s AttributeSelectedClassifier, using ranking with InfoGainAttributeEval and the Ranker search, to try and eliminate at- tributes that are not so useful. As before we need to use the FilteredClassifier to transform the data before it is passed to the AttributeSelectedClassifier. Experiment with this set-up, using default options for StringToWordVector and NaiveBayesMultinomial as the classifier. Vary the number of most infor- mative attributes that are selected from the info-gain-based ranking by changing the value of the numToSelect field in the Ranker method. Record the AUC values you obtain. What number of attributes gives you the best AUC for the two datasets from above? What AUC values are the best ones you manage to obtain? (An exhaustive search is again not required.)

Homework is Completed By:

Writer Writer Name Amount Client Comments & Rating
Instant Homework Helper

ONLINE

Instant Homework Helper

$36

She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up!

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 3 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 6 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 12 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

6 writers have sent their proposals to do this homework:

Engineering Exam Guru
Unique Academic Solutions
24/7 Assignment Help
A Grade Exams
A+GRADE HELPER
Financial Hub
Writer Writer Name Offer Chat
Engineering Exam Guru

ONLINE

Engineering Exam Guru

I have worked on wide variety of research papers including; Analytical research paper, Argumentative research paper, Interpretative research, experimental research etc.

$28 Chat With Writer
Unique Academic Solutions

ONLINE

Unique Academic Solutions

I have worked on wide variety of research papers including; Analytical research paper, Argumentative research paper, Interpretative research, experimental research etc.

$28 Chat With Writer
24/7 Assignment Help

ONLINE

24/7 Assignment Help

I am a PhD writer with 10 years of experience. I will be delivering high-quality, plagiarism-free work to you in the minimum amount of time. Waiting for your message.

$48 Chat With Writer
A Grade Exams

ONLINE

A Grade Exams

I reckon that I can perfectly carry this project for you! I am a research writer and have been writing academic papers, business reports, plans, literature review, reports and others for the past 1 decade.

$50 Chat With Writer
A+GRADE HELPER

ONLINE

A+GRADE HELPER

I will be delighted to work on your project. As an experienced writer, I can provide you top quality, well researched, concise and error-free work within your provided deadline at very reasonable prices.

$33 Chat With Writer
Financial Hub

ONLINE

Financial Hub

I have done dissertations, thesis, reports related to these topics, and I cover all the CHAPTERS accordingly and provide proper updates on the project.

$47 Chat With Writer

Let our expert academic writers to help you in achieving a+ grades in your homework, assignment, quiz or exam.

Similar Homework Questions

Rhul accounting and finance - Give me liberty 5th edition pdf volume 2 - Derbyshire county council transport - Ancient greece zoom background - The crucible gcse revision - Kellogg hkust emba fees - Applied calculus projects - 230 volts between neutral and earth - V for vendetta gordon - United cereal case study solutions - Hiii - The bear came over the mountain essay - Entrepreneurship and business management n4 memorandum 2018 - How to write a museum visit paper - Ak timms brize norton - Webmail rci rutgers edu - Libs 150 - An informal outline uses _____ - Whatever the weather song - Grandview heights aquatic centre - Outboard marine v hecar - Db dental kings square - Types of sewing machine ppt - Seniors living sepp nsw - Strengths and cultural assessment lloyds - Commonwealth security clearance levels - University of rhode island change assessment urica - Graphs are blank of relationships - Lewicki saunders and barry negotiation - Barry mason family therapist - Wallace and gromit explanation text - Lucky and wild rom - Why america is self segregating - Ghost of banquo quotes - Atomic habits cheat sheet - Cybersecurity essentials 1.1 chapter 1 answers - Only Bethuel best 9 - Federal Government essay - God is a master craftsman poem - Generate Report after watching video... - Microeconomics 4 - Telstra byo plans $39 - Catholic acolyte training manual - Indirect finance and direct finance - Cat sat on mat - Transaction processing system example companies - Describe a difficult interaction you had with a customer - Cis 3 methyl 4 octene - X20 ashington to newcastle - Steady state error example - Raf pay scales 20 21 - Balloon payment vs bullet payment - The sports model project - Gum clinic high wycombe - Summary of serving in florida by barbara ehrenreich - Dq - Macbeth act 1 scene 3 comprehension questions - International Business - Structural family therapy assessment tools - Writing my wrongs discussion questions - Educational Philosophy - When a firm has dividends in arrears - Chartered institute of loss adjusters cila - Psychological testing - N491 Discussion Mod 5: - Economics the basics 2nd edition pdf - Poetry Explication Assignment due in 36 hours - Polar pete neutral slush base - The aisling centre new york - Multiple choice questions on leadership and management - Cqc notifications pressure sores - Calculator program in python using class - Edit My Paper Narrative Paper - Testing mosfet with oscilloscope - Peter honey learning styles - Coca cola corporate governance - In the last line of "god's grandeur," we see an unusual and complicated use of - Nsq1 - Personal narrative essay examples 750 words - History - How to win practice marketing game - What sets christianity apart from other religions - Cuáles son tres formas de decir headset - Accounting - Ssop in food industry - Whatever is true whatever is good - Challis lane winery manteca ca - Rmit extension of time form - 5.9 online shopping cart java - Case and point book pdf - Practical connection assignment - Colligative properties lab report - What percent of 192 is 48 - In the following, what are the degrees of freedom: t(29) = 2.001, p < .05? - Atm misuse case - Gloucester mg owners club - Real estate agent write offs - Only intended for wizard Kim - Charlie and the chocolate factory multiple choice questions - Laural home smoky x ray of eucalyptus leaves shower curtain