Loading...

Messages

Proposals

Stuck in your homework and missing deadline? Get urgent help in $10/Page with 24 hours deadline

Get Urgent Writing Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework & Achieve A+ Grades.

Privacy Guaranteed - 100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

What is roc area in weka

21/10/2021 Client: muhammad11 Deadline: 2 Day

Practical Data Mining

COMP-321B

Tutorial 5: Document Classification

Eibe Frank

July 17, 2014

c©2008-2014 University of Waikato

1 Introduction

A popular application area of machine learning is text classification. You may have actually encountered machine-learning-based text classification before: so- called “Bayesian” personalized spam filters are essentially document classifiers that classify email messages (i.e., short documents) into two groups, namely junk and not junk. They are trained on messages that have been manually labeled— perhaps by putting them into appropriate folders (e.g., “ham” vs “spam”). In this tutorial, we look at how to perform document classification using tools in WEKA. Please submit this tutorial to the teaching staff at the end of the lab.

2 Data with string attributes

The raw data in document classification is text, but most machine learning algorithms expect examples that are described by a fixed set of nominal or numeric attributes. Hence we need to convert the text data into a form suitable for learning. This is usually done by creating a dictionary of terms from all the documents in the training corpus and making a numeric attribute for each of the terms. Then, for a particular document, the value of an attribute is based on the frequency of the corresponding term in the document—or it simply indicates the presence of absence of the term in the document.

The resulting data is usually very high-dimensional because the dictionary consists of hundreds or thousands of terms. Each document is thus represented as a vector in a high-dimensional space. There is obviously also the class at- tribute, which has the label of each document, and which we want to learn to predict based on the contents of the document.

The filter weka.filters.unsupervised.attribute.StringToWordVector can be used to perform the conversion of raw text into term-frequency-based attributes for us. The filter assumes that the text of the documents is stored in an attribute of type string, a nominal attribute without a pre-specified set of values. In the filtered data, this string attribute is replaced by a fixed set of numeric attributes and the class attribute is made the first attribute in the resulting set. The numeric attributes have values 0 and 1 by default, indicating whether a term is absent or present in a document.

To perform document classification in WEKA, we first need to create an appropriate ARFF file with a string attribute that holds the documents’ text— declared in the header of the ARFF file using @attribute document string, where document is the name of the attribute. We also need a nominal attribute that holds the documents’ classifications.

To get some feeling for how this works, make an ARFF file from the labeled mini-“documents” in Table 1 and run StringToWordVector with default op- tions on this data. How many attributes are generated? Now change the value of the option minTermFreq to two. What are the attributes that are generated?

1

Document text Classification The price of crude oil has increased significantly yes Demand of crude oil outstrips supply yes Some people do not like the flavor of olive oil no The food was very oily no Crude oil is in short supply yes Use a bit of cooking oil in the frying pan no

Table 1: Training “documents”.

Answer:

Build a J48 decision tree from the last version of the data you generated. Sate the tree that is output, in textual form.

Answer:

Usually, we want to build a classifier to classify new documents. Let us assume we want to classify the documents given in Table 2, once they have been converted into ARFF format, using question marks for the missing class labels. Classification should be done based on the decision tree classifier generated from the documents in Table 1.

To do this, we need to use the FilteredClassifier in conjunction with the StringToWordVector filter and the base classifier that we want to apply (i.e., J48). Configure the FilteredClassifier using default options for String- ToWordVector and J48, and specify the data from Table 2 as the test set. (Ignore the fact that the Explorer states that the data has an unknown number of instances: it loads the data incrementally when the classifier is tested.) The data from Table 1 should be the training data (in unfiltered, raw form, so that its format is consistent with that of the test set), so make sure that data has

2

Document text Classification Oil platforms extract crude oil Unknown Canola oil is supposed to be healthy Unknown Iraq has significant oil reserves Unknown There are different types of cooking oil Unknown

Table 2: Test “documents”.

been loaded in the Preprocess panel.

We want to look at the classifications for the test data, so make sure that you select Output predictions under More options... in the Classify panel. After you have hit Start, take a look at the model and the predictions generated. What are the predictions (in the order in which the documents are listed in Table 2)? Please explain how they come about.

Answer:

3 Classifying actual short text documents

A well-known dataset for evaluating document classifiers is a corpus consisting of a collection of Reuters newswire articles. In C:\Program Files\Weka-3-6\data, you can find some training data derived from this benchmark corpus, more specifically ReutersCorn-train.arff and ReutersGrain-train.arff. There are also corresponding test datasets in *-test.arff. We will use these to ex- periment with two learning algorithms and the StringToWordVector filter.

The actual documents in the corn and grain data are the same, only the labels differ. In the first dataset, articles that talk about corn-related issues are labeled with a class value of 1, other articles are labeled with the value 0. In the second dataset, the same kind of labeling is applied, but with respect to grain-related issues. In the first case, we want to build a classification model that can be used to identify articles that talk about corn, in the second case we want to do the same for grain.

Let us now consider document classification using decision trees vs. docu- ment classification using a naive Bayes classifier. To this end, build document classifiers for the two training sets by applying the FilteredClassifier with StringToWordVector using (a) J48 and (b) NaiveBayesMultinomial, in each case evaluating them on the corresponding test set. (NaiveBayesMulti-

3

nomial implements the multinomial naive Bayes model discussed in class that is specifically designed for document classification.) What is the percentage of correct classifications obtained for the four scenarios? Which classifier would you deploy based on your results?

Answer:

Percent correct is actually not the only evaluation metric that is commonly used for document classification problems. WEKA lists several other “per-class” evaluation statistics that are often used for evaluating information retrieval sys- tems like search engines, under Detailed Accuracy By Class in the Classi- fier output text area. They are based on the number of true positives (TP), the number of false positives (FP), the number of true negatives (TN), and the number of false negatives (FN) in the test data. A true positive is an example in the test data that is classified correctly as an example of the target class con- cerned, a false positive is a (negative) example that is incorrectly assigned to the target class. FN and TN are defined analogously. (In the information retrieval setting, true positives would be documents or web sites that are returned by the search engine and that are actually relevant, false positives would be irrelevant documents that are returned by the search engine, and so on.)

It is clear that these numbers can only be computed on a per-class basis, where the corresponding class is considered the positive class, and all other classes are considered joined together as the negative class. Based on this ap- proach, the statistics output by WEKA under Detailed Accuracy By Class are computed as follows (ROC Area, another statistic that is output by WEKA, is discussed further below):

• TP Rate: TP / (TP + FN)

• FP Rate: FP / (FP + TN)

• Precision: TP / (TP + FP)

• Recall: TP / (TP + FN)

• F-Measure: the harmonic mean of precision and recall

Based on the formulas, what are the best possible values for each of the statistics in this list, considering an arbitrary dataset and classifier? Describe in English when these values are attained assuming we treat newswire articles from the above corn category as positive examples and other articles as negative examples.

4

Answer:

In the Classifier Output, there is also the ROC Area, which differs from these other statistics because it is based on ranking the examples in the test data according to how likely they are to belong to the positive class. The likelihood is given by the class probability that the classifier predicts. (Note that most classifiers in WEKA can produce class probabilities in addition to discrete classifications.) The ROC area (also known as AUC) is the probability that a randomly chosen positive instance in the test data is ranked above a randomly chosen negative instance, based on the ranking produced by the classifier.

The best outcome is that all positive examples are ranked above all negative examples. In that case the AUC is one. In the worst case it is zero. In the case where the ranking is essentially random, the AUC is 0.5. Hence we want an AUC that is at least 0.5, otherwise our classifier has not learned anything from the training data. Which of the two classifiers used above produces the best AUC for the two Reuters datasets? Compare this to the outcome for percent correct. What do the different outcomes mean?

5

Answer:

Interestingly, there is a close relationship between ROC Area and TP Rate/FP Rate. Rather than just obtaining a single pair of values for the latter statistics, we can obtain a whole range of value pairs by imposing different classification thresholds on the class probabilities estimated by the classifier: in the ranked list of instances, ranked according to their estimated probability of being positive, we simply count how many truly positive and negative instances exhibit a probability above the specified threshold.

By default, an instance is classified as “positive” if the estimated probability for the positive class exceeds 0.5, otherwise it is classified as negative (i.e., the most likely class is predicted). We can change this threshold from 0.5 to some other value between 0 and 1, and recompute the TP Rate/FP Rate. By repeating this with all possible different thresholds, we can compute what is called an ROC curve. You can graph it in WEKA by right-clicking on the Result list entry corresponding to the classification model you are evaluating, and selecting Visualize threshold curve.

When you do this, you get a plot with FP Rate on the x axis and TP Rate on the y axis. Depending on the classifier you use, this plot can be quite smooth, or it can be fairly discrete. The interesting thing is that if you connect the dots shown in the plot by lines, and you compute the area under the resulting curve, you get the ROC Area discussed above! That is where the acronym AUC for the ROC Area comes from (Area Under the Curve).

What does the ideal ROC curve corresponding to perfect performance look like (a rough sketch is sufficient)? (Note that this corresponds to the case where all positive instances are ranked above all negative ones when sorted according to the estimated probability of being positive.)

Answer:

6

Using the threshold curve GUI, you can also plot other types of curves, e.g., a precision/recall curve (with Recall on the x axis and Precision on the y axis). This plots precision/recall values for each probability threshold evaluated. Change the axes to obtain a precision/recall curve. What shape does the ideal precision/recall curve corresponding to perfect performance have (a rough sketch is again sufficient)?

Answer:

4 Exploring the StringToWordVector filter

By default, the StringToWordVector filter simply outputs binary attributes for all raw single-word terms. These attributes indicate presence and absence of a term as the attribute values in the transformed dataset. However, there are many options that can be changed, e.g:

• outputWordCounts: if set to true, actual word frequencies are output, rather than just binary values.

• IDFTransform and TFTransform: when both are set to true, raw term frequencies are transformed into so-called TF × IDF values that are popular for representing documents in information retrieval applications.

• stemmer: allows you to choose from different word stemming algorithms that attempt to reduced words to their stems.

• useStopList: allows you determine whether stop words are deleted. Stop words are words that are usually not informative (e.g., articles).

• tokenizer: allows you to choose a different tokenizer for generating terms, e.g., one that produces word n-grams instead of single words.

However, there are several other useful options as well. Check the descriptions in the GenericObjectEditor for more information by clicking on More.

Experiment with the options that are available. What options give you the best AUC for the two datasets from above when MultinomialNaiveBayes is applied as the classifier? (Note that an exhaustive search is not required.)

7

Answer:

Often, not all attributes (i.e., terms) are important when classifying docu- ments, because many words may be irrelevant for determining the topic of a document. We can use WEKA’s AttributeSelectedClassifier, using ranking with InfoGainAttributeEval and the Ranker search, to try and eliminate at- tributes that are not so useful. As before we need to use the FilteredClassifier to transform the data before it is passed to the AttributeSelectedClassifier. Experiment with this set-up, using default options for StringToWordVector and NaiveBayesMultinomial as the classifier. Vary the number of most infor- mative attributes that are selected from the info-gain-based ranking by changing the value of the numToSelect field in the Ranker method. Record the AUC values you obtain. What number of attributes gives you the best AUC for the two datasets from above? What AUC values are the best ones you manage to obtain? (An exhaustive search is again not required.)

Homework is Completed By:

Writer Writer Name Amount Client Comments & Rating
Instant Homework Helper

ONLINE

Instant Homework Helper

$36

She helped me in last minute in a very reasonable price. She is a lifesaver, I got A+ grade in my homework, I will surely hire her again for my next assignments, Thumbs Up!

Order & Get This Solution Within 3 Hours in $25/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 3 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 6 Hours in $20/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 6 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

Order & Get This Solution Within 12 Hours in $15/Page

Custom Original Solution And Get A+ Grades

  • 100% Plagiarism Free
  • Proper APA/MLA/Harvard Referencing
  • Delivery in 12 Hours After Placing Order
  • Free Turnitin Report
  • Unlimited Revisions
  • Privacy Guaranteed

6 writers have sent their proposals to do this homework:

Engineering Exam Guru
Unique Academic Solutions
24/7 Assignment Help
A Grade Exams
A+GRADE HELPER
Financial Hub
Writer Writer Name Offer Chat
Engineering Exam Guru

ONLINE

Engineering Exam Guru

I have worked on wide variety of research papers including; Analytical research paper, Argumentative research paper, Interpretative research, experimental research etc.

$28 Chat With Writer
Unique Academic Solutions

ONLINE

Unique Academic Solutions

I have worked on wide variety of research papers including; Analytical research paper, Argumentative research paper, Interpretative research, experimental research etc.

$28 Chat With Writer
24/7 Assignment Help

ONLINE

24/7 Assignment Help

I am a PhD writer with 10 years of experience. I will be delivering high-quality, plagiarism-free work to you in the minimum amount of time. Waiting for your message.

$48 Chat With Writer
A Grade Exams

ONLINE

A Grade Exams

I reckon that I can perfectly carry this project for you! I am a research writer and have been writing academic papers, business reports, plans, literature review, reports and others for the past 1 decade.

$50 Chat With Writer
A+GRADE HELPER

ONLINE

A+GRADE HELPER

I will be delighted to work on your project. As an experienced writer, I can provide you top quality, well researched, concise and error-free work within your provided deadline at very reasonable prices.

$33 Chat With Writer
Financial Hub

ONLINE

Financial Hub

I have done dissertations, thesis, reports related to these topics, and I cover all the CHAPTERS accordingly and provide proper updates on the project.

$47 Chat With Writer

Let our expert academic writers to help you in achieving a+ grades in your homework, assignment, quiz or exam.

Similar Homework Questions

Rhetorical analysis of a public document - HR week 5 case study - Dan dennis shop lc facebook - Tom bohlman lexus of albuquerque - A man wants to help provide a college education - MAN 111 ASSIGNMENT - Degrees for wall street - What is under armour's mission statement - Tucker company case study analysis - Biology por que envejecemos - ASSIGNMENT 3 - Thermochemistry and hess's law lab discussion questions - How long did the frankish empire last - Coca cola finance internship - Current industrial award for teacher aides queensland - Engineering thermodynamics equation sheet - What is the plural of tortoise - Infoblox default admin password - If equity is $300,000 and liabilities are $192,000, then assets equal: - Persuasive speech topics about veterans - Brave comparative and superlative - A perspective that revolves around how relational partners manage tensions is called - Dr michael kohn paediatrician - Cisco asr1001 x ios upgrade procedure - You can use predesigned formatted files called - Cash converters northlands balcatta - Assignment1014 - Social cognitive career theory - Learn qualtrics in 5 steps - Www.police clearance online fiji - Cpt code for urolift procedure - In expanded academic asap, choosing only peer-reviewed articles indicates that you'll - Please circle your answer - Economies of scale ib - I need 500 words on the Federal Open Market Committee - HR_STR (U5_5) - Breeding bird survey codes - Discussion - Festo hydraulics advanced level textbook pdf - Do metals tend to form positive ions - Strike it rich xi sweepstakes - Does a paperclip sink or float - Modify the po q4 worksheet so gridlines will print - Vending machine program in python - Body of evidence trutv - 22 lakeview road wamberal - Company requirements for a total rewards system - Intentional interviewing definition - Arriva bus times milton keynes 8 - The breakfast club allison psychology analysis - Big d custom screen printing - Google reacts case study - UNIT VIII DQ MBA 6601 - Peabody individual achievement test scoring - What is x squared cubed - Jesse stone benefit of the doubt recap - Argon plasma coagulation cpt code - Current issue artical - Nola pender theory to practice application - A firefighter 50.0 m away - Southwest airlines in 2014 case study - Coke mission statement - Legal implications of moving business abroad - Syringe vending machine locations qld - Gibbs reflective cycle reference apa - Among us skin hack - Rail fence cipher program in python - Chapter 9 geometry test - Boat harbour primary school - Indian camp summary - Accounting - Unit of study outlines - Me2u credit transfer optus - How to spell battery plural - Stoichiometry of a precipitation reaction lab report - REIT - Summarizer and opinion seeker are examples of _________ functions. - Global Issues: Chapter 2 - How to reduce employee turnover rate in capsim - What is american ethnic literature - Philosophy 347/ Critical Thinking, Reasoning/ Argumentative paper - Principles Of Accounting II 8 - Chemistry data sheet hsc - Elements of an epic hero - Axiolysis - What does andantino con moto mean - Good things about cell phones in school - Crime analysis - Exploring creation with physical science module 3 study guide answers - Prepare a presentation of the research paper submitted in week 6. - Balancing chemical equations ppt - Cowdroy triumph installation instructions - Discussion - Martha stewart public relations - 1700 words essay on schizophrenia and public health nursing due 11/02/2020. - Leadership - The trickster of seville full text english - Where is xidax located - Miller mood dulux exterior - Rexona stunt city full movie