Practical Data Mining
COMP-321B
Tutorial 5: Document Classification
Eibe Frank
July 17, 2014
c©2008-2014 University of Waikato
1 Introduction
A popular application area of machine learning is text classification. You may have actually encountered machine-learning-based text classification before: so- called “Bayesian” personalized spam filters are essentially document classifiers that classify email messages (i.e., short documents) into two groups, namely junk and not junk. They are trained on messages that have been manually labeled— perhaps by putting them into appropriate folders (e.g., “ham” vs “spam”). In this tutorial, we look at how to perform document classification using tools in WEKA. Please submit this tutorial to the teaching staff at the end of the lab.
2 Data with string attributes
The raw data in document classification is text, but most machine learning algorithms expect examples that are described by a fixed set of nominal or numeric attributes. Hence we need to convert the text data into a form suitable for learning. This is usually done by creating a dictionary of terms from all the documents in the training corpus and making a numeric attribute for each of the terms. Then, for a particular document, the value of an attribute is based on the frequency of the corresponding term in the document—or it simply indicates the presence of absence of the term in the document.
The resulting data is usually very high-dimensional because the dictionary consists of hundreds or thousands of terms. Each document is thus represented as a vector in a high-dimensional space. There is obviously also the class at- tribute, which has the label of each document, and which we want to learn to predict based on the contents of the document.
The filter weka.filters.unsupervised.attribute.StringToWordVector can be used to perform the conversion of raw text into term-frequency-based attributes for us. The filter assumes that the text of the documents is stored in an attribute of type string, a nominal attribute without a pre-specified set of values. In the filtered data, this string attribute is replaced by a fixed set of numeric attributes and the class attribute is made the first attribute in the resulting set. The numeric attributes have values 0 and 1 by default, indicating whether a term is absent or present in a document.
To perform document classification in WEKA, we first need to create an appropriate ARFF file with a string attribute that holds the documents’ text— declared in the header of the ARFF file using @attribute document string, where document is the name of the attribute. We also need a nominal attribute that holds the documents’ classifications.
To get some feeling for how this works, make an ARFF file from the labeled mini-“documents” in Table 1 and run StringToWordVector with default op- tions on this data. How many attributes are generated? Now change the value of the option minTermFreq to two. What are the attributes that are generated?
1
Document text Classification The price of crude oil has increased significantly yes Demand of crude oil outstrips supply yes Some people do not like the flavor of olive oil no The food was very oily no Crude oil is in short supply yes Use a bit of cooking oil in the frying pan no
Table 1: Training “documents”.
Answer:
Build a J48 decision tree from the last version of the data you generated. Sate the tree that is output, in textual form.
Answer:
Usually, we want to build a classifier to classify new documents. Let us assume we want to classify the documents given in Table 2, once they have been converted into ARFF format, using question marks for the missing class labels. Classification should be done based on the decision tree classifier generated from the documents in Table 1.
To do this, we need to use the FilteredClassifier in conjunction with the StringToWordVector filter and the base classifier that we want to apply (i.e., J48). Configure the FilteredClassifier using default options for String- ToWordVector and J48, and specify the data from Table 2 as the test set. (Ignore the fact that the Explorer states that the data has an unknown number of instances: it loads the data incrementally when the classifier is tested.) The data from Table 1 should be the training data (in unfiltered, raw form, so that its format is consistent with that of the test set), so make sure that data has
2
Document text Classification Oil platforms extract crude oil Unknown Canola oil is supposed to be healthy Unknown Iraq has significant oil reserves Unknown There are different types of cooking oil Unknown
Table 2: Test “documents”.
been loaded in the Preprocess panel.
We want to look at the classifications for the test data, so make sure that you select Output predictions under More options... in the Classify panel. After you have hit Start, take a look at the model and the predictions generated. What are the predictions (in the order in which the documents are listed in Table 2)? Please explain how they come about.
Answer:
3 Classifying actual short text documents
A well-known dataset for evaluating document classifiers is a corpus consisting of a collection of Reuters newswire articles. In C:\Program Files\Weka-3-6\data, you can find some training data derived from this benchmark corpus, more specifically ReutersCorn-train.arff and ReutersGrain-train.arff. There are also corresponding test datasets in *-test.arff. We will use these to ex- periment with two learning algorithms and the StringToWordVector filter.
The actual documents in the corn and grain data are the same, only the labels differ. In the first dataset, articles that talk about corn-related issues are labeled with a class value of 1, other articles are labeled with the value 0. In the second dataset, the same kind of labeling is applied, but with respect to grain-related issues. In the first case, we want to build a classification model that can be used to identify articles that talk about corn, in the second case we want to do the same for grain.
Let us now consider document classification using decision trees vs. docu- ment classification using a naive Bayes classifier. To this end, build document classifiers for the two training sets by applying the FilteredClassifier with StringToWordVector using (a) J48 and (b) NaiveBayesMultinomial, in each case evaluating them on the corresponding test set. (NaiveBayesMulti-
3
nomial implements the multinomial naive Bayes model discussed in class that is specifically designed for document classification.) What is the percentage of correct classifications obtained for the four scenarios? Which classifier would you deploy based on your results?
Answer:
Percent correct is actually not the only evaluation metric that is commonly used for document classification problems. WEKA lists several other “per-class” evaluation statistics that are often used for evaluating information retrieval sys- tems like search engines, under Detailed Accuracy By Class in the Classi- fier output text area. They are based on the number of true positives (TP), the number of false positives (FP), the number of true negatives (TN), and the number of false negatives (FN) in the test data. A true positive is an example in the test data that is classified correctly as an example of the target class con- cerned, a false positive is a (negative) example that is incorrectly assigned to the target class. FN and TN are defined analogously. (In the information retrieval setting, true positives would be documents or web sites that are returned by the search engine and that are actually relevant, false positives would be irrelevant documents that are returned by the search engine, and so on.)
It is clear that these numbers can only be computed on a per-class basis, where the corresponding class is considered the positive class, and all other classes are considered joined together as the negative class. Based on this ap- proach, the statistics output by WEKA under Detailed Accuracy By Class are computed as follows (ROC Area, another statistic that is output by WEKA, is discussed further below):
• TP Rate: TP / (TP + FN)
• FP Rate: FP / (FP + TN)
• Precision: TP / (TP + FP)
• Recall: TP / (TP + FN)
• F-Measure: the harmonic mean of precision and recall
Based on the formulas, what are the best possible values for each of the statistics in this list, considering an arbitrary dataset and classifier? Describe in English when these values are attained assuming we treat newswire articles from the above corn category as positive examples and other articles as negative examples.
4
Answer:
In the Classifier Output, there is also the ROC Area, which differs from these other statistics because it is based on ranking the examples in the test data according to how likely they are to belong to the positive class. The likelihood is given by the class probability that the classifier predicts. (Note that most classifiers in WEKA can produce class probabilities in addition to discrete classifications.) The ROC area (also known as AUC) is the probability that a randomly chosen positive instance in the test data is ranked above a randomly chosen negative instance, based on the ranking produced by the classifier.
The best outcome is that all positive examples are ranked above all negative examples. In that case the AUC is one. In the worst case it is zero. In the case where the ranking is essentially random, the AUC is 0.5. Hence we want an AUC that is at least 0.5, otherwise our classifier has not learned anything from the training data. Which of the two classifiers used above produces the best AUC for the two Reuters datasets? Compare this to the outcome for percent correct. What do the different outcomes mean?
5
Answer:
Interestingly, there is a close relationship between ROC Area and TP Rate/FP Rate. Rather than just obtaining a single pair of values for the latter statistics, we can obtain a whole range of value pairs by imposing different classification thresholds on the class probabilities estimated by the classifier: in the ranked list of instances, ranked according to their estimated probability of being positive, we simply count how many truly positive and negative instances exhibit a probability above the specified threshold.
By default, an instance is classified as “positive” if the estimated probability for the positive class exceeds 0.5, otherwise it is classified as negative (i.e., the most likely class is predicted). We can change this threshold from 0.5 to some other value between 0 and 1, and recompute the TP Rate/FP Rate. By repeating this with all possible different thresholds, we can compute what is called an ROC curve. You can graph it in WEKA by right-clicking on the Result list entry corresponding to the classification model you are evaluating, and selecting Visualize threshold curve.
When you do this, you get a plot with FP Rate on the x axis and TP Rate on the y axis. Depending on the classifier you use, this plot can be quite smooth, or it can be fairly discrete. The interesting thing is that if you connect the dots shown in the plot by lines, and you compute the area under the resulting curve, you get the ROC Area discussed above! That is where the acronym AUC for the ROC Area comes from (Area Under the Curve).
What does the ideal ROC curve corresponding to perfect performance look like (a rough sketch is sufficient)? (Note that this corresponds to the case where all positive instances are ranked above all negative ones when sorted according to the estimated probability of being positive.)
Answer:
6
Using the threshold curve GUI, you can also plot other types of curves, e.g., a precision/recall curve (with Recall on the x axis and Precision on the y axis). This plots precision/recall values for each probability threshold evaluated. Change the axes to obtain a precision/recall curve. What shape does the ideal precision/recall curve corresponding to perfect performance have (a rough sketch is again sufficient)?
Answer:
4 Exploring the StringToWordVector filter
By default, the StringToWordVector filter simply outputs binary attributes for all raw single-word terms. These attributes indicate presence and absence of a term as the attribute values in the transformed dataset. However, there are many options that can be changed, e.g:
• outputWordCounts: if set to true, actual word frequencies are output, rather than just binary values.
• IDFTransform and TFTransform: when both are set to true, raw term frequencies are transformed into so-called TF × IDF values that are popular for representing documents in information retrieval applications.
• stemmer: allows you to choose from different word stemming algorithms that attempt to reduced words to their stems.
• useStopList: allows you determine whether stop words are deleted. Stop words are words that are usually not informative (e.g., articles).
• tokenizer: allows you to choose a different tokenizer for generating terms, e.g., one that produces word n-grams instead of single words.
However, there are several other useful options as well. Check the descriptions in the GenericObjectEditor for more information by clicking on More.
Experiment with the options that are available. What options give you the best AUC for the two datasets from above when MultinomialNaiveBayes is applied as the classifier? (Note that an exhaustive search is not required.)
7
Answer:
Often, not all attributes (i.e., terms) are important when classifying docu- ments, because many words may be irrelevant for determining the topic of a document. We can use WEKA’s AttributeSelectedClassifier, using ranking with InfoGainAttributeEval and the Ranker search, to try and eliminate at- tributes that are not so useful. As before we need to use the FilteredClassifier to transform the data before it is passed to the AttributeSelectedClassifier. Experiment with this set-up, using default options for StringToWordVector and NaiveBayesMultinomial as the classifier. Vary the number of most infor- mative attributes that are selected from the info-gain-based ranking by changing the value of the numToSelect field in the Ranker method. Record the AUC values you obtain. What number of attributes gives you the best AUC for the two datasets from above? What AUC values are the best ones you manage to obtain? (An exhaustive search is again not required.)