Using a Large Set of Weak Classifiers for Text Analytics

Steven J. Simske; A. Marie Vans

doi:10.2352/issn.2168-3204.2017.1.0.146

Abstract

TF*IDF is a common approach used for text mining and information retrieval. We have described a method for using 112 variations on the TF*IDF equation for the classification of 588 CNN news articles belonging to 12 different classes. We found that no single TF*IDF could accurately classify all the documents. In fact, the highest accuracy attainable by any single TF*IDF was 45%. In this article, we take the work further to show how different measurements utilizing the TF*IDF classification results can be used to show that some classes may be logically inconsistent as classes. These methods also may be used to create more cohesive classes.

72010361

Archiving Conference

archiving

2161-8798

Society for Imaging Science and Technology

10.2352/issn.2168-3204.2017.1.0.146

2161-8798(20170515)2017:1L.146;1-

s29.phd

/ist/ac/2017/00002017/00000001/art00029

Articles

Using a Large Set of Weak Classifiers for Text Analytics

SimskeSteven J.

VansA. Marie

15052017

2017

146

151

2017