CORC  > 北京大学  > 信息科学技术学院
Clustering based two-stage text classification requiring minimal training data
Zhang, Xue ; Xiao, Wang-Xin
2012
英文摘要Clustering aided classification methods are based on the assumption that the learned clusters under the guidance of initial training data can somewhat characterize the underlying distribution of the data set. However, our experiments show that whether such assumption holds is based on both the separability of the considered data set and the size of the training data set. It is often violated on data set of bad separability, especially when the initial training data are too few. In this case, clustering based methods would perform worse. In this paper, we propose a clustering based two-stage text classification approach to address the above problem. In the first stage, labeled and unlabeled data are first clustered with the guidance of the labeled data. Then a self-training style clustering strategy is used to iteratively expand the training data under the guidance of an oracle or expert. At the second stage, discriminative classifiers can subsequently be trained with the expanded labeled data set. Unlike other clustering based methods, the proposed clustering strategy can effectively cope with data of bad separability. Furthermore, our proposed framework converts the problem of sparsely labeled text classification into a supervised one, therefore, supervised classification models, e.g. SVM, can be applied, and techniques proposed for supervised learning can be used to further improve the classification accuracy, such as feature selection, sampling methods and data editing or noise filtering. Our experimental results demonstrated the effectiveness of our proposed approach especially when the size of the training data set is very small. ? 2012 IEEE.; EI; 0
语种英语
DOI标识10.1109/ICSAI.2012.6223496
内容类型其他
源URL[http://ir.pku.edu.cn/handle/20.500.11897/294681]  
专题信息科学技术学院
推荐引用方式
GB/T 7714
Zhang, Xue,Xiao, Wang-Xin. Clustering based two-stage text classification requiring minimal training data. 2012-01-01.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace