Learning outliers to refine a corpus for Chinese webpage categorization | |
Luo, Dingsheng ; Wang, Xinhao ; Wu, Xihong ; Chi, Huisheng | |
2005 | |
英文摘要 | Webpage categorization has turned out to be an important topic in recent years. In a webpage, text is usually the main content, so that auto text categorization (ATC) becomes the key technique to such a task. For Chinese text categorization as well as Chinese webpage categorization, one of the basic and urgent problems is the construction of a good benchmark corpus. In this study, a machine learning approach is presented to refine a corpus for Chinese webpage categorization, where the AdaBoost algorithm is adopted to identify outliers in the corpus. The standard k nearest neighbor (kNN) algorithm under a vector space model (VSM) is adopted to construct a webpage categorization system. Simulation results as well as manual investigation of the identified outliers reveal that the presented method works well. ? Springer-Verlag Berlin Heidelberg 2005.; EI; 0 |
语种 | 英语 |
内容类型 | 其他 |
源URL | [http://ir.pku.edu.cn/handle/20.500.11897/295273] ![]() |
专题 | 信息科学技术学院 |
推荐引用方式 GB/T 7714 | Luo, Dingsheng,Wang, Xinhao,Wu, Xihong,et al. Learning outliers to refine a corpus for Chinese webpage categorization. 2005-01-01. |
个性服务 |
查看访问统计 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论