基于互信息的代价缺失学习在不平衡数据中的研究

CORC > 自动化研究所 > 中国科学院自动化研究所 > 毕业生 > 博士学位论文

题名	基于互信息的代价缺失学习在不平衡数据中的研究
作者	张晓晚
学位类别	工学博士
答辩日期	2014-05-20
授予单位	中国科学院大学
授予地点	中国科学院自动化研究所
导师	胡包钢
关键词	不平衡数据代价缺失学习代价敏感学习互信息拒识图形化评估方法 Class Imbalance Cost-Free Learning Cost-Sensitive Learning Mutual Information Abstaining Graphical Evaluation Method
其他题名	Cost-Free Learning in the Class Imbalance Problem based on Mutual Information
学位专业	模式识别与智能系统
中文摘要	对于不平衡数据的学习，根据其是否要求代价信息参与计算，可分为代价敏感学习和不需要代价的学习方法。本篇论文将所有不需要代价参与计算的学习统一定义为代价缺失学习。若代价未知，则代价敏感学习不适用，可采用代价缺失学习，如抽样法和一些基于准则的方法。然而，现有的代价敏感学习和代价缺失学习都不能处理错误信息和拒识信息未知的带拒识分类问题。因此，基于信息理论，本文提出一种全新的代价缺失学习策略，最大化数据的真实类别和预测类别之间的归一化互信息。该方法能够自动平衡各类错误和拒识，可处理二值分类和多值分类问题，包括无拒识和带拒识的情况。本篇论文的贡献主要包括以下几个部分： 1.针对不平衡数据学习中代价未知的情形，考虑到代价敏感学习方法的局限性，以及现有的不平衡数据学习方法在处理带拒识分类问题时表现的不足，本文提出一种通用的代价缺失学习的策略。胡已提出互信息分类器，并通过数值实验验证了互信息在处理不平衡数据时具有自动保护少数类样本的优势，但其方法没有应用于真实数据。本文基于真实的不平衡数据进行学习，利用归一化互信息所具有的自动识别错误类别和拒识类别的能力，以归一化互信息作为学习目标，简单且直接地利用传统的分类器处理不平衡数据。在代价未知的情形下，对于二值分类和多值分类问题，包括无拒识和带拒识的情况，自动平衡各类错分和拒识，得到合理的分类结果。因此，本文提出的代价缺失学习策略可弥补现有的不平衡数据学习方法的不足，并有效地解决传统的分类方法在不平衡数据学习中产生的问题。 2.在代价敏感学习中，代价未知是一个常见且公认的难题。因此很多方试图回避明确的代价，或尝试去学习代价。如果引入拒识，由于现有的学习方法都不能学习到合理的拒识信息，代价未知问题变得更加复杂。采用本文提出的代价缺失学习策略，通过最优化学习目标，可得到两种有意义的最优参数。对于带拒识的二值分类和多值分类问题，可以自动地得到最优的拒识阈值。针对无拒识二值分类问题，本文与Elkan关于代价与决策阈值的经典工作建立关联，得到“等价”错分代价；针对带拒识二值分类问题，胡已指出代价敏感学习中存在参数冗余问题。本文利用无拒识分类问题中得到的“等价”错分代价作为先验知识，解决参数冗余问题，得到“等价”拒识代价。其中，拒识阈值和“等价”代价不由人为指定，完全取决于数据的分布和基础分类器，因此具有客观性。同时，这些“等价”代价可以为代价敏感学习中主观代价的确定提供客观的参考，将代价缺失学习与代价敏感学习建立联系。 3.图形化评估方法可以形象且直观地分析分类器的性能。本文首次对ROC空间带拒识情况给出几何解释，进一步探讨了拒识在PR 空间、代价空间的关联。同时从图形化评估曲线上可以看出，无拒识与Chow 的拒识是通常带拒识问题的特例。这项工作为分类性能分析提供明确的图形化解释，并为用户利用图形化方法交互式地调整参数提供便利。
英文摘要	In the context of learning in the class imbalance problem, the approaches fall into two categories: cost-sensitive learning (CSL) and the learning that does not require any cost information. In this thesis, we define cost-free learning (CFL) as the learning approach that seeks optimal classification results without requiring any cost information. If the costs are unknown, CSL can not work, while CFL can be applied, such as sampling and some criteria-based approaches. However, to our best knowledge, none of the existing CSL and CFL approaches are able to process the abstaining classifications properly when no information is given about errors and rejects. Based on information theory, we propose a novel CFL which seeks to maximize normalized mutual information of the targets and the decision outputs of classifiers. While the degree of class imbalance is changing, the proposed strategy is able to balance the errors and rejects accordingly and automatically. Using the strategy, we can handle binary/multi-class classifications with/without abstaining. The main contributions of this thesis are as follows: 1.In the situation that costs are unknown in the class imbalance problem, this thesis presents a general CFL strategy by studying the limitations of CSL and the deficiencies of the existing learning methods in dealing with abstaining classifications. Hu[1] has proposed mutual information classifier and verified its effectiveness by applying numerical data. In this thesis, we focus on learning from real data sets. Using normalized mutual information (NI) as the learning target, we conduct the learning from conventional classifiers, and adopt them for simple and direct implementations. The proposed CFL strategy is able to balance the errors and rejects accordingly and automatically. Using the strategy, we can handle binary/multi-class classifications with/without abstaining. Therefore, the proposed strategy can not only compensate the limitations of the existing imbalanced learning methods, but also solve the problems caused by applying conventional classification algorithms into the class imbalance learning. 2.The issue that costs are unknown is not unusual in real-world applications, and it is regarded as the challenge in the field of CSL. Various methods intent to avoid the specific values of costs, or they attempt to learn the costs. However, the problem will become more complex when reject option is involved, since existing learning methods can not get p...
语种	中文
其他标识符	201018014628073
内容类型	学位论文
源URL	[http://ir.ia.ac.cn/handle/173211/6582]
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	张晓晚. 基于互信息的代价缺失学习在不平衡数据中的研究[D]. 中国科学院自动化研究所. 中国科学院大学. 2014.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们