CORC  > 自动化研究所  > 中国科学院自动化研究所  > 毕业生  > 硕士学位论文
题名基于内容的网页敏感信息识别与过滤方法
作者陈周耀
学位类别工学硕士
答辩日期2007-06-06
授予单位中国科学院研究生院
授予地点中国科学院自动化研究所
导师胡卫明
关键词信息过滤 信息融合 决策树分类 多实例学习 information filtering information fusion decision-tree classification multi-instance learning
其他题名Content-based Web sensitive information recognition and filtering method
学位专业计算机应用技术
中文摘要伴随着近几年来互联网的飞速发展,全世界的人们都从这种全球性的信息共享和传播中获益匪浅。与此同时,WWW时代的到来对人们来说也是一把双刃剑,这点表现为互联网在传播各种各样信息的过程中也广泛传播了一些有害信息,这些信息包含有反动,色情,暴力等危害社会安定的因素。如何净化网络环境,有效识别并过滤有害信息成了当前迫切需要解决的问题。针对这类问题,在国内外已开始有专家学者开展了过滤这些包含有敏感信息网页的工作。而本文也在前人的经验积累的基础上结合自身的摸索尝试提出了一些敏感信息过滤的新方法。 由于敏感信息过滤必须建立在对其高效识别基础上,因此本文将从三个方面入手来解决该问题:一是基于Bayes理论的互联网离散型敏感文本识别;二是利用决策树算法对web网页进行类别划分,在此基础上融合文本和图像对互联网敏感网页进行识别;三是研究基于多实例学习算法的网页集合快速识别方法。具体主要的贡献和工作有: (1) 提出了一个基于Bayes理论的敏感文本识别算法。经过大量的统计与观察,导向型网页中词汇之间的语义联系通常很弱,可以视为相对独立的,我们定义其为离散文本。基于这个结论, 我们构造了Bayes分类器来处理网页中的此类离散文本。首先定义敏感类的关键词,然后将这些词的词频视为基本的语义统计信息并且作为分类器的输入数据,经过分类器处理后,输出一个概率,此概率值表示目标网页包含敏感信息的可能性。 (2) 提出了一个综合分治法和文本与图像信息融合的敏感网页处理系统框架。该框架针对不同类型的网页,应用不同的文本和图像分类器进行针对性的处理。首先在网页预处理阶段实现文本和图像的信息分流,然后用C4.5决策树算法将目标网页分为文本型,图像型和混合型三类,对每类网页用相应的分类器分而治之。最后利用信息融合技术结合各分类器的输出结果对目标网页是否包含敏感信息得到一个总体评价。该方法克服了目前所存在的方法基本上只能过滤某种特定类型形式网页的局限性。 (3) 提出了一个利用多实例学习思想对网页集合的快速识别的算法。通常我们需要处理的对象是一个网站而不是单个网页,所以应用多实例学习算法以网站为单位进行敏感信息识别可以有效的提高识别效率。多实例学习算法是近年来应用比较多也相对成熟的集合识别算法。 (4) 设计并实现了一个网页敏感信息过滤系统。 (5) 利用多实例算法改进了原敏感信息主动搜索系统。提升了识别效率。
英文摘要With the rapid development of the Internet, people benefit much from the sharing of information. Meanwhile, the WWW era is a double-edged sword which spreads harmful and erotic content widely. So sensitive information recognition and filtering is of great importance, and has been one of most active research topics recently. Aiming at this problem, much work has been done to filter the pornography web pages. In this paper, a new approach has been exploited by combining the foundation of the past research and new classification methods using our filtering system. Based on this thesis, we roughly divide our research work into the following three aspects: the first one is discrete webpage text recognition including sensitive information; the second one is web sensitive information recognition framework with decision tree classification and text and image fusion; the last is the exploration of filtering webpage sets using multi-instance learning. The main contributions of this thesis include issues below: (1) We proposed a discrete sensitive text recognition algorithm based on Bayes theory. Considering that words are relatively independent in discrete text, we use Bayes classifier to handle them. Usually, the occurrences of key words in text are treated as the basic semantic feature and taken as the input of classifier. The output is the probability which stands for how much the incoming page would be sensitive. (2) We proposed a novel web sensitive information recognition framework by applying divide and conquer strategy and proposing a fusion algorithm to combine the results obtained from text as well as image classification together. Firstly, we obtain web data flows and do the pre-processing; then we use C4.5 decision-tree to apply corresponding classifiers to text and image data in web pages. Finally, we combine the classification results to get an overall judgment. (3) We proposed multi-instance learning for web page set’s sensitive information filtering. Generally, we need to deal with a web site rather than a single web page. The existed methods can only filter certain kinds of sensitive web pages. Consider that multi-instance learning is a relatively mature method for data set recongniton; we use it to improve the effect of sensitive information filtering. (4) We participated and implemented a plug-in of browser which can block the sensitive information. (5) We improved an active search engine for sensitive web pages using multi-instance learning.
语种中文
其他标识符200428014628005
内容类型学位论文
源URL[http://ir.ia.ac.cn/handle/173211/7401]  
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
陈周耀. 基于内容的网页敏感信息识别与过滤方法[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2007.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace