CORC  > 软件研究所  > 基础软件国家工程研究中心  > 学位论文
题名基于词典的汉藏句子自动对齐
作者于新
学位类别硕士
答辩日期2011-06-01
授予单位中国科学院研究生院
授予地点北京
导师吴健
关键词汉藏句子对齐 词典 藏文分句 分词粒度 平行语料库
学位专业计算机软件与理论
中文摘要自然语言处理的许多研究都是从大规模的语料库提取语言知识的规律,然后利用这些规律来指导自然语言处理的过程。双语语料库含有两种语言之间的互译对照信息,对于这两种语言之间的双语自然语言处理具有极其重要的研究和应用价值。双语语料库加工的关键技术之一是对齐,构建句子级别的对齐语料是构建语料库最基本的任务。
本文参考其他语言句子对齐的成熟的方法,针对藏文语言的特殊性,提出基于词典的汉藏句子自动对齐。藏文的分句问题是句子对齐首先要解决的问题,在统计实验的基础上对藏文分句问题进行了探索性研究,论文中结合句尾词、句首连词以及句子长度信息来判断藏文句子边界。根据藏文的数词组成规则,实现了对藏文文本中数字编号的切分。整理了对齐所用双语词典,并对其词语覆盖率进行了评价。汉藏句子对齐遇到的另一问题是汉语与藏文的分词粒度不同,采用在藏汉词典中进一步查词并在汉语句子中比对的方法,使正确句对的得分增加,从而提高对齐正确率。采用该方法准确率为 81.11%。并对所使用算法进行了效率优化。
本文的目的在于建立一个实用的汉藏双语句子对齐工具,并利用此工具对汉藏双语语料库进行处理。实验结果说明,本文的方法建立的汉藏双语句子对齐工具具有重要的实用价值和研究价值。
英文摘要Many studies of Natural Language Processing (NLP) are process that extracting language knowledge from large-scale corpus and then applying it to the NLP process. Bilingual corpus contains translation information between two kinds of language, and it can provide very valuable information for bilingual researches. The key technology to create bilingual corpus is alignment. In creation of bilingual parallel corpus, research of alignment at sentence level is a basic topic.
This paper refers to other languages mature methods sentence alignment, considering the specific characteristic of Tibetan language, and proposes dictionary-based Chinese-Tibetan sentence automatic alignment. The first problem need to be solved in Chinese-Tibetan sentence alignment is Tibetan sentence recognition. We do exploratory research based on statistical experiment. In this paper, it splits Tibetan sentences according to typical words at the end of a sentence which are verbs, auxiliary verbs or mood words, etc. Conjunctions at the beginning of a sentence and sentence length information also contribute to sentence recognition. According to the rules of Tibetan Numerals, implement the splitting of numerical symbols in the Tibetan text. Organize the bilingual dictionary for alignment, and evaluate the rate of its word coverage. Another problem accounted in Chinese-Tibetan sentence alignment is the different granularity between Chinese word segmentation and Tibetan word segmentation, and the problem is solved by further looking up the remaining big Tibetan word in Tibetan-Chinese dictionary and then matching it in the original Chinese sentence. Experiment shows the average precision of this approach is 81.11%. And efficiency optimization is applied in the method.
The purpose of this paper is to establish a practical Chinese-Tibetan bilingual sentence alignment tool, and apply it in the construction of Chinese-Tibetan bilingual parallel corpus.
The experimental results demonstrate that the sentence alignment tool in accordance with the method proposed in this paper has important practical and research value.
学科主题计算机应用 ; 中国语言文字信息处理(包括汉字信息处理)
公开日期2011-06-13
内容类型学位论文
源URL[http://124.16.136.157/handle/311060/10795]  
专题软件研究所_基础软件国家工程研究中心_学位论文
推荐引用方式
GB/T 7714
于新. 基于词典的汉藏句子自动对齐[D]. 北京. 中国科学院研究生院. 2011.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace