神经网络机器翻译中的数据优化方法研究

CORC > 自动化研究所 > 中国科学院自动化研究所 > 毕业生 > 博士学位论文

题名	神经网络机器翻译中的数据优化方法研究
作者	李小青
学位类别	工学博士
答辩日期	2017
授予单位	中国科学院研究生院
授予地点	北京
导师	宗成庆
关键词	神经网络机器翻译相似度模型在线学习注意力机制
中文摘要	经过二十多年的发展，统计机器翻译取得了长足的进步，并在生产生活中得到了广泛的应用。然而，受限于离散化的符号表示，传统的统计方法面临着许多天然的缺陷，例如独立性假设过强、泛化能力不佳等等，这使得统计机器翻译的发展逐渐进入了瓶颈期。神经网络机器翻译是近年兴起的一种全新的机器翻译方法。它采用连续的向量表示，能在一定程度上克服传统统计翻译方法的缺陷，并且该方法采用端到端的训练方式，极大地降低了构建翻译系统的复杂度，因而受到了越来越多的青睐。同时，新的翻译架构也带来了一些新的问题和挑战。本文从数据的角度出发，通过增强、选择和替换等方式来弥补数据的不足或者提高数据的利用率，以解决神经网络机器翻译中存在的显著问题，并取得了以下创新性研究成果： 1. 提出了一种面向句对数据的混合注意力模型已有的神经网络翻译模型在计算注意力权重时仅仅考虑了内容信息，即源端和目标端的隐层状态，而忽略了源端和目标端单词在句子中所处的位置。这使得译文中经常出现重复、遗漏和跳跃翻译的现象。针对这一问题，本文提出了一种融合位置信息和内容信息的注意力模型。通过在原始数据中添加显式的位置信息，该方法能够有效引导模型去关注那些更为合理的位置，包括之前没有注意过的位置和上一时刻关注点附近的位置等等。实验表明，使用了融合位置信息和内容信息的注意力模型之后，系统的译文质量和对齐质量都得到了显著的提升，并且自动学习到的位置向量也具有良好的可解释性。 2. 提出了一种基于双语数据的网络参数实时更新的方法神经网络翻译系统通常有数千万的参数需要调节，这一方面使得模型拥有强大的表达能力，能够对各种复杂的语言现象进行拟合，另一方面也极大地增加了模型训练的难度。现有的做法通常是在全部的训练语料上学习参数，然后使用相同的参数对所有的测试句子进行解码。但是，受限于训练数据的规模和模型的复杂度，一组固定的参数不可能覆盖所有的翻译知识，因而无法对所有测试句子都做到最优。针对这一问题，本文提出了一种动态的句子敏感的参数更新方法。针对每个测试句子，通过从双语训练语料中实时地检索出相似的子集，该方法采用微调（fine-tune）策略对解码当前句子所需的翻译知识进行实时更新，从而得到该句子的专属网络参数。实验表明，该方法能够显著地改善译文质量。尤其当训练语料中存在与测试句子高度相似的句子时，该方法能够带来超过10个BLEU值的性能提升。 3. 提出了一种基于基于双语和单语数据的低频词替换方法为了控制计算复杂度，现有的神经网络翻译系统只能使用小规模的词汇表。词汇表以外的低频词都会被一个特殊的“UNK”符号所替代。这为神经网络机器翻译带来了两方面的问题：一是低频词本身无法被翻译，二是“UNK”破坏了句子原有的结构，从而影响了句中高频词的翻译和调序。针对这一问题，本文提出了一种基于语义相似度的低频词替换方法。借助从双语和单语语料中学习到的相似度模型，该方法将机器翻译的训练和测试语料中的低频词替换为语义相似的高频词，从而保持了句子结构的完整性，有利于提高整句的翻译质量，并且能够通过后处理的方式实现低频词的翻译。另外，本文还提出了命名实体替换方法作为相似词替换方法的补充，并提出使用字符级别的编码器-解码器模型实现命名实体的翻译和对齐。实验结果表明，本文提出的方法能够有效地改善译文质量，并显著超过了前人提出的替换方法。
英文摘要	After years of development, great progress has been made in statistical machine translation (SMT). And it has been widely applied in industry and daily life. However, due to the restriction of symbolic representation, SMT has many defects in nature, such as strong independence assumption, poor generalization capacity etc, thus the development of SMT gradually enters the bottleneck period. Neural machine translation (NMT) is a novel method proposed in recent years. It depends on the continuous vector representation which can fundamentally overcome the defects faced by SMT. And its end-to-end training paradigm greatly simplify the achitechture of machine translation systems. So NMT quickly becomes the hottest research area of machine translation. However, the new architechture also brings some new problems and challenges. In this paper, we study the problems in NMT from the aspect of data, and achieved the following results. 1. Hybrid Attention Model with Both Location and Content Information Current NMT systems only consider content information when calculating the attention weights, i.e. the hidden states of source and target sides, and igores the position of each source and target word in the sentence. This often results in repeating, missing and jumping translations. To solve these problems, we propose a hybrid attention model which incorporates both location and content information. By modeling the relations between source and target positions, and between historic atteneded positions and current position, our model is able to pay more attention to the reasonable positions, including those which are never attended before, and those near the last attended position. Experimental results demonstrate that both the translation and alignment quality are imporved after incorporating location information. And the location embeddings learnt automatically are highly interpretable. 2. One Sentence One Model for Neural Machine Translation There are tens of millions of parameters in current NMT systems. On one hand, the huge amount of parameters empowers the network to handle various kinds of language phenomenons. On the other hand, it also greatly increases the difficulty of parameter learning. Common practice trains the parameters on the whole training data, and uses the same parameters to decode all testing sentences. However, due to the limit in data size and model complexity, the same parameters cannot cover all the translation knowledge, thus cannot be optimal to every testing sentence. To solve this problem, we propose an online method for parameter updating. For every testing sentence, we select some training sentences similar to the testing one for fine-tuing, so the translation knowledge needed to decode current sentence can be refreshed in real time. Experimental results demonstrate that this method can significantly improve the translation quality. Especially when highly similar sentences can be found, this method can improve the performance by more than 10 BLEU points. 3. Rare Word Replacement Based on Semantic Similarity In order to control computational complexity, current NMT systems can only employ a small vocabulary. All rare words out of the vocabulary are converted into a single UNK symbol. Such conversion results in two problems for NMT: first, rare words cannot be translated; second, UNKs break down sentence structures and affect the translation of common words in the rest of the sentence. To solve these problems, we propose a novel replacement mothed based on semantic similarity. By replacing rare words with similar and high-frequency symbols, including similar in-vocabulary words and named entity types, we can not only track the position of rare words after translation, which enables the translation of rare words with the help of the dictionary, but also keep the integrity of the sentence and improve the translation quality of the whole sentence. In addition, we also propose to translate named entities with a character-level encoder-decoder model, which can be further used to derive high-quality entity alignment. Experimental results demonstrate that our method can effectively improve the translation quality, and it also supasses a previous replacement method significantly.
内容类型	学位论文
源URL	[http://ir.ia.ac.cn/handle/173211/13012]
专题	毕业生_博士学位论文
作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	李小青. 神经网络机器翻译中的数据优化方法研究[D]. 北京. 中国科学院研究生院. 2017.