基于内容的连续音频流分割与标注技术研究

CORC > 自动化研究所 > 中国科学院自动化研究所 > 毕业生 > 博士学位论文

题名	基于内容的连续音频流分割与标注技术研究
作者	张世磊
学位类别	工学博士
答辩日期	2007-05-27
授予单位	中国科学院研究生院
授予地点	中国科学院自动化研究所
导师	徐波 ; 张树武
关键词	分割与标注技术音频分割音频分类说话人聚类说话人标注系统语音识别预处理 Segmentation and annotation Audio segmentation Audio classification Speaker clustering Speaker diarization system Pre-processing system
其他题名	Research on Content-based Audio Segmentation and Annotation
学位专业	模式识别与智能系统
中文摘要	近年来，自动语音识别系统的性能已经得到了极大的提高，它生成的高质量识别文本可以满足很多任务的要求。随着多媒体音频信息的快速增长，如何处理真实环境下的连续音频流信号，如广播语音和电话语音，变得越来越重要。连续音频流经过一系列的分割、分类标注和聚类等预处理过程，才可以输入到传统的语音识别系统中进行识别。另一方面，使识别文本更可读已经成为语音识别领域研究的一个热点。从音频流中提取有用的元数据信息，对于读者更好理解识别文本或者进行音频检索等应用是十分必要的。基于内容的连续音频流分割和标注技术正是满足这些要求的极为重要的音频处理技术。本文根据实际需求，在深入分析连续音频处理的特点和面临的主要问题的基础上，针对音频流的分割、分类和聚类以及系统框架等问题做了深入地探讨和分析，主要进行了以下几个方面的研究：（1）在音频数据分割方面，本文提出一种两层次的音频分割算法。将分割过程分为区域和边界两个层次。区域层次检测包含候选声学特征变化点的潜在区域。区域检测较之传统的变化点检测更为鲁棒，可以在一定程度上克服窗长变化的影响，同时可以采用较大的窗移来提高检测速度。由于相邻检测窗之间相互交叠，真实变化点可以被多个检测区域检测，从而确保高的检出率。在边界层次，我们在潜在区域内寻找真正的变化点，两步的边界检测过程保证理想的准确率。为了提高速度，我们采用T2<上标!>统计值搜索候选变化点，而用贝叶斯信息准则(BIC)进行确认。（2）在音频类型标注方面，本文对基于支持向量机(SVM)的语音/非语音分类方法中的音频特征选择和分类问题进行了深入的探讨。在基于决策树的多类别分类框架下，为不同的SVM分类器选择有效的音频特征，从而在多类别分类中获得优良的性能。此外，为了提高SVM模型的训练速度，本文提出一种基于回归树的有效样本选择算法。实验结果表明，在保证分类正确率基本保持不变的情况下，支持向量的个数和训练时间都会显著降低。（3）在说话人信息标注方面，本文提出一种有效的无监督的说话人聚类算法。提出以候选状态为基础实现说话人聚类过程的停止准则，它的优点在于无需针对具体任务调整参数，而直接应用到实际任务中，并且得到合理的聚类数。为了克服说话人分割的结果会直接影响到聚类结果的缺点，本文对循环的说话人切分聚类算法进行了深入的研究。（4）最后，在分割、分类和聚类问题得到合理解决的前提下，本文提出基于内容的音频流分割与标注系统的可伸缩通用框架，将整个系统分为语音检测、说话人变化点检测和说话人聚类三个阶段。在此基础上，研究了约束条件下的系统应用问题和广播语音的语音识别预处理系统应用的问题。
英文摘要	Recent years have seen great improvements in the performance of systems to automatically recognize speech. These STT systems can now produce transcriptions of sufficient quality to enable some important tasks, such as information retrieval. Research is now focusing on making the transcripts more readable. On the other hand, real-world audio signals, such as broadcast news and telephone, can not be seemed as input to ASR system directly. The problem of distinguishing audio signals into these different audio types is thus becoming increasingly significant. The audio segmentation and annotation techniques make useful preprocessing for any audio indexing and recognition system. This dissertation presents the researchs on content-based audio segmentation and annotation. The work of dissertation mainly includes the following contributions: (1)A two-level segmentation method that detects speaker or other acoustic changes in a continuous audio stream effectively is proposed. The method divides the acoustic change detection process into two levels: region level that detects the potential change regions containing candidate acoustic change points, and boundary level that searches and refines the true change points. The region level search for the potential change regions in continuous local windows. The boundary level perform T2 and BIC algorithm to detect segment boundaries within the potential windows. (2)A novel method to choose the effective samples for SVMs training based on regression tree in audio classification task is proposed. The objective is to reduce the training time of SVM by choosing effective examples from the training set and to balance the number of training points of binary classes. We test the performances of our new method on a dataset composed of about 6-hour audio data which illustrate that the computation time can be significantly reduced without a significant decrease in the prediction accuracy. (3)An efficient, robust and unsupervised algorithm for speaker clustering is proposed. Each utterance is modeled as a single Gaussian model distribution. A novel distance metric is proposed in this paper for the purpose of determining stopping criteria. The advantage of the proposed method is that it achieves comparable performance without requiring an adjusting threshold term. On the other hand, the framework of iterative segmentation and clustering is studied. Speaker change detection and clustering are preformed concurrently in a single iterative process. (4)The general framework of audio segmentation and annotation is proposed. The system consists of three main components, namely speech detection, speaker change detection and speaker clustering. We also investigate into the system application problems of constraint conditions and preprocessing step for ASR system.
语种	中文
其他标识符	200418014628078
内容类型	学位论文
源URL	[http://ir.ia.ac.cn/handle/173211/5976]
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	张世磊. 基于内容的连续音频流分割与标注技术研究[D]. 中国科学院自动化研究所. 中国科学院研究生院. 2007.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们