EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud
Zhou, Yongtao1; Deng, Yuhui1,3; Xie, Junjie1; Yang, Laurence T.2
刊名IEEE TRANSACTIONS ON CLOUD COMPUTING
2018-07-01
卷号6期号:3页码:720-733
关键词Similarity detection sampling shingle position-aware cloud
ISSN号2168-7161
DOI10.1109/TCC.2016.2527646
英文摘要The explosive growth of data brings new challenges to the data storage and management in cloud environment. These data usually have to be processed in a timely fashion in the cloud. Thus, any increased latency may cause a massive loss to the enterprises. Similarity detection plays a very important role in data management. Many typical algorithms such as Shingle, Simhash, Traits and Traditional Sampling Algorithm (TSA) are extensively used. The Shingle, Simhash and Traits algorithms read entire source file to calculate the corresponding similarity characteristic value, thus requiring lots of CPU cycles and memory space and incurring tremendous disk accesses. In addition, the overhead increases with the growth of data set volume and results in a long delay. Instead of reading entire file, TSA samples some data blocks to calculate the fingerprints as similarity characteristics value. The overhead of TSA is fixed and negligible. However, a slight modification of source files will trigger the bit positions of file content shifting. Therefore, a failure of similarity identification is inevitable due to the slight modifications. This paper proposes an Enhanced Position-Aware Sampling algorithm (EPAS) to identify file similarity for the cloud by modulo file length. EPAS concurrently samples data blocks from the head and the tail of the modulated file to avoid the position shift incurred by the modifications. Meanwhile, an improved metric is proposed to measure the similarity between different files and make the possible detection probability close to the actual probability. Furthermore, this paper describes a query algorithm to reduce the time overhead of similarity detection. Our experimental results demonstrate that the EPAS significantly outperforms the existing well known algorithms in terms of time overhead, CPU and memory occupation. Moreover, EPAS makes a more preferable tradeoff between precision and recall than that of other similarity detection algorithms. Therefore, it is an effective approach of similarity identification for the cloud.
资助项目National Science foundation (NSF) of China[61572232] ; National Science foundation (NSF) of China[61272073] ; key program of NSF of Guangdong Province[S2013020012865] ; Fundamental Research Funds for the Central Universities ; Open Research Fund of Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences[CARCH201401] ; Science and Technology Planning Project of Guangdong Province[2013B090200021]
WOS研究方向Computer Science
语种英语
出版者IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
WOS记录号WOS:000443894000010
内容类型期刊论文
源URL[http://119.78.100.204/handle/2XEOYT63/4953]  
专题中国科学院计算技术研究所期刊论文_英文
通讯作者Zhou, Yongtao
作者单位1.Jinan Univ, Dept Comp Sci, Guangzhou 510632, Guangdong, Peoples R China
2.St Francis Xavier Univ, Dept Comp Sci, Antigonish, NS B2G 2W5, Canada
3.Chinese Acad Sci, Inst Comp Technol, State Key Lab Comp Architecture, Beijing 100080, Peoples R China
推荐引用方式
GB/T 7714
Zhou, Yongtao,Deng, Yuhui,Xie, Junjie,et al. EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud[J]. IEEE TRANSACTIONS ON CLOUD COMPUTING,2018,6(3):720-733.
APA Zhou, Yongtao,Deng, Yuhui,Xie, Junjie,&Yang, Laurence T..(2018).EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud.IEEE TRANSACTIONS ON CLOUD COMPUTING,6(3),720-733.
MLA Zhou, Yongtao,et al."EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud".IEEE TRANSACTIONS ON CLOUD COMPUTING 6.3(2018):720-733.
个性服务
查看访问统计
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。


©版权所有 ©2017 CSpace - Powered by CSpace