Enabling Coordinated Register Allocation and Thread-level Parallelism   Optimization for GPUs

doi:10.1145/2830772.2830813

CORC > 北京大学 > 信息科学技术学院

	Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs
	Xie, Xiaolong ; Liang, Yun ; Li, Xiuhong ; Wu, Yudong ; Sun, Guangyu ; Wang, Tao ; Fan, Dongrui
	2015
关键词	EFFICIENT COMPILER FRAMEWORK CONTROL FLOW CACHE MODEL PROCESSORS DIVERGENCE GPGPUS FILE
英文摘要	The key to high performance on GPUs lies in the massive threading to enable thread switching and hide the latency of function unit and memory access. However, running with the maximum thread-level parallelism (TLP) does not necessarily lead to the optimal performance due to the excessive thread contention for cache resource. As a result, thread throttling techniques are employed to limit the number of threads that concurrently execute to preserve the data locality. On the other hand, GPUs are equipped with a large register file to enable fast context switch between threads. However, thread throttling techniques that are designed to mitigate cache contention, lead to under utilization of registers. Register allocation is a significant factor for performance as it not just determines the single-thread performance, but indirectly affects the TLP. The design space of register allocation and TLP presents new opportunities for performance optimization. However, the complicated correlation between the two factors inevitably lead to many performance dynamics and uncertainties. In this paper, we propose Coordinated Register Allocation and Thread-level parallelism (CRAT), a compiler-based performance optimization framework. In order to achieve this goal, CRAT first enables effective register allocation. Given a register per-thread limit, CRAT allocates the registers by analyzing the lifetime of variables. To reduce the spilling cost, CRAT spills the registers to shared memory when possible. Then, CRAT explores the design space by first pruning the design points that cause serious L1 cache thrashing and register under utilization. After that, CRAT employs a prediction model to find the best tradeoff between the single-thread performance and TLP. We evaluate CRAT using a set of representative workloads on GPUs. Experimental results indicate that compared to the optimal thread throttling technique, our framework achieves performance improvement up to 1.79X (geometric mean 1.25X).; National Science Foundation China [61300005]; National High-Tech Research & Development Program of China [2015AA01A301]; EI; CPCI-S(ISTP); xiexl_pku@pku.edu.cn; ericlyun@pku.edu.cn; lixiuhong@pku.edu.cn; wuyd_pku@pku.edu.cn; gsun@pku.edu.cn; wangtao@pku.edu.cn; fandr@ict.ac.cn; 395-406; 05-09-December-2015
语种	英语
出处	48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
DOI标识	10.1145/2830772.2830813
内容类型	其他
源URL	[http://ir.pku.edu.cn/handle/20.500.11897/436440]
专题	信息科学技术学院
推荐引用方式 GB/T 7714	Xie, Xiaolong,Liang, Yun,Li, Xiuhong,et al. Enabling Coordinated Register Allocation and Thread-level Parallelism Optimization for GPUs. 2015-01-01.

个性服务

查看访问统计

相关权益政策

暂无数据

收藏/分享

所有评论 (0)

[发表评论/异议/意见]

暂无评论

评论
权益异议
反馈意见

评注功能仅针对注册用户开放，请您登录

您对该条目有什么异议，请向管理员反馈。
内容：
Email：	*
单位:
验证码：	刷新

您在知识库使用过程中有什么好的想法或者建议可以反馈给我们。
标题：	*
内容：
Email：	*
验证码：	刷新

相关链接

CORC

联系我们