Warp
# Warp是什么
The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps.
Warp是Nvidia GPU中Thread的调度单位,每个Warp包含32个Thread。
# Thread在Warp中的组织方式
When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution.
Warp的划分是由处理器自动进行的。
The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0.
每个Warp中的Thread ID都是连续的。
所以,用户在程序里可以通过Thread ID判断Thread被划到了哪个Warp里,但无法指定Warp的划分方式。
# Warp对计算的影响
Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently.
每个Warp在从相同的指令地址上开始执行,但有各自的指令计数器。
可以理解为多个Thread都在扫描同一个数组,但各自有各自的进度。
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path.
但Warp中的每个Thread不是各自执行不同的指令,而是全部执行相同的指令。
所以这个运行模式也不是多个“扫描机”在相互独立地扫描同一个数组,而是一个“扫描机”在扫描数组然后把数据发给多个“执行机”。
If threads of a warp diverge via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path.
在这种模式下,如果Thread里面有if什么的跳过了一些指令,那这个Thread将会被暂时暂停,直到“扫描机”扫到了这个Thread要执行的下一条指令。
# Warp相关操作
The
__shfl_sync()
intrinsics permit exchanging of a variable between threads within a warp without use of shared memory. The exchange occurs simultaneously for all active threads within the warp (and named inmask
), moving 4 or 8 bytes of data per thread depending on the type.
Threads within a warp are referred to as lanes, and may have an index between 0 and
warpSize-1
(inclusive). Four source-lane addressing modes are supported:
- __shfl_sync()
- Direct copy from indexed lane
- __shfl_up_sync()
- Copy from a lane with lower ID relative to caller
- __shfl_down_sync()
- Copy from a lane with higher ID relative to caller
- __shfl_xor_sync()
- Copy from a lane based on bitwise XOR of own lane ID
可以看到主要都是一些传数据的函数。
Threads may only read data from another thread which is actively participating in the
__shfl_sync()
command. If the target thread is inactive, the retrieved value is undefined.
T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize); T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize); T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int width=warpSize); T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);
1
2
3
4
T
can beint
,unsigned int
,long
,unsigned long
,long long
,unsigned long long
,float
ordouble
. With thecuda_fp16.h
header included,T
can also be__half
or__half2
. Similarly, with the cuda_bf16.h header included, T can also be__nv_bfloat16
or__nv_bfloat162
.
__shfl_sync()
returns the value ofvar
held by the thread whose ID is given bysrcLane
. Ifwidth
is less thanwarpSize
then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. IfsrcLane
is outside the range [0:width-1
], the value returned corresponds to the value of var held by thesrcLane
modulowidth
(i.e. within the same subsection).
所以这些函数的第一个参数mask
和地四个参数width
分别通过掩码和范围限定了操作涉及的Thread的ID范围,第二个参数var
指定了要传哪个变量的值,第三个参数用于指定具体要找哪个Thread,最后返回值是要取的那个参数的值。
# 其他
If applications have warp-synchronous codes, they will need to insert the new
__syncwarp()
warp-wide barrier synchronization instruction between any steps where data is exchanged between threads via global or shared memory. Assumptions that code is executed in lockstep or that reads/writes from separate threads are visible across a warp without synchronization are invalid.
void __syncwarp(unsigned mask=0xffffffff);
1will cause the executing thread to wait until all warp lanes named in mask have executed a
__syncwarp()
(with the same mask) before resuming execution. All non-exited threads named in mask must execute a corresponding__syncwarp()
with the same mask, or the result is undefined.
就是Warp级的同步,类似__syncthreads()