https://main-horse.github.io/posts/visualizing-6d/

DataParallel

Identical copies of the model exist on every accelerator.
通过all reduce汇聚梯度

这里还提到了fsdp2这篇论文：SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile

backward之后，会进行dp link之间的通信
同时和下一层的backward进行overlap

Hybrid/Fully Sharded DP

backward阶段，reduce-scatter之后，是all-reduce

Tensor Parallel

TP + HSDP的结合
图上展示的每个点对应是个GPU pair

The inner dot within each square lights up whenever a TP communication occurs.
Why would you ever do this? Consider the simple case of an 8x3090 node, where:

Pairs of GPUs are NVLink’d, making TP viable,

Groups of 4x GPUs are separated across a NUMA Boundary.

不过GPU这里也会考虑numa boundary么?

首先在DP维度做all-gather，每个TP rank获得自己全量的参数
做TP，然后做TP之间的all-gather

Context Parallel

Context Parallelism is a form of data parallelism

CP shards data on the sequence dimension, rather than the batch dim, which requires extra communication overhead to synchronize ops that are sequence-aware, like Scaled Dot Product Attention or SoftMoE dispatch. If you have no sequence-aware ops, CP degrades to DP!

没有sequence-aware的操作，CP就会变成DP。

CP做了两件事：

op synchronization. This is similar in functionality to TP – at specific execution points, activations have to be shared across each CP group.
- 对应的就是attention层的cp group内的通信
weight synchronization. The CP ranks have to participate in something like FSDP, and the easiest way to do this is to ‘fold’ the CP group into the FS-or-DP group. In the example below, the CP dim is folded into the FS dim; this is equivalent to creating mesh['fs','cp']._flatten('fscp') and passing mesh['fsdp'] to fully_shard in PyTorch.
- 同样讲了将CP/DP fold到一起，给FSDP用

在DP + CP的层面上做all-gather

然后在cp group内做all-gather，收集全量的sequence

再做TP的计算+all-reduce
- 看这里的图很细节，对于cp来说，是先all-gather，做计算
- 对于tp来说，是计算完做all-reduce。
- 所以communication分布在两边

EP

这里EP和FS放到了一起
先通过all-gather来同步non-expert-weights。对于这里来说就是router
router计算完之后，紫色的部分就是做all2all。
做完all2all后，后面的计算和之前直接使用cp+tp是一样的。
- 不过感觉这里有点点混乱，因为一般可能认为moe是放到mlp上，就没有cp通信的事情了。
- 放到attention层的话，对应的就是不同的sequence可能使用不同的attention。吗？

PP

The SOTA in pipelining is ZBPP, which promises something close to perfect pipelining, at the cost of your entire codebase + my entire visualization setup thus far

Perfect pipeline, at the cost of your entire codebase

感觉像是对整个device mesh再复制了一块。
每个pipeline stage算完之后，会引入PP group的通信。
剩下的就都是PP group内的各种操作了。所以PP的通信频率是远低于其他的并行方法的

后面还有一些设计决策
That hierarchy isn’t important for a mere 2⁶ GPUs; but it could make sense at scale:

TP=8 applied within-node, with low-latency async TP over NVSwitch.
EP=16 across rail-optimized leaf switches for best all2all perf
CP×FS×DP=256, such that the 5D submesh [TP,EP,CP,FS,DP] fills a 32k island
- CP>FS>DP in terms of latency priority
PP=? across islands
- 这里说cross island network速度比较慢，所以把PP放到这里

PP and FSDP

If you allowed for a naive implementation of FSDP to be used, each forward and backward microbatch in a pipeline schedule would require its own allgather/reducescatter. This quickly pushes up the communication cost of FSDP by O(microbatches), which will obviously destroy MFU if e.g. microbatches>=24 as in ZBPP.

因为FSDP在每一层引入了通信，而PP的microbatch会多次执行forward/backward
导致PP和FSDP结合的时候会放大FSDP的通信量
如果不希望放大的话，避免通信，会导致FSDP的memory saving能力失效

这里作者也提出了一个新的方式：

in GPipe, because all forward steps are executed at the start, and I’m using ZeRO2, only 1x allgather of params is required per layer.
for every backward layer step, I create new local gradients and reduce-scatter them to accumulated gradient shards. This ensures that the total memory required for storing gradients is always roughly equivalent to that of their sharded size.
only for the last microbatch’s backward, I apply an allreduce of gradients across the DP axis. Meaning: nosync is applied for DP, but not FS.
就是用ZeRO2，但是会做FS group中梯度的RS。最后再做DP层的all-reduce

Visualizing 6D Mesh Parallelism Notes

DataParallel

Hybrid/Fully Sharded DP

Tensor Parallel

Context Parallel

EP

PP

PP and FSDP

文章评论