Parallelism Mesh Zoo Notes

推荐的两本书

https://jax-ml.github.io/scaling-book/
https://huggingface.co/spaces/nanotron/ultrascale-playbook

为什么需要device mesh：

device mesh are a reflection of the physical constraints of networking between GPUs

根据物理结构来选择不同的并行策略，优化communication开销

如何思考device mesh：

We typically think of 2D and 3D tensors as grids and cubes, but I find it is more helpful (especially in higher dimensions) to think of the device mesh as imposing some self-similar (fractal) structure on the GPUs

HSDP is an extension of FSDP where you shard weights (FSDP) up to the point where you can’t actually do a giant all-gather/reduce-scatter over every GPU, and then replicate these shards to cover the rest of your cluster (DP)

FSDP/HSDP的差距主要在中间的all-gather上
看起来0.5倍的通信量放大也会有比较大的影响

In the “reduce effective batch size” framing, the idea behind TP is that you can only scale up DP until your cluster is as large as your batch size. From a modeling perspective, it can be undesirable to have a batch size that is too large, so you can’t just keep increasing your batch size to get more parallelism

DP受限制于batch size
- 但是CP是不是已经解决了

Ulysses SP: It aims to alleviate memory pressure from extremely long sequences, so sequences are sharded on input, and only when attention needs to be computed is an alltoall issued to re-shard on the attention heads rather than the sequence
Importantly, this means it competes with TP for sharding on the attention heads, which is why you also see people use it to replace TP in MoE models, since it has much less communication than TP

TP/UlyssesSP都需要attention head，所以有一定冲突
但是Ulysses需要的all2all的开销比TP的all-gather小

In verl, you will just see a device mesh ["dp", "sp"] when you are using their FSDP backend (which is what supports Ulysses).

ring-attention的CP：CP operates very similarly to SP outside of the attention calls (as it is just plain data parallelism when there is no cross-token dependency), but because it never shards on attention heads, it doesn’t compete with TP and can be used completely orthogonally to TP (TP shards hidden, CP shards sequence).

和TP的兼容性更好一些
在attention层，tp shard head之后，做CP group之间的ring attention。

In torchtitan, we create a flattened mesh dim “dp_shard_cp” specifically for FSDP sharding

和megatron的一样，dp_cp被放到一个维度中给FSDP用，用来做all-gather

I’ve seen both ["dp", "pp", ...] or ["pp", "dp", ...] for meshes with PP, but the order probably doesn’t make too much of a difference as you are likely solidly inter-node at this point

可能出了TP的范围（intra-node），影响都不大了
是否要关注更上层的layout呢？比如rack是怎么组织的，rack内的通信什么的

It’s actually more intuitive to imagine that you have two distinct meshes: ["pp", "dp_replicate", "dp_shard", "cp", "tp"] and ["pp", "dp_shard_mod_ep", "ep", "tp"]

同megatron

The keen-eyed may also notice that there is no intrinsic reason the tp mesh size inside and outside of the expert parallel region， but this is not easily done if you have to have a single global device mesh for everything

这里好像是个病句，不过他说的意思应该就是不需要让tp是相同的。可以有独立的ETP

Parallelism Mesh Zoo Notes

文章评论