More than code - To Everyone Working Toward Good. Parallelism Mesh Zoo Notes

推荐的两本书 https://jax-ml.github.io/scaling-book/ https://huggingface.co/spaces/nanotron/ultrascale-playbook 为什么需要device mesh： device mesh are a reflection of the physical constraints of networking between GPUs 根据物理结构来选择不同的并行策略，优化communication开销如何思考device mesh： W…

2026年1月18日 0条评论 56点热度 0人点赞 sheep 阅读全文

https://main-horse.github.io/posts/visualizing-6d/ DataParallel Identical copies of the model exist on every accelerator. 通过all reduce汇聚梯度这里还提到了fsdp2这篇论文：SimpleFSDP: Simpler Fully Sharded Data Parallel with torch.compile backward之后，会进行dp link之间的通信同时和下一层的back…

2026年1月18日 0条评论 53点热度 0人点赞 sheep 阅读全文

简单介绍一下MegatronLM中，ContextParallel相关的实现，主要是面向源码这一节相关的Paper也挺多，也有一些不错的知乎上的文章： Sequence Parallelism: Long Sequence Training from System Perspective Ring self attention，主要引入了分布式的计算。看论文描述应该是两轮，先算score，再算S * V 这里应该是要求同一个Q的S被放到同一个设备上了。没有做在线计算所以这里是把Attention的act…

2026年1月10日 0条评论 77点热度 0人点赞 sheep 阅读全文

这篇文章来介绍一下MegatronLM中DataParallel相关的实现，适合希望阅读源码的同学来看主要会涉及到DDP/FSDP，distributed optimizer会单独再出一篇文章。官方有一篇设计文档，可以简单看看https://docs.nvidia.com/megatron-core/developer-guide/latest/api-guide/custom_fsdp.html# DDP MegatronLM中DDP的代码主要在core/distributed/distributed…

2026年1月10日 0条评论 81点热度 1人点赞 sheep 阅读全文

这篇文章来介绍一下MegatronLM中，有关EP部分的代码。因为我也是头一次接触MoE相关的，同时并没有对比过其他系统（DeepSpeed等）的实现，所以这块知识单纯讲一下MegatronLM中的一些细节。理论基础的话，我在看相关代码的时候，看了这几篇Paper： GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding Switch Transformers: Scaling to Trillion Pa…

2026年1月10日 0条评论 81点热度 0人点赞 sheep 阅读全文

这篇文章来介绍一下MegatronLM中，PipelineParallel的实现，主要是偏源码主要相关的论文是这一篇：Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM 还有经典的一些前置的paper： GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism PipeDream: Generalized Pipeline Parallelism f…

2026年1月10日 0条评论 66点热度 0人点赞 sheep 阅读全文

MegatronLM的SequenceParalle主要是针对TP做的 DeepSpeed-Ulysses/RingAttention这种在MegatronLM中叫ContextParallel，会有单独的一篇文章介绍论文主要是这一篇：Reducing Activation Recomputation in Large Transformer Models SequenceParallel的逻辑相对简单，之前TensorParallel的设计主要针对图中的MLP/Attention层，其他层的输入输出在所有的tp…

2026年1月10日 0条评论 56点热度 0人点赞 sheep 阅读全文

这篇文章来介绍一下MegatronLM中，TensorParallel相关的实现，主要是面向源码。相关论文：Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism，推荐先看一下 MegatronLM实现的TensorParallel需要对模型结构有改动，用支持并行计算的层来替换掉原始模型中的那些层，并不是类似Torch FSDP这种对模型结构无感知的实现方法。所以在阅读代码的过程中，主要需要看两个…

2026年1月10日 0条评论 66点热度 0人点赞 sheep 阅读全文

土豆豆角炖排骨这次的做法： * 排骨焯水，小火煮一会 * 切豆角，土豆块 * 放油，放姜片，蒜片（拍一下简单切切就行），葱段，小火煸一下 * 放排骨，炒吧炒吧。放酱油，多放点放一个锅勺 * 炒一炒放豆角。简单炒一炒，放热水，漫过排骨 * 盖盖，中小火炖。等水炒开，放土豆 * 炖10分钟左右，开锅，尝尝汤，味道不够就再放点酱油。 * 收个汁，放点葱段，出国改进点： * 排骨有点硬，要么是多煮一会，要么是用高压锅先压一下。这样排骨比较软，更好吃 * 后面炖土豆/豆角的时候有点烂了，我大概炖了15分钟。可以放少点时间…

2026年1月3日 0条评论 110点热度 0人点赞 sheep 阅读全文

之前有过FSDP1相关的介绍，这次来看一看FSDP2，也是偏源码分析级不过有一个特殊的点是FSDP2在Github上的文档非常全面，把相关特性的支持，代码结构的设计讲的都很清楚，所以这篇文章主要是来做一个补全。推荐在阅读FSDP2的代码之前，先看看这个文档https://github.com/pytorch/pytorch/issues/114299 还有一个点是因为我个人对torch dynamo相关的不太熟悉，所以FSDP2和编译优化相关的事情就不提了 FSDP2和FSDP1个人认为最主要的区别点有几个：不…

2026年1月1日 0条评论 129点热度 1人点赞 sheep 阅读全文

123 4 5…33

Parallelism Mesh Zoo Notes

Visualizing 6D Mesh Parallelism Notes

MegatronLM ContextParallel

MegatronLM DataParallel

MegatronLM ExpertParallel

MegatronLM PipelineParallel

MegatronLM SequenceParallel

MegatronLM TensorParallel

cook notes 07

Pytorch FSDP2 Introduction