在section 3 distributed training这一节给了很详细的背景介绍和分析,包括: - FSDP和pp的冲突点 - compute/communicate overlap的定义 - Critical batch size的分析,在appendix上还给了详细的推导过程,以及直观的理解 a (mini-)batch is used to approximate the true gradients of the weights with respect to the loss. Increasin…