当前位置: 首页 > news >正文

RCCL/NCCL中的Transports方式选择:P2P or SHM or NET

本篇文章主要总结以下在传输路径方式选择的时候,选择每一种方式应该满足的条件和优先度。

本文初步总结,之后还会进行更新,欢迎大家补充

源码位置:tools/topo_expl

Topo结构:

初始化判断前 ret设置为0,代表此类方法不可使用,如果满足transports中的某一类,则设置 ret = 1.

PATH_LOC为节点到自己,PATH_NVL表示路径上的边都是NVLink,PATH_PIX表示经过最多一个PCIe switch,PATH_PXB表示经过了多个PCIe witch,但是没有经过CPU,PATH_PHB表示经过了CPU,PATH_SYS表示不同numa之间的路径。

# define// Local (myself)
#define PATH_LOC 0// Connection traversing NVLink
#define PATH_NVL 1// Connection through NVLink using an intermediate GPU
#define PATH_NVB 2// Connection traversing at most a single PCIe bridge
#define PATH_PIX 3// Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
#define PATH_PXB 4// Connection between a GPU and a NIC using an intermediate GPU. Used to enable rail-local, aggregated network send/recv operations.
#define PATH_PXN 5// Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
#define PATH_PHB 6// Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
#define PATH_SYS 7// Connection through the network
#define PATH_NET 8// Disconnected
#define PATH_DIS 9

P2P:

选择Transport路径:
源码位置:src/transport.cc
selectTransport函数下 CanConnect

源码中有两个p2pCanConnect

目前再源码二中,添加INFO打logo没有输出,需要在p2p.cc中添加 才可以在test中输出logo

一、p2pCanConnect
源码目录: src/transport/p2p.cc

二、p2pCanConnect
源码目录: tools/topo_expl/model.cpp

 

看一下源码的逻辑过程和源码

1. 检查设备是否支持精细粒度 GPU 操作

如果任一设备不支持 Fine-Grain 操作(精细粒度的 GPU 共享内存),直接判定两设备无法使用 P2P 通信。

2. 检查设备是否属于同一主机或容器

比较 hostHash 和 shmDev:
hostHash:表示设备所属主机的唯一标识。
shmDev:表示设备是否共享同一块内存设备(如共享同一块显存)。
如果设备不在同一主机或共享内存域,判定无法使用 P2P 通信。

3. 检查设备间的拓扑关系和通信路径

调用 ncclTopoCheckP2p 检查两设备之间的 P2P 拓扑关系:
如果 *ret = 0:表示设备间无直接 P2P 通信路径。
如果 intermediateRank != -1:表示设备间通信需要通过第三方设备中转。
如果 useMemcpy(标志位)为真,优先禁用 P2P 通信

4. 判断是否使用网络通信更优

调用 ncclTopoCheckNet 检查是否通过网络通信(如 NIC)会比 P2P 更优:
如果 useNet = 1:说明网络通信更适合,禁用 P2P。
 

/* Determine if two peers can communicate with P2P */
ncclResult_t p2pCanConnect(int* ret, struct ncclTopoSystem* topo, struct ncclTopoGraph* graph, struct ncclPeerInfo* info1, struct ncclPeerInfo* info2) {if (!info1->hasFineGrain || !info2->hasFineGrain)  {*ret = 0;return ncclSuccess;}// Rule out different nodes / isolated containersif (info1->hostHash != info2->hostHash || info1->shmDev != info2->shmDev) {*ret = 0;return ncclSuccess;}// Check topology / p2p level.int intermediateRank;NCCLCHECK(ncclTopoCheckP2p(topo, info1->busId, info2->busId, ret, NULL, &intermediateRank));if (*ret == 0) return ncclSuccess;if (intermediateRank != -1) {if (useMemcpy) *ret = 0;return ncclSuccess;}// Check if NET would work betterint useNet = 0;NCCLCHECK(ncclTopoCheckNet(topo, info1->busId, info2->busId, &useNet));if (useNet) {*ret = 0;return ncclSuccess;}*ret = 1;return ncclSuccess;
}

这里主要是再看一下ncclTopoCheckP2p,这个函数中会看Topo结构中两个GPU之间是否满足p2p,主要是p2p_Level 是否满足,看一下源码,如果CPU类型是 ARM CPUs/ CPU_ARCH_X86(Intel / ZHAOXIN)p2pLevel 有进一步的要求   p2pLevel = PATH_PXB,实际路径满足 path->type <= p2pLevel, 允许 p2p = 1.

源码位置是编译过后的位置:build/release/hipify/src/graph/paths.cc

int ncclTopoUserP2pLevel = -1;
ncclResult_t ncclTopoCheckP2p(struct ncclTopoSystem* system, int64_t id1, int64_t id2, int* p2p, int *read, int* intermediateRank) {*p2p = 0;if (read) *read = 0;if (intermediateRank) *intermediateRank = -1;// Get GPUs from topologyint g1, g2;NCCLCHECK(ncclTopoIdToIndex(system, GPU, id1, &g1));struct ncclTopoNode* gpu1 = system->nodes[GPU].nodes+g1;if (ncclTopoIdToIndex(system, GPU, id2, &g2) == ncclInternalError) {// GPU not found, we can't use p2p.return ncclSuccess;}int intermediateIndex = -1;// Set intermediate GPU rank, if routing through an intermediate GPU.struct ncclTopoLinkList* path = gpu1->paths[GPU]+g2;if (path->count == 2) {struct ncclTopoNode* intermediateNode = path->list[0]->remNode;if (intermediateNode->type == GPU) {intermediateIndex = intermediateNode - system->nodes[GPU].nodes;if (intermediateRank) *intermediateRank = intermediateNode->gpu.rank;}}// In general, use P2P whenever we can.int p2pLevel = PATH_SYS;// User overrideif (ncclTopoUserP2pLevel == -1)NCCLCHECK(ncclGetLevel(&ncclTopoUserP2pLevel, "NCCL_P2P_DISABLE", "NCCL_P2P_LEVEL"));if (ncclTopoUserP2pLevel != -2) {p2pLevel = ncclTopoUserP2pLevel;goto compare;}// Don't use P2P through ARM CPUsint arch, vendor, model;NCCLCHECK(ncclTopoCpuType(system, &arch, &vendor, &model));if (arch == NCCL_TOPO_CPU_ARCH_ARM) p2pLevel = PATH_PXB;if (arch == NCCL_TOPO_CPU_ARCH_X86 && vendor == NCCL_TOPO_CPU_VENDOR_INTEL) {p2pLevel = PATH_PXB;}if (arch == NCCL_TOPO_CPU_ARCH_X86 && vendor == NCCL_TOPO_CPU_VENDOR_ZHAOXIN) {p2pLevel = PATH_PXB;}compare:// Compute the PCI distance and compare with the p2pLevel.if (path->type <= p2pLevel) *p2p = 1;#if defined(__HIP_PLATFORM_AMD__) || defined(__HIPCC__)
#elseif (*p2p == 1) {// NCCL_IGNORE_DISABLED_P2P=2 is used by unit tests that don't want to// validate against NVML at all since they are pretending to be on other hw.if (g1 != g2 && ncclParamIgnoreDisabledP2p() != 2) {int indexes[3] = {-1,-1,-1};int verticeN = 0;NCCLCHECK(ncclNvmlEnsureInitialized());indexes[verticeN++] = system->nodes[GPU].nodes[g1].gpu.dev;if (intermediateIndex != -1) indexes[verticeN++] = system->nodes[GPU].nodes[intermediateIndex].gpu.dev;indexes[verticeN++] = system->nodes[GPU].nodes[g2].gpu.dev;for (int i=1; i < verticeN; i++) {nvmlGpuP2PStatus_t status;status = ncclNvmlDevicePairs[indexes[i-1]][indexes[i-0]].p2pStatusRead;bool good = status == NVML_P2P_STATUS_OK;status = ncclNvmlDevicePairs[indexes[i-1]][indexes[i-0]].p2pStatusWrite;good &= status == NVML_P2P_STATUS_OK;if (!good) {if (!ncclParamIgnoreDisabledP2p()) {if (path->type <= PATH_NVB) {WARN("P2P is disabled between NVLINK connected GPUs %d and %d. This should not be the case given their connectivity, and is probably due to a hardware issue. If you still want to proceed, you can set NCCL_IGNORE_DISABLED_P2P=1.", indexes[i-1], indexes[i-0]);return ncclUnhandledCudaError;} else if (path->type < PATH_SYS) {INFO(NCCL_INIT, "P2P is disabled between connected GPUs %d and %d. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.", indexes[i-1], indexes[i-0]);}}*p2p = 0;}}}}
#endifif (path->type == PATH_NVL) {struct ncclTopoNode* gpu2 = system->nodes[GPU].nodes+g2;// Enable P2P Read for Ampere/NVLink onlyif (read && (gpu1->gpu.cudaCompCap == gpu2->gpu.cudaCompCap) && (gpu1->gpu.cudaCompCap == 80)) *read = 1;}return ncclSuccess;
}// MNNVL: Check whether peers are in the same fabric cluster and clique
ncclResult_t ncclTopoCheckMNNVL(struct ncclTopoSystem* system, struct ncclPeerInfo* info1, struct ncclPeerInfo* info2, int* ret) {*ret = 0;nvmlGpuFabricInfoV_t *fabricInfo1 = &info1->fabricInfo;nvmlGpuFabricInfoV_t *fabricInfo2 = &info2->fabricInfo;// A zero UUID means we don't have MNNVL fabric infoif ((((long *)&fabricInfo2->clusterUuid)[0]|((long *)fabricInfo2->clusterUuid)[1]) == 0) return ncclSuccess;if ((memcmp(fabricInfo1->clusterUuid, fabricInfo2->clusterUuid, NVML_GPU_FABRIC_UUID_LEN) == 0) &&(fabricInfo1->cliqueId == fabricInfo2->cliqueId)) {INFO(NCCL_NET, "MNNVL matching peer 0x%lx UUID %lx.%lx cliqueId 0x%x",info2->busId, ((long *)fabricInfo2->clusterUuid)[0], ((long *)fabricInfo2->clusterUuid)[1], fabricInfo2->cliqueId);*ret = 1;}return ncclSuccess;
}

SHM:

SHMCanConnect:也有两个源码位置:

一、源码位置: src/transport/p2p.cc


二、源码目录: tools/topo_expl/model.cpp

看一下源码的逻辑过程和源码

1.初始化 ret 为0;

2.检查是否禁用共享内存SHM;

3.检查是否需要通过网络通信;

4.判断两个设备是否在同一主机;

5.检查设备是否共享相同的 /dev/shm;

6.满足条件,ret = 1;

/* Determine if two peers can communicate with SHM */
ncclResult_t shmCanConnect(int* ret, struct ncclTopoSystem* topo, struct ncclTopoGraph* graph, struct ncclPeerInfo* info1, struct ncclPeerInfo* info2) {*ret = 0;if (ncclParamShmDisable() == 1) return ncclSuccess;int useNet = 0;NCCLCHECK(ncclTopoCheckNet(topo, info1->busId, info2->busId, &useNet));if (useNet) return ncclSuccess;// Same host?TRACE(NCCL_INIT|NCCL_SHM, "peer1 hostHash %lx peer2 hostHash %lx", info1->hostHash, info2->hostHash);if (info1->hostHash != info2->hostHash) return ncclSuccess;// Common /dev/shm (between containers) ?TRACE(NCCL_INIT|NCCL_SHM, "peer1 shmDev %lx peer2 shmDev %lx", info1->shmDev, info2->shmDev);if (info1->shmDev != info2->shmDev) return ncclSuccess;*ret = 1;return ncclSuccess;
}


http://www.mrgr.cn/news/81920.html

相关文章:

  • Git 入门(一)
  • 7-58 输出不重复的数组元素
  • Android Notification 问题:Invalid notification (no valid small icon)
  • Colyseus 的断线重连和心跳机制
  • Apollo中间件技术:从入门到精通
  • 【WebSocket】tomcat内部处理websocket的过程
  • java开发配置文件集合
  • 【数据结构】Trie字典树(前缀树)— 数组实现
  • 为什么选择 RAG 技术?开启 AI 2.0 应用开发的新时代
  • 【bluedroid】A2dp Source播放流程源码分析(4)
  • 隧道FM广播信号、隧道内调频广播信号覆盖方案选择
  • elasticsearch-java客户端jar包中各模块的应用梳理
  • 一文大白话讲清楚CSS盒子和盒子模型
  • 鸿蒙TCPSocket通信模拟智能家居模拟案例
  • python初学练习之基于文件操作的停车管理系统
  • 19.springcloud_openfeign之案例
  • Snowflake基础知识
  • WPF 绘制过顶点的圆滑曲线 (样条,贝塞尔)
  • Qt之QtConcurrent
  • 【服务器项目部署】⭐️将本地项目部署到服务器!
  • 数仓建模:如何进行实体建模?
  • 大模型在自动驾驶领域的应用和存在的问题
  • MySQL数据库的备份与恢复你会了吗?
  • ubuntu2204 gpu 没接显示器,如何连接vnc
  • 3.2、SDH帧结构
  • Rust : tokio中select!