当前位置: 首页 > news >正文

测试NCU中srcnode_gpc、srcunit_l1、srcunit_tex之间的关系

测试NCU中srcnode_gpc、srcunit_l1、srcunit_tex之间的关系

  • 一.参考链接
  • 二.注意事项
  • 三.复现步骤

测试NCU中srcnode_gpc、srcunit_l1、srcunit_tex之间的关系

一.参考链接

  • Different betweent in lts__t_sectors_srcunit_tex_op_read.sum and lts__t_bytes.sum

二.注意事项

lts__t_sectors_srcunit_tex_op_read.sum: total number of 32-byte sectors requested by src unit tex
(this is all l1tex including local, global, surface, and texture) and operation type is read from any aperture (device, sysmem, peer).

三.复现步骤

tee srcnode_srcunit_test.cu<<-'EOF'
#include <iostream>
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <stdio.h>
#include <assert.h>
#include <cstdio>
#include <cuda.h>#define CHECK_CUDA(call)                                           \do {                                                           \cudaError_t err = call;                                    \if (err != cudaSuccess) {                                  \std::cerr << "CUDA error at " << __FILE__ << ":" << __LINE__; \std::cerr << " code=" << err << " (" << cudaGetErrorString(err) << ")" << std::endl; \exit(EXIT_FAILURE);                                    \}                                                          \} while (0)__global__ void kernel(float *input,float *output)
{unsigned int tid  = threadIdx.x + blockIdx.x * blockDim.x;output[tid]=input[tid];
}int main(int argc,char *argv[])
{int deviceid=0;cudaSetDevice(deviceid);  int block_count=1;int block_size=32*4; //4个warp请求int thread_size=block_count*block_size;float *input;CHECK_CUDA(cudaMalloc(&input, thread_size*4));float *output;CHECK_CUDA(cudaMalloc(&output, thread_size*4));kernel<<<block_count, block_size>>>(input,output);CHECK_CUDA(cudaDeviceSynchronize());
}
EOF
/usr/local/cuda/bin/nvcc -std=c++17 -arch=sm_86 -lineinfo  -o srcnode_srcunit_test srcnode_srcunit_test.cu -I /usr/local/cuda/include -L /usr/local/cuda/lib64 -lcuda/usr/local/NVIDIA-Nsight-Compute/ncu --query-metrics | grep "aperture_device " | awk '{ALL=$1".sum,"ALL}END{print ALL}' > metrics.cfg
/usr/local/NVIDIA-Nsight-Compute/ncu --metrics `cat metrics.cfg` ./srcnode_srcunit_test/usr/local/NVIDIA-Nsight-Compute/ncu --query-metrics | grep "lts.*op_read " | grep -v "aperture" | awk '{print $1}' | egrep "op_read$" |awk '{ALL=$1".sum,"ALL}END{print ALL}' > metrics.cfg
/usr/local/NVIDIA-Nsight-Compute/ncu --metrics `cat metrics.cfg` ./srcnode_srcunit_test
  • Metric解释
lts__t_requests_aperture_device              Counter request # of LTS requests accessing device memory (vidmem)
lts__t_requests_srcnode_gpc_aperture_device  Counter request # of LTS requests from node GPC accessing device memory (vidmem)
lts__t_requests_srcunit_l1_aperture_device   Counter request # of LTS requests from unit L1 accessing device memory (vidmem)
lts__t_requests_srcunit_tex_aperture_device  Counter request # of LTS requests from unit TEX accessing device memory (vidmem)lts__t_sectors_aperture_device               Counter sector  # of LTS sectors accessing device memory (vidmem)
lts__t_sectors_srcnode_gpc_aperture_device   Counter sector  # of LTS sectors from node GPC accessing device memory (vidmem)
lts__t_sectors_srcunit_l1_aperture_device    Counter sector  # of LTS sectors from unit L1 accessing device memory (vidmem)
lts__t_sectors_srcunit_tex_aperture_device   Counter sector  # of LTS sectors from unit TEX accessing device memory (vidmem)lts__t_requests_op_read                      Counter request # of LTS requests for reads
lts__t_requests_srcnode_gpc_op_read          Counter request # of LTS requests from node GPC for reads
lts__t_requests_srcunit_l1_op_read           Counter request # of LTS requests from unit L1 for reads
lts__t_requests_srcunit_tex_op_read          Counter request # of LTS requests from unit TEX for readslts__t_sectors_op_read                       Counter sector  # of LTS sectors for reads
lts__t_sectors_srcnode_gpc_op_read           Counter sector  # of LTS sectors from node GPC for reads
lts__t_sectors_srcunit_l1_op_read            Counter sector  # of LTS sectors from unit L1 for reads
lts__t_sectors_srcunit_tex_op_read           Counter sector  # of LTS sectors from unit TEX for reads
  • 输出
  kernel(float *, float *) (1, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.6Section: Command line profiler metrics----------------------------------------------- ----------- ------------Metric Name                                     Metric Unit Metric Value----------------------------------------------- ----------- ------------lts__t_requests_aperture_device.sum                 request           84lts__t_requests_srcnode_gpc_aperture_device.sum     request           74lts__t_requests_srcunit_l1_aperture_device.sum      request            0lts__t_requests_srcunit_tex_aperture_device.sum     request            8 #指令发了4个请求lts__t_sectors_aperture_device.sum                   sector          319lts__t_sectors_srcnode_gpc_aperture_device.sum       sector          290lts__t_sectors_srcunit_l1_aperture_device.sum        sector            0lts__t_sectors_srcunit_tex_aperture_device.sum       sector           32 #指令请求了16个sector,但L2往dram请求了32个sector----------------------------------------------- ----------- ------------kernel(float *, float *) (1, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.6Section: Command line profiler metrics--------------------------------------- ----------- ------------Metric Name                             Metric Unit Metric Value--------------------------------------- ----------- ------------lts__t_requests_op_read.sum                 request           83lts__t_requests_srcnode_gpc_op_read.sum     request           68  lts__t_requests_srcunit_l1_op_read.sum      request            0lts__t_requests_srcunit_tex_op_read.sum     request            4  #4个warp请求,合并访问后lts__t_sectors_op_read.sum                   sector          299  #lts一共收到的请求数lts__t_sectors_srcnode_gpc_op_read.sum       sector          272  #来自gpc的请求数=68*4 说明来自gpc的请求是按4个sector合并访问的lts__t_sectors_srcunit_l1_op_read.sum        sector            0  lts__t_sectors_srcunit_tex_op_read.sum       sector           16  #一个warp 128字节,4个sector 4*4=16个sector,指令需要16个sector--------------------------------------- ----------- ------------

从统计数据上,lts__t_requests_op_read 是L2 tag阶段接收到的总请求数,它来自GPC或其它,GPC的请求又包括L1、LEX和其它


http://www.mrgr.cn/news/46232.html

相关文章:

  • DevOps项目发布一体化平台构建及应用实践
  • 基于API数据的拼多多商品市场分析与改进
  • 无人机避障——4D毫米波雷达点云滤波去噪(四)
  • linux程序说明
  • 《深度学习》循环神经网络RNN 结构及原理解析
  • 数字王国里的虚拟人――技术、商业与法律解读
  • 传奇GOM引擎佩戴装备BUFF触发传奇脚本
  • 锁升级机制——Java全栈知识(49)
  • Windows安装Elasticsearch
  • 【Windows】开始菜单关键错误以及系统应用闪退问题记录
  • 广告效果测评包括哪些方面
  • 【0338】Postgres内核 XLogReader 的分配与初始化
  • 宝塔docker中如何修改应用配置文件参数
  • Python 与 Pycharm 的简易安装教程,包含Pycharm的修改
  • 容器引领者:Kubernetes集群部署
  • 字符串左旋
  • 台球助教预约小程序源码开发:技术解析与示例代码
  • C# 的动态性多态说明
  • AIP1668显示驱动3线串口共阴极10段7位或13段4位 LED驱动控制
  • 工业网关设备在智能制造中的应用有哪些?天拓四方