测试NCU中srcnode_gpc、srcunit_l1、srcunit_tex之间的关系
测试NCU中srcnode_gpc、srcunit_l1、srcunit_tex之间的关系
- 一.参考链接
- 二.注意事项
- 三.复现步骤
测试NCU中srcnode_gpc、srcunit_l1、srcunit_tex之间的关系
一.参考链接
- Different betweent in lts__t_sectors_srcunit_tex_op_read.sum and lts__t_bytes.sum
二.注意事项
lts__t_sectors_srcunit_tex_op_read.sum: total number of 32-byte sectors requested by src unit tex
(this is all l1tex including local, global, surface, and texture) and operation type is read from any aperture (device, sysmem, peer).
三.复现步骤
tee srcnode_srcunit_test.cu<<-'EOF'
#include <iostream>
#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include <stdio.h>
#include <assert.h>
#include <cstdio>
#include <cuda.h>#define CHECK_CUDA(call) \do { \cudaError_t err = call; \if (err != cudaSuccess) { \std::cerr << "CUDA error at " << __FILE__ << ":" << __LINE__; \std::cerr << " code=" << err << " (" << cudaGetErrorString(err) << ")" << std::endl; \exit(EXIT_FAILURE); \} \} while (0)__global__ void kernel(float *input,float *output)
{unsigned int tid = threadIdx.x + blockIdx.x * blockDim.x;output[tid]=input[tid];
}int main(int argc,char *argv[])
{int deviceid=0;cudaSetDevice(deviceid); int block_count=1;int block_size=32*4; //4个warp请求int thread_size=block_count*block_size;float *input;CHECK_CUDA(cudaMalloc(&input, thread_size*4));float *output;CHECK_CUDA(cudaMalloc(&output, thread_size*4));kernel<<<block_count, block_size>>>(input,output);CHECK_CUDA(cudaDeviceSynchronize());
}
EOF
/usr/local/cuda/bin/nvcc -std=c++17 -arch=sm_86 -lineinfo -o srcnode_srcunit_test srcnode_srcunit_test.cu -I /usr/local/cuda/include -L /usr/local/cuda/lib64 -lcuda/usr/local/NVIDIA-Nsight-Compute/ncu --query-metrics | grep "aperture_device " | awk '{ALL=$1".sum,"ALL}END{print ALL}' > metrics.cfg
/usr/local/NVIDIA-Nsight-Compute/ncu --metrics `cat metrics.cfg` ./srcnode_srcunit_test/usr/local/NVIDIA-Nsight-Compute/ncu --query-metrics | grep "lts.*op_read " | grep -v "aperture" | awk '{print $1}' | egrep "op_read$" |awk '{ALL=$1".sum,"ALL}END{print ALL}' > metrics.cfg
/usr/local/NVIDIA-Nsight-Compute/ncu --metrics `cat metrics.cfg` ./srcnode_srcunit_test
- Metric解释
lts__t_requests_aperture_device Counter request # of LTS requests accessing device memory (vidmem)
lts__t_requests_srcnode_gpc_aperture_device Counter request # of LTS requests from node GPC accessing device memory (vidmem)
lts__t_requests_srcunit_l1_aperture_device Counter request # of LTS requests from unit L1 accessing device memory (vidmem)
lts__t_requests_srcunit_tex_aperture_device Counter request # of LTS requests from unit TEX accessing device memory (vidmem)lts__t_sectors_aperture_device Counter sector # of LTS sectors accessing device memory (vidmem)
lts__t_sectors_srcnode_gpc_aperture_device Counter sector # of LTS sectors from node GPC accessing device memory (vidmem)
lts__t_sectors_srcunit_l1_aperture_device Counter sector # of LTS sectors from unit L1 accessing device memory (vidmem)
lts__t_sectors_srcunit_tex_aperture_device Counter sector # of LTS sectors from unit TEX accessing device memory (vidmem)lts__t_requests_op_read Counter request # of LTS requests for reads
lts__t_requests_srcnode_gpc_op_read Counter request # of LTS requests from node GPC for reads
lts__t_requests_srcunit_l1_op_read Counter request # of LTS requests from unit L1 for reads
lts__t_requests_srcunit_tex_op_read Counter request # of LTS requests from unit TEX for readslts__t_sectors_op_read Counter sector # of LTS sectors for reads
lts__t_sectors_srcnode_gpc_op_read Counter sector # of LTS sectors from node GPC for reads
lts__t_sectors_srcunit_l1_op_read Counter sector # of LTS sectors from unit L1 for reads
lts__t_sectors_srcunit_tex_op_read Counter sector # of LTS sectors from unit TEX for reads
- 输出
kernel(float *, float *) (1, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.6Section: Command line profiler metrics----------------------------------------------- ----------- ------------Metric Name Metric Unit Metric Value----------------------------------------------- ----------- ------------lts__t_requests_aperture_device.sum request 84lts__t_requests_srcnode_gpc_aperture_device.sum request 74lts__t_requests_srcunit_l1_aperture_device.sum request 0lts__t_requests_srcunit_tex_aperture_device.sum request 8 #指令发了4个请求lts__t_sectors_aperture_device.sum sector 319lts__t_sectors_srcnode_gpc_aperture_device.sum sector 290lts__t_sectors_srcunit_l1_aperture_device.sum sector 0lts__t_sectors_srcunit_tex_aperture_device.sum sector 32 #指令请求了16个sector,但L2往dram请求了32个sector----------------------------------------------- ----------- ------------kernel(float *, float *) (1, 1, 1)x(128, 1, 1), Context 1, Stream 7, Device 0, CC 8.6Section: Command line profiler metrics--------------------------------------- ----------- ------------Metric Name Metric Unit Metric Value--------------------------------------- ----------- ------------lts__t_requests_op_read.sum request 83lts__t_requests_srcnode_gpc_op_read.sum request 68 lts__t_requests_srcunit_l1_op_read.sum request 0lts__t_requests_srcunit_tex_op_read.sum request 4 #4个warp请求,合并访问后lts__t_sectors_op_read.sum sector 299 #lts一共收到的请求数lts__t_sectors_srcnode_gpc_op_read.sum sector 272 #来自gpc的请求数=68*4 说明来自gpc的请求是按4个sector合并访问的lts__t_sectors_srcunit_l1_op_read.sum sector 0 lts__t_sectors_srcunit_tex_op_read.sum sector 16 #一个warp 128字节,4个sector 4*4=16个sector,指令需要16个sector--------------------------------------- ----------- ------------
从统计数据上,lts__t_requests_op_read 是L2 tag阶段接收到的总请求数,它来自GPC或其它,GPC的请求又包括L1、LEX和其它