CUDA sample volumeRender

it2025-08-11  6

CUDA sample volumeRender

如果出现错误,记得判定一下OpenGL是不是在集成显卡上生成的,如果是则把集成显卡禁用掉就行了。

1、thread、block、grid

https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)

https://nyu-cds.github.io/python-gpu/02-cuda/

https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

https://cvw.cac.cornell.edu/GPU/thread

https://zhuanlan.zhihu.com/p/123170285

https://www.cnblogs.com/1024incn/p/4541313.html

https://nichijou.co/cuda2-warp/

将problem划分成coarse sub-problems in parallel by blocks of threads。

将每个sub-problem划分到block中更细的threads中。

kernel也就是一个操作,或者说c++编写的一个函数。所以才会有核函数的概念,一个kernel由一个grid组成。由一系列线程并行执行:

kernel内的线程完成同一操作,也就是 all threads run the same code。每个线程都会有thread ID,该ID就是用来计算存储地址及其他操作。 

grid是由一系列的线程block组成。

毫无疑问block就是由thread组成的。

基本顺序就是program由一系列的function组成,而一个function表示一个kernel grid,一个grid由一系列的block组成,一个block由一系列thread组成。

所以每个kernel grid都有自己独立的工作区。

每个kernel grid中的block完全独立,不同的block被分配到不同的 Streaming Multiprocessors (SMs),但是同一个block不可以分配到不同的SM。

单个block内的threads可以:

synchronize、share memorycooperate(communicate)

来自不同blocks的threads都是完全独立的,位移共享的资源就是global memory。

thread级任务并行性——不同的thread执行不同的任务block和grid级并行——不同的block或grid执行不同的任务data并行性——不同的threads和blocks处理memory中不同部分的数据

thread ID:每个block和thread都有自己唯一的索引,只能从运行的kernel中访问。变量blockIdx和threadIdx分别包含被调用block和thread的索引。

The thread index (threadIdx)The block index (blockIdx)The size and shape of a block (blockDim)The size and shape of a grid (gridDim)

Global memory所有thread以及host(CPU)均可访问此内存:

Global memory内存由host分配和释放用于初始化GPU将要处理的数据

Shared memory:每个线程块都有自己的shared memory:

只能由block内的threads访问比local或global momory快得多需要特殊处理才能获得最佳性能只在block的生命周期内存在

Local memory每个线程都有自己的private local memory:

只在thread的生命周期内存在通常由编译器自动处理

Constant and texture memory 这些是所有线程均可访问的只读内存空间:

Constant memory 用于缓存所有功能单元共享的值texture memory是为硬件提供的texture 操作而优化的

一般CUDA程序的三个主要步骤:

 host-to-device transfer也就是数据重host memory to device steps。  加载GPU程序并执行,将数据缓存在芯片上以提高性能。  将结果从device memory复制到host memory,也称为device-to-host transfer。

每个CUDA kernel都以__global__声明说明符开头。程序员使用内置变量为每个thread提供unique glokal ID。

CUDA架构限制了每个块的线程数(1024个线程每个块限制)。线程块的维度可以通过内置的blockDim变量在内核中访问。一个block中的所有threads都可以使用一个内部__syncthreads来同步。对于__syncthreads,线程块中的所有线程必须等待,然后才能继续执行。在<<<...>>>中指定的每个块的thread数和每个网格的block数,可以是int或dim3。这些三尖括号标记了从主机代码到设备代码的调用。它也被称为kernel launch。

Registers—这些是每个thread专用的,这意味着分配给该线程的寄存器对其他线程不可见。编译器做出有关寄存器利用率的决策。L1/Shared memory (SMEM)—每个SM都有一个快速的on-chip scratched存储器,可用作L1 cache和shared memory。CUDA block中的所有线程可以共享shared memory,并且在给定SM上运行的所有CUDA块可以共享SM提供的物理内存资源。Read-only memory—每个SM都具instruction cache,constant memory,texture和RO cache,这对kernel代码是只读的。L2 cache—L2 cache在所有SM之间共享,因此每个CUDA block中的每个线程都可以访问该内存。Global memory—这是GPU和位于GPU中的DRAM的帧缓冲区大小。

CUDA使用kernel execution configuration<<< ... >>>来告诉CUDA运行时在GPU上启动多少个threads。 CUDA将线程组织到称为“thread block”的组中。kernel可以启动组织为“grid”结构的多个thread blocks。

<<< M , T >>>//表示kernel以M个thread blocks的grid启动。每个thread block具有T个并行线程。

很容一看出来,一个kernel是一个grid,所以需要声明该grid有多少个thread blocks,并且每个thread block有多少个thread。

SP(Streaming Processor):流处理器, 是GPU最基本的处理单元,在fermi架构开始被叫做CUDA core。

SM(Streaming MultiProcessor): 一个SM由多个CUDA core组成,SM还包括特殊运算单元(SFU),共享内存(shared memory),寄存器文件(Register File)和调度器(Warp Scheduler)等。

按照SIMD模型,SM最小执行单元为warp,一个warp中有多个线程。SM执行单元SPs共享单个指令fetch/dispatch,这些线程将同一个指令应用于不同数据。因此,一个warp中的所有线程将总是具有相同的执行时间。

warp是SM的基本执行单元。一个warp包含32个并行thread,这32个thread执行于SIMT(Single-Instruction, Multiple-Thread,单指令多线程)模式。也就是说所有thread执行同一条指令,并且每个thread会使用各自的data执行该指令。

一个warp中的线程必然在同一个block中,同一个block不会再两个SM中,也就是block会调用多个warp,如果block的线程数不能整除warp线程数,则最后一个warp没有填满,没填满的warp中的thread是inactive。只要调用warp就会消耗SM资源,只不过有些warp中的线程是inactive。

一个warp中的thread执行相同的指令,有相同的执行时间,如果某一个thread阻塞了,同一个warp的其他thread都会阻塞,因此有了warp divergence。所以warp divergence只会出现在同一个warp中。

同一个warp中的thread执行同一操作,如果因为控制流语句(if)进入不同的分支才触发的warp divergence,那么只要避免控制流语句触发即可。也就是将同一分支放入同一warp。

也就引入了branch efficiency代码比较简单的时候,CUDA编译器自动优化代码。

warp的context包含三个部分:

Program counterRegisterShared memory

当一个block或得到足够的资源时,就成为active block。block中的warp就称为active warp。active warp又可以被分为下面三类:

Selected warp 被选中的warpStalled warp 没准备好要执行的称为Stalled warpEligible warp 没被选中,但是已经做好准备被执行的称为Eligible warp

SM中warp调度器每个cycle会挑选active warp送去执行,

warp适合执行需要满足下面两个条件:

32个CUDA core有空所有当前指令的参数都准备就绪

cudaError_t cudaDeviceSynchronize(void)可以用来停住CUP等待CUDA中的操作完成:

__device__ void __syncthreads(void)当该函数被调用,block中的每个thread都会等待所有其他thread执行到某个点来实现同步。

#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size); __global__ void addKernel(int *c, const int *a, const int *b) { int i = threadIdx.x;//获取一维的线程索引 c[i] = a[i] + b[i]; } int main() { const int arraySize = 5; const int a[arraySize] = { 1, 2, 3, 4, 5 }; const int b[arraySize] = { 10, 20, 30, 40, 50 }; int c[arraySize] = { 0 }; // Add vectors in parallel. cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);//并行声明返回cuda状态 if (cudaStatus != cudaSuccess) { fprintf(stderr, "addWithCuda failed!"); return 1; } printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n", c[0], c[1], c[2], c[3], c[4]); // cudaDeviceReset must be called before exiting in order for profiling and // tracing tools such as Nsight and Visual Profiler to show complete traces. cudaStatus = cudaDeviceReset(); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaDeviceReset failed!"); return 1; } return 0; } // Helper function for using CUDA to add vectors in parallel. cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size) { int *dev_a = 0; int *dev_b = 0; int *dev_c = 0; cudaError_t cudaStatus; // Choose which GPU to run on, change this on a multi-GPU system. cudaStatus = cudaSetDevice(0);//选择cuda 0作为默认选项 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?"); goto Error; } // Allocate GPU buffers for three vectors (two input, one output) . cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));//为c创建cuda数据 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMalloc failed!"); goto Error; } cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));//为a创建cuda数据 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMalloc failed!"); goto Error; } cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));//为b创建cuda数据 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMalloc failed!"); goto Error; } // Copy input vectors from host memory to GPU buffers. cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);//将a内存拷贝给显存 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMemcpy failed!"); goto Error; } cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);//将b内存拷贝给显存 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMemcpy failed!"); goto Error; } // Launch a kernel on the GPU with one thread for each element. addKernel<<<1, size>>>(dev_c, dev_a, dev_b); // Check for any errors launching the kernel cudaStatus = cudaGetLastError(); if (cudaStatus != cudaSuccess) { fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus)); goto Error; } // cudaDeviceSynchronize waits for the kernel to finish, and returns // any errors encountered during the launch. cudaStatus = cudaDeviceSynchronize();//同步 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus); goto Error; } // Copy output vector from GPU buffer to host memory. cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);//显存拷贝到内存 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMemcpy failed!"); goto Error; } Error: cudaFree(dev_c); cudaFree(dev_a); cudaFree(dev_b); return cudaStatus; }

 

 

volumeRender_kernel.cu

/* * Copyright 1993-2015 NVIDIA Corporation. All rights reserved. * * Please refer to the NVIDIA end user license agreement (EULA) associated * with this source code for terms and conditions that govern your use of * this software. Any use, reproduction, disclosure, or distribution of * this software and related documentation outside the terms of the EULA * is strictly prohibited. * */ // Simple 3D volume renderer #ifndef _VOLUMERENDER_KERNEL_CU_ #define _VOLUMERENDER_KERNEL_CU_ #include <helper_cuda.h> #include <helper_math.h> typedef unsigned int uint; typedef unsigned char uchar; cudaArray *d_volumeArray = 0; cudaArray *d_transferFuncArray; typedef unsigned char VolumeType; //typedef unsigned short VolumeType; cudaTextureObject_t texObject; // For 3D texture cudaTextureObject_t transferTex; // For 1D transfer function texture typedef struct { float4 m[3]; } float3x4; __constant__ float3x4 c_invViewMatrix; // inverse view matrix struct Ray { float3 o; // origin float3 d; // direction }; // intersect ray with a box // http://www.siggraph.org/education/materials/HyperGraph/raytrace/rtinter3.htm __device__ int intersectBox(Ray r, float3 boxmin, float3 boxmax, float *tnear, float *tfar) { // compute intersection of ray with all six bbox planes float3 invR = make_float3(1.0f) / r.d; float3 tbot = invR * (boxmin - r.o); float3 ttop = invR * (boxmax - r.o); // re-order intersections to find smallest and largest on each axis float3 tmin = fminf(ttop, tbot); float3 tmax = fmaxf(ttop, tbot); // find the largest tmin and the smallest tmax float largest_tmin = fmaxf(fmaxf(tmin.x, tmin.y), fmaxf(tmin.x, tmin.z)); float smallest_tmax = fminf(fminf(tmax.x, tmax.y), fminf(tmax.x, tmax.z)); *tnear = largest_tmin; *tfar = smallest_tmax; return smallest_tmax > largest_tmin; } // transform vector by matrix (no translation) __device__ float3 mul(const float3x4 &M, const float3 &v) { float3 r; r.x = dot(v, make_float3(M.m[0])); r.y = dot(v, make_float3(M.m[1])); r.z = dot(v, make_float3(M.m[2])); return r; } // transform vector by matrix with translation __device__ float4 mul(const float3x4 &M, const float4 &v) { float4 r; r.x = dot(v, M.m[0]); r.y = dot(v, M.m[1]); r.z = dot(v, M.m[2]); r.w = 1.0f; return r; } __device__ uint rgbaFloatToInt(float4 rgba) { rgba.x = __saturatef(rgba.x); // clamp to [0.0, 1.0] rgba.y = __saturatef(rgba.y); rgba.z = __saturatef(rgba.z); rgba.w = __saturatef(rgba.w); return (uint(rgba.w*255)<<24) | (uint(rgba.z*255)<<16) | (uint(rgba.y*255)<<8) | uint(rgba.x*255); } __global__ void d_render(uint *d_output, uint imageW, uint imageH, float density, float brightness, float transferOffset, float transferScale, cudaTextureObject_t tex, cudaTextureObject_t transferTex) { const int maxSteps = 500; const float tstep = 0.01f; const float opacityThreshold = 0.95f; const float3 boxMin = make_float3(-1.0f, -1.0f, -1.0f); const float3 boxMax = make_float3(1.0f, 1.0f, 1.0f); uint x = blockIdx.x*blockDim.x + threadIdx.x; uint y = blockIdx.y*blockDim.y + threadIdx.y; if ((x >= imageW) || (y >= imageH)) return; float u = (x / (float) imageW)*2.0f-1.0f; float v = (y / (float) imageH)*2.0f-1.0f; // calculate eye ray in world space Ray eyeRay; eyeRay.o = make_float3(mul(c_invViewMatrix, make_float4(0.0f, 0.0f, 0.0f, 1.0f))); eyeRay.d = normalize(make_float3(u, v, -2.0f)); eyeRay.d = mul(c_invViewMatrix, eyeRay.d); // find intersection with box float tnear, tfar; int hit = intersectBox(eyeRay, boxMin, boxMax, &tnear, &tfar); if (!hit) return; if (tnear < 0.0f) tnear = 0.0f; // clamp to near plane // march along ray from front to back, accumulating color float4 sum = make_float4(0.0f); float t = tnear; float3 pos = eyeRay.o + eyeRay.d*tnear; float3 step = eyeRay.d*tstep; for (int i=0; i<maxSteps; i++) { // read from 3D texture // remap position to [0, 1] coordinates float sample = tex3D<float>(tex, pos.x*0.5f+0.5f, pos.y*0.5f+0.5f, pos.z*0.5f+0.5f); //sample *= 64.0f; // scale for 10-bit data // lookup in transfer function texture float4 col = tex1D<float4>(transferTex, (sample-transferOffset)*transferScale); col.w *= density; // "under" operator for back-to-front blending //sum = lerp(sum, col, col.w); // pre-multiply alpha col.x *= col.w; col.y *= col.w; col.z *= col.w; // "over" operator for front-to-back blending sum = sum + col*(1.0f - sum.w); // exit early if opaque if (sum.w > opacityThreshold) break; t += tstep; if (t > tfar) break; pos += step; } sum *= brightness; // write output color d_output[y*imageW + x] = rgbaFloatToInt(sum); } extern "C" void setTextureFilterMode(bool bLinearFilter) { if (texObject) { checkCudaErrors(cudaDestroyTextureObject(texObject));//销毁texture对象 } cudaResourceDesc texRes;//定义资源描述子 memset(&texRes,0,sizeof(cudaResourceDesc));//将textRes初始化为0 texRes.resType = cudaResourceTypeArray;//资源类型 texRes.res.array.array = d_volumeArray;//数组数据 cudaTextureDesc texDescr;//定义纹理描述子 memset(&texDescr,0,sizeof(cudaTextureDesc));//将texDescr初始化为0 texDescr.normalizedCoords = true;//指示纹理读取是否标准化 texDescr.filterMode = bLinearFilter ? cudaFilterModeLinear : cudaFilterModePoint;//纹理过滤模式 texDescr.addressMode[0] = cudaAddressModeWrap;//Wrap地址模式 texDescr.addressMode[1] = cudaAddressModeWrap;//Wrap地址模式 texDescr.addressMode[2] = cudaAddressModeWrap;//Wrap地址模式 texDescr.readMode = cudaReadModeNormalizedFloat;//读取texture作为normal浮点数 checkCudaErrors(cudaCreateTextureObject(&texObject, &texRes, &texDescr, NULL));//创建一个纹理对象 } extern "C" void initCuda(void *h_volume, cudaExtent volumeSize) { // create 3D array cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<VolumeType>();//创建CUDA通道格式描述子 checkCudaErrors(cudaMalloc3DArray(&d_volumeArray, &channelDesc, volumeSize));//创建volume显存 // copy data to 3D array cudaMemcpy3DParms copyParams = {0};//CUDA 3D显存复制参数 copyParams.srcPtr = make_cudaPitchedPtr(h_volume, volumeSize.width*sizeof(VolumeType), volumeSize.width, volumeSize.height);//编码源存储器地址 copyParams.dstArray = d_volumeArray;//目的地显存地址 copyParams.extent = volumeSize;//请求内存拷贝大小 copyParams.kind = cudaMemcpyHostToDevice;//类型的转移 checkCudaErrors(cudaMemcpy3D(&copyParams));//在3D对象之间复制数据 cudaResourceDesc texRes;//CUDA资源描述子 memset(&texRes, 0, sizeof(cudaResourceDesc));//将textRes初始化为0 texRes.resType = cudaResourceTypeArray;//资源类型 texRes.res.array.array = d_volumeArray;//数组数据 cudaTextureDesc texDescr;//定义纹理描述子 memset(&texDescr, 0, sizeof(cudaTextureDesc));//将texDescr初始化为0 texDescr.normalizedCoords = true; // access with normalized texture coordinates 使用归一化纹理坐标访问 texDescr.filterMode = cudaFilterModeLinear; // linear interpolation 线性差值 texDescr.addressMode[0] = cudaAddressModeClamp; // clamp texture coordinates 限制到边缘地址模式 texDescr.addressMode[1] = cudaAddressModeClamp; texDescr.addressMode[2] = cudaAddressModeClamp; texDescr.readMode = cudaReadModeNormalizedFloat;//读取texture作为normal浮点数 checkCudaErrors(cudaCreateTextureObject(&texObject, &texRes, &texDescr, NULL));//创建一个纹理对象 // create transfer function texture 创建传递函数纹理 float4 transferFunc[] = { { 0.0, 0.0, 0.0, 0.0, }, { 1.0, 0.0, 0.0, 1.0, }, { 1.0, 0.5, 0.0, 1.0, }, { 1.0, 1.0, 0.0, 1.0, }, { 0.0, 1.0, 0.0, 1.0, }, { 0.0, 1.0, 1.0, 1.0, }, { 0.0, 0.0, 1.0, 1.0, }, { 1.0, 0.0, 1.0, 1.0, }, { 0.0, 0.0, 0.0, 0.0, }, }; cudaChannelFormatDesc channelDesc2 = cudaCreateChannelDesc<float4>();//创建CUDA通道格式描述子 cudaArray *d_transferFuncArray;//CUDA 数组 checkCudaErrors(cudaMallocArray(&d_transferFuncArray, &channelDesc2, sizeof(transferFunc)/sizeof(float4), 1));//创建d_transferFuncArray显存 checkCudaErrors(cudaMemcpyToArray(d_transferFuncArray, 0, 0, transferFunc, sizeof(transferFunc), cudaMemcpyHostToDevice));//Copies data between host and device memset(&texRes,0,sizeof(cudaResourceDesc));//将texRes初始化为0 texRes.resType = cudaResourceTypeArray;//资源类型 texRes.res.array.array = d_transferFuncArray;//CUDA 数组 memset(&texDescr,0,sizeof(cudaTextureDesc));//将texDescr初始化为0 texDescr.normalizedCoords = true; // access with normalized texture coordinates 使用归一化纹理坐标访问 texDescr.filterMode = cudaFilterModeLinear; // linear interpolation 线性差值 texDescr.addressMode[0] = cudaAddressModeClamp; // wrap texture coordinates texDescr.readMode = cudaReadModeElementType;//读取指定元素类型的纹理 checkCudaErrors(cudaCreateTextureObject(&transferTex, &texRes, &texDescr, NULL));//创建一个纹理对象 } extern "C" void freeCudaBuffers() { checkCudaErrors(cudaDestroyTextureObject(texObject));//销毁一个纹理对象 checkCudaErrors(cudaDestroyTextureObject(transferTex));// 销毁一个纹理对象 checkCudaErrors(cudaFreeArray(d_volumeArray));//Frees an array on the device checkCudaErrors(cudaFreeArray(d_transferFuncArray));//Frees an array on the device } extern "C" void render_kernel(dim3 gridSize, dim3 blockSize, uint *d_output, uint imageW, uint imageH, float density, float brightness, float transferOffset, float transferScale) { d_render<<<gridSize, blockSize>>>(d_output, imageW, imageH, density, brightness, transferOffset, transferScale, texObject, transferTex); } extern "C" void copyInvViewMatrix(float *invViewMatrix, size_t sizeofMatrix) { checkCudaErrors(cudaMemcpyToSymbol(c_invViewMatrix, invViewMatrix, sizeofMatrix));//Copies data to the given symbol on the device } #endif // #ifndef _VOLUMERENDER_KERNEL_CU_

volumeRender.cpp

/* * Copyright 1993-2015 NVIDIA Corporation. All rights reserved. * * Please refer to the NVIDIA end user license agreement (EULA) associated * with this source code for terms and conditions that govern your use of * this software. Any use, reproduction, disclosure, or distribution of * this software and related documentation outside the terms of the EULA * is strictly prohibited. * */ /* Volume rendering sample This sample loads a 3D volume from disk and displays it using ray marching and 3D textures. Note - this is intended to be an example of using 3D textures in CUDA, not an optimized volume renderer. Changes sgg 22/3/2010 - updated to use texture for display instead of glDrawPixels. - changed to render from front-to-back rather than back-to-front. */ // OpenGL Graphics includes #include <helper_gl.h> #if defined (__APPLE__) || defined(MACOSX) #pragma clang diagnostic ignored "-Wdeprecated-declarations" #include <GLUT/glut.h> #ifndef glutCloseFunc #define glutCloseFunc glutWMCloseFunc #endif #else #include <GL/freeglut.h> #endif // CUDA Runtime, Interop, and includes #include <cuda_runtime.h> #include <cuda_gl_interop.h> #include <cuda_profiler_api.h> #include <vector_types.h> #include <vector_functions.h> #include <driver_functions.h> // CUDA utilities #include <helper_cuda.h> // Helper functions #include <helper_cuda.h> #include <helper_functions.h> #include <helper_timer.h> #include <time.h> typedef unsigned int uint; typedef unsigned char uchar; #define MAX_EPSILON_ERROR 5.00f #define THRESHOLD 0.30f // Define the files that are to be save and the reference images for validation const char *sOriginal[] = { "volume.ppm", NULL }; const char *sReference[] = { "ref_volume.ppm", NULL }; const char *sSDKsample = "CUDA 3D Volume Render";// //const char *volumeFilename = "Bucky.raw"; //cudaExtent volumeSize = make_cudaExtent(32, 32, 32); //typedef unsigned char VolumeType; const char* volumeFilename = "img_itk.raw"; cudaExtent volumeSize = make_cudaExtent(512, 512, 143);//创建cuda显存 typedef unsigned char VolumeType; //char *volumeFilename = "mrt16_angio.raw"; //cudaExtent volumeSize = make_cudaExtent(416, 512, 112); //typedef unsigned short VolumeType; uint width = 512, height = 512;//生成的OpenGL大小 //uint width = 1536, height = 1536; dim3 blockSize(16, 16);//block大小,二维,16*16=256个线程 dim3 gridSize;//grid大小,即block个数 float3 viewRotation; float3 viewTranslation = make_float3(0.0, 0.0, -4.0f); float invViewMatrix[12]; float density = 0.05f; float brightness = 1.0f; float transferOffset = 0.0f; float transferScale = 1.0f; bool linearFiltering = true; GLuint pbo = 0; // OpenGL pixel buffer object GLuint tex = 0; // OpenGL texture object struct cudaGraphicsResource *cuda_pbo_resource; // CUDA Graphics Resource (to transfer PBO) CUDA图形资源(用于传输PBO pixel buffer object) StopWatchInterface *timer = 0; //定义 StopWatch Interface接口 // Auto-Verification Code 自动验证码 const int frameCheckNumber = 2; int fpsCount = 0; // FPS count for averaging FPS计算平均值 int fpsLimit = 1; // FPS limit for sampling 采样FPS极限 int g_Index = 0; unsigned int frameCount = 0; int *pArgc; char **pArgv; #ifndef MAX #define MAX(a,b) ((a > b) ? a : b) #endif extern "C" void setTextureFilterMode(bool bLinearFilter);//创建一个3D texture extern "C" void initCuda(void *h_volume, cudaExtent volumeSize);//create 3D array extern "C" void freeCudaBuffers();//释放cuda对象 extern "C" void render_kernel(dim3 gridSize, dim3 blockSize, uint *d_output, uint imageW, uint imageH, float density, float brightness, float transferOffset, float transferScale);//render kernel extern "C" void copyInvViewMatrix(float *invViewMatrix, size_t sizeofMatrix);//逆矩阵 void initPixelBuffer();//初始化openGL void computeFPS() { frameCount++; fpsCount++; if (fpsCount == fpsLimit) { char fps[256]; float ifps = 1.f / (sdkGetAverageTimerValue(&timer) / 1000.f);// 1s/返回计时器执行的平均时间 sprintf(fps, "Volume Render: %3.1f fps", ifps); glutSetWindowTitle(fps);//OpenGL标题 fpsCount = 0; fpsLimit = (int)MAX(1.f, ifps); sdkResetTimer(&timer);//重置计时器的计数器。 } } // render image using CUDA 使用CUDA渲染图像 void render() { copyInvViewMatrix(invViewMatrix, sizeof(float4)*3);//拷贝逆矩阵 // map PBO to get CUDA device pointer uint *d_output;//opengl指针 // map PBO to get CUDA device pointer checkCudaErrors(cudaGraphicsMapResources(1, &cuda_pbo_resource, 0));//Map graphics resources for access by CUDA size_t num_bytes; checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void **)&d_output, &num_bytes, cuda_pbo_resource));//Get an device pointer through which to access a mapped graphics resource. //printf("CUDA mapped PBO: May access %ld bytes\n", num_bytes); // clear image checkCudaErrors(cudaMemset(d_output, 0, width*height*4));//清除image // call CUDA kernel, writing results to PBO render_kernel(gridSize, blockSize, d_output, width, height, density, brightness, transferOffset, transferScale);//开始渲染 getLastCudaError("kernel failed"); checkCudaErrors(cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0)); } // display results using OpenGL (called by GLUT) void display() { sdkStartTimer(&timer); // use OpenGL to build view matrix GLfloat modelView[16]; glMatrixMode(GL_MODELVIEW); glPushMatrix(); glLoadIdentity(); glRotatef(-viewRotation.x, 1.0, 0.0, 0.0); glRotatef(-viewRotation.y, 0.0, 1.0, 0.0); glTranslatef(-viewTranslation.x, -viewTranslation.y, -viewTranslation.z); glGetFloatv(GL_MODELVIEW_MATRIX, modelView); glPopMatrix(); invViewMatrix[0] = modelView[0]; invViewMatrix[1] = modelView[4]; invViewMatrix[2] = modelView[8]; invViewMatrix[3] = modelView[12]; invViewMatrix[4] = modelView[1]; invViewMatrix[5] = modelView[5]; invViewMatrix[6] = modelView[9]; invViewMatrix[7] = modelView[13]; invViewMatrix[8] = modelView[2]; invViewMatrix[9] = modelView[6]; invViewMatrix[10] = modelView[10]; invViewMatrix[11] = modelView[14]; clock_t start, end; start = clock(); render(); end = clock(); printf("render time: %f \n", double(end - start) / CLOCKS_PER_SEC); // display results glClear(GL_COLOR_BUFFER_BIT); // draw image from PBO glDisable(GL_DEPTH_TEST); glPixelStorei(GL_UNPACK_ALIGNMENT, 1); #if 0 // draw using glDrawPixels (slower) glRasterPos2i(0, 0); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo); glDrawPixels(width, height, GL_RGBA, GL_UNSIGNED_BYTE, 0); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0); #else // draw using texture // copy from pbo to texture glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo); glBindTexture(GL_TEXTURE_2D, tex); glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, 0); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0); // draw textured quad glEnable(GL_TEXTURE_2D); glBegin(GL_QUADS); glTexCoord2f(0, 0); glVertex2f(0, 0); glTexCoord2f(1, 0); glVertex2f(1, 0); glTexCoord2f(1, 1); glVertex2f(1, 1); glTexCoord2f(0, 1); glVertex2f(0, 1); glEnd(); glDisable(GL_TEXTURE_2D); glBindTexture(GL_TEXTURE_2D, 0); #endif glutSwapBuffers(); glutReportErrors(); sdkStopTimer(&timer); computeFPS(); } void idle() { glutPostRedisplay(); } void keyboard(unsigned char key, int x, int y) { switch (key) { case 27: #if defined (__APPLE__) || defined(MACOSX) exit(EXIT_SUCCESS); #else glutDestroyWindow(glutGetWindow()); return; #endif break; case 'f': linearFiltering = !linearFiltering; setTextureFilterMode(linearFiltering); break; case '+': density += 0.01f; break; case '-': density -= 0.01f; break; case ']': brightness += 0.1f; break; case '[': brightness -= 0.1f; break; case ';': transferOffset += 0.01f; break; case '\'': transferOffset -= 0.01f; break; case '.': transferScale += 0.01f; break; case ',': transferScale -= 0.01f; break; default: break; } printf("density = %.2f, brightness = %.2f, transferOffset = %.2f, transferScale = %.2f\n", density, brightness, transferOffset, transferScale); glutPostRedisplay(); } int ox, oy; int buttonState = 0; void mouse(int button, int state, int x, int y) { if (state == GLUT_DOWN) { buttonState |= 1<<button; } else if (state == GLUT_UP) { buttonState = 0; } ox = x; oy = y; glutPostRedisplay(); } void motion(int x, int y) { float dx, dy; dx = (float)(x - ox); dy = (float)(y - oy); if (buttonState == 4) { // right = zoom viewTranslation.z += dy / 100.0f; } else if (buttonState == 2) { // middle = translate viewTranslation.x += dx / 100.0f; viewTranslation.y -= dy / 100.0f; } else if (buttonState == 1) { // left = rotate viewRotation.x += dy / 5.0f; viewRotation.y += dx / 5.0f; } ox = x; oy = y; glutPostRedisplay(); } int iDivUp(int a, int b) { return (a % b != 0) ? (a / b + 1) : (a / b); } void reshape(int w, int h) { width = w; height = h; initPixelBuffer(); // calculate new grid size gridSize = dim3(iDivUp(width, blockSize.x), iDivUp(height, blockSize.y)); glViewport(0, 0, w, h); glMatrixMode(GL_MODELVIEW); glLoadIdentity(); glMatrixMode(GL_PROJECTION); glLoadIdentity(); glOrtho(0.0, 1.0, 0.0, 1.0, 0.0, 1.0); } void cleanup() { sdkDeleteTimer(&timer); freeCudaBuffers(); if (pbo) { cudaGraphicsUnregisterResource(cuda_pbo_resource); glDeleteBuffers(1, &pbo); glDeleteTextures(1, &tex); } // Calling cudaProfilerStop causes all profile data to be // flushed before the application exits checkCudaErrors(cudaProfilerStop()); } void initGL(int *argc, char **argv) { // initialize GLUT callback functions glutInit(argc, argv); glutInitDisplayMode(GLUT_RGB | GLUT_DOUBLE); glutInitWindowSize(width, height); glutCreateWindow("CUDA volume rendering"); if (!isGLVersionSupported(2,0) || !areGLExtensionsSupported("GL_ARB_pixel_buffer_object")) { printf("Required OpenGL extensions are missing."); exit(EXIT_SUCCESS); } } void initPixelBuffer() { if (pbo) { // unregister this buffer object from CUDA C ;从CUDA C中注销这个缓冲区对象 checkCudaErrors(cudaGraphicsUnregisterResource(cuda_pbo_resource));//从CUDA C中注销CUDA图形资源(用于传输PBO pixel buffer object) // delete old buffer glDeleteBuffers(1, &pbo);// delete old bufferobject glDeleteTextures(1, &tex);// delete old texture object } // create pixel buffer object for display;为显示创建像素缓冲对象 glGenBuffers(1, &pbo); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo); glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, width*height*sizeof(GLubyte)*4, 0, GL_STREAM_DRAW_ARB); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0); // register this buffer object with CUDA checkCudaErrors(cudaGraphicsGLRegisterBuffer(&cuda_pbo_resource, pbo, cudaGraphicsMapFlagsWriteDiscard));//用CUDA注册这个缓冲区对象,CUDA只会写入该资源,不会读取该资源,从CUDA到openGL // create texture for display glGenTextures(1, &tex); glBindTexture(GL_TEXTURE_2D, tex); glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST); glBindTexture(GL_TEXTURE_2D, 0); } // Load raw data from disk void *loadRawFile(char *filename, size_t size) { FILE *fp = fopen(filename, "rb"); if (!fp) { fprintf(stderr, "Error opening file '%s'\n", filename); return 0; } void *data = malloc(size); size_t read = fread(data, 1, size, fp); fclose(fp); #if defined(_MSC_VER_) printf("Read '%s', %Iu bytes\n", filename, read); #else printf("Read '%s', %zu bytes\n", filename, read); #endif return data; } void runSingleTest(const char *ref_file, const char *exec_path) { bool bTestResult = true; uint *d_output; checkCudaErrors(cudaMalloc((void **)&d_output, width*height*sizeof(uint)));//分配显存 checkCudaErrors(cudaMemset(d_output, 0, width*height*sizeof(uint)));//初始化 float modelView[16] = { 1.0f, 0.0f, 0.0f, 0.0f, 0.0f, 1.0f, 0.0f, 0.0f, 0.0f, 0.0f, 1.0f, 0.0f, 0.0f, 0.0f, 4.0f, 1.0f };//变换矩阵 invViewMatrix[0] = modelView[0]; invViewMatrix[1] = modelView[4]; invViewMatrix[2] = modelView[8]; invViewMatrix[3] = modelView[12]; invViewMatrix[4] = modelView[1]; invViewMatrix[5] = modelView[5]; invViewMatrix[6] = modelView[9]; invViewMatrix[7] = modelView[13]; invViewMatrix[8] = modelView[2]; invViewMatrix[9] = modelView[6]; invViewMatrix[10] = modelView[10]; invViewMatrix[11] = modelView[14]; // call CUDA kernel, writing results to PBO copyInvViewMatrix(invViewMatrix, sizeof(float4)*3);// // Start timer 0 and process n loops on the GPU int nIter = 10; for (int i = -1; i < nIter; i++) { if (i == 0) { cudaDeviceSynchronize();//等待GPU同步完成 sdkStartTimer(&timer);//开启计时器 } render_kernel(gridSize, blockSize, d_output, width, height, density, brightness, transferOffset, transferScale);//渲染函数 } cudaDeviceSynchronize();//等待GPU同步完成 sdkStopTimer(&timer);//结束计时器 // Get elapsed time and throughput, then log to sample and master logs 获取运行时间和吞吐量,然后记录到样本日志和主日志中 double dAvgTime = sdkGetTimerValue(&timer)/(nIter * 1000.0); printf("volumeRender, Throughput = %.4f MTexels/s, Time = %.5f s, Size = %u Texels, NumDevsUsed = %u, Workgroup = %u\n", (1.0e-6 * width * height)/dAvgTime, dAvgTime, (width * height), 1, blockSize.x * blockSize.y); getLastCudaError("Error: render_kernel() execution FAILED"); checkCudaErrors(cudaDeviceSynchronize()); unsigned char *h_output = (unsigned char *)malloc(width*height*4); checkCudaErrors(cudaMemcpy(h_output, d_output, width*height*4, cudaMemcpyDeviceToHost)); sdkSavePPM4ub("volume.ppm", h_output, width, height); bTestResult = sdkComparePPM("volume.ppm", sdkFindFilePath(ref_file, exec_path), MAX_EPSILON_ERROR, THRESHOLD, true); cudaFree(d_output); free(h_output); cleanup(); exit(bTestResult ? EXIT_SUCCESS : EXIT_FAILURE); } // Program main int main(int argc, char **argv) { pArgc = &argc; pArgv = argv; char *ref_file = NULL; #if defined(__linux__) setenv ("DISPLAY", ":0", 0); #endif //start logs printf("%s Starting...\n\n", sSDKsample); if (checkCmdLineFlag(argc, (const char **)argv, "file")) { getCmdLineArgumentString(argc, (const char **)argv, "file", &ref_file); fpsLimit = frameCheckNumber; } if (ref_file) { findCudaDevice(argc, (const char **)argv);// Initialization code to find the best CUDA Device } else { // First initialize OpenGL context, so we can properly set the GL for CUDA. // This is necessary in order to achieve optimal performance with OpenGL/CUDA interop. initGL(&argc, argv);//首先初始化OpenGL上下文 findCudaDevice(argc, (const char **)argv);//初始化CUDA } // parse arguments char *filename; if (getCmdLineArgumentString(argc, (const char **) argv, "volume", &filename)) { volumeFilename = filename; } int n; if (checkCmdLineFlag(argc, (const char **) argv, "size")) { n = getCmdLineArgumentInt(argc, (const char **) argv, "size"); volumeSize.width = volumeSize.height = volumeSize.depth = n; } if (checkCmdLineFlag(argc, (const char **) argv, "xsize")) { n = getCmdLineArgumentInt(argc, (const char **) argv, "xsize"); volumeSize.width = n; } if (checkCmdLineFlag(argc, (const char **) argv, "ysize")) { n = getCmdLineArgumentInt(argc, (const char **) argv, "ysize"); volumeSize.height = n; } if (checkCmdLineFlag(argc, (const char **) argv, "zsize")) { n= getCmdLineArgumentInt(argc, (const char **) argv, "zsize"); volumeSize.depth = n; } // load volume data char *path = sdkFindFilePath(volumeFilename, argv[0]);//查找文件 if (path == 0) { printf("Error finding file '%s'\n", volumeFilename); exit(EXIT_FAILURE); } size_t size = volumeSize.width*volumeSize.height*volumeSize.depth*sizeof(VolumeType);//获取文件大小 void *h_volume = loadRawFile(path, size);//读取raw数据 initCuda(h_volume, volumeSize);//初始化显存 free(h_volume);//释放内存数据 sdkCreateTimer(&timer);//创建计时器 printf("Press '+' and '-' to change density (0.01 increments)\n" " ']' and '[' to change brightness\n" " ';' and ''' to modify transfer function offset\n" " '.' and ',' to modify transfer function scale\n\n"); // calculate new grid size gridSize = dim3(iDivUp(width, blockSize.x), iDivUp(height, blockSize.y));//根据blocksize和图像的大小计算grid大小 if (ref_file) { runSingleTest(ref_file, argv[0]); } else { // This is the normal rendering path for VolumeRender glutDisplayFunc(display); glutKeyboardFunc(keyboard); glutMouseFunc(mouse); glutMotionFunc(motion); glutReshapeFunc(reshape); glutIdleFunc(idle); initPixelBuffer(); #if defined (__APPLE__) || defined(MACOSX) atexit(cleanup); #else glutCloseFunc(cleanup); #endif glutMainLoop(); } }

 

最新回复(0)