CUDA sample volumeRender

it2025-08-11 20

如果出现错误，记得判定一下OpenGL是不是在集成显卡上生成的，如果是则把集成显卡禁用掉就行了。

1、thread、block、grid

https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)

https://nyu-cds.github.io/python-gpu/02-cuda/

https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

https://cvw.cac.cornell.edu/GPU/thread

https://zhuanlan.zhihu.com/p/123170285

https://www.cnblogs.com/1024incn/p/4541313.html

https://nichijou.co/cuda2-warp/

将problem划分成coarse sub-problems in parallel by blocks of threads。

将每个sub-problem划分到block中更细的threads中。

kernel也就是一个操作，或者说c++编写的一个函数。所以才会有核函数的概念，一个kernel由一个grid组成。由一系列线程并行执行：

kernel内的线程完成同一操作，也就是 all threads run the same code。每个线程都会有thread ID，该ID就是用来计算存储地址及其他操作。

grid是由一系列的线程block组成。

毫无疑问block就是由thread组成的。

基本顺序就是program由一系列的function组成，而一个function表示一个kernel grid，一个grid由一系列的block组成，一个block由一系列thread组成。

所以每个kernel grid都有自己独立的工作区。

每个kernel grid中的block完全独立，不同的block被分配到不同的 Streaming Multiprocessors (SMs),但是同一个block不可以分配到不同的SM。

单个block内的threads可以：

synchronize、share memorycooperate(communicate)

来自不同blocks的threads都是完全独立的，位移共享的资源就是global memory。

thread级任务并行性——不同的thread执行不同的任务block和grid级并行——不同的block或grid执行不同的任务data并行性——不同的threads和blocks处理memory中不同部分的数据

thread ID：每个block和thread都有自己唯一的索引，只能从运行的kernel中访问。变量blockIdx和threadIdx分别包含被调用block和thread的索引。

The thread index (threadIdx)The block index (blockIdx)The size and shape of a block (blockDim)The size and shape of a grid (gridDim)

Global memory所有thread以及host（CPU）均可访问此内存:

Global memory内存由host分配和释放用于初始化GPU将要处理的数据

Shared memory:每个线程块都有自己的shared memory:

只能由block内的threads访问比local或global momory快得多需要特殊处理才能获得最佳性能只在block的生命周期内存在

Local memory每个线程都有自己的private local memory：

只在thread的生命周期内存在通常由编译器自动处理

Constant and texture memory 这些是所有线程均可访问的只读内存空间:

Constant memory 用于缓存所有功能单元共享的值texture memory是为硬件提供的texture 操作而优化的

一般CUDA程序的三个主要步骤：

host-to-device transfer也就是数据重host memory to device steps。加载GPU程序并执行，将数据缓存在芯片上以提高性能。将结果从device memory复制到host memory，也称为device-to-host transfer。

每个CUDA kernel都以__global__声明说明符开头。程序员使用内置变量为每个thread提供unique glokal ID。

CUDA架构限制了每个块的线程数(1024个线程每个块限制)。线程块的维度可以通过内置的blockDim变量在内核中访问。一个block中的所有threads都可以使用一个内部__syncthreads来同步。对于__syncthreads，线程块中的所有线程必须等待，然后才能继续执行。在<<<...>>>中指定的每个块的thread数和每个网格的block数，可以是int或dim3。这些三尖括号标记了从主机代码到设备代码的调用。它也被称为kernel launch。

Registers—这些是每个thread专用的，这意味着分配给该线程的寄存器对其他线程不可见。编译器做出有关寄存器利用率的决策。L1/Shared memory (SMEM)—每个SM都有一个快速的on-chip scratched存储器，可用作L1 cache和shared memory。CUDA block中的所有线程可以共享shared memory，并且在给定SM上运行的所有CUDA块可以共享SM提供的物理内存资源。Read-only memory—每个SM都具instruction cache，constant memory，texture和RO cache，这对kernel代码是只读的。L2 cache—L2 cache在所有SM之间共享，因此每个CUDA block中的每个线程都可以访问该内存。Global memory—这是GPU和位于GPU中的DRAM的帧缓冲区大小。

CUDA使用kernel execution configuration<<< ... >>>来告诉CUDA运行时在GPU上启动多少个threads。 CUDA将线程组织到称为“thread block”的组中。kernel可以启动组织为“grid”结构的多个thread blocks。

<<< M , T >>>//表示kernel以M个thread blocks的grid启动。每个thread block具有T个并行线程。

很容一看出来，一个kernel是一个grid，所以需要声明该grid有多少个thread blocks，并且每个thread block有多少个thread。

SP(Streaming Processor):流处理器，是GPU最基本的处理单元，在fermi架构开始被叫做CUDA core。

SM(Streaming MultiProcessor): 一个SM由多个CUDA core组成，SM还包括特殊运算单元(SFU)，共享内存(shared memory)，寄存器文件(Register File)和调度器(Warp Scheduler)等。

按照SIMD模型，SM最小执行单元为warp，一个warp中有多个线程。SM执行单元SPs共享单个指令fetch/dispatch，这些线程将同一个指令应用于不同数据。因此，一个warp中的所有线程将总是具有相同的执行时间。

warp是SM的基本执行单元。一个warp包含32个并行thread，这32个thread执行于SIMT(Single-Instruction, Multiple-Thread，单指令多线程)模式。也就是说所有thread执行同一条指令，并且每个thread会使用各自的data执行该指令。

一个warp中的线程必然在同一个block中,同一个block不会再两个SM中，也就是block会调用多个warp，如果block的线程数不能整除warp线程数，则最后一个warp没有填满，没填满的warp中的thread是inactive。只要调用warp就会消耗SM资源，只不过有些warp中的线程是inactive。

一个warp中的thread执行相同的指令，有相同的执行时间，如果某一个thread阻塞了，同一个warp的其他thread都会阻塞，因此有了warp divergence。所以warp divergence只会出现在同一个warp中。

同一个warp中的thread执行同一操作，如果因为控制流语句（if）进入不同的分支才触发的warp divergence，那么只要避免控制流语句触发即可。也就是将同一分支放入同一warp。

也就引入了branch efficiency代码比较简单的时候，CUDA编译器自动优化代码。

warp的context包含三个部分：

Program counterRegisterShared memory

当一个block或得到足够的资源时，就成为active block。block中的warp就称为active warp。active warp又可以被分为下面三类：

Selected warp 被选中的warpStalled warp 没准备好要执行的称为Stalled warpEligible warp 没被选中，但是已经做好准备被执行的称为Eligible warp

SM中warp调度器每个cycle会挑选active warp送去执行，

warp适合执行需要满足下面两个条件：

32个CUDA core有空所有当前指令的参数都准备就绪

cudaError_t cudaDeviceSynchronize(void)可以用来停住CUP等待CUDA中的操作完成：

__device__ void __syncthreads(void)当该函数被调用，block中的每个thread都会等待所有其他thread执行到某个点来实现同步。

#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size); __global__ void addKernel(int *c, const int *a, const int *b) { int i = threadIdx.x;//获取一维的线程索引 c[i] = a[i] + b[i]; } int main() { const int arraySize = 5; const int a[arraySize] = { 1, 2, 3, 4, 5 }; const int b[arraySize] = { 10, 20, 30, 40, 50 }; int c[arraySize] = { 0 }; // Add vectors in parallel. cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);//并行声明返回cuda状态 if (cudaStatus != cudaSuccess) { fprintf(stderr, "addWithCuda failed!"); return 1; } printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}\n", c[0], c[1], c[2], c[3], c[4]); // cudaDeviceReset must be called before exiting in order for profiling and // tracing tools such as Nsight and Visual Profiler to show complete traces. cudaStatus = cudaDeviceReset(); if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaDeviceReset failed!"); return 1; } return 0; } // Helper function for using CUDA to add vectors in parallel. cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size) { int *dev_a = 0; int *dev_b = 0; int *dev_c = 0; cudaError_t cudaStatus; // Choose which GPU to run on, change this on a multi-GPU system. cudaStatus = cudaSetDevice(0);//选择cuda 0作为默认选项 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?"); goto Error; } // Allocate GPU buffers for three vectors (two input, one output) . cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));//为c创建cuda数据 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMalloc failed!"); goto Error; } cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));//为a创建cuda数据 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMalloc failed!"); goto Error; } cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));//为b创建cuda数据 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMalloc failed!"); goto Error; } // Copy input vectors from host memory to GPU buffers. cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);//将a内存拷贝给显存 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMemcpy failed!"); goto Error; } cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);//将b内存拷贝给显存 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMemcpy failed!"); goto Error; } // Launch a kernel on the GPU with one thread for each element. addKernel<<<1, size>>>(dev_c, dev_a, dev_b); // Check for any errors launching the kernel cudaStatus = cudaGetLastError(); if (cudaStatus != cudaSuccess) { fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus)); goto Error; } // cudaDeviceSynchronize waits for the kernel to finish, and returns // any errors encountered during the launch. cudaStatus = cudaDeviceSynchronize();//同步 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus); goto Error; } // Copy output vector from GPU buffer to host memory. cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);//显存拷贝到内存 if (cudaStatus != cudaSuccess) { fprintf(stderr, "cudaMemcpy failed!"); goto Error; } Error: cudaFree(dev_c); cudaFree(dev_a); cudaFree(dev_b); return cudaStatus; }

volumeRender_kernel.cu

volumeRender.cpp

/* * Copyright 1993-2015 NVIDIA Corporation. All rights reserved. * * Please refer to the NVIDIA end user license agreement (EULA) associated * with this source code for terms and conditions that govern your use of * this software. Any use, reproduction, disclosure, or distribution of * this software and related documentation outside the terms of the EULA * is strictly prohibited. * */ /* Volume rendering sample This sample loads a 3D volume from disk and displays it using ray marching and 3D textures. Note - this is intended to be an example of using 3D textures in CUDA, not an optimized volume renderer. Changes sgg 22/3/2010 - updated to use texture for display instead of glDrawPixels. - changed to render from front-to-back rather than back-to-front. */ // OpenGL Graphics includes #include <helper_gl.h> #if defined (__APPLE__) || defined(MACOSX) #pragma clang diagnostic ignored "-Wdeprecated-declarations" #include <GLUT/glut.h> #ifndef glutCloseFunc #define glutCloseFunc glutWMCloseFunc #endif #else #include <GL/freeglut.h> #endif // CUDA Runtime, Interop, and includes #include <cuda_runtime.h> #include <cuda_gl_interop.h> #include <cuda_profiler_api.h> #include <vector_types.h> #include <vector_functions.h> #include <driver_functions.h> // CUDA utilities #include <helper_cuda.h> // Helper functions #include <helper_cuda.h> #include <helper_functions.h> #include <helper_timer.h> #include <time.h> typedef unsigned int uint; typedef unsigned char uchar; #define MAX_EPSILON_ERROR 5.00f #define THRESHOLD 0.30f // Define the files that are to be save and the reference images for validation const char *sOriginal[] = { "volume.ppm", NULL }; const char *sReference[] = { "ref_volume.ppm", NULL }; const char *sSDKsample = "CUDA 3D Volume Render";// //const char *volumeFilename = "Bucky.raw"; //cudaExtent volumeSize = make_cudaExtent(32, 32, 32); //typedef unsigned char VolumeType; const char* volumeFilename = "img_itk.raw"; cudaExtent volumeSize = make_cudaExtent(512, 512, 143);//创建cuda显存 typedef unsigned char VolumeType; //char *volumeFilename = "mrt16_angio.raw"; //cudaExtent volumeSize = make_cudaExtent(416, 512, 112); //typedef unsigned short VolumeType; uint width = 512, height = 512;//生成的OpenGL大小 //uint width = 1536, height = 1536; dim3 blockSize(16, 16);//block大小,二维，16*16=256个线程 dim3 gridSize;//grid大小，即block个数 float3 viewRotation; float3 viewTranslation = make_float3(0.0, 0.0, -4.0f); float invViewMatrix[12]; float density = 0.05f; float brightness = 1.0f; float transferOffset = 0.0f; float transferScale = 1.0f; bool linearFiltering = true; GLuint pbo = 0; // OpenGL pixel buffer object GLuint tex = 0; // OpenGL texture object struct cudaGraphicsResource *cuda_pbo_resource; // CUDA Graphics Resource (to transfer PBO) CUDA图形资源(用于传输PBO pixel buffer object) StopWatchInterface *timer = 0; //定义 StopWatch Interface接口 // Auto-Verification Code 自动验证码 const int frameCheckNumber = 2; int fpsCount = 0; // FPS count for averaging FPS计算平均值 int fpsLimit = 1; // FPS limit for sampling 采样FPS极限 int g_Index = 0; unsigned int frameCount = 0; int *pArgc; char **pArgv; #ifndef MAX #define MAX(a,b) ((a > b) ? a : b) #endif extern "C" void setTextureFilterMode(bool bLinearFilter);//创建一个3D texture extern "C" void initCuda(void *h_volume, cudaExtent volumeSize);//create 3D array extern "C" void freeCudaBuffers();//释放cuda对象 extern "C" void render_kernel(dim3 gridSize, dim3 blockSize, uint *d_output, uint imageW, uint imageH, float density, float brightness, float transferOffset, float transferScale);//render kernel extern "C" void copyInvViewMatrix(float *invViewMatrix, size_t sizeofMatrix);//逆矩阵 void initPixelBuffer();//初始化openGL void computeFPS() { frameCount++; fpsCount++; if (fpsCount == fpsLimit) { char fps[256]; float ifps = 1.f / (sdkGetAverageTimerValue(&timer) / 1000.f);// 1s/返回计时器执行的平均时间 sprintf(fps, "Volume Render: %3.1f fps", ifps); glutSetWindowTitle(fps);//OpenGL标题 fpsCount = 0; fpsLimit = (int)MAX(1.f, ifps); sdkResetTimer(&timer);//重置计时器的计数器。 } } // render image using CUDA 使用CUDA渲染图像 void render() { copyInvViewMatrix(invViewMatrix, sizeof(float4)*3);//拷贝逆矩阵 // map PBO to get CUDA device pointer uint *d_output;//opengl指针 // map PBO to get CUDA device pointer checkCudaErrors(cudaGraphicsMapResources(1, &cuda_pbo_resource, 0));//Map graphics resources for access by CUDA size_t num_bytes; checkCudaErrors(cudaGraphicsResourceGetMappedPointer((void **)&d_output, &num_bytes, cuda_pbo_resource));//Get an device pointer through which to access a mapped graphics resource. //printf("CUDA mapped PBO: May access %ld bytes\n", num_bytes); // clear image checkCudaErrors(cudaMemset(d_output, 0, width*height*4));//清除image // call CUDA kernel, writing results to PBO render_kernel(gridSize, blockSize, d_output, width, height, density, brightness, transferOffset, transferScale);//开始渲染 getLastCudaError("kernel failed"); checkCudaErrors(cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0)); } // display results using OpenGL (called by GLUT) void display() { sdkStartTimer(&timer); // use OpenGL to build view matrix GLfloat modelView[16]; glMatrixMode(GL_MODELVIEW); glPushMatrix(); glLoadIdentity(); glRotatef(-viewRotation.x, 1.0, 0.0, 0.0); glRotatef(-viewRotation.y, 0.0, 1.0, 0.0); glTranslatef(-viewTranslation.x, -viewTranslation.y, -viewTranslation.z); glGetFloatv(GL_MODELVIEW_MATRIX, modelView); glPopMatrix(); invViewMatrix[0] = modelView[0]; invViewMatrix[1] = modelView[4]; invViewMatrix[2] = modelView[8]; invViewMatrix[3] = modelView[12]; invViewMatrix[4] = modelView[1]; invViewMatrix[5] = modelView[5]; invViewMatrix[6] = modelView[9]; invViewMatrix[7] = modelView[13]; invViewMatrix[8] = modelView[2]; invViewMatrix[9] = modelView[6]; invViewMatrix[10] = modelView[10]; invViewMatrix[11] = modelView[14]; clock_t start, end; start = clock(); render(); end = clock(); printf("render time: %f \n", double(end - start) / CLOCKS_PER_SEC); // display results glClear(GL_COLOR_BUFFER_BIT); // draw image from PBO glDisable(GL_DEPTH_TEST); glPixelStorei(GL_UNPACK_ALIGNMENT, 1); #if 0 // draw using glDrawPixels (slower) glRasterPos2i(0, 0); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo); glDrawPixels(width, height, GL_RGBA, GL_UNSIGNED_BYTE, 0); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0); #else // draw using texture // copy from pbo to texture glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo); glBindTexture(GL_TEXTURE_2D, tex); glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, width, height, GL_RGBA, GL_UNSIGNED_BYTE, 0); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0); // draw textured quad glEnable(GL_TEXTURE_2D); glBegin(GL_QUADS); glTexCoord2f(0, 0); glVertex2f(0, 0); glTexCoord2f(1, 0); glVertex2f(1, 0); glTexCoord2f(1, 1); glVertex2f(1, 1); glTexCoord2f(0, 1); glVertex2f(0, 1); glEnd(); glDisable(GL_TEXTURE_2D); glBindTexture(GL_TEXTURE_2D, 0); #endif glutSwapBuffers(); glutReportErrors(); sdkStopTimer(&timer); computeFPS(); } void idle() { glutPostRedisplay(); } void keyboard(unsigned char key, int x, int y) { switch (key) { case 27: #if defined (__APPLE__) || defined(MACOSX) exit(EXIT_SUCCESS); #else glutDestroyWindow(glutGetWindow()); return; #endif break; case 'f': linearFiltering = !linearFiltering; setTextureFilterMode(linearFiltering); break; case '+': density += 0.01f; break; case '-': density -= 0.01f; break; case ']': brightness += 0.1f; break; case '[': brightness -= 0.1f; break; case ';': transferOffset += 0.01f; break; case '\'': transferOffset -= 0.01f; break; case '.': transferScale += 0.01f; break; case ',': transferScale -= 0.01f; break; default: break; } printf("density = %.2f, brightness = %.2f, transferOffset = %.2f, transferScale = %.2f\n", density, brightness, transferOffset, transferScale); glutPostRedisplay(); } int ox, oy; int buttonState = 0; void mouse(int button, int state, int x, int y) { if (state == GLUT_DOWN) { buttonState |= 1<<button; } else if (state == GLUT_UP) { buttonState = 0; } ox = x; oy = y; glutPostRedisplay(); } void motion(int x, int y) { float dx, dy; dx = (float)(x - ox); dy = (float)(y - oy); if (buttonState == 4) { // right = zoom viewTranslation.z += dy / 100.0f; } else if (buttonState == 2) { // middle = translate viewTranslation.x += dx / 100.0f; viewTranslation.y -= dy / 100.0f; } else if (buttonState == 1) { // left = rotate viewRotation.x += dy / 5.0f; viewRotation.y += dx / 5.0f; } ox = x; oy = y; glutPostRedisplay(); } int iDivUp(int a, int b) { return (a % b != 0) ? (a / b + 1) : (a / b); } void reshape(int w, int h) { width = w; height = h; initPixelBuffer(); // calculate new grid size gridSize = dim3(iDivUp(width, blockSize.x), iDivUp(height, blockSize.y)); glViewport(0, 0, w, h); glMatrixMode(GL_MODELVIEW); glLoadIdentity(); glMatrixMode(GL_PROJECTION); glLoadIdentity(); glOrtho(0.0, 1.0, 0.0, 1.0, 0.0, 1.0); } void cleanup() { sdkDeleteTimer(&timer); freeCudaBuffers(); if (pbo) { cudaGraphicsUnregisterResource(cuda_pbo_resource); glDeleteBuffers(1, &pbo); glDeleteTextures(1, &tex); } // Calling cudaProfilerStop causes all profile data to be // flushed before the application exits checkCudaErrors(cudaProfilerStop()); } void initGL(int *argc, char **argv) { // initialize GLUT callback functions glutInit(argc, argv); glutInitDisplayMode(GLUT_RGB | GLUT_DOUBLE); glutInitWindowSize(width, height); glutCreateWindow("CUDA volume rendering"); if (!isGLVersionSupported(2,0) || !areGLExtensionsSupported("GL_ARB_pixel_buffer_object")) { printf("Required OpenGL extensions are missing."); exit(EXIT_SUCCESS); } } void initPixelBuffer() { if (pbo) { // unregister this buffer object from CUDA C ;从CUDA C中注销这个缓冲区对象 checkCudaErrors(cudaGraphicsUnregisterResource(cuda_pbo_resource));//从CUDA C中注销CUDA图形资源(用于传输PBO pixel buffer object) // delete old buffer glDeleteBuffers(1, &pbo);// delete old bufferobject glDeleteTextures(1, &tex);// delete old texture object } // create pixel buffer object for display;为显示创建像素缓冲对象 glGenBuffers(1, &pbo); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, pbo); glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, width*height*sizeof(GLubyte)*4, 0, GL_STREAM_DRAW_ARB); glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, 0); // register this buffer object with CUDA checkCudaErrors(cudaGraphicsGLRegisterBuffer(&cuda_pbo_resource, pbo, cudaGraphicsMapFlagsWriteDiscard));//用CUDA注册这个缓冲区对象，CUDA只会写入该资源，不会读取该资源，从CUDA到openGL // create texture for display glGenTextures(1, &tex); glBindTexture(GL_TEXTURE_2D, tex); glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST); glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST); glBindTexture(GL_TEXTURE_2D, 0); } // Load raw data from disk void *loadRawFile(char *filename, size_t size) { FILE *fp = fopen(filename, "rb"); if (!fp) { fprintf(stderr, "Error opening file '%s'\n", filename); return 0; } void *data = malloc(size); size_t read = fread(data, 1, size, fp); fclose(fp); #if defined(_MSC_VER_) printf("Read '%s', %Iu bytes\n", filename, read); #else printf("Read '%s', %zu bytes\n", filename, read); #endif return data; } void runSingleTest(const char *ref_file, const char *exec_path) { bool bTestResult = true; uint *d_output; checkCudaErrors(cudaMalloc((void **)&d_output, width*height*sizeof(uint)));//分配显存 checkCudaErrors(cudaMemset(d_output, 0, width*height*sizeof(uint)));//初始化 float modelView[16] = { 1.0f, 0.0f, 0.0f, 0.0f, 0.0f, 1.0f, 0.0f, 0.0f, 0.0f, 0.0f, 1.0f, 0.0f, 0.0f, 0.0f, 4.0f, 1.0f };//变换矩阵 invViewMatrix[0] = modelView[0]; invViewMatrix[1] = modelView[4]; invViewMatrix[2] = modelView[8]; invViewMatrix[3] = modelView[12]; invViewMatrix[4] = modelView[1]; invViewMatrix[5] = modelView[5]; invViewMatrix[6] = modelView[9]; invViewMatrix[7] = modelView[13]; invViewMatrix[8] = modelView[2]; invViewMatrix[9] = modelView[6]; invViewMatrix[10] = modelView[10]; invViewMatrix[11] = modelView[14]; // call CUDA kernel, writing results to PBO copyInvViewMatrix(invViewMatrix, sizeof(float4)*3);// // Start timer 0 and process n loops on the GPU int nIter = 10; for (int i = -1; i < nIter; i++) { if (i == 0) { cudaDeviceSynchronize();//等待GPU同步完成 sdkStartTimer(&timer);//开启计时器 } render_kernel(gridSize, blockSize, d_output, width, height, density, brightness, transferOffset, transferScale);//渲染函数 } cudaDeviceSynchronize();//等待GPU同步完成 sdkStopTimer(&timer);//结束计时器 // Get elapsed time and throughput, then log to sample and master logs 获取运行时间和吞吐量，然后记录到样本日志和主日志中 double dAvgTime = sdkGetTimerValue(&timer)/(nIter * 1000.0); printf("volumeRender, Throughput = %.4f MTexels/s, Time = %.5f s, Size = %u Texels, NumDevsUsed = %u, Workgroup = %u\n", (1.0e-6 * width * height)/dAvgTime, dAvgTime, (width * height), 1, blockSize.x * blockSize.y); getLastCudaError("Error: render_kernel() execution FAILED"); checkCudaErrors(cudaDeviceSynchronize()); unsigned char *h_output = (unsigned char *)malloc(width*height*4); checkCudaErrors(cudaMemcpy(h_output, d_output, width*height*4, cudaMemcpyDeviceToHost)); sdkSavePPM4ub("volume.ppm", h_output, width, height); bTestResult = sdkComparePPM("volume.ppm", sdkFindFilePath(ref_file, exec_path), MAX_EPSILON_ERROR, THRESHOLD, true); cudaFree(d_output); free(h_output); cleanup(); exit(bTestResult ? EXIT_SUCCESS : EXIT_FAILURE); } // Program main int main(int argc, char **argv) { pArgc = &argc; pArgv = argv; char *ref_file = NULL; #if defined(__linux__) setenv ("DISPLAY", ":0", 0); #endif //start logs printf("%s Starting...\n\n", sSDKsample); if (checkCmdLineFlag(argc, (const char **)argv, "file")) { getCmdLineArgumentString(argc, (const char **)argv, "file", &ref_file); fpsLimit = frameCheckNumber; } if (ref_file) { findCudaDevice(argc, (const char **)argv);// Initialization code to find the best CUDA Device } else { // First initialize OpenGL context, so we can properly set the GL for CUDA. // This is necessary in order to achieve optimal performance with OpenGL/CUDA interop. initGL(&argc, argv);//首先初始化OpenGL上下文 findCudaDevice(argc, (const char **)argv);//初始化CUDA } // parse arguments char *filename; if (getCmdLineArgumentString(argc, (const char **) argv, "volume", &filename)) { volumeFilename = filename; } int n; if (checkCmdLineFlag(argc, (const char **) argv, "size")) { n = getCmdLineArgumentInt(argc, (const char **) argv, "size"); volumeSize.width = volumeSize.height = volumeSize.depth = n; } if (checkCmdLineFlag(argc, (const char **) argv, "xsize")) { n = getCmdLineArgumentInt(argc, (const char **) argv, "xsize"); volumeSize.width = n; } if (checkCmdLineFlag(argc, (const char **) argv, "ysize")) { n = getCmdLineArgumentInt(argc, (const char **) argv, "ysize"); volumeSize.height = n; } if (checkCmdLineFlag(argc, (const char **) argv, "zsize")) { n= getCmdLineArgumentInt(argc, (const char **) argv, "zsize"); volumeSize.depth = n; } // load volume data char *path = sdkFindFilePath(volumeFilename, argv[0]);//查找文件 if (path == 0) { printf("Error finding file '%s'\n", volumeFilename); exit(EXIT_FAILURE); } size_t size = volumeSize.width*volumeSize.height*volumeSize.depth*sizeof(VolumeType);//获取文件大小 void *h_volume = loadRawFile(path, size);//读取raw数据 initCuda(h_volume, volumeSize);//初始化显存 free(h_volume);//释放内存数据 sdkCreateTimer(&timer);//创建计时器 printf("Press '+' and '-' to change density (0.01 increments)\n" " ']' and '[' to change brightness\n" " ';' and ''' to modify transfer function offset\n" " '.' and ',' to modify transfer function scale\n\n"); // calculate new grid size gridSize = dim3(iDivUp(width, blockSize.x), iDivUp(height, blockSize.y));//根据blocksize和图像的大小计算grid大小 if (ref_file) { runSingleTest(ref_file, argv[0]); } else { // This is the normal rendering path for VolumeRender glutDisplayFunc(display); glutKeyboardFunc(keyboard); glutMouseFunc(mouse); glutMotionFunc(motion); glutReshapeFunc(reshape); glutIdleFunc(idle); initPixelBuffer(); #if defined (__APPLE__) || defined(MACOSX) atexit(cleanup); #else glutCloseFunc(cleanup); #endif glutMainLoop(); } }

最新回复(0)