Yin的笔记本

vuePress-theme-reco Howard Yin    2021 - 2025
Yin的笔记本 Yin的笔记本

Choose mode

  • dark
  • auto
  • light
Home
Category
  • CNCF
  • Docker
  • namespaces
  • Kubernetes
  • Kubernetes对象
  • MyIdeas
  • Linux
  • Revolution
  • WebRTC
  • 云计算
  • 人工智能
  • 分布式
  • 图像处理
  • 图形学
  • 微服务
  • 数学
  • OJ笔记
  • 博弈论
  • 形式语言与自动机
  • 数据库
  • 服务器运维
  • 编程语言
  • C
  • Git
  • Go
  • Java
  • Node
  • Nvidia
  • Python
  • Rust
  • Tex
  • Shell
  • Vue
  • 视频编解码
  • 计算机网络
  • SDN
  • 论文笔记
  • 讨论
  • 边缘计算
  • 量子信息技术
Tag
TimeLine
About
查看源码
author-avatar

Howard Yin

299

Article

152

Tag

Home
Category
  • CNCF
  • Docker
  • namespaces
  • Kubernetes
  • Kubernetes对象
  • MyIdeas
  • Linux
  • Revolution
  • WebRTC
  • 云计算
  • 人工智能
  • 分布式
  • 图像处理
  • 图形学
  • 微服务
  • 数学
  • OJ笔记
  • 博弈论
  • 形式语言与自动机
  • 数据库
  • 服务器运维
  • 编程语言
  • C
  • Git
  • Go
  • Java
  • Node
  • Nvidia
  • Python
  • Rust
  • Tex
  • Shell
  • Vue
  • 视频编解码
  • 计算机网络
  • SDN
  • 论文笔记
  • 讨论
  • 边缘计算
  • 量子信息技术
Tag
TimeLine
About
查看源码
  • 详解3D Gaussian Splatting CUDA Kernel:前向传播

    • foward函数: RasterizeGaussiansCUDA
      • 开头
      • 分配显存空间
      • 渲染过程的入口
    • 深入CudaRasterizer::Rasterizer::forward(一)主要数据结构及初始化
      • 深入CudaRasterizer::Rasterizer::forward(二)预处理过程FORWARD::preprocess
        • 并行预处理函数preprocessCUDA
          • preprocessCUDA函数开头
          • 剔除近处的高斯球
          • 高斯球中心坐标转换到相机坐标系
          • 由高斯球参数计算3D协方差矩阵
          • 3D协方差矩阵计算函数computeCov3D
          • 3D协方差矩阵计算投影到成像平面上的2D协方差
          • 2D协方差矩阵计算函数computeCov2D
          • 2D协方差矩阵求逆
          • 计算高斯球与成像平面上哪些tiles相交
          • 从球谐函数系数求投影点颜色
          • 记录重要信息
        • 深入CudaRasterizer::Rasterizer::forward(三)高斯球排序
          • 构造排序用的Keys
          • 对高斯球进行基数排序
          • 计算每个tile要渲染的部分在point_list_keys上的起点和终点
        • 深入CudaRasterizer::Rasterizer::forward(四)渲染过程FORWARD::render
          • 并行渲染函数renderCUDA
            • 获取当前tile对应的像素range和thread对应的像素坐标
            • 获取当前tile对应的像素range
            • 渲染前的初始化
            • for循环处理该像素的每个高斯球

        详解3D Gaussian Splatting CUDA Kernel:前向传播

        vuePress-theme-reco Howard Yin    2021 - 2025

        详解3D Gaussian Splatting CUDA Kernel:前向传播


        Howard Yin 2024-11-24 18:43:00 图形学3D Gaussian Splatting

        PyBind绑定了3个函数,核心函数就两个:rasterize_gaussians和rasterize_gaussians_backward

        PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
          m.def("rasterize_gaussians", &RasterizeGaussiansCUDA);
          m.def("rasterize_gaussians_backward", &RasterizeGaussiansBackwardCUDA);
          m.def("mark_visible", &markVisible);
        }
        
        1
        2
        3
        4
        5

        # foward函数: RasterizeGaussiansCUDA

        # 开头

        最外层的渲染函数,输入输出差不多都是torch::Tensor,和Python里的输入输出直接对应:

        std::tuple<int, torch::Tensor, torch::Tensor, torch::Tensor, torch::Tensor, torch::Tensor>
        RasterizeGaussiansCUDA(
        	const torch::Tensor& background,
        	const torch::Tensor& means3D,
            const torch::Tensor& colors,
            const torch::Tensor& opacity,
        	const torch::Tensor& scales,
        	const torch::Tensor& rotations,
        	const float scale_modifier,
        	const torch::Tensor& cov3D_precomp,
        	const torch::Tensor& viewmatrix,
        	const torch::Tensor& projmatrix,
        	const float tan_fovx, 
        	const float tan_fovy,
            const int image_height,
            const int image_width,
        	const torch::Tensor& sh,
        	const int degree,
        	const torch::Tensor& campos,
        	const bool prefiltered,
        	const bool debug)
        {
          if (means3D.ndimension() != 2 || means3D.size(1) != 3) {
            AT_ERROR("means3D must have dimensions (num_points, 3)");
          }
          
          const int P = means3D.size(0); //高斯点数量
          const int H = image_height;
          const int W = image_width;
        
          auto int_opts = means3D.options().dtype(torch::kInt32);
          auto float_opts = means3D.options().dtype(torch::kFloat32);
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32

        # 分配显存空间

        给out_color和radii分配显存空间,初始化geomBuffer、binningBuffer、imgBuffer:

          torch::Tensor out_color = torch::full({NUM_CHANNELS, H, W}, 0.0, float_opts); //分配显存空间用于存储渲染图
          torch::Tensor radii = torch::full({P}, 0, means3D.options().dtype(torch::kInt32)); //分配显存空间用于存储每个高斯球在渲染图中的半径
          
          torch::Device device(torch::kCUDA);
          torch::TensorOptions options(torch::kByte);
          torch::Tensor geomBuffer = torch::empty({0}, options.device(device)); //初始化, 还未分配显存空间
          torch::Tensor binningBuffer = torch::empty({0}, options.device(device)); //初始化, 还未分配显存空间
          torch::Tensor imgBuffer = torch::empty({0}, options.device(device)); //初始化, 还未分配显存空间
          std::function<char*(size_t)> geomFunc = resizeFunctional(geomBuffer);
          std::function<char*(size_t)> binningFunc = resizeFunctional(binningBuffer);
          std::function<char*(size_t)> imgFunc = resizeFunctional(imgBuffer);
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11

        这里的resizeFunctional是一个闭包,输出一个调显存空间大小lambda表达式,给后面代码里面分配显存用:

        std::function<char*(size_t N)> resizeFunctional(torch::Tensor& t) {
            auto lambda = [&t](size_t N) {
                t.resize_({(long long)N});
        		return reinterpret_cast<char*>(t.contiguous().data_ptr());
            };
            return lambda;
        }
        
        1
        2
        3
        4
        5
        6
        7

        这里的out_color和radii好理解,分别是渲染结果和高斯球在渲染图中的半径,但geomBuffer、binningBuffer、imgBuffer是存的什么?且看后文。

        # 渲染过程的入口

        调用CudaRasterizer::Rasterizer::forward执行渲染,结果构造为Python里的tuple:

          int rendered = 0;
          if(P != 0)
          {
        	  int M = 0;
        	  if(sh.size(0) != 0)
        	  {
        		M = sh.size(1);
              }
        
        	  rendered = CudaRasterizer::Rasterizer::forward(
        	    geomFunc,
        		binningFunc,
        		imgFunc,
        	    P, degree, M,
        		background.contiguous().data<float>(),
        		W, H,
        		means3D.contiguous().data<float>(),
        		sh.contiguous().data_ptr<float>(),
        		colors.contiguous().data<float>(), 
        		opacity.contiguous().data<float>(), 
        		scales.contiguous().data_ptr<float>(),
        		scale_modifier,
        		rotations.contiguous().data_ptr<float>(),
        		cov3D_precomp.contiguous().data<float>(), 
        		viewmatrix.contiguous().data<float>(), 
        		projmatrix.contiguous().data<float>(),
        		campos.contiguous().data<float>(),
        		tan_fovx,
        		tan_fovy,
        		prefiltered,
        		out_color.contiguous().data<float>(),
        		radii.contiguous().data<int>(),
        		debug);
          }
          return std::make_tuple(rendered, out_color, radii, geomBuffer, binningBuffer, imgBuffer);
        }
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36

        # 深入CudaRasterizer::Rasterizer::forward(一)主要数据结构及初始化

        真 · 渲染过程的入口:

        // Forward rendering procedure for differentiable rasterization
        // of Gaussians.
        int CudaRasterizer::Rasterizer::forward(
        	std::function<char* (size_t)> geometryBuffer,
        	std::function<char* (size_t)> binningBuffer,
        	std::function<char* (size_t)> imageBuffer,
        	const int P, int D, int M,
        	const float* background,
        	const int width, int height,
        	const float* means3D,
        	const float* shs,
        	const float* colors_precomp,
        	const float* opacities,
        	const float* scales,
        	const float scale_modifier,
        	const float* rotations,
        	const float* cov3D_precomp,
        	const float* viewmatrix,
        	const float* projmatrix,
        	const float* cam_pos,
        	const float tan_fovx, float tan_fovy,
        	const bool prefiltered,
        	float* out_color,
        	int* radii,
        	bool debug)
        {
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26

        预处理前的初始化:

        	const float focal_y = height / (2.0f * tan_fovy); //计算焦距fx,fy
        	const float focal_x = width / (2.0f * tan_fovx); //计算焦距fx,fy
        
        	size_t chunk_size = required<GeometryState>(P); //计算所需的显存大小
        	char* chunkptr = geometryBuffer(chunk_size); //分配显存块,并返回首地址
        	GeometryState geomState = GeometryState::fromChunk(chunkptr, P); //将分配的显存块初始化为GeometryState
        
        	if (radii == nullptr)
        	{
        		radii = geomState.internal_radii;
        	}
        
        	dim3 tile_grid((width + BLOCK_X - 1) / BLOCK_X, (height + BLOCK_Y - 1) / BLOCK_Y, 1); //图像划分为16x16的tiles,xy方向上各有多少tiles
        	dim3 block(BLOCK_X, BLOCK_Y, 1); //块中线程数量,16*16
        
        	// Dynamically resize image-based auxiliary buffers during training
        	size_t img_chunk_size = required<ImageState>(width * height); //计算所需的显存大小
        	char* img_chunkptr = imageBuffer(img_chunk_size); //分配显存块,返回首地址
        	ImageState imgState = ImageState::fromChunk(img_chunkptr, width * height); //将分配的显存块初始化为ImageState
        
        	if (NUM_CHANNELS != 3 && colors_precomp == nullptr)
        	{
        		throw std::runtime_error("For non-RGB, provide precomputed Gaussian colors!");
        	}
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24

        GeometryState和ImageState的定义在rasterizer_impl.h,里面就是一大堆数组,用来存储渲染中的各种数据,后面可以看到每个数组的用法:

        	struct GeometryState
        	{
        		size_t scan_size;
        		float* depths;
        		char* scanning_space;
        		bool* clamped;
        		int* internal_radii;
        		float2* means2D;
        		float* cov3D;
        		float4* conic_opacity;
        		float* rgb;
        		uint32_t* point_offsets;
        		uint32_t* tiles_touched;
        
        		static GeometryState fromChunk(char*& chunk, size_t P);
        	};
        
        	struct ImageState
        	{
        		uint2* ranges;
        		uint32_t* n_contrib;
        		float* accum_alpha;
        
        		static ImageState fromChunk(char*& chunk, size_t N);
        	};
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25

        这里的required是在调用GeometryState或ImageState里面的fromChunk:

        	template<typename T> 
        	size_t required(size_t P)
        	{
        		char* size = nullptr;
        		T::fromChunk(size, P);
        		return ((size_t)size) + 128;
        	}
        
        1
        2
        3
        4
        5
        6
        7

        GeometryState的fromChunk实现为:

        CudaRasterizer::GeometryState CudaRasterizer::GeometryState::fromChunk(char*& chunk, size_t P)
        {
        	GeometryState geom;
        	obtain(chunk, geom.depths, P, 128);
        	obtain(chunk, geom.clamped, P * 3, 128);
        	obtain(chunk, geom.internal_radii, P, 128);
        	obtain(chunk, geom.means2D, P, 128);
        	obtain(chunk, geom.cov3D, P * 6, 128);
        	obtain(chunk, geom.conic_opacity, P, 128);
        	obtain(chunk, geom.rgb, P * 3, 128);
        	obtain(chunk, geom.tiles_touched, P, 128);
        	//参数:第三个in,第四个out,最后一个num。当第一个参数为NULL时, 所需的分配大小被写入第二个参数,并且不执行任何工作 https://github.com/dmlc/cub/blob/master/cub/device/device_scan.cuh
        	cub::DeviceScan::InclusiveSum(nullptr, geom.scan_size, geom.tiles_touched, geom.tiles_touched, P); //算出计算数组前缀和所需的内存空间大小,InclusiveSum表示包括自身,ExclusiveSum表示不包括自身
        	obtain(chunk, geom.scanning_space, geom.scan_size, 128); //按照算出的内存空间大小,给geom.scanning_space分配空间。看来后面会有计算数组前缀和的操作,这个geom.scanning_space就是在计算数组前缀和时用来存放中间结果的数组了
        	obtain(chunk, geom.point_offsets, P, 128);
        	return geom;
        }
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17

        ImageState的fromChunk实现为:

        CudaRasterizer::ImageState CudaRasterizer::ImageState::fromChunk(char*& chunk, size_t N)
        {
        	ImageState img;
        	obtain(chunk, img.accum_alpha, N, 128);
        	obtain(chunk, img.n_contrib, N, 128);
        	obtain(chunk, img.ranges, N, 128);
        	return img;
        }
        
        1
        2
        3
        4
        5
        6
        7
        8

        其中obtain就是把chunk开始的一段显存空间首地址赋值给ptr然后对chunk自增:

        	template <typename T>
        	static void obtain(char*& chunk, T*& ptr, std::size_t count, std::size_t alignment)
        	{
        		std::size_t offset = (reinterpret_cast<std::uintptr_t>(chunk) + alignment - 1) & ~(alignment - 1);
        		ptr = reinterpret_cast<T*>(offset);
        		chunk = reinterpret_cast<char*>(ptr + count);
        	}
        
        1
        2
        3
        4
        5
        6
        7

        这样就能理解前面的内分配过程了:

        	size_t chunk_size = required<GeometryState>(P); //计算所需的显存大小
        	char* chunkptr = geometryBuffer(chunk_size); //分配显存块,并返回首地址
        	GeometryState geomState = GeometryState::fromChunk(chunkptr, P); //将分配的显存块初始化为GeometryState
        
        1
        2
        3

        required调用fromChunk时有char* size = nullptr;,实际上size是0,于是放进fromChunk的chunk参数里就自增得到所需的内存大小,然后调用resizeFunctional实际分配内存和fromChunk给数组指针赋值。

        # 深入CudaRasterizer::Rasterizer::forward(二)预处理过程FORWARD::preprocess

        FORWARD::preprocess计算每一个高斯球投影出来的圆半径及覆盖的范围。

        	// Run preprocessing per-Gaussian (transformation, bounding, conversion of SHs to RGB)
        	CHECK_CUDA(FORWARD::preprocess(
        		P, D, M,
        		means3D,
        		(glm::vec3*)scales,
        		scale_modifier,
        		(glm::vec4*)rotations,
        		opacities,
        		shs,
        		geomState.clamped,
        		cov3D_precomp,
        		colors_precomp,
        		viewmatrix, projmatrix,
        		(glm::vec3*)cam_pos,
        		width, height,
        		focal_x, focal_y,
        		tan_fovx, tan_fovy,
        		radii,
        		geomState.means2D,
        		geomState.depths,
        		geomState.cov3D,
        		geomState.rgb,
        		geomState.conic_opacity,
        		tile_grid,
        		geomState.tiles_touched,
        		prefiltered
        	), debug)
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27

        这个preprocess函数经过一层封装,实际上是并行调用preprocessCUDA对每个高斯球进行处理:

        void FORWARD::preprocess(int P, int D, int M,
        	const float* means3D,
        	const glm::vec3* scales,
        	const float scale_modifier,
        	const glm::vec4* rotations,
        	const float* opacities,
        	const float* shs,
        	bool* clamped,
        	const float* cov3D_precomp,
        	const float* colors_precomp,
        	const float* viewmatrix,
        	const float* projmatrix,
        	const glm::vec3* cam_pos,
        	const int W, int H,
        	const float focal_x, float focal_y,
        	const float tan_fovx, float tan_fovy,
        	int* radii,
        	float2* means2D,
        	float* depths,
        	float* cov3Ds,
        	float* rgb,
        	float4* conic_opacity,
        	const dim3 grid,
        	uint32_t* tiles_touched,
        	bool prefiltered)
        {
        	preprocessCUDA<NUM_CHANNELS> << <(P + 255) / 256, 256 >> > ( //一个线程处理一个高斯球
        		P, D, M,
        		means3D,
        		scales,
        		scale_modifier,
        		rotations,
        		opacities,
        		shs,
        		clamped,
        		cov3D_precomp,
        		colors_precomp,
        		viewmatrix, 
        		projmatrix,
        		cam_pos,
        		W, H,
        		tan_fovx, tan_fovy,
        		focal_x, focal_y,
        		radii,
        		means2D,
        		depths,
        		cov3Ds,
        		rgb,
        		conic_opacity,
        		grid,
        		tiles_touched,
        		prefiltered
        		);
        }
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
        42
        43
        44
        45
        46
        47
        48
        49
        50
        51
        52
        53
        54

        # 并行预处理函数preprocessCUDA

        # preprocessCUDA函数开头

        // Perform initial steps for each Gaussian prior to rasterization.
        template<int C>
        __global__ void preprocessCUDA(int P, int D, int M, // P是高斯球的数量、D和M是球谐函数的级数
        	const float* orig_points, //高斯球中心在世界空间下的坐标
        	const glm::vec3* scales,
        	const float scale_modifier,
        	const glm::vec4* rotations,
        	const float* opacities,
        	const float* shs,
        	bool* clamped,
        	const float* cov3D_precomp,
        	const float* colors_precomp,
        	const float* viewmatrix, // view transform matrix,世界空间坐标转相机空间坐标
        	const float* projmatrix, // 这里的projmatrix其实是projection transform和view transform的合体,直接把世界空间坐标转NDC空间坐标
        	const glm::vec3* cam_pos,
        	const int W, int H,
        	const float tan_fovx, float tan_fovy,
        	const float focal_x, float focal_y,
        	int* radii,
        	float2* points_xy_image,
        	float* depths,
        	float* cov3Ds,
        	float* rgb,
        	float4* conic_opacity,
        	const dim3 grid, //block块的范围
        	uint32_t* tiles_touched,
        	bool prefiltered)
        {
        	auto idx = cg::this_grid().thread_rank(); //线程在组内的标号,区间为[0,num_threads)
        	if (idx >= P)
        		return; //如果编号大于高斯球的数量,则返回,也就是说一个线程对应一个高斯球处理
        
        	// Initialize radius and touched tiles to 0. If this isn't changed,
        	// this Gaussian will not be processed further.
        	radii[idx] = 0; //高斯球半径
        	tiles_touched[idx] = 0; //高斯球在图片上覆盖的tile数量
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36

        # 剔除近处的高斯球

        	// Perform near culling, quit if outside.
        	float3 p_view;
        	if (!in_frustum(idx, orig_points, viewmatrix, projmatrix, prefiltered, p_view))
        		return; //执行近剔除,如果高斯球离成像平面太近则退出
        
        1
        2
        3
        4

        # 高斯球中心坐标转换到相机坐标系

        	// Transform point by projecting
        	float3 p_orig = { orig_points[3 * idx], orig_points[3 * idx + 1], orig_points[3 * idx + 2] }; //高斯球中心在世界空间下的坐标
        	float4 p_hom = transformPoint4x4(p_orig, projmatrix); //世界空间坐标直接转NDC坐标,并且自动变齐次
        	float p_w = 1.0f / (p_hom.w + 0.0000001f);
        	float3 p_proj = { p_hom.x * p_w, p_hom.y * p_w, p_hom.z * p_w }; //齐次转回非齐次
        
        1
        2
        3
        4
        5

        # 由高斯球参数计算3D协方差矩阵

        	// If 3D covariance matrix is precomputed, use it, otherwise compute
        	// from scaling and rotation parameters. 
        	const float* cov3D;
        	if (cov3D_precomp != nullptr)
        	{
        		cov3D = cov3D_precomp + idx * 6; //预设的3D协方差
        	}
        	else
        	{
        		computeCov3D(scales[idx], scale_modifier, rotations[idx], cov3Ds + idx * 6); //用高斯球的scales和rotations其3D协方差,放进cov3Ds里
        		cov3D = cov3Ds + idx * 6; //从cov3Ds里读取高斯球的3D协方差矩阵
        	}
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12

        # 3D协方差矩阵计算函数computeCov3D

        // Forward method for converting scale and rotation properties of each
        // Gaussian to a 3D covariance matrix in world space. Also takes care
        // of quaternion normalization.
        __device__ void computeCov3D(const glm::vec3 scale, float mod, const glm::vec4 rot, float* cov3D)
        {
        
        1
        2
        3
        4
        5

        # 从scale参数计算缩放矩阵SSS

        首先根据高斯球的scale参数创建缩放矩阵:

        S=[sx000sy000sz]S=\left[\begin{matrix}s_x & 0 & 0 \\0 & s_y & 0 \\0 & 0 & s_z\end{matrix}\right] S=⎣⎢⎡​sx​00​0sy​0​00sz​​⎦⎥⎤​

        	// Create scaling matrix
        	glm::mat3 S = glm::mat3(1.0f);
        	S[0][0] = mod * scale.x;
        	S[1][1] = mod * scale.y;
        	S[2][2] = mod * scale.z;
        
        1
        2
        3
        4
        5

        其中的mod来自于外面Python传来的scale_modifier,在使用图形界面看3DGS场景的时候可见调节高斯球大小的功能就是由此实现。

        # 从四元数计算旋转矩阵RRR

        旋转矩阵 RRR 可以从归一化的四元数 q=[r,x,y,z]q = [r, x, y, z]q=[r,x,y,z] 计算得出:

        R=[1−2(y2+z2)2(xy−rz)2(xz+ry)2(xy+rz)1−2(x2+z2)2(yz−rx)2(xz−ry)2(yz+rx)1−2(x2+y2)]R = \left[\begin{matrix} 1 - 2(y^2 + z^2) & 2(xy - rz) & 2(xz + ry) \\ 2(xy + rz) & 1 - 2(x^2 + z^2) & 2(yz - rx) \\ 2(xz - ry) & 2(yz + rx) & 1 - 2(x^2 + y^2) \end{matrix}\right]R=⎣⎢⎡​1−2(y2+z2)2(xy+rz)2(xz−ry)​2(xy−rz)1−2(x2+z2)2(yz+rx)​2(xz+ry)2(yz−rx)1−2(x2+y2)​⎦⎥⎤​

        	// Normalize quaternion to get valid rotation
        	glm::vec4 q = rot;// / glm::length(rot);
        	float r = q.x;
        	float x = q.y;
        	float y = q.z;
        	float z = q.w;
        
        	// Compute rotation matrix from quaternion
        	glm::mat3 R = glm::mat3(
        		1.f - 2.f * (y * y + z * z), 2.f * (x * y - r * z), 2.f * (x * z + r * y),
        		2.f * (x * y + r * z), 1.f - 2.f * (x * x + z * z), 2.f * (y * z - r * x),
        		2.f * (x * z - r * y), 2.f * (y * z + r * x), 1.f - 2.f * (x * x + y * y)
        	);
        
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14

        # 计算3D协方差矩阵Σ\SigmaΣ

        由缩放矩阵SSS和旋转矩阵RRR计算3D协方差矩阵的公式来自3DGS原论文:

        Σ=RSSTRT\Sigma=RSS^TR^T Σ=RSSTRT

        	glm::mat3 M = S * R;
        
        	// Compute 3D world covariance matrix Sigma
        	glm::mat3 Sigma = glm::transpose(M) * M;
        
        	// Covariance is symmetric, only store upper right
        	cov3D[0] = Sigma[0][0];
        	cov3D[1] = Sigma[0][1];
        	cov3D[2] = Sigma[0][2];
        	cov3D[3] = Sigma[1][1];
        	cov3D[4] = Sigma[1][2];
        	cov3D[5] = Sigma[2][2];
        }
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13

        # 3D协方差矩阵计算投影到成像平面上的2D协方差

        	// Compute 2D screen-space covariance matrix
        	float3 cov = computeCov2D(p_orig, focal_x, focal_y, tan_fovx, tan_fovy, cov3D, viewmatrix); //计算椭球投影成椭圆的样子[cov.x,cov.y,cov.y,cov.z],即2D协方差矩阵[abbc]
        
        1
        2

        # 2D协方差矩阵计算函数computeCov2D

        按注释所讲,此函数对应公式来自于"EWA Splatting" (Zwicker et al., 2002)原论文:

        // Forward version of 2D covariance matrix computation
        __device__ float3 computeCov2D(const float3& mean, float focal_x, float focal_y, float tan_fovx, float tan_fovy, const float* cov3D, const float* viewmatrix)
        {
        	// The following models the steps outlined by equations 29
        	// and 31 in "EWA Splatting" (Zwicker et al., 2002). 
        	// Additionally considers aspect / scaling of viewport.
        	// Transposes used to account for row-/column-major conventions.
        	float3 t = transformPoint4x3(mean, viewmatrix); // 高斯球中心点坐标变换到相机坐标系
        
        1
        2
        3
        4
        5
        6
        7
        8

        # 限制高斯球中心点坐标范围

        此举意在保持数值稳定性避免精度溢出:

        	const float limx = 1.3f * tan_fovx;
        	const float limy = 1.3f * tan_fovy;
        	const float txtz = t.x / t.z; // xy除了个z
        	const float tytz = t.y / t.z;
        	t.x = min(limx, max(-limx, txtz)) * t.z; // 限制大小后又把z给乘回去了
        	t.y = min(limy, max(-limy, tytz)) * t.z;
        
        1
        2
        3
        4
        5
        6

        # 计算雅可比矩阵JJJ

        在3D到2D的投影计算过程中,雅可比矩阵 JJJ 表示2维空间坐标(u,v)(u,v)(u,v)对3维空间中的坐标(x,y,z)(x,y,z)(x,y,z)的导数,即:

        J=∂(u,v)∂(x,y,z)=[∂u∂x∂u∂y∂u∂z∂v∂x∂v∂y∂v∂z]J=\frac{\partial (u,v)}{\partial (x,y,z)}=\left[ \begin{matrix} \frac{\partial u}{\partial x} & \frac{\partial u}{\partial y} & \frac{\partial u}{\partial z} \\ \frac{\partial v}{\partial x} & \frac{\partial v}{\partial y} & \frac{\partial v}{\partial z}\\ \end{matrix} \right]J=∂(x,y,z)∂(u,v)​=[∂x∂u​∂x∂v​​∂y∂u​∂y∂v​​∂z∂u​∂z∂v​​]

        3DGS只针对透视投影,根据透视投影的计算公式:

        z[uv1]=[fx0cx0fycy001]⋅[xyz]=z[fxxz+cxfyyz+cy1]z \left[ \begin{matrix} u\\v\\1 \end{matrix} \right] = \left[ \begin{matrix} f_x&0&c_x\\0&f_y&c_y\\0&0&1 \end{matrix} \right] \cdot \left[ \begin{matrix} x\\y\\z \end{matrix} \right] = z \left[ \begin{matrix} f_x\frac{x}{z}+c_x\\f_y\frac{y}{z}+c_y\\1 \end{matrix} \right] z⎣⎢⎡​uv1​⎦⎥⎤​=⎣⎢⎡​fx​00​0fy​0​cx​cy​1​⎦⎥⎤​⋅⎣⎢⎡​xyz​⎦⎥⎤​=z⎣⎢⎡​fx​zx​+cx​fy​zy​+cy​1​⎦⎥⎤​

        可知(u,v)(u,v)(u,v)和(x,y,z)(x,y,z)(x,y,z)的对应关系:

        u=fxxz+cxv=fyyz+cy\begin{aligned} u&=f_x\frac{x}{z}+c_x\\ v&=f_y\frac{y}{z}+c_y\\ \end{aligned} uv​=fx​zx​+cx​=fy​zy​+cy​​

        进而求导得雅可比矩阵JJJ:

        J=[fx1z0−fxxz20fy1z−fyyz2]J=\left[ \begin{matrix} f_x\frac{1}{z} & 0 & -f_x\frac{x}{z^2} \\ 0 & f_y\frac{1}{z} & -f_y\frac{y}{z^2} \\ \end{matrix} \right]J=[fx​z1​0​0fy​z1​​−fx​z2x​−fy​z2y​​]

        	glm::mat3 J = glm::mat3(
        		focal_x / t.z, 0.0f, -(focal_x * t.x) / (t.z * t.z),
        		0.0f, focal_y / t.z, -(focal_y * t.y) / (t.z * t.z),
        		0, 0, 0);
        
        1
        2
        3
        4

        代码里的J为了方便glm库的加速所以加上了一行0。

        # 计算2D协方差矩阵Σ′\Sigma'Σ′

        2D协方差矩阵公式来自3DGS原论文:

        Σ′=JWΣWTJT\Sigma'=JW\Sigma W^TJ^T Σ′=JWΣWTJT

        其中,WWW为视角变换矩阵的前3x3部分:

        	glm::mat3 W = glm::mat3(
        		viewmatrix[0], viewmatrix[4], viewmatrix[8],
        		viewmatrix[1], viewmatrix[5], viewmatrix[9],
        		viewmatrix[2], viewmatrix[6], viewmatrix[10]);
        
        1
        2
        3
        4

        注: 上面的雅可比矩阵 JJJ 蕴含相机内参的信息、视角变换矩阵 WWW 蕴含相机外参信息。 于是已知相机内外参和3D协方差矩阵Σ\SigmaΣ则可计算高斯球投影在成像平面上的2D协方差矩阵Σ′\Sigma'Σ′,很好理解。

        注:公式Σ′=JWΣWTJT\Sigma'=JW\Sigma W^TJ^TΣ′=JWΣWTJT的这种投影已经包含了正交投影所需的各种变换,具体怎么由正交投影推导出这么简洁优美的表达形式的要去看"EWA Splatting" (Zwicker et al., 2002)原论文。

        计算Σ′=JWΣWTJT\Sigma'=JW\Sigma W^TJ^TΣ′=JWΣWTJT:

        	glm::mat3 T = W * J;
        
        	glm::mat3 Vrk = glm::mat3(
        		cov3D[0], cov3D[1], cov3D[2],
        		cov3D[1], cov3D[3], cov3D[4],
        		cov3D[2], cov3D[4], cov3D[5]);
        
        	glm::mat3 cov = glm::transpose(T) * glm::transpose(Vrk) * T;
        
        1
        2
        3
        4
        5
        6
        7
        8

        这里的Vrk就是Σ\SigmaΣ

        最后保存计算结果:

        	// Apply low-pass filter: every Gaussian should be at least
        	// one pixel wide/high. Discard 3rd row and column.
        	cov[0][0] += 0.3f;
        	cov[1][1] += 0.3f;
        	return { float(cov[0][0]), float(cov[0][1]), float(cov[1][1]) };
        }
        
        1
        2
        3
        4
        5
        6

        # 2D协方差矩阵求逆

        求坐标x\bm xx上的高斯函数值G(x)G(\bm x)G(x)公式来自3DGS原论文:

        G(x)=e−12xTΣ−1xG(\bm x)=e^{-\frac{1}{2}\bm x^T\Sigma^{-1}\bm x} G(x)=e−21​xTΣ−1x

        所以为了后面渲染过程方便计算这里先算出了Σ−1\Sigma^{-1}Σ−1保存为conic,称为“锥体(conic)参数”(高斯球投影为实心椭圆,其边界为圆锥曲线):

        Σ=[abbc]Σ−1=Σ∗∣Σ∣=[c−b−ba]ac−b2\begin{aligned} \Sigma&=\left[\begin{matrix}a&b\\b&c\end{matrix}\right]\\ \Sigma^{-1}&=\frac{\Sigma^*}{|\Sigma|}=\frac{\left[\begin{matrix}c&-b\\-b&a\end{matrix}\right]}{ac-b^2} \end{aligned} ΣΣ−1​=[ab​bc​]=∣Σ∣Σ∗​=ac−b2[c−b​−ba​]​​

        	// Invert covariance (EWA algorithm) //求2D协方差矩阵的行列式
        	float det = (cov.x * cov.z - cov.y * cov.y); //协方差矩阵[abbc]的模ac-b^2
        	if (det == 0.0f)
        		return; //协方差矩阵的模为0说明投影出来椭圆面积为0,退出
        	float det_inv = 1.f / det; //协方差矩阵[abbc]的行列式
        	float3 conic = { cov.z * det_inv, -cov.y * det_inv, cov.x * det_inv }; //协方差矩阵[abbc]的逆 = det*(c,-b,a)
        
        1
        2
        3
        4
        5
        6

        # 计算高斯球与成像平面上哪些tiles相交

        	// Compute extent in screen space (by finding eigenvalues of
        	// 2D covariance matrix). Use extent to compute a bounding rectangle
        	// of screen-space tiles that this Gaussian overlaps with. Quit if
        	// rectangle covers 0 tiles. 
        	float mid = 0.5f * (cov.x + cov.z);
        	float lambda1 = mid + sqrt(max(0.1f, mid * mid - det));
        	float lambda2 = mid - sqrt(max(0.1f, mid * mid - det));
        	float my_radius = ceil(3.f * sqrt(max(lambda1, lambda2))); //这里的3就是正态分布的3sigma法则,max表明用于算覆盖范围的半径是长轴半径
        	float2 point_image = { ndc2Pix(p_proj.x, W), ndc2Pix(p_proj.y, H) }; //ndc坐标转换到像素坐标
        	uint2 rect_min, rect_max;
        	getRect(point_image, my_radius, rect_min, rect_max, grid); //计算高斯球覆盖了哪些tile,返回rect_min, rect_max
        	if ((rect_max.x - rect_min.x) * (rect_max.y - rect_min.y) == 0)
        		return;
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13

        其中ndc2Pix把相机ndc坐标系上的坐标转换到像素坐标:

        __forceinline__ __device__ float ndc2Pix(float v, int S)
        {
        	return ((v + 1.0) * S - 1.0) * 0.5;
        }
        
        1
        2
        3
        4

        getRect函数定义如下,可以看出,计算方法是以高斯球中心为中点,边长为长轴半径x2的方形区域的相交区域:

        __forceinline__ __device__ void getRect(const float2 p, int max_radius, uint2& rect_min, uint2& rect_max, dim3 grid) //grid是成像平面的范围
        {
        	rect_min = {
        		min(grid.x, max((int)0, (int)((p.x - max_radius) / BLOCK_X))), //BLOCK_X=BLOCK_Y=16,所以是16x16为一个tile
        		min(grid.y, max((int)0, (int)((p.y - max_radius) / BLOCK_Y)))
        	};
        	rect_max = {
        		min(grid.x, max((int)0, (int)((p.x + max_radius + BLOCK_X - 1) / BLOCK_X))),
        		min(grid.y, max((int)0, (int)((p.y + max_radius + BLOCK_Y - 1) / BLOCK_Y)))
        	};
        }
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11

        # 从球谐函数系数求投影点颜色

        	// If colors have been precomputed, use them, otherwise convert
        	// spherical harmonics coefficients to RGB color.
        	if (colors_precomp == nullptr)
        	{
        		glm::vec3 result = computeColorFromSH(idx, D, M, (glm::vec3*)orig_points, *cam_pos, shs, clamped);
        		rgb[idx * C + 0] = result.x;
        		rgb[idx * C + 1] = result.y;
        		rgb[idx * C + 2] = result.z;
        	}
        
        1
        2
        3
        4
        5
        6
        7
        8
        9

        # 记录重要信息

        	// Store some useful helper data for the next steps.
        	depths[idx] = p_view.z; //该高斯椭球中心距成像平面距离(深度)
        	radii[idx] = my_radius; //该高斯椭球投影在成像平面上的长轴半径
        	points_xy_image[idx] = point_image; //该高斯椭球中心在成像平面上的像素坐标
        	// Inverse 2D covariance and opacity neatly pack into one float4
        	conic_opacity[idx] = { conic.x, conic.y, conic.z, opacities[idx] }; //高斯椭球的协方差逆矩阵和透明度放在一起了
        	tiles_touched[idx] = (rect_max.y - rect_min.y) * (rect_max.x - rect_min.x); //该高斯椭球与多少个tiles相交
        }
        
        1
        2
        3
        4
        5
        6
        7
        8

        # 深入CudaRasterizer::Rasterizer::forward(三)高斯球排序

        预处理后的初始化:

        	// Compute prefix sum over full list of touched tile counts by Gaussians
        	// E.g., [2, 3, 0, 2, 1] -> [2, 5, 5, 7, 8]
        	CHECK_CUDA(cub::DeviceScan::InclusiveSum(geomState.scanning_space, geomState.scan_size, geomState.tiles_touched, geomState.point_offsets, P), debug) //计算tiles_touched数组的前缀和存到point_offsets中,从后面的代码可以看出,每个高斯椭球都在BinningState都分配了一片显存空间存储与其相交的所有tiles的数据,每个高斯椭球相交的tiles数量都不一样,这里point_offsets就存着每个高斯椭球对应的显存空间起点
        
        	// Retrieve total number of Gaussian instances to launch and resize aux buffers
        	int num_rendered;
        	CHECK_CUDA(cudaMemcpy(&num_rendered, geomState.point_offsets + P - 1, sizeof(int), cudaMemcpyDeviceToHost), debug); //指针point_offsets+P-1是point_offsets数组(tiles_touched数组的前缀和数组)的最后一个元素的值,也即总共覆盖的tiles数量,把它赋给num_rendered,所以num_rendered就是需要渲染的tiles总量,被多个高斯球覆盖的tiles重复计数
        
        	size_t binning_chunk_size = required<BinningState>(num_rendered); //计算所需的显存大小
        	char* binning_chunkptr = binningBuffer(binning_chunk_size); //分配显存块,并返回首地址
        	BinningState binningState = BinningState::fromChunk(binning_chunkptr, num_rendered); //将分配的显存块初始化为BinningState,每个高斯椭球都在BinningState都分配了一片显存空间存储与其相交的所有tiles的数据
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11

        # 构造排序用的Keys

        目前为止,FORWARD::preprocess和前面的初始化中计算了geomState中的各项:高斯椭球中心在成像平面上的像素坐标geomState.means2D、高斯椭球中心距成像平面距离geomState.depths、每个高斯椭球对应的存储相交tiles的显存空间起点geomState.point_offsets,以及高斯椭球投影在成像平面上的长轴半径radii; binningState里每个高斯椭球都都分配了一片显存空间存储与其相交的所有tiles的数据。 接下来将其放进duplicateWithKeys中初始化key用于排序,每个高斯球一个线程:

        	// For each instance to be rendered, produce adequate [ tile | depth ] key 
        	// and corresponding dublicated Gaussian indices to be sorted
        	duplicateWithKeys << <(P + 255) / 256, 256 >> > ( //这里用到上步InclusiveSum得到的累计高斯球touch的tiles数
        		P,
        		geomState.means2D,
        		geomState.depths,
        		geomState.point_offsets, //这里用到上步InclusiveSum得到的累计高斯球touch的tiles数
        		binningState.point_list_keys_unsorted, //存储key (tileID|depth)
        		binningState.point_list_unsorted, // 存储对应的高斯球idx
        		radii, //像素平面上高斯圆的半径,最长轴的3倍
        		tile_grid) //全图中tile的数量
        	CHECK_CUDA(, debug)
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12

        duplicateWithKeys就是将每个高斯椭球深度和相交tiles的编号组合成key放进gaussian_keys_unsorted里、高斯椭球编号放进gaussian_values_unsorted里,其定义如下:

        // Generates one key/value pair for all Gaussian / tile overlaps. 
        // Run once per Gaussian (1:N mapping).
        __global__ void duplicateWithKeys(
        	int P,
        	const float2* points_xy,
        	const float* depths,
        	const uint32_t* offsets,
        	uint64_t* gaussian_keys_unsorted,
        	uint32_t* gaussian_values_unsorted,
        	int* radii,
        	dim3 grid)
        {
        	auto idx = cg::this_grid().thread_rank();
        	if (idx >= P)
        		return;
        
        	// Generate no key/value pair for invisible Gaussians
        	if (radii[idx] > 0)
        	{
        		// Find this Gaussian's offset in buffer for writing keys/values.
        		uint32_t off = (idx == 0) ? 0 : offsets[idx - 1];
        		uint2 rect_min, rect_max;
        
        		getRect(points_xy[idx], radii[idx], rect_min, rect_max, grid);
        
        		// For each tile that the bounding rect overlaps, emit a 
        		// key/value pair. The key is |  tile ID  |      depth      |,
        		// and the value is the ID of the Gaussian. Sorting the values 
        		// with this key yields Gaussian IDs in a list, such that they
        		// are first sorted by tile and then by depth. 
        		for (int y = rect_min.y; y < rect_max.y; y++)
        		{
        			for (int x = rect_min.x; x < rect_max.x; x++)
        			{
        				uint64_t key = y * grid.x + x; //key的上半边是tile ID,即tile的xy坐标的组合
        				key <<= 32; //tile ID作为key的上半边
        				key |= *((uint32_t*)&depths[idx]); //key的下半边为该高斯球中心的深度
        				gaussian_keys_unsorted[off] = key; //记录key
        				gaussian_values_unsorted[off] = idx; //value值为高斯球编号
        				off++;
        			}
        		}
        	}
        }
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        37
        38
        39
        40
        41
        42
        43
        44

        # 对高斯球进行基数排序

        对gaussian_keys_unsorted进行排序,并调整gaussian_values_unsorted里对应的顺序,分别放进point_list_keys和point_list里:

        	int bit = getHigherMsb(tile_grid.x * tile_grid.y); //查找最高有效位(most significant bit),排序算法cub::DeviceRadixSort::SortPairs中需要
        
        	// Sort complete list of (duplicated) Gaussian indices by keys
        	CHECK_CUDA(cub::DeviceRadixSort::SortPairs(
        		binningState.list_sorting_space,
        		binningState.sorting_size,
        		binningState.point_list_keys_unsorted, binningState.point_list_keys,
        		binningState.point_list_unsorted, binningState.point_list,
        		num_rendered, 0, 32 + bit), debug) //0, 32 + bit分别是The least-significant 和 most-significant bit index needed for key comparison,是可选的参数,这里应该是因为key是uint64_t很长,指定最高有效位可以减少计算量?
        
        1
        2
        3
        4
        5
        6
        7
        8
        9

        # 计算每个tile要渲染的部分在point_list_keys上的起点和终点

        每个tile分配两个uint,每个要渲染的key都启动一个identifyTileRanges检查是否是point_list_keys上的tile起点或终点,如果是,将idx写入imgState.ranges:

        	CHECK_CUDA(cudaMemset(imgState.ranges, 0, tile_grid.x * tile_grid.y * sizeof(uint2)), debug);
        
        	// Identify start and end of per-tile workloads in sorted list
        	if (num_rendered > 0)
        		identifyTileRanges << <(num_rendered + 255) / 256, 256 >> > (
        			num_rendered,
        			binningState.point_list_keys,
        			imgState.ranges);
        	CHECK_CUDA(, debug)
        
        1
        2
        3
        4
        5
        6
        7
        8
        9

        其具体的计算过程identifyTileRanges:

        // Check keys to see if it is at the start/end of one tile's range in 
        // the full sorted list. If yes, write start/end of this tile. 
        // Run once per instanced (duplicated) Gaussian ID.
        __global__ void identifyTileRanges(int L, uint64_t* point_list_keys, uint2* ranges)
        {
        	auto idx = cg::this_grid().thread_rank();
        	if (idx >= L)
        		return;
        
        	// Read tile ID from key. Update start/end of tile range if at limit.
        	uint64_t key = point_list_keys[idx]; //获取key
        	uint32_t currtile = key >> 32; //获取tile编号
        	if (idx == 0)
        		ranges[currtile].x = 0; //数组中第一项没有上一个,直接赋0
        	else
        	{
        		uint32_t prevtile = point_list_keys[idx - 1] >> 32; //获取上一个tile编号
        		if (currtile != prevtile) //如果上一个tile编号和这一个不一样
        		{
        			ranges[prevtile].y = idx; //记录range
        			ranges[currtile].x = idx; //记录range
        		}
        	}
        	if (idx == L - 1)
        		ranges[currtile].y = L; //数组中最后一个不一定和上一个tile编号不一样,直接赋L
        }
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26

        # 深入CudaRasterizer::Rasterizer::forward(四)渲染过程FORWARD::render

        	// Let each tile blend its range of Gaussians independently in parallel
        	const float* feature_ptr = colors_precomp != nullptr ? colors_precomp : geomState.rgb;
        	CHECK_CUDA(FORWARD::render(
        		tile_grid, block, //tile_grid个tile,每个tile有block个key要渲染
        		imgState.ranges,
        		binningState.point_list,
        		width, height,
        		geomState.means2D,
        		feature_ptr,
        		geomState.conic_opacity,
        		imgState.accum_alpha,
        		imgState.n_contrib,
        		background,
        		out_color), debug)
        
        	return num_rendered;
        }
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17

        FORWARD::render对grid个tile各启动block(16x16x1)个线程,即每个像素一个线程运行renderCUDA:

        void FORWARD::render(
        	const dim3 grid, dim3 block,
        	const uint2* ranges,
        	const uint32_t* point_list,
        	int W, int H,
        	const float2* means2D,
        	const float* colors,
        	const float4* conic_opacity,
        	float* final_T,
        	uint32_t* n_contrib,
        	const float* bg_color,
        	float* out_color)
        {
        	renderCUDA<NUM_CHANNELS> << <grid, block >> > (
        		ranges,
        		point_list,
        		W, H,
        		means2D,
        		colors,
        		conic_opacity,
        		final_T,
        		n_contrib,
        		bg_color,
        		out_color);
        }
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25

        # 并行渲染函数renderCUDA

        每个像素一个renderCUDA线程:

        // Main rasterization method. Collaboratively works on one tile per
        // block, each thread treats one pixel. Alternates between fetching 
        // and rasterizing data.
        template <uint32_t CHANNELS>
        __global__ void __launch_bounds__(BLOCK_X * BLOCK_Y)
        renderCUDA(
        	const uint2* __restrict__ ranges,
        	const uint32_t* __restrict__ point_list,
        	int W, int H,
        	const float2* __restrict__ points_xy_image,
        	const float* __restrict__ features,
        	const float4* __restrict__ conic_opacity,
        	float* __restrict__ final_T,
        	uint32_t* __restrict__ n_contrib,
        	const float* __restrict__ bg_color,
        	float* __restrict__ out_color)
        {
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17

        # 获取当前tile对应的像素range和thread对应的像素坐标

        	// Identify current tile and associated min/max pixel range.
        	auto block = cg::this_thread_block();
        	uint32_t horizontal_blocks = (W + BLOCK_X - 1) / BLOCK_X;
        	uint2 pix_min = { block.group_index().x * BLOCK_X, block.group_index().y * BLOCK_Y }; //该tile覆盖范围的像素坐标
        	uint2 pix_max = { min(pix_min.x + BLOCK_X, W), min(pix_min.y + BLOCK_Y , H) }; //该tile覆盖范围的像素坐标
        	uint2 pix = { pix_min.x + block.thread_index().x, pix_min.y + block.thread_index().y }; //该thread需要处理的像素的坐标
        	uint32_t pix_id = W * pix.y + pix.x; //该thread对应的像素编号
        	float2 pixf = { (float)pix.x, (float)pix.y };
        
        	// Check if this thread is associated with a valid pixel or outside.
        	bool inside = pix.x < W&& pix.y < H; //是否超出图像区域范围
        	// Done threads can help with fetching, but don't rasterize
        	bool done = !inside;
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13

        # 获取当前tile对应的像素range

        	// Load start/end range of IDs to process in bit sorted list.
        	uint2 range = ranges[block.group_index().y * horizontal_blocks + block.group_index().x]; //当前tile在point_list_keys中的范围,包括x下界和y上界
        	const int rounds = ((range.y - range.x + BLOCK_SIZE - 1) / BLOCK_SIZE); //当前tile在point_list_keys中的range大小除BLOCK_SIZE,也即把当前线程即要处理的高斯球每BLOCK_SIZE个分一个batch,之后for循环里是一个batch一个batch的处理
        	int toDo = range.y - range.x; //当前线程即要处理的高斯球数量
        
        1
        2
        3
        4

        # 渲染前的初始化

        初始化block内部用的__shared__数组和thread里用的一下变量:

        	// Allocate storage for batches of collectively fetched data.
        	__shared__ int collected_id[BLOCK_SIZE]; //存储block内各thread处理的高斯球的编号
        	__shared__ float2 collected_xy[BLOCK_SIZE]; //存储block内各thread处理的高斯球中心在2D平面的投影坐标
        	__shared__ float4 collected_conic_opacity[BLOCK_SIZE]; //存储block内各thread处理的高斯球的2D协方差逆矩阵和不透明度
        
        	// Initialize helper variables
        	float T = 1.0f; //透射率
        	uint32_t contributor = 0; //计算经过了多少高斯
        	uint32_t last_contributor = 0; //存储最终经过的高斯球数量
        	float C[CHANNELS] = { 0 }; //最后渲染的颜色
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10

        batch大小为什么等于block中的thread数量BLOCK_SIZE?简言之就是复用BLOCK_SIZE个thread分批读取数据。

        block(tile)里的每个thread(像素)都要对这个tile中的所有高斯球进行渲染,所以一个thread需要访问上面的collected_*数组里的每一项。 这里最速度的情况是一次把这个tile中的所有高斯球都读进来渲染,collected_*数组长度设置为高斯球的数量。 当然,这种方法占内存太高,本文采用的方法是利用thread进行并行读取,所以才把collected_*数组长度设置为BLOCK_SIZE,这样就是充分利用block中的BLOCK_SIZE个thread,读一批处理一批。

        # for循环处理该像素的每个高斯球

        	// Iterate over batches until all done or range is complete
        	for (int i = 0; i < rounds; i++, toDo -= BLOCK_SIZE)
        	{
        
        1
        2
        3

        这个for循环分两个部分,首先BLOCK_SIZE个thread并行读取一批BLOCK_SIZE个高斯球进collected_*数组里:

        		// End if entire block votes that it is done rasterizing
        		int num_done = __syncthreads_count(done); //计算一个block里done为true的数量
        		if (num_done == BLOCK_SIZE) //如果done有BLOCK_SIZE个true,则表明所有批次的高斯球全部处理完成,可以退出
        			break; //如果一个block里面thread全部完成,则退出循环
        
        		// Collectively fetch per-Gaussian data from global to shared
        		int progress = i * BLOCK_SIZE + block.thread_rank(); //thread_rank是当前线程在组内的标号,区间为[0, num_threads);progress是当前批次应该从第几个高斯球开始读
        		if (range.x + progress < range.y) //如果当前线程有效,即处理的高斯球不越界
        		{ //读取一批BLOCK_SIZE个高斯球
        			int coll_id = point_list[range.x + progress]; //point_list表示与已排序的point_list_keys对应的高斯球编号,coll_id为当前线程处理的高斯球编号
        			collected_id[block.thread_rank()] = coll_id; //读取待处理的高斯球id
        			collected_xy[block.thread_rank()] = points_xy_image[coll_id]; //读取待处理的高斯球中心在像素平面的坐标
        			collected_conic_opacity[block.thread_rank()] = conic_opacity[coll_id]; //读取待处理的高斯球2D协方差逆矩阵和不透明度
        		}
        		block.sync(); //同步,确保读取全部完成
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15

        然后每个thread各自处理读进来的BLOCK_SIZE个高斯球:

        		// Iterate over current batch
        		for (int j = 0; !done && j < min(BLOCK_SIZE, toDo); j++) //block(tile)里的每个thread(像素)都要对这个tile中的所有高斯球进行渲染
        		{
        			// Keep track of current position in range
        			contributor++;
        
        1
        2
        3
        4
        5

        # 求高斯函数值G(x)G(\bm x)G(x)中的指数部分

        求高斯函数值G(x)G(\bm x)G(x)公式来自3DGS原论文:

        G(x)=e−12xTΣ−1xG(\bm x)=e^{-\frac{1}{2}\bm x^T\Sigma^{-1}\bm x} G(x)=e−21​xTΣ−1x

        其中的指数部分展开一下就是:

        −12xTΣ−1x=−12[ΔxΔy][ABBC][ΔxΔy]=−12AΔx2−BΔxΔy−12CΔy2\begin{aligned} -\frac{1}{2}\bm x^T\Sigma^{-1}\bm x&=-\frac{1}{2}\left[\begin{matrix}\Delta x&\Delta y\end{matrix}\right]\left[\begin{matrix}A&B\\B&C\end{matrix}\right]\left[\begin{matrix}\Delta x\\\Delta y\end{matrix}\right]\\ &=-\frac{1}{2}A\Delta x^2 -B\Delta x \Delta y -\frac{1}{2}C\Delta y^2 \end{aligned}−21​xTΣ−1x​=−21​[Δx​Δy​][AB​BC​][ΔxΔy​]=−21​AΔx2−BΔxΔy−21​CΔy2​

        其中,(Δx,Δy)T=(xi,yi)T−(xpixel,ypixel)T(\Delta x, \Delta y)^T=(x_i, y_i)^T-(x_{\text{pixel}}, y_{\text{pixel}})^T(Δx,Δy)T=(xi​,yi​)T−(xpixel​,ypixel​)T为高斯球在成像平面上投影的高斯椭圆中心(xi,yi)T(x_i, y_i)^T(xi​,yi​)T到当前像素位置(xpixel,ypixel)T(x_{\text{pixel}}, y_{\text{pixel}})^T(xpixel​,ypixel​)T的向量。

        对应代码中的float power = -0.5f * (con_o.x * d.x * d.x + con_o.z * d.y * d.y) - con_o.y * d.x * d.y;:

        			// Resample using conic matrix (cf. "Surface 
        			// Splatting" by Zwicker et al., 2001)
        			float2 xy = collected_xy[j]; //高斯球中心在像素平面的坐标
        			float2 d = { xy.x - pixf.x, xy.y - pixf.y }; //当前像素和高斯球中心的相对位置
        			float4 con_o = collected_conic_opacity[j]; //高斯球2D协方差逆矩阵和不透明度
        			float power = -0.5f * (con_o.x * d.x * d.x + con_o.z * d.y * d.y) - con_o.y * d.x * d.y; //通过相对位置、高斯球2D协方差逆矩阵计算高斯分布的指数部分
        			if (power > 0.0f)
        				continue;
        
        1
        2
        3
        4
        5
        6
        7
        8

        # alpha-blending

        alpha-blending算法计算的当前像素颜色CCC由高斯球iii的颜色CiC_iCi​和透明度参数Ai\Alpha_iAi​按深度顺序由近及远混合而成,并最后加上背景颜色CbgC_{bg}Cbg​:

        αk=AkG(xk)Tj=∏k=1j−1(1−αk)C=TN+1⋅Cbg+∑i=1NTi⋅αi⋅Ci\begin{aligned} \alpha_k&=\Alpha_kG(\bm x_k)\\ T_j&=\prod_{k=1}^{j-1} (1 - \alpha_k)\\ C&=T_{N+1}\cdot C_{bg}+\sum_{i=1}^{N}T_{i}\cdot\alpha_{i}\cdot C_{i} \end{aligned}αk​Tj​C​=Ak​G(xk​)=k=1∏j−1​(1−αk​)=TN+1​⋅Cbg​+i=1∑N​Ti​⋅αi​⋅Ci​​

        对应代码中的float alpha = min(0.99f, con_o.w * exp(power));、float test_T = T * (1 - alpha)、C[ch] += features[collected_id[j] * CHANNELS + ch] * alpha * T;、out_color[ch * H * W + pix_id] = C[ch] + T * bg_color[ch];:

        			// Eq. (2) from 3D Gaussian splatting paper.
        			// Obtain alpha by multiplying with Gaussian opacity
        			// and its exponential falloff from mean.
        			// Avoid numerical instabilities (see paper appendix). 
        			float alpha = min(0.99f, con_o.w * exp(power)); //通过高斯分布的指数部分和不透明度计算alpha值
        			if (alpha < 1.0f / 255.0f) //alpha值太低的高斯球不渲染
        				continue;
        			float test_T = T * (1 - alpha); //由alpha值计算透射率
        			if (test_T < 0.0001f) //透射率太低表明前几个高斯球已经遮住了这个像素,则该像素渲染结束
        			{
        				done = true;
        				continue;
        			}
        
        			// Eq. (3) from 3D Gaussian splatting paper.
        			for (int ch = 0; ch < CHANNELS; ch++)
        				C[ch] += features[collected_id[j] * CHANNELS + ch] * alpha * T; //alpha-blending混合颜色
        
        			T = test_T; //记录透射率
        
        			// Keep track of last range entry to update this
        			// pixel.
        			last_contributor = contributor;
        		}
        	}
        
        	// All threads that treat valid pixel write out their final
        	// rendering data to the frame and auxiliary buffers.
        	if (inside)
        	{
        		final_T[pix_id] = T; //记录最终的透射率
        		n_contrib[pix_id] = last_contributor; //记录经过几个高斯球才被遮挡
        		for (int ch = 0; ch < CHANNELS; ch++)
        			out_color[ch * H * W + pix_id] = C[ch] + T * bg_color[ch]; //记录颜色
        	}
        }
        
        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        11
        12
        13
        14
        15
        16
        17
        18
        19
        20
        21
        22
        23
        24
        25
        26
        27
        28
        29
        30
        31
        32
        33
        34
        35
        36
        帮助我们改善此页面!
        创建于: 2024-10-09 21:55:30

        更新于: 2024-11-24 18:43:09