nv-tlabs · moennen · Mar 24, 2026 · Mar 24, 2026 · Apr 21, 2026 · Apr 21, 2026
diff --git a/.gitignore b/.gitignore
@@ -12,6 +12,7 @@ outputs/
 extra_info/
 eval/
 extra_info/
+plan/
 
 debug_**
 
@@ -24,6 +25,7 @@ thirdparty/kaolin/
 
 threedgrt_tracer/.ninja_log
 threedgrt_tracer/include/3dgrt/kernels/slang/*.cuh*
+threedgut_tracer/include/threedgutSlang.cuh
 *.egg-info
 .idea
 

diff --git a/TODO.md b/TODO.md
@@ -0,0 +1,23 @@
+# TODO
+
+## 3DGRT: half-precision particle features
+
+`conf.render.particle_feature_half` is compiled into the 3DGRT kernel via `-DPARTICLE_FEATURE_HALF`
+but the Python-side cast is missing. In `threedgrt_tracer/tracer.py`, `gaussians.get_features()`
+must be cast to `.half()` before being passed to `_Autograd.apply` when the flag is set,
+matching what 3DGUT already does.
+
+See the `TODO` comment in `threedgrt_tracer/tracer.py`.
+
+## 3DGRT: NHT support in CUDA path (`gaussianParticles.cuh`)
+
+The NHT feature transform (`FEATURE_TRANSFORM_TYPE=1`) is implemented for the Slang path
+(`gaussianParticles.slang`) but not yet in the CUDA path (`gaussianParticles.cuh`).
+Full NHT support in 3DGRT requires extending `gaussianParticles.cuh` with the NHT
+interpolation and activation logic currently only present in the Slang kernel.
+
+## 3DGUT: refactor `evalBackwardNoKBuffer` to share path with k-buffer backward
+
+`evalBackwardNoKBuffer` (`gutKBufferRenderer.cuh`) duplicates logic from the k-buffer backward
+path. The two should be unified into a shared implementation to reduce code duplication and
+ensure future fixes apply to both.
diff --git a/TODO_half_3dgrt.md b/TODO_half_3dgrt.md
@@ -0,0 +1,144 @@
+# 3DGRT half-precision feature support
+
+Goal: make `conf.render.particle_feature_half=true` and `conf.render.feature_output_half=true`
+work end-to-end in `threedgrt_tracer`, matching the behavior already implemented in
+`threedgut_tracer`. Gradient buffers remain fp32 on both paths.
+
+Semantics (mirroring 3dgut):
+- `particle_feature_half=true`: storage for `particleRadiance` (per-particle feature buffer)
+  is fp16. Slang entry points already expect `feat_elem_t*` (`__half*` when the macro is set).
+  Gradient `particleRadianceGrad` stays fp32.
+- `feature_output_half=true`: storage for the per-ray integrated feature buffer (`rayRadiance`)
+  is fp16. Gradient `rayRadianceGrad` stays fp32. Tracer `.forward()` casts fp16 back to fp32
+  before returning, mirroring 3dgut.
+
+## Scope
+
+Files to touch (by layer):
+
+- C++ pipeline type layer
+  - `threedgrt_tracer/include/3dgrt/pipelineParameters.h`
+    Introduce `TFeatureDensityElem` (output/ray feature) and `TParticleFeatureElem` (particle
+    storage) typedefs, guarded on the two macros. Change `particleRadiance` from `const float*`
+    to `const TParticleFeatureElem*`, and `rayRadiance` from
+    `PackedTensorAccessor32<float, 4>` to `PackedTensorAccessor32<TFeatureDensityElem, 4>`.
+    `particleRadianceGrad` and `rayRadianceGrad` stay fp32.
+- OptiX raygen kernels
+  - `threedgrt_tracer/src/kernels/cuda/referenceSlangOptix.cu`
+  - `threedgrt_tracer/src/kernels/cuda/referenceSlangBwdOptix.cu`
+    1. Replace `const_cast<float*>(params.particleRadiance)` with
+       `const_cast<TParticleFeatureElem*>(params.particleRadiance)`.
+    2. FWD write to `rayRadiance`: wrap with `__float2half` when `FEATURE_OUTPUT_HALF`.
+    3. BWD read from `rayRadiance`: wrap with `__half2float` when `FEATURE_OUTPUT_HALF`.
+       `rayRadianceGrad` reads stay fp32.
+- Host launcher
+  - `threedgrt_tracer/src/optixTracer.cpp`
+    1. `trace()`: allocate `rayRad` with `torch::kHalf` when `FEATURE_OUTPUT_HALF=1`,
+       build `packed_accessor32<TFeatureDensityElem, 4>(rayRad)`, and
+       `getPtr<const TParticleFeatureElem>(particleRadiance)`.
+    2. `traceBwd()`: same dtype for the forward `rayRad` input; the grad tensors remain fp32.
+- Python tracer
+  - `threedgrt_tracer/tracer.py`
+    1. Cast `gaussians.get_features()` to `.half()` when `conf.render.particle_feature_half`.
+    2. Keep `ray_features.float()` return to caller; `ray_features` saved in ctx may be fp16
+       when `feature_output_half=true` (already saves the raw output, consistent with 3dgut).
+
+No changes required in Slang `.slang` or generated `.cuh`: the generalization already landed
+and compiles correctly once `SLANG_CUDA_ENABLE_HALF=1` is set (done).
+
+## Task breakdown
+
+Each task is independently reviewable and testable (run validate.py for the relevant flag
+combinations after each).
+
+### T1 — Introduce typedefs in `pipelineParameters.h`
+- Add `TFeatureDensityElem` and `TParticleFeatureElem` (guarded by `FEATURE_OUTPUT_HALF` and
+  `PARTICLE_FEATURE_HALF`), include `cuda_fp16.h` when either is set.
+- Change `particleRadiance` to `const TParticleFeatureElem*` and `rayRadiance` accessor to
+  `PackedTensorAccessor32<TFeatureDensityElem, 4>`.
+- No functional change when both macros are 0 (typedefs resolve to `float`).
+- Test: build with both flags false (current default) → no-op rebuild; CI NeRF-Synthetic 3dgrt
+  smoke test still passes.
+
+### T2 — Update OptiX kernels for fp16 reads/writes
+- Apply the `__float2half` / `__half2float` wrappers in `referenceSlangOptix.cu` and
+  `referenceSlangBwdOptix.cu` under `FEATURE_OUTPUT_HALF`.
+- Update `const_cast` sites to `TParticleFeatureElem*`.
+- Test: build with both flags false → identical numerical output to baseline (no wrappers
+  compiled in).
+
+### T3 — Host buffer allocation and accessor typing
+- `optixTracer.cpp`: select dtype `kHalf` vs `kFloat32` for `rayRad`; use
+  `packed_accessor32<TFeatureDensityElem, 4>(rayRad)`.
+- `getPtr<const TParticleFeatureElem>(particleRadiance)` for the particle buffer.
+- Test: with flags false → unchanged; build-time assert that tensor dtype matches the
+  typedef via `TORCH_CHECK(rayRad.scalar_type() == ...)` in DEBUG.
+
+### T4 — Python cast for `particle_feature_half`
+- `tracer.py`: mirror 3dgut's conditional `.half()` cast on `gaussians.get_features()`.
+- Test: flags false → unchanged.
+
+### T5 — End-to-end validation with flags enabled
+- Run `validate.py` with `render.particle_feature_half=true render.feature_output_half=true`
+  using an existing NHT config (e.g. `nerf_synthetic_3dgrt_mcmc_nht.yaml`).
+- Compare PSNR after N iterations against the fp32 baseline — expected within 0.1 dB.
+- Gradients: single backward pass on a fixed seed; check that
+  `particleRadianceGrad` and `rayRadianceGrad` are finite and within tolerance of the
+  fp32 reference.
+
+### T6 — Rename `*Radiance*` → `*Features*` in 3dgrt
+Naming cleanup to align with the post-SH NHT feature abstraction. The legacy `Radiance`
+suffix comes from the SH-only era; the buffers now carry arbitrary per-particle / per-ray
+features. Purely mechanical rename, no behavioral change. Runs AFTER T1–T5 land so we are
+not also chasing name drift during the fp16 functional work.
+
+Rename mapping (all scopes):
+- `PipelineParameters::particleRadiance`       → `particleFeatures`
+- `PipelineParameters::rayRadiance`            → `rayFeatures`
+- `PipelineBackwardParameters::particleRadianceGrad` → `particleFeaturesGrad`
+- `PipelineBackwardParameters::rayRadianceGrad`      → `rayFeaturesGrad`
+- `OptixTracer::trace(..., torch::Tensor particleRadiance, ...)` arg                  → `particleFeatures`
+- `OptixTracer::traceBwd(..., torch::Tensor particleRadiance, rayRad, rayRadGrd, ...)` args
+  → `particleFeatures`, `rayFeat`, `rayFeatGrd` (local tensors + Python side kwargs).
+- `particleRadianceGrad` local in `optixTracer.cpp::traceBwd` → `particleFeaturesGrad`.
+- Python: `tracer.py` local variables `ray_features` / `ray_features_grd` are already
+  feature-named; cross-check that the pybind11 binding signature in `bindings.cpp` uses
+  the new C++ arg names.
+
+Out of scope for T6 (per resolved decisions above):
+- `particleRadianceSphDegree` C++ field and `conf.render.particle_radiance_sph_degree` YAML.
+- `shRadiativeParticles.slang` filename and internal `shRadiance*` identifiers (SH path).
+- Any `*Radiance*` identifiers that only exist on the SH-specific code path.
+
+Test:
+- Build + full `validate.py` run with fp32 flags (both false) → identical numerical
+  output to pre-T6 baseline (bit-identical expected since only identifier renames).
+- Build + `validate.py` with fp16 flags (both true) → identical output to T5 result.
+
+## Tests to write up-front
+
+- `tests/test_3dgrt_half_flags.py` (new, small)
+  - Parametrize over `(particle_feature_half, feature_output_half) ∈ {(F,F),(T,F),(F,T),(T,T)}`.
+  - Forward only, single frame, fixed scene; compare `pred_features.float()` to the (F,F)
+    baseline with `atol=5e-3, rtol=1e-2`.
+  - Forward + backward; compare `mog_sph.grad` to the (F,F) baseline at the same tolerance.
+
+## Decisions (resolved with user)
+
+1. T5 validation ownership: user runs validation; the plan only needs to keep the hooks in
+   place (no tolerance tuning required from the implementer).
+2. Gradient buffers stay fp32 end-to-end (no half-grad path).
+3. T6 rename scope is restricted to identifiers naming buffers that can carry NHT features
+   (i.e. the per-particle feature storage and per-ray integrated feature output, plus their
+   fp32 gradients). Scalars and SH-specific paths are NOT renamed:
+     - keep `particleRadianceSphDegree` (C++ field) and `conf.render.particle_radiance_sph_degree`
+       (YAML) — scalar, shared with the SH path
+     - keep `shRadiativeParticles.slang` filename and its internal `shRadiance*` identifiers —
+       SH-only code path.
+4. T6 runs AFTER T1–T5.
+
+## Non-goals
+
+- No changes to CUDA fallback path (`gaussianParticles.cuh`) — per the existing TODO that is
+  a separate workstream.
+- No changes to `threedgrt_playground`.
diff --git a/TODO_nht_cuda.md b/TODO_nht_cuda.md
@@ -0,0 +1,73 @@
+# Handwritten CUDA port of `featuresIntegrateBwdToLocalGrad` (NHT path)
+
+## Status
+- [x] **T1** — Tetrahedron constants (`tetraV0`, `tetraN0..N3`) placed in
+      `nht_detail` namespace at the top of `shRadiativeGaussianParticles.cuh`.
+      Values derived from Slang's vertex ordering; verified via script
+      (w_k == 1 at v_k, 0 at other vertices).
+- [x] **T2** — Method body replaced, gated by `#if NHT_FEATURES_BWD_LOCAL_GRAD_CUDA`.
+      Default = `1` (native CUDA). Flip to `0` in
+      `threedgut_tracer/include/3dgut/kernels/cuda/models/shRadiativeGaussianParticles.cuh`
+      to restore the Slang-autodiff path (kept unchanged in the `#else` branch).
+- [ ] **T3** — Rebuild, run `validate.py` (or one training step) with the macro
+      at 1 vs 0. Compare:
+      - feature gradient buffer L2  (primary parity check)
+      - density / position gradients (sanity; should be identical since we don't
+        touch those paths)
+      - `renderBackward` ms in nsys.
+- [ ] **T4** — If parity holds, keep default = 1. Otherwise flip to 0 and iterate.
+
+## What the handwritten CUDA does (semantics to match Slang exactly)
+
+Replicates the sequence inside Slang's `particleFeaturesIntegrateBwdToBuffer`
+called with `exclusiveGradient=true` and the shifted `featureLocalGrad` buffer:
+
+1. Early-out when `alpha <= 0`.
+2. Recover pre-hit accumulator:
+   `acc_prev[i] = (integratedFeatures[i] - features[i]*alpha) / (1-alpha)`.
+3. VJP of back-to-front `y_i = (1-alpha)*acc_prev_i + alpha*f_i` against
+   incoming `dy = integratedFeaturesGrad`:
+   - `dFeatures[i] = alpha * dy_i`
+   - `alphaGrad += sum_i (features[i] - acc_prev[i]) * dy_i`
+   - `integratedFeaturesGrad[i] = (1-alpha) * dy_i`  (new accumulator grad)
+4. Barycentric weights `w[0..3]` from `canonicalIntersection` (Cramer form
+   matching Slang, precomputed `N_k` face normals).
+5. Load 4 vertex feature blocks × `InterpPointFeatureDim` once
+   (`__half2float` when `PARTICLE_FEATURE_HALF=1`).
+6. Activation backward → `dBase[InterpPointFeatureDim]`:
+   | Activation | Forward | Backward |
+   |---|---|---|
+   | None (0) | `out = base` | `dBase = dFeatures` |
+   | Siren (1) | `sin(base * 2^f)` | `dBase += cos(base*freq) * freq * dOut` |
+   | Sincos (2) | `sin + cos` | `dBase += (cos - sin) * freq * dOut` |
+   | Relu (3) | `max(0, base)` | `dBase = (features[i] > 0) ? dFeatures[i] : 0` |
+7. Barycentric backward:
+   - `featureLocalGrad[k*IPFD + i] += w[k] * dBase[i]`  (matches Slang's `+=` with exclusiveGradient=true)
+   - `canonicalIntersectionGrad += sum_k (sum_i vert[k][i] * dBase[i]) * N_k`
+
+## Guardrails
+- `static_assert(FeatureTransformType == 1)` — NHT-only.
+- `static_assert(FEATURE_INTERPOLATION_TYPE == 0)` — barycentric only.
+- `static_assert(FEATURE_INTERPOLATION_SUPPORT == 1)` — tetrahedra only.
+- `static_assert` on `RAY_FEATURE_DIM` / `INTERP_POINT_FEATURE_DIM` / activation consistency.
+- `static_assert(4 * IPFD == ParticleFeatureDim)` — buffer layout.
+
+Any unsupported config fails at compile time — fallback is to flip the macro to 0.
+
+## Confidence
+
+- **Forward parity** (interpolation + integration, current config `activation=relu`):
+  high (see comparison with `neural-harmonic-textures/Interpolation.cuh` — same
+  tetrahedron geometry, different indexing; same integration math).
+- **Backward numerical parity**: medium-high. The Relu path is trivial. The
+  (1-α)/α lerp VJP + barycentric VJP is standard. Main risk is a sign or
+  vertex-index swap — covered by T3 gradient diff.
+- **Perf win**: medium-high. Expected 3–5× on this single kernel.
+
+## Open reference points
+
+- Slang source:    `threedgut_tracer/include/3dgut/kernels/slang/models/neuralHarmonicFeaturesParticle.slang`
+- External CUDA ref: `/nv/dev/neural-harmonic-textures/gsplat/gsplat/cuda/csrc/RasterizeToPixelsFromWorldNHT3DGSBwd.cu`
+  (sincos activation; do NOT copy the activation bwd verbatim — see
+  "Caveats" in the forward-parity discussion: Slang's sincos sums into one
+  channel, ref's keeps them separate).
diff --git a/configs/apps/colmap_3dgrt_mcmc_nht.yaml b/configs/apps/colmap_3dgrt_mcmc_nht.yaml
@@ -0,0 +1,23 @@
+# @package _global_
+# NHT (Neural Harmonic Textures) variant for colmap datasets with 3DGRT and MCMC
+
+defaults:
+  - /base_mcmc
+  - /dataset: colmap
+  - /initialization: colmap
+  - /render: 3dgrt
+  - _self_
+
+model:
+  feature_type: "nht"
+
+render:
+  pipeline_type: referenceSlang
+  backward_pipeline_type: referenceSlangBwd
+  particle_kernel_max_alpha: 0.999
+
+loss:
+  use_opacity: true
+  lambda_opacity: 0.02
+  use_scale: true
+  lambda_scale: 0.005
diff --git a/configs/apps/colmap_3dgut_mcmc_nht.yaml b/configs/apps/colmap_3dgut_mcmc_nht.yaml
@@ -0,0 +1,21 @@
+# @package _global_
+# NHT (Neural Harmonic Textures) variant for colmap datasets with 3DGUT and MCMC
+
+defaults:
+  - /base_mcmc
+  - /dataset: colmap
+  - /initialization: colmap
+  - /render: 3dgut
+  - _self_
+
+model:
+  feature_type: "nht"
+
+render:
+  particle_kernel_max_alpha: 0.999
+
+loss:
+  use_opacity: true
+  lambda_opacity: 0.02
+  use_scale: true
+  lambda_scale: 0.005
diff --git a/configs/apps/nerf_synthetic_3dgrt_mcmc_nht.yaml b/configs/apps/nerf_synthetic_3dgrt_mcmc_nht.yaml
@@ -0,0 +1,23 @@
+# @package _global_
+# NHT (Neural Harmonic Textures) variant for nerf_synthetic with 3DGRT and MCMC
+
+defaults:
+  - /base_mcmc
+  - /dataset: nerf
+  - /initialization: random
+  - /render: 3dgrt
+  - _self_
+
+model:
+  feature_type: "nht"
+
+render:
+  pipeline_type: referenceSlang
+  backward_pipeline_type: referenceSlangBwd
+  particle_kernel_max_alpha: 0.999
+
+loss:
+  use_opacity: true
+  lambda_opacity: 0.02
+  use_scale: true
+  lambda_scale: 0.005
diff --git a/configs/apps/nerf_synthetic_3dgut_mcmc_nht.yaml b/configs/apps/nerf_synthetic_3dgut_mcmc_nht.yaml
@@ -0,0 +1,21 @@
+# @package _global_
+# NHT (Neural Harmonic Textures) variant for nerf_synthetic with 3DGUT and MCMC
+
+defaults:
+  - /base_mcmc
+  - /dataset: nerf
+  - /initialization: random
+  - /render: 3dgut
+  - _self_
+
+model:
+  feature_type: "nht"
+
+render:
+  particle_kernel_max_alpha: 0.999
+
+loss:
+  use_opacity: true
+  lambda_opacity: 0.02
+  use_scale: true
+  lambda_scale: 0.005