Skip to content

Make full_thread_state a shared pointer#3966

Open
abellina wants to merge 7 commits into
NVIDIA:mainfrom
abellina:shared_ptr_full_thread_state
Open

Make full_thread_state a shared pointer#3966
abellina wants to merge 7 commits into
NVIDIA:mainfrom
abellina:shared_ptr_full_thread_state

Conversation

@abellina

Copy link
Copy Markdown
Collaborator

Contributes to #3905

This change is simply making full_thread_state a shared pointer, and updating all calls to use -> instead of ..

We do this so that when we pass full_thread_state around, we are incrementing the ref count in the pointer and even if the thread disappears, we are still able to hold a reference to the state object until we exit the function call.

This PR doesn't change behavior, but in the future we can do things like: get the thread state under a lock, unlock, return thread state and use that object without fear that it will be destroyed by another thread.

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@abellina

Copy link
Copy Markdown
Collaborator Author

build

@greptile-apps

greptile-apps Bot commented Nov 18, 2025

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR wraps full_thread_state objects in std::shared_ptr in the threads map, changing std::map<long, full_thread_state> to std::map<long, std::shared_ptr<full_thread_state>>. All member accesses are mechanically updated from . to ->, and affected helper-function signatures are updated to accept std::shared_ptr<full_thread_state> instead of full_thread_state&.

  • The mechanical .-> conversion is applied consistently across all ~100+ access sites with no functional changes introduced.
  • Two previously-reported use-after-free bugs in block_thread_until_ready's wait loops are fixed: after_block() is now guarded by if (thread != threads.end()), preventing a dereference of threads.end() when the thread is removed while blocked.
  • The cudf submodule is bumped to a newer commit.

Confidence Score: 4/5

Safe to merge — the change is a mechanical accessor update with no functional behavior changes and two pre-existing use-after-free bugs correctly fixed.

The refactoring is consistent and complete across all ~100+ access sites. The two after_block() dereference-after-removal bugs are now properly guarded. The only remaining concerns are stylistic: helper function signatures dropped const-correctness and incur an unnecessary ref-count copy on each call to transition and checkpoint_metrics.

src/main/cpp/src/SparkResourceAdaptorJni.cpp — specifically the helper function signatures for transition, checkpoint_metrics, and is_thread_bufn_or_above.

Important Files Changed

Filename Overview
src/main/cpp/src/SparkResourceAdaptorJni.cpp Mechanical shared_ptr refactoring of full_thread_state; fixes two after_block() use-after-free bugs in wait loops; minor const-correctness regression on helper function signatures.
thirdparty/cudf Submodule pointer bump; no logic change in this repository.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant T as Calling Thread
    participant M as threads map (shared_ptr)
    participant S as full_thread_state (heap object)

    T->>M: threads.find(thread_id)
    M-->>T: "iterator → shared_ptr (refcount=1)"

    Note over T,S: Old code: map held full_thread_state by value. Removal destroyed the object immediately.

    T->>S: before_block() via shared_ptr
    T->>S: "wake_condition->wait(lock)"

    Note over M,S: Another thread calls remove_thread_association()
    M->>S: threads.erase(iterator) — refcount drops but object stays alive while iterator's copy exists

    T->>M: threads.find(thread_id) → end()
    Note over T,S: New code: guard prevents after_block() dereference on removed thread.
    T->>S: "if (thread != end()) after_block() — skipped safely"
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant T as Calling Thread
    participant M as threads map (shared_ptr)
    participant S as full_thread_state (heap object)

    T->>M: threads.find(thread_id)
    M-->>T: "iterator → shared_ptr (refcount=1)"

    Note over T,S: Old code: map held full_thread_state by value. Removal destroyed the object immediately.

    T->>S: before_block() via shared_ptr
    T->>S: "wake_condition->wait(lock)"

    Note over M,S: Another thread calls remove_thread_association()
    M->>S: threads.erase(iterator) — refcount drops but object stays alive while iterator's copy exists

    T->>M: threads.find(thread_id) → end()
    Note over T,S: New code: guard prevents after_block() dereference on removed thread.
    T->>S: "if (thread != end()) after_block() — skipped safely"
Loading

Reviews (1): Last reviewed commit: "Resolve thirparty/cudf ref" | Re-trigger Greptile

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

Comment thread src/main/cpp/src/SparkResourceAdaptorJni.cpp Outdated
Comment thread src/main/cpp/src/SparkResourceAdaptorJni.cpp Outdated
@abellina

Copy link
Copy Markdown
Collaborator Author

build

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@abellina

Copy link
Copy Markdown
Collaborator Author

build

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@revans2 revans2 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you measured the performance impact of this? shared_ptr has o do a lot more work than a normal pointer. Did you run into problems anywhere that needed this? Are you planning more changes that require it, like looping through a data structure and you can modify it as you loop?

@abellina

Copy link
Copy Markdown
Collaborator Author

Have you measured the performance impact of this? shared_ptr has o do a lot more work than a normal pointer. Did you run into problems anywhere that needed this? Are you planning more changes that require it, like looping through a data structure and you can modify it as you loop?

Main change I was looking to do was be able to take the lock to get state, and then return the state object as is without a lock. This implies I can add locks inside of the per-thread state object to protect its internal datastructures. But this is something I am planning for later, as moving state inside of full_thread_state is, well, not trivial and I don't have that done.

In terms of perf, I tested it with other patches and performance in the plugin under NDS never got worse. But, I can split this up and rework the patch that is coming after this to use raw pointers and then we can just measure adding it when absolutely necessary.

@abellina

Copy link
Copy Markdown
Collaborator Author

Lets punt on this, I don't have a great answer to the performance question on this change on its own, and it's not strictly necessary to what I am trying to accomplish right now.

@abellina abellina closed this Nov 18, 2025
@abellina

Copy link
Copy Markdown
Collaborator Author

never mind, I do need shared pointers. I am storing full_thread_state in different maps in one of my patches. I'll have to reopen, let me test this patch independently

@abellina abellina reopened this Nov 18, 2025

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@abellina

abellina commented Nov 19, 2025

Copy link
Copy Markdown
Collaborator Author

I ran NDS at sf3k, and I don't see significant changes. It's slightly faster at the end, but it's all in the noise:

Results: benchmark: Previous (242200.0 ms) vs Current (241000.0 ms) Diff 1200 E2E 1.00x
query1: Previous (1627.2 ms) vs Current (1670.2 ms) Diff -43 E2E 0.97x
query2: Previous (1647.4 ms) vs Current (1415.6 ms) Diff 231 E2E 1.16x
query3: Previous (381.4 ms) vs Current (442.8 ms) Diff -61 E2E 0.86x
query4: Previous (4595.2 ms) vs Current (4713.4 ms) Diff -118 E2E 0.97x
query5: Previous (2116.6 ms) vs Current (2190.8 ms) Diff -74 E2E 0.97x
query6: Previous (865.0 ms) vs Current (834.8 ms) Diff 30 E2E 1.04x
query7: Previous (2573.2 ms) vs Current (2749.8 ms) Diff -176 E2E 0.94x
query8: Previous (761.8 ms) vs Current (741.8 ms) Diff 20 E2E 1.03x
query9: Previous (1683.6 ms) vs Current (1709.4 ms) Diff -25 E2E 0.98x
query10: Previous (1528.2 ms) vs Current (1460.0 ms) Diff 68 E2E 1.05x
query11: Previous (2824.8 ms) vs Current (2946.2 ms) Diff -121 E2E 0.96x
query12: Previous (655.2 ms) vs Current (643.2 ms) Diff 12 E2E 1.02x
query13: Previous (1612.6 ms) vs Current (1565.8 ms) Diff 46 E2E 1.03x
query14_part1: Previous (5497.6 ms) vs Current (5151.6 ms) Diff 346 E2E 1.07x
query14_part2: Previous (4460.8 ms) vs Current (4506.4 ms) Diff -45 E2E 0.99x
query15: Previous (1049.6 ms) vs Current (1183.4 ms) Diff -133 E2E 0.89x
query16: Previous (3335.2 ms) vs Current (3326.8 ms) Diff 8 E2E 1.00x
query17: Previous (1649.8 ms) vs Current (1578.4 ms) Diff 71 E2E 1.05x
query18: Previous (1972.4 ms) vs Current (1997.0 ms) Diff -24 E2E 0.99x
query19: Previous (1175.4 ms) vs Current (1236.0 ms) Diff -60 E2E 0.95x
query20: Previous (520.2 ms) vs Current (514.2 ms) Diff 6 E2E 1.01x
query21: Previous (452.2 ms) vs Current (600.0 ms) Diff -147 E2E 0.75x
query22: Previous (1074.6 ms) vs Current (1032.6 ms) Diff 42 E2E 1.04x
query23_part1: Previous (5555.2 ms) vs Current (5415.0 ms) Diff 140 E2E 1.03x
query23_part2: Previous (6028.0 ms) vs Current (6012.6 ms) Diff 15 E2E 1.00x
query24_part1: Previous (6998.6 ms) vs Current (6630.6 ms) Diff 368 E2E 1.06x
query24_part2: Previous (6456.0 ms) vs Current (6545.8 ms) Diff -89 E2E 0.99x
query25: Previous (1886.6 ms) vs Current (1829.8 ms) Diff 56 E2E 1.03x
query26: Previous (757.2 ms) vs Current (720.8 ms) Diff 36 E2E 1.05x
query27: Previous (1354.6 ms) vs Current (1074.4 ms) Diff 280 E2E 1.26x
query28: Previous (4232.0 ms) vs Current (4181.0 ms) Diff 51 E2E 1.01x
query29: Previous (2514.2 ms) vs Current (2593.6 ms) Diff -79 E2E 0.97x
query30: Previous (1856.8 ms) vs Current (1830.0 ms) Diff 26 E2E 1.01x
query31: Previous (1667.4 ms) vs Current (1656.4 ms) Diff 11 E2E 1.01x
query32: Previous (894.4 ms) vs Current (938.4 ms) Diff -44 E2E 0.95x
query33: Previous (1135.0 ms) vs Current (1037.4 ms) Diff 97 E2E 1.09x
query34: Previous (1865.0 ms) vs Current (1851.8 ms) Diff 13 E2E 1.01x
query35: Previous (1758.0 ms) vs Current (1734.8 ms) Diff 23 E2E 1.01x
query36: Previous (1292.2 ms) vs Current (1438.2 ms) Diff -146 E2E 0.90x
query37: Previous (516.0 ms) vs Current (558.2 ms) Diff -42 E2E 0.92x
query38: Previous (1861.4 ms) vs Current (1923.8 ms) Diff -62 E2E 0.97x
query39_part1: Previous (1816.6 ms) vs Current (1797.2 ms) Diff 19 E2E 1.01x
query39_part2: Previous (1333.0 ms) vs Current (1147.6 ms) Diff 185 E2E 1.16x
query40: Previous (1065.6 ms) vs Current (1089.0 ms) Diff -23 E2E 0.98x
query41: Previous (287.4 ms) vs Current (319.0 ms) Diff -31 E2E 0.90x
query42: Previous (317.4 ms) vs Current (337.6 ms) Diff -20 E2E 0.94x
query43: Previous (789.6 ms) vs Current (770.0 ms) Diff 19 E2E 1.03x
query44: Previous (666.2 ms) vs Current (647.2 ms) Diff 19 E2E 1.03x
query45: Previous (1165.8 ms) vs Current (1186.6 ms) Diff -20 E2E 0.98x
query46: Previous (1415.8 ms) vs Current (1438.0 ms) Diff -22 E2E 0.98x
query47: Previous (1713.4 ms) vs Current (1593.8 ms) Diff 119 E2E 1.08x
query48: Previous (904.4 ms) vs Current (980.8 ms) Diff -76 E2E 0.92x
query49: Previous (1892.0 ms) vs Current (1784.2 ms) Diff 107 E2E 1.06x
query50: Previous (7365.6 ms) vs Current (7141.4 ms) Diff 224 E2E 1.03x
query51: Previous (1675.8 ms) vs Current (1951.0 ms) Diff -275 E2E 0.86x
query52: Previous (490.8 ms) vs Current (449.8 ms) Diff 41 E2E 1.09x
query53: Previous (632.2 ms) vs Current (657.2 ms) Diff -25 E2E 0.96x
query54: Previous (1373.2 ms) vs Current (1343.6 ms) Diff 29 E2E 1.02x
query55: Previous (408.0 ms) vs Current (370.4 ms) Diff 37 E2E 1.10x
query56: Previous (794.6 ms) vs Current (832.4 ms) Diff -37 E2E 0.95x
query57: Previous (1259.0 ms) vs Current (1353.8 ms) Diff -94 E2E 0.93x
query58: Previous (901.0 ms) vs Current (814.4 ms) Diff 86 E2E 1.11x
query59: Previous (1805.2 ms) vs Current (1847.2 ms) Diff -42 E2E 0.98x
query60: Previous (1220.0 ms) vs Current (1185.6 ms) Diff 34 E2E 1.03x
query61: Previous (1213.0 ms) vs Current (1246.6 ms) Diff -33 E2E 0.97x
query62: Previous (1129.2 ms) vs Current (1124.2 ms) Diff 5 E2E 1.00x
query63: Previous (794.4 ms) vs Current (731.0 ms) Diff 63 E2E 1.09x
query64: Previous (14201.0 ms) vs Current (14071.2 ms) Diff 129 E2E 1.01x
query65: Previous (3162.6 ms) vs Current (3127.4 ms) Diff 35 E2E 1.01x
query66: Previous (2386.2 ms) vs Current (2469.6 ms) Diff -83 E2E 0.97x
query67: Previous (17100.6 ms) vs Current (17021.8 ms) Diff 78 E2E 1.00x
query68: Previous (1173.4 ms) vs Current (1210.6 ms) Diff -37 E2E 0.97x
query69: Previous (1359.4 ms) vs Current (1469.0 ms) Diff -109 E2E 0.93x
query70: Previous (1544.2 ms) vs Current (1496.8 ms) Diff 47 E2E 1.03x
query71: Previous (3042.4 ms) vs Current (3247.4 ms) Diff -205 E2E 0.94x
query72: Previous (2612.4 ms) vs Current (2579.6 ms) Diff 32 E2E 1.01x
query73: Previous (987.4 ms) vs Current (1046.6 ms) Diff -59 E2E 0.94x
query74: Previous (2250.4 ms) vs Current (2338.0 ms) Diff -87 E2E 0.96x
query75: Previous (6662.0 ms) vs Current (6678.6 ms) Diff -16 E2E 1.00x
query76: Previous (1665.2 ms) vs Current (1502.4 ms) Diff 162 E2E 1.11x
query77: Previous (925.2 ms) vs Current (1049.4 ms) Diff -124 E2E 0.88x
query78: Previous (8079.2 ms) vs Current (8006.2 ms) Diff 73 E2E 1.01x
query79: Previous (865.0 ms) vs Current (877.4 ms) Diff -12 E2E 0.99x
query80: Previous (3891.8 ms) vs Current (3862.2 ms) Diff 29 E2E 1.01x
query81: Previous (2065.8 ms) vs Current (2007.8 ms) Diff 58 E2E 1.03x
query82: Previous (599.6 ms) vs Current (668.2 ms) Diff -68 E2E 0.90x
query83: Previous (651.4 ms) vs Current (685.8 ms) Diff -34 E2E 0.95x
query84: Previous (1176.0 ms) vs Current (988.6 ms) Diff 187 E2E 1.19x
query85: Previous (1539.2 ms) vs Current (1480.4 ms) Diff 58 E2E 1.04x
query86: Previous (894.2 ms) vs Current (923.2 ms) Diff -29 E2E 0.97x
query87: Previous (1955.6 ms) vs Current (1965.4 ms) Diff -9 E2E 1.00x
query88: Previous (3135.6 ms) vs Current (3136.4 ms) Diff 0 E2E 1.00x
query89: Previous (1088.0 ms) vs Current (967.2 ms) Diff 120 E2E 1.12x
query90: Previous (699.0 ms) vs Current (685.4 ms) Diff 13 E2E 1.02x
query91: Previous (951.0 ms) vs Current (1108.6 ms) Diff -157 E2E 0.86x
query92: Previous (539.8 ms) vs Current (507.0 ms) Diff 32 E2E 1.06x
query93: Previous (9397.4 ms) vs Current (9176.0 ms) Diff 221 E2E 1.02x
query94: Previous (3957.0 ms) vs Current (3870.4 ms) Diff 86 E2E 1.02x
query95: Previous (4760.8 ms) vs Current (4813.4 ms) Diff -52 E2E 0.99x
query96: Previous (4285.2 ms) vs Current (4331.0 ms) Diff -45 E2E 0.99x
query97: Previous (1799.0 ms) vs Current (1710.2 ms) Diff 88 E2E 1.05x
query98: Previous (1279.4 ms) vs Current (1239.6 ms) Diff 39 E2E 1.03x
query99: Previous (1477.0 ms) vs Current (1555.8 ms) Diff -78 E2E 0.95x

* verification.
*/
void transition(full_thread_state& state, thread_state const new_state)
void transition(std::shared_ptr<full_thread_state> state, thread_state const new_state)

@ttnghia ttnghia Dec 23, 2025

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can use std::shared_ptr<> const& as parameter. Doing so will avoid the atomic ops when creating a new shared_ptr and reduce overhead.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, if the intent of this change is to allow this function to continue runing while leaving the state at the caller to be destroyed independently then this can be left as-is.

* Checkpoint all of the metrics for a thread.
*/
void checkpoint_metrics(full_thread_state& state)
void checkpoint_metrics(std::shared_ptr<full_thread_state> state)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here.

}

bool is_thread_bufn_or_above(JNIEnv* env, full_thread_state const& state)
bool is_thread_bufn_or_above(JNIEnv* env, std::shared_ptr<full_thread_state> state)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here.

@nvauto

nvauto commented Jan 19, 2026

Copy link
Copy Markdown
Collaborator

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

@nvauto

nvauto commented Mar 16, 2026

Copy link
Copy Markdown
Collaborator

NOTE: release/26.04 has been created from main. Please retarget your PR to release/26.04 if it should be included in the release.

@nvauto

nvauto commented May 18, 2026

Copy link
Copy Markdown
Collaborator

NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the spark_resource_adaptor thread-tracking state so that each full_thread_state is heap-allocated and referenced via std::shared_ptr, enabling safe retention of thread state beyond the lifetime of the originating thread.

Changes:

  • Replaced threads map value type from full_thread_state to std::shared_ptr<full_thread_state>.
  • Updated state access patterns throughout the state machine from . to ->.
  • Updated helper APIs (transition, checkpoint_metrics, is_thread_bufn_or_above) to accept shared-pointer state.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 693 to +695
auto const was_threads_inserted = threads.emplace(
thread_id, full_thread_state(thread_state::THREAD_RUNNING, thread_id, task_id));
thread_id,
std::make_shared<full_thread_state>(thread_state::THREAD_RUNNING, thread_id, task_id));
Comment on lines +778 to +779
auto const was_inserted = threads.emplace(
thread_id, std::make_shared<full_thread_state>(thread_state::THREAD_RUNNING, thread_id));
* verification.
*/
void transition(full_thread_state& state, thread_state const new_state)
void transition(std::shared_ptr<full_thread_state> state, thread_state const new_state)
* Checkpoint all of the metrics for a thread.
*/
void checkpoint_metrics(full_thread_state& state)
void checkpoint_metrics(std::shared_ptr<full_thread_state> state)
}

bool is_thread_bufn_or_above(JNIEnv* env, full_thread_state const& state)
bool is_thread_bufn_or_above(JNIEnv* env, std::shared_ptr<full_thread_state> state)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants