Skip to content

Conversation

neethuhaneesha
Copy link
Contributor

@neethuhaneesha neethuhaneesha commented Oct 14, 2025

FdbDecode memory issues fix.
The fdbdecode command was throwing memory-related errors such as:
double free or corruption (!prev), free(): invalid pointer, munmap_chunk(): invalid pointer, Segmentation fault
These errors occurred only during the program’s shutdown phase, after all decoding work was completed. They did not affect the correctness of the decoded key–value output.

Root Cause
Valgrind analysis revealed that the crashes were caused by static object destruction order issues, leading to use-after-free and double-free situations.
Issue 1: EventCacheHolder
A static EventCacheHolder instance invoked clear() during its destruction, which accessed a LatestEventCache object that had already been destroyed.
Issue 2: BlobStats
Another static variable, BlobStats, owned an EventCacheHolder instance. During shutdown, its destruction triggered the same invalid access pattern described above.

==40744== Invalid read of size 2
==40744==    at 0x14571E4: operator< (NetworkAddress.h:64)
==40744==    by 0x14571E4: operator() (stl_function.h:400)
==40744==    by 0x14571E4: _M_lower_bound (stl_tree.h:1905)
==40744==    by 0x14571E4: lower_bound (stl_tree.h:1270)
==40744==    by 0x14571E4: lower_bound (stl_map.h:1259)
==40744==    by 0x14571E4: operator[] (stl_map.h:517)
==40744==    by 0x14571E4: LatestEventCache::clear(std::string const&) (Trace.cpp:632)
==40744==    by 0x1243F58: ~EventCacheHolder (Trace.h:524)
==40744==    by 0x1243F58: delref (FastRef.h:70)
==40744==    by 0x1243F58: delref<EventCacheHolder> (FastRef.h:95)
==40744==    by 0x1243F58: ~Reference (FastRef.h:126)
==40744==    by 0x1243F58: CounterCollectionImpl::TraceCountersActorState<CounterCollectionImpl::TraceCountersActor>::~TraceCountersActorState() (Stats.actor.g.cpp:162)
==40744==    by 0x1244270: a_body1Catch1 (Stats.actor.g.cpp:188)
==40744==    by 0x1244270: CounterCollectionImpl::TraceCountersActorState<CounterCollectionImpl::TraceCountersActor>::a_callback_error(ActorCallback<CounterCollectionImpl::TraceCountersActor, 1, Void>*, Error) (Stats.actor.g.cpp:417)
==40744==    by 0x69691E: delFutureRef (flow.h:866)
==40744==    by 0x69691E: delFutureRef (flow.h:863)
==40744==    by 0x69691E: Future<Void>::~Future() (flow.h:948)
==40744==    by 0x498A2DC: __run_exit_handlers (in /usr/lib64/libc.so.6)
==40744==    by 0x498A42F: exit (in /usr/lib64/libc.so.6)
==40744==    by 0x49725D6: (below main) (in /usr/lib64/libc.so.6)

==40744== Invalid read of size 8
==40744==    at 0x1452988: _M_lower_bound (stl_tree.h:1904)
==40744==    by 0x1452988: lower_bound (stl_tree.h:1270)
==40744==    by 0x1452988: lower_bound (stl_map.h:1259)
==40744==    by 0x1452988: clearPrefix_internal(std::map<std::string, TraceEventFields, std::less<std::string>, std::allocator<std::pair<std::string const, TraceEventFields> > >&, std::string const&) (Trace.cpp:627)
==40744==    by 0x1457232: LatestEventCache::clear(std::string const&) (Trace.cpp:632)
==40744==    by 0xE91777: ~EventCacheHolder (Trace.h:524)
==40744==    by 0xE91777: delref (FastRef.h:70)
==40744==    by 0xE91777: delref<EventCacheHolder> (FastRef.h:95)
==40744==    by 0xE91777: ~Reference (FastRef.h:126)
==40744==    by 0xE91777: ~LatencySample (Stats.h:227)
==40744==    by 0xE91777: ~BlobStats (S3BlobStore.h:88)
==40744==    by 0xE91777: operator() (unique_ptr.h:85)
==40744==    by 0xE91777: std::unique_ptr<S3BlobStoreEndpoint::BlobStats, std::default_delete<S3BlobStoreEndpoint::BlobStats> >::~unique_ptr() (unique_ptr.h:361)
==40744==    by 0x498A2DC: __run_exit_handlers (in /usr/lib64/libc.so.6)
==40744==    by 0x498A42F: exit (in /usr/lib64/libc.so.6)
==40744==    by 0x49725D6: (below main) (in /usr/lib64/libc.so.6)

100k Correctness passing:
20251022-184501-neethuhaneeshabingi-c75f24d2a60bc088 compressed=True data_size=51345406 duration=4751108 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:21:46 sanity=False started=100000 stopped=20251022-200647 submitted=20251022-184501 timeout=5400 username=neethuhaneeshabingi

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 9133b8b
  • Duration 0:11:29
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 9133b8b
  • Duration 0:24:26
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 9133b8b
  • Duration 0:46:13
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 9133b8b
  • Duration 0:56:25
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 9133b8b
  • Duration 1:03:07
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 9133b8b
  • Duration 1:08:36
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

Copy link
Contributor

@jzhou77 jzhou77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question about the fix.

static std::unique_ptr<BlobStats> blobStats;
static Future<Void> statsLogger;
std::unique_ptr<BlobStats> blobStats;
Future<Void> statsLogger;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the idea is to have a singleton BlobStats and logger. Does this fix changes the behavior to have multiple BlobStats and loggers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These stats are per S3BlobStoreEndPoint now. I checked the code and felt we have only 1 S3BlobStoreEndPoint. So assumed we should be fine. Correct me if I'm wrong.

If we have more than 1 S3BlobStoreEndPoint, we have to maintain multiple of such blobState and logger and can sometimes OOM too (if we cannot accommodate so much in memory). In that case I can retain them as static and can make changes in the destructors with additional checks for safe deletion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like in the fdbdecode case, only one S3BlobStoreEndPoint is created via IBackupContainer::openContainer(params->container_url, params->proxy, {});.

I noticed openContainer() is supposed to cache IBackupContainer object. However, the code after the following block creates a new r but not saving it to m_cache. So if this is fixed, then opening the same URL multiple times will share the same IBackupContainer.

Reference<IBackupContainer> IBackupContainer::openContainer(const std::string& url,
                                                            const Optional<std::string>& proxy,
                                                            const Optional<std::string>& encryptionKeyFileName) {
	static std::map<std::string, Reference<IBackupContainer>> m_cache;

	Reference<IBackupContainer>& r = m_cache[url];
	if (r)
		return r;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came across this bit of code and thought too that the new 'r' was not being added to the cache after creation but I think it is.... 'r' is a reference to the cache entry so later when we do this 'r = makeReference(...' we are inserting the new 'r' into the cache. What you two think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! I missed Reference<IBackupContainer>& part.

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 9133b8b
  • Duration 0:11:44
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 9133b8b
  • Duration 0:24:47
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 9133b8b
  • Duration 0:34:59
  • Result: ❌ FAILED
  • Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 9133b8b
  • Duration 0:42:57
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 9133b8b
  • Duration 0:47:29
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: 5ef949b
  • Duration 0:24:21
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 9133b8b
  • Duration 1:00:42
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 5ef949b
  • Duration 0:43:02
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 9133b8b
  • Duration 1:09:49
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 5ef949b
  • Duration 0:52:47
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 5ef949b
  • Duration 0:53:19
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 5ef949b
  • Duration 1:04:16
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 5ef949b
  • Duration 1:05:04
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@neethuhaneesha neethuhaneesha merged commit 2f1c194 into apple:main Oct 22, 2025
6 checks passed
@neethuhaneesha neethuhaneesha deleted the decode-main branch October 22, 2025 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants