Skip to content

Conversation

@xingbowang
Copy link
Contributor

Summary:

Add a new replay_db_stress binary to help debug stress test failure.
Integrate the new replay_db_stress binary to automatically run after a stress test failure to see whether the issue is consistently reproducible.

Test Plan:

Sample output from existing test failure.

./db_stress_replay --db=/home/xbw/workspace/sandcastle/iter_diverged/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --op_logs="S 000000000000038D000000000000012B0000000000000299 *N*N*N* SFP 000000000000038D000000000000012B00000000000002A2 *P*P*P* SFP 000000000000038D000000000000012B000000000000029E *" --column_family="3"

Parsing op_logs...
Parsed 18 operations

Opening database: /home/xbw/workspace/sandcastle/iter_diverged/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox
Attempting to load options from database...
Successfully loaded options from database
Using 10 column families from OPTIONS file

Database opened successfully (read-only mode)

Using column family: 3

=== Starting operation replay (18 operations) ===
ReadOptions: allow_unprepared_value=0, total_order_seek=1

[  0] Seek(000000000000038D000000000000012B0000000000000299)
      -> Valid: key=000000000000038D000000000000012B0000000000000299, value_size=32
[  1] PrepareValue()
      -> Valid: key=000000000000038D000000000000012B0000000000000299, value_size=32
[  2] Next()
      -> Valid: key=000000000000038D000000000000012B000000000000029A, value_size=96
[  3] PrepareValue()
      -> Valid: key=000000000000038D000000000000012B000000000000029A, value_size=96
[  4] Next()
      -> Valid: key=000000000000038D000000000000012B000000000000029B, value_size=32
[  5] PrepareValue()
      -> Valid: key=000000000000038D000000000000012B000000000000029B, value_size=32
[  6] Next()
      -> Valid: key=000000000000038D000000000000012B00000000000002A2, value_size=96
[  7] PrepareValue()
      -> Valid: key=000000000000038D000000000000012B00000000000002A2, value_size=96
[  8] SeekForPrev(000000000000038D000000000000012B00000000000002A2)
      -> Valid: key=000000000000038D000000000000012B00000000000002A2, value_size=96
[  9] PrepareValue()
      -> Valid: key=000000000000038D000000000000012B00000000000002A2, value_size=96
[ 10] Prev()
      -> Valid: key=000000000000038D000000000000012B000000000000029B, value_size=32
[ 11] PrepareValue()
      -> Valid: key=000000000000038D000000000000012B000000000000029B, value_size=32
[ 12] Prev()
      -> Valid: key=000000000000038D000000000000012B000000000000029A, value_size=96
[ 13] PrepareValue()
      -> Valid: key=000000000000038D000000000000012B000000000000029A, value_size=96
[ 14] Prev()
      -> Valid: key=000000000000038D000000000000012B0000000000000299, value_size=32
[ 15] PrepareValue()
      -> Valid: key=000000000000038D000000000000012B0000000000000299, value_size=32
[ 16] SeekForPrev(000000000000038D000000000000012B000000000000029E)
      -> Valid: key=000000000000038D000000000000012B000000000000029E, value_size=96
[ 17] PrepareValue()
      -> Valid: key=000000000000038D000000000000012B000000000000029E, value_size=96

=== Replay completed ===
Total operations: 18
Errors encountered: 0

Final iterator state:
  Valid: true
  Key: 000000000000038D000000000000012B000000000000029E
  Value size: 96

Reviewers:

Subscribers:

Tasks:

Tags:

Summary:

Add a new db_stress_replay binary to help debug stress test failure

Test Plan:

Unit test

Reviewers:

Subscribers:

Tasks:

Tags:
@meta-cla meta-cla bot added the CLA Signed label Dec 5, 2025
@meta-codesync
Copy link

meta-codesync bot commented Dec 5, 2025

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D88525321.

@meta-codesync
Copy link

meta-codesync bot commented Dec 6, 2025

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D88525321.

@hx235
Copy link
Contributor

hx235 commented Dec 8, 2025

While I do see the point of decreasing mental barrier by proving some off-the-shelf tool and making info extraction easier for debugging, we need to be careful to ensure the tool being well implemented so people don't end up debugging the tool for debugging. One thing is we may want to reduce code redundancy between parsing logic in the tool and printing logic in the test. We should have one ground truth. The other one is to keep read options used to scan thing in sync between your tool and the stress test. If the tool can reuse what's already used in your debugging process e.g, ldb, dump etc that are correct by themselves, that will be better.

Copy link
Contributor

@hx235 hx235 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss more before adding tools to debug stress test just to ensure tools are done with minimum maintenance cost and maximum helpfulness

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants