Skip to content

Conversation

@yairgott
Copy link
Contributor

@yairgott yairgott commented Jul 2, 2025

Overview

Sharing memory between the module and engine reduces memory overhead by eliminating redundant copies of stored records in the module. This is particularly beneficial for search workloads that require indexing large volumes of documents.

Vectors

Vector similarity search requires storing large volumes of high-cardinality vectors. For example, a single vector with 512 dimensions consumes 2048 bytes, and typical workloads often involve millions of vectors. Due to the lack of a memory-sharing mechanism between the module and the engine, valkey-search currently doubles memory consumption when indexing vectors, significantly increasing operational costs. This limitation introduces adoption friction and reduces valkey-search's competitiveness.

Memory Allocation Strategy

At a fundamental level, there are two primary allocation strategies:

  • [Chosen] Module-allocated memory shared with the engine.
  • Engine-allocated memory shared with the module.

For valkey-search, it is crucial that vectors reside in cache-aligned memory to maximize SIMD optimizations. Allowing the module to allocate memory provides greater flexibility for different use cases, though it introduces slightly higher implementation complexity.

Old Implementation

The old implementation was based on ref-counting and introduced a new SDS type. After further discussion, we agreed to simplify the design by removing ref-counting and avoiding the introduction of a new SDS type.

New Implementation - Key Points

  1. The engine exposes a new interface, VM_HashExternalize, which externalizes a buffer as a hash field. The externalized buffer is owned by the module. The function accepts the hash key, hash field, and a buffer along with its length.
  2. ExternalizedValue is a new data type that captures the externalized buffer and its length.
  3. FIELD_SDS_AUX_BIT_EXTERNALIZED is a new SDS AUX flag used to distinguish between regular/embedded SDS values and ExternalizedValue.

valkey-search Usage

Insertion

  1. Upon receiving a key space notification for a new hash or JSON key with an indexed vector attribute, valkey-search allocates cache-aligned memory and deep-copies the vector value.
  2. valkey-search then calls VM_HashExternalize to avoid performing two copies of the vector.

Deletion

When receiving a key space notification for a deleted hash key or hash field that was indexed as a vector, valkey-search deletes the corresponding entry from the index.

Update

Handled similarly to insertion.

yairgott and others added 24 commits July 2, 2025 23:56
Signed-off-by: yairgott <[email protected]>
Signed-off-by: yairgott <[email protected]>
Signed-off-by: yairgott <[email protected]>
Signed-off-by: yairgott <[email protected]>
Signed-off-by: yairgott <[email protected]>
Signed-off-by: yairgott <[email protected]>
Signed-off-by: yairgott <[email protected]>
Resolves valkey-io#2267

Timed out test gets logged at the end of the test run.

```
!!! WARNING The following tests failed:

*** [TIMEOUT]: WAIT should not acknowledge 2 additional copies of the data in tests/unit/wait.tcl
Cleanup: may take some time... OK
```

Signed-off-by: Sarthak Aggarwal <[email protected]>
…alkey-io#2283)

When a replica doing the CLUSTER RESET SOFT, we should generate a
new shard_id for it. Since if the node was a replica, it inherits
the shard_id of its primary, and after the CLUSTER RESET SOFT, the
replica become a new empty primary, it should has its own shard_id.

Otherwise, when it joins back into the cluster or responds the PONG,
we will have two primaries in the same shard.

---------

Signed-off-by: Binbin <[email protected]>
In the unit test framework, there is a check that `zmalloc_used_memory()
== 0` after each test suite. If an assertion fails in any test case, the
test case is aborted and often it means some memory is not freed. The
check for used memory == 0 then fails for all test suites that run after
the failed one.

This PR makes sure a memory leak is only reported once, for the test
suite that didn't free its memory.

Two test cases that explicitly checked that all memory had been freed
are deleted. This is handled by the test framework.

Signed-off-by: Viktor Söderqvist <[email protected]>
… and fix zadd --sequential to create expected data (valkey-io#2102)

This extends the `__rand_int__` pattern to add `__rand_1st__` through
`__rand_9th__`.

For the new placeholders, multiple occurrences within a command will
have the same value.

With the `--sequential` option, each placeholder will use separate
counters.

For `__rand_int__`, the behavior is unchanged - multiple occurrences of
the pattern within a command will have different values.

Examples of using the `__rand_int__` with `--sequential`:

```
$ ./valkey-benchmark -r 300 -n 300 --sequential -q -- set key__rand_int__ val__rand_int__
set key__rand_int__ val__rand_int__: 60000.00 requests per second, p50=0.391 msec
$ ./valkey-cli info keyspace
# Keyspace
db0:keys=150,expires=0,avg_ttl=0
$ ./valkey-cli get key000000000050
"val000000000051"
```

For the new patterns, multiple occurrences within the command will have
the same value.

```
$ ./valkey-benchmark -r 300 -n 300 --sequential -q -- set key__rand_1st__ val__rand_1st__
set key__rand_int__ val__rand_int1_: 60000.00 requests per second, p50=0.383 msec
$ ./valkey-cli info keyspace
# Keyspace
db0:keys=300,expires=0,avg_ttl=0
$ ./valkey-cli get key000000000050
"val000000000050"
```

I also fixed the zadd benchmark so it produces the expected number of
keys when used with `--sequential`. (By using the same counter twice per
command it was effectively counting by twos)

I made this for myself but raised a PR in case y'all like it. 🙂

---------

Signed-off-by: Rain Valentine <[email protected]>
During memory eviction, we may end up with a primary kv hash table
that's below the 13% `MIN_FILL_PERCENT_SOFT`, but inside `resize()` for
the kv table we check with `ht->type->resizeAllowed` whether allocating
a new table crosses the configured `maxmemory` limit without taking into
account that shrinking, while temporarily increasing the memory usage,
will ultimately reduce the memory usage in all situations.

Blocking hash tables from shrinking in situations where we would only
temporarily cross the maxmemory limit has at least three negative side
effects:
* The performance of valkey is negatively impacted if, for example,
we're moving items from table A to table B since table A would
eventually end up being a very sparse hash table that never shrinks.
* In extreme situations, the performance of valkey slows down
significantly when trying to find more keys to evict since we're trying
to find keys in a sparse table that only becomes more sparse over time.
* In more extreme situations, such as when reducing `maxmemory` by
around 90% or writing large data to a single key, valkey can end up in a
situation where, no matter how many keys are being evicted from the kv
table, the table bucket struct overhead uses more memory than the data
stores in the table, causing all keys to be evicted from the table and
valkey becoming unusable before setting maxmemory to above the current
memory usage:

```
127.0.0.1:6379> config set maxmemory 2gb
OK
127.0.0.1:6379> debug populate 20000000
OK
(8.94s)
127.0.0.1:6379> debug htstats 0
[Dictionary HT]
Hash table 0 stats (main hash table):
 table size: 29360128
 number of entries: 20000000
[Expires HT]
127.0.0.1:6379> config set maxmemory 200mb
OK
127.0.0.1:6379> dbsize
(integer) 18725794
[...]
127.0.0.1:6379> dbsize
(integer) 0
127.0.0.1:6379> keys *
(empty array)
127.0.0.1:6379> set foo bar
(error) OOM command not allowed when used memory > 'maxmemory'.
127.0.0.1:6379> config get maxmemory
1) "maxmemory"
2) "209715200"
127.0.0.1:6379> info memory
# Memory
used_memory:269890904
used_memory_human:257.39M
[...]
127.0.0.1:6379> config set maxmemory 300mb
OK
127.0.0.1:6379> info memory
# Memory
used_memory:1455512
used_memory_human:1.39M
[...]
127.0.0.1:6379> set foo bar
OK
```

This PR addresses these issues by allowing hash tables to allocate a new
table for shrinking, even if allocating a new table causes the maxmemory
limit to be crossed during resize.

Additionally, the check for whether we should fast-forward the resize
operation has been moved to before checking whether we are allowed to
allocate a new resize table. Because we don't start a resize in this
section of the function yet, a fast-forward usually drops the memory
usage enough to allow another resize operation to start immediately
after the fast-forward resize completes.

As a side effect of this PR, we now also refuse to start a hashtable
resize if the policy is set to `HASHTABLE_RESIZE_FORBID`.

---------

Signed-off-by: Fusl <[email protected]>
Signed-off-by: Viktor Söderqvist <[email protected]>
Co-authored-by: Viktor Söderqvist <[email protected]>
…el (valkey-io#2308)

Sentinel sends switch-master message more quickly, ref
[valkey-io#2282](valkey-io#2282).

Signed-off-by: youngmore1024 <[email protected]>
- Subset of valkey-io#2183

---------

Signed-off-by: Josh Soref <[email protected]>
- Subset of valkey-io#2183

---------

Signed-off-by: Josh Soref <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
…rectory as path (valkey-io#630)

Helps testing specific functionality like `cluster` rather than running the entire test suite.

### Usage example
#### Run all the tests under tests/unit/cluster directory
```
./runtest --single tests/unit/cluster
```

#### Run all the tests under test/unit but skip the file `memefficiency`
```
./runtest --single tests/unit --skipunit tests/unit/memefficiency.tcl
```

#### Run all the tests under tests/integration
```
./runtest --single tests/integration
```

Note: this doesn't work with tests under `tests/cluster`. We should
migrate all of the tests under `tests/cluster` to `tests/unit/cluster`.

---------

Signed-off-by: Harkrishn Patro <[email protected]>
)

Needs backporting to 8.1 to fix
valkey-io#2320

---------

Signed-off-by: Harkrishn Patro <[email protected]>
Resolves test failures like
https://github.com/valkey-io/valkey/actions/runs/16093360514/job/45412856532.
Valgrind tests can run slowly, so the 100 ms expire time may have
partially began to expire items during the ingestion phase.

Signed-off-by: Madelyn Olson <[email protected]>
Signed-off-by: yairgott <[email protected]>
Signed-off-by: yairgott <[email protected]>
Signed-off-by: yairgott <[email protected]>
@yairgott yairgott marked this pull request as ready for review July 8, 2025 06:31
@yairgott yairgott changed the title Hash externalizer Adding support for sharing memory between the module and the engine Jul 8, 2025
@codecov
Copy link

codecov bot commented Jul 8, 2025

Codecov Report

Attention: Patch coverage is 63.25301% with 61 lines in your changes missing coverage. Please review.

Project coverage is 71.41%. Comparing base (4827720) to head (61052cd).
Report is 40 commits behind head on unstable.

Files with missing lines Patch % Lines
src/t_hash.c 57.93% 53 Missing ⚠️
src/module.c 12.50% 7 Missing ⚠️
src/ziplist.c 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2299      +/-   ##
============================================
- Coverage     71.43%   71.41%   -0.02%     
============================================
  Files           123      123              
  Lines         66908    67163     +255     
============================================
+ Hits          47795    47964     +169     
- Misses        19113    19199      +86     
Files with missing lines Coverage Δ
src/aof.c 80.54% <100.00%> (+0.13%) ⬆️
src/db.c 90.06% <100.00%> (-0.04%) ⬇️
src/rdb.c 76.90% <100.00%> (+0.28%) ⬆️
src/server.h 100.00% <ø> (ø)
src/ziplist.c 15.18% <0.00%> (ø)
src/module.c 9.56% <12.50%> (+<0.01%) ⬆️
src/t_hash.c 89.80% <57.93%> (-6.40%) ⬇️

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: yairgott <[email protected]>
@yairgott
Copy link
Contributor Author

yairgott commented Jul 8, 2025

cc @zuiderkwast

Signed-off-by: yairgott <[email protected]>
ranshid and others added 5 commits August 6, 2025 12:33
test_vest uses a mock_defrag function in order to simulate defrag flow.
In the scenario of no system malloc_size we need to use
zmalloc_usable_size in order to prevent stepping over the
"to-be-replaced" buffer

will partially fix: valkey-io#2435

Signed-off-by: Ran Shidlansik <[email protected]>
Fix missing check for executing client in `lookupKey` function

### Issue
The `lookupKey` function in db.c accesses
`server.executing_client->cmd->proc` without first verifying that
`server.executing_client` is not NULL. This was introduced in valkey-io#1499
where the check for executing client was added without verifying it
could be null.

The server crashes with a null pointer dereference when the
current_client's flag.no_touch is set.
```
27719 valkey-server *
/lib64/libpthread.so.0(+0x118e0)[0x7f34cb96a8e0]
src/valkey-server 127.0.0.1:21113(lookupKey+0xf5)[0x4a14b7]
src/valkey-server 127.0.0.1:21113(lookupKeyReadWithFlags+0x50)[0x4a15fc]
src/valkey-server 127.0.0.1:21113[0x52b8f1]
src/valkey-server 127.0.0.1:21113(handleClientsBlockedOnKeys+0xa5)[0x52b16f]
src/valkey-server 127.0.0.1:21113(processCommand+0xf1e)[0x4712c9]
src/valkey-server 127.0.0.1:21113(processCommandAndResetClient+0x35)[0x490fd5]
src/valkey-server 127.0.0.1:21113(processInputBuffer+0xe1)[0x4912e5]
src/valkey-server 127.0.0.1:21113(readQueryFromClient+0x8c)[0x49177b]
src/valkey-server 127.0.0.1:21113[0x57daa6]
src/valkey-server 127.0.0.1:21113[0x57e280]
src/valkey-server 127.0.0.1:21113(aeProcessEvents+0x261)[0x45b259]
src/valkey-server 127.0.0.1:21113(aeMain+0x2a)[0x45b450]
src/valkey-server 127.0.0.1:21113(main+0xd43)[0x479bf6]
/lib64/libc.so.6(__libc_start_main+0xea)[0x7f34cb5cd13a]
src/valkey-server 127.0.0.1:21113(_start+0x2a)[0x454e3a]
```

### Fix
Added a null check for `server.executing_client` before attempting to
dereference it:

### Tests
Added a regression test in tests/unit/type/list.tcl.

---------

Signed-off-by: Uri Yagelnik <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Co-authored-by: Ran Shidlansik <[email protected]>
We used to did print the context but after valkey-io#2276, we lost the context.

unstable:
```
*** Extract version and sha1 details from info command and print in tests/unit/info-command.tcl
```

now:
```
*** [err]: Extract version and sha1 details from info command and print in tests/unit/info-command.tcl
Expected '0' to be equal to '1' (context: type source line 7 file /xxx/info-command.tcl cmd {assert_equal 0 1} proc ::test)
```

We can see the different, we have provided enough context when asserting
fail. Otherwise we need to scroll back (which is usually a lot of server
logs) to see the context.

Signed-off-by: Binbin <[email protected]>
Remove unused includes

Signed-off-by: Xiaolong Chen <[email protected]>
ranshid and others added 7 commits August 7, 2025 11:54
1. Fix the way we toggle active expire.
2. reduce the number of spawned servers during the test

Just merged some start_server blocks to reduce the time spent on
starting and shutting down the server.

---------

Signed-off-by: Ran Shidlansik <[email protected]>
Across multiple runs of the big list test in defrag, the latency check
is tripping because the maximum observed defrag cycle latency
occasionally spikes above our 5 ms limit. While most cycles complete in
just a few milliseconds, rare slowdowns push some cycles into the double
digit millisecond range, so a 5 ms hard cap is too aggressive for stable
testing.

```
/runtest --verbose --tls --single unit/memefficiency --only '/big list' --accurate --loop --fastfail
```

```
[err]: Active Defrag big list: standalone in tests/unit/memefficiency.tcl
Expected 12 <= 5 (context: type proc line 18 cmd {assert {$max_latency <= $limit_ms}} proc ::validate_latency level 1)
(Fast fail: test will exit now)

[err]: Active Defrag big list: standalone in tests/unit/memefficiency.tcl
Expected 21 <= 5 (context: type proc line 18 cmd {assert {$max_latency <= $limit_ms}} proc ::validate_latency level 1)
(Fast fail: test will exit now)

[err]: Active Defrag big list: standalone in tests/unit/memefficiency.tcl
Expected 25 <= 5 (context: type proc line 18 cmd {assert {$max_latency <= $limit_ms}} proc ::validate_latency level 1)
(Fast fail: test will exit now)

[err]: Active Defrag big list: standalone in tests/unit/memefficiency.tcl
Expected 17 <= 5 (context: type proc line 18 cmd {assert {$max_latency <= $limit_ms}} proc ::validate_latency level 1)
(Fast fail: test will exit now)

```

---------

Signed-off-by: Seungmin Lee <[email protected]>
Co-authored-by: Seungmin Lee <[email protected]>
Fixes valkey-io#626 - Skip CI tests when only documentation files are modified

---------

Signed-off-by: Hanxi Zhang <[email protected]>
…o#2455)

Fixes the GitHub Actions error where both `paths` and `paths-ignore`
were defined for the same event, which is not allowed.

Resolves the error: "you may only define one of `paths` and
`paths-ignore` for a single event"

Removed the conflicting `paths` section from the `pull_request` trigger,
keeping only `paths-ignore` to skip documentation changes while allowing
the workflow to run on all other changes.

This is a follow-up fix to address the issue identified in the previous
PR.

Signed-off-by: Hanxi Zhang <[email protected]>
Test `Instance valkey-io#5 is still a slave after some time (no failover)` is
supposed to verify that command `CLUSTER FAILOVER` will not promote a
replica without quorum from the primary; later in the file (`Instance 5
is a master after some time`), we verify that `CLUSTER FAILOVER FORCE`
does promote a replica under the same conditions.

There's a couple issues with the tests:

1. `Instance valkey-io#5 is still a slave after some time (no failover)` should
verify that instance 5 is a replica (i.e. that there's no failover), but
we call `assert {[s -5 role] eq {master}}`.
2. The reason why the above assert works is that we previously send
`DEBUG SLEEP 10` to the primary, which pauses the primary for longer
than the configured 3 seconds for`cluster-node-timeout`.
The primary is marked as failed from the perspective of the rest of the
cluster, so quorum can be established and instance 5 is promoted as
primary.

This commit fixes the two by shortening the sleep to less than 3
seconds, and then asserting the role is still replica. Test `Instance valkey-io#5
is a master after some time` is updated to sleep for a shorter duration
to ensure that `FAILOVER FORCE` succeeds under the exact same
conditions.

### Testing
`./runtest --single unit/cluster/manual-failover --loop --fastfail`

Signed-off-by: Tyler Amano-Smerling <[email protected]>
…-io#2452)

Adds a simple check for backup directory, remove the todo.

Signed-off-by: Sarthak Aggarwal <[email protected]>
@madolson madolson moved this to In Progress in Valkey 9.0 Aug 8, 2025
murphyjacob4 and others added 12 commits August 10, 2025 10:05
The bug caused many invalid HMSET entries to be added to the AOF during
rewriting.

This now properly skips emitting HMSET until we find an entry with no
expiry.

Signed-off-by: Jacob Murphy <[email protected]>
…lkey-io#2462)

In the current implementation `HGETEX`, when applied on a non existing
object, will simply return null value instead of an array (like in the
`HMGET` case).

---------

Signed-off-by: Ran Shidlansik <[email protected]>
closes valkey-io#2352

---------

Signed-off-by: Xiaolong Chen <[email protected]>
Signed-off-by: Ran Shidlansik <[email protected]>
Co-authored-by: Ran Shidlansik <[email protected]>
…y-io#2464)

Following new API presented in
valkey-io#2089, we might access out of
bound memory in case of some illegal command input

Signed-off-by: Ran Shidlansik <[email protected]>
Based on the review suggestion
valkey-io#2432 (comment)

---------

Signed-off-by: Xiaolong Chen <[email protected]>
### Description
User data logged when crash caused by moduleRDBLoadError.

### Change
Redact user data when hide-user-data-from-log enabled.

Signed-off-by: VanessaTang <[email protected]>
…ey-io#2469)

Right now, if a TLS connect fails, you get an unhelpful error message in
the log since it prints out NULL. This change makes sure that report
error always returns a string (never null) as well as tries to print out
underlying errors.

Signed-off-by: Madelyn Olson <[email protected]>
Introduces a new family of commands for migrating slots via replication.
The procedure is driven by the source node which pushes an AOF formatted
snapshot of the slots to the target, followed by a replication stream of
changes on that slot (a la manual failover).

This solution is an adaptation of the solution provided by
@enjoy-binbin, combined with the solution I previously posted at valkey-io#1591,
modified to meet the designs we had outlined in valkey-io#23.

## New commands

* `CLUSTER MIGRATESLOTS SLOTSRANGE start end [start end]... NODE
node-id`: Begin sending the slot via replication to the target. Multiple
targets can be specified by repeating `SLOTSRANGE ... NODE ...`
*  `CLUSTER CANCELMIGRATION ALL`: Cancel all slot migrations
* `CLUSTER GETSLOTMIGRATIONS`: See a recent log of migrations

This PR only implements "one shot" semantics with an asynchronous model.
Later, "two phase" (e.g. slot level replicate/failover commands) can be
added with the same core.

## Slot migration jobs

Introduces the concept of a slot migration job. While active, a job
tracks a connection created by the source to the target over which the
contents of the slots are sent. This connection is used for control
messages as well as replicated slot data. Each job is given a 40
character random name to help uniquely identify it.

All jobs, including those that finished recently, can be observed using
the `CLUSTER GETSLOTMIGRATIONS` command.

## Replication

* Since the snapshot uses AOF, the snapshot can be replayed verbatim to
any replicas of the target node.
* We use the same proxying mechanism used for chaining replication to
copy the content sent by the source node directly to the replica nodes.

## `CLUSTER SYNCSLOTS`

To coordinate the state machine transitions across the two nodes, a new
command is added, `CLUSTER SYNCSLOTS`, that performs this control flow.

Each end of the slot migration connection is expected to install a read
handler in order to handle `CLUSTER SYNCSLOTS` commands:

* `ESTABLISH`: Begins a slot migration. Provides slot migration
information to the target and authorizes the connection to write to
unowned slots.
* `SNAPSHOT-EOF`: appended to the end of the snapshot to signal that the
snapshot is done being written to the target.
* `PAUSE`: informs the source node to pause whenever it gets the
opportunity
* `PAUSED`: added to the end of the client output buffer when the pause
is performed. The pause is only performed after the buffer shrinks below
a configurable size
* `REQUEST-FAILOVER`: request the source to either grant or deny a
failover for the slot migration. The grant is only granted if the target
is still paused. Once a failover is granted, the paused is refreshed for
a short duration
* `FAILOVER-GRANTED`: sent to the target to inform that REQUEST-FAILOVER
is granted
* `ACK`: heartbeat command used to ensure liveness

## Interactions with other commands

* FLUSHDB on the source node (which flushes the migrating slot) will
result in the source dropping the connection, which will flush the slot
on the target and reset the state machine back to the beginning. The
subsequent retry should very quickly succeed (it is now empty)
* FLUSHDB on the target will fail the slot migration. We can iterate
with better handling, but for now it is expected that the operator would
retry.
* Genearlly, FLUSHDB is expected to be executed cluster wide, so
preserving partially migrated slots doesn't make much sense
* SCAN and KEYS are filtered to avoid exposing importing slot data

## Error handling

* For any transient connection drops, the migration will be failed and
require the user to retry.
* If there is an OOM while reading from the import connection, we will
fail the import, which will drop the importing slot data
* If there is a client output buffer limit reached on the source node,
it will drop the connection, which will cause the migration to fail
* If at any point the export loses ownership or either node is failed
over, a callback will be triggered on both ends of the migration to fail
the import. The import will not reattempt with a new owner
* The two ends of the migration are routinely pinging each other with
SYNCSLOTS ACK messages. If at any point there is no interaction on the
connection for longer than `repl-timeout`, the connection will be
dropped, resulting in migration failure
* If a failover happens, we will drop keys in all unowned slots. The
migration does not persist through failovers and would need to be
retried on the new source/target.

## State machine

```
                                                                            
                Target/Importing Node State Machine                         
   ─────────────────────────────────────────────────────────────            
                                                                            
             ┌────────────────────┐
             │SLOT_IMPORT_WAIT_ACK┼──────┐
             └──────────┬─────────┘      │
                     ACK│                │
         ┌──────────────▼─────────────┐  │
         │SLOT_IMPORT_RECEIVE_SNAPSHOT┼──┤
         └──────────────┬─────────────┘  │
            SNAPSHOT-EOF│                │                                  
        ┌───────────────▼──────────────┐ │                                  
        │SLOT_IMPORT_WAITING_FOR_PAUSED┼─┤                                  
        └───────────────┬──────────────┘ │                                  
                  PAUSED│                │                                  
        ┌───────────────▼──────────────┐ │ Error Conditions:                
        │SLOT_IMPORT_FAILOVER_REQUESTED┼─┤  1. OOM                          
        └───────────────┬──────────────┘ │  2. Slot Ownership Change        
        FAILOVER-GRANTED│                │  3. Demotion to replica          
         ┌──────────────▼─────────────┐  │  4. FLUSHDB                      
         │SLOT_IMPORT_FAILOVER_GRANTED┼──┤  5. Connection Lost              
         └──────────────┬─────────────┘  │  6. No ACK from source (timeout) 
      Takeover Performed│                │                                  
         ┌──────────────▼───────────┐    │                                  
         │SLOT_MIGRATION_JOB_SUCCESS┼────┤                                  
         └──────────────────────────┘    │                                  
                                         │                                  
   ┌─────────────────────────────────────▼─┐                                
   │SLOT_IMPORT_FINISHED_WAITING_TO_CLEANUP│                                
   └────────────────────┬──────────────────┘                                
Unowned Slots Cleaned Up│                                                   
          ┌─────────────▼───────────┐                                      
          │SLOT_MIGRATION_JOB_FAILED│                                      
          └─────────────────────────┘                                      

                                                                                           
                                                                                           
                      Source/Exporting Node State Machine                                  
         ─────────────────────────────────────────────────────────────                     
                                                                                           
               ┌──────────────────────┐                                                    
               │SLOT_EXPORT_CONNECTING├─────────┐                                          
               └───────────┬──────────┘         │                                          
                  Connected│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_AUTHENTICATING┼───────┤                                          
             └─────────────┬────────────┘       │                                          
              Authenticated│                    │                                          
             ┌─────────────▼────────────┐       │                                          
             │SLOT_EXPORT_SEND_ESTABLISH┼───────┤                                          
             └─────────────┬────────────┘       │                                          
  ESTABLISH command written│                    │                                          
     ┌─────────────────────▼─────────────┐      │                                          
     │SLOT_EXPORT_READ_ESTABLISH_RESPONSE┼──────┤                                          
     └─────────────────────┬─────────────┘      │                                          
   Full response read (+OK)│                    │                                          
          ┌────────────────▼──────────────┐     │ Error Conditions:                        
          │SLOT_EXPORT_WAITING_TO_SNAPSHOT┼─────┤  1. User sends CANCELMIGRATION           
          └────────────────┬──────────────┘     │  2. Slot ownership change                
     No other child process│                    │  3. Demotion to replica                  
              ┌────────────▼───────────┐        │  4. FLUSHDB                              
              │SLOT_EXPORT_SNAPSHOTTING┼────────┤  5. Connection Lost                      
              └────────────┬───────────┘        │  6. AUTH failed                          
              Snapshot done│                    │  7. ERR from ESTABLISH command           
               ┌───────────▼─────────┐          │  8. Unpaused before failover completed   
               │SLOT_EXPORT_STREAMING┼──────────┤  9. Snapshot failed (e.g. Child OOM)     
               └───────────┬─────────┘          │  10. No ack from target (timeout)        
                      PAUSE│                    │  11. Client output buffer overrun        
            ┌──────────────▼─────────────┐      │                                          
            │SLOT_EXPORT_WAITING_TO_PAUSE┼──────┤                                          
            └──────────────┬─────────────┘      │                                          
             Buffer drained│                    │                                          
            ┌──────────────▼────────────┐       │                                          
            │SLOT_EXPORT_FAILOVER_PAUSED┼───────┤                                          
            └──────────────┬────────────┘       │                                          
   Failover request granted│                    │                                          
           ┌───────────────▼────────────┐       │                                          
           │SLOT_EXPORT_FAILOVER_GRANTED┼───────┤                                          
           └───────────────┬────────────┘       │                                          
      New topology received│                    │                                          
            ┌──────────────▼───────────┐        │                                          
            │SLOT_MIGRATION_JOB_SUCCESS│        │                                          
            └──────────────────────────┘        │                                          
                                                │                                          
            ┌─────────────────────────┐         │                                          
            │SLOT_MIGRATION_JOB_FAILED│◄────────┤                                          
            └─────────────────────────┘         │                                          
                                                │                                          
           ┌────────────────────────────┐       │                                          
           │SLOT_MIGRATION_JOB_CANCELLED│◄──────┘                                          
           └────────────────────────────┘                                                 
```

Co-authored-by: Binbin <[email protected]>

---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Co-authored-by: Binbin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
When we added the safeguard against sub-replicas logic, we didn't add
much tests for it. As time has shown, it can happen in different scenarios.
Here's a test case that used to happen in a failover scenario.

Signed-off-by: Binbin <[email protected]>
…alkey-io#2431)

The assert was added in valkey-io#2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from valkey-io#2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Closes valkey-io#2423.

Note that the test written here also exposes the issue in valkey-io#2441, so these two
may need to be addressed together.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: yairgott <[email protected]>
Signed-off-by: yairgott <[email protected]>
@yairgott yairgott marked this pull request as draft August 12, 2025 04:54
@yairgott
Copy link
Contributor Author

This PR was based on Ran's hash TTL branch. Closing this PR in favor of #2472.

@yairgott yairgott closed this Aug 12, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in Valkey 9.0 Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.