[scheduler] fix scheduling behavior of batch job allocs #26961

chrisroberts · 2025-10-18T00:46:57Z

Description

Batch jobs have a few documented behaviors which are described here:

when stopped with nomad alloc stop - the allocation should be rescheduled
when any task statuses become failed - the allocation should be rescheduled
when drained - the allocation should not be replaced (allocation is allowed to complete, or killed if deadline reached)
when job restart is used with the -reschedule flag - stops and migrates allocations instead of restarting in-place (this command migrates but does not reschedule allocations, so it ignores the reschedule block)

This changeset updates Nomad's behavior with batch job allocations so they behave as documented. Within this changeset, modifications introduced in dfa07e1 (#26025) that forced batch job allocations into a failed state when migrating. The reported issue it was attempting to resolve was itself incorrect behavior. The reconciler has been adjusted to properly handle batch job allocations as documented.

Changes of note

Eval trigger reasons

A new new eval trigger reason was added to provide better information to the user. It is shown and explained in the last examples below.

EvalTriggerAllocReschedule

Allocations API

The Allocations.Stop function was using an old helper which was resulting in any defined query parameters being silently dropped. This was updated to use the newer helper when passes the full query.

The Allocation.DesiredTransition was also updated to match its counterpart along with the same helper functions.

New desired transition field

A new field is added to the DesiredTransition - MigrateDisablePlacement. This is used when draining to allow the allocation to be stopped, but prevents it from being placed to achieve the desired draining behavior.

Testing & Reproduction steps

batch jobspec

job "sleep-job" {
  type = "batch"

  group "sleeper" {
    count = 5

    reschedule {
      attempts       = 3
      interval       = "15m"
      delay          = "4m"
      delay_function = "constant"
      max_delay      = "5m"
      unlimited      = false
    }

    ephemeral_disk {
      size = 10
    }

    task "do_sleep" {
      driver = "raw_exec"

      logs {
        disabled      = true
        max_files     = 1
        max_file_size = 1
      }

      config {
        command = "sleep"
        args    = ["1d"]
      }

      resources {
        memory = 10
        cpu    = 5
      }
    }

    task "extra_sleep" {
      driver = "raw_exec"

      logs {
        disabled      = true
        max_files     = 1
        max_file_size = 1
      }

      config {
        command = "sleep"
        args    = ["2d"]
      }

      resources {
        memory = 10
        cpu    = 5
      }
    }
  }
}

Behavior on main

alloc stop command

This shows the behavior of the alloc stop command on a batch job allocation. The job is started and then a single allocation is stopped:

➜ nomad run sleep.hcl

==> View this job in the Web UI: http://10.86.244.24:4646/ui/jobs/sleep-job@default

==> 2025-10-17T17:51:06-07:00: Monitoring evaluation "40250ff8"
    2025-10-17T17:51:06-07:00: Evaluation triggered by job "sleep-job"
    2025-10-17T17:51:07-07:00: Allocation "71d6882e" created: node "0e569f27", group "sleeper"
    2025-10-17T17:51:07-07:00: Allocation "8e671f60" created: node "0e569f27", group "sleeper"
    2025-10-17T17:51:07-07:00: Allocation "c72be233" created: node "b0dccea3", group "sleeper"
    2025-10-17T17:51:07-07:00: Allocation "ca3f8856" created: node "6c4fcb70", group "sleeper"
    2025-10-17T17:51:07-07:00: Allocation "421b7a60" created: node "b0dccea3", group "sleeper"
    2025-10-17T17:51:07-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-17T17:51:07-07:00: Evaluation "40250ff8" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-17T17:51:06-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
421b7a60  b0dccea3  sleeper     0        run      running  3s ago   2s ago
71d6882e  0e569f27  sleeper     0        run      running  3s ago   2s ago
8e671f60  0e569f27  sleeper     0        run      running  3s ago   2s ago
c72be233  b0dccea3  sleeper     0        run      running  3s ago   2s ago
ca3f8856  6c4fcb70  sleeper     0        run      running  3s ago   2s ago

➜ nomad alloc stop 42
==> 2025-10-17T17:51:31-07:00: Monitoring evaluation "855d8b1a"
    2025-10-17T17:51:31-07:00: Evaluation triggered by job "sleep-job"
    2025-10-17T17:51:32-07:00: Allocation "8b1af122" created: node "6c4fcb70", group "sleeper"
    2025-10-17T17:51:32-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-17T17:51:32-07:00: Evaluation "855d8b1a" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-17T17:51:06-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        1       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
8b1af122  6c4fcb70  sleeper     0        run      running  3s ago   2s ago
421b7a60  b0dccea3  sleeper     0        stop     failed   29s ago  3s ago
71d6882e  0e569f27  sleeper     0        run      running  29s ago  28s ago
8e671f60  0e569f27  sleeper     0        run      running  29s ago  28s ago
c72be233  b0dccea3  sleeper     0        run      running  29s ago  28s ago
ca3f8856  6c4fcb70  sleeper     0        run      running  29s ago  28s ago

Here we can see the result of the alloc stop command is the allocation is stopped in a failed state and the allocation is immediately replaced. The desired behavior here is that the allocation should be stopped with a complete status, and the allocation should be rescheduled based on the reschedule policy.

drain behavior

This shows the behavior of a node drain on batch job allocations. The job is started and then a single node is drained with a one second deadline:

➜ nomad run sleep.hcl

==> View this job in the Web UI: http://10.86.244.24:4646/ui/jobs/sleep-job@default

==> 2025-10-17T17:58:19-07:00: Monitoring evaluation "28b04ae3"
    2025-10-17T17:58:19-07:00: Evaluation triggered by job "sleep-job"
    2025-10-17T17:58:20-07:00: Allocation "8841e305" created: node "6c4fcb70", group "sleeper"
    2025-10-17T17:58:20-07:00: Allocation "de029dc7" created: node "6c4fcb70", group "sleeper"
    2025-10-17T17:58:20-07:00: Allocation "f33973b8" created: node "0e569f27", group "sleeper"
    2025-10-17T17:58:20-07:00: Allocation "2d9fb037" created: node "b0dccea3", group "sleeper"
    2025-10-17T17:58:20-07:00: Allocation "733eb34d" created: node "b0dccea3", group "sleeper"
    2025-10-17T17:58:20-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-17T17:58:20-07:00: Evaluation "28b04ae3" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-17T17:58:19-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
2d9fb037  b0dccea3  sleeper     0        run      running  4s ago   3s ago
733eb34d  b0dccea3  sleeper     0        run      running  4s ago   3s ago
8841e305  6c4fcb70  sleeper     0        run      running  4s ago   3s ago
de029dc7  6c4fcb70  sleeper     0        run      running  4s ago   3s ago
f33973b8  0e569f27  sleeper     0        run      running  4s ago   3s ago


➜ nomad node drain -enable -yes -deadline 1s b0
2025-10-17T17:58:36-07:00: Ctrl-C to stop monitoring: will not cancel the node drain
2025-10-17T17:58:36-07:00: Node "b0dccea3-ab06-6141-474b-05f5892f72b8" drain strategy set
2025-10-17T17:58:38-07:00: Alloc "2d9fb037-5c72-786b-21c2-5e0938463f53" marked for migration
2025-10-17T17:58:38-07:00: Alloc "733eb34d-a409-6469-1245-8607a8c57804" marked for migration
2025-10-17T17:58:38-07:00: Drain complete for node b0dccea3-ab06-6141-474b-05f5892f72b8
2025-10-17T17:58:38-07:00: Alloc "2d9fb037-5c72-786b-21c2-5e0938463f53" draining
2025-10-17T17:58:38-07:00: Alloc "733eb34d-a409-6469-1245-8607a8c57804" draining
2025-10-17T17:58:39-07:00: Alloc "2d9fb037-5c72-786b-21c2-5e0938463f53" status running -> failed
2025-10-17T17:58:39-07:00: Alloc "733eb34d-a409-6469-1245-8607a8c57804" status running -> failed
2025-10-17T17:58:39-07:00: All allocations on node "b0dccea3-ab06-6141-474b-05f5892f72b8" have stopped

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-17T17:58:19-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        2       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
10065b8b  0e569f27  sleeper     0        run      running  5s ago   4s ago
9d99b920  0e569f27  sleeper     0        run      running  5s ago   4s ago
2d9fb037  b0dccea3  sleeper     0        stop     failed   25s ago  5s ago
733eb34d  b0dccea3  sleeper     0        stop     failed   25s ago  5s ago
8841e305  6c4fcb70  sleeper     0        run      running  25s ago  24s ago
de029dc7  6c4fcb70  sleeper     0        run      running  25s ago  24s ago
f33973b8  0e569f27  sleeper     0        run      running  25s ago  24s ago

The drain stops the two allocations on the node in a failed state, and immediately places two new allocations. For drains, the allocations should be stopped with a complete status and the allocations should not be replaced.

Behavior with this changeset

alloc stop command

➜ nomad run sleep.hcl

==> 2025-10-20T08:10:34-07:00: Monitoring evaluation "d89ce708"
    2025-10-20T08:10:34-07:00: Evaluation triggered by job "sleep-job"
    2025-10-20T08:10:35-07:00: Allocation "05ad7436" created: node "6c4fcb70", group "sleeper"
    2025-10-20T08:10:35-07:00: Allocation "7a1b5420" created: node "0e569f27", group "sleeper"
    2025-10-20T08:10:35-07:00: Allocation "995f5e33" created: node "b0dccea3", group "sleeper"
    2025-10-20T08:10:35-07:00: Allocation "a5fd7420" created: node "0e569f27", group "sleeper"
    2025-10-20T08:10:35-07:00: Allocation "c5c12c43" created: node "6c4fcb70", group "sleeper"
    2025-10-20T08:10:35-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-20T08:10:35-07:00: Evaluation "d89ce708" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:10:34-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
05ad7436  6c4fcb70  sleeper     0        run      running  3s ago   2s ago
7a1b5420  0e569f27  sleeper     0        run      running  3s ago   2s ago
995f5e33  b0dccea3  sleeper     0        run      running  3s ago   2s ago
a5fd7420  0e569f27  sleeper     0        run      running  3s ago   2s ago
c5c12c43  6c4fcb70  sleeper     0        run      running  3s ago   2s ago

➜ nomad alloc stop 05
==> 2025-10-20T08:10:43-07:00: Monitoring evaluation "abb43bda"
    2025-10-20T08:10:43-07:00: Evaluation triggered by job "sleep-job"
    2025-10-20T08:10:44-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-20T08:10:44-07:00: Evaluation "abb43bda" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:10:34-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         4        0       1         0     0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
sleeper     63d25748  3m47s from now

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
05ad7436  6c4fcb70  sleeper     0        stop     complete  14s ago  4s ago
7a1b5420  0e569f27  sleeper     0        run      running   14s ago  13s ago
995f5e33  b0dccea3  sleeper     0        run      running   14s ago  13s ago
a5fd7420  0e569f27  sleeper     0        run      running   14s ago  13s ago
c5c12c43  6c4fcb70  sleeper     0        run      running   14s ago  13s ago

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:10:34-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       1         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
0befef56  b0dccea3  sleeper     0        run      running   3m56s ago  3m55s ago
05ad7436  6c4fcb70  sleeper     0        stop     complete  7m57s ago  7m47s ago
7a1b5420  0e569f27  sleeper     0        run      running   7m57s ago  7m56s ago
995f5e33  b0dccea3  sleeper     0        run      running   7m57s ago  7m56s ago
a5fd7420  0e569f27  sleeper     0        run      running   7m57s ago  7m56s ago
c5c12c43  6c4fcb70  sleeper     0        run      running   7m57s ago  7m56s ago

Now the allocation is stopped, in a complete state, and a new allocation hasn't immediately replaced it. Instead, the allocation has been rescheduled based on the reschedule policy as expected from the documented behavior. Once the delayed evaluation is executed, the new allocation is placed.

drain behavior

This shows the behavior of a node drain on batch job allocations. The job is started and then a single node is drained with a one second deadline:

➜ nomad run sleep.hcl

==> 2025-10-20T08:21:36-07:00: Monitoring evaluation "ad5b6d81"
    2025-10-20T08:21:36-07:00: Evaluation triggered by job "sleep-job"
    2025-10-20T08:21:37-07:00: Allocation "f7af18cc" created: node "0e569f27", group "sleeper"
    2025-10-20T08:21:37-07:00: Allocation "7386d7b1" created: node "b0dccea3", group "sleeper"
    2025-10-20T08:21:37-07:00: Allocation "8392ca41" created: node "6c4fcb70", group "sleeper"
    2025-10-20T08:21:37-07:00: Allocation "8765c6ba" created: node "6c4fcb70", group "sleeper"
    2025-10-20T08:21:37-07:00: Allocation "d647f127" created: node "b0dccea3", group "sleeper"
    2025-10-20T08:21:37-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-20T08:21:37-07:00: Evaluation "ad5b6d81" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:21:36-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
7386d7b1  b0dccea3  sleeper     0        run      running  4s ago   3s ago
8392ca41  6c4fcb70  sleeper     0        run      running  4s ago   3s ago
8765c6ba  6c4fcb70  sleeper     0        run      running  4s ago   3s ago
d647f127  b0dccea3  sleeper     0        run      running  4s ago   3s ago
f7af18cc  0e569f27  sleeper     0        run      running  4s ago   4s ago

➜ nomad node drain -enable -yes -deadline 1s b0
2025-10-20T08:22:11-07:00: Ctrl-C to stop monitoring: will not cancel the node drain
2025-10-20T08:22:11-07:00: Node "b0dccea3-ab06-6141-474b-05f5892f72b8" drain strategy set
2025-10-20T08:22:13-07:00: Alloc "7386d7b1-fe02-a718-58a5-54dcd196937c" marked for migration
2025-10-20T08:22:13-07:00: Alloc "d647f127-203f-9536-56ea-5f6ee595c493" marked for migration
2025-10-20T08:22:13-07:00: Drain complete for node b0dccea3-ab06-6141-474b-05f5892f72b8
2025-10-20T08:22:14-07:00: Alloc "7386d7b1-fe02-a718-58a5-54dcd196937c" draining
2025-10-20T08:22:14-07:00: Alloc "d647f127-203f-9536-56ea-5f6ee595c493" draining
2025-10-20T08:22:14-07:00: Alloc "7386d7b1-fe02-a718-58a5-54dcd196937c" status running -> complete
2025-10-20T08:22:14-07:00: Alloc "d647f127-203f-9536-56ea-5f6ee595c493" status running -> complete
2025-10-20T08:22:14-07:00: All allocations on node "b0dccea3-ab06-6141-474b-05f5892f72b8" have stopped

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:21:36-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         3        0       2         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
7386d7b1  b0dccea3  sleeper     0        stop     complete  41s ago  4s ago
8392ca41  6c4fcb70  sleeper     0        run      running   41s ago  40s ago
8765c6ba  6c4fcb70  sleeper     0        run      running   41s ago  40s ago
d647f127  b0dccea3  sleeper     0        stop     complete  41s ago  4s ago
f7af18cc  0e569f27  sleeper     0        run      running   41s ago  41s ago

The drain stops the two allocations on the node in a completed state, and the allocations are not replaced. This matches the documented expected behavior.

New evaluation trigger reason

The current behavior of nomad when rescheduling an allocation is to assume the allocation being replaced has failed. When stopping an allocation, this results in an eval status with the following:

➜ nomad eval status 8dd
ID                 = 8dde8bd1
Create Time        = 24s ago
Modify Time        = 24s ago
Status             = pending
Status Description = created for delayed rescheduling
Type               = batch
TriggeredBy        = alloc-failure
Job ID             = sleep-job
Namespace          = default
...

The TriggeredBy insinuates that the eval was triggered by the allocation failing, but it was triggered by the allocation being rescheduled due to the alloc stop command. To more correctly describe the reason, the EvalTriggerAllocReschedule constant was introduced and used in this situation, which gives the value alloc-reschedule as shown below:

➜ nomad eval status 440
ID                 = 44058981
Create Time        = 10s ago
Modify Time        = 10s ago
Status             = pending
Status Description = created for delayed rescheduling
Type               = batch
TriggeredBy        = alloc-reschedule
Job ID             = sleep-job
Namespace          = default
...

Links

Fixes #26929

Contributor Checklist

Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the make cl command.
Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
ensure regressions will be caught.
Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.

Reviewer Checklist

Backport Labels Please add the correct backport labels as described by the internal
backporting document.
Commit Type Ensure the correct merge method is selected which should be "squash and merge"
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
within the public repository.

If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

scheduler/reconciler/filters.go

nomad/structs/structs.go

scheduler/reconciler/reconcile_cluster.go

scheduler/reconciler/filters.go

scheduler/reconciler/reconcile_cluster.go

tgross

LGTM! I've left a few very small comments, but nothing critical. We'll need a changelog entry before we can merge it.

In addition to the specific behaviors discussed here, I made a build from this and double-checked we were still getting the expected behavior when a batch allocation simply fails and gets rescheduled. That's still all working as expected. Nice work on this.

tgross · 2025-10-23T20:38:25Z

api/allocations.go

 // long pauses on this API call.
-//
-// BREAKING: This method will have the following signature in 1.6.0
-// func (a *Allocations) Stop(allocID string, w *WriteOptions) (*AllocStopResponse, error) {
 func (a *Allocations) Stop(alloc *Allocation, q *QueryOptions) (*AllocStopResponse, error) {


Not having a versioned API makes this super painful to fix these kinds of things. You're right better just to back out that intention to change it, and live with it.

tgross · 2025-10-23T20:43:16Z

nomad/drainer_int_test.go

 	// Wait for allocs to be replaced
 	finalAllocs := waitForAllocsStop(t, store, n1.ID, nil)
-	waitForPlacedAllocs(t, store, n2.ID, 5)
+	waitForPlacedAllocs(t, store, n2.ID, 3)


Bah, no wonder this bug lurked for so long!

tgross · 2025-10-23T20:46:45Z

command/agent/alloc_endpoint.go

+	reschedule := false
+	if rescheduleQS := req.URL.Query().Get("reschedule"); rescheduleQS != "" {
+		var err error
+		reschedule, err = strconv.ParseBool(rescheduleQS)
+		if err != nil {
+			return nil, fmt.Errorf("reschedule value is not a boolean: %w", err)
 		}
 	}


There's a parseBool helper in command/agent/http.go that you can use like

reschedule, err := parseBool(req, "reschedule") if err != nil { return nil, err }

(which we could use above for no_shutdown_delay as well)

tgross · 2025-10-24T13:36:48Z

scheduler/reconciler/filters.go

+// filterServerTerminalAllocs returns a new allocSet that includes only
+// batch job type that are not marked for rescheduling or non-server-terminal
+// allocations.


Clarifying the compound clause a bit:

Suggested change

// filterServerTerminalAllocs returns a new allocSet that includes only

// batch job type that are not marked for rescheduling or non-server-terminal

// allocations.

// filterServerTerminalAllocs returns a new allocSet that includes only

// non-server-terminal allocations, and batch job allocs that are not marked for rescheduling.

tgross

LGTM!

Allocations of batch jobs have a few defined behaviors documented which do not work as expected: First, on node drain, the allocation is allowed to complete unless the deadline is reached at which point the allocation is killed. The allocation is note replaced. Second, when using the `alloc stop` command, the allocation is stopped and then rescheduled according to its reschedule policy. Third, on job restart if the `-reschedule` flag is used the allocation will be migrated and its reschedule policy will be ignored. This update removes the change introduced in dfa07e1 (#26025) that forced batch job allocations into a failed state when migrating. The reported issue it was attempting to resolve was itself incorrect behavior. The reconciler has been adjusted to properly handle batch job allocations as documented.

vercel bot deployed to Preview – nomad-ui October 18, 2025 00:47 View deployment

vercel bot deployed to Preview – nomad-ui October 18, 2025 01:36 View deployment

tgross reviewed Oct 20, 2025

View reviewed changes

vercel bot deployed to Preview – nomad-ui October 21, 2025 00:54 View deployment

vercel bot deployed to Preview – nomad-ui October 21, 2025 00:59 View deployment

chrisroberts force-pushed the f-drain-behavior-main branch from 879b348 to 3ad0ff5 Compare October 23, 2025 00:41

vercel bot deployed to Preview – nomad-ui October 23, 2025 00:42 View deployment

chrisroberts force-pushed the f-drain-behavior-main branch from 3ad0ff5 to f148392 Compare October 23, 2025 00:55

vercel bot deployed to Preview – nomad-ui October 23, 2025 00:56 View deployment

chrisroberts force-pushed the f-drain-behavior-main branch from f148392 to c1a383c Compare October 23, 2025 01:24

vercel bot deployed to Preview – nomad-ui October 23, 2025 01:25 View deployment

chrisroberts force-pushed the f-drain-behavior-main branch from c1a383c to dddc7db Compare October 23, 2025 15:40

vercel bot deployed to Preview – nomad-ui October 23, 2025 15:41 View deployment

chrisroberts force-pushed the f-drain-behavior-main branch from dddc7db to 5c370fe Compare October 23, 2025 17:17

vercel bot deployed to Preview – nomad-ui October 23, 2025 17:18 View deployment

chrisroberts force-pushed the f-drain-behavior-main branch from 5c370fe to 02dc6ec Compare October 23, 2025 20:12

vercel bot deployed to Preview – nomad-ui October 23, 2025 20:13 View deployment

chrisroberts marked this pull request as ready for review October 23, 2025 20:17

chrisroberts requested review from a team as code owners October 23, 2025 20:17

chrisroberts force-pushed the f-drain-behavior-main branch from 02dc6ec to 284d3e5 Compare October 24, 2025 02:44

vercel bot deployed to Preview – nomad-ui October 24, 2025 02:45 View deployment

tgross previously approved these changes Oct 24, 2025

View reviewed changes

chrisroberts dismissed tgross’s stale review via 384a984 October 24, 2025 16:50

chrisroberts force-pushed the f-drain-behavior-main branch from 284d3e5 to 384a984 Compare October 24, 2025 16:50

vercel bot deployed to Preview – nomad-ui October 24, 2025 16:51 View deployment

chrisroberts mentioned this pull request Oct 24, 2025

batch: do not fail migrating batch allocations #26917

Closed

7 tasks

tgross previously approved these changes Oct 24, 2025

View reviewed changes

chrisroberts dismissed tgross’s stale review via 70f55aa October 24, 2025 17:36

chrisroberts force-pushed the f-drain-behavior-main branch from 384a984 to 70f55aa Compare October 24, 2025 17:36

vercel bot deployed to Preview – nomad-ui October 24, 2025 17:37 View deployment

tgross approved these changes Oct 24, 2025

View reviewed changes

chrisroberts mentioned this pull request Oct 27, 2025

[scheduler] fix scheduling behavior of batch job allocs (1.10) #26998

Merged

7 tasks

chrisroberts merged commit 3a20db3 into main Oct 29, 2025
40 checks passed

chrisroberts deleted the f-drain-behavior-main branch October 29, 2025 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[scheduler] fix scheduling behavior of batch job allocs #26961

[scheduler] fix scheduling behavior of batch job allocs #26961

Uh oh!

chrisroberts commented Oct 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgross left a comment

Uh oh!

tgross Oct 23, 2025

Uh oh!

tgross Oct 23, 2025

Uh oh!

tgross Oct 23, 2025

Uh oh!

tgross Oct 24, 2025

Uh oh!

tgross left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[scheduler] fix scheduling behavior of batch job allocs #26961

[scheduler] fix scheduling behavior of batch job allocs #26961

Uh oh!

Conversation

chrisroberts commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes of note

Eval trigger reasons

Allocations API

New desired transition field

Testing & Reproduction steps

Behavior on main

alloc stop command

drain behavior

Behavior with this changeset

alloc stop command

drain behavior

New evaluation trigger reason

Links

Contributor Checklist

Reviewer Checklist

Changes to Security Controls

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

tgross Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

tgross Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

tgross Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

tgross Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chrisroberts commented Oct 18, 2025 •

edited

Loading