Skip to content

Conversation

@chrisroberts
Copy link
Member

@chrisroberts chrisroberts commented Oct 18, 2025

Description

Batch jobs have a few documented behaviors which are described here:

This changeset updates Nomad's behavior with batch job allocations so they behave as documented. Within this changeset, modifications introduced in dfa07e1 (#26025) that forced batch job allocations into a failed state when migrating. The reported issue it was attempting to resolve was itself incorrect behavior. The reconciler has been adjusted to properly handle batch job allocations as documented.

Changes of note

Eval trigger reasons

A new new eval trigger reason was added to provide better information to the user. It is shown and explained in the last examples below.

Allocations API

The Allocations.Stop function was using an old helper which was resulting in any defined query parameters being silently dropped. This was updated to use the newer helper when passes the full query.

The Allocation.DesiredTransition was also updated to match its counterpart along with the same helper functions.

New desired transition field

A new field is added to the DesiredTransition - MigrateDisablePlacement. This is used when draining to allow the allocation to be stopped, but prevents it from being placed to achieve the desired draining behavior.

Testing & Reproduction steps

batch jobspec
job "sleep-job" {
  type = "batch"

  group "sleeper" {
    count = 5

    reschedule {
      attempts       = 3
      interval       = "15m"
      delay          = "4m"
      delay_function = "constant"
      max_delay      = "5m"
      unlimited      = false
    }

    ephemeral_disk {
      size = 10
    }

    task "do_sleep" {
      driver = "raw_exec"

      logs {
        disabled      = true
        max_files     = 1
        max_file_size = 1
      }

      config {
        command = "sleep"
        args    = ["1d"]
      }

      resources {
        memory = 10
        cpu    = 5
      }
    }

    task "extra_sleep" {
      driver = "raw_exec"

      logs {
        disabled      = true
        max_files     = 1
        max_file_size = 1
      }

      config {
        command = "sleep"
        args    = ["2d"]
      }

      resources {
        memory = 10
        cpu    = 5
      }
    }
  }
}

Behavior on main

alloc stop command

This shows the behavior of the alloc stop command on a batch job allocation. The job is started and then a single allocation is stopped:

➜ nomad run sleep.hcl

==> View this job in the Web UI: http://10.86.244.24:4646/ui/jobs/sleep-job@default

==> 2025-10-17T17:51:06-07:00: Monitoring evaluation "40250ff8"
    2025-10-17T17:51:06-07:00: Evaluation triggered by job "sleep-job"
    2025-10-17T17:51:07-07:00: Allocation "71d6882e" created: node "0e569f27", group "sleeper"
    2025-10-17T17:51:07-07:00: Allocation "8e671f60" created: node "0e569f27", group "sleeper"
    2025-10-17T17:51:07-07:00: Allocation "c72be233" created: node "b0dccea3", group "sleeper"
    2025-10-17T17:51:07-07:00: Allocation "ca3f8856" created: node "6c4fcb70", group "sleeper"
    2025-10-17T17:51:07-07:00: Allocation "421b7a60" created: node "b0dccea3", group "sleeper"
    2025-10-17T17:51:07-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-17T17:51:07-07:00: Evaluation "40250ff8" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-17T17:51:06-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
421b7a60  b0dccea3  sleeper     0        run      running  3s ago   2s ago
71d6882e  0e569f27  sleeper     0        run      running  3s ago   2s ago
8e671f60  0e569f27  sleeper     0        run      running  3s ago   2s ago
c72be233  b0dccea3  sleeper     0        run      running  3s ago   2s ago
ca3f8856  6c4fcb70  sleeper     0        run      running  3s ago   2s ago

➜ nomad alloc stop 42
==> 2025-10-17T17:51:31-07:00: Monitoring evaluation "855d8b1a"
    2025-10-17T17:51:31-07:00: Evaluation triggered by job "sleep-job"
    2025-10-17T17:51:32-07:00: Allocation "8b1af122" created: node "6c4fcb70", group "sleeper"
    2025-10-17T17:51:32-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-17T17:51:32-07:00: Evaluation "855d8b1a" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-17T17:51:06-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        1       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
8b1af122  6c4fcb70  sleeper     0        run      running  3s ago   2s ago
421b7a60  b0dccea3  sleeper     0        stop     failed   29s ago  3s ago
71d6882e  0e569f27  sleeper     0        run      running  29s ago  28s ago
8e671f60  0e569f27  sleeper     0        run      running  29s ago  28s ago
c72be233  b0dccea3  sleeper     0        run      running  29s ago  28s ago
ca3f8856  6c4fcb70  sleeper     0        run      running  29s ago  28s ago

Here we can see the result of the alloc stop command is the allocation is stopped in a failed state and the allocation is immediately replaced. The desired behavior here is that the allocation should be stopped with a complete status, and the allocation should be rescheduled based on the reschedule policy.

drain behavior

This shows the behavior of a node drain on batch job allocations. The job is started and then a single node is drained with a one second deadline:

➜ nomad run sleep.hcl

==> View this job in the Web UI: http://10.86.244.24:4646/ui/jobs/sleep-job@default

==> 2025-10-17T17:58:19-07:00: Monitoring evaluation "28b04ae3"
    2025-10-17T17:58:19-07:00: Evaluation triggered by job "sleep-job"
    2025-10-17T17:58:20-07:00: Allocation "8841e305" created: node "6c4fcb70", group "sleeper"
    2025-10-17T17:58:20-07:00: Allocation "de029dc7" created: node "6c4fcb70", group "sleeper"
    2025-10-17T17:58:20-07:00: Allocation "f33973b8" created: node "0e569f27", group "sleeper"
    2025-10-17T17:58:20-07:00: Allocation "2d9fb037" created: node "b0dccea3", group "sleeper"
    2025-10-17T17:58:20-07:00: Allocation "733eb34d" created: node "b0dccea3", group "sleeper"
    2025-10-17T17:58:20-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-17T17:58:20-07:00: Evaluation "28b04ae3" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-17T17:58:19-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
2d9fb037  b0dccea3  sleeper     0        run      running  4s ago   3s ago
733eb34d  b0dccea3  sleeper     0        run      running  4s ago   3s ago
8841e305  6c4fcb70  sleeper     0        run      running  4s ago   3s ago
de029dc7  6c4fcb70  sleeper     0        run      running  4s ago   3s ago
f33973b8  0e569f27  sleeper     0        run      running  4s ago   3s ago


➜ nomad node drain -enable -yes -deadline 1s b0
2025-10-17T17:58:36-07:00: Ctrl-C to stop monitoring: will not cancel the node drain
2025-10-17T17:58:36-07:00: Node "b0dccea3-ab06-6141-474b-05f5892f72b8" drain strategy set
2025-10-17T17:58:38-07:00: Alloc "2d9fb037-5c72-786b-21c2-5e0938463f53" marked for migration
2025-10-17T17:58:38-07:00: Alloc "733eb34d-a409-6469-1245-8607a8c57804" marked for migration
2025-10-17T17:58:38-07:00: Drain complete for node b0dccea3-ab06-6141-474b-05f5892f72b8
2025-10-17T17:58:38-07:00: Alloc "2d9fb037-5c72-786b-21c2-5e0938463f53" draining
2025-10-17T17:58:38-07:00: Alloc "733eb34d-a409-6469-1245-8607a8c57804" draining
2025-10-17T17:58:39-07:00: Alloc "2d9fb037-5c72-786b-21c2-5e0938463f53" status running -> failed
2025-10-17T17:58:39-07:00: Alloc "733eb34d-a409-6469-1245-8607a8c57804" status running -> failed
2025-10-17T17:58:39-07:00: All allocations on node "b0dccea3-ab06-6141-474b-05f5892f72b8" have stopped

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-17T17:58:19-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        2       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
10065b8b  0e569f27  sleeper     0        run      running  5s ago   4s ago
9d99b920  0e569f27  sleeper     0        run      running  5s ago   4s ago
2d9fb037  b0dccea3  sleeper     0        stop     failed   25s ago  5s ago
733eb34d  b0dccea3  sleeper     0        stop     failed   25s ago  5s ago
8841e305  6c4fcb70  sleeper     0        run      running  25s ago  24s ago
de029dc7  6c4fcb70  sleeper     0        run      running  25s ago  24s ago
f33973b8  0e569f27  sleeper     0        run      running  25s ago  24s ago

The drain stops the two allocations on the node in a failed state, and immediately places two new allocations. For drains, the allocations should be stopped with a complete status and the allocations should not be replaced.

Behavior with this changeset

alloc stop command
➜ nomad run sleep.hcl

==> 2025-10-20T08:10:34-07:00: Monitoring evaluation "d89ce708"
    2025-10-20T08:10:34-07:00: Evaluation triggered by job "sleep-job"
    2025-10-20T08:10:35-07:00: Allocation "05ad7436" created: node "6c4fcb70", group "sleeper"
    2025-10-20T08:10:35-07:00: Allocation "7a1b5420" created: node "0e569f27", group "sleeper"
    2025-10-20T08:10:35-07:00: Allocation "995f5e33" created: node "b0dccea3", group "sleeper"
    2025-10-20T08:10:35-07:00: Allocation "a5fd7420" created: node "0e569f27", group "sleeper"
    2025-10-20T08:10:35-07:00: Allocation "c5c12c43" created: node "6c4fcb70", group "sleeper"
    2025-10-20T08:10:35-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-20T08:10:35-07:00: Evaluation "d89ce708" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:10:34-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
05ad7436  6c4fcb70  sleeper     0        run      running  3s ago   2s ago
7a1b5420  0e569f27  sleeper     0        run      running  3s ago   2s ago
995f5e33  b0dccea3  sleeper     0        run      running  3s ago   2s ago
a5fd7420  0e569f27  sleeper     0        run      running  3s ago   2s ago
c5c12c43  6c4fcb70  sleeper     0        run      running  3s ago   2s ago

➜ nomad alloc stop 05
==> 2025-10-20T08:10:43-07:00: Monitoring evaluation "abb43bda"
    2025-10-20T08:10:43-07:00: Evaluation triggered by job "sleep-job"
    2025-10-20T08:10:44-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-20T08:10:44-07:00: Evaluation "abb43bda" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:10:34-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         4        0       1         0     0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
sleeper     63d25748  3m47s from now

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
05ad7436  6c4fcb70  sleeper     0        stop     complete  14s ago  4s ago
7a1b5420  0e569f27  sleeper     0        run      running   14s ago  13s ago
995f5e33  b0dccea3  sleeper     0        run      running   14s ago  13s ago
a5fd7420  0e569f27  sleeper     0        run      running   14s ago  13s ago
c5c12c43  6c4fcb70  sleeper     0        run      running   14s ago  13s ago

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:10:34-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       1         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
0befef56  b0dccea3  sleeper     0        run      running   3m56s ago  3m55s ago
05ad7436  6c4fcb70  sleeper     0        stop     complete  7m57s ago  7m47s ago
7a1b5420  0e569f27  sleeper     0        run      running   7m57s ago  7m56s ago
995f5e33  b0dccea3  sleeper     0        run      running   7m57s ago  7m56s ago
a5fd7420  0e569f27  sleeper     0        run      running   7m57s ago  7m56s ago
c5c12c43  6c4fcb70  sleeper     0        run      running   7m57s ago  7m56s ago

Now the allocation is stopped, in a complete state, and a new allocation hasn't immediately replaced it. Instead, the allocation has been rescheduled based on the reschedule policy as expected from the documented behavior. Once the delayed evaluation is executed, the new allocation is placed.

drain behavior

This shows the behavior of a node drain on batch job allocations. The job is started and then a single node is drained with a one second deadline:

➜ nomad run sleep.hcl

==> 2025-10-20T08:21:36-07:00: Monitoring evaluation "ad5b6d81"
    2025-10-20T08:21:36-07:00: Evaluation triggered by job "sleep-job"
    2025-10-20T08:21:37-07:00: Allocation "f7af18cc" created: node "0e569f27", group "sleeper"
    2025-10-20T08:21:37-07:00: Allocation "7386d7b1" created: node "b0dccea3", group "sleeper"
    2025-10-20T08:21:37-07:00: Allocation "8392ca41" created: node "6c4fcb70", group "sleeper"
    2025-10-20T08:21:37-07:00: Allocation "8765c6ba" created: node "6c4fcb70", group "sleeper"
    2025-10-20T08:21:37-07:00: Allocation "d647f127" created: node "b0dccea3", group "sleeper"
    2025-10-20T08:21:37-07:00: Evaluation status changed: "pending" -> "complete"
==> 2025-10-20T08:21:37-07:00: Evaluation "ad5b6d81" finished with status "complete"

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:21:36-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         5        0       0         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
7386d7b1  b0dccea3  sleeper     0        run      running  4s ago   3s ago
8392ca41  6c4fcb70  sleeper     0        run      running  4s ago   3s ago
8765c6ba  6c4fcb70  sleeper     0        run      running  4s ago   3s ago
d647f127  b0dccea3  sleeper     0        run      running  4s ago   3s ago
f7af18cc  0e569f27  sleeper     0        run      running  4s ago   4s ago

➜ nomad node drain -enable -yes -deadline 1s b0
2025-10-20T08:22:11-07:00: Ctrl-C to stop monitoring: will not cancel the node drain
2025-10-20T08:22:11-07:00: Node "b0dccea3-ab06-6141-474b-05f5892f72b8" drain strategy set
2025-10-20T08:22:13-07:00: Alloc "7386d7b1-fe02-a718-58a5-54dcd196937c" marked for migration
2025-10-20T08:22:13-07:00: Alloc "d647f127-203f-9536-56ea-5f6ee595c493" marked for migration
2025-10-20T08:22:13-07:00: Drain complete for node b0dccea3-ab06-6141-474b-05f5892f72b8
2025-10-20T08:22:14-07:00: Alloc "7386d7b1-fe02-a718-58a5-54dcd196937c" draining
2025-10-20T08:22:14-07:00: Alloc "d647f127-203f-9536-56ea-5f6ee595c493" draining
2025-10-20T08:22:14-07:00: Alloc "7386d7b1-fe02-a718-58a5-54dcd196937c" status running -> complete
2025-10-20T08:22:14-07:00: Alloc "d647f127-203f-9536-56ea-5f6ee595c493" status running -> complete
2025-10-20T08:22:14-07:00: All allocations on node "b0dccea3-ab06-6141-474b-05f5892f72b8" have stopped

➜ nomad status sleep-job
ID            = sleep-job
Name          = sleep-job
Submit Date   = 2025-10-20T08:21:36-07:00
Type          = batch
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
sleeper     0       0         3        0       2         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created  Modified
7386d7b1  b0dccea3  sleeper     0        stop     complete  41s ago  4s ago
8392ca41  6c4fcb70  sleeper     0        run      running   41s ago  40s ago
8765c6ba  6c4fcb70  sleeper     0        run      running   41s ago  40s ago
d647f127  b0dccea3  sleeper     0        stop     complete  41s ago  4s ago
f7af18cc  0e569f27  sleeper     0        run      running   41s ago  41s ago

The drain stops the two allocations on the node in a completed state, and the allocations are not replaced. This matches the documented expected behavior.

New evaluation trigger reason

The current behavior of nomad when rescheduling an allocation is to assume the allocation being replaced has failed. When stopping an allocation, this results in an eval status with the following:

➜ nomad eval status 8dd
ID                 = 8dde8bd1
Create Time        = 24s ago
Modify Time        = 24s ago
Status             = pending
Status Description = created for delayed rescheduling
Type               = batch
TriggeredBy        = alloc-failure
Job ID             = sleep-job
Namespace          = default
...

The TriggeredBy insinuates that the eval was triggered by the allocation failing, but it was triggered by the allocation being rescheduled due to the alloc stop command. To more correctly describe the reason, the EvalTriggerAllocReschedule constant was introduced and used in this situation, which gives the value alloc-reschedule as shown below:

➜ nomad eval status 440
ID                 = 44058981
Create Time        = 10s ago
Modify Time        = 10s ago
Status             = pending
Status Description = created for delayed rescheduling
Type               = batch
TriggeredBy        = alloc-reschedule
Job ID             = sleep-job
Namespace          = default
...

Links

Fixes #26929

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

@chrisroberts chrisroberts force-pushed the f-drain-behavior-main branch from 879b348 to 3ad0ff5 Compare October 23, 2025 00:41
@chrisroberts chrisroberts force-pushed the f-drain-behavior-main branch from 3ad0ff5 to f148392 Compare October 23, 2025 00:55
@chrisroberts chrisroberts force-pushed the f-drain-behavior-main branch from f148392 to c1a383c Compare October 23, 2025 01:24
@chrisroberts chrisroberts force-pushed the f-drain-behavior-main branch from c1a383c to dddc7db Compare October 23, 2025 15:40
@chrisroberts chrisroberts force-pushed the f-drain-behavior-main branch from dddc7db to 5c370fe Compare October 23, 2025 17:17
@chrisroberts chrisroberts force-pushed the f-drain-behavior-main branch from 5c370fe to 02dc6ec Compare October 23, 2025 20:12
@chrisroberts chrisroberts marked this pull request as ready for review October 23, 2025 20:17
@chrisroberts chrisroberts requested review from a team as code owners October 23, 2025 20:17
@chrisroberts chrisroberts force-pushed the f-drain-behavior-main branch from 02dc6ec to 284d3e5 Compare October 24, 2025 02:44
tgross
tgross previously approved these changes Oct 24, 2025
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I've left a few very small comments, but nothing critical. We'll need a changelog entry before we can merge it.

In addition to the specific behaviors discussed here, I made a build from this and double-checked we were still getting the expected behavior when a batch allocation simply fails and gets rescheduled. That's still all working as expected. Nice work on this.

Comment on lines -187 to 188
// long pauses on this API call.
//
// BREAKING: This method will have the following signature in 1.6.0
// func (a *Allocations) Stop(allocID string, w *WriteOptions) (*AllocStopResponse, error) {
func (a *Allocations) Stop(alloc *Allocation, q *QueryOptions) (*AllocStopResponse, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not having a versioned API makes this super painful to fix these kinds of things. You're right better just to back out that intention to change it, and live with it.

// Wait for allocs to be replaced
finalAllocs := waitForAllocsStop(t, store, n1.ID, nil)
waitForPlacedAllocs(t, store, n2.ID, 5)
waitForPlacedAllocs(t, store, n2.ID, 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah, no wonder this bug lurked for so long!

Comment on lines 159 to 166
reschedule := false
if rescheduleQS := req.URL.Query().Get("reschedule"); rescheduleQS != "" {
var err error
reschedule, err = strconv.ParseBool(rescheduleQS)
if err != nil {
return nil, fmt.Errorf("reschedule value is not a boolean: %w", err)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a parseBool helper in command/agent/http.go that you can use like

reschedule, err := parseBool(req, "reschedule")
if err != nil {
	return nil, err
}

(which we could use above for no_shutdown_delay as well)

Comment on lines 31 to 33
// filterServerTerminalAllocs returns a new allocSet that includes only
// batch job type that are not marked for rescheduling or non-server-terminal
// allocations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying the compound clause a bit:

Suggested change
// filterServerTerminalAllocs returns a new allocSet that includes only
// batch job type that are not marked for rescheduling or non-server-terminal
// allocations.
// filterServerTerminalAllocs returns a new allocSet that includes only
// non-server-terminal allocations, and batch job allocs that are not marked for rescheduling.

tgross
tgross previously approved these changes Oct 24, 2025
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Allocations of batch jobs have a few defined behaviors documented
which do not work as expected:

First, on node drain, the allocation is allowed to complete unless
the deadline is reached at which point the allocation is killed. The
allocation is note replaced.

Second, when using the `alloc stop` command, the allocation is
stopped and then rescheduled according to its reschedule policy.

Third, on job restart if the `-reschedule` flag is used the
allocation will be migrated and its reschedule policy will be
ignored.

This update removes the change introduced in dfa07e1 (#26025)
that forced batch job allocations into a failed state when
migrating. The reported issue it was attempting to resolve was
itself incorrect behavior. The reconciler has been adjusted
to properly handle batch job allocations as documented.
@chrisroberts chrisroberts merged commit 3a20db3 into main Oct 29, 2025
40 checks passed
@chrisroberts chrisroberts deleted the f-drain-behavior-main branch October 29, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scheduler: incorrect scheduling of batch job allocations on drain

2 participants