Skip to content

[ddmd] add --no-state-machine flag for test fixtures and Linux build#729

Open
zeeshanlakhani wants to merge 2 commits intomainfrom
zl/ddmd-no-state-machine
Open

[ddmd] add --no-state-machine flag for test fixtures and Linux build#729
zeeshanlakhani wants to merge 2 commits intomainfrom
zl/ddmd-no-state-machine

Conversation

@zeeshanlakhani
Copy link
Copy Markdown

@zeeshanlakhani zeeshanlakhani commented May 7, 2026

Omicron's oxidecomputer/omicron#10381 introduces a stubbed ddmd admin endpoint because spawning a real ddmd in a generic test toolchain is not viable: the routing state machine (discovery, exchange, route synchronization) depends on illumos networking facilities the toolchain does not provide. Consumers of the stub, e.g., Nexus RPW (multicast members), sled-agent's DDM reconciler, and anything that resolves the DDM internal-DNS service name, cannot exercise the real admin surface from Omicron's test harness.

This work adds an opt-in --no-state-machine flag to ddmd that runs only the admin API server and skips the state machine entirely, allowing the fixture to spawn the real binary. This is analogous to mgd --no-bgp-dispatcher, which Omicron's MgdInstance already uses for the same purpose.

To make the fixture path usable on Linux, ddmd itself must build on Linux. The previous code pulled the illumos-only crates libnet, dpd-client, opte-ioctl, and oxide-vpc unconditionally through ddm, which failed to link on Linux (-lzfs, -ldlpi). This change introduces an illumos feature in both ddm and ddmd (default-on, mirroring mgd's mg-lower pattern) that marks those four crates optional. The buildomat linux.sh job now builds ddmd and ddmadm, with ddmd invoked as cargo build --bin ddmd --no-default-features.

The illumos-only halves of ddm are isolated by the feature gate:

  • The routing state machine implementation moves from sm.rs into sm/state.rs.
  • The exchange runtime (HTTP push/pull and route programming) moves from exchange.rs into exchange/runtime.rs.
  • The discovery runtime (UDPv6 solicitation/advertisement loops) moves from discovery.rs into discovery/runtime.rs.

Each parent mod.rs keeps the platform-agnostic types and re-exports the runtime surface so existing call sites resolve unchanged on illumos. The runtime submodules are gated as a unit by #[cfg(all(feature = "illumos", target_os = "illumos"))]. We also remove the single-function ddm/src/util.rs, inlining the function into discovery/runtime.rs, where its sole caller lives.

The SIGTERM cleanup handler is installed regardless of the flag, so Ctrl-C still exits cleanly in --no-state-machine mode. The imported route sets are empty in that mode, so the cleanup itself is a noop. Passing --addr alongside --no-state-machine is harmless but ignored, with a warning logged.

@zeeshanlakhani zeeshanlakhani requested review from jgallagher and taspelund and removed request for jgallagher and taspelund May 7, 2026 01:59
@zeeshanlakhani zeeshanlakhani force-pushed the zl/ddmd-no-state-machine branch from 6a10770 to 6dcf010 Compare May 7, 2026 03:23
@zeeshanlakhani zeeshanlakhani changed the title [ddmd] add --no-state-machine flag for test fixtures [ddmd] add --no-state-machine flag for test fixtures and Linux build May 7, 2026
Omicron's oxidecomputer/omicron#10381 introduces a stubbed `ddmd`
admin endpoint because spawning a real `ddmd` in a generic test
toolchain is not viable: the routing state machine (discovery, exchange, route
synchronization) depends on illumos networking facilities the toolchain does not
provide. Consumers of the stub, e.g., Nexus RPW (multicast members),
sled-agent's DDM reconciler, and anything that resolves the DDM internal-DNS
service name, cannot exercise the real admin surface from Omicron's test harness.

This work adds an opt-in `--no-state-machine` flag to `ddmd` that runs only
the admin API server and skips the state machine entirely, allowing the fixture
to spawn the real binary. This is analogous to `mgd --no-bgp-dispatcher`, which
Omicron's `MgdInstance` already uses for the same purpose.

To make the fixture path usable on Linux, `ddmd` itself must build on Linux.
The previous code pulled the illumos-only crates `libnet`, `dpd-client`,
`opte-ioctl`, and `oxide-vpc` unconditionally through `ddm`, which failed to
link on Linux (`-lzfs`, `-ldlpi`). This change introduces an `illumos` feature
in both `ddm` and `ddmd` (default-on, mirroring `mgd`'s `mg-lower` pattern) that
marks those four crates optional. The buildomat `linux.sh` job now builds `ddmd`
and `ddmadm`, with `ddmd` invoked as `cargo build --bin ddmd --no-default-features`.

The illumos-only halves of `ddm` are isolated by the feature gate:

- The routing state machine implementation moves from `sm.rs` into
  `sm/state.rs`.
- The exchange runtime (HTTP push/pull and route programming) moves from
  `exchange.rs` into `exchange/runtime.rs`.
- The discovery runtime (UDPv6 solicitation/advertisement loops) moves from
  `discovery.rs` into `discovery/runtime.rs`.

Each parent `mod.rs` keeps the platform-agnostic types and re-exports the
runtime surface so existing call sites resolve unchanged on illumos. The runtime
submodules are gated as a unit by `#[cfg(all(feature = "illumos",
target_os = "illumos"))]`. We also remove the single-function `ddm/src/util.rs`,
inlining the function into `discovery/runtime.rs`, where its sole caller lives.

The SIGTERM cleanup handler is installed regardless of the flag, so
Ctrl-C still exits cleanly in `--no-state-machine` mode. The imported
route sets are empty in that mode, so the cleanup itself is a noop.
Passing `--addr` alongside `--no-state-machine` is harmless but ignored,
with a warning logged.
@zeeshanlakhani zeeshanlakhani force-pushed the zl/ddmd-no-state-machine branch from 6dcf010 to 3b54e16 Compare May 7, 2026 04:25
zeeshanlakhani added a commit to oxidecomputer/omicron that referenced this pull request May 7, 2026
…fixture

We address @jgallagher's review by:

- Replacing the four positional `u16` arguments in `DnsConfigBuilder::host_zone_switch`
  with a `HostSwitchZonePorts` named-fields structure.

- Replacing the dropshot-based stubbed `DdmInstance` in test-utils with a
  fixture that spawns and supervises a real `ddmd` subprocess running with
  `--no-state-machine`, analogous to `MgdInstance` and `mgd --no-bgp-dispatcher`.
  Only the switch-zone `ddmd` is registered in internal DNS, while sled-global-zone
  instances are accessed locally by their own host and don't need DNS registration.

  This **does** require maghemite changes, already PR'ed to oxidecomputer/maghemite#729.

  To make this all work, we wire `ddmd` into the developer xtask toolchain.
  `cargo xtask download maghemite-ddmd` reuses the existing `mg-ddm.tar.gz`
  illumos zone artifact (extracting `ddmd`/`ddmadm`). On Linux it overlays a
  raw `ddmd` binary, and on macOS it builds from source.

Also, we had to bump `oxnet` from 0.1.4 to 0.1.5 to satisfy the new maghemite pin.
Copy link
Copy Markdown
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few notes on the mechanics of the split; I'll defer to folks who know maghemite better for the code organization.

Comment thread ddmd/src/main.rs Outdated
Comment thread ddmd/src/main.rs Outdated
Comment thread ddm/Cargo.toml Outdated
Includes:

- Reject `--no-state-machine` together with `--addr` at clap level via `conflicts_with`
- Collapse the two cfg-gated `termination_handler` variants into one cfg-gated body.
- Rename the `illumos` Cargo feature to `state-machine` so that it describes the gated 
  functionality (and matches the CLI flag) rather than colliding semantically with 
  `target_os = "illumos"`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants