Skip to content

test flake due to sp-sim's ereport port receiving an unexpected UDP datagram #10387

@hawkw

Description

@hawkw

The integration_tests::affinity::test_affinity_group_usage test failed on a CI run on pull request #10382:

https://github.com/oxidecomputer/omicron/pull/10382/checks?check_run_id=74546484800

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01KQXPZCBW2NHNXNEH0BGF1EQB/BcTcBzPAAKSDAzJ3U3i0Pi7K6b8Y806PLnRRLXofAduFgCIf/01KQXPZSQB0ZRAP1X61NM18WMQ#S8248

Excerpt from the log showing the failure:

    
  stderr ───
    log file: /var/tmp/omicron_tmp/test_all-b1323e5628efcde2-test_affinity_group_usage.17207.0.log
    note: configured to log to "/var/tmp/omicron_tmp/test_all-b1323e5628efcde2-test_affinity_group_usage.17207.0.log"
    DB URL: postgresql://root@[::1]:56137/omicron?sslmode=disable
    DB address: [::1]:56137
    log file: /var/tmp/omicron_tmp/test_all-b1323e5628efcde2-test_affinity_group_usage.17207.2.log
    note: configured to log to "/var/tmp/omicron_tmp/test_all-b1323e5628efcde2-test_affinity_group_usage.17207.2.log"
    log file: /var/tmp/omicron_tmp/test_all-b1323e5628efcde2-test_affinity_group_usage.17207.3.log
    note: configured to log to "/var/tmp/omicron_tmp/test_all-b1323e5628efcde2-test_affinity_group_usage.17207.3.log"

    thread 'integration_tests::affinity::test_affinity_group_usage' (2) panicked at sp-sim/src/gimlet.rs:429:62:
    called `Result::unwrap()` on an `Err` value: couldn't understand ereport header from [::1]:47491: Size(SizeError)

    Stack backtrace:
       0: <anyhow::Error>::msg::<alloc::string::String>
                 at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/anyhow-1.0.102/src/backtrace.rs:10:14
       1: anyhow::__private::format_err
                 at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/anyhow-1.0.102/src/lib.rs:690:13
       2: sp_sim::ereport::recv_request::{closure#0}::{closure#0}
                 at /work/oxidecomputer/omicron/sp-sim/src/ereport.rs:51:17
       3: <core::result::Result<gateway_ereport_messages::Request, zerocopy::error::ConvertError<core::convert::Infallible, zerocopy::error::SizeError<&[u8], gateway_ereport_messages::Request>, zerocopy::error::ValidityError<&[u8], gateway_ereport_messages::Request>>>>::map_err::<anyhow::Error, sp_sim::ereport::recv_request::{closure#0}::{closure#0}>
                at /rustc/4a4ef493e3a1488c6e321570238084b38948f6db/library/core/src/result.rs:968:27
    4: {async_fn#0}
                at /work/oxidecomputer/omicron/sp-sim/src/ereport.rs:50:60
    5: <sp_sim::gimlet::UdpTask>::run::{closure#0}::{closure#0}
                at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.52.1/src/macros/select.rs:705:49
    6: <core::future::poll_fn::PollFn<<sp_sim::gimlet::UdpTask>::run::{closure#0}::{closure#0}> as core::future::future::Future>::poll
                at /rustc/4a4ef493e3a1488c6e321570238084b38948f6db/library/core/src/future/poll_fn.rs:151:9
    7: {async_fn#0}
                at /work/oxidecomputer/omicron/sp-sim/src/gimlet.rs:723:13
    8: {async_block#6}
                at /work/oxidecomputer/omicron/sp-sim/src/gimlet.rs:429:56
    9: poll<alloc::boxed::Box<sp_sim::gimlet::{impl#2}::spawn::{async_fn#0}::{async_block_env#6}, alloc::alloc::Global>>
                at /rustc/4a4ef493e3a1488c6e321570238084b38948f6db/library/core/src/future/future.rs:133:9

This ticket has diverged a bit from the standard "test flake in CI" template, as I am fairly certain that the flake has nothing much to do with this test in particular. Instead, as I discussed in #10382 (comment), I think that what happened is that something1 happened to send an unanticipated UDP datagram to the SP simulator's ereport port, which was not a valid ereport request header, and then it panicked. I presume this could have happened to any test, and it just so happened to impact test_affinity_group_usage due to chance. I had thought it was a good idea to make the SP sim panic upon receipt of a UDP datagram that is not a valid ereport header so that we fail any test where MGS sends something that's malformed...but I don't actually think this one came from MGS. Looking through the log files, I didn't actually see any sign of anything that looked like an ereport request, and the ereport ingester background task is disabled in this test.

Sadly, I haven't been able to determine what process port 47491 belonged to at the time of the event. I'd like to be able to figure out whether this is just a strange random fluke and we shouldn't worry too much about it, or if we need to change the SP simulator code to not panic here to prevent future flakes...

Footnotes

  1. Unfortunately, I was not really able to find any hint in the test's logs as to what that might have been, though I'm pretty sure it was not the test's MGS.

Metadata

Metadata

Assignees

Labels

Test FlakeTests that work. Wait, no. Actually yes. Hang on. Something is broken.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions