The integration_tests::affinity::test_affinity_group_usage test failed on a CI run on pull request #10382:
https://github.com/oxidecomputer/omicron/pull/10382/checks?check_run_id=74546484800
Log showing the specific test failure:
https://buildomat.eng.oxide.computer/wg/0/details/01KQXPZCBW2NHNXNEH0BGF1EQB/BcTcBzPAAKSDAzJ3U3i0Pi7K6b8Y806PLnRRLXofAduFgCIf/01KQXPZSQB0ZRAP1X61NM18WMQ#S8248
Excerpt from the log showing the failure:
stderr ───
log file: /var/tmp/omicron_tmp/test_all-b1323e5628efcde2-test_affinity_group_usage.17207.0.log
note: configured to log to "/var/tmp/omicron_tmp/test_all-b1323e5628efcde2-test_affinity_group_usage.17207.0.log"
DB URL: postgresql://root@[::1]:56137/omicron?sslmode=disable
DB address: [::1]:56137
log file: /var/tmp/omicron_tmp/test_all-b1323e5628efcde2-test_affinity_group_usage.17207.2.log
note: configured to log to "/var/tmp/omicron_tmp/test_all-b1323e5628efcde2-test_affinity_group_usage.17207.2.log"
log file: /var/tmp/omicron_tmp/test_all-b1323e5628efcde2-test_affinity_group_usage.17207.3.log
note: configured to log to "/var/tmp/omicron_tmp/test_all-b1323e5628efcde2-test_affinity_group_usage.17207.3.log"
thread 'integration_tests::affinity::test_affinity_group_usage' (2) panicked at sp-sim/src/gimlet.rs:429:62:
called `Result::unwrap()` on an `Err` value: couldn't understand ereport header from [::1]:47491: Size(SizeError)
Stack backtrace:
0: <anyhow::Error>::msg::<alloc::string::String>
at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/anyhow-1.0.102/src/backtrace.rs:10:14
1: anyhow::__private::format_err
at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/anyhow-1.0.102/src/lib.rs:690:13
2: sp_sim::ereport::recv_request::{closure#0}::{closure#0}
at /work/oxidecomputer/omicron/sp-sim/src/ereport.rs:51:17
3: <core::result::Result<gateway_ereport_messages::Request, zerocopy::error::ConvertError<core::convert::Infallible, zerocopy::error::SizeError<&[u8], gateway_ereport_messages::Request>, zerocopy::error::ValidityError<&[u8], gateway_ereport_messages::Request>>>>::map_err::<anyhow::Error, sp_sim::ereport::recv_request::{closure#0}::{closure#0}>
at /rustc/4a4ef493e3a1488c6e321570238084b38948f6db/library/core/src/result.rs:968:27
4: {async_fn#0}
at /work/oxidecomputer/omicron/sp-sim/src/ereport.rs:50:60
5: <sp_sim::gimlet::UdpTask>::run::{closure#0}::{closure#0}
at /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/tokio-1.52.1/src/macros/select.rs:705:49
6: <core::future::poll_fn::PollFn<<sp_sim::gimlet::UdpTask>::run::{closure#0}::{closure#0}> as core::future::future::Future>::poll
at /rustc/4a4ef493e3a1488c6e321570238084b38948f6db/library/core/src/future/poll_fn.rs:151:9
7: {async_fn#0}
at /work/oxidecomputer/omicron/sp-sim/src/gimlet.rs:723:13
8: {async_block#6}
at /work/oxidecomputer/omicron/sp-sim/src/gimlet.rs:429:56
9: poll<alloc::boxed::Box<sp_sim::gimlet::{impl#2}::spawn::{async_fn#0}::{async_block_env#6}, alloc::alloc::Global>>
at /rustc/4a4ef493e3a1488c6e321570238084b38948f6db/library/core/src/future/future.rs:133:9
This ticket has diverged a bit from the standard "test flake in CI" template, as I am fairly certain that the flake has nothing much to do with this test in particular. Instead, as I discussed in #10382 (comment), I think that what happened is that something1 happened to send an unanticipated UDP datagram to the SP simulator's ereport port, which was not a valid ereport request header, and then it panicked. I presume this could have happened to any test, and it just so happened to impact test_affinity_group_usage due to chance. I had thought it was a good idea to make the SP sim panic upon receipt of a UDP datagram that is not a valid ereport header so that we fail any test where MGS sends something that's malformed...but I don't actually think this one came from MGS. Looking through the log files, I didn't actually see any sign of anything that looked like an ereport request, and the ereport ingester background task is disabled in this test.
Sadly, I haven't been able to determine what process port 47491 belonged to at the time of the event. I'd like to be able to figure out whether this is just a strange random fluke and we shouldn't worry too much about it, or if we need to change the SP simulator code to not panic here to prevent future flakes...
The
integration_tests::affinity::test_affinity_group_usagetest failed on a CI run on pull request #10382:https://github.com/oxidecomputer/omicron/pull/10382/checks?check_run_id=74546484800
Log showing the specific test failure:
https://buildomat.eng.oxide.computer/wg/0/details/01KQXPZCBW2NHNXNEH0BGF1EQB/BcTcBzPAAKSDAzJ3U3i0Pi7K6b8Y806PLnRRLXofAduFgCIf/01KQXPZSQB0ZRAP1X61NM18WMQ#S8248
Excerpt from the log showing the failure:
This ticket has diverged a bit from the standard "test flake in CI" template, as I am fairly certain that the flake has nothing much to do with this test in particular. Instead, as I discussed in #10382 (comment), I think that what happened is that something1 happened to send an unanticipated UDP datagram to the SP simulator's ereport port, which was not a valid ereport request header, and then it panicked. I presume this could have happened to any test, and it just so happened to impact
test_affinity_group_usagedue to chance. I had thought it was a good idea to make the SP sim panic upon receipt of a UDP datagram that is not a valid ereport header so that we fail any test where MGS sends something that's malformed...but I don't actually think this one came from MGS. Looking through the log files, I didn't actually see any sign of anything that looked like an ereport request, and the ereport ingester background task is disabled in this test.Sadly, I haven't been able to determine what process port
47491belonged to at the time of the event. I'd like to be able to figure out whether this is just a strange random fluke and we shouldn't worry too much about it, or if we need to change the SP simulator code to not panic here to prevent future flakes...Footnotes
Unfortunately, I was not really able to find any hint in the test's logs as to what that might have been, though I'm pretty sure it was not the test's MGS. ↩