-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TASK: Work with QA on update grouping, handling mass rebuild-type events for openQA testing #200
Comments
I think https://bodhi.fedoraproject.org/updates/FEDORA-2024-c9a2438d21 is the biggest update we have successfully handled through Bodhi + openQA so far; it has 445 packages. I did various bits of work on the openQA tests to make them handle an update that size, I hope they would handle something even larger but it's impossible to know without trying. It does cause some issues for Bodhi itself too - creating or editing an update of that size takes a very long time, and loading the test results in the webUI takes a long time too. |
I guess we could also just try unconditionally creating a batched update, no matter how large the batch, and see how Bodhi and openQA manage. I think the biggest we'd realistically wind up with is 2-3k packages? Maybe they could deal with that. I kinda suspect Bodhi might be stretched just a bit too far and start hitting timeouts on update creation, but maybe it'd be OK. |
Sorry for the delay; I've been busy getting rid of ODCS. First off, thanks for the extremely detailed summary of the conversation, @AdamWill OK, on to the meat of the discussion: Bodhi UpdatesI'm worried about the failure case if we can't create a Bodhi update. What exactly is our fallback path if 1) Bodhi crashes and doesn't create an update at all or 2) produces a timeout on creation but DOES actually (eventually) create the update? The core issue is that we need to ensure that the packages we just finished building get into the buildroot for the next batch as soon as possible, since the next batch may be relying on those that just finished. Any additional delay added to that (such as gating Bodhi updates) increases the risk that we won't have an appropriate buildroot for the next batch. Yes, we could delay the start of the next batch until the Bodhi update is pushed to stable, but if gating tests interfere, that could lead to blocking further builds entirely. One probably-crazy idea I just had is that at the conclusion of a batch, we could immediately tag its results directly to the Compose testingI like the idea of moving the sync to Obviously, if we implement this for ELN, we should do it in a way that's reusable for other Fedora streams like Rawhide. |
What does the ELN SIG need to do?
We have recently enabled testing of ELN updates in openQA - ELN updates are the ones with only four tests run on them - but some issues have emerged which @sgallagher and @yselkowitz and I discussed on a call:
@sgallagher had a couple of ideas to address this on the ELN side:
@sgallagher also said that due to the details of how EBS batching and buildroot handling work it's probably not feasible to gate ELN updates; he wants the tests to run, but he doesn't want the updates to wait for the tests to complete and be blocked if they fail, they should always go through. This is already how things work, but it does make the testing less 'effective' and mean we need more manual review and oversight (we need to notice when there's a failure, and any time there is a failure, it will 'cascade' to subsequent updates until it's resolved or the offending build is manually untagged). On the openQA side, I had a couple of ideas to mitigate this:
remaining
value of0
- this means that, at the time it finished, no other jobs were scheduled for the same compose. Then you can hit greenwave's API and request a decision (which comes back as JSON). This isn't really that difficult, but probably pretty messy to do in-line in a shell script. It might require converting the shell script into something more sophisticated, or taking the sync step out of the script and doing it in an infra "toddler" or a standalone message consumer. Doing it that way would also handle the case that a test fails and it's a bad needle or just a blip or something, then we re-run it and it passes; an always-listening consumer would trigger again in that case and sync the compose, a one-time script which only checks the first time it sees a message with a remaining count of 0 would not.The text was updated successfully, but these errors were encountered: