-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI management for Tier 2 platforms #1666
Comments
I think we should rework all the Tiers to have, in a nutshell:
|
These two conditions conbimed do not sound like a healthy CI situation. If the test failures do not prevent pull requests from landing, it should not be tested on every pull request. It adds consufions. |
I agree, CI failure of a Tier 2 platform should block landing stuff until the team responsible for that platform has been pinged and come back with a path forward (e.g. wait for a fix to land, or skip the failing test), with a maximum waiting time after which that platform would automatically be downgraded to Tier 3, and the PR can land. |
the stated (and desired, IMO) difference between t1 and t2 is the size of user base. not releasing binaries for t2 will affect all its users, whatever volume of users it might be in comparison with t1 users. I am worries about that approach, for the sake of providing frictionless experience to contributors. lets compare this with security scenario. if there are less folks to fix vulnerabilities, will we let the code bleed out, or stop supporting part of the code? no, we raise funds to address it. I think for t2 platforms that have users but less contributors, we could adopt a similar mechanism, until we see the platform is not in use. |
What I imagine is that CI failures would not be as visible as Tier 1 (for example not as a Red Cross in GitHub UI) |
I totally agree here, we should document also how/why/when a platform drops or climbs the tierlist |
@gireeshpunathil I'm not suggesting we stop releasing binaries that we already release today. If we have Tier 2 platforms for which we release binaries, they should be promoted to Tier 1 |
@aduh95 what you suggest sgtm too |
I think in the short term that sounds good but we also need to watch out for the case where "nobody has the time to ping anyone and things just get stalled for no good reason", since when a PR is opened, no one is obliged to review it or look into the CI failure, this means it still falls on the individual contributors to relentlessly ping other collaborators, who might know how this work, provided that
And as this is a volunteer-based project, I often see PRs getting abandoned or stalled just because "the CI doesn't pass on the first run" (sometimes by collaborators, too), which is something we should make some effort to avoid or it makes it difficult for the project to gain/retain new contributors. We can see how this pans out and if it still ends up stalling pull request that would otherwise not have this friction if the tier 2 platforms aren't blocking any PR, we should consider removing them from the PR CI altogether. |
I wholeheartedly agree with this. Pull requests have waaaaay too much friction now, so much so that many people aren't going to bother, myself included. Take away the roadblocks and let the tier 2 platforms look after themselves. |
while it is reasonable in an OSS project where contributor's time is voluntary, prioritising decisions based on contributor's perspective alone may risk alienating tier2 users - "come fix it yourself if you care". the project is sufficiently matured and being used heavily in the field (enterprise and non-enterprise) and TSC should try everything it can before amending release decisions and platform support matrix:
this positions our relationship between contributors and users as a mutual one and reflects the spirit of community. |
For tier 1 users, I think that's a risk we should consider, for tier 2 I don't think so as by definition they don't have as many users, and usually they are backed by some company who answers to these tier 2 users as their own customers. By caring too much about tier 2 users, we also risk giving the community the wrong impression that they are somehow a customer to the Node.js as well and are entitled to customer service from the project (this is already not true for tier 1 users, but at least tier 1 users are less likely to pay any company in a way that can create a delusion that the money is proxied to a random Node.js project contributor that they are talking to).
I think this is already the reality, and is how a volunteer-based project should operate in general. For actual, generic bugs that affect all platforms, it has always been "come fix it yourself if you care". There are still many bugs lying around among the 1600+ issues with dozens of thumbs up over the years while no one makes enough effort to fix them. IMO it's too much of a luxury to prioritize anything about tier 2 platforms while things are already bad enough for generic bugs on all platforms and we have been in survival mode for a long time.
The TSC is also based on volunteers. Unless there are volunteers in the TSC actively pledged to spend the time and make this list happen, the more likely outcome is that nothing is going to be done along these lines. Shifting the "biting off more than we can chew" problem from individual contributors to a subset of them (the TSC) isn't really going to solve the problem, either, the most practical solution remains to stop taking on more responsibility than what the volunteers are able to afford. |
I agree with @bnoordhuis here. |
Looking at that list there is only one entry that only appeared on a tier 2 platform. Everything else also occurred on a tier 1. |
The flake does not need to solely reproducible on one platform - the problem is that running tests on too many platforms increases the flake surface. Suppose a flake have a 1% rate to fail on multiple platforms, the more platforms we run on every node-test-pull-request, the more likely it is to show up in at least one of them and create friction in a PR. One can argue that it is important to make sure that we test Node.js under stress to make sure the test always passes, even if we risk false positives from unstable machine states, that can be true for tier 1, but for tier 2 with less users, do we really need the test suite to be that perfect? I doubt so. If we can't even make it perfect for tier 1, trying to also make it perfect for tier 2 is biting off way more than what we can chew. |
The majority of the current Tier 2 platforms are Red Hat/IBM ones and it is important for us (Red Hat/IBM) to have the CI run on every PR and to have release binaries on them. IBM stopped producing Node.js rebuilds in 2018 due to the investments (both machine resources and developers (including to upstream V8)) made in the community so for AIX there is nowhere else to get Node.js (RHEL customers can get the Red Hat Build of Node.js). At a time where there is a lot of talk over sustainability and whether companies contribute towards open source projects I would point out that there is more than one way to contribute to projects and for Node.js, both IBM and MNX sponsor several machines which include Linux x64 (i.e. not their primary concern platforms) servers. This includes all three jenkins-workspace machines which are large (in terms of RAM/storage) machines which all pull request CI jobs go through for rebasing onto the base branch prior to fanning out onto the specific architectures. Specifically on the IBM/RH side of things we have provisioned (through sponsorship) more than the guideline minimum number (2) of servers to the test CI for our platforms along with the effort required to keep them reliably running. The reality is that of the releases where we have had to discuss releasing without a platform the most recent cases were Tier 1 (32-bit Windows in Node.js 18 and macOS in Node.js 23). |
My two cents here. Since it is already a fair long discussion I'll keep them brief.
@richardlau Does it pose a big difference for your platform to be in each PR compared to daily CI? If yes, can you please explain why? |
G'day!
I'm not a Richard, but rather a member of the core team for illumos (known also, here, as SmartOS) and as another Tier 2 platform I think we're in a similar boat. From our community's experience assisting in maintenance on other popular toolchains like Go and Rust, where we are also at Tier 2 status, it's super important to keep the context alive when dealing with platform issues. That is, it's important that we can at least aid in making a decision on whether to ignore a failure prior to integration of a change, so that we can participate in the discussion of the change while it is still fresh in the author's mind. Most of the time we'll presumably be able to offer immediate guidance for the vast majority of relatively small differences between UNIX platforms. I don't believe anybody is demanding free work from contributors, to be clear: rather, the author of any change is best placed to help understand and amend that change -- and, understandably, they will have paged it out and moved on to other things once it lands! If anything, it seems more disrespectful of their time and flow to try and ask them questions later as an attempt by Tier 2 maintainers to catch up with regressions introduced into the software. I am strongly in favour of continuing to test our platform in pre-integration CI, and as I mentioned in another issue just now, I am here to help improve communication between our two projects so that we can make that as painless as possible for f olks! |
I think if we start landing PR with tests that are not passing on Tier 2 platforms, than we will quickly lose them as viable Node.js platforms. I don’t think this is good for Node.js, nor it is moving the responsibility alone on the companies volunteering to support those platforms. Frankly, it’s very rare that I have a failure on any tier 2 platforms: most of them my frustration is directed at Windows (you might have a different mileage). How much work is actually blocked by code that doesn’t work in Tier 2 platforms? As many have stated, the biggest issue in maintaining all our platforms are related to keeping the HW & OS up to date. Can someone proposing this change do an inventory of PRs that are blocked by the inability of getting a change implemented in any Tier 2 plaftorms that are not related to HW & OS versions? |
Isn't this contradicting what you say:
Either most PRs are not platform specific that it's fine to not test most PRs in the testing CI and only catch any real failure in the daily master CI, or most PRs would really fail on tier 2 platforms without special care that they will quickly become non viable? My impression is that most failures on tier 2 platforms are flakes and catching them is meaningless to anyone, and only introduce friction to contribution. You don't need them to be real blockers - the fact that a PR works locally abut doesn't work in the CI and one has to look at Jenkins to find failures and ping someone else to restart CI is enough to make contributors frustrated and give up.
I don't think it needs to get to become final blockers to introduce friction - the fact that there are many flakes present on tier 2 platforms (you can find many of those in https://github.com/nodejs/reliability/tree/main/reports already show that they are increasing the friction. We have already not been able to keep the CI green on even just tier 1 platforms for many many years, trying to also extend it to tier 2 platforms is just biting off more than we can chew, and removing them from the testing CI would already make the CI greener simply because there will be fewer attempts to "reproduce" a flake. People don't get blocked by these flakes - they can simply stop trying when it doesn't work on the first try, since they aren't obliged to get the changes in, this happens not only to casual contributors but also collaborators. You'd only know what even is the blocker when the author tries hard enough to eliminate all the other friction, and that is already assuming too much from most contributors. |
@mcollina Does not reflect the truth at all, but Node.js build has 3 access requests in the last 2 year. https://github.com/nodejs/build/issues?q=is%3Aissue%20label%3A%22access%20request%22%20label%3Aplatform%3Asmartos%20%20 You can also find all issues related to smartOS created in the Node.js build repository: https://github.com/nodejs/build/issues?q=is%3Aissue%20label%3Aplatform%3Asmartos You can also find smartOS specific incidents in the Node.js CI: For |
While it is great that there is help, it still requires someone to ask for help first, and Node.js isn't a project where this works particularly well. Many people would simply give up on a PR if it fails in the CI, as they are not obliged to get the chances in anyway, or if the CI doesn't pass on the first run, a new contributor needs to ping a collaborator to keep retrying until they know that it's a flake, and this back-and-forth alone is enough to stall many PRs. I think the only way this would work is for someone from the platform team to actively monitor every single pull request testing CI run, and proactively comment in the PR when it fails on that platform telling the author whether this is a flake that they can ignore or it's a true error and how it should be fixed, instead of waiting for someone to take the initiative to triage, find out who to ping, and do the ping (note that non-members cannot ping teams, so new contributors again have to ask for help from someone in the organization, which introduces even more friction). I don't see that kind of extra workload is a good use of the platform teams' time, either, but that's what it takes to reduce the friction. The alternative of only doing that for the daily CI seems much more reasonable to me.
Note that we are discussing about moving it to the daily master CI, so if someone from the platform team is paying attention to it, it should take no more than 24 hours for them to notice a merged PR that breaks a tier 2 platform in the daily master CI. That seems to be a reasonable timeframe to me - in Node.js you generally can't expect someone to reply to you for something not urgent in 24 hours, anyway. What happens a lot more common these days, however, is not that people can miss a true platform-specific failure on tier 2 platforms if they don't run tests on it, but that people get too many false positives by trying to run the flaky test suite on too many platforms, including tier 2 ones, which increase their chances of reproducing a flake. And triaging a flake is already costly enough for many contributors that it can stall PRs as people get frustrated. |
Can the tier 2 tests be run on every PR, but marked as "optional", so their failures are known but they don't block progress? |
If there are enough flakey tests that this is a big issue, why not disable (or mark ignored) the flakey tests rather than entire platforms? If they're truly just flakey, perhaps retry their execution a few times prior to giving up? It seems like that would go a lot further towards solving the problem you're describing, where casual contributors are discouraged by CI failures -- not just on Tier 2 platforms, but indeed on all platforms! It would also then be easier to get automated notifications of platform-specific CI failures sent to people that can help, because there would be a lot more confidence in the failure being a real failure rather than a flake. |
This is already what we do today, but
Most people, myself included, would just do the bare minimum to keep what they are working on shephered through, and can't really invest all that much in the general situation of the CI. Constantly gardening the CI is a very demanding job, we used to have a bit more volunteering doing it, but these days we generally don't, and we are even drafting a SoW to hire someone to do all these work: #1629 - with no guarantee that when or whether it will actually happen. If we are already this desperate in the amount of volunteering available to the CI even if we only look at tier 1 platforms, trying to run the tests on even more platforms just to reproduce them more really looks like biting off more than we can chew to me. |
@joyeecheung I agree with you that the flaky tests are our main problem. Cutting down the supported platforms might improve that situation, but it's a bandaid with too many side effects and risks, including alienating a few of our corporate sponsors.
I don't think this will happen. Asking for people to be on call to verify a failed build every day that might be flaky is something I would not put on the companies supporting those platforms. Note that all the maintainers of those platforms are asking for their platforms to keep being tested for every CI. I'm not comfortable disregarding their point of view on this topic. Putting it into other terms... how can we improve the situation for Tier 2 platforms while keeping them on the testing for every pull request? |
@anonrig I specifically asked about work blocked by genuine OS differences between Tier 1 and Tier 2 platforms. You have ported examples of HW & OS problems. Are there any open PRs that have been blocked by a divergence of behaviors between the platforms that the relative teams did not address promptly? |
@aduh95 point really reasonates with me and I think it provides a good solution:
I think we could put that to 2 weeks. |
It needs to be done by someone, at the end of the day. If not the platform teams, who would be the volunteers to verify these failed builds on tier 2 platforms?
We are already desperate enough even if we only care about tier 1 platforms, thinking about how we can improve the situation for tier 2 platforms when CI on tier 1 platforms has already been in chaos for years is too much of a luxury IMO (even if we only start from the point where I couldn't stand the CI and developed https://github.com/nodejs/reliability, that repository is already 6 years old, and unfortunately I don't think the CI is any better than what it was 6 years ago...) |
Anyway, I only opened this issue because @mcollina asked me to during the TSC meeting. I don't think we have consensus on actually doing anything to node-test-pull-request, in that case I don't think anything of substance is going to change, and I think 48 hours is the maximum timespan that I should invest in the discussions on this topic, any more than this then I am dying on a hill that I am not really motivated to die on, and I feel that I have started to offend people that I don't really want to offend. Sorry for wasting people's time, but I am just going to close it to kick it out of my head during the holidays :) |
Let's take for example experimental-strip-types. This was identified as failing on big endian platforms on 15 Jul and was resolved in swc. It didn't land on main until 24 Jul which is over a week later. So we're talking ten days from the first opportunity to spot the failure on a Tier 2 platform until the next daily build (25 Jul). Whoever on the IBM/RH side spots this (most likely me) would have to unravel how this feature is implemented and spot that the fix would require a new version of a dependency (i.e. the context (as aluded to in #1666 (comment)) is lost). I am very aware that we have a (IMO bad) habit of landing features at the last minute for releases (often with semver majors, where I know we have changed things there to hopefully improve matters). Which will put added pressure on Tier 2 platforms to get things resolved before release. And I reiterate that for us it's important to have release binaries on our platforms -- I would push to get our platforms upgraded to Tier 1 if we decide not to release binaries for Tier 2 (as has been suggested). (FYI today was my last working day of the year. I'll be back on 6 Jan. At this moment in time I'm disinclined to check GH over the holiday period.) |
I think we should have a vote on this. If there is no consensus, the goal of TSC is to resolve it eventually by voting. @joyeecheung |
@joyeecheung thanks for opening this. I think debating this was important. |
Just back from vacation and I see the discussion is closed so I won't say too much. I think @richardlau has coverved my thoughts pretty well in terms of the importance for Tier 2 platforms to be in the PR testing, binaries beeing shipped and that we've worked hard to ensure the IBM/RH platforms in Tier 2 don't block forward progress. |
Spinning from the #1663, during the TSC discussions today we noticed that the tiers are documented as follows in BUILDING.md
Which does not say anything about Tier 2 platforms must be in the pull request testing CI, and there isn't much distinction between Tier 1 and Tier 2 if they are both in the testing CI.
Personally I think given the current situation of the testing CI, we are biting more than we can chew if we still insist in keeping the Tier 2 platforms in every pull request testing CI run. It requires more resources than we currently have, increase CI run time, and creates a bigger flake surface. (for example, in the reliability report yesterday you can see several tier 2 platforms on the list). It also shifts the burden of making every changes work on that platform to individual contributors, who can be new to the project and it can be frustrating for them to get blocked by these niche platforms that they don't care too much about, and overall making it harder to gain and retain new contributors.
Is it really more important to keep every pull request to pass all the tests in these tier 2 platforms than providing a frictionless experience for contributors? Personally I don't think that's the right prioritization. Keeping the tests green on a platform in the CI, specifically, is not the same as just "keeping Node.js work on that platform" - for example a test could pass perfectly fine on a well-maintained version of the platform, but still is in trouble for some reason in the CI on that platform, which may be strictly related to the state of the machine in the CI, or the specific setup of that machine. It does not seem right say for a new contributor to open a pull request to add
util.something
which is a straight-forward JS thing with a straight-forward test, and they end up scratching their head at some mysterious flake/machine problem on a niche tier 2 platform in the CI, and have to keep pinging other collaborators to restart the CI to get it landed (assuming they don't get frustrated enough themselves and give up on the PR). This can happen to tier 1 platform too, but at least the user base of tier 1 platforms is enough to deserve the attention of innocent new contributors, but tier 2 really isn't.I think a more tenable solution would be to keep tier 2 platforms in the node-daily-master CI, and let Jenkins notify corresponding stakeholders when tests fail there. The stakeholders could come back opening an issue or opening a pull request to fix the test for that platform, and make sure that things are in order for the releases (if there are infrastructure issues that makes it hard to build the binaries in time for the release then they'll be delayed - which is precisely what the Tier 2 description says). For node-test-pull-request, we can either remove the tier 2 platforms, or only make sure that they can build but do not actually run the tests, as the marginal cost of keeping all the tests green on all these platforms isn't really worth the benefit at this point IMO.
cc @nodejs/tsc @nodejs/build
The text was updated successfully, but these errors were encountered: