Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postmortem: Taskcluster Decision Task PR broke master commit checks (multiple times) #20523

Closed
2 tasks
stephenmcgruer opened this issue Nov 29, 2019 · 3 comments
Closed
2 tasks
Assignees
Labels
postmortem This issue is a postmortem for some outage

Comments

@stephenmcgruer
Copy link
Contributor

stephenmcgruer commented Nov 29, 2019

Owner: @stephenmcgruer
Postmortem Created: 2019-11-28 21:45 EST
Status: In Review
Issue: No specific issue for landing the Taskcluster Decision Graph change exists. See here.

Impact: Medium. Approximately 14 (4 for the first landing, 10 for the second) commits to master did not run Taskcluster successfully. This means that experimental Chrome and Firefox results for those commits are not available on wpt.fyi (example). The change also broke the epoch branches responsible for producing stable runs, and we lose a few days of stable results.

Root Cause: A planned change from a single status Taskcluster yaml file to a 'decision task' approach had a number of bugs and unintended side effects, which were only discovered as it rolled out.

Timeline

  • 2019-11-22 11:47 EST: The first attempt at creating a decision task for Taskcluster is merged.
  • 2019-11-22 11:48 EST: Subsequent commits (from other PRs) land on master and begin to fail (example).
  • FIRST OUTAGE BEGINS
  • 2019-11-22 14:33 EST: Having identified three known bugs, @Hexcles reverts the Taskcluster decision task CL.
  • FIRST OUTAGE ENDS
  • 2019-11-22 14:33 EST: A reland PR is created, with the intention to address the known issues before relanding.
  • 2019-11-26 16:24 EST: Having gone through some rounds of review, part of the change is pulled into a separate PR and landed without issue.
  • 2019-11-28 09:20 EST: The reland PR is merged.
  • 2019-11-28 09:21 EST: Subsequent commits (from other PRs) land on master and begin to fail (example). This is noticed by the Taskcluster decision task PR author, who begins working on a fix.
  • SECOND OUTAGE BEGINS
  • 2019-11-28 10:32 EST: A fix for the believed issue is landed. Unfortunately, it does not address the problems, and it appears (??) that no-one notices that master is still failing.
  • 2019-11-28 17:16 EST: Whilst working on an incidental PR, @stephenmcgruer notices that the Taskcluster run for it has hung. Away from his laptop, he requests (via internal chat) that members of the Chromium Ecosystem Infra team investigate whether this is widespread.
  • 2019-11-28 17:54 EST: @KyleJu confirms that every commit since the decision task PR landed has failed, for reasons similar to the first time.
  • 2019-11-28 18:24 EST: Back at a laptop, @stephenmcgruer puts together a revert PR of the decision task PR plus related changes since.
  • 2019-11-28 18:55 EST: The revert is merged.
  • SECOND OUTAGE ENDS
  • 2019-11-28 20:03 EST: @stephenmcgruer confirms that master has gone green again.
  • 2019-11-29 08:46 EST: Having found the fix for the latest problem + discussed on IRC, @jgraham and @stephenmcgruer agree to try once more to land the decision task PR.
  • 2019-11-29 08:56 EST: @jgraham and @stephenmcgruer observe that master appears to be running ok (no failure yet), but strange errors from Taskcluster are seen on the individual commits.
  • 2019-11-29 09:23 EST: Working theory is that Taskcluster is trying to respond to events other than pushes and failing, but there appears to be no way to tell. An attempt is made to contact Taskcluster engineers on #taskcluster to get logs.
  • 2019-11-29 09:47 EST: Master commit for Taskcluster decision task successfully passes. \o/
  • 2019-11-29 09:52 EST: @stephenmcgruer realizes that the task names changed, so wpt.fyi won't be able to parse them. @jgraham starts working on PR to change the names back.
  • 2019-11-29 10:20 EST: Task name fix merged.
  • 2019-11-29 10:23 EST: @jgraham notices that PR checks are failing on chrome-stability specifically.
  • 2019-11-29 11:00 EST: @stephenmcgruer finds the cause; xvfb isn't enabled for chrome-stability after the decision task change. A fix is uploaded.
  • 2019-11-29 11:16 EST: Fix for chrome-stability lands.
  • 2019-11-29 11:30 EST: @stephenmcgruer confirms the fix appears to have worked.
  • 2019-12-02 07:18 EST: @rakuco discovers that stable runs are missing for Chrome, Firefox and experimental runs for WebKitGTK. @stephenmcgruer begins to investigate.
  • 2019-12-02 09:23 EST: @stephenmcgruer discovers an overlooked error in an epochs commit, which confirms the outage was caused by a bug in the Taskcluster decision task change.
  • 2019-12-02 12:53 EST: A fix is merged
  • 2019-12-03 05:00 EST: Fix is confirmed to work, the runs are back.

Lessons Learnt

Things that went well

  • N/A, it seems :(

Things that went poorly

  • It is difficult to test changes to the Taskcluster configurations properly. One can push to a trigger branch and the PR author did so, but missed the timeouts as we saw some chunks go green (logs). However, it really isn't possible to test the impact of a TaskCluster configuration change on PR checks.
  • In the second outage, @stephenmcgruer had been asked to monitor the Taskcluster decision task post-landing. However he was unaware that the problems last time had been on commited PRs, and so had been looking at in-flight PRs and seeing no issues.
  • It appears that no-one involved received or noticed notifications (email or github) that master commits were failing checks in either case.

Where we got lucky

  • In the second outage, @stephenmcgruer just happened to notice that his in-flight PR had hung; otherwise the outage would have gone unnoticed until at least the next day.
  • In the second outage, both @stephenmcgruer and @KyleJu happened to be available after working hours to debug and revert.

Action Items

@stephenmcgruer stephenmcgruer added the postmortem This issue is a postmortem for some outage label Nov 29, 2019
@stephenmcgruer stephenmcgruer self-assigned this Nov 29, 2019
@stephenmcgruer stephenmcgruer changed the title Postmortem: Taskcluster Decision Graph PR broke master commit checks (multiple times) Postmortem: Taskcluster Decision Task PR broke master commit checks (multiple times) Nov 29, 2019
@Hexcles
Copy link
Member

Hexcles commented Nov 29, 2019

I'm proposing an action item: more tests for https://github.com/web-platform-tests/wpt/blob/master/tools/ci/tc/tests/test_valid.py

#20533 (comment)

I think maybe we should really check in the full expected JSONs and deep compare scheduled.

@stephenmcgruer
Copy link
Contributor Author

@jgraham @Hexcles - I noticed that I had just let this postmortem sit uncompleted from November :(. I tidied it up a little and added a few AIs, but I'm not sure whether we want to action them or not. Lmk what you think (and if you have any other suggestions for the postmortem)

@foolip
Copy link
Member

foolip commented Apr 30, 2021

Over a year later, I'm gonna go ahead and close this :)

@foolip foolip closed this as completed Apr 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
postmortem This issue is a postmortem for some outage
Projects
None yet
Development

No branches or pull requests

3 participants