Skip to content

Conversation

@stephanreiter
Copy link

@stephanreiter stephanreiter commented May 1, 2018

See JENKINS-17116.

When a build is aborted by the user, Jenkins will now gracefully terminate involved processes by giving it up to 30 seconds time to exit after having received SIGTERM (on Linux) or Ctrl+C on Windows.

Proposed Changelog entries

@jenkinsci/code-reviewers

@stephanreiter
Copy link
Author

Maybe making the 30secs configurable (per project?) is a good idea.

@stephanreiter
Copy link
Author

In Line 558 of ProcessTree.java we should replace "java" with the actual path to the java.exe that creates the jvm the code is running in, i.e. property java.home ...

@daniel-beck daniel-beck self-requested a review May 1, 2018 20:58
@daniel-beck
Copy link
Member

Maybe making the 30secs configurable (per project?) is a good idea.

Seems like one of those options that clutter up the UI further. Possibly good enough for a system property, but in general we should "just" get something like this right.

@stephanreiter
Copy link
Author

Good point. 30secs might be too short. At work we are using scons as the build system and when you abort a build it sometimes sits there for a minute updating its DB on disk. So maybe 5mins are a safer choice here ...

@stephanreiter stephanreiter changed the title When aborting a build, wait up to 30s for process termination When aborting a build, wait up to 2min for process termination May 2, 2018
@stephanreiter
Copy link
Author

I have no clue why test "Linux / Linux Publishing / considersKillingVetos – hudson.util.ProcessTreeKillerTest" should fail. On Linux, we are now merely waiting after the SIGTERM and not sending out any extra SIGTERMS.

@stephanreiter
Copy link
Author

Same test failure is observed in another pull request (#3417), so not introduced here.

@dwnusbaum
Copy link
Member

@stephanreiter test failure was fixed in #3419, looks like the build is good now.

@stephanreiter
Copy link
Author

Cool! I think I am quite happy with the change now. Please take a look.

return;

LOGGER.log(FINER, "Killing recursively {0}", getPid());
killSoftly(getPid());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll send Ctrl+C only to the root of a process tree. That's alright for my use case.
In case every process in the tree should receive Ctrl+C, we should change the implementation of WinProcess and do Ctrl+C sending in killRecursively and kill (should then pass a timeout to it).

@daniel-beck daniel-beck added the needs-more-reviews Complex change, which would benefit from more eyes label May 5, 2018
@stephanreiter
Copy link
Author

Can we proceed with this please?

private void killSoftly(int pid) {
// send Ctrl+C to the process in the first iteration
// after that just wait for it to cease to exist
for (int i = 0; i < softkillWaitSeconds; ++i) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good comment form selckin on IRC: better to calculate the deadline = nowtimestamp+softkillWaitSeconds) and loop while (nowtimestamp < deadline)

LOGGER.fine("Killing pid="+pid);
UnixReflection.destroy(pid);
// after sending SIGTERM, wait for the process to cease to exist
for (int i = 0; i < softkillWaitSeconds; ++i) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good comment form selckin on IRC: better to calculate the deadline = nowtimestamp+softkillWaitSeconds) and loop while (nowtimestamp < deadline)

catch (Exception e) {
break;
}
} while(System.currentTimeMillis() < deadline);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to use System.nanoTime, not currentTimeMillis. See Javadoc for why.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that nanoTime is immune to system clock changes and hence preferred for elapsed-time calculations. Will fix!

classPath = new ArrayList<String>();
for (final String resourcePath : resourcePaths) {
if (resourcePath.contains("jenkins-core") && resourcePath.endsWith(".jar")) {
String jarPath = servletContext.getRealPath(resourcePath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not smell right. Have you even tested this when using remote agents?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this with a local installation only. The issue to solve is to start a new process which attaches to the process we want to send Ctrl+C to. I'd be glad for suggestions since you have an elaborate nose.

@jglick
Copy link
Member

jglick commented May 24, 2018

I suspect this entire PR is in the wrong repository. See: https://github.com/kohsuke/winp

@stephanreiter
Copy link
Author

The entire PR is certainly not in the wrong repository. The windows-specific bits could be moved into winp; linux-specific bits cannot.

@stephanreiter
Copy link
Author

I'll try to move the killSoftly method into winp. There I will need to execute a function in the winp.dll in a new process to send ctrl+c. rundll should allow that ... we'll see.

@stephanreiter
Copy link
Author

Alright, working on Ctrl+C sending on Windows via WinP here: jenkinsci/winp#49

@oleg-nenashev
Copy link
Member

@stephanreiter will try to take a look ASAP

@oleg-nenashev oleg-nenashev self-assigned this May 26, 2018
@oleg-nenashev oleg-nenashev self-requested a review May 26, 2018 11:46
@stephanreiter
Copy link
Author

Upgraded this change here based on what I am hoping will make it into WinP.
See jenkinsci/winp#49

@oleg-nenashev oleg-nenashev changed the title When aborting a build, wait up to 2min for process termination [JENKINS-17116] - When aborting a build, wait up to 2min for process termination Jun 8, 2018
@stephanreiter
Copy link
Author

@oleg-nenashev after you have released a new version of WinP, could you help me update this pull request here to include the new WinP, please?

@oleg-nenashev
Copy link
Member

@stephanreiter FYI I have pushed a commit to retrigger the build after fixing Maven Central staging

@stephanreiter
Copy link
Author

Seems to build now, yippie! I added exception catching around sendCtrlC, i.e. if Ctrl+C couldn't be sent then we just proceed with the regular hard kill. Exception catching became necessary now that sendCtrlC throws if something goes wrong instead of merely returning false.
Thanks for your help so far, @oleg-nenashev ! I appreciate it a lot! :)

@daniel-beck
Copy link
Member

Only looked at this superficially, but I regularly press 'abort build' again if I don't see anything happening. If this will affect abort duration, there should be a message logged that it might take a while.

@stephanreiter
Copy link
Author

This change here gives processes a chance to execute a clean shutdown. Depending on the executable, aborting will therefore take longer than before. At work, we use SCons for building and it sometimes takes a few seconds to shutdown. (FInally, though, SCons will cleanly shutdown; that was my motivation for fixing the referenced issue.)

Logging when process termination starts and ends sounds like good feedback to the user. I don't know how to do that, though. Just regular logging on a certain loglevel?

@daniel-beck
Copy link
Member

The nearest http://javadoc.jenkins-ci.org/hudson/model/TaskListener.html (or rather its logger) would receive such messages.

@stephanreiter
Copy link
Author

@daniel-beck I don't know how to get to a relevant TaskListener logger at the time the abort action is triggered by the user. The callstack for that even is:
at hudson.model.Executor.interrupt(Executor.java:153)
at hudson.model.Executor.doStop(Executor.java:853)
at hudson.model.AbstractBuild.doStop(AbstractBuild.java:1320)

I pushed a change that allows for tweaking of the softkill time via a system property and makes sure that when killing a processtree on Linux, the overall operation won't take longer than the softkill time (i.e. all process in the tree share the 2min default softkill time, it is no longer per process in the tree -- thanks to Nick Talbot for pointing that out!!).

Stephan Reiter added 2 commits August 14, 2018 17:28
@stephanreiter
Copy link
Author

@oleg-nenashev @daniel-beck @dwnusbaum can you help me finish this PR, please?
I don't know how to show the user a message that aborting may take some time and I don't know Jenkins enough to be able to tell whether clicking multiple times on Abort has any effect at all.
Looking at the time frame this PR is sitting here and how it made progress only in bursts, I'd like to finally see this over the finish line and be done with it.

@oleg-nenashev
Copy link
Member

@stephanreiter If you are fine with that, I will try to add logging on the top of your pull-request (today on the evening, I'd guess)

@stephanreiter
Copy link
Author

I would very much appreciate that, @oleg-nenashev . Thank you so much, for this and the work you put into the PR already!

@oleg-nenashev
Copy link
Member

Didn't get to it on the weekend, sorry. I will see if I have some time this week

@oleg-nenashev oleg-nenashev added the work-in-progress The PR is under active development, not ready to the final review label Aug 31, 2018
@oleg-nenashev
Copy link
Member

@daniel-beck I have spent few hours yesterday to investigate your proposal w.r.t TaskListeners. So, I think it should be removed from the scope.

  • hudson.model.Executor does not keep/cache reference to the Run's TaskListener. It receives it from API calls
  • doStop() (termination from WebUI) is executed on the executor directly, and it does not get TaskListener reference available within Run#execute() context. It would be tricky to even get a message there without caching the listener
  • In order to propagate TaskListener across the the ProcessTree call hierarchy, we need to change API as well (if we want to send event only when a soft termination happens)

I have started assembling some API changes to enable TaskListener propagation, because I agree this is important for error/event propagation. Apparently some changes also collide with #3557. But anyway, it would be a pretty big change which should be done separately IMHO.

I propose to...

  1. create a follow-up JIRA ticket for TaskListener
  2. merge this PR as is OR send events to INFO/FINE system logs instead of the TaskListener. The logging won't help a common user much tho

@daniel-beck WDYT?

@daniel-beck
Copy link
Member

If it's not possible then we'll need to accept this. Thanks for investigating.

@oleg-nenashev oleg-nenashev added ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback and removed needs-more-reviews Complex change, which would benefit from more eyes work-in-progress The PR is under active development, not ready to the final review labels Aug 31, 2018
@oleg-nenashev
Copy link
Member

Created https://issues.jenkins-ci.org/browse/JENKINS-53373 as a follow-up for logging

@oleg-nenashev
Copy link
Member

@jenkinsci/sig-platform Just FYI, here we are dropping support of Windows 2000 (as discussed at the first meeting).

@slide
Copy link
Member

slide commented Aug 31, 2018

+1

1 similar comment
@MarkEWaite
Copy link
Contributor

+1

@oleg-nenashev oleg-nenashev merged commit d8eac92 into jenkinsci:master Aug 31, 2018
@daniel-beck
Copy link
Member

https://issues.jenkins-ci.org/browse/JENKINS-55106 indicates the lack of logging is a problem, and 2 minutes is excessive, as it seems consecutive when multiple processes are involved.

@stephanreiter
Copy link
Author

Lack of information in the UI is indeed a problem.
2min is not excessive in my opinion because depending on load on the machine, shutdown of, for example scons (build tool) can take a minute. The time is tweakable, so in case someone has problems this can be used as a workaround. Would be better for them though to make a program react to sigterm in time instead of relying on sigkill.

@daniel-beck
Copy link
Member

The time is tweakable, so in case someone has problems this can be used as a workaround.

Lack of logging means this isn't discoverable however, so it might as well not exist.

@merlin66
Copy link

Until logging is implemented, maybe the default timeout should be set very short then, to avoid regressions?

As long as those of us who need a long timeout can set a long timeout explicitly, that shouldn't be a problem.

@omehegan
Copy link
Member

@stephanreiter https://issues.jenkins-ci.org/browse/JENKINS-54502 suggests that this change might have introduced a bug with aborting builds on Windows masters.

@slonopotamus
Copy link
Contributor

2min timeout introduced here doesn't actually work. Instead, Jenkins waits only for 20s. See https://issues.jenkins-ci.org/browse/JENKINS-59152

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants