Skip to content

Conversation

@paciorek
Copy link
Contributor

This is an attempt to start OOD Desktop (in particular xfwm4) in a way that prevents current problems. It's not ready to merge yet, but rather to start some discussion and for some testing.

The current problems are that users and I have noticed that (sometimes) the Desktop is almost unusable because one can't resize or move app windows. Also the borders around the windows (and around the entire Desktop) disappear.

I believe I have diagnosed as occurring because the xfwm4 window manager process dies shortly after the Desktop starts. I've noticed that very briefly (about one second) when the Desktop appears, the usual border around the Desktop is there (i.e., the top 'management' bar and the small app bar at the bottom). Then it disappears. Then when I look using ps, xfwm4 is not running.

Looking at output.log, I see the following messages:

Another Window Manager (Xfwm4) is already running on screen :1.0
To replace the current window manager, try "--replace"

(xfwm4:2197088): xfwm4-WARNING **: 13:48:36.095: Could not find a screen to manage, exiting

Following some hints from Claude, I tried to see if xfwm4 was being started multiple times, but I didn't see any indications of that happening.

So I don't know why another xfwm4 is already running.

Ideally it would be nice to figure that out so as to come up with a robust solution.

Even if we can't, my thought with this PR is that either sleeping before starting xfwm4 (perhaps helping if something needs to finish starting before xfwm4 can start robustly) or using --replace (to replace whatever problematic xfwm4 has already started) might help.

@paciorek
Copy link
Contributor Author

Flagging @saroj-lbl for any thoughts. I didn't see this problem in limited testing on LRC OOD, but given the apps seem identical (or nearly so), I wouldn't expect to to be Savio-only.

@saroj-lbl
Copy link
Collaborator

Hi Chris,

We are not aware of any such problem on LRC OOD as of yet. I will try to keep a desktop app open for several hours today to test. Is there a particular application or set of applications for which the problem shows up?

Looking at the LRC OOD apps' script.sh.erb, it looks quite similar to BRC OOD:

(
  export SEND_256_COLORS_TO_REMOTE=1
  export XDG_CONFIG_HOME="<%= session.staged_root.join("config") %>"
  export XDG_DATA_HOME="<%= session.staged_root.join("share") %>"
  export XDG_CACHE_HOME="$(mktemp -d)"
  set -x
  xfwm4 --compositor=off --daemon --sm-client-disable
  xsetroot -solid "#D3D3D3"
  xfsettingsd --sm-client-disable
  xfce4-panel --sm-client-disable
) &

At first I noticed the difference being --daemon but the output.log says that is an Unknown option.

@markyashar
Copy link

Also, in case this is useful: Taking a look at the OOD docs at https://osc.github.io/ood-documentation/latest/tutorials/tutorials-interactive-apps/add-matlab/edit-script-sh.html#use-xfce-for-the-window-manager, there is an example of how xfwm4 is set up in the ../template/script.sh.erb file for the ood matlab interactive app. The example is as follows:

Use XFCE for the Window Manager:

XFCE is OSC's preferred desktop environment for launching VNC applications. The code for starting XFCE in the background looks like this (see highlighted lines 1-20):

Launch Xfce Window Manager and Panel

(
export SEND_256_COLORS_TO_REMOTE=1
# session.staged_root.join("config") refers to /.../bc_my_center_matlab/template/config
# which is copied at job start time to a session specifc directory.
# It will override without replacing any XFCE settings that the user
# already has.
export XDG_CONFIG_HOME="<%= session.staged_root.join("config") %>"
export XDG_DATA_HOME="<%= session.staged_root.join("share") %>"
export XDG_CACHE_HOME="$(mktemp -d)"
module restore
set -x
xfwm4 --compositor=off --daemon --sm-client-disable
xsetroot -solid "#D3D3D3"
xfsettingsd --sm-client-disable
xfce4-panel --sm-client-disable
) &

cd "$HOME"

Start MATLAB

Load the required environment

module load xalt/latest <%= context.version %>

Launch MATLAB

Switch the implementation on if the user requested a visualization GPU node

<%- if context.node_type.include?("vis") -%>
module load intel/16.0.3 virtualgl # Perform whatever set up you want / need
module list # List loaded modules for debugging purposes
set -x
vglrun matlab -desktop -nosoftwareopengl # Launch MATLAB using VirtualGL
<%- else -%>

When not using a GPU node

module list # List loaded modules for debugging purposes
set -x
matlab -desktop # Launch MATLAB
<%- end -%>

@markyashar
Copy link

Taking a look at the file /global/home/users/myashar/ondemand/dev/brc_desktop/template/script.sh.erb, for example, we do see that it includes the following:
...

Launch Xfce Window Manager and Panel

export SEND_256_COLORS_TO_REMOTE=1
export XDG_CONFIG_HOME="<%= session.staged_root.join("config") %>"
export XDG_DATA_HOME="<%= session.staged_root.join("share") %>"
export XDG_CACHE_HOME="$(mktemp -d)"
set -x
xfwm4 --compositor=off --sm-client-disable &
sleep 5
xsetroot -solid "#D3D3D3"
xfsettingsd --sm-client-disable
xfce4-panel --sm-client-disable &
sleep 5

cd "$HOME"

@paciorek
Copy link
Contributor Author

paciorek commented Jul 24, 2025

@markyashar I'm not seeing any difference between the OSC script and ours that might help explain our Desktop mis-behavior. Was there something in particular you were pointing out?

@paciorek
Copy link
Contributor Author

@saroj-lbl regarding keeping the app open for hours, my current understanding is that the issue happens immediately upon starting the app, so I don't think you'll see anything by keeping it open. Just an FYI -- given this is mysterious, any diagnostic effort is welcome!

@markyashar
Copy link

markyashar commented Jul 24, 2025

@paciorek I think the only basic differences I'm seeing between the OSC script and ours is that their script has the line "module restore" before the line "set -x", which our script does not have, and in our script we have the line "sleep 5" after the line "xfwm4 --compositor=off --sm-client-disable &", whereas the OSC script does not have the "&" or the "sleep 5" command. There are some similar differences between our script and the LRC OOD apps' script.sh.erb that Saroj sent. These may be minor differences, but it could be worth experimenting to remove these differences between our script and the OSC script and the LRC OOD apps' script.sh.erb script (?), e.g., take out the ampersand in our script and remove the "sleep 5" line and add the "module restore" line, etc.

Also, our script has

xfce4-panel --sm-client-disable &
sleep 5

whereas the others just have

xfce4-panel --sm-client-disable

Or, the other scripts have the & outside of the parenthesis, but not sure how much of a difference this ultimately makes ...

@paciorek
Copy link
Contributor Author

Yeah, as far as module restore, that seems to relate to default environment modules and I don't think that is relevant for us.

I myself introduced the sleep 5 back in November to address these same sorts of windowing issues. I should have remembered/mentioned that PR #48 was related.

I'm going to experiment more with the "sleep"s to see if reducing the amount of time would be ok, as having 5x3=15 seconds of delay seems like a bit much from a user experience perspective.

@paciorek
Copy link
Contributor Author

Ok, so one very confusing thing is that the LRC script and the Savio script that is currently live at /var/www/ood/apps/sys/brc_desktop/template/script.sh.erb look to be the same (i.e., the whole block of commands is backgrounded).

However, somehow the Savio script that is live at /var/www/... is not what we have in this GitHub repository. In particular it does not reflect the changes in commit 1b23744 from PR #48 (in which the whole block of commands is not backgrounded). I'm not sure how that could be the case.

So that makes this all the more confusing. I'm going to have to check with @wfeinstein to understand why /var/www/ood/apps/sys/brc_desktop/template/script.sh.erb is not the same as the script.sh.erb in commit d56f8cd.

Remove a `sleep` that is probably not needed.
@paciorek
Copy link
Contributor Author

Now that I've tested my robustification efforts, I realized that xfsettingsd needs to be started as a daemon or xfce will not start. So I think this PR now provides a working Desktop.

However, in testing today, I can't reproduce the failures that user Jeff Lockhart and I were seeing last week. So it's hard to pin down what changes we want.

Here's a summary of my current thinking:

  • I think we do definitely want the --daemon flag for xfsettingsd.
  • I'm not sure about the --replace flag for xfwm4, though it would seem to be a helpful robustification.
  • I do think we want the changes to how things are backgrounded given in commit 1b23744, given my work in PR Modify Desktop app script to fix windowing issues #48. That said, LRC has the "old" way of backgrounding the entire block, and things have been fine there.

@markyashar
Copy link

In case there's anything useful here, here is how Bearborg AI Chatbot (Gemini Pro) responds to all of this after I've fed this thread and all of the background and context into it (-- Note that one thing that it suggests is to add the line "pkill -f xfwm4" to the script....):

"Let's break down this complex issue and create a plan to fix the BRC OOD Desktop problems.

Problem Summary:

  • Intermittent issues with the Xfce desktop on Savio where window borders disappear and windows become unmanageable.
  • This appears to be caused by xfwm4 (the window manager) crashing or not starting correctly.
  • Error message observed: Another Window Manager (Xfwm4) is already running on screen :1.0
  • Inconsistent behavior between the current GitHub repository code, the deployed Savio code, and the working LRC OOD implementation.

Root Causes & Potential Solutions:

  1. xfwm4 Startup Conflicts: The "Another Window Manager" error suggests a conflict, likely due to a pre-existing xfwm4 process or lingering resources from a previous session.

    • Solution 1: --replace flag: Using xfwm4 --replace is a good first step. It instructs xfwm4 to take over from any existing window manager, resolving potential conflicts.

    • Solution 2: Robust Cleanup: Implement a more robust cleanup process before starting xfwm4. This might involve checking for and killing any existing xfwm4 processes:

      pkill -f xfwm4  # Kill any existing xfwm4 processes
  2. Timing Issues: The original introduction of sleep 5 suggests potential timing issues where xfwm4 starts before other necessary components are ready.

    • Solution: Strategic Delays: While sleep can be a workaround, it's not ideal. Investigate why the delay is needed. Are other processes (like xfsettingsd) taking too long to start? If so, address the root cause of the delay. Consider using a loop that checks for the existence of critical processes or resources before proceeding, rather than relying on arbitrary sleep durations.
  3. Inconsistent Code Deployment: The discrepancy between the GitHub repo, deployed code, and LRC's implementation adds complexity.

    • Solution: Verify Deployment: This is critical. Ensure the correct code from the GitHub repository is being deployed to Savio. Investigate why the deployed script doesn't reflect the changes from PR Modify Desktop app script to fix windowing issues #48. This could involve checking deployment scripts, server configurations, or caching mechanisms.
  4. xfsettingsd Daemonization: xfsettingsd must be run as a daemon.

    • Solution: Ensure Daemonization: Verify that xfsettingsd is correctly launched in the background. Use the & at the end of the command line: xfsettingsd --sm-client-disable &
  5. Backgrounding and Process Groups: The variations in how commands are backgrounded might be causing subtle issues.

    • Solution: Consistent Backgrounding: Adopt a consistent approach across all environments (Savio, LRC, and the example). Using the method from the OSC documentation (backgrounding the entire block within parentheses) appears to be the most robust and recommended approach. It ensures that all related processes are managed as a single group.

Action Plan:

  1. Deployment Synchronization: Immediately address the inconsistent deployment issue. Ensure the current GitHub code is live on Savio.
  2. --replace and Cleanup: Implement the xfwm4 --replace flag and the pkill -f xfwm4 command as a precautionary measure.
  3. xfsettingsd as Daemon: Ensure xfsettingsd is run as a daemon (xfsettingsd --sm-client-disable &).
  4. Refine Backgrounding: Use the parentheses method for backgrounding the entire Xfce startup block, as in the OSC documentation.
  5. Minimize Sleeps: Reduce sleep durations if still needed after implementing the above changes. Aim for eliminating them entirely by addressing the root causes of any delays.
  6. Thorough Testing: After each change, conduct thorough testing on Savio to verify stability and reproduce the original issues. Include scenarios where the desktop is left open for extended periods. Test with different applications to see if specific programs trigger the problem.
  7. Logging: Add more robust logging to the script to capture the startup process of Xfce components. This can help pinpoint the exact point of failure.

By systematically addressing these points, you should be able to create a more robust and reliable Xfce desktop experience on Savio. Clear communication with @wfeinstein about the deployment discrepancy is essential. "

@paciorek
Copy link
Contributor Author

Well, I still don't understand what is going on, and now I am seeing over the last few days that the Desktop app seems to be behaving fine, so it's hard to experiment more.

One other thing I've realized is that with the current Savio and LRC Desktop apps, with xfsettingsd and xfce4-panel inside the backgrounded block, is that I don't think either of those commands is actually running, because the prior xfwm4 command is not backgrounded (and inserting some print/echo statements indicates they are never invoked). It does look like both xfsettingsd and xfce4-panel are started when xfce4-session is run (xfce4-session is called from within desktops/xfce.sh).

So that is odd.

I'll also note that with this PR (as well as with the code from commit 1b23744) it looks like XFCE puts up some pop-up windows (related to the "notification area" and "XFCE4 Policy Kit agent") perhaps because xfsettingsd and xfce4-panel are started twice.

I'm going to wait more and try to reproduce the behavior that Jeff and I saw a few weeks ago. If I can, I will follow up here, possibly with a proposal to modify the current backgrounded block approach to add --replace. And we might remove the xfsettingsd and xfce4-panel invocations...

@paciorek
Copy link
Contributor Author

It looks like Wei's merge of PR #62 and copying to the live app removed the discrepancy between the repo and the live apps versions of Desktop that I noted in a previous comment.

We'll have to decide whether to also merge in this PR, which might fix some cases where there are problems with xfwm4 dying. But I don't feel like I have a good handle on to what extent this PR would address those problems. And in contradiction of my claim earlier in this discussion, XFCE is starting even though xfsettingsd does not currently have the --daemon flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants