Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fsm dialog #241

Merged
merged 59 commits into from
Dec 10, 2024
Merged

Fsm dialog #241

merged 59 commits into from
Dec 10, 2024

Conversation

dalonsoa
Copy link
Collaborator

Description

Adds the dialog to gather the arguments needed to execute transitions on the FSM. A couple of bugs with the hardcoded FSM have been implemented, as well.

Recording.2024-11-22.135358.mp4

Fixes #216
Fixes #217
Fixes #161 (the umbrella issue for the FSM)

Type of change

  • Documentation (non-breaking change that adds or improves the documentation)
  • New feature (non-breaking change which adds functionality)
  • Optimization (non-breaking, back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (whatever its nature)

Key checklist

  • All tests pass (eg. python -m pytest)
  • The documentation builds and looks OK (eg. python -m sphinx -b html docs docs/build)
  • Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added or an issue has been opened to tackle that in the future. (Indicate issue here: # (issue))

@codecov-commenter
Copy link

codecov-commenter commented Nov 22, 2024

Codecov Report

Attention: Patch coverage is 90.76923% with 12 lines in your changes missing coverage. Please review.

Project coverage is 86.49%. Comparing base (21536ea) to head (67a2935).
Report is 60 commits behind head on state_machine.

Files with missing lines Patch % Lines
process_manager/process_manager_interface.py 16.66% 5 Missing ⚠️
controller/controller_interface.py 92.00% 4 Missing ⚠️
controller/views/partials.py 90.47% 2 Missing ⚠️
controller/views/pages.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@                Coverage Diff                @@
##           state_machine     #241      +/-   ##
=================================================
+ Coverage          84.18%   86.49%   +2.30%     
=================================================
  Files                 37       39       +2     
  Lines                487      607     +120     
=================================================
+ Hits                 410      525     +115     
- Misses                77       82       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@cc-a cc-a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff @dalonsoa. I've suggested a few minor tweaks.

controller/controller_interface.py Outdated Show resolved Hide resolved
controller/controller_interface.py Outdated Show resolved Hide resolved
controller/forms.py Show resolved Hide resolved
controller/templates/controller/index.html Outdated Show resolved Hide resolved
controller/templates/controller/index.html Outdated Show resolved Hide resolved
@dalonsoa dalonsoa mentioned this pull request Dec 3, 2024
10 tasks
@cc-a
Copy link
Contributor

cc-a commented Dec 4, 2024

A little realism but I'm not sure how much we really need so long as we can interact with the processes as need to.

@dalonsoa
Copy link
Collaborator Author

dalonsoa commented Dec 5, 2024

If that's all, let's make that the default session and move on, so we don't get stuck on this.

@AdrianDAlessandro
Copy link
Contributor

I say merge this as-is and make a separate issue about changing the default to lr-session

@plasorak
Copy link

plasorak commented Dec 5, 2024

I've been digging into this and, in addition to being sluggish, two processes always die when performing transitions - always the same ones:

app-1  | Command 'execute_fsm_command' failed on 'ru-01' (response flag 'DRUNC_EXCEPTION_THROWN')
app-1  | Command 'execute_fsm_command' failed on 'ru-02' (response flag 'DRUNC_EXCEPTION_THROWN')

No other exception or message appears in the app logs, and the response flag from the command sent to the FSM is always FSM_EXECUTED_SUCCESSFULLY, which is not true since it gets stuck and cannot actually run STOP. Logs from the drunc app are confusing but seem to refer mostly to the two failed apps, ru-01 and ru-02. Re-starting them in the process manager serves no purpose.

I'm going to try the newest nightly build, in case that helps, but I'm running out of ideas.

I think it would be helpful to see the logs of ru-controller here.

@dalonsoa
Copy link
Collaborator Author

dalonsoa commented Dec 6, 2024

@plasorak here you have them, with entries just before and after the ru-0X processes fail. In Windows this works fine - it fails for Mac and Linux - and the new 1x1 configuration works fine in both.

[05:41:31] INFO rest_api_child.py:303 ru-01-commander: Received reply from ru-01 to start INFO broadcast_sender.py:65 Broadcast: Propagated execute_fsm_command to children (ru-01) successfully INFO rest_api_child.py:303 ru-02-commander: Received reply from ru-02 to start INFO broadcast_sender.py:65 Broadcast: Propagated execute_fsm_command to children (ru-02) successfully INFO broadcast_sender.py:65 Broadcast: Changing operational_state from propagating-start to start-propagated INFO broadcast_sender.py:65 Broadcast: Changing operational_state from start-propagated to executing-start INFO broadcast_sender.py:65 Broadcast: Changing operational_state from executing-start to start-terminated INFO broadcast_sender.py:65 Broadcast: Changing operational_state from configured to ready INFO broadcast_sender.py:65 Broadcast: Changing operational_state from start-terminated to finalising-start INFO broadcast_sender.py:65 Broadcast: Changing operational_state from finalising-start to ready INFO broadcast_sender.py:65 Broadcast: User 'nobody' successfully executed 'execute_fsm_command' INFO broadcast_sender.py:65 Broadcast: User 'nobody' successfully executed 'status' Exception in thread connectivity_service_updating_thread: Traceback (most recent call last): File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 445, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 440, in _make_request httplib_response = conn.getresponse() File "/cvmfs/dunedaq.opensciencegrid.org/spack/externals/ext-v2.1/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-12.1.0/python-3.10.10-gcsatsf5lmzrhmprzux7uv67w2omc7e3/lib/python3.10/http/client.py", line 1374, in getresponse response.begin() File "/cvmfs/dunedaq.opensciencegrid.org/spack/externals/ext-v2.1/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-12.1.0/python-3.10.10-gcsatsf5lmzrhmprzux7uv67w2omc7e3/lib/python3.10/http/client.py", line 318, in begin version, status, reason = self._read_status() File "/cvmfs/dunedaq.opensciencegrid.org/spack/externals/ext-v2.1/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-12.1.0/python-3.10.10-gcsatsf5lmzrhmprzux7uv67w2omc7e3/lib/python3.10/http/client.py", line 279, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/cvmfs/dunedaq.opensciencegrid.org/spack/externals/ext-v2.1/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-12.1.0/python-3.10.10-gcsatsf5lmzrhmprzux7uv67w2omc7e3/lib/python3.10/socket.py", line 705, in readinto return self._sock.recv_into(b) TimeoutError: timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/requests/adapters.py", line 439, in send resp = conn.urlopen( File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 755, in urlopen retries = retries.increment( File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/urllib3/util/retry.py", line 532, in increment raise six.reraise(type(error), error, _stacktrace) File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/urllib3/packages/six.py", line 770, in reraise raise value File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 699, in urlopen httplib_response = self._make_request( File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 447, in _make_request self._raise_timeout(err=e, url=url, timeout_value=read_timeout) File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 336, in _raise_timeout raise ReadTimeoutError( urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=5000): Read timed out. (read timeout=0.5) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/cvmfs/dunedaq.opensciencegrid.org/spack/externals/ext-v2.1/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-12.1.0/python-3.10.10-gcsatsf5lmzrhmprzux7uv67w2omc7e3/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/cvmfs/dunedaq.opensciencegrid.org/spack/externals/ext-v2.1/spack-0.22.0/opt/spack/linux-almalinux9-x86_64/gcc-12.1.0/python-3.10.10-gcsatsf5lmzrhmprzux7uv67w2omc7e3/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/drunc/controller/controller.py", line 281, in update_connectivity_service ctrler.connectivity_service.publish( File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/drunc/connectivity_service/client.py", line 110, in publish http_post( File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/drunc/utils/utils.py", line 268, in http_post r = post(address, json=data, **post_kwargs) File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/requests/api.py", line 119, in post return request('post', url, data=data, json=json, **kwargs) File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/requests/api.py", line 61, in request return session.request(method=method, url=url, **kwargs) File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/requests/sessions.py", line 542, in request resp = self.send(prep, **send_kwargs) File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/requests/sessions.py", line 655, in send r = adapter.send(request, **kwargs) File "/basedir/NFD_DEV_241114_A9/.venv/lib/python3.10/site-packages/requests/adapters.py", line 529, in send raise ReadTimeout(e, request=request) requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=5000): Read timed out. (read timeout=0.5) [05:41:46] INFO broadcast_sender.py:65 Broadcast: User 'nobody' successfully executed 'status' [05:41:48] INFO broadcast_sender.py:65 Broadcast: User 'nobody' successfully executed 'status' [05:41:53] INFO broadcast_sender.py:65 Broadcast: Propagating take_control to children INFO broadcast_sender.py:65 Broadcast: Propagating take_control to children (ru-01) [05:41:54] INFO rest_api_child.py:517 ru-01-rest-api-child: Ignoring command 'take_control' sent to 'ru-01' INFO broadcast_sender.py:65 Broadcast: Propagating take_control to children (ru-02) INFO rest_api_child.py:517 ru-02-rest-api-child: Ignoring command 'take_control' sent to 'ru-02' INFO broadcast_sender.py:65 Broadcast: User 'nobody' successfully executed 'take_control' INFO broadcast_sender.py:65 Broadcast: Changing operational_state from ready to preparing-drain_dataflow INFO broadcast_sender.py:65 Broadcast: Changing operational_state from preparing-drain_dataflow to drain_dataflow-ready INFO broadcast_sender.py:65 Broadcast: Changing operational_state from drain_dataflow-ready to propagating-drain_dataflow INFO broadcast_sender.py:65 Broadcast: Propagating execute_fsm_command to children INFO broadcast_sender.py:65 Broadcast: Propagating execute_fsm_command to children (ru-01) INFO rest_api_child.py:532 ru-01-rest-api-child: Sending 'drain_dataflow' to 'ru-01' INFO broadcast_sender.py:65 Broadcast: Propagating execute_fsm_command to children (ru-02) INFO rest_api_child.py:532 ru-02-rest-api-child: Sending 'drain_dataflow' to 'ru-02' [05:41:55] INFO rest_api_child.py:303 ru-02-commander: Received reply from ru-02 to drain_dataflow INFO rest_api_child.py:303 ru-01-commander: Received reply from ru-01 to drain_dataflow INFO broadcast_sender.py:65 Broadcast: Propagated execute_fsm_command to children (ru-02) successfully INFO broadcast_sender.py:65 Broadcast: Propagated execute_fsm_command to children (ru-01) successfully INFO broadcast_sender.py:65 Broadcast: Changing operational_state from propagating-drain_dataflow to drain_dataflow-propagated [05:41:56] INFO broadcast_sender.py:65 Broadcast: Changing operational_state from drain_dataflow-propagated to executing-drain_dataflow INFO broadcast_sender.py:65 Broadcast: Changing operational_state from executing-drain_dataflow to drain_dataflow-terminated INFO broadcast_sender.py:65 Broadcast: Changing operational_state from ready to dataflow_drained INFO broadcast_sender.py:65 Broadcast: Changing operational_state from drain_dataflow-terminated to finalising-drain_dataflow INFO broadcast_sender.py:65 Broadcast: Changing operational_state from finalising-drain_dataflow to dataflow_drained INFO broadcast_sender.py:65 Broadcast: User 'nobody' successfully executed 'execute_fsm_command' INFO broadcast_sender.py:65 Broadcast: User 'nobody' successfully executed 'status'

@plasorak
Copy link

plasorak commented Dec 6, 2024

Thanks @dalonsoa , new issue here: DUNE-DAQ/drunc#321

@dalonsoa
Copy link
Collaborator Author

dalonsoa commented Dec 6, 2024

Fantastic! Any idea of why things might be failing for Linux and MacOS but not for Windows?

@plasorak
Copy link

plasorak commented Dec 6, 2024

Not really... This could be due to the performance of the connectivity service, is it possible your Windows machine is more powerful?

@dalonsoa
Copy link
Collaborator Author

dalonsoa commented Dec 6, 2024

Maybe... It is from last year, so pretty new, but not particularly high specs, I think.

Copy link
Contributor

@jamesturner246 jamesturner246 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. :+ The dynamic class creation thing is weird. 😕

controller/forms.py Show resolved Hide resolved
controller/controller_interface.py Show resolved Hide resolved
@cc-a cc-a merged commit 12f6dc7 into state_machine Dec 10, 2024
4 checks passed
@cc-a cc-a deleted the fsm_dialog branch December 10, 2024 12:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants