Skip to content

[8.0] fix: Singularity issue with non existing SE + JobAgent issue with exception raised during submission #8118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

aldbr
Copy link
Contributor

@aldbr aldbr commented Apr 4, 2025

I found this double issue while playing with the Site configuration:

  • Initial issue:
2025-04-03T15:36:40,991592Z WorkloadManagement/JobAgent/Singularity INFO: Creating singularity container
2025-04-03T15:36:40,992786Z WorkloadManagement/JobAgent/WorkloadManagement/JobAgent INFO: Found Job LogLevel JDL parameter with value DEBUG

# The issue that triggers the first problem: a SE section is not found
2025-04-03T15:36:41,618254Z WorkloadManagement/JobAgent ERROR: StorageFactory._getConfigStorageName: Failed to get storage options Path /Resources/StorageElements/<NOT_EXISTING_SE> does not exist or it's not a section
  • 1st problem: self.storages is not initialized because the StorageElement object is malformed (SE section not found)
2025-04-03T15:36:41,619437Z WorkloadManagement/JobAgent/WorkloadManagement/JobAgent ERROR: Exception occurred when submitting JobID: 2140558
Traceback (most recent call last):
  File "/cvmfs/lhcbdev.cern.ch/lhcbdirac/versions/v12.0.0a11-1743438349/Linux-aarch64/lib/python3.11/site-packages/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py", line 637, in _submitJob
    result = self.computingElement.submitJob(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/lhcbdev.cern.ch/lhcbdirac/versions/v12.0.0a11-1743438349/Linux-aarch64/lib/python3.11/site-packages/DIRAC/Resources/Computing/SingularityComputingElement.py", line 384, in submitJob
    mountedPath = StorageElement(seName).getStorageParameters(protocol="file")["Value"]["Path"]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/lhcbdev.cern.ch/lhcbdirac/versions/v12.0.0a11-1743438349/Linux-aarch64/lib/python3.11/site-packages/DIRAC/Resources/Storage/StorageElement.py", line 705, in getStorageParameters
    for storage in self.storages.values():
                   ^^^^^^^^^^^^^
  File "/cvmfs/lhcbdev.cern.ch/lhcbdirac/versions/v12.0.0a11-1743438349/Linux-aarch64/lib/python3.11/site-packages/DIRAC/Resources/Storage/StorageElement.py", line 1368, in __getattr__
    raise AttributeError(f"StorageElement does not have a method '{name}'")
AttributeError: StorageElement does not have a method 'storages'
2025-04-03T15:36:41,620682Z WorkloadManagement/JobAgent/WorkloadManagement/JobAgent ERROR: Job submission failed 2140558
  • 2nd problem: instead of capturing the message of the exception to reschedule the job, we capture the exception itself
2025-04-03T15:36:41,620789Z WorkloadManagement/JobAgent/WorkloadManagement/JobAgent ERROR: Error in DIRAC JobWrapper or inner CE execution: StorageElement does not have a method 'storages'
2025-04-03T15:36:41,620842Z WorkloadManagement/JobAgent/WorkloadManagement/JobAgent WARN: Failure ==> rescheduling (during StorageElement does not have a method 'storages')
2025-04-03T15:36:41,620910Z WorkloadManagement/JobAgent/WorkloadManagement/JobAgent ERROR: Agent exception while calling method <bound method JobAgent.execute of <DIRAC.WorkloadManagementSystem.Agent.JobAgent.JobAgent object at 0x40000e892f10>>
Traceback (most recent call last):
  File "/cvmfs/lhcbdev.cern.ch/lhcbdirac/versions/v12.0.0a11-1743438349/Linux-aarch64/lib/python3.11/site-packages/DIRAC/Core/Base/AgentModule.py", line 314, in am_secureCall
    result = functor(*args)
             ^^^^^^^^^^^^^^
  File "/cvmfs/lhcbdev.cern.ch/lhcbdirac/versions/v12.0.0a11-1743438349/Linux-aarch64/lib/python3.11/site-packages/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py", line 311, in execute
    result = self._checkSubmittedJobs()
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/lhcbdev.cern.ch/lhcbdirac/versions/v12.0.0a11-1743438349/Linux-aarch64/lib/python3.11/site-packages/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py", line 700, in _checkSubmittedJobs
    self._rescheduleFailedJob(jobID, result["Message"])
  File "/cvmfs/lhcbdev.cern.ch/lhcbdirac/versions/v12.0.0a11-1743438349/Linux-aarch64/lib/python3.11/site-packages/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py", line 828, in _rescheduleFailedJob
    self.jobs[jobID]["JobReport"].setJobStatus(
  File "/cvmfs/lhcbdev.cern.ch/lhcbdirac/versions/v12.0.0a11-1743438349/Linux-aarch64/lib/python3.11/site-packages/DIRAC/WorkloadManagementSystem/Client/JobReport.py", line 38, in

BEGINRELEASENOTES
*Resources
FIX: do not try to use a malformed StorageElement instance in SingularityCE
*WorkloadManagement
FIX: report the message of the Exception instead of the Exception itself in JobAgent.submitJob
ENDRELEASENOTES

@DIRACGridBot DIRACGridBot added the alsoTargeting:integration Cherry pick this PR to integration after merge label Apr 4, 2025
@chrisburr chrisburr merged commit 3255c8e into DIRACGrid:rel-v8r0 Apr 7, 2025
26 checks passed
@DIRACGridBot DIRACGridBot added the sweep:done All sweeping actions have been done for this PR label Apr 7, 2025
DIRACGridBot pushed a commit to DIRACGridBot/DIRAC that referenced this pull request Apr 7, 2025
…obAgent issue with exception raised during submission
@DIRACGridBot
Copy link

Sweep summary

Sweep ran in https://github.com/DIRACGrid/DIRAC/actions/runs/14305428060

Successful:

  • integration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alsoTargeting:integration Cherry pick this PR to integration after merge sweep:done All sweeping actions have been done for this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants