Skip to content

Conversation

@hftsoi
Copy link
Contributor

@hftsoi hftsoi commented Oct 28, 2021

PR description:

This PR is just to comment out two lines to remove message from unexpected CTP7 firmware ID, which is causing hassle for debugging the ongoing collisions. Functionality is not changed.

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35873/26270

  • This PR adds an extra 20KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@mmusich
Copy link
Contributor

mmusich commented Oct 28, 2021

please test

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35873/26273

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @hftsoi (Ho-Fung Tsoi) for master.

It involves the following packages:

  • EventFilter/L1TRawToDigi (l1)

@rekovic, @cecilecaillol can you please review it and eventually sign? Thanks.
@dinyar, @missirol, @thomreis, @Martin-Grunewald this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@rekovic
Copy link
Contributor

rekovic commented Oct 28, 2021

a back port to 12_0_X is available (#35877)

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2058f1/20008/summary.html
COMMIT: f9aaa17
CMSSW: CMSSW_12_1_X_2021-10-27-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35873/20008/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

  • 138.3138.3_RunMinimumBias2021Splash+RunMinimumBias2021Splash+RECODR3Splash+HARVESTDR3/step2_RunMinimumBias2021Splash+RunMinimumBias2021Splash+RECODR3Splash+HARVESTDR3.log

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 42
  • DQMHistoTests: Total histograms compared: 2901440
  • DQMHistoTests: Total failures: 11
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 2901406
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 41 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 177 log files, 37 edm output root files, 42 DQM output files
  • TriggerResults: no differences found

@BenjaminRS
Copy link
Contributor

I guess this is a spurious fail with a problem with the input relval? (Clicking on Details it states there are No changes)
Also the backport looks to have passed all the checks...

@rekovic
Copy link
Contributor

rekovic commented Oct 28, 2021

Failing job cannot have anything to do with this PR. Reported error messages in the failing job don't point to an obvious reason apart from en external termination request.


A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.



A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.



Waiting for stacktrace completion failed: timed out waiting for GDB to complete.

Current Modules:

Module: EcalHaloDataProducer:EcalHaloData (crashed)

A fatal system signal has occurred: external termination requestf

@rekovic
Copy link
Contributor

rekovic commented Oct 28, 2021

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (but tests are reportedly failing). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@tvami
Copy link
Contributor

tvami commented Oct 28, 2021

@perrotta @qliphy this is a trivial PR, since the pre5 has issues, can we merge this w/o the tests giving a +1 and include it in the pre5?

@tvami
Copy link
Contributor

tvami commented Oct 28, 2021

type bug-fix

@perrotta
Copy link
Contributor

@perrotta @qliphy this is a trivial PR, since the pre5 has issues, can we merge this w/o the tests giving a +1 and include it in the pre5?

@tvami I miss why merging this PR can be any urgently useful in 12_1_X: it disable a warning message (which nonetheless refer to a real issue) whose massif presence can undoubtfully complicate the debugging. On the other hand, it seems to me that this only applies to online collisions, which is not the target of 12_1_X. Is it correct?

If so, my suggestion would be:

  • Merge it only in 12_0_X to allow quickly debugging the issue, and find a real solution to it
  • Do not merge in 12_1, so that if the fix above is not finally implemented it will appear again when the online data taking moves to (some daughter release of) 12_1_X
  • In the meanwhile, a simple update could allow only issuing the warning message only once, or a few times, per job

Please let me know if I misunderstood the usefulness or the range of application of this PR.

@tvami
Copy link
Contributor

tvami commented Oct 28, 2021

Hi @perrotta sorry my bad, I didn't explain the reasons at all.

Indeed this is relevant for data taking alone, and so it's more urgent in the backport. I was thinking that a possible rereco would use 12_1_X, but you are right that could be a 12_1_1 if needed.

simple update could allow only issuing the warning message only once, or a few times, per job

This was the suggestion about 3 month ago, I let the L1T team comment on this further.

@tvami
Copy link
Contributor

tvami commented Oct 28, 2021

I'll also remove the urgent label from here and put it to the backport

@cmsbuild cmsbuild removed the urgent label Oct 28, 2021
@smuzaffar smuzaffar modified the milestones: CMSSW_12_1_X, CMSSW_12_2_X Oct 29, 2021
@perrotta
Copy link
Contributor

As I said, I am a bit reluctant in merging this PR as it is in the master. The warning message it refers to is a true one: I understand if in the online release we shut it up completely. But in a release which has to last and to be forward ported, I think it may be risky to just forget it.

I second the suggestion of @tvami and I'd ask you to either:

  • fix the issue at the firmware level
  • modify this PR so that at least once in a while (up to you to decide the frequency) the warning message is stil outputed

At least the second option can be easily implemented, and I'd go with it even if you eventually can fix the firmare.

Please let us know if you can take care of what above.

@tvami
Copy link
Contributor

tvami commented Oct 29, 2021

As I said, I am a bit reluctant in merging this PR as it is in the master. The warning message it refers to is a true one: I understand if in the online release we shut it up completely. But in a release which has to last and to be forward ported, I think it may be risky to just forget it.

I second the suggestion of @tvami and I'd ask you to either:

  • fix the issue at the firmware level
  • modify this PR so that at least once in a while (up to you to decide the frequency) the warning message is stil outputed

At least the second option can be easily implemented, and I'd go with it even if you eventually can fix the firmare.

Please let us know if you can take care of what above.

Let me tag @BenjaminRS

@BenjaminRS
Copy link
Contributor

Hi @perrotta and @tvami -- we discussed with L1T Calo Layer 1; @asavincms has said:

this message is obsolete, I have checked it already and found that many yers ago we did such a comparison,
but since then we never update the FW ID in release  this is not needed

So it should be safe for that particular elog message. I believe @rekovic is looking into the option of collecting various warning messages to be displayed with less frequency.

As this is the master branch: we could do a proper PR to remove the testing of the FW ID which is not used, and just have in the back port the commenting out of the 2 lines of offending code? But do you want this to be unsynchronised?

@perrotta
Copy link
Contributor

Hi @perrotta and @tvami -- we discussed with L1T Calo Layer 1; @asavincms has said:

this message is obsolete, I have checked it already and found that many yers ago we did such a comparison,
but since then we never update the FW ID in release  this is not needed

So it should be safe for that particular elog message. I believe @rekovic is looking into the option of collecting various warning messages to be displayed with less frequency.

As this is the master branch: we could do a proper PR to remove the testing of the FW ID which is not used, and just have in the back port the commenting out of the 2 lines of offending code?

If the message is not needed any more, then it is better to remove it fully, rather than just commenting it out.
If the message is still potentially a useful warning, I'd go with the plan @rekovic is looking into.

But do you want this to be unsynchronised?

What do you mean? Are you suggesting to first comment them out (as it is done here) and eventually implement a PR which allows the warning messages being displaced with less frequency? Well, it took several months to comment out two lines of code, I wouldn't like that if we hide the issue by completely shuting up the warning it can take even more ;-)
Let keep this in standby till when the other PR is ready.

@BenjaminRS
Copy link
Contributor

My understanding is that it is not needed any more. @asavincms can you confirm that?
Re: unsynchronised -- I believed that we wanted a quick fix to stop the elogs being a hassle now during these collisions (hence the commenting out the 2 lines). Then new PRs can be made to:

  1. remove the if else statement in CaloLayer1Setup.cc completely and just return res[0] if so desired
  2. reduce the frequency of reporting various L1T warnings

@asavincms
Copy link
Contributor

asavincms commented Oct 29, 2021 via email

@asavincms
Copy link
Contributor

I am not sure if my message was clear, but the line is really obsolete , we dont need it and ask to complete this PR

Sascha

@BenjaminRS
Copy link
Contributor

I have implemented suggestion (1) from my comment as PR #35940

@perrotta
Copy link
Contributor

perrotta commented Nov 2, 2021

I have implemented suggestion (1) from my comment as PR #35940

Thank you @BenjaminRS !
Can this PR get closed, then? (Backports with this simple and tested solution can remain, instead, to be merged only if needed)

@BenjaminRS
Copy link
Contributor

Hi @perrotta -- yes I think this one can be closed and #35940 would replace it for the master branch.

@perrotta
Copy link
Contributor

perrotta commented Nov 2, 2021

-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants