-
Couldn't load subscription status.
- Fork 192
Make polling intervals in the ThermalMonitor class configurable #635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Make polling intervals in the ThermalMonitor class configurable #635
Conversation
…urable. Default polling interval of 60s is quite high and can be unresponsive. Vendors should be able to override this polling interval
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
would it make more sense to do this via a config file like was done for sensormond? https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-sensormond/scripts/sensormond#L485 |
Hi Gregory, We currently have the config set by the existing pmon_daemon_control.json config file: Example usage of this new config: PR which adds new config flags to thermalctld: PR which sets new config flags using pmon_daemon_control.json: When we were working on this initially we thought pmon_daemon_control.json would be the most fitting place to put the configuration. We hadn't seen your change to use platform_env.conf yet. What are your thoughts on platform_env.conf vs pmon_daemon_control.json? |
I had actually originally tried the same thing using the pmon daemon control to act as a further input for this to the supervisord configuration: 9e91789#diff-195551a342dbd08efef0aa4f3e21a4f4e24e2ee87cadcc27070299e9fc4a0390R560. General community sentiment though was to use the platform_env.conf for platform specific configurability (at least when this was discussed in the chassis call when we reviewed it originally). |
|
Chassisd also uses it for a timing related configuration: https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-chassisd/scripts/chassisd#L312. Probably best to keep it consistent across the different daemons. Imo you should join next weeks chassis call and bring it up for discussion if you feel strongly towards the pmon_daemon_control method. |
|
Thanks Gregory, I will join next week and get more ideas before refactoring this. |
| thermal_control = ThermalControlDaemon() | ||
| parser = argparse.ArgumentParser() | ||
| parser.add_argument('--thermal-monitor-initial-interval', type=int, default=5) | ||
| parser.add_argument('--thermal-monitor-update-interval', type=int, default=60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see @gregoryboudreau comments earlier in the issue. I have similar comment - It is not possible to add platform speficif timer intervals when we start thermalctld in supervisord.conf : https://github.com/sonic-net/sonic-buildimage/blob/master/dockers/docker-platform-monitor/docker-pmon.supervisord.conf.j2.
We can define it in platform.json file for that particular platform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am of the opinion that we keep this unified across daemons within PMON and put it in the platform_env.conf given that sensormond and chassisd already use this for timing related configurations that differ from the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Judy and Gregory, will update these PRs soon to go with that feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have updated the PR to get timing-related configurations from platform_env.conf, so that it is consistent with sensormond and chassisd. Unit tests were added as well.
Please take another look.
Thank you both Judy and Gregory!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! sorry if I got this comment late @lotus-nexthop
platform_env.conf was originally thought to set configurations platform wide eg: https://github.com/sonic-net/sonic-buildimage/blob/master/device/arista/x86_64-arista_7060x6_16pe_384c_b/platform_env.conf, https://github.com/sonic-net/sonic-buildimage/blob/master/device/arista/x86_64-arista_7280cr3_32p4/platform_env.conf
Prefer to update the timer configuraion details in the "pmon_daemon_control.json" where we can keep configs pertaining to any daemon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gregoryboudreau I feel the timer configurations are a pmon daemon specific configurations, which is clear if it is in this format : https://github.com/sonic-net/sonic-buildimage/blob/master/device/nexthop/common/pmon_daemon_control.json#L5-L8
Good to keep all pmon daemon configs in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, should we change at least sensormond to follow this model then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also +1 to making this change in pmon_daemon_control.json. We can do incremental improvement starting with sensormond.
I can revert this PR back to the original. Please note that it will need to be paired with sonic-net/sonic-buildimage#23139 as well for the full functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is now reverted back to the original.
Please help review both this PR and sonic-net/sonic-buildimage#23139. Thank you both!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility is to have thermal related daemons/threads use the same interval specified by thermal_manager for the fan/thermal algorithm (or some scaled smaller interval to ensure operational fluctuations do not delay temperature updates to the algorithm): https://github.com/sonic-net/sonic-platform-daemons/blob/master/sonic-thermalctld/scripts/thermalctld#L837
This would ideally also be able to apply to xcvrd for DOM polling, syncd for ASIC temperature polling, and perhaps others I'm unaware of.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| k, v = line.split('=', 1) | ||
| k = k.strip() | ||
| v = v.strip() | ||
| if k == 'THERMALCTLD_THERMAL_MONITOR_INITIAL_INTERVAL': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can these be wrapped in a try/except? Not sure if you want these to fallback to defaults or take the daemon down on an incorrect config file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this up! I think it's better to fallback to the default values and let the daemon continue.
I added try/except and updated the unit tests as well. Thanks, Gregory!
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Default polling interval of 60s is quite high and can be unresponsive. Vendors should be able to override this polling interval
Description
Motivation and Context
How Has This Been Tested?
Additional Information (Optional)