Add extended flag to CPU collector #5

scottcunningham · 2017-01-13T11:22:35Z

This allows us to produce both simple and complex metrics - this way, simple metrics can be aggregated, but we still have the option to drill-down into per-host metrics in situations that require them (eg. checking steal on an individual VM)

Without shutting down the socket before closing, with carbon-relay behind an Amazon EC2 ELB, Diamond can get stuck in an inconsistent state where sockets are kept open in a CLOSE-WAIT state. For example, we let two nodes running Diamond report metrics to carbon-relay behind an ELB over a period of 24 hours with the following parameters: - node01 - running Diamond with the socket shutdown logic added (this PR) - node02 - running Diamond without the socket shutdown logic added Partway into the test (16:52), we observed that node02 stopped reporting metrics (see Grafana screenshot attached to Github PR) Upon checking the logs, we see the following errors on node02 but NOT node01: [2016-11-24 16:52:27,915] [MainThread] GraphiteHandler: Socket error, trying reconnect. [2016-11-24 17:00:43,056] [MainThread] GraphiteHandler: Socket error, trying reconnect. [2016-11-24 17:02:43,045] [MainThread] GraphiteHandler: Socket error, trying reconnect. [2016-11-24 17:06:52,968] [MainThread] GraphiteHandler: Socket error, trying reconnect. This coincides exactly with the time that the node stopped reporting metrics in to carbon. When checking the socket status on that node with `ss`, we see the following: ``` root@node02:~# ss -ntp | grep diam CLOSE-WAIT 1 0 172.20.244.54:45725 10.20.255.252:2004 users:(("diamond",12826,4),("diamond",12812,4),("diamond",12803,4),("diamond",12795,4),("diamond",12792,4),("diamond",12786,4),("diamond",12750,4)) CLOSE-WAIT 1 0 172.20.244.54:49532 10.20.255.233:2004 users:(("diamond",12780,4)) ``` For reference, following is the `pstree` output for Diamond processes: ``` ├─diamond(12750)─┬─diamond(12753)─┬─{diamond}(12782) │ │ ├─{diamond}(12788) │ │ ├─{diamond}(12814) │ │ ├─{diamond}(12828) │ │ ├─{diamond}(12853) │ │ ├─{diamond}(12854) │ │ └─{diamond}(12855) │ ├─diamond(12780) │ ├─diamond(12786) │ ├─diamond(12792) │ ├─diamond(12795) │ ├─diamond(12803) │ ├─diamond(12812) │ └─diamond(12826)``` However, on node01 which has the socket shutdown fix, we see the following sockets open: ``` root@node01:~# ss -ntp | grep diam ESTAB 0 0 172.20.245.210:52872 10.20.255.252:2004 users:(("diamond",19942,4)) ``` node01 was able to report metrics for the duration of the test.

Shutdown socket connection before closing

This reverts commit ff155a4, reversing changes made to 015e28f.

This allows us to produce both simple and complex metrics - this way, simple metrics can be aggregated, but we still have the option to drill-down into per-host metrics in situations that require them (eg. checking steal on an individual VM)

scottcunningham · 2017-01-13T11:59:35Z

src/collectors/cpu/cpu.py

        config_help.update({
            'percore':  'Collect metrics per cpu core or just total',
            'simple':   'only return aggregate CPU% metric',
+            'extended':  'return aggregate CPU% metric but also complex CPU metrics',


Does anyone have any ideas for a better name for this?

scottcunningham · 2017-01-18T12:20:58Z

superseded by #6

Scott Cunningham and others added 5 commits October 13, 2016 16:36

Merge branch 'nginx_collector_precision_fix' into ens-master

015e28f

Merge pull request #4 from Ensighten/socket-shutdown-fix

ff155a4

Shutdown socket connection before closing

Revert "Merge pull request #4 from Ensighten/socket-shutdown-fix"

2723624

This reverts commit ff155a4, reversing changes made to 015e28f.

Add extended flag to CPU collector

8b65329

This allows us to produce both simple and complex metrics - this way, simple metrics can be aggregated, but we still have the option to drill-down into per-host metrics in situations that require them (eg. checking steal on an individual VM)

scottcunningham commented Jan 13, 2017

View reviewed changes

scottcunningham force-pushed the ens-master branch from 2723624 to 8c7ea91 Compare January 18, 2017 12:03

scottcunningham closed this Jan 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add extended flag to CPU collector #5

Add extended flag to CPU collector #5

Uh oh!

scottcunningham commented Jan 13, 2017

Uh oh!

scottcunningham Jan 13, 2017

Uh oh!

scottcunningham commented Jan 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add extended flag to CPU collector #5

Add extended flag to CPU collector #5

Uh oh!

Conversation

scottcunningham commented Jan 13, 2017

Uh oh!

scottcunningham Jan 13, 2017

Choose a reason for hiding this comment

Uh oh!

scottcunningham commented Jan 18, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants