From 09e032ed3a0a41ec13a89ff676ae285a04ab0a9a Mon Sep 17 00:00:00 2001 From: isabelle-guitton Date: Thu, 9 Jan 2025 11:44:13 +0100 Subject: [PATCH 1/8] Issue #533 Document the self-monitoring feature * Reviewed and updated the self-monitoring section in configure-monitoring.md --- .../configuration/configure-monitoring.md | 58 +++++++------------ 1 file changed, 22 insertions(+), 36 deletions(-) diff --git a/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md b/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md index 5679e5f26..b79868f1e 100644 --- a/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md +++ b/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md @@ -1209,33 +1209,27 @@ hw.status{state="degraded"} 1 In this case, only the `degraded` state is reported, and the zero values for `ok` and `failed` are suppressed after the initial state transition. -## Self-Monitoring +#### Self-Monitoring -**MetricsHub** includes **self-monitoring capabilities** to track its own performance. This feature can monitor key aspects such **job duration metrics**. +The self-monitoring feature helps you track **MetricsHub**'s performance by providing metrics like job duration. These metrics offer detailed insights into task execution times, helping identify bottlenecks or inefficiencies and optimizing performance. -### Configuration: `enableSelfMonitoring` +To enable this feature, set the `enableSelfMonitoring` parameter to `true` in the relevant section of the `config/metricshub.yaml` file as described below: -This configuration controls whether **MetricsHub** reports internal signals such as job duration metrics. -#### Supported Values - -- `true` (default): Enables self-monitoring capabilities. -- `false`: Disables self-monitoring capabilities. - -#### Configuration Scopes - -You can configure `enableSelfMonitoring` at the following levels: +| Self-Monitoring | Set enableSelfMonitoring to true | +|----------------------------------------------------|---------------------------------------------------------| +| For all resources | In the global section (top of the file) | +| For all the resources of a specific resource group | Under the corresponding `` section | +| For a specific resource | Under the corresponding `` section | -1. **Global Configuration** - Applies to all monitored resources. +##### Example 1: Enabling self-monitoring for all resources ```yaml enableSelfMonitoring: true # Set to "false" to disable resourceGroups: ... ``` -2. **Per Resource Group** - Applies to all resources within a specific group. +##### Example 2: Enabling self-monitoring for all resources of a specific resource group ```yaml resourceGroups: @@ -1244,8 +1238,7 @@ You can configure `enableSelfMonitoring` at the following levels: resources: ... ``` -3. **Per Resource** - Applies to an individual resource. +##### Example 3: Enabling self-monitoring for a specific resource ```yaml resourceGroups: @@ -1255,9 +1248,7 @@ You can configure `enableSelfMonitoring` at the following levels: enableSelfMonitoring: true # Set to "false" to disable ``` -### Examples of Self-Monitoring Metrics - -When enabled, **MetricsHub** reports the `metricshub.job.duration` metrics, for example: +When enabled, **MetricsHub** reports the `metricshub.job.duration` metrics. For example: ``` metricshub.job.duration{job.type="discovery", monitor.type="enclosure", connector_id="HPEGen10IloREST"} 0.020 @@ -1268,21 +1259,16 @@ metricshub.job.duration{job.type="collect", monitor.type="cpu", connector_id="HP ``` Where: -- **`job.type`**: Specifies the type of operation performed by MetricsHub. - - Possible values: - - `discovery`: Identifies and registers components. - - `collect`: Gathers telemetry data from the monitored components. - - `simple`: Executes a single straightforward task. - - `beforeAll` or `afterAll`: Runs preparatory or cleanup operations. -- **`monitor.type`**: Indicates the specific category of component being monitored. - - Examples: - - Hardware components like `cpu`, `memory`, `physical_disk`, or `disk_controller`. - - Environmental metrics like `temperature` or `battery`. - - Logical entities like `connector`. -- **`connector_id`**: The unique identifier of the connector defining the method and protocol to collect metrics for the specified component. - - Example: `"HPEGen10IloREST"` denotes the HPE Gen10 iLO REST connector. - -These metrics provide granular insights into task execution times, enabling the identification of bottlenecks or inefficiencies and helping optimize monitoring performance. +* **`job.type`**: is the operation performed by **MetricsHub**. Possible values are: + * `discovery`: identifies and registers components. + * `collect`: gathers telemetry data from monitored components. + * `simple`: executes a straightforward task. + * `beforeAll` or `afterAll`: performs preparatory or cleanup operations. +* ***`monitor.type`**: is the component being monitored. Examples: + * Hardware: `cpu`, `memory`, `physical_disk`, or `disk_controller`. + * Environmental metrics: `temperature` or `battery`. + * Logical entities: `connector`. +- **`connector_id`**: is a unique identifier for the connector defining the collection method and protocol. `HPEGen10IloREST` refers for example to the `HPE Gen10 iLO REST` connector. #### Timeout, duration and period format From 2af62ee183690b4631f8ee230763e6fda0e4d5eb Mon Sep 17 00:00:00 2001 From: isabelle-guitton Date: Tue, 14 Jan 2025 11:11:52 +0100 Subject: [PATCH 2/8] Issue #533 Work in progress - First draft --- .../configuration/configure-monitoring.md | 21 ---------- .../site/markdown/troubleshooting/index.md | 7 ++-- .../troubleshooting/self-monitoring.md | 39 +++++++++++++++++++ metricshub-doc/src/site/site.xml | 1 + 4 files changed, 44 insertions(+), 24 deletions(-) create mode 100644 metricshub-doc/src/site/markdown/troubleshooting/self-monitoring.md diff --git a/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md b/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md index b79868f1e..eb62a8081 100644 --- a/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md +++ b/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md @@ -1248,27 +1248,6 @@ To enable this feature, set the `enableSelfMonitoring` parameter to `true` in th enableSelfMonitoring: true # Set to "false" to disable ``` -When enabled, **MetricsHub** reports the `metricshub.job.duration` metrics. For example: - -``` -metricshub.job.duration{job.type="discovery", monitor.type="enclosure", connector_id="HPEGen10IloREST"} 0.020 -metricshub.job.duration{job.type="discovery", monitor.type="cpu", connector_id="HPEGen10IloREST"} 0.030 -metricshub.job.duration{job.type="discovery", monitor.type="temperature", connector_id="HPEGen10IloREST"} 0.025 -metricshub.job.duration{job.type="discovery", monitor.type="connector", connector_id="HPEGen10IloREST"} 0.015 -metricshub.job.duration{job.type="collect", monitor.type="cpu", connector_id="HPEGen10IloREST"} 0.015 -``` - -Where: -* **`job.type`**: is the operation performed by **MetricsHub**. Possible values are: - * `discovery`: identifies and registers components. - * `collect`: gathers telemetry data from monitored components. - * `simple`: executes a straightforward task. - * `beforeAll` or `afterAll`: performs preparatory or cleanup operations. -* ***`monitor.type`**: is the component being monitored. Examples: - * Hardware: `cpu`, `memory`, `physical_disk`, or `disk_controller`. - * Environmental metrics: `temperature` or `battery`. - * Logical entities: `connector`. -- **`connector_id`**: is a unique identifier for the connector defining the collection method and protocol. `HPEGen10IloREST` refers for example to the `HPE Gen10 iLO REST` connector. #### Timeout, duration and period format diff --git a/metricshub-doc/src/site/markdown/troubleshooting/index.md b/metricshub-doc/src/site/markdown/troubleshooting/index.md index f1418d3bf..e102b58a1 100644 --- a/metricshub-doc/src/site/markdown/troubleshooting/index.md +++ b/metricshub-doc/src/site/markdown/troubleshooting/index.md @@ -10,9 +10,10 @@ This section provides guidance on: * **Troubleshooting common MetricsHub issues:** * [No data for a specific monitored resource](./no-data-resources.md) * [No data in the observability platforms](./no-data-observability-platforms.md) -* **Enabling logs for:** - * [the MetricsHub Agent](./metricshub-logs.md) - * [the OTel Collector](./otel-logs.md). +* **Enabling:** + * [MetricsHub's self-monitoring](./self-monitoring.md) + * [the MetricsHub Agent logs](./metricshub-logs.md) + * [the OTel Collector logs](./otel-logs.md). For further assistance, consider: diff --git a/metricshub-doc/src/site/markdown/troubleshooting/self-monitoring.md b/metricshub-doc/src/site/markdown/troubleshooting/self-monitoring.md new file mode 100644 index 000000000..664be0b50 --- /dev/null +++ b/metricshub-doc/src/site/markdown/troubleshooting/self-monitoring.md @@ -0,0 +1,39 @@ +keywords: self-monitoring, performance, job duration, troubleshooting +description: How to track MetricsHub's own performance. + +## MetricsHub self-monitoring + + + + + +### Enabling self-monitoring + +Refer to [Monitoring Configuration](../configuration/configure-monitoring.md#self-monitoring) page to know how to enable the self-monitoring feature. + +### + +Monitoring MetricsHub’s own performance ensures that your observability stack runs efficiently, enabling proactive troubleshooting and optimization. Use the self-monitoring feature described in the [Monitoring Configuration](../configuration/configure-monitoring.md#self-monitoring) page to access detailed metrics. + +When self-monitoring is enabled, the `metricshub.job.duration` metric provides insights into task execution times. Key tags include: + +* **`job.type`**: Operation performed by **MetricsHub**. Possible values are: + * `discovery`: Identifies and registers components. + * `collect`: Gathers telemetry data from monitored components. + * `simple`: Executes a straightforward task. + * `beforeAll` or `afterAll`: Performs preparatory or cleanup operations. +* **`monitor.type`**: Component being monitored. Examples: + * Hardware metrics: `cpu`, `memory`, `physical_disk`, or `disk_controller`. + * Environmental metrics: `temperature` or `battery`. + * Logical entities: `connector`. +- **`connector_id`**: Unique identifier for the connector, such as HPEGen10IloREST for the HPE Gen10 iLO REST connector. + +Example: + +``` +metricshub.job.duration{job.type="discovery", monitor.type="enclosure", connector_id="HPEGen10IloREST"} 0.020 +metricshub.job.duration{job.type="discovery", monitor.type="cpu", connector_id="HPEGen10IloREST"} 0.030 +metricshub.job.duration{job.type="discovery", monitor.type="temperature", connector_id="HPEGen10IloREST"} 0.025 +metricshub.job.duration{job.type="discovery", monitor.type="connector", connector_id="HPEGen10IloREST"} 0.015 +metricshub.job.duration{job.type="collect", monitor.type="cpu", connector_id="HPEGen10IloREST"} 0.015 +``` diff --git a/metricshub-doc/src/site/site.xml b/metricshub-doc/src/site/site.xml index d9ae7a468..e9a8ad61f 100644 --- a/metricshub-doc/src/site/site.xml +++ b/metricshub-doc/src/site/site.xml @@ -114,6 +114,7 @@ + From a6bf3272548f233abefb533e233bf436204c3770 Mon Sep 17 00:00:00 2001 From: isabelle-guitton Date: Wed, 15 Jan 2025 15:38:36 +0100 Subject: [PATCH 3/8] Issue #533: Document the self-monitoring feature * Wrote the first version of the doc --- .../configuration/configure-monitoring.md | 78 ++++++++++--------- .../troubleshooting/degraded-performance.md | 53 +++++++++++++ .../site/markdown/troubleshooting/index.md | 5 +- .../troubleshooting/self-monitoring.md | 39 ---------- metricshub-doc/src/site/site.xml | 1 + 5 files changed, 97 insertions(+), 79 deletions(-) create mode 100644 metricshub-doc/src/site/markdown/troubleshooting/degraded-performance.md delete mode 100644 metricshub-doc/src/site/markdown/troubleshooting/self-monitoring.md diff --git a/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md b/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md index eb62a8081..7ed18a077 100644 --- a/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md +++ b/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md @@ -8,7 +8,7 @@ description: How to configure the MetricsHub Agent to collect metrics from a var **MetricsHub** extracts metrics from the resources configured in the `config/metricshub.yaml` file. These **resources** can be hosts, applications, or other components running in your IT infrastructure. Each **resource** is typically associated with a physical location, such as a data center or server room, or a logical location, like a business unit. -In **MetricsHub**, these locations are referred to as **sites**. +In **MetricsHub**, these locations are referred to as **sites**. In highly distributed infrastructures, multiple resources can be organized into **resource groups** to simplify management and monitoring. To reflect this organization, you are asked to define your **resource group** first, followed by your **site** and its corresponding **resources** in the `config/metricshub.yaml` file stored in: @@ -16,16 +16,16 @@ To reflect this organization, you are asked to define your **resource group** fi > * `C:\ProgramData\MetricsHub\config` on Windows systems > * `./metricshub/lib/config` on Linux systems -> **Important**: We recommend using an editor supporting the +> **Important**: We recommend using an editor supporting the [Schemastore](https://www.schemastore.org/json#editors) to edit **MetricsHub**'s configuration YAML - files (Example: [Visual Studio Code](https://code.visualstudio.com/download) and - [vscode.dev](https://vscode.dev), + files (Example: [Visual Studio Code](https://code.visualstudio.com/download) and + [vscode.dev](https://vscode.dev), with [RedHat's YAML extension](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml)). ## Step 1: Configure resource groups > Note: For centralized infrastructures, `resourceGroups` are not required. - Simply configure resources as explained in [Step 2](./configure-monitoring.html#step-2-configure-resources). + Simply configure resources as explained in [Step 2](./configure-monitoring.html#step-2-configure-resources). Create a resource group for each site to be monitored under the `resourceGroups:` section: @@ -35,6 +35,7 @@ resourceGroups: attributes: site: # Specify where resources are hosted ``` + Replace: * `` with the actual name of your resource group @@ -68,6 +69,7 @@ At this stage, you can configure sustainability metrics reporting. For more deta host.type: ``` + * or under the resource group you previously specified *(recommended for highly distributed infrastructures)* ```yaml @@ -83,7 +85,7 @@ At this stage, you can configure sustainability metrics reporting. For more deta ``` -The syntax to adopt for configuring your resources will differ whether your resources have unique +The syntax to adopt for configuring your resources will differ whether your resources have unique or similar characteristics (such as device type, protocols, and credentials). ### Syntax for unique resources @@ -108,7 +110,9 @@ resources: host.extra.attribute: [ , , etc. ] ``` + Whatever the syntax adopted, replace: + * `` with the actual hostname or IP address of the resource * `` with the type of resource to be monitored. Possible values are: * [`win`](https://metricshub.com/docs/latest/connectors/tags/windows.html) for Microsoft Windows systems @@ -162,8 +166,6 @@ resourceGroups: ``` - - ### Protocols and credentials #### HTTP @@ -200,7 +202,7 @@ resourceGroups: timeout: 60 ``` -#### ICMP Ping +#### ICMP Ping Use the parameters below to configure the ICMP ping protocol: @@ -566,6 +568,7 @@ resourceGroups: ### Customize resource hostname By default, the `host.name` attribute specified for a resource determines both: + * the hostname used to execute requests against the resource for collecting metrics * the hostname associated with each OpenTelemetry metric collected for the resource. @@ -592,7 +595,7 @@ resources: #### Example for resources sharing similar characteristics -For resources with shared characteristics, you can define multiple hostnames in the configuration: +For resources with shared characteristics, you can define multiple hostnames in the configuration: ```yaml resources: @@ -648,17 +651,16 @@ Follow the structure below to declare your monitor: ``` Refer to: -- [Monitors](https://sentrysoftware.org/metricshub-community-connectors/develop/monitors.html) for more information on how to configure custom resource monitoring. -- [Monitoring the health of a service](../usecases/service-health.md) for a practical example that demonstrates how to use this feature effectively. - +* [Monitors](https://sentrysoftware.org/metricshub-community-connectors/develop/monitors.html) for more information on how to configure custom resource monitoring. +* [Monitoring the health of a service](../usecases/service-health.md) for a practical example that demonstrates how to use this feature effectively. ### Basic Authentication settings #### Enterprise Edition authentication -In the Enterprise Edition, the **MetricsHub**'s internal `OTLP Exporter` authenticates itself with the _OpenTelemetry Collector_'s [OTLP gRPC Receiver](send-telemetry.md#otlp-grpc) by including the HTTP `Authorization` request header with the credentials. +In the Enterprise Edition, the **MetricsHub**'s internal `OTLP Exporter` authenticates itself with the *OpenTelemetry Collector*'s [OTLP gRPC Receiver](send-telemetry.md#otlp-grpc) by including the HTTP `Authorization` request header with the credentials. -These settings are already configured in the `config/metricshub.yaml` file of **MetricsHub Enterprise Edition**. Changing them is **not recommended** unless you are familiar with managing communication between the **MetricsHub** `OTLP Exporter` and the _OpenTelemetry Collector_'s `OTLP Receiver`. +These settings are already configured in the `config/metricshub.yaml` file of **MetricsHub Enterprise Edition**. Changing them is **not recommended** unless you are familiar with managing communication between the **MetricsHub** `OTLP Exporter` and the *OpenTelemetry Collector*'s `OTLP Receiver`. To override the default value of the *Basic Authentication Header*, configure the `otel.exporter.otlp.metrics.headers` and `otel.exporter.otlp.logs.headers` parameters under the `otel` section: @@ -835,7 +837,7 @@ To know which connectors are available, refer to [Connectors Directory](../metri Otherwise, you can list the available connectors using the below command: ```shell-session -$ metricshub -l +metricshub -l ``` For more information about the `metricshub` command, refer to [MetricsHub CLI (metricshub)](../guides/cli.md). @@ -853,9 +855,9 @@ patchDirectory: /opt/patch/connectors # Replace with the path to your patch conn loggerLevel: ... ``` -#### Customize data collection +#### Customize data collection -**MetricsHub** allows you to customize data collection on your Windows or Linux servers, specifying exactly which processes or services to monitor. This customization is achieved by configuring the following connector variables: +**MetricsHub** allows you to customize data collection on your Windows or Linux servers, specifying exactly which processes or services to monitor. This customization is achieved by configuring the following connector variables: | Connector Variable | Available for | Usage | |--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------| @@ -868,7 +870,7 @@ Refer to the [Connectors directory](../metricshub-connectors-directory.html#) an ##### Procedure -In the `config/metricshub.yaml` file, locate the resource for which you wish to customize data collection and specify the `variables` attribute available under the `additionalConnectors` section: +In the `config/metricshub.yaml` file, locate the resource for which you wish to customize data collection and specify the `variables` attribute available under the `additionalConnectors` section: ```yaml resources: @@ -886,17 +888,16 @@ resources: | Property | Description | |--------------------------|--------------------------------------------------------------------------------------------------------------------------------| -| ` ` | Custom ID for this additional connector. | -| `uses` | _(Optional)_ Provide an ID for this additional connector. If not specified, the key ID will be used. | -| `force` | _(Optional)_ Set to `false` if you want the connector to only be activated when detected (Default: `true` - always activated). | +| `` | Custom ID for this additional connector. | +| `uses` | *(Optional)* Provide an ID for this additional connector. If not specified, the key ID will be used. | +| `force` | *(Optional)* Set to `false` if you want the connector to only be activated when detected (Default: `true` - always activated). | | `variables` | Specify the connector variable to be used and its value (Format: `: `). | > Note: If a connector is added under the `additionalConnectors` section with missing or unspecified variables, those variables will automatically be populated with default values defined by the connector itself. For practical examples demonstrating effective use of this feature, refer to the following pages: -- [Monitoring a process command line](../usecases/process-command-line.md) -- [Monitoring a service running on Linux](../usecases/service-linux.md). - +* [Monitoring a process command line](../usecases/process-command-line.md) +* [Monitoring a service running on Linux](../usecases/service-linux.md). #### Filter monitors @@ -907,10 +908,10 @@ To manage the volume of telemetry data sent to your observability platform and t You can apply monitor inclusion or exclusion in data collection for the following scopes: * All resources -* All the resources within a specific resource group. A resource group is a container that holds resources to be monitored and generally refers to a site or a specific location. +* All the resources within a specific resource group. A resource group is a container that holds resources to be monitored and generally refers to a site or a specific location. * A specific resource -This is done by adding the `monitorFilters` parameter in the relevant section of the `config/metricshub.yaml` file as described below: +This is done by adding the `monitorFilters` parameter in the relevant section of the `config/metricshub.yaml` file as described below: | Filter monitors | Add monitorFilters | |----------------------------------------------------|---------------------------------------------------------| @@ -929,21 +930,22 @@ To obtain the monitor name: 2. Click the connector of your choice (e.g.: [WindowsOS Metrics](../connectors/windows.html)) 3. Scroll-down to the **Metrics** section and note down the relevant monitor **Type**. -> **Warning**: Excluding monitors may lead to missed outage detection or inconsistencies in collected data, such as inaccurate power consumption estimates or other metrics calculated by the engine. Use exclusions carefully to avoid overlooking important information. -The monitoring of critical devices such as batteries, power supplies, CPUs, fans, and memories should not be disabled. +> **Warning**: Excluding monitors may lead to missed outage detection or inconsistencies in collected data, such as inaccurate power consumption estimates or other metrics calculated by the engine. Use exclusions carefully to avoid overlooking important information. +The monitoring of critical devices such as batteries, power supplies, CPUs, fans, and memories should not be disabled. ##### Example 1: Including monitors for all resources - ```yaml monitorFilters: [ +enclosure, +fan, +power_supply ] # Include specific monitors globally resourceGroups: ... ``` -##### Example 2: Excluding monitors for all resources +##### Example 2: Excluding monitors for all resources + ```yaml monitorFilters: [ "!volume" ] # Exclude specific monitors globally ``` + ##### Example 3: Including monitors for all resources within a specific resource group ```yaml @@ -952,6 +954,7 @@ The monitoring of critical devices such as batteries, power supplies, CPUs, fans monitorFilters: [ +enclosure, +fan, +power_supply ] # Include specific monitors for this group resources: ... ``` + ##### Example 4: Excluding monitors for all resources within a specific resource group ```yaml @@ -960,6 +963,7 @@ The monitoring of critical devices such as batteries, power supplies, CPUs, fans monitorFilters: [ "!volume" ] # Exclude specific monitors for this group resources: ... ``` + ##### Example 5: Including monitors for a specific resource ```yaml @@ -978,7 +982,7 @@ The monitoring of critical devices such as batteries, power supplies, CPUs, fans resources: : monitorFilters: [ "!volume" ] # Exclude specific monitors for this resource - ``` + ``` #### Discovery cycle @@ -1145,9 +1149,9 @@ By default, **MetricsHub** compresses StateSet metrics to reduce unnecessary rep This configuration controls how StateSet metrics are reported, specifically whether zero values should be suppressed or not. -- **Supported values:** - - `none`: No compression is applied. All StateSet metrics, including zero values, are reported on every collection cycle. - - `suppressZeros` (default): **MetricsHub** compresses StateSet metrics by reporting the zero value only the first time a state transitions to zero. Subsequent reports will include only the non-zero state values. +* **Supported values:** + * `none`: No compression is applied. All StateSet metrics, including zero values, are reported on every collection cycle. + * `suppressZeros` (default): **MetricsHub** compresses StateSet metrics by reporting the zero value only the first time a state transitions to zero. Subsequent reports will include only the non-zero state values. To configure the StateSet compression level, you can apply the `stateSetCompression` setting in the following scopes: @@ -1215,7 +1219,6 @@ The self-monitoring feature helps you track **MetricsHub**'s performance by prov To enable this feature, set the `enableSelfMonitoring` parameter to `true` in the relevant section of the `config/metricshub.yaml` file as described below: - | Self-Monitoring | Set enableSelfMonitoring to true | |----------------------------------------------------|---------------------------------------------------------| | For all resources | In the global section (top of the file) | @@ -1248,7 +1251,6 @@ To enable this feature, set the `enableSelfMonitoring` parameter to `true` in th enableSelfMonitoring: true # Set to "false" to disable ``` - #### Timeout, duration and period format Timeouts, durations and periods are specified with the below format: @@ -1258,4 +1260,4 @@ Timeouts, durations and periods are specified with the below format: | s | seconds | 120s | | m | minutes | 90m, 1m15s | | h | hours | 1h, 1h30m | -| d | days (based on a 24-hour day) | 1d | \ No newline at end of file +| d | days (based on a 24-hour day) | 1d | diff --git a/metricshub-doc/src/site/markdown/troubleshooting/degraded-performance.md b/metricshub-doc/src/site/markdown/troubleshooting/degraded-performance.md new file mode 100644 index 000000000..d8d6940f8 --- /dev/null +++ b/metricshub-doc/src/site/markdown/troubleshooting/degraded-performance.md @@ -0,0 +1,53 @@ +keywords: self-monitoring, performance, job duration, troubleshooting +description: How to track MetricsHub's own performance. + +## Degraded Performance + +If you observe delays in data collection, missing data points, or timeouts, enable the self-monitoring feature as described in the [Monitoring Configuration](../configuration/configure-monitoring.md#self-monitoring) page. This feature provides detailed metrics about job execution times, helping you identify inefficiencies such as misconfigurations, bottlenecks, or performance issues in specific components. + +When self-monitoring is enabled, the `metricshub.job.duration` metric provides insights into task execution times. Key attributes include: + +* **`job.type`**: The operation performed by **MetricsHub**. Possible values are: + * `discovery`: Identifies and registers components. + * `collect`: Gathers telemetry data from monitored components. + * `simple`: Executes a straightforward task. + * `beforeAll` or `afterAll`: Performs preparatory or cleanup operations. +* **`monitor.type`**: The component being monitored, such as: + * *Hardware metrics*: `cpu`, `memory`, `physical_disk`, or `disk_controller`. + * *Environmental metrics*: `temperature` or `battery`. + * *Logical entities*: `connector`. +* **`connector_id`**: The unique identifier for the connector, such as HPEGen10IloREST for the HPE Gen10 iLO REST connector. + +These metrics can be viewed in Prometheus/Grafana or in the `metricshub-agent-$resourceId-$timestamp.log` file. + +### Example + +Example of metrics emitted for the `HPEGen10IloREST` connector: + +```bash +metricshub.job.duration{job.type="discovery", monitor.type="enclosure", connector_id="HPEGen10IloREST"} 0.020 +metricshub.job.duration{job.type="discovery", monitor.type="cpu", connector_id="HPEGen10IloREST"} 0.030 +metricshub.job.duration{job.type="discovery", monitor.type="temperature", connector_id="HPEGen10IloREST"} 0.025 +metricshub.job.duration{job.type="discovery", monitor.type="connector", connector_id="HPEGen10IloREST"} 0.015 +metricshub.job.duration{job.type="collect", monitor.type="cpu", connector_id="HPEGen10IloREST"} 0.015 +``` + +In this example: + +* during `discovery`: + * The `enclosure` monitor takes `0.020` s. + * The `cpu` monitor takes `0.030` s. + * The `temperature` monitor takes `0.025` s. + * The `connector` monitor takes `0.015` s. +* during `collect`, the `cpu` metrics collection takes `0.015` s. + +These metrics indicate that **MetricsHub** is functioning as expected, with task durations well within acceptable ranges. + +If task durations are above 5 seconds, consider the following: + +* **Verify resource availability**: Ensure the monitored system has sufficient CPU, memory, and storage resources to handle monitoring tasks. +* **Check MetricsHub configuration**: Review your configuration to ensure **MetricsHub** is set up correctly . +* **Restart services**: If configurations appear correct, try restarting relevant services. +* **Inspect network configurations**: Check for network latency or connectivity issues between **MetricsHub** and the monitored resources, and ensure network settings (e.g., firewalls or proxies) are not causing delays. +* **Examine logs**: Look for warnings or errors in the [MetricsHub logs](./metricshub-logs.md) or the monitored system's logs to identify potential problems. +* **Review timeouts**: Ensure timeout settings are appropriate for the environment to prevent unnecessary delays or retries. diff --git a/metricshub-doc/src/site/markdown/troubleshooting/index.md b/metricshub-doc/src/site/markdown/troubleshooting/index.md index e102b58a1..259cd0a68 100644 --- a/metricshub-doc/src/site/markdown/troubleshooting/index.md +++ b/metricshub-doc/src/site/markdown/troubleshooting/index.md @@ -7,11 +7,12 @@ description: How to troubleshoot MetricsHub common issues and basic steps to res This section provides guidance on: -* **Troubleshooting common MetricsHub issues:** +* **Troubleshooting common MetricsHub issues**, such as: + * [Degraded performance](./degraded-performance.md) * [No data for a specific monitored resource](./no-data-resources.md) * [No data in the observability platforms](./no-data-observability-platforms.md) + * **Enabling:** - * [MetricsHub's self-monitoring](./self-monitoring.md) * [the MetricsHub Agent logs](./metricshub-logs.md) * [the OTel Collector logs](./otel-logs.md). diff --git a/metricshub-doc/src/site/markdown/troubleshooting/self-monitoring.md b/metricshub-doc/src/site/markdown/troubleshooting/self-monitoring.md deleted file mode 100644 index 664be0b50..000000000 --- a/metricshub-doc/src/site/markdown/troubleshooting/self-monitoring.md +++ /dev/null @@ -1,39 +0,0 @@ -keywords: self-monitoring, performance, job duration, troubleshooting -description: How to track MetricsHub's own performance. - -## MetricsHub self-monitoring - - - - - -### Enabling self-monitoring - -Refer to [Monitoring Configuration](../configuration/configure-monitoring.md#self-monitoring) page to know how to enable the self-monitoring feature. - -### - -Monitoring MetricsHub’s own performance ensures that your observability stack runs efficiently, enabling proactive troubleshooting and optimization. Use the self-monitoring feature described in the [Monitoring Configuration](../configuration/configure-monitoring.md#self-monitoring) page to access detailed metrics. - -When self-monitoring is enabled, the `metricshub.job.duration` metric provides insights into task execution times. Key tags include: - -* **`job.type`**: Operation performed by **MetricsHub**. Possible values are: - * `discovery`: Identifies and registers components. - * `collect`: Gathers telemetry data from monitored components. - * `simple`: Executes a straightforward task. - * `beforeAll` or `afterAll`: Performs preparatory or cleanup operations. -* **`monitor.type`**: Component being monitored. Examples: - * Hardware metrics: `cpu`, `memory`, `physical_disk`, or `disk_controller`. - * Environmental metrics: `temperature` or `battery`. - * Logical entities: `connector`. -- **`connector_id`**: Unique identifier for the connector, such as HPEGen10IloREST for the HPE Gen10 iLO REST connector. - -Example: - -``` -metricshub.job.duration{job.type="discovery", monitor.type="enclosure", connector_id="HPEGen10IloREST"} 0.020 -metricshub.job.duration{job.type="discovery", monitor.type="cpu", connector_id="HPEGen10IloREST"} 0.030 -metricshub.job.duration{job.type="discovery", monitor.type="temperature", connector_id="HPEGen10IloREST"} 0.025 -metricshub.job.duration{job.type="discovery", monitor.type="connector", connector_id="HPEGen10IloREST"} 0.015 -metricshub.job.duration{job.type="collect", monitor.type="cpu", connector_id="HPEGen10IloREST"} 0.015 -``` diff --git a/metricshub-doc/src/site/site.xml b/metricshub-doc/src/site/site.xml index e9a8ad61f..c147467de 100644 --- a/metricshub-doc/src/site/site.xml +++ b/metricshub-doc/src/site/site.xml @@ -112,6 +112,7 @@ + From ec2371b063056fb2e906dda4009182fe9b9f5267 Mon Sep 17 00:00:00 2001 From: isabelle-guitton Date: Wed, 15 Jan 2025 15:54:08 +0100 Subject: [PATCH 4/8] Issue #533: Document the self-monitoring feature * Updated site.xml --- metricshub-doc/src/site/site.xml | 1 - 1 file changed, 1 deletion(-) diff --git a/metricshub-doc/src/site/site.xml b/metricshub-doc/src/site/site.xml index 85b4649db..c423740f6 100644 --- a/metricshub-doc/src/site/site.xml +++ b/metricshub-doc/src/site/site.xml @@ -116,7 +116,6 @@ - From 30f03b78e4d19b7dc7c40f9aa32522a3dca06670 Mon Sep 17 00:00:00 2001 From: isabelle-guitton Date: Thu, 16 Jan 2025 09:42:52 +0100 Subject: [PATCH 5/8] Issue #533 Document the self-monitoring feature * Provided additional information --- .../troubleshooting/degraded-performance.md | 26 +++++++++++++------ 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/metricshub-doc/src/site/markdown/troubleshooting/degraded-performance.md b/metricshub-doc/src/site/markdown/troubleshooting/degraded-performance.md index d8d6940f8..2e738565a 100644 --- a/metricshub-doc/src/site/markdown/troubleshooting/degraded-performance.md +++ b/metricshub-doc/src/site/markdown/troubleshooting/degraded-performance.md @@ -18,7 +18,7 @@ When self-monitoring is enabled, the `metricshub.job.duration` metric provides i * *Logical entities*: `connector`. * **`connector_id`**: The unique identifier for the connector, such as HPEGen10IloREST for the HPE Gen10 iLO REST connector. -These metrics can be viewed in Prometheus/Grafana or in the `metricshub-agent-$resourceId-$timestamp.log` file. +These metrics can be viewed in Prometheus/Grafana or in the `metricshub-agent-$resourceId-$timestamp.log` file. Refer to the [MetricsHub Log Files](./metricshub-logs.md) page for details on locating and interpreting log files. ### Example @@ -35,15 +35,25 @@ metricshub.job.duration{job.type="collect", monitor.type="cpu", connector_id="HP In this example: * during `discovery`: - * The `enclosure` monitor takes `0.020` s. - * The `cpu` monitor takes `0.030` s. - * The `temperature` monitor takes `0.025` s. - * The `connector` monitor takes `0.015` s. -* during `collect`, the `cpu` metrics collection takes `0.015` s. + * The `enclosure` monitor takes `0.020` seconds. + * The `cpu` monitor takes `0.030` seconds. + * The `temperature` monitor takes `0.025` seconds. + * The `connector` monitor takes `0.015` seconds. +* during `collect`, the `cpu` metrics collection takes `0.015` seconds. -These metrics indicate that **MetricsHub** is functioning as expected, with task durations well within acceptable ranges. +These metrics indicate that **MetricsHub** is functioning as expected, with task durations well within acceptable ranges. Jobs exceeding 5 seconds may require further investigation. -If task durations are above 5 seconds, consider the following: +For example, if a job takes more than 5 seconds, as shown below: + +```bash +metricshub.job.duration{job.type="collect", monitor.type="network", connector_id="WbemGenNetwork"} 5.8 +``` + +1. Identify, the `job.type`, `monitor.type`, and `connector.id`. In this example, collecting network metrics with the `WbemGenNetwork` is the bottleneck +2. Open the connector's configuration file and review the job steps +3. Check the `metricshub-agent-$resourceId-$timestamp.log` file for the start and end timestamps of each job step to identify where performance degradation occurs. + +You can also: * **Verify resource availability**: Ensure the monitored system has sufficient CPU, memory, and storage resources to handle monitoring tasks. * **Check MetricsHub configuration**: Review your configuration to ensure **MetricsHub** is set up correctly . From 8762944f1dad2039527cafd0eaa29bb1f977cda3 Mon Sep 17 00:00:00 2001 From: isabelle-guitton Date: Thu, 16 Jan 2025 17:30:40 +0100 Subject: [PATCH 6/8] Issue #533: Document the self-monitoring feature * Took Nassim's comment into account --- .../src/site/markdown/troubleshooting/degraded-performance.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/metricshub-doc/src/site/markdown/troubleshooting/degraded-performance.md b/metricshub-doc/src/site/markdown/troubleshooting/degraded-performance.md index 2e738565a..af53a71ee 100644 --- a/metricshub-doc/src/site/markdown/troubleshooting/degraded-performance.md +++ b/metricshub-doc/src/site/markdown/troubleshooting/degraded-performance.md @@ -50,8 +50,7 @@ metricshub.job.duration{job.type="collect", monitor.type="network", connector_id ``` 1. Identify, the `job.type`, `monitor.type`, and `connector.id`. In this example, collecting network metrics with the `WbemGenNetwork` is the bottleneck -2. Open the connector's configuration file and review the job steps -3. Check the `metricshub-agent-$resourceId-$timestamp.log` file for the start and end timestamps of each job step to identify where performance degradation occurs. +2. Check the `metricshub-agent-$resourceId-$timestamp.log` file for the start and end timestamps of each job step to identify where performance degradation occurs. You can also: From a5da3c838350db628a26e2a101b6f0d652eef359 Mon Sep 17 00:00:00 2001 From: Nassim Boutekedjiret Date: Fri, 17 Jan 2025 10:53:45 +0100 Subject: [PATCH 7/8] Issue #533 document the self monitoring feature Apply suggestions from code review --- .../src/site/markdown/configuration/configure-monitoring.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md b/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md index ea30646fe..953cd4178 100644 --- a/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md +++ b/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md @@ -888,7 +888,7 @@ resources: | Property | Description | |--------------------------|--------------------------------------------------------------------------------------------------------------------------------| -| `` | Custom ID for this additional connector. | +| `` | Custom ID for this additional connector. | | `uses` | *(Optional)* Provide an ID for this additional connector. If not specified, the key ID will be used. | | `force` | *(Optional)* Set to `false` if you want the connector to only be activated when detected (Default: `true` - always activated). | | `variables` | Specify the connector variable to be used and its value (Format: `: `). | @@ -1219,7 +1219,7 @@ The self-monitoring feature helps you track **MetricsHub**'s performance by prov To enable this feature, set the `enableSelfMonitoring` parameter to `true` in the relevant section of the `config/metricshub.yaml` file as described below: -| Self-Monitoring | Set enableSelfMonitoring to true | +| Self-Monitoring | Set enableSelfMonitoring to true | |----------------------------------------------------|---------------------------------------------------------| | For all resources | In the global section (top of the file) | | For all the resources of a specific resource group | Under the corresponding `` section | From 919fc70b8ab7d13480a3667feafa142e19dff58d Mon Sep 17 00:00:00 2001 From: Nassim Boutekedjiret Date: Fri, 17 Jan 2025 10:54:47 +0100 Subject: [PATCH 8/8] Issue #533 document the self monitoring feature * Update metricshub-doc/src/site/markdown/configuration/configure-monitoring.md --- .../src/site/markdown/configuration/configure-monitoring.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md b/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md index 953cd4178..1e77d35c5 100644 --- a/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md +++ b/metricshub-doc/src/site/markdown/configuration/configure-monitoring.md @@ -1219,7 +1219,7 @@ The self-monitoring feature helps you track **MetricsHub**'s performance by prov To enable this feature, set the `enableSelfMonitoring` parameter to `true` in the relevant section of the `config/metricshub.yaml` file as described below: -| Self-Monitoring | Set enableSelfMonitoring to true | +| Self-Monitoring | Set enableSelfMonitoring to true | |----------------------------------------------------|---------------------------------------------------------| | For all resources | In the global section (top of the file) | | For all the resources of a specific resource group | Under the corresponding `` section |