Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #533: document the self monitoring feature #545

Merged
merged 10 commits into from
Jan 17, 2025
127 changes: 47 additions & 80 deletions metricshub-doc/src/site/markdown/configuration/configure-monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,24 +8,24 @@ description: How to configure the MetricsHub Agent to collect metrics from a var
**MetricsHub** extracts metrics from the resources configured in the `config/metricshub.yaml` file.
These **resources** can be hosts, applications, or other components running in your IT infrastructure.
Each **resource** is typically associated with a physical location, such as a data center or server room, or a logical location, like a business unit.
In **MetricsHub**, these locations are referred to as **sites**.
In **MetricsHub**, these locations are referred to as **sites**.
In highly distributed infrastructures, multiple resources can be organized into **resource groups** to simplify management and monitoring.

To reflect this organization, you are asked to define your **resource group** first, followed by your **site** and its corresponding **resources** in the `config/metricshub.yaml` file stored in:

> * `C:\ProgramData\MetricsHub\config` on Windows systems
> * `./metricshub/lib/config` on Linux systems

> **Important**: We recommend using an editor supporting the
> **Important**: We recommend using an editor supporting the
[Schemastore](https://www.schemastore.org/json#editors) to edit **MetricsHub**'s configuration YAML
files (Example: [Visual Studio Code](https://code.visualstudio.com/download) and
[vscode.dev](https://vscode.dev),
files (Example: [Visual Studio Code](https://code.visualstudio.com/download) and
[vscode.dev](https://vscode.dev),
with [RedHat's YAML extension](https://marketplace.visualstudio.com/items?itemName=redhat.vscode-yaml)).

## Step 1: Configure resource groups

> Note: For centralized infrastructures, `resourceGroups` are not required.
Simply configure resources as explained in [Step 2](./configure-monitoring.html#step-2-configure-resources).
Simply configure resources as explained in [Step 2](./configure-monitoring.html#step-2-configure-resources).

Create a resource group for each site to be monitored under the `resourceGroups:` section:

Expand All @@ -35,6 +35,7 @@ resourceGroups:
attributes:
site: <site-name> # Specify where resources are hosted
```

Replace:

* `<resource-group-name>` with the actual name of your resource group
Expand Down Expand Up @@ -68,6 +69,7 @@ At this stage, you can configure sustainability metrics reporting. For more deta
host.type: <type>
<protocol-configuration>
```

* or under the resource group you previously specified *(recommended for highly distributed infrastructures)*

```yaml
Expand All @@ -83,7 +85,7 @@ At this stage, you can configure sustainability metrics reporting. For more deta
<protocol-configuration>
```

The syntax to adopt for configuring your resources will differ whether your resources have unique
The syntax to adopt for configuring your resources will differ whether your resources have unique
or similar characteristics (such as device type, protocols, and credentials).

### Syntax for unique resources
Expand All @@ -108,7 +110,9 @@ resources:
host.extra.attribute: [ <extra-attribute-for-hostname1>, <extra-attribute-for-hostname2>, etc. ]
<protocol-configuration>
```

Whatever the syntax adopted, replace:

* `<hostname>` with the actual hostname or IP address of the resource
* `<type>` with the type of resource to be monitored. Possible values are:
* [`win`](https://metricshub.com/docs/latest/connectors/tags/windows.html) for Microsoft Windows systems
Expand Down Expand Up @@ -162,8 +166,6 @@ resourceGroups:
<protocol-configuration>
```



### Protocols and credentials

#### HTTP
Expand Down Expand Up @@ -200,7 +202,7 @@ resourceGroups:
timeout: 60
```

#### ICMP Ping
#### ICMP Ping

Use the parameters below to configure the ICMP ping protocol:

Expand Down Expand Up @@ -566,6 +568,7 @@ resourceGroups:
### Customize resource hostname

By default, the `host.name` attribute specified for a resource determines both:

* the hostname used to execute requests against the resource for collecting metrics
* the hostname associated with each OpenTelemetry metric collected for the resource.

Expand All @@ -592,7 +595,7 @@ resources:

#### Example for resources sharing similar characteristics

For resources with shared characteristics, you can define multiple hostnames in the configuration:
For resources with shared characteristics, you can define multiple hostnames in the configuration:

```yaml
resources:
Expand Down Expand Up @@ -651,14 +654,13 @@ Refer to:
- [Monitors](https://sentrysoftware.org/metricshub-community-connectors/develop/monitors.html) for more information on how to configure custom resource monitoring.
- [Monitoring the health of a Web service](https://metricshub.com/usecases/monitoring-the-health-of-a-web-service/) for a practical example that demonstrates how to use this feature effectively.


### Basic Authentication settings

#### Enterprise Edition authentication

In the Enterprise Edition, the **MetricsHub**'s internal `OTLP Exporter` authenticates itself with the _OpenTelemetry Collector_'s [OTLP gRPC Receiver](send-telemetry.md#otlp-grpc) by including the HTTP `Authorization` request header with the credentials.
In the Enterprise Edition, the **MetricsHub**'s internal `OTLP Exporter` authenticates itself with the *OpenTelemetry Collector*'s [OTLP gRPC Receiver](send-telemetry.md#otlp-grpc) by including the HTTP `Authorization` request header with the credentials.

These settings are already configured in the `config/metricshub.yaml` file of **MetricsHub Enterprise Edition**. Changing them is **not recommended** unless you are familiar with managing communication between the **MetricsHub** `OTLP Exporter` and the _OpenTelemetry Collector_'s `OTLP Receiver`.
These settings are already configured in the `config/metricshub.yaml` file of **MetricsHub Enterprise Edition**. Changing them is **not recommended** unless you are familiar with managing communication between the **MetricsHub** `OTLP Exporter` and the *OpenTelemetry Collector*'s `OTLP Receiver`.

To override the default value of the *Basic Authentication Header*, configure the `otel.exporter.otlp.metrics.headers` and `otel.exporter.otlp.logs.headers` parameters under the `otel` section:

Expand Down Expand Up @@ -835,7 +837,7 @@ To know which connectors are available, refer to [Connectors Directory](../metri
Otherwise, you can list the available connectors using the below command:

```shell-session
$ metricshub -l
metricshub -l
```

For more information about the `metricshub` command, refer to [MetricsHub CLI (metricshub)](../guides/cli.md).
Expand All @@ -853,9 +855,9 @@ patchDirectory: /opt/patch/connectors # Replace with the path to your patch conn
loggerLevel: ...
```

#### Customize data collection
#### Customize data collection

**MetricsHub** allows you to customize data collection on your Windows or Linux servers, specifying exactly which processes or services to monitor. This customization is achieved by configuring the following connector variables:
**MetricsHub** allows you to customize data collection on your Windows or Linux servers, specifying exactly which processes or services to monitor. This customization is achieved by configuring the following connector variables:

| Connector Variable | Available for | Usage |
|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------|
Expand All @@ -868,7 +870,7 @@ Refer to the [Connectors directory](../metricshub-connectors-directory.html#) an

##### Procedure

In the `config/metricshub.yaml` file, locate the resource for which you wish to customize data collection and specify the `variables` attribute available under the `additionalConnectors` section:
In the `config/metricshub.yaml` file, locate the resource for which you wish to customize data collection and specify the `variables` attribute available under the `additionalConnectors` section:

```yaml
resources:
Expand All @@ -886,15 +888,14 @@ resources:

| Property | Description |
|--------------------------|--------------------------------------------------------------------------------------------------------------------------------|
| ` <connector-custom-id>` | Custom ID for this additional connector. |
| `uses` | _(Optional)_ Provide an ID for this additional connector. If not specified, the key ID will be used. |
| `force` | _(Optional)_ Set to `false` if you want the connector to only be activated when detected (Default: `true` - always activated). |
| `<connector-custom-id>` | Custom ID for this additional connector. |
NassimBtk marked this conversation as resolved.
Show resolved Hide resolved
| `uses` | *(Optional)* Provide an ID for this additional connector. If not specified, the key ID will be used. |
| `force` | *(Optional)* Set to `false` if you want the connector to only be activated when detected (Default: `true` - always activated). |
| `variables` | Specify the connector variable to be used and its value (Format: `<variable-name>: <value>`). |

> Note: If a connector is added under the `additionalConnectors` section with missing or unspecified variables, those variables will automatically be populated with default values defined by the connector itself.

For practical examples demonstrating effective use of this feature, refer to the following pages:

* [Monitoring a process command line](https://metricshub.com/usecases/monitoring-a-process-on-windows/)
* [Monitoring a service running on Linux](https://metricshub.com/usecases/monitoring-a-service-running-on-linux/).

Expand All @@ -907,10 +908,10 @@ To manage the volume of telemetry data sent to your observability platform and t
You can apply monitor inclusion or exclusion in data collection for the following scopes:

* All resources
* All the resources within a specific resource group. A resource group is a container that holds resources to be monitored and generally refers to a site or a specific location.
* All the resources within a specific resource group. A resource group is a container that holds resources to be monitored and generally refers to a site or a specific location.
* A specific resource

This is done by adding the `monitorFilters` parameter in the relevant section of the `config/metricshub.yaml` file as described below:
This is done by adding the `monitorFilters` parameter in the relevant section of the `config/metricshub.yaml` file as described below:

| Filter monitors | Add monitorFilters |
|----------------------------------------------------|---------------------------------------------------------|
Expand All @@ -929,21 +930,22 @@ To obtain the monitor name:
2. Click the connector of your choice (e.g.: [WindowsOS Metrics](../connectors/windows.html))
3. Scroll-down to the **Metrics** section and note down the relevant monitor **Type**.

> **Warning**: Excluding monitors may lead to missed outage detection or inconsistencies in collected data, such as inaccurate power consumption estimates or other metrics calculated by the engine. Use exclusions carefully to avoid overlooking important information.
The monitoring of critical devices such as batteries, power supplies, CPUs, fans, and memories should not be disabled.
> **Warning**: Excluding monitors may lead to missed outage detection or inconsistencies in collected data, such as inaccurate power consumption estimates or other metrics calculated by the engine. Use exclusions carefully to avoid overlooking important information.
The monitoring of critical devices such as batteries, power supplies, CPUs, fans, and memories should not be disabled.

##### Example 1: Including monitors for all resources


```yaml
monitorFilters: [ +enclosure, +fan, +power_supply ] # Include specific monitors globally
resourceGroups: ...
```

##### Example 2: Excluding monitors for all resources
##### Example 2: Excluding monitors for all resources

```yaml
monitorFilters: [ "!volume" ] # Exclude specific monitors globally
```

##### Example 3: Including monitors for all resources within a specific resource group

```yaml
Expand All @@ -952,6 +954,7 @@ The monitoring of critical devices such as batteries, power supplies, CPUs, fans
monitorFilters: [ +enclosure, +fan, +power_supply ] # Include specific monitors for this group
resources: ...
```

##### Example 4: Excluding monitors for all resources within a specific resource group

```yaml
Expand All @@ -960,6 +963,7 @@ The monitoring of critical devices such as batteries, power supplies, CPUs, fans
monitorFilters: [ "!volume" ] # Exclude specific monitors for this group
resources: ...
```

##### Example 5: Including monitors for a specific resource

```yaml
Expand All @@ -978,7 +982,7 @@ The monitoring of critical devices such as batteries, power supplies, CPUs, fans
resources:
<resource-id>:
monitorFilters: [ "!volume" ] # Exclude specific monitors for this resource
```
```

#### Discovery cycle

Expand Down Expand Up @@ -1145,9 +1149,9 @@ By default, **MetricsHub** compresses StateSet metrics to reduce unnecessary rep

This configuration controls how StateSet metrics are reported, specifically whether zero values should be suppressed or not.

- **Supported values:**
- `none`: No compression is applied. All StateSet metrics, including zero values, are reported on every collection cycle.
- `suppressZeros` (default): **MetricsHub** compresses StateSet metrics by reporting the zero value only the first time a state transitions to zero. Subsequent reports will include only the non-zero state values.
* **Supported values:**
* `none`: No compression is applied. All StateSet metrics, including zero values, are reported on every collection cycle.
* `suppressZeros` (default): **MetricsHub** compresses StateSet metrics by reporting the zero value only the first time a state transitions to zero. Subsequent reports will include only the non-zero state values.

To configure the StateSet compression level, you can apply the `stateSetCompression` setting in the following scopes:

Expand Down Expand Up @@ -1209,33 +1213,26 @@ hw.status{state="degraded"} 1

In this case, only the `degraded` state is reported, and the zero values for `ok` and `failed` are suppressed after the initial state transition.

## Self-Monitoring
#### Self-Monitoring

**MetricsHub** includes **self-monitoring capabilities** to track its own performance. This feature can monitor key aspects such **job duration metrics**.
The self-monitoring feature helps you track **MetricsHub**'s performance by providing metrics like job duration. These metrics offer detailed insights into task execution times, helping identify bottlenecks or inefficiencies and optimizing performance.

### Configuration: `enableSelfMonitoring`
To enable this feature, set the `enableSelfMonitoring` parameter to `true` in the relevant section of the `config/metricshub.yaml` file as described below:

This configuration controls whether **MetricsHub** reports internal signals such as job duration metrics.

#### Supported Values

- `true` (default): Enables self-monitoring capabilities.
- `false`: Disables self-monitoring capabilities.

#### Configuration Scopes

You can configure `enableSelfMonitoring` at the following levels:
| Self-Monitoring | Set enableSelfMonitoring to true |
NassimBtk marked this conversation as resolved.
Show resolved Hide resolved
|----------------------------------------------------|---------------------------------------------------------|
| For all resources | In the global section (top of the file) |
| For all the resources of a specific resource group | Under the corresponding `<resource-group-name>` section |
| For a specific resource | Under the corresponding `<resource-id>` section |

1. **Global Configuration**
Applies to all monitored resources.
##### Example 1: Enabling self-monitoring for all resources

```yaml
enableSelfMonitoring: true # Set to "false" to disable
resourceGroups: ...
```

2. **Per Resource Group**
Applies to all resources within a specific group.
##### Example 2: Enabling self-monitoring for all resources of a specific resource group

```yaml
resourceGroups:
Expand All @@ -1244,8 +1241,7 @@ You can configure `enableSelfMonitoring` at the following levels:
resources: ...
```

3. **Per Resource**
Applies to an individual resource.
##### Example 3: Enabling self-monitoring for a specific resource

```yaml
resourceGroups:
Expand All @@ -1255,35 +1251,6 @@ You can configure `enableSelfMonitoring` at the following levels:
enableSelfMonitoring: true # Set to "false" to disable
```

### Examples of Self-Monitoring Metrics

When enabled, **MetricsHub** reports the `metricshub.job.duration` metrics, for example:

```
metricshub.job.duration{job.type="discovery", monitor.type="enclosure", connector_id="HPEGen10IloREST"} 0.020
metricshub.job.duration{job.type="discovery", monitor.type="cpu", connector_id="HPEGen10IloREST"} 0.030
metricshub.job.duration{job.type="discovery", monitor.type="temperature", connector_id="HPEGen10IloREST"} 0.025
metricshub.job.duration{job.type="discovery", monitor.type="connector", connector_id="HPEGen10IloREST"} 0.015
metricshub.job.duration{job.type="collect", monitor.type="cpu", connector_id="HPEGen10IloREST"} 0.015
```

Where:
- **`job.type`**: Specifies the type of operation performed by MetricsHub.
- Possible values:
- `discovery`: Identifies and registers components.
- `collect`: Gathers telemetry data from the monitored components.
- `simple`: Executes a single straightforward task.
- `beforeAll` or `afterAll`: Runs preparatory or cleanup operations.
- **`monitor.type`**: Indicates the specific category of component being monitored.
- Examples:
- Hardware components like `cpu`, `memory`, `physical_disk`, or `disk_controller`.
- Environmental metrics like `temperature` or `battery`.
- Logical entities like `connector`.
- **`connector_id`**: The unique identifier of the connector defining the method and protocol to collect metrics for the specified component.
- Example: `"HPEGen10IloREST"` denotes the HPE Gen10 iLO REST connector.

These metrics provide granular insights into task execution times, enabling the identification of bottlenecks or inefficiencies and helping optimize monitoring performance.

#### Timeout, duration and period format

Timeouts, durations and periods are specified with the below format:
Expand All @@ -1293,4 +1260,4 @@ Timeouts, durations and periods are specified with the below format:
| s | seconds | 120s |
| m | minutes | 90m, 1m15s |
| h | hours | 1h, 1h30m |
| d | days (based on a 24-hour day) | 1d |
| d | days (based on a 24-hour day) | 1d |
Loading
Loading