Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
f1f76bc
ci: add ruff and mypy to lint job (#21)
bjridicodes May 21, 2026
b9c13b3
feat(s6): LangGraph orchestration + ReAct loop scaffold (M6) (#23)
bjridicodes May 21, 2026
30f2c7b
tooling: add ruff, mypy, pre-commit, and Makefile for local lint/type…
bjridicodes May 21, 2026
abf553e
feat(s7): Agent 3 — LLM-based error classifier with REST endpoint (#28)
bjridicodes May 22, 2026
8e5c4d5
feat(m7): DOD test incident setup — CMDB, log fixtures, cluster hosts…
bjridicodes May 22, 2026
51a0637
docs: update conf_template with cdp.log_dirs option, set M7 to in pro…
bjridicodes May 22, 2026
ec7567a
feat(s8): ReAct loop trigger — cross-service log requests (#32)
bjridicodes May 22, 2026
36bb480
docs: S8 ReAct loop — README and architecture docs update (#33)
bjridicodes May 22, 2026
5b4af9a
feat(infra): UC testing Terraform modules — UC1 Hadoop, UC2 Dataproc,…
bjridicodes Jun 6, 2026
17c484a
docs: README update for Phase 1.5 + CI permissions fix (#81)
bjridicodes Jun 6, 2026
6576703
feat(observability): Phase 1.5 S1 — structured logging (#88)
bjridicodes Jun 10, 2026
0dc0989
feat(monitoring): P1.5 S2 backend — run stores, monitoring API, opera…
bjridicodes Jun 12, 2026
ae39382
feat(dashboard): P1.5 S2 — Alpine.js ops dashboard (#90)
bjridicodes Jun 12, 2026
d6bbb27
feat(s3): Docker + config + LLM portability — P1.5 S3 (#91)
bjridicodes Jun 16, 2026
d7cbf70
feat(kb): add UC1 fixture runbooks for cdp-master, cdp-bus, cdp-utili…
bjridicodes Jun 16, 2026
7964d9a
feat(s4): testing infrastructure wiring — KB, connectors, config, cor…
bjridicodes Jun 16, 2026
301e334
docs(readme): S4 status in progress
bjridicodes Jun 16, 2026
18c8371
fix(kb): remove cluster token from dataproc_job.md to resolve score tie
bjridicodes Jun 16, 2026
3b9e708
fix(kb): fully purge cluster token from dataproc_job.md
bjridicodes Jun 16, 2026
c2b0922
refactor(kb): split KB into resource_kb and analyser_kb; per-cluster …
bjridicodes Jun 17, 2026
7edbce3
feat(infra): Azure UC equivalents + infra directory restructure
bjridicodes Jun 17, 2026
a0c4fec
docs(readme): update Agent 2 connectors, repo structure, S4 roadmap f…
bjridicodes Jun 17, 2026
0c150b4
feat(s4): UC1 smoke test PASS — Azure CDP pipeline validated end-to-end
bjridicodes Jun 17, 2026
5a946f3
Merge branch 'main' into feat/s4-testing-infra
bayrem Jun 18, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 25 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -612,6 +612,30 @@ Phase 1 is complete when all of the following pass on 10 consecutive test incide

---

## S4 Smoke Test Results (2026-06-17)

First live end-to-end validation against real (non-local) infrastructure. Agent 1 was stubbed;
Agents 2 → 3 → 4 ran live against Azure VMs.

| Use Case | Platform | Status | Notes |
|---|---|---|---|
| UC1 — Hadoop on-prem (CDP) | Azure VM (`Standard_D2s_v3`, West Europe) | ✅ **PASS** | 1 log line retrieved, `disk/HIGH` (0.93), Slack notified |
| UC2 — Managed Spark | — | ⏳ Deferred | GCP billing blocked; Azure HDInsight not yet deployed |
| UC3 — GCP native | — | ⏳ Deferred | Same blocker as UC2 |

**UC1 result:**
```
log_lines: 1 (DISK_FAILURE WARN on /var/log/hadoop/hdfs/)
root_cause: disk
confidence: HIGH (0.93)
notification_sent: True
```

Full report: [documentation/reports/s4_uc1_smoke_test_2026-06-17.md](documentation/reports/s4_uc1_smoke_test_2026-06-17.md)
Test script: [scripts/smoke_uc1.py](scripts/smoke_uc1.py)

---

## Roadmap

| Phase | Milestone | Status |
Expand All @@ -630,7 +654,7 @@ Phase 1 is complete when all of the following pass on 10 consecutive test incide
| Phase 1.5 | S1: Structured logging — structlog, `run_id`, lifecycle events, RunRecord | ✅ Done |
| Phase 1.5 | S2: Monitoring foundation — run store, REST API, Alpine.js dashboard, mode scaffold | ✅ Done |
| Phase 1.5 | S3: Docker + `ARIA_CONFIG_PATH` + `VertexAILLMClient` + LLM provider DI (incl. #84 security fix) | ✅ Done |
| Phase 1.5 | S4: Testing infrastructure — UC1/UC2/UC3 cluster wiring (GCP + Azure), KB runbooks, AzureLogConnector wired | 🔄 In progress |
| Phase 1.5 | S4: Testing infrastructure — UC1/UC2/UC3 cluster wiring (GCP + Azure), KB runbooks, AzureLogConnector wired | ✅ Done (UC1 smoke PASS; UC2/UC3 deferred) |
| Phase 1.5 | S5: Round 2 acceptance testing — 30 incidents on UC1 + UC2 real infrastructure | 🔜 Planned |
| Phase 1.5 | S6: GCP native connectors — BQ, Cloud Functions, Pub/Sub, GCS | 🔜 Planned |
| Phase 2 | Human validation gate + write-back to ServiceNow | 💡 Planned |
Expand Down
181 changes: 181 additions & 0 deletions documentation/reports/s4_uc1_smoke_test_2026-06-17.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# ARIA S4 — UC1 Smoke Test Report
**Sprint 4 / Phase 1.5 Testing Infrastructure**
**Date:** 2026-06-17
**Prepared by:** ARIA Engineering
**Status:** UC1 PASSED — UC2/UC3 Deferred

---

## 1. Objective

Validate that the ARIA Agent 2 → Agent 3 → Agent 4 pipeline can run end-to-end against a
real, live infrastructure target — not a mock or stub. This is the first test of the pipeline
against actual VMs outside the local development environment.

Agent 1 (ServiceNow incident reader) was stubbed to isolate infrastructure variables. The
scope was: can Agent 2 SSH into a real VM, retrieve real logs, feed them to Agent 3 for LLM
classification, and trigger a real Slack notification via Agent 4?

---

## 2. Infrastructure

### 2.1 Platform

GCP billing was blocked (`OR_BACR2_44` — quota error, not a code issue). The UC1 smoke test
was run on **Azure** using a $200 MS AI Fest credit.

### 2.2 Azure UC1 Cluster

| Component | Detail |
|---|---|
| Provider | Microsoft Azure (West Europe) |
| Resource Group | `aria-uc1-rg` |
| Terraform | `infra/terraform/uc_testing/azure/uc1-hadoop-onprem/` |
| VM type | `Standard_D2s_v3` (2 vCPU, 8 GB RAM) — B2ms unavailable in West Europe |
| OS | Debian 11 |
| Auth | RSA 4096 SSH key (`aria` user) |

**Nodes provisioned (2 of 5 — vCPU quota limit of 4 cores on free trial):**

| Node | Public IP | Role |
|---|---|---|
| cdp-master-01 | REDACTED-IP | HDFS NameNode, YARN ResourceManager |
| cdp-bus-01 | (internal only) | Kafka, ZooKeeper, Nifi |

cdp-data-01, cdp-data-02, cdp-utility-01 were not provisioned (quota exceeded). This is
sufficient for the UC1 smoke test — all log extraction targets the master node.

### 2.3 Log Injection

A synthetic DISK_FAILURE log entry was injected into `/var/log/hadoop/hdfs/` on cdp-master-01
via `az vm run-command invoke` (Azure control plane — bypasses NSG):

```
2026-06-17 16:45:00,000 WARN org.apache.hadoop.hdfs.server.namenode.NameNode:
DISK_FAILURE detected — block corruption on /data/dfs/dn — available storage below threshold
```

---

## 3. Test Scope

| Agent | Status | Notes |
|---|---|---|
| Agent 1 (ServiceNow) | **Stubbed** | `IncidentMetadata` hardcoded in `scripts/smoke_uc1.py` |
| Agent 2 (LogExtractor) | **Live** | SSH into REDACTED-IP via `SSHLogConnector` |
| Agent 3 (Classifier) | **Live** | `claude_code` LLM provider (local Claude Code CLI) |
| Agent 4 (Notifier) | **Live** | Real Slack message to `#aria-notifications` |

**Vault:** `EnvVarVault` (secrets passed as environment variables — Infisical re-login deferred
due to self-hosted SMTP setup work required; Infisical auth is now restored for future runs).

---

## 4. Test Results

### 4.1 Summary

| Use Case | Platform | Result | Notes |
|---|---|---|---|
| UC1 — Hadoop on-prem (CDP) | Azure VM (SSH) | **PASS** | Full A2→A3→A4 chain |
| UC2 — Managed Spark (Dataproc/HDInsight) | — | **Deferred** | GCP blocked; HDInsight not deployed |
| UC3 — GCP native | — | **Deferred** | GCP blocked |

### 4.2 UC1 Detail

```
=== ARIA UC1 SMOKE TEST (Agent 1 stubbed, vault bypassed) ===

[1/3] Agent 2 — SSH log extraction from REDACTED-IP...
OK — 1 line(s), confidence=high

[2/3] Agent 3 — classification...
OK — error_class=disk, band=HIGH

[3/3] Agent 4 — notification...
notification_sent=True

=== RESULT ===
log_lines: 1
root_cause: disk
notification_sent: True
error: none

PASS
```

**Classification detail:**
- `error_class`: `disk`
- `confidence_band`: `HIGH` (0.93)
- Slack notification: delivered to `#aria-notifications`

### 4.3 Acceptance Criteria Assessment (UC1)

| AC | Criterion | Result |
|---|---|---|
| AC-02 | Affected resource correctly identified | ✅ `cdp-master-01` resolved from metadata |
| AC-03 | ≥ 1 relevant log line returned | ✅ 1 line (DISK_FAILURE WARN) |
| AC-04 | Classification label correct | ✅ `disk` — matches injected fault |
| AC-05 | Confidence score present | ✅ `HIGH` (0.93) |
| AC-06 | Slack notification delivered | ✅ Message sent |

AC-01 (Agent 1 latency) not tested — Agent 1 was stubbed.

---

## 5. Issues Encountered and Resolutions

| Issue | Root Cause | Resolution |
|---|---|---|
| Terraform provider registration timeout | azurerm auto-registers all providers on first run | `skip_provider_registration = true`; manually registered Microsoft.Compute, .Network, .Storage |
| `Standard_B2ms` unavailable | Azure capacity restrictions in West Europe | Switched to `Standard_D2s_v3` |
| `Standard_B2s` unavailable | Same capacity issue | Skipped; went directly to D2s_v3 |
| 4 vCPU quota limit | Free trial quota | Accepted 2-VM deployment; master node sufficient for smoke test |
| ed25519 SSH key rejected | Azure Linux VMs only accept RSA | Regenerated key as RSA 4096 |
| SSH timeout from local machine | NSG only had Cloud Shell IP `REDACTED-IP/32` | Added local IP `REDACTED-IP/32` via `az network nsg rule create` |
| Infisical session expired | Self-hosted SMTP not configured | Set up Gmail SMTP App Password, recreated container, updated CLI to 0.43.96 |
| `datetime` offset-aware mismatch | `datetime.now(timezone.utc)` vs naive log timestamps | Changed stub to `datetime.now()` |
| LLM API credits depleted | Anthropic direct API credits exhausted | Switched `conf.yaml` to `provider: claude_code` (local CLI, no API cost) |
| Log outside 30-minute time window | Log injected hours earlier in the session | Re-injected via `az vm run-command invoke` with current timestamp |

---

## 6. Known Gaps

| Gap | Impact | Plan |
|---|---|---|
| Agent 1 stubbed | Pipeline start-to-end not tested | S5 acceptance testing will use live ServiceNow |
| `CDP_SSH_USER` in Infisical set to `aria-cdp` | Hardcoded `aria` in smoke test | Update Infisical secret to `aria` before S5 |
| UC2/UC3 deferred | Only 1 of 3 use cases validated | Deploy Azure HDInsight for UC2 when GCP billing resolves or Azure HDInsight is provisioned |
| 2 of 5 VMs deployed | Partial cluster | Sufficient for smoke test; full cluster requires Azure quota increase |

---

## 7. Test Script

`scripts/smoke_uc1.py` — bypasses Agent 1 and Infisical, reads secrets from environment
variables, tests Agent 2 → Agent 3 → Agent 4 directly.

```bash
# Run with secrets from environment
export CDP_SSH_KEY="$(cat ~/.ssh/aria_uc1_key)"
export CDP_HOST_KEY="ssh-rsa AAAA..."
PYTHONPATH=/home/brm/projects/aria python scripts/smoke_uc1.py

# Run with Infisical (now that login is restored)
infisical run --env=dev -- python scripts/smoke_uc1.py
```

---

## 8. Conclusion

UC1 smoke test **PASSED**. The ARIA pipeline can:
- Establish SSH to a real remote VM using key-based auth
- Retrieve log files from CDP-compatible directory structures
- Classify a DISK_FAILURE event correctly at HIGH confidence
- Deliver a Slack notification end-to-end

This validates the core Agent 2 → 3 → 4 chain on real infrastructure. UC2 and UC3 are
deferred to S5 pending GCP billing resolution or Azure HDInsight deployment.
5 changes: 3 additions & 2 deletions infra/terraform/uc_testing/azure/uc1-hadoop-onprem/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ terraform {
}

provider "azurerm" {
subscription_id = var.subscription_id
subscription_id = var.subscription_id
skip_provider_registration = true
features {}
}

Expand Down Expand Up @@ -150,7 +151,7 @@ resource "azurerm_linux_virtual_machine" "nodes" {
name = each.key
location = azurerm_resource_group.uc1.location
resource_group_name = azurerm_resource_group.uc1.name
size = "Standard_B2ms" # 2 vCPU, 8 GB RAM — equivalent to GCP e2-standard-2
size = "Standard_D2s_v3" # 2 vCPU, 8 GB RAM — B2ms capacity unavailable in westeurope

admin_username = "aria"
# SSH key auth only — no password
Expand Down
155 changes: 155 additions & 0 deletions scripts/smoke_uc1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
"""UC1 smoke test — bypasses Agent 1 and Infisical, tests Agent 2 → Agent 3 → Agent 4 live.

Reads secrets from environment variables directly (EnvVarVault) — no Infisical needed.

Required env vars:
CDP_SSH_KEY PEM-encoded RSA private key for aria@<master-ip>
CDP_HOST_KEY SSH host public key of the master VM: "ssh-rsa AAAA..."

How to get them from Azure Cloud Shell:
cat ~/.ssh/aria_uc1_key # → CDP_SSH_KEY
ssh-keyscan -t rsa REDACTED-IP 2>/dev/null | cut -d' ' -f2- # → CDP_HOST_KEY

Run (no Infisical):
export CDP_SSH_KEY="$(cat /path/to/aria_uc1_key)"
export CDP_HOST_KEY="ssh-rsa AAAA..."
python scripts/smoke_uc1.py

Run (with Infisical, once login is restored):
infisical run --env=development -- python scripts/smoke_uc1.py
"""

import sys
from datetime import datetime

from api.dependencies import get_agent3, get_agent4
from core.interfaces.log_store import LogStoreInterface
from core.models import IncidentMetadata, PipelineState, PlatformTag, Priority
from implementations.clusters.onprem.log_connector import SSHLogConnector
from implementations.vault.envvar import EnvVarVault

# ── Config ────────────────────────────────────────────────────────────────────

MASTER_IP = "REDACTED-IP"
INCIDENT_NUMBER = "INC_UC1_SMOKE"

# Log dirs that match where we injected the synthetic log entry.
# /var/log/hadoop searched recursively — covers /var/log/hadoop/hdfs/
_LOG_DIRS = [
"/var/log/hadoop",
"/var/log/hadoop-hdfs",
"/var/log/hadoop-yarn",
]


# ── Stub ─────────────────────────────────────────────────────────────────────


def _stub_metadata() -> IncidentMetadata:
"""Hardcoded IncidentMetadata pointing at the live UC1 master VM.

opened_at = now so the 30-minute Agent 2 window covers logs just injected.
"""
return IncidentMetadata(
incident_number=INCIDENT_NUMBER,
caller="smoke-test",
short_description="DISK_FAILURE on HDFS namenode",
long_description=(
"Synthetic incident for UC1 smoke test. "
"DISK_FAILURE detected on cdp-master-01 — block corruption reported."
),
priority=Priority.P1,
state="New",
affected_ci="cdp-master-01",
affected_ci_ip=MASTER_IP,
assigned_group="data-ops",
opened_at=datetime.now(),
platform_tag=PlatformTag.CDP,
)


# ── Test ─────────────────────────────────────────────────────────────────────


def main() -> None:
print("=== ARIA UC1 SMOKE TEST (Agent 1 stubbed, vault bypassed) ===\n")

# Vault reads CDP_SSH_KEY and CDP_HOST_KEY directly from environment.
vault = EnvVarVault()

# Build Agent 2 connector directly so we control ssh_user and log_dirs.
# Default config uses 'hadoop' as ssh_user; our Azure VMs use 'aria'.
from core.agents.log_extractor import LogExtractorAgent

connector_registry: dict[PlatformTag, LogStoreInterface] = {
PlatformTag.CDP: SSHLogConnector(
vault=vault,
ssh_key_secret="CDP_SSH_KEY",
ssh_user="aria",
log_dirs=_LOG_DIRS,
host_key_secret="CDP_HOST_KEY",
),
}
agent2 = LogExtractorAgent(connector_registry=connector_registry)
agent3 = get_agent3()
agent4 = get_agent4()

state = PipelineState(
incident_number=INCIDENT_NUMBER,
incident_metadata=_stub_metadata(),
)

# ── Agent 2 ──────────────────────────────────────────────────────────────
print(f"[1/3] Agent 2 — SSH log extraction from {MASTER_IP}...")
state = agent2.run(state)
if state.error:
print(f" FAIL: {state.error}")
sys.exit(1)

log_lines = state.log_result.log_lines if state.log_result else []
confidence = state.log_result.confidence.value if state.log_result else "none"
print(f" OK — {len(log_lines)} line(s), confidence={confidence}")
for line in log_lines[:3]:
print(f" {line.timestamp} [{line.level}] {line.message[:100]}")

# ── Agent 3 ──────────────────────────────────────────────────────────────
print("\n[2/3] Agent 3 — classification...")
state = agent3.run(state)
if state.classification:
cls = state.classification
print(f" OK — error_class={cls.error_class}, band={cls.confidence_band.value}")
else:
print(f" WARN — no classification (error={state.error})")

# ── Agent 4 ──────────────────────────────────────────────────────────────
print("\n[3/3] Agent 4 — notification...")
state = agent4.run(state)
print(f" notification_sent={state.notification_sent}")
if state.error:
print(f" error={state.error}")

# ── Summary ──────────────────────────────────────────────────────────────
print("\n=== RESULT ===")
print(f" log_lines: {len(log_lines)}")
cls_label = state.classification.error_class if state.classification else "none"
print(f" root_cause: {cls_label}")
print(f" notification_sent: {state.notification_sent}")
print(f" error: {state.error or 'none'}")

failed = []
if len(log_lines) < 1:
failed.append("no log lines returned from master VM")
if not state.notification_sent:
failed.append("notification not sent")

if failed:
print("\nFAIL")
for reason in failed:
print(f" - {reason}")
sys.exit(1)

print("\nPASS")


if __name__ == "__main__":
main()
Loading