Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,8 +172,9 @@ Time window: `opened_at − 30 min`. On empty result, retries once with a 60-min
**Implemented connectors:**
- `SSHLogConnector` (`implementations/clusters/onprem/`) — provider-agnostic SSH connector for any on-premise cluster (CDP, HDP, Oracle RAC, MapR, etc.). Log dirs and SSH credentials are constructor params.
- `GCPLogConnector` (`implementations/clusters/cloud/gcp/`) — Cloud Logging API with vault-backed service account.
- `AzureLogConnector` (`implementations/clusters/cloud/azure/`) — Azure Monitor Log Analytics workspace, KQL-based queries, `DefaultAzureCredential` auth. Workspace ID resolved from vault at query time.

**Cloud stubs:** Databricks, AWS EMR, Azure Monitor — raise `NotImplementedError`, full implementations planned.
**Cloud stubs:** Databricks, AWS EMR — raise `NotImplementedError`, full implementations planned.

### Agent 3 — Classifier ✅ Implemented

Expand Down Expand Up @@ -484,9 +485,9 @@ aria/
│ │ ├── onprem/ # SSHLogConnector — any bare-metal/VM cluster (CDP, HDP, Oracle RAC, MapR, etc.)
│ │ └── cloud/
│ │ ├── gcp/ # GCPLogConnector — Cloud Logging API
│ │ ├── azure/ # ✅ AzureLogConnector — Log Analytics workspace (Azure Monitor)
│ │ ├── databricks/ # stub — planned
│ │ ├── aws/ # stub — planned
│ │ └── azure/ # stub — planned
│ │ └── aws/ # stub — planned
│ ├── itsm/
│ │ └── servicenow/ # ServiceNowConnector
│ ├── coms/
Expand All @@ -508,7 +509,9 @@ aria/
├── documentation/ # MkDocs site source (mkdocs serve)
├── infra/
│ └── terraform/
│ └── uc_testing/ # UC1 (Hadoop VMs) · UC2 (Dataproc) · UC3 (GCP native)
│ └── uc_testing/
│ ├── gcp/ # UC1 (Hadoop VMs) · UC2 (Dataproc) · UC3 (GCP native)
│ └── azure/ # UC1 (Hadoop VMs) · UC2 (HDInsight) · UC3 (Azure native)
├── ml/ # Datasets, few-shot prompt assets, evaluation scripts
├── tests/acceptance/ # ground_truth.json · round results · AC reports
├── Dockerfile # P1.5 S3 — python:3.11-slim, non-root, single stage
Expand Down Expand Up @@ -627,7 +630,7 @@ Phase 1 is complete when all of the following pass on 10 consecutive test incide
| Phase 1.5 | S1: Structured logging — structlog, `run_id`, lifecycle events, RunRecord | ✅ Done |
| Phase 1.5 | S2: Monitoring foundation — run store, REST API, Alpine.js dashboard, mode scaffold | ✅ Done |
| Phase 1.5 | S3: Docker + `ARIA_CONFIG_PATH` + `VertexAILLMClient` + LLM provider DI (incl. #84 security fix) | ✅ Done |
| Phase 1.5 | S4: Testing infrastructure — UC1/UC2/UC3 cluster wiring, KB runbooks, CMDB validation | 🔄 In progress |
| Phase 1.5 | S4: Testing infrastructure — UC1/UC2/UC3 cluster wiring (GCP + Azure), KB runbooks, AzureLogConnector wired | 🔄 In progress |
| Phase 1.5 | S5: Round 2 acceptance testing — 30 incidents on UC1 + UC2 real infrastructure | 🔜 Planned |
| Phase 1.5 | S6: GCP native connectors — BQ, Cloud Functions, Pub/Sub, GCS | 🔜 Planned |
| Phase 2 | Human validation gate + write-back to ServiceNow | 💡 Planned |
Expand Down
19 changes: 15 additions & 4 deletions api/dependencies.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
from core.interfaces.vault import VaultInterface
from core.models import PlatformTag
from core.orchestrator.pipeline import ARIAPipeline
from implementations.clusters.cloud.azure.log_connector import AzureLogConnector
from implementations.clusters.cloud.gcp.log_connector import GCPLogConnector
from implementations.clusters.onprem.log_connector import SSHLogConnector
from implementations.itsm.servicenow.connector import ServiceNowConnector
Expand Down Expand Up @@ -75,7 +76,11 @@ def _get_vault() -> VaultInterface:
from implementations.vault.gcp_secret_manager import GCPSecretManagerVault

return GCPSecretManagerVault.from_env()
# hashicorp, aws, azure already have implementations — wire them here as they get used.
if backend == "azure":
from implementations.vault.azure_kv import AzureKeyVaultClient

return AzureKeyVaultClient.from_env()
# hashicorp, aws — implementations exist, wire when needed.
return EnvVarVault()


Expand Down Expand Up @@ -231,9 +236,9 @@ def get_pipeline() -> "ARIAPipeline":
def get_agent2() -> LogExtractorAgent:
"""Build and cache the Agent 2 (Log Extractor) instance.

Registers CDP (SSH) and GCP (Cloud Logging) connectors. Missing credentials
are non-fatal at construction — connectors resolve secrets at query time
and return empty results gracefully if credentials are absent.
Registers CDP (SSH), GCP (Cloud Logging), and Azure (Log Analytics) connectors.
Missing credentials are non-fatal at construction — connectors resolve secrets
at query time and return empty results gracefully if credentials are absent.
Injects an LLM client for query planning if ARIA_AGENT2_MODEL is set.
"""
vault = _get_vault()
Expand All @@ -254,6 +259,12 @@ def get_agent2() -> LogExtractorAgent:
vault,
resource_types=["cloud_dataproc_cluster", "cloud_dataproc_job"],
),
# Azure Log Analytics workspace — workspace ID resolved from AZURE_LOG_WORKSPACE_ID
# secret at query time. Covers UC2 (HDInsight) and UC3 (Azure-native) incidents.
PlatformTag.AZURE: AzureLogConnector(
vault,
workspace_id_secret="AZURE_LOG_WORKSPACE_ID",
),
}
llm = None
model = _resolve_model("2")
Expand Down
Empty file.
190 changes: 190 additions & 0 deletions infra/terraform/uc_testing/azure/uc1-hadoop-onprem/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
terraform {
required_version = ">= 1.5"
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.0"
}
}
}

provider "azurerm" {
subscription_id = var.subscription_id
features {}
}

# ── Resource Group ─────────────────────────────────────────────────────────────
resource "azurerm_resource_group" "uc1" {
name = var.resource_group_name
location = var.location
}

# ── Virtual Network ────────────────────────────────────────────────────────────
resource "azurerm_virtual_network" "uc1" {
name = "aria-uc1-vnet"
location = azurerm_resource_group.uc1.location
resource_group_name = azurerm_resource_group.uc1.name
address_space = ["10.10.0.0/16"]
}

resource "azurerm_subnet" "uc1" {
name = "aria-uc1-subnet"
resource_group_name = azurerm_resource_group.uc1.name
virtual_network_name = azurerm_virtual_network.uc1.name
address_prefixes = ["10.10.0.0/24"]
}

# ── Network Security Group ─────────────────────────────────────────────────────
resource "azurerm_network_security_group" "uc1" {
name = "aria-uc1-nsg"
location = azurerm_resource_group.uc1.location
resource_group_name = azurerm_resource_group.uc1.name

# Allow SSH from operator workstation (needed for key validation and log injection)
security_rule {
name = "allow-ssh-from-operator"
priority = 100
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "22"
source_address_prefix = var.allowed_ssh_cidr
destination_address_prefix = "*"
}

# Allow all inbound traffic within the subnet (inter-node Hadoop communication)
security_rule {
name = "allow-internal-subnet"
priority = 200
direction = "Inbound"
access = "Allow"
protocol = "*"
source_port_range = "*"
destination_port_range = "*"
source_address_prefix = "10.10.0.0/24"
destination_address_prefix = "10.10.0.0/24"
}
}

resource "azurerm_subnet_network_security_group_association" "uc1" {
subnet_id = azurerm_subnet.uc1.id
network_security_group_id = azurerm_network_security_group.uc1.id
}

# ── Public IP for master node only ─────────────────────────────────────────────
resource "azurerm_public_ip" "master" {
name = "aria-uc1-master-pip"
location = azurerm_resource_group.uc1.location
resource_group_name = azurerm_resource_group.uc1.name
allocation_method = "Static"
sku = "Standard"
}

# ── Network Interfaces ─────────────────────────────────────────────────────────
# Azure separates NIC from VM — one NIC per node.
resource "azurerm_network_interface" "nodes" {
for_each = local.nodes
name = "aria-uc1-${each.key}-nic"
location = azurerm_resource_group.uc1.location
resource_group_name = azurerm_resource_group.uc1.name

ip_configuration {
name = "internal"
subnet_id = azurerm_subnet.uc1.id
private_ip_address_allocation = "Dynamic"
# Only the master node gets a public IP
public_ip_address_id = each.key == "cdp-master-01" ? azurerm_public_ip.master.id : null
}
}

# ── Node definitions ────────────────────────────────────────────────────────────
locals {
nodes = {
"cdp-master-01" = { role = "hdfs-namenode,yarn-resourcemanager,hiveserver2", disk_gb = 64 }
"cdp-data-01" = { role = "hdfs-datanode,yarn-nodemanager", disk_gb = 128 }
"cdp-data-02" = { role = "hdfs-datanode,yarn-nodemanager", disk_gb = 128 }
"cdp-utility-01" = { role = "hive-metastore,spark-history,oozie,hue", disk_gb = 64 }
"cdp-bus-01" = { role = "kafka,zookeeper,nifi", disk_gb = 64 }
}

# cloud-init script mirrors the GCP startup script logic:
# installs Java 11, Hadoop 3.3.6, creates CDP-compatible log directory structure.
cloud_init = <<-CLOUDINIT
#cloud-config
package_update: true
packages:
- openjdk-11-jdk
- python3
- python3-pip
- wget
- curl
- rsyslog
- openssh-server

runcmd:
- systemctl enable ssh
- systemctl start ssh

# Hadoop binaries — for authentic log format
- HADOOP_VERSION=3.3.6
- wget -q "https://downloads.apache.org/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz" -O /tmp/hadoop.tar.gz
- tar -xzf /tmp/hadoop.tar.gz -C /opt/
- ln -s /opt/hadoop-$HADOOP_VERSION /opt/hadoop
- rm /tmp/hadoop.tar.gz

# Log directory structure — mirrors real CDP layout exactly
- mkdir -p /var/log/hadoop/hdfs /var/log/hadoop/yarn
- mkdir -p /var/log/hive /var/log/spark /var/log/kafka
- mkdir -p /var/log/zookeeper /var/log/oozie /var/log/nifi
- chmod -R 755 /var/log/hadoop /var/log/hive /var/log/spark
- chmod -R 755 /var/log/kafka /var/log/zookeeper /var/log/oozie /var/log/nifi

- echo "ARIA UC1 Azure node ready $(hostname) at $(date)" >> /var/log/aria-setup.log
CLOUDINIT
}

# ── Virtual Machines ───────────────────────────────────────────────────────────
resource "azurerm_linux_virtual_machine" "nodes" {
for_each = local.nodes
name = each.key
location = azurerm_resource_group.uc1.location
resource_group_name = azurerm_resource_group.uc1.name
size = "Standard_B2ms" # 2 vCPU, 8 GB RAM — equivalent to GCP e2-standard-2

admin_username = "aria"
# SSH key auth only — no password
disable_password_authentication = true

admin_ssh_key {
username = "aria"
public_key = var.aria_ssh_public_key
}

network_interface_ids = [azurerm_network_interface.nodes[each.key].id]

os_disk {
name = "aria-uc1-${each.key}-osdisk"
caching = "ReadWrite"
storage_account_type = "StandardSSD_LRS"
disk_size_gb = each.value.disk_gb
}

source_image_reference {
publisher = "Debian"
offer = "debian-11"
sku = "11"
version = "latest"
}

# cloud-init runs on first boot — same Hadoop setup as GCP startup script
custom_data = base64encode(local.cloud_init)

tags = {
aria-role = each.value.role
aria-uc = "uc1"
aria-env = "testing"
}

depends_on = [azurerm_network_interface.nodes]
}
17 changes: 17 additions & 0 deletions infra/terraform/uc_testing/azure/uc1-hadoop-onprem/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
output "master_external_ip" {
description = "Public IP of cdp-master-01 — use for SSH access and ServiceNow CMDB"
value = azurerm_public_ip.master.ip_address
}

output "node_internal_ips" {
description = "Map of node name to private IP — populate ServiceNow CMDB member CI ip_address fields"
value = {
for name, _ in local.nodes :
name => azurerm_network_interface.nodes[name].private_ip_address
}
}

output "resource_group_name" {
description = "Resource group containing all UC1 resources — use for az CLI commands and cleanup"
value = azurerm_resource_group.uc1.name
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
subscription_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" # az account show --query id
resource_group_name = "aria-uc1-rg"
location = "West Europe"

# Your workstation public IP — curl ifconfig.me
allowed_ssh_cidr = "YOUR_IP/32"

# Generate with: ssh-keygen -t ed25519 -f ~/.ssh/aria_uc1_key -C aria -N ""
# Then: cat ~/.ssh/aria_uc1_key.pub
aria_ssh_public_key = "ssh-ed25519 AAAA... aria"
26 changes: 26 additions & 0 deletions infra/terraform/uc_testing/azure/uc1-hadoop-onprem/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
variable "subscription_id" {
type = string
description = "Azure subscription ID — get from: az account show --query id"
}

variable "resource_group_name" {
type = string
default = "aria-uc1-rg"
description = "Name of the Azure resource group to create for UC1"
}

variable "location" {
type = string
default = "West Europe"
description = "Azure region for all UC1 resources"
}

variable "allowed_ssh_cidr" {
type = string
description = "Your workstation IP/32 for SSH access — get from: curl ifconfig.me"
}

variable "aria_ssh_public_key" {
type = string
description = "ED25519 public key for ARIA SSH access — generate with: ssh-keygen -t ed25519 -f ~/.ssh/aria_uc1_key -C aria"
}
Loading
Loading