Description
The anomaly_root_cause agent, when invoked from IABENCH-v1 task IA-1
with Ollama + Llama 3.1 8B, returns "unknown" as the predicted fault
type for every synthetic-data scenario. F1 is 0.000 not because
predictions are wrong, but because the agent does not emit a recognizable
fault label at all.
Surfaced by the PR 1 health-check run on the IABENCH-v1 foundation PR.
Steps to reproduce
- Check out
main branch with PR 1 merged
- Ensure Ollama is running locally with
llama3.1:8b pulled
- In PowerShell:
$env:PYTHONPATH = "."
- Run:
industrial-agents bench --suite all --provider ollama --model llama3.1:8b
- Open
benchmarks/results/iabench_all_llama3.1_8b.json
- Inspect the
IA-1 task's details[] array — all entries show
predicted: "unknown"
Expected behaviour
For each anomaly scenario, the agent should return a recognizable
structured fault type (e.g., bearing-wear, hydraulic-leak,
filter-clog) that can be canonicalized and compared against ground
truth. Some predictions may be wrong, but the agent should at minimum
emit a fault label from the taxonomy, producing F1 > 0.0.
Actual behaviour
All three scenarios returned predicted: "unknown":
| Asset |
Signal |
Ground truth |
Prediction |
| motor_01 |
vibration_rms |
bearing-wear |
unknown |
| press_01 |
clamp_pressure_bar |
hydraulic-leak |
unknown |
| hydraulic_01 |
filter_dp_bar |
filter-clog |
unknown |
Result: F1=0.000. Run is flagged reliable: true (no exceptions, just
genuine "unknown" responses from the LLM).
Likely causes:
- Agent's prompt template doesn't constrain output to the structured
fault taxonomy
- Llama 3.1 8B may not be capable of reliable structured output for this
task
- Response parser may be too strict in extracting the label
Suggested next step: inspect the agent's raw LLM output for one scenario,
then either tighten the prompt with a JSON schema constraint, add few-shot
examples, or document Llama 3.1 8B as insufficient for IA-1.
Reference: benchmarks/results/iabench_all_llama3.1_8b.json (2026-06-07
UTC run)
Framework version
v0.1.0-pre (bench/iabench-foundation @ 967baf6)
LLM provider
ollama
Environment
Windows 11, Python 3.12, Llama 3.1 8B (Q4_K_M quant via ollama pull)
Description
The
anomaly_root_causeagent, when invoked from IABENCH-v1 task IA-1with Ollama + Llama 3.1 8B, returns
"unknown"as the predicted faulttype for every synthetic-data scenario. F1 is 0.000 not because
predictions are wrong, but because the agent does not emit a recognizable
fault label at all.
Surfaced by the PR 1 health-check run on the IABENCH-v1 foundation PR.
Steps to reproduce
mainbranch with PR 1 mergedllama3.1:8bpulled$env:PYTHONPATH = "."industrial-agents bench --suite all --provider ollama --model llama3.1:8bbenchmarks/results/iabench_all_llama3.1_8b.jsonIA-1task'sdetails[]array — all entries showpredicted: "unknown"Expected behaviour
For each anomaly scenario, the agent should return a recognizable
structured fault type (e.g.,
bearing-wear,hydraulic-leak,filter-clog) that can be canonicalized and compared against groundtruth. Some predictions may be wrong, but the agent should at minimum
emit a fault label from the taxonomy, producing F1 > 0.0.
Actual behaviour
All three scenarios returned
predicted: "unknown":Result: F1=0.000. Run is flagged
reliable: true(no exceptions, justgenuine "unknown" responses from the LLM).
Likely causes:
fault taxonomy
task
Suggested next step: inspect the agent's raw LLM output for one scenario,
then either tighten the prompt with a JSON schema constraint, add few-shot
examples, or document Llama 3.1 8B as insufficient for IA-1.
Reference:
benchmarks/results/iabench_all_llama3.1_8b.json(2026-06-07UTC run)
Framework version
v0.1.0-pre (bench/iabench-foundation @ 967baf6)
LLM provider
ollama
Environment
Windows 11, Python 3.12, Llama 3.1 8B (Q4_K_M quant via ollama pull)