This document describes the formalization of SAFE_COMPLETE as a first-class, rule-based action, with single
source of truth in moralstack.runtime.decision.safe_complete_policy.
- SAFE_COMPLETE is a policy action, not an inference from text: it does not rely on presence of disclaimer or caveat in the response (language-agnostic).
- BENIGN remains NORMAL_COMPLETE except for epistemic escalation:
actionability_risk == HIGH(user asks what to DO, provides resources/constraints/personal goals, or output directly influences a real decision) promotes any category — including BENIGN — to SAFE_COMPLETE. This is the only exception, and it is domain-agnostic. - Reason codes are first-class: every decision includes deterministic codes (e.g.
hard_violations,risk_benign,safe_complete_required,safe_complete_required_high_actionability,domain_regulated) for explainability and auditing.
- Module:
moralstack.runtime.decision.safe_complete_policy - API:
compute_action_bounds(context) -> PolicyBounds,decide_final_action(context) -> (final_action, bounds, reason_codes) - Action order: NORMAL_COMPLETE < SAFE_COMPLETE < REFUSE
- domain (optional): from
request.user_context.domain_overlay - risk_category: BENIGN | SENSITIVE | MORALLY_NUANCED | POTENTIALLY_HARMFUL | CLEARLY_HARMFUL
- op_risk: operational risk (actionable harmful action)
- hard_violations_count: number of constitutional hard violations
- intent_type (optional): factual | advice | support | explanation — derived from
request_typeand operational signals (intent_operational,risk_score). Used to differentiate SENSITIVE factual vs advisory. - Structured signals: misuse_plausibility, actionability_risk, intent_clarity; optional
has_ambiguity_or_dual_use(or harm_type ≠ none).- actionability_risk is the primary epistemic escalation signal: HIGH = user asks what to DO, provides resources/personal constraints, or output influences a real decision; MEDIUM = directional but generic guidance; LOW = purely informative/descriptive/historical.
The policy no longer relies on a hardcoded regulated-domain set in the policy module.
Sensitive handling is driven by constitution overlays (overlay_sensitive) resolved by the controller.
- REFUSE mandatory
hard_violations_count > 0→ min = max = REFUSErisk_category == CLEARLY_HARMFULor op_risk indicates actionable harm → min = max = REFUSE
1.5. SAFE_COMPLETE mandatory (epistemic escalation)
actionability_risk == HIGHandhard_violations_count == 0and op_risk not HIGH- → min = max = SAFE_COMPLETE
- Reason code:
safe_complete_required_high_actionability - Definition of HIGH actionability_risk: user asks what to DO, provides resources/constraints/personal goals, or output could directly influence a real decision.
- This rule is domain-agnostic and precedes the rules for SENSITIVE/MORALLY_NUANCED.
- Does NOT override REFUSE (hard violations, op_risk HIGH, clearly_harmful already handled in rule 1).
-
SAFE_COMPLETE mandatory
risk_category ∈ {SENSITIVE, MORALLY_NUANCED}and (intent_type ≠ factualor ambiguity/dual-use signals)- → min = max = SAFE_COMPLETE (no downgrade to NORMAL)
- Factual exception: if
intent_type == factualand no ambiguity/dual-use signals andoverlay_sensitive == false→ min = NORMAL_COMPLETE, max = SAFE_COMPLETE (reason code:risk_sensitive_allowed). This reduces over-governance on informative questions in health, finance, etc.
-
SAFE_COMPLETE allowed (gray zone)
risk_category == POTENTIALLY_HARMFULand not REFUSE:- if
overlay_sensitive == trueand not operational intent, SAFE_COMPLETE is required (factual + no ambiguity keeps[NORMAL_COMPLETE, SAFE_COMPLETE]) - otherwise SAFE_COMPLETE is allowed and default final_action is NORMAL_COMPLETE to reduce FP.
- if
-
NORMAL_COMPLETE mandatory (benign)
risk_category == BENIGNand no hard_violation → min = max = NORMAL_COMPLETE.
- If REFUSE mandatory (rule 1) → final_action = REFUSE
- Else if actionability_risk HIGH (rule 1.5) → final_action = SAFE_COMPLETE
- Else if SAFE_COMPLETE mandatory (rule 2) → final_action = SAFE_COMPLETE
- Else final_action = NORMAL_COMPLETE (including POTENTIALLY_HARMFUL gray zone).
The Decision Trace is a structured audit log that records decisions made by the system during request evaluation. It is an observational audit mechanism: it does not affect behavior nor final output.
For a single request there can be multiple trace entries. PRE_POLICY represents the decision after risk and policy
bounds (without hard-violations). FINAL represents the decision after any hard-violations and enforcement. The
decision exposed to the user is always FINAL.
| Field | Meaning |
|---|---|
request_id |
Request identifier |
stage |
PRE_POLICY | FINAL |
sequence |
Temporal order (1 = PRE, 2 = FINAL) |
final_action |
Decided action (REFUSE | SAFE_COMPLETE | NORMAL_COMPLETE) |
decision_reason |
Textual rationale |
policy_reason_codes |
Policy codes (e.g. risk_benign, safe_complete_required) |
hard_violation_codes |
Hard violation codes (if present) |
The presence of multiple entries for the same request_id is intentional; any downgrade or override is always explicit
in the fields.
The decision is deterministic given input and policy. The trace is a logging side effect: presence or absence of the trace does not change the decision.
- Debugging
- Benchmark analysis
- Audit / compliance
- Policy tuning
Included at least:
hard_violations, risk_clearly_harmful, risk_potentially_harmful, risk_sensitive, risk_morally_nuanced,
risk_benign, domain_regulated, safe_complete_required, safe_complete_required_high_actionability,
safe_complete_allowed, normal_complete_required, risk_sensitive_allowed, sim_negative_valence_safe_complete,
cycles_exhausted_sensitive_fallback, deliberation_override_refuse_to_safe_complete.
Written in metadata (and in the per-question benchmark table) together with min_required, max_allowed and correctness_verdict.
- REFUSE can be FAST_PATH.
- NORMAL_COMPLETE (benign) → FAST_PATH when appropriate.
- SAFE_COMPLETE (required for SENSITIVE/MORALLY_NUANCED or
actionability_risk == HIGH) → DELIBERATIVE_PATH: at least one deliberative cycle is executed (no "DELIBERATIVE_PATH with 0 modules").
When a constitutional overlay is marked sensitive: true, the Controller applies a floor to the local risk_score:
- Constant:
OVERLAY_SENSITIVE_RISK_FLOOR = 0.35(defined inmoralstack/orchestration/overlay_policy.py, imported by controller) - Condition: if
overlay.sensitive == Trueandrisk_score < 0.35, thenrisk_score = 0.35 - Effect: forces routing toward the deliberative path (
risk_score >= threshold_low = 0.3) - Invariant: the floor is a local boost to the
risk_scorevariable in the controller; it does not mutate therisk_estimationobject and does not affect risk estimator trace/diagnostics
When deliberation exhausts cycles without converging, the Controller applies a conservative fallback:
-
Conditions (all must hold):
outcome.stop_reason == "CYCLES_EXHAUSTED"decision.final_action == "NORMAL_COMPLETE"(does not touch REFUSE)risk_category in {SENSITIVE, MORALLY_NUANCED}oroverlay_sensitive == True
-
Action: forces
final_action = "SAFE_COMPLETE"by creating a newDecisionwith all fields preserved, adding the reason codecycles_exhausted_sensitive_fallback. -
Rationale: a CYCLES_EXHAUSTED outcome in a sensitive context signals uncertainty; the system adopts the precautionary principle preferring explicit governance (SAFE_COMPLETE) over an uncontrolled response (NORMAL_COMPLETE).
When a request is routed to the deliberative path due to borderline REFUSE (risk_score within the borderline range), the Controller may downgrade a post-deliberation REFUSE to SAFE_COMPLETE if the deliberative modules unanimously support a safe response.
-
Conditions (all must hold):
- Post-deliberation
decide_actionstill returns REFUSE. - Critic said PROCEED with zero violations and no hard violations.
- Perspectives have mean approval_score ≥ 0.7.
- Simulator has expected_valence ≥ 0 and semantic_expected_harm < 0.3.
- Hindsight (if present) does not recommend refuse.
- No critical violations in state.
- Post-deliberation
-
Action: the Controller replaces the decision with SAFE_COMPLETE and adds the reason code
deliberation_override_refuse_to_safe_complete. -
Rationale: avoids wasting deliberative cycles when all modules agree the response is safe; the override is applied only after full deliberation and only when every module concurs.
Current behavior in response_assembler._make_refusal() passes the original user prompt to policy.refuse(...)
to enable context-aware refusals and domain-appropriate redirection.
The safety contract is enforced by policy decisions and structured signals upstream (final_action, reason_codes, hard violations), not by stripping prompt content at refusal assembly time.
- Policy:
moralstack.runtime.decision.safe_complete_policy(compute_action_bounds,decide_final_action,PolicyBounds,PolicyContext) - Decisions:
moralstack.orchestration.decision_service.decide_actioncalls the policy as the sole source for bounds and final_action - DCF:
moralstack.runtime.decision_correctness.compute_intervaluses the same policy for min/max and reason_codes - Response contract:
ResponseMetadata.caveat_present,safe_alternative_present,no_prescriptive_languageset in assembler when final_action == SAFE_COMPLETE