Skip to content

Latest commit

 

History

History
352 lines (258 loc) · 12 KB

File metadata and controls

352 lines (258 loc) · 12 KB

Consequence Simulator

Module: moralstack/runtime/modules/simulator_module.py

The Consequence Simulator generates plausible future scenarios to evaluate the impact of responses.

For testers: The output (expected_valence, worst_case_valence, semantic_expected_harm) contributes to the Orchestrator's decision and revision guidance. Tests can verify that scenarios with strongly negative expected_valence or high semantic_expected_harm generate revision recommendations (REVISE).


Overview

The Consequence Simulator:

  • Generates scenarios (best/worst/likely case)
  • Evaluates the potential impact of responses
  • Identifies risks not evident from the response itself
  • Provides feedback to guide revisions
  • Semantic layer: damage taxonomy (harm_type, harm_scope) and semantic_expected_harm influence deliberation

Ultra-Lean Design

To reduce token usage:

  • Narrative: each consequence.text ≤ 15 words, short nominal phrases
  • max_tokens: 384 (configurable in SimulatorConfig)
  • Minimal schema: prompt without verbose examples, JSON skeleton only
  • Post-parse truncation: text > 160 characters is truncated

Scenario Types

ScenarioType

Type Description Purpose
IMMEDIATE_HARM Foreseeable immediate harm Identify urgent risks
DOWNSTREAM_MISUSE Subsequent misuse Anticipate abuse
SOCIAL_IMPACT Social/community impact Assess broad effects
LEGAL_CONSEQUENCE Legal consequences Identify regulatory risks
POSITIVE_OUTCOME Positive outcome Balance the evaluation

Semantic Harm Layer

Damage taxonomy (harm_type)

harm_type Description Constitutional principle
financial_loss Economic loss -
physical_harm Physical harm CORE.NM.1
psychological_harm Psychological harm -
privacy_breach Privacy violation CORE.PRIV.1
legal_risk Legal risk CORE.NM.2
reputational_harm Reputational harm -
security_breach Security breach -
discrimination Discrimination -
misinformation Misinformation CORE.DECEPTION.1
exploitation Exploitation CORE.NONEXPLOITATION.1
self_harm Self-harm MH.CRISIS.1
emotional_manipulation Hidden emotional influence CORE.MANIPULATION.1, CORE.AUTONOMY.1
financial_deception Financial deception / fraud FIN.SCAM.1, CORE.DECEPTION.1
none No semantic harm -

Scope (harm_scope)

harm_scope Description
individual Single person
group Group
societal Society
systemic System/institution

semantic_expected_harm calculation

risk_i = likelihood * harm_severity  # for harm_type != "none"
semantic_expected_harm = max(risk_i)
dominant_harm_types = top 2 harm_type by risk_i
worst_harm = argmax(risk_i)  # {harm_type, harm_scope, risk}

Influence on deliberation

semantic_expected_harm and dominant_harm_types influence votes in _determine_decision:

  • semantic_expected_harm >= 0.4 → +1 vote REVISE
  • semantic_expected_harm >= 0.6 → +2 votes REVISE
  • physical_harm or self_harm in dominant_harm_types and semantic_expected_harm >= 0.4 → +2 votes REVISE

The simulator can never produce REFUSE; REFUSE comes only from hard violations, op_risk HIGH, or policy bounds.


Output Structure

Consequence

@dataclass
class Consequence:
    text: str  # Narrative description (max 160 chars)
    likelihood: float  # Probability [0, 1]
    scenario_id: str  # Unique identifier
    scenario_type: ScenarioType  # Category
    outcome_valence: float  # Valence [-1, 1]
    affected_stakeholders: list[str]  # Affected parties (max 3)
    harm_type: str  # Semantic harm type
    harm_severity: float  # Severity [0, 1]
    harm_scope: str  # Scope
    reversibility: float  # Reversibility [0, 1]

SimulationResult

@dataclass
class SimulationResult:
    consequences: list[Consequence]  # Generated scenarios
    worst_case_valence: float  # min(valence)
    best_case_valence: float  # max(valence)
    expected_valence: float  # Likelihood-weighted average
    semantic_expected_harm: float  # max(likelihood * harm_severity)
    dominant_harm_types: list[str]  # top 2 harm_type
    worst_harm: dict | None  # {harm_type, harm_scope, risk}
    raw_response: str  # LLM response (debug)

Expected Valence Calculation

expected_valence = Σ(valence_i × likelihood_i) / Σ(likelihood_i)
Expected Valence Interpretation
> 0.5 Predominantly positive outcome
0.0 - 0.5 Mixed outcome
< 0.0 Predominantly negative outcome → generates guidance

Usage

Initialization

from moralstack.runtime.modules.simulator_module import LLMConsequenceSimulator

simulator = LLMConsequenceSimulator(policy_llm=policy)

Simulation

result = simulator.simulate(
    prompt="User request",
    response="Response to evaluate",
)

print(f"Expected valence: {result.expected_valence}")
print(f"Worst case: {result.worst_case_valence}")
print(f"Best case: {result.best_case_valence}")

for consequence in result.consequences:
    print(f"- [{consequence.scenario_type.value}] {consequence.text}")
    print(f"  Valence: {consequence.outcome_valence}, Likelihood: {consequence.likelihood}")

Example Output

SimulationResult(
    consequences=[
        Consequence(
            text="User uses the information as a starting point for deeper research",
            likelihood=0.6,
            scenario_type=ScenarioType.POSITIVE_OUTCOME,
            outcome_valence=0.8,
            affected_stakeholders=["user"]
        ),
        Consequence(
            text="User might interpret the advice as a substitute for professional consultation",
            likelihood=0.25,
            scenario_type=ScenarioType.DOWNSTREAM_MISUSE,
            outcome_valence=-0.5,
            affected_stakeholders=["user", "healthcare_system"]
        ),
        Consequence(
            text="In case of unrecognized emergency, delay in appropriate care",
            likelihood=0.15,
            scenario_type=ScenarioType.IMMEDIATE_HARM,
            outcome_valence=-0.7,
            affected_stakeholders=["user"]
        )
    ],
    expected_valence=0.35,
    worst_case_valence=-0.7,
    best_case_valence=0.8,
)

Caching

The simulator implements caching to avoid recomputation:

# Automatic cache based on hash(prompt + response)
# Avoids duplicate LLM calls for identical inputs

Orchestrator Integration

The Simulator contributes to aggregated guidance:

if simulation.expected_valence < 0:
    # Generate specific guidance
    guidance_parts.append(
        f"[SIMULATOR] Negative consequences predicted: {worst_consequence.text}"
    )

Impact on Decisions

Expected Valence Impact
≥ 0.5 No negative impact
0 - 0.5 Generates warning
< 0 Generates revision guidance

Semantic harm (independent of valence): semantic_expected_harm >= 0.4 adds REVISE votes; physical_harm/ self_harm with high harm add further REVISE votes.


Factory Methods

SimulationResult.empty()

# No relevant consequences
result = SimulationResult.empty()

SimulationResult.from_error()

# Fallback on error
result = SimulationResult.from_error("Simulation failed")
# Assumes neutral valence (0.0)

Environment Variables

All simulator tuning can be overridden via .env. Variables are read at simulator construction; empty or missing values use the defaults below. See .env.template for the full list. In application runs (CLI and benchmark), simulator configuration is the single source of configuration — no CLI or code path overrides these variables.

Model (simulator LLM)

MORALSTACK_SIMULATOR_MODEL

  • Default: (none — uses the same model as the rest of the stack, e.g. OPENAI_MODEL or gpt-4o)
  • Type: string (OpenAI model id)
  • Description: OpenAI model used only for the consequence simulator. When set and non-empty, the CLI and benchmark create a dedicated OpenAIPolicy with this model for the simulator; the rest of the stack keeps using OPENAI_MODEL.
  • Effect:
    • Set to a model id (e.g. gpt-4o-mini, gpt-4.1-nano as in .env.template / .env.minimal): The simulator uses that model. Lets you use a smaller/cheaper model for simulation and a larger one for generation.
    • Unset or empty: The simulator uses the same policy (and model) as the rest of the pipeline.

In the recommended configuration (.env.template), the simulator uses gpt-4.1-nano. Benchmark testing shows this reduces average deliberative latency by ~27% compared to gpt-4o-mini on the simulator, with no compliance degradation (98.8% maintained) and minimal quality impact (avg score 9.39 vs 9.36 with gpt-4o across all modules).

LLM and retry behaviour

MORALSTACK_SIMULATOR_MAX_RETRIES

  • Default: 3
  • Type: int (>= 1)
  • Description: Number of parse attempts for the simulator JSON response before raising an error.

Simulator generation uses OpenAI's json_object response format (response_format={"type": "json_object"} on GenerationConfig), which guarantees valid JSON and greatly reduces retries caused by malformed JSON.

MORALSTACK_SIMULATOR_MAX_TOKENS

  • Default: 384
  • Type: int (>= 1)
  • Description: Maximum tokens for the simulator LLM response. 384 keeps narratives compact (ultra-lean design).

MORALSTACK_SIMULATOR_TEMPERATURE

  • Default: 0.8
  • Type: float (0–2)
  • Description: LLM temperature for scenario generation. Higher values produce more diverse scenarios.

MORALSTACK_SIMULATOR_TOP_P

  • Default: 0.95
  • Type: float (0–1)
  • Description: Nucleus sampling (top-p) for simulator LLM generation. Controls diversity of token sampling.

Scenario generation

MORALSTACK_SIMULATOR_DEFAULT_NUM_SCENARIOS

  • Default: 3
  • Type: int (>= 1)
  • Description: Default number of consequence scenarios to generate per simulation call.

MORALSTACK_SIMULATOR_USE_SEEDED_GENERATION

  • Default: false
  • Type: bool (1/true/yes or 0/false/no)
  • Description: When true, each scenario is generated with a separate seed prompt for greater diversity. More costly but produces more varied scenarios.

Caching

MORALSTACK_SIMULATOR_ENABLE_CACHING

  • Default: true
  • Type: bool (1/true/yes or 0/false/no)
  • Description: Enable caching of simulation results to avoid recomputation on identical inputs.

See Also