Skip to content

lang-bmb/bmb-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

bmb-test

Tests are how humans trust AI.

The human verification layer for BMB — because contracts speak machine, but tests speak human.


The Problem

BMB solves safety through contracts:

fn sort(arr: &[i32]) -> Vec<i32>
  post forall i: 0..ret.len()-1. ret[i] <= ret[i+1]
  post forall x in arr. count(ret, x) == count(arr, x)

The compiler proves this is safe. But can you verify it's correct?

  • Can you read forall i: 0..ret.len()-1 in 3 seconds?
  • Do you trust AI-generated contracts you can't understand?
  • How do you know the AI understood what you wanted?

If humans can't verify the code, humans can't trust the code.


The Solution

Tests speak human:

#[example]
fn sort_behavior() {
    assert(sort(&[3, 1, 2]) == [1, 2, 3]);
    assert(sort(&[]) == []);
    assert(sort(&[5, 5, 5]) == [5, 5, 5]);
}

Now you know what the code does. In 3 seconds.


Philosophy

Traditional Testing vs BMB Testing

Role Traditional BMB
Find bugs Tests Contracts (compile-time)
Prove safety Tests Contracts (Z3 verified)
Check bounds Tests Contracts (pre idx < len)
Null safety Tests Contracts (T? + pre)
Human understanding Tests ← new role

In BMB, tests don't find bugs. Tests build trust.

Two Languages for Two Audiences

┌─────────────────────────────────────────────────────────┐
│                                                         │
│   Machine Language              Human Language          │
│                                                         │
│   post forall i: 0..n-1.       [3,1,2] → [1,2,3]       │
│     arr[i] <= arr[i+1]         [1] → [1]               │
│                                 [] → []                 │
│                                                         │
│   ↓                             ↓                       │
│   Z3 Solver verifies           Human brain verifies    │
│                                                         │
└─────────────────────────────────────────────────────────┘

Both must agree. Both must pass. Then you can trust.


Test Types

1. #[example] — The Core

Human-readable input/output pairs. The primary trust interface.

#[example]
fn binary_search_behavior() {
    let arr = [1, 3, 5, 7, 9];
    
    // Found
    assert(binary_search(&arr, 5) == Some(2));
    assert(binary_search(&arr, 1) == Some(0));
    
    // Not found
    assert(binary_search(&arr, 4) == None);
    assert(binary_search(&arr, 0) == None);
    
    // Edge cases
    assert(binary_search(&[], 1) == None);
    assert(binary_search(&[1], 1) == Some(0));
}

Rule: If a human can't verify the example in 3 seconds, it's too complex.

2. #[scenario] — Business Logic

Natural language-like flows for domain logic:

#[scenario]
fn user_checkout_flow() {
    let cart = Cart::new();
    
    // Add items
    cart.add("Apple", quantity: 3, price: 1000);
    cart.add("Banana", quantity: 2, price: 500);
    assert(cart.total() == 4000);
    
    // Apply discount
    cart.apply_coupon("SAVE10");  // 10% off
    assert(cart.total() == 3600);
    
    // Checkout
    let order = cart.checkout(user: "alice");
    assert(order.status == "confirmed");
    assert(cart.is_empty());
}

3. #[equivalence] — Reference Implementation

Compare AI-generated code against simple, obviously-correct reference:

#[equivalence]
fn sort_matches_reference() {
    // Simple bubble sort - slow but obviously correct
    fn reference_sort(arr: &[i32]) -> Vec<i32> {
        let mut v = arr.to_vec();
        for i in 0..v.len() {
            for j in i+1..v.len() {
                if v[i] > v[j] { 
                    swap(&mut v[i], &mut v[j]); 
                }
            }
        }
        return v;
    }
    
    // AI implementation must match reference
    assert_equivalent(
        optimized_sort,
        reference_sort,
        inputs: random_arrays(100)
    );
}

4. #[contract_audit] — Contract Readability

Verify that contracts actually express human intent:

#[contract_audit]
fn sort_contract_makes_sense() {
    // The contract says "sorted output"
    // Does it really mean what we think?
    
    // This should pass
    assert_contract_accepts(sort, 
        input: [3, 1, 2], 
        output: [1, 2, 3]
    );
    
    // This should fail
    assert_contract_rejects(sort,
        input: [3, 1, 2],
        output: [1, 2]  // Missing element!
    );
    
    // This should also fail
    assert_contract_rejects(sort,
        input: [3, 1, 2],
        output: [3, 2, 1]  // Wrong order!
    );
}

5. #[golden] — Snapshot Testing

For complex outputs that are hard to specify but easy to verify visually:

#[golden("snapshots/parse_json.golden")]
fn json_parser_output() {
    let input = r#"{"name": "alice", "age": 30}"#;
    return format_ast(parse_json(input));
}
// snapshots/parse_json.golden
Object {
  "name": String("alice"),
  "age": Number(30)
}

Human reviews the golden file once. Future changes diff against it.


Workflow

┌─────────────────────────────────────────────────────────┐
│                  STEP 1: Human writes tests             │
│                     (what I want)                       │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  #[example]                                             │
│  fn sort_behavior() {                                   │
│      assert(sort(&[3,1,2]) == [1,2,3]);                │
│  }                                                      │
│                                                         │
└───────────────────────────┬─────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────┐
│                  STEP 2: AI generates code              │
│                (implementation + contracts)             │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  fn sort(arr: &[i32]) -> Vec<i32>                      │
│    post forall i: 0..ret.len()-1. ret[i] <= ret[i+1]  │
│  { /* AI implementation */ }                           │
│                                                         │
└───────────────────────────┬─────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────┐
│                  STEP 3: Dual verification              │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌─────────────────┐    ┌─────────────────┐            │
│  │  bmb verify     │    │  bmb test       │            │
│  │  (contracts)    │    │  (examples)     │            │
│  │                 │    │                 │            │
│  │  ✓ Z3 proof    │    │  ✓ Human intent │            │
│  │    complete     │    │    satisfied    │            │
│  └────────┬────────┘    └────────┬────────┘            │
│           │                      │                      │
│           └──────────┬───────────┘                      │
│                      ▼                                  │
│              BOTH PASS = TRUST                          │
│                                                         │
└─────────────────────────────────────────────────────────┘

Commands

# Run all example tests
bmb test

# Run specific test file
bmb test tests/sort_examples.bmb

# Run with coverage report (which examples cover which functions)
bmb test --coverage

# Verify examples align with contracts
bmb test --audit

# Generate missing examples from contracts (AI-assisted)
bmb test --generate-examples

# Watch mode
bmb test --watch

Coverage: Example Coverage, Not Line Coverage

Traditional coverage: "Which lines were executed?" BMB coverage: "Which behaviors are demonstrated?"

$ bmb test --coverage

sort:
  ✓ empty array
  ✓ single element
  ✓ already sorted
  ✓ reverse sorted
  ✓ duplicates
  ✗ large array (>1000 elements)  # missing example
  
binary_search:
  ✓ found at start
  ✓ found at middle
  ✓ found at end
  ✓ not found (too small)
  ✓ not found (too large)
  ✗ not found (between elements)  # missing example

Contract Audit Report

$ bmb test --audit

sort:
  Contract: post forall i: 0..ret.len()-1. ret[i] <= ret[i+1]
  
  ✓ Examples demonstrate: ascending order
  ✗ Missing demonstration: preserves all elements
    Suggestion: add example with duplicates
    
  Contract: post ret.len() == arr.len()
  
  ✓ Examples demonstrate: length preservation

Example Generation (AI-Assisted)

When you have contracts but no examples:

$ bmb test --generate-examples sort

Generated examples for sort:

#[example]
fn sort_generated() {
    // Basic case
    assert(sort(&[3, 1, 2]) == [1, 2, 3]);
    
    // Empty
    assert(sort(&[]) == []);
    
    // Single element
    assert(sort(&[42]) == [42]);
    
    // Already sorted
    assert(sort(&[1, 2, 3]) == [1, 2, 3]);
    
    // Duplicates
    assert(sort(&[2, 1, 2]) == [1, 2, 2]);
}

Review and save? [y/n]

Human reviews, possibly edits, then saves. AI proposes, human approves.


Best Practices

DO: Write Examples First

// 1. Human writes examples (the specification)
#[example]
fn factorial_behavior() {
    assert(factorial(0) == 1);
    assert(factorial(1) == 1);
    assert(factorial(5) == 120);
}

// 2. AI generates implementation + contracts
// 3. Both verify

DO: Keep Examples Simple

// ✓ Good - verifiable in 3 seconds
assert(sort(&[3, 1, 2]) == [1, 2, 3]);

// ✗ Bad - requires mental computation
assert(sort(&[9,2,7,4,1,8,3,6,5]) == [1,2,3,4,5,6,7,8,9]);

DO: Cover Edge Cases

#[example]
fn divide_examples() {
    // Normal
    assert(divide(10, 2) == 5);
    
    // Edge: zero numerator
    assert(divide(0, 5) == 0);
    
    // Edge: result is zero
    assert(divide(3, 5) == 0);  // integer division
    
    // Note: divide(x, 0) is prevented by contract
    // No example needed - it cannot happen
}

DON'T: Test What Contracts Guarantee

// ✗ Unnecessary - contract already guarantees this
#[test]
fn sort_returns_sorted() {
    let result = sort(&[3, 1, 2]);
    for i in 0..result.len()-1 {
        assert(result[i] <= result[i+1]);  // This IS the contract!
    }
}

// ✓ Useful - demonstrates behavior to humans
#[example]
fn sort_examples() {
    assert(sort(&[3, 1, 2]) == [1, 2, 3]);
}

Integration

With bmb-mcp (Chatter)

AI uses tests as specification when generating code:

Human: "Implement sort"
AI: [reads #[example] tests] → understands intent
AI: [generates code + contracts]
AI: [runs bmb test] → verifies against human intent

With CI (action-bmb)

- uses: lang-bmb/action-bmb@v1
  with:
    command: test --audit
    fail-if-missing-examples: true

With IDE (vscode-bmb)

  • Inline example results
  • "Generate example" code action
  • Contract ↔ Example alignment warnings

FAQ

Why not property-based testing?

Properties are powerful but unreadable:

// Property - correct but hard to verify
#[property]
fn sort_is_permutation(arr: Vec<i32>) {
    assert(is_permutation(sort(&arr), arr));
}

// Example - instantly verifiable
#[example]
fn sort_preserves_elements() {
    assert(sort(&[3, 1, 2]) == [1, 2, 3]);  // all elements present
}

Properties are for machines. Examples are for humans.

Why not mutation testing?

Mutation testing asks: "Would tests catch this bug?"

In BMB, contracts catch bugs. Tests build trust.

What about integration tests?

Integration tests remain valuable — they verify module interactions that contracts can't express:

#[integration]
fn api_returns_sorted_users() {
    let response = http_get("/api/users?sort=name");
    assert(response.status == 200);
    assert(is_sorted(response.body.users, by: "name"));
}

License

MIT


Note: Self-Hosting

bmb-test is written in BMB itself. This is intentional:

  • Dogfooding: Proves BMB is mature enough for real tools
  • Consistency: The entire BMB ecosystem uses one language
  • Demonstration: bmb-test source code serves as BMB examples

bmb-test tests itself using #[example] tests.


Contracts prove safety. Tests prove intent.
Both must pass before you trust.

About

Property-based testing framework for BMB — where tests define intent and AI implements.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors