Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
163 changes: 163 additions & 0 deletions content/blog/2025-05-13-bril-concurrency-extension.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
+++
title = "Sailing Bril’s IR into Concurrent Waters"
[extra]
bio = """
Ananya Goenka is an undergraduate studying CS at Cornell. While she isn’t nerding out over programming languages or weird ISA quirks, she’s writing for Creme de Cornell or getting way too invested in obscure books.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magazine looks cool! Would you be interested in adding a hyperlink? 😃

"""
[[extra.authors]]
name = "Ananya Goenka"
link = "https://ananyagoenka.github.io"
+++

1\. Goal and Motivation
--------------------------------------

My project aims to add native shared parallelism concurrency support to [Bril](https://github.com/sampsyo/bril). It adds two simple yet powerful instructions, `spawn` and `join`, which let you launch and synchronize threads directly in Bril programs. Beyond just powering up the IR with threads, I wanted to lay the groundwork for real optimizations by writing a thread-escape analysis pass: a static check that flags heap allocations which truly stay local to a single thread, so future compiler passes can skip unnecessary synchronization or RPC overhead.

2\. Design Decisions and Specification
--------------------------------------

### 2.1 Instruction Semantics

We introduced two new opcodes:

* { "op": "spawn", "dest": "t", "type": "thread", "funcs": \["worker"\], "args": \["x", "y"\] }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably be a little easier to read if it were in Markdown code backticks.


* **join**: An effect operation that takes a single thread handle argument and blocks until the corresponding thread completes.

To prevent clients from forging thread IDs, we defined a new primitive type thread in our TypeScript definitions (bril.ts). This opaque type ensures only genuine spawn instructions can produce valid thread handles.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More code backticks are in order:

  • thread -> thread
  • bril.ts -> bril.ts

…and consider doing this to filename and TypeScript symbol names throughout.


### 2.2 Shared State Model

A critical design question was state sharing: should threads share only the heap or also local variables? We chose a hybrid model:

1. **Isolated stacks**: Each thread receives its own copy of the caller's environment (Env map), so local variables remain private.

2. **Shared heap**: All heap allocations (alloc, load, store) target a single global Heap instance, exposing potential data races—a faithful reflection of real-world concurrency.


### 2.3 Interpreter Implementation
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a quick overview here to note that this is about an implementation in the reference interpreter, which is written in TS and runs on Deno.


#### Stubbed Concurrency (Option A)

Our first pass implemented spawn/join synchronously in-process: spawn would directly call evalFunc(...) and immediately resolve, making join a no-op. This stub served as a correctness check and allowed us to validate the grammar and TypeScript types without introducing asynchrony.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of referring to internal interpreter implementation details (the evalFunc function), it might be a little clearer to just say abstractly how this works: we recursively call the interpreter to run the function directly, just like a function call.


#### Web Worker-Based Concurrency (Option B)

To simulate true concurrency, we leveraged Deno’s Web Workers. Each spawn:

1. Allocates a unique thread ID and a corresponding promise resolver.

2. Spins up a new Worker running brili\_worker.ts, passing it the full Bril program, function name, arguments, **and a proxy to the shared Heap**.

3. The worker executes runBrilFunction(...) in its own isolate and signals completion via postMessage.

4. The main isolate bridges heap operations through an RPC-style HeapProxy, intercepting alloc/load/store/freefrom workers and dispatching them on the master heap.


This architecture faithfully exposes nondeterministic interleaving and shared-memory errors (e.g., data races), but at the cost of heavy message-passing overhead.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! I think this is a great prototype.


3\. Thread-Escape Analysis Pass
-------------------------------

We wrote examples/escape.py, a Python script that performs a conservative, intraprocedural escape analysis:

1. **Seed escapes**: Pointers passed to spawn, returned via ret, stored into other pointers, or passed to external calls are marked as escaping.

2. **Propagation**: Through ptradd and load chains, any derived pointer from an escaping pointer also escapes.

3. **Reporting**: The pass reports per-function and global statistics: the fraction of allocations that remain thread-local.


This simple pass identifies heap objects that never cross thread boundaries, opening opportunities to Elide synchronization or use stack allocation in a future optimizer.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elide -> elide


4\. Benchmark Suites
--------------------

We developed two complementary benchmark suites:

* **Correctness Suite** (benchmarks/concurrency/\*.bril): A set of small tests covering spawn/join arity errors, double joins, shared counters, parallel sums, and array writers. We validated behavior under both the stubbed and worker-based interpreters using turnt.

* **Performance Suite** (benchmarks/concurrency/perf/\*.bril): Larger, data-parallel workloads measuring speed-up with workers:

* perf\_par\_sum\_100k.bril: Summing 1..100 000 split across two threads. 100 k elements produce ~20× speed-up over sequential.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm… how can two threads yield a 20× speedup?


* perf\_matmul\_split.bril: Matrix multiplication on 100×100 matrices with row-range split; yields ~10× speed-up.

* perf\_big\_par\_sum\_10M.bril: Summing 1..10 000 000, which instead suffers a slowdown due to RPC overhead.


5\. Empirical Evaluation
------------------------

### 5.1 Correctness

All the correctness benchmarks pass under the worker-enabled interpreter (brili-conc), matching our specification. Error cases in the error directory (arity mismatch, double join, wrong handle type) correctly exit with error codes.

### 5.2 Performance

Below are the wall-clock times measured for three representative benchmarks, comparing the sequential interpreter (using --no-workers) against our concurrent version (with true Deno Web Workers and an RPC-based shared heap).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say something about your experimental setup (hardware/OS/Deno runtime) and how you took measurements (using Unix time, or Hyperfine with how many replica)?


```
=== Benchmark: perf_par_sum_100k ===
Sequential: 217 ms, Concurrent: 4.5 s → 20.8× speed-up
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like a slowdown? Maybe the numbers are backward.


=== Benchmark: perf_matmul_split ===
Sequential: 4.2 s, Concurrent: 44.9 s → 10.8× speed-up
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.


=== Benchmark: perf_big_par_sum_10M ===
Sequential: 15.8 s, Concurrent: > (timeout/slower) → RPC overhead dominates
```

#### **Why “concurrent” is actually slower**
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think you need bold and a heading marker…


1. Our design (real Web Workers + RPC heap) means _every_ load and store from a worker turns into:
```
postMessage({ kind: "heap_req", … }) // worker → main
→ main services request, touches native JS array
postMessage({ kind: "heap_res", … }) // main → worker
```
On a 100 k–element loop, that’s ∼200 k round trips, each costing hundreds of microseconds in Deno’s message‐passing layer—totalling several seconds. At 10 M elements, it grows to ∼20 M messages, which simply overwhelms the system.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does “simply overwhelms the system” mean? Are you hypothesizing a superlinear slowdown?


2. Spawning even two workers in Deno carries a fixed overhead (tens to hundreds of milliseconds). In fine-grained loops (100 k or fewer iterations), that overhead is not negligible, and in very fine loops (10 M) it compounds the RPC penalty.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many threads are you launching in each benchmark? Is it constant, or related to the input size?


3. The main isolate is busy servicing heap requests from _both_ workers. The event loop context‐switching and message queue flooding create contention, so neither the “workers” nor the main thread run at full core capacity.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little inconsistent… the main isolate is busy, but it’s not running at full core capacity? It seems like only one of these can be true.


I'd expect that this implementation of concurrency would help, on moderately coarse workloads** (e.g. 100 k or splitting a 100 × 100 matrix) still see _some_ parallelism, because the computation per RPC is nontrivial (simple arithmetic plus pointer arithmetic inside V8). In our tests, the sequential run was ~4 s, the concurrent ~45 s—still slower, but less horrific than the 10 M sum case.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stray **.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is because of the parenthetical but l 16st track of which benchmark you're talking about at which point. Also, this bit is confusing:

because the computation per RPC is nontrivial (simple arithmetic plus pointer arithmetic inside V8)

Is the computation per RPC simple or complex?


Some lessons/thoughts.  To actually _win_ with real parallelism under this design, we must batch memory operations. For example, transform long loops into single RPC calls that process entire slices (e.g. “sum these 1 000 elements” in one go), amortizing the message‐passing cost.SharedArrayBuffers could eliminate RPC entirely by mapping our Bril heap into a typed array visible in all workers. Then each load/store is a direct memory access, and you’d see true multicore speedups on large-N benchmarks. For an intermediate step, we could group every 1 000 loads/stores into one batched message, cutting messaging overhead by two orders of magnitude, which should already push the breakeven point down toward the 10 M-element range.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a space before SharedArrayBuffers.


_In summary_, our RPC-based “shared-heap” prototype cleanly demonstrates the mechanics of true concurrency in Bril, but the messaging overhead severely limits speedups for fine-grained loops. Future work on shared‐memory backing or batched RPC will be necessary to turn this into a truly performant parallel interpreter.

### 5.3 Escape Analysis Statistics

```
[escape] main: 1/1 allocs are thread-local # solo allocation stays local
[escape] TOTAL: 1/1 (100%) thread-local

[escape] main: 0/1 allocs are thread-local # spawn causes escape
[escape] mixed: 1/2 allocs are thread-local # one local, one escaping
```
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of just pasting the output of your tool can you explain what this is saying? What are "main" and "mixed"? Why are the numbers so small (1 and 2)?


These results confirm our pass correctly identifies thread-local allocations.

6\. Challenges and Lessons Learned
----------------------------------

The switch from synchronous stubs to Web Workers introduced complexity in sharing the heap. Since workers cannot share JS memory directly, we built an RPC-based HeapProxy that marshals each alloc/load/store call. Debugging message ordering, promise lifetimes, and termination conditions consumed significant effort.
I think my biggest learning was that I expected concurrency to speed everything up, but the 10 M-element test exposed how RPC overhead can negate parallel gains.
I also took some short cuts. A fully context-sensitive, interprocedural pass would be more precise, but the current intraprocedural, flow-insensitive approach seemed to be simplest useful thing I could do.

7\. Future Work
---------------

Building on this foundation, several directions remain:

* **SharedArrayBuffer Heap**: Replace RPC with a SharedArrayBuffer–backed heap to eliminate message-passing overhead and unlock true multi-core speed-ups for large workloads.

* **Par-for Loop Transform**: Implement a compiler pass that recognizes map/reduce patterns and automatically rewrites for loops into multiple spawn calls, guided by data-dependency analysis.

* **Interprocedural Escape Analysis**: Extend escape.py to track pointers across function boundaries and calls, increasing precision and enabling stack-based allocation for truly local objects.

* **Robust Testing Harness**: Integrate with continuous integration (CI) to run our concurrency and escape-analysis suites on every commit, ensuring regressions are caught early.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, aren’t your interpreter tests enabled by default already? It also seems pretty easy to add your escape analysis tests to the main test suite, assuming they already use Turnt.