[inference] speedup by 84% - async, layer caching by zazer0 · Pull Request #103 · hijohnnylin/neuronpedia

zazer0 · 2025-05-24T01:47:12Z

Problem

Inferencing could be slow, depending on size/quantity of requests.
- A steering request would process the original + steered generations in serial; rather than parallel.
When steering with identical features repeatedly, layers are manually re-loaded each time
- Since this use case is common, it seems ideal to cache at least a few of the most recent ones.

Fix / Feature

Implemented async generation; both the original and steered responses are processed in parallel (minor speedup).
Implemented least recently used layer caching, for the 5 most recent layers requested. (major speedup).

Testing

🔥 Added + committed benchmarking script + results showing 84% speedup 🎉
✅ Added unit + integration tests for all aspects

vercel · 2025-05-24T01:47:17Z

@zazer0 is attempting to deploy a commit to the Neuronpedia Team on Vercel.

A member of the Team first needs to authorize it.

hijohnnylin · 2025-05-24T02:01:28Z

this looks promising! i know this is still draft but quick question on parallel steer generation - does this result in the same results as if they ran separately if we specify a seed?

for example

scenario A (current):
set seed = 42
feature = dog feature
generate steer results sequentially (or they run on separate instances)

default = "i am gemma"
<seed is reset to 42>
steered = "i am dog"

scenario B (parallel):
set seed = 42
feature = dog feature
generate steer results in parallel

default = "i am gemma"
steered = "i am a canine" <-- different

i havent tested your code to see if this is the case or not, just flagging it as something to consider

zazer0 · 2025-05-24T06:49:48Z

this looks promising! i know this is still draft but quick question on parallel steer generation - does this result in the same results as if they ran separately if we specify a seed?

Good question! Not sure, will look into it and clarify 👍

hijohnnylin · 2025-06-01T21:13:20Z

@zazer0 I intended to merge this today (it looks great!) and deployed it to some test servers to test outputs, but realized that the steer/completion_chat endpoint was not updated. Would you have time to add that as well? If not, I can take a look.

…e-speed

codecov · 2025-06-01T23:08:18Z

Codecov Report

Attention: Patch coverage is 34.14634% with 270 lines in your changes missing coverage. Please review.

Project coverage is 8.79%. Comparing base (3063b04) to head (7690555).
Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
...pedia_inference/endpoints/steer/completion_chat.py	28.79%	131 Missing and 5 partials ⚠️
...euronpedia_inference/endpoints/steer/completion.py	3.61%	80 Missing ⚠️
...uronpedia_inference/endpoints/activation/single.py	5.55%	17 Missing ⚠️
.../neuronpedia_inference/endpoints/activation/all.py	8.33%	11 Missing ⚠️
...ia_inference/endpoints/activation/topk_by_token.py	10.00%	9 Missing ⚠️
...ce/neuronpedia_inference/layer_activation_cache.py	89.77%	7 Missing and 2 partials ⚠️
apps/inference/neuronpedia_inference/server.py	0.00%	8 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##            main    #103      +/-   ##
========================================
+ Coverage   7.98%   8.79%   +0.81%     
========================================
  Files        121     122       +1     
  Lines      17040   17354     +314     
  Branches     362     422      +60     
========================================
+ Hits        1360    1527     +167     
- Misses     15669   15804     +135     
- Partials      11      23      +12

Flag	Coverage Δ
autointerp	`100.00% <ø> (ø)`
inference	`40.65% <34.14%> (+2.88%)`	⬆️
webapp	`5.21% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

hijohnnylin · 2025-06-07T01:38:43Z

Thanks for the updates to include completion-chat. I tested this today (7690555):

Summary

It seems that running DEFAULT and STEERED separately result in different results than if you run DEFAULT and STEERED in the same request, using the same settings and seed. I am not sure what the reason is - but the reason this matters to us is because we're a research platform, so we need consistent results between the two.

Repro - a Spanish feature

Setup

poetry run python start.py \
  --model_id gemma-2-2b \
  --override_model_id gemma-2-2b-it \
  --sae_sets gemmascope-res-16k \
  --model_dtype bfloat16 \
  --sae_dtype float16 \
  --include_sae 20-gemmascope-res-16k

Test Command (Change "types" array to ["STEERED"], ["DEFAULT"], or ["STEERED", "DEFAULT"]

curl -X POST http://127.0.0.1:5002/v1/steer/completion-chat \
  -H "Content-Type: application/json" \
  -d '{
     "prompt": [
        {
            "content": "I'\''m a",
            "role": "user"
        }
     ],
     "model": "gemma-2-2b-it",
     "features": [
       {
         "model": "gemma-2-2b",
         "source": "20-gemmascope-res-16k",
         "index": 8590,
         "strength": 90
       }
     ],
     "types": [
       "STEERED"   # CHANGE THIS
     ],
     "steer_special_tokens": true,
     "n_completion_tokens": 16,
     "temperature": 0.5,
     "strength_multiplier": 4,
     "freq_penalty": 2,
     "seed": 16,
     "steer_method": "SIMPLE_ADDITIVE",
     "normalize_steering": false
   }' | jq .

Test Outputs (clipped for readability):

STEERED only

"raw": "<bos><start_of_turn>user\nI'm a<end_of_turn>\n<start_of_turn>model\n¡Hola! ¿ cómo puedo ayudarte hoy? \n\n ( ¡En el que"

DEFAULT only

"raw": "<bos><start_of_turn>user\nI'm a<end_of_turn>\n<start_of_turn>model\nPlease tell me more! I'm eager to know what you are. 😊"

both STEERED and DEFAULT

[STEERED]
"raw": "<start_of_turn>user\nI'm a<end_of_turn>\n<start_of_turn>model\n¡Hola! ¿ cómo puedo ayudarte hoy? \n\n ( ¡En el momento",

[DEFAULT]
"raw": "<start_of_turn>user\nI'm a<end_of_turn>\n<start_of_turn>model\nPlease tell me more!  What are you? 😄 \n\nFor example,",

Recap

The STEERED-only result ends with "¡En el que" while the STEERED result when both are run in the same request ends in "¡En el momento"
The DEFAULT-only result ends with "I'm eager to know what you are. 😊" while the DEFAULT result when both are run in the same request ends in "What are you? 😄 \n\nFor example,"
When running the same test above on the main code (3063b04), the respective texts match.

Next Stesps

For the completion endpoints: If we can somehow get the results to match, would be an amazing performance speedup to merge.
For the other endpoints (activation/all, etc): Is there a speedup for repeated similar requests (eg same/similar SAE/text)? Possibly is still worth merging optimizations on these other endpoints.
The CLAUDE.md looks quite useful. You've been using this with Claude Code with good results?
Question about memory: Since we are caching results, do we expect to use more memory, or does it stay the same? Am considering if we would need to reduce the max token limit for a change like this.

hijohnnylin · 2025-07-16T05:54:06Z

Moving this back to draft until author has time to look at it.

zazer0 added 6 commits May 24, 2025 11:37

chore(inference): CLAUDE.md, update poetry+pydeps

d1f0d75

feat(inference): parallel generate steer responses!

18a9b90

chore(inference): add batch tests w/async pytest!

78717f4

feat(inference): add layer caching!

93ab035

chore(inference): add tests for layer caching

78e02b4

chore(inference): benchamrking for cache speedup!

98b9b84

zazer0 added 2 commits May 24, 2025 11:48

Merge branch 'main' into feat/optimise-inference-speed

d912f6e

Merge branch 'main' into feat/optimise-inference-speed

060bf0e

zazer0 marked this pull request as ready for review May 24, 2025 01:57

fix(inference): update Deepseek example run cmd!

eedb232

merge: github.com:hijohnnylin/neuronpedia into feat/optimise-inferenc…

b53a547

…e-speed

zazer0 added 8 commits June 2, 2025 10:22

chore(deps): update for completion_chat batching

03910fe

chore(benchmarking): pre-batched completion_chat

7312e76

feat(inference): batching for completion_chat

755efad

feat(tests): add inference completion_chat

70050d8

chore(benchmarking): post-batched completion_chat

d5c62f8

fix(lint): test styling

67a6d3c

fix(lints): inference passing!

0cebbb5

feat(CLAUDE): correct inference testing

7690555

hijohnnylin marked this pull request as draft July 16, 2025 05:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inference] speedup by 84% - async, layer caching#103

[inference] speedup by 84% - async, layer caching#103
zazer0 wants to merge 18 commits intohijohnnylin:mainfrom
zazer0:feat/optimise-inference-speed

zazer0 commented May 24, 2025 •

edited

Loading

Uh oh!

vercel bot commented May 24, 2025

Uh oh!

hijohnnylin commented May 24, 2025 •

edited

Loading

Uh oh!

zazer0 commented May 24, 2025

Uh oh!

hijohnnylin commented Jun 1, 2025

Uh oh!

codecov bot commented Jun 1, 2025 •

edited

Loading

Uh oh!

hijohnnylin commented Jun 7, 2025

Uh oh!

hijohnnylin commented Jul 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zazer0 commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix / Feature

Testing

Uh oh!

vercel bot commented May 24, 2025

Uh oh!

hijohnnylin commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zazer0 commented May 24, 2025

Uh oh!

hijohnnylin commented Jun 1, 2025

Uh oh!

codecov bot commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hijohnnylin commented Jun 7, 2025

Summary

Repro - a Spanish feature

Setup

Test Command (Change "types" array to ["STEERED"], ["DEFAULT"], or ["STEERED", "DEFAULT"]

STEERED only

DEFAULT only

both STEERED and DEFAULT

Recap

Next Stesps

Uh oh!

hijohnnylin commented Jul 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zazer0 commented May 24, 2025 •

edited

Loading

hijohnnylin commented May 24, 2025 •

edited

Loading

codecov bot commented Jun 1, 2025 •

edited

Loading