Skip to content

[inference] speedup by 84% - async, layer caching#103

Draft
zazer0 wants to merge 18 commits intohijohnnylin:mainfrom
zazer0:feat/optimise-inference-speed
Draft

[inference] speedup by 84% - async, layer caching#103
zazer0 wants to merge 18 commits intohijohnnylin:mainfrom
zazer0:feat/optimise-inference-speed

Conversation

@zazer0
Copy link
Contributor

@zazer0 zazer0 commented May 24, 2025

Problem

  • Inferencing could be slow, depending on size/quantity of requests.
    • A steering request would process the original + steered generations in serial; rather than parallel.
  • When steering with identical features repeatedly, layers are manually re-loaded each time
    • Since this use case is common, it seems ideal to cache at least a few of the most recent ones.

Fix / Feature

  • Implemented async generation; both the original and steered responses are processed in parallel (minor speedup).
  • Implemented least recently used layer caching, for the 5 most recent layers requested. (major speedup).

Testing

🔥 Added + committed benchmarking script + results showing 84% speedup 🎉
✅ Added unit + integration tests for all aspects

@vercel
Copy link

vercel bot commented May 24, 2025

@zazer0 is attempting to deploy a commit to the Neuronpedia Team on Vercel.

A member of the Team first needs to authorize it.

@zazer0 zazer0 marked this pull request as ready for review May 24, 2025 01:57
@hijohnnylin
Copy link
Owner

hijohnnylin commented May 24, 2025

this looks promising! i know this is still draft but quick question on parallel steer generation - does this result in the same results as if they ran separately if we specify a seed?

for example

scenario A (current):
set seed = 42
feature = dog feature
generate steer results sequentially (or they run on separate instances)

  1. default = "i am gemma"
    <seed is reset to 42>
  2. steered = "i am dog"

scenario B (parallel):
set seed = 42
feature = dog feature
generate steer results in parallel

  1. default = "i am gemma"
  2. steered = "i am a canine" <-- different

i havent tested your code to see if this is the case or not, just flagging it as something to consider

@zazer0
Copy link
Contributor Author

zazer0 commented May 24, 2025

this looks promising! i know this is still draft but quick question on parallel steer generation - does this result in the same results as if they ran separately if we specify a seed?

Good question! Not sure, will look into it and clarify 👍

@hijohnnylin
Copy link
Owner

@zazer0 I intended to merge this today (it looks great!) and deployed it to some test servers to test outputs, but realized that the steer/completion_chat endpoint was not updated. Would you have time to add that as well? If not, I can take a look.

@codecov
Copy link

codecov bot commented Jun 1, 2025

Codecov Report

Attention: Patch coverage is 34.14634% with 270 lines in your changes missing coverage. Please review.

Project coverage is 8.79%. Comparing base (3063b04) to head (7690555).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
...pedia_inference/endpoints/steer/completion_chat.py 28.79% 131 Missing and 5 partials ⚠️
...euronpedia_inference/endpoints/steer/completion.py 3.61% 80 Missing ⚠️
...uronpedia_inference/endpoints/activation/single.py 5.55% 17 Missing ⚠️
.../neuronpedia_inference/endpoints/activation/all.py 8.33% 11 Missing ⚠️
...ia_inference/endpoints/activation/topk_by_token.py 10.00% 9 Missing ⚠️
...ce/neuronpedia_inference/layer_activation_cache.py 89.77% 7 Missing and 2 partials ⚠️
apps/inference/neuronpedia_inference/server.py 0.00% 8 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##            main    #103      +/-   ##
========================================
+ Coverage   7.98%   8.79%   +0.81%     
========================================
  Files        121     122       +1     
  Lines      17040   17354     +314     
  Branches     362     422      +60     
========================================
+ Hits        1360    1527     +167     
- Misses     15669   15804     +135     
- Partials      11      23      +12     
Flag Coverage Δ
autointerp 100.00% <ø> (ø)
inference 40.65% <34.14%> (+2.88%) ⬆️
webapp 5.21% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@hijohnnylin
Copy link
Owner

Thanks for the updates to include completion-chat. I tested this today (7690555):

Summary

It seems that running DEFAULT and STEERED separately result in different results than if you run DEFAULT and STEERED in the same request, using the same settings and seed. I am not sure what the reason is - but the reason this matters to us is because we're a research platform, so we need consistent results between the two.

Repro - a Spanish feature

Setup

poetry run python start.py \
  --model_id gemma-2-2b \
  --override_model_id gemma-2-2b-it \
  --sae_sets gemmascope-res-16k \
  --model_dtype bfloat16 \
  --sae_dtype float16 \
  --include_sae 20-gemmascope-res-16k

Test Command (Change "types" array to ["STEERED"], ["DEFAULT"], or ["STEERED", "DEFAULT"]

curl -X POST http://127.0.0.1:5002/v1/steer/completion-chat \
  -H "Content-Type: application/json" \
  -d '{
     "prompt": [
        {
            "content": "I'\''m a",
            "role": "user"
        }
     ],
     "model": "gemma-2-2b-it",
     "features": [
       {
         "model": "gemma-2-2b",
         "source": "20-gemmascope-res-16k",
         "index": 8590,
         "strength": 90
       }
     ],
     "types": [
       "STEERED"   # CHANGE THIS
     ],
     "steer_special_tokens": true,
     "n_completion_tokens": 16,
     "temperature": 0.5,
     "strength_multiplier": 4,
     "freq_penalty": 2,
     "seed": 16,
     "steer_method": "SIMPLE_ADDITIVE",
     "normalize_steering": false
   }' | jq .

Test Outputs (clipped for readability):

STEERED only

"raw": "<bos><start_of_turn>user\nI'm a<end_of_turn>\n<start_of_turn>model\n¡Hola! ¿ cómo puedo ayudarte hoy? \n\n ( ¡En el que"

DEFAULT only

"raw": "<bos><start_of_turn>user\nI'm a<end_of_turn>\n<start_of_turn>model\nPlease tell me more! I'm eager to know what you are. 😊"

both STEERED and DEFAULT

[STEERED]
"raw": "<start_of_turn>user\nI'm a<end_of_turn>\n<start_of_turn>model\n¡Hola! ¿ cómo puedo ayudarte hoy? \n\n ( ¡En el momento",

[DEFAULT]
"raw": "<start_of_turn>user\nI'm a<end_of_turn>\n<start_of_turn>model\nPlease tell me more!  What are you? 😄 \n\nFor example,",

Recap

  • The STEERED-only result ends with "¡En el que" while the STEERED result when both are run in the same request ends in "¡En el momento"
  • The DEFAULT-only result ends with "I'm eager to know what you are. 😊" while the DEFAULT result when both are run in the same request ends in "What are you? 😄 \n\nFor example,"
  • When running the same test above on the main code (3063b04), the respective texts match.

Next Stesps

  • For the completion endpoints: If we can somehow get the results to match, would be an amazing performance speedup to merge.
  • For the other endpoints (activation/all, etc): Is there a speedup for repeated similar requests (eg same/similar SAE/text)? Possibly is still worth merging optimizations on these other endpoints.
  • The CLAUDE.md looks quite useful. You've been using this with Claude Code with good results?
  • Question about memory: Since we are caching results, do we expect to use more memory, or does it stay the same? Am considering if we would need to reduce the max token limit for a change like this.

@hijohnnylin hijohnnylin marked this pull request as draft July 16, 2025 05:53
@hijohnnylin
Copy link
Owner

Moving this back to draft until author has time to look at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants