-
Notifications
You must be signed in to change notification settings - Fork 87
replaces sae-auto-interp with delphi #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
anthonyduong9
wants to merge
1
commit into
main
Choose a base branch
from
replace-sae-auto-interp-with-eai-delphi
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
22 changes: 13 additions & 9 deletions
22
apps/autointerp/neuronpedia_autointerp/routes/explain/default.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,15 @@ | ||
| import traceback | ||
|
|
||
| import torch | ||
| from delphi.clients import OpenRouter | ||
| from delphi.latents.latents import ( | ||
| ActivatingExample, | ||
| Latent, | ||
| LatentRecord, | ||
| NonActivatingExample, | ||
| ) | ||
| from delphi.scorers import DetectionScorer, FuzzingScorer | ||
| from delphi.scorers.scorer import ScorerResult | ||
| from fastapi import HTTPException | ||
| from neuronpedia_autointerp_client.models.np_score_fuzz_detection_type import ( | ||
| NPScoreFuzzDetectionType, | ||
|
|
@@ -11,10 +20,6 @@ | |
| from neuronpedia_autointerp_client.models.score_fuzz_detection_post_request import ( | ||
| ScoreFuzzDetectionPostRequest, | ||
| ) | ||
| from sae_auto_interp.clients import OpenRouter | ||
| from sae_auto_interp.features import Example, Feature, FeatureRecord | ||
| from sae_auto_interp.scorers import DetectionScorer, FuzzingScorer | ||
| from sae_auto_interp.scorers.scorer import ScorerResult | ||
|
|
||
| from neuronpedia_autointerp.utils import ( | ||
| convert_classifier_output_to_score_classifier_output, | ||
|
|
@@ -39,45 +44,54 @@ async def generate_score_fuzz_detection(request: ScoreFuzzDetectionPostRequest): | |
| We currently show 5 examples at a time (batch_size=5). | ||
| """ | ||
| try: | ||
| feature = Feature("feature", 0) | ||
| feature = Latent("feature", 0) | ||
| activating_examples = [] | ||
| non_activating_examples = [] | ||
|
|
||
| for activation in request.activations: | ||
| example = Example(activation.tokens, torch.tensor(activation.values)) # type: ignore | ||
| if sum(activation.values) > 0: | ||
| example = ActivatingExample( | ||
| tokens=activation.tokens, # type: ignore | ||
| activations=torch.tensor(activation.values), | ||
| str_tokens=activation.tokens, | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same thing here
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar comment as to #113 (comment). |
||
| quantile=1, | ||
| ) | ||
| activating_examples.append(example) | ||
| else: | ||
| example = NonActivatingExample( | ||
| tokens=activation.tokens, # type: ignore | ||
| activations=torch.tensor(activation.values), | ||
| str_tokens=activation.tokens, | ||
| distance=-1, | ||
| ) | ||
| non_activating_examples.append(example) | ||
|
|
||
| feature_record = FeatureRecord(feature) | ||
| feature_record.test = [activating_examples] | ||
| feature_record.extra_examples = non_activating_examples # type: ignore | ||
| feature_record.random_examples = non_activating_examples # type: ignore | ||
| feature_record.explanation = request.explanation # type: ignore | ||
| feature_record = LatentRecord(feature) | ||
| feature_record.test = activating_examples | ||
| feature_record.not_active = non_activating_examples | ||
| feature_record.extra_examples = non_activating_examples | ||
| feature_record.explanation = request.explanation | ||
|
|
||
| client = OpenRouter(api_key=request.openrouter_key, model=request.model) | ||
|
|
||
| if request.type == NPScoreFuzzDetectionType.FUZZ: | ||
| scorer = FuzzingScorer( | ||
| client, | ||
| tokenizer=None, # type: ignore | ||
| batch_size=5, | ||
| verbose=False, | ||
| log_prob=False, | ||
| ) | ||
| elif request.type == NPScoreFuzzDetectionType.DETECTION: | ||
| scorer = DetectionScorer( | ||
| client, | ||
| tokenizer=None, # type: ignore | ||
| batch_size=5, | ||
| verbose=False, | ||
| log_prob=False, | ||
| ) | ||
| else: | ||
| raise HTTPException(status_code=400, detail="Invalid scoring type") | ||
|
|
||
| result: ScorerResult = await scorer.__call__(feature_record) # type: ignore | ||
| result: ScorerResult = await scorer.__call__(feature_record) | ||
| score = per_feature_scores_fuzz_detection(result.score) | ||
|
|
||
| breakdown = [ | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem correct. We are expecting str_tokens to be the decoded tokens into strings. Do you have access to those in your request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
activations.tokensis a list of strings. I think what's confusing is that we're able to passactivations.tokenstoExample.tokensdespite https://github.com/EleutherAI/delphi/blob/db49cb78120c1926a4a3c4928c76ece6be64dcb3/delphi/latents/latents.py#L66-L73. But https://github.com/EleutherAI/delphi/blob/db49cb78120c1926a4a3c4928c76ece6be64dcb3/delphi/scorers/embedding/embedding.py#L111-L121 shows thatExample.tokenscan either be a list of integers or strings.Do we want to keep both
Example.tokensandExample.str_tokensindelphibut make changes sotokenscan only be a list of strings?If so, I can update the requests in this repo, in a separate PR, and make the changes to
delphias well.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea this is definitely a mistake on our part that we let pass. Example.tokens should definitely only be a list of integers and Example.str_tokens the corresponding strings. In this case, because you don't care about the integers you can just pass a dummy list?