Skip to content

Commit 2a93475

Browse files
authored
Merge pull request #142 from opentensor/staging
V.1.2.0 Release
2 parents 0f7c18d + be21a17 commit 2a93475

File tree

15 files changed

+236
-34
lines changed

15 files changed

+236
-34
lines changed

CHANGELOG.md

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,24 @@
11
# Changelog
22

3-
## 1.1.8 / 2023-08-12
3+
## 1.2.0 / 2023-08-28
4+
### What's changed
5+
- Adds Direct Optimization (DPO) style rewards by @opentaco on #99
6+
- Changes print format on exception catch by @camfairchild on #135
7+
- Brings back netuid and wandb to logged config by @p-ferreira on #137
8+
- Adds DPO penalty update by @Eugene-hu on #138
9+
- Adds original reward output to wandb logs by @isabella618033 on #139
10+
- Reweights reward models by @Eugene-hu on #140
11+
- Update stale documentation by @steffencruz on #129
412

5-
## What's Changed
6-
- Make sure to serve axon first by @camfairchild in 14921d35c
713

814

9-
**Full Changelog**: https://github.com/opentensor/validators/compare/v1.1.7...v1.1.8
15+
**Full Changelog**: https://github.com/opentensor/validators/compare/v1.1.7...v1.2.0
1016

17+
## 1.1.8 / 2023-08-12
18+
### What's Changed
19+
- Make sure to serve axon first by @camfairchild in 14921d35c
20+
- Adds scripts for releases on github by @camfairchild in #128
21+
- Wandb config log changes @isabella618033 in #132
1122

1223
## 1.1.7 / 2023-08-11
1324
### What’s Changed

README.md

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -118,13 +118,6 @@ Check the [README of the data collector](./scripts/README.md) for more informati
118118

119119
----
120120
## Experimental Features
121-
### Prompt-Based Scoring
122-
The reward mechanism for miner completions plays a crucial role in the overall quality of the network. As such, we are constantly developing and testing new methods that make the reward process **open** and **robust**. This benefits everyone. Presently, miners weights are set based on evaluations of their completions that are carried out by a reward model. This presents two major challenges:
123-
124-
1. Reward model evaluations are a bottleneck, owing to the large model size
125-
2. Reward models are vulnerable to attacks, which reduces the network quality for everyone
126-
127-
Consequently, validators also perform *shadow scoring*, which outsources the reward mechanism to the network. This feature is currently under development, and so the prompt-based scores are only used for research purposes.
128121

129122
## Sentence Embedding Gating Model
130123
Another cornerstone of the validator functionality is the use of a mixture of experts (MoE) model, which we call the gating model, to enable queries to be efficiently routed to the best-suited miners. **This incentivizes miners to become specialists, which in turn improves response quality**. It also reduces latency and addresses bandwidth issues in the network.

openvalidators/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,6 @@
2828
from . import weights
2929
from . import event
3030

31-
__version__ = "1.1.8"
31+
__version__ = "1.2.0"
3232
version_split = __version__.split(".")
3333
__spec_version__ = (1000 * int(version_split[0])) + (10 * int(version_split[1])) + (1 * int(version_split[2]))

openvalidators/config.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -265,6 +265,12 @@ def add_args(cls, parser):
265265
help="Weight for the reciprocate reward model",
266266
default=DefaultRewardFrameworkConfig.reciprocate_model_weight,
267267
)
268+
parser.add_argument(
269+
"--reward.dpo_weight",
270+
type=float,
271+
help="Weight for the dpo reward model",
272+
default=DefaultRewardFrameworkConfig.dpo_model_weight,
273+
)
268274
parser.add_argument(
269275
"--reward.rlhf_weight",
270276
type=float,

openvalidators/event.py

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,27 +41,49 @@ class EventSchema:
4141
nsfw_filter: Optional[List[float]] # Output vector of the nsfw filter
4242
reciprocate_reward_model: Optional[List[float]] # Output vector of the reciprocate reward model
4343
diversity_reward_model: Optional[List[float]] # Output vector of the diversity reward model
44+
dpo_reward_model: Optional[List[float]] # Output vector of the dpo reward model
4445
rlhf_reward_model: Optional[List[float]] # Output vector of the rlhf reward model
4546
prompt_reward_model: Optional[List[float]] # Output vector of the prompt reward model
4647
relevance_filter: Optional[List[float]] # Output vector of the relevance scoring reward model
4748
task_validator_filter: Optional[List[float]]
4849

50+
dahoas_reward_model_normalized: Optional[List[float]] # Output vector of the dahoas reward model
51+
nsfw_filter_normalized: Optional[List[float]] # Output vector of the nsfw filter
52+
reciprocate_reward_model_normalized: Optional[List[float]] # Output vector of the reciprocate reward model
53+
diversity_reward_model_normalized: Optional[List[float]] # Output vector of the diversity reward model
54+
dpo_reward_model_normalized: Optional[List[float]] # Output vector of the dpo reward model
55+
rlhf_reward_model_normalized: Optional[List[float]] # Output vector of the rlhf reward model
56+
prompt_reward_model_normalized: Optional[List[float]] # Output vector of the prompt reward model
57+
relevance_filter_normalized: Optional[List[float]] # Output vector of the relevance scoring reward model
58+
task_validator_filter_normalized: Optional[List[float]]
59+
4960
# Weights data
5061
set_weights: Optional[List[List[float]]]
5162

5263
@staticmethod
5364
def from_dict(event_dict: dict, disable_log_rewards: bool) -> 'EventSchema':
5465
"""Converts a dictionary to an EventSchema object."""
5566
rewards = {
56-
'dahoas_reward_model': event_dict.get(RewardModelType.dahoas.value),
5767
'blacklist_filter': event_dict.get(RewardModelType.blacklist.value),
68+
'dahoas_reward_model': event_dict.get(RewardModelType.dahoas.value),
5869
'task_validator_filter': event_dict.get(RewardModelType.task_validator.value),
5970
'nsfw_filter': event_dict.get(RewardModelType.nsfw.value),
6071
'relevance_filter': event_dict.get(RewardModelType.relevance.value),
6172
'reciprocate_reward_model': event_dict.get(RewardModelType.reciprocate.value),
6273
'diversity_reward_model': event_dict.get(RewardModelType.diversity.value),
74+
'dpo_reward_model': event_dict.get(RewardModelType.dpo.value),
6375
'rlhf_reward_model': event_dict.get(RewardModelType.rlhf.value),
6476
'prompt_reward_model': event_dict.get(RewardModelType.prompt.value),
77+
78+
'dahoas_reward_model_normalized': event_dict.get(RewardModelType.dahoas.value + '_normalized'),
79+
'task_validator_filter_normalized': event_dict.get(RewardModelType.task_validator.value + '_normalized'),
80+
'nsfw_filter_normalized': event_dict.get(RewardModelType.nsfw.value + '_normalized'),
81+
'relevance_filter_normalized': event_dict.get(RewardModelType.relevance.value + '_normalized'),
82+
'reciprocate_reward_model_normalized': event_dict.get(RewardModelType.reciprocate.value + '_normalized'),
83+
'diversity_reward_model_normalized': event_dict.get(RewardModelType.diversity.value + '_normalized'),
84+
'dpo_reward_model_normalized': event_dict.get(RewardModelType.dpo.value + '_normalized'),
85+
'rlhf_reward_model_normalized': event_dict.get(RewardModelType.rlhf.value + '_normalized'),
86+
'prompt_reward_model_normalized': event_dict.get(RewardModelType.prompt.value + '_normalized'),
6587
}
6688

6789
# Logs warning that expected data was not set properly

openvalidators/forward.py

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -87,18 +87,20 @@ async def run_step(self, prompt: str, k: int, timeout: float, name: str, exclude
8787
# Compute the rewards for the responses given the prompt.
8888
rewards: torch.FloatTensor = torch.zeros(len(responses), dtype=torch.float32).to(self.device)
8989
for weight_i, reward_fn_i in zip(self.reward_weights, self.reward_functions):
90-
reward_i = reward_fn_i.apply(prompt, responses, name).to(self.device)
91-
rewards += weight_i * reward_i
90+
reward_i, reward_i_normalized = reward_fn_i.apply(prompt, responses, name)
91+
rewards += weight_i * reward_i_normalized.to(self.device)
9292
if not self.config.neuron.disable_log_rewards:
9393
event[reward_fn_i.name] = reward_i.tolist()
94-
bt.logging.trace(str(reward_fn_i.name), reward_i.tolist())
94+
event[reward_fn_i.name + '_normalized'] = reward_i_normalized.tolist()
95+
bt.logging.trace(str(reward_fn_i.name), reward_i_normalized.tolist())
9596

9697
for masking_fn_i in self.masking_functions:
97-
mask_i = masking_fn_i.apply(base_prompt, responses, name).to(self.device)
98-
rewards *= mask_i # includes diversity
98+
mask_i, mask_i_normalized = masking_fn_i.apply(base_prompt, responses, name)
99+
rewards *= mask_i_normalized.to(self.device) # includes diversity
99100
if not self.config.neuron.disable_log_rewards:
100101
event[masking_fn_i.name] = mask_i.tolist()
101-
bt.logging.trace(str(masking_fn_i.name), mask_i.tolist())
102+
event[masking_fn_i.name + '_normalized'] = mask_i_normalized.tolist()
103+
bt.logging.trace(str(masking_fn_i.name), mask_i_normalized.tolist())
102104

103105
# Train the gating model based on the predicted scores and the actual rewards.
104106
gating_scores: torch.FloatTensor = self.gating_model(prompt).to(self.device)

openvalidators/neuron.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@
3636
Blacklist,
3737
TaskValidator,
3838
NSFWRewardModel,
39+
DirectPreferenceRewardModel,
3940
OpenAssistantRewardModel,
4041
ReciprocateRewardModel,
4142
RelevanceRewardModel,
@@ -174,6 +175,7 @@ def __init__(self):
174175
else:
175176
self.reward_weights = torch.tensor(
176177
[
178+
self.config.reward.dpo_weight,
177179
self.config.reward.rlhf_weight,
178180
self.config.reward.reciprocate_weight,
179181
self.config.reward.dahoas_weight,
@@ -192,6 +194,9 @@ def __init__(self):
192194
raise Exception(message)
193195

194196
self.reward_functions = [
197+
DirectPreferenceRewardModel(device=self.device)
198+
if self.config.reward.dpo_weight > 0
199+
else MockRewardModel(RewardModelType.dpo.value),
195200
OpenAssistantRewardModel(device=self.device)
196201
if self.config.reward.rlhf_weight > 0
197202
else MockRewardModel(RewardModelType.rlhf.value),

openvalidators/reward/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from .blacklist import Blacklist
22
from .task_validator import TaskValidator
33
from .nsfw import NSFWRewardModel
4+
from .dpo import DirectPreferenceRewardModel
45
from .open_assistant import OpenAssistantRewardModel
56
from .reciprocate import ReciprocateRewardModel
67
from .relevance import RelevanceRewardModel

openvalidators/reward/config.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818

1919

2020
class RewardModelType(Enum):
21+
dpo = 'dpo_reward_model'
2122
rlhf = 'rlhf_reward_model'
2223
reciprocate = 'reciprocate_reward_model'
2324
dahoas = 'dahoas_reward_model'
@@ -34,7 +35,8 @@ class DefaultRewardFrameworkConfig:
3435
"""Reward framework default configuration.
3536
Note: All the weights should add up to 1.0.
3637
"""
37-
rlhf_model_weight: float = 0.6
38-
reciprocate_model_weight: float = 0.4
38+
dpo_model_weight: float = 0.3
39+
rlhf_model_weight: float = 0.4
40+
reciprocate_model_weight: float = 0.3
3941
dahoas_model_weight: float = 0
4042
prompt_model_weight: float = 0

openvalidators/reward/dpo.py

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# The MIT License (MIT)
2+
# Copyright © 2021 Yuma Rao
3+
4+
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
5+
# documentation files (the “Software”), to deal in the Software without restriction, including without limitation
6+
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software,
7+
# and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
8+
9+
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of
10+
# the Software.
11+
12+
# THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO
13+
# THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
14+
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
15+
# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
16+
# DEALINGS IN THE SOFTWARE.
17+
18+
import torch
19+
import bittensor as bt
20+
from typing import List
21+
from .config import RewardModelType
22+
from .reward import BaseRewardModel
23+
from transformers import AutoTokenizer, AutoModelForCausalLM
24+
25+
26+
class DirectPreferenceRewardModel(BaseRewardModel):
27+
28+
reward_model_name: str = "cerebras/btlm-3b-8k-base"
29+
30+
@property
31+
def name(self) -> str: return RewardModelType.dpo.value
32+
33+
def __init__(self, device: str):
34+
super().__init__()
35+
self.device = device
36+
self.penalty = 1.2 # Same penalty as the original [paper](https://arxiv.org/pdf/1909.05858.pdf).
37+
self.tokenizer = AutoTokenizer.from_pretrained(DirectPreferenceRewardModel.reward_model_name)
38+
self.model = AutoModelForCausalLM.from_pretrained(DirectPreferenceRewardModel.reward_model_name,
39+
trust_remote_code=True,
40+
torch_dtype=torch.float16).to(self.device)
41+
42+
def reward_single(self, prompt: str, completion: str, name: str ,with_penalty=True) -> float:
43+
r""" Calculates a direct preference optimization (DPO) style reward for a completion,
44+
which is a reference model's average log-probability for completion tokens given a prompt.
45+
Uses guidance from https://github.com/eric-mitchell/direct-preference-optimization/blob/main/trainers.py.
46+
"""
47+
with torch.no_grad():
48+
49+
# Check if completion is
50+
if completion.strip() == '' or len(completion) <= 5:
51+
return -11 # exp(-11)=1.67e-5 < 2e-5=1/50257 (typical vocab size)
52+
53+
# Tokenize the combined prompt + completion.
54+
combined = self.tokenizer(prompt + completion, return_tensors="pt").input_ids[0].to(self.device) # [seq_len]
55+
# Tokenize only the prompt, to help determine prompt token length.
56+
prompt_part = self.tokenizer(prompt, return_tensors="pt").input_ids[0].to(self.device) # [prompt_len]
57+
58+
# Completion doesn't fit into model sequence, so return lowest reward.
59+
if self.tokenizer.model_max_length <= len(prompt_part):
60+
return -11. # exp(-11)=1.67e-5 < 2e-5=1/50257 (typical vocab size)
61+
62+
# Truncate combined to fit into model max sequence length.
63+
if self.tokenizer.model_max_length < len(combined):
64+
combined = combined[:self.tokenizer.model_max_length]
65+
66+
labels = combined.clone() # [seq_len]
67+
# Ignore prompt part for calculating reward.
68+
labels[:len(prompt_part)] = -100
69+
# Label only each next token prediction ground-truth.
70+
labels = labels[1:] # [seq_len-1]
71+
loss_mask = (labels != -100) # [seq_len-1]
72+
# Dummy token to allow for indexing, but loss will be ignored.
73+
labels[labels == -100] = 0
74+
# Reshape for gather operation.
75+
labels = labels.unsqueeze(0).unsqueeze(2) # [batch_size=1, seq_len-1, :]
76+
77+
# Forward pass to calculate logit predictions for each sequence position.
78+
logits = self.model(combined.unsqueeze(0)).logits # [batch_size=1, seq_len, vocab_len]
79+
# Predict only where labels are available.
80+
logits = logits[:, :-1, :] # [batch_size=1, seq_len-1, vocab_len]
81+
82+
if with_penalty:
83+
# Apply penalty for repeated generation
84+
for i in range(len(prompt_part)+1, len(combined)-1):
85+
logit = logits[:,i,:].clone()
86+
inputs = combined[len(prompt_part):i].clone()
87+
logits[:,i,:] = self.logit_penalty(input_ids=inputs, logit=logit)
88+
89+
# Rescale via log(softmax(logits)).
90+
logits = logits.log_softmax(-1)
91+
# Calculate the model's log-probability for each actual completion token.
92+
per_token_logps = torch.gather(logits, dim=2, index=labels).squeeze(2) # [batch_size=1, seq_len-1]
93+
# Average log-probability over completion sequence.
94+
reward = (per_token_logps * loss_mask).sum(-1) / loss_mask.sum(-1) # [batch_size=1]
95+
reward = reward[0].cpu().detach()
96+
97+
# NaNs can possibly arise through log(0)=-inf, replace with suitably small logits.
98+
if torch.isnan(reward) or torch.isinf(reward):
99+
return -11. # exp(-11)=1.67e-5 < 2e-5=1/50257 (typical vocab size)
100+
return reward.item()
101+
102+
def get_rewards(self, prompt: str, completions: List[str], name: str) -> torch.FloatTensor:
103+
rewards = torch.tensor([self.reward_single(prompt, completion, name) for completion in completions],
104+
dtype=torch.float32).to(self.device)
105+
bt.logging.trace(f"DirectPreferenceRewardModel | rewards: {rewards.tolist()}")
106+
return rewards
107+
108+
def logit_penalty(self, input_ids: torch.LongTensor, logit: torch.FloatTensor) -> torch.FloatTensor:
109+
# Counts the unique tokens within each generation
110+
uniques, counts = input_ids.unique(return_counts=True)
111+
score = torch.gather(logit, 1, uniques.unsqueeze(0))
112+
113+
# if score < 0 then repetition penalty has to be multiplied to reduce the previous token probability
114+
score = torch.where(score < 0, score * (self.penalty**counts), score / (self.penalty**counts))
115+
116+
logit.scatter_(1, uniques.unsqueeze(0), score.to(logit.dtype))
117+
return logit

0 commit comments

Comments
 (0)