fix ratio #1108

hhaAndroid · 2025-09-18T02:53:43Z

the loss calculation will convert logits to float32, so for alignment, we also need to convert it to float32 here to prevent the ratio from being 1 during rl training

RangiLyu · 2025-09-18T03:25:27Z

xtuner/v1/module/lm_head/lm_head.py

            logits = F.linear(hidden_states, w, b)
-            return None, logits
+            # Note: the loss calculation will convert logits to float32, so for alignment,
+            # we also need to convert it to float32 here to prevent the ratio from being 1 during rl training


Suggested change

# we also need to convert it to float32 here to prevent the ratio from being 1 during rl training

# we also need to convert it to float32 to prevent the ratio from not being equal to 1 during on-policy rl training

pppppM · 2025-09-18T04:08:20Z

Due to the lack of use of chunk forward when computing old_logprobs, it may cause OOM (Out of Memory).
From the perspective of inference, the logits returned need to be bf16.
It is best to convert to float32 near the computation of logprobs.

hhaAndroid added 3 commits September 17, 2025 14:56

add dapo test

1ee63ee

update

3dfb684

fix ratio

d558d97

RangiLyu reviewed Sep 18, 2025

View reviewed changes

hhaAndroid added 20 commits September 18, 2025 14:41

debug

64d5e29

Merge branch 'dapo_test' of github.com:hhaAndroid/xtuner into fix_ratio

a7f994a

debug

2d7d728

fix

081a381

update

4ab5217

update1

444aae2

update1

e726881

fix bos

9ace2b9

refine

6ea5b82

fix

6c1d245

fix

43ce367

update

45f8a93

fix print

e63562d

fix print

dfafbd5

support generate rollout of sglang

284ff29

update

14a2882

add logprob

089a2db

fix logprob print

d8168fa

update dump

9f8a863

update lmdeploy

007807f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix ratio #1108

fix ratio #1108

Uh oh!

hhaAndroid commented Sep 18, 2025

Uh oh!

RangiLyu Sep 18, 2025

Uh oh!

pppppM commented Sep 18, 2025

Uh oh!

Uh oh!

	# we also need to convert it to float32 here to prevent the ratio from being 1 during rl training
	# we also need to convert it to float32 to prevent the ratio from not being equal to 1 during on-policy rl training

fix ratio #1108

Are you sure you want to change the base?

fix ratio #1108

Uh oh!

Conversation

hhaAndroid commented Sep 18, 2025

Uh oh!

RangiLyu Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

pppppM commented Sep 18, 2025

Uh oh!

Uh oh!