Numerical discrepancy in reduced precision operations causes WER degradation for custom models #1845

JacobAndersson · 2025-01-14T20:05:49Z

First posted this issue on faster-whisper but they suggested that I should post it here since it is likely an issue with ctranslate2

Description

When using custom fine-tuned models, faster-whisper's implementation shows significant WER degradation compared to OpenAI's reference implementation (13.5% vs 8.2% WER). Through investigation, I've traced this to numerical differences starting from the very first Conv1D operation in the encoder and is also present for the original large-v3 weights. The weights are both stored in float16, suggesting some numerical or algorithmic issue.

Here is some benchmark results on custom dataset.

Implementation	Model	Performance (WER)
openai	large-v3	0.124165
faster-whisper	large-v3	0.0992092
openai	custom	0.0819845
faster-whisper	custom	0.135567

After comparing the logits of the two implementations and trying to narrow down the root cause of the issue i manage to locate that the difference starts as early the first conv1d operation in the encoder.

Steps to reproduce

Modifying the whisper encoder operator function to only apply the conv1d operation. I.e:

void WhisperEncoder::operator()(const StorageView& features, StorageView& output) {
      PROFILE("WhisperEncoder");

      if (features.rank() != 3)
        throw std::invalid_argument("Expected input features to have 3 dimensions, but got "
                                    + std::to_string(features.rank())
                                    + " dimension(s) instead");

      if (features.dim(1) != input_size() || features.dim(2) > max_input_time())
        throw std::invalid_argument("Invalid input features shape: expected an input with shape ("
                                    + std::to_string(features.dim(0))
                                    + ", "
                                    + std::to_string(input_size())
                                    + ", "
                                    + std::to_string(std::min(features.dim(2), max_input_time()))
                                    + "), but got an input with shape ("
                                    + std::to_string(features.dim(0))
                                    + ", "
                                    + std::to_string(features.dim(1))
                                    + ", "
                                    + std::to_string(features.dim(2))
                                    + ") instead");

      StorageView input(output_type(), features.device());

      _conv1(features, output);
    }

And running the following scripts will show the diff between the implementation.

Faster whisper

from faster_whisper import WhisperModel
import numpy as np
import ctranslate2

model = WhisperModel('large-v3', device="cuda", cpu_threads=32, num_workers=1, compute_type="bfloat16")

audio = np.ones((1, 128, 3000), dtype=np.float32)
enc = model.encode(audio)
enc_numpy = np.array(enc.to_device(ctranslate2.Device.cpu).to(ctranslate2.DataType.float32))

for i in range(5):
    print(enc_numpy[0, 0, -5 + i])

pytorch equivalent is

import numpy as np
import torch
import torch.nn.functional as F
import torch.nn as nn

audio = torch.ones((1, 128, 3000), dtype=torch.bfloat16, device='cuda:0', requires_grad=False)

weights = torch.load('./large-v3.pt')
weight = weights['model_state_dict']['encoder.conv1.weight'].to(torch.bfloat16).to('cuda:0')
bias = weights['model_state_dict']['encoder.conv1.bias'].to(torch.bfloat16).to('cuda:0')

out = F.conv1d(audio, weight, bias, padding=1)

enc_numpy_2 = out.detach().cpu().to(torch.float32).numpy()

for i in range(5):
    print(enc_numpy_2[0, 0, -5 + i])

If i run these two scripts i get the following outputs:

faster-whisper:

0.029418945
0.029418945
0.029418945
0.029418945
-0.003112793

pytorch

0.029296875
0.029296875
0.029296875
0.029296875
-0.0029296875

Environment

python 3.10.10
CUDA Version: 12.3
All running inside a docker container: nvcr.io/nvidia/pytorch:23.10-py3
GPU: 3090

Additional findings

Precision behavior:

float32: Values match but performance still degrades when i run the full benchmark
bfloat16: Values differ
float16: Values differ

Input dependency (bfloat16):

Zero inputs: Match perfectly
Unit inputs (1.0): Small difference
Larger inputs (2.0): Larger differences

The text was updated successfully, but these errors were encountered:

vakkov · 2025-01-15T19:18:14Z

I am also fighting accuracy problems after conversion and I have been doing some experiments as well but I get exactly the same numbers for the convolutions (I modified my ctranslate's whisper implementation as well and i am adding back specific operations). I believe you should explicitly specify the parameters for the convolutions in your torch experiments:


import numpy as np
import torch
import torch.nn.functional as F
import torch.nn as nn

audio = torch.ones((1, 128, 3000), dtype=torch.bfloat16, device='cuda:0', requires_grad=False)

weights = torch.load('/home/gpu/.cache/whisper/large-v3.pt')
weight1 = weights['model_state_dict']['encoder.conv1.weight'].to(torch.bfloat16).to('cuda:0')
bias1   = weights['model_state_dict']['encoder.conv1.bias'].to(torch.bfloat16).to('cuda:0')

weight2 = weights['model_state_dict']['encoder.conv2.weight'].to(torch.bfloat16).to('cuda:0')
bias2 = weights['model_state_dict']['encoder.conv2.bias'].to(torch.bfloat16).to('cuda:0')

# out = F.conv1d(audio, weight, bias, padding=1)
# out = F.gelu(out)
# out = F.conv1d(out, weight2, bias2, padding=1)


conv1 = nn.Conv1d(
    in_channels=128,
    out_channels=1280,
    kernel_size=3,
    stride=1,
    padding=1,
    bias=True
).to(audio.device, dtype=torch.bfloat16)


conv2 = nn.Conv1d(
    in_channels=1280,
    out_channels=1280,
    kernel_size=3,
    stride=2,
    padding=1,
    bias=True
).to(audio.device, dtype=torch.bfloat16)

# Copy the weights and biases into the conv layers
with torch.no_grad():
    conv1.weight.copy_(weight1)
    conv1.bias.copy_(bias1)
    conv2.weight.copy_(weight2)
    conv2.bias.copy_(bias2)

# Forward pass through conv1 -> GELU -> conv2 -> GELU
x = conv1(audio)
x = F.gelu(x)


y = x.detach().cpu().to(torch.float32).numpy()
for i in range(5):
    print(y[0, 0, -5 + i])

print("Output shape after conv1:", x.shape)

x = conv2(x)
x = F.gelu(x)

print("Output shape after conv1+conv2:", x.shape)

enc_numpy_2 = x.detach().cpu().to(torch.float32).numpy()

#enc_numpy_2 = np.transpose(enc_numpy_2, (0, 2, 1))

print("Torch: ", enc_numpy_2.shape)


for i in range(5):
    print(enc_numpy_2[0, 0, -5 + i])

#save enc_numpy_2 to a file
np.save('enc_numpy_torch.npy', enc_numpy_2)

sssshhhhhh · 2025-01-17T06:05:32Z

This is just how BLAS libs are, they take shortcuts and treat fp ops as associative. I ran your torch equivalent on my cpu and actually got the same 5 numbers as your ct2 result. Compared cuda/cpu with torch for fp32/fp16/bf16 and they were all different (fp32 had RMSD ~1e-7).

I've been finetuning my own whisper models and never had problems with conversion (that wasn't PEBKAC).

JacobAndersson · 2025-01-17T09:52:08Z

@sssshhhhhh yes might just be BLAS shortcuts and I would expect some instabilities, but 1e-4 feels like to large of a diff.

I have compared the weights that are loaded and what i can see they are the same. I convert my finetuned model in pt format with the following steps

I use the pt to hf conversion script from huggingface to get a custom hf model.
convert the huggingface model to ctranslate2 with the packaged converter without quantization
ct2-transformers-converter --model ./custom-hf --output_dir ./custom-ct2 --copy_files tokenizer.json preprocessor_config.json

Still might PEBKAC, can you see anything that is incorrect with the above steps?

sssshhhhhh · 2025-01-18T07:25:45Z

I don't think it's that bad considering bf16 only has 7 bits of mantissa. In the range [0.015625, 0.03125) 1e-4 is the spacing between values. Even within torch with bf16 I get an RMSD of 1e-3.

Your steps looks right and I'm not saying your issue is PEBKAC since there's lots of ways to train which might hit some edge case. But I do think focusing on these numerical deviations is barking up the wrong tree.

JacobAndersson · 2025-01-22T07:08:49Z

yes for a single layer this might be fine. But the error enlarges as we go through the encoder so the final encoded features are quite off.

The start of this investigation was that for some files the correct token would not even be in the top 5 (but be the most activated in the openai implementation) and then I worked backwards and found this to be the earliest diff. I can create and example of this too if you want to? But might just be that our finetuned model is very sensitive to the precision used

sssshhhhhh · 2025-01-24T11:47:08Z

An example where tokens are completely different might help. I tested a random finetune (jlvdoorn/whisper-large-v3-atco2-asr) and had no problems with it either.

fp16	atco2	large-v3
openai	0.11663807890222985	0.09148084619782733
hf	0.11377930245854774	0.09090909090909091
ct2	0.11663807890222985	0.09090909090909091

I compared the fp16 encoder output of a random sample. RMSD between oai and ct2 was 5e-3 but the decoded output was identical.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerical discrepancy in reduced precision operations causes WER degradation for custom models #1845

Numerical discrepancy in reduced precision operations causes WER degradation for custom models #1845

JacobAndersson commented Jan 14, 2025

vakkov commented Jan 15, 2025 •

edited

Loading

sssshhhhhh commented Jan 17, 2025

JacobAndersson commented Jan 17, 2025

sssshhhhhh commented Jan 18, 2025

JacobAndersson commented Jan 22, 2025

sssshhhhhh commented Jan 24, 2025

Numerical discrepancy in reduced precision operations causes WER degradation for custom models #1845

Numerical discrepancy in reduced precision operations causes WER degradation for custom models #1845

Comments

JacobAndersson commented Jan 14, 2025

Description

Steps to reproduce

Environment

Additional findings

vakkov commented Jan 15, 2025 • edited Loading

sssshhhhhh commented Jan 17, 2025

JacobAndersson commented Jan 17, 2025

sssshhhhhh commented Jan 18, 2025

JacobAndersson commented Jan 22, 2025

sssshhhhhh commented Jan 24, 2025

vakkov commented Jan 15, 2025 •

edited

Loading