-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Max audio length PC vs Mac #1234
Comments
Maybe, since the M2 chip offers strong performance and has an integrated, soldered GPU. And the GPU can perform like 3070 NVidia GPU. can I know the GPU and CPU of your PC? thanks thanks. |
RTX 4090 + 11th gen i6. Just to clarify this is about memory allocation. |
Check |
@Purfview Seems both machines have AV 14.1.0. |
Post the full error trace. |
Traceback (most recent call last): |
Check |
@Purfview thank you for your suggestions. I already checked that and noticed the numpy on the mac was older 1.26.4. File "C:\Users...\miniconda3\envs\speak\Lib\site-packages\faster_whisper\feature_extractor.py", line 189, in stft |
Same here, I think it's a regression from anywhere between faster-whisper==1.0.3 to 1.1.1, because these stack traces are new to me since the upgrade. numpy==2.0.2
The system has >10 GB RAM available, so there should be enough to allocate that memory. Maybe the numpy data type changed? "complex128" seems resource-hungry to me. |
@saddy001 Something changed but might not be numpy. On my machine, transcribing a file that has more than 11 hours of content: Faster whisper version: Tested with Tested more combinations but think these are the most relevent ones. Observation: |
At the moment I suspect it's the transition from soundfile==0.12.1 to 0.13.0, as this brings in a new numpy dependency version. |
For the last working fw version 1.0.3, was it working with both the old and new numpy and av versions? |
Not 100% sure exactly what combinations i tried, but numpy 2.2.2 seems to work. Name: faster-whisper Name: av Name: numpy Though the results I am getting are quite poor compared to f-w 1.1.1 |
Feature extraction was changed since that old version, probably it's more tuned for performance than for memory efficiency and it's expected to consume more RAM than that old version. As the issue is about "PC vs Mac" I would focus on that, or on optimization of feature_extractor. |
Can we help with that? |
To save you some tokens and time, I have run this through an LLM (I know chances are good that this is complete garbage, but maybe we can get some interesting ideas). Here is the output:
def __call__(self, audio, chunk_length=None):
if chunk_length is not None:
# Ensure chunk_length doesn't cause excessive memory usage
max_memory_frames = 100000 # Adjust based on available memory
num_chunks = (chunk_length * self.sample_rate + max_memory_frames - 1) // max_memory_frames
processed_audio = []
for i in range(num_chunks):
start = i * max_memory_frames
end = min(start + max_memory_frames, chunk_length * self.sample_rate)
chunk = audio[..., start:end]
# Apply padding only to the last chunk if necessary
if i == num_chunks - 1 and (end - start) < max_memory_frames:
chunk = np.pad(chunk, [(0, max_memory_frames - (end - start)), (0, 0)])
processed_audio.append(self.process_chunk(chunk))
return np.concatenate(processed_audio)
else:
# Original processing without chunking
n = audio.shape[-1]
hop_length = self.hop_length
win_length = self.win_length
# Padding
pad = (self.n_fft - hop_length) // 2
audio_padded = np.pad(audio, [(0, 0), (pad, pad)], mode='reflect')
D = self.stft(audio_padded)
return self.griffin_lim(D)
def process_chunk(self, chunk):
# Process each chunk using the existing STFT and Griffin-Lim
n_fft = self.n_fft
hop_length = self.hop_length
win_length = self.win_length
pad = (n_fft - hop_length) // 2
chunk_padded = np.pad(chunk, [(0, 0), (pad, pad)], mode='reflect')
D = self.stft(chunk_padded)
y = self.griffin_lim(D)
return y
|
I'm a 'c' guy and python is not my knife, but still something feels off. I understand your reasoning on smaller chunks but we can do 11 hours file using only a few GB of memory but with 12 it grows by a factor of 40? |
On M2 with 64GB I am able to transcribe a 16h long audio file, not fast but it works.
On PC with 128GB and 24GB GPU i can transcribe max 11 hours, any longer than that and it throws allocation error.
Doesn't matter if using CPU, GPU.
I have no insight in the code, just wondering how come there is a diffrence?
Would preferable run these tasks on the pc..
The text was updated successfully, but these errors were encountered: