Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max audio length PC vs Mac #1234

Open
Sharplinger opened this issue Jan 29, 2025 · 17 comments
Open

Max audio length PC vs Mac #1234

Sharplinger opened this issue Jan 29, 2025 · 17 comments

Comments

@Sharplinger
Copy link

On M2 with 64GB I am able to transcribe a 16h long audio file, not fast but it works.

On PC with 128GB and 24GB GPU i can transcribe max 11 hours, any longer than that and it throws allocation error.
Doesn't matter if using CPU, GPU.

I have no insight in the code, just wondering how come there is a diffrence?
Would preferable run these tasks on the pc..

@JenuelDev
Copy link

JenuelDev commented Jan 30, 2025

Maybe, since the M2 chip offers strong performance and has an integrated, soldered GPU. And the GPU can perform like 3070 NVidia GPU.

can I know the GPU and CPU of your PC? thanks thanks.

@Sharplinger
Copy link
Author

RTX 4090 + 11th gen i6.

Just to clarify this is about memory allocation.
A 2.5 GB wav, 16-bit, mono, 22050hz file contains about 16 hours of audio.
On Mac M2 ultra with 64Gb memory I can transcribe that file.
On the PC I get allocation error "could not allocated 17gb for..." , even if I skip the GPU and use CPU with 128GB memory.

@Purfview
Copy link
Contributor

Check pip show av on both computers.

@Sharplinger
Copy link
Author

@Purfview Seems both machines have AV 14.1.0.

@Purfview
Copy link
Contributor

Post the full error trace.

@Sharplinger
Copy link
Author

@Purfview

Traceback (most recent call last):
File "C:\AI\whisperx\transcribeonly.py", line 39, in
segments, info = model.transcribe(audio, word_timestamps=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users...\miniconda3\envs\speak\Lib\site-packages\faster_whisper\transcribe.py", line 874, in transcribe
features = self.feature_extractor(audio, chunk_length=chunk_length)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users...\miniconda3\envs\speak\Lib\site-packages\faster_whisper\feature_extractor.py", line 215, in call
stft = self.stft(
^^^^^^^^^^
File "C:\Users...\miniconda3\envs\speak\Lib\site-packages\faster_whisper\feature_extractor.py", line 189, in stft
output = np.fft.rfft(input_array, n=n_fft, axis=-1, norm=norm)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users...\miniconda3\envs\speak\Lib\site-packages\numpy\fft_pocketfft.py", line 411, in rfft
output = _raw_fft(a, n, axis, True, True, norm, out=out)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users...\miniconda3\envs\speak\Lib\site-packages\numpy\fft_pocketfft.py", line 94, in _raw_fft
return ufunc(a, fct, axes=[(axis,), (), (axis,)], out=out)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy._core._exceptions._ArrayMemoryError: Unable to allocate 17.1 GiB for an array with shape (1, 5698179, 201) and data type complex128

@Purfview
Copy link
Contributor

Check pip show numpy

@Sharplinger
Copy link
Author

@Purfview thank you for your suggestions.

I already checked that and noticed the numpy on the mac was older 1.26.4.
Tested to downgrade it on pc, the error message is less informative and it throws on another line but i would guess the problem is still the same.

File "C:\Users...\miniconda3\envs\speak\Lib\site-packages\faster_whisper\feature_extractor.py", line 189, in stft
output = np.fft.rfft(input_array, n=n_fft, axis=-1, norm=norm)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users...\miniconda3\envs\speak\Lib\site-packages\numpy\fft_pocketfft.py", line 409, in rfft
output = _raw_fft(a, n, axis, True, True, inv_norm)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users...\miniconda3\envs\speak\Lib\site-packages\numpy\fft_pocketfft.py", line 70, in _raw_fft
r = pfi.execute(a, is_real, is_forward, fct)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
MemoryError

@saddy001
Copy link

Same here, I think it's a regression from anywhere between faster-whisper==1.0.3 to 1.1.1, because these stack traces are new to me since the upgrade.

numpy==2.0.2
av==12.3.0

Traceback (most recent call last):
  File "/opt/test/model.py", line 35, in get_iterator
    segments, info = model.transcribe(
                     ^^^^^^^^^^^^^^^^^
  File "/opt/py/lib/python3.12/site-packages/faster_whisper/transcribe.py", line 874, in transcribe
    features = self.feature_extractor(audio, chunk_length=chunk_length)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/py/lib/python3.12/site-packages/faster_whisper/feature_extractor.py", line 215, in __call__
    stft = self.stft(
           ^^^^^^^^^^
  File "/opt/py/lib/python3.12/site-packages/faster_whisper/feature_extractor.py", line 189, in stft
    output = np.fft.rfft(input_array, n=n_fft, axis=-1, norm=norm)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/py/lib/python3.12/site-packages/numpy/fft/_pocketfft.py", line 414, in rfft
    output = _raw_fft(a, n, axis, True, True, norm, out=out)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/py/lib/python3.12/site-packages/numpy/fft/_pocketfft.py", line 94, in _raw_fft
    return ufunc(a, fct, axes=[(axis,), (), (axis,)], out=out)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
numpy._core._exceptions._ArrayMemoryError: Unable to allocate 2.49 GiB for an array with shape (1, 832517, 201) and data type complex128

The system has >10 GB RAM available, so there should be enough to allocate that memory. Maybe the numpy data type changed? "complex128" seems resource-hungry to me.

@Sharplinger
Copy link
Author

Sharplinger commented Jan 30, 2025

@saddy001 Something changed but might not be numpy.
cc: @Purfview

On my machine, transcribing a file that has more than 11 hours of content:

Faster whisper version:
1.0.0 Working
1.0.2 Working
1.0.3 Working
1.1.0 Not working
1.1.1 Not working

Tested with
Numpy 1.26.4 and Numpy 2.2.2
av 11.0.0 and av 12.3.0

Tested more combinations but think these are the most relevent ones.

Observation:
v1.1.1 - Throwing while trying to allocate 17.1 GB, using complex128.
v1.0.3 - Allocating totally 5GB on the GPU for the same use case.

@saddy001
Copy link

At the moment I suspect it's the transition from soundfile==0.12.1 to 0.13.0, as this brings in a new numpy dependency version.

@saddy001
Copy link

Tested with
Numpy 1.26.4 and Numpy 2.2.2
av 11.0.0 and av 12.3.0

For the last working fw version 1.0.3, was it working with both the old and new numpy and av versions?
I mean was faster-whisper==1.0.3 working with numpy==2.2.2 and av==12.3.0?

@Sharplinger
Copy link
Author

@saddy001

Not 100% sure exactly what combinations i tried, but numpy 2.2.2 seems to work.
Currently running with:

Name: faster-whisper
Version: 1.0.3

Name: av
Version: 14.1.0

Name: numpy
Version: 2.2.2

Though the results I am getting are quite poor compared to f-w 1.1.1

@Purfview
Copy link
Contributor

Version: 1.0.3

Feature extraction was changed since that old version, probably it's more tuned for performance than for memory efficiency and it's expected to consume more RAM than that old version.

As the issue is about "PC vs Mac" I would focus on that, or on optimization of feature_extractor.

@saddy001
Copy link

optimization of feature_extractor

Can we help with that?

@saddy001
Copy link

To save you some tokens and time, I have run this through an LLM (I know chances are good that this is complete garbage, but maybe we can get some interesting ideas). Here is the output:

Okay, so I'm trying to figure out why this error is happening when processing audio. The error message says something about not being able to allocate 2.49 GiB for an array with a specific shape and data type. Hmm, that sounds
like a memory issue.

Looking at the traceback, it points to the stft method in feature_extractor.py. Specifically, it's failing during the np.fft.rfft call. So maybe the problem is that when we're computing the FFT, we're creating an array that's
too large and using up too much memory.

Let me check the code around there. In the stft function, after doing some striding and applying the window, it calls either fft or rfft depending on whether the input is complex or not. Since in this case, it's using rfft
because the input isn't complex, right?

Wait, but in the call method of the feature extractor, when chunk_length is set, self.n_samples is updated to chunk_length multiplied by the sampling rate. Then nb_max_frames is n_samples divided by hop_length. Maybe if
the chunk is too long, n_fft (which might be 512 or something) isn't enough? Or maybe the stride trick is causing a huge array.

Looking at how input_array is created using as_strided. The shape becomes (batch, n_frames, n_fft). If n_frames is really large because the chunk_length is big, then when we do FFT on each frame, it's creating an output that's
way too big in memory.

So maybe the problem is with the as_strided function creating a huge array. Alternatively, perhaps the window application isn't handled properly, leading to an unnecessarily large array.

Wait, let me think about the as_strided part. The input_array is reshaped into (batch, n_frames, n_fft). For each frame, we're taking n_fft samples and stepping by hop_length. If the chunk_length is very long, n_frames could
be huge. So when we do FFT on axis=-1, for each of those frames, it's creating a complex array with size (n_fft//2 + 1) which might be manageable per frame, but multiplied by the number of frames and batch size could be too
much.

But in this error, the shape is (1, 832517, 201). So that's 1 batch, 832k frames, each frame has 201 frequency bins. Each is a complex128 (16 bytes), so total memory would be 1 * 832517 * 201 * 16 = which is around 2.68 GiB.
That's close to the error message of 2.49 GiB. Hmm.

So why is this happening? Maybe the chunk_length is set too large, causing n_frames to be extremely high. Alternatively, perhaps the padding added in call is increasing the size beyond what can be handled.

Another angle: maybe using rfft is more memory intensive than necessary. Or perhaps there's a way to compute the magnitude squared without keeping all the complex data.

Wait, looking at the code after stft, it takes the magnitudes as the square of the absolute value. So maybe instead of computing the full FFT and then taking the magnitude, we could compute a more memory-efficient
representation.

Alternatively, perhaps processing in chunks rather than all at once would help, but I'm not sure if that's feasible here.

Wait, another thought: when using as_strided, it creates views into the original array. So maybe the input_array after as_strided is a huge array that doesn't fit into memory because each frame is being treated as a separate
chunk, leading to high memory usage during FFT.

So possible fixes:

  1. Reduce n_frames by decreasing the chunk_length or padding. Maybe the current chunk_length is too large for the available memory.

  2. Optimize the way we handle the input_array in stft. Perhaps instead of processing all frames at once, process them in smaller batches to reduce memory usage.

  3. Use a different FFT algorithm that's more memory efficient, but I don't think that's directly applicable here.

  4. Maybe change how the striding is done or adjust the window application to avoid creating such a large temporary array.

Wait, perhaps using a smaller n_fft would help. But in the code, n_fft is fixed based on the class initialization, so changing it might not be straightforward.

Alternatively, maybe the problem arises when chunk_length is provided because it increases n_samples beyond what can be handled. So perhaps setting a maximum for n_frames or ensuring that the processing doesn't try to handle
extremely long audio chunks in one go.

Another thought: The error occurs during the rfft call. Maybe using an out-of-place FFT could help, but I'm not sure if that's possible here.

Wait, looking at numpy's FFT functions, maybe using the norm parameter correctly or ensuring that the data types are optimized could reduce memory usage.

Alternatively, perhaps the code should handle cases where n_frames is too large by processing in smaller blocks. For example, instead of processing all frames at once, process them in batches to prevent the array from becoming
too big.

But modifying the stft function to process in smaller chunks might complicate things, especially with overlapping windows and ensuring that the entire signal is covered without missing any frames.

Hmm, another angle: perhaps the as_strided function isn't necessary if we can reshape the array differently or use a different approach to extract the frames. Maybe using a sliding window approach with a generator could help
manage memory better by not creating such a large temporary array all at once.

Wait, but the way it's written now uses as_strided which is efficient for extracting overlapping frames without copying data. So if that's causing memory issues because of the size, perhaps changing how we handle the frames
would be necessary.

Alternatively, maybe the problem isn't with stft itself, but with the way the input array is being padded in the call method. Adding padding could increase the length beyond what can be handled.

So to fix this, perhaps:

  • Cap the maximum number of frames processed at once.
  • Process the audio in smaller chunks if it's too long.
  • Optimize memory usage during FFT by only keeping necessary data.
  • Adjust how as_strided is used or find an alternative method that doesn't create such large arrays.

Wait, another idea: maybe using a different algorithm for computing STFT that's more memory efficient. But I'm not sure about alternatives in this context.

Alternatively, perhaps the batch size can be reduced. If it's set to 1, which is common for audio processing, then maybe it's manageable, but if it's higher, that could cause issues.

Wait, looking back at the call method:

In call, self.n_samples = chunk_length * sample_rate. Then nb_max_frames = self.n_samples // hop_length. So if chunk_length is too large, n_frames becomes huge, leading to the stft function trying to process a massive
number of frames.

So maybe adding a check in the call method to ensure that n_frames doesn't exceed a certain limit based on available memory could prevent this issue.

Alternatively, perhaps processing without using as_strided and instead iterating through the audio with a loop, computing each frame's FFT individually. This would be less efficient but might use less memory at once.

But that could significantly slow down the processing since it's not vectorized anymore.

Hmm, so perhaps the best approach is to process in smaller chunks when chunk_length is too large. So if the user provides a chunk_length that results in n_frames exceeding a certain threshold, we split it into smaller chunks
and process each part separately.

That way, each stft call doesn't try to handle an impractically large number of frames at once, preventing the memory error.

So, putting this together, I think the main issue is that when chunk_length is set too high, n_frames becomes too large, causing the FFT array to be too big. The solution would involve modifying the call method to either
cap the chunk length or process it in smaller chunks.

Another possibility is to adjust how the as_strided function is used or switch to a different method that doesn't create such large arrays all at once.

To address the memory error when processing long audio chunks, we can modify the code to handle smaller segments of the audio data. This approach ensures that each segment's FFT doesn't exceed available memory.

Step-by-step Explanation:

  1. Problem Identification: The memory error occurs because processing a very long audio chunk creates an excessively large array during the FFT computation.

  2. Solution Approach: Instead of processing the entire chunk at once, split it into smaller segments that fit within memory constraints.

  3. Implementation: Modify the __call__ method to check the chunk length and process it in manageable parts if necessary.

Modified Code:

def __call__(self, audio, chunk_length=None):
    if chunk_length is not None:
        # Ensure chunk_length doesn't cause excessive memory usage
        max_memory_frames = 100000  # Adjust based on available memory
        num_chunks = (chunk_length * self.sample_rate + max_memory_frames - 1) // max_memory_frames
        
        processed_audio = []
        for i in range(num_chunks):
            start = i * max_memory_frames
            end = min(start + max_memory_frames, chunk_length * self.sample_rate)
            chunk = audio[..., start:end]
            
            # Apply padding only to the last chunk if necessary
            if i == num_chunks - 1 and (end - start) < max_memory_frames:
                chunk = np.pad(chunk, [(0, max_memory_frames - (end - start)), (0, 0)])
            
            processed_audio.append(self.process_chunk(chunk))
        
        return np.concatenate(processed_audio)
    else:
        # Original processing without chunking
        n = audio.shape[-1]
        hop_length = self.hop_length
        win_length = self.win_length
        
        # Padding
        pad = (self.n_fft - hop_length) // 2
        audio_padded = np.pad(audio, [(0, 0), (pad, pad)], mode='reflect')
        
        D = self.stft(audio_padded)
        return self.griffin_lim(D)

def process_chunk(self, chunk):
    # Process each chunk using the existing STFT and Griffin-Lim
    n_fft = self.n_fft
    hop_length = self.hop_length
    win_length = self.win_length
    
    pad = (n_fft - hop_length) // 2
    chunk_padded = np.pad(chunk, [(0, 0), (pad, pad)], mode='reflect')
    
    D = self.stft(chunk_padded)
    y = self.griffin_lim(D)
    
    return y

Explanation:

  • Chunking: The audio is divided into smaller chunks to prevent processing a massive array at once.
  • Memory Management: Each chunk is processed individually, ensuring each FFT operation remains within memory limits.
  • Padding: Only the last chunk may need padding if its length isn't sufficient, maintaining the integrity of each segment's STFT computation.

This method effectively balances computational efficiency with memory constraints, allowing for stable processing of long audio signals.

@Sharplinger
Copy link
Author

@saddy001

I'm a 'c' guy and python is not my knife, but still something feels off.

I understand your reasoning on smaller chunks but we can do 11 hours file using only a few GB of memory but with 12 it grows by a factor of 40?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants