-
Notifications
You must be signed in to change notification settings - Fork 665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CQT, iCQT, and VQT implementations and testing #3804
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/audio/3804
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Awesome contribution! A bunch of |
Hey, here's to addressing the feedback ☝️
SAMPLE_RATE = 16000
HOP_LENGTH = 256
F_MIN = 32.703
N_BINS = 672
BINS_PER_OCTAVE = 96 Increasing the
|
Here's an example of the high frequency artifacts/aliasing(?) in the reconstruction I can't get rid of (using your implementation without my adjustments): sample_rate = 44100
hop_length = 512
f_min = 20
n_bins = 1280
bins_per_octave = 128 Original: Recon: Even if I apply a low-pass filter to chop out freqs above 8000 before passing it into the above, I still get distortion when the bass beats: recon.mp4 |
Hey, Thank you for your patience - busy week! I managed to get decent reconstruction, without too many audible artefacts, using the following parameters, and without any transformations to the original signal: SAMPLE_RATE = 44100
HOP_LENGTH = 256
F_MIN = 32.703
N_BINS = 1728
BINS_PER_OCTAVE = 192 There are two issues with the parameters you are using:
Of course, feel free to play around with lower resolutions! icqt.mp4 |
Thanks! However, this has 1728 * 256 size, whereas a normal spectrogram can store enough info for a perfect strong COLA reconstruction in only 512 * 256 numbers (or 1024 * 512, etc). Shouldn't a CQT be able to do the same without significant artifacting? |
Hey everyone,
I am happy to propose the addition of the
CQT
,iCQT
, andVQT
. The first two have been requested by issue 588. Since the CQT is a VQT with parametergamma=0
, I figured the VQT should be added to the package too. It also figures quite prominently in the research community, even as a time-frequency representation for neural networks. Here are a few important details.General
The proposed transforms follow and test against the librosa implementations. Note that, since the algorithms are based on recursive sub-sampling, the results between the proposed transforms and
librosa
gradually diverge as the number of resampling iterations increases; the resampling algorithms differ. Thelibrosa
comparison test thresholds are adapted as such. The implementation being matched is the following:The
<ARGUMENTS>
(similar throughout all three transforms) are the controllable ones in the proposed code . The others are "hard-coded". In my opinion, they should stay that way to avoid unnecessary complexity. Future iterations of the transform could incorporate some of these arguments however, if requested by the community!Tests
I was unable to make the transforms
torch-scriptable
. Maybe this should be the focus of a future PR. For the rest, I was able to test on CPU but not GPU for installation reasons. Feel free to let me know if any are lacking.Speed
On the audio snippet from here, over 100 iterations, with
dtype=torch.float64
:Sanity Check
Here's an image of the CQT-gram generated using the following parameters:
The results are pretty much identical! Feel free to request changes or ask me any questions on this PR. I'll be happy to answer, and am excited to get these transforms to the package 🫡