Studies have shown that humans do not perceive frequencies on a linear scale. We are better at detecting differences in lower frequencies than higher frequencies. For example, we can easily tell the difference between 500 and 1000 Hz, but we will hardly be able to tell a difference between 10,000 and 10,500 Hz, even though the distance between the two pairs are the same. In 1937, Stevens, Volkmann, and Newmann proposed a unit of pitch such that equal distances in pitch sounded equally distant to the listener. This is called the mel scale. We perform a mathematical operation on frequencies to convert them to the mel scale.
- reference: understanding the mel spectrogram
load audio
convert to melspectrogram
(n_mels: 256, fmin=0, fmax=14000)
frequency map to decibel format
(np.abs(stft))
resize to (256, 512)
"from skimage.transform import resize"
stack to 3 channels (for cnn)
np.stack((stft_db),(stft_db),(stft_db))
append each sample to list
(nsamples, 3, 256, 512)
all columns
| 'ID' | 'Sex' | 'Age' | 'Narrow pitch range' |
| 'Decreased volume' | 'Fatigue' | 'Dryness' | 'Lumping' |
| 'heartburn' | 'Choking' | 'Eye dryness' | 'PND' |
| 'Smoking' | 'PPD' | 'Drinking' | 'frequency' |
| 'Diurnal pattern' | 'Onset of dysphonia' | 'Noise at work' | 'Occupational vocal demand' |
| 'Head injury' | 'CVA' | 'Voice handicap index - 10' | 'Disease category' |
- 'Disease category' is the classification column
The model summary for custom CNN model.
- optimizer: nadam
- minimum lr: 1e-8
- loss: categorical cross entropy
- validation ratio: 5%
- callback: reduce lr on validation loss and early stopping after 10 epochs
- save checkpoint: True
The model summary for DNN model.
- hidden layers: 3
- activation function: sigmoid (hidden nodes), softmax (categorical prediction)
- loss: categorical corss entropy
- optimizer: adam
- metrics: accuracy
| testing data | UAR |
|---|---|
| public | 0.687 |
| private | 0.543 |
