Automated Chorus Detection

Overview

A hierarchical convolutional recurrent neural network designed for musical structure analysis, specifically optimized for detecting choruses in music recordings. The model was initially trained on 332 annotated songs from electronic music genres and achieved an F1 score of 0.864 (Precision: 0.831, Recall: 0.900) on unseen test data. For more details, scroll down to the Project Technical Summary section.

Quick Links

Quick Installation

# Clone repository
git clone https://github.com/dennisvdang/chorus-detection.git
cd chorus-detection

# Set up environment
conda env create -f environment.yml
conda activate chorus-detection
pip install -r requirements.txt

# With conda environment activated
# Run the web-app locally (recommended)
streamlit run web/app.py

# Or run the CLI
python cli/cli_app.py

Project Structure

chorus-detection/
│
├── core/                 # Core functionality
│   ├── audio_processor.py   # Audio processing and feature extraction
│   ├── model.py             # Model loading and prediction
│   ├── utils.py             # Utility functions
│   └── visualization.py     # Plotting and visualization
│
├── cli/                  # Command-line interface
│   └── cli_app.py           # CLI application
│
├── web/                  # Web interface
│   └── app.py               # Streamlit web application
│
├── models/               # Pre-trained models
├── input/                # Input audio files
├── output/               # Output files and visualizations
│
├── setup.py              # Package setup
├── requirements.txt      # Package requirements
├── Dockerfile            # Docker configuration
└── docker-compose.yml    # Docker Compose configuration

Project Technical Summary

Data

The dataset consists of 332 manually labeled songs, predominantly from electronic music genres. Data preparation involved:

Audio preprocessing: Formatting songs uniformly, processing at a consistent sampling rate, trimming silence, and extracting metadata using Spotify's API. Link to preprocessing notebook
Manual Chorus Labeling: Labeling the start and end timestamps of choruses following a set of guidelines. More details on the annotation process can be found in the Annotation Guide pdf.

Model Preprocessing

Features such as Root Mean Squared energy, key-invariant chromagrams, Melspectrograms, MFCCs, and tempograms were extracted. These features were decomposed using Non-negative Matrix Factorization using an optimal number of components derived in our exploratory analysis.
Songs were segmented into timesteps based on musical meters, with positional and grid encoding applied to every audio frame and meter, respectively. Songs and labels were uniformly padded and split into train/validation/test sets, processed into batch sizes of 32 using a custom generator.

Below are examples of audio feature visualizations of a song with 3 choruses (highlighted in green). The gridlines represent the musical meters, which are used to divide the song into segments; these segments then serve as the timesteps for the CRNN input.

Model Architecture

The model employs a two-tier architecture that respects the heirarchical structure of music (frames → meters → song)

Input Features:

The model receives as input a song's feature vector (NMF-activated features derived from RMS, mel spectrogram, chromagram, tempogram, and MFCCs). These are computed per frame and grouped by musical meter.

CNN Layers:

The CNN layers (3 Conv1D + MaxPooling1D layers) apply a series of learnable filters to the input features, sliding across the time (frame) dimension within each meter segment, and outputs a single feature vector (embedding) that summarizes the temporal information found within that meter. Note that there is no information shared between meters at this stage; each meter’s frames are processed in isolation.

LSTM Layer:

After the CNN layers, the sequence of meter embeddings that make up the input song is passed to a bidirectional LSTM, allowing the model to capture both past and future context across the song’s structure. The LSTM outputs a sequence of hidden states, one for each meter, which are then used for final classification.

Output Layer:

A TimeDistributed dense layer with a sigmoid activation is applied to the LSTM outputs, producing a probability for each meter indicating the likelihood that it corresponds to a chorus section. The model is trained using a custom binary cross-entropy loss that masks out padded values, allowing the model to learn from variable-length songs.

def create_crnn_model(max_frames_per_meter, max_meters, n_features):
    """
    Args:
    max_frames_per_meter (int): Maximum number of frames per meter.
    max_meters (int): Maximum number of meters.
    n_features (int): Number of features per frame.
    """
    frame_input = layers.Input(shape=(max_frames_per_meter, n_features))
    conv1 = layers.Conv1D(filters=128, kernel_size=3, activation='relu', padding='same')(frame_input)
    pool1 = layers.MaxPooling1D(pool_size=2, padding='same')(conv1)
    conv2 = layers.Conv1D(filters=256, kernel_size=3, activation='relu', padding='same')(pool1)
    pool2 = layers.MaxPooling1D(pool_size=2, padding='same')(conv2)
    conv3 = layers.Conv1D(filters=256, kernel_size=3, activation='relu', padding='same')(pool2)
    pool3 = layers.MaxPooling1D(pool_size=2, padding='same')(conv3)
    frame_features = layers.Flatten()(pool3)
    frame_feature_model = Model(inputs=frame_input, outputs=frame_features)

    meter_input = layers.Input(shape=(max_meters, max_frames_per_meter, n_features))
    time_distributed = layers.TimeDistributed(frame_feature_model)(meter_input)
    masking_layer = layers.Masking(mask_value=0.0)(time_distributed)
    lstm_out = layers.Bidirectional(layers.LSTM(256, return_sequences=True))(masking_layer)
    output = layers.TimeDistributed(layers.Dense(1, activation='sigmoid'))(lstm_out)
    model = Model(inputs=meter_input, outputs=output)
    model.compile(optimizer='adam', loss=custom_binary_crossentropy, metrics=[custom_accuracy])
    return model

Training

Custom loss and accuracy functions handle padded values
Callbacks to save best model based on minimal validation loss, reduce learning rate on plateau, and early stopping
Trained for 50 epochs (stopped early after 18 epochs). Training/Validation Loss and Accuracy plotted below:

Results

The model achieved strong results on the held-out test set as shown in the summary table. Visualizations of the predictions on sample test songs are also provided and can be found in the test_predictions folder.

Metric	Score
Loss	0.278
Accuracy	0.891
Precision	0.831
Recall	0.900
F1 Score	0.864

Works in progress

Pytorch implementation using the same CRNN architecture
Additional training data for other musical segments (e.g. intro, pre-chorus, bridge, verse)
Music data labeling interface for contributions

Contributing

If you found this project interesting or informative, feel free to star the repository! Issues, pull requests, and feedback are welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
cli		cli
core		core
data		data
docs		docs
images		images
models/CRNN		models/CRNN
notebooks		notebooks
tests		tests
web		web
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automated Chorus Detection

Overview

Quick Links

Quick Installation

Project Structure

Project Technical Summary

Data

Model Preprocessing

Model Architecture

Training

Results

Works in progress

Contributing

About

Uh oh!

Uh oh!

Languages

dennisvdang/chorus-detection

Folders and files

Latest commit

History

Repository files navigation

Automated Chorus Detection

Overview

Quick Links

Quick Installation

Project Structure

Project Technical Summary

Data

Model Preprocessing

Model Architecture

Training

Results

Works in progress

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages