This project aims to provide an in-depth analysis of various neural network architectures, with a special focus on those built for image recognition. It delves into the architectures of artificial neural networks (ANNs), convolutional neural networks (CNNs), and residual neural networks (ResNets), and their applications to the GTZAN dataset. This dataset, comprised of songs categorized by genre and represented as spectrograms, presents an interesting challenge: applying computer vision models to data originally from a non-visual medium. The experiments in this project investigate how the transformation of audio into visual spectrograms can reveal patterns obscured in raw audio formats, highlighting neural networks’ ability to discern intricate features imperceptible to the human eye or ear. By employing techniques such as transfer learning and data augmentation, the goal is to develop a robust classification model. For a simple ANN and CNN, test prediction accuracies of 50% and 63% respectively are achieved on the original dataset. For the best ResNet models, test accuracy reaches 75% with the original data and 89% by expanding the dataset. This research not only assesses the performance of these models but also offers insights into both historical developments and contemporary advancements in neural network technologies.
View the full paper HERE
View the presentation slides HERE
Download data from: https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification and place in the data
folder.