This project implements a sophisticated system that processes both image (MNIST) and audio (FSDD) data using separate autoencoders, performs clustering on the latent representations, and analyzes relationships between modalities. The architecture is fully decoupled, allowing for independent processing of each modality while still enabling joint analysis.
- Separate Processing Modules: Independent modules for image and audio data processing
- Unsupervised Learning: No labeled data required for training
- Advanced Encoding: Autoencoder architectures for both image and audio data
- Multimodal Clustering: Clustering techniques for individual and joint modalities
- Relationship Analysis: Establishes one-to-one and one-to-many relationships between modalities
- Convergence/Divergence Analysis: Analyzes zones of similarity and difference between modalities
- Visualization: Comprehensive visualization of results including reconstructions, latent spaces, and clustering
The architecture consists of the following main components:
-
Data Preprocessing and Encoding
- MNIST image dataset processing
- FSDD audio dataset processing
- Robust encoding generators for both datasets
- Autoencoder architectures for each data type
-
Clustering and Analysis Modules
- Visual clustering mechanism
- One-to-one and one-to-many relationship tables
- Brain module for intelligent processing
- Config controller for dynamic parameter management
- Convergence and divergence zone analysis
-
Visualization
- Image and audio reconstruction visualizations
- Latent space visualizations using t-SNE
- Clustering result visualizations
- Relationship heatmaps
- Convergence and divergence analysis plots
graph TD
subgraph Visual_Stream
RVD[Raw Visual Data] <--> VAE[Autoencoder]
VAE <--> VENC[Encoding]
VENC <--> VNOD[Nodes]
VNOD <--> VCLU[Clusters]
end
subgraph Reward_Stream
RRD[Raw Reward Data] <--> RNOD[Nodes]
RNOD <--> RCLU[Clusters]
end
subgraph Audio_Stream
RAD[Raw Audio Data] <--> AAE[Autoencoder]
AAE <--> AENC[Encoding]
AENC <--> ANOD[Nodes]
ANOD <--> ACLU[Clusters]
end
VCLU <--> CDZ[Convergence-Divergence Zone]
RCLU <--> CDZ
ACLU <--> CDZ
classDef raw fill:#cccccc,stroke:#333,stroke-width:2px,color:black
classDef processing fill:#e6f2ff,stroke:#b3d7ff,stroke-width:2px,color:black
class RVD,RRD,RAD raw
class VAE,VENC,VNOD,VCLU,RNOD,RCLU,AAE,AENC,ANOD,ACLU,CDZ processing
Multimodal-Generative-Architecture/
├── config/
│ └── config.json
├── data/
│ ├── mnist/
│ └── fsdd/
├── modules/
│ ├── __init__.py
│ ├── brain_module.py
│ ├── config_controller.py
│ ├── data_loader.py
│ ├── image_autoencoder.py
│ ├── audio_autoencoder.py
│ ├── multimodal_clustering.py
│ ├── utils.py
│ └── visualization.py
├── outputs/
│ ├── models/
│ │ ├── image_autoencoder/
│ │ └── audio_autoencoder/
│ ├── clusters/
│ ├── synthetic/
│ └── plots/
├── main.py
├── train.py
├── evaluate.py
└── visualize.py
flowchart TD
%% Global entities
ConfigFile["Configuration File"]:::config
Train["Training Workflow"]:::workflow
Eval["Evaluation Workflow"]:::workflow
Visual["Visualization Workflow"]:::workflow
%% Input Layer
subgraph "Input Layer"
DataLoader["Data Loader"]:::input
MNIST["MNIST Images"]:::data
FSDD["FSDD Audio"]:::data
end
%% Processing Layer
subgraph "Processing Layer"
ImgAE["Image Autoencoder"]:::image
AudioAE["Audio Autoencoder"]:::audio
ImgLatent["Image Latent Space"]:::latent
AudioLatent["Audio Latent Space"]:::latent
end
%% Analysis Layer
subgraph "Analysis Layer"
MultiCluster["Multimodal Clustering"]:::analysis
Brain["Brain Module"]:::analysis
RelAnalysis["Relationship Analysis"]:::analysis
ConvDiv["Convergence/Divergence"]:::analysis
end
%% Output Layer
subgraph "Output Layer"
Visualize["Visualization"]:::output
Models["Model Outputs"]:::output
Plots["Plots"]:::output
Synth["Synthetic Data"]:::output
end
%% Data Flow
MNIST -->|"image data"| DataLoader
FSDD -->|"audio data"| DataLoader
DataLoader -->|"images"| ImgAE
DataLoader -->|"audio"| AudioAE
ImgAE -->|"latent rep"| ImgLatent
AudioAE -->|"latent rep"| AudioLatent
ImgLatent --> MultiCluster
AudioLatent --> MultiCluster
MultiCluster --> Brain
Brain --> RelAnalysis
RelAnalysis --> ConvDiv
ConvDiv --> Visualize
Visualize --> Models
Visualize --> Plots
Visualize --> Synth
%% Configuration Flow
ConfigFile -->|"config"| DataLoader
ConfigFile -->|"config"| ImgAE
ConfigFile -->|"config"| AudioAE
ConfigFile -->|"config"| MultiCluster
ConfigFile -->|"config"| Brain
ConfigFile -->|"config"| Visualize
%% Workflow Connections
Train --> DataLoader
Eval --> MultiCluster
Visual --> Visualize
%% Click Events
click DataLoader "https://github.com/blshaw/multimodal-generative-architecture/blob/main/modules/data_loader.py"
click ImgAE "https://github.com/blshaw/multimodal-generative-architecture/blob/main/modules/image_autoencoder.py"
click AudioAE "https://github.com/blshaw/multimodal-generative-architecture/blob/main/modules/audio_autoencoder.py"
click MultiCluster "https://github.com/blshaw/multimodal-generative-architecture/blob/main/modules/multimodal_clustering.py"
click Brain "https://github.com/blshaw/multimodal-generative-architecture/blob/main/modules/brain_module.py"
click ConfigFile "https://github.com/blshaw/multimodal-generative-architecture/blob/main/config/config.json"
click Visualize "https://github.com/blshaw/multimodal-generative-architecture/blob/main/modules/visualization.py"
click Train "https://github.com/blshaw/multimodal-generative-architecture/blob/main/train.py"
click Eval "https://github.com/blshaw/multimodal-generative-architecture/blob/main/evaluate.py"
click Visual "https://github.com/blshaw/multimodal-generative-architecture/blob/main/visualize.py"
%% Styles
classDef input fill:#D5E8D4,stroke:#82B366,color:#000
classDef data fill:#F8CECC,stroke:#B85450,color:#000
classDef image fill:#DAE8FC,stroke:#6C8EBF,color:#000
classDef audio fill:#E1D5E7,stroke:#9673A6,color:#000
classDef latent fill:#FFF2CC,stroke:#D6B656,color:#000
classDef analysis fill:#F5F5F5,stroke:#666666,color:#000
classDef output fill:#D4EDF7,stroke:#6DA9CF,color:#000
classDef config fill:#FCE4D6,stroke:#D79B00,color:#000
classDef workflow fill:#E2F0D9,stroke:#70AD47,color:#000
- Clone the repository:
git clone https://github.com/BLShaw/Multimodal-Generative-Architecture.git
cd Multimodal-Generative-Architecture- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install the required packages:
pip install -r requirements.txtThe project is designed to be run in three separate stages:
Train the autoencoders and brain module:
python train.py --config config/config.jsonThis will:
- Download and preprocess the MNIST and FSDD datasets
- Train the image and audio autoencoders
- Train the brain module for multimodal processing
- Save all trained models and latent representations
Evaluate the trained models and perform clustering:
python evaluate.py --config config/config.jsonThis will:
- Load the trained models
- Perform clustering on individual modalities
- Perform joint clustering on multimodal data
- Establish relationships between modalities
- Analyze convergence and divergence zones
- Save clustering results and metrics
Generate visualizations of the results:
python visualize.py --config config/config.jsonThis will:
- Load the trained models and results
- Generate reconstruction visualizations
- Visualize latent spaces using t-SNE
- Plot clustering results
- Visualize attention weights
- Generate synthetic samples
The system behavior can be customized through the config/config.json file:
{
"model": {
"image_autoencoder": {
"input_shape": [28, 28, 1],
"encoder_layers": [128, 64, 32],
"latent_dim": 16,
"decoder_layers": [32, 64, 128],
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 50
},
"audio_autoencoder": {
"input_shape": [8192],
"encoder_layers": [1024, 512, 256],
"latent_dim": 32,
"decoder_layers": [256, 512, 1024],
"learning_rate": 0.001,
"batch_size": 16,
"epochs": 50
},
"clustering": {
"n_clusters": 10,
"algorithm": "kmeans",
"convergence_threshold": 0.001
}
},
"data": {
"mnist": {
"resize_to": [28, 28],
"normalize": true
},
"fsdd": {
"sample_rate": 8000,
"duration": 1.0,
"normalize": true
}
},
"paths": {
"data_dir": "data",
"output_dir": "outputs",
"model_dir": "outputs/models",
"cluster_dir": "outputs/clusters",
"synthetic_dir": "outputs/synthetic",
"plot_dir": "outputs/plots"
}
}The system generates several outputs:
- Trained Models: Saved in
outputs/models/ - Clustering Results: Saved in
outputs/clusters/ - Visualizations: Saved in
outputs/plots/ - Synthetic Samples: Saved in
outputs/synthetic/
Key metrics include:
- Silhouette scores for clustering quality
- Reconstruction losses for autoencoder performance
- Relationship matrices between modalities
- Convergence and divergence zone analysis
- TensorFlow
- NumPy
- Scikit-learn
- Matplotlib
- Seaborn
- Librosa
- Soundfile
- Requests
See requirements.txt for the complete list.
- Fork the repository
- Create a feature branch:
git checkout -b feature/new-feature- Commit your changes:
git commit -am 'Add some new feature'- Push to the branch:
git push origin feature/new-feature- Open a Pull Request
- MNIST Dataset - Yann LeCun
- FSDD Dataset - Jakobovski
- TensorFlow Team
- Keras Team