This project focuses on generating realistic synthetic Electronic Nose (E-Nose) time-series data using a Transformer-based model. The goal is to improve data availability, generalization, and downstream classification performance for coffee quality assessment.
This work was carried out as part of an internship project.
- Generate realistic synthetic E-Nose sensor data
- Capture and distinguish spectral patterns of different coffee quality types
- Evaluate the usefulness of synthetic data in downstream machine learning tasks
The dataset is based on E-Nose measurements for Colombian coffee quality control, aimed at detecting defects during cup tests.
Key details:
- 58 coffee samples
- 3 quality labels:
- High Quality (HQ)
- Average Quality (AQ)
- Low Quality (LQ)
- Time-series data sampled at 1 Hz for 300 seconds
- 8 gas sensors per sample:
- SP-12A, SP-31, TGS-813, TGS-842
- SP-AQ3, TGS-823, ST-31, TGS-800
- Sensor readings are resistance values (kΞ©)
A Transformer EncoderβDecoder architecture was used for time-series reconstruction and synthetic data generation.
Key components:
- Self-Attention to capture long-range temporal dependencies
- Multi-Head Attention for diverse feature learning
- Positional Encoding to preserve time-step order
- Framework: PyTorch
- Environment: Google Colab
- Version Control: GitHub
Preprocessing & Training:
- Data normalization
- Loss function: Mean Squared Error (MSE)
- Trainβvalidation split for generalization testing
- Low MSE between real and synthetic data
- Strong sensor-wise similarity
- Real vs. Synthetic plots show close alignment
- Smooth decrease in training and validation loss
- No overfitting observed
- Stable and well-regularized training process
Synthetic data was evaluated using an LSTM-based classifier.
Classification Accuracy:
- Real data only: 0.6667
- Synthetic data only: 0.7500
- Hybrid (real + synthetic): 0.7083
β Synthetic data improves classification performance.
- Imbalanced dataset
- Overfitting risks
- Long training times
- Handling time-series data with Transformer models
- Importance of synthetic data in machine learning
- Transformers are effective for time-series generation
- Visualization and downstream evaluation are critical
- Synthetic data enhances model robustness
- Successfully generated realistic synthetic E-Nose data
- Demonstrated usefulness for coffee quality classification
- Future directions:
- Statistical comparisons (PCA, t-SNE) between real and synthetic data
- Exploring GANs and Informer models for improved performance