Skip to content

shekharsharma100001/Data-Generation-using-Transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Synthetic E-Nose Data Generation Using Transformer

πŸ“Œ Project Overview

This project focuses on generating realistic synthetic Electronic Nose (E-Nose) time-series data using a Transformer-based model. The goal is to improve data availability, generalization, and downstream classification performance for coffee quality assessment.

This work was carried out as part of an internship project.


🎯 Objectives

  • Generate realistic synthetic E-Nose sensor data
  • Capture and distinguish spectral patterns of different coffee quality types
  • Evaluate the usefulness of synthetic data in downstream machine learning tasks

πŸ“Š Dataset Description

The dataset is based on E-Nose measurements for Colombian coffee quality control, aimed at detecting defects during cup tests.

Key details:

  • 58 coffee samples
  • 3 quality labels:
    • High Quality (HQ)
    • Average Quality (AQ)
    • Low Quality (LQ)
  • Time-series data sampled at 1 Hz for 300 seconds
  • 8 gas sensors per sample:
    • SP-12A, SP-31, TGS-813, TGS-842
    • SP-AQ3, TGS-823, ST-31, TGS-800
  • Sensor readings are resistance values (kΞ©)

🧠 Model Architecture

A Transformer Encoder–Decoder architecture was used for time-series reconstruction and synthetic data generation.

Key components:

  • Self-Attention to capture long-range temporal dependencies
  • Multi-Head Attention for diverse feature learning
  • Positional Encoding to preserve time-step order

βš™οΈ Training Setup

  • Framework: PyTorch
  • Environment: Google Colab
  • Version Control: GitHub

Preprocessing & Training:

  • Data normalization
  • Loss function: Mean Squared Error (MSE)
  • Train–validation split for generalization testing

πŸ“ˆ Results

Reconstruction Quality

  • Low MSE between real and synthetic data
  • Strong sensor-wise similarity
  • Real vs. Synthetic plots show close alignment

Training Behavior

  • Smooth decrease in training and validation loss
  • No overfitting observed
  • Stable and well-regularized training process

πŸ” Downstream Evaluation

Synthetic data was evaluated using an LSTM-based classifier.

Classification Accuracy:

  • Real data only: 0.6667
  • Synthetic data only: 0.7500
  • Hybrid (real + synthetic): 0.7083

βœ… Synthetic data improves classification performance.


⚠️ Challenges

  • Imbalanced dataset
  • Overfitting risks
  • Long training times
  • Handling time-series data with Transformer models

πŸ“š Key Learnings

  • Importance of synthetic data in machine learning
  • Transformers are effective for time-series generation
  • Visualization and downstream evaluation are critical
  • Synthetic data enhances model robustness

πŸš€ Conclusion & Future Work

  • Successfully generated realistic synthetic E-Nose data
  • Demonstrated usefulness for coffee quality classification
  • Future directions:
    • Statistical comparisons (PCA, t-SNE) between real and synthetic data
    • Exploring GANs and Informer models for improved performance

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published