The objective of this project is to develop a deep learning model that can convert spoken descriptions into corresponding images in real-time. This project is a Speech-to-Image Generator that takes audio input, transcribes it using a Whisper model, analyzes the sentiment of the transcription, and generates an image using Stable Diffusion. The application is built using Python and Streamlit.
- Audio Recording: Record audio through your microphone.
- Transcription: Convert recorded audio to text using the Whisper model.
- Sentiment Analysis: Classify the sentiment of the transcribed text.
- Image Generation: Generate an image based on the transcription if the sentiment is positive or neutral.
The application follows these steps:
-
Audio Input:
- The user sets the recording duration.
- The application records audio using the
sounddevicelibrary. - The recorded audio is saved as a
.wavfile.
-
Transcription (Whisper Model):
- The saved audio is loaded and preprocessed.
- The Whisper model transcribes the audio into text.
-
Sentiment Analysis:
- The transcribed text is passed to a sentiment analysis pipeline.
- The sentiment is classified as either Positive, Neutral, or Negative.
-
Image Generation (Stable Diffusion):
- If the sentiment is Positive or Neutral, the transcription is passed to the Stable Diffusion pipeline, which generates an image based on the transcription.
- If the sentiment is Negative, no image is generated, and a warning is displayed to the user.
-
Display Results:
- The transcription and sentiment analysis results are displayed.
- If the sentiment is Positive or Neutral, the generated image is also displayed alongside the results.
graph TD
A[Start: User Opens the App] --> B[Set Recording Duration]
B --> C[Record Audio Using Microphone]
C --> D[Save Audio as WAV File]
D --> E[Transcribe Audio Using Whisper]
E --> F[Perform Sentiment Analysis]
F -->|Positive/Neutral| G[Generate Image Using Stable Diffusion]
F -->|Negative| H[Skip Image Generation]
G --> I[Display Transcription, Sentiment, and Image]
H --> J[Display Transcription and Sentiment Only]
I --> K[End]
J --> K[End]
- Clone the repository:
git clone https://github.com/AabidMK/Speech-to-Image-Live-Conversion-using-Deep-Learning_Infosys_Internship_Oct2024.git cd Speech-to-Image-Live-Conversion-using-Deep-Learning_Infosys_Internship_Oct2024 - Create a virtual environment (optional but recommended):
python -m venv venv venv\Scripts\activate # On Windows source venv/bin/activate # On macOS/Linux
- Install dependencies:
pip install -r requirements.txt
-
Start the Streamlit application:
streamlit run speech_to_image.py
-
Interact with the app:
- Set the recording duration using the slider.
- Click Start Recording 🎙️ to record your audio.
- View the transcription and sentiment analysis.
- If the sentiment is positive/neutral, view the generated image.
- The application will not generate an image if the sentiment is classified as negative.
- Ensure your microphone is functioning correctly for the audio recording.
-
CUDA Error:
- If GPU is not available, the application will default to CPU, which might slow down inference.
- Verify your PyTorch installation supports CUDA.
-
Model Loading Issues:
- Confirm the paths for the Whisper and Stable Diffusion models are correct.
- Ensure the models are downloaded and accessible.
-
Dependencies:
- Use the exact versions of dependencies listed in
requirements.txtto avoid compatibility issues.
- Use the exact versions of dependencies listed in
Feel free to fork this repository and make contributions. Create a pull request with your changes for review.
This project is licensed under the MIT License. See LICENSE for details.