This project is meant to eventually be a webapp, but the general idea is to take an .mp3 (or other audio format) file of a lecture or a meeting, generate a text transcript of that audio file, and create notes for that meeting or lecture.
Both the OpenAI and Open Source versions work well. My main goal now is to create a web interface for everything first (for both iterations). Then, I can work more on optimizing the format of notes that the model makes, and then eventually work on having a chat feature.
- Ensure you are in the
OpenAI-Notetaker
directory
pip install -r requirements.txt
OpenAI-Notetaker
/app.py
takes the following commandline arguments:api-key
: This is a key generated by OpenAI API, this is required to run the program. (Eventually this should support using an environment variable but does not at this moment.)audio-file
: This is an audio file that you want to be transcribed and notes taken on. You must either give a relative path based on the current working directory or an absolute path.
python app.py --api-key "YOUR_API_KEY" --audio-file "path/to/audio/file"
- You must be on a linux machine with a CUDA-enabled GPU
- CUDA toolkit must be installed
export CUDA_HOME='<path/to/cuda>
# There are redundancies between these commands and the requirements.txt but this just ensures proper setup
pip install packaging
pip install wheel
pip install ninja
pip install torch torchvision torchaudio
pip install setuptools
pip install flash-attn --no-build-isolation
pip install -r requirements.txt
OpenSource-Notetaker
/app.py
takes the following commandline arguments:whisper
: This is an optional string representing which whisper model to load.audio-file
: This is an audio file that you want to be transcribed and notes taken on. You must either give a relative path based on the current working directory or an absolute path.model
: This is the HugginFace path to the LLM that you want to use to take notes with. Default ismicrosoft/Phi-3-mini-128k-instruct
.
- There is currently no option for customizing the language model to be used, this could be an addition in a later iteration, but
Phi3-mini-128k
is used by default. - Suggested requirements for the default implementation of this is an RTX-3080ti GPU, or any NVIDIA GPU with >= 12GB of VRAM
# From root directory (/Notetaker)
python OpenSource-Notetaker/app.py --audio-file "path/to/audio/file"
I need to figure out how to:
Split a single audio file into workable batchesEnsure those batches have a dimension of 1Figure out model training. (None of this is needed with whisper)Turns out whisper is open source:https://github.com/openai/whisper
Implement open source whisper and then use huggingface models for the notetaking portion.Fix issues (see above)Create a pipeline for taking the transcription and creating (hopefully formatted) notes- Turn it into a webapp (Flask)
- Following this tutorial to add auth: https://www.digitalocean.com/community/tutorials/how-to-add-authentication-to-your-app-with-flask-login#step-7-setting-up-the-authorization-function
- Need to create something that encrypts api keys in the database and decrypts them upon login
- Need an upload page
- Make use of environment variables for encryption secret key, database stuff, etc.
- Need to format output files so they can be utilized by RAG later
- Add a chat interface (implement chat functionality with GPT-3.5 and later open source transformers)
- Implement RAG that accesses current user's transcriptions
- Add GraphRAG from Microsoft
- TED-LIUM (https://www.openslr.org/51)
- LibriSpeech ASR (https://openslr.org/12)
- Audio-MNIST (https://github.com/soerenab/AudioMNIST)
- Create a virtual environment with Python 3.8 and then enter the following commands:
brew install ffmpeg # or use apt for linux
pip3 install requirements.txt