This project aims to build a system that can identify and localize specific sub-scenes within a single dense image based on a natural language query describing one of the events occurring in the scene.
#Setup and Installation Follow these steps to set up the environment and run the project.
First, clone this repository to your local machine:
- git clone https://github.com/prakhar14-op/Scene-Localization-in-Dense-Images-via-Natural-Language-Queries-
- cd Scene-Localization-in-Dense-Images-via-Natural-Language-Queries
- Create a weights directory in the main project folder.
- Download the model weights from this link:
- download these files and put them in a folder named "weights"
This project requires Python 3.9. It is highly recommended to use a virtual environment Create and activate the environment
- py -3.9 -m venv venv on windows On Windows: .\venv\Scripts\activate
With the virtual environment activated, install all required packages.
- pip install -r requirements.txt
- pip install -e GroundingDINO/
Note: If you encounter build errors on Windows, you may need to install the Microsoft C++ Build Tools.
- Make sure your virtual environment is activated.
- Run the main script from your terminal:
- python aims.py
- The script will prompt you to enter the path to an image and your text query.
- The results (an annotated image and a cropped image) will be saved in a new results folder.
example outputs







