Vision Transformer from Scratch using only the C++ Standard Library
ViST is a C++ implementation of the Vision Transformer (ViT) model for image classification only using the C++ standard library. Currently it classifies images into 3 classes
- C++17 or newer.
- Clang (or configure CMakeLists.txt)
- CMake (for building the project).
- stb_image.h for loading images
Clone the repository to your local machine:
git clone https://github.com/allanhanan/ViST.git
cd ViSTEnsure you have CMake installed.
From the project root directory, create a build directory and compile:
mkdir build
cd build
cmake ..
makeThis will generate the executable `ViT` in the `build` folder.
To train the model, run thecommand from the project's root directory:
./ViTIt will start training the model using the images located in the following directory structure:
program_root/
└── train/
├── apple/
│ ├── image1.png
│ ├── image2.png
│ └── image3.png
├── orange/
│ ├── image1.png
│ ├── image2.png
│ └── image3.png
└── banana/
├── image1.png
├── image2.png
└── image3.png
Note: The image directory path is currently hardcoded in the source code.
After training the model, test it on an image by running:
./ViT /path/to/model_checkpoint.bin /path/to/test_image.pngExample:
./ViT /home/allan/project/viT/vit/build/model_checkpoint.bin /home/allan/project/viT/vit/test.pngParameters are hardcoded for now and only supports CPU training
also uses stb_image.h so it technically isnt only using the std library but dont care + L + Ratio