GitHub - WalkingDevFlag/Vision-Transformer-Implementation: Implementation of the Vision Transformer (ViT) architecture as described in the paper "An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale" (ICLR 2021).

Vision Transformer (ViT) Implementation

This repository provides a complete implementation of the Vision Transformer (ViT) architecture as described in the paper "An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale" (ICLR 2021).

Project Structure

ViT/
├── models/
│   ├── __init__.py
│   ├── patch_embedding.py
│   ├── positional_embedding.py
│   ├── transformer_encoder.py
│   └── vit.py
├── data/
│   ├── __init__.py
│   └── data_loader.py
├── train.py
├── eval.py
├── test.py
└── README.md

Components

Main ViT Architecture (`models/vit.py`)

Defines the overall Vision Transformer model
Integrates patch embedding, positional embedding, and a stack of Transformer encoder layers
Uses a [CLS] token whose final representation is fed into a classification head
Provides factory functions for three variants:
- ViT-Base: 12 layers, 768-dim hidden size, 12 heads (~86M params)
- ViT-Large: 24 layers, 1024-dim hidden size, 16 heads (~307M params)
- ViT-Huge: 32 layers, 1280-dim hidden size, 16 heads (~632M params)

Other Components

Patch Embedding (models/patch_embedding.py): Splits input images into fixed-size patches and projects them to the model's latent space.
Positional Embedding (models/positional_embedding.py): Adds learnable positional embeddings to retain spatial information.
Transformer Encoder (models/transformer_encoder.py): Implements the standard Transformer encoder block with multi-head self-attention and MLP layers.
Data Preprocessing (data/data_loader.py): Provides functions to load and preprocess data (using CIFAR10 as an example).
Training Script (train.py): Trains the selected ViT model variant on the dataset.
Evaluation Script (eval.py): Loads a saved checkpoint and evaluates the model on the validation set.
Test Script (test.py): Runs inference on a sample image and prints the predicted class index.

How to Run

Training

Run the training script with your chosen model variant and hyperparameters:

python train.py --model vit_base --epochs 10 --lr 0.001 --batch_size 64

To train ViT-Large or ViT-Huge, change the --model argument accordingly.

Evaluation

To evaluate a pre-trained model:

python eval.py --model vit_base --checkpoint best_model.pth --batch_size 64

Ensure the checkpoint file (best_model.pth) exists in the working directory.

Testing / Inference

To run inference on a sample image:

python test.py --model vit_base --image path_to_image.jpg

This will load the image, preprocess it, run it through the model, and print the predicted class index.

Modifying Hyperparameters and Switching Variants

Switching Variants: In all scripts, use the --model argument to select vit_base, vit_large, or vit_huge.
Hyperparameters: Adjust learning rate, epochs, and batch size via command-line arguments in train.py.
Data Preprocessing: Modify image size, normalization, or dataset selection in data/data_loader.py as needed.

Sample Input and Expected Output

Input: A sample image (e.g., a CIFAR-10 image) resized to 224×224.
Output: The model outputs a predicted class index. Example:
```
Predicted class index: 3
```

Requirements and Installation

Python 3.x
PyTorch
Torchvision
Pillow

Install dependencies with:

pip install torch torchvision pillow

Running the Entire Pipeline

Training:

python train.py --model vit_base --epochs 10 --lr 0.001 --batch_size 64

Evaluation:

python eval.py --model vit_base --checkpoint best_model.pth --batch_size 64

Testing/Inference:

python test.py --model vit_base --image path_to_image.jpg

Switch between ViT-Base, ViT-Large, and ViT-Huge by changing the --model argument.

Detailed Process Explanation

Training Process

Command Structure: Use --model, --epochs, --lr, and --batch_size arguments.
Examples:
- ViT-Base: python train.py --model vit_base --epochs 10 --lr 0.001 --batch_size 64
- ViT-Large: python train.py --model vit_large --epochs 10 --lr 0.001 --batch_size 64
- ViT-Huge: python train.py --model vit_huge --epochs 10 --lr 0.001 --batch_size 64
Process: Loads dataset, instantiates model, trains using Adam optimizer, and saves best model.

Evaluation Process

Command Structure: Use --model, --checkpoint, and --batch_size arguments.
Example: python eval.py --model vit_base --checkpoint best_model.pth --batch_size 64
Process: Instantiates model, loads weights, and computes accuracy on validation set.

Testing (Inference) Process

Command Structure: Use --model and --image arguments.
Example: python test.py --model vit_base --image path_to_image.jpg
Process: Loads and preprocesses image, instantiates model, and outputs predicted class.

Switching Between Variants and Modifying Hyperparameters

Switching Variants: Use --model flag with vit_base, vit_large, or vit_huge.
Modifying Hyperparameters: Adjust epochs, learning rate, and batch size in train.py. For deeper changes, modify factory functions in models/vit.py.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
this_is_old_code		this_is_old_code
2010.11929v2_vision_transformer.pdf		2010.11929v2_vision_transformer.pdf
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vision Transformer (ViT) Implementation

Project Structure

Components

Main ViT Architecture (`models/vit.py`)

Other Components

How to Run

Training

Evaluation

Testing / Inference

Modifying Hyperparameters and Switching Variants

Sample Input and Expected Output

Requirements and Installation

Running the Entire Pipeline

Detailed Process Explanation

Training Process

Evaluation Process

Testing (Inference) Process

Switching Between Variants and Modifying Hyperparameters

About

Uh oh!

Releases

Packages

Languages

License

WalkingDevFlag/Vision-Transformer-Implementation

Folders and files

Latest commit

History

Repository files navigation

Vision Transformer (ViT) Implementation

Project Structure

Components

Main ViT Architecture (models/vit.py)

Other Components

How to Run

Training

Evaluation

Testing / Inference

Modifying Hyperparameters and Switching Variants

Sample Input and Expected Output

Requirements and Installation

Running the Entire Pipeline

Detailed Process Explanation

Training Process

Evaluation Process

Testing (Inference) Process

Switching Between Variants and Modifying Hyperparameters

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Main ViT Architecture (`models/vit.py`)

Packages