Skip to content

Implement Llama 3.2 Vision (Multimodal) Architecture #2470

@Vivek1106-04

Description

@Vivek1106-04

While Keras Hub currently supports the Llama 3 (and 3.1/3.2) text-only backbones via Llama3Backbone, it lacks support for the multimodal Llama 3.2 Vision (11B and 90B) variants.

Users who want to run the official multimodal weights currently have to resort to PyTorch/Transformers or Unsloth, as there is no native Keras 3 / JAX implementation for the Vision Adapter and the specific Cross-Attention mechanism used in these models.

The solution I want to implement
I would like to propose adding full support for the Llama 3.2 Vision architecture. This involves:

  1. Vision Encoder: Implementing Llama3VisionEncoder, which wraps the CLIP-like ViT tower used in the 11B/90B models.
  2. Cross-Attention Blocks: Implementing a Llama3VisionBlock (or similar) to handle the gated cross-attention layers that inject visual features into the text decoder at specific intervals (e.g., every 4th layer).
  3. Model Class: Adding Llama3Vision (inheriting from Llama3CausalLM or a new Multimodal base) to handle the image+text inputs.
  4. Weights: Porting the official Meta weights to the Keras Hub format.

Alternatives I've considered
I considered using the existing Llama3Backbone and trying to "hack" visual tokens into the embedding layer (like early fusion), but this is incorrect for Llama 3.2 11B/90B. These models use a Gated Cross-Attention architecture where the image embeddings bypass the input embedding layer and are attended to directly by specific decoder layers. Therefore, a strict architectural update is required.

Additional context
I am interested in working on this for GSoC 2026 and have started analyzing the PyTorch reference implementation.

I plan to start by drafting the Configuration and Image Preprocessor classes this week to establish the API structure. I will follow up with a Pull Request containing the initial Llama3VisionConfig and architectural skeleton within the next 48 hours.

Technical Detail: The vision adapter injects features into the transformer layers using a gated_cross_attn mechanism. I propose adding a cross_attention_layers list to the Llama3VisionConfig to control which layers receive this injection, keeping the implementation flexible for both 11B and 90B variants.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions