Implement Llama 3.2 Vision (Multimodal) Architecture

 While Keras Hub currently supports the Llama 3 (and 3.1/3.2) text-only backbones via `Llama3Backbone`, it lacks support for the multimodal **Llama 3.2 Vision (11B and 90B)** variants.

Users who want to run the official multimodal weights currently have to resort to PyTorch/Transformers or Unsloth, as there is no native Keras 3 / JAX implementation for the **Vision Adapter** and the specific **Cross-Attention** mechanism used in these models.

**The solution I want to implement**
I would like to propose adding full support for the Llama 3.2 Vision architecture. This involves:

1. **Vision Encoder:** Implementing `Llama3VisionEncoder`, which wraps the CLIP-like ViT tower used in the 11B/90B models.
2. **Cross-Attention Blocks:** Implementing a `Llama3VisionBlock` (or similar) to handle the gated cross-attention layers that inject visual features into the text decoder at specific intervals (e.g., every 4th layer).
3. **Model Class:** Adding `Llama3Vision` (inheriting from `Llama3CausalLM` or a new Multimodal base) to handle the image+text inputs.
4. **Weights:** Porting the official Meta weights to the Keras Hub format.

**Alternatives I've considered**
I considered using the existing `Llama3Backbone` and trying to "hack" visual tokens into the embedding layer (like early fusion), but this is incorrect for Llama 3.2 11B/90B. These models use a **Gated Cross-Attention** architecture where the image embeddings bypass the input embedding layer and are attended to directly by specific decoder layers. Therefore, a strict architectural update is required.

**Additional context**
I am interested in working on this for **GSoC 2026** and have started analyzing the PyTorch reference implementation. 

I plan to start by drafting the **Configuration** and **Image Preprocessor** classes this week to establish the API structure. I will follow up with a Pull Request containing the initial `Llama3VisionConfig` and architectural skeleton within the next 48 hours.

_Technical Detail:_ The vision adapter injects features into the transformer layers using a `gated_cross_attn` mechanism. I propose adding a `cross_attention_layers` list to the `Llama3VisionConfig` to control which layers receive this injection, keeping the implementation flexible for both 11B and 90B variants.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement Llama 3.2 Vision (Multimodal) Architecture #2470

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement Llama 3.2 Vision (Multimodal) Architecture #2470

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions