AMoE

Agglomerative Mixture-of-Experts Vision Foundation Model
Sofian Chaybouti, Sanath Narayan, Yasser Dahou, Phúc H. Lê Khac, Ankit Singh, Ngoc Dung Huynh, Wamiq Reyaz Para, Hilde Kuehne, Hakim Hacid

Overview

AMoE Vision Foundation Model

FIG. 1: A Mixture-of-Experts student is distilled from multiple frozen vision teachers (SigLIP2 and DINOv3).

Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations. We introduce AMoE (Agglomerative Mixture-of-Experts), which distills knowledge from SigLIP2 and DINOv3 simultaneously into a single Mixture-of-Experts student.

Instantiated in a Mixture-of-Experts backbone, AMoE sets a new state-of-the-art on global representation and retrieval benchmarks while using significantly fewer tokens than competitors like RADIOv2.5.

AMoE Framework

// State-of-the-Art Comparison

AMoE (0.3B active/ 0.6B total parameters) outperforms RADIOv2.5-H (0.6B parameters) on global representation tasks while using more than 4x less tokens during training.

Method Image-Text Avg (Top-1) kNN Avg (Top-1)
RADIOv2.5-H 82.26 84.42
AMoE 84.13 87.44

// Key Contributions

01. OpenLVD200M Dataset

We introduce OpenLVD200M, a curated 200M-image corpus constructed via hierarchical clustering. It provides balanced coverage of visual concepts and demonstrate its effectiveness on Multi-Teacher distillation.

02. MoE Architecture

We use a Mixture-of-Experts (MoE) backbone to learn complementary signals from different teachers.

03. Token-Balanced Batching

We use token-balanced batching to stabilize training across varying resolutions, packing images to a fixed token budget.

04. Asymmetric RKD

We introduce ARKD to preserve the pairwise geometry of teacher embeddings.

Token Balanced Batching

FIG. 2: Packing multiple native-resolution images per sequence up to a fixed token budget prevents low-res forgetting and improves performance.

OpenLVD200M Dataset

// Construction

We construct OpenLVD200M to mitigate the long-tail distribution inherent in web-scale data. Inspired by the hierarchical clustering and sampling technique from Vo et al. (2024), we process a 2.3B-image pool (combining DFN and LAION) to learn a semantic hierarchy of visual concepts. By sampling uniformly across these semantic clusters rather than the raw data distribution, we ensure balanced coverage of both common and rare concepts.

// Ablation: OpenLVD vs. Random Sampling

Method Image-Text Avg kNN Avg T2I I2T
Random (200M) 74.96 82.66 57.63 75.12
OpenLVD200M 79.11 85.08 59.14 76.43

In-Depth Analysis

1. RoPE Impact: Golden vs Axial

We investigate the impact of the Rotary Positional Embedding (RoPE) strategy on the student's ability to generalize to unseen high resolutions. Specifically, we compare the standard Axial RoPE against normalizing the input coordinates based on the image aspect ratio (mapping coordinates roughly to [-1, 1]) rather than using absolute integer indices. Specifically, we use Golden RoPE [Xiong, 2025].

SAMPLE 1/3
Original
256x256
768x768
2048x2048
Normalized Coordinates
Original Image
Golden 256
Golden 768
Golden 2048
Standard Axial RoPE
Axial 256
Axial 768
Axial 2048

FIG. 3: Comparison of feature map consistency across resolutions. Golden RoPE (normalized coordinates) maintains strong scale invariance and feature coherence even at unseen resolutions ($2048 \times 2048$), whereas standard Axial RoPE degrades. Use arrows to switch samples.

2. PCA Visualizations

We provide a qualitative comparison of the learned student representations against the original teacher features.

SAMPLE 1/5
Original Image
AMoE Map
Original Image
AMoE Map
DINOv3 Teacher
DINOv3 Head (AMoE)
DINOv3 Teacher
DINOv3 Head
SigLIP2 Teacher
SigLIP2 Head (AMoE)
SigLIP2 Teacher
SigLIP2 Head

FIG. 4: PCA maps showing the student (AMoE) closely reconstructing the teacher distributions. The student retains SigLIP2's text-aware features and DINOv3's geometric consistency. Use arrows to switch samples.

3. Register PHI-S Impact

We analyze the effectiveness of PHI-S (Ranzinger et al. 2024) normalization on different token types. While effective for global and patch tokens, we find that the first register token in DINOv3 exhibits a multi-mode distribution which prevents from correctly estimating the transform.

PHI-S PCA Maps

FIG. 5: The first register (Row 4) exhibits multi-mode distributions which standard PHI-S fails to normalize correctly, leading to training instability.