FIG. 1: A Mixture-of-Experts student is distilled from multiple frozen vision teachers (SigLIP2 and DINOv3).
Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations. We introduce AMoE (Agglomerative Mixture-of-Experts), which distills knowledge from SigLIP2 and DINOv3 simultaneously into a single Mixture-of-Experts student.
Instantiated in a Mixture-of-Experts backbone, AMoE sets a new state-of-the-art on global representation and retrieval benchmarks while using significantly fewer tokens than competitors like RADIOv2.5.
AMoE (0.3B active/ 0.6B total parameters) outperforms RADIOv2.5-H (0.6B parameters) on global representation tasks while using more than 4x less tokens during training.
| Method | Image-Text Avg (Top-1) | kNN Avg (Top-1) |
|---|---|---|
| RADIOv2.5-H | 82.26 | 84.42 |
| AMoE | 84.13 | 87.44 |
We introduce OpenLVD200M, a curated 200M-image corpus constructed via hierarchical clustering. It provides balanced coverage of visual concepts and demonstrate its effectiveness on Multi-Teacher distillation.
We use a Mixture-of-Experts (MoE) backbone to learn complementary signals from different teachers.
We use token-balanced batching to stabilize training across varying resolutions, packing images to a fixed token budget.
We introduce ARKD to preserve the pairwise geometry of teacher embeddings.
FIG. 2: Packing multiple native-resolution images per sequence up to a fixed token budget prevents low-res forgetting and improves performance.
We construct OpenLVD200M to mitigate the long-tail distribution inherent in web-scale data. Inspired by the hierarchical clustering and sampling technique from Vo et al. (2024), we process a 2.3B-image pool (combining DFN and LAION) to learn a semantic hierarchy of visual concepts. By sampling uniformly across these semantic clusters rather than the raw data distribution, we ensure balanced coverage of both common and rare concepts.
| Method | Image-Text Avg | kNN Avg | T2I | I2T |
|---|---|---|---|---|
| Random (200M) | 74.96 | 82.66 | 57.63 | 75.12 |
| OpenLVD200M | 79.11 | 85.08 | 59.14 | 76.43 |
We investigate the impact of the Rotary Positional Embedding (RoPE) strategy on the student's ability to generalize to unseen high resolutions. Specifically, we compare the standard Axial RoPE against normalizing the input coordinates based on the image aspect ratio (mapping coordinates roughly to [-1, 1]) rather than using absolute integer indices. Specifically, we use Golden RoPE [Xiong, 2025].
FIG. 3: Comparison of feature map consistency across resolutions. Golden RoPE (normalized coordinates) maintains strong scale invariance and feature coherence even at unseen resolutions ($2048 \times 2048$), whereas standard Axial RoPE degrades. Use arrows to switch samples.
We provide a qualitative comparison of the learned student representations against the original teacher features.
FIG. 4: PCA maps showing the student (AMoE) closely reconstructing the teacher distributions. The student retains SigLIP2's text-aware features and DINOv3's geometric consistency. Use arrows to switch samples.
We analyze the effectiveness of PHI-S (Ranzinger et al. 2024) normalization on different token types. While effective for global and patch tokens, we find that the first register token in DINOv3 exhibits a multi-mode distribution which prevents from correctly estimating the transform.
FIG. 5: The first register (Row 4) exhibits multi-mode distributions which standard PHI-S fails to normalize correctly, leading to training instability.