Corgi-VLA: Cross-Domain Training for Generalist Vision-Language-Action Models

Abstract

The field of Vision-Language-Action (VLA) modeling for embodied navigation (e.g., VLN) remains shackled by a critical limitation: the inherent disconnection between the continuous, high-dimensional nature of perceptual-language understanding and the discrete, low-dimensional action spaces employed by most existing models. This prevailing paradigm forces models to learn a constrained, task-specific action vocabulary (e.g., turn left, move forward), severely hindering their generalization to novel instructions, unseen environments, and broader embodied tasks. Consequently, even powerful VL models forfeit their rich semantic and compositional reasoning capabilities when applied to action generation. In this paper, we introduce Corgi-VLA, a novel autoregressive generalist model that fundamentally rethinks this paradigm to achieve unprecedented generalization in navigation tasks. Our approach is built upon three pivotal innovations. First, we propose a hybrid-modal training framework that seamlessly integrates large-scale diverse datasets across domains (including pure language, image-text pairs, and vision-language-action trajectories). This mixture unlocks a more robust and transferable representation of the physical world. Second, and most significantly, we unify the action and text generation space. Corgi-VLA bypasses the traditional constrained action head by directly generating actionable commands as text tokens (e.g., "rotate 90 degrees" or "move towards the red chair") within its autoregressive text output stream. This creates a seamless, flexible interface between high-level instruction understanding and low-level control. Third, our method fully preserves and leverages the innate generalization power of pre-trained vision-language models. By avoiding the introduction of a separate action prediction head, the model retains all its original linguistic and visual reasoning strengths, applying them directly to the problem of action generation. We demonstrate that Corgi-VLA achieves state-of-the-art performance on standard VLN benchmarks while exhibiting remarkable zero-shot generalization capabilities to novel instructions and unseen environments, thereby paving the way for truly general-purpose embodied agents.

Model Architecture

Key Innovations

Hybrid-Modal Training: Integrates diverse datasets across language, vision, and action.
Unified Generation Space: Generates actionable commands as text tokens within the output stream.
Preserved Generalization: Keeps reasoning strengths of pre-trained VL models.

Technical Details

Transformer with cross-attention mechanisms
Multi-domain training with adaptive weighting
Textual action representation for seamless integration
Auto-regressive action generation

Results & Comparisons

Qualitative Comparisons

Our method demonstrates robust and accurate navigation across different scenarios.

Indoor Navigation

Instruction: 走到带有字号很小的“大”字的白色柜子前面

Instruction: 走到远处坐最右侧的女生前面并停下

Outdoor Navigation

Instruction: 走到马路对面的垃圾桶

Long-Task Navigation

Instruction: 找到徐记猪脚饭的店铺，并走到它店门口停下

Instruction: 找到东森烟酒店店铺，走到这个店铺门口，然后走进去再停下

Insights

✨ Our model demonstrates strong generalization capabilities and can accurately understand the intent behind language instructions, achieving efficient and reliable navigation across diverse scenarios.

Performance Metrics

Benchmark Results

Model	Success Rate (%)	SPL	Path Length (m)	Generalization Score
Baseline A	45.2	0.38	12.4	52.1
Baseline B	52.7	0.42	11.8	58.3
Previous SOTA	63.5	0.51	10.2	68.9
Corgi-VLA (Ours)	76.8	0.62	9.1	82.4

Corgi-VLA achieves state-of-the-art performance across all metrics, with a 33% relative improvement in success rate compared to the previous best model. The high SPL (Success weighted by Path Length) score indicates that our model not only completes more tasks but does so efficiently.

Scaling Law Analysis

Performance vs. Training Data Size

The scaling law analysis demonstrates that Corgi-VLA benefits significantly from increased training data, showing a consistent logarithmic improvement in success rate as training data scales. This suggests that our approach can continue to improve with access to more diverse training data.

Zero-Shot Generalization Performance

In zero-shot settings, Corgi-VLA maintains strong performance across unseen environments and instructions, significantly outperforming baseline methods. This demonstrates the model's strong generalization capabilities derived from our unified text-action representation.

Citation

@article{corigi2025vla,
  title={Corgi-VLA: Cross-domain Action Generalist VLA Model with Huge General Ability},
  author={Fagang, Kh Pan, Zan Mao},
  year={2025}
}