Towards Motion-aware
Referring Image Segmentation

1Seoul National University    2AIM Intelligence
*Equal Contribution   Corresponding Author

Abstract

Referring Image Segmentation (RIS) requires identifying objects from images based on textual descriptions. We observe that existing methods significantly underperform on motion-related queries compared to appearance-based ones. To address this, we first introduce an efficient data augmentation scheme that extracts motion-centric phrases from original captions, exposing models to more motion expressions without additional annotations. Second, since the same object can be described differently depending on the context, we propose Multimodal Radial Contrastive Learning (MRaCL), performed on fused image-text embeddings rather than unimodal representations. For comprehensive evaluation, we introduce a new test split focusing on motion-centric queries, and introduce a new benchmark called M-Bench, where objects are distinguished primarily by actions. Extensive experiments show our method substantially improves performance on motion-centric queries across multiple RIS models, maintaining competitive results on appearance-based descriptions.

Why do RIS models struggle with motion?

Existing RIS models are typically trained on datasets where objects are distinguished by appearance attributes (e.g., color, clothing, position). As a result, they significantly underperform when the referring expression describes an object through its motion or action. We measure this performance gap across multiple state-of-the-art models on the G-Ref UMD test set.

Appearance-based vs. motion-centric queries in RIS
Performance gap between appearance-centric and motion-centric queries

Left: Appearance-based vs. motion-centric queries in RIS. Existing methods handle the former well but struggle with the latter. Right: Performance gap between appearance-centric and motion-centric queries on G-Ref UMD test set. All models show a significant drop on motion-centric queries.

Method

Our approach enhances RIS models' understanding of action-centric expressions through two complementary strategies: (1) data augmentation with motion-centric verb phrase extraction, and (2) Multimodal Radial Contrastive Learning (MRaCL) on fused image-text embeddings using angular distance.

MRaCL architecture overview

Overview of MRaCL. (a) Given an image-text pair, we extract motion-centric verb phrases from the original caption and fuse both representations through a projection layer. (b) We compute angular similarity scores within each mini-batch, filter potential false negatives, and apply our MRaCL loss with margin penalty.

1

Verb Phrase Augmentation

We extract motion-centric verb phrases from original captions using an LLM, and use them as supplementary training examples. The model learns that both the original and verb-phrase descriptions refer to the same target, reinforcing attention to motion semantics.

2

Multimodal Contrastive Learning

Different expressions can describe the same object only within a specific image context. We perform contrastive learning on fused cross-modal embeddings rather than unimodal representations, with false negative elimination.

3

MRaCL Loss

We use angular distance as our similarity metric to overcome similarity saturation and anisotropy. A margin penalty enforces minimum angular separation, making embeddings more discriminative.

Overall Comparison

MRaCL consistently improves performance across all datasets and all models. The degree of improvement correlates with the ratio of motion-centric queries per dataset — on G-Ref and Ref-ZOM (rich in verb-centric phrases), our method achieves substantial gains, while still providing consistent improvements on verb-sparse datasets like RefCOCO and RefCOCO+.

Overall comparison of RIS models with and without MRaCL

Table 2: Overall comparison of the RIS models. Green cells indicate statistically significant improvement. Gray cells mean neutral. We have no case of significant performance drop (red).

Qualitative Analysis

Qualitative segmentation results on M-Bench

Qualitative results on M-Bench. Our method correctly segments the target by grounding fine-grained action cues such as "running fast," "snowboarding and holding a selfie stick," and "balancing on a bike," while the baseline without MRaCL fails to disambiguate visually similar subjects.

Analysis

Ablation Study

Ablation on proposed components
Ambiguity filtering strategy comparison

Left: Each component—motion-centric augmentation, MRaCL loss, and false negative filtering—contributes to the overall improvement; applying all three achieves the best performance. Right: Stricter ambiguity filtering (retaining samples where the target category appears only once) produces a cleaner training signal and consistently improves results.

Training Objective Analysis

Comparison with various contrastive objectives
Positive alignment vs semi-hard negative mining

Left: MRaCL substantially outperforms L2, SimCSE, and InfoNCE (MCC) alternatives by leveraging angular distance with explicit margin penalties. Right: Semi-hard negative mining consistently degrades performance—since current RIS models have not yet learned to leverage motion cues, hard negatives act as noise rather than useful signal, motivating our positive alignment strategy.

BibTeX

@article{kim2025mracl,
  title={Towards Motion-aware Referring Image Segmentation},
  author={Kim, Chaeyun and Yi, Seunghoon and Kim, Yejin and Jo, Yohan and Lee, Joonseok},
  year={2025}
}