Abstract
Referring Image Segmentation (RIS) requires identifying objects from images based on textual descriptions. We observe that existing methods significantly underperform on motion-related queries compared to appearance-based ones. To address this, we first introduce an efficient data augmentation scheme that extracts motion-centric phrases from original captions, exposing models to more motion expressions without additional annotations. Second, since the same object can be described differently depending on the context, we propose Multimodal Radial Contrastive Learning (MRaCL), performed on fused image-text embeddings rather than unimodal representations. For comprehensive evaluation, we introduce a new test split focusing on motion-centric queries, and introduce a new benchmark called M-Bench, where objects are distinguished primarily by actions. Extensive experiments show our method substantially improves performance on motion-centric queries across multiple RIS models, maintaining competitive results on appearance-based descriptions.
Why do RIS models struggle with motion?
Existing RIS models are typically trained on datasets where objects are distinguished by appearance attributes (e.g., color, clothing, position). As a result, they significantly underperform when the referring expression describes an object through its motion or action. We measure this performance gap across multiple state-of-the-art models on the G-Ref UMD test set.
Left: Appearance-based vs. motion-centric queries in RIS. Existing methods handle the former well but struggle with the latter. Right: Performance gap between appearance-centric and motion-centric queries on G-Ref UMD test set. All models show a significant drop on motion-centric queries.
Method
Our approach enhances RIS models' understanding of action-centric expressions through two complementary strategies: (1) data augmentation with motion-centric verb phrase extraction, and (2) Multimodal Radial Contrastive Learning (MRaCL) on fused image-text embeddings using angular distance.
Overview of MRaCL. (a) Given an image-text pair, we extract motion-centric verb phrases from the original caption and fuse both representations through a projection layer. (b) We compute angular similarity scores within each mini-batch, filter potential false negatives, and apply our MRaCL loss with margin penalty.
Verb Phrase Augmentation
We extract motion-centric verb phrases from original captions using an LLM, and use them as supplementary training examples. The model learns that both the original and verb-phrase descriptions refer to the same target, reinforcing attention to motion semantics.
Multimodal Contrastive Learning
Different expressions can describe the same object only within a specific image context. We perform contrastive learning on fused cross-modal embeddings rather than unimodal representations, with false negative elimination.
MRaCL Loss
We use angular distance as our similarity metric to overcome similarity saturation and anisotropy. A margin penalty enforces minimum angular separation, making embeddings more discriminative.
Overall Comparison
MRaCL consistently improves performance across all datasets and all models. The degree of improvement correlates with the ratio of motion-centric queries per dataset — on G-Ref and Ref-ZOM (rich in verb-centric phrases), our method achieves substantial gains, while still providing consistent improvements on verb-sparse datasets like RefCOCO and RefCOCO+.
Table 2: Overall comparison of the RIS models. Green cells indicate statistically significant improvement. Gray cells mean neutral. We have no case of significant performance drop (red).
Qualitative Analysis
Qualitative results on M-Bench. Our method correctly segments the target by grounding fine-grained action cues such as "running fast," "snowboarding and holding a selfie stick," and "balancing on a bike," while the baseline without MRaCL fails to disambiguate visually similar subjects.
Analysis
BibTeX
@article{kim2025mracl,
title={Towards Motion-aware Referring Image Segmentation},
author={Kim, Chaeyun and Yi, Seunghoon and Kim, Yejin and Jo, Yohan and Lee, Joonseok},
year={2025}
}