Qun LiHaijun ZhangKai YangYong-Guo ShiDeqiang ZengWun-She Yap2025-09-242025-09-242025-0510.1109/TCE.2025.3547962https://dspace-cris.utar.edu.my/handle/123456789/11361Object tracking has advanced significantly with Transformer-based architectures in recent years. However, replacing convolutional layers with global cross-attention in the tracking head of these architectures results in a loss of object-centric inductive bias. Consequently, existing Transformer-based methods often struggle with complex real-life scenarios, such as low resolution, background clutter, and scale variation. To address this issue, we propose a new Vision Transformer-based anchor-free tracking framework named CasCenter. Specifically, the framework features a cascade attention module in the decoder that propagates tracking cues from the previous tracking head to refine object features in a coarse-to-fine manner, enabling the tracker to focus more effectively on the target. Additionally, to further improve tracking stability and accuracy, we incorporate SIoU loss, a multi-scale tracking head, and a Gaussian mask-constrained cross-attention mechanism that emphasizes target regions while suppressing background interference. Extensive experiments demonstrate the superiority of our proposed CasCenter. © 1975-2011 IEEE.enanchor-free mechanismcascade mechanismVision TransformerVisual trackingClutter (information theory)PrismsTracking (position)Anchor-freeAnchor-free mechanismBackground clutterCascade mechanismCoarse to fineInductive biasLower resolutionObject TrackingVision transformerVisual TrackingObject detectionCasCenter: A Cascaded Center Network for Visual Trackingjournal-article