[IEEE TPAMI 2026] 3D Hand Pose Estimation via Articulated Anchor-to-Joint 3D Local Regressors

1Huazhong University of Science and Technology, 2ByteDance, 3CFAR, A*STAR, 4University at Buffalo

Abstract

In this paper, we propose to address monocular 3D hand pose estimation from a single RGB or depth image via articulated anchor-to-joint 3D local regressors, in form of A2J-Transformer+. The key idea is to make the local regressors (i.e., anchor points) in 3D space be aware of hand’s local fine details and global articulated context jointly, to facilitate predicting their 3D offsets toward hand joints with linear weighted aggregation for joint localization. Our intuition is that, local fine details help to estimate accurate offset but may suffer from the issues including serious occlusion, confusing similar patterns, and overfitting risk. On the other hand, hand’s global articulated context can essentially provide additional descriptive clues and constraints to alleviate these issues. To set anchor points adaptively in 3D space, A2J-Transformer+ runs in a 2-stage manner. At the first stage, since the input modality property anchor points distribute more densely on X-Y plane, it leads to lower prediction accuracy along Z direction compared with those in the X and Y directions. To alleviate this, at the second stage anchor points are set near the joints yielded by the first stage evenly along X, Y, and Z directions. This treatment brings two main advantages: (1) balancing the prediction accuracy along X, Y, and Z directions, and (2) ensuring the anchor-joint offsets are of small values relatively easy to estimate. Wide-range experiments on three RGB hand datasets (InterHand2.6 M, HO-3D V2 and RHP) and three depth hand datasets (NYU, ICVL and HANDS 2017) verify A2J-Transformer+’s superiority and generalization ability for different modalities (i.e., RGB and depth) and hand cases (i.e., single hand, interacting hands, and hand-object interaction), even outperforming model-based manners. The test on ITOP dataset reveals that, A2J-Transformer+ can also be applied to 3D human pose estimation task.

Abstract Image

Figure 1. Comparison of informative anchor point distribution and predicted anchor-to-joint offsets among A2J, A2J-Transformer and A2J-Transformer+. Essentially, A2JTransformer+ is of the most concentrated informative anchor point distribution near hand joint, to alleviate the regression difficulty on anchor-to-joint offsets.

Method Overview

For the first time, we introduce Transformer’s non-local attention mechanism to anchor-to-joint based 3D hand pose estimation manner for monocular image, to make anchor point (i.e., local regressor) be aware of hand joints’ local fine details and global articulation context jointly;

Method Overview

Figure 2. [Left] Articulate anchors through Transformer. [Right] Anchor-to-anchor self-attention visualization. The articulated anchors are aware of hands' relationship, allowing local anchor points to be aware of hands' global information.

A 2-stage joint-aware 3D anchor point setting approach is proposed to make anchor points distribute near the target hand joints, which helps to reduce regression difficulty on anchor-to-joint offsets.

Method Overview

Figure 3. A2J-Transformer+ introduces a two-stage pipeline for monocular 3D hand pose estimation, addressing adaptive anchor placement in 3D space. Within the first stage, the anchors are preset fixedly in 3D space relatively dense along X-Y plane while sparse along Z. According to the joint prediction results from the first stage, within the second stage the anchors are set joint-wise adaptively, and evenly along X-Y-Z axes. Each stage consists of feature enhancement module (FEM), anchor interaction module (AIM), and prediction module (PM) for joint localization.

Method Overview

Figure 4. The network architecture of anchor interaction module (AIM, left) and the multi-scale feature enhancement module (FEM, right).

Research effort is paid on 3D anchor point setting in a 2-stage running way, to make the anchor points distribute near the target joints for alleviating regression difficulty on anchor-to-joint offsets. Besides, we find the problem within A2J-Transformer on imbalanced anchor point setting along X-Y-Z axes, and address it via setting anchor points around hand joints with even distribution in X, Y and Z directions on a sphere.

Method Overview

Figure 5. Comparison among the anchor point setting methods towards A2J, A2J-Transformer and A2J-Transformer+. MPJPE results of each model are evaluated on the InterHand2.6M test set. Compared with A2J and A2J-Transformer, A2JTransformer+’s 3D anchor setting way is adaptively joint-aware with more concentrated informative anchor distribution near hand joint in 3D space, which helps to lead to A2J-Transformer+’s most accurate joint localization result.

Quantitative Results

We evaluate our method on 7 distinctive datasets with different scenarios, including RGB-based two hands datasets (i.e., InterHand2.6M, RHP), RGB-based hand-object interaction dataset (i.e., HO-3D V2), depth-based single hand datasets (i.e., NYU, ICVL, HANDS 2017), and depth-based human body pose dataset (i.e., ITOP). These datasets ensure a thorough investigation across different 3D pose estimation tasks and visual domains.

Quantitative Results

Figure 6. Quantitative comparison with SOTA methods.

Furthermore, we conduct extensive ablation studies to validate the effectiveness of each component in our proposed framework.

Quantitative Results

Figure 7. Ablation studies.

Qualitative Results

Qualitative Results

Figure 8. [Left] Anchor-to-anchor attention visualization of A2JTransformer+ on InterHand2.6M, HO-3D V2, NYU and ITOP datasets. For better clarification, attention is shown in form of anchor centered square area in the first stage, and shown as marked point in the second stage. [Right] Anchor-to-joint weight and offset visualization of the two stages in A2J-Transformer+ on InterHand2.6M, HO-3D V2, NYU and ITOP datasets.

BibTeX

@article{jiang20253d,
  title={3D Hand Pose Estimation via Articulated Anchor-to-Joint 3D Local Regressors},
  author={Jiang, Changlong and Xiao, Yang and Zheng, Jinghong and Kuang, Haohong and Wu, Cunlin and Zhang, Mingyang and Cao, Zhiguo and Du, Min and Zhou, Joey Tianyi and Yuan, Junsong},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2025},
  publisher={IEEE},
  doi={10.1109/TPAMI.2025.3609907}
}