Dear reviewer:

​	Thank you very much for taking the time to review our submission and for providing valuable feedback. We greatly appreciate your thoughtful comments, which are highly beneficial to improving our work. Below, we provide detailed responses to each of the concerns you raised.

**Q1. The experiments are conducted only on the authors’ self-collected dataset.  While this helps validate the proposed framework under controlled settings, the lack of evaluation on public benchmarks such as Human3.6M or  MPI-INF-3DHP limits the generalizability and persuasiveness of the results. The authors mention that “existing public datasets do not provide Euler angle annotations,” which is true; however, Euler angles can be  derived from the available rotation matrices or joint orientations provided by these datasets. **

**A1.** We fully understand the reviewer’s concern regarding validation on public benchmark datasets and fully acknowledge its importance. However, we encountered three practical **difficulties**:

（1） We contacted the official dataset team via email in December 2021 using our institutional research credentials to request access. However, we were unable to obtain the official authorization. At present, the versions of the dataset circulating online only contain **position data**, rather than the full dataset required for our experiments.

（2） The Human3.6M dataset lacks enough variety of complex actions, which limits further research on reconstructing more challenging and diverse motion patterns.

（3） We also attempted to directly convert the 3D joint position ground truth into 3D rotation ground truth. However, when using only 3D positional data, we encountered the issue of losing self-rotation information, as discussed in our paper.

Considering all the above factors, we decided to collect our own dataset and plan to make it publicly available, along with our proposed method, to facilitate future research in this area.



**Q2. The potential impact of misclassification in the conditional orientation classifier and its influence on regression stability could be analyzed more thoroughly.  **

A2. We fully agree with your concern that inaccurate direction-scalar estimation may affect the accuracy of the subsequent pose prediction. Our original pipeline indeed consists of two stages, and the performance of the first-stage classifier directly influences the second-stage pose estimation. This classifier achieves over 98% accuracy on the test set, however, which still falls short of perfect accuracy.

After carefully analyzing the experimental results, we find that most misclassifications occur in rare or extreme poses—such as prone positions, body twists, or other ambiguous human postures—as illustrated in the figure below Figure 1.

![push up 2](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251203001435479.png)

​	**Figure 1. Left: Ground Truth, Middle: Prediction by reference methods, Right: Our previous methods**

​	To address the issue mentioned above, we try two potential directions:    

​	(1) **Replacing the original two-stage pipeline with a unified framework.**

In this new design, we still incorporate the orientation prior during training, but now treat it as part of the network’s output to enhance end-to-end learning effectiveness, as shown in the Figure 2.

![inner classifier](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251203001403306.png)

​	**Figure 2.  Our new framework**

​	In Figure 2, our framework is no longer a “classifier–pose estimator” two-stage process. Instead, the original classification network is integrated into the pose estimation framework. The classifier continues to output a 1-dimensional feature, which gradually evolves into a conditional feature during training. With this unified architecture, the orientation information becomes a learnable output of the network.

​	In the previous method, the two-stage models used separate losses: the classifier is pre-optimized with a binary cross-entropy loss, and the pose estimator is trained with an MPJAE loss, the two processes are independent. In the new unified framework, because the network predicts both **sine–cosine representations of Euler angles** and the conditional orientation feature, we use MPJSCE and an additional joint direction loss term. The formulation is as follows:
$$
\mathcal{L}_{\mathrm{Total}} = \mathcal{L}_{\mathrm{MPJASE}} + \lambda \cdot \mathcal{L}_{\mathrm{BCE}}
$$
​	(2) **To address the discontinuity of Euler angles, we switch to learning their sine–cosine representations.**
​	In the improved method, we no longer estimate the Euler angles directly. Instead, the network predicts the sine-cosine values of the Euler angles, along with a 1-dimensional conditional feature. As illustrated in Figure 2, the network outputs a 7-dimensional vector: the purple components represent the sine values of the Euler angles, the yellow components represent the cosine values, and the green and red parts correspond to the conditional information. The final Euler-angle is recovered by the arctangent function, computed from the predicted sine and cosine values.

![cossin](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251203001410403.png)

​	**Figure 3. Our new strategy to learn sine-cosine for Euler angles**	

​	We summarize all the components that have been modified to help you better understand the improvements we have made:

|               | **Previous method**                                          | **New**                                                      |
| ------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| Framework     | ![mini view1](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251203101853994.png) | ![mini view2](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251203101857586.png) |
| Training      | Trained  independently                                       | Trained  jointly                                             |
| Input         | Classifier: 2D keypoints (*N*, j, 2) <br />Pose Estimator: Augmented Keypoints (N, J, 3) | 2D keypoints (*N*, j, 2)                                     |
| Output        | Classifier: Condition (N, 1)  <br />Euler Angles: (*N*, J, 2) | Sine- cosine of Euler Angles: (*N*, j, 6)<br />Orientation condition scale: (*N*, j, 1) |
| Loss Function | Independ computing of MPJAE and BCE                          | MPJSCE and BCE jointly                                       |

​	Corresponding modifications have been made in the manuscript



**Q3. Lacking the comparisons with learning the continuous representation, like 6D representation. Further comparison with directly learning continuous angular representations can further highlight the value of this research.**

**A3.** Thank you for point out this, we fully understand the importance of comparing our method with continuous rotation representations such as 6D and quaternion representations.

**Regarding the 6D representation**, the paper *“On the Continuity of Rotation Representations in Neural Networks”* demonstrates how a 6D formulation can map the SO(3) manifold into a continuous space and effectively facilitate learning, especially for point cloud-based tasks. This work provides important insights for rotation learning beyond Euler angles.

Our improved method is conceptually related to this family of approaches. Instead of directly regressing Euler angles along the x, y, and z axes—which inherently suffer from periodic discontinuities—we view the problem from a geometric perspective. Specifically, we reinterpret angle regression as a prediction task on the unit circle by estimating the paired sine–cosine values for each Euler angle. This transformation effectively eliminates the discontinuity issue associated with direct angle regression.

In this sense, our approach can also be interpreted as implicitly lifting the prediction space from SO(3) to a higher-dimensional continuous representation resembling SO(6), where six values (three sine–cosine pairs) are predicted and later converted back to Euler angles through their trigonometric relationships.

Coincidentally, our formulation naturally aligns with the 6D paradigm described in the above paper. As discussed in our manuscript, our method can also be interpreted as adhering to this 6D representation framework. By predicting the sine–cosine pairs for each Euler angle, our approach effectively forms a 6D continuous rotation representation, which is then mapped back to Euler angles through trigonometric reconstruction.

Although **quaternion-based rotation representations** are widely used in motion capture and animation, as shown in Formula (2), they rely heavily on a stable and well-defined rotation axis ***V\***. In marker-based mocap systems with dense markers (e.g., 41-point setups), the rotation axis ***V\*** can be reliably inferred in real time. However, in our setting this assumption does not hold: the implicit rotation axes derived from human motion exhibit high variability and lack consistent temporal regularity, as they are largely determined by the subject’s spontaneous movements.

![img](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251203093356538.png)

Due to this instability of the underlying rotation axes, the quaternion representation becomes difficult to model directly from the available data. For this reason, we did not adopt a quaternion-based formulation in our method.





​	









