Dear reviewer:

​	Thank you very much for taking the time to review our submission and for providing valuable feedback. We greatly appreciate your thoughtful comments, which are highly beneficial to improving our work. Below, we provide detailed responses to each of the concerns you raised.

**Q1. To what extent does the minor parameter overhead of other rotation representations (e.g., 6D or quaternions) justify the decision to predict Euler angles, which are known to suffer from discontinuities and gimbal lock?**

**A1.** Thank you for point out this, we fully understand the importance of comparing our method with continuous rotation representations such as **6D** and **quaternion-based rotation**.

**Regarding the 6D representation**, the paper *“On the Continuity of Rotation Representations in Neural Networks”* demonstrates how a 6D formulation can map the SO(3) manifold into a continuous space and effectively facilitate learning, especially for point cloud-based tasks. This work provides important insights for rotation learning beyond Euler angles.

Our improved method is conceptually related to this family of approaches. Instead of directly regressing Euler angles along the x, y, and z axes—which inherently suffer from periodic discontinuities—we view the problem from a geometric perspective. Specifically, we reinterpret angle regression as a prediction task on the unit circle by estimating the paired sine–cosine values for each Euler angle. This transformation effectively eliminates the discontinuity issue associated with direct angle regression.

In this sense, our approach can also be interpreted as implicitly lifting the prediction space from SO(3) to a higher-dimensional continuous representation resembling SO(6), where six values (three sine–cosine pairs) are predicted and later converted back to Euler angles through their trigonometric relationships.

Coincidentally, our formulation naturally aligns with the 6D paradigm described in the above paper. As discussed in our manuscript, our method can also be interpreted as adhering to this 6D representation framework. By predicting the sine–cosine pairs for each Euler angle, our approach effectively forms a 6D continuous rotation representation, which is then mapped back to Euler angles through trigonometric reconstruction.

Although **quaternion-based rotation representations** are widely used in motion capture and animation, as shown in Formula (2), they rely heavily on a stable and well-defined rotation axis ***V\***. In marker-based mocap systems with dense markers (e.g., 41-point setups), the rotation axis ***V\*** can be reliably inferred in real time. However, in our setting this assumption does not hold: the implicit rotation axes derived from human motion exhibit high variability and lack consistent temporal regularity, as they are largely determined by the subject’s spontaneous movements.

![img](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204010438119.png)

Due to this instability of the underlying rotation axes, the quaternion representation becomes difficult to model directly from the available data. For this reason, we did not adopt a quaternion-based formulation in our method.

The **gimbal lock problem** occurs when a local rotation axis aligns with one of the global axes, causing the corresponding rotational degree of freedom to be lost. In our study, all joints are rotated with respect to the global coordinate system following the X-Y-Z rotation order. Therefore, gimbal lock does not occur in our setting.



**Q2. How does the proposed 2D projection and binary conditioning ensure that the  predicted rotations remain valid on the SO(3) manifold rather than producing geometrically invalid or impractical rotations?**

**A2.** Thank you for your insightful discussion of the problem from the perspective of geometric manifolds. To address the issue mentioned above, we try two potential directions: 

​	(1) **Replacing the original two-stage pipeline with a unified framework.**

​	In this new design, we still incorporate the orientation prior during training, but now treat it as part of the network’s output to enhance end-to-end learning effectiveness, as shown in the Figure 2.

![over view5](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204041505540.png)

​	**Figure 2.  Our new framework**

​	In Figure 2, our framework is no longer a “classifier–pose estimator” two-stage process. Instead, the original classification network is integrated into the pose estimation framework. The classifier continues to output a 1-dimensional feature, which gradually evolves into a conditional feature during training. With this unified architecture, the orientation information becomes a learnable output of the network.

​	In the previous method, the two-stage models used separate losses: the classifier is pre-optimized with a binary cross-entropy loss, and the pose estimator is trained with an MPJAE loss, the two processes are independent. In the new unified framework, because the network predicts both **sine–cosine values of Euler angles** and the conditional orientation feature, we use MPJASE and an additional joint direction loss term. The formulation is as follows:
$$
\mathcal{L}_{\mathrm{Total}} = \mathcal{L}_{\mathrm{MPJASE}} + \lambda \cdot \mathcal{L}_{\mathrm{BCE}}
$$
​	(2) **To address the discontinuity of Euler angles, we switch to learning their sine–cosine form.**
​	In the improved method, we no longer estimate the Euler angles directly. Instead, the network predicts the sine-cosine values of the Euler angles, along with a 1-dimensional conditional feature. As illustrated in Figure 2, the network outputs a 7-dimensional vector: the purple components represent the sine values of the Euler angles, the yellow components represent the cosine values, and the green and red parts correspond to the conditional information. The final Euler-angle is recovered by the arctangent function, computed from the predicted sine and cosine values.

​	We summarize all the components that have been modified to help you better understand the improvements we have made:

|               | **Previous method**                                          | **New**                                                      |
| ------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| Framework     | ![mini view1](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251203101853994.png) | ![](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204044014457.png) |
| Training      | Trained  independently                                       | Trained  jointly                                             |
| Input         | Classifier: 2D keypoints (*N*, j, 2) <br />Pose Estimator: Augmented Keypoints (N, J, 3) | 2D keypoints (*N*, j, 2)                                     |
| Output        | Classifier: Condition (N, 1)  <br />Euler Angles: (*N*, J, 2) | Sine- cosine of Euler Angles: (*N*, j, 6)<br />Orientation condition scale: (*N*, j, 1) |
| Loss Function | Independ computing of MPJASE and BCE                         | MPJASE and BCE jointly                                       |

Corresponding modifications have been made in the manuscript .



**Q3. How does the L1 loss on Euler angles handle wrap-around discontinuities and gimbal lock while ensuring that the predicted rotations remain consistent with the SO(3) manifold?**

**A3.** We appreciate you pointing out this issue in our previous method. The **L1** loss cannot fully capture the true rotational error. For example, when a 2D pose is near a critical angle, such as 179°, the classifier may fail to correctly determine its interval. If the classifier makes an incorrect decision, the subsequent prediction could be a negative angle, e.g., -170°, resulting in a computed error of 349°, whereas the actual physical rotational difference is only 11°.

​	In our new method, we predict the sine-cosine values of each Euler angles as shown in above **A2.** In this way, we can remain consistent with the SO(3) manifold.

Corresponding modifications have been made in the manuscript, **page 2, lines 59-63** and **pages 5-6, lines 265-285**.

​	

**Q4. Could you provide more details about the dataset (e.g., number of subjects and quality control)? Will it be released publicly?**

**A4.** Thank you for your concern regarding our work. We recruited a total of 9 participants for our motion capture data collection, including 8 males and 1 female, among whom were a martial arts instructor and a professional dancer. Prior to each recording session, the performers were trained on the target actions, which included high-frequency general poses and daily-life motions. All participants wore motion capture suits with 41 markers. The action categories, durations, and data formats are consistent with Human3.6M. There are two datasets:

（1）**Euler angle dataset**: We use a self-collected dataset consisting of 117,325 training frames and 32,219 testing frames. The actions are categorized into several major classes: *Walk*, *Sit*, *Run*, *Jump*, *Squat*, *Torso*, *Arm*, *Leg*, and *Sports*. The **number of frames** for each category is **summarized** in the Table I. The human body in this dataset consists of 17 joints, with each joint represented by a single Euler angle following the XYZ rotation order.

|              | Walk  | Sit   | Run  | Jump | Squat | Torso | Arm   | Leg   | Sports | Sum    |
| ------------ | ----- | ----- | ---- | ---- | ----- | ----- | ----- | ----- | ------ | ------ |
| Training set | 13993 | 17843 | 2168 | 5253 | 2613  | 11027 | 26031 | 12246 | 20151  | 117325 |
| Test set     | 1229  | 1963  | 1040 | 988  | 275   | 959   | 2429  | 1112  | 22924  | 32219  |

（2） **Orientation dataset**: The orientation scalar is a binary scalar (0 or 1), with one scalar value corresponding to each frame. The value is determined by the sign of the projected angle of the root joint’s Euler angle on the 2D horizontal plane: positive angles are assigned 1, and negative angles 0. The detailed method is described in the Methods section of our new Manuscript (Page X, lines XX-XX). The training and testing splits for this orientation data, which are consistent with those of the Euler angle dataset, as summarized in the above Table 1.

​	We plan to make this dataset publicly available to enrich the diversity of existing datasets and to facilitate further research by the community.

​	We also appreciate the reviewer’s attention to the **quality of dataset**. All recordings were captured in a controlled lighting environment using the OptiTrack high-precision optical motion capture system, covering various daily activities and sports sequences. A strict quality control procedure was implemented, including frame synchronization and manual data inspection. We plan to release this dataset to the academic community upon the formal publication of the paper, aiming to promote research and enable fair comparisons in the field.    

Corresponding modifications have been made in the manuscript (**Page 12**).



**Q5. Why were no evaluations conducted on standard benchmarks (e.g., Human3.6M), given that Euler angles can be derived from rotation matrices？**

**A5.** We fully understand the reviewer’s concern regarding validation on public benchmark datasets and fully acknowledge its importance. However, we encountered three practical **difficulties**:

（1） We contacted the official dataset team via email in December 2021 using our institutional research credentials to request access. However, we were unable to obtain the official authorization. At present, the versions of the dataset circulating online only contain position data, rather than the full dataset required for our experiments.

（2） The Human3.6M dataset lacks enough variety of complex actions, which limits further research on reconstructing more challenging and diverse motion patterns.

（3） We also attempted to directly convert the 3D joint position ground truth into 3D rotation ground truth. However, when using only 3D positional data, we encountered the issue of losing self-rotation information, as discussed in our paper.

Considering all the above factors, we decided to collect our own dataset and plan to make it publicly available, along with our proposed method, to facilitate future research in this area.



**Q6. How do improvements in angular metrics (MPJASE) relate to positional metrics like MPJPE?**

**A6**. Thank you for your concern regarding our work. The MPJASE (Mean Per Joint Angular Error) directly measures the accuracy of joint orientation estimation. It computes the difference for each Euler angle component, takes the absolute value, and averages across all components, effectively functioning as an **L1 loss**. In contrast, MPJPE (Mean Per Joint Position Error) measures the accuracy of 3D spatial coordinates by computing the Euclidean distance between corresponding 3D points, which corresponds to an **L2 loss**.

​	Using MPJPE directly on Euler angles does not align with the properties of rotation. It disproportionately amplifies the influence of the component with the largest error and cannot accurately reflect angular discrepancies. Our goal is to treat all three angle components equally in the loss function.

​	Additionally, we found that MPJASE still exhibits certain limitations (as discussed in Question 3). In our improved version, we employ a sine-cosine representations of Euler angle  which is presented in **Pages 5-6, Lines 265-285.**



**Q7. When integrating into existing frameworks (e.g., Zhao et al., 2019), were prediction heads modified to output Euler angles？**

**A7.** We appreciate your attention to our study. In the **previous** methods, we did not modify the existing prediction head structures. These models naturally output a 3-dimensional vector for each joint, with shape 𝑁×𝐽×3. In our experiments, the output shape remains 𝑁×𝐽×3; we simply reinterpret the three components as the X, Y, and Z Euler angles and apply our loss function on this basis during training.

In the **latest version** of our method, we reformulated the regression targets from Euler angles to their sine and cosine components and additionally output a conditional feature. **Accordingly, the output head was modified to have shape** **𝑁×𝐽×7**. This modification led to promising experimental results. The details of the method are presented in **Q2**, and the corresponding experimental results will be shown in **Q9**.

Corresponding modifications have been made in the manuscript (**Page 4， Lines 176-178**).



**Q8. Have you compared the conditional Euler formulation directly with other     rotation representations under identical settings？**

**A8**. Thank you for point out this, we fully understand the importance of comparing our method with continuous rotation representations such as **6D** and **quaternion-based rotation** **representationS**.

**Regarding the 6D representation**, the paper *“On the Continuity of Rotation Representations in Neural Networks”* demonstrates how a 6D formulation can map the SO(3) manifold into a continuous space and effectively facilitate learning, especially for point cloud-based tasks. This work provides important insights for rotation learning beyond Euler angles.

Our improved method is conceptually related to this family of approaches. Instead of directly regressing Euler angles along the x, y, and z axes—which inherently suffer from periodic discontinuities—we view the problem from a geometric perspective. Specifically, we reinterpret angle regression as a prediction task on the unit circle by estimating the paired sine–cosine values for each Euler angle. This transformation effectively eliminates the discontinuity issue associated with direct angle regression.

In this sense, our approach can also be interpreted as implicitly lifting the prediction space from SO(3) to a higher-dimensional continuous representation resembling SO(6), where six values (three sine–cosine pairs) are predicted and later converted back to Euler angles through their trigonometric relationships.

Coincidentally, our formulation naturally aligns with the 6D paradigm described in the above paper. As discussed in our manuscript, our method can also be interpreted as adhering to this 6D representation framework. By predicting the sine–cosine pairs for each Euler angle, our approach effectively forms a 6D continuous rotation representation, which is then mapped back to Euler angles through trigonometric reconstruction.

Although **quaternion-based rotation representations** are widely used in motion capture and animation, as shown in Formula (2), they rely heavily on a stable and well-defined rotation axis ***V\***. In marker-based mocap systems with dense markers (e.g., 41-point setups), the rotation axis ***V\*** can be reliably inferred in real time. However, in our setting this assumption does not hold: the implicit rotation axes derived from human motion exhibit high variability and lack consistent temporal regularity, as they are largely determined by the subject’s spontaneous movements.

![img](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251203221223091.png)

Due to this instability of the underlying rotation axes, the quaternion representation becomes difficult to model directly from the available data. For this reason, we did not adopt a quaternion-based formulation in our method.



**Q9. Can you provide quantitative ablations showing the contribution of the conditional classifier to the final performance？**

**A9**. Thank you for pointing out the shortcomings in our previous manuscript. We have incorporated additional ablation studies based on our latest method to further validate its effectiveness. 

**Ablation Study**

Within VideoPose3D, we conducted comparisons for each method under the following settings: 

**A**: Euler angles are predicted via the pose estimation method alone, without any auxiliary.

**B**: Euler angles are predicted via the results of an external classifier.

**C**: Euler angles and conditional information are predicted via the improved unified framework.

**D**: The improved unified framework predicts the values of sine and cosine of Euler angles along with the conditional information, and the final Euler angles are computed by the arctangent function.

**(1)  Qualitative results:** Our new method achieves significant improvements, not only outperforming the original approaches overall but also correctly estimating previously challenging ambiguous poses. As illustrated in Figure 5, we compare the performance of different methods on two representative pose types:

| VideoPose3D  | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020303292.png" alt="VideoPose243" style="zoom: 27%;" /> | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020540440.png" alt="VideoPose225" style="zoom:35%;" /> |
| ------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| PoseFormer   | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020440930.png" alt="PoseFormer243" style="zoom:25%;" /> | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020549127.png" alt="PoseFormer225" style="zoom:33%;" /> |
| PoseFormerV2 | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020524021.png" alt="PsoeFormerV2243" style="zoom:33%;" /> | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020839320.png" alt="PoseFormerV2225" style="zoom:33%;" /> |
| SemGCN       | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020530823.png" alt="SemGCN243" style="zoom:30%;" /> | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020845232.png" alt="SemGCN225" style="zoom:33%;" /> |
|              | Torso(push up)                                               | Sit                                                          |

​	**Figure 5. Comparison on *Sit* and *Torso* actions under ablations set D.**

As show in Figure 5, **Push-up poses**: Methods A and B still exhibit discontinuous estimations when the human body is in prone or sitting positions, whereas Methods C and D effectively resolve this issue. **Sitting poses**: Methods A, B, and C all show varying degrees of pose jumps. Only Method D consistently produces smooth and physically plausible results, successfully eliminating the pose discontinuities.

(2)   **Quantitative results**: Table 3 presents the results of 4 methods below. For evaluation, the MPJSCE loss is then computed.

**Table 3. Quantitative comparison of MPJASE under ablation settings from A to D on our Euler angle Dataset.**

| Methods      |      | Walk     | Sit       | Run      | Jump     | Squat    | Torso    | Arm      | Leg      | Avg         |
| ------------ | ---- | -------- | --------- | -------- | -------- | -------- | -------- | -------- | -------- | ----------- |
| VideoPose3D  | A    | 5.73     | 8         | 5.99     | 6.5      | 5.46     | 6.47     | 6        | 7.26     | 6.42625     |
|              | B    | 5.16     | 7.32      | 5.5      | 5.44     | 5.28     | 6.31     | 5.81     | 6.61     | 5.92875     |
|              | C    | 5.25     | 7.87      | 5.87     | 5.84     | 5.41     | 6.45     | 5.85     | 6.56     | 6.1375      |
|              | D    | **5.08** | **7.22**  | **5.43** | **5.42** | **5.21** | **6.29** | **5.71** | **6.25** | **5.82625** |
| PoseFormer   | A    | 6.9      | 8.49      | 7.77     | 7.05     | 5.81     | 8.07     | 6.31     | 8.61     | 7.37625     |
|              | B    | 5.15     | 7.41      | 5.32     | 5.18     | 5.6      | 6.58     | 6.13     | 7.1      | 6.05875     |
|              | C    | 5.66     | 7.42      | 7.45     | 6.88     | 5.65     | 6.51     | 6.22     | 7.51     | 6.6625      |
|              | D    | **4.42** | **7.32**  | **5.21** | **4.98** | **5.19** | **5.71** | **5.66** | **6.85** | **5.6675**  |
| PoseFormerV2 | A    | 9.19     | 11.44     | 9.72     | 8.23     | 6.77     | 9.01     | 7.52     | 9.65     | 8.94125     |
|              | B    | 5.9      | 8.38      | 5.62     | 6.35     | 6.75     | 7.47     | 6.96     | 7.79     | 6.9025      |
|              | C    | 6.82     | 9.22      | 6.35     | 7.51     | 6.9      | 9.21     | 6.81     | 8.12     | 7.6175      |
|              | D    | **5.68** | **8.21**  | **5.52** | **6.28** | **6.71** | **7.41** | **6.88** | **7.68** | **6.79625** |
| SemGCN       | A    | 10.32    | 11.25     | 10.84    | 11.13    | 10.37    | 11.34    | 10.27    | 10.96    | 10.81       |
|              | B    | 8.59     | 10.77     | 8.69     | 9.2      | 9.2      | 9.9      | 8.58     | 9.78     | 9.33875     |
|              | C    | 9.51     | 11.58     | 9.51     | 9.9      | 10.15    | 9.54     | 9.33     | 9.91     | 9.92875     |
|              | D    | **8.42** | **10.41** | **7.84** | **7.13** | **9.17** | **8.71** | **8.24** | **8.96** | **8.61**    |

The Ablation Study are added in the manuscript **Pages 8- 9 Lines 421-487**. 
