Dear reviewer:

​	Thank you very much for taking the time to review our submission and for providing valuable feedback. We greatly appreciate your thoughtful comments, which are highly beneficial to improving our work. Below, we provide detailed responses to each of the concerns you raised.

**Q1. The conditional splitting relies on a classifier to assign labels. If the classifier misclassifies a frame, as shown in Figure 6, then assuming the ground truth is 179°, the network might predict a value like -170°, and the proposed MPJASE will assign, while the person only rotated 11°. The L1 loss used in MPJASE may not be the optimal solution for periodic angular data. Why is geodesic loss not used in MPJASE instead of L1 loss to address discontinuity issues?**

**A1.** We thank you for your insightful suggestions. We agree that the geodesic loss can address this issue; however, in our updated paper, we employed an alternative approach to achieve the same effect.

​	 In the original approach, directly estimating Euler angles can lead to large numerical errors. For example, if the ground truth is -179° and the predicted value is 170°, the original MPJAE loss would compute an error of 349°, whereas the actual period angular difference is only 11°. In our improved method, we no longer predict Euler angles directly. Instead, we predict the values of (sine, cosine) of each Euler angle component and recover the angles using the arctangent function. This changes the output dimension from 3 to 6, mapping each Euler angle onto the unit circle in a 2D plane. In this representation, using the same example, the ground truth and predicted values correspond to coordinates (-0.01745241, -0.9998477) and (0.17364818, -0.98480775), respectively. Calculating the error in this space results in a more reasonable and physically meaningful measure.

​	Corresponding modifications have been made in the manuscript, **page 5, lines 243-246**.



**Q2. The classifier proposed in the paper uses 2D keypoints as input. How is ResNet18 used as a classifier if it expects image inputs? If 2D keypoints are represented as heatmaps, how is the inference speed still exceptionally high?**

**A2.** Thank you for pointing out this limitation. In response to your question **Q1**, we have improved our method and provide here a more detailed explanation of our orientation-scalar module in the previous and current methods, including how it contributes to ensuring reliable pose estimation.

In the revised approach, we still employ a ResNet-based network as the orientation-scalar network. However, this network no longer produces an independent output; instead, it is integrated into a unified framework together with the subsequent pose-prediction module.

As shown in Figure 3, We modified the original official ResNet input and output as follows:

![resnet compare](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251203001416542.png)

​	**Figure 3. The ResNet in our networks**	

​	(1)  **At the input layer**, the original ResNet classifier accepts image inputs of size 𝐵×𝑊×𝐻×3, where 𝐵 denotes the batch size, 𝑊 and 𝐻 are the image width and height, and 3 corresponds to the RGB channels. In contrast, our 2D keypoints have the shape 𝐵×𝑁×𝐽×2, B representing the batch size, N is the number of frames, J stands the number of joints, and 2 is joint dimensions, respectively.

​	(2)  **At the output stage**, we modify the dimensionality of the prediction head. In the original ResNet used for binary classification, the network outputs predictions of size 𝐵×1. In our revised architecture, we redesign the output head to predict a conditional feature of size 𝐵×𝑁×1.

​	The corresponding description have been made in the manuscript.

​	In the improved classification network, we are able to ensure the quality of estimation for the following reason:

​	(1)  **Unified training with the pose estimation network**. In the original approach, the external classifier could not achieve 100% accuracy, inevitably introducing errors into the subsequent pose estimation. In our revised version, the classification network is trained jointly with the pose estimation network within a single unified framework. This design preserves the conditional information for estimation while avoiding the introduction of errors from an imperfect standalone classifier.

​	(2)  **Modification of the estimation target**. In the original approach, directly estimating Euler angles can lead to large numerical errors. For example, if the ground truth is -179° and the predicted value is 170°, the original MPJAE loss would compute an error of 349°, whereas the actual period angular difference is only 11°. In our improved method, we no longer predict Euler angles directly. Instead, we predict the values of (sine, cosine) of each Euler angle component and recover the angles using the arctangent function. This changes the output dimension from 3 to 6, mapping each Euler angle onto the unit circle in a 2D plane. In this representation, using the same example, the ground truth and predicted values correspond to coordinates (-0.01745241, -0.9998477) and (0.17364818, -0.98480775), respectively. Calculating the error in this space results in a more reasonable and physically meaningful measure.

Corresponding modifications have been made in the manuscript, **page 5, lines 243-246**.



**Q3. The work only reports results on the privately recorded dataset, and only MPJASE is presented. It is therefore unknown how the proposed method would affect other metrics, such as the Mean Per Joint Position Error (MPJPE),  on public datasets.**

**A3.** We fully understand the reviewer’s concern regarding validation on public benchmark datasets and fully acknowledge its importance. However, we encountered three practical **difficulties**:

（1） We contacted the official dataset team via email in December 2021 using our institutional research credentials to request access. However, we were unable to obtain the official authorization. At present, the versions of the dataset circulating online only contain **position data**, rather than the full dataset required for our experiments.

（2） The Human3.6M dataset lacks enough variety of complex actions, which limits further research on reconstructing more challenging and diverse motion patterns.

（3） We also attempted to directly convert the 3D joint position ground truth into 3D rotation ground truth. However, when using only 3D positional data, we encountered the issue of losing self-rotation information, as discussed in our paper.

Considering all the above factors, we decided to collect our own dataset and plan to make it publicly available, along with our proposed method, to facilitate future research in this area.

**Q4. The division of the space into two intervals is coarse and might fail in complex poses, such as twisting the upper body.**

A4.We fully agree with your concern that inaccurate direction-scalar estimation may affect the accuracy of the subsequent pose prediction. Our original pipeline indeed consists of two stages, and the performance of the first-stage classifier directly influences the second-stage pose estimation. This classifier achieves over 98% accuracy on the test set, however, which still falls short of perfect accuracy.

After carefully analyzing the experimental results, we find that most misclassifications occur in rare or extreme poses—such as prone positions, body twists, or other ambiguous human postures—as illustrated in the figure below Figure 1.

![push up 2](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251203001435479.png)

​	**Figure 1. Left: Ground Truth, Middle: Prediction by reference methods, Right: Our previous methods**

​	To address the issue mentioned above, we try two potential directions:    

​	(1) **Replacing the original two-stage pipeline with a unified framework.**

In this new design, we still incorporate the orientation prior during training, but now treat it as part of the network’s output to enhance end-to-end learning effectiveness, as shown in the Figure 2.

![over view5](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204041505540.png)

​	**Figure 2.  Our new framework**

​	In Figure 2, our framework is no longer a “classifier–pose estimator” two-stage process. Instead, the original classification network is integrated into the pose estimation framework. The classifier continues to output a 1-dimensional feature, which gradually evolves into a conditional feature during training. With this unified architecture, the orientation information becomes a learnable output of the network.

​	In the previous method, the two-stage models used separate losses: the classifier is pre-optimized with a binary cross-entropy loss, and the pose estimator is trained with an MPJAE loss, the two processes are independent. In the new unified framework, because the network predicts both **sine–cosine representations of Euler angles** and the conditional orientation feature, we use MPJSCE and an additional joint direction loss term. The formulation is as follows:
$$
\mathcal{L}_{\mathrm{Total}} = \mathcal{L}_{\mathrm{MPJASE}} + \lambda \cdot \mathcal{L}_{\mathrm{BCE}}
$$
​	(2) **To address the discontinuity of Euler angles, we switch to learning their sine–cosine representations.**
​	In the improved method, we no longer estimate the Euler angles directly. Instead, the network predicts the sine-cosine values of the Euler angles, along with a 1-dimensional conditional feature. As illustrated in Figure 2, the network outputs a 7-dimensional vector: the purple components represent the sine values of the Euler angles, the yellow components represent the cosine values, and the green and red parts correspond to the conditional information. The final Euler-angle is recovered by the arctangent function, computed from the predicted sine and cosine values.

​	We summarize all the components that have been modified to help you better understand the improvements we have made:

|               | **Previous method**                                          | **New**                                                      |
| ------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| Framework     | ![mini view1](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251203101853994.png) | ![](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204044014457.png) |
| Training      | Trained  independently                                       | Trained  jointly                                             |
| Input         | Classifier: 2D keypoints (*N*, j, 2) <br />Pose Estimator: Augmented Keypoints (N, J, 3) | 2D keypoints (*N*, j, 2)                                     |
| Output        | Classifier: Condition (N, 1)  <br />Euler Angles: (*N*, J, 2) | Sine- cosine of Euler Angles: (*N*, j, 6)<br />Orientation condition scale: (*N*, j, 1) |
| Loss Function | Independ computing of MPJAE and BCE                          | MPJSCE and BCE jointly                                       |

Corresponding modifications have been made in the manuscript, **page 4, lines 170-208**.



**Q5. The work describes converting OptiTrack marker trajectories into Euler angles but does not specify how bone hierarchies were defined or which rotation     conventions were used (e.g., ZYX, XYZ).**

A5. Thank you for your comment. We apologize for overlooking a detailed description of the angles used in our work. The angles we employ follow the **XYZ rotation order**. They are obtained from motion capture skeleton files exported using **Motive**, the motion capture software provided by OptiTrack, and the corresponding rotations are computed within **Blender**.



**Q6. Although Table 2 shows consistent improvement when the proposed method is applied,     there is no experiment isolating the effect of the conditional  classification alone or the augmented input representation, making it unclear which part of the improvement actually comes from the proposed   method.**

A6. Thank you for pointing this out. We have omitted a clear definition of the experimental setup, and only included some training parameters in the appendix, which caused confusion. Here, we provide a detailed description of the experimental setup for our improved method.

​	In the improved version of our method, we select several representative 3D human pose estimation approaches as baselines including VideoPose3D, PoseFormer, PoseFormerV2, SemGCN. To ensure a fair comparison, all methods follow the same preprocessing steps and use identical input–output formats. They are all optimized with the same loss function (MPJASE). Additionally, all models are trained on the same training set and evaluated on a common test set.

**1. Experimental Setup**

（1）**Euler angle dataset**: We use a self-collected dataset consisting of 117,325 training frames and 32,219 testing frames. The actions are categorized into several major classes: *Walk*, *Sit*, *Run*, *Jump*, *Squat*, *Torso*, *Arm*, *Leg*, and *Sports*. The **number of frames** for each category is **summarized** in the Table I. The human body in this dataset consists of 17 joints, with each joint represented by a single Euler angle following the XYZ rotation order.

|              | Walk  | Sit   | Run  | Jump | Squat | Torso | Arm   | Leg   | Sports | Sum    |
| ------------ | ----- | ----- | ---- | ---- | ----- | ----- | ----- | ----- | ------ | ------ |
| Training set | 13993 | 17843 | 2168 | 5253 | 2613  | 11027 | 26031 | 12246 | 20151  | 117325 |
| Test set     | 1229  | 1963  | 1040 | 988  | 275   | 959   | 2429  | 1112  | 22924  | 32219  |

（2） **Orientation dataset**: The orientation scalar is a binary scalar (0 or 1), with one scalar value corresponding to each frame. The value is determined by the sign of the projected angle of the root joint’s Euler angle on the 2D horizontal plane: positive angles are assigned 1, and negative angles 0. The detailed method is described in the Methods section of our new Manuscript (**Page 12, lines 626-630**). The training and testing splits for this orientation data, which are consistent with those of the Euler angle dataset, as summarized in the above Table 1.

（3）**2D keypoint data**: The 2D keypoints for all video frames are obtained using inference from Detectron2. We explain the method in our new Manuscript **(Page 6, lines 316-317).**

（4）**Training setup**: All methods follow the training parameters recommended by their respective authors. The specific training configurations are as follows Table 2

Table 2. Training configurations for all 3D human pose estimation methods.

| Method               | Core Settings       | Batch Size | Epoch | Special Modules  |
| -------------------- | ------------------- | ---------- | ----- | ---------------- |
| Pavllo et al. (2019) | arc = 3, 3, 3, 3, 3 | 1024       | 80    | -                |
| Zheng et al. (2021)  | f = 27              | default    | 70    | -                |
| Zhao et al. (2023)   | f = 27              | default    | 70    | -                |
| Liu et al. (2020b)   | default             | default    | 40    | -                |
| Liu et al. (2020a)   | arc = 3, 3, 3       | 128        | 60    | -                |
| Zhao et al. (2019)   | default             | default    | 90    | non-local module |

​	We put it in **Page 13 Lines 662-692**



**2. Ablation Study**

Within VideoPose3D, we conducted comparisons for each method under the following settings: 

**A**: Euler angles are predicted via the pose estimation method alone, without any auxiliary.

**B**: Euler angles are predicted via the results of an external classifier.

**C**: Euler angles and conditional information are predicted via the improved unified framework.

**D**: The improved unified framework predicts the values of sine and cosine of Euler angles along with the conditional information, and the final Euler angles are computed by the arctangent function.

**(1)  Qualitative results:** Our new method achieves significant improvements, not only outperforming the original approaches overall but also correctly estimating previously challenging ambiguous poses. As illustrated in Figure 5, we compare the performance of different methods on two representative pose types:

| VideoPose3D  | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020303292.png" alt="VideoPose243" style="zoom: 27%;" /> | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020540440.png" alt="VideoPose225" style="zoom:35%;" /> |
| ------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| PoseFormer   | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020440930.png" alt="PoseFormer243" style="zoom:25%;" /> | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020549127.png" alt="PoseFormer225" style="zoom:33%;" /> |
| PoseFormerV2 | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020524021.png" alt="PsoeFormerV2243" style="zoom:33%;" /> | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020839320.png" alt="PoseFormerV2225" style="zoom:33%;" /> |
| SemGCN       | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020530823.png" alt="SemGCN243" style="zoom:30%;" /> | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020845232.png" alt="SemGCN225" style="zoom:33%;" /> |
|              | Torso (push up)                                              | Sit                                                          |

​	**Figure 3. Comparison on *Sit* and *Torso* actions under ablations set D.**

​	As show in Figure 3, We present visualizations of complex poses under the D setting, with actions selected from the Sit and Torso categories, specifically a seated posture and a push-up posture. As is shown in Fig5, the results demonstrate that predicting Euler angles via their sine and cosine values, followed by recovery using the arctangent function, effectively resolves rotational discontinuities even in these challenging poses.

​	(2)   **Quantitative results**: Table 3 presents the results of 4 methods below. For evaluation, the MPJSCE loss is then computed.

​	**Table 3. Quantitative comparison of MPJASE under ablation settings from A to D on our Euler angle Dataset.**

| Methods        |      | Walk  | Sit   | Run   | Jump  | Squat | Torso | Arm   | Leg   | Avg     |
| -------------- | ---- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ------- |
| VideoPose3D[]  | A    | 5.73  | 8     | 5.99  | 6.5   | 5.46  | 6.47  | 6     | 7.26  | 6.42625 |
|                | B    | 5.16  | 7.32  | 5.5   | 5.44  | 5.28  | 6.31  | 5.81  | 6.61  | 5.92875 |
|                | C    | 5.25  | 7.87  | 5.87  | 5.84  | 5.41  | 6.45  | 5.85  | 6.56  | 6.1375  |
|                | D    | 5.08  | 7.22  | 5.43  | 5.42  | 5.21  | 6.29  | 5.71  | 6.25  | 5.82625 |
| PoseFormer[]   | A    | 6.9   | 8.49  | 7.77  | 7.05  | 5.81  | 8.07  | 6.31  | 8.61  | 7.37625 |
|                | B    | 5.15  | 7.41  | 5.32  | 5.18  | 5.6   | 6.58  | 6.13  | 7.1   | 6.05875 |
|                | C    | 5.66  | 7.42  | 7.45  | 6.88  | 5.65  | 6.51  | 6.22  | 7.51  | 6.6625  |
|                | D    | 4.42  | 7.32  | 5.21  | 4.98  | 5.19  | 5.71  | 5.66  | 6.85  | 5.6675  |
| PoseFormerV2[] | A    | 9.19  | 11.44 | 9.72  | 8.23  | 6.77  | 9.01  | 7.52  | 9.65  | 8.94125 |
|                | B    | 5.9   | 8.38  | 5.62  | 6.35  | 6.75  | 7.47  | 6.96  | 7.79  | 6.9025  |
|                | C    | 6.82  | 9.22  | 6.35  | 7.51  | 6.9   | 9.21  | 6.81  | 8.12  | 7.6175  |
|                | D    | 5.68  | 8.21  | 5.52  | 6.28  | 6.71  | 7.41  | 6.88  | 7.68  | 6.79625 |
| SemGCN[]       | A    | 10.32 | 11.25 | 10.84 | 11.13 | 10.37 | 11.34 | 10.27 | 10.96 | 10.81   |
|                | B    | 8.59  | 10.77 | 8.69  | 9.2   | 9.2   | 9.9   | 8.58  | 9.78  | 9.33875 |
|                | C    | 9.51  | 11.58 | 9.51  | 9.9   | 10.15 | 9.54  | 9.33  | 9.91  | 9.92875 |
|                | D    | 8.42  | 10.41 | 7.84  | 7.13  | 9.17  | 8.71  | 8.24  | 8.96  | 8.61    |

The Ablation Study are added in the manuscript **Pages 8-9**. 
