Dear reviewer:

​	Thank you very much for taking the time to review our submission and for providing valuable feedback. We greatly appreciate your thoughtful comments, which are highly beneficial to improving our work. Below, we provide detailed responses to each of the concerns you raised.

**Q1.The proposed method is highly dependent on the correctness of the orientation     scalar estimated in the first step. If the rough pose estimation is wrong,  the result is likely to be completely wrong.**

**A1.** We fully agree with your concern that inaccurate direction-scalar estimation may affect the accuracy of the subsequent pose prediction. Our original pipeline indeed consists of two stages, and the performance of the first-stage classifier directly influences the second-stage pose estimation. This classifier achieves over 98% accuracy on the test set, however, which still falls short of perfect accuracy.

After carefully analyzing the experimental results, we find that most misclassifications occur in rare or extreme poses—such as prone positions, body twists, or other ambiguous human postures—as illustrated in the figure below Figure 1.

![push up 2](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251203001435479.png)

​	**Figure 1. Left: Ground Truth, Middle: Prediction by reference methods, Right: Our previous methods**

​	To address the issue mentioned above, we try two potential directions:    

​	(1) **Replacing the original two-stage pipeline with a unified framework.**

In this new design, we still incorporate the orientation prior during training, but now treat it as part of the network’s output to enhance end-to-end learning effectiveness, as shown in the Figure 2.

![over view5](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204041505540.png)

​	**Figure 2.  Our new framework**

​	In Figure 2, our framework is no longer a “classifier–pose estimator” two-stage process. Instead, the original classification network is integrated into the pose estimation framework. The classifier continues to output a 1-dimensional feature, which gradually evolves into a conditional feature during training. With this unified architecture, the orientation information becomes a learnable output of the network.

​	In the previous method, the two-stage models used separate losses: the classifier is pre-optimized with a binary cross-entropy loss, and the pose estimator is trained with an MPJAE loss, the two processes are independent. In the new unified framework, because the network predicts both **sine–cosine values of Euler angles** and the conditional orientation feature, we use MPJASE and an additional joint direction loss term. The formulation is as follows:
$$
\mathcal{L}_{\mathrm{Total}} = \mathcal{L}_{\mathrm{MPJASE}} + \lambda \cdot \mathcal{L}_{\mathrm{BCE}}
$$
​	(2) **To address the discontinuity of Euler angles, we switch to learning their sine–cosine form.**
​	In the improved method, we no longer estimate the Euler angles directly. Instead, the network predicts the sine-cosine values of the Euler angles, along with a 1-dimensional conditional feature. As illustrated in Figure 2, the network outputs a 7-dimensional vector: the purple components represent the sine values of the Euler angles, the yellow components represent the cosine values, and the green and red parts correspond to the conditional information. The final Euler-angle is recovered by the arctangent function, computed from the predicted sine and cosine values.

​	We summarize all the components that have been modified to help you better understand the improvements we have made:

|               | **Previous method**                                          | **New**                                                      |
| ------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| Framework     | ![mini view1](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251203101853994.png) | ![mini view2 2](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204044014457.png) |
| Training      | Trained  independently                                       | Trained  jointly                                             |
| Input         | Classifier: 2D keypoints (*N*, j, 2) <br />Pose Estimator: Augmented Keypoints (N, J, 3) | 2D keypoints (*N*, j, 2)                                     |
| Output        | Classifier: Condition (N, 1)  <br />Euler Angles: (*N*, J, 2) | Sine- cosine of Euler Angles: (*N*, j, 6)<br />Orientation condition scale: (*N*, j, 1) |
| Loss Function | Independ computing of MPJASE and BCE                         | MPJASE and BCE jointly                                       |

Corresponding modifications have been made in the manuscript, **page 4, lines 170-208**.



**Q2. The authors did not give a clear description about the orientation scalar network and how to guarantee its estimation quality.** 

**A2.** Thank you for pointing out this limitation. In response to your question **Q1**, we have improved our method and provide here a more detailed explanation of our orientation-scalar module in the previous and current methods, including how it contributes to ensuring reliable pose estimation.

In the revised approach, we still employ a ResNet-based network as the orientation-scalar network. However, this network no longer produces an independent output; instead, it is integrated into a unified framework together with the subsequent pose-prediction module.

As shown in Figure 3, We modified the original official ResNet input and output as follows:

![resnet compare](https://java-dgx.oss-cn-beijing.aliyuncs.com/20251203001416542.png)

​	**Figure 3. The ResNet in our networks**	

​	(1)  **At the input layer**, the original ResNet classifier accepts image inputs of size 𝐵×𝑊×𝐻×3, where 𝐵 denotes the batch size, 𝑊 and 𝐻 are the image width and height, and 3 corresponds to the RGB channels. In contrast, our 2D keypoints have the shape 𝐵×𝑁×𝐽×2, B representing the batch size, N is the number of frames, J stands the number of joints, and 2 is joint dimensions, respectively.

​	(2)  **At the output stage**, we modify the dimensionality of the prediction head. In the original ResNet used for binary classification, the network outputs predictions of size 𝐵×1. In our revised architecture, we redesign the output head to predict a conditional feature of size 𝐵×𝑁×1.

​	The corresponding description have been made in the manuscript (Page.

​	In the improved classification network, we are able to ensure the quality of estimation for the following reason:

​	(1)  **Unified training with the pose estimation network**. In the original approach, the external classifier could not achieve 100% accuracy, inevitably introducing errors into the subsequent pose estimation. In our revised version, the classification network is trained jointly with the pose estimation network within a single unified framework. This design preserves the conditional information for estimation while avoiding the introduction of errors from an imperfect standalone classifier.

​	(2)  **Modification of the estimation target**. In the original approach, directly estimating Euler angles can lead to large numerical errors. For example, if the ground truth is -179° and the predicted value is 170°, the original MPJAE loss would compute an error of 349°, whereas the actual period angular difference is only 11°. In our improved method, we no longer predict Euler angles directly. Instead, we predict the values of (sine, cosine) of each Euler angle component and recover the angles using the arctangent function. This changes the output dimension from 3 to 6, mapping each Euler angle onto the unit circle in a 2D plane. In this representation, using the same example, the ground truth and predicted values correspond to coordinates (-0.01745241, -0.9998477) and (0.17364818, -0.98480775), respectively. Calculating the error in this space results in a more reasonable and physically meaningful measure.

Corresponding modifications have been made in the manuscript, **page 5, lines 243-246**.



**Q3. It is not clear what softmax classifier means.**

**A3.** Thank you for pointing this out. We had overlooked providing a detailed description of the classifier used in our experiments. 

​	The Softmax classifier mentioned in the manuscript refers specifically to a lightweight feedforward network that we designed to validate the generality of our method. Its architecture is as follows: the first layer maps the 34-dimensional input feature (corresponding to the 2D coordinates of 17 joints) to 128 dimensions; the second layer reduces it to 64 dimensions; the third layer outputs 2-dimensional logits, which are then passed through a Softmax function.

​	In the revised method, due to changes in the overall approach, this classifier is no longer important. Instead, we only employ the ResNet network. Therefore, experiments involving this classifier have been removed from the revised manuscript.



**Q4. It  is not clear what dataset is the classifiers such as Resnet18 in table 1 is trained on and tested on.**

**A4.** Thank you for pointing out this issue. We had neglected to provide details about the orientation data used for the classifier. 

​	The ground truth orientation data for the classifier is a 1-dimensional binary vector (0 or 1). For each frame, we assign a value of 1 if the rotation of the root joint in the 3D human pose is greater than 0°, and 0 if it is less than 0°.

​	In the revised manuscript, we add this information in the experimental setup section (**Page 5, lines 243-253**) to provide a clearer description of the orientation dataset.



**Q5.Since the experimental settings are not clearly defined, the experimental results are not convincing.**

**A5.** Thank you for pointing this out. We have omitted a clear definition of the experimental setup, and only included some training parameters in the appendix, which caused confusion. Here, we provide a detailed description of the experimental setup for our improved method.

​	In the improved version of our method, we select several representative 3D human pose estimation approaches as baselines including VideoPose3D, PoseFormer, PoseFormerV2, SemGCN. To ensure a fair comparison, all methods follow the same preprocessing steps and use identical input–output formats. They are all optimized with the same loss function (MPJASE). Additionally, all models are trained on the same training set and evaluated on a common test set.

**1. Experimental Setup**

（1）**Euler angle dataset**: We use a self-collected dataset consisting of 117,325 training frames and 32,219 testing frames. The actions are categorized into several major classes: *Walk*, *Sit*, *Run*, *Jump*, *Squat*, *Torso*, *Arm*, *Leg*, and *Sports*. The **number of frames** for each category is **summarized** in the Table I. The human body in this dataset consists of 17 joints, with each joint represented by a single Euler angle following the XYZ rotation order.

|              | Walk  | Sit   | Run  | Jump | Squat | Torso | Arm   | Leg   | Sports | Sum    |
| ------------ | ----- | ----- | ---- | ---- | ----- | ----- | ----- | ----- | ------ | ------ |
| Training set | 13993 | 17843 | 2168 | 5253 | 2613  | 11027 | 26031 | 12246 | 20151  | 117325 |
| Test set     | 1229  | 1963  | 1040 | 988  | 275   | 959   | 2429  | 1112  | 22924  | 32219  |

（2） **Orientation dataset**: The orientation scalar is a binary scalar (0 or 1), with one scalar value corresponding to each frame. The value is determined by the sign of the projected angle of the root joint’s Euler angle on the 2D horizontal plane: positive angles are assigned 1, and negative angles 0. The training and testing splits for this orientation data, which are consistent with those of the Euler angle dataset, as summarized in the above Table 1.

（3）**2D keypoint data**: The 2D keypoints for all video frames are obtained using inference from Detectron2. We explain the method in our new Manuscript.

（4）**Training setup**: All methods follow the training parameters recommended by their respective authors. The specific training configurations are as follows Table 2

​	**Table 2. Training configurations for all 3D human pose estimation methods.**

| Method               | Core Settings       | Batch Size | Epoch | Special Modules  |
| -------------------- | ------------------- | ---------- | ----- | ---------------- |
| Pavllo et al. (2019) | arc = 3, 3, 3, 3, 3 | 1024       | 80    | -                |
| Zheng et al. (2021)  | f = 27              | default    | 70    | -                |
| Zhao et al. (2023)   | f = 27              | default    | 70    | -                |
| Liu et al. (2020b)   | default             | default    | 40    | -                |
| Liu et al. (2020a)   | arc = 3, 3, 3       | 128        | 60    | -                |
| Zhao et al. (2019)   | default             | default    | 90    | non-local module |

These settings are shown in **Pages 12- 13, Lines 611-692**.



**2. Ablation Study**

Within VideoPose3D, we conducted comparisons for each method under the following settings: 

**A**: Euler angles are predicted via the pose estimation method alone, without any auxiliary.

**B**: Euler angles are predicted via the results of an external classifier.

**C**: Euler angles and conditional information are predicted via the improved unified framework.

**D**: The improved unified framework predicts the values of sine and cosine of Euler angles along with the conditional information, and the final Euler angles are computed by the arctangent function.

**(1)  Qualitative results:** Our new method achieves significant improvements, not only outperforming the original approaches overall but also correctly estimating previously challenging ambiguous poses. As illustrated in Figure 5, we compare the performance of different methods on two representative pose types:

| VideoPose3D  | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020303292.png" alt="VideoPose243" style="zoom: 27%;" /> | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020540440.png" alt="VideoPose225" style="zoom:35%;" /> |
| ------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| PoseFormer   | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020440930.png" alt="PoseFormer243" style="zoom:25%;" /> | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020549127.png" alt="PoseFormer225" style="zoom:33%;" /> |
| PoseFormerV2 | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020524021.png" alt="PsoeFormerV2243" style="zoom:33%;" /> | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020839320.png" alt="PoseFormerV2225" style="zoom:33%;" /> |
| SemGCN       | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020530823.png" alt="SemGCN243" style="zoom:30%;" /> | <img src="https://java-dgx.oss-cn-beijing.aliyuncs.com/20251204020845232.png" alt="SemGCN225" style="zoom:33%;" /> |
|              | Torso(push up)                                               | Sit                                                          |

​	**Figure 4. Comparison on *Sit* and *Torso* actions under ablations set D.**

​	As show in Figure 4, We present visualizations of complex poses under the D setting, with actions selected from the Sit and Torso categories, specifically a seated posture and a push-up posture. As is shown in Fig5, the results demonstrate that predicting Euler angles via their sine and cosine values, followed by recovery using the arctangent function, effectively resolves rotational discontinuities even in these challenging poses.

​	(2)   **Quantitative results**: Table 3 presents the results of 4 methods below. For evaluation, the MPJASE loss is then computed.

​	**Table 3. Quantitative comparison of MPJASE under ablation settings from A to D on our Euler angle Dataset.**

| Methods      |      | Walk  | Sit   | Run   | Jump  | Squat | Torso | Arm   | Leg   | Avg     |
| ------------ | ---- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ------- |
| VideoPose3D  | A    | 5.73  | 8     | 5.99  | 6.5   | 5.46  | 6.47  | 6     | 7.26  | 6.42625 |
|              | B    | 5.16  | 7.32  | 5.5   | 5.44  | 5.28  | 6.31  | 5.81  | 6.61  | 5.92875 |
|              | C    | 5.25  | 7.87  | 5.87  | 5.84  | 5.41  | 6.45  | 5.85  | 6.56  | 6.1375  |
|              | D    | 5.08  | 7.22  | 5.43  | 5.42  | 5.21  | 6.29  | 5.71  | 6.25  | 5.82625 |
| PoseFormer   | A    | 6.9   | 8.49  | 7.77  | 7.05  | 5.81  | 8.07  | 6.31  | 8.61  | 7.37625 |
|              | B    | 5.15  | 7.41  | 5.32  | 5.18  | 5.6   | 6.58  | 6.13  | 7.1   | 6.05875 |
|              | C    | 5.66  | 7.42  | 7.45  | 6.88  | 5.65  | 6.51  | 6.22  | 7.51  | 6.6625  |
|              | D    | 4.42  | 7.32  | 5.21  | 4.98  | 5.19  | 5.71  | 5.66  | 6.85  | 5.6675  |
| PoseFormerV2 | A    | 9.19  | 11.44 | 9.72  | 8.23  | 6.77  | 9.01  | 7.52  | 9.65  | 8.94125 |
|              | B    | 5.9   | 8.38  | 5.62  | 6.35  | 6.75  | 7.47  | 6.96  | 7.79  | 6.9025  |
|              | C    | 6.82  | 9.22  | 6.35  | 7.51  | 6.9   | 9.21  | 6.81  | 8.12  | 7.6175  |
|              | D    | 5.68  | 8.21  | 5.52  | 6.28  | 6.71  | 7.41  | 6.88  | 7.68  | 6.79625 |
| SemGCN       | A    | 10.32 | 11.25 | 10.84 | 11.13 | 10.37 | 11.34 | 10.27 | 10.96 | 10.81   |
|              | B    | 8.59  | 10.77 | 8.69  | 9.2   | 9.2   | 9.9   | 8.58  | 9.78  | 9.33875 |
|              | C    | 9.51  | 11.58 | 9.51  | 9.9   | 10.15 | 9.54  | 9.33  | 9.91  | 9.92875 |
|              | D    | 8.42  | 10.41 | 7.84  | 7.13  | 9.17  | 8.71  | 8.24  | 8.96  | 8.61    |

The Ablation Study are added in the manuscript **Pages 8- 9 Lines 421-487**. 

