Auxiliary Cross-Modal Representation Learning with Triplet Loss Functions for Online Handwriting Recognition
Abstract: Cross-modal representation learning learns a shared embedding between two or more modalities to improve performance in a given task compared to using only one of the modalities. Cross-modal representation learning from different data types -- such as images and time-series data (e.g., audio or text data) -- requires a deep metric learning loss that minimizes the distance between the modality embeddings. In this paper, we propose to use the triplet loss, which uses positive and negative identities to create sample pairs with different labels, for cross-modal representation learning between image and time-series modalities (CMR-IS). By adapting the triplet loss for cross-modal representation learning, higher accuracy in the main (time-series classification) task can be achieved by exploiting additional information of the auxiliary (image classification) task. Our experiments on synthetic data and handwriting recognition data from sensor-enhanced pens show improved classification accuracy, faster convergence, and better generalizability.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=QDQOZnESCG&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: This paper is a resubmission of our manuscript submitted on 30 Apr 2022.
Submission number: 68
Paper link: https://openreview.net/forum?id=QDQOZnESCG&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
This paper has been rejected and the reviewers suggested a resubmission with a revised paper (i.e., more focus on prior work on cross-modal metric learning using triplet losses). Again, we would like to thank the reviewers and the action editor for their comments of the first submitted version, and for taking the time to point out options to improve our manuscript. We will provide a point-by-point response to all comments of the reviewers and will describe changes in detail in the following.
1. From all reviewers we got the feedback that the presentation of the paper is not clear. We revised the writing of the manuscript to make the contributions of our paper clearer.
2. Reviewer 6y6a and reviewer cbhN suggest focusing more on the task of online HWR and its challenges. We rephrased the paper as such and focused on the peculiarities and challenges of the application (i.e., we added a subsection in the related work chapter for offline and online HWR).
3. Reviewer 6y6a and reviewer cbhN mentioned that the paper is lacking important related work. We added the suggested papers for cross-modal learning and the triplet loss in the related work chapter.
4. For a better overview of related work, we added a table with state-of-the-art cross-modal, pairwise and triplet learning techniques.
5. As suggested by reviewer 5uxE and reviewer 6y6a, we separated the related work section into deep metric learning and continual learning.
6. We clarified the weaknesses of reviewer 6y6a W1 and W2.
7. As suggested by reviewer cbhN, we cited "SphereFace: Deep Hypersphere Embedding for Face Recognition" (CVPR) for the cross-entropy loss as choice as a deep metric learning loss.
8. We used commonly used variable names (i.e., y for class labels instead of v). We clarified the dimension q × t. We used now the variables h × w for image height and width.
9. We stated a reason to use the CNN+BiLSTM model as baseline classifier over the InceptionTime model (partly better results), marked by reviewer cbhN.
10. Suggestion by reviewer cbhN: We named Section 3 as "Methodological Background" as here cross-modal representation learning and pairwise learning is introduced, and named Section 4 as "Methodology" as here our time-series classification and online handwriting recognition methods are introduced. We think that this structure is best suitable. Furthermore, methods and datasets in Section 4.2 are required to understand the offline/online cross-modal learning task for OnHW recognition, and hence, we omit the experiments chapter.
11. We included a table with the exact results for the sinusoidal dataset, but think that the accuracy curve of the training are important as different DML losses lead to different convergence rates.
12. We added more details of the sub-modules for the ScrabbleGAN method. More details for GTR are included in the appendix.
13. As suggested by reviewer 6y6a, we clarified the words "representation" w.r.t. the model.
14. We run further experiments with a contrastive learning technique on the online HWR task and added results in the experiments section. We indicated improvements against the baseline with arrows for better readability.
15. We integrated results for offline handwriting recognition from the appendix into the experiments chapter of the main text. Furthermore, we included results for transfer learning onto left-handed writers.
16. As suggested by reviewer 5uxE and reviewer 6y6a, we reduced the number of acronyms, i.e., we replaced the non-common acronym common representation learning with cross-modal representation learning and GAF for Gramian angular field.
17. Reviewer 5uxE recommended to reduce details in the architecture overview figure. We omitted network architecture details, instead, added a legend for better readability, and described details in the main text.
18. We updated the citation style and used citep for better readability.
19. References: We double checked all references and updated versions. We added the doi classifier for all references
Assigned Action Editor: ~Yale_Song1
Submission Number: 487
Loading