Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation

Published: 10 Oct 2024, Last Modified: 19 Nov 2024AFM 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large language model, RAG, Code translation, Fortran, CPP, Contrastive learning, RAG Alignment
TL;DR: We improve Fortran-to-C++ code translation by aligning embeddings within a Retrieval-Augmented Generation framework, significantly enhancing translation quality without fine-tuning the language model.
Abstract: We propose a method to improve Fortran-to-C++ code translation by aligning task-specific embeddings within a Retrieval-Augmented Generation (RAG) framework. Unlike traditional retrieval approaches using generic embeddings, we align the retrieval model directly with the goal of maximizing translation quality as measured by the CodeBLEU metric, ensuring that embeddings are semantically and syntactically meaningful for this task. Utilizing 25,000 Fortran code snippets from the Stack-V2 dataset and their C++ translations generated by \texttt{llama3.1-8b}, we compute pairwise CodeBLEU scores to capture fine-grained similarities. These scores serve as supervision in a contrastive learning framework to optimize the embedding model for retrieving the most beneficial Fortran-C++ pairs. Integrating these CodeBLEU-optimized embeddings into the RAG framework significantly enhances both retrieval accuracy and code generation quality. Without fine-tuning the language model, our method improves the average CodeBLEU score from 0.64 to 0.73 (a 14\% improvement) on the HPC Fortran2C++ dataset and from 0.52 to 0.60 (a 15\% improvement) on the Numerical Recipes dataset, demonstrating the effectiveness and practicality of our approach.
Submission Number: 56
Loading