Abstract: With the rapid development of Large Language
Models (LLMs), aligning these models with human preferences and values is critical to ensuring
ethical and safe applications. However, existing
alignment techniques such as RLHF or DPO often require direct fine-tuning on LLMs with billions of parameters, resulting in substantial computational costs and inefficiencies. To address
this, we propose Micro token-level Accept-Reject
Aligning (MARA) approach designed to operate
independently of the language models. MARA
simplifies the alignment process by decomposing sentence-level preference learning into tokenlevel binary classification, where a compact threelayer fully-connected network determines whether
candidate tokens are “Accepted” or “Rejected”
as part of the response. Extensive experiments
across seven different LLMs and three opensource datasets show that MARA achieves significant improvements in alignment performance
while reducing computational costs. The source
code and implementation details are publicly
available at https://github.com/IAAR-Shanghai/
MARA, and the trained models are released
at https://huggingface.co/IAAR-Shanghai/MARA
AGENTS.
Loading