What Limits Bidirectional Model's Generative Capabilities? A Uni-Bi-Directional Mixture-of-Expert Method For Bidirectional Fine-tuning

Zuchao Li; Yonghua Hei; Qiwei Li; Lefei Zhang; Ping Wang; hai zhao; Baoyuan Qi; Liu Guoming

What Limits Bidirectional Model's Generative Capabilities? A Uni-Bi-Directional Mixture-of-Expert Method For Bidirectional Fine-tuning

Zuchao Li, Yonghua Hei, Qiwei Li, Lefei Zhang, Ping Wang, hai zhao, Baoyuan Qi, Liu Guoming

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We demonstrate that bidirectional training leads to an increase in subsequent dependency, and propose A Uni-Bi-Directional Mixture-of-Expert Large Language Model.

Abstract: Large Language Models (LLMs) excel in generation tasks, yet their causal attention mechanisms limit performance in embedding tasks. While bidirectional modeling may enhance embeddings, naively fine-tuning unidirectional models bidirectionally severely degrades generative performance. To investigate this trade-off, we analyze attention weights as dependence indicators and find that bidirectional fine-tuning increases subsequent dependence, impairing unidirectional generation. Through systematic Transformer module evaluations, we discover the FFN layer is least affected by such dependence. Leveraging this discovery, we propose UBMoE-LLM, a novel Uni-Bi-directional Mixture-of-Experts LLM, which integrates the original unidirectional FFN with a bidirectionally fine-tuned FFN via unsupervised contrastive learning. This MoE-based approach enhances embedding performance while preserving robust generation. Extensive experiments across diverse datasets and model scales validate our attention dependence metric and demonstrate UBMoE-LLM’s superior generative quality and reduced hallucination. Code is available at: https://github.com/heiyonghua/ubmoe_llm.

Lay Summary: Large Language Models (LLMs) excel in generation tasks, yet their causal attention mechanisms limit performance in embedding tasks. While bidirectional modeling may enhance embeddings, naively fine-tuning unidirectional models bidirectionally severely degrades generative performance. To investigate this trade-off, we analyze attention weights as dependence indicators and find that bidirectional fine-tuning increases subsequent dependence, impairing unidirectional generation. Through systematic Transformer module evaluations, we discover the FFN layer is least affected by such dependence. Leveraging this discovery, we propose UBMoE-LLM, a novel Uni-Bi-directional Mixture-of-Experts LLM, which integrates the original unidirectional FFN with a bidirectionally fine-tuned FFN via unsupervised contrastive learning. This MoE-based approach enhances embedding performance while preserving robust generation. Extensive experiments across diverse datasets and model scales validate our attention dependence metric and demonstrate UBMoE-LLM’s superior generative quality and reduced hallucination.

Primary Area: Deep Learning->Large Language Models

Keywords: Large Language Model, Bidirectional Modeling, Causal Attention

Submission Number: 15290

Loading