[
  {
    "group_id": "pwNIOcr8fU",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "GySIAKEwtZ",
      "title": "Geometry of Long-Tailed Representation Learning: Rebalancing Features for Skewed Distributions",
      "abstract": "Deep learning has achieved significant success by training on balanced datasets. However, real-world data often exhibit long-tailed distributions. Empirical studies have revealed that long-tailed data skew data representations, where head classes dominate the feature space. Many methods have been proposed to empirically rectify the skewed representations. However, a clear understanding of the underlying cause and extent of this skew remains lacking. In this study, we provide a comprehensive theoretical analysis to elucidate how long-tailed data affect feature distributions, deriving the conditions under which centers of tail classes shrink together or even collapse into a single point. This results in overlapping feature distributions of tail classes, making features in the overlapping regions inseparable. Moreover, we demonstrate that merely empirically correcting the skewed representations of the training data is insufficient to separate the overlapping features due to distribution shifts between the training and real data. To address these challenges, we propose a novel long-tailed representation learning method, FeatRecon. It reconstructs the feature space in order to arrange features from different classes into symmetricial and linearly separable regions. This, in turn, enhances the model’s robustness to long-tailed data. We validate the effectiveness of our method through extensive experiments on the CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018 datasets.",
      "keywords": [
        "contrastive learning",
        "representation learning",
        "long-tail recgonition",
        "theory",
        "neural collapse"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "pwNIOcr8fU",
      "title": "Towards Syn-to-Real IQA: A Novel Perspective on Reshaping Synthetic Data Distributions",
      "abstract": "Blind Image Quality Assessment (BIQA) has advanced significantly through deep learning, but the scarcity of large-scale labeled datasets remains a challenge. While synthetic data offers a promising solution, models trained on existing synthetic datasets often show limited generalization ability. In this work, we make a key observation that representations learned from synthetic datasets often exhibit a discrete and clustered pattern that hinders regression performance: features of high-quality images cluster around reference images, while those of low-quality images cluster based on distortion types. Our analysis reveals that this issue stems from the distribution of synthetic data rather than model architecture. Consequently, we introduce a novel framework SynDR-IQA, which reshapes synthetic data distribution to enhance BIQA generalization. Based on theoretical derivations of sample diversity and redundancy's impact on generalization error, SynDR-IQA employs two strategies: distribution-aware diverse content upsampling, which enhances visual diversity while preserving content distribution, and density-aware redundant cluster downsampling, which balances samples by reducing the density of densely clustered areas. Extensive experiments across three cross-dataset settings (synthetic-to-authentic, synthetic-to-algorithmic, and synthetic-to-synthetic) demonstrate the effectiveness of our method. Additionally, as a data-based approach, SynDR-IQA can be coupled with model-based methods without increasing inference costs. The source code will be publicly available.",
      "keywords": [
        "Blind Image Quality Assessment; Data Distribution Reshaping; Synthetic Data"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "iQoZv77o3g",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "RF3miSqdXa",
      "title": "On Linear Mode Connectivity of Mixture-of-Experts Architectures",
      "abstract": "Linear Mode Connectivity (LMC) is a notable phenomenon in the loss landscapes\nof neural networks, wherein independently trained models have been observed to\nbe connected—up to permutation symmetries—by linear paths in parameter space\nalong which the loss remains consistently low. This observation challenges classical\nviews of non-convex optimization and has implications for model ensembling,\ngeneralization, and our understanding of neural loss geometry. Inspired by recent\nstudies on LMC in standard neural networks, we systematically investigate this\nphenomenon within Mixture-of-Experts (MoE) architectures—a class of models\nknown for their scalability and computational efficiency, which combine traditional\nneural networks—referred to as experts—through a learnable gating mechanism.\nWe begin by conducting a comprehensive analysis of both dense and sparse gating\nregimes, demonstrating that the symmetries inherent to MoE architectures are\nfully characterized by permutations acting on both the expert components and the\ngating function. Building on these foundational findings, we propose a matching\nalgorithm that enables alignment between independently trained MoEs, thereby\nfacilitating the discovery of LMC. Finally, we empirically validate the presence of\nLMC using our proposed algorithm across diverse MoE configurations—including\ndense, sparse, and shared-expert variants—under a wide range of model settings\nand datasets of varying scales and modalities. Our results confirm the existence\nof LMC in MoE architectures and offer fundamental insights into the functional\nlandscape and optimization dynamics of deep learning models.",
      "keywords": [
        "linear mode connectivity",
        "mixture-of-experts"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "iQoZv77o3g",
      "title": "Predicting Functional Brain Connectivity with Context-Aware Deep Neural Networks",
      "abstract": "Spatial location and molecular interactions have long been linked to the connectivity patterns of neural circuits. Yet, at the macroscale of human brain networks, the interplay between spatial position, gene expression, and connectivity remains incompletely understood. Recent efforts to map the human transcriptome and connectome have yielded spatially resolved brain atlases, however modeling the relationship between high-dimensional transcriptomic data and connectivity while accounting for inherent spatial confounds presents a significant challenge. In this paper, we present the first deep learning approaches for predicting whole-brain functional connectivity from gene expression and regional spatial coordinates, including our proposed Spatiomolecular Transformer (SMT). SMT explicitly models biological context by tokenizing genes based on their transcription start site (TSS) order to capture multi-scale genomic organization, and incorporating regional 3D spatial location via a dedicated context [CLS] token within its multi-head self-attention mechanism. We rigorously benchmark context-aware neural networks, including SMT and a single-gene resolution Multilayer-Perceptron (MLP), to established rules-based and bilinear methods. Crucially, to ensure that learned relationships in any model are not mere artifacts of spatial proximity, we introduce novel  spatiomolecular null maps, preserving both spatial and transcriptomic autocorrelation. Context-aware neural networks outperform linear methods, significantly exceed our stringent null shuffle models, and generalize across diverse connectomic datasets and parcellation resolutions. Together, these findings demonstrate a strong, predictable link between the spatial distributions of gene expression and functional brain network architecture, and establish a rigorously validated deep learning framework for decoding this relationship. Code to reproduce our results is available at: github.com/neuroinfolab/GeneEx2Conn.",
      "keywords": [
        "neuroscience",
        "fMRI",
        "connectomics",
        "transcriptomics",
        "attention"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "uUWb5eawL9",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "QJtanJS4T9",
      "title": "Irreducible Loss Floors in Gradient Descent Convergence and Energy Footprint",
      "abstract": "Despite their central role, convergence analyses of the dynamics of loss functions\nduring training require strong assumptions (e.g convexity and smoothness) which\nare non-trivial to prove. In this work, we introduce a framework for deriving\nnecessary convergence conditions that hold without restrictive assumptions on\nthe dataset or the model architecture. By linking microscopic properties such as\nindividual sample losses and their gradient to macroscopic training dynamics, we\nderive tight lower bounds for loss functions, applicable to both full-batch and mini-\nbatch gradient systems. These bounds reveal the presence of irreducible floors\nthat optimizers cannot surpass and beyond theoretical guarantees, this framework offers a practical tool for anticipating convergence speed, and estimating\nminimum training time and energy requirements. Thus, this framework can be\nused to ensure the sustainability and feasibility of large-scale training regimes.",
      "keywords": [
        "gradient descent",
        "convergence",
        "loss bounds",
        "optimization",
        "training dynamics",
        "sustainability",
        "efficiency",
        "feasibility",
        "computational cost",
        "irreducible loss",
        "non-convex optimization",
        "lower bounds"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "uUWb5eawL9",
      "title": "Noise Injection Reveals Hidden Capabilities of Sandbagging Language Models",
      "abstract": "Capability evaluations play a crucial role in assessing and regulating frontier AI systems. The effectiveness of these evaluations faces a significant challenge: strategic underperformance, or ``sandbagging'', where models deliberately underperform during evaluation. \nSandbagging can manifest either through explicit developer intervention or through unintended model behavior, presenting a fundamental obstacle to accurate capability assessment. We introduce a novel sandbagging detection method based on injecting noise of varying magnitudes into model weights. While non-sandbagging models show predictable performance degradation with increasing noise, we demonstrate that sandbagging models exhibit anomalous performance improvements, likely due to disruption of underperformance mechanisms while core capabilities remain partially intact. Through experiments across various model architectures, sizes, and sandbagging techniques, we establish this distinctive response pattern as a reliable, model-agnostic signal for detecting sandbagging behavior. Importantly, we find noise-injection is capable of eliciting the full performance of Mistral Large 120B in a setting where the model underperforms without being instructed to do so. Our findings provide a practical tool for AI evaluation and oversight, addressing a challenge in ensuring accurate capability assessment of frontier AI systems.",
      "keywords": [
        "Large Language Models",
        "Sandbagging",
        "Noise Injection",
        "Deception Detection"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "9GsgCUJtic",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "9GsgCUJtic",
      "title": "When do GFlowNets learn the right distribution?",
      "abstract": "Generative Flow Networks (GFlowNets) are an emerging class of sampling methods for distributions over discrete and compositional objects, e.g., graphs. In spite of their remarkable success in problems such as drug discovery and phylogenetic inference, the question of when and whether GFlowNets learn to sample from the target distribution remains underexplored. To tackle this issue, we first assess the extent to which a violation of the detailed balance of the underlying flow network might hamper the correctness of GFlowNet's sampling distribution. In particular, we demonstrate that the impact of an imbalanced edge on the model's accuracy is influenced by the total amount of flow passing through it and, as a consequence, is unevenly distributed across the network. We also argue that, depending on the parameterization, imbalance may be inevitable. In this regard, we consider the problem of sampling from distributions over graphs with GFlowNets parameterized by graph neural networks (GNNs) and show that the representation limits of GNNs delineate which distributions these GFlowNets can approximate. Lastly, we address these limitations by proposing a theoretically sound and computationally tractable metric for assessing GFlowNets, experimentally showing it is a better proxy for correctness than popular evaluation protocols.",
      "keywords": [
        "GFlowNets"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "Xj66fkrlTk",
      "title": "Optimizing Backward Policies in GFlowNets via Trajectory Likelihood Maximization",
      "abstract": "Generative Flow Networks (GFlowNets) are a family of generative models that learn to sample objects with probabilities proportional to a given reward function. The key concept behind GFlowNets is the use of two stochastic policies: a forward policy, which incrementally constructs compositional objects, and a backward policy, which sequentially deconstructs them. Recent results show a close relationship between GFlowNet training and entropy-regularized reinforcement learning (RL) problems with a particular reward design. However, this connection applies only in the setting of a fixed backward policy, which might be a significant limitation. As a remedy to this problem, we introduce a simple backward policy optimization algorithm that involves direct maximization of the value function in an entropy-regularized Markov Decision Process (MDP) over intermediate rewards. We provide an extensive experimental evaluation of the proposed approach across various benchmarks in combination with both RL and GFlowNet algorithms and demonstrate its faster convergence and mode discovery in complex environments.",
      "keywords": [
        "generative flow networks",
        "gflownets",
        "reinforcement learning",
        "sampling",
        "generative models"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "BOQpRtI4F5",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "BOQpRtI4F5",
      "title": "Towards Bridging Generalization and Expressivity of Graph Neural Networks",
      "abstract": "Expressivity and generalization are two critical aspects of graph neural networks (GNNs). While significant progress has been made in studying the expressivity of GNNs, much less is known about their generalization capabilities, particularly when dealing with the inherent complexity of graph-structured data.\nIn this work, we address the intricate relationship between expressivity and generalization in GNNs. Theoretical studies conjecture a trade-off between the two: highly expressive models risk overfitting, while those focused on generalization may sacrifice expressivity. However, empirical evidence often contradicts this assumption, with expressive GNNs frequently demonstrating strong generalization. We explore this contradiction by introducing a novel framework that connects GNN generalization to the variance in graph structures they can capture. This leads us to propose a $k$-variance margin-based generalization bound that characterizes the structural properties of graph embeddings in terms of their upper-bounded expressive power. Our analysis does not rely on specific GNN architectures, making it broadly applicable across GNN models. We further uncover a trade-off between intra-class concentration and inter-class separation, both of which are crucial for effective generalization. Through case studies and experiments on real-world datasets, we demonstrate that our theoretical findings align with empirical results, offering a deeper understanding of how expressivity can enhance GNN generalization.",
      "keywords": [
        "gnn",
        "expressivity",
        "generalization"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "sZQRUrvLn4",
      "title": "Graph Neural Networks Can (Often) Count Substructures",
      "abstract": "Message passing graph neural networks (GNNs) are known to have limited expressive power in their ability to distinguish some non-isomorphic graphs.\nBecause of this, it is well known that they are unable to detect or count arbitrary graph substructures (i.e., solving the subgraph isomorphism problem), a task that is of great importance for several types of graph-structured data. \nHowever, we observe that GNNs are in fact able to count graph patterns quite accurately across several real-world graph datasets.\nMotivated by this observation, we provide an analysis of the subgraph-counting capabilities of GNNs beyond the worst case, deriving several sufficient conditions for GNNs to be able to count subgraphs and, more importantly, to be able to sample-efficiently learn to count subgraphs. \nMoreover, we develop novel dynamic programming algorithms for solving the subgraph isomorphism problem on restricted classes of pattern and target graphs, and show that message-passing GNNs can efficiently simulate these dynamic programs. \nFinally, we empirically validate that our sufficient conditions for GNNs to count subgraphs hold on many real-world datasets, providing a theoretically-grounded explanation to our motivating observations.",
      "keywords": [
        "graph neural networks",
        "subgraphs",
        "expressivity"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "NltQraRnbW",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "LyJi5ugyJx",
      "title": "Simplifying, Stabilizing and Scaling Continuous-time Consistency Models",
      "abstract": "Consistency models (CMs) are a powerful class of diffusion-based generative models optimized for fast sampling. Most existing CMs are trained using discretized timesteps, which introduce additional hyperparameters and are prone to discretization errors. While continuous-time formulations can mitigate these issues, their success has been limited by training instability. To address this, we propose a simplified theoretical framework that unifies previous parameterizations of diffusion models and CMs, identifying the root causes of instability. Based on this analysis, we introduce key improvements in diffusion process parameterization, network architecture, and training objectives. These changes enable us to train continuous-time CMs at an unprecedented scale, reaching 1.5B parameters on ImageNet 512×512. Our proposed training algorithm, using only two sampling steps, achieves FID scores of 2.06 on CIFAR-10, 1.48 on ImageNet 64×64, and 1.88 on ImageNet 512×512, narrowing the gap in FID scores with the best existing diffusion models to within 10\\%.",
      "keywords": [
        "continuous-time consistency models",
        "diffusion models",
        "fast sampling"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "NltQraRnbW",
      "title": "Conditional Diffusion Models are Minimax-Optimal and Manifold-Adaptive for Conditional Distribution Estimation",
      "abstract": "We consider a class of conditional forward-backward diffusion models for conditional generative modeling, that is, generating new data given a covariate (or control variable). To formally study the theoretical properties of these conditional generative models, we adopt a statistical framework of distribution regression to characterize the large sample properties of the conditional distribution estimators induced by these conditional forward-backward diffusion models. Here, the conditional distribution of data is assumed to smoothly change over the covariate. In particular, our derived convergence rate is minimax-optimal under the total variation metric within the regimes covered by the existing literature. Additionally, we extend our theory by allowing both the data and the covariate variable to potentially admit a low-dimensional manifold structure. In this scenario, we demonstrate that the conditional forward-backward diffusion model can adapt to both manifold structures, meaning that the derived estimation error bound (under the Wasserstein metric) depends only on the intrinsic dimensionalities of the data and the covariate.",
      "keywords": [
        "conditional distribution estimation",
        "diffusion models",
        "distribution regression",
        "generative models",
        "manifold",
        "minimax rate"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "AYcKh0oT3h",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "AYcKh0oT3h",
      "title": "Online Convex Optimization with Heavy Tails: Old Algorithms, New Regrets, and Applications",
      "abstract": "In Online Convex Optimization (OCO), when the stochastic gradient has a finite variance, many algorithms provably work and guarantee a sublinear regret. However, limited results are known if the gradient estimate has a heavy tail, i.e., the stochastic gradient only admits a finite $\\mathsf{p}$-th central moment for some $\\mathsf{p}\\in\\left(1,2\\right]$. Motivated by it, this work examines different old algorithms for OCO (e.g., Online Gradient Descent) in the more challenging heavy-tailed setting. Under the standard bounded domain assumption, we establish new regrets for these classical methods without any algorithmic modification. Remarkably, these regret bounds are fully optimal in all parameters (can be achieved even without knowing $\\mathsf{p}$), suggesting that OCO with heavy tails can be solved effectively without any extra operation (e.g., gradient clipping). Our new results have several applications. A particularly interesting one is the first provable convergence result for nonsmooth nonconvex optimization under heavy-tailed noise without gradient clipping.",
      "keywords": [
        "Online Learning",
        "Online Convex Optimization",
        "Heavy Tails"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "YmbQ0qnQ76",
      "title": "$O(\\sqrt{T})$ Static Regret and Instance Dependent Constraint Violation for Constrained Online Convex Optimization",
      "abstract": "The constrained version of the standard online convex optimization (OCO) framework, called COCO is considered, where on every round, a convex cost function and a convex constraint function are revealed to the learner after it chooses the action for that round.\nThe objective is to simultaneously minimize the static regret and cumulative constraint violation (CCV). \nAn algorithm is proposed that guarantees a static regret of $O(\\sqrt{T})$ and a CCV of $\\min\\{{\\cal V}, O(\\sqrt{T}\\log T) \\}$, where ${\\cal V}$ depends on the distance between the consecutively revealed constraint sets, the shape of constraint sets, dimension of action space and the diameter of the action space. \nWhen constraint sets have additional structure, ${\\cal V}=O(1)$. Compared to the state of the art results, static regret of $O(\\sqrt{T})$ and CCV of $O(\\sqrt{T}\\log T)$, that were universal, the new result on CCV is instance dependent, which is derived by exploiting the geometric properties of the constraint sets.",
      "keywords": [
        "online convex optimization",
        "regret"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "pwNIOcr8fU",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "9bMZ29SPVx",
      "title": "A CLIP-Powered Framework for Robust and Generalizable Data Selection",
      "abstract": "Large-scale datasets have been pivotal to the advancements of deep learning models in recent years, but training on such large datasets inevitably incurs substantial storage and computational overhead. \nMeanwhile, real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.\nData selection has shown promise in identifying the most representative samples from the entire dataset, which aims to minimize the performance gap with reduced training costs. \nExisting works typically rely on single-modality information to assign importance scores for individual samples, which may lead to inaccurate assessments, especially when dealing with noisy or corrupted samples.\nTo address this limitation, we propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection. \nSpecifically, our framework consists of three key modules—dataset adaptation, sample scoring, and selection optimization—that together harness extensive pre-trained multimodal knowledge to comprehensively assess sample influence and optimize the selection results through multi-objective optimization. \nExtensive experiments demonstrate that our approach consistently outperforms existing state-of-the-art baselines on various benchmark datasets. Notably, our method effectively removes noisy or damaged samples from the dataset, enabling it to achieve even higher performance with less data. This indicates that it is not only a way to accelerate training but can also improve overall data quality.\n The implementation is available at https://github.com/Jackbrocp/clip-powered-data-selection.",
      "keywords": [
        "Data selection",
        "generalization",
        "multimodal"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "GySIAKEwtZ",
      "title": "Geometry of Long-Tailed Representation Learning: Rebalancing Features for Skewed Distributions",
      "abstract": "Deep learning has achieved significant success by training on balanced datasets. However, real-world data often exhibit long-tailed distributions. Empirical studies have revealed that long-tailed data skew data representations, where head classes dominate the feature space. Many methods have been proposed to empirically rectify the skewed representations. However, a clear understanding of the underlying cause and extent of this skew remains lacking. In this study, we provide a comprehensive theoretical analysis to elucidate how long-tailed data affect feature distributions, deriving the conditions under which centers of tail classes shrink together or even collapse into a single point. This results in overlapping feature distributions of tail classes, making features in the overlapping regions inseparable. Moreover, we demonstrate that merely empirically correcting the skewed representations of the training data is insufficient to separate the overlapping features due to distribution shifts between the training and real data. To address these challenges, we propose a novel long-tailed representation learning method, FeatRecon. It reconstructs the feature space in order to arrange features from different classes into symmetricial and linearly separable regions. This, in turn, enhances the model’s robustness to long-tailed data. We validate the effectiveness of our method through extensive experiments on the CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018 datasets.",
      "keywords": [
        "contrastive learning",
        "representation learning",
        "long-tail recgonition",
        "theory",
        "neural collapse"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "G10Y4vrhGF",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "CaSQgef484",
      "title": "Exploring Diffusion Transformer Designs via Grafting",
      "abstract": "Designing model architectures requires decisions such as selecting operators (e.g., attention, convolution) and configurations (e.g., depth, width). However, evaluating the impact of these decisions on model quality requires costly pretraining, limiting architectural investigation.\nInspired by how new software is built on existing code, we ask: can new architecture designs be studied using pretrained models? To this end, we present *grafting*, a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets. Informed by our analysis of activation behavior and attention locality, we construct a testbed based on the DiT-XL/2 design to study the impact of grafting on model quality. Using this testbed, we develop a family of hybrid designs via grafting: replacing softmax attention with gated convolution, local attention, and linear attention, and replacing MLPs with variable expansion ratio and convolutional variants. Notably, many hybrid designs achieve good quality (FID: 2.38–2.64 vs. 2.27 for DiT-XL/2)\nusing $<2$% pretraining compute. We then graft a text-to-image model (PixArt-$\\Sigma$), achieving a 1.43$\\times$ speedup with less than a 2% drop in GenEval score. Finally, we present a case study that restructures DiT-XL/2 by converting every pair of sequential transformer blocks into parallel blocks via grafting. This reduces model depth by 2$\\times$ and yields better quality (FID: 2.77) than other models of comparable depth. Together, we show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring. Code and grafted models: https://grafting.stanford.edu.",
      "keywords": [
        "Diffusion Transformers",
        "Model Grafting",
        "Architectural Editing",
        "Hybrid Models"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "G10Y4vrhGF",
      "title": "FedFree: Breaking Knowledge-sharing Barriers through Layer-wise Alignment in Heterogeneous Federated Learning",
      "abstract": "Heterogeneous Federated Learning (HtFL) enables collaborative learning across clients with diverse model architectures and non-IID data distributions, which are prevalent in real-world edge computing applications. Existing HtFL approaches typically employ proxy datasets to facilitate knowledge sharing or implement coarse-grained model-level knowledge transfer. However, such approaches not only elevate risks of user privacy leakage but also lead to the loss of fine-grained model-specific knowledge, ultimately creating barriers to effective knowledge sharing. To address these challenges, we propose FedFree, a novel data-free and model-free HtFL framework featuring two key innovations. First, FedFree introduces a reverse layer-wise knowledge transfer mechanism that aggregates heterogeneous client models into a global model solely using Gaussian-based pseudo data, eliminating reliance on proxy datasets. Second, it leverages Knowledge Gain Entropy (KGE) to guide targeted layer-wise knowledge alignment, ensuring that each client receives the most relevant global updates tailored to its specific architecture. We provide rigorous theoretical convergence guarantees for FedFree and conduct extensive experiments on CIFAR-10 and CIFAR-100. Results demonstrate that FedFree achieves substantial performance gains, with relative accuracy improving up to 46.3% over state-of-the-art baselines. The framework consistently excels under highly heterogeneous model/data distributions and in large scale settings.",
      "keywords": [
        "Heterogeneous Federated Learning",
        "Public-Data-Free",
        "Knowledge Gain Entropy"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "gkcU26BOml",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "gkcU26BOml",
      "title": "Cross-modal Associations in Vision and Language Models: Revisiting the Bouba-Kiki Effect",
      "abstract": "Recent advances in multimodal models have raised questions about whether vision-and-language models (VLMs) integrate cross-modal information in ways that reflect human cognition. One well-studied test case in this domain is the bouba-kiki effect, where humans reliably associate pseudowords like ‘bouba’ with round shapes and ‘kiki’ with jagged ones. Given the mixed evidence found in prior studies for this effect in VLMs, we present a comprehensive re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer (ViT), given their centrality in many state-of-the-art VLMs. We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as a measure of model preference, and we use Grad-CAM as a novel approach to interpret visual attention in shape-word matching tasks. Our findings show that these model variants do not consistently exhibit the bouba-kiki effect. While ResNet shows a preference for round shapes, overall performance across both model variants lacks the expected associations. Moreover, direct comparison with prior human data on the same task shows that the models’ responses fall markedly short of the robust, modality-integrated behaviour characteristic of human cognition. These results contribute to the ongoing debate about the extent to which VLMs truly understand cross-modal concepts, highlighting limitations in their internal representations and alignment with human intuitions.",
      "keywords": [
        "Cross-modal associations",
        "Vision-and-Language Models",
        "bouba-kiki effect",
        "Cognitive science"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "qYkhCah8OZ",
      "title": "Boosting Knowledge Utilization in Multimodal Large Language Models via Adaptive Logits Fusion and Attention Reallocation",
      "abstract": "Despite their recent progress, Multimodal Large Language Models (MLLMs) often struggle in knowledge-intensive tasks due to the limited and outdated parametric knowledge acquired during training. Multimodal Retrieval Augmented Generation addresses this issue by retrieving contextual knowledge from external databases, thereby enhancing MLLMs with expanded knowledge sources. \nHowever, existing MLLMs often fail to fully leverage the retrieved contextual knowledge for response generation. We examine representative MLLMs and identify two major causes, namely, attention bias toward different tokens and knowledge conflicts between parametric and contextual knowledge. To this end, we design Adaptive Logits Fusion and Attention Reallocation (ALFAR), a training-free and plug-and-play approach that improves MLLM responses by maximizing the utility of the retrieved knowledge. Specifically, ALFAR tackles the challenges from two perspectives. First, it alleviates attention bias by adaptively shifting attention from visual tokens to relevant context tokens according to query-context relevance. Second, it decouples and weights parametric and contextual knowledge at output logits, mitigating conflicts between the two types of knowledge. As a plug-and-play method, ALFAR achieves superior performance across diverse datasets without requiring additional training or external tools. Extensive experiments over multiple MLLMs and benchmarks show that ALFAR consistently outperforms the state-of-the-art by large margins. Our code and data are available at https://github.com/Lackel/ALFAR.",
      "keywords": [
        "Multimodal Large Language Models",
        "Multimodal Retrieval Augmented Generation"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "z5KTxW5sJd",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "obXGSmmG70",
      "title": "AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning",
      "abstract": "Large Language Models (LLMs) have demonstrated remarkable capabilities but often face challenges with tasks requiring sophisticated reasoning. While Chain-of-Thought (CoT) prompting significantly enhances reasoning, it indiscriminately generates lengthy reasoning steps for all queries, leading to substantial computational costs and inefficiency, especially for simpler inputs. To address this critical issue, we introduce AdaCoT (Adaptive Chain-of-Thought), a novel framework enabling LLMs to adaptively decide when to invoke CoT. AdaCoT framed adaptive reasoning as a Pareto optimization problem that seeks to balance model performance with the costs associated with CoT invocation (both frequency and computational overhead). We propose a reinforcement learning (RL) based method, specifically utilizing Proximal Policy Optimization (PPO), to dynamically control the CoT triggering decision boundary by adjusting penalty coefficients, thereby allowing the model to determine CoT necessity based on implicit query complexity. A key technical contribution is Selective Loss Masking (SLM), designed to counteract decision boundary collapse during multi-stage RL training, ensuring robust and stable adaptive triggering. Experimental results demonstrate that AdaCoT successfully navigates the Pareto frontier, achieving substantial reductions in CoT usage for queries not requiring elaborate reasoning. For instance, on our production traffic testset, AdaCoT reduced CoT triggering rates to as low as 3.18% and decreased average response tokens by 69.06% on APP, while maintaining high performance on complex tasks. This substantial token decrease directly translates to a significant reduction in inference computational load. AdaCoT pioneers adaptive CoT triggering, offering a practical and principled solution for developing more efficient, responsive, and cost-effective LLMs, particularly crucial for interactive and resource-sensitive applications.",
      "keywords": [
        "Adaptive Reasoning",
        "Chain-of-Thought",
        "Large Language Models"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "z5KTxW5sJd",
      "title": "From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review",
      "abstract": "The advent of large language models (LLMs) offers unprecedented opportunities to reimagine peer review beyond the constraints of traditional workflows.\nDespite these opportunities, prior efforts have largely focused on replicating traditional review workflows with LLMs serving as direct substitutes for human reviewers, while limited attention has been given to exploring new paradigms that fundamentally rethink how LLMs can participate in the academic review process.\nIn this paper, we introduce and explore a novel mechanism that employs LLM agents to perform pairwise comparisons among manuscripts instead of individual scoring. By aggregating outcomes from substantial pairwise evaluations, this approach enables a more accurate and robust measure of relative manuscript quality.\nOur experiments demonstrate that this comparative approach significantly outperforms traditional rating-based methods in identifying high-impact papers. However, our analysis also reveals emergent biases in the selection process, notably a reduced novelty in research topics and an increased institutional imbalance. These findings highlight both the transformative potential of rethinking peer review with LLMs and critical challenges that future systems must address to ensure equity and diversity.",
      "keywords": [
        "Large Language Models",
        "Peer Review Redesign",
        "Comparative Paper Evaluation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "FZURCro04D",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "Q3qAsZAEZw",
      "title": "Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference",
      "abstract": "Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration, such as evaluation batch size, GPU count, and GPU version, can introduce significant differences in the generated responses. \nThis issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9\\% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size.\nWe trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. \nThis work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge.\nOur analysis reveals that floating-point precision—while critical for reproducibility—is often neglected in evaluation practices.\nInspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.",
      "keywords": [
        "Large Language Models (LLMs)",
        "Reproducibility",
        "Numerical precision",
        "Deterministic inference"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "b7uniOw0sZ",
      "title": "Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions",
      "abstract": "Large language models (LLMs) have demonstrated remarkable reasoning capabilities in math and coding, often bolstered by post-training on the chain-of-thoughts (CoTs) generated by stronger models. \nHowever, existing strategies for curating such training data predominantly rely on heuristics, limiting generalizability and failing to capture subtleties underlying in data. \nTo address these limitations, we leverage influence functions to systematically attribute LLMs' reasoning ability on math and coding to individual training examples, sequences, and tokens, enabling deeper insights into effective data characteristics.\nOur Influence-based Reasoning Attribution (Infra) uncovers nontrivial cross-domain effects across math and coding tasks: high-difficulty math examples improve both math and code reasoning, while low-difficulty code tasks most effectively benefit code reasoning.\nBased on these findings, we introduce a simple yet effective dataset reweighting strategy by flipping task difficulty, which doubles AIME24 accuracy from 10\\% to 20\\% and boosts LiveCodeBench accuracy from 33.8\\% to 35.3\\% for Qwen2.5-7B-Instruct.\nMoreover, our fine-grained attribution reveals that the sequence-level exploratory behaviors enhance reasoning performance in both math and code, and the token-level influence patterns are distinct for math and code reasoning: the former prefers natural language logic connectors and the latter emphasizes structural syntax.",
      "keywords": [
        "influence functions",
        "llm reasoning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "E2PFv7ad3p",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "E2PFv7ad3p",
      "title": "Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs",
      "abstract": "In the study of LLMs, sycophancy represents a prevalent hallucination that poses significant challenges to these models. Specifically, LLMs often fail to adhere to original correct responses, instead blindly agreeing with users' opinions, even when those opinions are incorrect or malicious. However, research on sycophancy in visual language models (VLMs) has been scarce. In this work, we extend the exploration of sycophancy from LLMs to VLMs, introducing the MM-SY benchmark to evaluate this phenomenon. We present evaluation results from multiple representative models, addressing the gap in sycophancy research for VLMs. To mitigate sycophancy, we propose a synthetic dataset for training and employ methods based on prompts, supervised fine-tuning, and DPO. Our experiments demonstrate that these methods effectively alleviate sycophancy in VLMs. Additionally, we probe VLMs to assess the semantic impact of sycophancy and analyze the attention distribution of visual tokens. Our findings indicate that the ability to prevent sycophancy is predominantly observed in higher layers of the model. The lack of attention to image knowledge in these higher layers may contribute to sycophancy, and enhancing image attention at high layers proves beneficial in mitigating this issue.",
      "keywords": [
        "Multi-modal Model",
        "Visual-Language Model",
        "Sycophancy",
        "Hallucination"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "qb2QRoE4W3",
      "title": "LLM-Cite: Cheap Fact Verification with Attribution via URL Generation",
      "abstract": "Hallucinations are one of the main issues with Large Language Models (LLMs). This has led to increased interest in automated ways to verify the factuality of LLMs' responses. Existing methods either rely on: (a) search over a knowledge base (KB), which is costly especially if the KB must be updated frequently to keep up with fresh content, (b) LLM's parametric knowledge to fact-check claims, which is cheaper but does not give attribution and is limited to verifying claims related to knowledge acquired during pretraining. In this work, we present LLM-Cite, a cheap and easy to implement method that does not rely on any external search system while still providing attribution and the ability to verify fresh claims. Our key insight is to leverage an LLM to directly generate potential citation URLs for a given claim, and then use entailment checks to verify the claim against content of the URLs (which are fetched on-the-fly). We benchmark LLM-Cite on three datasets containing fresh and non-fresh claims generated by humans and models. We show that LLM-Cite performs comparable or better than existing methods on all categories of claims --- importantly, without sacrificing attribution, or requiring costly external search --- overall LLM-Cite is more than 45x cheaper than a Google Search based approach.",
      "keywords": [
        "Fact Verification",
        "Attribution",
        "Citation",
        "Factuality"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "XO9fhSZkBh",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "Jsln9ZyMl4",
      "title": "The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets",
      "abstract": "We study the parameter complexity of robust memorization for ReLU networks: the number of parameters required to interpolate any dataset with $\\epsilon$-separation between differently labeled points, while ensuring predictions remain consistent within a $\\mu$-ball around each training example. We establish upper and lower bounds on the parameter count as a function of the robustness ratio $\\rho = \\mu / \\epsilon$. Unlike prior work, we provide a fine-grained analysis across the entire range $\\rho \\in (0,1)$ and obtain tighter upper and lower bounds that improve upon existing results. Our findings reveal that the parameter complexity of robust memorization matches that of non-robust memorization when $\\rho$ is small, but grows with increasing $\\rho$. As a special case, when the input dimension is comparable to or exceeds the dataset size, our bounds become tight (up to logarithmic factors) across the entire range of $\\rho$.",
      "keywords": [
        "Robust memorization",
        "Memorization",
        "Adversarial training",
        "Parameter Complexity"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "XO9fhSZkBh",
      "title": "Depth-Bounds for Neural Networks via the Braid Arrangement",
      "abstract": "We contribute towards resolving the open question of how many hidden layers are required in ReLU networks for exactly representing all continuous and piecewise linear functions on $\\mathbb{R}^d$. \nWhile the question has been resolved in special cases, the best known lower bound in general is still 2. \nWe focus on neural networks that are compatible with certain polyhedral complexes, more precisely with the braid fan.  \nFor such neural networks, we prove a non-constant lower bound of $\\Omega(\\log\\log d)$ hidden layers required to exactly represent the maximum of $d$ numbers. Additionally, we provide a combinatorial proof that neural networks satisfying this assumption require three hidden layers to compute the maximum of 5 numbers; this had only been verified with an excessive computation so far.\nFinally, we show that a natural generalization of the best known upper bound to maxout networks is not tight, by demonstrating that a rank-3 maxout layer followed by a rank-2 maxout layer is sufficient to represent the maximum of 7 numbers.",
      "keywords": [
        "Neural Networks",
        "Piecewise Linear Functions",
        "Exact Representations",
        "Polyhedral Geometry",
        "Braid Fan",
        "Boolean Lattice"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "7ieS4EYKnB",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "4Ud0pRqFto",
      "title": "MARS-VFL: A Unified Benchmark for Vertical Federated Learning with Realistic Evaluation",
      "abstract": "Vertical Federated Learning (VFL) has emerged as a critical privacy-preserving learning paradigm, enabling collaborative model training by leveraging distributed features across clients. However, due to privacy concerns, there are few publicly available real-world datasets for evaluating VFL methods, which poses significant challenges to related research. To bridge this gap, we propose MARS-VFL, a unified benchmark for realistic VFL evaluation. It integrates data from practical applications involving collaboration across different features, maintaining compatibility with the VFL setting. Based on this, we standardize the evaluation of VFL methods from the mainstream aspects of efficiency, robustness, and security. We conduct comprehensive experiments to assess different VFL approaches, providing references for unified evaluation. Furthermore, we are the first to unify the evaluation of robustness challenges in VFL and introduce a new method for addressing robustness challenges, establishing standard baselines for future research.",
      "keywords": [
        "Vertical Federated Learning",
        "Distrubuted System",
        "Benchmark"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "7ieS4EYKnB",
      "title": "Flick: Empowering Federated Learning with Commonsense Knowledge",
      "abstract": "Federated Learning (FL) has emerged as a privacy-preserving framework for training models on data generated at the edge. However, the heterogeneity of data silos (e.g., label skew and domain shift) often leads to inconsistent learning objectives and suboptimal model performance. Inspired by the data-driven approach, we propose Flick, a novel data generation framework for heterogeneous **F**ederated **L**earning w**i**th **C**ommonsense **K**nowledge from Large Language Models (LLMs). In Flick, the client performs the local data summary to capture client-specific knowledge in textual form. The central server then distills task-relevant, high-quality knowledge from the out-of-the-box LLM -- guided by cross-client-specific insights -- to generate informative text prompts. These prompts direct a generative model in producing synthetic data, enabling global model fine-tuning and local data compensation. This process gradually aligns the label and feature distributions across clients. Extensive results on three datasets demonstrate that Flick improves the global model accuracy by up to 11.43\\%, and accelerates convergence by up to 12.9$\\times$, validating its effectiveness in addressing data heterogeneity.",
      "keywords": [
        "Federated Learning",
        "Data Heterogeneity"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "yThwhNCaZN",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "5h9mS87Pyt",
      "title": "The Atlas of In-Context Learning: How Attention Heads Shape In-Context Retrieval Augmentation",
      "abstract": "Large language models are able to exploit in-context learning to access external knowledge beyond their training data through retrieval-augmentation. While promising, its inner workings remain unclear. In this work, we shed light on the mechanism of in-context retrieval augmentation for question answering by viewing a prompt as a composition of informational components. We propose an attribution-based method to identify specialized attention heads, revealing in-context heads that comprehend instructions and retrieve relevant contextual information, and parametric heads that store entities' relational knowledge. To better understand their roles, we extract function vectors and modify their attention weights to show how they can influence the answer generation process. Finally, we leverage the gained insights to trace the sources of knowledge used during inference, paving the way towards more safe and transparent language models.",
      "keywords": [
        "feature attribution",
        "in-context learning",
        "function vectors",
        "retrieval-augmented language models",
        "transformers",
        "explainable AI",
        "mechanistic Interpretability"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "OkVQJZWGfn",
      "title": "CoT Information: Improved Sample Complexity under Chain-of-Thought Supervision",
      "abstract": "Learning complex functions that involve multi-step reasoning poses a significant challenge for standard supervised learning from input-output examples. Chain-of-thought (CoT) supervision, which augments training data with intermediate reasoning steps to provide a richer learning signal, has driven recent advances in large language model reasoning. This paper develops a statistical theory of learning under CoT supervision. Central to the theory is the *CoT information*, which measures the additional discriminative power offered by the chain-of-thought for distinguishing hypotheses with different end-to-end behaviors. The main theoretical results demonstrate how CoT supervision can yield significantly faster learning rates compared to standard end-to-end supervision, with both upper bounds and information-theoretic lower bounds characterized by the CoT information.",
      "keywords": [
        "chain-of-thought",
        "learning theory",
        "statistical learning theory",
        "PAC learning",
        "sample complexity"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "XoN10bZtR9",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "XoN10bZtR9",
      "title": "Rethinking Joint Maximum Mean Discrepancy for Visual Domain Adaptation",
      "abstract": "In domain adaption (DA), joint maximum mean discrepancy (JMMD), as a famous distribution-distance metric, aims to measure joint probability distribution difference between the source domain and target domain, while it is still not fully explored and especially hard to be applied into a subspace-learning framework as its empirical estimation involves a tensor-product operator whose partial derivative is difficult to obtain. To solve this issue, we deduce a concise JMMD based on the Representer theorem that avoids the tensor-product operator and obtains two essential findings. First, we reveal the uniformity of JMMD by proving that previous marginal, class conditional, and weighted class conditional probability distribution distances are three special cases of JMMD with different label reproducing kernels. Second, inspired by graph embedding, we observe that the similarity weights, which strengthen the intra-class compactness in the graph of Hilbert Schmidt independence criterion (HSIC), take opposite signs in the graph of JMMD, revealing why JMMD degrades the feature discrimination. This motivates us to propose a novel loss JMMD-HSIC by jointly considering JMMD and HSIC to promote discrimination of JMMD. Extensive experiments on several cross-domain datasets could demonstrate the validity of our revealed theoretical results and the effectiveness of our proposed JMMD-HSIC.",
      "keywords": [
        "domain adaptation",
        "JMMD",
        "HSIC",
        "feature discrimination",
        "graph embedding"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "e2WesV6Voe",
      "title": "Sequence Modeling with Spectral Mean Flows",
      "abstract": "A key question in sequence modeling with neural networks is how to represent and learn highly nonlinear and probabilistic state dynamics. Operator theory views such dynamics as linear maps on Hilbert spaces containing mean embedding vectors of distributions, offering an appealing but currently overlooked perspective. We propose a new approach to sequence modeling based on an operator-theoretic view of a hidden Markov model (HMM). Instead of materializing stochastic recurrence, we embed the full sequence distribution as a tensor in the product Hilbert space. A generative process is then defined as maximum mean discrepancy (MMD) gradient flow in the space of sequences. To overcome challenges with large tensors and slow sampling convergence, we introduce spectral mean flows, a novel tractable algorithm integrating two core concepts. First, we propose a new neural architecture by leveraging spectral decomposition of linear operators to derive a scalable tensor network decomposition of sequence mean embeddings. Second, we extend MMD gradient flows to time-dependent Hilbert spaces and connect them to flow matching via the continuity equation, enabling simulation-free learning and faster sampling. We demonstrate competitive results on a range of time-series modeling datasets.",
      "keywords": [
        "sequence modeling",
        "time series",
        "hidden Markov models",
        "mean embeddings",
        "linear operators",
        "maximum mean discrepancy",
        "gradient flows"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "MeOTBs8BQV",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "MeOTBs8BQV",
      "title": "Sync or Sink: Bounds on Algorithmic Collective Action with Noise and Multiple Groups",
      "abstract": "Collective action against algorithmic systems, which enables groups to promote their own interests, is poised to grow. Hence, there will be growth in the size and the number of distinct collectives. Currently, there is no formal analysis of how coordination challenges within a collective can impact downstream outcomes, or how multiple collectives may affect each other's success. In this work, we aim to provide guarantees on the success of collective action in the presence of both coordination noise and multiple groups. Our insight is that data generated by either multiple collectives or by coordination noise can be viewed as originating from multiple data distributions. \nUsing this framing, we derive bounds on the success of collective action. We conduct experiments to study the effects of noise on collective action. We find that sufficiently high levels of noise can reduce the success of collective action. In certain scenarios, large noise can sink a collective success rate from $100$% to just under $60$%. We identify potential trade-offs between collective size and coordination noise; for example, a collective that is twice as big but with four times more noise experiencing worse outcomes than the smaller, more coordinated one. This work highlights the importance of understanding nuanced dynamics of strategic behavior in algorithmic systems.",
      "keywords": [
        "Algorithmic Collective Action",
        "Social Computing",
        "Data Campaigns"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "kNXTUCksnh",
      "title": "The Burden of Interactive Alignment with Inconsistent Preferences",
      "abstract": "From media platforms to chatbots, algorithms shape how people interact, learn, and discover information. Such interactions between users and an algorithm often unfold over multiple steps, during which strategic users can guide the algorithm to better align with their true interests by selectively engaging with content. However, users frequently exhibit inconsistent preferences: they may spend considerable time on content that offers little long-term value, inadvertently signaling that such content is desirable. Focusing on the user side, this raises a key question: what does it take for such users to align the algorithm with their true interests?\n\nTo investigate these dynamics, we model the user’s decision process as split between a rational \"system 2\" that decides whether to engage and an impulsive \"system 1\" that determines how long engagement lasts. We then study a multi-leader, single-follower extensive Stackelberg game, where users, specifically system 2, lead by committing to engagement strategies and the algorithm best-responds based on observed interactions. We define the burden of alignment as the minimum horizon over which users must optimize to effectively steer the algorithm. We show that a critical horizon exists: users who are sufficiently foresighted can achieve alignment, while those who are not are instead aligned to the algorithm’s objective. This critical horizon can be long, imposing a substantial burden. However, even a small, costly signal (e.g., an extra click) can significantly reduce it. Overall, our framework explains how users with inconsistent preferences can align an engagement-driven algorithm with their interests in a Stackelberg equilibrium, highlighting both the challenges and potential remedies for achieving alignment.",
      "keywords": [
        "Human AI Interaction",
        "Alignment",
        "Strategic Behavior",
        "Stackelberg Equilibrium"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "RPRqKhjrr6",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "B6bE2GC71a",
      "title": "EvoLM: In Search of Lost Language Model Training Dynamics",
      "abstract": "Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage.\nWe present EvoLM, a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. \nBy training over 100 LMs with 1B and 4B parameters from scratch, we rigorously evaluate both upstream (language modeling) and downstream (problem-solving) reasoning capabilities, including considerations of both in-domain and out-of-domain generalization. \nKey insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. \nTo facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.",
      "keywords": [
        "Language Models",
        "Training Dynamics",
        "Pretraining",
        "Post-training"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "oN5YVZ9JeF",
      "title": "T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning",
      "abstract": "Instruction tuning is essential for Large Language Models (LLMs) to effectively follow user instructions. To improve training efficiency and reduce data redundancy, recent works use LLM-based scoring functions, e.g., Instruction-Following Difficulty (IFD), to select high–quality instruction-tuning data with scores above a threshold. While these data selection methods often lead to models that can match or even exceed the performance of models trained on the full datasets, we identify two key limitations: (i) they assess quality at the sample level, ignoring token-level informativeness; and (ii) they overlook the robustness of the scoring method, often selecting a sample due to superficial lexical features instead of its true quality. In this work, we propose Token-Selective HIeRarchical Data Selection for Instruction Tuning (T-SHIRT), a novel data selection framework that introduces a new scoring method to include only informative tokens in quality evaluation and also promote robust and reliable samples whose neighbors also show high quality with less local inconsistencies. We demonstrate that models instruction-tuned on a curated dataset (only 5% of the original size) using T-SHIRT can outperform those trained on the entire large-scale dataset by up to 5.48 points on average across eight benchmarks. Across various LLMs and training set scales, our method consistently surpasses existing state-of-the-art data selection techniques, while also remaining both cost-effective and highly efficient. For instance, by using GPT-2 for score computation, we are able to process a dataset of 52k samples in 40 minutes on a single GPU. Our code is available at https://github.com/Dynamite321/T-SHIRT.",
      "keywords": [
        "Large Language Models",
        "Instruction tuning",
        "Data Selection",
        "Token-selective Quality Score",
        "Robust Hierarchical Selection"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "Q3qAsZAEZw",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "Q3qAsZAEZw",
      "title": "Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference",
      "abstract": "Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration, such as evaluation batch size, GPU count, and GPU version, can introduce significant differences in the generated responses. \nThis issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9\\% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size.\nWe trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. \nThis work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge.\nOur analysis reveals that floating-point precision—while critical for reproducibility—is often neglected in evaluation practices.\nInspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.",
      "keywords": [
        "Large Language Models (LLMs)",
        "Reproducibility",
        "Numerical precision",
        "Deterministic inference"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "iBFfb6bGOz",
      "title": "Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities",
      "abstract": "Large Language Models (LLMs) have demonstrated outstanding performance in mathematical reasoning capabilities. However, we argue that current large-scale reasoning models primarily rely on scaling up training datasets with diverse mathematical problems and long thinking chains, which raises questions about whether LLMs genuinely acquire mathematical concepts and reasoning principles or merely remember the training data. In contrast, humans tend to break down complex problems into multiple fundamental atomic capabilities. Inspired by this, we propose a new paradigm for evaluating mathematical atomic capabilities. Our work categorizes atomic abilities into two dimensions: (1) field-specific abilities across four major mathematical fields, algebra, geometry, analysis, and topology, and (2) logical abilities at different levels, including conceptual understanding, forward multi-step reasoning with formal math language, and counterexample-driven backward reasoning. We propose corresponding training and evaluation datasets for each atomic capability unit, and conduct extensive experiments about how different atomic capabilities influence others, to explore the strategies to elicit the required specific atomic capability. Evaluation and experimental results on advanced models show many interesting discoveries and inspirations about the different performances of models on various atomic capabilities and the interactions between atomic capabilities. Our findings highlight the importance of decoupling mathematical intelligence into atomic components, providing new insights into model cognition and guiding the development of training strategies toward a more efficient, transferable, and cognitively grounded paradigm of \"atomic thinking\".",
      "keywords": [
        "Large Language Models",
        "Mathematical Reasoning",
        "Atomic Thinking"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "dML3XGvWmy",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "K3KrOsR6y9",
      "title": "LLMs Can Plan Only If We Tell Them",
      "abstract": "Large language models (LLMs) have demonstrated significant capabilities in natural language processing and reasoning, yet their effectiveness in autonomous planning has been under debate. While existing studies have utilized LLMs with external feedback mechanisms or in controlled environments for planning, these approaches often involve substantial computational and development resources due to the requirement for careful design and iterative backprompting. Moreover, even the most advanced LLMs like GPT-4 struggle to match human performance on standard planning benchmarks, such as the Blocksworld, without additional support. This paper investigates whether LLMs can independently generate long-horizon plans that rival human baselines. Our novel enhancements to Algorithm-of-Thoughts (AoT), which we dub AoT+, help achieve state-of-the-art results in planning benchmarks out-competing prior methods and human baselines all autonomously.",
      "keywords": [
        "large language models",
        "decision-making",
        "planning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "YrycTjllL0",
      "title": "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions",
      "abstract": "Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks range from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical tasks requires the capability of utilizing **diverse function calls as tools** to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding **complex instructions**. Fulfilling both of these characteristics can pose a great challenge for LLMs. To assess how well LLMs can solve challenging and practical tasks via programs, we introduce BigCodeBench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions containing only essential information. Our extensive evaluation of 60 LLMs shows that **LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%**. The results underscore the need for further advancements in this area.",
      "keywords": [
        "Code Generation",
        "Tool Use",
        "Instruction Following",
        "Benchmark"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "n4V3MSqK77",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "Fcs90Rwm8j",
      "title": "Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs",
      "abstract": "Large language models (LLMs) have shown remarkable performance across diverse reasoning and generation tasks, and are increasingly deployed as agents in dynamic environments such as code generation and recommendation systems. However, many real-world applications, such as high-frequency trading and real-time competitive gaming, require decisions under strict latency constraints, where faster responses directly translate into higher rewards. Despite the importance of this latency–quality trade-off, it remains underexplored in the context of LLM-based agents. In this work, we present the first systematic study of this trade-off in real-time decision-making tasks. To support our investigation, we introduce two new benchmarks: HFTBench, a high-frequency trading simulation, and StreetFighter, a competitive gaming platform. Our analysis reveals that optimal latency–quality balance varies by task, and that sacrificing quality for lower latency can significantly enhance downstream performance. To address this, we propose FPX, an adaptive framework that dynamically selects model size and quantization level based on real-time demands. Our method achieves the best performance on both benchmarks, improving win rate by up to 80% in Street Fighter and boosting daily yield by up to 26.52% in trading,  underscoring the need for latency-aware evaluation and deployment strategies for LLM-based agents. These results demonstrate the critical importance of latency-aware evaluation and deployment strategies for real-world LLM-based agents.",
      "keywords": [
        "Machine Learning",
        "Gaming",
        "Model Compression",
        "LLM Agents"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "n4V3MSqK77",
      "title": "Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents",
      "abstract": "LLM-based agent applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs and latency due to extensive planning and reasoning requirements. \nExisting LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agent applications where outputs depend on external data and environmental contexts. \nWe propose **Agentic Plan Caching (APC)**, a novel **test-time memory** that extracts, stores, adapts, and reuses structured plan templates from planning stages of agent applications across semantically similar tasks to reduce the cost and latency of serving. \nUnlike traditional semantic caching, our system extracts plan templates from completed agent executions at test-time, employs keyword extraction to match new requests against cached plans, and utilizes lightweight models to adapt these templates to task-specific plans with contexts. \nEvaluation across multiple real-world agent applications shows that our system can reduce costs by 50.31\\% and latency by 27.28\\% on average while maintaining performance, offering a more efficient solution for serving LLM-based agents that complements existing LLM serving infrastructures.",
      "keywords": [
        "Caching",
        "Memory",
        "Serving",
        "LLM Agents"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "H8fscnm6Xx",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "7AwFJzgIUW",
      "title": "Dynamical Low-Rank Compression of Neural Networks with Robustness under Adversarial Attacks",
      "abstract": "Deployment of neural networks on resource-constrained devices demands models that are both compact and robust to adversarial inputs. However, compression and adversarial robustness often conflict. In this work, we introduce a dynamical low-rank training scheme enhanced with a novel spectral regularizer that controls the condition number of the low-rank core in each layer. This approach mitigates the sensitivity of compressed models to adversarial perturbations without sacrificing clean accuracy. The method is model- and data-agnostic, computationally efficient, and supports rank adaptivity to automatically compress the network at hand. Extensive experiments across standard architectures, datasets, and adversarial attacks show the regularized networks can achieve over 94 compression while recovering or improving adversarial accuracy relative to uncompressed baselines.",
      "keywords": [
        "Low Rank",
        "Adversarial Robustenss",
        "Adversarial Attacks",
        "Rank Adaptive",
        "Computer Vision",
        "Compression"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "H8fscnm6Xx",
      "title": "Unextractable Protocol Models: Collaborative Training and Inference without Weight Materialization",
      "abstract": "We consider a decentralized setup in which the participants collaboratively train and serve a large neural network, and where each participant only processes a subset of the model. \nIn this setup, we explore the possibility of unmaterializable weights, where a full weight set is never available to any one participant.\nWe introduce Unextractable Protocol Models (UPMs): a training and inference framework that leverages the sharded model setup to ensure model shards (i.e.,, subsets) held by participants are incompatible at different time steps. UPMs periodically inject time-varying, random, invertible transforms at participant boundaries; preserving the overall network function yet rendering cross-time assemblies incoherent.\nOn Qwen-2.5-0.5B and Llama-3.2-1B, 10 000 transforms leave FP32 perplexity unchanged ($\\Delta$PPL$< 0.01$; Jensen–Shannon drift $<4 \\times 10^{-5}$), and we show how to control growth for lower precision datatypes. Applying a transform every 30s adds 3% latency, 0.1% bandwidth, and 10% GPU-memory overhead at inference, while training overhead falls to 1.6% time and < 1% memory. We consider several attacks, showing that the requirements of direct attacks are impractical and easy to defend against, and that gradient-based fine-tuning of stitched partitions consumes $\\geq 60\\%$ of the tokens required to train from scratch. By enabling models to be collaboratively trained yet not extracted, UPMs make it practical to embed programmatic incentive mechanisms in community-driven decentralized training.",
      "keywords": [
        "Decentralized training",
        "LLMs",
        "Distributed training",
        "Open source",
        "Weight secrecy"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "GRMfXcAAFh",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "RoN6M3i7gJ",
      "title": "A Riemannian Framework for Learning Reduced-order Lagrangian Dynamics",
      "abstract": "By incorporating physical consistency as inductive bias, deep neural networks display increased generalization capabilities and data efficiency in learning nonlinear dynamic models. However, the complexity of these models generally increases with the system dimensionality, requiring larger datasets, more complex deep networks, and significant computational effort.\nWe propose a novel geometric network architecture to learn physically-consistent reduced-order dynamic parameters that accurately describe the original high-dimensional system behavior.\nThis is achieved by building on recent advances in model-order reduction and by adopting a Riemannian perspective to jointly learn a non-linear structure-preserving latent space and the associated low-dimensional dynamics.\nOur approach enables accurate long-term predictions of the high-dimensional dynamics of rigid and deformable systems with increased data efficiency by inferring interpretable and physically-plausible reduced Lagrangian models.",
      "keywords": [
        "physics-inspired networks",
        "dynamics learning",
        "model-order reduction",
        "Riemannian geometry",
        "deformable objects"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "WWymYrA48K",
      "title": "Test Time Learning for Time Series Forecasting",
      "abstract": "We propose the use of Test-Time Training (TTT) modules in a cascade architecture to enhance performance in long-term time series forecasting. Through extensive experiments on standard benchmark datasets, we demonstrate that TTT modules consistently outperform state-of-the-art models, including Mamba-based TimeMachine, particularly in scenarios involving extended sequence and prediction lengths. Our results show significant improvements, especially on larger datasets such as Electricity, Traffic, and Weather, underscoring the effectiveness of TTT in capturing long-range dependencies. Additionally, we explore various convolutional architectures within the TTT framework, showing that convolutional blocks as hidden layer architectures can achieve competitive results.",
      "keywords": [
        "Time Series Forecasting",
        "Test-Time Training",
        "Mamba",
        "Expressive Hidden States",
        "Modern CNN"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "N8Oj1XhtYZ",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "N8Oj1XhtYZ",
      "title": "SANA: Efficient High-Resolution Text-to-Image Synthesis with Linear Diffusion Transformers",
      "abstract": "We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096$\\times$4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8$\\times$, we trained an AE that can compress images 32$\\times$, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4)  Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024$\\times$1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released upon publication.",
      "keywords": [
        "Efficient AI",
        "Diffusion Models",
        "Text to Image generation"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "mr2icR6dpD",
      "title": "TIGeR: Unifying Text-to-Image Generation and Retrieval with Large Multimodal Models",
      "abstract": "How humans can effectively and efficiently acquire images has always been a perennial question. A classic solution is *text-to-image retrieval* from an existing database; however, the limited database typically lacks creativity. By contrast, recent breakthroughs in *text-to-image generation* have made it possible to produce attractive and counterfactual visual content, but it faces challenges in synthesizing knowledge-intensive images. In this work, we rethink the relationship between text-to-image generation and retrieval, proposing a *unified* framework for both tasks with one single Large Multimodal Model (LMM). Specifically, we first explore the intrinsic discriminative abilities of LMMs and introduce an efficient generative retrieval method for text-to-image retrieval in a training-free manner. Subsequently, we unify generation and retrieval autoregressively and propose an autonomous decision mechanism to choose the best-matched one between generated and retrieved images as the response to the text prompt. To standardize the evaluation of unified text-to-image generation and retrieval, we construct TIGeR-Bench, a benchmark spanning both creative and knowledge-intensive domains. Extensive experiments on TIGeR-Bench and two retrieval benchmarks, *i.e.*, Flickr30K and MS-COCO, demonstrate the superiority of our proposed framework.",
      "keywords": [
        "Multimodal Large Language Models",
        "Text-to-Image Generation",
        "Cross-Modal Retrieval"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "sgbI8Pxwie",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "eks3dGnocX",
      "title": "How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis",
      "abstract": "Large language models (LLMs) have shown amazing performance on tasks that require planning and reasoning. Motivated by this, we investigate the internal mechanisms that underpin a network's ability to perform complex logical reasoning. We first construct a synthetic propositional logic problem that serves as a concrete test-bed for network training and evaluation. Crucially, this problem demands nontrivial planning to solve. We perform our study on two fronts. First, we pursue an understanding of precisely how a three-layer transformer, trained from scratch and attains perfect test accuracy, solves this problem. We are able to identify certain \"planning\" and \"reasoning\" circuits in the network that necessitate cooperation between the attention blocks to implement the desired logic. Second, we study how a pretrained LLM, Mistral 7B, solves this problem. Using activation patching, we characterize internal components that are critical in solving our logic problem. Overall, our work systemically uncovers novel aspects of small and large transformers, and continues the study of how they plan and reason.",
      "keywords": [
        "Mechanistic Interpretability",
        "Language Models",
        "Transformers",
        "Logical Reasoning",
        "Learned Representations"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "sgbI8Pxwie",
      "title": "Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix",
      "abstract": "Large Language Models (LLMs) have shown immense potential in enhancing various aspects of our daily lives, from conversational AI to search and AI assistants. However, their growing capabilities come at the cost of extremely large model sizes, making deployment on edge devices challenging due to memory and computational constraints. This paper introduces a novel approach to LLM weight pruning that directly optimizes for approximating the attention matrix, a core component of transformer architectures. Unlike existing methods that focus on linear approximations, our approach accounts for the non-linear nature of the Softmax attention mechanism. We provide theoretical guarantees for the convergence of our Gradient Descent-based optimization method to a near-optimal pruning mask solution. Our empirical results demonstrate the effectiveness of our non-linear pruning approach in maintaining model performance while significantly reducing computational costs, which is beyond the current state-of-the-art methods, i.e., SparseGPT and Wanda, by a large margin. This work establishes a new theoretical foundation for pruning algorithm design in LLMs, potentially paving the way for more efficient LLM inference on resource-constrained devices.",
      "keywords": [
        "Weights Pruning",
        "Attention Approximation",
        "Gradient Descent Optimization"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "8YniJnJQ0P",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "8YniJnJQ0P",
      "title": "Detecting High-Stakes Interactions with Activation Probes",
      "abstract": "Monitoring is an important aspect of safely deploying Large Language Models (LLMs).\nThis paper examines activation probes for detecting ``high-stakes'' interactions---where the text indicates that the interaction might lead to significant harm---as a critical, yet underexplored, target for such monitoring.\nWe evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data.\nProbes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude.\nThese savings are enabled by reusing activations of the model that is being monitored.\nOur experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis.\nWe release our novel synthetic dataset and the codebase at\n\\url{https://github.com/arrrlex/models-under-pressure}.",
      "keywords": [
        "linear probes",
        "monitoring",
        "mechanistic interpretability",
        "large language models"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "DwXX8c7xst",
      "title": "Bits Leaked per Query: Information-Theoretic Bounds for Adversarial Attacks on LLMs",
      "abstract": "Adversarial attacks by malicious users that threaten the safety of large language models (LLMs) can be viewed as attempts to infer a target property $T$ that is unknown when an instruction is issued, and becomes knowable only after the model's reply is observed.  \nExamples of target properties $T$ include the binary flag that triggers an LLM's harmful response or rejection, and the degree to which information deleted by unlearning can be restored, both elicited via adversarial instructions.  \nThe LLM reveals an \\emph{observable signal} $Z$ that potentially leaks hints for attacking through a response containing answer tokens, thinking process tokens, or logits.\nYet the scale of information leaked remains anecdotal, leaving auditors without principled guidance and defenders blind to the transparency--risk trade-off.\nWe fill this gap with an information-theoretic framework that computes how much information can be safely disclosed, and enables auditors to gauge how close their methods come to the fundamental limit.\nTreating the mutual information $I(Z;T)$ between the observation $Z$ and the target property $T$ as the leaked bits per query, we show that achieving error $\\varepsilon$ requires at least $\\log(1/\\varepsilon)/I(Z;T)$ queries, scaling linearly with the inverse leak rate and only logarithmically with the desired accuracy.\nThus, even a modest increase in disclosure collapses the attack cost from quadratic to logarithmic in terms of the desired accuracy.\nExperiments on seven LLMs across system-prompt leakage, jailbreak, and relearning attacks corroborate the theory: exposing answer tokens alone requires about a thousand queries; adding logits cuts this to about a hundred; and revealing the full thinking process trims it to a few dozen.\nOur results provide the first principled yardstick for balancing transparency and security when deploying LLMs.",
      "keywords": [
        "Large Language Models",
        "Jailbreak attack",
        "Security"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "F0JzotXYgC",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "F0JzotXYgC",
      "title": "Spectral Perturbation Bounds for Low-Rank Approximation with Applications to Privacy",
      "abstract": "A central challenge in machine learning is to understand how noise or measurement errors affect low-rank approximations, particularly in the spectral norm. This question is especially important in differentially private low-rank approximation, where one aims to preserve the top-$p$ structure of a data-derived matrix while ensuring privacy. Prior work often analyzes Frobenius norm error or changes in reconstruction quality, but these metrics can over- or under-estimate true subspace distortion. The spectral norm, by contrast, captures worst-case directional error and provides the strongest utility guarantees. We establish new high-probability spectral-norm perturbation bounds for symmetric matrices that refine the classical Eckart--Young--Mirsky theorem and explicitly capture interactions between a matrix $A \\in \\mathbb{R}^{n \\times n}$ and an arbitrary symmetric perturbation $E$. Under mild eigengap and norm conditions, our bounds yield sharp estimates for $\\| (A + E)_p - A_p \\|$, where $A_p$ is the best rank-$p$ approximation of $A$, with improvements of up to a factor of $\\sqrt{n}$. As an application, we derive improved utility guarantees for differentially private PCA, resolving an open problem in the literature. Our analysis relies on a novel contour bootstrapping method from complex analysis and extends it to a broad class of spectral functionals, including polynomials and matrix exponentials. Empirical results on real-world datasets confirm that our bounds closely track the actual spectral error under diverse perturbation regimes.",
      "keywords": [
        "Spectral norm",
        "low-rank approximation",
        "differentially private PCA",
        "contour integration",
        "matrix analysis"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "uEFC25uUwU",
      "title": "The $\\varphi$ Curve: The Shape of Generalization through the Lens of Norm-based Capacity Control",
      "abstract": "Understanding how the test risk scales with model complexity is a central question in machine learning. Classical theory is challenged by the learning curves observed for large over-parametrized deep networks. Capacity measures based on parameter count typically fail to account for these empirical observations. To tackle this challenge, we consider norm-based capacity measures and develop our study for random features based estimators, widely used as simplified theoretical models for more complex networks. In this context, we provide a precise characterization of how the estimator’s norm concentrates and how it governs the associated test error. Our results show that the predicted learning curve admits a phase transition from under- to over-parameterization, but no double descent behavior. This confirms that more classical U-shaped behavior is recovered considering appropriate capacity measures based on models norms rather than size. From a technical point of view, we leverage deterministic equivalence as the key tool and further develop new deterministic quantities which are of independent interest.",
      "keywords": [
        "generalization",
        "norm-based capacity",
        "deterministic equivalence"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "XoN10bZtR9",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "e2WesV6Voe",
      "title": "Sequence Modeling with Spectral Mean Flows",
      "abstract": "A key question in sequence modeling with neural networks is how to represent and learn highly nonlinear and probabilistic state dynamics. Operator theory views such dynamics as linear maps on Hilbert spaces containing mean embedding vectors of distributions, offering an appealing but currently overlooked perspective. We propose a new approach to sequence modeling based on an operator-theoretic view of a hidden Markov model (HMM). Instead of materializing stochastic recurrence, we embed the full sequence distribution as a tensor in the product Hilbert space. A generative process is then defined as maximum mean discrepancy (MMD) gradient flow in the space of sequences. To overcome challenges with large tensors and slow sampling convergence, we introduce spectral mean flows, a novel tractable algorithm integrating two core concepts. First, we propose a new neural architecture by leveraging spectral decomposition of linear operators to derive a scalable tensor network decomposition of sequence mean embeddings. Second, we extend MMD gradient flows to time-dependent Hilbert spaces and connect them to flow matching via the continuity equation, enabling simulation-free learning and faster sampling. We demonstrate competitive results on a range of time-series modeling datasets.",
      "keywords": [
        "sequence modeling",
        "time series",
        "hidden Markov models",
        "mean embeddings",
        "linear operators",
        "maximum mean discrepancy",
        "gradient flows"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "z2SGaPIhLT",
      "title": "SGCD: Stain-Guided CycleDiffusion for Unsupervised Domain Adaptation of Histopathology Image Classification",
      "abstract": "The effectiveness of domain translation in addressing image-based problems of Unsupervised Domain Adaptation (UDA) depends on the quality of the translated images and the preservation of crucial discriminative features. However, achieving high-quality and stable translations typically requires paired data, which poses a challenge in scenarios with limited annotations in the target domain. To address this issue, this paper proposes a novel method termed Stain-Guided Cycle Diffusion (SGCD), employing a dual diffusion model with bidirectional generative constraints to synthesize highly realistic data for downstream task fine-tuning. The bidirectional generative constraints ensure that the translated images retain the features critical to the downstream model in properly controlling the generation process. Additionally, a stain-guided consistency loss is introduced to enhance the denoising capability of the dual diffusion model, thereby improving the quality of images translated between different domains using latents from one domain and a diffusion model trained on another. Experiments conducted on four public datasets demonstrate that SGCD can effectively enhance the performance of downstream task models on the target domain.",
      "keywords": [
        "Transfer Learning",
        "Domain Adaptation",
        "Generative Models"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "heJ7NRInjs",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "0Y4gjqdvC6",
      "title": "Fundamental Limits of Game-Theoretic LLM Alignment: Smith Consistency and Preference Matching",
      "abstract": "Nash Learning from Human Feedback (NLHF) is a game-theoretic framework for aligning large language models (LLMs) with human preferences by modeling learning as a two-player zero-sum game. However, using raw preference as the payoff in the game highly limits the potential of the game-theoretic LLM alignment framework.In this paper, we systematically study using what choices of payoff based on the pairwise human preferences can yield desirable alignment properties. We establish necessary and sufficient conditions for Condorcet consistency, diversity through mixed strategies, and Smith consistency. These results provide a theoretical foundation for the robustness of game-theoretic LLM alignment. Further, we show the impossibility of preference matching—i.e., no smooth and learnable mappings of pairwise preferences can guarantee a unique Nash equilibrium that matches a target policy, even under standard assumptions like the Bradley-Terry-Luce (BTL) model. This result highlight the fundamental limitation of game-theoretic LLM alignment.",
      "keywords": [
        "Large Language Models",
        "Preference Alignment",
        "Nash Equilibrium",
        "Nash Learning from Human Feedback"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "heJ7NRInjs",
      "title": "RSafe: Incentivizing proactive reasoning to build robust and adaptive  LLM safeguards",
      "abstract": "Large Language Models (LLMs) continue to exhibit vulnerabilities despite deliberate safety alignment efforts, posing significant risks to users and society. To safeguard against the risk of policy-violating content, system-level moderation via external guard models—designed to monitor LLM inputs and outputs and block potentially harmful content—has emerged as a prevalent mitigation strategy. Existing approaches of training guard models rely heavily on extensive human curated datasets and struggle with out-of-distribution threats, such as emerging harmful categories or jailbreak attacks. To address these limitations, we propose RSafe, an adaptive reasoning-based safeguard that conducts guided safety reasoning to provide robust protection within the scope of specified safety policies. RSafe operates\nin two stages: (1) guided reasoning, where it analyzes safety risks of input content through policy-guided step-by-step reasoning, and (2) reinforced alignment, where rule-based RL optimizes its reasoning paths to align with accurate safety prediction. This two-stage training paradigm enables RSafe to internalize safety principles to generalize safety protection capability over unseen or adversarial safety violation\nscenarios. During inference, RSafe accepts user-specified safety policies to provide enhanced safeguards tailored to specific safety requirements. Experiments demonstrate that RSafe matches state-of-the-art guard models using limited amount of public data in both prompt- and response-level harmfulness detection, while achieving superior out-of-distribution generalization on both emerging harmful category and jailbreak attacks. Furthermore, RSafe provides human-readable explanations for its safety judgments for better interpretability. RSafe offers a robust, adaptive, and interpretable solution for LLM safety moderation, advancing the development of reliable safeguards in dynamic real-world environments. Our code is available at https://anonymous.4open.science/r/RSafe-996D.",
      "keywords": [
        "large language model",
        "safety",
        "moderation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "6EUtjXAvmj",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "TtUh0TOlGX",
      "title": "Regularization by Texts for Latent Diffusion Inverse Solvers",
      "abstract": "The recent development of diffusion models has led to significant progress in solving inverse problems by leveraging these models as powerful generative priors. However, challenges persist due to the ill-posed nature of such problems, often arising from ambiguities in measurements or intrinsic system symmetries. To address this, we introduce a novel latent diffusion inverse solver, regularization by text (TReg), inspired by the human ability to resolve visual ambiguities through perceptual biases. TReg integrates textual descriptions of preconceptions about the solution during reverse diffusion sampling, dynamically reinforcing these descriptions through null-text optimization, which we refer to as adaptive negation. Our comprehensive experimental results demonstrate that TReg effectively mitigates ambiguity in inverse problems, improving both accuracy and efficiency.",
      "keywords": [
        "Inverse problem",
        "Text regularization",
        "Diffusion model"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "kRBQwlkFSP",
      "title": "Diffusion State-Guided Projected Gradient for Inverse Problems",
      "abstract": "Recent advancements in diffusion models have been effective in learning data priors for solving inverse problems. They leverage diffusion sampling steps for inducing a data prior while using a measurement guidance gradient at each step to impose data consistency. For general inverse problems, approximations are needed when an unconditionally trained diffusion model is used since the measurement likelihood is intractable, leading to inaccurate posterior sampling. In other words, due to their approximations, these methods fail to preserve the generation process on the data manifold defined by the diffusion prior, leading to artifacts in applications such as image restoration. To enhance the performance and robustness of diffusion models in solving inverse problems, we propose Diffusion State-Guided Projected Gradient (DiffStateGrad), which projects the measurement gradient onto a subspace that is a low-rank approximation of an intermediate state of the diffusion process. DiffStateGrad, as a module, can be added to a wide range of diffusion-based inverse solvers to improve the preservation of the diffusion process on the prior manifold and filter out artifact-inducing components. We highlight that DiffStateGrad improves the robustness of diffusion models in terms of the choice of measurement guidance step size and noise while improving the worst-case performance. Finally, we demonstrate that DiffStateGrad improves upon the state-of-the-art on linear and nonlinear image restoration inverse problems. Our code is available at https://github.com/Anima-Lab/DiffStateGrad.",
      "keywords": [
        "Diffusion models",
        "Inverse problems",
        "Robustness",
        "Subspace",
        "Projection",
        "Box inpainting",
        "Phase retrieval"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "hq2CkcEY7h",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "rOuRpOA6pm",
      "title": "Physics-informed machine learning with domain decomposition and global dynamics for three-dimensional intersecting flows",
      "abstract": "Physics-informed neural networks (PINNs) have emerged as a promising framework to develop complex scientific surrogate models, yet their scalability and accuracy often degrade in non-canonical geometries, such as non-rectangular domains or three-dimensional (3D) domains with high aspect ratios. These limitations hinder the broader adoption of vanilla PINNs in real-world, practical systems. In this work, we introduce a multi-domain PINN (MDPINN) framework designed to address the scalability and generalization challenges inherent in 3D non-rectangular domains governed by nonlinear fluid dynamics. The target domain consists of intersecting 3D fluid channels with a high aspect ratio, inducing complex flow features such as deflections, mixing, and recirculations. Our approach is grounded in two key innovations: 1) domain decomposition, which partitions the channel volumes into multiple cubic-like subdomains, each modeled by an individual PINN, 2) enforcement of global dynamics (MDPINN-GD), which ensures that the total mass flow rate entering the domain equals that exiting. These innovations reduce the complexity of the problem imposed on individual PINNs and guide effective network optimization toward physically consistent solutions throughout the domain. We demonstrate that our method achieves: 1) 74.8\\% accuracy improvement over a single-network PINN, and 2) 52.9\\% accuracy improvement over MDPINN that do not enforce global mass conservation. Furthermore, the MDPINN-GD framework exhibits accurate prediction even in highly complex regions-such as the channel intersecting zone and the outlet zone characterized by intense flow mixing and large velocity gradients-achieving maximum normalized mean absolute errors below 14.9\\% for velocity predictions compared to simulation results. This work establishes a path towards scalable, physically grounded surrogate modeling approach that is extensible to multiphysics and high-dimensional scientific problems.",
      "keywords": [
        "Physics-informed neural networks",
        "domain decomposition",
        "global dynamics",
        "three-dimensional fluid dynamics"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "xEzQCFSfPG",
      "title": "Grokking and Generalization Collapse: Insights from HTSR theory",
      "abstract": "Grokking is a surprising phenomenon in neural network training where test accuracy remains low for an extended period despite near-perfect training accuracy, only to suddenly leap to strong generalization. In this work, we study grokking using a depth-3, width-200 ReLU MLP trained on a subset of MNIST. We investigate it's long-term dynamics under both weight-decay and, critically, no-decay regimes—the latter often characterized by increasing $l^2$ weight norms. Our primary tool is the theory of Heavy-Tailed Self-Regularization **HTSR**, where we track the heavy-tailed exponent $\\alpha$. We find that $\\alpha$ reliably predicts both the initial grokking transition and subsequent anti-grokking. We benchmark these insights against four prior approaches: progress measures---Activation Sparsity, Absolute Weight Entropy, and Approximate Local Circuit Complexity ---and weight norm ($l^2$) analysis.\nOur experiments show that while comparative approaches register significant changes, **in this regime of increasing $l^2$ norm, the heavy-tailed exponent $\\alpha$ demonstrates a unique correlation with the ensuing large, long-term dip in test accuracy, a signal not reliably captured by most other measures.**\n\n\n\nExtending our zero weight decay experiment significantly beyond typical timescales ($10^{5}$ to approximately $10^{7}$ optimization steps), **we reveal a late-stage catastrophic generalization collapse (``anti-grokking''), characterized by a dramatic drop in test accuracy (over 25 percentage points) while training accuracy remains perfect**; notably, the heavy-tail metric $\\alpha$ uniquely provides an early warning of this impending collapse. Our results underscore the utility of Heavy-Tailed Self-Regularization theory for tracking generalization dynamics, even in the challenging regimes without explicit weight decay regularization.",
      "keywords": [
        "Grokking",
        "Heavy-Tailed Self-Regularization",
        "Random Matrix Theory",
        "Heavy-Tail Exponent",
        "Spectral Analysis",
        "Generalization Dynamics",
        "Catastrophic Generalization Collapse",
        "Implicit Regularization"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "RD9q5vEe1Q",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "RD9q5vEe1Q",
      "title": "Error-quantified Conformal Inference for Time Series",
      "abstract": "Uncertainty quantification in time series prediction is challenging due to the temporal dependence and distribution shift on sequential data. Conformal prediction provides a pivotal and flexible instrument for assessing the uncertainty of machine learning models through prediction sets. Recently, a series of online conformal inference methods updated thresholds of prediction sets by performing online gradient descent on a sequence of quantile loss functions. A drawback of such methods is that they only use the information of revealed non-conformity scores via miscoverage indicators but ignore error quantification, namely the distance between the non-conformity score and the current threshold. To accurately leverage the dynamic of miscoverage error, we propose Error-quantified Conformal Inference (ECI) by smoothing the quantile loss function. ECI introduces a continuous and adaptive feedback scale with the miscoverage error, rather than simple binary feedback in existing methods. We establish a long-term coverage guarantee for ECI under arbitrary dependence and distribution shift. The extensive experimental results show that ECI can achieve valid miscoverage control and output tighter prediction sets than other baselines.",
      "keywords": [
        "Time Series; Uncertainty Quantification; Conformal Prediction; Distribution Shift"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "WFlLqUmb9v",
      "title": "Efficient Time Series Forecasting via Hyper-Complex Models and Frequency Aggregation",
      "abstract": "Time-series forecasting is a long-standing challenge in statistics and machine learning, with one of the key difficulties being the ability to process sequences with long-range dependencies. A recent line of work has addressed this by applying the short-time Fourier transform (STFT), which partitions sequences into multiple subsequences and applies a Fourier transform to each separately.\nWe propose the Frequency Information Aggregation (FIA-Net), a model that can utilize two backbone architectures: the Window-Mixing MLP (WM-MLP), which aggregates adjacent window information in the frequency domain, and the Hyper-Complex MLP (HC-MLP), which treats the set of STFT windows as hyper-complex (HC) valued vectors. and employ HC algebra to efficiently combine information from all STFT windows altogether. Furthermore, due to the nature of HC operations, the HC-MLP uses up to three times fewer parameters than the equivalent standard window aggre- gation method. We evaluate the FIA-Net on various time-series benchmarks and show that the proposed methodologies outperform existing state-of-the-art meth- ods in terms of both accuracy and efficiency. Our code is publicly available on https://anonymous.4open.science/r/research-1803/",
      "keywords": [
        "time-series forecasting",
        "frequency models",
        "hyper-complex machine learning",
        "short-time Fourier transform"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "lydPkW4lfz",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "lydPkW4lfz",
      "title": "Local Steps Speed Up Local GD for Heterogeneous Distributed Logistic Regression",
      "abstract": "We analyze two variants of Local Gradient Descent applied to distributed logistic regression with heterogeneous, separable data and show convergence at the rate $O(1/KR)$ for $K$ local steps and sufficiently large $R$ communication rounds. In contrast, all existing convergence guarantees for Local GD applied to any problem are at least $\\Omega(1/R)$, meaning they fail to show the benefit of local updates. The key to our improved guarantee is showing progress on the logistic regression objective when using a large stepsize $\\eta \\gg 1/K$, whereas prior analysis depends on $\\eta \\leq 1/K$.",
      "keywords": [
        "optimization",
        "convex optimization",
        "distributed optimization",
        "federated learning",
        "logistic regression"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "mscnV6JZkT",
      "title": "Distributed Gradient Descent with Many Local Steps in Overparameterized Models",
      "abstract": "In distributed training of machine learning models, gradient descent with local iterative steps is a very popular method, variants of which are commonly known as Local-SGD or the Federated Averaging (FedAvg). In this method, gradient steps based on local datasets are taken independently in distributed compute nodes to update  the local models, which are then aggregated intermittently. Although the existing convergence analysis suggests that with heterogeneous data, FedAvg encounters quick performance degradation as the number of local steps increases, it is shown to work quite well in practice, especially in the distributed training of large language models. In this work we try to explain this good performance from a viewpoint of implicit bias in Local Gradient Descent (Local-GD) with a large number of local steps. In overparameterized regime, the gradient descent at each compute node would lead the model to a specific direction locally. We characterize the dynamics of the aggregated global model and compare it to the centralized model trained with all of the data in one place. In particular, we analyze the implicit bias of gradient descent on linear models, for both regression and classification tasks. Our analysis shows that the aggregated global model  converges exactly to the centralized model for regression tasks, and converges (in direction) to the same feasible set as centralized model  for classification tasks. We further propose a Modified Local-GD with a refined aggregation and theoretically show it converges to the centralized model in direction for linear classification. We empirically verified our theoretical findings in linear models and also conducted experiments on distributed fine-tuning of pretrained neural networks to further apply our theory.",
      "keywords": [
        "Distributed Learning",
        "Overparameterization",
        "Optimization",
        "Federated Learning"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "XPe55Uffd7",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "qRPIWtf3SE",
      "title": "Learning single index models via harmonic decomposition",
      "abstract": "We study the problem of learning single-index models, where the label $y \\in \\mathbb{R}$ depends on the input $\\boldsymbol{x} \\in \\mathbb{R}^d$ only through an unknown one-dimensional projection $\\langle \\boldsymbol{w_*}, \\boldsymbol{x} \\rangle$. Prior work has shown that under Gaussian inputs, the statistical and computational complexity of recovering $\\boldsymbol{w}_*$ is governed by the Hermite expansion of the link function. In this paper, we propose a new perspective: we argue that *spherical harmonics*---rather than *Hermite polynomials*---provide the natural basis for this problem, as they capture its intrinsic \\textit{rotational symmetry}. Building on this insight, we characterize the complexity of learning single-index models under arbitrary spherically symmetric input distributions. We introduce two families of estimators---based on tensor-unfolding and online SGD---that respectively achieve either  optimal sample complexity or optimal runtime, and argue that estimators achieving both may not exist in general. When specialized to Gaussian inputs, our theory not only recovers and clarifies existing results but also reveals new phenomena that had previously been overlooked.",
      "keywords": [
        "single-index models",
        "statistical and computational complexity"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "sXpyn3lAb5",
      "title": "Accelerating data-driven algorithm selection for combinatorial partitioning problems",
      "abstract": "Data-driven algorithm selection is a powerful approach for choosing effective heuristics for computational problems. It operates by evaluating a set of candidate algorithms on a collection of representative training instances and selecting the one with the best empirical performance. However, running each algorithm on every training instance is computationally expensive, making scalability a central challenge. In practice, a common workaround is to evaluate algorithms on smaller proxy instances derived from the original inputs. However, this practice has remained largely ad hoc and lacked theoretical grounding. We provide the first theoretical foundations for this practice by formalizing the notion of size generalization: predicting an algorithm's performance on a large instance by evaluating it on a smaller, representative instance, subsampled from the original instance. We provide size generalization guarantees for three widely used clustering algorithms (single-linkage, k-means++, and Gonzalez's k-centers heuristic) and two canonical max-cut algorithms (Goemans-Williamson and Greedy). We characterize the subsample size sufficient to ensure that performance on the subsample reflects performance on the full instance, and our experiments support these findings.",
      "keywords": [
        "data-driven algorithm selection",
        "sub-sampling",
        "clustering",
        "max-cut",
        "Goemans-williamson"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "xZnjIkIzST",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "aLhA7AYLLR",
      "title": "ControlFusion: A Controllable Image Fusion Network with Language-Vision Degradation Prompts",
      "abstract": "Current image fusion methods struggle with real-world composite degradations and lack the flexibility to accommodate user-specific needs. To address this, we propose ControlFusion, a controllable fusion network guided by language-vision prompts that adaptively mitigates composite degradations. On the one hand, we construct a degraded imaging model based on physical mechanisms, such as the Retinex theory and atmospheric scattering principle, to simulate composite degradations and provide a data foundation for addressing realistic degradations. On the other hand, we devise a prompt-modulated restoration and fusion network that dynamically enhances features according to degradation prompts, enabling adaptability to varying degradation levels. To support user-specific preferences in visual quality, a text encoder is incorporated to embed user-defined degradation types and levels as degradation prompts. Moreover, a spatial-frequency collaborative visual adapter is designed to autonomously perceive degradations from source images, thereby reducing complete reliance on user instructions. Extensive experiments demonstrate that ControlFusion outperforms SOTA fusion methods in fusion quality and degradation handling, particularly under real-world and compound degradations.",
      "keywords": [
        "Image fusion",
        "multimodal images",
        "degradation"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "qbVbZWxUib",
      "title": "Efficient Part-level 3D Object Generation via Dual Volume Packing",
      "abstract": "Recent progress in 3D object generation has greatly improved both the quality and efficiency.\nHowever, most existing methods generate a single mesh with all parts fused together, which limits the ability to edit or manipulate individual parts.\nA key challenge is that different objects may have a varying number of parts.\nTo address this, we propose a new end-to-end framework for part-level 3D object generation.\nGiven a single input image, our method generates high-quality 3D objects with an arbitrary number of complete and semantically meaningful parts.\nWe introduce a dual volume packing strategy that organizes all parts into two complementary volumes, allowing for the creation of complete and interleaved parts that assemble into the final object.\nExperiments show that our model achieves better quality, diversity, and generalization than previous image-based part-level generation methods.\nOur project page is at \\url{https://research.nvidia.com/labs/dir/partpacker/}.",
      "keywords": [
        "3D Generation",
        "Part Generation",
        "Image-to-3D"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "I4fBSpDOha",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "I4fBSpDOha",
      "title": "Focus-Then-Reuse: Fast Adaptation in Visual Perturbation Environments",
      "abstract": "Visual reinforcement learning has shown promise in various real-world applications. However, deploying policies in complex real-world environments with visual perturbations remains a significant challenge. We notice that humans tend to filter information at the object level prior to decision-making, facilitating efficient skill transfer across different contexts. Inspired by this, we introduce Focus-Then-Reuse (FTR), a method utilizing a novel object selection mechanism to focus on task-relevant objects, and directly reuse the simulation-trained policy on them. The training of the object selection mechanism integrates prior knowledge from a vision-language model and feedback from the environment. Experimental results on challenging tasks based on DeepMind Control Suite and Franka Emika Robotics demonstrate that FTR enables rapid adaptation in visual perturbation environments and achieves state-of-the-art performance. The source code is available at https://github.com/LAMDA-RL/FTR.",
      "keywords": [
        "reinforcement learning; visual domain adaptation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "RxkCwOKVKa",
      "title": "Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies",
      "abstract": "Reinforcement learning (RL) systems have countless applications, from energy-grid management to protein design. However, such real-world scenarios are often extremely difficult, combinatorial in nature, and require complex coordination between multiple agents. This level of complexity can cause even state-of-the-art RL systems, trained until convergence, to hit a performance ceiling which they are unable to break out of with zero-shot inference. Meanwhile, many digital or simulation-based applications allow for an inference phase that utilises a specific time and compute budget to explore multiple attempts before outputting a final solution. In this work, we show that such an inference phase employed at execution time, and the choice of a corresponding inference strategy, are key to breaking the performance ceiling observed in complex multi-agent RL problems. Our main result is striking: we can obtain up to a 126% and, on average, a 45% improvement over the previous state-of-the-art across 17 tasks, using only a couple seconds of extra wall-clock time during execution. We also demonstrate promising compute scaling properties, supported by over 60k experiments, making it the largest study on inference strategies for complex RL to date. We make all of our experimental data and code available.",
      "keywords": [
        "reinforcement learning",
        "inference strategies",
        "complex decision-making"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "Ceb788Uigr",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "Ceb788Uigr",
      "title": "On the Convergence of Stochastic Smoothed Multi-Level Compositional Gradient Descent Ascent",
      "abstract": "Multi-level compositional optimization is a fundamental framework in machine learning with broad applications. While recent advances have addressed compositional minimization problems, the stochastic multi-level compositional minimax problem introduces significant new challenges—most notably, the biased nature of stochastic gradients for both the primal and dual variables. In this work, we address this gap by proposing a novel stochastic multi-level compositional gradient descent-ascent algorithm, incorporating a smoothing technique under the nonconvex-PL condition. We establish a convergence rate to an $(\\epsilon, \\epsilon/\\sqrt{\\kappa})$-stationary point with improved dependence on the condition number at $O(\\kappa^{3/2})$, where $\\epsilon$ denotes the solution accuracy and $\\kappa$ represents the condition number. Moreover,  we design a novel stage-wise algorithm with variance reduction to address the  biased gradient issue under the two-sided PL condition. This algorithm successfully enables a translation from and $(\\epsilon, \\epsilon/\\sqrt{\\kappa})$-stationary point to an $\\epsilon$-stationary point. Finally, extensive experiments validate the effectiveness of our algorithms.",
      "keywords": [
        "Compositional Optimization",
        "Minimax Optimization"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "WhEPg4mUs6",
      "title": "Analog In-memory Training on General Non-ideal Resistive Elements: The Impact of Response Functions",
      "abstract": "As the economic and environmental costs of training and deploying large vision or language models increase dramatically, analog in-memory computing (AIMC) emerges as a promising energy-efficient solution. However, the training perspective, especially its training dynamic, is underexplored. In AIMC hardware, the trainable weights are represented by the conductance of resistive elements and updated using consecutive electrical pulses.  While the conductance changes by a constant in response to each pulse, in reality, the change is scaled by asymmetric and non-linear response functions, leading to a non-ideal training dynamic. This paper provides a theoretical foundation for gradient-based training on AIMC hardware with non-ideal response functions.  We demonstrate that asymmetric response functions negatively impact Analog SGD by imposing an implicit penalty on the objective. To overcome the issue, we propose residual learning algorithm, which provably converges exactly to a critical point by solving a bilevel optimization problem. We show that the proposed method can be extended to deal with other hardware imperfections like limited response granularity. As far as we know, it is the first paper to investigate the impact of a class of generic non-ideal response functions. The conclusion is supported by simulations validating our theoretical insights.",
      "keywords": [
        "Analog AI; in-memory computing; stochastic gradient descent; stochastic optimization"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "hTbimOuFPM",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "drZzzGUlbG",
      "title": "Quasi-Self-Concordant Optimization with $\\ell_{\\infty}$ Lewis Weights",
      "abstract": "In this paper, we study the problem $\\min_{x\\in R^{d},Nx=v}\\sum\\_{i=1}^{n}f((Ax-b)_{i})$\nfor a quasi-self-concordant function $f: R\\to R$, where $A,N$ are\n$n\\times d$ and $m\\times d$ matrices, $b,v$ are vectors of length\n$n$ and $m$ with $n\\ge d.$ We show an algorithm based on a trust-region\nmethod with an oracle  that can be implemented using $\\widetilde{O}(d^{1/3})$\nlinear system solves, improving the $\\widetilde{O}(n^{1/3})$ oracle\nby [Adil-Bullins-Sachdeva, NeurIPS 2021]. Our implementation of\nthe oracle relies on solving the overdetermined $\\ell\\_{\\infty}$-regression\nproblem $\\min\\_{x\\in R^{d},Nx=v}\\|Ax-b\\|\\_{\\infty}$. We provide an\nalgorithm that finds a $(1+\\epsilon)$-approximate solution to this\nproblem using $O((d^{1/3}/\\epsilon+1/\\epsilon^{2})\\log(n/\\epsilon))$\nlinear system solves. This algorithm leverages $\\ell\\_{\\infty}$ Lewis\nweight overestimates and achieves this iteration complexity via a\nsimple lightweight IRLS approach, inspired by the work of [Ene-Vladu,\nICML 2019]. Experimentally, we demonstrate that our algorithm significantly\nimproves the runtime of the standard CVX solver.",
      "keywords": [
        "Convex optimization",
        "Quasi-Self-Concordant Optimization"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "hTbimOuFPM",
      "title": "An efficient implementation for solving the all pairs minimax path problem in an undirected dense graph",
      "abstract": "We provide an efficient $ O(n^2) $ implementation for solving the all pairs minimax path problem or  widest path problem in an undirected dense graph. The distance matrix is also called the all points path distance (APPD). We conducted experiments to test the implementation and algorithm, compared it with several other algorithms for solving the APPD matrix.  Result shows Algorithm 4 works good for solving the widest path or minimax path APPD matrix.  It can drastically improve the efficiency for computing the APPD matrix.  There are several theoretical outcomes which claim the APPD matrix can be solved accurately in $ O(n^2) $ . However, they are impractical because there is no code implementation of these algorithms. Algorithm 4 is the first algorithm that has an actual code implementation for solving the APPD matrix of minimax path or widest path problem in $ O(n^2) $, in an undirected dense graph.",
      "keywords": [
        "Minimax path problem",
        "Longest-leg path distance",
        "Min-Max-Jump distance",
        "Widest path problem",
        "Maximum capacity path problem",
        "Bottleneck edge query problem",
        "All points path distance",
        "Floyd-Warshall algorithm",
        "Minimum spanning tree"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "w7BGq6ozOL",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "Acvo2RGSCy",
      "title": "DeLLMa: Decision Making Under Uncertainty with Large Language Models",
      "abstract": "The potential of large language models (LLMs) as decision support tools is increasingly being explored in fields such as business, engineering, and medicine, which often face challenging tasks of *decision-making under uncertainty*. In this paper, we show that directly prompting LLMs on these types of decision-making problems can yield poor results, especially as the problem complexity increases. To aid in these tasks, we propose DeLLMa (Decision-making Large Language Model assistant), a framework designed to enhance decision-making accuracy in uncertain environments. DeLLMa involves a multi-step reasoning procedure that integrates recent best practices in scaling *inference-time reasoning*, drawing upon principles from decision theory and utility theory, to provide an accurate and human-auditable decision-making process. We validate our procedure on multiple realistic decision-making environments, demonstrating that DeLLMa can consistently enhance the decision-making performance of leading language models, and achieve up to a 40% increase in accuracy over competing methods. Additionally, we show how performance improves when scaling compute at test time, and carry out human evaluations to benchmark components of DeLLMa.",
      "keywords": [
        "large language models",
        "decision theory",
        "decision making under uncertainty"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "Yqk7EyT52H",
      "title": "MarS: a Financial Market Simulation Engine Powered by Generative Foundation Model",
      "abstract": "Generative models aim to simulate realistic effects of various actions across different contexts, from text generation to visual effects. Despite significant efforts to build real-world simulators, the application of generative models to virtual worlds, like financial markets, remains under-explored. In financial markets, generative models can simulate complex market effects of participants with various behaviors, enabling interaction under different market conditions, and training strategies without financial risk. This simulation relies on the finest structured data in financial market like orders thus building the finest realistic simulation. We propose Large Market Model (LMM), an order-level generative foundation model, for financial market simulation, akin to language modeling in the digital world. Our financial Market Simulation engine (MarS), powered by LMM, addresses the domain-specific need for realistic, interactive and controllable order generation. Key observations include LMM's strong scalability across data size and model complexity, and MarS's robust and practicable realism in controlled generation with market impact. We showcase MarS as a forecast tool, detection system, analysis platform, and agent training environment, thus demonstrating MarS's ``paradigm shift'' potential for a variety of financial applications. We release the code of MarS at https://github.com/microsoft/MarS/.",
      "keywords": [
        "Financial Market Simulation",
        "Generative Foundation Model",
        "Large Market Model (LMM)",
        "Controllable Simulation",
        "Interactive Simulation",
        "Market Impact",
        "Reinforcement Learning",
        "Forecasting",
        "Market Manipulation Detection",
        "Order-Level Data"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "cJd1BgZ9CS",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "HD6bWcj87Y",
      "title": "Data Shapley in One Training Run",
      "abstract": "Data Shapley offers a principled framework for attributing the contribution of data within machine learning contexts. However, the traditional notion of Data Shapley requires re-training models on various data subsets, which becomes computationally infeasible for large-scale models. Additionally, this retraining-based definition cannot evaluate the contribution of data for a specific model training run, which may often be of interest in practice. This paper introduces a novel concept, In-Run Data Shapley, which eliminates the need for model retraining and is specifically designed for assessing data contribution for a particular model of interest. In-Run Data Shapley calculates the Shapley value for each gradient update iteration and accumulates these values throughout the training process. We present several techniques that allow the efficient scaling of In-Run Data Shapley to the size of foundation models. In its most optimized implementation, our method adds negligible runtime overhead compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage. We present several case studies that offer fresh insights into pretraining data's contribution and discuss their implications for copyright in generative AI and pretraining data curation.",
      "keywords": [
        "Shapley value",
        "data valuation."
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "cJd1BgZ9CS",
      "title": "Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference",
      "abstract": "This paper introduces *distributed speculative inference (DSI)*, a novel inference algorithm that is provably faster than speculative inference (SI) [leviathan2023, chen2023, miao2024, sun2025, timor2025] and standard autoregressive inference (non-SI). Like other SI algorithms, DSI operates on frozen language models (LMs), requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups over non-SI—but rely on sufficiently fast and accurate drafters, which are often unavailable in practice. We identify a gap where SI can be slower than non-SI if drafters are too slow or inaccurate. We close this gap by proving that DSI is faster than both SI and non-SI—given any drafters. DSI is therefore not only faster than SI, but also unlocks the acceleration of LMs for which SI fails. DSI leverages *speculation parallelism (SP)*, a novel type of task parallelism, to orchestrate target and drafter instances that overlap in time, establishing a new foundational tradeoff between computational resources and latency. Our simulations show that DSI is 1.29-1.92x faster than SI in single-node setups for various off-the-shelf LMs and tasks. We open-source all our code.",
      "keywords": [
        "inference algorithms for generative models",
        "LLM inference",
        "speculative decoding"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "reZKq6hjOZ",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "DJSZGGZYVi",
      "title": "Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think",
      "abstract": "Recent studies have shown that the denoising process in (generative) diffusion models can induce meaningful (discriminative) representations inside the model, though the quality of these representations still lags behind those learned through recent self-supervised learning methods. We argue that one main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations. Moreover, training can be made easier by incorporating high-quality external visual representations, rather than relying solely on the diffusion models to learn them independently. We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders. The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs. For instance, our method can speed up SiT training by over 17.5$\\times$, matching the performance (without classifier-free guidance) of a SiT-XL model trained for 7M steps in less than 400K steps. In terms of final generation quality, our approach achieves state-of-the-art results of FID=1.42 using classifier-free guidance with the guidance interval.",
      "keywords": [
        "Diffusion models",
        "Representation learning"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "reZKq6hjOZ",
      "title": "Broadening Target Distributions for Accelerated Diffusion Models via a Novel Analysis Approach",
      "abstract": "Accelerated diffusion models hold the potential to significantly enhance the efficiency of standard diffusion processes. Theoretically, these models have been shown to achieve faster convergence rates than the standard $\\mathcal O(1/\\epsilon^2)$ rate of vanilla diffusion models, where $\\epsilon$ denotes the target accuracy. However, current theoretical studies have established the acceleration advantage only for restrictive target distribution classes, such as those with smoothness conditions imposed along the entire sampling path or with bounded support. In this work, we significantly broaden the target distribution classes with a new accelerated stochastic DDPM sampler. In particular, we show that it achieves accelerated performance for three broad distribution classes not considered before. Our first class relies on the smoothness condition posed only to the target density $q_0$, which is far more relaxed than the existing smoothness conditions posed to all $q_t$ along the entire sampling path. Our second class requires only a finite second moment condition, allowing for a much wider class of target distributions than the existing finite-support condition. Our third class is Gaussian mixture, for which our result establishes the first acceleration guarantee. Moreover, among accelerated DDPM type samplers, our results specialized for bounded-support distributions show an improved dependency on the data dimension $d$. Our analysis introduces a novel technique for establishing performance guarantees via constructing a tilting factor representation of the convergence error and utilizing Tweedie's formula to handle Taylor expansion terms. This new analytical framework may be of independent interest.",
      "keywords": [
        "generative models",
        "denoising diffusion probabilistic model (DDPM)",
        "convergence analysis",
        "accelerated methods"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "MFZjrTFE7h",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "9GJ6JKoCVp",
      "title": "NaN Pooling and Convolution Accelerate U-Nets",
      "abstract": "Recent advancements in deep learning for neuroimaging have resulted in the development of increasingly complex models designed for a wide range of tasks. Despite significant improvements in hardware, enhancing inference and training times for these models remains crucial. Through a numerical analysis of convolutional neural networks (CNNs) inference, we found that a substantial amount of operations in these models are applied to pure numerical noise, with little to no impact on the final output. As a result, some CNNs consume up to two-thirds of their floating-point operations unnecessarily.\n\nTo address this inefficiency, we introduce NaN Pooling & Convolution---novel variations of PyTorch's max pooling and 2D convolution operations. These techniques identify numerically unstable voxels and replace them with NaNs, allowing  models to bypass operations on irrelevant data. We evaluate NaN Pooling and Convolution on two models: the FastSurfer CNN, a widely used neuroimaging tool, and a CNN designed to classify the MNIST dataset. For FastSurfer, our approach significantly improves computational efficiency, skipping between 33.24% and 69.30\\% of convolutions in certain layers while preserving the model's original accuracy. On MNIST, our approach skips up to 28.38% of convolutions, again without major impact on the accuracy.",
      "keywords": [
        "Pooling",
        "Convolutions",
        "Deep learning",
        "Optimization",
        "Neuroimaging",
        "Convolutional Neural Networks",
        "Numerical Analysis"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "ziw5bzg2NO",
      "title": "Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding",
      "abstract": "Recent advancements in Large Vision-Language Models (LVLMs) have significantly expanded their utility in tasks like image captioning and visual question answering. However, they still struggle with object hallucination, where models generate descriptions that inaccurately reflect the visual content by including nonexistent objects or misrepresenting existing ones. While previous methods, such as data augmentation and training-free approaches, strive to tackle this issue, they still encounter scalability challenges and often depend on additional external modules. In this work, we propose Ensemble Decoding (ED), a novel strategy that splits the input image into sub-images and combines logit distributions by assigning weights through the attention map. Furthermore, we introduce ED adaptive plausibility constraint to calibrate logit distribution and FastED, a variant designed for speed-critical applications. Extensive experiments across hallucination benchmarks demonstrate that our proposed method achieves state-of-the-art performance, validating the effectiveness of our approach.",
      "keywords": [
        "Hallucination",
        "Multimodal Hallucination",
        "Large Vision-Language Model"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "viQ1bLqKY0",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "XLMAMmowdY",
      "title": "ToolGen: Unified Tool Retrieval and Calling via Generation",
      "abstract": "As large language models (LLMs) advance, their inability to autonomously execute tasks by directly interacting with external tools remains a critical limitation. Traditional methods rely on inputting tool descriptions as context, which is constrained by context length and requires separate, often inefficient, retrieval mechanisms. We introduce ToolGen, a paradigm shift that integrates tool knowledge directly into the LLM’s parameters by representing each tool as a unique token. This enables the LLM to generate tool calls and arguments as part of its next token prediction capabilities, seamlessly blending tool invocation with language generation.  Our framework allows the LLM to access and utilize a vast amount of tools with no additional retrieval step, significantly enhancing both performance and scalability. Experimental results with over 47,000 tools show that ToolGen not only achieves superior results in both tool retrieval and autonomous task completion but also sets the stage for a new era of AI agents that can adapt to tools across diverse domains.  By fundamentally transforming tool retrieval into a generative process, ToolGen paves the way for more versatile, efficient, and autonomous AI systems. ToolGen enables end-to-end tool learning and opens opportunities for integration with other advanced techniques such as chain-of-thought and reinforcement learning, thereby expanding the practical capabilities of LLMs",
      "keywords": [
        "Agent",
        "Tool Learning",
        "Virtual Token"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "viQ1bLqKY0",
      "title": "EXecution-Eval: Can language models execute real-world code?",
      "abstract": "As Large Language Models (LLMs) advance, traditional benchmarks face challenges of dataset saturation and disconnection from real-world performance, limiting our understanding of true model capabilities. We introduce EXecution-Eval (EXE), a benchmark designed to assess LLMs' ability to execute code and predict program states. EXE attempts to address key limitations in existing evaluations: difficulty scaling, task diversity, training data contamination, and cost-effective scalability.\nComprising over 30,000 tasks derived from 1,000 popular Python repositories on GitHub, EXE spans a range of context lengths and algorithmic complexities. Tasks require models to execute code, necessitating various operations including mathematical reasoning, logical inference, bit manipulation, string operations, loop execution, and maintaining multiple internal variable states during computation. Our methodology involves: (a) selecting and preprocessing GitHub repositories, (b) generating diverse inputs for functions, (c) executing code to obtain ground truth outputs, and (d) formulating tasks that require models to reason about code execution. This approach allows for continuous new task generation for as few as 1,200 tokens, significantly reducing the risk of models \"training on the test set.\"\nWe evaluate several state-of-the-art LLMs on EXE, revealing insights into their code comprehension and execution capabilities. Our results show that even the best-performing models struggle with complex, multi-step execution tasks, highlighting specific computational concepts that pose the greatest challenges for today's LLMs. Furthermore, we review EXE's potential for finding and predicting errors to aid in assessing a model's cybersecurity capabilities. We propose EXE as a sustainable and challenging testbed for evaluating frontier models, offering potential insights into their internal mechanistic advancement",
      "keywords": [
        "large language model",
        "evaluation",
        "benchmark",
        "code execution"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "xoXn62FzD0",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "5X5Z7Ffrjb",
      "title": "Steering Large Language Models between Code Execution and Textual Reasoning",
      "abstract": "While a lot of recent research focuses on enhancing the textual reasoning capabilities of Large Language Models (LLMs) by optimizing the multi-agent framework or reasoning chains, several benchmark tasks can be solved with 100\\% success through direct coding, which is more scalable and avoids the computational overhead associated with textual iterating and searching. Textual reasoning has inherent limitations in solving tasks with challenges in math, logics, optimization, and searching, which is unlikely to be solved by simply scaling up the model and data size. The recently released OpenAI GPT Code Interpreter and multi-agent frameworks such as AutoGen have demonstrated remarkable proficiency of integrating code generation and execution to solve complex tasks using LLMs. However, based on our experiments on 7 existing popular methods for steering code/text generation in both single- and multi-turn settings with 14 tasks and 6 types of LLMs (including the new O1-preview), currently there is no optimal method to correctly steer LLMs to write code when needed. We discover some interesting patterns on when models use code vs. textual reasoning with the evolution to task complexity and model sizes, which even result in an astonishingly inverse scaling behavior. We also discover that results from LLM written code are not always better than using textual reasoning, even if the task could be solved through code. To mitigate the above issues, we propose three methods to better steer LLM code/text generation and achieve a notable improvement. The costs of token lengths and runtime are thoroughly discussed for all the methods. We believe the problem of steering LLM code/text generation is critical for future research and has much space for further improvement. Project Page, Datasets, and Codes are available at https://yongchao98.github.io/CodeSteer/.",
      "keywords": [
        "Large Language Models",
        "Code Interpreter",
        "Code/text generation",
        "Agent",
        "Textual reasoning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "xoXn62FzD0",
      "title": "Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo",
      "abstract": "A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be naturally framed as _probabilistic conditioning_, but exact generation from the resulting distribution—which can differ substantially from the LM’s base distribution—is generally intractable. In this work,\nwe develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). Our SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains---Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis—we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8$\\times$ larger, as well as closed-source, fine-tuned ones. \nIn support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. \n[Our system](https://github.com/probcomp/genlm-control) builds on the framework of Lew et al. (2023) and integrates with its _language model probabilistic programming language_, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.",
      "keywords": [
        "Sequential Monte Carlo",
        "Language Models",
        "Semantic parsing",
        "Bayesian inference",
        "Probabilistic programming",
        "SMC"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "sAFottNlra",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "CCBPSxWOhi",
      "title": "Linguini: A benchmark for language-agnostic linguistic reasoning",
      "abstract": "We propose a new benchmark to measure a language model's linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don't need previous knowledge of the tested language, as all the information needed to solve the linguistic puzzle is presented in the context. We find that, while all analyzed models rank below 25% accuracy, there is a significant gap between open and closed models, with the best-performing proprietary model scoring 24.05% and the best-performing open model 8.84%.",
      "keywords": [
        "linguistic reasoning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "Q3qAsZAEZw",
      "title": "Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference",
      "abstract": "Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration, such as evaluation batch size, GPU count, and GPU version, can introduce significant differences in the generated responses. \nThis issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9\\% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size.\nWe trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. \nThis work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge.\nOur analysis reveals that floating-point precision—while critical for reproducibility—is often neglected in evaluation practices.\nInspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.",
      "keywords": [
        "Large Language Models (LLMs)",
        "Reproducibility",
        "Numerical precision",
        "Deterministic inference"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "Q3qAsZAEZw",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "FZURCro04D",
      "title": "Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking",
      "abstract": "Large language models (LLMs) demonstrate remarkable reasoning capabilities, yet their reliance on step-by-step reasoning can make them brittle when tasks do not align with such structured approaches. In contrast, human cognition flexibly alternates between fast, intuitive reasoning (System 1) and slow, analytical reasoning (System 2), depending on context. To bridge this gap, we curate a dataset of 2K examples, each with valid responses from both reasoning styles, and explicitly align LLMs with System 1 and System 2 reasoning. Evaluations across diverse reasoning benchmarks reveal an accuracy-efficiency trade-off: System 2-aligned models excel in arithmetic and symbolic reasoning, while System 1-aligned models perform better in commonsense tasks. A mechanistic analysis of model responses shows that System 1 models employ more definitive answers, whereas System 2 models demonstrate greater uncertainty. Interpolating between these extremes produces a monotonic transition in reasoning accuracy, preserving coherence. This work challenges the assumption that step-by-step reasoning is always optimal and highlights the need for adapting reasoning strategies based on task demands.",
      "keywords": [
        "Alignment",
        "System 1 and System 2 thinking",
        "Cognitive heuristics",
        "LLM",
        "NLP"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "iBFfb6bGOz",
      "title": "Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities",
      "abstract": "Large Language Models (LLMs) have demonstrated outstanding performance in mathematical reasoning capabilities. However, we argue that current large-scale reasoning models primarily rely on scaling up training datasets with diverse mathematical problems and long thinking chains, which raises questions about whether LLMs genuinely acquire mathematical concepts and reasoning principles or merely remember the training data. In contrast, humans tend to break down complex problems into multiple fundamental atomic capabilities. Inspired by this, we propose a new paradigm for evaluating mathematical atomic capabilities. Our work categorizes atomic abilities into two dimensions: (1) field-specific abilities across four major mathematical fields, algebra, geometry, analysis, and topology, and (2) logical abilities at different levels, including conceptual understanding, forward multi-step reasoning with formal math language, and counterexample-driven backward reasoning. We propose corresponding training and evaluation datasets for each atomic capability unit, and conduct extensive experiments about how different atomic capabilities influence others, to explore the strategies to elicit the required specific atomic capability. Evaluation and experimental results on advanced models show many interesting discoveries and inspirations about the different performances of models on various atomic capabilities and the interactions between atomic capabilities. Our findings highlight the importance of decoupling mathematical intelligence into atomic components, providing new insights into model cognition and guiding the development of training strategies toward a more efficient, transferable, and cognitively grounded paradigm of \"atomic thinking\".",
      "keywords": [
        "Large Language Models",
        "Mathematical Reasoning",
        "Atomic Thinking"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "aPHHhnZktB",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "aPHHhnZktB",
      "title": "FairDen: Fair Density-Based Clustering",
      "abstract": "Fairness in data mining tasks like clustering has recently become an increasingly important aspect. \nHowever, few clustering algorithms exist that focus on fair groupings of data with sensitive attributes. \nIncluding fairness in the clustering objective is especially hard for density-based clustering, as it does not directly optimize a closed form objective like centroid-based or spectral methods.  \n\nThis paper introduces FairDen, the first fair, density-based clustering algorithm.\nWe capture the dataset's density-connectivity structure in a similarity matrix that we manipulate to encourage a balanced clustering. \nIn contrast to state-of-the-art, FairDen inherently handles categorical attributes, noise, and data with several sensitive attributes or groups.\nWe show that FairDen finds meaningful and fair clusters in extensive experiments.",
      "keywords": [
        "Fair Clustering",
        "Density-based Clustering",
        "Density-Connectivity",
        "Fairness",
        "Unsupervised Learning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "f4gF6AIHRy",
      "title": "Combatting Dimensional Collapse in LLM Pre-Training Data via Submodular File Selection",
      "abstract": "Selecting high-quality pre-training data for large language models (LLMs) is crucial for enhancing their overall performance under limited computation budget, improving both training and sample efficiency. Recent advancements in file selection primarily rely on using an existing or trained proxy model to assess the similarity of samples to a target domain, such as high quality sources BookCorpus and Wikipedia. However, upon revisiting these methods, the domain-similarity selection criteria demonstrates a diversity dilemma, i.e. dimensional collapse in the feature space, improving performance on the domain-related tasks but causing severe degradation on generic performance.To prevent collapse and enhance diversity, we propose a DiverSified File selection algorithm (DiSF), which selects the most decorrelated text files in the feature space. We approach this with a classical greedy algorithm to achieve more uniform eigenvalues in the feature covariance matrix of the selected texts, analyzing its approximation to the optimal solution under a formulation of $\\gamma$-weakly submodular optimization problem. Empirically, we establish a benchmark and conduct extensive experiments on the TinyLlama architecture with models from 120M to 1.1B parameters. Evaluating across nine tasks from the Harness framework, DiSF demonstrates a significant improvement on overall performance. Specifically, DiSF saves 98.5\\% of 590M training files in SlimPajama, outperforming the full-data pre-training within a 50B training budget, and achieving about 1.5x training efficiency and 5x data efficiency. Source code\nis available at: https://github.com/MediaBrain-SJTU/DiSF.git.",
      "keywords": [
        "file selection",
        "large language model",
        "pre-training",
        "submodular optimization"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "vRvVVb0NAz",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "OZVTqoli2N",
      "title": "A Second-Order Perspective on Model Compositionality and Incremental Learning",
      "abstract": "The fine-tuning of deep pre-trained models has revealed compositional properties, with multiple specialized modules that can be arbitrarily composed into a single, multi-task model. However, identifying the conditions that promote compositionality remains an open issue, with recent efforts concentrating mainly on linearized networks. We conduct a theoretical study that attempts to demystify compositionality in standard non-linear networks through the second-order Taylor approximation of the loss function. The proposed formulation highlights the importance of staying within the pre-training basin to achieve composable modules. Moreover, it provides the basis for two dual incremental training algorithms: the one from the perspective of multiple models trained individually, while the other aims to optimize the composed model as a whole. We probe their application in incremental classification tasks and highlight some valuable skills. In fact, the pool of incrementally learned modules not only supports the creation of an effective multi-task model but also enables unlearning and specialization in certain tasks. Code available at <https://github.com/aimagelab/mammoth>",
      "keywords": [
        "Continual Learning",
        "Model Compositionality",
        "Ensemble Learning",
        "Task Arithmetic"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "TDyE2iuvyc",
      "title": "Efficient Model Editing with Task-Localized Sparse Fine-tuning",
      "abstract": "Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.",
      "keywords": [
        "task arithmetic",
        "parameter-efficient fine-tuning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "WOzffPgVjF",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "14fFV0chUS",
      "title": "TRACE: Temporal Grounding Video LLM  via Causal Event Modeling",
      "abstract": "Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing. \nTo effectively handle various tasks simultaneously and enable zero-shot prediction, there is a growing trend in employing video LLMs for VTG tasks. However, current video LLM-based methods rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos, which restricts their effectiveness in tackling VTG tasks. To address this issue, this paper first formally introduces causal event modeling framework, which represents video LLM outputs as sequences of events, and predict the current event using previous events, video inputs, and textural instructions. Each event consists of three components: timestamps, salient scores, and textual captions. We then propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice. \nThe TRACE process visual frames, timestamps, salient scores, and text as distinct tasks, employing various encoders and decoding heads for each. Task tokens are arranged in an interleaved sequence according to the causal event modeling framework's formulation.\nExtensive experiments on various VTG tasks and datasets demonstrate the superior performance of TRACE compared to state-of-the-art video LLMs. Our model and code are avaliable at \\url{https://github.com/gyxxyg/TRACE}.",
      "keywords": [
        "video large language model",
        "video temporal grounding"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "xSOl0s1u77",
      "title": "TC-Bench: Benchmarking Temporal Compositionality in Conditional Video Generation",
      "abstract": "Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move beyond evaluating simple actions and argue that generated videos should incorporate the emergence of new concepts and their relation transitions like in real-world videos as time progresses. To assess the \\textbf{T}emporal \\textbf{C}ompositionality of video generation models, we propose TC-Bench, a benchmark of meticulously crafted text prompts, corresponding ground truth videos, and robust evaluation metrics. The prompts articulate the initial and final states of scenes, effectively reducing ambiguities for frame development and simplifying the assessment of transition completion. In addition, by collecting aligned real-world videos corresponding to the prompts, we expand TC-Bench's applicability from text-conditional models to image-conditional ones that can perform generative frame interpolation. We also develop new metrics to measure the completeness of component transitions in generated videos, which demonstrate significantly higher correlations with human judgments than existing metrics. Our comprehensive experimental results reveal that most video generators achieve less than ～20% of the compositional changes, highlighting enormous space for future improvement. Our analysis indicates that current video generation models struggle to interpret descriptions of compositional changes and dynamically map varied semantics across different time steps.",
      "keywords": [
        "Video Generation Benchmark; Text-to-Video Generation; Compositional Video Generation"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "SG1R2H3fa1",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "8sSqNntaMr",
      "title": "RouteLLM: Learning to Route LLMs from Preference Data",
      "abstract": "Large language models (LLMs) excel at a wide range of tasks, but choosing the right model often involves balancing performance and cost. Powerful models offer better results but are expensive, while smaller models are more cost-effective but less capable. To address this trade-off, we introduce a training framework for learning efficient router models that dynamically select between a stronger and weaker LLM during inference. Our framework leverages human preference data and employs data augmentation techniques to enhance performance. Evaluations on public benchmarks show that our approach can reduce costs by over 2 times without sacrificing response quality. Moreover, our routers exhibit strong generalization capabilities, maintaining performance even when routing between LLMs not included in training. This highlights the potential of our framework to deliver cost-effective, high-performance LLM solutions.",
      "keywords": [
        "Large language models",
        "query routing"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "SG1R2H3fa1",
      "title": "Revisiting Random Walks for Learning on Graphs",
      "abstract": "We revisit a simple model class for machine learning on graphs, where a random walk on a graph produces a machine-readable record, and this record is processed by a deep neural network to directly make vertex-level or graph-level predictions. We call these stochastic machines random walk neural networks (RWNNs), and through principled analysis, show that we can design them to be isomorphism invariant while capable of universal approximation of graph functions in probability. A useful finding is that almost any kind of record of random walks guarantees probabilistic invariance as long as the vertices are anonymized. This enables us, for example, to record random walks in plain text and adopt a language model to read these text records to solve graph tasks. We further establish a parallelism to message passing neural networks using tools from Markov chain theory, and show that over-smoothing in message passing is alleviated by construction in RWNNs, while over-squashing manifests as probabilistic under-reaching. We empirically demonstrate RWNNs on a range of problems, verifying our theoretical analysis and demonstrating the use of language models for separating strongly regular graphs where 3-WL test fails, and transductive classification on arXiv citation network. Code is available at https://github.com/jw9730/random-walk.",
      "keywords": [
        "Graph machine learning",
        "random walk",
        "invariance",
        "universal approximation",
        "markov chain"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "yRxX01oRIi",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "FZURCro04D",
      "title": "Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking",
      "abstract": "Large language models (LLMs) demonstrate remarkable reasoning capabilities, yet their reliance on step-by-step reasoning can make them brittle when tasks do not align with such structured approaches. In contrast, human cognition flexibly alternates between fast, intuitive reasoning (System 1) and slow, analytical reasoning (System 2), depending on context. To bridge this gap, we curate a dataset of 2K examples, each with valid responses from both reasoning styles, and explicitly align LLMs with System 1 and System 2 reasoning. Evaluations across diverse reasoning benchmarks reveal an accuracy-efficiency trade-off: System 2-aligned models excel in arithmetic and symbolic reasoning, while System 1-aligned models perform better in commonsense tasks. A mechanistic analysis of model responses shows that System 1 models employ more definitive answers, whereas System 2 models demonstrate greater uncertainty. Interpolating between these extremes produces a monotonic transition in reasoning accuracy, preserving coherence. This work challenges the assumption that step-by-step reasoning is always optimal and highlights the need for adapting reasoning strategies based on task demands.",
      "keywords": [
        "Alignment",
        "System 1 and System 2 thinking",
        "Cognitive heuristics",
        "LLM",
        "NLP"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "yRxX01oRIi",
      "title": "Evaluating the Inductive Abilities of Large Language Models: Why Chain-of-Thought Reasoning Sometimes Hurts More Than Helps",
      "abstract": "Large Language Models (LLMs) have shown remarkable progress across domains, yet their ability to perform inductive reasoning—inferring latent rules from sparse examples—remains limited. \nIt is often assumed that chain-of-thought (CoT) prompting, as used in Large Reasoning Models (LRMs), enhances such reasoning. \nWe investigate this assumption with creating four controlled, diagnostic game-based tasks—chess, Texas Hold’em, dice games, and blackjack—with hidden human-defined rules. \nWe find that CoT reasoning can degrade inductive performance, with LRMs often underperforming their non-reasoning counterparts.\n\nTo explain this, we present a theoretical framework that reveals how reasoning steps can amplify error through three failure modes: incorrect sub-task decomposition, incorrect sub-task solving, and incorrect final answer summarization. \nBased on our theoretical and empirical analysis, we introduce structured interventions that adapt CoT generation according to our identified failure types. These interventions improve inductive accuracy without retraining. Our findings suggest that effective (CoT) reasoning depends not only on taking more steps but also on ensuring those steps are well-structured.",
      "keywords": [
        "Large Lauange Model",
        "Inductive Abilities",
        "Reasoning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "RD9q5vEe1Q",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "RD9q5vEe1Q",
      "title": "Error-quantified Conformal Inference for Time Series",
      "abstract": "Uncertainty quantification in time series prediction is challenging due to the temporal dependence and distribution shift on sequential data. Conformal prediction provides a pivotal and flexible instrument for assessing the uncertainty of machine learning models through prediction sets. Recently, a series of online conformal inference methods updated thresholds of prediction sets by performing online gradient descent on a sequence of quantile loss functions. A drawback of such methods is that they only use the information of revealed non-conformity scores via miscoverage indicators but ignore error quantification, namely the distance between the non-conformity score and the current threshold. To accurately leverage the dynamic of miscoverage error, we propose Error-quantified Conformal Inference (ECI) by smoothing the quantile loss function. ECI introduces a continuous and adaptive feedback scale with the miscoverage error, rather than simple binary feedback in existing methods. We establish a long-term coverage guarantee for ECI under arbitrary dependence and distribution shift. The extensive experimental results show that ECI can achieve valid miscoverage control and output tighter prediction sets than other baselines.",
      "keywords": [
        "Time Series; Uncertainty Quantification; Conformal Prediction; Distribution Shift"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "k38Th3x4d9",
      "title": "Root Cause Analysis of Anomalies in Multivariate Time Series through Granger Causal Discovery",
      "abstract": "Identifying the root causes of anomalies in multivariate time series is challenging due to the complex dependencies among the series. In this paper, we propose a comprehensive approach called AERCA that inherently integrates Granger causal discovery with root cause analysis. By defining anomalies as interventions on the exogenous variables of time series, AERCA not only learns the Granger causality among time series but also explicitly models the distributions of exogenous variables under normal conditions. AERCA then identifies the root causes of anomalies by highlighting exogenous variables that significantly deviate from their normal states. Experiments on multiple synthetic and real-world datasets demonstrate that AERCA can accurately capture the causal relationships among time series and effectively identify the root causes of anomalies.",
      "keywords": [
        "root cause analysis",
        "Granger causality",
        "multivariate time series"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "S8XcHutp7Z",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "DwFDfrPsm8",
      "title": "NOVA: A Benchmark for Rare Anomaly Localization and Clinical Reasoning in Brain MRI",
      "abstract": "In many real-world applications, deployed models encounter inputs that differ from the data seen during training. Open-world recognition ensures that such systems remain robust as ever-emerging, previously _unknown_ categories appear and must be addressed without retraining.\nFoundation and vision-language models are pre-trained on large and diverse datasets with the expectation of broad generalization across domains, including medical imaging.\nHowever, benchmarking these models on test sets with only a few common outlier types silently collapses the evaluation back to a closed-set problem, masking failures on rare or truly novel conditions encountered in clinical use.\n\nWe therefore present NOVA, a challenging, real-life _evaluation-only_ benchmark of $\\sim$900 brain MRI scans that span 281 rare pathologies and heterogeneous acquisition protocols. Each case includes rich clinical narratives and double-blinded expert bounding-box annotations. Together, these enable joint assessment of anomaly localisation, visual captioning, and diagnostic reasoning. \nBecause NOVA is never used for training, it serves as an _extreme_ stress-test of out-of-distribution generalisation: models must bridge a distribution gap both in sample appearance and in semantic space.  \nBaseline results with leading vision-language models (GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B) reveal substantial performance drops, with approximately a 65\\% gap in localisation compared to natural-image benchmarks and 40\\% and 20\\% gaps in captioning and reasoning, respectively, compared to resident radiologists. Therefore, NOVA establishes a testbed for advancing models that can detect, localize, and reason about truly unknown anomalies.",
      "keywords": [
        "Vision-Language Models",
        "Zero-shot Learning",
        "Anomaly Detection",
        "Dataset Benchmarking",
        "Medical Imaging",
        "Brain MRI",
        "Multi-modal Data",
        "Rare Diseases"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "nxGSj1xkm3",
      "title": "Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis",
      "abstract": "Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 43.31% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we provide the supervised fine-tuning (SFT) process utilizing our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., Qwen2.5-VL-7B demonstrates a 24.73% improvement. MMOral holds significant potential as a critical foundation for intelligent dentistry and enables more clinically impactful multimodal AI systems in the dental field.",
      "keywords": [
        "Medical benchmark",
        "Multimodal instruction data",
        "Large vision language models"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "s0JVsx3bx1",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "neZSGqhxDa",
      "title": "Absolute Zero: Reinforced Self-play Reasoning with Zero Data",
      "abstract": "Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from rule-based outcome rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external human or distillation data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability. AZR uses a code executor to both validate self-proposed code reasoning tasks and verify answers, serving as an unified source of verifiable feedback to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.",
      "keywords": [
        "reasoning",
        "language model",
        "reinforcement learning",
        "self-play",
        "LLM"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "oEgybA04dY",
      "title": "Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning",
      "abstract": "The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, \nfollowed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps—surpassing all previous open-source efforts in scale.\nThis pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.",
      "keywords": [
        "Multimodal LLM",
        "Visual Reasoning",
        "Cognitive Behavior Transfer"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "EoebmBe9fG",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "AQ21krZgax",
      "title": "Formal Models of Active Learning from Contrastive Examples",
      "abstract": "Machine learning can greatly benefit from providing learning algorithms with pairs of contrastive training examples---typically pairs of instances that differ only slightly, yet have different class labels. Intuitively, the difference in the instances helps explain the difference in the class labels. This paper proposes a theoretical framework in which the effect of various types of contrastive examples on active learners is studied formally. The focus is on the sample complexity of learning concept classes and how it is influenced by the choice of contrastive examples. We illustrate our results with geometric concept classes and classes of Boolean functions. Interestingly, we reveal a connection between learning from contrastive examples and the classical model of self-directed learning.",
      "keywords": [
        "membership queries",
        "self-directed learning",
        "learning boolean functions",
        "learning from contrastive examples"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "vWaMUMrBpF",
      "title": "Inconsistency-Aware Minimization: Improving Generalization with Unlabeled Data",
      "abstract": "Accurately estimating the generalization gap and devising optimization methods that generalize better are crucial for deep learning models, particularly in both theoretical understanding and practical applications. The ability to leverage unlabeled data for these purposes offers significant advantages in real-world scenarios. This paper introduces a novel generalization measure, termed $\\textit{local inconsistency}$, developed from an information-geometric perspective of the neural network's parameter space; a key feature is its computability from unlabeled data. We establish its theoretical underpinnings by connecting local inconsistency to the Fisher Information Matrix (FIM) and the loss Hessian. Empirically, we demonstrate that local inconsistency not only correlates with the generalization gap but also exhibits characteristics comparable to $\\textit{sharpness}$. Based on these findings, we propose Inconsistency-Aware Minimization (IAM), a regularization strategy that incorporates local inconsistency. We demonstrate that in standard supervised learning settings, IAM enhances generalization, achieving performance comparable to existing methods such as Sharpness-Aware Minimization (SAM). Furthermore, IAM exhibits notable efficacy in semi-supervised learning scenarios, where the local inconsistency regularizer is computed from the unlabeled data portion to further improve model performance.",
      "keywords": [
        "Generalization",
        "Regularization",
        "Training Method",
        "Deep Learning",
        "Inconsistency"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "JbJVWljk7r",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "JbJVWljk7r",
      "title": "SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training",
      "abstract": "The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new $\\texttt{FP4}$ Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves $\\textbf{1038}$ $\\texttt{TOPS}$ on $\\texttt{RTX5090}$, which is a $\\textbf{5}\\times$ speedup over the fastest FlashAttention on $\\texttt{RTX5090}$. Experiments show that our $\\texttt{FP4}$ attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient $\\texttt{8-bit}$ attention for both forward and backward propagation. Experiments indicate that $\\texttt{8-bit}$ attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https://github.com/thu-ml/SageAttention.",
      "keywords": [
        "Attention",
        "Quantization",
        "Efficient Attention",
        "GPU Kernel"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "o9iReV4FGm",
      "title": "Fast attention mechanisms: a tale of parallelism",
      "abstract": "Transformers have the representational capacity to simulate Massively Parallel Computation (MPC) algorithms, but they suffer from quadratic time complexity, which severely limits their scalability. We introduce an efficient attention mechanism called Approximate Nearest Neighbor Attention (ANNA) with sub-quadratic time complexity. We prove that ANNA-transformers (1) retain the expressive power previously established for standard attention in terms of matching the capabilities of MPC algorithms, and (2) can solve key reasoning tasks such as Match2 and $k$-hop with near-optimal depth. Using the MPC framework, we further prove that constant-depth ANNA-transformers can simulate constant-depth low-rank transformers, thereby providing a unified way to reason about a broad class of efficient attention approximations.",
      "keywords": [
        "Transformer theory",
        "representational strength",
        "nearest neighbor search",
        "massively parallel computation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "w7BGq6ozOL",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "Yqk7EyT52H",
      "title": "MarS: a Financial Market Simulation Engine Powered by Generative Foundation Model",
      "abstract": "Generative models aim to simulate realistic effects of various actions across different contexts, from text generation to visual effects. Despite significant efforts to build real-world simulators, the application of generative models to virtual worlds, like financial markets, remains under-explored. In financial markets, generative models can simulate complex market effects of participants with various behaviors, enabling interaction under different market conditions, and training strategies without financial risk. This simulation relies on the finest structured data in financial market like orders thus building the finest realistic simulation. We propose Large Market Model (LMM), an order-level generative foundation model, for financial market simulation, akin to language modeling in the digital world. Our financial Market Simulation engine (MarS), powered by LMM, addresses the domain-specific need for realistic, interactive and controllable order generation. Key observations include LMM's strong scalability across data size and model complexity, and MarS's robust and practicable realism in controlled generation with market impact. We showcase MarS as a forecast tool, detection system, analysis platform, and agent training environment, thus demonstrating MarS's ``paradigm shift'' potential for a variety of financial applications. We release the code of MarS at https://github.com/microsoft/MarS/.",
      "keywords": [
        "Financial Market Simulation",
        "Generative Foundation Model",
        "Large Market Model (LMM)",
        "Controllable Simulation",
        "Interactive Simulation",
        "Market Impact",
        "Reinforcement Learning",
        "Forecasting",
        "Market Manipulation Detection",
        "Order-Level Data"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "w7BGq6ozOL",
      "title": "Advancing Algorithmic Trading with Large Language Models: A Reinforcement Learning Approach for Stock Market Optimization",
      "abstract": "In the fast-evolving landscape of financial markets, effective decision-making tools are essential for managing complexities driven by economic indicators and market dynamics. Algorithmic trading strategies have gained prominence for their ability to execute trades autonomously, with Deep Reinforcement Learning (DRL) emerging as a key approach for optimizing trading actions through continuous market interaction. However, RL-based systems face significant challenges, particularly in adapting to evolving time series data and incorporating unstructured textual information. In response to these limitations, recent advancements in Large Language Models (LLMs) offer new opportunities. LLMs possess the capacity to analyze vast volumes of data, providing enhanced insights that can complement traditional market analysis. This study proposes a novel approach that integrates six distinct LLMs into algorithmic trading frameworks, developing Stock-Evol-Instruct, an innovative instruction generation algorithm. This algorithm enables RL agents to fine-tune their trading strategies by leveraging LLM-driven insights for daily stock trading decisions. Empirical evaluation using real-world stock data from Silver and JPMorgan demonstrates the significant potential of this approach to outperform conventional trading models. By bridging the gap between LLMs and RL in algorithmic trading, this study contributes to a new frontier in financial technology, setting the stage for future advancements in autonomous trading systems.",
      "keywords": [
        "Algorithmic trading",
        "Stock market",
        "Large language models",
        "Deep reinforcement learning"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "8P3QNSckMp",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "QVheAhJefR",
      "title": "Learning Preferences without Interaction for Cooperative AI: A Hybrid Offline-Online Approach",
      "abstract": "Reinforcement learning (RL) for collaborative agents capable of cooperating with humans to accomplish tasks has long been a central goal in the RL community. While prior approaches have made progress in adapting collaborative agents to diverse human partners, they often focus solely on optimizing task performance and overlook human preferences—despite the fact that such preferences often diverge from the reward-maximization objective of the environment.\nAddressing this discrepancy poses significant challenges: humans typically provide only a small amount of offline, preference-related feedback and are unable to engage in online interactions, resulting in a distributional mismatch between the agent’s online learning process and the offline human data. To tackle this, we formulate the problem as an online&offline reinforcement learning problem that jointly integrates online generalization and offline preference learning, entirely under an offline training regime.\nWe propose a simple yet effective training framework built upon existing RL algorithms that alternates between offline preference learning and online generalization recovery, ensuring the stability and alignment of both learning objectives.\nWe evaluate our approach on a benchmark built upon the Overcooked environment—a standard environment  for human-agent collaboration—and demonstrate remarkable performance across diverse preference styles and cooperative scenarios.",
      "keywords": [
        "Cooperative AI",
        "offline preference learning",
        "online adaptation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "hguaupzLCU",
      "title": "Horizon Reduction Makes RL Scalable",
      "abstract": "In this work, we study the scalability of offline reinforcement learning (RL) algorithms. In principle, a truly scalable offline RL algorithm should be able to solve any given problem, regardless of its complexity, given sufficient data, compute, and model capacity. We investigate if and how current offline RL algorithms match up to this promise on diverse, challenging, previously unsolved tasks, using datasets up to 1000× larger than typical offline RL datasets. We observe that despite scaling up data, many existing offline RL algorithms exhibit poor scaling behavior, saturating well below the maximum performance. We hypothesize that the horizon is the main cause behind the poor scaling of offline RL. We empirically verify this hypothesis through several analysis experiments, showing that long horizons indeed present a fundamental barrier to scaling up offline RL. We then show that various horizon reduction techniques substantially enhance scalability on challenging tasks. Based on our insights, we also introduce a minimal yet scalable method named SHARSA that effectively reduces the horizon. SHARSA achieves the best asymptotic performance and scaling behavior among our evaluation methods, showing that explicitly reducing the horizon unlocks the scalability of offline RL.",
      "keywords": [
        "reinforcement learning",
        "offline reinforcement learning"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "k3tbMMW8rH",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "GcvLoqOoXL",
      "title": "Rethinking Diffusion Posterior Sampling: From Conditional Score Estimator to Maximizing a Posterior",
      "abstract": "Recent advancements in diffusion models have been leveraged to address inverse problems without additional training, and Diffusion Posterior Sampling (DPS) (Chung et al., 2022a) is among the most popular approaches. Previous analyses suggest that DPS accomplishes posterior sampling by approximating the conditional score. While in this paper, we demonstrate that the conditional score approximation employed by DPS is not as effective as previously assumed, but rather aligns more closely with the principle of maximizing a posterior (MAP). This assertion is substantiated through an examination of DPS on 512$\\times$512 ImageNet images, revealing that: 1) DPS’s conditional score estimation significantly diverges from the score of a well-trained conditional diffusion model and is even inferior to the unconditional score; 2) The mean of DPS’s conditional score estimation deviates significantly from zero, rendering it an invalid score estimation; 3) DPS generates high-quality samples with significantly lower diversity. In light of the above findings, we posit that DPS more closely resembles MAP than a conditional score estimator, and accordingly propose the following enhancements to DPS: 1) we explicitly maximize the posterior through multi-step gradient ascent and projection; 2) we utilize a light-weighted conditional score estimator trained with only 100 images and 8 GPU hours. Extensive experimental results indicate that these proposed improvements significantly enhance DPS's performance. The source code for these improvements is provided in https://github.com/tongdaxu/Rethinking-Diffusion-Posterior-Sampling-From-Conditional-Score-Estimator-to-Maximizing-a-Posterior.",
      "keywords": [
        "Diffusion models",
        "Inverse problem"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "k3tbMMW8rH",
      "title": "Feedback Schrödinger Bridge Matching",
      "abstract": "Recent advancements in diffusion bridges for distribution transport problems have heavily relied on matching frameworks, yet existing methods often face a trade-off between scalability and access to optimal pairings during training. \nFully unsupervised methods make minimal assumptions but incur high computational costs, limiting their practicality. On the other hand, imposing full supervision of the matching process with optimal pairings improves scalability, however, it can be infeasible in most applications.\nTo strike a balance between scalability and minimal supervision, we introduce Feedback Schrödinger Bridge Matching (FSBM), a novel semi-supervised matching framework that incorporates a small portion ($<8$% of the entire dataset) of pre-aligned pairs as state feedback to guide the transport map of non-coupled samples, thereby significantly improving efficiency. This is achieved by formulating a static Entropic Optimal Transport (EOT) problem with an additional term capturing the semi-supervised guidance. The generalized EOT objective is then recast into a dynamic formulation to leverage the scalability of matching frameworks. Extensive experiments demonstrate that FSBM accelerates training and enhances generalization by leveraging coupled pairs' guidance, opening new avenues for training matching frameworks with partially aligned datasets.",
      "keywords": [
        "Diffusion models",
        "Schrödinger bridge",
        "Distribution matching",
        "Semi-Supervised Learning"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "8oFvUBvF1u",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "CNO4rbSV6v",
      "title": "Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning",
      "abstract": "Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear.\nIn this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D understanding of existing vision models. Remarkably, even finetuning on a single object for just one iteration results in substantial performance gains. Code is available on https://github.com/qq456cvb/3DCorrEnhance.",
      "keywords": [
        "Vision Foundation Models; 3D Representation Learning; Fine-tuning; 3D Equivariance"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "meRCKuUpmc",
      "title": "Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation",
      "abstract": "Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on \"action,\" which involves behavior cloning from extensive collections of robotic data, while the other emphasizes \"vision,\" enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot's forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to real-world scenarios with a little fine-tuning data. Thanks to large-scale, end-to-end training and the continuous synergy between vision and action at each execution step, Seer significantly outperforms state-of-the-art methods across both simulation and real-world experiments. It achieves improvements of 13% on the LIBERO-LONG benchmark, 22% on CALVIN ABC-D, and 43% in real-world tasks. Notably, it demonstrates superior generalization for novel objects, lighting conditions, and environments under high-intensity disturbances. Code and models will be publicly available.",
      "keywords": [
        "Robotic Manipulation ; Pre-training ; Visual Foresight ; Inverse Dynamics ; Large-scale robot dataset"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "cFu7ze7xUm",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "QWunLKbBGF",
      "title": "Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs",
      "abstract": "Large Language Models (LLMs) are increasingly deployed as chatbots, yet their ability to personalize responses to user preferences remains limited. We introduce PrefEval, a benchmark for evaluating LLMs' ability to infer, memorize and adhere to user preferences in long-context conversational setting.\nPrefEval comprises 3,000 manually curated user preference and query pairs spanning 20 topics. PrefEval contains user personalization or preference information in both explicit and implicit preference forms, and evaluates LLM performance using a generation and a classification task. With PrefEval, we have evaluated 10 open-sourced and\nproprietary LLMs in multi-session conversations with varying context lengths up to 100k tokens. We benchmark with various prompting, iterative feedback, and retrieval-augmented generation methods. \nOur benchmarking effort reveals that state-of-the-art LLMs face significant challenges in following users' preference during conversations. In particular,  in zero-shot settings, preference following accuracy falls below 10\\% at merely 10 turns (~3k tokens) across most evaluated models. Even with advanced prompting and retrieval methods, preference following still deteriorates in long-context conversations. Furthermore, we show that fine-tuning on PrefEval significantly improves performance. We believe PrefEval serves as a valuable resource for measuring, understanding, and enhancing LLMs' proactive preference following abilities, paving the way for personalized conversational agents.",
      "keywords": [
        "personalization",
        "benchmark",
        "Large language models",
        "conversational llm",
        "chatbots"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "cFu7ze7xUm",
      "title": "DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads",
      "abstract": "Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges.\nCaching all Key and Value (KV) states across all attention heads consumes substantial memory.\nExisting KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements.\nIn this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens.\nIn contrast, all other heads, which primarily focus on recent tokens and attention sinks—referred to as Streaming Heads—do not require full attention.\nBased on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities.\nDuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately.\nOur method significantly reduces long-context inference memory by up to 2.55$\\times$ for MHA and 1.67$\\times$ for GQA models while speeding up decoding by up to 2.18$\\times$ and 1.50$\\times$ and accelerating pre-filling by up to 1.73$\\times$ and 1.63$\\times$ for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention.\nNotably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.33 million context length measured on a single A100 GPU. Code is provided in https://github.com/mit-han-lab/duo-attention.",
      "keywords": [
        "Large Language Models; Long Context; Efficiency;"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "iQoZv77o3g",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "WOXyOiVd4B",
      "title": "FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching",
      "abstract": "We introduce FragFM, a novel hierarchical framework via fragment-level discrete flow matching for efficient molecular graph generation. FragFM generates molecules at the fragment level, leveraging a coarse-to-fine autoencoder to reconstruct details at the atom level. Together with a stochastic fragment bag strategy to effectively handle an extensive fragment space, our framework enables more efficient and scalable molecular generation. We demonstrate that our fragment-based approach achieves better property control than the atom-based method and additional flexibility through conditioning the fragment bag. We also propose a Natural Product Generation benchmark (NPGen) to evaluate modern molecular graph generative models' ability to generate natural product-like molecules. Since natural products are biologically prevalidated and differ from typical drug-like molecules, our benchmark provides a more challenging yet meaningful evaluation relevant to drug discovery. We conduct a FragFM comparative study against various models on diverse molecular generation benchmarks, including NPGen, demonstrating superior performance. The results highlight the potential of fragment-based generative modeling for large-scale, property-aware molecular design, paving the way for more efficient exploration of chemical space.",
      "keywords": [
        "Molecular Graph Generation",
        "Discrete Flow Matching",
        "Fragment-Based Drug Discovery",
        "Natural Product"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "iQoZv77o3g",
      "title": "Predicting Functional Brain Connectivity with Context-Aware Deep Neural Networks",
      "abstract": "Spatial location and molecular interactions have long been linked to the connectivity patterns of neural circuits. Yet, at the macroscale of human brain networks, the interplay between spatial position, gene expression, and connectivity remains incompletely understood. Recent efforts to map the human transcriptome and connectome have yielded spatially resolved brain atlases, however modeling the relationship between high-dimensional transcriptomic data and connectivity while accounting for inherent spatial confounds presents a significant challenge. In this paper, we present the first deep learning approaches for predicting whole-brain functional connectivity from gene expression and regional spatial coordinates, including our proposed Spatiomolecular Transformer (SMT). SMT explicitly models biological context by tokenizing genes based on their transcription start site (TSS) order to capture multi-scale genomic organization, and incorporating regional 3D spatial location via a dedicated context [CLS] token within its multi-head self-attention mechanism. We rigorously benchmark context-aware neural networks, including SMT and a single-gene resolution Multilayer-Perceptron (MLP), to established rules-based and bilinear methods. Crucially, to ensure that learned relationships in any model are not mere artifacts of spatial proximity, we introduce novel  spatiomolecular null maps, preserving both spatial and transcriptomic autocorrelation. Context-aware neural networks outperform linear methods, significantly exceed our stringent null shuffle models, and generalize across diverse connectomic datasets and parcellation resolutions. Together, these findings demonstrate a strong, predictable link between the spatial distributions of gene expression and functional brain network architecture, and establish a rigorously validated deep learning framework for decoding this relationship. Code to reproduce our results is available at: github.com/neuroinfolab/GeneEx2Conn.",
      "keywords": [
        "neuroscience",
        "fMRI",
        "connectomics",
        "transcriptomics",
        "attention"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "4xvE6Iy77Y",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "QtnCPZMxYg",
      "title": "Trajectory Graph Learning: Aligning with Long Trajectories in Reinforcement Learning Without Reward Design",
      "abstract": "Reinforcement learning (RL) often relies on manually designed reward functions, which are difficult to specify and can lead to issues such as reward hacking and suboptimal behavior. Alternatives like inverse RL and preference-based RL attempt to infer surrogate rewards from demonstrations or preferences but suffer from ambiguity and distribution mismatch. A more direct approach, inspired by imitation learning, avoids reward modeling by leveraging expert demonstrations. However, most existing methods align actions only at individual states, failing to capture the coherence of long-horizon trajectories.\n\nIn this work, we study the problem of directly aligning policies with expert-labeled trajectories to preserve long-horizon behavior without relying on reward signals. Specifically, we aim to learn a policy that maximizes the probability of generating the expert trajectories. Nevertheless, we prove that, in its general form, this trajectory alignment problem is NP-complete. \nTo address this, we propose Trajectory Graph Learning (TGL), a framework that leverages structural assumptions commonly satisfied in practice—such as bounded realizability of expert trajectories or a tree-structured MDP. These enable a graph-based policy planning algorithm that computes optimal policies in polynomial time under known dynamics. For settings with unknown dynamics, we develop a sample-efficient algorithm based on UCB-style exploration and establish sub-linear regret. Experiments on grid-world tasks demonstrate that TGL substantially outperforms standard imitation learning methods for long-trajectory planning.",
      "keywords": [
        "Reinforcement Learning",
        "Trajectory Alignment",
        "Trajectory Graph Learning"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "mPuOMcN9E7",
      "title": "Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options",
      "abstract": "We study online preference-based reinforcement learning (PbRL) with the goal of improving sample efficiency. While a growing body of theoretical work has emerged—motivated by PbRL’s recent empirical success, particularly in aligning large language models (LLMs)—most existing studies focus only on pairwise comparisons. A few recent works  (Zhu et al., 2023, Mukherjee et al., 2024, Thekumparampil et al., 2024)  have explored using multiple comparisons and ranking feedback, but their performance guarantees fail to improve—and can even deteriorate—as the feedback length increases, despite the richer information available. To address this gap, we adopt the Plackett–Luce (PL) model for ranking feedback over action subsets and propose **M-AUPO**, an algorithm that selects multiple actions by maximizing the average uncertainty within the offered subset. We prove that **M-AUPO** achieves a suboptimality gap of $\\tilde{\\mathcal{O}}\\left( \\frac{d}{T} \\sqrt{ \\sum_{t=1}^T \\frac{1}{|S_t|}} \\right)$, where $T$ is the total number of rounds, $d$ is the feature dimension, and $|S_t|$ is the size of the subset at round $t$. This result shows that larger subsets directly lead to improved performance and, notably, the bound avoids the exponential dependence on the unknown parameter’s norm, which was a fundamental limitation in most previous works. Moreover, we establish a near-matching lower bound of $\\Omega \\left( \\frac{d}{K \\sqrt{T}} \\right)$, where $K$ is the maximum subset size. To the best of our knowledge, this is the first theoretical result in PbRL with ranking feedback that explicitly shows improved sample efficiency as a function of the subset size.",
      "keywords": [
        "Preference-based Reinforcement Learning",
        "Ranking Feedback",
        "Plackett–Luce Model",
        "Reinforcement Learning from Human Feedback",
        "Dueling Bandit"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "G10Y4vrhGF",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "G10Y4vrhGF",
      "title": "FedFree: Breaking Knowledge-sharing Barriers through Layer-wise Alignment in Heterogeneous Federated Learning",
      "abstract": "Heterogeneous Federated Learning (HtFL) enables collaborative learning across clients with diverse model architectures and non-IID data distributions, which are prevalent in real-world edge computing applications. Existing HtFL approaches typically employ proxy datasets to facilitate knowledge sharing or implement coarse-grained model-level knowledge transfer. However, such approaches not only elevate risks of user privacy leakage but also lead to the loss of fine-grained model-specific knowledge, ultimately creating barriers to effective knowledge sharing. To address these challenges, we propose FedFree, a novel data-free and model-free HtFL framework featuring two key innovations. First, FedFree introduces a reverse layer-wise knowledge transfer mechanism that aggregates heterogeneous client models into a global model solely using Gaussian-based pseudo data, eliminating reliance on proxy datasets. Second, it leverages Knowledge Gain Entropy (KGE) to guide targeted layer-wise knowledge alignment, ensuring that each client receives the most relevant global updates tailored to its specific architecture. We provide rigorous theoretical convergence guarantees for FedFree and conduct extensive experiments on CIFAR-10 and CIFAR-100. Results demonstrate that FedFree achieves substantial performance gains, with relative accuracy improving up to 46.3% over state-of-the-art baselines. The framework consistently excels under highly heterogeneous model/data distributions and in large scale settings.",
      "keywords": [
        "Heterogeneous Federated Learning",
        "Public-Data-Free",
        "Knowledge Gain Entropy"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "InyYuWLWHD",
      "title": "LayerGuard: Poisoning-Resilient Federated Learning via Layer-Wise Similarity Analysis",
      "abstract": "In recent years, model poisoning attacks have gradually evolved from conventional global parameter manipulations to more stealthy and strategic Targeted Layer Poisoning (TLP) attacks.These attacks achieve high attack success rates by selectively poisoning only a subset of layers. However, most existing defenses rely on evaluation of the entire network and are thus ineffective against TLP attacks, posing new challenges to the security of Federated Learning (FL).In this paper, we propose \\textbf{LayerGuard}, a comprehensive defense framework featuring dynamic detection and adaptive aggregation to protect FL against advanced model poisoning attacks. Diverging from traditional methods that analyze the entire network collectively, \\textbf{LayerGuard} performs layer-wise similarity analysis to detect anomalous clients and adaptively identifies layers under attack based on the clustering behavior of malicious updates, facilitating more precise threat detection. Building on this, we introduce a joint weighting mechanism in the aggregation process, which evaluates each client's credibility at the layer level from two complementary informational dimensions: inter-layer and intra-layer, balancing attack mitigation and benign contribution retention. Extensive experiments across various datasets and model architectures demonstrate that \\textbf{LayerGuard} successfully reduces the average attack success rate of TLP attacks to around 5\\%. Moreover, when confronted with other advanced model poisoning attacks, \\textbf{LayerGuard} consistently maintains global model accuracy—even under high poisoning rates and severe non-IID conditions—comparable to that of FedAvg under no-attack settings, marking a significant improvement over existing defenses.",
      "keywords": [
        "Federated Learning; Security; Model Poisoning Attacks; Robust Aggregation"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "0Xt7uT04cQ",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "0Xt7uT04cQ",
      "title": "Uni-Sign: Toward Unified Sign Language Understanding at Scale",
      "abstract": "Sign language pre-training has gained increasing attention for its ability to enhance performance across various sign language understanding (SLU) tasks. However, existing methods often suffer from a gap between pre-training and fine-tuning, leading to suboptimal results. To address this, we propose Uni-Sign, a unified pre-training framework that eliminates the gap between pre-training and downstream SLU tasks through a large-scale generative pre-training strategy and a novel fine-tuning paradigm. First, we introduce CSL-News, a large-scale Chinese Sign Language (CSL) dataset containing 1,985 hours of video paired with textual annotations, which enables effective large-scale pre-training. Second, Uni-Sign unifies SLU tasks by treating downstream tasks as a single sign language translation (SLT) task during fine-tuning, ensuring seamless knowledge transfer between pre-training and fine-tuning. Furthermore, we incorporate a prior-guided fusion (PGF) module and a score-aware sampling strategy to efficiently fuse pose and RGB information, addressing keypoint inaccuracies and improving computational efficiency.  Extensive experiments across multiple SLU benchmarks demonstrate that Uni-Sign achieves state-of-the-art performance across multiple downstream SLU tasks. Dataset and code are available at github.com/ZechengLi19/Uni-Sign.",
      "keywords": [
        "Sign language understanding",
        "Pre-training",
        "Large-scale sign language dataset"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "bMC1t7eLRc",
      "title": "Harnessing Diversity for Important Data Selection in Pretraining Large Language Models",
      "abstract": "Data selection is of great significance in  pretraining large language models, given the  variation in quality within the large-scale available training corpora. \nTo achieve this, researchers are currently investigating the use of data influence to measure the importance of data instances, $i.e.,$ a high influence score indicates that incorporating this instance to the training set is likely to enhance the model performance. Consequently, they select the top-$k$ instances with the highest scores.  However, this approach has several limitations. \n(1) Calculating the accurate influence of all available data is time-consuming.\n(2) The selected data instances are not diverse enough, which may hinder the pretrained model's ability to generalize effectively to various downstream tasks.\nIn this paper, we introduce $\\texttt{Quad}$, a data selection approach that considers both quality and diversity by using data influence to achieve state-of-the-art pretraining results.\nTo compute the influence ($i.e.,$ the quality) more accurately and efficiently, we incorporate the attention layers to capture more semantic details, which can be accelerated through the Kronecker product. \nFor the diversity, $\\texttt{Quad}$ clusters the dataset into similar data instances within each cluster and diverse instances across different clusters. For each cluster, if we opt to select data from it, we take some samples to evaluate the influence to prevent processing all instances. Overall, we favor clusters with highly influential instances (ensuring high quality) or clusters that have been selected less frequently (ensuring diversity), thereby well balancing between quality and diversity.  Experiments on Slimpajama and FineWeb over 7B large language models demonstrate that $\\texttt{Quad}$ significantly outperforms other data selection methods with a low FLOPs consumption. Further analysis also validates the effectiveness of our influence calculation.",
      "keywords": [
        "LLMs",
        "data selection",
        "influence function",
        "diversity"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "fZK6AQXlUU",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "LxkgScfHKf",
      "title": "Conformal Training with Reduced Variance",
      "abstract": "Conformal prediction (CP) is a distribution-free framework for achieving probabilistic guarantees on black-box models. {CP} is generally applied to a model post-training. Conformal training is an approach that aims to optimize the CP efficiency during training. In this direction, ConfTr (Stutz et al, 2022) is a technique that seeks to minimize the expected prediction set size of a model by simulating {CP} in-between training updates. Despite its potential, we identify a strong source of sample inefficiency in ConfTr that leads to overly noisy estimated gradients, introducing training instability and limiting practical use. To address this challenge, we propose variance-reduced conformal training (VR-ConfTr), a method that incorporates a variance reduction technique in the gradient estimation of the ConfTr objective function. Through extensive experiments on various benchmark datasets, we demonstrate that VR-ConfTr consistently achieves faster convergence and smaller prediction sets compared to baselines.",
      "keywords": [
        "Conformal Training",
        "Conformal Prediction",
        "Optimization",
        "Quantile",
        "Deep Learning",
        "Uncertainty Quantification"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "xiQNfYl33p",
      "title": "A Generic Framework for Conformal Fairness",
      "abstract": "Conformal Prediction (CP) is a popular method for uncertainty quantification with machine learning models. While conformal prediction provides probabilistic guarantees regarding the coverage of the true label, these guarantees are agnostic to the presence of sensitive attributes within the dataset. In this work, we formalize \\textit{Conformal Fairness}, a notion of fairness using conformal predictors, and provide a theoretically well-founded algorithm and associated framework to control for the gaps in coverage between different sensitive groups. Our framework leverages the exchangeability assumption (implicit to CP) rather than the typical IID assumption, allowing us to apply the notion of Conformal Fairness to data types and tasks that are not IID, such as graph data. Experiments were conducted on graph and tabular datasets to demonstrate that the algorithm can control fairness-related gaps in addition to coverage aligned with theoretical expectations.",
      "keywords": [
        "Fairness",
        "Conformal Prediction",
        "Graph Neural Networks"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "aPHHhnZktB",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "aPHHhnZktB",
      "title": "FairDen: Fair Density-Based Clustering",
      "abstract": "Fairness in data mining tasks like clustering has recently become an increasingly important aspect. \nHowever, few clustering algorithms exist that focus on fair groupings of data with sensitive attributes. \nIncluding fairness in the clustering objective is especially hard for density-based clustering, as it does not directly optimize a closed form objective like centroid-based or spectral methods.  \n\nThis paper introduces FairDen, the first fair, density-based clustering algorithm.\nWe capture the dataset's density-connectivity structure in a similarity matrix that we manipulate to encourage a balanced clustering. \nIn contrast to state-of-the-art, FairDen inherently handles categorical attributes, noise, and data with several sensitive attributes or groups.\nWe show that FairDen finds meaningful and fair clusters in extensive experiments.",
      "keywords": [
        "Fair Clustering",
        "Density-based Clustering",
        "Density-Connectivity",
        "Fairness",
        "Unsupervised Learning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "nphsoKxlFs",
      "title": "Dynamic Contrastive Learning for Time Series Representation",
      "abstract": "Understanding events in time series is an important task in a variety of contexts. However, human analysis and labeling are expensive and time-consuming. Therefore, it is advantageous to learn embeddings for moments in time series in an unsupervised way, which allows for good performance in classification or detection tasks after later minimal human labeling. In this paper, we propose dynamic contrastive learning (DynaCL), an unsupervised representation learning framework for time series that uses temporal adjacent steps to define positive pairs. DynaCL adopts N-pair loss to dynamically treat all samples in a batch as positive or negative pairs, enabling efficient training and addressing the challenges of complicated sampling of positives. We demonstrate that DynaCL embeds instances from time series into well-defined, semantically meaningful clusters, which allows superior performance on downstream tasks on a variety of public time series datasets. Our findings also reveal that high scores on unsupervised clustering metrics do not guarantee that the representations are useful in downstream tasks.",
      "keywords": [
        "contrastive learning",
        "self-supervised learning",
        "time series analysis",
        "representation learning"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "Tk5nQnTGmP",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "KurYdcCbjv",
      "title": "Generalized Linear Mode Connectivity for Transformers",
      "abstract": "Understanding the geometry of neural network loss landscapes is a central question in deep learning, with implications for generalization and optimization. A striking phenomenon is $\\textit{linear mode connectivity}$ (LMC), where independently trained models can be connected by low- or zero-barrier paths, despite appearing to lie in separate loss basins. However, this is often obscured by symmetries in parameter space—such as neuron permutations—which make functionally equivalent models appear dissimilar. Prior work has predominantly focused on neuron reordering through permutations, but such approaches are limited in scope and fail to capture the richer symmetries exhibited by modern architectures such as Transformers. In this work, we introduce a unified framework that captures four symmetry classes—permutations, semi-permutations, orthogonal transformations, and general invertible maps—broadening the set of valid reparameterizations and subsuming many previous approaches as special cases. Crucially, this generalization enables, for the first time, the discovery of low- and zero-barrier linear interpolation paths between independently trained Vision Transformers and GPT-2 models. Furthermore, our framework extends beyond pairwise alignment, to multi-model and width-heterogeneous settings, enabling alignment across architectures of different sizes. These results reveal deeper structure in the loss landscape and underscore the importance of symmetry-aware analysis for understanding model space geometry.",
      "keywords": [
        "Neural Network Merging",
        "Linear Mode Connectivity",
        "Model Re-basin",
        "Parameter Space Geometry",
        "Transformer",
        "Permutation Invariance",
        "Model Fusion"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "YTbLri0siT",
      "title": "Spike-timing-dependent Hebbian learning as noisy gradient descent",
      "abstract": "Hebbian learning is a key principle underlying learning in biological neural networks. We relate a Hebbian spike-timing-dependent plasticity rule to noisy gradient descent with respect to a non-convex loss function on the probability simplex. Despite the constant injection of noise and the non-convexity of the underlying optimization problem, one can rigorously prove that the considered Hebbian learning dynamic identifies the presynaptic neuron with the highest activity and that the convergence is exponentially fast in the number of iterations. This is non-standard and surprising as typically noisy gradient descent with fixed noise level only converges to a stationary regime where the noise causes the dynamic to fluctuate around a minimiser.",
      "keywords": [
        "Biological neural networks",
        "Hebbian learning",
        "Spike-timing-dependent plasticity",
        "Noisy gradient descent",
        "Mirror descent"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "rMhQBlhh4c",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "U87XyMPrZp",
      "title": "Unlocking hidden biomolecular conformational landscapes in diffusion models at inference time",
      "abstract": "The function of biomolecules such as proteins depends on their ability to interconvert between a wide range of structures or conformations. Researchers have endeavored for decades to develop computational methods to predict the distribution of conformations, which is far harder to determine experimentally than a static folded structure. We present ConforMix, an inference-time algorithm that enhances sampling of conformational distributions using a combination of classifier guidance, filtering, and free energy estimation. Our approach upgrades diffusion models---whether trained for static structure prediction or conformational generation---to enable more efficient discovery of conformational variability without requiring prior knowledge of major degrees of freedom. ConforMix is orthogonal to improvements in model pretraining and would benefit even a hypothetical model that perfectly reproduced the Boltzmann distribution. Remarkably, when applied to a diffusion model trained for static structure prediction, ConforMix captures structural changes including domain motion, cryptic pocket flexibility, and transporter cycling, while avoiding unphysical states. Case studies of biologically critical proteins demonstrate the scalability, accuracy, and utility of this method.",
      "keywords": [
        "protein conformations",
        "biomolecular systems",
        "enhanced sampling",
        "diffusion",
        "importance sampling",
        "protein flexibility"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "WFujqJ5UBV",
      "title": "Path Gradients after Flow Matching",
      "abstract": "Boltzmann Generators have emerged as a promising machine learning tool for generating samples from equilibrium distributions of molecular systems using Normalizing Flows and importance weighting.\nRecently, Flow Matching has helped speed up Continuous Normalizing Flows (CNFs), scale them to more complex molecular systems, and minimize the length of the flow integration trajectories.\n We investigate the benefits of using path gradients to fine-tune CNFs  initially trained by Flow Matching, in the setting where a target energy is known. \n Our experiments show that this hybrid approach yields up to a threefold increase in sampling efficiency for molecular systems,\n all while using the same model, a similar computational budget and without the need for additional sampling.\n Furthermore, by measuring the length of the flow trajectories \n during fine-tuning, we show that path gradients largely preserve the learned structure of the flow.",
      "keywords": [
        "Path Gradients",
        "Boltzmann Generators",
        "Continuous Normalizing Flows",
        "Equivariance",
        "Optimal Transport",
        "Flow Matching",
        "Variational Inference",
        "Molecular Dynamics"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "n33JVwCz38",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "FEugj28qhC",
      "title": "BayeSQP: Bayesian Optimization through Sequential Quadratic Programming",
      "abstract": "We introduce BayeSQP, a novel algorithm for general black-box optimization that merges the structure of sequential quadratic programming with concepts from Bayesian optimization. BayeSQP employs second-order Gaussian process surrogates for both the objective and constraints to jointly model the function values, gradients, and Hessian from only zero-order information. At each iteration, a local subproblem is constructed using the GP posterior estimates and solved to obtain a search direction. Crucially, the formulation of the subproblem explicitly incorporates uncertainty in both the function and derivative estimates, resulting in a tractable second-order cone program for high probability improvements under model uncertainty. A subsequent one-dimensional line search via constrained Thompson sampling selects the next evaluation point. Empirical results show that BayeSQP outperforms state-of-the-art methods in specific high-dimensional settings. Our algorithm offers a principled and flexible framework that bridges classical optimization techniques with modern approaches to black-box optimization.",
      "keywords": [
        "Bayesian optimization",
        "sequential quadratic programming",
        "constrained optimization"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "zdRW39Tc3C",
      "title": "Architectural and Inferential Inductive Biases for Exchangeable Sequence Modeling",
      "abstract": "Autoregressive models have emerged as a powerful framework for modeling exchangeable sequences---i.i.d. observations when conditioned on some latent factor---enabling direct modeling of uncertainty from missing data (rather than a latent). Motivated by the critical role posterior inference plays as a subroutine in decision-making (e.g., active learning, bandits), we study the inferential and architectural inductive biases that are most effective  for exchangeable sequence modeling. For the inference stage, we highlight a fundamental limitation of the prevalent single-step generation approach: its inability to distinguish between epistemic and aleatoric uncertainty. Instead, a long line of works in Bayesian statistics advocates for multi-step autoregressive generation; we demonstrate this \"correct approach\" enables superior uncertainty quantification that translates into better performance on downstream decision-making tasks. This naturally leads to the next question: which architectures are best suited for multi-step inference? We identify a subtle yet important gap between recently proposed Transformer architectures for exchangeable sequences (Müller et al., 2022; Nguyen & Grover, 2022; Ye & Namkoong, 2024), and prove that they in fact cannot guarantee exchangeability despite introducing significant computational overhead.  Through empirical evaluation, we find that these custom architectures can significantly underperform compared to standard causal masking, highlighting the need for new architectural innovations in Transformer-based modeling of exchangeable sequences.",
      "keywords": [
        "Exchangeability",
        "Inductive Bias",
        "Epistemic uncertainty",
        "Multi-step inference",
        "Transformer architecture"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "PYmrUQmMEw",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "BL4WBIfyrz",
      "title": "Lightweight Neural App Control",
      "abstract": "This paper introduces a novel mobile phone control architecture, Lightweight Multi-modal App Control (LiMAC), for efficient interactions and control across various Android apps. LiMAC  takes as input a textual goal and a sequence of past mobile observations, such as screenshots and corresponding UI trees, to generate precise actions. To address the computational constraints inherent to smartphones, we introduce a small Action Transformer (AcT) integrated with a fine-tuned vision-language model (VLM) for real-time decision-making and task execution.  We evaluate LiMAC on two open-source mobile control datasets, demonstrating the superior performance of our small-form-factor approach against fine-tuned versions of open-source VLMs, such as Florence2 and Qwen2-VL. It also significantly outperforms prompt engineering baselines utilising closed-source foundation models like GPT-4o. More specifically, LiMAC increases the overall action accuracy by up to 19% compared to fine-tuned VLMs, and up to 42% compared to prompt-engineering baselines.",
      "keywords": [
        "vision-language model",
        "multi-modal",
        "android control",
        "app agent"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "PYmrUQmMEw",
      "title": "LLaMA-Omni: Seamless Speech Interaction with Large Language Models",
      "abstract": "Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct model. To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models in the future.",
      "keywords": [
        "large language models",
        "speech interaction",
        "speech-to-speech",
        "speech-language models"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "09FiNmvNMw",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "09FiNmvNMw",
      "title": "Divide and Translate: Compositional First-Order Logic Translation and Verification for Complex Logical Reasoning",
      "abstract": "Complex logical reasoning tasks require a long sequence of reasoning, which a large language model (LLM) with chain-of-thought prompting still falls short. To alleviate this issue, neurosymbolic approaches incorporate a symbolic solver. Specifically, an LLM only translates a natural language problem into a satisfiability (SAT) problem that consists of first-order logic formulas, and a sound symbolic solver returns a mathematically correct solution. However, we discover that LLMs have difficulties to capture complex logical semantics hidden in the natural language during translation. To resolve this limitation, we propose a Compositional First-Order Logic Translation. An LLM first parses a natural language sentence into newly defined logical dependency structures that consist of an atomic subsentence and its dependents, then sequentially translate the parsed subsentences. Since multiple logical dependency structures and sequential translations are possible for a single sentence, we also introduce two Verification algorithms to ensure more reliable results. We utilize an SAT solver to rigorously compare semantics of generated first-order logic formulas and select the most probable one. We evaluate the proposed method, dubbed CLOVER, on seven logical reasoning benchmarks and show that it outperforms the previous neurosymbolic approaches and achieves new state-of-the-art results.",
      "keywords": [
        "Logical Reasoning",
        "Large Language Models",
        "Neurosymbolic Approaches",
        "Semantic Decomposition",
        "Formal Language Verification"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "8QTpYC4smR",
      "title": "Systematic Review of Large Language Models: Applications, Limitations, Practical Usages and Future Directions",
      "abstract": "Large Language Models have revolutionized natural language processing with their remarkable ability to understand and generate human-like text. This review explores the various applications of large language models, highlighting their versatility across different domains. The paper begins with an introduction to LLMs, followed by an overview of their types and a detailed literature review. We then examine their limitations before delving into specific applications such as text generation, translation, summarization, and more. Finally, we discuss future directions for research and development, concluding with a summary of key findings and the potential impact of large language models on various industries.",
      "keywords": [
        "Large Language Models",
        "Systematic Review"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "krF62hkrfR",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "fqfYzp4GKi",
      "title": "Martingale Posterior Neural Networks for Fast Sequential Decision Making",
      "abstract": "We introduce scalable algorithms for online learning of neural network parameters and Bayesian sequential decision making.\nUnlike classical Bayesian neural networks,\nwhich induce predictive uncertainty through a posterior over model parameters,\nour methods adopt a predictive-first perspective based on martingale posteriors.\nIn particular, we work directly with the one-step-ahead posterior predictive, which we\nparameterize with a neural network and update sequentially with incoming observations.\nThis decouples Bayesian decision-making from parameter-space inference:\nwe sample from the posterior predictive for decision making,\nand update the parameters of the posterior predictive via fast, frequentist Kalman-filter-like\nrecursions. \nOur algorithms operate in a fully online, replay-free setting, providing principled uncertainty quantification without costly posterior sampling.\nEmpirically, they achieve competitive performance–speed trade-offs in non-stationary contextual bandits and Bayesian optimization,\noffering 10–100 times faster inference than classical Thompson sampling while maintaining comparable or superior decision performance.",
      "keywords": [
        "online learning",
        "neural bandits",
        "sequential decision making",
        "Kalman filtering",
        "frequentist",
        "bayes",
        "martingale posterior"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "rVyBrD8h2b",
      "title": "Preconditioned Langevin Dynamics with Score-based Generative Models for Infinite-Dimensional Linear Bayesian Inverse Problems",
      "abstract": "Designing algorithms for solving high-dimensional Bayesian inverse problems directly in infinite‑dimensional function spaces – where such problems are naturally formulated – is crucial to ensure stability and  convergence as the discretization of the underlying  problem is refined. In this paper, we contribute to this line of work by analyzing a widely used sampler for linear inverse problems: Langevin dynamics driven by score‑based generative models (SGMs) acting as priors, formulated directly in function space. Building on the  theoretical framework for SGMs in Hilbert spaces, we give a rigorous definition of this sampler in the infinite-dimensional setting and derive, for the first time, error estimates that explicitly depend on the approximation error of the score. As a consequence, we obtain sufficient conditions for global convergence in Kullback–Leibler divergence on the underlying function space. Preventing numerical instabilities requires preconditioning of the Langevin algorithm and we prove the existence  and form of an optimal preconditioner. The preconditioner depends on both the score error and the forward operator and guarantees a uniform convergence rate across all posterior modes. Our analysis applies to both Gaussian and a general class of non‑Gaussian priors. Finally, we present examples that illustrate and validate our theoretical findings.",
      "keywords": [
        "theory",
        "score-based generative models",
        "error analysis",
        "hilbert space",
        "langevin dynamics",
        "preconditioning"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "k3y0oyK7sn",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "GjfIZan5jN",
      "title": "Enhancing Pre-trained Representation Classifiability can Boost its Interpretability",
      "abstract": "The visual representation of a pre-trained model prioritizes the classifiability on downstream tasks, while the widespread applications for pre-trained visual models have posed new requirements for representation interpretability. However, it remains unclear whether the pre-trained representations can achieve high interpretability and classifiability simultaneously. To answer this question, we quantify the representation interpretability by leveraging its correlation with the ratio of interpretable semantics within the representations. Given the pre-trained representations, only the interpretable semantics can be captured by interpretations, whereas the uninterpretable part leads to information loss. Based on this fact, we propose the Inherent Interpretability Score (IIS) that evaluates the information loss, measures the ratio of interpretable semantics, and quantifies the representation interpretability. In the evaluation of the representation interpretability with different classifiability, we surprisingly discover that the interpretability and classifiability are positively correlated, i.e., representations with higher classifiability provide more interpretable semantics that can be captured in the interpretations. This observation further supports two benefits to the pre-trained representations. First, the classifiability of representations can be further improved by fine-tuning with interpretability maximization. Second, with the classifiability improvement for the representations, we obtain predictions based on their interpretations with less accuracy degradation. The discovered positive correlation and corresponding applications show that practitioners can unify the improvements in interpretability and classifiability for pre-trained vision models. Codes are available at https://github.com/ssfgunner/IIS.",
      "keywords": [
        "Representation interpretability",
        "vision representations",
        "image understanding"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "k3y0oyK7sn",
      "title": "Predictive Uncertainty Quantification for Bird's Eye View Segmentation: A Benchmark and Novel Loss Function",
      "abstract": "The fusion of raw sensor data to create a Bird's Eye View (BEV) representation is critical for autonomous vehicle planning and control. Despite the growing interest in using deep learning models for BEV semantic segmentation, anticipating segmentation errors and enhancing the explainability of these models remain underexplored. This paper introduces a comprehensive benchmark for predictive uncertainty quantification in BEV segmentation, evaluating multiple uncertainty quantification methods across three popular datasets with three representative network architectures. Our study focuses on the effectiveness of quantified uncertainty in detecting misclassified and out-of-distribution (OOD) pixels while also improving model calibration. Through empirical analysis, we uncover challenges in existing uncertainty quantification methods and demonstrate the potential of evidential deep learning techniques, which capture both aleatoric and epistemic uncertainty. To address these challenges, we propose a novel loss function, Uncertainty-Focal-Cross-Entropy (UFCE), specifically designed for highly imbalanced data, along with a simple uncertainty-scaling regularization term that improves both uncertainty quantification and model calibration for BEV segmentation.",
      "keywords": [
        "Uncertainty Quantification",
        "Evidential Deep Learning",
        "Bird's Eye View (BEV) Segmentation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "z5KTxW5sJd",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "3k70Vt0YFS",
      "title": "ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code",
      "abstract": "Large language models (LLMs) have shown promise in transforming machine learning research, yet their capability to faithfully implement genuinely novel ideas from recent research papers—ideas unseen during pretraining—remains unclear. We introduce ResearchCodeBench, a benchmark that evaluates LLMs’ ability to translate cutting-edge ML contributions from top 2024-2025 research papers into executable code. We assessed 30+ proprietary and open-source LLMs, finding that even the best models correctly implement less than 40% of the code. We present empirical findings on performance comparison, contamination, and error patterns. By providing a rigorous evaluation platform, ResearchCodeBench enables continuous understanding and advancement of LLM-driven innovation in research code generation.",
      "keywords": [
        "Machine learning benchmarks",
        "Code generation",
        "Large language models",
        "Research automation"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "z5KTxW5sJd",
      "title": "From Replication to Redesign: Exploring Pairwise Comparisons for LLM-Based Peer Review",
      "abstract": "The advent of large language models (LLMs) offers unprecedented opportunities to reimagine peer review beyond the constraints of traditional workflows.\nDespite these opportunities, prior efforts have largely focused on replicating traditional review workflows with LLMs serving as direct substitutes for human reviewers, while limited attention has been given to exploring new paradigms that fundamentally rethink how LLMs can participate in the academic review process.\nIn this paper, we introduce and explore a novel mechanism that employs LLM agents to perform pairwise comparisons among manuscripts instead of individual scoring. By aggregating outcomes from substantial pairwise evaluations, this approach enables a more accurate and robust measure of relative manuscript quality.\nOur experiments demonstrate that this comparative approach significantly outperforms traditional rating-based methods in identifying high-impact papers. However, our analysis also reveals emergent biases in the selection process, notably a reduced novelty in research topics and an increased institutional imbalance. These findings highlight both the transformative potential of rethinking peer review with LLMs and critical challenges that future systems must address to ensure equity and diversity.",
      "keywords": [
        "Large Language Models",
        "Peer Review Redesign",
        "Comparative Paper Evaluation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "uy31tqVuNo",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "VAvZ4oinpa",
      "title": "Video Generation with Learned Action Prior",
      "abstract": "Long-term stochastic video generation remains challenging, especially with moving cameras. This scenario introduces complex interactions between camera movement and observed pixels, resulting in intricate spatio-temporal dynamics and partial observability issues. Current approaches often focus on pixel-level image reconstruction, neglecting explicit modeling of camera motion dynamics. Our proposed solution incorporates camera motion or action as an extended part of the observed image state, employing a multi-modal learning framework to simultaneously model both image and action. We introduce three models: (i) Video Generation with Learning Action Prior (VG-LeAP) that treats the image-action pair as an augmented state generated from a single latent stochastic process and uses variational inference to learn the image-action latent prior; (ii) Causal-LeAP, which establishes a causal relationship between action and the observed image frame, and learns a seperate action prior, conditioned on the observed image states along with the image prior; and (iii) RAFI, which integrates the augmented image-action state concept with a conditional flow matching framework, demonstrating that this action-conditioned image generation concept can be extended to other transformer-based architectures. Through comprehensive empirical studies on robotic video dataset, RoAM, we highlight the importance of multi-modal training in addressing partially observable video generation problems.",
      "keywords": [
        "Stochastic Video Generation",
        "Variational Inference"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "uy31tqVuNo",
      "title": "Unbounded: A Generative Infinite Game of Character Life Simulation",
      "abstract": "We introduce the concept of a generative infinite game, a video game that transcends the traditional boundaries of finite, hard-coded systems by using generative models. Inspired by James P. Carse's distinction between finite and infinite games, we leverage recent advances in generative AI to create Unbounded: a game of character life simulation that is fully encapsulated in generative models. Specifically, Unbounded draws inspiration from sandbox life simulations and allows you to interact with your autonomous virtual character in a virtual world by feeding, playing with and guiding it - with open-ended mechanics generated by an LLM, some of which can be emergent. In order to develop Unbounded, we propose technical innovations in both the LLM and visual generation domains. Specifically, we present: (1) a specialized, distilled large language model (LLM) that dynamically generates game mechanics, narratives, and character interactions in real-time, and (2) a new dynamic regional image prompt Adapter (IP-Adapter) for vision models that ensures consistent yet flexible visual generation of a character across multiple environments. We evaluate our system through both qualitative and quantitative analysis, showing significant improvements in character life simulation, user instruction following, narrative coherence, and visual consistency for both characters and the environments compared to traditional related approaches.",
      "keywords": [
        "Text-to-Image Generation",
        "Interactive Image Generation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "OwpLQrpdwE",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "NPSZ7V1CCY",
      "title": "Zero-shot Imputation with Foundation Inference Models for Dynamical Systems",
      "abstract": "Dynamical systems governed by ordinary differential equations (ODEs) serve as models for a vast number of natural and social phenomena. In this work, we offer a fresh perspective on the classical problem of imputing missing time series data, whose underlying dynamics are assumed to be determined by ODEs. Specifically, we revisit ideas from amortized inference and neural operators, and propose a novel supervised learning framework for *zero-shot time series imputation*, through parametric functions satisfying some (hidden) ODEs. Our proposal consists of two components. First, a broad probability distribution over the space of ODE solutions, observation times and noise mechanisms, with which we generate a large, synthetic dataset of (hidden) ODE solutions, along with their noisy and sparse observations. Second, a neural recognition model that is trained *offline*, to map the generated time series onto the spaces of initial conditions and time derivatives of the (hidden) ODE solutions, which we then integrate to impute the missing data. We empirically demonstrate that *one and the same* (pretrained) recognition model can perform zero-shot imputation across 63 distinct time series with missing values, each sampled from widely different dynamical systems. Likewise, we demonstrate that it can perform zero-shot imputation of missing high-dimensional data in 10 vastly different settings, spanning human motion, air quality, traffic and electricity studies, as well as Navier-Stokes simulations — *without requiring any fine-tuning*. What is more, our proposal often outperforms state-of-the-art methods, which are trained on the target datasets.\n\nOur pretrained model, repository and tutorials are available online.",
      "keywords": [
        "Zero-shot imputation",
        "foundation models",
        "time series imputation",
        "dynamical systems",
        "amortized inference",
        "zero-shot interpolation",
        "foundation models for time series"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "OwpLQrpdwE",
      "title": "Learning vector fields of differential equations on manifolds with geometrically constrained operator-valued kernels",
      "abstract": "We address the problem of learning ordinary differential equations (ODEs) on manifolds. Existing machine learning methods, particularly those using neural networks, often struggle with high computational demands. To overcome this issue, we introduce a geometrically constrained operator-valued kernel that allows us to represent vector fields on tangent bundles of smooth manifolds. The construction of the kernel imposes the geometric constraints that are estimated from the data and ensures the computational feasibility for learning high dimensional systems of ODEs. Once the vector fields are estimated, e.g., by the kernel ridge regression, we need an ODE solver that guarantees the solution to stay on (or close to) the manifold. To overcome this issue, we propose a geometry-preserving ODE solver that approximates the exponential maps corresponding to the ODE solutions.  We deduce a theoretical error bound for the proposed solver that guarantees the approximate solutions to lie on the manifold in the limit of large data. We verify the effectiveness of the proposed approach on high-dimensional dynamical systems, including the cavity flow problem, the beating and travelling waves in Kuramoto-Sivashinsky equations, and the reaction-diffusion dynamics.",
      "keywords": [
        "Dynamics on manifolds",
        "Operator-valued kernel",
        "Geometry-preserving time integration",
        "Ordinary differential equations"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "Thnk4ez3wN",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "SctfBCLmWo",
      "title": "A Decade's Battle on Dataset Bias: Are We There Yet?",
      "abstract": "We revisit the ``dataset classification'' experiment suggested by Torralba & Efros (2011) a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from: e.g., we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and DataComp datasets. Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be explained by memorization. We hope our discovery will inspire the community to rethink issues involving dataset bias.",
      "keywords": [
        "Vision datasets",
        "Dataset bias",
        "Deep learning"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "c61unr33XA",
      "title": "Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-training of Deep Networks",
      "abstract": "Dataset distillation (DD) generates small synthetic datasets that can efficiently train deep networks with a limited amount of memory and compute. Despite the success of DD methods for supervised learning, DD for self-supervised pre-training of deep models has remained unaddressed. Pre-training on unlabeled data is crucial for efficiently generalizing to downstream tasks with limited labeled data. In this work, we propose the first effective DD method for SSL pre-training. First, we show, theoretically and empirically, that naiive application of supervised DD methods to SSL fails, due to the high variance of the SSL gradient. Then, we address this issue by relying on insights from knowledge distillation (KD) literature. Specifically, we train a small student model to match the representations of a larger teacher model trained with SSL. Then, we generate a small synthetic dataset by matching the training trajectories of the student models. As the KD objective has considerably lower variance than SSL, our approach can generate synthetic datasets that can successfully pre-train high-quality encoders. Through extensive experiments, we show that our distilled sets lead to up to 13% higher accuracy than prior work, on a variety of downstream tasks, in the presence of limited labeled data. Code at https://github.com/BigML-CS-UCLA/MKDT.",
      "keywords": [
        "dataset distillation",
        "self-supervised learning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "cFu7ze7xUm",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "LlE61BEYpB",
      "title": "FLARE: Fine-tuned Long-context Acceleration with ReLU-enhanced FIRE",
      "abstract": "Deploying large language models (LLMs) on resource-constrained edge devices is challenging due to computational bottlenecks, memory bottlenecks, and -- for long-contexts -- specifically the Softmax operation in the attention mechanism. While using ReLU in place of Softmax has been explored, and FIRE as an alternative to RoPE has been explored for models trained from scratch, there has been little work towards exploring fine-tuning models to utilize these efficient algorithms, or the combination of the two.\n\nIn this paper, we contribute FLARE, a method for fusing Rectified Linear Activations (ReLU) with Relative Encodings (specifically FIRE), and we share a particular recipe which allows these to be fine-tuned effectively into existing models and fused to create efficient long-context inference. Following this recipe yields markedly better validation loss, long-context inference speed, and successfully introduces the property of length-generalization -- the property where the model gains high accuracy for contexts lengths several times larger than trained -- unlike RoPE -- without further fine-tuning.   \n\nOnce FIRE and ReLU are both fine-tuned into a model, we show these can be mathematically fused into a single, more efficient operation, which on average was found to shave 98.9\\% of FIRE operations and produce a Probability matrix with 98.9\\% zeros in its lower-triangle.\n\nFinally, we benchmark inference speed improvements for custom hardware as well with custom CUDA kernels. Using Power, Performance, and Area (PPA) analysis, we show that FLARE operates at eight times the frequency of Softmax while consuming only 0.1\\% of the power and 0.11\\% of the energy per cycle. Our custom CUDA Kernel shows 3.8x faster operation than Softmax FlashAttention. We believe this shows the potential of fine-tuning new algorithms in pre-trained models, and we share our fine-tuning recipes, code and custom hardware designs at \\url{https://anonymous.4open.science/r/nanoGPTBD54}.",
      "keywords": [
        "FIRE",
        "Functional Interpolation for Relative Position Encoding",
        "fine-tune",
        "fine-tuning",
        "ReLU",
        "Softmax",
        "Softplus",
        "Softmax alternatives",
        "long context",
        "transformer",
        "large language model",
        "edge device",
        "Flash Attention"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "cFu7ze7xUm",
      "title": "DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads",
      "abstract": "Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges.\nCaching all Key and Value (KV) states across all attention heads consumes substantial memory.\nExisting KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements.\nIn this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens.\nIn contrast, all other heads, which primarily focus on recent tokens and attention sinks—referred to as Streaming Heads—do not require full attention.\nBased on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities.\nDuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately.\nOur method significantly reduces long-context inference memory by up to 2.55$\\times$ for MHA and 1.67$\\times$ for GQA models while speeding up decoding by up to 2.18$\\times$ and 1.50$\\times$ and accelerating pre-filling by up to 1.73$\\times$ and 1.63$\\times$ for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention.\nNotably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.33 million context length measured on a single A100 GPU. Code is provided in https://github.com/mit-han-lab/duo-attention.",
      "keywords": [
        "Large Language Models; Long Context; Efficiency;"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "hSX7Dd8dxy",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "10s01YrlKp",
      "title": "metaTextGrad: Automatically optimizing language model optimizers",
      "abstract": "Large language models (LLMs) are increasingly used in learning algorithms, evaluations, and optimization tasks. Recent studies have shown that using LLM-based optimizers to automatically optimize model prompts, demonstrations, predictions themselves, or other components can significantly enhance the performance of AI systems, as demonstrated by frameworks such as DSPy and TextGrad. However, optimizers built on language models themselves are usually designed by humans with manual design choices; optimizers themselves are not optimized. Moreover, these optimizers are general purpose by design, to be useful to a broad audience, and are not tailored for specific tasks. To address these challenges, we propose metaTextGrad, which focuses on designing a meta-optimizer to further enhance existing optimizers and align them to be good optimizers for a given task. Our approach consists of two key components: a meta prompt optimizer and a meta structure optimizer. The combination of these two significantly improves performance across multiple benchmarks, achieving an average absolute performance improvement of up to 6% compared to the best baseline.",
      "keywords": [
        "programming models",
        "prompting techniques",
        "meta-learning",
        "LLM optimizer"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "hSX7Dd8dxy",
      "title": "Inference-Time Reward Hacking in Large Language Models",
      "abstract": "A common paradigm to improve the performance of large language models is optimizing for a reward model. Reward models assign a numerical score to an LLM’s output that indicates, for example, how likely it is to align with user preferences or safety goals. However, reward models are never perfect. They inevitably function as proxies for  complex desiderata such as correctness, helpfulness, and safety. By overoptimizing for a misspecified reward, we can subvert intended alignment goals and reduce overall performance -- a phenomenon commonly referred to as reward hacking. In this work, we characterize reward hacking in inference-time alignment and demonstrate when and how we can mitigate it by hedging on the proxy reward. We study this phenomenon under Best-of-$n$ (BoN) and Soft Best-of-$n$ (SBoN), and we introduce Best-of-Poisson (BoP) that provides an efficient, near-exact approximation of the optimal reward-KL divergence policy at inference time. We show that the characteristic pattern of hacking as observed in practice (where the true reward first increases before declining) is an inevitable property of a broad class of inference-time mechanisms, including BoN and BoP. To counter this effect, we introduce $\\texttt{HedgeTune}$, an efficient algorithm to find the optimal inference-time parameter. We demonstrate that hedging mitigates reward hacking and achieves superior reward-distortion tradeoffs on math, reasoning, and human-preference setups.",
      "keywords": [
        "reward hacking",
        "large language models",
        "inference time alignment",
        "information theory"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "bVTM2QKYuA",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "Njx1NjHIx4",
      "title": "Formation of Representations in Neural Networks",
      "abstract": "Understanding neural representations will help open the black box of neural networks and advance our scientific understanding of modern AI systems. However, how complex, structured, and transferable representations emerge in modern neural networks has remained a mystery. Building on previous results, we propose the Canonical Representation Hypothesis (CRH), which posits a set of six alignment relations to universally govern the formation of representations in most hidden layers of a neural network. Under the CRH, the latent representations (R), weights (W), and neuron gradients (G) become mutually aligned during training. This alignment implies that neural networks naturally learn compact representations, where neurons and weights are invariant to task-irrelevant transformations. We then show that the breaking of CRH leads to the emergence of reciprocal power-law relations between R, W, and G, which we refer to as the Polynomial Alignment Hypothesis (PAH). We present a minimal-assumption theory proving that the balance between gradient noise and regularization is crucial for the emergence of the canonical representation. The CRH and PAH lead to an exciting possibility of unifying major key deep learning phenomena, including neural collapse and the neural feature ansatz, in a single framework.",
      "keywords": [
        "representation learning",
        "neural collapse",
        "neural feature ansatz"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "yVGGtsOgc7",
      "title": "Disentangling Representations through Multi-task Learning",
      "abstract": "Intelligent perception and interaction with the world hinges on internal representations that capture its underlying structure (\"disentangled\" or \"abstract\" representations). Disentangled representations serve as world models, isolating latent factors of variation in the world along approximately orthogonal directions, thus facilitating feature-based generalization. We provide experimental and theoretical results guaranteeing the emergence of disentangled representations in agents that optimally solve multi-task evidence accumulation classification tasks, canonical in the neuroscience literature. The key conceptual finding is that, by producing accurate multi-task classification estimates, a system implicitly represents a set of coordinates specifying a disentangled representation of the underlying latent state of the data it receives. The theory provides conditions for the emergence of these representations in terms of noise, number of tasks, and evidence accumulation time, when the classification boundaries are affine in the latent space. Surprisingly, the theory also produces closed-form expressions for extracting the disentangled representation from the model's latent state $\\mathbf Z(t)$. We experimentally validate these predictions in RNNs trained on multi-task classification, which learn disentangled representations in the form of continuous attractors, leading to zero-shot out-of-distribution (OOD) generalization in predicting latent factors. We demonstrate the robustness of our framework across autoregressive architectures, decision boundary geometries and in tasks requiring classification confidence estimation. We find that transformers are particularly suited for disentangling representations, which might explain their unique world understanding abilities. Overall, our framework establishes a formal link between competence at multiple tasks and the formation of disentangled, interpretable world models in both biological and artificial systems, and helps explain why ANNs often arrive at human-interpretable concepts, and how they both may acquire exceptional zero-shot generalization capabilities.",
      "keywords": [
        "zero-shot generalization",
        "disentanglement",
        "interpretability",
        "world models",
        "multi-task learning",
        "computational neuroscience",
        "neuroAI",
        "evidence accumulation",
        "cognitive maps",
        "continuous attractors",
        "RNNs",
        "transformers"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "uaKBM9sGEm",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "8NdNniulYE",
      "title": "STAMP: Scalable Task- And Model-agnostic Collaborative Perception",
      "abstract": "Perception is a crucial component of autonomous driving systems. However, single-agent setups often face limitations due to sensor constraints, especially under challenging conditions like severe occlusion, adverse weather, and long-range object detection. Multi-agent collaborative perception (CP) offers a promising solution that enables communication and information sharing between connected vehicles. Yet, the heterogeneity among agents—in terms of sensors, models, and tasks—significantly hinders effective and efficient cross-agent collaboration. To address these challenges, we propose STAMP, a scalable task- and model-agnostic collaborative perception framework tailored for heterogeneous agents. STAMP utilizes lightweight adapter-reverter pairs to transform Bird's Eye View (BEV) features between agent-specific domains and a shared protocol domain, facilitating efficient feature sharing and fusion while minimizing computational overhead. Moreover, our approach enhances scalability, preserves model security, and accommodates a diverse range of agents. Extensive experiments on both simulated (OPV2V) and real-world (V2V4Real) datasets demonstrate that STAMP achieves comparable or superior accuracy to state-of-the-art models with significantly reduced computational costs. As the first-of-its-kind task- and model-agnostic collaborative perception framework, STAMP aims to advance research in scalable and secure mobility systems, bringing us closer to Level 5 autonomy. Our project page is at https://xiangbogaobarry.github.io/STAMP and the code is available at https://github.com/taco-group/STAMP.",
      "keywords": [
        "Autonomous Driving",
        "Collaborative Perception",
        "Domain Adaptation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "uaKBM9sGEm",
      "title": "Towards Off-Road Autonomous Driving via Planner Guided Policy Optimization",
      "abstract": "Off-road autonomous driving poses significant challenges such as navigating diverse terrains, avoiding obstacles, and maneuvering through ditches. Addressing these challenges requires effective planning and adaptability, making it a long-horizon planning and control problem. Traditional model-based control techniques like Model Predictive Path Integral (MPPI) require dense sampling and accurate modeling of the vehicle-terrain interaction, both of which are computationally expensive, making effective long-horizon planning in real-time intractable. Reinforcement learning (RL) methods operate without this limitation and are computationally cheaper at deployment. However, exploration in obstacle-dense and challenging terrains is difficult, and typical RL techniques struggle to navigate in these terrains. To alleviate the limitations of MPPI, we propose a hierarchical autonomy pipeline with a low-frequency high-level MPPI planner and a high-frequency low-level RL controller. To tackle RL's exploration challenge, we propose a teacher-student paradigm to learn an end-to-end RL policy, capable of real-time execution and traversal through challenging terrains. The teacher policy is trained using dense planning information from an MPPI planner while the student policy learns to navigate using visual inputs and sparse planning information. In this framework, we introduce a new policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration. We demonstrate our performance in a realistic off-road simulator against various RL and imitation learning methods.",
      "keywords": [
        "Reinforcement learning",
        "Learning from Demonstrations",
        "Autonomous driving",
        "Off-road driving"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "Jzr9VOiJYd",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "Jzr9VOiJYd",
      "title": "Contimask: Explaining Irregular Time Series via Perturbations in Continuous Time",
      "abstract": "Explaining black-box models for time series data is critical for the wide-scale adoption of deep learning techniques across domains such as healthcare. Recently, explainability methods for deep time series models have seen significant progress by adopting saliency methods that perturb masked segments of time series to uncover their importance towards the prediction of black-box models. Thus far, such methods have been largely restricted to regular time series. Irregular time series, however, sampled at irregular time intervals and potentially with missing values, are the dominant form of time series in various critical domains (e.g., hospital records). In this paper, we conduct the first evaluation of saliency methods for the interpretation of irregular time series models. We first translate techniques for regular time series into the continuous time realm of irregular time series and show under which circumstances such techniques are still applicable. However, existing perturbation techniques neglect the timing and structure of observed data, e.g., informative missingness when data is not missing at random. Thus, we propose Contimask, a simple framework to also apply non-differentiable perturbations, such as simulating that parts of the data had not been observed using NeuroEvolution. Doing so, we successfully detect how structural differences in the data can bias irregular time series models on a real-world sepsis prediction task where 90% of the data is missing. Source code is available on GitHub.",
      "keywords": [
        "Irregular Time Series",
        "Explainability",
        "Interpretability",
        "Explanations",
        "Perturbations"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "jMhRbV47pS",
      "title": "The emergence of sparse attention: impact of data distribution and benefits of repetition",
      "abstract": "Emergence is a fascinating property of large language models and neural networks more broadly: as models scale and train for longer, they sometimes develop new abilities in sudden ways. Despite initial studies, we still lack a comprehensive understanding of how and when these abilities emerge. To address this gap, we study the emergence over training of sparse attention, a critical and frequently observed attention pattern in Transformers. By combining theoretical analysis of a toy model with empirical observations on small Transformers trained on a linear regression variant, we uncover the mechanics driving sparse attention emergence and reveal that emergence timing follows power laws based on task structure, architecture, and optimizer choice. We additionally find that repetition can greatly speed up emergence. Finally, we confirm these results on a well-studied in-context associative recall task. Our findings provide a simple, theoretically grounded framework for understanding how data distributions and model design influence the learning dynamics behind one form of emergence.",
      "keywords": [
        "emergence",
        "sparse attention",
        "in-context learning",
        "induction head"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "Jzr9VOiJYd",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "Jzr9VOiJYd",
      "title": "Contimask: Explaining Irregular Time Series via Perturbations in Continuous Time",
      "abstract": "Explaining black-box models for time series data is critical for the wide-scale adoption of deep learning techniques across domains such as healthcare. Recently, explainability methods for deep time series models have seen significant progress by adopting saliency methods that perturb masked segments of time series to uncover their importance towards the prediction of black-box models. Thus far, such methods have been largely restricted to regular time series. Irregular time series, however, sampled at irregular time intervals and potentially with missing values, are the dominant form of time series in various critical domains (e.g., hospital records). In this paper, we conduct the first evaluation of saliency methods for the interpretation of irregular time series models. We first translate techniques for regular time series into the continuous time realm of irregular time series and show under which circumstances such techniques are still applicable. However, existing perturbation techniques neglect the timing and structure of observed data, e.g., informative missingness when data is not missing at random. Thus, we propose Contimask, a simple framework to also apply non-differentiable perturbations, such as simulating that parts of the data had not been observed using NeuroEvolution. Doing so, we successfully detect how structural differences in the data can bias irregular time series models on a real-world sepsis prediction task where 90% of the data is missing. Source code is available on GitHub.",
      "keywords": [
        "Irregular Time Series",
        "Explainability",
        "Interpretability",
        "Explanations",
        "Perturbations"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "K61Y6cTMRl",
      "title": "Toward Foundation Model for Multivariate Wearable Sensing of Physiological Signals",
      "abstract": "Time-series foundation models excel at tasks like forecasting across diverse data types by leveraging informative waveform representations. Wearable sensing data, however, pose unique challenges due to their variability in patterns and frequency bands, especially for healthcare-related outcomes. The main obstacle lies in crafting generalizable representations that adapt efficiently across heterogeneous sensing configurations and applications. To address this, we propose NormWear, the first multi-modal and ubiquitous foundation model designed to extract generalized and informative representations from wearable sensing data. Specifically, we design a channel-aware attention mechanism with a shared special liaison [CLS] token to detect signal patterns in both intra-sensor and inter-sensors. This helps the model to extract more meaningful information considering both time series themselves and the relationships between input sensors. This helps the model to be widely compatible with various sensors settings. NormWear is pretrained on a diverse set of physiological signals, including PPG, ECG, EEG, GSR, and IMU, from various public datasets. Our model shows exceptional generalizability across 11 public wearable sensing datasets, spanning 18 applications in mental health, body state inference, vital sign estimation, and disease risk evaluation. It consistently outperforms competitive baselines under zero-shot, partial-shot, and full-shot settings, indicating broad applicability in real-world health applications.",
      "keywords": [
        "Foundation Model",
        "Signal Processing",
        "Time Series",
        "Wearable Sensing",
        "Digital Health"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "dGSOn7sdWg",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "dGSOn7sdWg",
      "title": "SyllableLM: Learning Coarse Semantic Units for Speech Language Models",
      "abstract": "Language models require tokenized inputs. However, tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering, which do not necessarily align with the semantic structure of the data. For speech in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant challenge as speech-based language models have had to use several times more tokens per word than text-based language models. In this work, we introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information. We do this by 1) extracting noisy boundaries through analyzing correlations in pretrained encoder losses and 2) iteratively improving model representations with a novel distillation technique. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and achieves SotA in syllabic segmentation and clustering. Using these coarse tokens, we successfully train SyllableLM, a Speech Language Model (SpeechLM) that matches or outperforms current SotA SpeechLMs on a range of spoken language modeling tasks. SyllableLM also achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup. Our code and checkpoints are available at https://www.github.com/alanbaade/SyllableLM",
      "keywords": [
        "Generative Spoken Language Modeling",
        "Audio",
        "Textless NLP",
        "Representation Learning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "mtSSFiqW6y",
      "title": "Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment",
      "abstract": "The performance of large language models (LLMs) is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive generation, leveraging a fast draft model to propose candidate tokens, which are then verified in parallel based on their likelihood under the target model. While this approach guarantees to reproduce the target output, it incurs a substantial penalty: many high-quality draft tokens are rejected, even when they represent objectively valid continuations. Indeed, we show that even powerful draft models such as GPT-4o, as well as human text cannot achieve high acceptance rates under the standard verification scheme. This severely limits the speedup potential of current speculative decoding methods, as an early rejection becomes overwhelmingly likely when solely relying on alignment of draft and target.\nWe thus ask the following question: Can we adapt verification to recognize correct, but non-aligned replies? To this end, we draw inspiration from the LLM-as-a-judge framework, which demonstrated that LLMs are able to rate answers in a versatile way. We carefully design a dataset coined TokenCourt to elicit the same capability in the target model by training a compact module on top of the embeddings to produce ``judgements\" of the current continuation. We showcase our strategy on the Llama-3.1 family, where our 8B/405B-Judge achieves a speedup of $9\\times$ over Llama-405B, while maintaining its quality on a large range of benchmarks. These benefits remain present even in optimized inference frameworks, where our method reaches up to $141$ tokens/s for 8B/70B-Judge and $129$ tokens/s for 8B/405B on $2$ and $8$ H100s respectively.",
      "keywords": [
        "LLM inference",
        "speculative decoding"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "fWXYD0ZCdd",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "1yJP5TVWih",
      "title": "Lambda-Skip Connections: the architectural component that prevents Rank Collapse",
      "abstract": "Rank collapse, a phenomenon where embedding vectors in sequence models\nrapidly converge to a uniform token or equilibrium state, has recently gained at-\ntention in the deep learning literature. This phenomenon leads to reduced expres-\nsivity and potential training instabilities due to vanishing gradients. Empirical ev-\nidence suggests that architectural components like skip connections, LayerNorm,\nand MultiLayer Perceptrons (MLPs) play critical roles in mitigating rank collapse.\nWhile this issue is well-documented for transformers, alternative sequence mod-\nels, such as State Space Models (SSMs), which have recently gained prominence,\nhave not been thoroughly examined for similar vulnerabilities. This paper extends\nthe theory of rank collapse from transformers to SSMs using a unifying frame-\nwork that captures both architectures. We introduce a modification in the skip\nconnection component, termed lambda-skip connections, that provides guaran-\ntees for rank collapse prevention. We present, via analytical results, a sufficient\ncondition to achieve the guarantee for all of the aforementioned architectures. We\nalso study the necessity of this condition via ablation studies and analytical exam-\nples. To our knowledge, this is the first study that provides a general guarantee to\nprevent rank collapse, and that investigates rank collapse in the context of SSMs,\noffering valuable understanding for both theoreticians and practitioners. Finally,\nwe validate our findings with experiments demonstrating the crucial role of archi-\ntectural components in preventing rank collapse.",
      "keywords": [
        "Rank Collapse",
        "Skip Connections",
        "Sequence Modeling Architectures"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "UvfI4grcM7",
      "title": "Biologically Constrained Barrel Cortex Model Integrates Whisker Inputs and Replicates Key Brain Network Dynamics",
      "abstract": "The brain's ability to transform sensory inputs into motor functions is central to neuroscience and crucial for the development of embodied intelligence. Sensory-motor integration involves complex neural circuits, diverse neuronal types, and intricate intercellular connections. Bridging the gap between biological realism and behavioral functionality presents a formidable challenge. In this study, we focus on the columnar structure of the superficial layers of mouse barrel cortex as a model system. We constructed a model comprising 4,218 neurons across 13 neuronal subtypes, with neural distribution and connection strengths constrained by anatomical experimental findings. A key innovation of our work is the development of an effective construction and training pipeline tailored for this biologically constrained model. Additionally, we converted an existing simulated whisker sweep dataset into a spiking-based format, enabling our network to be trained and tested on neural signals that more closely mimic those observed in biological systems. The results of object discrimination utilizing whisker signals demonstrate that our barrel cortex model, grounded in biological constraints, achieves a classification accuracy exceeds classical convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory networks (LSTMs), by an average of 8.6%, and is on par with recent spiking neural networks (SNNs) in performance. Interestingly, a whisker deprivation experiment, designed in accordance with neuroscience practices, further validates the perceptual capabilities of our model in behavioral tasks.\nCritically, it offers significant biological interpretability: post-training analysis reveals that neurons within our model exhibit firing characteristics and distribution patterns similar to those observed in the actual neuronal systems of the barrel cortex. This study advances our understanding of neural processing in the barrel cortex and exemplifies how integrating detailed biological structures into neural network models can enhance both scientific inquiry and artificial intelligence applications. The code is available at https://github.com/fun0515/RSNN_bfd.",
      "keywords": [
        "Barrel cortex",
        "biophysical modeling",
        "sensory-motor integration",
        "recurrent spiking neural networks"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "xPO6fwvldG",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "8enWnd6Gp3",
      "title": "TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes",
      "abstract": "We introduce TetSphere Splatting, a Lagrangian geometry representation designed for high-quality 3D shape modeling. TetSphere splatting leverages an underused yet powerful geometric primitive -- volumetric tetrahedral meshes. It represents 3D shapes by deforming a collection of tetrahedral spheres, with geometric regularizations and constraints that effectively resolve common mesh issues such as irregular triangles, non-manifoldness, and floating artifacts. Experimental results on multi-view and single-view reconstruction highlight TetSphere splatting's superior mesh quality while maintaining competitive reconstruction accuracy compared to state-of-the-art methods. Additionally, TetSphere splatting demonstrates versatility by seamlessly integrating into generative modeling tasks, such as image-to-3D and text-to-3D generation.",
      "keywords": [
        "geometry representation",
        "3D modeling"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "xPO6fwvldG",
      "title": "UniRestore3D: A Scalable Framework For General Shape Restoration",
      "abstract": "Shape restoration aims to recover intact 3D shapes from defective ones, such as those that are incomplete, noisy, and low-resolution. Previous works have achieved impressive results in shape restoration subtasks thanks to advanced generative models. While effective for specific shape defects, they are less applicable in real-world scenarios involving multiple defect types simultaneously. Additionally, training on limited subsets of defective shapes hinders knowledge transfer across restoration types and thus affects generalization. In this paper, we address the task of general shape restoration, which restores shapes with various types of defects through a unified model, thereby naturally improving the applicability and scalability. Our approach first standardizes the data representation across different restoration subtasks using high-resolution TSDF grids and constructs a large-scale dataset with diverse types of shape defects. Next, we design an efficient hierarchical shape generation model and a noise-robust defective shape encoder that enables effective impaired shape understanding and intact shape generation. Moreover, we propose a scalable training strategy for efficient model training. The capabilities of our proposed method are demonstrated across multiple shape restoration subtasks and validated on various datasets, including Objaverse, ShapeNet, GSO, and ABO.",
      "keywords": [
        "Shape Restoration",
        "3D Reconstruction",
        "Diffusion Model"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "WOzffPgVjF",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "14fFV0chUS",
      "title": "TRACE: Temporal Grounding Video LLM  via Causal Event Modeling",
      "abstract": "Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing. \nTo effectively handle various tasks simultaneously and enable zero-shot prediction, there is a growing trend in employing video LLMs for VTG tasks. However, current video LLM-based methods rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos, which restricts their effectiveness in tackling VTG tasks. To address this issue, this paper first formally introduces causal event modeling framework, which represents video LLM outputs as sequences of events, and predict the current event using previous events, video inputs, and textural instructions. Each event consists of three components: timestamps, salient scores, and textual captions. We then propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice. \nThe TRACE process visual frames, timestamps, salient scores, and text as distinct tasks, employing various encoders and decoding heads for each. Task tokens are arranged in an interleaved sequence according to the causal event modeling framework's formulation.\nExtensive experiments on various VTG tasks and datasets demonstrate the superior performance of TRACE compared to state-of-the-art video LLMs. Our model and code are avaliable at \\url{https://github.com/gyxxyg/TRACE}.",
      "keywords": [
        "video large language model",
        "video temporal grounding"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "WOzffPgVjF",
      "title": "Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding",
      "abstract": "Transformer has attracted increasing interest in spatio-temporal video grounding, or STVG, owing to its end-to-end pipeline and promising result. Existing Transformer-based STVG approaches often leverage a set of object queries, which are initialized simply using zeros and then gradually learn target position information via iterative interactions with multimodal features, for spatial and temporal localization. Despite simplicity, these zero object queries, due to lacking target-specific cues, are hard to learn discriminative target information from interactions with multimodal features in complicated scenarios (e.g., with distractors or occlusion), resulting in degradation. Addressing this, we introduce a novel $\\textbf{T}$arget-$\\textbf{A}$ware Transformer for $\\textbf{STVG}$ ($\\textbf{TA-STVG}$), which seeks to adaptively generate object queries via exploring target-specific cues from the given video-text pair, for improving STVG. The key lies in two simple yet effective modules, comprising text-guided temporal sampling (TTS) and attribute-aware spatial activation (ASA), working in a cascade. The former focuses on selecting target-relevant temporal cues from a video utilizing holistic text information, while the latter aims at further exploiting the fine-grained visual attribute information of the object from previous target-aware temporal cues, which is applied for object query initialization. Compared to existing methods leveraging zero-initialized queries, object queries in our TA-STVG, directly generated from a given video-text pair, naturally carry target-specific cues, making them adaptive and better interact with multimodal features for learning more discriminative information to improve STVG. In our experiments on three benchmarks, including HCSTVG-v1/-v2 and VidSTG, TA-STVG achieves state-of-the-art performance and significantly outperforms the baseline, validating its efficacy. Moreover, TTS and ASA are designed for general purpose. When applied to existing methods such as TubeDETR and STCAT, we show substantial performance gains, verifying its generality. Code is released at https://github.com/HengLan/TA-STVG.",
      "keywords": [
        "Spatio-Temporal Video Grounding"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "BSZqpqgqM0",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "BSZqpqgqM0",
      "title": "Why Diffusion Models Don’t Memorize:  The Role of Implicit Dynamical Regularization in Training",
      "abstract": "Diffusion models have achieved remarkable success across a wide range of generative tasks. A key challenge is understanding the mechanisms that prevent their memorization of training data and allow generalization. In this work, we investigate the role of the training dynamics in the transition from generalization to memorization. Through extensive experiments and theoretical analysis, we identify two distinct timescales: an early time $\\tau_\\mathrm{gen}$ at which models begin to generate high-quality samples, and a later time $\\tau_\\mathrm{mem}$ beyond which memorization emerges. Crucially, we find that $\\tau_\\mathrm{mem}$ increases linearly with the training set size $n$, while $\\tau_\\mathrm{gen}$ remains constant. This creates a growing window of training times with $n$ where models generalize effectively, despite showing strong memorization if training continues beyond it. It is only when $n$ becomes larger than a model-dependent threshold that overfitting disappears at infinite training times.\nThese findings reveal a form of implicit dynamical regularization in the training dynamics, which allow to avoid memorization even in highly overparameterized settings. Our results are supported by numerical experiments with standard U-Net architectures on realistic and synthetic  datasets, and by a theoretical analysis using a tractable random features model studied in the high-dimensional limit.",
      "keywords": [
        "Diffusion Models",
        "Deep Learning",
        "Probabilistic Methods"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "X8C20fJXtx",
      "title": "Hierarchical Koopman Diffusion: Fast Generation with Interpretable Diffusion Trajectory",
      "abstract": "Diffusion models have achieved impressive success in high-fidelity image generation but suffer from slow sampling due to their inherently iterative denoising process. While recent one-step methods accelerate inference by learning direct noise-to-image mappings, they sacrifice the interpretability and fine-grained control intrinsic to diffusion dynamics, key advantages that enable applications like editable generation. To resolve this dichotomy, we introduce **Hierarchical Koopman Diffusion**, a novel framework that achieves both one-step sampling and interpretable generative trajectories.  Grounded in Koopman operator theory, our method lifts the nonlinear diffusion dynamics into a latent space where evolution is governed by globally linear operators, enabling closed-form trajectory solutions. This formulation not only eliminates iterative sampling but also provides full access to intermediate states, allowing manual intervention during generation. To model the multi-scale nature of images, we design a hierarchical architecture that disentangles generative dynamics across spatial resolutions via scale-specific Koopman subspaces, capturing coarse-to-fine details systematically. We empirically show that the Hierarchical Koopman Diffusion not only achieves competitive one-step generation performance but also provides a principled mechanism for interpreting and manipulating the generative process through spectral analysis. Our framework bridges the gap between fast sampling and interpretability in diffusion models, paving the way for explainable image synthesis in generative modeling.",
      "keywords": [
        "One-step Generation",
        "Diffusion Models",
        "Koopman Operators",
        "Interpretable Image Synthesis"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "XoN10bZtR9",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "e2WesV6Voe",
      "title": "Sequence Modeling with Spectral Mean Flows",
      "abstract": "A key question in sequence modeling with neural networks is how to represent and learn highly nonlinear and probabilistic state dynamics. Operator theory views such dynamics as linear maps on Hilbert spaces containing mean embedding vectors of distributions, offering an appealing but currently overlooked perspective. We propose a new approach to sequence modeling based on an operator-theoretic view of a hidden Markov model (HMM). Instead of materializing stochastic recurrence, we embed the full sequence distribution as a tensor in the product Hilbert space. A generative process is then defined as maximum mean discrepancy (MMD) gradient flow in the space of sequences. To overcome challenges with large tensors and slow sampling convergence, we introduce spectral mean flows, a novel tractable algorithm integrating two core concepts. First, we propose a new neural architecture by leveraging spectral decomposition of linear operators to derive a scalable tensor network decomposition of sequence mean embeddings. Second, we extend MMD gradient flows to time-dependent Hilbert spaces and connect them to flow matching via the continuity equation, enabling simulation-free learning and faster sampling. We demonstrate competitive results on a range of time-series modeling datasets.",
      "keywords": [
        "sequence modeling",
        "time series",
        "hidden Markov models",
        "mean embeddings",
        "linear operators",
        "maximum mean discrepancy",
        "gradient flows"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "np5NmBQL4F",
      "title": "Isometry pursuit",
      "abstract": "Isometry pursuit is a convex algorithm for identifying orthonormal column-submatrices of wide matrices.\nIt consists of a vector normalization followed by multitask basis pursuit.\nApplied to Jacobians of putative coordinate functions, it helps identify locally isometric embeddings from within interpretable dictionaries.\nWe provide theoretical and experimental results justifying this method, including a proof with realistic assumptions that such isometric submatrices, should they exist, are contained within the obtained support.\nFor problems involving coordinate selection and diversification, it offers a synergistic alternative to greedy and brute force search.",
      "keywords": [
        "Manifold learning",
        "interpretability",
        "sparse coding"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "F0JzotXYgC",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "kXdW2KySK5",
      "title": "Variance-Dependent Regret Lower Bounds for Contextual Bandits",
      "abstract": "Variance-dependent regret bounds for linear contextual bandits, which improve upon the classical $\\tilde{O}(d\\sqrt{K})$ regret bound to $\\tilde{O}(d\\sqrt{\\sum_{k=1}^K\\sigma_k^2})$, where $d$ is the context dimension, $K$ is the number of rounds, and $\\sigma^2_k$ is the noise variance in round $k$, has been widely studied in recent years. However, most existing works focus on the regret upper bounds instead of lower bounds. To our knowledge, the only lower bound is from Jia et al. (2024), which proved that for any eluder dimension $d_{\\textbf{elu}}$ and total variance budget $\\Lambda$, there exists an instance with $\\sum_{k=1}^K\\sigma_k^2\\leq \\Lambda$ for which  any algorithm incurs a variance-dependent lower bound of $\\Omega(\\sqrt{d_{\\textbf{elu}}\\Lambda})$. However, this lower bound has a $\\sqrt{d}$ gap with existing upper bounds. Moreover, it only considers a fixed total variance budget $\\Lambda$ and does not apply to a general variance sequence $\\{\\sigma_1^2,\\ldots,\\sigma_K^2\\}$.\nIn this paper, to overcome the limitations of Jia et al. (2024), we consider the general variance sequence under two settings. For a prefixed sequence, where the entire variance sequence is revealed to the learner at the beginning of the learning process, we establish a variance-dependent lower bound of $\\Omega(d \\sqrt{\\sum_{k=1}^K\\sigma_k^2 }/\\log K)$ for linear contextual bandits. For an adaptive sequence, where an adversary can generate the variance $\\sigma_k^2$ in each round $k$ based on historical observations, we show that when the adversary must generate $\\sigma_k^2$ before observing the decision set, a similar lower bound of $\\Omega(d\\sqrt{ \\sum_{k=1}^K\\sigma_k^2} /\\log^6(dK))$ holds. In both settings, our results match the upper bounds of the SAVE algorithm (Zhao et al. 2023) up to logarithmic factors. Furthermore, if the adversary can generate the variance $\\sigma_k$ after observing the decision set $\\mathcal{D}_k$, we construct a counter-example showing that it is impossible to construct a variance-dependent lower bound if the adversary properly selects variances in collaboration with the learner.\nOur lower bound proofs use a novel peeling technique that groups rounds by variance magnitude. For each group, we construct separate instances and assign the learner distinct decision sets. We believe this proof technique may be of independent interest.",
      "keywords": [
        "Bandit",
        "Reinforcement Learning"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "uEFC25uUwU",
      "title": "The $\\varphi$ Curve: The Shape of Generalization through the Lens of Norm-based Capacity Control",
      "abstract": "Understanding how the test risk scales with model complexity is a central question in machine learning. Classical theory is challenged by the learning curves observed for large over-parametrized deep networks. Capacity measures based on parameter count typically fail to account for these empirical observations. To tackle this challenge, we consider norm-based capacity measures and develop our study for random features based estimators, widely used as simplified theoretical models for more complex networks. In this context, we provide a precise characterization of how the estimator’s norm concentrates and how it governs the associated test error. Our results show that the predicted learning curve admits a phase transition from under- to over-parameterization, but no double descent behavior. This confirms that more classical U-shaped behavior is recovered considering appropriate capacity measures based on models norms rather than size. From a technical point of view, we leverage deterministic equivalence as the key tool and further develop new deterministic quantities which are of independent interest.",
      "keywords": [
        "generalization",
        "norm-based capacity",
        "deterministic equivalence"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "jXvwJ51vcK",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "CRmiX0v16e",
      "title": "Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation",
      "abstract": "Recent works on open-vocabulary 3D instance segmentation show strong promise but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on aggregated clip features from multi-view, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP. Consequently, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a novel open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that efficiently leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. \n We demonstrate that our proposed Multi-View Prompt Distribution (MVPDist) method makes use of multi-view information to account for misclassification from the object detector to predict a reliable label for 3D instance masks. Furthermore, since projections of 3D object instances are already contained within the 2D bounding boxes, we show that our proposed low granularity label maps, which require only a 2D object detector to construct, are sufficient and very fast to predict prompt IDs for 3D instance masks when used with our proposed MVPDist.\n We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, \n under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network.\n Our Open-YOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to $\\sim$16$\\times$ speedup compared to the best existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7% while operating at 22 seconds per scene. github.com/aminebdj/OpenYOLO3D",
      "keywords": [
        "Open Vocabulary",
        "3D point cloud instance segmentation"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "yXCTDhZDh6",
      "title": "Point-SAM: Promptable 3D Segmentation Model for Point Clouds",
      "abstract": "The development of 2D foundation models for image segmentation has been significantly advanced by the Segment Anything Model (SAM). However, achieving similar success in 3D models remains a challenge due to issues such as non-unified data formats, poor model scalability, and the scarcity of labeled data with diverse masks. To this end, we propose a 3D promptable segmentation model Point-SAM, focusing on point clouds. We employ an efficient transformer-based architecture tailored for point clouds, extending SAM to the 3D domain. We then distill the rich knowledge from 2D SAM for Point-SAM training by introducing a data engine to generate part-level and object-level pseudo-labels at scale from 2D SAM. Our model outperforms state-of-the-art 3D segmentation models on several indoor and outdoor benchmarks and demonstrates a variety of applications, such as interactive 3D annotation and zero-shot 3D instance proposal.",
      "keywords": [
        "3D vision",
        "promptable segmentation",
        "point cloud segmentation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "I4fBSpDOha",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "I4fBSpDOha",
      "title": "Focus-Then-Reuse: Fast Adaptation in Visual Perturbation Environments",
      "abstract": "Visual reinforcement learning has shown promise in various real-world applications. However, deploying policies in complex real-world environments with visual perturbations remains a significant challenge. We notice that humans tend to filter information at the object level prior to decision-making, facilitating efficient skill transfer across different contexts. Inspired by this, we introduce Focus-Then-Reuse (FTR), a method utilizing a novel object selection mechanism to focus on task-relevant objects, and directly reuse the simulation-trained policy on them. The training of the object selection mechanism integrates prior knowledge from a vision-language model and feedback from the environment. Experimental results on challenging tasks based on DeepMind Control Suite and Franka Emika Robotics demonstrate that FTR enables rapid adaptation in visual perturbation environments and achieves state-of-the-art performance. The source code is available at https://github.com/LAMDA-RL/FTR.",
      "keywords": [
        "reinforcement learning; visual domain adaptation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "neZSGqhxDa",
      "title": "Absolute Zero: Reinforced Self-play Reasoning with Zero Data",
      "abstract": "Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from rule-based outcome rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external human or distillation data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability. AZR uses a code executor to both validate self-proposed code reasoning tasks and verify answers, serving as an unified source of verifiable feedback to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.",
      "keywords": [
        "reasoning",
        "language model",
        "reinforcement learning",
        "self-play",
        "LLM"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "jXvwJ51vcK",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "jXvwJ51vcK",
      "title": "Multimodality Helps Few-shot 3D Point Cloud Semantic Segmentation",
      "abstract": "Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal annotated support samples. While existing FS-PCS methods have shown promise, they primarily focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. In this paper, we address this gap by introducing a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. Under this easy-to-achieve setup, we present the MultiModal Few-Shot SegNet (MM-FSS), a model effectively harnessing complementary information from multiple modalities. MM-FSS employs a shared backbone with two heads to extract intermodal and unimodal visual features, and a pretrained text encoder to generate text embeddings. To fully exploit the multimodal information, we propose a Multimodal Correlation Fusion (MCF) module to generate multimodal correlations, and a Multimodal Semantic Fusion (MSF) module to refine the correlations using text-aware semantic guidance. Additionally, we propose a simple yet effective Test-time Adaptive Cross-modal Calibration (TACC) technique to mitigate training bias, further improving generalization. Experimental results on S3DIS and ScanNet datasets demonstrate significant performance improvements achieved by our method. The efficacy of our approach indicates the benefits of leveraging commonly-ignored free modalities for FS-PCS, providing valuable insights for future research. The code is available at github.com/ZhaochongAn/Multimodality-3D-Few-Shot.",
      "keywords": [
        "3D segmentation",
        "3D point cloud",
        "few-shot segmentation",
        "multimodality",
        "few-shot point cloud semantic segmentation"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "yXCTDhZDh6",
      "title": "Point-SAM: Promptable 3D Segmentation Model for Point Clouds",
      "abstract": "The development of 2D foundation models for image segmentation has been significantly advanced by the Segment Anything Model (SAM). However, achieving similar success in 3D models remains a challenge due to issues such as non-unified data formats, poor model scalability, and the scarcity of labeled data with diverse masks. To this end, we propose a 3D promptable segmentation model Point-SAM, focusing on point clouds. We employ an efficient transformer-based architecture tailored for point clouds, extending SAM to the 3D domain. We then distill the rich knowledge from 2D SAM for Point-SAM training by introducing a data engine to generate part-level and object-level pseudo-labels at scale from 2D SAM. Our model outperforms state-of-the-art 3D segmentation models on several indoor and outdoor benchmarks and demonstrates a variety of applications, such as interactive 3D annotation and zero-shot 3D instance proposal.",
      "keywords": [
        "3D vision",
        "promptable segmentation",
        "point cloud segmentation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "hUb2At2DsQ",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "B5RrIFMqbe",
      "title": "FormalAlign: Automated Alignment Evaluation for Autoformalization",
      "abstract": "Autoformalization aims to convert informal mathematical proofs into machine-verifiable formats, bridging the gap between natural and formal languages. However, ensuring semantic alignment between the informal and formalized statements remains challenging. Existing approaches heavily rely on manual verification, hindering scalability. To address this, we introduce FormalAlign, a framework for automatically evaluating the alignment between natural and formal languages in autoformalization. FormalAlign trains on both the autoformalization sequence generation task and the representational alignment between input and output, employing a dual loss that combines a pair of mutually enhancing autoformalization and alignment tasks. Evaluated across four benchmarks augmented by our proposed misalignment strategies, FormalAlign demonstrates superior performance. In our experiments, FormalAlign outperforms GPT-4, achieving an Alignment-Selection Score 11.58\\% higher on \\forml-Basic (99.21\\% vs. 88.91\\%) and 3.19\\% higher on MiniF2F-Valid (66.39\\% vs. 64.34\\%). This effective alignment evaluation significantly reduces the need for manual verification.",
      "keywords": [
        "Large Language models",
        "Autoformalization",
        "Lean 4",
        "Formal Math",
        "AI for Math"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "WJaUkwci9o",
      "title": "Self-Improvement in Language Models: The Sharpening Mechanism",
      "abstract": "Recent work in language modeling has raised the possibility of “self-improvement,” where an LLM evaluates and refines its own generations to achieve higher performance without external feedback. It is impossible for this self-improvement to create information that is not already in the model, so why should we expect that this will lead to improved capabilities? We offer a new theoretical perspective on the capabilities of self-improvement through a lens we refer to as “sharpening.” Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training in order to ‘sharpen’ the model to one placing large mass on high-quality sequences, thereby amortizing the expensive inference-time computation of generating good sequences. We begin by introducing a new statistical framework for sharpening in which the learner has sample access to a pre-trained base policy. Then, we analyze two natural families of self improvement algorithms based on SFT and RLHF. We find that (i) the SFT-based approach is minimax optimal whenever the initial model has sufficient coverage, but (ii) the RLHF-based approach can improve over SFT-based self- improvement by leveraging online exploration, bypassing the need for coverage. We view these findings as a starting point toward a foundational understanding that can guide the design and evaluation of self-improvement algorithms.",
      "keywords": [
        "Learning theory",
        "Sample complexity",
        "Self-Improvement",
        "Language Models"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "JbJVWljk7r",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "1b7whO4SfY",
      "title": "Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free",
      "abstract": "Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention.\nYet, existing literature rarely examines the specific effects of gating.\nIn this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants.\nSpecifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset.\nOur central finding is that a simple modification—applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)—consistently improves performance.\nThis modification also enhances training stability, tolerates larger learning rates, and improves scaling properties.\nBy comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output.\nNotably, we find this sparse gating mechanism mitigates `massive activation`, `attention sink` and enhances long-context extrapolation performance. \nWe also release related codes (https://github.com/qiuzh20/gated_attention}) and models (https://huggingface.co/QwQZh/gated_attention) to facilitate future research.\nFurthermore, the most effective SDPA output gating is used in the Qwen3-Next models (https://huggingface.co/collections/Qwen/qwen3-next).",
      "keywords": [
        "Attention",
        "Large Language Model",
        "Gating"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "o9iReV4FGm",
      "title": "Fast attention mechanisms: a tale of parallelism",
      "abstract": "Transformers have the representational capacity to simulate Massively Parallel Computation (MPC) algorithms, but they suffer from quadratic time complexity, which severely limits their scalability. We introduce an efficient attention mechanism called Approximate Nearest Neighbor Attention (ANNA) with sub-quadratic time complexity. We prove that ANNA-transformers (1) retain the expressive power previously established for standard attention in terms of matching the capabilities of MPC algorithms, and (2) can solve key reasoning tasks such as Match2 and $k$-hop with near-optimal depth. Using the MPC framework, we further prove that constant-depth ANNA-transformers can simulate constant-depth low-rank transformers, thereby providing a unified way to reason about a broad class of efficient attention approximations.",
      "keywords": [
        "Transformer theory",
        "representational strength",
        "nearest neighbor search",
        "massively parallel computation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "jfwe9qNqRi",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "4FVGowGzQb",
      "title": "Learning from negative feedback, or positive feedback or both",
      "abstract": "Existing preference optimization methods often assume scenarios where paired preference feedback (preferred/positive vs. dis-preferred/negative examples) is available. This requirement limits their applicability in scenarios where only unpaired feedback—for example, either positive or negative— is available. To address this, we introduce a novel approach that decouples learning from positive and negative feedback. This decoupling enables control over the influence of each feedback type and, importantly, allows learning even when only one feedback type is present. A key contribution is demonstrating stable learning from negative feedback alone, a capability not well-addressed by current methods. Our approach builds upon the probabilistic framework introduced in (Dayan and Hinton, 1997), which uses expectation-maximization (EM) to directly optimize the probability of positive outcomes (as opposed to classic expected reward maximization). We address a key limitation in current EM-based methods: they solely maximize the likelihood of positive examples, while neglecting negative ones. We show how to extend EM algorithms to explicitly incorporate negative examples, leading to a theoretically grounded algorithm that offers an intuitive and versatile way to learn from both positive and negative feedback. We evaluate our approach for training language models based on human feedback as well as training policies for sequential decision-making problems, where learned value functions are available.",
      "keywords": [
        "Preference Optimization",
        "Policy Optimization",
        "Negative Feedback",
        "Positive feedback",
        "Reinforcement Learning",
        "Probabilistic Inference"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "jfwe9qNqRi",
      "title": "SimPER: A Minimalist Approach to Preference  Alignment without Hyperparameters",
      "abstract": "Existing preference optimization objectives for language model alignment require additional hyperparameters that must be extensively tuned to achieve optimal performance, increasing both the complexity and time required for fine-tuning large language models. In this paper, we propose a simple yet effective hyperparameter-free preference optimization algorithm for alignment. We observe that promising performance can be achieved simply by optimizing inverse perplexity, which is calculated as the inverse of the exponentiated average log-likelihood of the chosen and rejected responses in the preference dataset. The resulting simple learning objective, SimPER, is easy to implement and eliminates the need for expensive hyperparameter tuning and a reference model, making it both computationally and memory efficient. Extensive experiments on widely used real-world benchmarks, including MT-Bench, AlpacaEval 2, and 10 key benchmarks of the Open LLM Leaderboard with 5 base models, demonstrate that SimPER consistently and significantly outperforms existing approaches—even without any hyperparameters or a reference model. For example, despite its simplicity, SimPER outperforms state-of-the-art methods by up to 5.7 points on AlpacaEval 2 and achieves the highest average ranking across 10 benchmarks on the Open LLM Leaderboard. The source code for SimPER is publicly available at: https://github.com/tengxiao1/SimPER.",
      "keywords": [
        "Large Language Model",
        "Alignment",
        "RLHF"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "N5fVv6PZGz",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "N5fVv6PZGz",
      "title": "Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models",
      "abstract": "Large Language Models (LLMs) with the Mixture-of-Experts (MoE) architectures have shown promising performance on various tasks. However, due to the huge model sizes, running them in resource-constrained environments where the GPU memory is not abundant is challenging. Some existing systems propose to use CPU resources to solve that, but they either suffer from the significant overhead of frequently moving data between CPU and GPU, or fail to consider distinct characteristics of CPUs and GPUs. This paper proposes Fiddler, a resource-efficient inference system for MoE models with limited GPU resources. Fiddler strategically utilizes CPU and GPU resources by determining the optimal execution strategy. Our evaluation shows that, unlike state-of-the-art systems that optimize for specific scenarios such as single batch inference or long prefill, Fiddler performs better in all scenarios. Compared against different baselines, Fiddler achieves 1.26 times speed up in single batch inference, 1.30 times in long prefill processing, and 11.57 times in beam search inference. The code of Fiddler is publicly available at https://github.com/efeslab/fiddler.",
      "keywords": [
        "NLP in resource-constrained settings",
        "inference methods"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "ob7UrZOJve",
      "title": "Inheritune: Training Smaller Yet More Attentive Language Models",
      "abstract": "Large Language Models (LLMs) have achieved remarkable performance across various natural language processing tasks, primarily due to the transformer architecture and its self-attention mechanism. However, we observe that in standard decoder-style LLMs attention matrices degenerate to single-column for deeper layers. Layers in this state unable to learn anything meaningful and mostly redundant; we refer to these as lazy layers. The goal of this paper is to train smaller models by eliminating this structural inefficiency without compromising performance.\n\nMotivated by this observation, we propose Inheritune, a simple yet effective training recipe for developing smaller, high-performing language models. Smaller models trained with Inheritune inherits early transformer layers from a larger pre-trained model, then retrains and progressively expands the smaller model until it matches or exceeds the performance of the larger model. We demonstrate that Inheritune enables the training of various sizes of GPT-2 models on datasets like OpenWebText-9B and FineWeb\\_Edu. Models trained with Inheritune, despite having significantly fewer layers, match or even surpass the performance of their larger counterparts. For instance, our 16-layer GPT-2 medium variant achieves comparable performance to the standard 24-layer GPT-2 medium model.",
      "keywords": [
        "Large Language Models",
        "Small Language Models",
        "Attention degeneration",
        "Efficient training",
        "Model Initialization"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "YSA0QeYnDd",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "S9GyQUXzee",
      "title": "GROOT-2: Weakly Supervised Multimodal Instruction Following Agents",
      "abstract": "Developing agents that can follow multimodal instructions remains a fundamental challenge in robotics and AI. Although large-scale pre-training on unlabeled datasets has enabled agents to learn diverse behaviors, these agents often struggle with following instructions. While augmenting the dataset with instruction labels can mitigate this issue, acquiring such high-quality annotations at scale is impractical. \nTo address this issue, we frame the problem as a semi-supervised learning task and introduce \\agent, a multimodal instructable agent trained using a novel approach that combines weak supervision with latent variable models. Our method consists of two key components: constrained self-imitating, which utilizes large amounts of unlabeled demonstrations to enable the policy to learn diverse behaviors, and human intention alignment, which uses a smaller set of labeled demonstrations to ensure the latent space reflects human intentions. \\agent’s effectiveness is validated across four diverse environments, ranging from video games to robotic manipulation, demonstrating its robust multimodal instruction-following capabilities.",
      "keywords": [
        "Reinforcement Learning",
        "Open-world Agent",
        "Weakly Supervised Learning",
        "Goal-Conditioned Policy"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "YSA0QeYnDd",
      "title": "Inference of Evolving Mental States from Irregular Action Events to Understand Human Behaviors",
      "abstract": "Inference of latent human mental processes, such as belief, intention, or desire, is crucial for developing AI with human-like intelligence, enabling more effective and timely collaboration. In this paper, we introduce a versatile encoder-decoder model designed to infer  evolving mental processes based on irregularly observed action events and predict future occurrences. The primary challenges arise from two factors: both actions and mental processes are irregular events, and the observed action data is often limited. To address the irregularity of these events, we leverage a temporal point process model within the encoder-decoder framework, effectively capturing the dynamics of both action and mental events. Additionally, we implement a backtracking mechanism in the decoder to enhance the accuracy of predicting future actions and evolving mental states. To tackle the issue of limited data, our model incorporates logic rules as priors, enabling accurate inferences from just a few observed samples. These logic rules can be refined and updated as needed, providing flexibility to the model. Overall, our approach enhances the understanding of human behavior by predicting when actions will occur and how mental processes evolve. Experiments on both synthetic and real-world datasets demonstrate the strong performance of our model in inferring mental states and predicting future actions, contributing to the development of more human-centric AI systems.",
      "keywords": [
        "temporal point process",
        "logic rule",
        "human-AI collaboration"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "2Ri68h7bD1",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "5YMZfufpfY",
      "title": "Improving Energy Natural Gradient Descent through Woodbury, Momentum, and Randomization",
      "abstract": "Natural gradient methods significantly accelerate the training of Physics-Informed Neural Networks (PINNs), but are often prohibitively costly. We introduce a suite of techniques to improve the accuracy and efficiency of energy natural gradient descent (ENGD) for PINNs. First, we leverage the Woodbury formula to dramatically reduce the computational complexity of ENGD. Second, we adapt the Subsampled Projected-Increment Natural Gradient Descent algorithm from the variational Monte Carlo literature to accelerate the convergence. Third, we explore the use of randomized algorithms to further reduce the computational cost in the case of large batch sizes. We find that randomization accelerates progress in the early stages of training for low-dimensional problems, and we identify key barriers to attaining acceleration in other scenarios. Our numerical experiments demonstrate that our methods outperform previous approaches, achieving the same $L^2$ error as the original ENGD up to $75\\times$ faster.",
      "keywords": [
        "Physics-Informed Neural Networks",
        "Natural Gradient Descent",
        "Woodbury Matrix Identity",
        "Momentum",
        "Nyström Approximation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "uWj4s7rMnR",
      "title": "Mean Flows for One-step Generative Modeling",
      "abstract": "We propose a principled and effective framework for one-step generative modeling. We introduce the notion of average velocity to characterize flow fields, in contrast to instantaneous velocity modeled by Flow Matching methods. A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training. Our method, termed the \\textit{MeanFlow} model, is self-contained and requires no pre-training, distillation, or curriculum learning. MeanFlow demonstrates strong empirical performance: it achieves an FID of 3.43 with a single function evaluation (1-NFE) on ImageNet 256$\\times$256 trained from scratch, significantly outperforming previous state-of-the-art one-step diffusion/flow models. Our study substantially narrows the gap between one-step diffusion/flow models and their multi-step predecessors, and we hope it will motivate future research to revisit the foundations of these powerful models.",
      "keywords": [
        "Generative Models"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "J2Jyp1SZ0n",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "J2Jyp1SZ0n",
      "title": "MMSearch: Unveiling the Potential of Large Models as Multi-modal Search Engines",
      "abstract": "The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question. To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs' training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSearch may provide unique insights to guide the future development of multimodal AI search engine.",
      "keywords": [
        "Large Multimodal Model",
        "AI Search Engine",
        "Benchmark"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "f4gF6AIHRy",
      "title": "Combatting Dimensional Collapse in LLM Pre-Training Data via Submodular File Selection",
      "abstract": "Selecting high-quality pre-training data for large language models (LLMs) is crucial for enhancing their overall performance under limited computation budget, improving both training and sample efficiency. Recent advancements in file selection primarily rely on using an existing or trained proxy model to assess the similarity of samples to a target domain, such as high quality sources BookCorpus and Wikipedia. However, upon revisiting these methods, the domain-similarity selection criteria demonstrates a diversity dilemma, i.e. dimensional collapse in the feature space, improving performance on the domain-related tasks but causing severe degradation on generic performance.To prevent collapse and enhance diversity, we propose a DiverSified File selection algorithm (DiSF), which selects the most decorrelated text files in the feature space. We approach this with a classical greedy algorithm to achieve more uniform eigenvalues in the feature covariance matrix of the selected texts, analyzing its approximation to the optimal solution under a formulation of $\\gamma$-weakly submodular optimization problem. Empirically, we establish a benchmark and conduct extensive experiments on the TinyLlama architecture with models from 120M to 1.1B parameters. Evaluating across nine tasks from the Harness framework, DiSF demonstrates a significant improvement on overall performance. Specifically, DiSF saves 98.5\\% of 590M training files in SlimPajama, outperforming the full-data pre-training within a 50B training budget, and achieving about 1.5x training efficiency and 5x data efficiency. Source code\nis available at: https://github.com/MediaBrain-SJTU/DiSF.git.",
      "keywords": [
        "file selection",
        "large language model",
        "pre-training",
        "submodular optimization"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "TLgDQ0Rr2Z",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "rPkCVSsoM4",
      "title": "A Causal Lens for Learning Long-term Fair Policies",
      "abstract": "Fairness-aware learning studies the development of algorithms that avoid discriminatory decision outcomes despite biased training data. While most studies have concentrated on immediate bias in static contexts, this paper highlights the importance of investigating long-term fairness in dynamic decision-making systems while simultaneously considering instantaneous fairness requirements. In the context of reinforcement learning, we propose a general framework where long-term fairness is measured by the difference in the average expected qualification gain that individuals from different groups could obtain. Then, through a causal lens, we decompose this metric into three components that represent the direct impact, the delayed impact, as well as the spurious effect the policy has on the qualification gain. We analyze the intrinsic connection between these components and an emerging fairness notion called benefit fairness that aims to control the equity of outcomes in decision-making. Finally, we develop a simple yet effective approach for balancing various fairness notions.",
      "keywords": [
        "long-term fairness",
        "fair reinforcement learning",
        "causal decomposition"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "stUKwWBuBm",
      "title": "Tractable Multi-Agent Reinforcement Learning through Behavioral Economics",
      "abstract": "A significant roadblock to the development of principled multi-agent reinforcement learning (MARL) algorithms is the fact that desired solution concepts like Nash equilibria may be intractable to compute. We show how one can overcome this obstacle by introducing concepts from behavioral economics into MARL. To do so, we imbue agents with two key features of human decision-making: risk aversion and bounded rationality. We show that introducing these two properties into games gives rise to a class of equilibria---risk-averse quantal response equilibria (RQE)---which are tractable to compute in \\emph{all} $n$-player matrix and finite-horizon Markov games.  In particular, we show that they emerge as the endpoint of no-regret learning in suitably adjusted versions of the games. Crucially, the class of computationally tractable RQE is independent of the underlying game structure and only depends on agents' degrees of risk-aversion and bounded rationality.  To validate the expressivity of this class of solution concepts we show that it captures peoples' patterns of play in a number of 2-player matrix games previously studied in experimental economics. Furthermore, we give a first analysis of the sample complexity of computing these equilibria in finite-horizon Markov games when one has access to a generative model. We validate our findings on a simple multi-agent reinforcement learning benchmark. Our results open the doors for to the principled development of new decentralized multi-agent reinforcement learning algorithms.",
      "keywords": [
        "behavioral economics",
        "risk-aversion",
        "multi-agent reinforcement learning",
        "quantal response",
        "bounded rationality"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "jTaxGFy34h",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "ODgWBaErst",
      "title": "Distribution-Aware Tensor Decomposition for Compression of Convolutional Neural Networks",
      "abstract": "Neural networks are widely used for image–related tasks but typically demand considerable computing power. Once a network has been trained, however, its memory‑ and compute‑footprint can be reduced by compression. In this work, we focus on compression through tensorization and low‑rank representations. Whereas classical approaches search for a low‑rank approximation by minimizing an isotropic norm such as the Frobenius norm in weight‑space, we use data‑informed norms that measure the error in function space. Concretely, we minimize the change in the layer’s output distribution, which can be expressed as $\\lVert (W - \\widetilde{W}) \\Sigma^{1/2}\\rVert_F$ where $\\Sigma^{1/2}$ is the square root of the covariance matrix of the layer’s input and $W$, $\\widetilde{W}$ are the original and compressed weights. We propose new alternating least square algorithms for the two most common tensor decompositions (Tucker‑2 and CPD) that directly optimize the new norm. Unlike conventional compression pipelines, which almost always require post‑compression fine‑tuning, our data‑informed approach often achieves competitive accuracy without any fine‑tuning. We further show that the same covariance‑based norm can be transferred from one dataset to another with only a minor accuracy drop, enabling compression even when the original training dataset is unavailable.\nExperiments on several CNN architectures (ResNet‑18/50, and GoogLeNet) and datasets (ImageNet, FGVC‑Aircraft, Cifar10, and Cifar100) confirm the advantages of the proposed method.",
      "keywords": [
        "Compression of neural networks",
        "tensor decomposition",
        "low-rank approximation",
        "distribution-aware norm"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "jTaxGFy34h",
      "title": "Robust Wasserstein  $k$-center Clustering: Algorithms and Acceleration",
      "abstract": "The classical metric $k$-center problem is widely used in data representation tasks. However, real-world datasets often contain noise and exhibit complex structures, making the traditional metric $k$-center problem insufficient for such scenarios. To address these challenges, we present the \\textbf{R}obust \\textbf{W}asserstein \\textbf{C}enter clustering (RWC-clustering)  problem.\nCompared to the classical setting, the main challenge in designing an algorithm for the RWC-clustering problem lies in effectively handling noise in the cluster centers. To this end, we introduce a dedicated purification step to eliminate noise, based on which we develop our clustering algorithm.\nFurthermore, when dealing with large-scale datasets, both storage and computation become highly resource-intensive. To alleviate this, we adopt the \\textit{coreset} technique to improve the computational and storage efficiency by compressing the dataset.  \nRoughly speaking, this coreset method enables us to calculate the objective value on a small-size coreset, while ensuring a close approximation to the value on the original dataset in theory; thus, it substantially saves the storage and computation resources.  \nFinally, experimental results show the effectiveness of our RWC-clustering  problem and the efficiency of the coreset method.",
      "keywords": [
        "clustering; coreset; Wasserstein distance"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "k38Th3x4d9",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "m08aK3xxdJ",
      "title": "CATCH: Channel-Aware Multivariate Time Series Anomaly Detection via Frequency Patching",
      "abstract": "Anomaly detection in multivariate time series is challenging as heterogeneous subsequence anomalies may occur. Reconstruction-based methods, which focus on learning normal patterns in the frequency domain to detect diverse abnormal subsequences, achieve promising results, while still falling short on capturing fine-grained frequency characteristics and channel correlations. To contend with the limitations, we introduce CATCH, a framework based on frequency patching. We propose to patchify the frequency domain into frequency bands, which enhances its ability to capture fine-grained frequency characteristics. To perceive appropriate channel correlations, we propose a Channel Fusion Module (CFM), which features a patch-wise mask generator and a masked-attention mechanism. Driven by a bi-level multi-objective optimization algorithm, the CFM is encouraged to iteratively discover appropriate patch-wise channel correlations, and to cluster relevant channels while isolating adverse effects from irrelevant channels. Extensive experiments on 10 real-world datasets and 12 synthetic datasets demonstrate that CATCH achieves state-of-the-art performance. We make our code and datasets available at https://github.com/decisionintelligence/CATCH.",
      "keywords": [
        "Multivariate Time Series",
        "Anomaly Detection"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "v5BouOktUP",
      "title": "Multivariate Time-series Forecasting with SPACE: Series Prediction Augmented by Causality Estimation",
      "abstract": "The analysis of multivariate time series (MTS) presents a complex yet crucial task with substantial applications in areas such as weather forecasting, policy formulation, and stock market prediction. It is important to highlight three key characteristics of MTS that contribute to the challenging and multifaceted nature of their analysis: (i) their interrelationships are represented through causal relationships rather than mere similarities; (ii) they convey information across multiple independent factors; and (iii) their dynamics often arise from inherent temporal dependencies. While conventional time series analysis frameworks often fail to capture one or more of these aspects, resulting in incomplete or even misleading conclusions, we propose an end-to-end trainable $\\textbf{S}$eries $\\textbf{P}$rediction model $\\textbf{A}$ugmented by $\\textbf{C}$ausality $\\textbf{E}$stimation (SPACE) to address these limitations. This model effectively incorporates temporal dependencies and causal relationships, featuring a temporal embedding and a transfer entropy-based Cross-TE module designed to enhance predictions through causality-augmented mechanisms. Experiments demonstrate that SPACE achieves state-of-the-art results on challenging real-world time series prediction tasks, showing its effectiveness and versatility.",
      "keywords": [
        "Time Series Forecasting",
        "Causal Learning",
        "Transfer Entropy",
        "Graph Based Learning"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "Y6aHdDNQYD",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "9xHlhKLu1h",
      "title": "RobuRCDet: Enhancing Robustness of Radar-Camera Fusion in Bird's Eye View for 3D Object Detection",
      "abstract": "While recent low-cost radar-camera approaches have shown promising results in\nmulti-modal 3D object detection, both sensors face challenges from environmen-\ntal and intrinsic disturbances. Poor lighting or adverse weather conditions de-\ngrade camera performance, while radar suffers from noise and positional ambigu-\nity. Achieving robust radar-camera 3D object detection requires consistent perfor-\nmance across varying conditions, a topic that has not yet been fully explored. In\nthis work, we first conduct a systematic analysis of robustness in radar-camera de-\ntection on five kinds of noises and propose RobuRCDet, a robust object detection\nmodel in bird’s eye view (BEV). Specifically, we design a 3D Gaussian Expan-\nsion (3DGE) module to mitigate inaccuracies in radar points, including position,\nRadar Cross-Section (RCS), and velocity. The 3DGE uses RCS and velocity priors\nto generate a deformable kernel map and variance for kernel size adjustment and\nvalue distribution. Additionally, we introduce a weather-adaptive fusion module,\nwhich adaptively fuses radar and camera features based on camera signal confi-\ndence. Extensive experiments on the popular benchmark, nuScenes, show that\nour RobuRCDet achieves competitive results in regular and noisy conditions. The\nsource codes and trained models will be made available.",
      "keywords": [
        "3D Vision， Radar Camera 3D Object Detection"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "Y6aHdDNQYD",
      "title": "MOS: Model Synergy for Test-Time Adaptation on LiDAR-Based 3D Object Detection",
      "abstract": "LiDAR-based 3D object detection is crucial for various applications but often experiences performance degradation in real-world deployments due to domain shifts. While most studies focus on cross-dataset shifts, such as changes in environments and object geometries, practical corruptions from sensor variations and weather conditions remain underexplored. In this work, we propose a novel online test-time adaptation framework for 3D detectors that effectively tackles these shifts, including a challenging $\\textit{cross-corruption}$ scenario where cross-dataset shifts and corruptions co-occur. By leveraging long-term knowledge from previous test batches, our approach mitigates catastrophic forgetting and adapts effectively to diverse shifts. Specifically, we propose a Model Synergy (MOS) strategy that dynamically selects historical checkpoints with diverse knowledge and assembles them to best accommodate the current test batch. This assembly is directed by our proposed Synergy Weights (SW), which perform a weighted averaging of the selected checkpoints, minimizing redundancy in the composite model. The SWs are computed by evaluating the similarity of predicted bounding boxes on the test data and the independence of features between checkpoint pairs in the model bank. To maintain an efficient and informative model bank, we discard checkpoints with the lowest average SW scores, replacing them with newly updated models. Our method was rigorously tested against existing test-time adaptation strategies across three datasets and eight types of corruptions, demonstrating superior adaptability to dynamic scenes and conditions. Notably, it achieved a 67.3% improvement in a challenging cross-corruption scenario, offering a more comprehensive benchmark for adaptation. Source code: https://github.com/zhuoxiao-chen/MOS.",
      "keywords": [
        "Test-Time Adaptation",
        "3D Object Detection"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "E2PFv7ad3p",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "E2PFv7ad3p",
      "title": "Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs",
      "abstract": "In the study of LLMs, sycophancy represents a prevalent hallucination that poses significant challenges to these models. Specifically, LLMs often fail to adhere to original correct responses, instead blindly agreeing with users' opinions, even when those opinions are incorrect or malicious. However, research on sycophancy in visual language models (VLMs) has been scarce. In this work, we extend the exploration of sycophancy from LLMs to VLMs, introducing the MM-SY benchmark to evaluate this phenomenon. We present evaluation results from multiple representative models, addressing the gap in sycophancy research for VLMs. To mitigate sycophancy, we propose a synthetic dataset for training and employ methods based on prompts, supervised fine-tuning, and DPO. Our experiments demonstrate that these methods effectively alleviate sycophancy in VLMs. Additionally, we probe VLMs to assess the semantic impact of sycophancy and analyze the attention distribution of visual tokens. Our findings indicate that the ability to prevent sycophancy is predominantly observed in higher layers of the model. The lack of attention to image knowledge in these higher layers may contribute to sycophancy, and enhancing image attention at high layers proves beneficial in mitigating this issue.",
      "keywords": [
        "Multi-modal Model",
        "Visual-Language Model",
        "Sycophancy",
        "Hallucination"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "SPS6HzVzyt",
      "title": "Context-Parametric Inversion: Why Instruction Finetuning May Not Actually Improve Context Reliance",
      "abstract": "Large Language Model's are instruction-finetuned to enhance their ability to follow user instructions and better comprehend input context. Still, they often struggle to follow the input context, especially when it contradicts model's parametric knowledge. This manifests as various failures, such as hallucinations where a model inserts outdated or unwarranted facts into its response. In this work, we observe an intriguing phenomenon: the context reliance of the model decreases as instruction finetuning progresses, $\\textit{despite an initial expected increase}$. We call this phenomenon as the $\\textbf{context-parametric inversion}$. This is surprising, as one would expect instruction tuning to improve the model's ability to follow input instructions.  We observe this behavior on multiple general purpose instruction tuning datasets such as TULU, Alpaca and Ultrachat, across multiple model families like Llama, Mistral and Pythia.  We perform various controlled studies to eliminate some simple hypothesis for this observed behavior and isolate what datapoints cause this counter-intuitive behavior. We then analyze the phenomenon theoretically, to explain why context reliance varies across the trajectory of finetuning. \nWe tie the observed context-parametric inversion to the properties of the finetuning data, which provides us with some potential mitigation strategies that provide limited but insightful gains.",
      "keywords": [
        "Instruction finetuning",
        "context-vs-parametric reliance"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "IoSLbwZkal",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "IoSLbwZkal",
      "title": "On the Stability of Graph Convolutional Neural Networks: A Probabilistic Perspective",
      "abstract": "Graph convolutional neural networks (GCNNs) have emerged as powerful tools for analyzing graph-structured data, achieving remarkable success across diverse applications. However, the theoretical understanding of the stability of these models, i.e., their sensitivity to small changes in the graph structure, remains in rather limited settings, hampering the development and deployment of robust and trustworthy models in practice. To fill this gap, we study how small perturbations in the graph topology affect GCNN outputs and propose a novel formulation for analyzing model stability. Unlike prior studies that focus only on worst-case perturbations, our distribution-aware formulation characterizes output perturbations across a broad range of input data. This way, our framework enables, for the first time, a probabilistic perspective on the interplay between the statistical properties of the node data and perturbations in the graph topology. We conduct extensive experiments to validate our theoretical findings and demonstrate their benefits over existing baselines, in terms of both representation stability and adversarial attacks on downstream tasks. Our results demonstrate the practical significance of the proposed formulation and highlight the importance of incorporating data distribution into stability analysis.",
      "keywords": [
        "stability",
        "graph convolutional neural networks",
        "graph signal processing"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "eafIjoZAHm",
      "title": "GnnXemplar: Exemplars to Explanations - Natural Language Rules for Global GNN Interpretability",
      "abstract": "Graph Neural Networks (GNNs) are widely used for node classification, yet their opaque decision-making limits trust and adoption. While local explanations offer insights into individual predictions, global explanation methods—those that characterize an entire class—remain underdeveloped. Existing global explainers rely on motif discovery in small graphs, an approach that breaks down in large, real-world settings where subgraph repetition is rare, node attributes are high-dimensional, and predictions arise from complex structure-attribute interactions. We propose GnnXemplar, a novel global explainer inspired from Exemplar Theory from cognitive science. GnnXemplar identifies representative nodes in the GNN embedding space—exemplars—and explains predictions using natural language rules derived from their neighborhoods. Exemplar selection is framed as a coverage maximization problem over reverse $k$-nearest neighbors, for which we provide an efficient greedy approximation. To derive interpretable rules, we employ a self-refining prompt strategy using large language models (LLMs). Experiments across diverse benchmarks show that GnnXemplar significantly outperforms existing methods in fidelity, scalability, and human interpretability, as validated by a user study with 60 participants.",
      "keywords": [
        "graph neural network",
        "graph machine learning",
        "explainability",
        "xai",
        "global explanation",
        "text-based explanation",
        "exemplar",
        "exemplar theory"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "HD6bWcj87Y",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "BQgAToASdX",
      "title": "Generalized Group Data Attribution",
      "abstract": "Data Attribution (DA) methods quantify the influence of individual training data points on model outputs and have broad applications such as explainability, data selection, and noisy label identification. However, existing DA methods are often computationally intensive, limiting their applicability to large-scale machine learning models. To address this challenge, we introduce the Generalized Group Data Attribution (GGDA) framework, which computationally simplifies DA by attributing to groups of training points instead of individual ones. GGDA is a general framework that subsumes existing attribution methods and can be applied to new DA techniques as they emerge. It allows users to optimize the trade-off between efficiency and fidelity based on their needs. Our empirical results demonstrate that GGDA applied to popular DA methods such as Influence Functions, TracIn, and TRAK results in upto 10x-50x speedups over standard DA methods while gracefully trading off attribution fidelity. For downstream applications such as dataset pruning and noisy label identification, \nwe demonstrate that GGDA significantly improves computational efficiency and maintains effectiveness, enabling practical applications in large-scale machine learning scenarios that were previously infeasible.",
      "keywords": [
        "generalized",
        "group",
        "data attribution",
        "efficiency",
        "training data",
        "influence",
        "tracin",
        "trak"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "Y5LjYI4N6P",
      "title": "Efficient stagewise pretraining via progressive subnetworks",
      "abstract": "Recent developments in large language models have sparked interest in efficient\npretraining methods. Stagewise training approaches to improve efficiency, like\ngradual stacking and layer dropping (Reddi et al., 2023; Zhang & He, 2020), have\nrecently garnered attention. The prevailing view suggests that stagewise dropping\nstrategies, such as layer dropping, are ineffective, especially when compared to\nstacking-based approaches. This paper challenges this notion by demonstrating\nthat, with proper design, dropping strategies can be competitive, if not better, than\nstacking methods. Specifically, we develop a principled stagewise training framework, progressive subnetwork training, which only trains subnetworks within the\nmodel and progressively increases the size of subnetworks during training, until it\ntrains the full network. We propose an instantiation of this framework — Random\nPart Training (RAPTR) — that selects and trains only a random subnetwork (e.g.\ndepth-wise, width-wise) of the network at each step, progressively increasing the\nsize in stages. We show that this approach not only generalizes prior works like\nlayer dropping but also fixes their key issues. Furthermore, we establish a theoretical basis for such approaches and provide justification for (a) increasing complexity of subnetworks in stages, conceptually diverging from prior works on layer\ndropping, and (b) stability in loss across stage transitions in presence of key modern architecture components like residual connections and layer norms. Through\ncomprehensive experiments, we demonstrate that RAPTR can significantly speed\nup training of standard benchmarks like BERT and UL2, up to 33% compared to\nstandard training and, surprisingly, also shows better downstream performance on\nUL2, improving QA tasks and SuperGLUE by 1.5%; thereby, providing evidence\nof better inductive bias.",
      "keywords": [
        "Efficient stagewise training",
        "modular training",
        "language model pretraining",
        "implicit bias",
        "simple-to-complex learning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "XPe55Uffd7",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "qRPIWtf3SE",
      "title": "Learning single index models via harmonic decomposition",
      "abstract": "We study the problem of learning single-index models, where the label $y \\in \\mathbb{R}$ depends on the input $\\boldsymbol{x} \\in \\mathbb{R}^d$ only through an unknown one-dimensional projection $\\langle \\boldsymbol{w_*}, \\boldsymbol{x} \\rangle$. Prior work has shown that under Gaussian inputs, the statistical and computational complexity of recovering $\\boldsymbol{w}_*$ is governed by the Hermite expansion of the link function. In this paper, we propose a new perspective: we argue that *spherical harmonics*---rather than *Hermite polynomials*---provide the natural basis for this problem, as they capture its intrinsic \\textit{rotational symmetry}. Building on this insight, we characterize the complexity of learning single-index models under arbitrary spherically symmetric input distributions. We introduce two families of estimators---based on tensor-unfolding and online SGD---that respectively achieve either  optimal sample complexity or optimal runtime, and argue that estimators achieving both may not exist in general. When specialized to Gaussian inputs, our theory not only recovers and clarifies existing results but also reveals new phenomena that had previously been overlooked.",
      "keywords": [
        "single-index models",
        "statistical and computational complexity"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "zmlhP8myaT",
      "title": "Stepwise Feature Learning in Self-Supervised Learning",
      "abstract": "Recent advances in self-supervised learning (SSL) have shown remarkable progress in representation learning. However, SSL models often exhibit shortcut learning phenomenon, where they exploit dataset-specific biases rather than learning generalizable features, sometimes leading to severe over-optimization on particular datasets. We present a theoretical framework that analyzes this shortcut learning phenomenon through the lens of $\\textit{extent bias}$ and $\\textit{amplitude bias}$. By investigating the relations among extent bias, amplitude bias, and learning priorities in SSL, we demonstrate that learning dynamics is fundamentally governed by the dimensional properties and amplitude of features rather than their semantic importance. Our analysis reveals how the eigenvalues of the feature cross-correlation matrix influence which features are learned earlier, providing insights into why models preferentially learn shortcut features over more generalizable features.",
      "keywords": [
        "shortcut learning",
        "self-supervised learning",
        "stepwise learning",
        "feature learning",
        "learning dynamics"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "xak8c9l1nu",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "MT3aOfXIbY",
      "title": "Faster Diffusion Sampling with Randomized Midpoints: Sequential and Parallel",
      "abstract": "Sampling algorithms play an important role in controlling the quality and runtime of diffusion model inference. In recent years, a number of works (Chen et al., 2023c;b; Benton et al., 2023; Lee et al., 2022) have analyzed algorithms for diffusion sampling with provable guarantees; these works show that for essentially any data distribution, one can approximately sample in polynomial time given a sufficiently accurate estimate of its score functions at different noise levels. \n\nIn this work, we propose a new scheme inspired by Shen and Lee's randomized midpoint method for log-concave sampling  (Shen & Lee, 2019). We prove that this approach achieves the best known dimension dependence for sampling from arbitrary smooth distributions in total variation distance ($\\widetilde O(d^{5/12})$ compared to $\\widetilde O(\\sqrt{d})$ from prior work). We also show that our algorithm can be parallelized to run in only $\\widetilde O(\\log^2 d)$ parallel rounds, constituting the first provable guarantees for parallel sampling with diffusion models.\n    \nAs a byproduct of our methods, for the well-studied problem of log-concave sampling in total variation distance, we give an algorithm and simple analysis achieving dimension dependence $\\widetilde O(d^{5/12})$ compared to $\\widetilde O(\\sqrt{d})$ from prior work.",
      "keywords": [
        "Diffusion Sampling",
        "Generative Model",
        "Statistical Theory"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "ijbA5swmoK",
      "title": "Second-Order Min-Max Optimization with Lazy Hessians",
      "abstract": "This paper studies second-order methods for convex-concave minimax optimization.  \nMonteiro & Svaiter (2012)  proposed a method to solve the problem with an optimal iteration complexity of \n$\\mathcal{O}(\\epsilon^{-3/2})$ to find an $\\epsilon$-saddle point.  However, it is unclear whether the\ncomputational complexity, $\\mathcal{O}((N+ d^2) d \\epsilon^{-2/3})$, can be improved. In the above, we follow  Doikov et al. (2023) and assume the complexity of obtaining a first-order oracle as $N$ and the complexity of obtaining a second-order oracle as $dN$. \nIn this paper, we show that the computation cost can be reduced by reusing Hessian across iterations. Our methods take the overall computational complexity of $\\tilde{\\mathcal{O}}( (N+d^2)(d+ d^{2/3}\\epsilon^{-2/3}))$, which improves those of previous methods by a factor of $d^{1/3}$. \nFurthermore, we generalize our method to strongly-convex-strongly-concave minimax problems and establish the complexity of $\\tilde{\\mathcal{O}}((N+d^2) (d + d^{2/3} \\kappa^{2/3}) )$ when the condition number of the problem is $\\kappa$, enjoying a similar speedup upon the state-of-the-art method. \nNumerical experiments on both real and synthetic datasets also verify the efficiency of our method.",
      "keywords": [
        "min-max optimization; second-order methods; computational complexity"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "EzjsoomYEb",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "2MqyCIxLSi",
      "title": "TopoTune: A Framework for Generalized Combinatorial Complex Neural Networks",
      "abstract": "Graph Neural Networks (GNNs) excel in learning from relational datasets, processing node and edge features in a way that preserves the symmetries of the graph domain. However, many complex systems---such as biological or social networks---involve multiway complex interactions that are more naturally represented by higher-order topological domains. The emerging field of Topological Deep Learning (TDL) aims to accommodate and leverage these higher-order structures. Combinatorial Complex Neural Networks (CCNNs), fairly general TDL models, have been shown to be more expressive and better performing than GNNs. However, differently from the graph deep learning ecosystem, TDL lacks a principled and standardized framework for easily defining new architectures, restricting its accessibility and applicability. To address this issue, we introduce Generalized CCNNs (GCCNs), a novel simple yet powerful family of TDL models that can be used to systematically transform any (graph) neural network into its TDL counterpart. We prove that GCCNs generalize and subsume CCNNs, while extensive experiments on a diverse class of GCCNs show that these architectures consistently match or outperform CCNNs, often with less model complexity. In an effort to accelerate and democratize TDL, we introduce TopoTune, a lightweight software for defining, building, and training GCCNs with unprecedented flexibility and ease.",
      "keywords": [
        "Topological Deep Learning",
        "Graph Neural Network",
        "Graph Expansion",
        "Combinatorial Complex",
        "Cellular Complex"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "QC2qE1tcmd",
      "title": "Demystifying Topological Message-Passing with Relational Structures: A Case Study on Oversquashing in Simplicial Message-Passing",
      "abstract": "Topological deep learning (TDL) has emerged as a powerful tool for modeling higher-order interactions in relational data. However, phenomena such as oversquashing in topological message-passing remain understudied and lack theoretical analysis. We propose a unifying axiomatic framework that bridges graph and topological message-passing by viewing simplicial and cellular complexes and their message-passing schemes through the lens of relational structures. This approach extends graph-theoretic results and algorithms to higher-order structures, facilitating the analysis and mitigation of oversquashing in topological message-passing networks. Through theoretical analysis and empirical studies on simplicial networks, we demonstrate the potential of this framework to advance TDL.",
      "keywords": [
        "topological deep learning",
        "oversquashing",
        "rewiring",
        "relational graph neural networks",
        "simplicial complexes",
        "relational structures"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "cLtE4qoPlD",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "25kAzqzTrz",
      "title": "Towards Understanding Why FixMatch Generalizes Better Than Supervised Learning",
      "abstract": "Semi-supervised learning (SSL), exemplified by FixMatch (Sohn et al., 2020), has shown significant generalization advantages over supervised learning (SL), particularly in the context of deep neural networks (DNNs). However, it is still unclear, from a theoretical standpoint, why FixMatch-like SSL algorithms generalize  better than SL on DNNs. In this work, we present the first theoretical justification for the enhanced test accuracy observed in  FixMatch-like SSL applied to DNNs by taking  convolutional neural networks (CNNs) on classification tasks as an example. Our theoretical analysis reveals that the semantic feature learning processes in FixMatch and SL are rather different. In particular, FixMatch learns all the discriminative features of each semantic class, while SL only randomly captures a subset of features due to the well-known lottery ticket hypothesis. Furthermore, we show that our analysis framework can be applied to other FixMatch-like SSL methods, e.g., FlexMatch, FreeMatch, Dash, and SoftMatch. Inspired by our theoretical analysis, we develop an improved variant of FixMatch, termed Semantic-Aware FixMatch (SA-FixMatch). Experimental results corroborate our theoretical findings and the enhanced generalization capability of SA-FixMatch.",
      "keywords": [
        "deep semi-supervised learning",
        "generalization error",
        "feature learning"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "cLtE4qoPlD",
      "title": "Find A Winning Sign: Sign Is All We Need to Win the Lottery",
      "abstract": "The Lottery Ticket Hypothesis (LTH) posits the existence of a sparse subnetwork (a.k.a. winning ticket) that can generalize comparably to its over-parameterized counterpart when trained from scratch.\nThe common approach to finding a winning ticket is to preserve the original strong generalization through Iterative Pruning (IP) and transfer information useful for achieving the learned generalization by applying the resulting sparse mask to an untrained network.\nHowever, existing IP methods still struggle to generalize their observations beyond ad-hoc initialization and small-scale architectures or datasets, or they bypass these challenges by applying their mask to trained weights instead of initialized ones.\nIn this paper, we demonstrate that the parameter sign configuration plays a crucial role in conveying useful information for generalization to any randomly initialized network.\nThrough linear mode connectivity analysis, we observe that a sparse network trained by an existing IP method can retain its basin of attraction if its parameter signs and normalization layer parameters are preserved.\nTo take a step closer to finding a winning ticket, we alleviate the reliance on normalization layer parameters by preventing high error barriers along the linear path between the sparse network trained by our method and its counterpart with initialized normalization layer parameters.\nInterestingly, across various architectures and datasets, we observe that any randomly initialized network can be optimized to exhibit low error barriers along the linear path to the sparse network trained by our method by inheriting its sparsity and parameter sign information, potentially achieving performance comparable to the original.\nThe code is available at https://github.com/JungHunOh/AWS_ICLR2025.git.",
      "keywords": [
        "lottery ticket hypothesis",
        "network pruning",
        "linear mode connectivity"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "RPRqKhjrr6",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "oN5YVZ9JeF",
      "title": "T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning",
      "abstract": "Instruction tuning is essential for Large Language Models (LLMs) to effectively follow user instructions. To improve training efficiency and reduce data redundancy, recent works use LLM-based scoring functions, e.g., Instruction-Following Difficulty (IFD), to select high–quality instruction-tuning data with scores above a threshold. While these data selection methods often lead to models that can match or even exceed the performance of models trained on the full datasets, we identify two key limitations: (i) they assess quality at the sample level, ignoring token-level informativeness; and (ii) they overlook the robustness of the scoring method, often selecting a sample due to superficial lexical features instead of its true quality. In this work, we propose Token-Selective HIeRarchical Data Selection for Instruction Tuning (T-SHIRT), a novel data selection framework that introduces a new scoring method to include only informative tokens in quality evaluation and also promote robust and reliable samples whose neighbors also show high quality with less local inconsistencies. We demonstrate that models instruction-tuned on a curated dataset (only 5% of the original size) using T-SHIRT can outperform those trained on the entire large-scale dataset by up to 5.48 points on average across eight benchmarks. Across various LLMs and training set scales, our method consistently surpasses existing state-of-the-art data selection techniques, while also remaining both cost-effective and highly efficient. For instance, by using GPT-2 for score computation, we are able to process a dataset of 52k samples in 40 minutes on a single GPU. Our code is available at https://github.com/Dynamite321/T-SHIRT.",
      "keywords": [
        "Large Language Models",
        "Instruction tuning",
        "Data Selection",
        "Token-selective Quality Score",
        "Robust Hierarchical Selection"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "obXGSmmG70",
      "title": "AdaCoT: Pareto-Optimal Adaptive Chain-of-Thought Triggering via Reinforcement Learning",
      "abstract": "Large Language Models (LLMs) have demonstrated remarkable capabilities but often face challenges with tasks requiring sophisticated reasoning. While Chain-of-Thought (CoT) prompting significantly enhances reasoning, it indiscriminately generates lengthy reasoning steps for all queries, leading to substantial computational costs and inefficiency, especially for simpler inputs. To address this critical issue, we introduce AdaCoT (Adaptive Chain-of-Thought), a novel framework enabling LLMs to adaptively decide when to invoke CoT. AdaCoT framed adaptive reasoning as a Pareto optimization problem that seeks to balance model performance with the costs associated with CoT invocation (both frequency and computational overhead). We propose a reinforcement learning (RL) based method, specifically utilizing Proximal Policy Optimization (PPO), to dynamically control the CoT triggering decision boundary by adjusting penalty coefficients, thereby allowing the model to determine CoT necessity based on implicit query complexity. A key technical contribution is Selective Loss Masking (SLM), designed to counteract decision boundary collapse during multi-stage RL training, ensuring robust and stable adaptive triggering. Experimental results demonstrate that AdaCoT successfully navigates the Pareto frontier, achieving substantial reductions in CoT usage for queries not requiring elaborate reasoning. For instance, on our production traffic testset, AdaCoT reduced CoT triggering rates to as low as 3.18% and decreased average response tokens by 69.06% on APP, while maintaining high performance on complex tasks. This substantial token decrease directly translates to a significant reduction in inference computational load. AdaCoT pioneers adaptive CoT triggering, offering a practical and principled solution for developing more efficient, responsive, and cost-effective LLMs, particularly crucial for interactive and resource-sensitive applications.",
      "keywords": [
        "Adaptive Reasoning",
        "Chain-of-Thought",
        "Large Language Models"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "jTaxGFy34h",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "ODgWBaErst",
      "title": "Distribution-Aware Tensor Decomposition for Compression of Convolutional Neural Networks",
      "abstract": "Neural networks are widely used for image–related tasks but typically demand considerable computing power. Once a network has been trained, however, its memory‑ and compute‑footprint can be reduced by compression. In this work, we focus on compression through tensorization and low‑rank representations. Whereas classical approaches search for a low‑rank approximation by minimizing an isotropic norm such as the Frobenius norm in weight‑space, we use data‑informed norms that measure the error in function space. Concretely, we minimize the change in the layer’s output distribution, which can be expressed as $\\lVert (W - \\widetilde{W}) \\Sigma^{1/2}\\rVert_F$ where $\\Sigma^{1/2}$ is the square root of the covariance matrix of the layer’s input and $W$, $\\widetilde{W}$ are the original and compressed weights. We propose new alternating least square algorithms for the two most common tensor decompositions (Tucker‑2 and CPD) that directly optimize the new norm. Unlike conventional compression pipelines, which almost always require post‑compression fine‑tuning, our data‑informed approach often achieves competitive accuracy without any fine‑tuning. We further show that the same covariance‑based norm can be transferred from one dataset to another with only a minor accuracy drop, enabling compression even when the original training dataset is unavailable.\nExperiments on several CNN architectures (ResNet‑18/50, and GoogLeNet) and datasets (ImageNet, FGVC‑Aircraft, Cifar10, and Cifar100) confirm the advantages of the proposed method.",
      "keywords": [
        "Compression of neural networks",
        "tensor decomposition",
        "low-rank approximation",
        "distribution-aware norm"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "o7Z8TClGjp",
      "title": "Unifying Proportional Fairness in Centroid and Non-Centroid Clustering",
      "abstract": "Proportional fairness criteria inspired by democratic ideals of proportional representation have received growing attention in the clustering literature. Prior work has investigated them in two separate paradigms. Chen et al. [ICML 2019] study _centroid clustering_, in which each data point's loss is determined by its distance to a representative point (centroid) chosen in its cluster. Caragiannis et al. [NeurIPS 2024] study _non-centroid clustering_, in which each data point's loss is determined by its maximum distance to any other data point in its cluster. \n  \nWe generalize both paradigms to introduce _semi-centroid clustering_, in which each data point's loss is a combination of its centroid and non-centroid losses, and study two proportional fairness criteria---the core and, its relaxation, fully justified representation (FJR). Our main result is a novel algorithm which achieves a constant approximation to the core, in polynomial time, even when the distance metrics used for centroid and non-centroid loss measurements are different. We also derive improved results for more restricted loss functions and the weaker FJR criterion, and establish lower bounds in each case.",
      "keywords": [
        "Proportional Fairness",
        "Clustering",
        "Algorithmic Fairness"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "yRxX01oRIi",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "Q3qAsZAEZw",
      "title": "Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference",
      "abstract": "Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration, such as evaluation batch size, GPU count, and GPU version, can introduce significant differences in the generated responses. \nThis issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9\\% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size.\nWe trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. \nThis work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge.\nOur analysis reveals that floating-point precision—while critical for reproducibility—is often neglected in evaluation practices.\nInspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.",
      "keywords": [
        "Large Language Models (LLMs)",
        "Reproducibility",
        "Numerical precision",
        "Deterministic inference"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "yRxX01oRIi",
      "title": "Evaluating the Inductive Abilities of Large Language Models: Why Chain-of-Thought Reasoning Sometimes Hurts More Than Helps",
      "abstract": "Large Language Models (LLMs) have shown remarkable progress across domains, yet their ability to perform inductive reasoning—inferring latent rules from sparse examples—remains limited. \nIt is often assumed that chain-of-thought (CoT) prompting, as used in Large Reasoning Models (LRMs), enhances such reasoning. \nWe investigate this assumption with creating four controlled, diagnostic game-based tasks—chess, Texas Hold’em, dice games, and blackjack—with hidden human-defined rules. \nWe find that CoT reasoning can degrade inductive performance, with LRMs often underperforming their non-reasoning counterparts.\n\nTo explain this, we present a theoretical framework that reveals how reasoning steps can amplify error through three failure modes: incorrect sub-task decomposition, incorrect sub-task solving, and incorrect final answer summarization. \nBased on our theoretical and empirical analysis, we introduce structured interventions that adapt CoT generation according to our identified failure types. These interventions improve inductive accuracy without retraining. Our findings suggest that effective (CoT) reasoning depends not only on taking more steps but also on ensuring those steps are well-structured.",
      "keywords": [
        "Large Lauange Model",
        "Inductive Abilities",
        "Reasoning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "vM94dZiqx4",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "03OkC0LKDD",
      "title": "Adaptive Gradient Clipping for Robust Federated Learning",
      "abstract": "Robust federated learning aims to maintain reliable performance despite the presence of adversarial or misbehaving workers. While state-of-the-art (SOTA) robust distributed gradient descent (Robust-DGD) methods were proven theoretically optimal, their empirical success has often relied on pre-aggregation gradient clipping.\nHowever, existing static clipping strategies yield inconsistent results: enhancing robustness against some attacks while being ineffective or even detrimental against others.\nTo address this limitation, we propose a principled adaptive clipping strategy, Adaptive Robust Clipping (ARC), which dynamically adjusts clipping thresholds based on the input gradients. We prove that ARC not only preserves the theoretical robustness guarantees of SOTA Robust-DGD methods but also provably improves asymptotic convergence when the model is well-initialized. Extensive experiments on benchmark image classification tasks confirm these theoretical insights, demonstrating that ARC significantly enhances robustness, particularly in highly heterogeneous and adversarial settings.",
      "keywords": [
        "Federated learning",
        "robustness",
        "Byzantine resilience"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "vM94dZiqx4",
      "title": "Long-tailed Adversarial Training with Self-Distillation",
      "abstract": "Adversarial training significantly enhances adversarial robustness, yet superior performance is predominantly achieved on balanced datasets.\n Addressing adversarial robustness in the context of unbalanced or long-tailed distributions is considerably more challenging, mainly due to the scarcity of tail data instances. \n Previous research on adversarial robustness within long-tailed distributions has primarily focused on combining traditional long-tailed natural training with existing adversarial robustness methods.\n In this study, we provide an in-depth analysis for the challenge that adversarial training struggles to achieve high performance on tail classes in long-tailed distributions.\n Furthermore, we propose a simple yet effective solution to advance adversarial robustness on long-tailed distributions through a novel self-distillation technique.\n Specifically, this approach leverages a balanced self-teacher model, which is trained using a balanced dataset sampled from the original long-tailed dataset.\nOur extensive experiments demonstrate state-of-the-art performance in both clean and robust accuracy for long-tailed adversarial robustness, with significant improvements in tail class performance on various datasets.\nWe improve the accuracy against PGD attacks for tail classes by 20.3, 7.1, and 3.8 percentage points on CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively, while achieving the highest robust accuracy.",
      "keywords": [
        "Adversarial Robustness",
        "Adversarial Training",
        "Long-Tail Distribution Learning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "fWXYD0ZCdd",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "1yJP5TVWih",
      "title": "Lambda-Skip Connections: the architectural component that prevents Rank Collapse",
      "abstract": "Rank collapse, a phenomenon where embedding vectors in sequence models\nrapidly converge to a uniform token or equilibrium state, has recently gained at-\ntention in the deep learning literature. This phenomenon leads to reduced expres-\nsivity and potential training instabilities due to vanishing gradients. Empirical ev-\nidence suggests that architectural components like skip connections, LayerNorm,\nand MultiLayer Perceptrons (MLPs) play critical roles in mitigating rank collapse.\nWhile this issue is well-documented for transformers, alternative sequence mod-\nels, such as State Space Models (SSMs), which have recently gained prominence,\nhave not been thoroughly examined for similar vulnerabilities. This paper extends\nthe theory of rank collapse from transformers to SSMs using a unifying frame-\nwork that captures both architectures. We introduce a modification in the skip\nconnection component, termed lambda-skip connections, that provides guaran-\ntees for rank collapse prevention. We present, via analytical results, a sufficient\ncondition to achieve the guarantee for all of the aforementioned architectures. We\nalso study the necessity of this condition via ablation studies and analytical exam-\nples. To our knowledge, this is the first study that provides a general guarantee to\nprevent rank collapse, and that investigates rank collapse in the context of SSMs,\noffering valuable understanding for both theoreticians and practitioners. Finally,\nwe validate our findings with experiments demonstrating the crucial role of archi-\ntectural components in preventing rank collapse.",
      "keywords": [
        "Rank Collapse",
        "Skip Connections",
        "Sequence Modeling Architectures"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "fWXYD0ZCdd",
      "title": "A New Look at Low-Rank Recurrent Neural Networks",
      "abstract": "Low-rank recurrent neural networks (RNNs) have recently gained prominence as a framework for understanding how neural systems solve complex cognitive tasks. However, fitting and interpreting these networks remains an important open problem.\nHere we address this challenge using a perspective from the ``neural engineering framework'', which shows how to embed an arbitrary ordinary differential equation (ODE) into a low-rank RNN using least-squares regression. Under this perspective, individual neurons in a low-rank RNN provide nonlinear basis functions for representing an ODE of interest. This clarifies limits on the expressivity of low-rank RNNs, such as the fact that with a $\\tanh$ non-linearity they can only capture odd-symmetric functions in the absence of per neuron inputs or biases. Building on this framework, we propose a method for finding the smallest low-rank RNN to implement a given dynamical system using a variant of orthogonal matching pursuit. We also show how to use regression-based fitting to obtain low-rank RNNs with time-varying dynamics. This allows for the rapid training of vastly different dynamical systems that nevertheless produce a given time-varying trajectory. Finally, we highlight the usefulness of our framework by comparing to RNNs trained using backprop-through-time on neuroscience-inspired tasks, showing that our method achieves faster and more accurate learning with smaller networks than gradient-based training.",
      "keywords": [
        "low-rank rnn",
        "computational neuroscience",
        "dynamical systems",
        "neural dynamics"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "NgLFQTBPRR",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "NgLFQTBPRR",
      "title": "An Evidence-Based Post-Hoc Adjustment Framework for Anomaly Detection Under Data Contamination",
      "abstract": "Unsupervised anomaly detection (AD) methods typically assume clean training data, yet real-world datasets often contain undetected or mislabeled anomalies, leading to significant performance degradation. Existing solutions require access to the training pipelines, data or prior knowledge of the proportions of anomalies in the data, limiting their real-world applicability. To address this challenge, we propose EPHAD, a simple yet effective test-time adaptation framework that updates the outputs of AD models trained on contaminated datasets using evidence gathered at test time. Our approach integrates the prior knowledge captured by the AD model trained on contaminated datasets with evidence derived from multimodal foundation models like Contrastive Language-Image Pre-training (CLIP), classical AD methods like the Latent Outlier Factor or domain-specific knowledge. We illustrate the intuition behind EPHAD using a synthetic toy example and validate its effectiveness through comprehensive experiments across eight visual AD datasets, twenty-six tabular AD datasets, and a real-world industrial AD dataset. Additionally, we conduct an ablation study to analyse hyperparameter influence and robustness to varying contamination levels, demonstrating the versatility and robustness of EPHAD across diverse AD models and evidence pairs. To ensure reproducibility, our code is publicly available at https://github.com/sukanyapatra1997/EPHAD.",
      "keywords": [
        "anomaly detection",
        "anomaly",
        "contamination",
        "test-time adaptation"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "pKg5zKIEuV",
      "title": "Quantifying Statistical Significance of Deep Nearest Neighbor Anomaly Detection via Selective Inference",
      "abstract": "In real-world applications, anomaly detection (AD) often operates without access to anomalous data, necessitating semi-supervised methods that rely solely on normal data.\nAmong these methods, deep $k$-nearest neighbor (deep $k$NN) AD stands out for its interpretability and flexibility, leveraging distance-based scoring in deep latent spaces.\nDespite its strong performance, deep $k$NN lacks a mechanism to quantify uncertainty—an essential feature for critical applications such as industrial inspection.\nTo address this limitation, we propose a statistical framework that quantifies the  significance of detected anomalies in the form of $p$-values, thereby enabling control over false positive rates at a user-specified significance level (e.g.,0.05).\nA central challenge lies in managing selection bias, which we tackle using Selective Inference—a principled method for conducting inference conditioned on data-driven selections.\nWe evaluate our method on diverse datasets and demonstrate that it provides reliable AD well-suited for industrial use cases.",
      "keywords": [
        "Anomaly Detection",
        "k-Nearest Neighbors",
        "Statistical Test",
        "Selective Inference",
        "Deep Learning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "WOzffPgVjF",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "14fFV0chUS",
      "title": "TRACE: Temporal Grounding Video LLM  via Causal Event Modeling",
      "abstract": "Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing. \nTo effectively handle various tasks simultaneously and enable zero-shot prediction, there is a growing trend in employing video LLMs for VTG tasks. However, current video LLM-based methods rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos, which restricts their effectiveness in tackling VTG tasks. To address this issue, this paper first formally introduces causal event modeling framework, which represents video LLM outputs as sequences of events, and predict the current event using previous events, video inputs, and textural instructions. Each event consists of three components: timestamps, salient scores, and textual captions. We then propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice. \nThe TRACE process visual frames, timestamps, salient scores, and text as distinct tasks, employing various encoders and decoding heads for each. Task tokens are arranged in an interleaved sequence according to the causal event modeling framework's formulation.\nExtensive experiments on various VTG tasks and datasets demonstrate the superior performance of TRACE compared to state-of-the-art video LLMs. Our model and code are avaliable at \\url{https://github.com/gyxxyg/TRACE}.",
      "keywords": [
        "video large language model",
        "video temporal grounding"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "auZZ2gN0ZN",
      "title": "Dense Video Object Captioning from Disjoint Supervision",
      "abstract": "We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video. This task unifies spatial and temporal localization in video, whilst also requiring fine-grained visual understanding that is best described by natural language. We propose a unified model, and demonstrate how our end-to-end approach is more accurate and temporally coherent than a multi-stage pipeline combining state-of-the-art detection, tracking, and captioning models. Moreover, we propose a training strategy based on a mixture of disjoint tasks, which allows us to leverage diverse, large-scale datasets which supervise different parts of our model. Although each pretraining task only provides weak supervision, they are complementary and, when combined, result in noteworthy zero-shot ability and serve as strong initialization for additional finetuning to further improve accuracy. We carefully design new metrics capturing all components of our task, and show how we can repurpose existing video grounding datasets (e.g. VidSTG and VLN) for our new task. We show that our model improves upon a number of strong baselines for this new task. Furthermore, we can apply our model to the task of spatial grounding, outperforming prior state-of-the-art on VidSTG and VLN, without explicitly training for it. Our code is available at https://github.com/google-research/scenic.",
      "keywords": [
        "object captioning",
        "video",
        "tracking"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "2efNHgYRvM",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "Rkpdfia4Sz",
      "title": "Learning Discrete Latent Models from Discrete Observations",
      "abstract": "A central challenge in machine learning is discovering meaningful representations of high-dimensional data, commonly referred to as representation learning. However, many existing methods lack a theoretical foundation, leading to unreliable representations and limited inferential capabilities. In approaches where certain uniqueness of representation is guaranteed, such as nonlinear ICA, variables are typically assumed to be continuous. While recent work has extended identifiability to binarized observed variables, no principled method has been developed for scenarios involving discrete latent variables. In this paper, we show how multi-domain information can be leveraged to achieve identifiability when both latent and observed variables are discrete. We propose general identification conditions that do not depend on specific data distributional assumptions or parametric model forms. The effectiveness of our approach is validated through experiments on both simulated and real-world datasets.",
      "keywords": [
        "Latent Variable Identification",
        "Nonlinear Independent Component Analysis (ICA)"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "lk2Qk5xjeu",
      "title": "Unifying Causal Representation Learning with the Invariance Principle",
      "abstract": "Causal representation learning (CRL) aims at recovering latent causal variables from high-dimensional observations to solve causal downstream tasks, such as predicting the effect of new interventions or more robust classification. \n  A plethora of methods have been developed, each tackling carefully crafted problem settings that lead to different types of identifiability. \n  These different settings are widely assumed to be important because they are often linked to different rungs of Pearl's causal hierarchy, even though this correspondence is not always exact.\n    This work shows that instead of strictly conforming to this hierarchical mapping, *many causal representation learning approaches methodologically align their representations with inherent data symmetries.*\n  Identification of causal variables is guided by invariance principles that are not necessarily causal. \n  This result allows us to unify many existing approaches in a single method that can mix and match different assumptions, including non-causal ones, based on the invariance relevant to the problem at hand. \n  It also significantly benefits applicability, which we demonstrate by improving treatment effect estimation on real-world high-dimensional ecological data. Overall, this paper clarifies the role of causal assumptions in the discovery of causal variables and shifts the focus to preserving data symmetries.",
      "keywords": [
        "Causal representation learning",
        "Identifiability",
        "Invariance"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "VipcVxaTnG",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "VipcVxaTnG",
      "title": "Correlation and Navigation in the Vocabulary Key Representation Space of Language Models",
      "abstract": "Language model (LM) decoding is based on the next-token prediction (NTP) probability distribution. For neural LMs (e.g., Transformer-based), NTP distribution is\nessentially a softmax-regularized dot product between an encoded input context\n(query) and fixed vocabulary representations (keys). In this paper, we study the\neffect of the key distribution on the NTP distribution, with a focus on whether\nthe similarity between keys will trigger spurious correlations in NTP. Through\nknowledge-probing tasks, we show that in the NTP distribution, the few top-ranked\ntokens are typically accurate. However, the middle-ranked prediction is highly biased\ntowards the tokens that are distributionally (not necessarily semantically) similar to\nthese top ones. For instance, if “P” is predicted as the top-1 token, “A”-“Z” will all\nbe ranked high in NTP, no matter whether they can lead to correct decoding results.\nThis hurts the sampling diversity and makes the sampling of correct, long-tail\nresults hopeless and noisy. We attempt to alleviate this issue via a novel in-context\nmethod that iteratively pushes the query representation away from explored regions.\nSpecifically, we include the explored decoding results in the context and prompt\nthe LM to generate something else, which encourages the LM to produce a query\nrepresentation that has small dot products with explored keys. Experiments on\nknowledge-probing tasks show that our method leads to efficient navigation away\nfrom explored keys to correct new keys. We further extend our method to open-ended and chain-of-thought (for reasoning) generation. Experiment results show\nthat ICN contributes to better generation diversity and improved self-consistency\nvoting performance. Finally, we discuss potential training issues caused by the\nfixed key space together with the challenges and possible ways to address them in\nfuture research.",
      "keywords": [
        "Language Modeling",
        "Next Token Prediction",
        "Spurious Correlation",
        "Generation Diversity"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "bFYST1MaGh",
      "title": "Communicating Activations Between Language Model Agents",
      "abstract": "Communication between multiple language model (LM) agents has been shown to scale up the reasoning ability of LMs. While natural language has been the dominant medium for inter-LM communication, it is not obvious this should be the standard: not only does natural language communication incur high inference costs that scale quickly with the number of both agents and messages, but also the decoding process abstracts away too much rich information that could be otherwise accessed from the internal activations. In this work, we propose a simple technique whereby LMs communicate via *activations*; concretely, we pause an LM $B$'s computation at an intermediate layer, combine its current activation with another LM $A$'s intermediate activation via some function $f$, then pass $f$'s output into the next layer of $B$ and continue the forward pass till decoding is complete. This approach scales up LMs on new tasks with *zero* additional parameters and data, and saves a *substantial amount of compute* over natural language communication. We test our method with various functional forms $f$ on two experimental setups—multi-player coordination games and reasoning benchmarks—and find that it achieves up to $27.0$% improvement over natural language communication across datasets with $<$$1/4$ the compute, illustrating the superiority and robustness of activations as an alternative \"language\" for communication between LMs.",
      "keywords": [
        "large language models",
        "multiagent communication",
        "embedding representation",
        "multiagent debate"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "8YniJnJQ0P",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "8YniJnJQ0P",
      "title": "Detecting High-Stakes Interactions with Activation Probes",
      "abstract": "Monitoring is an important aspect of safely deploying Large Language Models (LLMs).\nThis paper examines activation probes for detecting ``high-stakes'' interactions---where the text indicates that the interaction might lead to significant harm---as a critical, yet underexplored, target for such monitoring.\nWe evaluate several probe architectures trained on synthetic data, and find them to exhibit robust generalization to diverse, out-of-distribution, real-world data.\nProbes' performance is comparable to that of prompted or finetuned medium-sized LLM monitors, while offering computational savings of six orders-of-magnitude.\nThese savings are enabled by reusing activations of the model that is being monitored.\nOur experiments also highlight the potential of building resource-aware hierarchical monitoring systems, where probes serve as an efficient initial filter and flag cases for more expensive downstream analysis.\nWe release our novel synthetic dataset and the codebase at\n\\url{https://github.com/arrrlex/models-under-pressure}.",
      "keywords": [
        "linear probes",
        "monitoring",
        "mechanistic interpretability",
        "large language models"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "Q3qAsZAEZw",
      "title": "Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference",
      "abstract": "Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration, such as evaluation batch size, GPU count, and GPU version, can introduce significant differences in the generated responses. \nThis issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9\\% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size.\nWe trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. \nThis work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge.\nOur analysis reveals that floating-point precision—while critical for reproducibility—is often neglected in evaluation practices.\nInspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.",
      "keywords": [
        "Large Language Models (LLMs)",
        "Reproducibility",
        "Numerical precision",
        "Deterministic inference"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "NltQraRnbW",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "FwW3jqchtY",
      "title": "Identifying neural dynamics using interventional state space models",
      "abstract": "Neural circuits produce signals that are complex and nonlinear. To facilitate the understanding of neural dynamics, a popular approach is to fit state space models (SSM) to data and analyze the dynamics of the low-dimensional latent variables. Despite the power of SSM in explaining neural circuit dynamics, it has been shown that these models merely capture statistical associations in the data and cannot be causally interpreted. Therefore, an important research problem is to build models that can predict neural dynamics under causal manipulations. Here, we propose interventional state space models (iSSM), a class of causal models that can predict neural responses to novel perturbations. We draw on recent advances in causal dynamical systems and present theoretical results for the identifiability of iSSM. In simulations of the motor cortex, we show that iSSM can recover the true latents and the underlying dynamics. In addition, we illustrate two applications of iSSM in biological datasets. First, we apply iSSM to a dataset of calcium recordings from ALM neurons in mice during photostimulation and uncover dynamical mechanisms underlying short-term memory. Second, we apply iSSM to a dataset of electrophysiological recordings from macaque dlPFC recordings during micro-stimulation and show that it successfully predicts responses to unseen perturbations.",
      "keywords": [
        "Causal dynamical systems",
        "interventions",
        "state space models",
        "photostimulation",
        "micro-stimulation"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "NltQraRnbW",
      "title": "Conditional Diffusion Models are Minimax-Optimal and Manifold-Adaptive for Conditional Distribution Estimation",
      "abstract": "We consider a class of conditional forward-backward diffusion models for conditional generative modeling, that is, generating new data given a covariate (or control variable). To formally study the theoretical properties of these conditional generative models, we adopt a statistical framework of distribution regression to characterize the large sample properties of the conditional distribution estimators induced by these conditional forward-backward diffusion models. Here, the conditional distribution of data is assumed to smoothly change over the covariate. In particular, our derived convergence rate is minimax-optimal under the total variation metric within the regimes covered by the existing literature. Additionally, we extend our theory by allowing both the data and the covariate variable to potentially admit a low-dimensional manifold structure. In this scenario, we demonstrate that the conditional forward-backward diffusion model can adapt to both manifold structures, meaning that the derived estimation error bound (under the Wasserstein metric) depends only on the intrinsic dimensionalities of the data and the covariate.",
      "keywords": [
        "conditional distribution estimation",
        "diffusion models",
        "distribution regression",
        "generative models",
        "manifold",
        "minimax rate"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "peX9zpWgg4",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "WWymYrA48K",
      "title": "Test Time Learning for Time Series Forecasting",
      "abstract": "We propose the use of Test-Time Training (TTT) modules in a cascade architecture to enhance performance in long-term time series forecasting. Through extensive experiments on standard benchmark datasets, we demonstrate that TTT modules consistently outperform state-of-the-art models, including Mamba-based TimeMachine, particularly in scenarios involving extended sequence and prediction lengths. Our results show significant improvements, especially on larger datasets such as Electricity, Traffic, and Weather, underscoring the effectiveness of TTT in capturing long-range dependencies. Additionally, we explore various convolutional architectures within the TTT framework, showing that convolutional blocks as hidden layer architectures can achieve competitive results.",
      "keywords": [
        "Time Series Forecasting",
        "Test-Time Training",
        "Mamba",
        "Expressive Hidden States",
        "Modern CNN"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "peX9zpWgg4",
      "title": "Adaptive Shrinkage Estimation for Personalized Deep Kernel Regression in Modeling Brain Trajectories",
      "abstract": "Longitudinal biomedical studies monitor individuals over time to capture dynamics in brain development, disease progression, and treatment effects. However, estimating trajectories of brain biomarkers is challenging due to biological variability, inconsistencies in measurement protocols (e.g., differences in MRI scanners) as well as scarcity and irregularity in longitudinal measurements. Herein,\nwe introduce a novel personalized deep kernel regression framework for forecasting brain biomarkers, with application to regional volumetric measurements. Our approach integrates two key components: a population model that captures brain trajectories from a large and diverse cohort, and a subject-specific model that captures individual trajectories. To optimally combine these, we propose Adaptive Shrinkage Estimation, which effectively balances population and subject-specific models. We assess our model’s performance through predictive accuracy metrics, uncertainty quantification, and validation against external clinical studies. Benchmarking against state-of-the-art statistical and machine learning models—including linear mixed effects models, generalized additive models, and deep learning methods—demonstrates the superior predictive performance of our approach. Additionally, we apply our method to predict trajectories of composite neuroimaging biomarkers, which highlights the versatility of our approach in modeling the progression of longitudinal neuroimaging biomarkers. Furthermore, validation on three external neuroimaging studies confirms the robustness of our method across different clinical contexts. We make the code available at https://github.com/vatass/AdaptiveShrinkageDKGP.",
      "keywords": [
        "Deep Kernel Regression",
        "Personalization",
        "Posterior Correction",
        "Longitudinal Biomarker Prediction"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "HH4KWP8RP5",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "G6dMvRuhFr",
      "title": "Grounding Video Models to Actions through Goal Conditioned Exploration",
      "abstract": "Large video models, pretrained on massive quantities of amount of Internet video,  provide a rich source of physical knowledge about the dynamics and motions of objects and tasks.\nHowever, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to reach the visual states depicted in a video.\nTo tackle this problem, current methods use a separate vision-based inverse dynamic model trained on embodiment-specific data to map image states to actions. \nGathering data to train such a model is often expensive and challenging, and this model is limited to visual settings similar to the ones in which data is available.\nIn this paper, we investigate how to directly  ground video models to continuous actions through self-exploration in the embodied environment -- using generated video states as visual goals for exploration.\nWe propose a framework that uses trajectory level action generation in combination with video guidance to\nenable an agent to solve complex tasks without any external supervision, e.g., rewards, action labels, or segmentation masks.\nWe validate the proposed approach on 8 tasks in Libero, 6 tasks in MetaWorld, 4 tasks in Calvin, and 12 tasks in iThor Visual Navigation. \nWe show how our approach is on par with or even surpasses multiple behavior cloning baselines trained on expert demonstrations while without requiring any action annotations.",
      "keywords": [
        "Embodied AI",
        "Decision Making",
        "Robotics",
        "Video Model"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "HH4KWP8RP5",
      "title": "Towards Improving Exploration through Sibling Augmented GFlowNets",
      "abstract": "Exploration is a key factor for the success of an active learning agent, especially when dealing with sparse extrinsic terminal rewards and long trajectories. We introduce Sibling Augmented Generative Flow Networks (SA-GFN), a novel framework designed to enhance exploration and training efficiency of Generative Flow Networks (GFlowNets). SA-GFN uses a decoupled dual network architecture, comprising of a main Behavior Network and an exploratory Sibling Network, to enable a diverse exploration of the underlying distribution using intrinsic rewards. Inspired by the ideas on exploration from reinforcement learning, SA-GFN provides a general-purpose exploration and learning paradigm that integrates with multiple GFlowNet training objectives and is especially helpful for exploration over a wide range of sparse or low reward distributions and task structures. An extensive set of experiments across a diverse range of tasks, reward structures and trajectory lengths, along with a thorough set of ablations, demonstrate the superior performance of SA-GFN in terms of exploration efficacy and convergence speed as compared to the existing methods. In addition, SA-GFN's versatility and compatibility with different GFlowNet training objectives and intrinsic reward methods underscores its broad applicability in various problem domains.",
      "keywords": [
        "Generative Models",
        "Generative Flow Networks",
        "Exploration"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "8xxEBAtD7y",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "5IWJBStfU7",
      "title": "Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?",
      "abstract": "As AI systems are increasingly deployed in high-stakes applications, ensuring their interpretability is essential. Mechanistic Interpretability (MI) aims to reverse-engineer neural networks by extracting human-understandable algorithms embedded within their structures to explain their behavior. This work systematically examines a fundamental question: for a fixed behavior to explain, and under the criteria that MI sets for itself, are we guaranteed a unique explanation? Drawing an analogy with the concept of identifiability in statistics, which ensures the uniqueness of parameters inferred from data under specific modeling assumptions, we speak about the identifiability of explanations produced by MI.\n\nWe identify two broad strategies to produce MI explanations: (i) \"where-then-what\", which first identifies a subset of the network (a circuit) that replicates the model's behavior before deriving its interpretation, and (ii) \"what-then-where\", which begins with candidate explanatory algorithms and searches in the activation subspaces of the neural model where the candidate algorithm may be implemented, relying on notions of causal alignment between the states of the candidate algorithm and the neural network. \n\nWe systematically test the identifiability of both strategies using simple tasks (learning Boolean functions) and multi-layer perceptrons small enough to allow a complete enumeration of candidate explanations. Our experiments reveal overwhelming evidence of non-identifiability in all cases: multiple circuits can replicate model behavior, multiple interpretations can exist for a circuit, several algorithms can be causally aligned with the neural network, and a single algorithm can be causally aligned with different subspaces of the network.\n\nWe discuss whether the unicity intuition is necessary. One could adopt a pragmatic stance, requiring explanations only to meet predictive and/or manipulability standards. However, if unicity is considered essential, e.g., to provide a sense of understanding, we also discuss less permissive criteria. Finally, we also refer to the inner interpretability framework that demands explanations to be validated by multiple complementary criteria. This work aims to contribute constructively to the ongoing effort to formalize what we expect from explanations in AI.",
      "keywords": [
        "AI interpretability",
        "mechanistic interpretability",
        "causal consistency",
        "explanatory algorithms",
        "circuits"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "Ozo7qJ5vZi",
      "title": "KAN: Kolmogorov–Arnold Networks",
      "abstract": "Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes (\"neurons''), KANs have learnable activation functions on edges (\"weights''). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability, on small-scale AI + Science tasks. For accuracy, smaller KANs can achieve comparable or better accuracy than larger MLPs in function fitting tasks. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful ``collaborators'' helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs. Despite the slow training of KANs, their improved accuracy and interpretability show the potential to improve today's deep learning models which rely heavily on MLPs. More research is necessary to make KANs' training more efficient.",
      "keywords": [
        "Kolmogorov-Arnold networks",
        "Kolmogorov-Arnold representation theorem",
        "learnable activation functions",
        "interpretability",
        "AI + Science"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "xPO6fwvldG",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "BoRmf8wDZ7",
      "title": "Gaussian Masked Autoencoders",
      "abstract": "This paper explores Masked Autoencoders (MAE) with Gaussian Splatting. While mainstream self-supervised learning frameworks such as MAE operate on low-level pixels, the image synthesis community has evolved to use latent, mid-level representations for better generative visual data modeling. Our approach, named GMAE, aims to reconcile these two and get the benefits of both worlds. Like MAE, it reconstructs the image end-to-end in the pixel space; however, it also introduces an intermediate, 3D Gaussian-based representation and renders images via splatting. We show that GMAE can enable various zero-shot learning capabilities (e.g figure-ground segmentation, image layering, edge detection, etc) while preserving the high self-supervised representation quality from MAE. Notably, we are the first to employ Gaussian primitives in an image representation learning framework beyond optimization-based single-scene reconstructions. We believe GMAE will inspire further research in this direction and contribute to developing next-generation techniques for modeling high-fidelity visual data.",
      "keywords": [
        "Representation learning",
        "Gaussian Splatting"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "xPO6fwvldG",
      "title": "UniRestore3D: A Scalable Framework For General Shape Restoration",
      "abstract": "Shape restoration aims to recover intact 3D shapes from defective ones, such as those that are incomplete, noisy, and low-resolution. Previous works have achieved impressive results in shape restoration subtasks thanks to advanced generative models. While effective for specific shape defects, they are less applicable in real-world scenarios involving multiple defect types simultaneously. Additionally, training on limited subsets of defective shapes hinders knowledge transfer across restoration types and thus affects generalization. In this paper, we address the task of general shape restoration, which restores shapes with various types of defects through a unified model, thereby naturally improving the applicability and scalability. Our approach first standardizes the data representation across different restoration subtasks using high-resolution TSDF grids and constructs a large-scale dataset with diverse types of shape defects. Next, we design an efficient hierarchical shape generation model and a noise-robust defective shape encoder that enables effective impaired shape understanding and intact shape generation. Moreover, we propose a scalable training strategy for efficient model training. The capabilities of our proposed method are demonstrated across multiple shape restoration subtasks and validated on various datasets, including Objaverse, ShapeNet, GSO, and ABO.",
      "keywords": [
        "Shape Restoration",
        "3D Reconstruction",
        "Diffusion Model"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "cJd1BgZ9CS",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "NI8AUSAc4i",
      "title": "LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy",
      "abstract": "The Key-Value (KV) cache is a crucial component in serving transformer-based autoregressive large language models (LLMs), enabling faster inference by storing previously computed KV vectors. However, its memory consumption scales linearly with sequence length and batch size, posing a significant bottleneck in LLM deployment. Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages, which requires extensive parameter tuning thus unsuitable to pre-trained LLMs; (2) KV cache compression at test time, primarily through token eviction policies, which often overlook inter-layer dependencies and can be task-specific.\n\nThis paper introduces an orthogonal approach to KV cache compression. We propose a low-rank approximation of  KV weight matrices, allowing for plug-in integration with existing transformer-based LLMs without model retraining. To effectively compress KV cache at the weight level, we adjust for layerwise sensitivity and introduce a progressive compression strategy, which is supported by our theoretical analysis on how compression errors accumulate in deep networks. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages. Extensive experiments with LLaMA models ranging from 8B to 70B parameters across various tasks show that our approach significantly reduces the GPU memory footprint while maintaining performance.",
      "keywords": [
        "KV Cache Compression",
        "Progressive Compression Strategy"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "cJd1BgZ9CS",
      "title": "Distributed Speculative Inference (DSI): Speculation Parallelism for Provably Faster Lossless Language Model Inference",
      "abstract": "This paper introduces *distributed speculative inference (DSI)*, a novel inference algorithm that is provably faster than speculative inference (SI) [leviathan2023, chen2023, miao2024, sun2025, timor2025] and standard autoregressive inference (non-SI). Like other SI algorithms, DSI operates on frozen language models (LMs), requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups over non-SI—but rely on sufficiently fast and accurate drafters, which are often unavailable in practice. We identify a gap where SI can be slower than non-SI if drafters are too slow or inaccurate. We close this gap by proving that DSI is faster than both SI and non-SI—given any drafters. DSI is therefore not only faster than SI, but also unlocks the acceleration of LMs for which SI fails. DSI leverages *speculation parallelism (SP)*, a novel type of task parallelism, to orchestrate target and drafter instances that overlap in time, establishing a new foundational tradeoff between computational resources and latency. Our simulations show that DSI is 1.29-1.92x faster than SI in single-node setups for various off-the-shelf LMs and tasks. We open-source all our code.",
      "keywords": [
        "inference algorithms for generative models",
        "LLM inference",
        "speculative decoding"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "E2PFv7ad3p",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "E2PFv7ad3p",
      "title": "Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs",
      "abstract": "In the study of LLMs, sycophancy represents a prevalent hallucination that poses significant challenges to these models. Specifically, LLMs often fail to adhere to original correct responses, instead blindly agreeing with users' opinions, even when those opinions are incorrect or malicious. However, research on sycophancy in visual language models (VLMs) has been scarce. In this work, we extend the exploration of sycophancy from LLMs to VLMs, introducing the MM-SY benchmark to evaluate this phenomenon. We present evaluation results from multiple representative models, addressing the gap in sycophancy research for VLMs. To mitigate sycophancy, we propose a synthetic dataset for training and employ methods based on prompts, supervised fine-tuning, and DPO. Our experiments demonstrate that these methods effectively alleviate sycophancy in VLMs. Additionally, we probe VLMs to assess the semantic impact of sycophancy and analyze the attention distribution of visual tokens. Our findings indicate that the ability to prevent sycophancy is predominantly observed in higher layers of the model. The lack of attention to image knowledge in these higher layers may contribute to sycophancy, and enhancing image attention at high layers proves beneficial in mitigating this issue.",
      "keywords": [
        "Multi-modal Model",
        "Visual-Language Model",
        "Sycophancy",
        "Hallucination"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "ztzZDzgfrh",
      "title": "ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability",
      "abstract": "Retrieval-Augmented Generation (RAG) models are designed to incorporate external knowledge, reducing hallucinations caused by insufficient parametric (internal) knowledge. However, even with accurate and relevant retrieved content, RAG models can still produce hallucinations by generating outputs that conflict with the retrieved information. Detecting such hallucinations requires disentangling how Large Language Models (LLMs) balance external and parametric knowledge. Current detection methods often focus on one of these mechanisms or without decoupling their intertwined effects, making accurate detection difficult. In this paper, we investigate the internal mechanisms behind hallucinations in RAG scenarios. We discover hallucinations occur when the **Knowledge FFNs** in LLMs overemphasize parametric knowledge in the residual stream, while **Copying Heads** fail to effectively retain or integrate external knowledge from retrieved content. Based on these findings, we propose **ReDeEP**, a novel method that detects hallucinations by decoupling LLM’s utilization of external context and parametric knowledge. Our experiments show that ReDeEP significantly improves RAG hallucination detection accuracy. Additionally, we introduce AARF, which mitigates hallucinations by modulating the contributions of Knowledge FFNs and Copying Heads.",
      "keywords": [
        "Retrieval-Augmented Generation Hallucination",
        "Hallucination Detection",
        "Mechanistic Interpretability"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "YcbE2K3i2E",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "OD1MV7vf41",
      "title": "Deep Random Features for Scalable Interpolation of Spatiotemporal Data",
      "abstract": "The rapid growth of earth observation systems calls for a scalable approach to interpolate remote-sensing observations. These methods in principle, should acquire more information about the observed field as data grows. Gaussian processes (GPs) are candidate model choices for interpolation. However, due to their poor scalability, they usually rely on inducing points for inference, which restricts their expressivity. Moreover, commonly imposed assumptions such as stationarity prevents them from capturing complex patterns in the data. While deep GPs can overcome this issue, training and making inference with them are difficult, again requiring crude approximations via inducing points. In this work, we instead approach the problem through Bayesian deep learning, where spatiotemporal fields are represented by deep neural networks, whose layers share the inductive bias of stationary GPs on the plane/sphere via random feature expansions. This allows one to (1) capture high frequency patterns in the data, and (2) use mini-batched gradient descent for large scale training. We experiment on various remote sensing data at local/global scales, showing that our approach produce competitive or superior results to existing methods, with well-calibrated uncertainties.",
      "keywords": [
        "Random Features",
        "Deep Gaussian Processes",
        "Bayesian Deep Learning",
        "Remote Sensing"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "WOzffPgVjF",
      "title": "Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding",
      "abstract": "Transformer has attracted increasing interest in spatio-temporal video grounding, or STVG, owing to its end-to-end pipeline and promising result. Existing Transformer-based STVG approaches often leverage a set of object queries, which are initialized simply using zeros and then gradually learn target position information via iterative interactions with multimodal features, for spatial and temporal localization. Despite simplicity, these zero object queries, due to lacking target-specific cues, are hard to learn discriminative target information from interactions with multimodal features in complicated scenarios (e.g., with distractors or occlusion), resulting in degradation. Addressing this, we introduce a novel $\\textbf{T}$arget-$\\textbf{A}$ware Transformer for $\\textbf{STVG}$ ($\\textbf{TA-STVG}$), which seeks to adaptively generate object queries via exploring target-specific cues from the given video-text pair, for improving STVG. The key lies in two simple yet effective modules, comprising text-guided temporal sampling (TTS) and attribute-aware spatial activation (ASA), working in a cascade. The former focuses on selecting target-relevant temporal cues from a video utilizing holistic text information, while the latter aims at further exploiting the fine-grained visual attribute information of the object from previous target-aware temporal cues, which is applied for object query initialization. Compared to existing methods leveraging zero-initialized queries, object queries in our TA-STVG, directly generated from a given video-text pair, naturally carry target-specific cues, making them adaptive and better interact with multimodal features for learning more discriminative information to improve STVG. In our experiments on three benchmarks, including HCSTVG-v1/-v2 and VidSTG, TA-STVG achieves state-of-the-art performance and significantly outperforms the baseline, validating its efficacy. Moreover, TTS and ASA are designed for general purpose. When applied to existing methods such as TubeDETR and STCAT, we show substantial performance gains, verifying its generality. Code is released at https://github.com/HengLan/TA-STVG.",
      "keywords": [
        "Spatio-Temporal Video Grounding"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "2efNHgYRvM",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "k03mB41vyM",
      "title": "Identifiable Exchangeable Mechanisms for Causal Structure and Representation Learning",
      "abstract": "Identifying latent representations or causal structures is important for good generalization and downstream task performance. However, both fields developed rather independently.\nWe observe that several structure and representation identifiability methods, particularly those that require multiple environments, rely on \nexchangeable non--i.i.d. (independent and identically distributed) data.\nTo formalize this connection, \nwe propose the Identifiable Exchangeable Mechanisms (IEM) framework to unify key representation and causal structure learning methods. IEM provides a unified probabilistic graphical model encompassing causal discovery, Independent Component Analysis, and Causal Representation Learning.\nWith the help of the IEM model, we generalize the Causal de Finetti theorem of Guo et al., 2022 by relaxing the necessary conditions for causal structure identification in exchangeable data.\nWe term these conditions cause and mechanism variability, and show how they imply a duality condition in identifiable representation learning, leading to new identifiability results.",
      "keywords": [
        "causality",
        "ICA",
        "identifiability",
        "causal representation learning"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "lk2Qk5xjeu",
      "title": "Unifying Causal Representation Learning with the Invariance Principle",
      "abstract": "Causal representation learning (CRL) aims at recovering latent causal variables from high-dimensional observations to solve causal downstream tasks, such as predicting the effect of new interventions or more robust classification. \n  A plethora of methods have been developed, each tackling carefully crafted problem settings that lead to different types of identifiability. \n  These different settings are widely assumed to be important because they are often linked to different rungs of Pearl's causal hierarchy, even though this correspondence is not always exact.\n    This work shows that instead of strictly conforming to this hierarchical mapping, *many causal representation learning approaches methodologically align their representations with inherent data symmetries.*\n  Identification of causal variables is guided by invariance principles that are not necessarily causal. \n  This result allows us to unify many existing approaches in a single method that can mix and match different assumptions, including non-causal ones, based on the invariance relevant to the problem at hand. \n  It also significantly benefits applicability, which we demonstrate by improving treatment effect estimation on real-world high-dimensional ecological data. Overall, this paper clarifies the role of causal assumptions in the discovery of causal variables and shifts the focus to preserving data symmetries.",
      "keywords": [
        "Causal representation learning",
        "Identifiability",
        "Invariance"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "bVTM2QKYuA",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "bVTM2QKYuA",
      "title": "The Geometry of Categorical and Hierarchical Concepts in Large Language Models",
      "abstract": "The linear representation hypothesis is the informal idea that semantic concepts are encoded as linear directions in the representation spaces of large language models (LLMs). Previous work has shown how to make this notion precise for representing binary concepts that have natural contrasts (e.g., {male, female}) as _directions_ in representation space. However, many natural concepts do not have natural contrasts (e.g., whether the output is about an animal). In this work, we show how to extend the formalization of the linear representation hypothesis to represent features (e.g., is_animal) as _vectors_. This allows us to immediately formalize the representation of categorical concepts as polytopes in the representation space. Further, we use the formalization to prove a relationship between the hierarchical structure of concepts and the geometry of their representations. We validate these theoretical results on the Gemma and LLaMA-3 large language models, estimating representations for 900+ hierarchically related concepts using data from WordNet.",
      "keywords": [
        "categorical concepts",
        "hierarchical concepts",
        "linear representation hypothesis",
        "causal inner product",
        "interpretability"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "yVGGtsOgc7",
      "title": "Disentangling Representations through Multi-task Learning",
      "abstract": "Intelligent perception and interaction with the world hinges on internal representations that capture its underlying structure (\"disentangled\" or \"abstract\" representations). Disentangled representations serve as world models, isolating latent factors of variation in the world along approximately orthogonal directions, thus facilitating feature-based generalization. We provide experimental and theoretical results guaranteeing the emergence of disentangled representations in agents that optimally solve multi-task evidence accumulation classification tasks, canonical in the neuroscience literature. The key conceptual finding is that, by producing accurate multi-task classification estimates, a system implicitly represents a set of coordinates specifying a disentangled representation of the underlying latent state of the data it receives. The theory provides conditions for the emergence of these representations in terms of noise, number of tasks, and evidence accumulation time, when the classification boundaries are affine in the latent space. Surprisingly, the theory also produces closed-form expressions for extracting the disentangled representation from the model's latent state $\\mathbf Z(t)$. We experimentally validate these predictions in RNNs trained on multi-task classification, which learn disentangled representations in the form of continuous attractors, leading to zero-shot out-of-distribution (OOD) generalization in predicting latent factors. We demonstrate the robustness of our framework across autoregressive architectures, decision boundary geometries and in tasks requiring classification confidence estimation. We find that transformers are particularly suited for disentangling representations, which might explain their unique world understanding abilities. Overall, our framework establishes a formal link between competence at multiple tasks and the formation of disentangled, interpretable world models in both biological and artificial systems, and helps explain why ANNs often arrive at human-interpretable concepts, and how they both may acquire exceptional zero-shot generalization capabilities.",
      "keywords": [
        "zero-shot generalization",
        "disentanglement",
        "interpretability",
        "world models",
        "multi-task learning",
        "computational neuroscience",
        "neuroAI",
        "evidence accumulation",
        "cognitive maps",
        "continuous attractors",
        "RNNs",
        "transformers"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "Q3qAsZAEZw",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "WJIDorHiuZ",
      "title": "CoRe: Benchmarking LLMs’ Code Reasoning Capabilities through Static Analysis Tasks",
      "abstract": "Large language models (LLMs) have been widely adopted across diverse domains of software engineering, such as code generation, program repair, and vulnerability detection. These applications require understanding beyond surface-level code patterns: value propagation, control flow, and interdependence between program elements. However, existing benchmarks primarily evaluate end-to-end outcomes, such as whether code is correctly repaired or generated, \nleaving the models' ability of program semantic reasoning underexplored.\nThis work presents CoRe, a high-quality, human-verified benchmark designed to evaluate LLMs on fundamental static analysis tasks. CoRe includes 12,553 task instances spanning data dependency, control dependency, and information flow across programs written in C/C++, Java, and Python. \nTo ensure semantic diversity and reasoning complexity, we propose a semantics-aware diverse sampling strategy that selects targets and task instances based on structural coverage and dependency depth. \nWe evaluate 10 state-of-the-art LLMs and show that, while they perform well at identifying dependencies, models still struggle with tasks that require deeper semantic understanding and multi-step reasoning.\nWe further conduct qualitative analyses to uncover key challenges, such as complex control structures and backward dependency patterns, offering insights into improving LLMs’ code reasoning capabilities.",
      "keywords": [
        "Large Language Models",
        "Code Reasoning",
        "Program Analysis",
        "Code Understanding",
        "Static Analysis",
        "Benchmarking"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "iBFfb6bGOz",
      "title": "Atomic Thinking of LLMs: Decoupling and Exploring Mathematical Reasoning Abilities",
      "abstract": "Large Language Models (LLMs) have demonstrated outstanding performance in mathematical reasoning capabilities. However, we argue that current large-scale reasoning models primarily rely on scaling up training datasets with diverse mathematical problems and long thinking chains, which raises questions about whether LLMs genuinely acquire mathematical concepts and reasoning principles or merely remember the training data. In contrast, humans tend to break down complex problems into multiple fundamental atomic capabilities. Inspired by this, we propose a new paradigm for evaluating mathematical atomic capabilities. Our work categorizes atomic abilities into two dimensions: (1) field-specific abilities across four major mathematical fields, algebra, geometry, analysis, and topology, and (2) logical abilities at different levels, including conceptual understanding, forward multi-step reasoning with formal math language, and counterexample-driven backward reasoning. We propose corresponding training and evaluation datasets for each atomic capability unit, and conduct extensive experiments about how different atomic capabilities influence others, to explore the strategies to elicit the required specific atomic capability. Evaluation and experimental results on advanced models show many interesting discoveries and inspirations about the different performances of models on various atomic capabilities and the interactions between atomic capabilities. Our findings highlight the importance of decoupling mathematical intelligence into atomic components, providing new insights into model cognition and guiding the development of training strategies toward a more efficient, transferable, and cognitively grounded paradigm of \"atomic thinking\".",
      "keywords": [
        "Large Language Models",
        "Mathematical Reasoning",
        "Atomic Thinking"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "R73ybUciQF",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "DfHcKzmHpp",
      "title": "Can We Partially Rewrite Transformers in Natural Language?",
      "abstract": "The greatest ambition of mechanistic interpretability is to completely rewrite deep neural networks in a format that is more amenable to human understanding, while preserving their behavior and performance. In this paper we evaluate whether sparse autoencoders (SAEs) and transcoders can be used for this purpose. We use an automated pipeline to generate explanations for each of the sparse coder latents. We then simulate the activation of each latent on a number of different inputs using an LLM prompted with the explanation we generated in the previous step, and \"partially rewrite'' the original model by patching the simulated activations into its forward pass. We find that current sparse coding techniques and automated interpretability pipelines are not up to the task of rewriting even a single layer of a transformer: the model is severely degraded by patching in the simulated activations. We believe this approach is the most thorough way to assess the quality of SAEs and transcoders, despite its high computational cost.",
      "keywords": [
        "Sparse autoencoders",
        "interpretability",
        "language models"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "p8lKcNkJRi",
      "title": "Dense SAE Latents Are Features, Not Bugs",
      "abstract": "Sparse autoencoders (SAEs) are designed to extract interpretable features from language models by enforcing a sparsity constraint. Ideally, training an SAE would yield latents that are both sparse and semantically meaningful. However, many SAE latents activate frequently (i.e., are *dense*), raising concerns that they may be undesirable artifacts of the training procedure. In this work, we systematically investigate the geometry, function, and origin of dense latents and show that they are not only persistent but often reflect meaningful model representations. We first demonstrate that dense latents tend to form antipodal pairs that reconstruct specific directions in the residual stream, and that ablating their subspace suppresses the emergence of new dense features in retrained SAEs---suggesting that high density features are an intrinsic property of the residual space. We then introduce a taxonomy of dense latents, identifying classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction. Finally, we analyze how these features evolve across layers, revealing a shift from structural features in early layers, to semantic features in mid layers, and final to output-oriented signals in the last layers of the model. Our findings indicate that dense latents serve functional roles in language model computation and should not be dismissed as training noise.",
      "keywords": [
        "sae",
        "sparse autoencoder",
        "interpretability",
        "mechanistic interpretability",
        "language model"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "xak8c9l1nu",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "CrOHzVtWmH",
      "title": "Relative-Translation Invariant Wasserstein Distance",
      "abstract": "In many real-world applications, data distributions are often subject to translation shifts caused by various factors such as changes in environmental conditions, sensor settings, or shifts in data collection practices. These distribution shifts pose a significant challenge for measuring the similarity between probability distributions, particularly in tasks like domain adaptation or transfer learning. To address this issue, we introduce a new family of distances, relative-translation invariant Wasserstein distances ($RW_p$), to measure the similarity of two probability distributions under distribution shift. Generalizing it from the classical optimal transport model, we show that $RW_p$ distances are also real distance metrics defined on the quotient set $\\mathcal{P}_p(\\mathbb{R}^n)/\\sim$ and invariant to distribution translations, which forms a family of new metric spaces. When $p=2$, the $RW_2$ distance enjoys more exciting properties, including decomposability of the optimal transport model and translation-invariance of the $RW_2$ distance. Based on these properties, we show that a distribution shift, measured by $W_2$ distance, can be explained in the bias-variance perspective. In addition, we propose two algorithms: one algorithm is a two-stage optimization algorithm for computing the general case of $RW_p$ distance, and the other is a variant of the Sinkhorn algorithm, named $RW_2$ Sinkhorn algorithm, for efficiently calculating $RW_2$ distance, coupling solutions, as well as $W_2$ distance. We also provide the analysis of numerical stability and time complexity for the proposed algorithms. Finally, we validate the $RW_p$ distance metric and the algorithm performance with two experiments. We conduct one numerical validation for the $RW_2$ Sinkhorn algorithm and demonstrate the effectiveness of using $RW_p$ under distribution shift for similar thunderstorm detection. The experimental results report that our proposed algorithm significantly improves the computational efficiency of Sinkhorn in practical applications, and the $RW_p$ distance is robust to distribution translations.",
      "keywords": [
        "Optimal transport theory",
        "Wasserstein distance",
        "Distribution shift"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "MT3aOfXIbY",
      "title": "Faster Diffusion Sampling with Randomized Midpoints: Sequential and Parallel",
      "abstract": "Sampling algorithms play an important role in controlling the quality and runtime of diffusion model inference. In recent years, a number of works (Chen et al., 2023c;b; Benton et al., 2023; Lee et al., 2022) have analyzed algorithms for diffusion sampling with provable guarantees; these works show that for essentially any data distribution, one can approximately sample in polynomial time given a sufficiently accurate estimate of its score functions at different noise levels. \n\nIn this work, we propose a new scheme inspired by Shen and Lee's randomized midpoint method for log-concave sampling  (Shen & Lee, 2019). We prove that this approach achieves the best known dimension dependence for sampling from arbitrary smooth distributions in total variation distance ($\\widetilde O(d^{5/12})$ compared to $\\widetilde O(\\sqrt{d})$ from prior work). We also show that our algorithm can be parallelized to run in only $\\widetilde O(\\log^2 d)$ parallel rounds, constituting the first provable guarantees for parallel sampling with diffusion models.\n    \nAs a byproduct of our methods, for the well-studied problem of log-concave sampling in total variation distance, we give an algorithm and simple analysis achieving dimension dependence $\\widetilde O(d^{5/12})$ compared to $\\widetilde O(\\sqrt{d})$ from prior work.",
      "keywords": [
        "Diffusion Sampling",
        "Generative Model",
        "Statistical Theory"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "k38Th3x4d9",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "3fl1SENSYO",
      "title": "DiffPuter: Empowering Diffusion Models for Missing Data Imputation",
      "abstract": "Generative models play an important role in missing data imputation in that they aim to learn the joint distribution of full data. However, applying advanced deep generative models (such as Diffusion models) to missing data imputation is challenging due to 1) the inherent incompleteness of the training data and 2) the difficulty in performing conditional inference from unconditional generative models. To deal with these challenges, this paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation. DiffPuter iteratively trains a diffusion model to learn the joint distribution of missing and observed data and performs an accurate conditional sampling to update the missing values using a tailored reversed sampling strategy. Our theoretical analysis shows that DiffPuter's training step corresponds to the maximum likelihood estimation of data density (M-step), and its sampling step represents the Expected A Posteriori estimation of missing values (E-step). Extensive experiments across ten diverse datasets and comparisons with 17 different imputation methods demonstrate DiffPuter's superior performance. Notably, DiffPuter achieves an average improvement of 8.10\\% in MAE and 5.64\\% in RMSE compared to the most competitive existing method.",
      "keywords": [
        "Diffusion models",
        "missing data imputation"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "m08aK3xxdJ",
      "title": "CATCH: Channel-Aware Multivariate Time Series Anomaly Detection via Frequency Patching",
      "abstract": "Anomaly detection in multivariate time series is challenging as heterogeneous subsequence anomalies may occur. Reconstruction-based methods, which focus on learning normal patterns in the frequency domain to detect diverse abnormal subsequences, achieve promising results, while still falling short on capturing fine-grained frequency characteristics and channel correlations. To contend with the limitations, we introduce CATCH, a framework based on frequency patching. We propose to patchify the frequency domain into frequency bands, which enhances its ability to capture fine-grained frequency characteristics. To perceive appropriate channel correlations, we propose a Channel Fusion Module (CFM), which features a patch-wise mask generator and a masked-attention mechanism. Driven by a bi-level multi-objective optimization algorithm, the CFM is encouraged to iteratively discover appropriate patch-wise channel correlations, and to cluster relevant channels while isolating adverse effects from irrelevant channels. Extensive experiments on 10 real-world datasets and 12 synthetic datasets demonstrate that CATCH achieves state-of-the-art performance. We make our code and datasets available at https://github.com/decisionintelligence/CATCH.",
      "keywords": [
        "Multivariate Time Series",
        "Anomaly Detection"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "B6bE2GC71a",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "FZURCro04D",
      "title": "Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking",
      "abstract": "Large language models (LLMs) demonstrate remarkable reasoning capabilities, yet their reliance on step-by-step reasoning can make them brittle when tasks do not align with such structured approaches. In contrast, human cognition flexibly alternates between fast, intuitive reasoning (System 1) and slow, analytical reasoning (System 2), depending on context. To bridge this gap, we curate a dataset of 2K examples, each with valid responses from both reasoning styles, and explicitly align LLMs with System 1 and System 2 reasoning. Evaluations across diverse reasoning benchmarks reveal an accuracy-efficiency trade-off: System 2-aligned models excel in arithmetic and symbolic reasoning, while System 1-aligned models perform better in commonsense tasks. A mechanistic analysis of model responses shows that System 1 models employ more definitive answers, whereas System 2 models demonstrate greater uncertainty. Interpolating between these extremes produces a monotonic transition in reasoning accuracy, preserving coherence. This work challenges the assumption that step-by-step reasoning is always optimal and highlights the need for adapting reasoning strategies based on task demands.",
      "keywords": [
        "Alignment",
        "System 1 and System 2 thinking",
        "Cognitive heuristics",
        "LLM",
        "NLP"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "ySFDPoiANu",
      "title": "Execution Guided Line-by-Line Code Generation",
      "abstract": "We present a novel approach to neural code generation that incorporates real-time execution signals into the language model generation process. While large language models (LLMs) have demonstrated impressive code generation capabilities, they typically do not utilize execution feedback during inference, a critical signal that human programmers regularly leverage. Our method, Execution-Guided Classifier-Free Guidance EG-CFG, dynamically incorporates execution signals as the model generates code, providing line-by-line feedback that guides the generation process toward executable solutions.\nEG-CFG employs a multi-stage process: first, we conduct beam search to sample candidate program completions for each line; second, we extract execution signals by executing these candidates against test cases; and finally, we incorporate these signals into the prompt during generation. By maintaining consistent signals across tokens within the same line and refreshing signals at line boundaries, our approach provides coherent guidance while preserving syntactic structure. Moreover, the method naturally supports native parallelism at the task level in which multiple agents operate in parallel, exploring diverse reasoning paths and collectively generating a broad set of candidate solutions.\nOur experiments across diverse coding tasks demonstrate that EG-CFG significantly improves code generation performance compared to standard approaches, achieving state-of-the-art results across various levels of complexity, from foundational problems to challenging competitive programming and data science tasks.",
      "keywords": [
        "Program Induction",
        "Problem Solving",
        "Reasoning",
        "Natural Language Processing"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "EzjsoomYEb",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "EzjsoomYEb",
      "title": "Topological Blindspots: Understanding and Extending Topological Deep Learning Through the Lens of Expressivity",
      "abstract": "Topological deep learning (TDL) is a rapidly growing field that seeks to leverage topological structure in data and facilitate learning from data supported on topological objects, ranging from molecules to 3D shapes. Most TDL architectures can be unified under the framework of higher-order message-passing (HOMP), which generalizes graph message-passing to higher-order domains. In the first part of the paper, we explore HOMP's expressive power from a topological perspective, demonstrating the framework's inability to capture fundamental topological and metric invariants such as diameter, orientability, planarity, and homology. In addition, we demonstrate HOMP's limitations in fully leveraging lifting and pooling methods on graphs. To the best of our knowledge, this is the first work to study the expressivity of TDL from a topological perspective. In the second part of the paper, we develop two new classes of architectures -- multi-cellular networks (MCN) and scalable MCN (SMCN) -- which draw inspiration from expressive GNNs. MCN can reach full expressivity, but scaling it to large data objects can be computationally expansive. Designed as a more scalable alternative, SMCN still mitigates many of HOMP's expressivity limitations. Finally, we design new benchmarks for evaluating models based on their ability to learn topological properties of complexes. We then evaluate SMCN on these benchmarks as well as on real-world graph datasets, demonstrating improvements over both HOMP baselines and expressive graph methods, highlighting the value of expressively leveraging topological information.",
      "keywords": [
        "Topological Deep Learning",
        "Message Passing",
        "Higher Order Message Passing",
        "Expressivity",
        "Graph Neural Networks",
        "GNNs",
        "Topology",
        "Homology",
        "Symmetry"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "QC2qE1tcmd",
      "title": "Demystifying Topological Message-Passing with Relational Structures: A Case Study on Oversquashing in Simplicial Message-Passing",
      "abstract": "Topological deep learning (TDL) has emerged as a powerful tool for modeling higher-order interactions in relational data. However, phenomena such as oversquashing in topological message-passing remain understudied and lack theoretical analysis. We propose a unifying axiomatic framework that bridges graph and topological message-passing by viewing simplicial and cellular complexes and their message-passing schemes through the lens of relational structures. This approach extends graph-theoretic results and algorithms to higher-order structures, facilitating the analysis and mitigation of oversquashing in topological message-passing networks. Through theoretical analysis and empirical studies on simplicial networks, we demonstrate the potential of this framework to advance TDL.",
      "keywords": [
        "topological deep learning",
        "oversquashing",
        "rewiring",
        "relational graph neural networks",
        "simplicial complexes",
        "relational structures"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "B6bE2GC71a",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "B6bE2GC71a",
      "title": "EvoLM: In Search of Lost Language Model Training Dynamics",
      "abstract": "Modern language model (LM) training has been divided into multiple stages, making it difficult for downstream developers to evaluate the impact of design choices made at each stage.\nWe present EvoLM, a model suite that enables systematic and transparent analysis of LMs' training dynamics across pre-training, continued pre-training, supervised fine-tuning, and reinforcement learning. \nBy training over 100 LMs with 1B and 4B parameters from scratch, we rigorously evaluate both upstream (language modeling) and downstream (problem-solving) reasoning capabilities, including considerations of both in-domain and out-of-domain generalization. \nKey insights highlight the diminishing returns from excessive pre-training and post-training, the importance and practices of mitigating forgetting during domain-specific continued pre-training, the crucial role of continued pre-training in bridging pre-training and post-training phases, and various intricate trade-offs when configuring supervised fine-tuning and reinforcement learning. \nTo facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline.",
      "keywords": [
        "Language Models",
        "Training Dynamics",
        "Pretraining",
        "Post-training"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "ySFDPoiANu",
      "title": "Execution Guided Line-by-Line Code Generation",
      "abstract": "We present a novel approach to neural code generation that incorporates real-time execution signals into the language model generation process. While large language models (LLMs) have demonstrated impressive code generation capabilities, they typically do not utilize execution feedback during inference, a critical signal that human programmers regularly leverage. Our method, Execution-Guided Classifier-Free Guidance EG-CFG, dynamically incorporates execution signals as the model generates code, providing line-by-line feedback that guides the generation process toward executable solutions.\nEG-CFG employs a multi-stage process: first, we conduct beam search to sample candidate program completions for each line; second, we extract execution signals by executing these candidates against test cases; and finally, we incorporate these signals into the prompt during generation. By maintaining consistent signals across tokens within the same line and refreshing signals at line boundaries, our approach provides coherent guidance while preserving syntactic structure. Moreover, the method naturally supports native parallelism at the task level in which multiple agents operate in parallel, exploring diverse reasoning paths and collectively generating a broad set of candidate solutions.\nOur experiments across diverse coding tasks demonstrate that EG-CFG significantly improves code generation performance compared to standard approaches, achieving state-of-the-art results across various levels of complexity, from foundational problems to challenging competitive programming and data science tasks.",
      "keywords": [
        "Program Induction",
        "Problem Solving",
        "Reasoning",
        "Natural Language Processing"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "WhEPg4mUs6",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "WhEPg4mUs6",
      "title": "Analog In-memory Training on General Non-ideal Resistive Elements: The Impact of Response Functions",
      "abstract": "As the economic and environmental costs of training and deploying large vision or language models increase dramatically, analog in-memory computing (AIMC) emerges as a promising energy-efficient solution. However, the training perspective, especially its training dynamic, is underexplored. In AIMC hardware, the trainable weights are represented by the conductance of resistive elements and updated using consecutive electrical pulses.  While the conductance changes by a constant in response to each pulse, in reality, the change is scaled by asymmetric and non-linear response functions, leading to a non-ideal training dynamic. This paper provides a theoretical foundation for gradient-based training on AIMC hardware with non-ideal response functions.  We demonstrate that asymmetric response functions negatively impact Analog SGD by imposing an implicit penalty on the objective. To overcome the issue, we propose residual learning algorithm, which provably converges exactly to a critical point by solving a bilevel optimization problem. We show that the proposed method can be extended to deal with other hardware imperfections like limited response granularity. As far as we know, it is the first paper to investigate the impact of a class of generic non-ideal response functions. The conclusion is supported by simulations validating our theoretical insights.",
      "keywords": [
        "Analog AI; in-memory computing; stochastic gradient descent; stochastic optimization"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "tDG6bY48ch",
      "title": "PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation",
      "abstract": "Recently, large vision-language models (LVLMs) have rapidly gained popularity for their strong generation and reasoning capabilities given diverse multimodal inputs. However, these models incur significant computational and memory overhead during inference, which greatly hinders the efficient deployment in practical scenarios. The extensive key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. Based on this, recent works have investigated ways to reduce the KV cache size for higher efficiency. Although effective, they generally overlook the distinct importance distributions of KV vectors across layers and maintain the same cache size for each layer during the next token prediction. This results in the significant contextual information loss for certain layers, leading to notable performance decline. To address this, we present PrefixKV. It reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. With an adaptive layer-wise KV retention recipe based on binary search, the maximum contextual information can thus be preserved in each layer, facilitating the generation. Extensive experiments demonstrate that our method achieves the state-of-the-art performance compared with others. It exhibits superior inference efficiency and generation quality trade-offs, showing promising potential for practical applications. Code is available at\nhttps://github.com/THU-MIG/PrefixKV.",
      "keywords": [
        "large vision-language models",
        "KV cache compression"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "WrYWolqKh3",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "eR8raBLZW7",
      "title": "BriLLM: Brain-inspired Large Language Model",
      "abstract": "This paper reports the brain-inspired large language model (BriLLM). This is a non-Transformer, non-GPT, non-traditional machine learning input-output controlled generative language model. The model is based on the Signal Fully-connected flowing (SiFu) definition on the directed graph in terms of the neural network, and has the interpretability of all nodes on the graph of the whole model, instead of the traditional machine learning model that only has limited interpretability at the input and output ends. In the language model scenario, the token is defined as a node in the graph. A randomly shaped or user-defined signal flow flows between nodes on the principle of \"least resistance\" along paths. The next token or node to be predicted or generated is the target of the signal flow. As a language model, BriLLM theoretically supports infinitely long $n$-gram models when the model size is independent of the input and predicted length of the model. The model's working signal flow provides the possibility of recall activation and innate multi-modal support similar to the cognitive patterns of the human brain. At present, we released the first BriLLM versions in Chinese and English, with 4000 tokens, 32-dimensional node size, 32-token sequence prediction ability, model sizes around 2B and 1B respectively, bringing language model prediction performance comparable to GPT-1.",
      "keywords": [
        "LLM"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "stpe7UeETz",
      "title": "Corrector Sampling in Language Models",
      "abstract": "Autoregressive language models accumulate errors due to their fixed, irrevocable left-to-right token generation. To address this, we propose a new sampling method called Resample-Previous-Tokens (RPT). RPT mitigates error accumulation by iteratively revisiting and potentially replacing tokens in a window of previously generated text. Fine-tuning a pretrained 8B parameter model with RPT for only 100B resulted in ~10% relative improvements on reasoning and coding benchmarks compared to the standard sampling.",
      "keywords": [
        "Language modeling",
        "Sampling"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "XPe55Uffd7",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "XPe55Uffd7",
      "title": "Agnostic Active Learning Is Always Better Than Passive Learning",
      "abstract": "We sharply characterize the optimal first-order query complexity of agnostic active learning for all concept classes, and propose a new general active learning algorithm which achieves it. Remarkably, the optimal query complexity admits a leading term which is always strictly smaller than the sample complexity of passive supervised learning (by a factor proportional to the best-in-class error rate). This was not previously known to be possible in the agnostic setting. For comparison, in all previous general analyses, the leading term exhibits an additional factor, such as the disagreement coefficient or related complexity measure, and therefore only provides improvements over passive learning in restricted cases. The present work completely removes such factors from the leading term, implying that $\\textit{every}$ concept class benefits from active learning in the non-realizable case. The results established in this work resolve an important long-standing open question central to the past two decades of research on the theory of agnostic active learning.",
      "keywords": [
        "Active learning",
        "Agnostic learning",
        "PAC learning",
        "Query complexity",
        "Minimax analysis",
        "VC dimension",
        "Star number",
        "Disagreement coefficient"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "qRPIWtf3SE",
      "title": "Learning single index models via harmonic decomposition",
      "abstract": "We study the problem of learning single-index models, where the label $y \\in \\mathbb{R}$ depends on the input $\\boldsymbol{x} \\in \\mathbb{R}^d$ only through an unknown one-dimensional projection $\\langle \\boldsymbol{w_*}, \\boldsymbol{x} \\rangle$. Prior work has shown that under Gaussian inputs, the statistical and computational complexity of recovering $\\boldsymbol{w}_*$ is governed by the Hermite expansion of the link function. In this paper, we propose a new perspective: we argue that *spherical harmonics*---rather than *Hermite polynomials*---provide the natural basis for this problem, as they capture its intrinsic \\textit{rotational symmetry}. Building on this insight, we characterize the complexity of learning single-index models under arbitrary spherically symmetric input distributions. We introduce two families of estimators---based on tensor-unfolding and online SGD---that respectively achieve either  optimal sample complexity or optimal runtime, and argue that estimators achieving both may not exist in general. When specialized to Gaussian inputs, our theory not only recovers and clarifies existing results but also reveals new phenomena that had previously been overlooked.",
      "keywords": [
        "single-index models",
        "statistical and computational complexity"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "dGSOn7sdWg",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "OW332Wh9S5",
      "title": "DC-Spin: A Speaker-invariant Speech Tokenizer For Spoken Language Models",
      "abstract": "Spoken language models (SLMs) have gained increasing attention with advancements in text-based, decoder-only language models. SLMs process text and speech, enabling simultaneous speech understanding and generation. This paper presents Double-Codebook Speaker-invariant Clustering (DC-Spin), which aims to improve speech tokenization by bridging audio signals and SLM tokens. DC-Spin extracts speaker-invariant tokens rich in phonetic information and resilient to input variations, enhancing zero-shot SLM tasks and speech resynthesis. We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation. Comparisons of tokenization methods (self-supervised and neural audio codecs), model scalability, and downstream task proxies show that tokens easily modeled by an n-gram LM or aligned with phonemes offer strong performance, providing insights for designing speech tokenizers for SLMs.",
      "keywords": [
        "speech tokenizer",
        "self-supervised learning",
        "spoken language model",
        "speech language model",
        "speech resynthesis",
        "audio codec"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "dGSOn7sdWg",
      "title": "SyllableLM: Learning Coarse Semantic Units for Speech Language Models",
      "abstract": "Language models require tokenized inputs. However, tokenization strategies for continuous data like audio and vision are often based on simple heuristics such as fixed sized convolutions or discrete clustering, which do not necessarily align with the semantic structure of the data. For speech in particular, the high resolution of waveforms (16,000 samples/second or more) presents a significant challenge as speech-based language models have had to use several times more tokens per word than text-based language models. In this work, we introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units while still preserving semantic information. We do this by 1) extracting noisy boundaries through analyzing correlations in pretrained encoder losses and 2) iteratively improving model representations with a novel distillation technique. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and achieves SotA in syllabic segmentation and clustering. Using these coarse tokens, we successfully train SyllableLM, a Speech Language Model (SpeechLM) that matches or outperforms current SotA SpeechLMs on a range of spoken language modeling tasks. SyllableLM also achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup. Our code and checkpoints are available at https://www.github.com/alanbaade/SyllableLM",
      "keywords": [
        "Generative Spoken Language Modeling",
        "Audio",
        "Textless NLP",
        "Representation Learning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "J2Jyp1SZ0n",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "J2Jyp1SZ0n",
      "title": "MMSearch: Unveiling the Potential of Large Models as Multi-modal Search Engines",
      "abstract": "The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question. To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs' training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSearch may provide unique insights to guide the future development of multimodal AI search engine.",
      "keywords": [
        "Large Multimodal Model",
        "AI Search Engine",
        "Benchmark"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "n0OtGl6VGb",
      "title": "ThinK: Thinner Key Cache by Query-Driven Pruning",
      "abstract": "Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. \nHowever, their increased computational and memory demands present significant challenges, especially when handling long sequences.\nThis paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. \nUnlike existing approaches that optimize the memory based on the sequence length, we identify substantial redundancy in the channel dimension of the KV cache, as indicated by an uneven magnitude distribution and a low-rank structure in the attention weights.\nIn response, we propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels. Our approach not only maintains or enhances model accuracy but also achieves a reduction in KV cache memory costs by over 20\\% compared with vanilla KV cache eviction and quantization methods. For instance, ThinK integrated with KIVI can achieve $2.8\\times$ peak memory reduction while maintaining nearly the same quality, enabling a batch size increase from 4$\\times$ (with KIVI alone) to 5$\\times$ when using a single GPU. Extensive evaluations on the LLaMA and Mistral models across various long-sequence datasets verified the efficiency of ThinK.  Our code has been made available at https://github.com/SalesforceAIResearch/ThinK.",
      "keywords": [
        "Large Language Models; KV Cache Compression; KV Cache Pruning"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "sPafJfwI2I",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "4Ud0pRqFto",
      "title": "MARS-VFL: A Unified Benchmark for Vertical Federated Learning with Realistic Evaluation",
      "abstract": "Vertical Federated Learning (VFL) has emerged as a critical privacy-preserving learning paradigm, enabling collaborative model training by leveraging distributed features across clients. However, due to privacy concerns, there are few publicly available real-world datasets for evaluating VFL methods, which poses significant challenges to related research. To bridge this gap, we propose MARS-VFL, a unified benchmark for realistic VFL evaluation. It integrates data from practical applications involving collaboration across different features, maintaining compatibility with the VFL setting. Based on this, we standardize the evaluation of VFL methods from the mainstream aspects of efficiency, robustness, and security. We conduct comprehensive experiments to assess different VFL approaches, providing references for unified evaluation. Furthermore, we are the first to unify the evaluation of robustness challenges in VFL and introduce a new method for addressing robustness challenges, establishing standard baselines for future research.",
      "keywords": [
        "Vertical Federated Learning",
        "Distrubuted System",
        "Benchmark"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "GkB1ZUNz83",
      "title": "FedRTS: Federated Robust Pruning via Combinatorial Thompson Sampling",
      "abstract": "Federated Learning (FL) enables collaborative model training across distributed clients without data sharing, but its high computational and communication demands strain resource-constrained devices. While existing methods use dynamic pruning to improve efficiency by periodically adjusting sparse model topologies while maintaining sparsity, these approaches suffer from issues such as **greedy adjustments**, **unstable topologies**, and **communication inefficiency**, resulting in less robust models and suboptimal performance under data heterogeneity and partial client availability. To address these challenges, we propose **Fed**erated **R**obust pruning via combinatorial **T**hompson **S**ampling (FedRTS),a novel framework designed to develop robust sparse models. FedRTS enhances robustness and performance through its Thompson Sampling-based Adjustment (TSAdj) mechanism, which uses probabilistic decisions informed by stable and farsighted information,  instead of deterministic decisions reliant on unstable and myopic information in previous methods. Extensive experiments demonstrate that FedRTS achieves state-of-the-art performance in computer vision and natural language processing tasks while reducing communication costs, particularly excelling in scenarios with heterogeneous data distributions and partial client participation. Our codes are available at: https://github.com/Little0o0/FedRTS.",
      "keywords": [
        "Federared Learning",
        "Neural Network Pruning",
        "Combinatorial Thompson Sampling"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "F20AfNqMq9",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "X9diEuva9R",
      "title": "AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning",
      "abstract": "Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model. This stabilizes RL training but suffers from severe system-level inefficiency. Generation must wait until the longest output in the batch is completed before model update, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL  system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.77x training speedup compared to synchronous systems with the same number of GPUs and matched or even improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.",
      "keywords": [
        "distributed system"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "ph1V6n7BSv",
      "title": "EDELINE: Enhancing Memory in Diffusion-based World Models via Linear-Time Sequence Modeling",
      "abstract": "World models represent a promising approach for training reinforcement learning agents with significantly improved sample efficiency. While most world model methods primarily rely on sequences of discrete latent variables to model environment dynamics, this compression often neglects critical visual details essential for reinforcement learning. Recent diffusion-based world models condition generation on a fixed context length of frames to predict the next observation, using separate recurrent neural networks to model rewards and termination signals. Although this architecture effectively enhances visual fidelity, the fixed context length approach inherently limits memory capacity.\nIn this paper, we introduce EDELINE, a unified world model architecture that integrates state space models with diffusion models. Our approach outperforms existing baselines across visually challenging Atari 100k tasks, memory-demanding Crafter benchmark, and 3D first-person ViZDoom environments, demonstrating superior performance in all these diverse challenges. Code is available at https://github.com/LJH-coding/EDELINE.",
      "keywords": [
        "Model-based Reinforcement Learning",
        "Atari 100k",
        "Doom",
        "Crafter",
        "MAMBA",
        "Diffusion",
        "World Model"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "G10Y4vrhGF",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "G10Y4vrhGF",
      "title": "FedFree: Breaking Knowledge-sharing Barriers through Layer-wise Alignment in Heterogeneous Federated Learning",
      "abstract": "Heterogeneous Federated Learning (HtFL) enables collaborative learning across clients with diverse model architectures and non-IID data distributions, which are prevalent in real-world edge computing applications. Existing HtFL approaches typically employ proxy datasets to facilitate knowledge sharing or implement coarse-grained model-level knowledge transfer. However, such approaches not only elevate risks of user privacy leakage but also lead to the loss of fine-grained model-specific knowledge, ultimately creating barriers to effective knowledge sharing. To address these challenges, we propose FedFree, a novel data-free and model-free HtFL framework featuring two key innovations. First, FedFree introduces a reverse layer-wise knowledge transfer mechanism that aggregates heterogeneous client models into a global model solely using Gaussian-based pseudo data, eliminating reliance on proxy datasets. Second, it leverages Knowledge Gain Entropy (KGE) to guide targeted layer-wise knowledge alignment, ensuring that each client receives the most relevant global updates tailored to its specific architecture. We provide rigorous theoretical convergence guarantees for FedFree and conduct extensive experiments on CIFAR-10 and CIFAR-100. Results demonstrate that FedFree achieves substantial performance gains, with relative accuracy improving up to 46.3% over state-of-the-art baselines. The framework consistently excels under highly heterogeneous model/data distributions and in large scale settings.",
      "keywords": [
        "Heterogeneous Federated Learning",
        "Public-Data-Free",
        "Knowledge Gain Entropy"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "b7waOsMnq8",
      "title": "Sharp Gaussian approximations for Decentralized Federated Learning",
      "abstract": "Federated Learning has gained traction in privacy-sensitive collaborative environments, with local SGD emerging as a key optimization method in decentralized settings. While its convergence properties are well-studied, asymptotic statistical guarantees beyond convergence remain limited. In this paper, we present two generalized Gaussian approximation results for local SGD and explore their implications. First, we prove a Berry-Esseen theorem for the final local SGD iterates, enabling valid multiplier bootstrap procedures. Second, motivated by robustness considerations, we introduce two distinct time-uniform Gaussian approximations for the entire trajectory of local SGD. The time-uniform approximations support Gaussian bootstrap-based tests for detecting adversarial attacks. Extensive simulations are provided to support our theoretical results.",
      "keywords": [
        "Federated Learning",
        "Distributed Systems",
        "SGD",
        "Gaussian approximation"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "M4Laq0Y5WG",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "Lf0W2gmNBg",
      "title": "EAG3R: Event-Augmented 3D Geometry Estimation for Dynamic and Extreme-Lighting Scenes",
      "abstract": "Robust 3D geometry estimation from videos is critical for applications such as autonomous navigation, SLAM, and 3D scene reconstruction. Recent methods like DUSt3R demonstrate that regressing dense pointmaps from image pairs enables accurate and efficient pose-free reconstruction. However, existing RGB-only approaches struggle under real-world conditions involving dynamic objects and extreme illumination, due to the inherent limitations of conventional cameras. In this paper, we propose \\textbf{EAG3R}, a novel geometry estimation framework that augments pointmap-based reconstruction with asynchronous event streams. Built upon the MonST3R backbone, EAG3R introduces two key innovations: (1) a retinex-inspired image enhancement module and a lightweight event adapter with SNR-aware fusion mechanism that adaptively combines RGB and event features based on local reliability; and (2) a novel event-based photometric consistency loss that reinforces spatiotemporal coherence during global optimization. Our method enables robust geometry estimation in challenging dynamic low-light scenes without requiring retraining on night-time data. Extensive experiments demonstrate that EAG3R significantly outperforms state-of-the-art RGB-only baselines across monocular depth estimation, camera pose tracking, and dynamic reconstruction tasks.",
      "keywords": [
        "Event Camera",
        "3D Vision",
        "Neuromorphic Computing"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "M4Laq0Y5WG",
      "title": "Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation",
      "abstract": "In this paper, we propose \\textbf{Jasmine}, the first Stable Diffusion (SD)-based self-supervised framework for monocular depth estimation, which effectively harnesses SD’s visual priors to enhance the sharpness and generalization of unsupervised prediction. Previous SD-based methods are all supervised since adapting diffusion models for dense prediction requires high-precision supervision. In contrast, self-supervised reprojection suffers from inherent challenges (\\textit{e.g.}, occlusions, texture-less regions, illumination variance), and the predictions exhibit blurs and artifacts that severely compromise SD's latent priors. To resolve this, we construct a novel surrogate task of mix-batch image reconstruction. Without any additional supervision, it preserves the detail priors of SD models by reconstructing the images themselves while preventing depth estimation from degradation. Furthermore, to address the inherent misalignment between SD's scale and shift invariant estimation and self-supervised scale-invariant depth estimation, we build the Scale-Shift GRU. It not only bridges this distribution gap but also isolates the fine-grained texture of SD output against the interference of reprojection loss. Extensive experiments demonstrate that Jasmine achieves SoTA performance on the KITTI benchmark and exhibits superior zero-shot generalization across multiple datasets.",
      "keywords": [
        "depth estimation",
        "self-supervision",
        "diffusion models",
        "generative model"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "kVz9uvqUna",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "MVYz4GmcUH",
      "title": "Ambient Diffusion Omni: Training Good Models with Bad Data",
      "abstract": "We show how to use low-quality, synthetic, and out-of-distribution images to improve the quality of a diffusion model. Typically, diffusion models are trained on curated datasets that emerge from highly filtered data pools from the Web and other sources. We show that there is immense value in the lower-quality images that are often discarded. We present Ambient Diffusion Omni, a simple, principled framework to train diffusion models that can extract signal from arbitrarily images during training. Our framework exploits two properties of natural images -- spectral power law decay and locality. We first validate our framework by successfully training diffusion models with images synthetically corrupted by Gaussian blur, JPEG compression, and motion blur. We use our framework to achieve state-of-the-art ImageNet FID and we show significant improvements in both image quality and diversity for text-to-image generative modeling. The core insight is that noise dampens the initial skew between the desired high-quality distribution and the mixed distribution we actually observe. We provide rigorous theoretical justification for our approach by analyzing the trade-off between learning from biased data versus limited unbiased data across diffusion times.",
      "keywords": [
        "ambient diffusion",
        "diffusion models",
        "corrupted data",
        "generative AI"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "XF4JM2MTSF",
      "title": "CDFlow: Building Invertible Layers with Circulant and Diagonal Matrices",
      "abstract": "Normalizing flows are deep generative models that achieve efficient likelihood estimation and sampling through invertible transformations. A key challenge is designing linear layers that enhance expressiveness while enabling efficient computation of the Jacobian determinant and inverse. In this work, we introduce a novel invertible linear layer based on the product of circulant and diagonal matrices. This decomposition provides a parameter- and computation-efficient formulation, reducing the parameter complexity from $\\mathcal{O}(n^2)$ to $\\mathcal{O}(mn)$ by using $m$ diagonal matrices together with $m-1$ circulant matrices, while approximating arbitrary linear transformations.Furthermore, leveraging the Fast Fourier Transform (FFT), our method reduces the time complexity of matrix inversion from $\\mathcal{O}(n^{3})$ to $\\mathcal{O}(mn \\log n)$ and matrix log-determinant from $\\mathcal{O}(n^{3})$ to $\\mathcal{O}(mn)$, where $n$ is the input dimension. Building upon this, we introduce a novel normalizing flow model called Circulant-Diagonal Flow (CDFlow). Empirical results demonstrate that CDFlow excels in density estimation for natural image datasets and effectively models data with inherent periodicity. In terms of computational efficiency, our method speeds up the matrix inverse and log-determinant computations by $1.17\\times$ and $4.31\\times$, respectively, compared to the general dense matrix, when the number of channels is set to 96.",
      "keywords": [
        "Normalizing Flows; Fast Fourier Transform (FFT); Matrix Inversion; Jacobian Determinant"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "2rgYVFiWPL",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "C7BIQRM57T",
      "title": "Momentum Multi-Marginal Schrödinger Bridge Matching",
      "abstract": "Understanding complex systems by inferring trajectories from sparse sample snapshots is a fundamental challenge in a wide range of domains, e.g., single-cell biology, meteorology, and economics. Despite advancements in Bridge and Flow matching frameworks, current methodologies rely on pairwise interpolation between adjacent snapshots. This hinders their ability to capture long-range temporal dependencies and potentially affects the coherence of the inferred trajectories. To address these issues, we introduce Momentum Multi-Marginal Schrödinger Bridge Matching (3MSBM), a novel matching framework that learns smooth measure-valued splines for stochastic systems that satisfy multiple positional constraints. This is achieved by lifting the dynamics to phase space and generalizing stochastic bridges to be conditioned on several points, forming a multi-marginal conditional stochastic optimal control problem. The underlying dynamics are then learned by minimizing a variational objective, having fixed the path induced by the multi-marginal conditional bridge. As a matching approach, 3MSBM learns transport maps that preserve intermediate marginals throughout training, significantly improving convergence and scalability. Extensive experimentation in a series of real-world applications validates the superior performance of 3MSBM compared to existing methods in capturing complex dynamics with temporal dependencies, opening new avenues for training matching frameworks in multi-marginal settings.",
      "keywords": [
        "Diffusion models",
        "Schrödinger bridge",
        "Distribution matching",
        "Trajectory Inference"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "rMhQBlhh4c",
      "title": "Adjoint Schrödinger Bridge Sampler",
      "abstract": "Computational methods for learning to sample from the Boltzmann distribution—where the target distribution is known only up to an unnormalized energy function—have advanced significantly recently. Due to the lack of explicit target samples, however, prior diffusion-based methods, known as _diffusion samplers_, often require importance-weighted estimation or complicated learning processes. Both trade off scalability with extensive evaluations of the energy and model, thereby limiting their practical usage. In this work, we propose **Adjoint Schrödinger Bridge Sampler (ASBS)**, a new diffusion sampler that employs simple and scalable matching-based objectives yet without the need to estimate target samples during training. ASBS is grounded on a mathematical model—the Schrödinger Bridge—which enhances sampling efficiency via kinetic-optimal transportation. Through a new lens of stochastic optimal control theory, we demonstrate how SB-based diffusion samplers can be learned at scale via Adjoint Matching and prove convergence to the global solution. Notably, ASBS generalizes the recent Adjoint Sampling (Havens et al., 2025) to arbitrary source distributions by relaxing the so-called memoryless condition that largely restricts the design space. Through extensive experiments, we demonstrate the effectiveness of ASBS on sampling from classical energy functions, amortized conformer generation, and molecular Boltzmann distributions. Codes are available at https://github.com/facebookresearch/adjoint_samplers",
      "keywords": [
        "Boltzmann distribution",
        "diffusion sampler",
        "Schrödinger bridge"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "reZKq6hjOZ",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "reZKq6hjOZ",
      "title": "Broadening Target Distributions for Accelerated Diffusion Models via a Novel Analysis Approach",
      "abstract": "Accelerated diffusion models hold the potential to significantly enhance the efficiency of standard diffusion processes. Theoretically, these models have been shown to achieve faster convergence rates than the standard $\\mathcal O(1/\\epsilon^2)$ rate of vanilla diffusion models, where $\\epsilon$ denotes the target accuracy. However, current theoretical studies have established the acceleration advantage only for restrictive target distribution classes, such as those with smoothness conditions imposed along the entire sampling path or with bounded support. In this work, we significantly broaden the target distribution classes with a new accelerated stochastic DDPM sampler. In particular, we show that it achieves accelerated performance for three broad distribution classes not considered before. Our first class relies on the smoothness condition posed only to the target density $q_0$, which is far more relaxed than the existing smoothness conditions posed to all $q_t$ along the entire sampling path. Our second class requires only a finite second moment condition, allowing for a much wider class of target distributions than the existing finite-support condition. Our third class is Gaussian mixture, for which our result establishes the first acceleration guarantee. Moreover, among accelerated DDPM type samplers, our results specialized for bounded-support distributions show an improved dependency on the data dimension $d$. Our analysis introduces a novel technique for establishing performance guarantees via constructing a tilting factor representation of the convergence error and utilizing Tweedie's formula to handle Taylor expansion terms. This new analytical framework may be of independent interest.",
      "keywords": [
        "generative models",
        "denoising diffusion probabilistic model (DDPM)",
        "convergence analysis",
        "accelerated methods"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "vNZIePda08",
      "title": "Sparse-to-Sparse Training of Diffusion Models",
      "abstract": "Diffusion models (DMs) are a powerful type of generative models that have achieved state-of-the-art results in various image synthesis tasks and have shown  potential in other domains, such as natural language processing and temporal data modeling. Despite their stable training dynamics and ability to produce diverse high-quality samples, DMs are notorious for requiring significant computational resources, both in the training and inference stages. Previous work has focused mostly on increasing the efficiency of model inference. This paper introduces, for the first time, the paradigm of sparse-to-sparse training to DMs, with the aim of improving both training and inference efficiency. We focus on unconditional generation and train sparse DMs from scratch (Latent Diffusion and ChiroDiff) on six datasets using three different methods (Static-DM, RigL-DM, and MagRan-DM) to study the effect of sparsity in model performance. Our experiments show that sparse DMs are able to match and sometimes outperform their Dense counterparts, while substantially reducing the number of trainable parameters and FLOPs. We also identify safe and effective values to perform sparse-to-sparse training of DMs.",
      "keywords": [
        "Diffusion Models",
        "Sparse-to-Sparse Training",
        "Static Sparse Training",
        "Dynamic Sparse Training"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "NgLFQTBPRR",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "dCWSVAbWXM",
      "title": "Fence off Anomaly Interference: Cross-Domain Distillation for Fully Unsupervised Anomaly Detection",
      "abstract": "Fully Unsupervised Anomaly Detection (FUAD) is a practical extension of Unsupervised Anomaly Detection (UAD), aiming to detect anomalies without any labels even when the training set may contain anomalous samples. To achieve FUAD,  we pioneer the introduction of Knowledge Distillation (KD) paradigm based on teacher–student framework into the FUAD setting. However, due to the presence of anomalies in the training data, traditional KD methods risk enabling the student to learn the teacher’s representation of anomalies under FUAD setting, thereby resulting in poor anomaly detection performance. To address this issue, we propose a novel Cross-Domain Distillation (CDD) framework based on the widely studied reverse distillation (RD) paradigm. Specifically, we design a Domain-Specific Training, which divides the training set into multiple domains with lower anomaly ratios and train a domain-specific student for each. Cross-Domain Knowledge Aggregation is then performed, where pseudo-normal features generated by domain-specific students collaboratively guide a global student to learn generalized normal representations across all samples. Experimental results on noisy versions of the MVTec AD and VisA datasets demonstrate that our method achieves significant performance improvements over the baseline, validating its effectiveness under FUAD setting.",
      "keywords": [
        "Anomaly Detection",
        "Unsupervised Learning",
        "Knowledge Distillation"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "pKg5zKIEuV",
      "title": "Quantifying Statistical Significance of Deep Nearest Neighbor Anomaly Detection via Selective Inference",
      "abstract": "In real-world applications, anomaly detection (AD) often operates without access to anomalous data, necessitating semi-supervised methods that rely solely on normal data.\nAmong these methods, deep $k$-nearest neighbor (deep $k$NN) AD stands out for its interpretability and flexibility, leveraging distance-based scoring in deep latent spaces.\nDespite its strong performance, deep $k$NN lacks a mechanism to quantify uncertainty—an essential feature for critical applications such as industrial inspection.\nTo address this limitation, we propose a statistical framework that quantifies the  significance of detected anomalies in the form of $p$-values, thereby enabling control over false positive rates at a user-specified significance level (e.g.,0.05).\nA central challenge lies in managing selection bias, which we tackle using Selective Inference—a principled method for conducting inference conditioned on data-driven selections.\nWe evaluate our method on diverse datasets and demonstrate that it provides reliable AD well-suited for industrial use cases.",
      "keywords": [
        "Anomaly Detection",
        "k-Nearest Neighbors",
        "Statistical Test",
        "Selective Inference",
        "Deep Learning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "MFZjrTFE7h",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "MFZjrTFE7h",
      "title": "D-FINE: Redefine Regression Task of DETRs as Fine-grained Distribution Refinement",
      "abstract": "We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD). FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and models: https://github.com/Peterande/D-FINE.",
      "keywords": [
        "Object Detection",
        "Real-Time",
        "Detection Transformer",
        "Knowledge Distillation"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "ziw5bzg2NO",
      "title": "Do You Keep an Eye on What I Ask? Mitigating Multimodal Hallucination via Attention-Guided Ensemble Decoding",
      "abstract": "Recent advancements in Large Vision-Language Models (LVLMs) have significantly expanded their utility in tasks like image captioning and visual question answering. However, they still struggle with object hallucination, where models generate descriptions that inaccurately reflect the visual content by including nonexistent objects or misrepresenting existing ones. While previous methods, such as data augmentation and training-free approaches, strive to tackle this issue, they still encounter scalability challenges and often depend on additional external modules. In this work, we propose Ensemble Decoding (ED), a novel strategy that splits the input image into sub-images and combines logit distributions by assigning weights through the attention map. Furthermore, we introduce ED adaptive plausibility constraint to calibrate logit distribution and FastED, a variant designed for speed-critical applications. Extensive experiments across hallucination benchmarks demonstrate that our proposed method achieves state-of-the-art performance, validating the effectiveness of our approach.",
      "keywords": [
        "Hallucination",
        "Multimodal Hallucination",
        "Large Vision-Language Model"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "XO9fhSZkBh",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "Jsln9ZyMl4",
      "title": "The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets",
      "abstract": "We study the parameter complexity of robust memorization for ReLU networks: the number of parameters required to interpolate any dataset with $\\epsilon$-separation between differently labeled points, while ensuring predictions remain consistent within a $\\mu$-ball around each training example. We establish upper and lower bounds on the parameter count as a function of the robustness ratio $\\rho = \\mu / \\epsilon$. Unlike prior work, we provide a fine-grained analysis across the entire range $\\rho \\in (0,1)$ and obtain tighter upper and lower bounds that improve upon existing results. Our findings reveal that the parameter complexity of robust memorization matches that of non-robust memorization when $\\rho$ is small, but grows with increasing $\\rho$. As a special case, when the input dimension is comparable to or exceeds the dataset size, our bounds become tight (up to logarithmic factors) across the entire range of $\\rho$.",
      "keywords": [
        "Robust memorization",
        "Memorization",
        "Adversarial training",
        "Parameter Complexity"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "wh3p37VYm2",
      "title": "Mechanistic Insights into Grokking from the Embedding Layer",
      "abstract": "Grokking, a delayed generalization in neural networks after perfect training performance, has been observed in Transformers and MLPs, but the components driving it remain underexplored. We show that embeddings are central to grokking: introducing them into MLPs induces delayed generalization in modular arithmetic tasks, whereas MLPs without embeddings can generalize immediately. Our analysis identifies two key mechanisms: (1) Embedding update dynamics, where rare tokens stagnate due to sparse gradient updates and weight decay, and (2) Bilinear coupling, where the interaction between embeddings and downstream weights introduces saddle points and increases sensitivity to initialization.  \nTo confirm these mechanisms, we investigate frequency-aware sampling, which balances token updates by minimizing gradient variance, and embedding-specific learning rates, derived from the asymmetric curvature of the bilinear loss landscape. We prove that an adaptive learning rate ratio, \\(\\frac{\\eta_E}{\\eta_W} \\propto \\frac{\\sigma_{\\max}(E)}{\\sigma_{\\max}(W)} \\cdot \\frac{f_W}{f_E}\\), mitigates bilinear coupling effects, accelerating convergence. Our methods not only improve grokking dynamics but also extend to broader challenges in Transformer optimization, where bilinear interactions hinder efficient training.",
      "keywords": [
        "Embedding learning",
        "Token frequencey",
        "Coupled system"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "pTeOOKnjGM",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "8enWnd6Gp3",
      "title": "TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes",
      "abstract": "We introduce TetSphere Splatting, a Lagrangian geometry representation designed for high-quality 3D shape modeling. TetSphere splatting leverages an underused yet powerful geometric primitive -- volumetric tetrahedral meshes. It represents 3D shapes by deforming a collection of tetrahedral spheres, with geometric regularizations and constraints that effectively resolve common mesh issues such as irregular triangles, non-manifoldness, and floating artifacts. Experimental results on multi-view and single-view reconstruction highlight TetSphere splatting's superior mesh quality while maintaining competitive reconstruction accuracy compared to state-of-the-art methods. Additionally, TetSphere splatting demonstrates versatility by seamlessly integrating into generative modeling tasks, such as image-to-3D and text-to-3D generation.",
      "keywords": [
        "geometry representation",
        "3D modeling"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "pTeOOKnjGM",
      "title": "TEASER: Token Enhanced Spatial Modeling for Expressions Reconstruction",
      "abstract": "3D facial reconstruction from a single in-the-wild image is a crucial task in human-centered computer vision tasks. While existing methods can recover accurate facial shapes, there remains significant space for improvement in fine-grained expression capture.  Current approaches struggle with irregular mouth shapes, exaggerated expressions, and asymmetrical facial movements. We present TEASER (Token EnhAnced Spatial modeling for Expressions Reconstruction), which addresses these challenges and enhances 3D facial geometry performance⁠⁠. TEASER tackles two main limitations of existing methods: insufficient photometric loss for self-reconstruction and inaccurate localization of subtle expressions. We introduce a multi-scale tokenizer to extract facial appearance information. Combined with a neural renderer, these tokens provide precise geometric guidance for expression reconstruction. Furthermore, TEASER incorporates a pose-dependent landmark loss to further improve geometric performance⁠. Our approach not only significantly enhances expression reconstruction quality but also offers interpretable tokens suitable for various downstream applications, such as photorealistic facial video driving, expression transfer, and identity swapping. Quantitative and qualitative experimental results across multiple datasets demonstrate that TEASER achieves state-of-the-art performance in precise expression reconstruction.",
      "keywords": [
        "Expression reconstruction",
        "Hybrid parameters"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "UWdPsY7agk",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "UWdPsY7agk",
      "title": "Efficient Causal Decision Making with One-sided Feedback",
      "abstract": "We study a class of decision-making problems with one-sided feedback, where outcomes are only observable for specific actions. A typical example is bank loans, where the repayment status is known only if a loan is approved and remains undefined if rejected. In such scenarios, conventional approaches to causal decision evaluation and learning from observational data are not directly applicable. In this paper, we introduce a novel value function to evaluate decision rules that addresses the issue of undefined counterfactual outcomes. Without assuming no unmeasured confounders, we establish the identification of the value function using shadow variables. Furthermore, leveraging semiparametric theory, we derive the efficiency bound for the proposed value function and develop efficient methods for decision evaluation and learning. Numerical experiments and a real-world data application demonstrate the empirical performance of our proposed methods.",
      "keywords": [
        "semiparametric efficiency",
        "one-sided feedback",
        "causal decision making"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "stUKwWBuBm",
      "title": "Tractable Multi-Agent Reinforcement Learning through Behavioral Economics",
      "abstract": "A significant roadblock to the development of principled multi-agent reinforcement learning (MARL) algorithms is the fact that desired solution concepts like Nash equilibria may be intractable to compute. We show how one can overcome this obstacle by introducing concepts from behavioral economics into MARL. To do so, we imbue agents with two key features of human decision-making: risk aversion and bounded rationality. We show that introducing these two properties into games gives rise to a class of equilibria---risk-averse quantal response equilibria (RQE)---which are tractable to compute in \\emph{all} $n$-player matrix and finite-horizon Markov games.  In particular, we show that they emerge as the endpoint of no-regret learning in suitably adjusted versions of the games. Crucially, the class of computationally tractable RQE is independent of the underlying game structure and only depends on agents' degrees of risk-aversion and bounded rationality.  To validate the expressivity of this class of solution concepts we show that it captures peoples' patterns of play in a number of 2-player matrix games previously studied in experimental economics. Furthermore, we give a first analysis of the sample complexity of computing these equilibria in finite-horizon Markov games when one has access to a generative model. We validate our findings on a simple multi-agent reinforcement learning benchmark. Our results open the doors for to the principled development of new decentralized multi-agent reinforcement learning algorithms.",
      "keywords": [
        "behavioral economics",
        "risk-aversion",
        "multi-agent reinforcement learning",
        "quantal response",
        "bounded rationality"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "xVI8g50Qfk",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "KkOMqJQiWU",
      "title": "Meta-learning local learning rules for structured credit assignment with sparse feedback",
      "abstract": "Biological neural networks can learn complex behaviors from sparse, delayed feedback using local synaptic plasticity, yet the mechanisms enabling structured credit assignment remain elusive. In contrast, artificial recurrent networks solving similar tasks typically rely on biologically implausible global learning rules or hand-crafted local updates. The space of local plasticity rules capable of supporting learning from delayed reinforcement remains largely unexplored. Here, we present a meta-learning framework that discovers local learning rules for structured credit assignment in recurrent networks trained with sparse feedback. Our approach interleaves local neo-Hebbian-like updates during task execution with an outer loop that optimizes plasticity parameters via **backpropagation through learning**. The resulting three-factor learning rules enable long-timescale credit assignment using only local information and delayed rewards, offering new insights into biologically grounded mechanisms for learning in recurrent circuits.",
      "keywords": [
        "Biologically Plausible Deep Networks",
        "Plasticity and Adaptation",
        "Recurrent Networks",
        "Reinforcement Learning (Cognitive/Neuroscience)"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "bd8kppxyB3",
      "title": "Revisiting Glorot Initialization for Long-Range Linear Recurrences",
      "abstract": "Proper initialization is critical for Recurrent Neural Networks (RNNs), particularly in long-range reasoning tasks, where repeated application of the same weight matrix can cause vanishing or exploding signals.\nA common baseline for linear recurrences is Glorot initialization, designed to ensure stable signal propagation---but derived under the infinite-width, fixed-length regime—an unrealistic setting for RNNs processing long sequences. In this work, we show that Glorot initialization is in fact unstable: small positive deviations in the spectral radius are amplified through time and cause the hidden state to explode. Our theoretical analysis demonstrates that sequences of length $t = O(\\sqrt{n})$, where $n$ is the hidden width, are sufficient to induce instability. To address this, we propose a simple, dimension-aware rescaling of Glorot that shifts the spectral radius slightly below one, preventing rapid signal explosion or decay. These results suggest that standard initialization schemes may break down in the long-sequence regime, motivating a separate line of theory for stable recurrent initialization.",
      "keywords": [
        "Recurrent Networks",
        "Initialization",
        "Signal Propagation"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "EoebmBe9fG",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "AQ21krZgax",
      "title": "Formal Models of Active Learning from Contrastive Examples",
      "abstract": "Machine learning can greatly benefit from providing learning algorithms with pairs of contrastive training examples---typically pairs of instances that differ only slightly, yet have different class labels. Intuitively, the difference in the instances helps explain the difference in the class labels. This paper proposes a theoretical framework in which the effect of various types of contrastive examples on active learners is studied formally. The focus is on the sample complexity of learning concept classes and how it is influenced by the choice of contrastive examples. We illustrate our results with geometric concept classes and classes of Boolean functions. Interestingly, we reveal a connection between learning from contrastive examples and the classical model of self-directed learning.",
      "keywords": [
        "membership queries",
        "self-directed learning",
        "learning boolean functions",
        "learning from contrastive examples"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "EoebmBe9fG",
      "title": "Optimal Mistake Bounds for Transductive Online Learning",
      "abstract": "We resolve a 30-year-old open problem concerning the power of unlabeled data in online learning by tightly quantifying the gap between transductive and standard online learning. We prove that for every concept class $\\mathcal{H}$ with Littlestone dimension $d$, the transductive mistake bound is at least $\\Omega(\\sqrt{d})$. This establishes an exponential improvement over previous lower bounds of $\\Omega(\\log \\log d)$, $\\Omega(\\sqrt{\\log d})$, and $\\Omega(\\log d)$, respectively due to Ben-David, Kushilevitz, and Mansour (1995, 1997) and Hanneke, Moran, and Shafer (2023). We also show that our bound is tight: for every $d$, there exists a class of Littlestone dimension $d$ with transductive mistake bound $O(\\sqrt{d})$. Our upper bound also improves the previous best known upper bound of $(2/3) \\cdot d$ from Ben-David et al. (1997). These results demonstrate a quadratic gap between transductive and standard online learning, thereby highlighting the benefit of advanced access to the unlabeled instance sequence. This stands in stark contrast to the PAC setting, where transductive and standard learning exhibit similar sample complexities.",
      "keywords": [
        "Online Learning"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "zbIS2r0t0F",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "60i0ksMAhd",
      "title": "BlendRL: A Framework for Merging Symbolic and Neural Policy Learning",
      "abstract": "Humans can leverage both symbolic reasoning and intuitive responses. In contrast, reinforcement learning policies are typically encoded in either opaque systems like neural networks or symbolic systems that rely on predefined symbols and rules. This disjointed approach severely limits the agents’ capabilities, as they often lack either the flexible low-level reaction characteristic of neural agents or the interpretable reasoning of symbolic agents. \n\nTo overcome this challenge, we introduce *BlendRL*, a neuro-symbolic RL framework that harmoniously integrates both paradigms. \nWe empirically demonstrate that BlendRL agents outperform both neural and symbolic baselines in standard Atari environments, and showcase their robustness to environmental changes. Additionally, we analyze the interaction between neural and symbolic policies, illustrating how their hybrid use helps agents overcome each other's limitations.",
      "keywords": [
        "Neuro-Symbolic AI",
        "Differentiable Reasoning",
        "Reinforcement Learning",
        "Interpretable AI",
        "First-order logic"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "We5z3UEnUY",
      "title": "Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning",
      "abstract": "Effective decision-making in partially observable environments demands robust memory management. Despite their success in supervised learning, current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term. They fail to efficiently capture relevant past information, adapt flexibly to changing observations, and maintain stable updates over long episodes. We theoretically analyze the limitations of existing memory models within a unified framework and introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents. Our model dynamically adjusts memory by erasing no longer needed experiences and reinforcing crucial ones computationally efficiently. To this end, we leverage the Hadamard product for calibrating and updating memory, specifically designed to enhance memory capacity while mitigating numerical and learning challenges. Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks, such as meta-reinforcement learning, long-horizon credit assignment, and POPGym, demonstrating superior performance in handling long-term and evolving contexts.",
      "keywords": [
        "Reinforcement Learning",
        "Memory",
        "POMDP"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "Tk5nQnTGmP",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "YTbLri0siT",
      "title": "Spike-timing-dependent Hebbian learning as noisy gradient descent",
      "abstract": "Hebbian learning is a key principle underlying learning in biological neural networks. We relate a Hebbian spike-timing-dependent plasticity rule to noisy gradient descent with respect to a non-convex loss function on the probability simplex. Despite the constant injection of noise and the non-convexity of the underlying optimization problem, one can rigorously prove that the considered Hebbian learning dynamic identifies the presynaptic neuron with the highest activity and that the convergence is exponentially fast in the number of iterations. This is non-standard and surprising as typically noisy gradient descent with fixed noise level only converges to a stationary regime where the noise causes the dynamic to fluctuate around a minimiser.",
      "keywords": [
        "Biological neural networks",
        "Hebbian learning",
        "Spike-timing-dependent plasticity",
        "Noisy gradient descent",
        "Mirror descent"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "vWaMUMrBpF",
      "title": "Inconsistency-Aware Minimization: Improving Generalization with Unlabeled Data",
      "abstract": "Accurately estimating the generalization gap and devising optimization methods that generalize better are crucial for deep learning models, particularly in both theoretical understanding and practical applications. The ability to leverage unlabeled data for these purposes offers significant advantages in real-world scenarios. This paper introduces a novel generalization measure, termed $\\textit{local inconsistency}$, developed from an information-geometric perspective of the neural network's parameter space; a key feature is its computability from unlabeled data. We establish its theoretical underpinnings by connecting local inconsistency to the Fisher Information Matrix (FIM) and the loss Hessian. Empirically, we demonstrate that local inconsistency not only correlates with the generalization gap but also exhibits characteristics comparable to $\\textit{sharpness}$. Based on these findings, we propose Inconsistency-Aware Minimization (IAM), a regularization strategy that incorporates local inconsistency. We demonstrate that in standard supervised learning settings, IAM enhances generalization, achieving performance comparable to existing methods such as Sharpness-Aware Minimization (SAM). Furthermore, IAM exhibits notable efficacy in semi-supervised learning scenarios, where the local inconsistency regularizer is computed from the unlabeled data portion to further improve model performance.",
      "keywords": [
        "Generalization",
        "Regularization",
        "Training Method",
        "Deep Learning",
        "Inconsistency"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "ziLHIExi1j",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "QtnCPZMxYg",
      "title": "Trajectory Graph Learning: Aligning with Long Trajectories in Reinforcement Learning Without Reward Design",
      "abstract": "Reinforcement learning (RL) often relies on manually designed reward functions, which are difficult to specify and can lead to issues such as reward hacking and suboptimal behavior. Alternatives like inverse RL and preference-based RL attempt to infer surrogate rewards from demonstrations or preferences but suffer from ambiguity and distribution mismatch. A more direct approach, inspired by imitation learning, avoids reward modeling by leveraging expert demonstrations. However, most existing methods align actions only at individual states, failing to capture the coherence of long-horizon trajectories.\n\nIn this work, we study the problem of directly aligning policies with expert-labeled trajectories to preserve long-horizon behavior without relying on reward signals. Specifically, we aim to learn a policy that maximizes the probability of generating the expert trajectories. Nevertheless, we prove that, in its general form, this trajectory alignment problem is NP-complete. \nTo address this, we propose Trajectory Graph Learning (TGL), a framework that leverages structural assumptions commonly satisfied in practice—such as bounded realizability of expert trajectories or a tree-structured MDP. These enable a graph-based policy planning algorithm that computes optimal policies in polynomial time under known dynamics. For settings with unknown dynamics, we develop a sample-efficient algorithm based on UCB-style exploration and establish sub-linear regret. Experiments on grid-world tasks demonstrate that TGL substantially outperforms standard imitation learning methods for long-trajectory planning.",
      "keywords": [
        "Reinforcement Learning",
        "Trajectory Alignment",
        "Trajectory Graph Learning"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "dAAz7afWJR",
      "title": "Robust and Scalable Autonomous Reinforcement Learning in Irreversible Environments",
      "abstract": "Reinforcement learning (RL) typically assumes repetitive resets to provide an agent with diverse and unbiased experiences. These resets require significant human intervention and result in poor training efficiency in real-world settings. Autonomous RL (ARL) addresses this challenge by jointly training forward and reset policies. While recent ARL algorithms have shown promise in reducing human intervention, they assume narrow support over the distributions of initial or goal states and rely on task-specific knowledge to identify irreversible states. In this paper, we propose a robust and scalable ARL algorithm, called RSA, that enables an agent to handle diverse initial and goal states and to avoid irreversible states without task-specific knowledge. RSA generates a curriculum by identifying informative states based on the learning progress of an agent. We hypothesize that informative states are neither overly difficult nor trivially easy for the agent being trained. To detect and avoid irreversible states without task-specific knowledge, RSA encodes the behaviors exhibited in those states rather than the states themselves. Experimental results demonstrate that RSA outperforms existing ARL algorithms with fewer manual resets in both reversible and irreversible environments.",
      "keywords": [
        "Reinforcement Learning",
        "Autonomous Reinforcement Learning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "k38Th3x4d9",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "k38Th3x4d9",
      "title": "Root Cause Analysis of Anomalies in Multivariate Time Series through Granger Causal Discovery",
      "abstract": "Identifying the root causes of anomalies in multivariate time series is challenging due to the complex dependencies among the series. In this paper, we propose a comprehensive approach called AERCA that inherently integrates Granger causal discovery with root cause analysis. By defining anomalies as interventions on the exogenous variables of time series, AERCA not only learns the Granger causality among time series but also explicitly models the distributions of exogenous variables under normal conditions. AERCA then identifies the root causes of anomalies by highlighting exogenous variables that significantly deviate from their normal states. Experiments on multiple synthetic and real-world datasets demonstrate that AERCA can accurately capture the causal relationships among time series and effectively identify the root causes of anomalies.",
      "keywords": [
        "root cause analysis",
        "Granger causality",
        "multivariate time series"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "m08aK3xxdJ",
      "title": "CATCH: Channel-Aware Multivariate Time Series Anomaly Detection via Frequency Patching",
      "abstract": "Anomaly detection in multivariate time series is challenging as heterogeneous subsequence anomalies may occur. Reconstruction-based methods, which focus on learning normal patterns in the frequency domain to detect diverse abnormal subsequences, achieve promising results, while still falling short on capturing fine-grained frequency characteristics and channel correlations. To contend with the limitations, we introduce CATCH, a framework based on frequency patching. We propose to patchify the frequency domain into frequency bands, which enhances its ability to capture fine-grained frequency characteristics. To perceive appropriate channel correlations, we propose a Channel Fusion Module (CFM), which features a patch-wise mask generator and a masked-attention mechanism. Driven by a bi-level multi-objective optimization algorithm, the CFM is encouraged to iteratively discover appropriate patch-wise channel correlations, and to cluster relevant channels while isolating adverse effects from irrelevant channels. Extensive experiments on 10 real-world datasets and 12 synthetic datasets demonstrate that CATCH achieves state-of-the-art performance. We make our code and datasets available at https://github.com/decisionintelligence/CATCH.",
      "keywords": [
        "Multivariate Time Series",
        "Anomaly Detection"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "XO9fhSZkBh",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "Jsln9ZyMl4",
      "title": "The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets",
      "abstract": "We study the parameter complexity of robust memorization for ReLU networks: the number of parameters required to interpolate any dataset with $\\epsilon$-separation between differently labeled points, while ensuring predictions remain consistent within a $\\mu$-ball around each training example. We establish upper and lower bounds on the parameter count as a function of the robustness ratio $\\rho = \\mu / \\epsilon$. Unlike prior work, we provide a fine-grained analysis across the entire range $\\rho \\in (0,1)$ and obtain tighter upper and lower bounds that improve upon existing results. Our findings reveal that the parameter complexity of robust memorization matches that of non-robust memorization when $\\rho$ is small, but grows with increasing $\\rho$. As a special case, when the input dimension is comparable to or exceeds the dataset size, our bounds become tight (up to logarithmic factors) across the entire range of $\\rho$.",
      "keywords": [
        "Robust memorization",
        "Memorization",
        "Adversarial training",
        "Parameter Complexity"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "bP5cU0OYSn",
      "title": "Fast Projection-Free Approach (without Optimization Oracle) for Optimization over Compact Convex Set",
      "abstract": "Projection-free first-order methods, e.g., the celebrated Frank-Wolfe (FW) algorithms, have emerged as powerful tools for optimization over simple convex sets such as polyhedra, because of their scalability, fast convergence, and iteration-wise feasibility without costly projections. \nHowever, extending these methods effectively to general compact convex sets remains challenging and largely open, as FW methods rely on expensive linear optimization oracles (LOO), while penalty-based methods often struggle with poor feasibility. \nWe tackle this open challenge by presenting **Hom-PGD**, a novel projection-free method without expensive (optimization) oracles. \nOur method constructs a homeomorphism between the convex constraint set and a unit ball, transforming the original problem into an equivalent ball-constrained formulation, thus enabling efficient gradient-based optimization while preserving the original problem structure. \nWe prove that Hom-PGD attains *optimal* convergence rates matching gradient descent with constant step-size to find an $\\epsilon$-approximate (stationary) solution: $\\mathcal{O}(\\log (1/\\epsilon))$ for strongly convex objectives, $\\mathcal{O}(\\epsilon^{-1})$ for convex objectives, \nand $\\mathcal{O}(\\epsilon^{-2})$ for non-convex objectives. \nMeanwhile, Hom-PGD enjoys a low per-iteration complexity of $\\mathcal{O}(n^2)$, without expensive oracles like LOO or projection, where $n$ is the input size. \nOur framework further extends to certain non-convex sets, broadening its applicability in practical optimization scenarios with complex constraints. Extensive numerical experiments demonstrate that Hom-PGD achieves comparable convergence rates to state-of-the-art projection-free methods, while significantly reducing per-iteration runtime  (up to 5 orders of magnitude faster) and thus the total problem-solving time.",
      "keywords": [
        "projection-free",
        "homeomorphism",
        "gauge mapping",
        "constrained optimization",
        "convex set"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "cR5GTis5II",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "cR5GTis5II",
      "title": "eQMARL: Entangled Quantum Multi-Agent Reinforcement Learning for Distributed Cooperation over Quantum Channels",
      "abstract": "Collaboration is a key challenge in distributed multi-agent reinforcement learning (MARL) environments. Learning frameworks for these decentralized systems must weigh the benefits of explicit player coordination against the communication overhead and computational cost of sharing local observations and environmental data. Quantum computing has sparked a potential synergy between quantum entanglement and cooperation in multi-agent environments, which could enable more efficient distributed collaboration with minimal information sharing. This relationship is largely unexplored, however, as current state-of-the-art quantum MARL (QMARL) implementations rely on classical information sharing rather than entanglement over a quantum channel as a coordination medium. In contrast, in this paper, a novel framework dubbed entangled QMARL (eQMARL) is proposed. The proposed eQMARL is a distributed actor-critic framework that facilitates cooperation over a quantum channel and eliminates local observation sharing via a quantum entangled split critic. Introducing a quantum critic uniquely spread across the agents allows coupling of local observation encoders through entangled input qubits over a quantum channel, which requires no explicit sharing of local observations and reduces classical communication overhead. Further, agent policies are tuned through joint observation-value function estimation via joint quantum measurements, thereby reducing the centralized computational burden. Experimental results show that eQMARL with $\\Psi^{+}$ entanglement converges to a cooperative strategy up to $17.8\\\\%$ faster and with a higher overall score compared to split classical and fully centralized classical and quantum baselines. The results also show that eQMARL achieves this performance with a constant factor of $25$-times fewer centralized parameters compared to the split classical baseline.",
      "keywords": [
        "quantum machine learning",
        "multi-agent reinforcement learning",
        "quantum entanglement"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "kBybSUskz7",
      "title": "Reinforcement Learning and Heuristics for Hardware-Efficient Constrained Code Design",
      "abstract": "Constrained codes enhance reliability in high-speed communication systems and optimize bit efficiency when working with non-binary data representations (e.g., three-level ternary symbols).  A key challenge in their design is minimizing the hardware complexity of the translation logic that encodes and decodes data. We introduce a reinforcement learning (RL)-based framework, augmented by a custom L1 similarity-based heuristic, to design hardware-efficient translation logic, navigating the vast solution space of codeword assignments. By modeling the task as a bipartite graph matching problem and using logic synthesis tools to evaluate hardware complexity, our RL approach outperforms human-derived solutions and generalizes to various code types. Finally, we analyze the learned policies to extract insights into high-performing strategies.",
      "keywords": [
        "reinforcement learning",
        "bipartite matching",
        "GNN",
        "combinatorial optimization",
        "feature engineering",
        "hardware design optimization",
        "logic synthesis"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "RcNzwKrjTo",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "Nfd7z9d6Bb",
      "title": "Probabilistic Conformal Prediction with Approximate Conditional Validity",
      "abstract": "We develop a new method for generating prediction sets that combines the flexibility of conformal methods with an estimate of the conditional distribution $\\textup{P}_{Y \\mid X}$. Existing methods, such as conformalized quantile regression and probabilistic conformal prediction, usually provide only a marginal coverage guarantee. In contrast, our approach extends these frameworks to achieve approximately conditional coverage, which is crucial for many practical applications. Our prediction sets adapt to the behavior of the predictive distribution, making them effective even under high heteroscedasticity. While exact conditional guarantees are infeasible without assumptions on the underlying data distribution, we derive non-asymptotic bounds that depend on the total variation distance of the conditional distribution and its estimate. Using extensive simulations, we show that our method consistently outperforms existing approaches in terms of conditional coverage, leading to more reliable statistical inference in a variety of applications.",
      "keywords": [
        "Conformal Prediction",
        "Conditional coverage",
        "Probabilistic method",
        "Uncertainty Quantification"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "RcNzwKrjTo",
      "title": "Conformal Prediction Sets with Improved Conditional Coverage using Trust Scores",
      "abstract": "Standard conformal prediction offers a marginal guarantee on coverage, but for prediction sets to be truly useful, they should ideally ensure coverage conditional on each test point. However, it is impossible to achieve exact, distribution-free conditional coverage in finite samples. In this work, we propose an alternative conformal prediction algorithm that targets coverage where it matters most---in instances where a classifier is overconfident in its incorrect predictions. We start by dissecting miscoverage events in marginally-valid conformal prediction, and show that miscoverage rates vary based on the classifier's confidence and its deviation from the Bayes optimal classifier. Motivated by this insight, we develop a variant of conformal prediction that targets coverage conditional on a reduced set of two variables: the classifier's confidence in a prediction and a nonparametric trust score that measures its deviation from the Bayes classifier. Empirical evaluation on multiple image datasets shows that our method generally improves conditional coverage properties compared to standard conformal prediction, including class-conditional coverage, coverage over arbitrary subgroups, and coverage over demographic groups.",
      "keywords": [
        "uncertainty quantification",
        "conformal prediction",
        "conditional guarantees"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "g2vViuEVDS",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "3a18D8IeQ1",
      "title": "Quantization-Free Autoregressive Action Transformer",
      "abstract": "Current transformer-based imitation learning approaches introduce discrete action representations and train an autoregressive transformer decoder on the resulting latent code. However, the initial quantization breaks the continuous structure of the action space thereby limiting the capabilities of the generative model. We propose a quantization-free method instead that leverages Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers. This simplifies the imitation learning pipeline while achieving state-of-the-art performance on a variety of popular simulated robotics tasks. We enhance our policy roll-outs by carefully studying sampling algorithms, further improving the results.",
      "keywords": [
        "Imitation Learning",
        "Reinforcement Learning",
        "Transformers"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "g2vViuEVDS",
      "title": "Intrinsic Goals for Autonomous Agents: Model-Based Exploration in Virtual Zebrafish Predicts Ethological Behavior and Whole-Brain Dynamics",
      "abstract": "Autonomy is a hallmark of animal intelligence, enabling adaptive and intelligent behavior in complex environments without relying on external reward or task structure. Existing reinforcement learning approaches to exploration in reward-free environments, including a class of methods known as *model-based intrinsic motivation*, exhibit inconsistent exploration patterns and do not converge to an exploratory policy, thus failing to capture robust autonomous behaviors observed in animals. Moreover, systems neuroscience has largely overlooked the neural basis of autonomy, focusing instead on experimental paradigms where animals are motivated by external reward rather than engaging in ethological, naturalistic and task-independent behavior. To bridge these gaps, we introduce a novel model-based intrinsic drive explicitly designed after the principles of autonomous exploration in animals. Our method (3M-Progress) achieves animal-like exploration by tracking divergence between an online world model and a fixed prior learned from an ecological niche. To the best of our knowledge, we introduce the first autonomous embodied agent that predicts brain data entirely from self-supervised optimization of an intrinsic goal—without any behavioral or neural training data—demonstrating that 3M-Progress agents capture the explainable variance in behavioral patterns and whole-brain neural-glial dynamics recorded from autonomously behaving larval zebrafish, thereby providing the first goal-driven, population-level model of neural-glial computation. Our findings establish a computational framework connecting model-based intrinsic motivation to naturalistic behavior, providing a foundation for building artificial agents with animal-like autonomy.",
      "keywords": [
        "NeuroAI",
        "intrinsic motivation",
        "zebrafish",
        "neural-glial",
        "embodied agents",
        "reinforcement learning",
        "autonomy."
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "g2vViuEVDS",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "1IpHkK5Q8F",
      "title": "Real-Time Hyper-Personalized Generative AI Should Be Regulated to Prevent the Rise of \"Digital Heroin\"",
      "abstract": "This position paper argues that real-time generative AI has the potential to become the next wave of addictive digital media, creating a new class of digital content akin to ``digital heroin'' with severe implications for mental health and youth development. By shortening the content-generation feedback loop to mere seconds, these advanced models will soon be able to hyper-personalize outputs on the fly. When paired with misaligned incentives (e.g., maximizing user engagement), this will fuel unprecedented compulsive consumption patterns with far-reaching consequences for mental health, cognitive development, and social stability. Drawing on interdisciplinary research, from clinical observations of social media addiction to neuroscientific studies of dopamine-driven feedback, we illustrate how real-time tailored content generation may erode user autonomy, foment emotional distress, and disproportionately endanger vulnerable groups, such as adolescents. Due to the rapid advancement of generative AI and its potential to induce severe addiction-like effects, we call for strong government oversight akin to existing controls on addictive substances, particularly for minors. We further urge the machine learning community to act proactively by establishing robust design guidelines, collaborating with public health experts, and supporting targeted policy measures to ensure responsible and ethical deployment, rather than paving the way for another wave of unregulated digital dependence.",
      "keywords": [
        "generative ai",
        "real-time personalization",
        "behavioral addiction",
        "digital media",
        "public health",
        "policy interventions",
        "machine learning ethics"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "g2vViuEVDS",
      "title": "Intrinsic Goals for Autonomous Agents: Model-Based Exploration in Virtual Zebrafish Predicts Ethological Behavior and Whole-Brain Dynamics",
      "abstract": "Autonomy is a hallmark of animal intelligence, enabling adaptive and intelligent behavior in complex environments without relying on external reward or task structure. Existing reinforcement learning approaches to exploration in reward-free environments, including a class of methods known as *model-based intrinsic motivation*, exhibit inconsistent exploration patterns and do not converge to an exploratory policy, thus failing to capture robust autonomous behaviors observed in animals. Moreover, systems neuroscience has largely overlooked the neural basis of autonomy, focusing instead on experimental paradigms where animals are motivated by external reward rather than engaging in ethological, naturalistic and task-independent behavior. To bridge these gaps, we introduce a novel model-based intrinsic drive explicitly designed after the principles of autonomous exploration in animals. Our method (3M-Progress) achieves animal-like exploration by tracking divergence between an online world model and a fixed prior learned from an ecological niche. To the best of our knowledge, we introduce the first autonomous embodied agent that predicts brain data entirely from self-supervised optimization of an intrinsic goal—without any behavioral or neural training data—demonstrating that 3M-Progress agents capture the explainable variance in behavioral patterns and whole-brain neural-glial dynamics recorded from autonomously behaving larval zebrafish, thereby providing the first goal-driven, population-level model of neural-glial computation. Our findings establish a computational framework connecting model-based intrinsic motivation to naturalistic behavior, providing a foundation for building artificial agents with animal-like autonomy.",
      "keywords": [
        "NeuroAI",
        "intrinsic motivation",
        "zebrafish",
        "neural-glial",
        "embodied agents",
        "reinforcement learning",
        "autonomy."
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "aPHHhnZktB",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "5ZEbpBYGwH",
      "title": "COPER: Correlation-based Permutations for Multi-View Clustering",
      "abstract": "Combining data from different sources can improve data analysis tasks such as clustering. However, most of the current multi-view clustering methods are limited to specific domains or rely on a suboptimal and computationally intensive two-stage process of representation learning and clustering. We propose an end-to-end deep learning-based multi-view clustering framework for general data types (such as images and tables). Our approach involves generating meaningful fused representations using a novel permutation-based canonical correlation objective. We provide a theoretical analysis showing how the learned embeddings approximate those obtained by supervised linear discriminant analysis (LDA). Cluster assignments are learned by identifying consistent pseudo-labels across multiple views. Additionally, we establish a theoretical bound on the error caused by incorrect pseudo-labels in the unsupervised representations compared to LDA. Extensive experiments on ten multi-view clustering benchmark datasets provide empirical evidence for the effectiveness of the proposed model.",
      "keywords": [
        "clustering",
        "canonical correlation analysis",
        "self supervision",
        "multiview"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "aPHHhnZktB",
      "title": "FairDen: Fair Density-Based Clustering",
      "abstract": "Fairness in data mining tasks like clustering has recently become an increasingly important aspect. \nHowever, few clustering algorithms exist that focus on fair groupings of data with sensitive attributes. \nIncluding fairness in the clustering objective is especially hard for density-based clustering, as it does not directly optimize a closed form objective like centroid-based or spectral methods.  \n\nThis paper introduces FairDen, the first fair, density-based clustering algorithm.\nWe capture the dataset's density-connectivity structure in a similarity matrix that we manipulate to encourage a balanced clustering. \nIn contrast to state-of-the-art, FairDen inherently handles categorical attributes, noise, and data with several sensitive attributes or groups.\nWe show that FairDen finds meaningful and fair clusters in extensive experiments.",
      "keywords": [
        "Fair Clustering",
        "Density-based Clustering",
        "Density-Connectivity",
        "Fairness",
        "Unsupervised Learning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "QwXpn5IPKk",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "BSZqpqgqM0",
      "title": "Why Diffusion Models Don’t Memorize:  The Role of Implicit Dynamical Regularization in Training",
      "abstract": "Diffusion models have achieved remarkable success across a wide range of generative tasks. A key challenge is understanding the mechanisms that prevent their memorization of training data and allow generalization. In this work, we investigate the role of the training dynamics in the transition from generalization to memorization. Through extensive experiments and theoretical analysis, we identify two distinct timescales: an early time $\\tau_\\mathrm{gen}$ at which models begin to generate high-quality samples, and a later time $\\tau_\\mathrm{mem}$ beyond which memorization emerges. Crucially, we find that $\\tau_\\mathrm{mem}$ increases linearly with the training set size $n$, while $\\tau_\\mathrm{gen}$ remains constant. This creates a growing window of training times with $n$ where models generalize effectively, despite showing strong memorization if training continues beyond it. It is only when $n$ becomes larger than a model-dependent threshold that overfitting disappears at infinite training times.\nThese findings reveal a form of implicit dynamical regularization in the training dynamics, which allow to avoid memorization even in highly overparameterized settings. Our results are supported by numerical experiments with standard U-Net architectures on realistic and synthetic  datasets, and by a theoretical analysis using a tractable random features model studied in the high-dimensional limit.",
      "keywords": [
        "Diffusion Models",
        "Deep Learning",
        "Probabilistic Methods"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "hlpRj222RG",
      "title": "PixPerfect: Seamless Latent Diffusion Local Editing with Discriminative Pixel-Space Refinement",
      "abstract": "Latent Diffusion Models (LDMs) have markedly advanced the quality of image inpainting and local editing. However, the inherent latent compression often introduces pixel-level inconsistencies, such as chromatic shifts, texture mismatches, and visible seams along editing boundaries. Existing remedies, including background-conditioned latent decoding and pixel-space harmonization, usually fail to fully eliminate these artifacts in practice and do not generalize well across different latent representations or tasks. We introduce PixPerfect, a pixel‐level refinement framework that delivers seamless, high-fidelity local edits across diverse LDM architectures and tasks. PixPerfect leverages (i) a differentiable discriminative pixel space that amplifies and suppresses subtle color and texture discrepancies, (ii) a comprehensive artifact simulation pipeline that exposes the refiner to realistic local editing artifacts during training, and (iii) a direct pixel-space refinement scheme that ensures broad applicability across diverse latent representations and tasks. Extensive experiments on inpainting, object removal, and insertion benchmarks demonstrate that PixPerfect substantially enhances perceptual fidelity and downstream editing performance, establishing a new standard for robust and high-fidelity localized image editing.",
      "keywords": [
        "image editing",
        "image refinement",
        "diffusion model"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "2rgYVFiWPL",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "C7BIQRM57T",
      "title": "Momentum Multi-Marginal Schrödinger Bridge Matching",
      "abstract": "Understanding complex systems by inferring trajectories from sparse sample snapshots is a fundamental challenge in a wide range of domains, e.g., single-cell biology, meteorology, and economics. Despite advancements in Bridge and Flow matching frameworks, current methodologies rely on pairwise interpolation between adjacent snapshots. This hinders their ability to capture long-range temporal dependencies and potentially affects the coherence of the inferred trajectories. To address these issues, we introduce Momentum Multi-Marginal Schrödinger Bridge Matching (3MSBM), a novel matching framework that learns smooth measure-valued splines for stochastic systems that satisfy multiple positional constraints. This is achieved by lifting the dynamics to phase space and generalizing stochastic bridges to be conditioned on several points, forming a multi-marginal conditional stochastic optimal control problem. The underlying dynamics are then learned by minimizing a variational objective, having fixed the path induced by the multi-marginal conditional bridge. As a matching approach, 3MSBM learns transport maps that preserve intermediate marginals throughout training, significantly improving convergence and scalability. Extensive experimentation in a series of real-world applications validates the superior performance of 3MSBM compared to existing methods in capturing complex dynamics with temporal dependencies, opening new avenues for training matching frameworks in multi-marginal settings.",
      "keywords": [
        "Diffusion models",
        "Schrödinger bridge",
        "Distribution matching",
        "Trajectory Inference"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "K0FbK2GOGj",
      "title": "Instance-Optimality for Private KL Distribution Estimation",
      "abstract": "We study the fundamental problem of estimating an unknown discrete distribution $p$ over $d$ symbols, given $n$ i.i.d. samples from the distribution. We are interested in minimizing the KL divergence between the true distribution and the algorithm's estimate. We first construct minimax optimal private estimators. Minimax optimality however fails to shed light on an algorithm's performance on individual (non-worst-case) instances $p$ and simple minimax-optimal DP estimators can have poor empirical performance on real distributions. We then study this problem from an instance-optimality viewpoint, where the algorithm's error on $p$ is compared to the minimum achievable estimation error over a small local neighborhood of $p$. Under natural notions of local neighborhood, we propose algorithms that achieve instance-optimality up to constant factors, with and without a differential privacy constraint. Our upper bounds rely on (private) variants of the Good-Turing estimator. Our lower bounds use additive local neighborhoods that more precisely captures the hardness of distribution estimation in KL divergence, compared to ones considered in prior works.",
      "keywords": [
        "Differential Privacy",
        "Distribution Estimation",
        "Instance-Optimality"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "WrYWolqKh3",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "NM8Apk61NA",
      "title": "HyperET: Efficient Training in Hyperbolic Space for Multi-modal Large Language Models",
      "abstract": "Multi-modal large language models (MLLMs) have emerged as a transformative approach for aligning visual and textual understanding. They typically require extremely high computational resources (e.g., thousands of GPUs) for training to achieve cross-modal alignment at multi-granularity levels. We argue that a key source of this inefficiency lies in the vision encoders they widely equip with, e.g., CLIP and SAM, which lack the alignment with language at multi-granularity levels. To address this issue, in this paper, we leverage hyperbolic space, which inherently models hierarchical levels and thus provides a principled framework for bridging the granularity gap between visual and textual modalities at an arbitrary granularity level. Concretely, we propose an efficient training paradigm for MLLMs, dubbed as \\blg, which can optimize visual representations to align with their textual counterparts at an arbitrary granularity level through dynamic hyperbolic radius adjustment in hyperbolic space. \\alg employs learnable matrices with M\\\"{o}bius multiplication operations, implemented via three effective configurations: diagonal scaling matrices, block-diagonal matrices, and banded matrices, providing a flexible yet efficient parametrization strategy. Comprehensive experiments across multiple MLLM benchmarks demonstrate that \\alg consistently improves both existing pre-training and fine-tuning MLLMs clearly with less than 1\\% additional parameters. Code is available at \\url{https://github.com/godlin-sjtu/HyperET}.",
      "keywords": [
        "Efficient Training",
        "Multi-modal Large Language Models",
        "Granularity Levels",
        "Hyperbolic Space"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "stpe7UeETz",
      "title": "Corrector Sampling in Language Models",
      "abstract": "Autoregressive language models accumulate errors due to their fixed, irrevocable left-to-right token generation. To address this, we propose a new sampling method called Resample-Previous-Tokens (RPT). RPT mitigates error accumulation by iteratively revisiting and potentially replacing tokens in a window of previously generated text. Fine-tuning a pretrained 8B parameter model with RPT for only 100B resulted in ~10% relative improvements on reasoning and coding benchmarks compared to the standard sampling.",
      "keywords": [
        "Language modeling",
        "Sampling"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "xak8c9l1nu",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "MT3aOfXIbY",
      "title": "Faster Diffusion Sampling with Randomized Midpoints: Sequential and Parallel",
      "abstract": "Sampling algorithms play an important role in controlling the quality and runtime of diffusion model inference. In recent years, a number of works (Chen et al., 2023c;b; Benton et al., 2023; Lee et al., 2022) have analyzed algorithms for diffusion sampling with provable guarantees; these works show that for essentially any data distribution, one can approximately sample in polynomial time given a sufficiently accurate estimate of its score functions at different noise levels. \n\nIn this work, we propose a new scheme inspired by Shen and Lee's randomized midpoint method for log-concave sampling  (Shen & Lee, 2019). We prove that this approach achieves the best known dimension dependence for sampling from arbitrary smooth distributions in total variation distance ($\\widetilde O(d^{5/12})$ compared to $\\widetilde O(\\sqrt{d})$ from prior work). We also show that our algorithm can be parallelized to run in only $\\widetilde O(\\log^2 d)$ parallel rounds, constituting the first provable guarantees for parallel sampling with diffusion models.\n    \nAs a byproduct of our methods, for the well-studied problem of log-concave sampling in total variation distance, we give an algorithm and simple analysis achieving dimension dependence $\\widetilde O(d^{5/12})$ compared to $\\widetilde O(\\sqrt{d})$ from prior work.",
      "keywords": [
        "Diffusion Sampling",
        "Generative Model",
        "Statistical Theory"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "xak8c9l1nu",
      "title": "Computational Explorations of Total Variation Distance",
      "abstract": "We investigate some previously unexplored (or underexplored) computational aspects of total variation (TV) distance.\nFirst, we give a simple deterministic polynomial-time algorithm for checking equivalence between mixtures of product distributions, over arbitrary alphabets.\nThis corresponds to a special case, whereby the TV distance between the two distributions is zero.\nSecond, we prove that unless $\\mathsf{NP} \\subseteq \\mathsf{RP}$ it is impossible to efficiently estimate the TV distance between arbitrary Ising models, even in a bounded-error randomized setting.",
      "keywords": [
        "total variation distance",
        "TV distance",
        "mixtures of products",
        "equivalence checking",
        "Ising models",
        "computational complexity",
        "FPRAS"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "S8XcHutp7Z",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "nMUpDatZBh",
      "title": "VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics",
      "abstract": "In-Context Operator Networks (ICONs) have demonstrated the ability to learn operators across diverse partial differential equations using few-shot, in-context learning. However, existing ICONs process each spatial point as an individual token, severely limiting computational efficiency when handling dense data in higher spatial dimensions. We propose \\textit{Vision In-Context Operator Networks} (VICON), which integrates vision transformer architectures to efficiently process 2D data through patch-wise operations while preserving ICON's adaptability to multiphysics systems and varying timesteps. Evaluated across three fluid dynamics benchmarks, VICON significantly outperforms state-of-the-art baselines: DPOT and MPP, reducing the averaged last-step rollout error by 37.9\\% compared to DPOT and 44.7\\% compared to MPP, while requiring only 72.5\\% and 34.8\\% of their respective inference times. VICON naturally supports flexible rollout strategies with varying timestep strides, enabling immediate deployment in \\textit{imperfect measurement systems} where sampling frequencies may differ or frames might be dropped—common challenges in real-world settings—without requiring retraining or interpolation. In these realistic scenarios, VICON exhibits remarkable robustness, experiencing only 24.41\\% relative performance degradation compared to 71.37\\%-74.49\\% degradation in baseline methods, demonstrating its versatility for depolying in realistic applications.",
      "keywords": [
        "AI4Science",
        "Learning PDE",
        "Fluid Dynamics",
        "In-Context Learning"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "nxGSj1xkm3",
      "title": "Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis",
      "abstract": "Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 43.31% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we provide the supervised fine-tuning (SFT) process utilizing our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., Qwen2.5-VL-7B demonstrates a 24.73% improvement. MMOral holds significant potential as a critical foundation for intelligent dentistry and enables more clinically impactful multimodal AI systems in the dental field.",
      "keywords": [
        "Medical benchmark",
        "Multimodal instruction data",
        "Large vision language models"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "TkbjqexD8w",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "bwOndfohRK",
      "title": "Neural networks on Symmetric Spaces of Noncompact Type",
      "abstract": "Recent works have demonstrated promising performances of neural networks on hyperbolic spaces and symmetric positive definite (SPD) manifolds. These spaces belong to a family of Riemannian manifolds referred to as symmetric spaces of noncompact type. In this paper, we propose a novel approach for developing neural networks on such spaces. Our approach relies on a unified formulation of the distance from a point to a hyperplane on the considered spaces. We show that some existing formulations of the point-to-hyperplane distance can be recovered by our approach under specific settings. Furthermore, we derive a closed-form expression for the point-to-hyperplane distance in higher-rank symmetric spaces of noncompact type equipped with G-invariant Riemannian metrics. The derived distance then serves as a tool to design fully-connected (FC) layers and an attention mechanism for neural networks on the considered spaces. Our approach is validated on challenging benchmarks for image classification, electroencephalogram (EEG) signal classification, image generation, and natural language inference.",
      "keywords": [
        "geometric deep learning",
        "symmetric spaces",
        "hyperbolic spaces",
        "SPD manifolds"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "fGdF8Bq1FV",
      "title": "Generalization Guarantees for Representation Learning via Data-Dependent Gaussian Mixture Priors",
      "abstract": "We establish in-expectation and tail bounds on the generalization error of representation learning type algorithms. The bounds are in terms of the relative entropy between the distribution of the representations extracted from the training and \"test'' datasets and a data-dependent symmetric prior, i.e., the Minimum Description Length (MDL) of the latent variables for the training and test datasets. Our bounds are shown to reflect the \"structure\" and \"simplicity'' of the encoder and significantly improve upon the few existing ones for the studied model. We then use our in-expectation bound to devise a suitable data-dependent regularizer; and we investigate thoroughly the important question of the selection of the prior. We propose a systematic approach to simultaneously learning a data-dependent Gaussian mixture prior and using it as a regularizer. Interestingly, we show that a weighted attention mechanism emerges naturally in this procedure. Our experiments show that our approach outperforms the now popular Variational Information Bottleneck (VIB) method as well as the recent Category-Dependent VIB (CDVIB).",
      "keywords": [
        "Representation learning algorithm",
        "Gaussian-Mixture",
        "regularizer",
        "rate-disotortion"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "VRlihVklCL",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "SMK0f8JoKF",
      "title": "Language Models Are Implicitly Continuous",
      "abstract": "Language is typically modelled with discrete sequences. However, the most successful approaches to language modelling, namely neural networks, are continuous and smooth function approximators.\nIn this work, we show that Transformer-based language models implicitly learn to represent sentences as continuous-time functions defined over a continuous input space. \nThis phenomenon occurs in most state-of-the-art Large Language Models (LLMs), including Llama2, Llama3, Phi3, Gemma, Gemma2, and Mistral, and suggests that LLMs reason about language in ways that fundamentally differ from humans.\nOur work formally extends Transformers to capture the nuances of time and space continuity in both input and output space.\nOur results challenge the traditional interpretation of how LLMs understand language, with several linguistic and engineering implications.",
      "keywords": [
        "llm",
        "continuity",
        "spatiotemporal transformers",
        "linguistics"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "VRlihVklCL",
      "title": "MaD-Scientist: AI-based Scientist solving Convection-Diffusion-Reaction Equations Using Massive PINN-Based Prior Data",
      "abstract": "Large language models (LLMs), like ChatGPT, have shown that even trained with noisy prior data,  they can generalize effectively to new tasks through in-context learning (ICL) and pre-training techniques.\nMotivated by this, we explore whether a similar approach can be applied to scientific foundation models (SFMs). Our methodology is structured as follows: (i) we collect low-cost physics-informed neural network (PINN)-based approximated prior data in the form of solutions to partial differential equations (PDEs) constructed through an arbitrary linear combination of mathematical dictionaries; (ii) we utilize Transformer architectures with self and cross-attention mechanisms to predict PDE solutions without knowledge of the governing equations in a zero-shot setting; (iii) we provide experimental evidence on the one-dimensional convection-diffusion-reaction equation, which demonstrate that pre-training remains robust even with approximated prior data, with only marginal impacts on test accuracy. Notably, this finding opens the path to pre-training SFMs with realistic, low-cost data instead of (or in conjunction with) numerical high-cost data. These results support the conjecture that SFMs can improve in a manner similar to LLMs, where fully cleaning the vast set of sentences crawled from the Internet is nearly impossible.",
      "keywords": [
        "in-context learning",
        "scientific foundation model",
        "zero-shot",
        "PINN-prior"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "hUb2At2DsQ",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "B5RrIFMqbe",
      "title": "FormalAlign: Automated Alignment Evaluation for Autoformalization",
      "abstract": "Autoformalization aims to convert informal mathematical proofs into machine-verifiable formats, bridging the gap between natural and formal languages. However, ensuring semantic alignment between the informal and formalized statements remains challenging. Existing approaches heavily rely on manual verification, hindering scalability. To address this, we introduce FormalAlign, a framework for automatically evaluating the alignment between natural and formal languages in autoformalization. FormalAlign trains on both the autoformalization sequence generation task and the representational alignment between input and output, employing a dual loss that combines a pair of mutually enhancing autoformalization and alignment tasks. Evaluated across four benchmarks augmented by our proposed misalignment strategies, FormalAlign demonstrates superior performance. In our experiments, FormalAlign outperforms GPT-4, achieving an Alignment-Selection Score 11.58\\% higher on \\forml-Basic (99.21\\% vs. 88.91\\%) and 3.19\\% higher on MiniF2F-Valid (66.39\\% vs. 64.34\\%). This effective alignment evaluation significantly reduces the need for manual verification.",
      "keywords": [
        "Large Language models",
        "Autoformalization",
        "Lean 4",
        "Formal Math",
        "AI for Math"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "hUb2At2DsQ",
      "title": "Rethinking and Improving Autoformalization: Towards a Faithful Metric and a Dependency Retrieval-based Approach",
      "abstract": "As a central component in formal verification, statement autoformalization has been widely studied including the recent efforts from machine learning community, but still remains a widely-recognized difficult and open problem. In this paper, we delve into two critical yet under-explored gaps: 1) absence of faithful and universal automated evaluation for autoformalization results; 2) agnosia of contextual information, inducing severe hallucination of formal definitions and theorems. To address the first issue, we propose **BEq** (_**B**idirectional **E**xtended Definitional E**q**uivalence_), an automated neuro-symbolic method to determine the equivalence between two formal statements, which is formal-grounded and well-aligned with human intuition. For the second, we propose **RAutoformalizer** (_**R**etrieval-augmented **Autoformalizer**_), augmenting statement autoformalization by _Dependency Retrieval_, retrieving potentially dependent objects from formal libraries. We parse the dependencies of libraries and propose to _structurally informalise_ formal objects by the topological order of dependencies. To evaluate OOD generalization and research-level capabilities, we build a novel benchmark, _Con-NF_, consisting of 961 informal-formal statement pairs from frontier mathematical researches. Experiments validate the effectiveness of our approaches: BEq is evaluated on 200 diverse formal statement pairs with expert-annotated equivalence label, exhibiting significantly improved accuracy ($82.50\\\\% \\mapsto 90.50\\\\%$) and precision ($70.59\\\\% \\mapsto 100.0\\\\%$). For dependency retrieval, a strong baseline is devised. Our RAutoformalizer substantially outperforms SOTA baselines in both in-distribution ProofNet benchmark ($12.83\\\\% \\mapsto 18.18\\\\%$, BEq@8) and OOD Con-NF scenario ($4.58\\\\%\\mapsto 16.86\\\\%$, BEq@8).",
      "keywords": [
        "Large Language Model",
        "Formal Verification",
        "Autoformalization"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "VipcVxaTnG",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "VipcVxaTnG",
      "title": "Correlation and Navigation in the Vocabulary Key Representation Space of Language Models",
      "abstract": "Language model (LM) decoding is based on the next-token prediction (NTP) probability distribution. For neural LMs (e.g., Transformer-based), NTP distribution is\nessentially a softmax-regularized dot product between an encoded input context\n(query) and fixed vocabulary representations (keys). In this paper, we study the\neffect of the key distribution on the NTP distribution, with a focus on whether\nthe similarity between keys will trigger spurious correlations in NTP. Through\nknowledge-probing tasks, we show that in the NTP distribution, the few top-ranked\ntokens are typically accurate. However, the middle-ranked prediction is highly biased\ntowards the tokens that are distributionally (not necessarily semantically) similar to\nthese top ones. For instance, if “P” is predicted as the top-1 token, “A”-“Z” will all\nbe ranked high in NTP, no matter whether they can lead to correct decoding results.\nThis hurts the sampling diversity and makes the sampling of correct, long-tail\nresults hopeless and noisy. We attempt to alleviate this issue via a novel in-context\nmethod that iteratively pushes the query representation away from explored regions.\nSpecifically, we include the explored decoding results in the context and prompt\nthe LM to generate something else, which encourages the LM to produce a query\nrepresentation that has small dot products with explored keys. Experiments on\nknowledge-probing tasks show that our method leads to efficient navigation away\nfrom explored keys to correct new keys. We further extend our method to open-ended and chain-of-thought (for reasoning) generation. Experiment results show\nthat ICN contributes to better generation diversity and improved self-consistency\nvoting performance. Finally, we discuss potential training issues caused by the\nfixed key space together with the challenges and possible ways to address them in\nfuture research.",
      "keywords": [
        "Language Modeling",
        "Next Token Prediction",
        "Spurious Correlation",
        "Generation Diversity"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "mtSSFiqW6y",
      "title": "Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment",
      "abstract": "The performance of large language models (LLMs) is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive generation, leveraging a fast draft model to propose candidate tokens, which are then verified in parallel based on their likelihood under the target model. While this approach guarantees to reproduce the target output, it incurs a substantial penalty: many high-quality draft tokens are rejected, even when they represent objectively valid continuations. Indeed, we show that even powerful draft models such as GPT-4o, as well as human text cannot achieve high acceptance rates under the standard verification scheme. This severely limits the speedup potential of current speculative decoding methods, as an early rejection becomes overwhelmingly likely when solely relying on alignment of draft and target.\nWe thus ask the following question: Can we adapt verification to recognize correct, but non-aligned replies? To this end, we draw inspiration from the LLM-as-a-judge framework, which demonstrated that LLMs are able to rate answers in a versatile way. We carefully design a dataset coined TokenCourt to elicit the same capability in the target model by training a compact module on top of the embeddings to produce ``judgements\" of the current continuation. We showcase our strategy on the Llama-3.1 family, where our 8B/405B-Judge achieves a speedup of $9\\times$ over Llama-405B, while maintaining its quality on a large range of benchmarks. These benefits remain present even in optimized inference frameworks, where our method reaches up to $141$ tokens/s for 8B/70B-Judge and $129$ tokens/s for 8B/405B on $2$ and $8$ H100s respectively.",
      "keywords": [
        "LLM inference",
        "speculative decoding"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "R73ybUciQF",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "R73ybUciQF",
      "title": "A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders",
      "abstract": "Sparse Autoencoders (SAEs) aim to decompose the activation space of large language models (LLMs) into human-interpretable latent directions or features. As we increase the number of features in the SAE, hierarchical features tend to split into finer features (“math” may split into “algebra”, “geometry”, etc.), a phenomenon referred to as feature splitting. However, we show that sparse decomposition and splitting of hierarchical features is not robust. Specifically, we show that seemingly monosemantic features fail to fire where they should, and instead get “absorbed” into their children features. We coin this phenomenon feature absorption, and show that it is caused by optimizing for sparsity in SAEs whenever the underlying features form a hierarchy. We introduce a metric to detect absorption in SAEs, and validate our findings empirically on hundreds of LLM SAEs. Our investigation suggests that varying SAE sizes or sparsity is insufficient to solve this issue. We discuss the implications of feature absorption in SAEs and some potential approaches to solve the fundamental theoretical issues before SAEs can be used for interpreting LLMs robustly and at scale.",
      "keywords": [
        "sparse autoencoders",
        "SAEs",
        "interpretability",
        "NLP"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "p8lKcNkJRi",
      "title": "Dense SAE Latents Are Features, Not Bugs",
      "abstract": "Sparse autoencoders (SAEs) are designed to extract interpretable features from language models by enforcing a sparsity constraint. Ideally, training an SAE would yield latents that are both sparse and semantically meaningful. However, many SAE latents activate frequently (i.e., are *dense*), raising concerns that they may be undesirable artifacts of the training procedure. In this work, we systematically investigate the geometry, function, and origin of dense latents and show that they are not only persistent but often reflect meaningful model representations. We first demonstrate that dense latents tend to form antipodal pairs that reconstruct specific directions in the residual stream, and that ablating their subspace suppresses the emergence of new dense features in retrained SAEs---suggesting that high density features are an intrinsic property of the residual space. We then introduce a taxonomy of dense latents, identifying classes tied to position tracking, context binding, entropy regulation, letter-specific output signals, part-of-speech, and principal component reconstruction. Finally, we analyze how these features evolve across layers, revealing a shift from structural features in early layers, to semantic features in mid layers, and final to output-oriented signals in the last layers of the model. Our findings indicate that dense latents serve functional roles in language model computation and should not be dismissed as training noise.",
      "keywords": [
        "sae",
        "sparse autoencoder",
        "interpretability",
        "mechanistic interpretability",
        "language model"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "n4V3MSqK77",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "n4V3MSqK77",
      "title": "Agentic Plan Caching: Test-Time Memory for Fast and Cost-Efficient LLM Agents",
      "abstract": "LLM-based agent applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs and latency due to extensive planning and reasoning requirements. \nExisting LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agent applications where outputs depend on external data and environmental contexts. \nWe propose **Agentic Plan Caching (APC)**, a novel **test-time memory** that extracts, stores, adapts, and reuses structured plan templates from planning stages of agent applications across semantically similar tasks to reduce the cost and latency of serving. \nUnlike traditional semantic caching, our system extracts plan templates from completed agent executions at test-time, employs keyword extraction to match new requests against cached plans, and utilizes lightweight models to adapt these templates to task-specific plans with contexts. \nEvaluation across multiple real-world agent applications shows that our system can reduce costs by 50.31\\% and latency by 27.28\\% on average while maintaining performance, offering a more efficient solution for serving LLM-based agents that complements existing LLM serving infrastructures.",
      "keywords": [
        "Caching",
        "Memory",
        "Serving",
        "LLM Agents"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "q2VpjD7k1V",
      "title": "WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch",
      "abstract": "LLM‑based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications.\nTo assess the quality of the generated websites, we generate test cases targeting each functionality described in the instructions. These test cases are then manually filtered, refined, and organized to ensure accuracy, resulting in a total of 647 test cases. Each test case specifies an operation to be performed on the website and the expected outcome of the operation.\nTo automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute test cases on the generated websites and determine whether the observed responses align with the expected results.\nWe evaluate three high-performance code-agent frameworks—Bolt.diy, OpenHands, and Aider—using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8\\% accuracy on the test cases, highlighting the challenging nature of our benchmark.\nAdditionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of the training set achieves an accuracy of 38.2\\%, surpassing the performance of the best proprietary model.\nWe release our data-generation, training, and testing code, along with both the datasets and model weights at https://github.com/mnluzimu/WebGen-Bench.",
      "keywords": [
        "Code Agent",
        "Website Generation"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "cFu7ze7xUm",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "FDnZFpHmU4",
      "title": "Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling",
      "abstract": "Large language models (LLMs) exhibit varying strengths and weaknesses across different tasks, prompting recent studies to explore the benefits of ensembling models to leverage their complementary advantages. However, existing LLM ensembling methods often overlook model compatibility and struggle with inefficient alignment of probabilities across the entire vocabulary. In this study, we empirically investigate the factors influencing ensemble performance, identifying model performance, vocabulary size, and response style as key determinants, revealing that compatibility among models is essential for effective ensembling. This analysis leads to the development of a simple yet effective model selection strategy that identifies compatible models. Additionally, we introduce the \\textsc{Uni}on \\textsc{T}op-$k$ \\textsc{E}nsembling (\\textsc{UniTE}), a novel approach that efficiently combines models by focusing on the union of the top-k tokens from each model, thereby avoiding the need for full vocabulary alignment and reducing computational overhead. Extensive evaluations across multiple benchmarks demonstrate that \\textsc{UniTE} significantly enhances performance compared to existing methods, offering a more efficient framework for LLM ensembling.",
      "keywords": [
        "Model ensembling",
        "LLM"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "cFu7ze7xUm",
      "title": "DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads",
      "abstract": "Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges.\nCaching all Key and Value (KV) states across all attention heads consumes substantial memory.\nExisting KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements.\nIn this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens.\nIn contrast, all other heads, which primarily focus on recent tokens and attention sinks—referred to as Streaming Heads—do not require full attention.\nBased on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM's decoding and pre-filling memory and latency without compromising its long-context abilities.\nDuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately.\nOur method significantly reduces long-context inference memory by up to 2.55$\\times$ for MHA and 1.67$\\times$ for GQA models while speeding up decoding by up to 2.18$\\times$ and 1.50$\\times$ and accelerating pre-filling by up to 1.73$\\times$ and 1.63$\\times$ for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention.\nNotably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.33 million context length measured on a single A100 GPU. Code is provided in https://github.com/mit-han-lab/duo-attention.",
      "keywords": [
        "Large Language Models; Long Context; Efficiency;"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "6HcnC3pPkp",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "6HcnC3pPkp",
      "title": "Token-Supervised Value Models for Enhancing Mathematical Problem-Solving Capabilities of Large Language Models",
      "abstract": "With the rapid advancement of test-time compute search strategies to improve the mathematical problem-solving capabilities of large language models (LLMs), the need for building robust verifiers has become increasingly important. However, all these inference strategies rely on existing verifiers originally designed for Best-of-N search, which makes them sub-optimal for tree search techniques at test time. During tree search, existing verifiers can only offer indirect and implicit assessments of partial solutions or under-value prospective intermediate steps, thus resulting in the premature pruning of promising intermediate steps. To overcome these limitations, we propose token-supervised value models (TVMs) -- a new class of verifiers that assign each token a probability that reflects the likelihood of reaching the correct final answer. This new token-level supervision enables TVMs to directly and explicitly evaluate partial solutions, effectively distinguishing between promising and incorrect intermediate steps during tree search at test time. Experimental results demonstrate that combining tree-search-based inference strategies with TVMs significantly improves the accuracy of LLMs in mathematical problem-solving tasks, surpassing the performance of existing verifiers.",
      "keywords": [
        "Large Language Models",
        "Mathematical Problem-Solving",
        "Verifiers"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "7tOc6h8bea",
      "title": "Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation",
      "abstract": "Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves generating a single predefined token. Trained using a dataset constructed with real unfiltered LMSYS user prompts, Llama 3.1 8B's win rate against GPT-4 on AlpacaEval increases from 21\\% to 34\\% with 16 samples and math performance on GSM8K improves from 84\\% to 91\\%. By sampling only when the LLM determines that it is beneficial to do so and adaptively adjusting temperature annealing, we demonstrate that 74\\% of the improvement from using 16 samples can be achieved with only 1.2 samples on average. We further demonstrate that 50–75\\% of samples can be pruned early in generation with minimal degradation in performance. Overall, our methods enable more efficient and scalable compute utilization during inference for LLMs.",
      "keywords": [
        "LLMs",
        "inference-time",
        "inference-time efficiency",
        "Best-of-N",
        "self-evaluation"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "vRvVVb0NAz",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "DzKdjWe59v",
      "title": "Hint Marginalization for Improved Reasoning in Large Language Models",
      "abstract": "Large Language Models (LLMs) have exhibited an impressive capability to perform reasoning tasks, especially if they are encouraged to generate a sequence of intermediate steps. Reasoning performance can be improved by suitably combining multiple LLM responses, generated either in parallel in a single query, or via sequential interactions with LLMs throughout the reasoning process. Existing strategies for combination, such as self-consistency and progressive-hint-prompting, make inefficient usage of the LLM responses. We present Hint Marginalization, a novel and principled algorithmic framework to enhance the reasoning capabilities of LLMs. Our approach can be viewed as an iterative sampling strategy for forming a Monte Carlo approximation of an underlying distribution of answers, with the goal of identifying the mode the most likely answer. Empirical evaluation on several benchmark datasets for arithmetic reasoning demonstrates the superiority of the proposed approach.",
      "keywords": [
        "reasoning",
        "large language models"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "TDyE2iuvyc",
      "title": "Efficient Model Editing with Task-Localized Sparse Fine-tuning",
      "abstract": "Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.",
      "keywords": [
        "task arithmetic",
        "parameter-efficient fine-tuning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "s0JVsx3bx1",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "oEgybA04dY",
      "title": "Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for Visual Reasoning",
      "abstract": "The remarkable reasoning capability of large language models (LLMs) stems from cognitive behaviors that emerge through reinforcement with verifiable rewards. This work investigates how to transfer this principle to Multimodal LLMs (MLLMs) to unlock advanced visual reasoning. We introduce a two-stage paradigm built on Qwen2.5-VL-7B: a massive linguistic cold-start fine-tuning, \nfollowed by multimodal reinforcement learning (RL) spanning nearly 1,000 steps—surpassing all previous open-source efforts in scale.\nThis pioneering work reveals three fundamental insights: 1) Behavior transfer emerges surprisingly early in cold start due to linguistic mental imagery. 2) Cold start broadly memorizes visual behaviors, while RL critically discerns and scales up effective patterns. 3) Transfer strategically favors high-utility behaviors such as visual reflection. Our resulting model, Open-Vision-Reasoner (OVR), achieves state-of-the-art performance on a suite of reasoning benchmarks, including 95.3% on MATH500, 51.8% on MathVision and 54.6% on MathVerse. We release our model, data, and training dynamics to catalyze the development of more capable, behavior-aligned multimodal reasoners.",
      "keywords": [
        "Multimodal LLM",
        "Visual Reasoning",
        "Cognitive Behavior Transfer"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "s0JVsx3bx1",
      "title": "1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities",
      "abstract": "Scaling up self-supervised learning has driven breakthroughs in language and vision, yet comparable progress has remained elusive in reinforcement learning (RL). In this paper, we study building blocks for self-supervised RL that unlock substantial improvements in scalability, with network depth serving as a critical factor. Whereas most RL papers in recent years have relied on shallow architectures (around 2 -- 5 layers), we demonstrate that increasing the depth up to 1024 layers can significantly boost performance.\nOur experiments are conducted in an unsupervised goal-conditioned setting, where no demonstrations or rewards are provided, so an agent must explore (from scratch) and learn how to maximize the likelihood of reaching commanded goals.\nEvaluated on simulated locomotion and manipulation tasks, our approach increases performance on the self-supervised contrastive RL algorithm by $2\\times$ -- $50\\times$, outperforming other goal-conditioned baselines.\nIncreasing the model depth not only increases success rates but also qualitatively changes the behaviors learned.",
      "keywords": [
        "Reinforcement Learning",
        "Self-Supervised Learning",
        "Contrastive RL",
        "Goal-conditioned RL",
        "Scaling"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "GEBkyKZOc4",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "GEBkyKZOc4",
      "title": "Rational Decision-Making Agent with Learning Internal Utility Judgment",
      "abstract": "With remarkable advancements, large language models (LLMs) have attracted significant efforts to develop LLM-based agents capable of executing intricate multi-step decision-making tasks. Existing approaches predominantly build upon the external performance measure to guide the decision-making process but the reliance on the external performance measure as prior is problematic in real-world scenarios, where such prior may be unavailable, flawed, or even erroneous. For genuine autonomous decision-making for LLM-based agents, it is imperative to develop rationality from their posterior experiences to judge the utility of each decision independently. In this work, we propose RaDAgent (Rational Decision-Making Agent), which fosters the development of its rationality through an iterative framework involving Experience Exploration and Utility Learning. Within this framework, Elo-based Utility Learning is devised to assign Elo scores to individual decision steps to judge their utilities via pairwise comparisons. Consequently, these Elo scores guide the decision-making process to derive optimal outcomes. Experimental results on the Game of 24, WebShop, ToolBench and RestBench datasets demonstrate RaDAgent’s superiority over baselines, achieving about 7.8% improvement on average. Besides, RaDAgent also can reduce costs (ChatGPT API calls), highlighting its effectiveness and efficiency.",
      "keywords": [
        "Decision Making",
        "Autonomous Agent",
        "Large Lanugage Model",
        "Elo Rating"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "jwGPmIqE99",
      "title": "STRIDE: A Tool-Assisted LLM Agent Framework for Strategic and Interactive Decision-Making",
      "abstract": "Large Language Models (LLMs) have revolutionized natural language processing, showing remarkable linguistic proficiency and reasoning capabilities. However, their application in strategic multi-agent decision-making environments is hampered by significant limitations including poor mathematical reasoning, difficulty in following instructions, and a tendency to generate incorrect information. These deficiencies hinder their performance in strategic and interactive tasks that demand adherence to nuanced game rules, long-term planning, exploration in unknown environments, and anticipation of opponents' moves. To overcome these obstacles, this paper presents a novel LLM agent framework equipped with memory and specialized tools to enhance their strategic decision-making capabilities. We deploy the tools in a number of economically important environments, in particular bilateral bargaining and multi-agent and dynamic mechanism design. We employ quantitative metrics to assess the framework's performance in various strategic decision-making problems. Our findings establish that our enhanced framework significantly improves the strategic decision-making capability of LLMs. While we highlight the inherent limitations of current LLM models, we demonstrate the improvements through targeted enhancements, suggesting a promising direction for future developments in LLM applications for interactive environments.",
      "keywords": [
        "LLM Agent",
        "Strategic Decision Making",
        "Markov Decision Making Process"
      ],
      "decision": "Reject",
      "year": "2025"
    }
  },
  {
    "group_id": "SDhOClkyqC",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "BSZqpqgqM0",
      "title": "Why Diffusion Models Don’t Memorize:  The Role of Implicit Dynamical Regularization in Training",
      "abstract": "Diffusion models have achieved remarkable success across a wide range of generative tasks. A key challenge is understanding the mechanisms that prevent their memorization of training data and allow generalization. In this work, we investigate the role of the training dynamics in the transition from generalization to memorization. Through extensive experiments and theoretical analysis, we identify two distinct timescales: an early time $\\tau_\\mathrm{gen}$ at which models begin to generate high-quality samples, and a later time $\\tau_\\mathrm{mem}$ beyond which memorization emerges. Crucially, we find that $\\tau_\\mathrm{mem}$ increases linearly with the training set size $n$, while $\\tau_\\mathrm{gen}$ remains constant. This creates a growing window of training times with $n$ where models generalize effectively, despite showing strong memorization if training continues beyond it. It is only when $n$ becomes larger than a model-dependent threshold that overfitting disappears at infinite training times.\nThese findings reveal a form of implicit dynamical regularization in the training dynamics, which allow to avoid memorization even in highly overparameterized settings. Our results are supported by numerical experiments with standard U-Net architectures on realistic and synthetic  datasets, and by a theoretical analysis using a tractable random features model studied in the high-dimensional limit.",
      "keywords": [
        "Diffusion Models",
        "Deep Learning",
        "Probabilistic Methods"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "cIGfKdfy3N",
      "title": "Learning Diffusion Models with Flexible Representation Guidance",
      "abstract": "Diffusion models can be improved with additional guidance towards more effective representations of input. Indeed, prior empirical work has already shown that aligning internal representations of the diffusion model with those of pre-trained models improves generation quality. In this paper, we present a systematic framework for incorporating representation guidance into diffusion models. We provide alternative decompositions of denoising models along with their associated training criteria, where the decompositions determine when and how the auxiliary representations are incorporated. Guided by our theoretical insights, we introduce two new strategies for enhancing representation alignment in diffusion models. First, we pair examples with target representations either derived from themselves or arisen from different synthetic modalities, and subsequently learn a joint model over the multimodal pairs. Second, we design an optimal training curriculum that balances representation learning and data generation. Our experiments across image, protein sequence, and molecule generation tasks demonstrate superior performance as well as accelerated training. In particular, on the class-conditional ImageNet $256\\times 256$ benchmark, our guidance results in $23.3$ times faster training than the original SiT-XL as well as four times speedup over the state-of-the-art method REPA.",
      "keywords": [
        "Diffusion Models",
        "Representation Learning",
        "Image Generation",
        "Biomolecule Generation",
        "Theoretical Analysis of Generative Model"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "3b9SKkRAKw",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "3b9SKkRAKw",
      "title": "LeFusion: Controllable Pathology Synthesis via Lesion-Focused Diffusion Models",
      "abstract": "Patient data from real-world clinical practice often suffers from data scarcity and long-tail imbalances, leading to biased outcomes or algorithmic unfairness. This study addresses these challenges by generating lesion-containing image-segmentation pairs from lesion-free images. Previous efforts in medical imaging synthesis have struggled with separating lesion information from background, resulting in low-quality backgrounds and limited control over the synthetic output. Inspired by diffusion-based image inpainting, we propose LeFusion, a lesion-focused diffusion model. By redesigning the diffusion learning objectives to focus on lesion areas, we simplify the learning process and improve control over the output while preserving high-fidelity backgrounds by integrating forward-diffused background contexts into the reverse diffusion process. Additionally, we tackle two major challenges in lesion texture synthesis: 1) multi-peak and 2) multi-class lesions. We introduce two effective strategies: histogram-based texture control and multi-channel decomposition, enabling the controlled generation of high-quality lesions in difficult scenarios. Furthermore, we incorporate lesion mask diffusion, allowing control over lesion size, location, and boundary, thus increasing lesion diversity. Validated on 3D cardiac lesion MRI and lung nodule CT datasets, LeFusion-generated data significantly improves the performance of state-of-the-art segmentation models, including nnUNet and SwinUNETR.",
      "keywords": [
        "data synthesis",
        "diffusion models",
        "cardiac MRI",
        "lung nodule CT",
        "segmentation"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "NJxCpMt0sf",
      "title": "Dynamic Modeling of Patients, Modalities and Tasks via Multi-modal Multi-task Mixture of Experts",
      "abstract": "Multi-modal multi-task learning holds significant promise in tackling complex diagnostic tasks and many significant medical imaging problems. It fulfills the needs in real-world diagnosis protocol to leverage information from different data sources and simultaneously perform mutually informative tasks. However, medical imaging domains introduce two key challenges: dynamic modality fusion and modality-task dependence. The quality and amount of task-related information from different modalities could vary significantly across patient samples, due to biological and demographic factors. Traditional fusion methods apply fixed combination strategies that fail to capture this dynamic relationship, potentially underutilizing modalities that carry stronger diagnostic signals for specific patients. Additionally, different clinical tasks may require dynamic feature selection and combination from various modalities, a phenomenon we term “modality-task dependence.” To address these issues, we propose M4oE, a novel Multi-modal Multi-task Mixture of Experts framework for precise Medical diagnosis. M4oE comprises Modality-Specific (MSoE) modules and a Modality-shared Modality-Task MoE (MToE) module. With collaboration from both modules, our model dynamically decomposes and learns distinct and shared information from different modalities and achieves dynamic fusion. MToE provides a joint probability model of modalities and tasks by using experts as a link and encourages experts to learn modality-task dependence via conditional mutual information loss. By doing so, M4oE offers sample and population-level interpretability of modality contributions. We evaluate M4oE on four public multi-modal medical benchmark datasets for solving two important medical diagnostic problems including breast cancer screening and retinal disease diagnosis. Results demonstrate our method's superiority over state-of-the-art methods under different metrics of classification and segmentation tasks like Accuracy, AUROC, AUPRC, and DICE.",
      "keywords": [
        "Multimodal Learning",
        "Medical Imaging"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "J2Jyp1SZ0n",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "8QTpYC4smR",
      "title": "Systematic Review of Large Language Models: Applications, Limitations, Practical Usages and Future Directions",
      "abstract": "Large Language Models have revolutionized natural language processing with their remarkable ability to understand and generate human-like text. This review explores the various applications of large language models, highlighting their versatility across different domains. The paper begins with an introduction to LLMs, followed by an overview of their types and a detailed literature review. We then examine their limitations before delving into specific applications such as text generation, translation, summarization, and more. Finally, we discuss future directions for research and development, concluding with a summary of key findings and the potential impact of large language models on various industries.",
      "keywords": [
        "Large Language Models",
        "Systematic Review"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "J2Jyp1SZ0n",
      "title": "MMSearch: Unveiling the Potential of Large Models as Multi-modal Search Engines",
      "abstract": "The advent of Large Language Models (LLMs) has paved the way for AI search engines, e.g., SearchGPT, showcasing a new paradigm in human-internet interaction. However, most current AI search engines are limited to text-only settings, neglecting the multimodal user queries and the text-image interleaved nature of website information. Recently, Large Multimodal Models (LMMs) have made impressive strides. Yet, whether they can function as AI search engines remains under-explored, leaving the potential of LMMs in multimodal search an open question. To this end, we first design a delicate pipeline, MMSearch-Engine, to empower any LMMs with multimodal search capabilities. On top of this, we introduce MMSearch, a comprehensive evaluation benchmark to assess the multimodal search performance of LMMs. The curated dataset contains 300 manually collected instances spanning 14 subfields, which involves no overlap with the current LMMs' training data, ensuring the correct answer can only be obtained within searching. By using MMSearch-Engine, the LMMs are evaluated by performing three individual tasks (requery, rerank, and summarization), and one challenging end-to-end task with a complete searching process. We conduct extensive experiments on closed-source and open-source LMMs. Among all tested models, GPT-4o with MMSearch-Engine achieves the best results, which surpasses the commercial product, Perplexity Pro, in the end-to-end task, demonstrating the effectiveness of our proposed pipeline. We further present error analysis to unveil current LMMs still struggle to fully grasp the multimodal search tasks, and conduct ablation study to indicate the potential of scaling test-time computation for AI search engine. We hope MMSearch may provide unique insights to guide the future development of multimodal AI search engine.",
      "keywords": [
        "Large Multimodal Model",
        "AI Search Engine",
        "Benchmark"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "QN0E0KX2LM",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "QN0E0KX2LM",
      "title": "Learning Linear Attention in Polynomial Time",
      "abstract": "Previous research has explored the expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the efficient learnability of Transformers from data has remained an open question.  Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention.  We show that learning the optimal multi head linear attention can be recast as finding the optimal kernel predictor in a suitably defined RKHS.  Moving to generalization, we construct an algorithm that, given a dataset, checks in polynomial time whether the set of best fit multi head linear attention networks on this data all perform an identical computation--a powerful notion for out of distribution generalization.  We empirically validate our theoretical findings on several canonical tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformer models.",
      "keywords": [
        "Transformers",
        "Learning Theory",
        "PAC learning"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "RBWnyDEBKf",
      "title": "Constant Bit-size Transformers Are Turing Complete",
      "abstract": "We prove that any Turing machine running on inputs of arbitrary length can be simulated by a constant bit-size transformer, as long as the context window is sufficiently long. This improves previous works, which require scaling up either the model's precision or the number of parameters on longer inputs. Furthermore, we prove that the complexity class SPACE$[s(n)]$ exactly characterizes the expressive power of a constant bit-size transformer with a context window of length $s(n)$. Our approach relies on simulating Post machines, a Turing-complete computational model. Post machines can be modeled as automata equipped with a queue, exhibiting computational behaviors naturally aligned with those of transformers. The behavioral similarity between transformers and Post machines may offer new insights into the mechanisms underlying the reasoning abilities of transformers.",
      "keywords": [
        "Transformer",
        "Turing complete",
        "Post machine",
        "context window length",
        "space complexity"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "Ceb788Uigr",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "2Ri68h7bD1",
      "title": "Dale's Law Meets Geometric Brownian Motion: Multiplicative Updates for Sampling",
      "abstract": "Gradient descent has proven to be a powerful and effective technique for optimization in numerous machine learning applications. Recent advances in computational neuroscience have shown that learning in standard gradient descent optimization formulation is not consistent with learning in biological systems. This has opened up interesting avenues for building biologically inspired learning techniques. One such approach is inspired by Dale's law, which states that inhibitory and excitatory synapses do not swap roles during the course of learning. The resulting exponential gradient descent optimization scheme leads to log-normally distributed synaptic weights. Interestingly, the density that satisfies the Fokker-Planck equation corresponding to the stochastic differential equation (SDE) with geometric Brownian motion (GBM) is the log-normal density. Leveraging this connection, we start with the SDE governing geometric Brownian motion, and show that discretizing the corresponding reverse-time SDE yields a multiplicative update rule, which surprisingly, coincides with the sampling equivalent of the exponential gradient descent update founded on Dale's law. Proceeding further, we propose a new formalism for multiplicative denoising score-matching, which subsumes the loss function proposed by Hyvaerinen for non-negative data. Indeed, log-normally distributed data is positive and the proposed score-matching formalism turns out to be a natural fit. This allows for training of score-based models for image data and results in a novel multiplicative update scheme for sample generation starting from a log-normal density. Experimental results on MNIST, Fashion MNIST, and Kuzushiji datasets demonstrate generative capability of the new scheme. To the best of our knowledge, this is the first instance of a biologically inspired generative model employing multiplicative updates, founded on geometric Brownian motion.",
      "keywords": [
        "Dale's Law",
        "Geometric Brownian Motion",
        "Stochastic Differential Equations",
        "Score-matching"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "Ceb788Uigr",
      "title": "On the Convergence of Stochastic Smoothed Multi-Level Compositional Gradient Descent Ascent",
      "abstract": "Multi-level compositional optimization is a fundamental framework in machine learning with broad applications. While recent advances have addressed compositional minimization problems, the stochastic multi-level compositional minimax problem introduces significant new challenges—most notably, the biased nature of stochastic gradients for both the primal and dual variables. In this work, we address this gap by proposing a novel stochastic multi-level compositional gradient descent-ascent algorithm, incorporating a smoothing technique under the nonconvex-PL condition. We establish a convergence rate to an $(\\epsilon, \\epsilon/\\sqrt{\\kappa})$-stationary point with improved dependence on the condition number at $O(\\kappa^{3/2})$, where $\\epsilon$ denotes the solution accuracy and $\\kappa$ represents the condition number. Moreover,  we design a novel stage-wise algorithm with variance reduction to address the  biased gradient issue under the two-sided PL condition. This algorithm successfully enables a translation from and $(\\epsilon, \\epsilon/\\sqrt{\\kappa})$-stationary point to an $\\epsilon$-stationary point. Finally, extensive experiments validate the effectiveness of our algorithms.",
      "keywords": [
        "Compositional Optimization",
        "Minimax Optimization"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "6VoDizmIoY",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "6VoDizmIoY",
      "title": "H3D-DGS: Exploring Heterogeneous 3D Motion Representation for Deformable 3D Gaussian Splatting",
      "abstract": "Dynamic scene reconstruction poses a persistent challenge in 3D vision. Deformable 3D Gaussian Splatting has emerged as an effective method for this task, offering real-time rendering and high visual fidelity.\nThis approach decomposes a dynamic scene into a static representation in a canonical space and time-varying scene motion.\nScene motion is defined as the collective movement of all Gaussian points, and for compactness, existing approaches commonly adopt implicit neural fields or sparse control points. \nHowever, these methods predominantly rely on gradient-based optimization for all motion information. Due to the high degree of freedom, they struggle to converge on real-world datasets exhibiting complex motion.\nTo preserve the compactness of motion representation and address convergence challenges, this paper proposes heterogeneous 3D control points, termed \\textbf{H3D control points}, whose attributes are obtained using a hybrid strategy combining optical flow back-projection and gradient-based methods. \nThis design decouples directly observable motion components from those that are geometrically occluded.\nSpecifically, components of 3D motion that project onto the image plane are directly acquired via optical flow back projection, while unobservable portions are refined through gradient-based optimization.\nExperiments on the Neu3DV and CMU-Panoptic datasets demonstrate that our method achieves superior performance over state-of-the-art deformable 3D Gaussian splatting techniques. Remarkably, our method converges within just 100 iterations and achieves a per-frame processing speed of 2 seconds on a single NVIDIA RTX 4070 GPU.",
      "keywords": [
        "3DGS",
        "3D motion representation",
        "dynamic scene reconstruction",
        "streaming reconstruction"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "shFhW4zqd6",
      "title": "EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting",
      "abstract": "Scene reconstruction from casually captured videos has wide real-world applications. Despite recent progress, existing methods relying on traditional cameras tend to fail in high-speed scenarios due to insufficient observations and inaccurate pose estimation. Event cameras, inspired by biological vision, record pixel-wise intensity changes asynchronously with high temporal resolution and low latency, providing valuable scene and motion information in blind inter-frame intervals. In this paper, we introduce the event cameras to aid scene construction from a casually captured video for the first time, and propose Event-Aided Free-Trajectory 3DGS, called EF-3DGS, which seamlessly integrates the advantages of event cameras into 3DGS through three key components. First, we leverage the Event Generation Model (EGM) to fuse events and frames, enabling continuous supervision between discrete frames. Second, we extract motion information through Contrast Maximization (CMax) of warped events, which calibrates camera poses and provides gradient-domain constraints for 3DGS. Third, to address the absence of color information in events, we combine photometric bundle adjustment (PBA) with a Fixed-GS training strategy that separates structure and color optimization, effectively ensuring color consistency across different views. We evaluate our method on the public Tanks and Temples benchmark and a newly collected real-world dataset, RealEv-DAVIS. Our method achieves up to 3dB higher PSNR and 40% lower Absolute Trajectory Error (ATE) compared to state-of-the-art methods under challenging high-speed scenarios.",
      "keywords": [
        "3D Gaussian Splatting (3DGS); Novel View Synthesis (NVS)"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "qXAABCxYQ2",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "5aeD5UbiLv",
      "title": "GIST: Greedy Independent Set Thresholding for Max-Min Diversification with Submodular Utility",
      "abstract": "This work studies a novel subset selection problem called *max-min diversification with monotone submodular utility* (MDMS), which has a wide range of applications in machine learning, e.g., data sampling and feature selection.\nGiven a set of points in a metric space,\nthe goal of MDMS is to maximize $f(S) = g(S) + \\lambda \\cdot \\text{div}(S)$\nsubject to a cardinality constraint $|S| \\le k$,\nwhere\n$g(S)$ is a monotone submodular function\nand\n$\\text{div}(S) = \\min_{u,v \\in S : u \\ne v} \\text{dist}(u,v)$ is the *max-min diversity* objective.\nWe propose the `GIST` algorithm, which gives a $\\frac{1}{2}$-approximation guarantee for MDMS\nby approximating a series of maximum independent set problems with a bicriteria greedy algorithm.\nWe also prove that it is NP-hard to approximate within a factor of $0.5584$.\nFinally, we show in our empirical study that `GIST` outperforms state-of-the-art benchmarks\nfor a single-shot data sampling task on ImageNet.",
      "keywords": [
        "approximation algorithm",
        "submodular maximization",
        "max-min diversification",
        "data sampling",
        "subset selection"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "qXAABCxYQ2",
      "title": "Which Algorithms Have Tight Generalization Bounds?",
      "abstract": "We study which machine learning algorithms have tight generalization bounds with respect to a given collection of population distributions. Our results build on and extend the recent work of Gastpar et al. (2023). First, we present conditions that preclude the existence of tight generalization bounds. Specifically, we show that algorithms that have certain inductive biases that cause them to be unstable do not admit tight generalization bounds. Next, we show that algorithms that are sufficiently loss-stable do have tight generalization bounds.  We conclude with a simple characterization that relates the existence of tight generalization bounds to the conditional variance of the algorithm's loss.",
      "keywords": [
        "learning theory"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "MsUhByb3CM",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "MsUhByb3CM",
      "title": "Extracting Symbolic Sequences from Visual Representations via Self-Supervised Learning",
      "abstract": "In this paper, we explore the potential of abstracting complex visual information into discrete, structured symbolic sequences using self-supervised learning (SSL). Inspired by how language abstracts and organizes information to enable better reasoning and generalization, we propose a novel approach for generating symbolic representations from visual data. To learn these sequences, we extend the DINO framework to handle both visual and symbolic information. Initial experiments suggest that the generated symbolic sequences capture a meaningful level of abstraction, though further refinement is required. An advantage of our method is its interpretability: the sequences are produced by a decoder transformer using cross-attention, allowing attention maps to be linked to specific symbols and offering insight into how these representations correspond to image regions. This approach lays the foundation for creating interpretable symbolic representations with potential applications in high-level scene understanding.",
      "keywords": [
        "Self-Supervised Learning",
        "Symbolic Representations",
        "Information Theory",
        "Knowledge Distillation",
        "Visual Abstraction",
        "Interpretability"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "md9qolJwLl",
      "title": "From Tokens to Lattices: Emergent Lattice Structures in Language Models",
      "abstract": "Pretrained masked language models (MLMs) have demonstrated an impressive capability to comprehend and encode conceptual knowledge, revealing a lattice structure among concepts. This raises a critical question: how does this conceptualization emerge from MLM pretraining? In this paper, we explore this problem from the perspective of Formal Concept Analysis (FCA), a mathematical framework that derives concept lattices from the observations of object-attribute relationships. We show that the MLM's objective implicitly learns a formal context that describes objects, attributes, and their dependencies, which enables the reconstruction of a concept lattice through FCA. We propose a novel framework for concept lattice construction from pretrained MLMs and investigate the origin of the inductive biases of MLMs in lattice structure learning. Our framework differs from previous work because it does not rely on human-defined concepts and allows for discovering \"latent\" concepts that extend beyond human definitions. We create three datasets for evaluation, and the empirical results verify our hypothesis.",
      "keywords": [
        "Masked Language Models",
        "Formal Concept Analysis",
        "Interpretability"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "SG1R2H3fa1",
    "difficulty": "difficult",
    "pair_type": "reject-poster",
    "paper_a": {
      "paper_id": "0Th6bCZwKt",
      "title": "Gaussian Mixture Models Based Augmentation Enhances GNN Generalization",
      "abstract": "Graph Neural Networks (GNNs) have shown great promise in many learning tasks, notably including node and graph classification, but they face difficulties when tested on new or unseen data. These challenges are exacerbated when training data is limited in size or diversity. To address this issue, we introduce a theoretical framework using Rademacher complexity to compute a regret bound on the generalization error and then characterize the effect of data augmentation. This framework informs the design of GMM-GDA, a new, efficient graph data augmentation (GDA) algorithm leveraging the capability of Gaussian Mixture Models (GMMs) to approximate any distribution. Our approach not only outperforms existing augmentation techniques but also offers improved time complexity, making it highly suitable for real-world applications.",
      "keywords": [
        "Graph Neural Networks",
        "Data Augmentation"
      ],
      "decision": "Reject",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "8sSqNntaMr",
      "title": "RouteLLM: Learning to Route LLMs from Preference Data",
      "abstract": "Large language models (LLMs) excel at a wide range of tasks, but choosing the right model often involves balancing performance and cost. Powerful models offer better results but are expensive, while smaller models are more cost-effective but less capable. To address this trade-off, we introduce a training framework for learning efficient router models that dynamically select between a stronger and weaker LLM during inference. Our framework leverages human preference data and employs data augmentation techniques to enhance performance. Evaluations on public benchmarks show that our approach can reduce costs by over 2 times without sacrificing response quality. Moreover, our routers exhibit strong generalization capabilities, maintaining performance even when routing between LLMs not included in training. This highlights the potential of our framework to deliver cost-effective, high-performance LLM solutions.",
      "keywords": [
        "Large language models",
        "query routing"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "FZURCro04D",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "853SwC2dMZ",
      "title": "Understanding LLM Behaviors via Compression: Data Generation, Knowledge Acquisition and Scaling Laws",
      "abstract": "Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous tasks, yet principled explanations for their underlying mechanisms and several phenomena, such as scaling laws, hallucinations, and related behaviors, remain elusive. In this work, we revisit the classical relationship between compression and prediction, grounded in Kolmogorov complexity and Shannon information theory, to provide deeper insights into LLM behaviors. By leveraging the Kolmogorov Structure Function and interpreting LLM compression as a two-part coding process, we offer a detailed view of how LLMs acquire and store information across increasing model and data scales -- from pervasive syntactic patterns to progressively rarer knowledge elements. Motivated by this theoretical perspective and natural assumptions inspired by Heap’s and Zipf’s laws, we introduce a simplified yet representative hierarchical data-generation framework called the Syntax-Knowledge model. Under the Bayesian setting, we show that prediction and compression within this model naturally lead to diverse learning and scaling behaviors of LLMs. In particular, our theoretical analysis offers intuitive and principled explanations for both data and model scaling laws, the dynamics of knowledge acquisition during training and fine-tuning, factual knowledge hallucinations in LLMs. The experimental results validate our theoretical predictions.",
      "keywords": [
        "Large Language Model",
        "Scaling Law",
        "Information Theory"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "b7uniOw0sZ",
      "title": "Which Data Attributes Stimulate Math and Code Reasoning? An Investigation via Influence Functions",
      "abstract": "Large language models (LLMs) have demonstrated remarkable reasoning capabilities in math and coding, often bolstered by post-training on the chain-of-thoughts (CoTs) generated by stronger models. \nHowever, existing strategies for curating such training data predominantly rely on heuristics, limiting generalizability and failing to capture subtleties underlying in data. \nTo address these limitations, we leverage influence functions to systematically attribute LLMs' reasoning ability on math and coding to individual training examples, sequences, and tokens, enabling deeper insights into effective data characteristics.\nOur Influence-based Reasoning Attribution (Infra) uncovers nontrivial cross-domain effects across math and coding tasks: high-difficulty math examples improve both math and code reasoning, while low-difficulty code tasks most effectively benefit code reasoning.\nBased on these findings, we introduce a simple yet effective dataset reweighting strategy by flipping task difficulty, which doubles AIME24 accuracy from 10\\% to 20\\% and boosts LiveCodeBench accuracy from 33.8\\% to 35.3\\% for Qwen2.5-7B-Instruct.\nMoreover, our fine-grained attribution reveals that the sequence-level exploratory behaviors enhance reasoning performance in both math and code, and the token-level influence patterns are distinct for math and code reasoning: the former prefers natural language logic connectors and the latter emphasizes structural syntax.",
      "keywords": [
        "influence functions",
        "llm reasoning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "RPRqKhjrr6",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "RPRqKhjrr6",
      "title": "Checklists Are Better Than Reward Models For Aligning Language Models",
      "abstract": "Language models must be adapted to understand and follow user instructions. Reinforcement learning is widely used to facilitate this —typically using fixed criteria such as \"helpfulness\" and \"harmfulness\". In our work, we instead propose using flexible, instruction-specific criteria as a means of broadening the impact that reinforcement learning can have in eliciting instruction following. We propose \"Reinforcement Learning from Checklist Feedback\" (RLCF). From instructions, we extract checklists and evaluate how well responses satisfy each item—using both AI judges and specialized verifier programs—then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods on top of a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks — RLCF is the only method to help on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard. We show that RLCF can also be used off-policy to improve Llama 3.1 8B Instruct and OLMo 2 7B Instruct. These results establish rubrics as a key tool for improving language models' support of queries that express a multitude of needs. We release our our dataset of rubrics (WildChecklists), models, and code to the public.",
      "keywords": [
        "alignment",
        "rubrics",
        "instruction following",
        "RLHF"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "oN5YVZ9JeF",
      "title": "T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning",
      "abstract": "Instruction tuning is essential for Large Language Models (LLMs) to effectively follow user instructions. To improve training efficiency and reduce data redundancy, recent works use LLM-based scoring functions, e.g., Instruction-Following Difficulty (IFD), to select high–quality instruction-tuning data with scores above a threshold. While these data selection methods often lead to models that can match or even exceed the performance of models trained on the full datasets, we identify two key limitations: (i) they assess quality at the sample level, ignoring token-level informativeness; and (ii) they overlook the robustness of the scoring method, often selecting a sample due to superficial lexical features instead of its true quality. In this work, we propose Token-Selective HIeRarchical Data Selection for Instruction Tuning (T-SHIRT), a novel data selection framework that introduces a new scoring method to include only informative tokens in quality evaluation and also promote robust and reliable samples whose neighbors also show high quality with less local inconsistencies. We demonstrate that models instruction-tuned on a curated dataset (only 5% of the original size) using T-SHIRT can outperform those trained on the entire large-scale dataset by up to 5.48 points on average across eight benchmarks. Across various LLMs and training set scales, our method consistently surpasses existing state-of-the-art data selection techniques, while also remaining both cost-effective and highly efficient. For instance, by using GPT-2 for score computation, we are able to process a dataset of 52k samples in 40 minutes on a single GPU. Our code is available at https://github.com/Dynamite321/T-SHIRT.",
      "keywords": [
        "Large Language Models",
        "Instruction tuning",
        "Data Selection",
        "Token-selective Quality Score",
        "Robust Hierarchical Selection"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  },
  {
    "group_id": "GqGoa44obw",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "WWXjMYZxfH",
      "title": "MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions",
      "abstract": "Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to preferred outcomes. This hinders learning efficiency and slows convergence.In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions --- sequences of tokens or higher-level language constructs --- into the learning process. By operating at higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30\\% in text summarization and code generation, 18\\% in dialogue, and 8\\% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF $1.7 \\sim 2$ times faster in terms of training time and continues to outperform it with further training. We make our code and data publicly available at \\url{https://github.com/ernie-research/MA-RLHF}.",
      "keywords": [
        "Human Alignment",
        "Large Language Models",
        "Reinforcement Learning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "cfKZ5VrhXt",
      "title": "Online Preference Alignment for Language Models via Count-based Exploration",
      "abstract": "Reinforcement Learning from Human Feedback (RLHF) has shown great potential in fine-tuning Large Language Models (LLMs) to align with human preferences. Existing methods perform preference alignment from a fixed dataset, which can be limited in data coverage and the resulting reward model is hard to generalize in out-of-distribution responses. Thus, online RLHF is more desirable to empower the LLM to explore outside the support of the initial dataset by iteratively collecting the prompt-response pairs. In this paper, we study the fundamental problem in online RLHF, i.e., how to explore for LLM. We give a theoretical motivation in linear reward assumption to show that an optimistic reward with an upper confidence bound (UCB) term leads to a provably efficient RLHF policy. Then, we reformulate our objective to direct preference optimization with an exploration term, where the UCB-term can be converted to a count-based exploration bonus. We further propose a practical algorithm, named Count-based Online Preference Optimization (COPO), which leverages a simple coin-flip counting module to estimate the pseudo-count of a prompt-response pair in previously collected data. COPO encourages LLMs to balance exploration and preference optimization in an iterative manner, which enlarges the exploration space and the entire data coverage of iterative LLM policies. We conduct online RLHF experiments on Zephyr and Llama-3 models. The results on instruction-following and standard academic benchmarks show that COPO significantly increases performance.",
      "keywords": [
        "Reinforcement Learning from Human Feedback",
        "RLHF",
        "Preference Alignment",
        "Exploration",
        "LLMs"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    }
  },
  {
    "group_id": "00SnKBGTsz",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "9RCT0ngvZP",
      "title": "Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning",
      "abstract": "Synthetic data has been widely used to train large language models, but their generative nature inevitably introduces noisy, non-informative, and misleading learning signals. In this paper, we propose Montessori-Instruct, a novel data synthesis framework that tailors the data synthesis ability of the teacher language model toward the student language model's learning process. Specifically, we utilize local data influence of synthetic training data points on students to characterize students' learning preferences. Then, we train the teacher model with Direct Preference Optimization (DPO) to generate synthetic data tailored toward student learning preferences. Experiments with Llama3-8B-Instruct (teacher) and Llama3-8B (student) on Alpaca Eval and MT-Bench demonstrate that Montessori-Instruct significantly outperforms standard synthesis methods by 18.35\\% and 46.24\\% relatively. Our method also beats data synthesized by a stronger teacher model, GPT-4o. Further analysis confirms the benefits of teacher's learning to generate more influential training data in the student's improved learning, the advantages of local data influence in accurately measuring student preferences, and the robustness of Montessori-Instruct across different student models. Our code and data are open-sourced at https://github.com/cxcscmu/Montessori-Instruct.",
      "keywords": [
        "synthetic data",
        "data influence",
        "instruction tuning"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "YUYJsHOf3c",
      "title": "ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement",
      "abstract": "Post-training Large Language Models (LLMs) with explicit reasoning trajectories can enhance their reasoning abilities. However, acquiring such high-quality trajectory data typically demands meticulous supervision from humans or superior models, which can be either expensive or license-constrained. In this paper, we explore how far an LLM can improve its reasoning by self-synthesizing reasoning paths as training data without any additional supervision. Existing self-synthesizing methods, such as STaR, suffer from poor generalization to out-of-domain (OOD) reasoning tasks. We hypothesize it is due to that their self-synthesized reasoning paths are too task-specific, lacking general task-agnostic reasoning guidance. To address this, we propose **Reasoning Generalist via Self-Improvement (ReGenesis)**, a method to *self-synthesize reasoning paths as post-training data by progressing from abstract to concrete*. More specifically, ReGenesis self-synthesizes reasoning paths by converting general reasoning guidelines into task-specific ones, generating reasoning structures, and subsequently transforming these structures into reasoning paths, without the need for human-designed task-specific examples used in existing methods. We show that ReGenesis achieves superior performance on all in-domain and OOD settings tested compared to existing methods. For six OOD tasks specifically, while previous methods exhibited an average performance decrease of approximately 4.6% after post training, ReGenesis delivers around 6.1% performance improvement. We also conduct an in-depth analysis of our framework and show ReGenesis is effective across various language models and design choices.",
      "keywords": [
        "LLM",
        "reasoning",
        "generalization",
        "self-improvement"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "GEBkyKZOc4",
    "difficulty": "difficult",
    "pair_type": "poster-oral",
    "paper_a": {
      "paper_id": "GEBkyKZOc4",
      "title": "Rational Decision-Making Agent with Learning Internal Utility Judgment",
      "abstract": "With remarkable advancements, large language models (LLMs) have attracted significant efforts to develop LLM-based agents capable of executing intricate multi-step decision-making tasks. Existing approaches predominantly build upon the external performance measure to guide the decision-making process but the reliance on the external performance measure as prior is problematic in real-world scenarios, where such prior may be unavailable, flawed, or even erroneous. For genuine autonomous decision-making for LLM-based agents, it is imperative to develop rationality from their posterior experiences to judge the utility of each decision independently. In this work, we propose RaDAgent (Rational Decision-Making Agent), which fosters the development of its rationality through an iterative framework involving Experience Exploration and Utility Learning. Within this framework, Elo-based Utility Learning is devised to assign Elo scores to individual decision steps to judge their utilities via pairwise comparisons. Consequently, these Elo scores guide the decision-making process to derive optimal outcomes. Experimental results on the Game of 24, WebShop, ToolBench and RestBench datasets demonstrate RaDAgent’s superiority over baselines, achieving about 7.8% improvement on average. Besides, RaDAgent also can reduce costs (ChatGPT API calls), highlighting its effectiveness and efficiency.",
      "keywords": [
        "Decision Making",
        "Autonomous Agent",
        "Large Lanugage Model",
        "Elo Rating"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "mMPMHWOdOy",
      "title": "WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct",
      "abstract": "Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical reasoning abilities of LLMs, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. Remarkably, WizardMath-Mistral 7B surpasses all other open-source LLMs by a substantial margin. Furthermore, WizardMath 70B even outperforms ChatGPT-3.5, Claude Instant, Gemini Pro and Mistral Medium. Additionally, our preliminary exploration highlights the pivotal role of instruction evolution and process supervision in achieving exceptional math performance.",
      "keywords": [
        "Mathematical Reasoning",
        "Evol-Instruct",
        "Reinforcement Learning"
      ],
      "decision": "Accept (Oral)",
      "year": "2025"
    }
  },
  {
    "group_id": "cR5GTis5II",
    "difficulty": "difficult",
    "pair_type": "poster-spotlight",
    "paper_a": {
      "paper_id": "Dzh0hQPpuf",
      "title": "Student-Informed Teacher Training",
      "abstract": "Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limited observations, e.g., in a robot navigation task, the teacher might have access to distances to nearby obstacles, while the student only receives visual observations of the scene. However, privileged imitation learning faces a key challenge: the student might be unable to imitate the teacher's behavior due to partial observability. This problem arises because the teacher is trained without considering if the student is capable of imitating the learned behavior. To address this teacher-student asymmetry, we propose a framework for joint training of the teacher and student policies, encouraging the teacher to learn behaviors that can be imitated by the student despite the latters' limited access to information and its partial observability. Based on the performance bound in imitation learning, we add (i) the approximated action difference between teacher and student as a penalty term to the reward function of the teacher, and (ii) a supervised teacher-student alignment step. We motivate our method with a maze navigation task and demonstrate its effectiveness on complex vision-based quadrotor flight and manipulation tasks.",
      "keywords": [
        "Reinforcement Learning",
        "Imitation Learning",
        "Robotics"
      ],
      "decision": "Accept (Spotlight)",
      "year": "2025"
    },
    "paper_b": {
      "paper_id": "cR5GTis5II",
      "title": "eQMARL: Entangled Quantum Multi-Agent Reinforcement Learning for Distributed Cooperation over Quantum Channels",
      "abstract": "Collaboration is a key challenge in distributed multi-agent reinforcement learning (MARL) environments. Learning frameworks for these decentralized systems must weigh the benefits of explicit player coordination against the communication overhead and computational cost of sharing local observations and environmental data. Quantum computing has sparked a potential synergy between quantum entanglement and cooperation in multi-agent environments, which could enable more efficient distributed collaboration with minimal information sharing. This relationship is largely unexplored, however, as current state-of-the-art quantum MARL (QMARL) implementations rely on classical information sharing rather than entanglement over a quantum channel as a coordination medium. In contrast, in this paper, a novel framework dubbed entangled QMARL (eQMARL) is proposed. The proposed eQMARL is a distributed actor-critic framework that facilitates cooperation over a quantum channel and eliminates local observation sharing via a quantum entangled split critic. Introducing a quantum critic uniquely spread across the agents allows coupling of local observation encoders through entangled input qubits over a quantum channel, which requires no explicit sharing of local observations and reduces classical communication overhead. Further, agent policies are tuned through joint observation-value function estimation via joint quantum measurements, thereby reducing the centralized computational burden. Experimental results show that eQMARL with $\\Psi^{+}$ entanglement converges to a cooperative strategy up to $17.8\\\\%$ faster and with a higher overall score compared to split classical and fully centralized classical and quantum baselines. The results also show that eQMARL achieves this performance with a constant factor of $25$-times fewer centralized parameters compared to the split classical baseline.",
      "keywords": [
        "quantum machine learning",
        "multi-agent reinforcement learning",
        "quantum entanglement"
      ],
      "decision": "Accept (Poster)",
      "year": "2025"
    }
  }
]