[
    {
        "id": "theme_054",
        "theme": "Weakly-Supervised Transition Alignment for Test-Time Emotion Recognition",
        "elaboration": "This research proposes a novel framework that merges principles from emotion transition learning (Concept 1) and test-time correlation alignment (Concept 2) to enhance emotion recognition in dynamic, real-world scenarios. By leveraging weakly supervised strategies to model emotional transitions between gestures (as in Concept 1), the framework addresses the challenge of aligning test instances with source data during inference (as in Concept 2). Specifically, we introduce a temporal alignment mechanism that encodes emotional transitions as correlation patterns, enabling the model to align high-certainty source instances with test instances during adaptation. This approach overcomes the limitations of instance-wise alignment in TTA by incorporating weak supervision through emotion mixture labels (Concept 1's innovation) and using tensor-based alignment (Concept 2's theoretical foundation). The method combines weakly supervised training for gesture transitions with test-time correlation alignment, achieving state-of-the-art performance while maintaining computational efficiency. The code and datasets are available at https://xingqunqi-lab.github.io/Emo-Transition-Gesture/ and https://github.com/youlj109/TCA.",
        "concept_original_list": [
            "Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation: Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label they overlook that long gesture sequence modeling with emotion transition is more practical in real scenes. In addition the lack of large-scale available datasets with emotional transition speech and corresponding 3D human gestures also limits the addressing of this task. To fulfill this goal we first incorporate the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches. Considering obtaining the realistic 3D pose annotations corresponding to the dynamically inpainted emotion transition audio is extremely difficult we propose a novel weakly supervised training strategy to encourage authority gesture transitions. Specifically to enhance the coordination of transition gestures w.r.t. different emotional ones we model the temporal association representation between two different emotional gesture sequences as style guidance and infuse it into the transition generation. We further devise an emotion mixture mechanism that provides weak supervision based on a learnable mixed emotion label for transition gestures. Last we present a keyframe sampler to supply effective initial posture cues in long sequences enabling us to generate diverse gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts on our newly defined emotion transition task and datasets. Our code and dataset will be released on the project page: https://xingqunqi-lab.github.io/Emo-Transition-Gesture/",
            "Test-time Correlation Alignment: Deep neural networks often degrade under distribution shifts. Although domain adaptation offers a solution, privacy constraints often prevent access to source data, making Test-Time Adaptation (TTA)—which adapts using only unlabeled test data—increasingly attractive. However, current TTA methods still face practical challenges: (1) a primary focus on instance-wise alignment, overlooking CORrelation ALignment (CORAL) due to missing source correlations; (2) complex backpropagation operations for model updating, resulting in overhead computation and (3) domain forgetting. To address these challenges, we provide a theoretical analysis to investigate the feasibility of **T**est-time **C**orrelation **A**lignment (**TCA**), demonstrating that correlation alignment between high-certainty instances and test instances can enhance test performances with a theoretical guarantee. Based on this, we propose two simple yet effective algorithms: LinearTCA and LinearTCA+. LinearTCA applies a simple linear transformation to achieve both instance and correlation alignment without additional model updates, while LinearTCA+ serves as a plug-and-play module that can easily boost existing TTA methods. Extensive experiments validate our theoretical insights and show that TCA methods significantly outperforms baselines across various tasks, benchmarks and backbones. Notably, LinearTCA achieves higher accuracy with only 4\\% GPU memory and 0.6\\% computation time compared to the best TTA baseline. It also outperforms existing methods on CLIP over 1.86\\%. Code: https://github.com/youlj109/TCA"
        ]
    },
    {
        "id": "theme_091",
        "theme": "Geometric Manifold Integration for Real-Time SLAM and Data Modeling",
        "elaboration": "This research proposes integrating the geometric structure of symmetric positive definite (SPD) matrices with real-time SLAM systems to enhance robustness and efficiency. By leveraging the wrapped Gaussian distribution on the SPD manifold, we extend Co-SLAM's hybrid representation (hash-grid + one-blob encoding) to incorporate geometric constraints. The SPD manifold's inherent structure allows for efficient, low-dimensional parameterization of scene features, enabling real-time bundle adjustment while preserving surface coherence. The wrapped Gaussian's probabilistic framework ensures geometric consistency, addressing limitations in traditional SLAM's handling of high-frequency local features. This approach merges the efficiency of hash-grid representations with the geometric fidelity of SPD manifolds, enabling novel SLAM algorithms that dynamically adapt to complex, structured environments while maintaining high reconstruction accuracy and tracking robustness.",
        "concept_original_list": [
            "Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM: We present Co-SLAM, a neural RGB-D SLAM system based on a hybrid representation, that performs robust camera tracking and high-fidelity surface reconstruction in real time. Co-SLAM represents the scene as a multi-resolution hash-grid to exploit its high convergence speed and ability to represent high-frequency local features. In addition, Co-SLAM incorporates one-blob encoding, to encourage surface coherence and completion in unobserved areas. This joint parametric-coordinate encoding enables real-time and robust performance by bringing the best of both worlds: fast convergence and surface hole filling. Moreover, our ray sampling strategy allows Co-SLAM to perform global bundle adjustment over all keyframes instead of requiring keyframe selection to maintain a small number of active keyframes as competing neural SLAM approaches do. Experimental results show that Co-SLAM runs at 10-17Hz and achieves state-of-the-art scene reconstruction results, and competitive tracking performance in various datasets and benchmarks (ScanNet, TUM, Replica, Synthetic RGBD). Project page: https://hengyiwang.github.io/projects/CoSLAM",
            "Wrapped Gaussian on the manifold of Symmetric Positive Definite Matrices: Circular and non-flat data distribution are prevalent across diverse domains of data science, yet their specific geometric structures often remain underutilized in machine learning frameworks.\nA principled approach to accounting for the underlying geometry of such data is pivotal, particularly when extending statistical models, like the pervasive Gaussian distribution.\nIn this work, we tackle those issue by focusing on the manifold of symmetric positive definite matrices, a key focus in information geometry.\nWe introduced a non-isotropic wrapped Gaussian by leveraging the exponential map, we derive theoretical properties of this distribution and propose a maximum likelihood framework for parameter estimation. Furthermore, we reinterpret established classifiers on SPD through a probabilistic lens and introduce new classifiers based on the wrapped Gaussian model.\nExperiments on synthetic and real-world datasets demonstrate the robustness and flexibility of this geometry-aware distribution, underscoring its potential to advance manifold-based data analysis.\nThis work lays the groundwork for extending classical machine learning and statistical methods to more complex and structured data."
        ]
    },
    {
        "id": "theme_063",
        "theme": "Decomposition-Driven Neural Representations for Adaptive Simulations in PDEs and Molecular Design",
        "elaboration": "This research proposes a unified framework that combines decomposition principles with neural spatial representations to enable adaptive, high-fidelity simulations of time-dependent PDEs and structured molecular systems. By decomposing complex domains (e.g., spatial fields in PDEs or molecular architectures in drug design) into modular components (e.g., spatial arms in INSR or scaffold in DecompDiff), we leverage neural networks to implicitly encode the intrinsic structure of each decomposition. This approach allows for dynamic adaptation of spatial and molecular representations through time-integration schemes, enabling efficient computation while preserving physical consistency. The framework would integrate operator-splitting techniques from PDE solvers with decomposed neural priors, allowing for hierarchical modeling of multi-scale phenomena. By treating each decomposition unit as a separate neural subnetwork, the model can dynamically adjust its spatial and molecular parameters during simulation, achieving both accuracy and adaptivity. This synthesis of decomposition and neural representation could revolutionize computational science by enabling scalable, physics-aware simulations in diverse domains, from fluid dynamics to drug discovery.",
        "concept_original_list": [
            "Implicit Neural Spatial Representations for Time-dependent PDEs: Implicit Neural Spatial Representation (INSR) has emerged as an effective representation of spatially-dependent vector fields. This work explores solving time-dependent PDEs with INSR. Classical PDE solvers introduce both temporal and spatial discretizations. Common spatial discretizations include meshes and meshless point clouds, where each degree-of-freedom corresponds to a location in space. While these explicit spatial correspondences are intuitive to model and understand, these representations are not necessarily optimal for accuracy, memory usage, or adaptivity. Keeping the classical temporal discretization unchanged (e.g., explicit/implicit Euler), we explore INSR as an alternative spatial discretization, where spatial information is implicitly stored in the neural network weights. The network weights then evolve over time via time integration. Our approach does not require any training data generated by existing solvers because our approach is the solver itself. We validate our approach on various PDEs with examples involving large elastic deformations, turbulent fluids, and multi-scale phenomena. While slower to compute than traditional representations, our approach exhibits higher accuracy and lower memory consumption. Whereas classical solvers can dynamically adapt their spatial representation only by resorting to complex remeshing algorithms, our INSR approach is intrinsically adaptive. By tapping into the rich literature of classic time integrators, e.g., operator-splitting schemes, our method enables challenging simulations in contact mechanics and turbulent flows where previous neural-physics approaches struggle. Videos and codes are available on the project page: http://www.cs.columbia.edu/cg/INSR-PDE/",
            "DecompDiff: Diffusion Models with Decomposed Priors for Structure-Based Drug Design: Designing 3D ligands within a target binding site is a fundamental task in drug discovery. Existing structured-based drug design methods treat all ligand atoms equally, which ignores different roles of atoms in the ligand for drug design and can be less efficient for exploring the large drug-like molecule space. In this paper, inspired by the convention in pharmaceutical practice, we decompose the ligand molecule into two parts, namely arms and scaffold, and propose a new diffusion model, DecompDiff, with decomposed priors over arms and scaffold. In order to facilitate the decomposed generation and improve the properties of the generated molecules, we incorporate both bond diffusion in the model and additional validity guidance in the sampling phase. Extensive experiments on CrossDocked2020 show that our approach achieves state-of-the-art performance in generating high-affinity molecules while maintaining proper molecular properties and conformational stability, with up to $-8.39$ Avg. Vina Dock score and $24.5\\%$ Success Rate. The code is provided at https://github.com/bytedance/DecompDiff"
        ]
    },
    {
        "id": "theme_199",
        "theme": "Sparse Embedding Integration with Pre-trained Models for Enhanced Adversarial Robustness",
        "elaboration": "This research proposes a novel framework that merges sparse dimensionality reduction techniques (Concept 1) with pre-trained model guided adversarial fine-tuning (Concept 2) to enhance robustness against adversarial examples. By leveraging the sparsity of sparse Johnson-Lindenstrauss transforms (Concept 1), we aim to create a more efficient and interpretable embedding space for pre-trained vision-language models. The sparsity allows for reduced computational overhead in adversarial training, enabling the model to preserve critical features while mitigating the risk of overfitting. Specifically, we design an auxiliary branch in the pre-trained model that aligns with the sparse embedding structure, ensuring adversarial examples are embedded in a space that retains the original model's generalization capabilities. This approach bridges the gap between efficient dimensionality reduction and robustness, offering a scalable solution for zero-shot adversarial robustness in large-scale pre-trained models. The novelty lies in integrating sparsity-driven embedding strategies with pre-trained model supervision to achieve both computational efficiency and adversarial resilience, addressing the challenge of balancing generalization and robustness in complex, real-world scenarios.",
        "concept_original_list": [
            "Sparse Dimensionality Reduction Revisited: The sparse Johnson-Lindenstrauss transform is one of the central techniques in dimensionality reduction. It supports embedding a set of $n$ points in $\\mathbb{R}^d$ into $m=O(\\varepsilon^{-2} \\ln n)$ dimensions while preserving all pairwise distances to within $1 \\pm \\varepsilon$. Each input point $x$ is embedded to $Ax$, where $A$ is an $m \\times d$ matrix having $s$ non-zeros per column, allowing for an embedding time of $O(s \\|x\\|_0)$. Since the sparsity of $A$ governs the embedding time, much work has gone into improving the sparsity $s$. The current state-of-the-art by Kane and Nelson (2014) shows that $s = O(\\varepsilon^{-1} \\ln n)$ suffices. This is almost matched by a lower bound of $s = \\Omega(\\varepsilon^{-1} \\ln n/\\ln(1/\\varepsilon))$ by Nelson and Nguyen (2013) for $d=\\Omega(n)$. Previous work thus suggests that we have near-optimal embeddings. In this work, we revisit sparse embeddings and present a sparser embedding for instances in which $d = n^{o(1)}$, which in many applications is realistic. Formally, our embedding achieves $s = O(\\varepsilon^{-1}(\\ln n/\\ln(1/\\varepsilon)+\\ln^{2/3}n \\ln^{1/3} d))$. We also complement our analysis by strengthening the lower bound of Nelson and Nguyen to hold also when $d \\ll n$, thereby matching the first term in our new sparsity upper bound. Finally, we also improve the sparsity of the best oblivious subspace embeddings for optimal embedding dimensionality.",
            "Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness: Large-scale pre-trained vision-language models like CLIP have demonstrated impressive performance across various tasks and exhibit remarkable zero-shot generalization capability while they are also vulnerable to imperceptible adversarial examples. Existing works typically employ adversarial training (fine-tuning) as a defense method against adversarial examples. However direct application to the CLIP model may result in overfitting compromising the model's capacity for generalization. In this paper we propose Pre-trained Model Guided Adversarial Fine-Tuning (PMG-AFT) method which leverages supervision from the original pre-trained model by carefully designing an auxiliary branch to enhance the model's zero-shot adversarial robustness. Specifically PMG-AFT minimizes the distance between the features of adversarial examples in the target model and those in the pre-trained model aiming to preserve the generalization features already captured by the pre-trained model. Extensive Experiments on 15 zero-shot datasets demonstrate that PMG-AFT significantly outperforms the state-of-the-art method improving the top-1 robust accuracy by an average of 4.99%. Furthermore our approach consistently improves clean accuracy by an average of 8.72%."
        ]
    },
    {
        "id": "theme_191",
        "theme": "Structural Representation Learning for Symptom Mitigation via Probabilistic Similarity Matrices",
        "elaboration": "This research proposes a novel framework that integrates principles from deep learning and probabilistic modeling to address the challenges of symptom mitigation in migraine treatment. By leveraging the structured similarity matrices used in the Pure Square Euclidean Distance (PSED) loss function, we aim to develop a method that quantifies the discriminability of prodromal symptoms (e.g., photophobia, fatigue) in real-time. The PSED's ability to mitigate random consistency—akin to the variability in symptom resolution observed in the PRODROME trial—suggests a pathway to model symptom relationships as a tensor network, where each node represents a symptom and edges encode their interdependencies. This approach would enable dynamic, adaptive interventions, such as ubrogepant, by predicting symptom alleviation probabilities based on pre-dose symptom profiles. The novelty lies in combining PSED's theoretical guarantees with the clinical relevance of symptom dynamics, creating a bridge between machine learning-driven representation learning and real-world medical outcomes. By formalizing symptom similarity as a probabilistic matrix, the framework could enable personalized treatment strategies, ensuring that interventions like ubrogepant are optimized not just for efficacy but for their temporal and structural alignment with patient symptoms.",
        "concept_original_list": [
            "Ubrogepant for the treatment of migraine prodromal symptoms: an exploratory analysis from the randomized phase 3 PRODROME trial: PRODROME was a phase 3, placebo-controlled, double-blind crossover trial evaluating whether ubrogepant 100 mg, a calcitonin gene-related peptide receptor antagonist, dosed during the premonitory (prodromal) phase of migraine, prevented development of headache and resolved prodromal symptoms. Qualifying prodromal events were defined as attacks with symptoms in which the participant was confident headache would follow within 1–6 h. Of 1,087 screened participants, 477 formed the efficacy analysis population. Outcomes were collected across 48 h showing, for example, at 2 h post-dose, absence of photophobia in 19.5% and 12.5% of ubrogepant- and placebo-treated events, respectively (odds ratio (OR) = 1.72 (95% confidence interval (CI) = 1.13–2.61)); at 3 h post-dose, absence of fatigue occurred in 27.3% and 16.8% (OR = 1.85 (95% CI = 1.17–2.92)) and absence of neck pain in 28.9% and 15.9% (OR = 2.04 (95% CI = 1.25–3.32)) of events; at 4 h post-dose, absence of phonophobia in 50.7% and 35.8% (OR = 1.97 (95% CI = 1.38–2.80)) of events; and at 24 h post-dose, absence of dizziness in 88.5% and 82.3% (OR = 1.82 (95% CI = 1.00–3.30)) of events. At 1 h and 6 h post-dose, respectively, absence of difficulty concentrating occurred in 8.7% and 2.1% (OR = 4.26 (95% CI = 1.17–15.54)) and absence of difficulty thinking occurred in 56.9% and 41.8% (OR = 2.05 (95% CI = 1.14–3.71)) of events. Treatment with ubrogepant during the prodromal phase may ameliorate common prodromal symptoms, with improvements possibly as early as 1 h post-dose. Analysis of the phase 3 PRODROME trial reveals that treatment with ubrogepant during the prodromal phase of migraine may ameliorate common prodromal symptoms, with improvements as early as 1 hour after dose administration.",
            "Stabilizing Sample Similarity in Representation via Mitigating Random Consistency: Deep learning excels at capturing complex data representations, yet quantifying the discriminative quality of these representations remains challenging. While unsupervised metrics often assess pairwise sample similarity, classification tasks fundamentally require class-level discrimination. To bridge this gap, we propose a novel loss function that evaluates representation discriminability via the Euclidean distance between the learned similarity matrix and the true class adjacency matrix.\nWe identify random consistency—an inherent bias in Euclidean distance metrics—as a key obstacle to reliable evaluation, \naffecting \nboth fairness and discrimination. To address this, we derive the expected Euclidean distance under uniformly distributed label permutations and introduce its closed-form solution, the Pure Square Euclidean Distance (PSED), which provably eliminates random consistency. Theoretically, we demonstrate that PSED satisfies heterogeneity and unbiasedness guarantees, and establish its generalization bound via the exponential Orlicz norm, confirming its statistical learnability.\nEmpirically, our method surpasses conventional loss functions across multiple benchmarks, achieving significant improvements in accuracy, $F_1$ score, and class-structure differentiation. (Code is published in https://github.com/FeijiangLi/ICML2025-PSED)"
        ]
    },
    {
        "id": "theme_097",
        "theme": "Integrating Multi-Modal Therapies with Consistent Explanations for Enhanced Treatment Outcomes",
        "elaboration": "This research proposal merges the clinical trial framework of Concept 1 (immunochemotherapy for ES-SCLC) with the explanatory rigor of Concept 2 (self-consistent visual grounding). By synthesizing the dual goals of efficacy and interpretability, we propose a novel trial design that combines immunotherapy (benmelstobart + anlotinib + chemotherapy) with a systematic approach to validating treatment mechanisms. The theme centers on leveraging consistent, data-driven explanations (akin to SelfEQ's visual explanation maps) to optimize therapeutic strategies. The proposal would involve: 1) a phase 3 trial of benmelstobart + anlotinib + EC in ES-SCLC, with a focus on safety and survival outcomes, and 2) integrating a 'explanation engine' (e.g., a language model) to generate self-consistent, region-specific insights into treatment effects. This dual approach would address the clinical challenge of translating complex immunotherapy mechanisms into actionable, interpretable outcomes, while ensuring safety profiles are rigorously validated. The novelty lies in applying Concept 2's principles of self-consistency to medical trials, enabling real-time validation of treatment efficacy through explainable AI, thereby bridging the gap between clinical data and actionable insights.",
        "concept_original_list": [
            "Benmelstobart, anlotinib and chemotherapy in extensive-stage small-cell lung cancer: a randomized phase 3 trial: Immunochemotherapy is the first-line standard for extensive-stage small-cell lung cancer (ES-SCLC). Combining the regimen with anti-angiogenesis may improve efficacy. ETER701 was a multicenter, double-blind, randomized, placebo-controlled phase 3 trial that investigated the efficacy and safety of benmelstobart (a novel programmed death-ligand 1 (PD-L1) inhibitor) with anlotinib (a multi-target anti-angiogenic small molecule) and standard chemotherapy in treatment-naive ES-SCLC. The ETER701 trial assessed two primary endpoints: Independent Review Committee-assessed progression-free survival per RECIST 1.1 and overall survival (OS). Here the prespecified final progression-free survival and interim OS analysis is reported. Patients randomly received benmelstobart and anlotinib plus etoposide/carboplatin (EC; n = 246), placebo and anlotinib plus EC (n = 245) or double placebo plus EC (‘EC alone’; n = 247), followed by matching maintenance therapy. Compared with EC alone, median OS was prolonged with benmelstobart and anlotinib plus EC (19.3 versus 11.9 months; hazard ratio 0.61; P = 0.0002), while improvement of OS was not statistically significant with anlotinib plus EC (13.3 versus 11.9 months; hazard ratio 0.86; P = 0.1723). The incidence of grade 3 or higher treatment-related adverse events was 93.1%, 94.3% and 87.0% in the benmelstobart and anlotinib plus EC, anlotinib plus EC, and EC alone groups, respectively. This study of immunochemotherapy plus multi-target anti-angiogenesis as first-line treatment achieved a median OS greater than recorded in prior randomized studies in patients with ES-SCLC. The safety profile was assessed as tolerable and manageable. Our findings suggest that the addition of anti-angiogenesis therapy to immunochemotherapy may represent an efficacious and safe approach to the management of ES-SCLC. ClinicalTrials.gov identifier: NCT04234607 . In this triple-arm, placebo-controlled phase 3 trial, first-line treatment of patients with extensive-stage small-cell lung cancer with the anti-PD-L1 benmelstobart, tyrosine kinase inhibitor anlotinib and chemotherapy (CT) showed improved survival outcomes compared with anlotinib and CT or CT alone.",
            "Improved Visual Grounding through Self-Consistent Explanations: Vision-and-language models trained to match images with text can be combined with visual explanation methods to point to the locations of specific objects in an image. Our work shows that the localization --\"grounding'\"-- abilities of these models can be further improved by finetuning for self-consistent visual explanations. We propose a strategy for augmenting existing text-image datasets with paraphrases using a large language model and SelfEQ a weakly-supervised strategy on visual explanation maps for paraphrases that encourages self-consistency. Specifically for an input textual phrase we attempt to generate a paraphrase and finetune the model so that the phrase and paraphrase map to the same region in the image. We posit that this both expands the vocabulary that the model is able to handle and improves the quality of the object locations highlighted by gradient-based visual explanation methods (e.g. GradCAM). We demonstrate that SelfEQ improves performance on Flickr30k ReferIt and RefCOCO+ over a strong baseline method and several prior works. Particularly comparing to other methods that do not use any type of box annotations we obtain 84.07% on Flickr30k (an absolute improvement of 4.69%) 67.40% on ReferIt (an absolute improvement of 7.68%) and 75.10% 55.49% on RefCOCO+ test sets A and B respectively (an absolute improvement of 3.74% on average)."
        ]
    },
    {
        "id": "theme_076",
        "theme": "Dual-Stage Encoder-Driven Cascade Ranking with Surrogate Loss Alignment: Integrating VideoMAC's Encoder Design and LCRON's End-to-End Training Paradigm",
        "elaboration": "This research proposes a novel framework that merges VideoMAC's encoder design with LCRON's cascade ranking paradigm to address the limitations of traditional top-k selection systems. VideoMAC's dual encoder architecture (online and exponential moving average target encoder) ensures inter-frame consistency in video processing, while LCRON's surrogate loss function aligns stage objectives through a lower-bound probability of ground-truth selection. By adapting VideoMAC's sparse convolutional encoder principles to cascade ranking stages, we introduce a dual-stage encoder-driven approach that maintains consistency across cascading decisions. The surrogate loss function is optimized to reduce the lower-bound probability, enabling end-to-end training of the entire cascade system. This integration leverages VideoMAC's resource-efficient ConvNets to enhance model robustness while ensuring alignment with the system's overall goal of maximizing end-to-end recall. The proposed method addresses challenges in cascade ranking by combining the benefits of self-supervised learning (VideoMAC) with interaction-aware training (LCRON), offering a scalable solution for large-scale top-k selection in recommendation and advertising systems.",
        "concept_original_list": [
            "VideoMAC: Video Masked Autoencoders Meet ConvNets: Recently the advancement of self-supervised learning techniques like masked autoencoders (MAE) has greatly influenced visual representation learning for images and videos. Nevertheless it is worth noting that the predominant approaches in existing masked image / video modeling rely excessively on resource-intensive vision transformers (ViTs) as the feature encoder. In this paper we propose a new approach termed as VideoMAC which combines video masked autoencoders with resource-friendly ConvNets. Specifically VideoMAC employs symmetric masking on randomly sampled pairs of video frames. To prevent the issue of mask pattern dissipation we utilize ConvNets which are implemented with sparse convolutional operators as encoders. Simultaneously we present a simple yet effective masked video modeling (MVM) approach a dual encoder architecture comprising an online encoder and an exponential moving average target encoder aimed to facilitate inter-frame reconstruction consistency in videos. Additionally we demonstrate that VideoMAC empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the benefits of MVM outperforms ViT-based approaches on downstream tasks including video object segmentation (+5.2% / 6.4% \\mathcal J &\\mathcal F ) body part propagation (+6.3% / 3.1% mIoU) and human pose tracking (+10.2% / 11.1% PCK@0.1).",
            "Learning Cascade Ranking as One Network: Cascade Ranking is a prevalent architecture in large-scale top-k selection systems like recommendation and advertising platforms. Traditional training methods focus on single-stage optimization, neglecting interactions between stages. Recent advances have introduced interaction-aware training paradigms, but still struggle to 1) align training objectives with the goal of the entire cascade ranking (i.e., end-to-end recall of ground-truth items) and 2) learn effective collaboration patterns for different stages. To address these challenges, we propose LCRON, which introduces a novel surrogate loss function derived from the lower bound probability that ground truth items are selected by cascade ranking, ensuring alignment with the overall objective of the system. According to the properties of the derived bound, we further design an auxiliary loss for each stage to drive the reduction of this bound, leading to a more robust and effective top-k selection. LCRON enables end-to-end training of the entire cascade ranking system as a unified network. Experimental results demonstrate that LCRON achieves significant improvement over existing methods on public benchmarks and industrial applications, addressing key limitations in cascade ranking training and significantly enhancing system performance."
        ]
    },
    {
        "id": "theme_112",
        "theme": "Real-Time Predictive Modeling in AI Systems and Genetic Risk Prediction: Bridging Creativity and Genomics",
        "elaboration": "This research proposes a unified framework that merges the real-time predictive capabilities of AI systems like MagicQuill with the genetic risk prediction methodologies of preimplantation embryo analysis. By leveraging multimodal large language models (MLLMs) to anticipate user intentions in MagicQuill, we draw parallels to how genome-wide statistical models combine common and rare variants to predict polygenic risk scores. The system would integrate a two-branch plug-in module, analogous to MagicQuill's diffusion prior, to process genetic data with precision, enabling real-time risk assessment and personalized health predictions. The novelty lies in applying AI-driven predictive frameworks to both creative workflows and genetic analysis, creating a bridge between computational creativity and biological risk modeling. This approach could revolutionize fields like medical genetics by enabling dynamic, context-aware predictions while enhancing user interaction in AI systems through intuitive, real-time feedback.",
        "concept_original_list": [
            "MagicQuill: An Intelligent Interactive Image Editing System: As a highly practical application, image editing encounters a variety of user demands and thus prioritizes excellent ease of use. In this paper, we unveil MagicQuill, an integrated image editing system designed to support users in swiftly actualizing their creativity. Our system starts with a streamlined yet functionally robust interface, enabling users to articulate their ideas (e.g., inserting elements, erasing objects, altering color, etc.) with just a few strokes. These interactions are then monitored by a multimodal large language model (MLLM) to anticipate user intentions in real time, bypassing the need for prompt entry. Finally, we apply the powerful diffusion prior, enhanced by a carefully learned two-branch plug-in module, to process the editing request with precise control. Please visit https://magic-quill.github.io to try out our system.",
            "Whole-genome risk prediction of common diseases in human preimplantation embryos: Preimplantation genetic testing (PGT) of in-vitro-fertilized embryos has been proposed as a method to reduce transmission of common disease; however, more comprehensive embryo genetic assessment, combining the effects of common variants and rare variants, remains unavailable. Here, we used a combination of molecular and statistical techniques to reliably infer inherited genome sequence in 110 embryos and model susceptibility across 12 common conditions. We observed a genotype accuracy of 99.0–99.4% at sites relevant to polygenic risk scoring in cases from day-5 embryo biopsies and 97.2–99.1% in cases from day-3 embryo biopsies. Combining rare variants with polygenic risk score (PRS) magnifies predicted differences across sibling embryos. For example, in a couple with a pathogenic BRCA1 variant, we predicted a 15-fold difference in odds ratio (OR) across siblings when combining versus a 4.5-fold or 3-fold difference with BRCA1 or PRS alone. Our findings may inform the discussion of utility and implementation of genome-based PGT in clinical practice. A computational approach combining whole-genome sequencing of parental genomes and genotyping of preimplantation embryos allows accurate prediction of the inherited genomes of embryos and calculation of polygenic risk scores."
        ]
    },
    {
        "id": "theme_177",
        "theme": "Unified Multi-Agent Collaboration for Dynamic Task Adaptation in Learning Systems",
        "elaboration": "TarViS's unified approach for task-specific video segmentation demonstrates the potential of modular, flexible architectures to handle diverse tasks through abstract queries. This concept can be extended to multi-agent systems by integrating principles of socialized learning, where agents collaborate to learn new tasks while maintaining expertise in original domains. By modeling agents as 'queries' in a collaborative framework, TarViS's modular design can be adapted to multi-agent systems where each agent specializes in a task, shares knowledge, and dynamically updates its capabilities. This would involve introducing collective collaboration modules (e.g., shared memory for task-specific knowledge) and reciprocal altruism mechanisms (e.g., incentivizing agents to retain expertise in original tasks). The novelty lies in merging TarViS's task-agnostic flexibility with socialized learning's multi-agent dynamics, enabling systems to adapt to new tasks without retraining while preserving foundational knowledge. This approach would bridge the gap between static task specialization and dynamic, collaborative learning, offering a scalable framework for AI systems operating in evolving, real-world environments.",
        "concept_original_list": [
            "TarViS: A Unified Approach for Target-Based Video Segmentation: The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video. Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks. A single TarViS model can be trained jointly on a collection of datasets spanning different tasks, and can hot-swap between tasks during inference without any task-specific retraining. To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking (PET). Our unified, jointly trained model achieves state-of-the-art performance on 5/7 benchmarks spanning these four tasks, and competitive performance on the remaining two. Code and model weights are available at: https://github.com/Ali2500/TarViS",
            "Socialized Learning: Making Each Other Better Through Multi-Agent Collaboration: Learning new knowledge frequently occurs in our dynamically changing world, e.g., humans culturally evolve by continuously acquiring new abilities to sustain their survival, leveraging collective intelligence rather than a large number of individual attempts. The effective learning paradigm during cultural evolution is termed socialized learning (SL). Consequently, a straightforward question arises: Can multi-agent systems acquire more new abilities like humans? In contrast to most existing methods that address continual learning and multi-agent collaboration, our emphasis lies in a more challenging problem: we prioritize the knowledge in the original expert classes, and as we adeptly learn new ones, the accuracy in the original expert classes stays superior among all in a directional manner. Inspired by population genetics and cognitive science, leading to unique and complete development, we propose Multi-Agent Socialized Collaboration (MASC), which achieves SL through interactions among multiple agents. Specifically, we introduce collective collaboration and reciprocal altruism modules, organizing collaborative behaviors, promoting information sharing, and facilitating learning and knowledge interaction among individuals. We demonstrate the effectiveness of multi-agent collaboration in an extensive empirical study. Our code will be publicly available at https://github.com/yxjdarren/SL."
        ]
    },
    {
        "id": "theme_186",
        "theme": "Adapting Blur Transformation Principles for Cross-Device Omnidirectional Localization",
        "elaboration": "This research proposal merges the principles of Blur2Blur's unpaired data-driven blur transformation with 360Loc's challenges of cross-device visual localization. By leveraging the concept of transforming blur patterns into more amenable states (as in Blur2Blur), we aim to address the domain gap in 360Loc's cross-device queries. Specifically, we propose a framework where 360-degree camera captures (e.g., fisheye or ultra-wide FoV) are transformed into lower-FoV reference frames using a virtual camera approach, analogous to Blur2Blur's blur-to-blur conversion. This transformation would enable more robust feature matching and pose regression in 360Loc, mitigating performance losses due to cross-device domain shifts. The novelty lies in integrating blur pattern abstraction as a universal bridge for cross-device mapping, ensuring both efficiency and interpretability in omnidirectional visual localization tasks. The proposed method would validate the hypothesis that blur transformation principles can harmonize diverse camera modalities, paving the way for scalable, cross-device visual localization systems.",
        "concept_original_list": [
            "Blur2Blur: Blur Conversion for Unsupervised Image Deblurring on Unknown Domains: This paper presents an innovative framework designed to train an image deblurring algorithm tailored to a specific camera device. This algorithm works by transforming a blurry input image which is challenging to deblur into another blurry image that is more amenable to deblurring. The transformation process from one blurry state to another leverages unpaired data consisting of sharp and blurry images captured by the target camera device. Learning this blur-to-blur transformation is inherently simpler than direct blur-to-sharp conversion as it primarily involves modifying blur patterns rather than the intricate task of reconstructing fine image details. The efficacy of the proposed approach has been demonstrated through comprehensive experiments on various benchmarks where it significantly outperforms state-of-the-art methods both quantitatively and qualitatively. Our code and data are available at https://github.com/VinAIResearch/Blur2Blur",
            "360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries: Portable 360^\\circ cameras are becoming a cheap and efficient tool to establish large visual databases. By capturing omnidirectional views of a scene these cameras could expedite building environment models that are essential for visual localization. However such an advantage is often overlooked due to the lack of valuable datasets. This paper introduces a new benchmark dataset 360Loc composed of 360^\\circ images with ground truth poses for visual localization. We present a practical implementation of 360^\\circ mapping combining 360^\\circ images with lidar data to generate the ground truth 6DoF poses. 360Loc is the first dataset and benchmark that explores the challenge of cross-device visual positioning involving 360^\\circ reference frames and query frames from pinhole ultra-wide FoV fisheye and 360^\\circ cameras. We propose a virtual camera approach to generate lower-FoV query frames from 360^\\circ images which ensures a fair comparison of performance among different query types in visual localization tasks. We also extend this virtual camera approach to feature matching-based and pose regression-based methods to alleviate the performance loss caused by the cross-device domain gap and evaluate its effectiveness against state-of-the-art baselines. We demonstrate that omnidirectional visual localization is more robust in challenging large-scale scenes with symmetries and repetitive structures. These results provide new insights into 360-camera mapping and omnidirectional visual localization with cross-device queries. Project Page and dataset: https://huajianup.github.io/research/360Loc/."
        ]
    },
    {
        "id": "theme_023",
        "theme": "Multimodal Knowledge Transfer for Geometry-Aware Texture Synthesis",
        "elaboration": "TextureDreamer and Source-Free Domain Adaptation (SFDA) both address challenges in knowledge transfer but operate in distinct domains. TextureDreamer leverages geometry-aware diffusion to synthesize textures for 3D shapes, while SFDA adapts multimodal models to new domains using frozen pre-trained knowledge. This research proposes integrating SFDA's multimodal distillation framework with TextureDreamer's geometry-aware score distillation (PGSD) to create a novel system that transfers contextual and geometric knowledge across arbitrary domains. By freezing a ViL model (e.g., CLIP) and using its multimodal (text/visual) knowledge to guide texture synthesis, the system can adapt to new 3D shapes while preserving geometric constraints. The integration of DIFO's prompt learning for mutual information maximization and predictive consistency regularization ensures that the distilled knowledge aligns with both visual and structural properties of the target geometry, enabling realistic, semantic-texture transfer. This approach bridges the gap between domain adaptation and geometry-aware synthesis, offering a scalable solution for democratizing texture creation in 3D modeling.",
        "concept_original_list": [
            "TextureDreamer: Image-Guided Texture Synthesis Through Geometry-Aware Diffusion: We present TextureDreamer a novel image-guided texture synthesis method to transfer relightable textures from a small number of input images (3 to 5) to target 3D shapes across arbitrary categories. Texture creation is a pivotal challenge in vision and graphics. Industrial companies hire experienced artists to manually craft textures for 3D assets. Classical methods require densely sampled views and accurately aligned geometry while learning-based methods are confined to category-specific shapes within the dataset. In contrast TextureDreamer can transfer highly detailed intricate textures from real-world environments to arbitrary objects with only a few casually captured images potentially significantly democratizing texture creation. Our core idea personalized geometry-aware score distillation (PGSD) draws inspiration from recent advancements in diffuse models including personalized modeling for texture information extraction score distillation for detailed appearance synthesis and explicit geometry guidance with ControlNet. Our integration and several essential modifications substantially improve the texture quality. Experiments on real images spanning different categories show that TextureDreamer can successfully transfer highly realistic semantic meaningful texture to arbitrary objects surpassing the visual quality of previous state-of-the-art. Project page: https://texturedreamer.github.io",
            "Source-Free Domain Adaptation with Frozen Multimodal Foundation Model: Source-Free Domain Adaptation (SFDA) aims to adapt a source model for a target domain with only access to unlabeled target training data and the source model pretrained on a supervised source domain. Relying on pseudo labeling and/or auxiliary supervision conventional methods are inevitably error-prone. To mitigate this limitation in this work we for the first time explore the potentials of off-the-shelf vision-language (ViL) multimodal models (e.g. CLIP) with rich whilst heterogeneous knowledge. We find that directly applying the ViL model to the target domain in a zero-shot fashion is unsatisfactory as it is not specialized for this particular task but largely generic. To make it task specific we propose a novel Distilling multImodal Foundation mOdel (DIFO) approach. Specifically DIFO alternates between two steps during adaptation: (i) Customizing the ViL model by maximizing the mutual information with the target model in a prompt learning manner (ii) Distilling the knowledge of this customized ViL model to the target model. For more fine-grained and reliable distillation we further introduce two effective regularization terms namely most-likely category encouragement and predictive consistency. Extensive experiments show that DIFO significantly outperforms the state-of-the-art alternatives. Code is here."
        ]
    },
    {
        "id": "theme_082",
        "theme": "Integrating Group Theory Principles for Enhanced Model Interpretability and Diversity in Neural Networks",
        "elaboration": "This research proposes a novel framework that merges principles from group theory and neural network interpretability to address the limitations of current models. By leveraging the structure of permutation groups (e.g., S₅ and S₆) and their subgroups, we aim to develop a systematic method for decomposing complex arithmetic operations into modular, interpretable components. This approach would enable neural networks to not only perform tasks like language modeling but also provide transparent, model-based explanations for their decisions. For instance, similar to how reverse-engineering neural circuits decomposes group arithmetic, we could design loss functions and temperature adjustments that align with group-theoretic properties, ensuring that model diversity is optimized through structured, mathematically grounded adjustments. This would bridge the gap between abstract group theory and practical machine learning, offering a new pathway to enhance both precision (through controlled temperature) and coverage (through subgroup-aware loss functions), while maintaining interpretability. The work builds on the first concept's focus on subgroup decomposition and the second's exploration of temperature's role in diversity, proposing a unified approach to model tuning that prioritizes both mathematical rigor and practical performance.",
        "concept_original_list": [
            "Grokking Group Multiplication with Cosets: The complex and unpredictable nature of deep neural networks prevents their safe use in many high-stakes applications. There have been many techniques developed to interpret deep neural networks, but all have substantial limitations. Algorithmic tasks have proven to be a fruitful test ground for interpreting a neural network end-to-end. Building on previous work, we completely reverse engineer fully connected one-hidden layer networks that have ``grokked'' the arithmetic of the permutation groups $S_5$ and $S_6$. The models discover the true subgroup structure of the full group and converge on neural circuits that decompose the group arithmetic using the permutation group's subgroups. We relate how we reverse engineered the model's mechanisms and confirmed our theory was a faithful description of the circuit's functionality. We also draw attention to current challenges in conducting interpretability research by comparing our work to Chughtai et al. (2023) which alleges to find a different algorithm for this same problem.",
            "Improving Diversity in Language Models: When Temperature Fails, Change the Loss: Increasing diversity in language models is a challenging yet essential objective. A common approach is to raise the decoding temperature. In this work, we investigate this approach through a simplistic yet common case to provide insights into why decreasing temperature can improve quality (Precision), while increasing it often fails to boost coverage (Recall). Our analysis reveals that for a model to be effectively tunable through temperature adjustments, it must be trained toward coverage. To address this, we propose rethinking loss functions in language models by leveraging the Precision-Recall framework. Our results demonstrate that this approach achieves a substantially better trade-off between Precision and Recall than merely combining negative log-likelihood training with temperature scaling. These findings offer a pathway toward more versatile and robust language modeling techniques."
        ]
    },
    {
        "id": "theme_004",
        "theme": "Masking-Based Regularization for Medical Treatment Optimization: Bridging Self-Supervised Learning and Therapeutic Efficacy",
        "elaboration": "This research proposes a novel framework that applies principles from self-supervised learning (MaskSub) to medical treatment development. By leveraging masking augmentations—similar to how MaskSub stabilizes training through relaxed loss functions—this approach aims to enhance the efficacy of therapeutic interventions. The concept draws on the idea that structured regularization (e.g., masking) can systematically improve model performance, analogously to how pegozafermin's mechanism reduces hypertriglyceridemia. The study would investigate whether a masking-inspired regularization strategy could be integrated into drug development pipelines to optimize treatment outcomes. For example, by simulating the dynamic interactions between drug targets and biological systems, masking-based regularization could accelerate the discovery of effective therapies while ensuring robustness. This work would bridge the gap between computational methods (e.g., training models with masked inputs) and clinical outcomes, offering a novel approach to both model training and drug design. The methodology would involve validating the masking strategy's efficacy in simulating biological processes, akin to how MaskSub improves model stability, and demonstrating its potential to enhance therapeutic efficacy in trials. The results could lead to a new paradigm where computational regularization techniques are applied to medical treatments, enabling faster, more reliable drug development.",
        "concept_original_list": [
            "Masking meets Supervision: A Strong Learning Alliance: Pre-training with random masked inputs has emerged as a novel trend in self-supervised training. However, supervised learning still faces a challenge in adopting masking augmentations, primarily due to unstable training. In this paper, we propose a novel way to involve masking augmentations dubbed Masked Sub-branch (MaskSub). MaskSub consists of the main-branch and sub-branch, the latter being a part of the former. The main-branch undergoes conventional training recipes, while the sub-branch merits intensive masking augmentations, during training. MaskSub tackles the challenge by mitigating adverse effects through a relaxed loss function similar to a self-distillation loss. Our analysis shows that MaskSub improves performance, with the training loss converging faster than in standard training, which suggests our method stabilizes the training process. We further validate MaskSub across diverse training scenarios and models, including DeiT-III training, MAE finetuning, CLIP finetuning, BERT training, and hierarchical architectures (ResNet and Swin Transformer). Our results show that MaskSub consistently achieves impressive performance gains across all the cases. MaskSub provides a practical and effective solution for introducing additional regularization under various training recipes. Code available at https://github.com/naver-ai/augsub",
            "The FGF21 analog pegozafermin in severe hypertriglyceridemia: a randomized phase 2 trial: Pegozafermin, a long-acting glycopegylated analog of human fibroblast growth factor 21, is in development for the treatment of severe hypertriglyceridemia (SHTG) and nonalcoholic steatohepatitis. Here we report the results of a phase 2, double-blind, randomized, five-arm trial testing pegozafermin at four different doses (n = 67; 52 male) versus placebo (n = 18; 12 male) for 8 weeks in patients with SHTG (triglycerides (TGs), ≥500 mg dl−1 and ≤2,000 mg dl−1). Treated patients showed a significant reduction in median TGs for the pooled pegozafermin group versus placebo (57.3% versus 11.9%, difference versus placebo −43.7%, 95% confidence interval (CI): −57.1%, −30.3%; P &lt; 0.001), meeting the primary endpoint of the trial. Reductions in median TGs ranged from 36.4% to 63.4% across all treatment arms and were consistent regardless of background lipid-lowering therapy. Results for secondary endpoints included significant decreases in mean apolipoprotein B and non-high-density lipoprotein cholesterol concentrations (−10.5% and −18.3% for pooled doses compared to 1.1% and −0.6% for placebo (95% CI: −21.5%, −2.0%; P = 0.019 and 95% CI: −30.7%, −5.1%; P = 0.007, respectively), as well as a significant decrease in liver fat fraction for pooled treatment (n = 17) versus placebo (n = 6; −42.2% pooled pegozafermin, −8.3% placebo; 95% CI: −60.9%, −8.7%; P = 0.012), as assessed in a magnetic resonance imaging sub-study. No serious adverse events were observed to be related to the study drug. If these results are confirmed in a phase 3 trial, pegozafermin could be a promising treatment for SHTG (ClinicalTrials.gov registration: NCT0441186). In a phase 2, randomized clinical trial in patients with severe hypertriglyceridemia, pegozafermin, a long-acting analog of human fibroblast growth factor 21, was safe and met the primary endpoint of the trial for reducing serum triglyceride levels."
        ]
    },
    {
        "id": "theme_144",
        "theme": "Fusion Frame-Driven Borda Regret Minimization for Efficient Transformer Optimization",
        "elaboration": "This research proposes a novel framework that merges principles from FrameQuant's fusion frame quantization with Borda Regret minimization in dueling bandits. By leveraging fusion frames, we transform the problem of optimizing Borda scores into a quantized space, enabling efficient computation while maintaining robustness to adversarial perturbations. FrameQuant's use of fusion frames to represent model weights as a structured, low-bit approximation can be analogously applied to Borda Regret, where the decision space is quantized to reduce dimensionality and computational overhead. This approach would allow for dynamic, adaptive algorithms that balance exploration (exploring Borda scores) and exploitation (committing to high-scoring options) while minimizing cumulative regret. The novelty lies in integrating quantization techniques from FrameQuant into the Borda Regret framework, enabling efficient, explainable decision-making in complex, high-dimensional settings. Empirical validation would demonstrate how this fusion reduces regret while maintaining interpretability, bridging the gap between model efficiency and optimal decision-making in dueling bandit tasks.",
        "concept_original_list": [
            "FrameQuant: Flexible Low-Bit Quantization for Transformers: Transformers are the backbone of powerful foundation models for many Vision and Natural Language Processing tasks. But their compute and memory/storage footprint is large, and so, serving such models is expensive often requiring high-end hardware. To mitigate this difficulty, Post-Training Quantization seeks to modify a pre-trained model and quantize it to eight bits or lower, significantly boosting compute/memory/latency efficiency. Such models have been successfully quantized to four bits with some performance loss. In this work, we outline a simple scheme to quantize Transformer-based models to just two bits (plus some overhead) with only a small drop in accuracy. Key to our formulation is a concept borrowed from Harmonic analysis called Fusion Frames. Our main finding is that the quantization must take place not in the original weight space, but instead in the Fusion Frame representations. If quantization is interpreted as the addition of noise, our casting of the problem allows invoking an extensive body of known consistent recovery and noise robustness guarantees. Further, if desired, de-noising filters are known in closed form. We show empirically, via a variety of experiments, that (almost) two-bit quantization for Transformer models promises sizable efficiency gains. The code is available at https://github.com/vsingh-group/FrameQuant",
            "Borda Regret Minimization for Generalized Linear Dueling Bandits: Dueling bandits are widely used to model preferential feedback prevalent in many applications such as recommendation systems and ranking. In this paper, we study the Borda regret minimization problem for dueling bandits, which aims to identify the item with the highest Borda score while minimizing the cumulative regret. We propose a rich class of generalized linear dueling bandit models, which cover many existing models. We first prove a regret lower bound of order $\\Omega(d^{2/3} T^{2/3})$ for the Borda regret minimization problem, where $d$ is the dimension of contextual vectors and $T$ is the time horizon. To attain this lower bound, we propose an explore-then-commit type algorithm for the stochastic setting, which has a nearly matching regret upper bound $\\tilde{O}(d^{2/3} T^{2/3})$. We also propose an EXP3-type algorithm for the adversarial linear setting, where the underlying model parameter can change in each round. Our algorithm achieves an $\\tilde{O}(d^{2/3} T^{2/3})$ regret, which is also optimal. Empirical evaluations on both synthetic data and a simulated real-world environment are conducted to corroborate our theoretical analysis."
        ]
    },
    {
        "id": "theme_200",
        "theme": "Sensitivity-Driven Surrogate Models for Genetic Risk Prediction in Diabetes",
        "elaboration": "Concept 1's focus on regulating surrogate model sensitivity offers a novel framework to integrate genetic factors (e.g., G6PDdef) into diabetes risk prediction. By developing a sensitivity-informed regularizer, we can quantify how genetic variants like rs1050828-T influence glucose levels and diabetes complications. This approach bridges Concept 2's genetic epidemiology with Concept 1's optimization techniques, enabling adaptive surrogate models that prioritize genetic heterogeneity. For instance, a sensitivity-aware optimizer could dynamically adjust its training process to account for G6PDdef's impact on glucose dynamics, improving prediction accuracy for African ancestry populations. The synergy between sensitivity regulation and genetic data allows for personalized diabetes management strategies, addressing racial disparities in complications. This research combines offline optimization principles with genetic epidemiology to create a scalable, interpretable model for predicting and mitigating diabetes complications through genotype-guided interventions.",
        "concept_original_list": [
            "Boosting Offline Optimizers with Surrogate Sensitivity: Offline optimization is an important task in numerous material engineering domains where online experimentation to collect data is too expensive and needs to be replaced by an in silico maximization of a surrogate of the black-box function. Although such a surrogate can be learned from offline data, its prediction might not be reliable outside the offline data regime, which happens when the surrogate has narrow prediction margin and is (therefore) sensitive to small perturbations of its parameterization. This raises the following questions: (1) how to regulate the sensitivity of a surrogate model; and (2) whether conditioning an offline optimizer with such less sensitive surrogate will lead to better optimization performance. To address these questions, we develop an optimizable sensitivity measurement for the surrogate model, which then inspires a sensitivity-informed regularizer that is applicable to a wide range of offline optimizers. This development is both orthogonal and synergistic to prior research on offline optimization, which is demonstrated in our extensive experiment benchmark.",
            "Adaptive selection at G6PD and disparities in diabetes complications: Diabetes complications occur at higher rates in individuals of African ancestry. Glucose-6-phosphate dehydrogenase deficiency (G6PDdef), common in some African populations, confers malaria resistance, and reduces hemoglobin A1c (HbA1c) levels by shortening erythrocyte lifespan. In a combined-ancestry genome-wide association study of diabetic retinopathy, we identified nine loci including a G6PDdef causal variant, rs1050828 -T (Val98Met), which was also associated with increased risk of other diabetes complications. The effect of rs1050828 -T on retinopathy was fully mediated by glucose levels. In the years preceding diabetes diagnosis and insulin prescription, glucose levels were significantly higher and HbA1c significantly lower in those with versus without G6PDdef. In the Action to Control Cardiovascular Risk in Diabetes (ACCORD) trial, participants with G6PDdef had significantly higher hazards of incident retinopathy and neuropathy. At the same HbA1c levels, G6PDdef participants in both ACCORD and the Million Veteran Program had significantly increased risk of retinopathy. We estimate that 12% and 9% of diabetic retinopathy and neuropathy cases, respectively, in participants of African ancestry are due to this exposure. Across continentally defined ancestral populations, the differences in frequency of rs1050828 -T and other G6PDdef alleles contribute to disparities in diabetes complications. Diabetes management guided by glucose or potentially genotype-adjusted HbA1c levels could lead to more timely diagnoses and appropriate intensification of therapy, decreasing the risk of diabetes complications in patients with G6PDdef alleles. A combined-ancestry GWAS of diabetic retinopathy, comprising 68,169 cases and 129,188 controls, revealed nine previously unreported loci associated with the condition, including an evolutionarily adaptive genetic variant alongside a potential functional mechanism that influences racial disparities in diabetes complications among individuals of non-Hispanic African ancestry."
        ]
    },
    {
        "id": "theme_006",
        "theme": "Integrating Adversarial Prompt Tuning with Multi-Task Collaboration for Robust Vision-Language Models",
        "elaboration": "This research proposes a novel framework that combines adversarial prompt tuning (TAPT) with multi-task collaboration (WeakMCN) to enhance the robustness and performance of vision-language models. By leveraging the adversarial training principles of TAPT, we design defensive prompts that dynamically adapt to task-specific requirements, while WeakMCN's dual-branch architecture ensures collaborative learning between weakly supervised tasks (WREC and WRES). The integration of adversarial prompts in a multi-task setting allows the model to simultaneously optimize for task-specific objectives (e.g., grounding in WREC) and robustness against adversarial perturbations (e.g., visual attacks in TAPT). Key innovations include dynamic visual feature enhancement (DVFE) to adaptively combine pre-trained visual knowledge and a collaborative consistency module (CCM) to enforce cross-task alignment during optimization. This approach not only improves performance on benchmarks like RefCOCO but also ensures generalization in semi-supervised settings, demonstrating a novel synergy between adversarial defense and multi-task learning.",
        "concept_original_list": [
            "WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation: Weakly supervised referring expression comprehension (WREC) and segmentation (WRES) aim to learn object grounding based on a given expression using weak supervision signals like image-text pairs. While these tasks have traditionally been modeled separately, we argue that they can benefit from joint learning in a multi-task framework. To this end, we propose WeakMCN, a novel multi-task collaborative network that effectively combines WREC and WRES with a dual-branch architecture. Specifically, the WREC branch is formulated as anchor-based contrastive learning, which also acts as a teacher to supervise the WRES branch. In WeakMCN, we propose two innovative designs to facilitate multi-task collaboration, namely Dynamic Visual Feature Enhancement (DVFE) and Collaborative Consistency Module (CCM). DVFE dynamically combines various pre-trained visual knowledge to meet different task requirements, while CCM promotes cross-task consistency from the perspective of optimization. Extensive experimental results on three popular REC and RES benchmarks, i.e., RefCOCO, RefCOCO+, and RefCOCOg, consistently demonstrate performance gains of WeakMCN over state-of-the-art single-task alternatives, e.g., up to 3.91% and 13.11% on RefCOCO for WREC and WRES tasks, respectively. Furthermore, experiments also validate the strong generalization ability of WeakMCN in both semi-supervised REC and RES settings against existing methods, e.g., +8.94% for semi-REC and +7.71% for semi-RES on 1% RefCOCO.",
            "TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models: Large pre-trained Vision-Language Models (VLMs) such as CLIP have demonstrated excellent zero-shot generalizability across various downstream tasks. However, recent studies have shown that the inference performance of CLIP can be greatly degraded by small adversarial perturbations, especially its visual modality, posing significant safety threats. To mitigate this vulnerability, in this paper, we propose a novel defense method called Test-Time Adversarial Prompt Tuning (TAPT) to enhance the inference robustness of CLIP against visual adversarial attacks. TAPT is a test-time defense method that learns defensive bimodal (textual and visual) prompts to robustify the inference process of CLIP. Specifically, it is an unsupervised method that optimizes the defensive prompts for each test sample by minimizing a multi-view entropy and aligning adversarial-clean distributions. We evaluate the effectiveness of TAPT on 11 benchmark datasets, including ImageNet and 10 other zero-shot datasets, demonstrating that it enhances the zero-shot adversarial robustness of the original CLIP by at least 48.9% against AutoAttack (AA), while largely maintaining performance on clean examples. Moreover, TAPT outperforms existing adversarial prompt tuning methods across various backbones, achieving an average robustness improvement of at least 36.6%. Code is available at https://github.com/xinwong/TAPT."
        ]
    },
    {
        "id": "theme_098",
        "theme": "Leveraging Statistical Stability Analysis for Enhanced Interpretability and Generalization in Sparse Autoencoders (SAEs)",
        "elaboration": "SAEBench's focus on evaluating SAEs across diverse metrics (e.g., interpretability, disentanglement) aligns with the statistical stability and generalization analysis in stochastic compositional gradient descent (SCGD) algorithms. By applying the compositional uniform stability framework from the second concept, researchers can quantify how SAEs' performance scales with training parameters and data diversity. This would bridge the gap between proxy metrics (e.g., reconstruction error) and practical performance (e.g., unlearning efficacy), enabling a deeper understanding of SAEs' stability and generalization. The proposed approach would integrate statistical learning theory to validate SAEs' robustness under varying conditions, ensuring that improvements in interpretability and disentanglement metrics translate to real-world applicability. This synergy would provide a standardized framework for assessing SAEs' scalability and reliability, addressing the limitations of prior work that prioritize proxy metrics over practical performance.",
        "concept_original_list": [
            "SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability: Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics with unclear practical relevance. We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across eight diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we open-source a suite of over 200 SAEs across seven recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance.  For instance, while Matryoshka SAEs slightly underperform on existing proxy metrics, they substantially outperform other architectures on feature disentanglement metrics; moreover, this advantage grows with SAE scale. By providing a standardized framework for measuring progress in SAE development, SAEBench enables researchers to study scaling trends and make nuanced comparisons between different SAE architectures and training methodologies. Our interactive interface enables researchers to flexibly visualize relationships between metrics across hundreds of open-source SAEs at www.neuronpedia.org/sae-bench",
            "Stability and Generalization of Stochastic Compositional Gradient Descent Algorithms: Many machine learning tasks can be formulated as a stochastic compositional optimization (SCO) problem such as reinforcement learning, AUC maximization and meta-learning, where the objective function involves a nested composition associated with an expectation. Although many studies have been devoted to studying the convergence behavior of SCO algorithms, there is little work on understanding their generalization, that is, how these learning algorithms built from training data would behave on future test examples. In this paper, we provide the stability and generalization analysis of stochastic compositional gradient descent algorithms in the framework of statistical learning theory. Firstly, we introduce a stability concept called *compositional uniform stability* and establish its quantitative relation with generalization for SCO problems. Then, we establish the compositional uniform stability results for two notable stochastic compositional gradient descent algorithms, namely SCGD and SCSC. Finally, we derive *dimension-independent* excess risk bounds for SCGD and SCSC by balancing stability results and optimization errors. To the best of our knowledge, these are the first-ever known results on stability and generalization analysis of stochastic compositional gradient descent algorithms."
        ]
    },
    {
        "id": "theme_042",
        "theme": "Iterative Generative Modeling with Programmable Transformers: Combining Dynamic Latent Refinement and Loop-Based Computation for Versatile Data Generation",
        "elaboration": "This research proposes a novel framework that merges the iterative refinement of latent variables in 'Learning to Jump' with the programmable computation of Looped Transformers. By treating transformer networks as dynamic computational engines, we design a system where iterative processes (like thinning and thickening in Learning to Jump) are encoded as looped operations. The transformer's looped structure enables continuous refinement of latent representations, allowing it to dynamically adjust to sparse, skewed, or heavy-tailed data distributions. For example, a transformer loop could simulate a forward count thinning process to initialize latent variables and a reverse thickening process to iteratively refine them, mimicking the dual-phase training of Learning to Jump. This approach would allow models to adapt to diverse data types (e.g., counts, non-negative continuous data) by embedding programmable computation in the loop, enabling both generative modeling and computational tasks. The novelty lies in integrating transformer's attention mechanisms with iterative refinement, creating a hybrid system that combines the flexibility of generative models with the programmability of computing systems, opening new avenues for applications in sparse data analysis and real-time computation.",
        "concept_original_list": [
            "Learning to Jump: Thinning and Thickening Latent Counts for Generative Modeling: Learning to denoise has emerged as a prominent paradigm to design state-of-the-art deep generative models for natural images. How to use it to model the distributions of both continuous real-valued data and categorical data has been well studied in recently proposed diffusion models. However, it is found in this paper to have limited ability in modeling some other types of data, such as count and non-negative continuous data, that are often highly sparse, skewed, heavy-tailed, and/or overdispersed. To this end, we propose learning to jump as a general recipe for generative modeling of various types of data. Using a forward count thinning process to construct learning objectives to train a deep neural network, it employs a reverse count thickening process to iteratively refine its generation through that network. We demonstrate when learning to jump is expected to perform comparably to learning to denoise, and when it is expected to perform better. For example, learning to jump is recommended when the training data is non-negative and exhibits strong sparsity, skewness, heavy-tailedness, and/or heterogeneity.",
            "Looped Transformers as Programmable Computers: We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop. Our input sequence acts as a punchcard, consisting of instructions and memory for data read/writes. We demonstrate that a constant number of encoder layers can emulate basic computing blocks, including lexicographic operations, non-linear functions, function calls, program counters, and conditional branches. Using this framework, we emulate a computer using a simple instruction-set architecture, which allows us to map iterative algorithms to programs that can be executed by a constant depth looped transformer network. We show how a single frozen transformer, instructed by its input, can emulate a basic calculator, a basic linear algebra library, and even a full backpropagation, in-context learning algorithm. Our findings reveal the potential of transformer networks as programmable compute units and offer insight into the mechanics of attention."
        ]
    },
    {
        "id": "theme_052",
        "theme": "Integrating Inference-Time Steering with Multi-Level Geometry Control for Enhanced 3D Human Reconstruction",
        "elaboration": "This research proposes a framework that combines FK steering's inference-time control mechanisms with multi-level geometry learning to address the challenges of reconstructing 3D clothed human bodies from monocular images. FK steering's use of reward functions and potential-based resampling can be adapted to guide diffusion models toward specific geometric details, such as wrinkles and joint configurations, by treating these as reward targets. The multi-level geometry learning framework (Skel-Enhance, Joint-Augment, Wrinkle-Refine) would leverage FK steering's principles to dynamically adjust the diffusion model's trajectory during inference, prioritizing geometric accuracy at each hierarchical level. For example, a reward function for wrinkle fidelity could steer the model to generate smoother textures, while a joint depth optimization reward would refine skeletal structure. This integration would enable scalable, training-free control over both global geometry and fine-grained details, overcoming the ambiguity of single-view input. The approach would bridge the gap between diffusion model flexibility and precise geometric reconstruction, offering a novel solution for 3D human body modeling with enhanced controllability and fidelity.",
        "concept_original_list": [
            "A General Framework for Inference-time Scaling and Steering of Diffusion Models: Diffusion models have demonstrated remarkable performance in generative modeling, but generating samples with specific desiderata remains challenging. Existing solutions --- such as fine-tuning,  best-of-n sampling, and gradient-based guidance --- are expensive, inefficient, or limited in applicability. In this work, we propose FK steering, a framework for inference-time steering diffusion models with reward functions. In this work, we introduce FK steering, which applies Feynman-Kac interacting particle systems to the inference-time steering of diffusion models with arbitrary reward functions. FK steering works by generating multiple trajectories, called particles, and resampling particles at intermediate steps based on scores computed using functions called potentials. Potentials are defined using rewards for intermediate states and are chosen such that a high score indicates the particle will yield a high-reward sample. We explore various choices of potentials, rewards, and samplers. Steering text-to-image models with a human preference reward, we find that FK steering outperforms fine-tuned models with just 2 particles. Moreover, FK steering a 0.8B parameter model outperforms a 2.6B model, achieving state-of-the-art performance on prompt fidelity. We also steer text diffusion models with rewards for text quality and rare attributes such as toxicity, and find that FK steering generates lower perplexity text and enables gradient-free control.  Overall, inference-time scaling and steering of diffusion models, even training-free, provides significant quality and controllability benefits. Code available [here](https://github.com/zacharyhorvitz/FK-Diffusion-Steering).",
            "MultiGO: Towards Multi-level Geometry Learning for Monocular 3D Textured Human Reconstruction: This paper investigates the research task of reconstructing the 3D clothed human body from a monocular image. Due to the inherent ambiguity of single-view input, existing approaches leverage pre-trained SMPL(-X) estimation models or generative models to provide auxiliary information for human reconstruction. However, these methods capture only the general human body geometry and overlook specific geometric details, leading to inaccurate skeleton reconstruction, incorrect joint positions, and unclear cloth wrinkles. In response to these issues, we propose a multi-level geometry learning framework. Technically, we design three key components: skeleton-level enhancement, joint-level augmentation, and wrinkle-level refinement modules. Specifically, we effectively integrate the projected 3D Fourier features into a Gaussian reconstruction model, introduce perturbations to improve joint depth estimation during training, and refine the human coarse wrinkles by resembling the de-noising process of the diffusion model. Extensive quantitative and qualitative experiments on two test sets show the superior performance of our approach compared to state-of-the-art (SOTA) methods."
        ]
    },
    {
        "id": "theme_162",
        "theme": "Adaptive Hybrid Mechanisms for Efficient Vision-Based Robot Pose Estimation via Mechanistic Design and Self- Supervised Learning",
        "elaboration": "This research proposes a novel framework that integrates the mechanistic design principles of hybrid architectures with the self-supervised learning paradigm of RoboPEPP to create a scalable, explainable, and robust system for robot pose estimation. By leveraging synthetic token manipulation tasks (e.g., masking, compression) from mechanistic design, we develop a hybrid encoder-predictor architecture that dynamically adapts to real-world constraints. The framework combines the physical modeling of RoboPEPP with the scalability of hybrid architectures, enabling efficient inference through computational primitives like sparsity and hybrid layers. Key innovations include: (1) a mechanistic design pipeline that generates synthetic occlusion scenarios to train the encoder-predictor model, ensuring robustness to real-world occlusions; (2) a hybrid architecture that balances computational efficiency with model accuracy, validated through compute-optimal and state-optimal scaling laws; (3) a self-supervised embedding-predictive architecture that merges physical constraints with neural learning, reducing reliance on labeled data. This approach bridges the gap between symbolic reasoning and neural computation, enabling real-time pose estimation in collaborative robotics with minimal latency and high generalization across occluded environments.",
        "concept_original_list": [
            "RoboPEPP: Vision-Based Robot Pose and Joint Angle Estimation through Embedding Predictive Pre-Training: Vision-based pose estimation of articulated robots with unknown joint angles has applications in collaborative robotics and human-robot interaction tasks. Current frameworks use neural network encoders to extract image features and downstream layers to predict joint angles and robot pose. While images of robots inherently contain rich information about the robot's physical structures, existing methods often fail to leverage it fully; therefore, limiting performance under occlusions and truncations. To address this, we introduce RoboPEPP, a method that fuses information about the robot's physical model into the encoder using a masking-based self-supervised embedding-predictive architecture. Specifically, we mask the robot's joints and pre-train an encoder-predictor model to infer the joints' embeddings from surrounding unmasked regions, enhancing the encoder's understanding of the robot's physical model. The pre-trained encoder-predictor pair, along with joint angle and keypoint prediction networks, is then fine-tuned for pose and joint angle estimation. Random masking of input during fine-tuning and keypoint filtering during evaluation further improves robustness. Our method, evaluated on several datasets, achieves the best results in robot pose and joint angle estimation while being the least sensitive to occlusions and requiring the lowest execution time.",
            "Mechanistic Design and Scaling of Hybrid Architectures: The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and evaluation. We set out to simplify this process by grounding it in an end-to-end mechanistic architecture design (MAD) pipeline, encompassing small-scale capability unit tests predictive of scaling laws. Through a suite of synthetic token manipulation tasks such as compression and recall, designed to probe capabilities, we identify and test new hybrid architectures constructed from a variety of computational primitives. We experimentally validate the resulting architectures via an extensive compute-optimal and a new state-optimal scaling law analysis, training over 500 language models between 70M to 7B parameters. Surprisingly, we find MAD synthetics to correlate with compute-optimal perplexity, enabling accurate evaluation of new architectures via isolated proxy tasks. The new architectures found via MAD, based on simple ideas such as hybridization and sparsity, outperform state-of-the-art Transformer, convolutional, and recurrent architectures (Transformer++, Hyena, Mamba) in scaling, both at compute-optimal budgets and in overtrained regimes. Overall, these results provide evidence that performance on curated synthetic tasks can be predictive of scaling laws, and that an optimal architecture should leverage specialized layers via a hybrid topology."
        ]
    },
    {
        "id": "theme_048",
        "theme": "Adaptive Interpolation for Hyperparameter Optimization in Gaussian Processes via F-Divergence Minimization",
        "elaboration": "This research proposes a novel framework that merges interpolation techniques from particle distribution optimization (Concept 1) with hyperparameter learning in Gaussian Processes (Concept 2). By treating hyperparameters as particles in a high-dimensional space, we employ interpolation-based velocity fields to dynamically navigate the hyperparameter landscape, minimizing f-divergences between approximate posteriors and true distributions. This approach addresses the limitations of traditional VI and EP by leveraging a hybrid training procedure: conjugate-computation VI for inference and EP-like marginal likelihood approximations for hyperparameter optimization. The core innovation lies in applying non-parametric interpolation to hyperparameter spaces, similar to how velocity fields are derived for particle motion, to achieve more robust and interpretable hyperparameter learning. The methodology is validated on domain adaptation and missing data imputation tasks, demonstrating its efficacy in balancing accuracy and computational efficiency.",
        "concept_original_list": [
            "Minimizing $f$-Divergences by Interpolating Velocity Fields: Many machine learning problems can be seen as approximating a *target* distribution using a *particle* distribution by minimizing their statistical discrepancy. Wasserstein Gradient Flow can move particles along a path that minimizes the $f$-divergence between the target and particle distributions. To move particles, we need to calculate the corresponding velocity fields derived from a density ratio function between these two distributions. Previous works estimated such density ratio functions and then differentiated the estimated ratios. These approaches may suffer from overfitting, leading to a less accurate estimate of the velocity fields. Inspired by non-parametric curve fitting, we directly estimate these velocity fields using interpolation techniques. We prove that our estimators are consistent under mild conditions. We validate their effectiveness using novel applications on domain adaptation and missing data imputation. The code for reproducing our results can be found at https://github.com/anewgithubname/gradest2.",
            "Improving Hyperparameter Learning under Approximate Inference in Gaussian Process Models: Approximate inference in Gaussian process (GP) models with non-conjugate likelihoods gets entangled with the learning of the model hyperparameters. We improve hyperparameter learning in GP models and focus on the interplay between variational inference (VI) and the learning target. While VI's lower bound to the marginal likelihood is a suitable objective for inferring the approximate posterior, we show that a direct approximation of the marginal likelihood as in Expectation Propagation (EP) is a better learning objective for hyperparameter optimization. We design a hybrid training procedure to bring the best of both worlds: it leverages conjugate-computation VI for inference and uses an EP-like marginal likelihood approximation for hyperparameter learning. We compare VI, EP, Laplace approximation, and our proposed training procedure and empirically demonstrate the effectiveness of our proposal across a wide range of data sets."
        ]
    },
    {
        "id": "theme_099",
        "theme": "Vector-Based Temporal Dynamics in Point Cloud Analysis: Integrating Spatial Abstraction with Temporal State Modeling",
        "elaboration": "This research proposes a novel framework that merges the spatial vector abstraction of PointVector with temporal state modeling to generate dynamic, high-fidelity 3D objects. By extending PointVector's vector-based feature aggregation to temporal dimensions, we design a model that encodes both spatial geometry and temporal evolution in a compact, structured format. The core innovation lies in transforming scalar temporal states (e.g., bloom cycles in a rose) into vector representations using rotation-based transformations, enabling efficient aggregation of temporal features while preserving spatial coherence. This approach leverages the efficiency of vector operations to handle complex temporal dynamics, similar to PointVector's parameter efficiency, while introducing a temporal vector network (TVN) to capture sequence dependencies. The method enables the generation of realistic, controllable 3D objects with temporal consistency, bridging the gap between point cloud analysis and temporal dynamics. By combining the vector abstraction of PointVector with temporal state distillation, this work addresses the challenge of generating evolving 3D assets, offering a scalable solution for dynamic object rendering and simulation.",
        "concept_original_list": [
            "PointVector: A Vector Representation in Point Cloud Analysis: In point cloud analysis, point-based methods have rapidly developed in recent years. These methods have recently focused on concise MLP structures, such as PointNeXt, which have demonstrated competitiveness with Convolutional and Transformer structures. However, standard MLPs are limited in their ability to extract local features effectively. To address this limitation, we propose a Vector-oriented Point Set Abstraction that can aggregate neighboring features through higher-dimensional vectors. To facilitate network optimization, we construct a transformation from scalar to vector using independent angles based on 3D vector rotations. Finally, we develop a PointVector model that follows the structure of PointNeXt. Our experimental results demonstrate that PointVector achieves state-of-the-art performance 72.3% mIOU on the S3DIS Area 5 and 78.4% mIOU on the S3DIS (6-fold cross-validation) with only 58% model parameters of PointNeXt. We hope our work will help the exploration of concise and effective feature representations. The code will be released soon.",
            "Birth and Death of a Rose: We study the problem of generating temporal object intrinsics--temporally evolving sequences of object geometry, reflectance, and texture, such as a blooming rose--from pre-trained 2D foundation models. Unlike conventional 3D modeling and animation techniques that require extensive manual effort and expertise, we introduce a method that generates such assets with signals distilled from pretrained 2D diffusion models. To ensure the temporal consistency of object intrinsics, we propose Neural Templates for temporal-state-guided distillation, derived automatically from image features from self-supervised learning. Our method can generate high-quality temporal object intrinsics for several natural phenomena and enable the sampling and controllable rendering of these dynamic objects from any viewpoint, under any environmental lighting conditions, at any time of their lifespan. Project website: https://chen-geng.com/rose4d."
        ]
    },
    {
        "id": "theme_149",
        "theme": "Graph-Based Frequency Domain Clustering for Robust Factor Analysis",
        "elaboration": "This research proposes a novel framework that integrates principles from latent factor analysis with frequency domain signal processing. By translating the problem of learning factor analysis structures into a graph-based clique search on a frequency-domain correlation graph, we leverage the scalability of clique algorithms (as in Concept 1) to identify rotationally identifiable structures in high-dimensional data. The frequency domain is modeled as a graph where cliques represent coherent frequency components, enabling the CT algorithm to simultaneously learn the number of factors and their spatial/temporal correlations. This approach addresses the defects in convolutional decoders (Concept 2) by ensuring high-frequency components are preserved through structured graph constraints, while the finite-sample error bounds and consistency guarantees from Concept 1 ensure robustness. The framework combines graph theory's ability to capture complex dependencies with frequency domain analysis to create a model that is both interpretable and resilient to violations of its assumptions, offering a unified solution to structure learning and frequency representation challenges.",
        "concept_original_list": [
            "Structure Learning of Latent Factors via Clique Search on Correlation Thresholded Graphs: Despite the widespread application of latent factor analysis, existing methods suffer from the following weaknesses: requiring the number of factors to be known, lack of theoretical guarantees for learning the model structure, and nonidentifiability of the parameters due to rotation invariance properties of the likelihood. We address these concerns by proposing a fast correlation thresholding (CT) algorithm that simultaneously learns the number of latent factors and a rotationally identifiable model structure. Our novel approach translates this structure learning problem into the search for so-called independent maximal cliques in a thresholded correlation graph that can be easily constructed from the observed data. Our clique analysis technique scales well up to thousands of variables, while competing methods are not applicable in a reasonable amount of running time. We establish a finite-sample error bound and high-dimensional consistency for the structure learning of our method. Through a series of simulation studies and a real data example, we show that the CT algorithm is an accurate method for learning the structure of factor analysis models and is robust to violations of its assumptions.",
            "Defects of Convolutional Decoder Networks in Frequency Representation: In this paper, we prove the representation defects of a cascaded convolutional decoder network, considering the capacity of representing different frequency components of an input sample. We conduct the discrete Fourier transform on each channel of the feature map in an intermediate layer of the decoder network. Then, we extend the 2D circular convolution theorem to represent the forward and backward propagations through convolutional layers in the frequency domain. Based on this, we prove three defects in representing feature spectrums. First, we prove that the convolution operation, the zero-padding operation, and a set of other settings all make a convolutional decoder network more likely to weaken high-frequency components. Second, we prove that the upsampling operation generates a feature spectrum, in which strong signals repetitively appear at certain frequencies. Third, we prove that if the frequency components in the input sample and frequency components in the target output for regression have a small shift, then the decoder usually cannot be effectively learned."
        ]
    },
    {
        "id": "theme_072",
        "theme": "Discrete Diffusion Models for Novel View Synthesis: Bridging Continuous Generative Models with Discrete Markov Principles.",
        "elaboration": "This research proposes a novel framework that integrates discrete Markov probabilistic models (DMPM) with pixel-space diffusion architectures to address the limitations of traditional novel view synthesis (NVS) methods. By leveraging the discrete, score-based framework of DMPM, we aim to enhance the stability and convergence of diffusion models when generating novel views from single-image inputs. The DMPM's ability to model discrete structures and its sharp convergence bounds under minimal assumptions can be adapted to encode geometric and structural information in a discrete space, enabling efficient and robust synthesis of novel views. This approach would address the challenges of out-of-domain content generalization and computational efficiency in NVS, while ensuring theoretical guarantees through the discrete score function's alignment with generative models. The framework would combine the continuous diffusion dynamics of pixel-space models with the discrete, model-based explanations of DMPM, creating a hybrid system that balances flexibility, efficiency, and interpretability for complex scene generation tasks.",
        "concept_original_list": [
            "Novel View Synthesis with Pixel-Space Diffusion Models: Synthesizing a novel view from a single input image is a challenging task. Traditionally, this task was approached by estimating scene depth, warping, and inpainting, with machine learning models enabling parts of the pipeline. More recently, generative models are being increasingly employed in novel view synthesis (NVS), often encompassing the entire end-to-end system.In this work, we adapt a modern diffusion model architecture for end-to-end NVS in the pixel space, substantially outperforming previous state-of-the-art (SOTA) techniques. We explore different ways to encode geometric information into the network.Our experiments show that while these methods may enhance performance, their impact is minor compared to utilizing improved generative models. Moreover, we introduce a novel NVS training scheme that utilizes single-view datasets, capitalizing on their relative abundance compared to their multi-view counterparts. This leads to improved generalization capabilities to scenes with out-of-domain content.",
            "Discrete Markov Probabilistic Models: An Improved Discrete Score-Based  Framework with sharp convergence bounds under minimal assumptions: This paper introduces the Discrete Markov Probabilistic Model (DMPM), a novel algorithm for discrete data generation. The algorithm operates in discrete space, where the noising process is a continuous-time Markov chain that can be sampled exactly via a Poissonian clock that flips labels uniformly at random. The time-reversal process, like the forward noise process, is a jump process, with its intensity governed by a discrete analogue of the classical score function. Crucially, this intensity is proven to be the conditional expectation\n of a function of the forward process, strengthening its theoretical alignment with score-based generative models while ensuring robustness and efficiency. We further establish convergence bounds for the algorithm under minimal assumptions and demonstrate its effectiveness through experiments on low-dimensional Bernoulli-distributed datasets and high-dimensional binary MNIST data. The results highlight its strong performance in generating discrete structures. This work bridges theoretical foundations and practical applications, advancing the development of effective and theoretically grounded discrete generative modeling."
        ]
    },
    {
        "id": "theme_150",
        "theme": "Adaptive Sequential Policy Learning for Pre-Trained Vision Models: Integrating Sequential Neural Posterior Score Estimation into Motor Control Frameworks",
        "elaboration": "This research proposes a novel framework that merges the sequential training paradigm of Sequential Neural Posterior Score Estimation (SNPSE) with pre-trained vision models for motor control. By leveraging SNPSE's ability to iteratively refine posterior approximations through sequential steps, we aim to enhance the adaptability of pre-trained vision models to diverse control policies. The approach involves training a policy learning module using a sequential training procedure that dynamically adjusts the model's posterior distribution based on real-time simulation feedback, reducing the need for extensive offline data collection. This integration addresses the limitations of conventional RL, BC, and VRF methods, which often struggle with generalization across control policies. The novelty lies in applying SNPSE's sequential inference techniques to policy learning, enabling more robust and interpretable models that can adapt to varying motor control tasks. By validating this approach on 21 tasks across three environments, we demonstrate its potential to bridge the gap between pre-training efficacy and real-world applicability in motor control.",
        "concept_original_list": [
            "Sequential Neural Score Estimation: Likelihood-Free Inference with Conditional Score Based Diffusion Models: We introduce Sequential Neural Posterior Score Estimation (SNPSE), a score-based method for Bayesian inference in simulator-based models. Our method, inspired by the remarkable success of score-based methods in generative modelling, leverages conditional score-based diffusion models to generate samples from the posterior distribution of interest. The model is trained using an objective function which directly estimates the score of the posterior. We embed the model into a sequential training procedure, which guides simulations using the current approximation of the posterior at the observation of interest, thereby reducing the simulation cost. We also introduce several alternative sequential approaches, and discuss their relative merits. We then validate our method, as well as its amortised, non-sequential, variant on several numerical examples, demonstrating comparable or superior performance to existing state-of-the-art methods such as Sequential Neural Posterior Estimation (SNPE).",
            "For Pre-Trained Vision Models in Motor Control, Not All Policy Learning Methods are Created Equal: In recent years, increasing attention has been directed to leveraging pre-trained vision models for motor control. While existing works mainly emphasize the importance of this pre-training phase, the arguably equally important role played by downstream policy learning during control-specific fine-tuning is often neglected. It thus remains unclear if pre-trained vision models are consistent in their effectiveness under different control policies. To bridge this gap in understanding, we conduct a comprehensive study on 14 pre-trained vision models using 3 distinct classes of policy learning methods, including reinforcement learning (RL), imitation learning through behavior cloning (BC), and imitation learning with a visual reward function (VRF). Our study yields a series of intriguing results, including the discovery that the effectiveness of pre-training is highly dependent on the choice of the downstream policy learning algorithm. We show that conventionally accepted evaluation based on RL methods is highly variable and therefore unreliable, and further advocate for using more robust methods like VRF and BC. To facilitate more universal evaluations of pre-trained models and their policy learning methods in the future, we also release a benchmark of 21 tasks across 3 different environments alongside our work."
        ]
    },
    {
        "id": "theme_143",
        "theme": "Unlearning in Federated Learning via Function Space Equivalence: Leveraging LR Principles to Efficiently Remove Data Influence",
        "elaboration": "This research proposes a novel framework that integrates principles from the Lookahead-Replicate (LR) algorithm in reinforcement learning with federated machine unlearning (FMU) to address the inefficiencies of traditional unlearning methods. The LR algorithm's focus on maintaining function-space equivalence between online and target networks inspires a paradigm shift in FMU, where the goal is to preserve the global model's performance while removing specific data subsets. By applying LR's function-space equivalence principles, we design a method to iteratively refine local models (via Nemytskii operators) while ensuring their functional outputs remain close to the global model. This approach enables efficient unlearning by avoiding sequential training/retraining cycles, leveraging the global Lipschitz constant of the Nemytskii operator to bound gradient differences. The proposed FFMU algorithm maintains certified guarantees on unlearning quality while achieving faster, data-efficient updates through function-space alignment, bridging the gap between decentralized learning and data privacy requirements.",
        "concept_original_list": [
            "Learning the Target Network in Function Space: We focus on the task of learning the value function in the reinforcement learning (RL) setting. This task is often solved by updating a pair of online and target networks while ensuring that the parameters of these two networks are equivalent. We propose Lookahead-Replicate (LR), a new value-function approximation algorithm that is agnostic to this parameter-space equivalence. Instead, the LR algorithm is designed to maintain an equivalence between the two networks in the function space. This value-based equivalence is obtained by employing a new target-network update. We show that LR leads to a convergent behavior in learning the value function. We also present empirical results demonstrating that LR-based target-network updates significantly improve deep RL on the Atari benchmark.",
            "Fast Federated Machine Unlearning with Nonlinear Functional Theory: Federated machine unlearning (FMU) aims to remove the influence of a specified subset of training data upon request from a trained federated learning model. Despite achieving remarkable performance, existing FMU techniques suffer from inefficiency due to two sequential operations of training and retraining/unlearning on large-scale datasets. Our prior study, PCMU, was proposed to improve the efficiency of centralized machine unlearning (CMU) with certified guarantees, by simultaneously executing the training and unlearning operations. This paper proposes a fast FMU algorithm, FFMU, for improving the FMU efficiency while maintaining the unlearning quality. The PCMU method is leveraged to train a local machine learning (MU) model on each edge device. We propose to employ nonlinear functional analysis techniques to refine the local MU models as output functions of a Nemytskii operator. We conduct theoretical analysis to derive that the Nemytskii operator has a global Lipschitz constant, which allows us to bound the difference between two MU models regarding the distance between their gradients. Based on the Nemytskii operator and average smooth local gradients, the global MU model on the server is guaranteed to achieve close performance to each local MU model with the certified guarantees."
        ]
    },
    {
        "id": "theme_049",
        "theme": "Integrating Topological Invariance with Mechanistic Synergy: A Novel Approach to Enhanced Treatment Response and Image Segmentation Accuracy",
        "elaboration": "This research proposal merges the principles of topological invariance and mechanistic synergy to address critical challenges in both clinical oncology and image segmentation. By drawing inspiration from the dual-targeted therapeutic strategy of tislelizumab plus zanubrutinib (combining PD-1 inhibition and BTK inhibition) in Richter transformation, we propose a framework that integrates topological matching of persistence barcodes with mechanistic optimization. The theme centers on leveraging topological fidelity (from the second concept) to ensure spatial accuracy in segmentation while applying mechanistic synergies (from the first concept) to enhance treatment efficacy. The novelty lies in applying topological matching to image segmentation to preserve spatial relationships, akin to how combination therapies in oncology target multiple pathways to improve outcomes. By defining a Betti matching error as a differentiable loss function, we aim to bridge the gap between topological accuracy and spatial fidelity, enabling robust segmentation models that adapt to complex anatomical structures. This approach would address limitations in traditional pixel-based segmentation while advancing personalized treatment strategies, demonstrating a unified methodology for integrating topological and mechanistic principles in both biomedical and computational domains.",
        "concept_original_list": [
            "Tislelizumab plus zanubrutinib for Richter transformation: the phase 2 RT1 trial: In patients with chronic lymphocytic leukemia, Richter transformation (RT) reflects the development of an aggressive lymphoma that is associated with poor response to chemotherapy and short survival. We initiated an international, investigator-initiated, prospective, open-label phase 2 study in which patients with RT received a combination of the PD-1 inhibitor tislelizumab plus the BTK inhibitor zanubrutinib for 12 cycles. Patients responding to treatment underwent maintenance treatment with both agents. The primary end point was overall response rate after six cycles. Of 59 enrolled patients, 48 patients received at least two cycles of treatment and comprised the analysis population according to the study protocol. The median observation time was 13.9 months, the median age was 67 (range 45–82) years. Ten patients (20.8%) had received previous RT-directed therapy. In total, 28 out of 48 patients responded to induction therapy with an overall response rate of 58.3% (95% confidence interval (CI) 43.2–72.4), including 9 (18.8%) complete reponse and 19 (39.6%) partial response, meeting the study’s primary end point by rejecting the predefined null hypothesis of 40% (P = 0.008). Secondary end points included duration of response, progression-free survival and overall survival. The median duration of response was not reached, the median progression-free survival was 10.0 months (95% CI 3.8–16.3). Median overall survival was not reached with a 12-month overall survival rate of 74.7% (95% CI 58.4–91.0). The most common adverse events were infections (18.0%), gastrointestinal disorders (13.0%) and hematological toxicities (11.4%). These data suggest that combined checkpoint and BTK inhibition by tislelizumab plus zanubrutinib is an effective and well-tolerated treatment strategy for patients with RT. ClinicalTrials.gov Identifier: NCT04271956 . In a large single-arm phase 2 trial, the anti-PD-1 inhibitor tislelizumab combined with the next-generation BTK inhibitor zanubrutinib had an overall response rate of 58.3% and was well tolerated in patients with Richter’s transformation.",
            "Topologically Faithful Image Segmentation via Induced Matching of Persistence Barcodes: Segmentation models predominantly optimize pixel-overlap-based loss, an objective that is actually inadequate for many segmentation tasks. In recent years, their limitations fueled a growing interest in topology-aware methods, which aim to recover the topology of the segmented structures. However, so far, existing methods only consider global topological properties, ignoring the need to preserve topological features spatially, which is crucial for accurate segmentation. We introduce the concept of induced matchings from persistent homology to achieve a spatially correct matching between persistence barcodes in a segmentation setting. Based on this concept, we define the Betti matching error as an interpretable, topologically and feature-wise accurate metric for image segmentations, which resolves the limitations of the Betti number error. Our Betti matching error is differentiable and efficient to use as a loss function. We demonstrate that it improves the topological performance of segmentation networks significantly across six diverse datasets while preserving the performance with respect to traditional scores. Our code is publicly available (https://github.com/nstucki/Betti-matching/)."
        ]
    },
    {
        "id": "theme_025",
        "theme": "Causal Inference and Optimization: Bridging Latent Structure and Convergence Guarantees",
        "elaboration": "This research proposes a unified framework that integrates causal representation learning with probabilistic optimization theory to address the challenges of learning from heterogeneous, non-stationary data and ensuring convergence in learned optimization algorithms. By leveraging the sparsity constraints of causal graphs (from Concept 1) and probabilistic convergence guarantees (from Concept 2), we aim to develop a method that simultaneously reconstructs latent causal variables and ensures robust optimization. The core innovation lies in applying causal inference techniques to model the underlying structure of complex systems, which then inform the design of optimization algorithms that can converge to critical points even in non-smooth, non-convex loss landscapes. This approach would enable applications in domains like autonomous systems, where causal relationships are critical for decision-making, and optimization algorithms must adapt to dynamic data distributions. The novelty lies in merging causal discovery with probabilistic convergence analysis, offering a novel pathway to reliable learning and optimization in high-dimensional, uncertain environments.",
        "concept_original_list": [
            "Causal Representation Learning from Multiple Distributions: A General Setting: In many problems, the measured variables (e.g., image pixels) are just mathematical functions of the latent causal variables (e.g., the underlying concepts or objects). For the purpose of making predictions in changing environments or making proper changes to the system, it is helpful to recover the latent causal variables $Z_i$ and their causal relations represented by graph $\\mathcal{G}_Z$. This problem has recently been known as causal representation learning. This paper is concerned with a general, completely nonparametric setting of causal representation learning from multiple distributions (arising from heterogeneous data or nonstationary time series), without assuming hard interventions behind distribution changes. We aim to develop general solutions in this fundamental case; as a by product, this helps see the unique benefit offered by other assumptions such as parametric causal models or hard interventions. We show that under the sparsity constraint on the recovered graph over the latent variables and suitable sufficient change conditions on the causal influences, interestingly, one can recover the moralized graph of the underlying directed acyclic graph, and the recovered latent variables and their relations are related to the underlying causal model in a specific, nontrivial way. In some cases, most latent variables can even be recovered up to component-wise transformations. Experimental results verify our theoretical claims.",
            "A Generalization Result for Convergence in Learning-to-Optimize: Learning-to-optimize leverages machine learning to accelerate optimization algorithms. While empirical results show tremendous improvements compared to classical optimization algorithms, theoretical guarantees are mostly lacking, such that the outcome cannot be reliably assured. Especially, convergence is hardly studied in learning-to-optimize, because conventional convergence guarantees in optimization are based on geometric arguments, which cannot be applied easily to learned algorithms. Thus, we develop a probabilistic framework that resembles classical optimization and allows for transferring geometric arguments into learning-to-optimize. Based on our new proof-strategy, our main theorem is a generalization result for parametric classes of potentially non-smooth, non-convex loss functions and establishes the convergence of learned optimization algorithms to critical points with high probability. This effectively generalizes the results of a worst-case analysis into a probabilistic framework, and frees the design of the learned algorithm from using safeguards."
        ]
    },
    {
        "id": "theme_040",
        "theme": "Systems-Level Integration of Genetic Regulatory Networks and Generative Inference for Efficient, Interpretable Model Training",
        "elaboration": "This research proposes a novel framework that merges the multi-omics insights from cerebral arachnoid cysts (ACs) with generative modeling techniques to develop a stable, interpretable model for complex data. By leveraging the regulatory networks and epigenomic dysregulation identified in ACs (Concept 1), we aim to design a generative model (Concept 2) that integrates dynamic regulatory constraints into its training process. Specifically, we will apply the principles of systems-level analysis from ACs—such as identifying causal relationships between de novo variants and developmental pathways—to engineer a Moment Matching Self-Distillation (MMSD) model that incorporates structured regulatory information. This approach would enable faster inference (via reduced steps) while maintaining distribution-level convergence, similar to how multi-omics data reveals coordinated developmental programs. The novelty lies in merging the epigenomic and genetic regulatory frameworks of ACs with the efficiency gains of MMSD, creating a hybrid model that balances interpretability and performance. This work would address the challenge of training stable generative models for high-dimensional data by harnessing the structured, causal insights from multi-omics studies, offering a new paradigm for both biological and computational systems.",
        "concept_original_list": [
            "Multiomic analyses implicate a neurodevelopmental program in the pathogenesis of cerebral arachnoid cysts: Cerebral arachnoid cysts (ACs) are one of the most common and poorly understood types of developmental brain lesion. To begin to elucidate AC pathogenesis, we performed an integrated analysis of 617 patient–parent (trio) exomes, 152,898 human brain and mouse meningeal single-cell RNA sequencing transcriptomes and natural language processing data of patient medical records. We found that damaging de novo variants (DNVs) were highly enriched in patients with ACs compared with healthy individuals (P = 1.57 × 10−33). Seven genes harbored an exome-wide significant DNV burden. AC-associated genes were enriched for chromatin modifiers and converged in midgestational transcription networks essential for neural and meningeal development. Unsupervised clustering of patient phenotypes identified four AC subtypes and clinical severity correlated with the presence of a damaging DNV. These data provide insights into the coordinated regulation of brain and meningeal development and implicate epigenomic dysregulation due to DNVs in AC pathogenesis. Our results provide a preliminary indication that, in the appropriate clinical context, ACs may be considered radiographic harbingers of neurodevelopmental pathology warranting genetic testing and neurobehavioral follow-up. These data highlight the utility of a systems-level, multiomics approach to elucidate sporadic structural brain disease. In a cohort of patients with cerebral arachnoid cysts, multiomic analyses reveal de novo variants causing genetic neurodevelopmental conditions in up to 16% of cases, suggesting that surgery in these cases may not improve non-mass effect-related symptoms.",
            "Inductive Moment Matching: Diffusion models and Flow Matching generate high-quality samples but are slow at inference, and distilling them into few-step models often leads to instability and extensive tuning. To resolve these trade-offs, we propose Moment Matching Self-Distillation (MMSD), a new class of generative models for one- or few-step sampling with a single-stage training procedure. Unlike distillation, MMSD does not require pre-training initialization and optimization of two networks; and unlike Consistency Models, MMSD guarantees distribution-level convergence and remains stable under various hyperparameters and standard model architectures. MMSD surpasses diffusion models on ImageNet-256x256 with 2.13 FID using only 8 inference steps and achieves state-of-the-art 2-step FID of 2.05 on CIFAR-10 for a model trained from scratch."
        ]
    },
    {
        "id": "theme_087",
        "theme": "Physically Guided Orthogonal Annotation for Efficient Embodied AI Scene Synthesis",
        "elaboration": "This research bridges the gap between physical interactivity and annotation efficiency by integrating principles from PhyScene's physics-guided scene generation with orthogonal annotation techniques. By leveraging the structured constraints of physical interactions (e.g., object collision, reachability) in 3D environments, we propose a novel annotation framework where orthogonal slices (akin to orthogonal annotation in medical imaging) are used to capture critical spatial relationships. This approach reduces the annotation burden while maintaining physical fidelity, enabling efficient training of embodied AI agents. The dual-layer architecture—combining dense pseudo-labels for early-stage scene reconstruction with sparse labels for later stages—mirrors DeSCO's co-training paradigm, ensuring consistency between physical realism and annotation accuracy. This synergy allows for scalable, physically plausible scene generation with minimal manual annotation, paving the way for more interactive and interpretable AI environments.",
        "concept_original_list": [
            "PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI: With recent developments in Embodied Artificial Intelligence (EAI) research there has been a growing demand for high-quality large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity we introduce PhyScene a novel method dedicated to generating interactive 3D scenes characterized by realistic layouts articulated objects and rich physical interactivity tailored for embodied agents. Based on a conditional diffusion model for capturing scene layouts we devise novel physics- and interactivity-based guidance mechanisms that integrate constraints from object collision room layout and object reachability. Through extensive experiments we demonstrate that PhyScene effectively leverages these guidance functions for physically interactable scene synthesis outperforming existing state-of-the-art scene synthesis methods by a large margin. Our findings suggest that the scenes generated by PhyScene hold considerable potential for facilitating diverse skill acquisition among agents within interactive environments thereby catalyzing further advancements in embodied AI research.",
            "Orthogonal Annotation Benefits Barely-Supervised Medical Image Segmentation: Recent trends in semi-supervised learning have significantly boosted the performance of 3D semi-supervised medical image segmentation. Compared with 2D images, 3D medical volumes involve information from different directions, e.g., transverse, sagittal, and coronal planes, so as to naturally provide complementary views. These complementary views and the intrinsic similarity among adjacent 3D slices inspire us to develop a novel annotation way and its corresponding semi-supervised model for effective segmentation. Specifically, we firstly propose the orthogonal annotation by only labeling two orthogonal slices in a labeled volume, which significantly relieves the burden of annotation. Then, we perform registration to obtain the initial pseudo labels for sparsely labeled volumes. Subsequently, by introducing unlabeled volumes, we propose a dual-network paradigm named Dense-Sparse Co-training (DeSCO) that exploits dense pseudo labels in early stage and sparse labels in later stage and meanwhile forces consistent output of two networks. Experimental results on three benchmark datasets validated our effectiveness in performance and efficiency in annotation. For example, with only 10 annotated slices, our method reaches a Dice up to 86.93% on KiTS19 dataset."
        ]
    },
    {
        "id": "theme_064",
        "theme": "Adaptive Feature Selection with Differential Privacy: Integrating Log-Concave Sampling for Efficient Anomaly Detection in Privately Optimized Feature Spaces",
        "elaboration": "This research proposes a novel framework that merges principles from differential privacy and online convex optimization with adaptive feature selection for anomaly detection. By leveraging the concentration properties of log-concave densities (as used in Concept 1), we design a feature selection network that dynamically balances privacy guarantees and computational efficiency. The core innovation is to apply rejection sampling from differential privacy (Concept 1) to adaptive feature selection in RealNet (Concept 2), enabling the network to prioritize discriminative features while maintaining privacy. This approach addresses the challenge of synthesizing realistic anomalies (Concept 2) by using SDAS (Strength-controllable Diffusion Anomaly Synthesis) to generate diverse anomaly samples, combined with AFS (Anomaly-aware Features Selection) to refine feature subsets. The integration of log-concave sampling ensures that the feature selection process adheres to privacy constraints while optimizing for anomaly detection performance. By treating the feature space as a privacy-protected optimization problem, the algorithm achieves both privacy guarantees and high detection accuracy, offering a scalable solution for real-world applications in industrial anomaly detection.",
        "concept_original_list": [
            "Improved Differentially Private and Lazy Online Convex Optimization: Lower Regret without Smoothness Requirements: We design differentially private regret-minimizing algorithms in the online convex optimization (OCO) framework. Unlike recent results, our algorithms and analyses do not require smoothness, thus yielding the first private regret bounds with an optimal leading-order term for non-smooth loss functions. Additionally, even for smooth losses, the resulting regret guarantees improve upon previous results in terms their dependence of dimension. Our results provide the best known rates for DP-OCO in all practical regimes of the privacy parameter, barring when it is exceptionally small. The principal innovation in our algorithm design is the use of sampling from strongly log-concave densities which satisfy the Log-Sobolev Inequality. The resulting concentration of measure allows us to obtain a better trade-off for the dimension factors than prior work, leading to improved results. Following previous works on DP-OCO, the proposed algorithm explicitly limits the number of switches via rejection sampling. Thus, independently of privacy constraints, the algorithm also provides improved results for online convex optimization with a switching budget.",
            "RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection: Self-supervised feature reconstruction methods have shown promising advances in industrial image anomaly detection and localization. Despite this progress these methods still face challenges in synthesizing realistic and diverse anomaly samples as well as addressing the feature redundancy and pre-training bias of pre-trained feature. In this work we introduce RealNet a feature reconstruction network with realistic synthetic anomaly and adaptive feature selection. It is incorporated with three key innovations: First we propose Strength-controllable Diffusion Anomaly Synthesis (SDAS) a diffusion process-based synthesis strategy capable of generating samples with varying anomaly strengths that mimic the distribution of real anomalous samples. Second we develop Anomaly-aware Features Selection (AFS) a method for selecting representative and discriminative pre-trained feature subsets to improve anomaly detection performance while controlling computational costs. Third we introduce Reconstruction Residuals Selection (RRS) a strategy that adaptively selects discriminative residuals for comprehensive identification of anomalous regions across multiple levels of granularity. We assess RealNet on four benchmark datasets and our results demonstrate significant improvements in both Image AUROC and Pixel AUROC compared to the current state-of-the-art methods. The code data and models are available at https://github.com/cnulab/RealNet."
        ]
    },
    {
        "id": "theme_185",
        "theme": "Integrating Timescale Decoupling with Trajectory Embedding for Robust Dynamical Similarity Analysis",
        "elaboration": "This research proposes a novel framework that combines the timescale decoupling strategy from KoopSTD with learned trajectory embedding techniques to address the challenge of dynamic system similarity analysis. By leveraging KoopSTD's ability to decompose the Koopman spectrum into temporally decoupled components, we design a trajectory embedding framework that explicitly captures the underlying dynamical structure of high-dimensional motion data. This approach ensures that trajectory embeddings are invariant to coordinate transformations, similar to KoopSTD's invariance properties, while preserving the geometric relationships between trajectories. The integration of timescale decoupling enables the embedding space to distinguish subtle dynamical differences across multiple temporal scales, enhancing clustering accuracy and robustness. This synergy allows for efficient, interpretable analysis of complex systems, such as neural dynamics or large language models, by mapping trajectories into a space where similarity metrics (e.g., KoopSTD's spectral residuals) can be directly computed. The framework addresses the limitations of traditional methods by combining the scalability of trajectory embedding with the temporal resolution of KoopSTD, offering a unified approach to dynamic similarity analysis and motion parameter recovery.",
        "concept_original_list": [
            "KoopSTD: Reliable Similarity Analysis between Dynamical Systems via Approximating Koopman Spectrum with Timescale Decoupling: Determining the similarity between dynamical systems remains a long-standing challenge in both machine learning and neuroscience. Recent works based on Koopman operator theory have proven effective in analyzing dynamical similarity by examining discrepancies in the Koopman spectrum. Nevertheless, existing similarity metrics can be severely constrained when systems exhibit complex nonlinear behaviors across multiple temporal scales. In this work, we propose **KoopSTD**, a dynamical similarity measurement framework that precisely characterizes the underlying dynamics by approximating the Koopman spectrum with explicit timescale decoupling and spectral residual control. We show that KoopSTD maintains invariance under several common representation-space transformations, which ensures robust measurements across different coordinate systems. Our extensive experiments on physical and neural systems validate the effectiveness, scalability, and robustness of KoopSTD compared to existing similarity metrics. We also apply KoopSTD to explore two open-ended research questions in neuroscience and large language models, highlighting its potential to facilitate future scientific and engineering discoveries. Code is available at [link](https://github.com/ZhangShimin1/KoopSTD).",
            "Learned Trajectory Embedding for Subspace Clustering: Clustering multiple motions from observed point trajectories is a fundamental task in understanding dynamic scenes. Most motion models require multiple tracks to estimate their parameters hence identifying clusters when multiple motions are observed is a very challenging task. This is even aggravated for high-dimensional motion models. The starting point of our work is that this high-dimensionality of motion model can actually be leveraged to our advantage as sufficiently long trajectories identify the underlying motion uniquely in practice. Consequently we propose to learn a mapping from trajectories to embedding vectors that represent the generating motion. The obtained trajectory embeddings are useful for clustering multiple observed motions but are also trained to contain sufficient information to recover the parameters of the underlying motion by utilizing a geometric loss. We therefore are able to use only weak supervision from given motion segmentation to train this mapping. The entire algorithm consisting of trajectory embedding clustering and motion parameter estimation is highly efficient. We conduct experiments on the Hopkins155 Hopkins12 and KT3DMoSeg datasets and show state-of-the-art performance of our proposed method for trajectory-based motion segmentation on full sequences and its competitiveness on the occluded sequences. Project page: https://ylochman.github.io/trajectory-embedding."
        ]
    },
    {
        "id": "theme_122",
        "theme": "Adaptive Semantic Guidance for Secure Text Generation in Autonomous Systems",
        "elaboration": "This research proposes a unified framework that merges adaptive watermarking techniques from large language models with goal-driven trajectory generation in autonomous systems. By leveraging semantic embeddings (as in Concept 1's adaptive logit scaling) and goal-point selection (as in Concept 2's Flow Matching), the framework ensures secure, high-quality text generation while maintaining robustness. The adaptive watermarking strategy, which dynamically adjusts token distributions based on entropy measurements, parallels GoalFlow's constraint-based trajectory generation, where semantic guidance (via goal points) ensures consistency between scene information and generated trajectories. This synergy addresses challenges in both domains—secure text authentication and reliable multimodal navigation—by integrating adaptive constraints with semantic fidelity, enabling systems to balance security, performance, and interpretability in complex, real-world scenarios.",
        "concept_original_list": [
            "Adaptive Text Watermark for Large Language Models: The advancement of Large Language Models (LLMs) has led to increasing concerns about the misuse of AI-generated text, and watermarking LLM-generated text has emerged as a potential solution. However, it is challenging to generate high-quality watermarked text while maintaining robustness, security, and the ability to detect watermarks without prior knowledge of the prompt and model. This paper proposes an adaptive text watermarking strategy to address such a challenge. To improve the text quality and maintain robustness, we adaptively add watermarking to token distributions with high entropy measured by an auxiliary model and keep the low-entropy token distributions untouched. For the sake of security and to further minimize the watermark's impact on text quality, instead of using a fixed green/red list generated from a random secret key, which can be vulnerable to decryption and forgery, we adaptively scale up the output logits based on the semantic embedding of previously generated text using a well designed semantic mapping model. Our experiments involving various LLMs demonstrate that our approach achieves comparable robustness performance to existing watermark methods. Additionally, the text generated by our method has perplexity comparable to that of *un-watermarked* LLMs while maintaining sufficient security.",
            "GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving: We propose GoalFlow, an end-to-end autonomous driving method for generating high-quality multimodal trajectories. In autonomous driving scenarios, there is rarely a single suitable trajectory. Recent methods have increasingly focused on modeling multimodal trajectory distributions. However, they suffer from trajectory selection complexity and reduced trajectory quality due to high trajectory divergence and inconsistencies between guidance and scene information. To address these issues, we introduce GoalFlow, a novel method that effectively constrains the generative process to produce high-quality, multimodal trajectories. To resolve the trajectory divergence problem inherent in diffusion-based methods, GoalFlow constrains the generated trajectories by introducing a goal point. GoalFlow establishes a novel scoring mechanism that selects the most appropriate goal point from the candidate points based on scene information. Furthermore, GoalFlow employs an efficient generative method, Flow Matching, to generate multimodal trajectories, and incorporates a refined scoring mechanism to select the optimal trajectory from the candidates. Our experimental results, validated on the Navsim, demonstrate that GoalFlow achieves state-of-the-art performance, delivering robust multimodal trajectories for autonomous driving. GoalFlow achieved PDMS of 90.3, significantly surpassing other methods. Compared with other diffusion-policy-based methods, our approach requires only a single denoising step to obtain excellent performance. The code is available at https://github.com/YvanYin/GoalFlow."
        ]
    },
    {
        "id": "theme_124",
        "theme": "Adaptive Data Augmentation for Efficient Estimation Protocols",
        "elaboration": "This research proposes a novel framework that merges the principles of data-driven augmentation from AugSeg with the optimization of communication efficiency in distinct element estimation. By leveraging the adaptive nature of AugSeg—where data transformations are dynamically adjusted based on model confidence—we introduce a protocol that balances computational complexity with accuracy. Specifically, we adapt the concept of 'confidence-based labeling' from AugSeg to the problem of distributed distinct element estimation, where the number of collisions (C) determines the protocol's complexity. By integrating a simplified intensity-based augmentation mechanism, similar to AugSeg's approach, we dynamically adjust the number of collisions and communication bits in the estimation protocol. This allows the protocol to optimize for both accuracy and efficiency, breaking previous lower bounds when C is small, while maintaining robustness under varying network conditions. The key innovation lies in treating data augmentation as a dynamic, context-aware process, enabling efficient estimation with minimal communication overhead, thus bridging the gap between interpretability and practical scalability in distributed systems.",
        "concept_original_list": [
            "Augmentation Matters: A Simple-Yet-Effective Approach to Semi-Supervised Semantic Segmentation: Recent studies on semi-supervised semantic segmentation (SSS) have seen fast progress. Despite their promising performance, current state-of-the-art methods tend to increasingly complex designs at the cost of introducing more network components and additional training procedures. Differently, in this work, we follow a standard teacher-student framework and propose AugSeg, a simple and clean approach that focuses mainly on data perturbations to boost the SSS performance. We argue that various data augmentations should be adjusted to better adapt to the semi-supervised scenarios instead of directly applying these techniques from supervised learning. Specifically, we adopt a simplified intensity-based augmentation that selects a random number of data transformations with uniformly sampling distortion strengths from a continuous space. Based on the estimated confidence of the model on different unlabeled samples, we also randomly inject labelled information to augment the unlabeled samples in an adaptive manner. Without bells and whistles, our simple AugSeg can readily achieve new state-of-the-art performance on SSS benchmarks under different partition protocols.",
            "On Fine-Grained Distinct Element Estimation: We study the problem of distributed distinct element estimation, where $\\alpha$ servers each receive a subset of a universe $[n]$ and aim to compute a $(1+\\varepsilon)$-approximation to the number of distinct elements using minimal communication. While prior work establishes a worst-case bound of $\\Theta\\left(\\alpha\\log n+\\frac{\\alpha}{\\varepsilon^2}\\right)$ bits, these results rely on assumptions that may not hold in practice. We introduce a new parameterization based on the number $C = \\frac{\\beta}{\\varepsilon^2}$ of pairwise collisions, i.e., instances where the same element appears on multiple servers, and design a protocol that uses only $O\\left(\\alpha\\log n\\log\\log n+\\frac{\\sqrt{\\beta}}{\\varepsilon^2} \\log n\\right)$ bits, breaking previous lower bounds when $C$ is small. We further improve our algorithm under assumptions on the number of distinct elements or collisions and provide matching lower bounds in all regimes, establishing $C$ as a tight complexity measure for the problem. Finally, we consider streaming algorithms for distinct element estimation parameterized by the number of items with frequency larger than $1$. Overall, our results offer insight into why statistical problems with known hardness results can be efficiently solved in practice."
        ]
    },
    {
        "id": "theme_009",
        "theme": "Robust Hypertension Care Cascades with Invariant Representation Learning",
        "elaboration": "This research proposes integrating principles from vector quantization and self-attention (from Concept 2) into hypertension care cascades to address socioeconomic disparities in cardiovascular disease (CVD) outcomes. By modeling the healthcare system as a cascading process (Concept 1), we aim to simulate how targeted interventions (diagnosis and treatment) affect CVD risk across wealth quintiles. The challenge lies in ensuring these interventions are robust to systemic inequities, such as varying access to care or data-driven disparities. Drawing from Concept 2, we introduce a framework where the hypertension care cascade is represented as a tensor network, with discrete vector quantization (VQ) modules to eliminate redundancy in recognition (e.g., identifying effective interventions) and self-attention to enhance feature cohesion. This would allow the system to adapt to heterogeneous data (e.g., varying baseline hypertension management in low- and middle-income countries) while maintaining invariant, equitable outcomes. The novelty lies in applying invariant representation learning to healthcare systems, ensuring that improvements in diagnosis and treatment are robust to socioeconomic disparities, thereby achieving both precision in CVD risk prediction and fairness in health outcomes across wealth quintiles.",
        "concept_original_list": [
            "Hypertension care cascades and reducing inequities in cardiovascular disease in low- and middle-income countries: Improving hypertension control in low- and middle-income countries has uncertain implications across socioeconomic groups. In this study, we simulated improvements in the hypertension care cascade and evaluated the distributional benefits across wealth quintiles in 44 low- and middle-income countries using individual-level data from nationally representative, cross-sectional surveys. We raised diagnosis (diagnosis scenario) and treatment (treatment scenario) levels for all wealth quintiles to match the best-performing country quintile and estimated the change in 10-year cardiovascular disease (CVD) risk of individuals initiated on treatment. We observed greater health benefits among bottom wealth quintiles in middle-income countries and in countries with larger baseline disparities in hypertension management. Lower-middle-income countries would see the greatest absolute benefits among the bottom quintiles under the treatment scenario (29.1 CVD cases averted per 1,000 people living with hypertension in the bottom quintile (Q1) versus 17.2 in the top quintile (Q5)), and the proportion of total CVD cases averted would be largest among the lowest quintiles in upper-middle-income countries under both diagnosis (32.0% of averted cases in Q1 versus 11.9% in Q5) and treatment (29.7% of averted cases in Q1 versus 14.0% in Q5) scenarios. Targeted improvements in hypertension diagnosis and treatment could substantially reduce socioeconomic-based inequalities in CVD burden in low- and middle-income countries. Simulation of improvements in hypertension care across wealth quintiles in 44 low- and middle-income countries demonstrates that targeted improvements in diagnosis and treatment could considerably reduce within-country, socioeconomic-based inequalites in cardiovascular disease burden.",
            "Vector Quantization With Self-Attention for Quality-Independent Representation Learning: Recently, the robustness of deep neural networks has drawn extensive attention due to the potential distribution shift between training and testing data (e.g., deep models trained on high-quality images are sensitive to corruption during testing). Many researchers attempt to make the model learn invariant representations from multiple corrupted data through data augmentation or image-pair-based feature distillation to improve the robustness. Inspired by sparse representation in image restoration, we opt to address this issue by learning image-quality-independent feature representation in a simple plug-and-play manner, that is, to introduce discrete vector quantization (VQ) to remove redundancy in recognition models. Specifically, we first add a codebook module to the network to quantize deep features. Then we concatenate them and design a self-attention module to enhance the representation. During training, we enforce the quantization of features from clean and corrupted images in the same discrete embedding space so that an invariant quality-independent feature representation can be learned to improve the recognition robustness of low-quality images. Qualitative and quantitative experimental results show that our method achieved this goal effectively, leading to a new state-of-the-art result of 43.1% mCE on ImageNet-C with ResNet50 as the backbone. On other robustness benchmark datasets, such as ImageNet-R, our method also has an accuracy improvement of almost 2%."
        ]
    },
    {
        "id": "theme_002",
        "theme": "Transformers for Efficient Human Interaction: Integrating Contextual Awareness and Scalability in Gaze Prediction and Human Reconstruction",
        "elaboration": "Gazeformer and FeatER both leverage transformers to address challenges in human interaction tasks, but they diverge in their focus: Gazeformer uses natural language models (NLMs) to encode target-specific contextual cues for gaze prediction, while FeatER optimizes transformer architectures for feature map processing in human reconstruction. The theme explores how these two approaches can be unified by designing a transformer framework that dynamically adapts to both gaze prediction and human structural data. By integrating NLM-based contextual encoding with feature map preservation in FeatER, the proposed model would enable efficient, scalable processing of complex human interaction tasks. The novelty lies in creating a transformer architecture that balances contextual awareness (via NLMs) with structural efficiency (via feature map processing), addressing the limitations of existing models in computational cost and interpretability. This synthesis would advance human-computer interaction by enabling real-time, high-accuracy gaze prediction and human reconstruction in diverse scenarios, such as augmented reality or autonomous systems.",
        "concept_original_list": [
            "Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention: Predicting human gaze is important in Human-Computer Interaction (HCI). However, to practically serve HCI applications, gaze prediction models must be scalable, fast, and accurate in their spatial and temporal gaze predictions. Recent scanpath prediction models focus on goal-directed attention (search). Such models are limited in their application due to a common approach relying on trained target detectors for all possible objects, and the availability of human gaze data for their training (both not scalable). In response, we pose a new task called ZeroGaze, a new variant of zero-shot learning where gaze is predicted for never-before-searched objects, and we develop a novel model, Gazeformer, to solve the ZeroGaze problem. In contrast to existing methods using object detector modules, Gazeformer encodes the target using a natural language model, thus leveraging semantic similarities in scanpath prediction. We use a transformer-based encoder-decoder architecture because transformers are particularly useful for generating contextual representations. Gazeformer surpasses other models by a large margin (19% - 70%) on the ZeroGaze setting. It also outperforms existing target-detection models on standard gaze prediction for both target-present and target-absent search tasks. In addition to its improved performance, Gazeformer is more than five times faster than the state-of-the-art target-present visual search model.",
            "FeatER: An Efficient Network for Human Reconstruction via Feature Map-Based TransformER: Recently, vision transformers have shown great success in a set of human reconstruction tasks such as 2D human pose estimation (2D HPE), 3D human pose estimation (3D HPE), and human mesh reconstruction (HMR) tasks. In these tasks, feature map representations of the human structural information are often extracted first from the image by a CNN (such as HRNet), and then further processed by transformer to predict the heatmaps (encodes each joint's location into a feature map with a Gaussian distribution) for HPE or HMR. However, existing transformer architectures are not able to process these feature map inputs directly, forcing an unnatural flattening of the location-sensitive human structural information. Furthermore, much of the performance benefit in recent HPE and HMR methods has come at the cost of ever-increasing computation and memory needs. Therefore, to simultaneously address these problems, we propose FeatER, a novel transformer design which preserves the inherent structure of feature map representations when modeling attention while reducing the memory and computational costs. Taking advantage of FeatER, we build an efficient network for a set of human reconstruction tasks including 2D HPE, 3D HPE, and HMR. A feature map reconstruction module is applied to improve the performance of the estimated human pose and mesh. Extensive experiments demonstrate the effectiveness of FeatER on various human pose and mesh datasets. For instance, FeatER outperforms the SOTA method MeshGraphormer by requiring 5% of Params (total parameters) and 16% of MACs (the Multiply-Accumulate Operations) on Human3.6M and 3DPW datasets. Code will be publicly available."
        ]
    },
    {
        "id": "theme_051",
        "theme": "Integrating Temporal Dynamics and Overfitting Resistance in Multi-Modal Geospatial Learning for Vegetation Forecasting",
        "elaboration": "This research proposes a novel framework that merges the temporal context-awareness of Contextformer with the robustness of overfitting-resistant neural networks to address the challenges of geospatial vegetation forecasting. By leveraging the multi-modal nature of satellite imagery (e.g., Sentinel-2) and meteorological data, the model captures spatial patterns and temporal dynamics simultaneously. The Contextformer architecture, which efficiently extracts spatial context through a transformer backbone, is adapted to incorporate temporal dependencies via localized patches and meteorological time series. To address overfitting in high-dimensional, correlated data (e.g., vegetation greenness across Europe), the model employs a parameter-efficient training strategy that balances complexity with generalization. This approach not only improves prediction accuracy for anomalies beyond seasonal cycles but also demonstrates that over-parameterized neural networks can learn complex, non-linear patterns (like vegetation dynamics) with minimal overfitting, even in the presence of noise. The GreenEarthNet dataset, designed for precise vegetation forecasting, serves as a foundation for evaluating these techniques, while the proposed method bridges the gap between spatial-temporal modeling and robust learning, enabling scalable, real-world applications in agriculture, climate modeling, and disaster response.",
        "concept_original_list": [
            "Multi-modal Learning for Geospatial Vegetation Forecasting: Precise geospatial vegetation forecasting holds potential across diverse sectors including agriculture forestry humanitarian aid and carbon accounting. To leverage the vast availability of satellite imagery for this task various works have applied deep neural networks for predicting multispectral images in photorealistic quality. However the important area of vegetation dynamics has not been thoroughly explored. Our study introduces GreenEarthNet the first dataset specifically designed for high-resolution vegetation forecasting and Contextformer a novel deep learning approach for predicting vegetation greenness from Sentinel 2 satellite images with fine resolution across Europe. Our multi-modal transformer model Contextformer leverages spatial context through a vision backbone and predicts the temporal dynamics on local context patches incorporating meteorological time series in a parameter-efficient manner. The GreenEarthNet dataset features a learned cloud mask and an appropriate evaluation scheme for vegetation modeling. It also maintains compatibility with the existing satellite imagery forecasting dataset EarthNet2021 enabling cross-dataset model comparisons. Our extensive qualitative and quantitative analyses reveal that our methods outperform a broad range of baseline techniques. This includes surpassing previous state-of-the-art models on EarthNet2021 as well as adapted models from time series forecasting and video prediction. To the best of our knowledge this work presents the first models for continental-scale vegetation modeling at fine resolution able to capture anomalies beyond the seasonal cycle thereby paving the way for predicting vegetation health and behaviour in response to climate variability and extremes. We provide open source code and pre-trained weights to reproduce our experimental results under https://github.com/vitusbenson/greenearthnet.",
            "Benign Overfitting in Two-Layer ReLU Convolutional Neural Networks for XOR Data: Modern deep learning models are usually highly over-parameterized so that they can overfit the training data. Surprisingly, such overfitting neural networks can usually still achieve high prediction accuracy. To study this ``benign overfitting'' phenomenon, a line of recent works has theoretically studied the learning of linear models and two-layer neural networks. However, most of these analyses are still limited to the very simple learning problems where the Bayes-optimal classifier is linear. In this work, we investigate a class of XOR-type classification tasks with label-flipping noises. We show that, under a certain condition on the sample complexity and signal-to-noise ratio, an over-parameterized ReLU CNN trained by gradient descent can achieve near Bayes-optimal accuracy. Moreover, we also establish a matching lower bound result showing that when the previous condition is not satisfied, the prediction accuracy of the obtained CNN is an absolute constant away from the Bayes-optimal rate. Our result demonstrates that CNNs have a remarkable capacity to efficiently learn XOR problems, even in the presence of highly correlated features."
        ]
    },
    {
        "id": "theme_027",
        "theme": "Dynamic Probabilistic Modeling for Enhanced Data Quality in 3D and Graph Domains",
        "elaboration": "GaussianAvatar's use of animatable 3D Gaussians to model dynamic human appearances and GraphCleaner's focus on detecting and correcting mislabelled graph nodes share a common goal of improving data quality through dynamic, probabilistic frameworks. By leveraging the principles of dynamic property modeling (e.g., motion in 3D Gaussians and label dependencies in graphs), this research proposes a unified approach to enhance data fidelity. The theme explores how probabilistic, adaptive models can simultaneously optimize both the realism of 3D avatars and the accuracy of graph datasets, addressing challenges in motion estimation and label error detection. The novelty lies in integrating dynamic optimization techniques from 3D modeling into graph data processing, enabling robust, interpretable models for high-stakes applications such as virtual humans and graph-based machine learning. This bridges the gap between 3D and graph domains by treating both as dynamic systems requiring iterative refinement, ensuring that the optimization process accounts for both structural and label-related uncertainties.",
        "concept_original_list": [
            "GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians: We present GaussianAvatar an efficient approach to creating realistic human avatars with dynamic 3D appearances from a single video. We start by introducing animatable 3D Gaussians to explicitly represent humans in various poses and clothing styles. Such an explicit and animatable representation can fuse 3D appearances more efficiently and consistently from 2D observations. Our representation is further augmented with dynamic properties to support pose-dependent appearance modeling where a dynamic appearance network along with an optimizable feature tensor is designed to learn the motion-to-appearance mapping. Moreover by leveraging the differentiable motion condition our method enables a joint optimization of motions and appearances during avatar modeling which helps to tackle the long-standing issue of inaccurate motion estimation in monocular settings. The efficacy of GaussianAvatar is validated on both the public dataset and our collected dataset demonstrating its superior performances in terms of appearance quality and rendering efficiency.",
            "GraphCleaner: Detecting Mislabelled Samples in Popular Graph Learning Benchmarks: Label errors have been found to be prevalent in popular text, vision, and audio datasets, which heavily influence the safe development and evaluation of machine learning algorithms. Despite increasing efforts towards improving the quality of generic data types, such as images and texts, the problem of mislabel detection in graph data remains underexplored. To bridge the gap, we explore mislabelling issues in popular real-world graph datasets and propose GraphCleaner, a post-hoc method to detect and correct these mislabelled nodes in graph datasets. GraphCleaner combines the novel ideas of 1) Synthetic Mislabel Dataset Generation, which seeks to generate realistic mislabels; and 2) Neighborhood-Aware Mislabel Detection, where neighborhood dependency is exploited in both labels and base classifier predictions. Empirical evaluations on 6 datasets and 6 experimental settings demonstrate that GraphCleaner outperforms the closest baseline, with an average improvement of $0.14$ in F1 score, and $0.16$ in MCC. On real-data case studies, GraphCleaner detects real and previously unknown mislabels in popular graph benchmarks: PubMed, Cora, CiteSeer and OGB-arxiv; we find that at least 6.91% of PubMed data is mislabelled or ambiguous, and simply removing these mislabelled data can boost evaluation performance from 86.71% to 89.11%."
        ]
    },
    {
        "id": "theme_062",
        "theme": "Tensor-Based Optimization for Efficient Survival Analysis: Leveraging FlashTP's Principles to Enhance Doubly Protected Estimators for Censored Data",
        "elaboration": "This research proposes integrating the tensor-product optimization strategies from FlashTP into survival analysis frameworks to address computational bottlenecks and bias in doubly protected estimators. By applying kernel fusion and sparse computation techniques, we aim to accelerate the training of models that estimate treatment-specific restricted mean survival time differences, similar to how FlashTP optimizes tensor-product layers in MLIPs. The framework would leverage tensor network architectures to represent high-dimensional survival data, enabling efficient contraction algorithms (e.g., DMRG) while mitigating biases from external controls. This approach would combine FlashTP's memory-efficient tensor operations with doubly robust estimation methods, ensuring both speed and accuracy in survival analysis. The novelty lies in applying tensor-based optimization principles to statistical estimation, enabling scalable, bias-minimized models for rare disease studies and real-world clinical trials where external controls are critical but may introduce heterogeneity. The research would explore how these techniques can be adapted to handle censored data, reducing computational overhead while maintaining the robustness of doubly protected estimators.",
        "concept_original_list": [
            "FlashTP: Fused, Sparsity-Aware Tensor Product for Machine Learning Interatomic Potentials: Machine Learning Interatomic Potentials (MLIPs) enable efficient molecular dynamics (MD) simulations with high accuracy. While equivariant MLIPs achieve state-of-the-art accuracy, they face significant computational bottlenecks centered around their Tensor-Product layer, which account for up to 75\\% of training time and cause substantial memory overhead. We present FlashTP, a highly optimized tensor-product library that addresses these inefficiencies through kernel fusion, sparse computation, and path-aggregated execution. FlashTP achieves up to 41.6$\\times$ and 60.8$\\times$ kernel speedups over _e3nn_ and NVIDIA cuEquivariance, respectively. For SevenNet-l3i5, it delivers 4.2$\\times$ and 3.5$\\times$ speedup while reducing peak memory usage by 6.3$\\times$ and 6.2$\\times$ for inference and training, respectively. The code is available at https://github.com/SNU-ARC/flashTP.",
            "Doubly Protected Estimation for Survival Outcomes Utilizing External Controls for Randomized Clinical Trials: Censored survival data are common in clinical trials, but small control groups can pose challenges, particularly in rare diseases or where balanced randomization is impractical. Recent approaches leverage external controls from historical studies or real-world data to strengthen treatment evaluation for survival outcomes. However, using external controls directly may introduce biases due to data heterogeneity. We propose a doubly protected estimator for the treatment-specific restricted mean survival time difference that is more efficient than trial-only estimators and mitigates biases from external data. Our method adjusts for covariate shifts via doubly robust estimation and addresses outcome drift using the DR-Learner for selective borrowing. The approach can incorporate machine learning to approximate survival curves and detect outcome drifts without strict parametric assumptions, borrowing only comparable external controls. Extensive simulation studies and a real-data application evaluating the efficacy of Galcanezumab in mitigating migraine headaches have been conducted to illustrate the effectiveness of our proposed framework."
        ]
    },
    {
        "id": "theme_105",
        "theme": "Symmetry-Driven Disentangled Networks for Robust Multi-Agent and Class Incremental Learning",
        "elaboration": "This research proposes a unified framework that integrates E(3)-equivariant actor-critic methods with disentangled manifold learning to address challenges in cooperative multi-agent reinforcement learning (MARL) and class incremental learning (CIL). By leveraging the inherent symmetries of E(3) in MARL, the framework ensures that symmetric optimal policies and generalized strategies are preserved across diverse environments, enabling zero-shot learning and transferability. In CIL, disentangled manifolds are employed to separate class-specific features, mitigating feature overlap and class-wise confusion. The integration of these principles involves designing neural architectures that enforce symmetry constraints while using disentangled latent representations to distinguish between classes. This dual approach enhances model robustness, reduces catastrophic forgetting, and improves generalization across unseen tasks, offering a novel solution for both cooperative MARL and incremental learning scenarios.",
        "concept_original_list": [
            "${\\rm E}(3)$-Equivariant Actor-Critic Methods for Cooperative Multi-Agent Reinforcement Learning: Identification and analysis of symmetrical patterns in the natural world have led to significant discoveries across various scientific fields, such as the formulation of gravitational laws in physics and advancements in the study of chemical structures. In this paper, we focus on exploiting Euclidean symmetries inherent in certain cooperative multi-agent reinforcement learning (MARL) problems and prevalent in many applications. We begin by formally characterizing a subclass of Markov games with a general notion of symmetries that admits the existence of symmetric optimal values and policies. Motivated by these properties, we design neural network architectures with symmetric constraints embedded as an inductive bias for multi-agent actor-critic methods. This inductive bias results in superior performance in various cooperative MARL benchmarks and impressive generalization capabilities such as zero-shot learning and transfer learning in unseen scenarios with repeated symmetric patterns.",
            "Reducing Class-wise Confusion for Incremental Learning with Disentangled Manifolds: Class incremental learning (CIL) aims to enable models to continuously learn new classes without catastrophically forgetting old ones. A promising direction is to learn and use prototypes of classes during incremental updates. Despite simplicity and intuition, we find that such methods suffer from inadequate representation capability and unsatisfied feature overlap. These two factors cause class-wise confusion and limited performance. In this paper, we develop a Confusion-REduced AuTo-Encoder classifier (CREATE) for CIL. Specifically, our method employs a lightweight auto-encoder module to learn compact manifold for each class in the latent subspace, constraining samples to be well reconstructed only on the semantically correct auto-encoder. Thus, the representation stability and capability of class distributions are enhanced, alleviating the potential class-wise confusion problem. To further distinguish the overlapped features, we propose a confusion-aware latent space separation loss that ensures samples are closely distributed in their corresponding low-dimensional manifold while keeping away from the distributions of drifted features from other classes. Our method demonstrates stronger representational capacity and discrimination ability by learning disentangled manifolds and reduces class confusion. Extensive experiments on multiple datasets and settings show that CREATE outperforms other state-of-the-art methods up to 5.41%."
        ]
    },
    {
        "id": "theme_118",
        "theme": "Global Structural Modeling for Data Integrity and Accuracy: Integrating Homography-Based Context Preservation with Topographic Inference for Robust Environmental Monitoring",
        "elaboration": "This research proposes a unified framework that leverages the principles of homography-based context preservation (as in HomoGen) and topographic structure inference (as in canopy height estimation) to enhance data integrity and accuracy in environmental monitoring. By treating both satellite-derived topographic data and video inpainting tasks as problems involving global structural consistency, the framework integrates homography propagation to preserve semantic coherence across spatial scales with a diffusion-based model that prioritizes structural integrity over pixel-level noise. The core innovation lies in developing a hybrid architecture where homography-induced global constraints are dynamically adapted to scale with video dynamics, while topographic features are encoded as structured tensors to mitigate geolocation errors. This approach enables simultaneous estimation of canopy height and video inpainting with reduced error propagation, offering a scalable solution for high-resolution environmental analysis and autonomous video restoration. The novelty resides in merging these two domains' methodologies into a single framework that preserves both spatial and semantic fidelity, addressing limitations of isolated techniques in handling complex, heterogeneous data.",
        "concept_original_list": [
            "Estimating Canopy Height at Scale: We propose a framework for global-scale canopy height estimation based on satellite data. Our model leverages advanced data preprocessing techniques, resorts to a novel loss function designed to counter geolocation inaccuracies inherent in the ground-truth height measurements, and employs data from the Shuttle Radar Topography Mission to effectively filter out erroneous labels in mountainous regions, enhancing the reliability of our predictions in those areas. A comparison between predictions and ground-truth labels yields an MAE/RMSE of 2.43 / 4.73 (meters) overall and 4.45 / 6.72 (meters) for trees taller than five meters, which depicts a substantial improvement compared to existing global-scale products. The resulting height map as well as the underlying framework will facilitate and enhance ecological analyses at a global scale, including, but not limited to, large-scale forest and biomass monitoring.",
            "HomoGen: Enhanced Video Inpainting via Homography Propagation and Diffusion: In this paper, we present HomoGen, an enhanced video inpainting method based on homography propagation and diffusion models. HomoGen leverages homography registration to propagate contextual pixels as priors for generating missing content in corrupted videos. Unlike previous flow-based propagation methods, which introduce local distortions due to point-to-point optical flows, homography-induced artifacts are typically global structural distortions that preserve semantic integrity. To effectively utilize these priors for generation, we employ a video diffusion model that inherently prioritizes semantic information within the priors over pixel-level details. A content-adaptive control mechanism is proposed to scale and inject the priors into intermediate video latents during iterative denoising. In contrast to existing transformer-based networks that often suffer from artifacts within priors, leading to error accumulation and unrealistic results, our denoising diffusion network can smooth out artifacts and ensure natural outputs. Extensive experiments demonstrate the effectiveness of the proposed method qualitatively and quantitatively."
        ]
    },
    {
        "id": "theme_026",
        "theme": "Quantum-Inspired Optimization for Personalized Cancer Treatment: Leveraging Quantum Algorithms to Model Immune Response Dynamics in NSCLC",
        "elaboration": "This research proposal merges quantum computing principles with clinical oncology to develop novel algorithms for optimizing neoadjuvant therapy in non-small cell lung cancer (NSCLC). By drawing parallels between the complex, high-dimensional optimization problems in quantum control (e.g., finding optimal control variables for Schrödinger equation solutions) and the biological optimization challenges in cancer treatment (e.g., selecting drug combinations to maximize pathologic responses), the study proposes a quantum-inspired framework. The clinical trial's success in enhancing immune cell infiltration and tertiary lymphoid structures (as observed in the Ipi+Nivo+CT arm) can be analogously modeled using quantum algorithms to simulate and optimize treatment protocols. Key innovations include: (1) integrating quantum gradient-estimation techniques with immune cell profiling data to identify optimal drug combinations, (2) leveraging fault-tolerant quantum computing to handle the high computational complexity of biological systems, and (3) applying error analysis from quantum algorithms to quantify biological variability in treatment outcomes. This approach bridges computational biology and quantum physics, enabling the development of adaptive, personalized treatment strategies that prioritize both efficacy and safety, while addressing the limitations of classical optimization methods in complex, dynamic biological systems.",
        "concept_original_list": [
            "Neoadjuvant chemotherapy plus nivolumab with or without ipilimumab in operable non-small cell lung cancer: the phase 2 platform NEOSTAR trial: Neoadjuvant ipilimumab + nivolumab (Ipi+Nivo) and nivolumab + chemotherapy (Nivo+CT) induce greater pathologic response rates than CT alone in patients with operable non-small cell lung cancer (NSCLC). The impact of adding ipilimumab to neoadjuvant Nivo+CT is unknown. Here we report the results and correlates of two arms of the phase 2 platform NEOSTAR trial testing neoadjuvant Nivo+CT and Ipi+Nivo+CT with major pathologic response (MPR) as the primary endpoint. MPR rates were 32.1% (7/22, 80% confidence interval (CI) 18.7–43.1%) in the Nivo+CT arm and 50% (11/22, 80% CI 34.6–61.1%) in the Ipi+Nivo+CT arm; the primary endpoint was met in both arms. In patients without known tumor EGFR/ALK alterations, MPR rates were 41.2% (7/17) and 62.5% (10/16) in the Nivo+CT and Ipi+Nivo+CT groups, respectively. No new safety signals were observed in either arm. Single-cell sequencing and multi-platform immune profiling (exploratory endpoints) underscored immune cell populations and phenotypes, including effector memory CD8+ T, B and myeloid cells and markers of tertiary lymphoid structures, that were preferentially increased in the Ipi+Nivo+CT cohort. Baseline fecal microbiota in patients with MPR were enriched with beneficial taxa, such as Akkermansia, and displayed reduced abundance of pro-inflammatory and pathogenic microbes. Neoadjuvant Ipi+Nivo+CT enhances pathologic responses and warrants further study in operable NSCLC. (ClinicalTrials.gov registration: NCT03158129 .) The combination of neoadjuvant nivolumab, ipilimumab and chemotherapy showed promising efficacy in patients with resectable non-small cell lung cancer, with higher tumor immune cell infiltration and tertiary lymphoid structures after treatment compared with neoadjuvant nivolumab plus chemotherapy.",
            "Efficient Quantum Algorithms for Quantum Optimal Control: In this paper, we present efficient quantum algorithms that are exponentially faster than classical algorithms for solving the quantum optimal control problem. This problem involves finding the control variable that maximizes a physical quantity at time $T$, where the system is governed by a time-dependent Schrödinger equation. This type of control problem also has an intricate relation with machine learning. Our algorithms are based on a time-dependent Hamiltonian simulation method and a fast gradient-estimation algorithm. We also provide a comprehensive error analysis to quantify the total error from various steps, such as the finite-dimensional representation of the control function, the discretization of the Schrödinger equation, the numerical quadrature, and optimization. Our quantum algorithms require fault-tolerant quantum computers."
        ]
    },
    {
        "id": "theme_188",
        "theme": "Adjoint-Equivariant Progressive Transformation Networks for Lie Algebra-Based Image Synthesis",
        "elaboration": "This research proposes a novel framework that merges adjoint-equivariant neural networks (from Concept 1) with progressive transformation learning (from Concept 2) to synthesize realistic images in Lie algebra spaces. By leveraging the mathematical structure of semi-simple Lie algebras, the framework ensures that transformations preserve geometric invariance (adjoint-equivariance) while progressively augmenting training datasets with virtual images. The core innovation involves integrating a progressive transformation generator that iteratively refines virtual images using adjoint-equivariant operations, addressing domain gaps through dynamic scaling of feature spaces. This approach combines the geometric invariance of Lie Neurons with the iterative refinement of PTL, enabling robust training of equivariant models for tasks like shape classification and system dynamics in Lie algebraic domains. The novelty lies in harmonizing Lie theory's symmetry properties with progressive learning, offering a scalable solution for generating high-fidelity data in complex, non-Euclidean spaces.",
        "concept_original_list": [
            "Lie Neurons: Adjoint-Equivariant Neural Networks for Semisimple Lie Algebras: This paper proposes an equivariant neural network that takes data in any finite-dimensional semi-simple Lie algebra as input. The corresponding group acts on the Lie algebra as adjoint operations, making our proposed network adjoint-equivariant. Our framework generalizes the Vector Neurons, a simple $\\mathrm{SO}(3)$-equivariant network, from 3-D Euclidean space to Lie algebra spaces, building upon the invariance property of the Killing form. Furthermore, we propose novel Lie bracket layers and geometric channel mixing layers that extend the modeling capacity. Experiments are conducted for the $\\mathfrak{so}(3)$, $\\mathfrak{sl}(3)$, and $\\mathfrak{sp}(4)$ Lie algebras on various tasks, including fitting equivariant and invariant functions, learning system dynamics, point cloud registration, and homography-based shape classification. Our proposed equivariant network shows wide applicability and competitive performance in various domains.",
            "Progressive Transformation Learning for Leveraging Virtual Images in Training: To effectively interrogate UAV-based images for detecting objects of interest, such as humans, it is essential to acquire large-scale UAV-based datasets that include human instances with various poses captured from widely varying viewing angles. As a viable alternative to laborious and costly data curation, we introduce Progressive Transformation Learning (PTL), which gradually augments a training dataset by adding transformed virtual images with enhanced realism. Generally, a virtual2real transformation generator in the conditional GAN framework suffers from quality degradation when a large domain gap exists between real and virtual images. To deal with the domain gap, PTL takes a novel approach that progressively iterates the following three steps: 1) select a subset from a pool of virtual images according to the domain gap, 2) transform the selected virtual images to enhance realism, and 3) add the transformed virtual images to the training set while removing them from the pool. In PTL, accurately quantifying the domain gap is critical. To do that, we theoretically demonstrate that the feature representation space of a given object detector can be modeled as a multivariate Gaussian distribution from which the Mahalanobis distance between a virtual object and the Gaussian distribution of each object category in the representation space can be readily computed. Experiments show that PTL results in a substantial performance increase over the baseline, especially in the small data and the cross-domain regime."
        ]
    },
    {
        "id": "theme_168",
        "theme": "Integrating Dynamic Motion Modeling with Pharmacological Response Prediction for Personalized T Cell Lymphoma Therapy",
        "elaboration": "This research proposes a novel framework that combines principles from generative image dynamics (Concept 2) with pharmacological modeling (Concept 1) to predict and simulate the complex interactions between drug combinations and biological systems. By leveraging spectral volumes from image dynamics to represent the temporal evolution of inflammatory responses and toxicity in T cell lymphomas, the study aims to develop a predictive model that can simulate how PI3K inhibitor-drug combinations (e.g., duvelisib + romidepsin) modulate cellular dynamics. The approach would treat the body’s response to therapy as a dynamic system, where spectral volumes encode the interplay between drug effects, inflammatory mediators, and tissue-level changes. This would enable real-time prediction of toxicity risks and efficacy, similar to how diffusion models generate motion in images, but adapted to model biological processes. The integration would allow for the development of personalized treatment strategies by simulating patient-specific responses, addressing the limitations of current clinical trials in capturing the nuanced interactions between drugs and the body’s immune system. The project would bridge computational modeling and clinical pharmacology, offering a new tool to optimize drug combinations while mitigating adverse effects.",
        "concept_original_list": [
            "Duvelisib plus romidepsin in relapsed/refractory T cell lymphomas: a phase 1b/2a trial: PI3K-δ inhibitors have shown impressive activity in lymphoid malignancies but have been hampered by autoimmune and infectious toxicities, leading to market withdrawals. We previously demonstrated activity of the PI3K-δγ inhibitor duvelisib in T cell lymphomas (TCLs) that was associated with inflammatory adverse events. As reported here, we conducted a phase 1b/2a study of duvelisib in combination with either romidepsin (n = 66) or bortezomib (n = 32) in patients with relapsed/refractory TCL and found that the addition of romidepsin, but not bortezomib, appeared to increase efficacy while attenuating PI3K inhibitor-driven toxicity. The primary endpoint of the study was to determine the safety and maximum tolerated dose of duvelisib, which was 75 mg twice daily when combined with romidepsin versus 25 mg twice daily when combined with bortezomib. The most common adverse events were neutropenia (42%, 25/59) and fatigue (37%, 22/59) in patients treated with duvelisib and romidepsin and diarrhea (48%, 11/23) and neutropenia (30%, 7/23) in patients treated with duvelisib and bortezomib. Duvelisib and romidepsin resulted in less grade 3/4 hepatotoxicity (14%, 8/59) compared to 40% (14/35) in our previous study with duvelisib monotherapy. This was associated with reductions in circulating inflammatory mediators and myeloid cell inflammatory gene expression. Secondary endpoints of overall and complete response rates were 55% (35/64) and 34% (22/64) for patients treated with duvelisib and romidepsin and 34% (11/32) and 13% (4/32) for patients treated with duvelisib and bortezomib. Among patients with peripheral T cell lymphomas (PTCLs), overall and complete response rates of duvelisib and romidepsin were 56% (27/48) and 44% (21/48), respectively, with exploratory analyses showing increased response rates in patients with a follicular helper T cell subtype. These findings support further development of combined PI3K and histone deacetylase (HDAC) inhibition in TCLs and suggest a unique strategy to enable PI3K inhibitor-based combinations for additional patient populations. ClinicalTrials.gov identifier: NCT02783625 . In a phase 1b/2a trial, the combination of the oral PI3K inhibitor duvelisib and romidepsin had limited toxicity and exhibited encouraging clinical activity in patients with relapsed or refractory T cell lymphoma, suggesting an approach whereby PI3K inhibitors can be safely used in this patient population.",
            "Generative Image Dynamics: We present an approach to modeling an image-space prior on scene motion. Our prior is learned from a collection of motion trajectories extracted from real video sequences depicting natural oscillatory dynamics of objects such as treesflowers candles and clothes swaying in the wind. We model dense long-term motion in the Fourier domain as spectral volumes which we find are well-suited to prediction with diffusion models. Given a single image our trained model uses a frequency-coordinated diffusion sampling process to predict a spectral volume which can be converted into a motion texture that spans an entire video. Along with an image-based rendering module the predicted motion representation can be used for a number of downstream applications such as turning still images into seamlessly looping videos or allowing users to interact with objects in real images producing realistic simulated dynamics (by interpreting the spectral volumes as image-space modal bases). See our project page for more results: generative-dynamics.github.io"
        ]
    },
    {
        "id": "theme_053",
        "theme": "Precision-Driven ML Frameworks for Interpretable Motion Control in Scientific Applications",
        "elaboration": "MotionPro's region-wise trajectory and motion mask approach exemplifies a precision-driven methodology that aligns with the need for interpretable machine learning (ML) in scientific domains. Concept 2 highlights the tension between ML's ontology (data-centric design) and epistemology (performance on training data), which often leads to biases in scientific applications. MotionPro's framework addresses this by decomposing complex motion dynamics into localized regions, enabling precise control while maintaining transparency. This mirrors ML's need for structured ontologies (e.g., explicit motion masks) to avoid confounding factors, akin to how causal models in physics (Concept 2) use expressive ML to represent confounders. By integrating region-wise trajectories with denoising techniques, MotionPro demonstrates how precise, data-driven control can mitigate ML's epistemological risks, offering a blueprint for deploying ML in scientific workflows while preserving interpretability. The proposed research would formalize this principle, creating a scalable framework for motion control in scientific simulations, thereby bridging the gap between ML's computational power and the scientific community's demand for trustworthiness.",
        "concept_original_list": [
            "MotionPro: A Precise Motion Controller for Image-to-Video Generation: Animating images with interactive motion control has garnered popularity for image-to-video (I2V) generation. Modern approaches typically rely on large Gaussian kernels to extend motion trajectories as condition without explicitly defining movement region, leading to coarse motion control and failing to disentangle object and camera moving. To alleviate these, we present MotionPro, a precise motion controller that novelly leverages region-wise trajectory and motion mask to regulate fine-grained motion synthesis and identify target motion category (i.e., object or camera moving), respectively. Technically, MotionPro first estimates the flow maps on each training video via a tracking model, and then samples the region-wise trajectories to simulate inference scenario. Instead of extending flow through large Gaussian kernels, our region-wise trajectory approach enables more precise control by directly utilizing trajectories within local regions, thereby effectively characterizing fine-grained movements. A motion mask is simultaneously derived from the predicted flow maps to capture the holistic motion dynamics of the movement regions. To pursue natural motion control, MotionPro further strengthens video denoising by incorporating both region-wise trajectories and motion mask through feature modulation. More remarkably, we meticulously construct a benchmark, i.e., MC-Bench, with 1.1K user-annotated image-trajectory pairs, for the evaluation of both fine-grained and object-level I2V motion control. Extensive experiments conducted on WebVid-10M and MC-Bench demonstrate the effectiveness of MotionPro. Please refer to our project page for more results: https://zhw-zhang.github.io/MotionPro-page/.",
            "Position: Is machine learning good or bad for the natural sciences?: Machine learning (ML) methods are having a huge impact across all of the sciences. However, ML has a strong ontology — in which only the data exist — and a strong epistemology — in which a model is considered good if it performs well on held-out training data. These philosophies are in strong conflict with both standard practices and key philosophies in the natural sciences. Here we identify some locations for ML in the natural sciences at which the ontology and epistemology are valuable. For example, when an expressive machine learning model is used in a causal inference to represent the effects of confounders, such as foregrounds, backgrounds, or instrument calibration parameters, the model capacity and loose philosophy of ML can make the results more trustworthy. We also show that there are contexts in which the introduction of ML introduces strong, unwanted statistical biases. For one, when ML models are used to emulate physical (or first-principles) simulations, they amplify confirmation biases. For another, when expressive regressions are used to label datasets, those labels cannot be used in downstream joint or ensemble analyses without taking on uncontrolled biases. The question in the title is being asked of all of the natural sciences; that is, we are calling on the scientific communities to take a step back and consider the role and value of ML in their fields; the (partial) answers we give here come from the particular perspective of physics."
        ]
    },
    {
        "id": "theme_041",
        "theme": "Uncertainty-Driven Reconstruction in Dynamic Scenes: Applying Conformal Prediction to Compositional Video Reconstruction",
        "elaboration": "ShowMak3r's challenge of reconstructing dynamic, occluded, and shot-changing TV scenes mirrors the uncertainty quantification problems in continual learning. By integrating conformal prediction (CP) into ShowMak3r's pipeline, we can dynamically quantify and mitigate uncertainties in actor poses, expressions, and scene reconstruction. The 3DLocator module's depth-based pose estimation and ShotMatcher's shot-tracking could be augmented with CPCL to generate prediction intervals for reconstructed elements, ensuring reliability even under ambiguous or incomplete data. This approach would enable applications like synthetic shot-making with confidence, where uncertainty-aware editing preserves scene integrity. The novelty lies in merging CP's theoretical guarantees with compositional video reconstruction, addressing both the dynamic challenges of TV shows and the uncertainty in continual learning tasks. The project would bridge computer vision and machine learning, offering a framework for robust, uncertainty-aware scene reconstruction in real-world applications.",
        "concept_original_list": [
            "ShowMak3r: Compositional TV Show Reconstruction: Reconstructing dynamic radiance fields from video clips is challenging, especially when entertainment videos like TV shows are given. Many challenges make the reconstruction difficult due to (1) actors occluding with each other and having diverse facial expressions, (2) cluttered stages, and (3) small baseline views or sudden shot changes. To address these issues, we present ShowMak3r, a comprehensive reconstruction pipeline that allows the editing of scenes like how video clips are made in a production control room. In ShowMak3r, a 3DLocator module locates recovered actors on the stage using depth prior, and estimates unseen human poses via interpolation. The proposed ShotMatcher module then tracks the actors under shot changes. Furthermore, ShowMak3r introduces a face-fitting network that dynamically recovers the actors' expressions. Experiments on Sitcoms3D dataset show that our pipeline can reassemble TV show scenes with new cameras at different timestamps. We also demonstrate that ShowMak3r enables interesting applications such as synthetic shot-making, actor relocation, insertion, deletion, and pose manipulation. Project page : https://nstar1125.github.io/showmak3r",
            "Model Uncertainty Quantification by Conformal Prediction in Continual Learning: Continual learning has attracted increasing research attention in recent years due to its promising experimental results in real-world applications. In this paper, we study the issue of calibration in continual learning which reliably quantifies the uncertainty of model predictions. Conformal prediction (CP) provides a general framework for model calibration, which outputs prediction intervals or sets with a theoretical high coverage guarantee as long as the samples are exchangeable. However, the tasks in continual learning are learned in sequence, which violates the principle that data should be exchangeable. Meanwhile, the model learns the current task with limited or no access to data from previous tasks, which is not conducive to constructing the calibration set. To address these issues, we propose a CP-based method for model uncertainty quantification in continual learning (CPCL), which also reveals the connection between prediction interval length and forgetting. We analyze the oracle prediction interval in continual learning and theoretically prove the asymptotic coverage guarantee of CPCL. Finally, extensive experiments on simulated and real data empirically verify the validity of our proposed method."
        ]
    },
    {
        "id": "theme_024",
        "theme": "Synthetic Data Integration for Effective Antiviral Drug Development",
        "elaboration": "This research proposes leveraging synthetic data generation techniques to optimize the development of antiviral drugs like obeldesivir. By creating high-fidelity synthetic datasets that mimic the complex biological interactions and variability observed in real-world clinical trials, we can simulate diverse viral challenges and exposure scenarios. This approach addresses the critical limitations of traditional real-world data, which often suffer from scarcity and bias, while enabling rigorous testing of drug efficacy and safety. The integration of synthetic data with experimental findings from animal models (as demonstrated in Concept 1) would allow for accelerated drug development by identifying optimal dosing regimens and exposure windows. Furthermore, the use of data-centric frameworks like Profile2Gen (Concept 2) could refine synthetic data generation to better reflect clinical outcomes, ensuring that drug candidates meet stringent safety and effectiveness criteria. This synergy would bridge the gap between preclinical animal studies and real-world application, offering a scalable solution for developing post-exposure prophylaxis against highly pathogenic viruses like Marburg.",
        "concept_original_list": [
            "Oral obeldesivir provides postexposure protection against Marburg virus in nonhuman primates: The recent outbreak of Marburg virus (MARV) in Rwanda underscores the need for effective countermeasures against this highly fatal pathogen, with case fatality rates reaching 90%. Currently, no vaccines or approved treatments exist for MARV infection, distinguishing it from related viruses such as Ebola. Our study demonstrates that the oral drug obeldesivir (ODV), a nucleoside analog prodrug, shows promising antiviral activity against filoviruses in vitro and offers significant protection in animal models. Here with cynomolgus macaques (n = 6), a 10 day regimen of once-daily ODV, initiated 24 h after exposure, provided 80% protection against a thousandfold lethal MARV challenge, delaying viral replication and disease onset. Transcriptome analysis revealed that early adaptive responses correlated with successful outcomes. Compared with intravenous options, oral antivirals such as ODV offer logistical advantages in outbreak settings, enabling easier administration and broader contact coverage. Our findings support the potential of ODV as a broad-spectrum, oral postexposure prophylaxis for filoviruses. The oral drug obeldesivir confers 80% survival when given after lethal Marburg virus exposure in cynomolgus macaques and also delays viral replication and disease onset, suggesting a potential treatment option for an infection with no approved therapies.",
            "Breaking the Barrier of Hard Samples: A Data-Centric Approach to Synthetic Data for Medical Tasks: Data scarcity and quality issues remain significant barriers to developing robust predictive models in medical research. Traditional reliance on real-world data often leads to biased models with poor generalizability across diverse patient populations. Synthetic data generation has emerged as a promising solution, yet challenges related to these sample's representativeness and effective utilization persist. This paper introduces Profile2Gen, a novel data-centric framework designed to guide the generation and refinement of synthetic data, focusing on addressing hard-to-learn samples in regression tasks. We conducted approximately 18,000 experiments to validate its effectiveness across six medical datasets, utilizing seven state-of-the-art generative models. Results demonstrate that refined synthetic samples can reduce predictive errors and enhance model reliability. Additionally, we generalize the DataIQ framework to support regression tasks, enabling its application in broader contexts. Statistical analyses confirm that our approach achieves equal or superior performance compared to models trained exclusively on real data."
        ]
    },
    {
        "id": "theme_119",
        "theme": "Active Inference-Driven Cross-Domain Image Generation for Enhanced Visual Localization",
        "elaboration": "This research integrates active statistical inference principles with cross-domain image generation to address the limitations of existing visual localization methods. By leveraging active inference, we propose a framework where machine learning models prioritize data points with high uncertainty in cross-domain scenes, enabling more efficient data collection and generation. This approach combines the strengths of text-guided image editing (from Concept 1) with active inference's ability to optimize sample selection, ensuring that the generated cross-domain datasets focus on critical, underrepresented scenarios. The method introduces a dynamic prioritization mechanism that adapts to the long-tail distribution of cross-domain data, using active inference to balance between diversity and relevance. This synergy enhances both data quality and model robustness, achieving state-of-the-art accuracy while reducing the need for extensive labeled data. The novelty lies in merging active inference's adaptive data collection with cross-domain generation, creating a scalable solution for real-world applications where diverse, dynamic scenes are common.",
        "concept_original_list": [
            "Enhancing Visual Localization with Cross-Domain Image Generation: Visual localization aims to predict the absolute camera pose for a single query image. However, predominant methods focus on single-camera images and scenes with limited appearance variations, limiting their applicability to cross-domain scenes commonly encountered in real-world applications. Furthermore, the long-tail distribution of cross-domain datasets poses additional challenges for visual localization. In this work, we propose a novel cross-domain data generation method to enhance visual localization methods. To achieve this, we first construct a cross-domain 3DGS to accurately model photometric variations and mitigate the interference of dynamic objects in large-scale scenes. We introduce a text-guided image editing model to enhance data diversity for addressing the long-tail distribution problem and design an effective fine-tuning strategy for it. Then, we develop an anchor-based method to generate high-quality datasets for visual localization. Finally, we introduce positional attention to address data ambiguities in cross-camera images. Extensive experiments show that our method achieves state-of-the-art accuracy, outperforming existing cross-domain visual localization methods by an average of 59\\% across all domains. Project page: https://yzwang-sjtu.github.io/CDG-Loc.",
            "Active Statistical Inference: Inspired by the concept of active learning, we propose active inference---a methodology for statistical inference with machine-learning-assisted data collection. Assuming a budget on the number of labels that can be collected, the methodology uses a machine learning model to identify which data points would be most beneficial to label, thus effectively utilizing the budget. It operates on a simple yet powerful intuition: prioritize the collection of labels for data points where the model exhibits uncertainty, and rely on the model's predictions where it is confident. Active inference constructs valid confidence intervals and hypothesis tests while leveraging any black-box machine learning model and handling any data distribution. The key point is that it achieves the same level of accuracy with far fewer samples than existing baselines relying on non-adaptively-collected data. This means that for the same number of collected samples, active inference enables smaller confidence intervals and more powerful tests. We evaluate active inference on datasets from public opinion research, census analysis, and proteomics."
        ]
    },
    {
        "id": "theme_077",
        "theme": "Synchronizing Immune Memory with 4D Video Generation: A Framework for Boosted Antigenic Response in Dynamic Immunological Systems",
        "elaboration": "This research proposal merges principles from immunological memory dynamics and 4D video generation to develop a novel framework for modeling antigenic response in vaccinated individuals. By analogizing the immune system's ability to maintain memory B cell responses (like a 4D video's synchronized time-viewpoint coordination), we propose a structured approach to optimize booster dose efficacy. The theme leverages the concept of 'synchronization'—critical in both immune memory (where booster doses ensure sustained antibody titers) and 4D video generation (where temporal and viewpoint streams are aligned)—to create a computational model that integrates dynamic antigen exposure with structured data flow. The framework would use a 'synchronization layer' inspired by 4Real-Video's tensor network architecture, where immune responses (time-axis) and antigenic diversity (viewpoint-axis) are dynamically coordinated. This would enable predictive modeling of breakthrough infections, similar to how 4D video's temporal and viewpoint consistency ensures visual coherence, while maintaining the immune system's ability to adapt to evolving viral variants. The novelty lies in applying video-generation synchronization principles to immunological data, enabling real-time monitoring of antigenic memory and optimizing booster dose efficacy through structured, high-fidelity data exchange between time and viewpoint streams, thereby bridging immunological and computational paradigms.",
        "concept_original_list": [
            "Immune response to SARS-CoV-2 after a booster of mRNA-1273: an open-label phase 2 trial: Rising breakthrough infections of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in previously immunized individuals have raised concerns for the need for a booster vaccine dose to combat waning antibody levels and new variants. Here we report the results of the open-label, non-randomized part B of a phase 2 trial in which we evaluated the safety and immunogenicity of a booster injection of 50 µg of the coronavirus disease 2019 (COVID-19) vaccine mRNA-1273 in 344 adult participants immunized 6–8 months earlier with a primary series of two doses of 50 µg or 100 µg of mRNA-1273 ( NCT04405076 ). Neutralizing antibody (nAb) titers against wild-type SARS-CoV-2 at 1 month after the booster were 1.7-fold (95% confidence interval (CI): 1.5, 1.9) higher than those at 28 days after the second injection of the primary series, which met the pre-specified non-inferiority criterion (primary immunogenicity objective) and might indicate a memory B cell response. The nAb titers against the Delta variant (B.1.617.2) (exploratory objective) at 1 month after the booster were 2.1-fold (95% CI: 1.8, 2.4) higher than those at 28 days after the second injection of the primary series. The seroresponse rate (95% CI (four-fold rise from baseline)) was 100% (98.7, 100.0) at 28 days after the booster compared to 98.3% (96.0, 99.4) after the primary series. The higher antibody titers at 28 days after the booster dose compared to 28 days after the second dose in the phase 3 COVE study were also observed in two assays for anti-spike IgG antibody measured by ELISA and by Meso Scale Discovery (MSD) Multiplex. The frequency of solicited local and systemic adverse reactions after the booster dose was similar to that after the second dose in the primary two-dose series of mRNA-1273 (50 µg or 100 µg); no new signals were observed in the unsolicited adverse events; and no serious adverse events were reported in the 1-month follow-up period. These results show that a booster injection of mRNA-1273 more than 6 months after completing the primary two-dose series is safe and elicited nAb titers that were statistically significantly higher than the peak titers detected after the primary vaccination series, suggesting that a booster dose of mRNA-1273 might result in increased vaccine effectiveness against infection and disease caused by SARS-CoV-2. A third dose of the COVID-19 vaccine mRNA-1273 is safe and boosts SARS-CoV-2 neutralizing antibody titers almost two-fold higher than the peak levels observed after completion of a two-dose series, highlighting the potential clinical benefit of a booster dose.",
            "4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion: We propose 4Real-Video, a novel framework for generating 4D videos, organized as a grid of video frames with both time and viewpoint axes.  In this grid, each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint. One stream performs viewpoint updates on columns, and the other stream performs temporal updates on rows. After each diffusion transformer layer, a newly designed synchronization layer exchanges information between the two token streams. We propose two implementations of the synchronization layer, using either hard or soft synchronization.This feedforward architecture improves upon previous work in three ways: higher inference speed, enhanced visual quality (measured by FVD, CLIP, and VideoScore), and improved temporal and viewpoint consistency (measured by VideoScore, GIM-Confidence, and Dust3R-Confidence)."
        ]
    },
    {
        "id": "theme_160",
        "theme": "Adapting Foundation Models for Open-Vocabulary 3D Semantic Segmentation via Language Distribution Balancing",
        "elaboration": "This research proposes integrating principles from language distribution balancing (as seen in BabelCode) with foundation models to enhance open-vocabulary 3D semantic segmentation. By treating 3D objects as 'languages' in a dynamic, open-world environment, the framework leverages the universal knowledge of foundation models to map 3D point features to textual descriptions. Similar to BabelCode's approach of balancing programming language distributions to improve model performance, OV3D trains on a diverse, balanced dataset of 3D objects, ensuring equitable representation of both high- and low-resource categories. This balances the model's capacity to generalize across unseen objects, achieving state-of-the-art performance while addressing the limitations of fixed-category 3D segmentation. The novelty lies in applying linguistic diversity balancing to 3D semantics, enabling models to dynamically adapt to open-vocabulary tasks without pre-defined labels, while maintaining robustness across diverse domains.",
        "concept_original_list": [
            "Open-Vocabulary 3D Semantic Segmentation with Foundation Models: In dynamic 3D environments the ability to recognize a diverse range of objects without the constraints of predefined categories is indispensable for real-world applications. In response to this need we introduce OV3D an innovative framework designed for open-vocabulary 3D semantic segmentation. OV3D leverages the broad open-world knowledge embedded in vision and language foundation models to establish a fine-grained correspondence between 3D points and textual entity descriptions. These entity descriptions are enriched with contextual information enabling a more open and comprehensive understanding. By seamlessly aligning 3D point features with entity text features OV3D empowers open-vocabulary recognition in the 3D domain achieving state-of-the-art open-vocabulary semantic segmentation performance across multiple datasets including ScanNet Matterport3D and nuScenes.",
            "Measuring the Impact of Programming Language Distribution: Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust. To ameliorate this issue, we present the BabelCode framework for execution-based evaluation of any benchmark in any language. BabelCode enables new investigations into the qualitative performance of models' memory, runtime, and individual test case results. Additionally, we present a new code translation dataset called Translating Python Programming Puzzles (TP3) from the Python Programming Puzzles (Schuster et al., 2021) benchmark that involves translating expert-level python functions to any language. With both BabelCode and the TP3 benchmark, we investigate if balancing the distributions of 14 languages in a training dataset improves a large language model's performance on low-resource languages. Training a model on a balanced corpus results in, on average, 12.34% higher $pass@k$ across all tasks and languages compared to the baseline. We find that this strategy achieves 66.48% better $pass@k$ on low-resource languages at the cost of only a 12.94% decrease to high-resource languages. In our three translation tasks, this strategy yields, on average, 30.77% better low-resource $pass@k$ while having 19.58% worse high-resource $pass@k$."
        ]
    }
]