{
    "title": "Autotuning Paper Search Results",
    "turns": [{
        "turn": 1,
        "user": "Veritas – Demystifying Silent Data Corruptions: μArch-Level Modeling and Fleet Data of Modern x86 CPUs\n\nChameleonEC: Exploiting Tunability of Erasure Coding for Low-Interference Repair\n\nDPUaudit: DPU-assisted Pull-based Architecture for Near-Zero Cost System Auditing\n\nDelinquent Loop Pre-execution Using Predicated Helper Threads\n\nMASCOT: Predicting Memory Dependencies and Opportunities for Speculative Memory Bypassing\n\nArchitecting Value Prediction around In-Order Execution\n\nEfficient Optimization with Encoded Ising Models\n\nSPARK – Sparsity Aware, Low Area, Energy-Efficient, Near-memory Architecture for Accelerating Linear Programming Problems\n\nLegoZK: A Dynamically Reconfigurable Accelerator for Zero-Knowledge Proof\n\nReuse-Aware Compilation for Zoned Quantum Architectures Based on Neutral Atoms\n\nHATT: Hamiltonian Aware Ternary Tree for Optimizing Fermion-to-Qubit Mapping\n\nQuCLEAR: Clifford Extraction and Absorption for Quantum Circuit Optimization\n\nGaze into the Pattern: Characterizing Spatial Patterns with Internal Temporal Correlations for Hardware Prefetching\n\nTo Cross, or Not to Cross Pages for Prefetching?\n\nIntegrating Prefetcher Selection with Dynamic Request Allocation Improves Prefetching Efficiency\n\nVR-Pipe: Streamlining Hardware Graphics Pipeline for Volume Rendering\n\nIRIS: Unleashing ISP-Software Cooperation to Optimize the Machine Vision Pipeline\n\nUni-Render: A Unified Accelerator for Real-Time Rendering Across Diverse Neural Renderers\n\nInterleaved Logical Qubits in Atom Arrays\n\nChoco-Q: Commute Hamiltonian-based QAOA for Constrained Binary Optimization\n\nBOSS: Blocking algorithm for optimizing shuttling scheduling in Ion Trap\n\nLSQCA: Resource-Efficient Load/Store Architecture for Limited-Scale Fault-Tolerant Quantum Computing\n\nThe Importance of Generalizability in Machine Learning for Systems\n\nR.I.P. Geomean Speedup Use Equal-Work (Or Equal-Time) Harmonic Mean Speedup Instead\n\neDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models\n\nEXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models\n\nDitto: Accelerating Diffusion Model via Temporal Value Similarity\n\nGaussian Blending Unit: An Edge GPU Plug-in for Real-Time Gaussian-Based Rendering in AR/VR\n\nGSArch: Breaking Memory Barriers in 3D Guassian Splatting Training via Architectural Support\n\nPalermo: Improving the Performance of Oblivious Memory using Protocol-Hardware Co-Design\n\nSpecMPK: Efficient In-Process Isolation with Speculative and Secure Permission Update Instruction\n\nBrokenSleep: Remote Power Timing Attack Exploiting Processor Idle States\n\nEfficient Memory Side-Channel Protection for Embedding Generation in Machine Learning\n\nCriticality-Aware Instruction-Centric Bandwidth Partitioning for Data Center Applications\n\nConcord: Rethinking Distributed Coherence for Software Caches in Serverless Environments\n\nGrad: Intelligent Microservice Scaling by Harnessing Resource Fungibility\n\nMulti-Dimensional Vector ISA Extension for Mobile In-Cache Computing\n\nER-DCIM: Error-Resilient Digital CIM with Run-Time MAC-Cell Error Correction\n\nAsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing\n\nSoMA: Identifying, Exploring, and Understanding the DRAM Communication Scheduling Space for DNN Accelerators\n\nAdyna: Accelerating Dynamic Neural Networks with Adaptive Scheduling\n\nEDA: Energy-Efficient Inter-Layer Model Compilation for Edge DNN Inference Acceleration\n\nSkyByte: Architecting An Efficient Memory-Semantic CXL-based SSD with OS and Hardware Co-design\n\nZebra: Efficient Redundant Array of Zoned Namespace SSDs Enabled by Zone Random Write Area (ZRWA)\n\nReviving In-Storage Hardware Compression on ZNS SSDs through Host-SSD Collaboration\n\nUniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures\n\nPiccolo: Large-Scale Graph Processing with Fine-Grained In-Memory Scatter-Gather\n\nGOPIM: GCN-Oriented Pipeline Optimization for PIM Accelerators\n\nLUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator\n\nExploring the Performance Improvement of Tensor Processing Engines through Transformation in the Bit-weight Dimension of MACs\n\nPanacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity\n\nFrom Optimal to Practical: Efficient Micro-op Cache Replacement Policies for Data Center Applications\n\nRethinking Dead Block Prediction for Intermittent Computing\n\nEfficient Caching with A Tag-enhanced DRAM\n\nPROCA: Programmable Probabilistic Processing Unit Architecture with Accept/Reject Prediction & Multicore Pipelining for Causal Inference\n\nCogSys: Efficient and Scalable Neurosymbolic Cognition System via Algorithm-Hardware Co-Design\n\nNeuVSA: A Unified and Efficient Accelerator for Neural Vector Search\n\nProsperity: Accelerating Spiking Neural Networks via Product Sparsity\n\nBit-slice Architecture for DNN Acceleration with Slice-level Sparsity Enhancement and Exploitation\n\nA Hardware-Software Design Framework for SpMV Acceleration with Flexible Access Pattern Portfolio\n\nVariable Read Disturbance: An Experimental Analysis of Temporal Variation in DRAM Read Disturbance\n\nUnderstanding RowHammer Under Reduced Refresh Latency: Experimental Analysis of Real DRAM Chips and Implications on Future Solutions\n\nChronus: Understanding and Securing the Cutting-Edge Industry Solutions to DRAM Read Disturbance\n\nNOVA: A Novel Vertex Management Architecture For Scalable Graph Processing\n\nMeHyper: Accelerating Hypergraph Neural Networks by Exploring Implicit Dataflows\n\nCambricon-DG: An Accelerator for Redundant-Free Dynamic Graph Neural Networks Based on Nonlinear Isolation\n\nTB-STC: Transposable Block-wise N:M Structured Sparse Tensor Core\n\nCROSS: Compiler-Driven Optimization of Sparse DNNs Using Sparse/Dense Computation Kernels\n\nAccelES: Accelerating Top-K SpMV for Embedding Similarity via Low-bit Pruning\n\nAutoRFM: Scaling Low-Cost In-DRAM Trackers to Ultra-Low Rowhammer Thresholds\n\nDAPPER: A Performance-Attack-Resilient Tracker for RowHammer Defense\n\nQPRAC: Towards Secure and Practical PRAC-based Rowhammer Mitigation using Priority Queues\n\nI-DGNN: A Graph Dissimilarity-based Framework for Designing Scalable and Efficient DGNN Accelerators\n\nMithril: A Scalable System for Deep GNN Training\n\nBuffalo: Enabling Large-Scale GNN Training via Memory-Efficient Bucketization\n\nBitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration\n\nFIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables\n\nMANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type\n\nFHENDI: A Near-DRAM Accelerator for Compiler-Generated Fully Homomorphic Encryption Applications\n\nEFFACT: A Highly Efficient Full-Stack FHE Acceleration Platform\n\nAnaheim: Architecture and Algorithms for Processing Fully Homomorphic Encryption in Memory\n\nHydra: Scale-out FHE Accelerator Architecture for Secure Deep Learning on FPGA\n\nWarpDrive: GPU-Based Fully Homomorphic Encryption Acceleration Leveraging Tensor and CUDA Cores\n\nMLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from μWatts to MWatts for Sustainable AI\n\nEnterprise Class Modular Cache Hierarchy\n\nPredicting DRAM-Caused Risky VMs in Large-Scale Clouds\n\nEnhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization\n\nRevisiting Reliability in Large-Scale Machine Learning Research Clusters\n\nHILP: Accounting for Workload-Level Parallelism in System-on-Chip Design Space Exploration\n\nCORDOBA: Carbon-Efficient Optimization Framework for Computing Systems\n\nArchitecting Space Microdatacenters: A System-level Approach\n\nARTEMIS: Agile Discovery of Efficient Real-Time Systems-on-Chips in the Heterogeneous Era\n\nLEGO: Spatial Accelerator Generation and Optimization for Tensor Applications\n\nDynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency\n\nthrottLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving\n\nRpcNIC: Enabling Efficient Datacenter RPC Offloading on PCIe-attached SmartNICs\n\nNVMePass: A Lightweight, High-performance and Scalable NVMe Virtualization Architecture with I/O Queues Passthrough\n\nWarped-Compaction: Maximizing GPU Register File Bandwidth Utilization via Operand Compaction\n\nCooperative Warp Execution in Tensor Core for RISC-V GPGPU\n\nSparseWeaver: Converting Sparse Operations as Dense Operations on GPUs for Graph Workloads\n\nHSMU-SpGEMM: Achieving High Shared Memory Utilization for Parallel Sparse General Matrix-Matrix Multiplication on Modern GPUs\n\nAnda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format\n\nLAD: Efficient Accelerator for Generative Inference of LLM with Locality Aware Decoding\n\nVQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference\n\nInstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference\n\nTidalMesh: Topology-Driven AllReduce Collective Communication for Mesh Topology\n\nPush Multicast: A Speculative and Coherent Interconnect for Mitigating Manycore CPU Communication Bottleneck\n\nPIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM\n\nEIGEN: Enabling Efficient 3DIC Interconnect with Heterogeneous Dual-Layer Network-on-Active-Interposer\n\nAriadne: A Hotness-Aware and Size-Adaptive Compressed Swap Technique for Fast Application Relaunch and Reduced CPU Usage on Mobile Devices\n\nGemina: A Coordinated and High-Performance Memory Deduplication Engine\n\nNo Rush in Executing Atomic Instructions\n\nMachine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory\n\nLet-Me-In: (Still) Employing In-pointer Bounds metadata for Fine-grained GPU Memory Safety\n\nMarching Page Walks: Batching and Concurrent Page Table Walks for Enhancing GPU Throughput\n\nOASIS: Object-Aware Page Management for Multi-GPU Systems\n\nNearFetch: Saving Inter-Module Bandwidth in Many-Chip-Module GPUs\n\nPAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM\n\nFACIL: Flexible DRAM Address Mapping for SoC-PIM Cooperative On-device LLM Inference\n\nLincoln: Real-Time 50~100B LLM Inference on Consumer Devices with LPDDR-Interfaced, Compute-Enabled Flash Memory\n\nMake LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM\n\nPlease find the papers in this list that include autotuning.",
        "assistant": "Sure, I'll find the papers in your list that include \"autotuning\" content. The following papers appear to be related:\n\n*   **ChameleonEC: Exploiting Tunability of Erasure Coding for Low-Interference Repair**\n*   **Adyna: Accelerating Dynamic Neural Networks with Adaptive Scheduling**\n*   **AutoRFM: Scaling Low-Cost In-DRAM Trackers to Ultra-Low Rowhammer Thresholds**\n*   **DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency**\n*   **throttLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving**\n\nThese papers include keywords like \"tunability,\" \"adaptive scheduling,\" \"auto-\", \"designing,\" and \"predictive,\" which suggest they are likely related to the concept of \"autotuning\"—that is, automatically optimizing or adjusting the performance of systems or algorithms."
    }, {
        "turn": 2,
        "user": "Look again, I don't see any.",
        "assistant": "* * *\n\n### Re-examining Papers Related to \"Autotuning\"\n\nIt seems the previously suggested papers do not directly include the keyword \"autotuning\"—sorry for the confusion. I have carefully reviewed the entire list again to find papers that **explicitly address \"autotuning\" or closely related concepts (e.g., automatic optimization, auto-configuration, or auto-adjustment)**.\n\nThe following papers are likely to be related to these concepts:\n\n*   **ChameleonEC: Exploiting Tunability of Erasure Coding for Low-Interference Repair**\n    *   The word \"Tunability\" suggests the system may be able to adjust itself to an optimal state.\n*   **Adyna: Accelerating Dynamic Neural Networks with Adaptive Scheduling**\n    *   \"Adaptive Scheduling\" means scheduling is adjusted dynamically to changing situations, which can be seen as a form of **autotuning**.\n*   **AutoRFM: Scaling Low-Cost In-DRAM Trackers to Ultra-Low Rowhammer Thresholds**\n    *   The \"Auto-\" prefix explicitly suggests an **automated optimization** mechanism.\n*   **DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency**\n    *   Designing clusters for performance and energy efficiency involves finding optimal configurations, which can be related to **autotuning**.\n*   **throttLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving**\n    *   \"Predictive\" and \"Throttling\" imply automatically predicting and adjusting GPU operation for energy efficiency, which can broadly be interpreted as **autotuning**.\n*   **HILP: Accounting for Workload-Level Parallelism in System-on-Chip Design Space Exploration**\n    *   \"Design Space Exploration\" may include **automatically searching for optimal designs** under given constraints.\n*   **ARTEMIS: Agile Discovery of Efficient Real-Time Systems-on-Chips in the Heterogeneous Era**\n    *   \"Agile Discovery\" could mean **automatically finding** efficient systems.\n*   **LEGO: Spatial Accelerator Generation and Optimization for Tensor Applications**\n    *   \"Generation and Optimization\" includes **automatically generating and optimizing** spatial accelerators for tensor applications.\n*   **From Optimal to Practical: Efficient Micro-op Cache Replacement Policies for Data Center Applications**\n    *   The process of finding \"Efficient Micro-op Cache Replacement Policies\" may involve **automatically searching for optimal policies**.\n\nWhile these papers do not use the word **\"autotuning\"** directly, they likely discuss mechanisms for **automatically configuring, adjusting, or optimizing systems for performance or efficiency**.\n\nCould you let me know if you're looking for \"autotuning\" in a specific context or technology? For example, compiler autotuning, database autotuning, etc.? If you provide more details, I can help you find the most relevant papers."
    }]
}