週次AIニュース 2026-W23
対象期間: 2026-06-01 〜 2026-06-07(2194 件)
トピックの推移
トピック別件数
- LLM/生成AI 891件
- 研究/論文 859件
- エージェント 488件
- 画像/動画生成 334件
- ビジネス/資金調達 134件
- ロボティクス 109件
- ハードウェア/半導体 67件
- その他 44件
- 規制/政策 12件
今週のハイライト(上位 10 件)
How Endava is redesigning software delivery around AI agents
Learn how Endava is using AI agents, ChatGPT Enterprise, and Codex to accelerate software delivery, automate workflows, and build an AI-nat…
Dreaming: Better memory for a more helpful ChatGPT
ChatGPT introduces a new memory system to better remember preferences, keeping context fresh and relevant across conversations.
Introducing new capabilities to GPT-Rosalind
GPT-Rosalind advances life sciences research with enhanced biological reasoning, medicinal chemistry expertise, genomics analysis, and expe…
How Wasmer used Codex to build a Node.js runtime for the edge
See how Wasmer used Codex with GPT-5.5 to build a Node.js runtime for the edge, accelerating development 10x to 20x and shipping in weeks i…
A blueprint for democratic governance of frontier AI
OpenAI outlines a blueprint for U.S. governance of frontier AI, proposing a federal framework for safety, resilience, and national security.
OpenAI public policy agenda
OpenAI outlines its public policy agenda for AI, including safety, youth protection, workforce transition, and global standards to ensure A…
Travelers deploys AI-powered claims countrywide with OpenAI
Travelers built an AI-powered Claim Assistant with OpenAI to guide customers through filing claims, provide 24/7 support, and scale operati…
Codex for every role, tool, and workflow
Discover new Codex plugins, sites, and annotations that help analysts, marketers, designers, investors, and other teams get more done with…
Advancing youth safety and opportunity through global leadership
OpenAI calls for global action on youth AI safety, proposing an international institute to strengthen safeguards, standards, and opportunit…
Building the infrastructure for the Intelligence Age in Michigan
OpenAI breaks ground on a 1GW data center project in Michigan as part of Stargate, building AI infrastructure to expand access, create jobs…
全件(日付別)
2026-06-07(5件)
ChatGPTに「ロックダウンモード」 プロンプトインジェクションによる情報漏えい対策
OpenAIは、ChatGPTに新たなセキュリティ機能「ロックダウンモード」の提供を開始した。プロンプトインジェクション攻撃によるデータ流出リスクを抑えるためのオプションで、有効にするとWebブラウジングや外部サービスへの接続機能が制限される。機密データを扱い、厳格な保護を求め…
OpenAI unveils Lockdown Mode to protect sensitive data from prompt injection attacks
Even with Lockdown Mode, ChatGPT could be still vulnerable to prompt injections, but the goal is to reduce the likelihood that sensitive da…
What to expect from WWDC 2026: Siri’s highly anticipated revamp and Apple Intelligence updates
Apple's WWDC nears: Here's what you can look forward to.
Sriram Krishnan is leaving his role as White House AI advisor
Krishnan is reportedly starting a new institution to continue shaping Trump's AI policy.
The Trump administration might take an equity stake in OpenAI
President Donald Trump said he's discussing deals "where the American people can benefit from the success of AI."
2026-06-06(3件)
Startup Battlefield 200 applications officially close in 3 days
Applications for Startup Battlefield 200 officially close on June 8, 11:59 p.m. PT. Don't wait any longer. Secure your shot at competing on…
Google will pay SpaceX $920M per month for compute
In a statement, a Google representative described the deal as a result of unexpected demand for its recently launched AI products.
The most interesting startups right now want to get you off your phone
While the AI fundraising machine keeps breaking its own records, some founders are building in the other direction. Mirror founder Brynn Pu…
2026-06-05(18件)
The token bill comes due: Inside the industry scramble to manage AI’s runaway costs
"The whole conversation shifted from tokenmaxxing and 'go fast' to 'we need guardrails, how do we control this?'"
The ‘together tech’ wave might be the most intriguing startup bet of 2026
While the AI fundraising machine keeps breaking its own records, some founders are building in the other direction. Mirror founder Brynn Pu…
英ケンブリッジ大学、AIが設計したワクチンの臨床試験に成功 未知の変異株にも備える“万能型”
ケンブリッジ大学は、AIが設計した抗原を用いる“ユニバーサルワクチン”の初の臨床試験に成功したと発表した。サルベコウイルス群のゲノム配列を機械学習で解析し、グループ共通の“スーパー抗原”を設計した。健康な39人に投与し、安全性と免疫応答を確認した。
AirTrunk commits $30B to build 5GW of AI data centers in India
The Australian data center operator plans to set up 5GW of capacity in India.
Mira Murati steps back into the spotlight, carefully
In the current environment, remaining heads down has diminishing returns; at some point, you have to make some noise just to remind the mar…
検図から積算まで支援する図面解析AI、工数を最大60%削減
フィーチャは、図面解析AI「Drawing-AI」の機能拡張と対応領域の拡大を発表した。回路図や金型図面に加えて建築図面にも対応し、検図やデータ化、積算業務を支援する。実証実験では作業工数を30~60%削減した。
AIで思考力が奪われる? 世界の研究が警告するAIバカの壁【動画あり】
調べる前にAI、考える前にAI――こうした使い方をしていると知らないうちに思考力に影響を与えているかもしれません。AIを使うほど人はバカになるのでしょうか。今回はその実態と対処法を紹介します。
Ahead of its IPO, Anthropic’s Daniela Amodei shrugs off doubts about AI’s returns
Anthropic has been growing at a breakneck pace. The company announced that annualized revenue crossed $47 billion in May, up dramatically f…
Airbnb’s Brian Chesky plans to launch a new AI lab
The Airbnb CEO said last year it hasn't struck an LLM partnership because existing products weren't quite ready.
Defense tech, AI, and fundraising take center stage at StrictlyVC Los Angeles on June 18
On Thursday, June 18, at The Aerospace Corporation Campus, investors, founders, and tech leaders will gather for an evening of conversation…
タイヤFEM解析を45分から5分に 住友ゴムと富士通がAIサロゲートモデルを共同開発
住友ゴム工業と富士通は、タイヤ性能をAIで予測するAIサロゲートモデルを共同開発した。実証実験では、タイヤの変形挙動予測において解析時間を従来の約45分から約5分に短縮するとともに、約60万要素規模の解析を実現した。
「この1年はAI戦国時代」 メルカリに学ぶ、AIガバナンス策定の勘所
生成AIの業務利用が前提となり、AIを通じてビジネス価値をどう生み出すかが問われている一方で、「シャドーAI」をはじめとするリスクも指摘されている。先行企業はAIのリスクをどう受け止め、対策に乗り出しているのか。本稿では「AI-Native Company」への転換を宣言し、A…
Meta steals a tactic from Tesla and builds data centers in tents
Meta may have found one way to slash its massive data center bill: tents.
Apple approves Poke as the first AI agent on its Messages for Business platform
Poke, the startup that lets people use AI agents through simple text messages, has become the first AI agent approved for Apple’s Messages…
東大松尾研が「LLM講座 基礎編」の講義資料を無料公開 期間限定で
東京大学の松尾・岩澤研究室(以下、東大松尾研)が、LLMの基礎から技術動向まで体系的にまとめた講義資料「大規模言語モデル(LLM)講座 2025 基礎編」を期間限定で無料公開している。
Meta rolls out a new AI creator assistant on Facebook
Creators often have to parse through charts and dashboards to understand their performance, but with the new AI assistant, they can get qui…
What to expect from WWDC 2026: Siri’s highly anticipated revamp and Apple Intelligence updates
Apple's WWDC nears: Here's what you can look forward to.
Is Silicon Valley ready to put robots in people’s homes? Hello Robot is.
The California startup released the fourth-generation of its home assistance robot, Stretch.
2026-06-04(353件)
Apple touts $1.4 trillion in App Store billings and sales, 90% without a commission
Apple's App Store generated $1.4 trillion in sales, up from $1.3 trillion last year, with $149 billion in sales for digital goods.
How Endava is redesigning software delivery around AI agents
Learn how Endava is using AI agents, ChatGPT Enterprise, and Codex to accelerate software delivery, automate workflows, and build an AI-nat…
Dreaming: Better memory for a more helpful ChatGPT
ChatGPT introduces a new memory system to better remember preferences, keeping context fresh and relevant across conversations.
TSMC、AI活用拡大による成長維持に自信 株主総会、東京エレクトロンとの取引は継続
半導体受託生産の世界最大手、台湾積体電路製造(TSMC)は6月4日、台湾の新竹市で株主総会を開いた。魏哲家会長兼最高経営責任者(CEO)は、AIの活用拡大により「われわれの最先端技術と製造能力の価値は引き続き成長する」と述べ、今後数年間の同社の成長維持に強い自信を示した。
Google Chromeの新機能「Skills」 AIプロンプトの“毎回手打ち”を不要に
GoogleはChrome向けのAI新機能「Skills in Chrome」を発表した。AIプロンプトを保存してワンクリックで再利用可能にするという。
Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification
Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) cap…
Stumbling Into AI Emotional Dependence: How Routine AI Interactions Reshape Human Connection
Public discourse and emerging policy typically assume that AI emotional support is a deliberate act: a lonely user consciously seeking comf…
Thinking Through Signs: PEEL as a Semiotic Scaffolding for Epistemically Accountable AI-Enabled Research
Large language models are reshaping research practice while quietly eroding researchers epistemic accountability. This commentary introduce…
SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models
As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. E…
Consensus is Strategically Insufficient: Reasoning-Trace Disagreement as a Knowledge-Representation Signal
Multi-agent systems are commonly designed to reduce disagreement through voting, consensus protocols, debate, or fault-tolerant aggregation…
VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark
Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externa…
StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis
Automatic generation of RTL code for digital hardware designs remains challenging due to long-horizon reasoning, multi-step dependencies, a…
Can Generalist Agents Automate Data Curation?
Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propos…
Characterizing initial human-AI proof formalization workflows
For centuries, human mathematicians have written proofs to substantiate their mathematical arguments; yet, the ability to automatically ver…
The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents
As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to inte…
Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline
LLM agents accumulate histories that outgrow their context windows, motivating a growing literature on memory systems. Yet most existing de…
The Digital Apprentice: A Framework for Human-Directed Agentic AI Development
Agentic AI deployments face a recurring design tension: heavy human oversight limits scale, while broad autonomy outruns accountability. Ne…
Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval
Language agents increasingly rely on reusable skills to improve multi-step web automation across related tasks. A growing line of work stud…
Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation
Modern reasoning models can allocate different amounts of test-time computation, such as thinking tokens, model calls, or compute budget, t…
Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers
Many current agentic systems and LLM pipelines correct mistakes by optimizing outcome reward. This addresses only the what of failure: when…
Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation
Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet…
The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a…
AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning
We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized…
Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System
Biomedical agents promise to automate complex biological workflows, yet current systems face two fundamental bottlenecks: bioinformatics to…
Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making
Scientific simulators are increasingly being integrated into LLM-driven systems for high-stakes simulation-driven decision-making. However,…
MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation
Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing and maintaining standardized…
Scaling Self-Evolving Agents via Parametric Memory
Existing memory-augmented LLM agents store past experience exclusively in prompt space, as textual summaries or retrieved passages, while k…
Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models
Purpose The WHO's COVID-19 non-pharmaceutical interventions (e.g., lockdowns, vaccinations) effectively curb transmission but impose heavy…
SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification
While Process Reward Models (PRMs) have achieved remarkable success in mathematical reasoning, their application in complex scientific doma…
Learning Admissible Heuristics via Cost Partitioning
Admissible heuristics are essential for optimal planning, yet learning them remains challenging due to the risk of overestimation. Cost par…
Plan First, Judge Later, Run Better: A DMAIC-Inspired Agentic System for Industrial Anomaly Detection
Large language model (LLM) agents have shown promise in automating complex data-analysis workflows, but their reliable deployment remains c…
Parthenon Law: A Self-Evolving Legal-Agent Framework
As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable d…
A Normative Intermediate Representation for ASP-Based Compliance Reasoning
We propose MONIR, a Modalized-Output Normative Intermediate Representation for ASP-based compliance reasoning. Its core fragment has a stag…
MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requir…
BiNSGPS: Geometry Problem Solving via Bidirectional Neuro-Symbolic Interaction
Geometry problem solving poses distinct challenges in artificial intelligence. Existing approaches typically fall into two paradigms: symbo…
Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment
Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity-b…
FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games
Large language models (LLMs) are increasingly deployed as autonomous agents in scientific tasks. Yet whether these systems can effectively…
Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories
Safety-aligned Large Language Models (LLMs) remain vulnerable to interventions during inference that redirect generation toward harmful out…
Tree-Based Formalization of Multi-Agent Complementarity in Human-AI Interactions
Complementarity is the case in which a human--AI interaction (HAI) outperforms the best prediction benchmark available among its members. A…
AIP: A Graph Representation for Learning and Governing Agent Skills
Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. Th…
BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization
Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a singl…
Beyond Objective Equivalence: Constraint Injection for LLM-Based Optimization Modeling on Vehicle Routing Problems
Large language models (LLMs) increasingly translate natural-language optimization problems into executable solver code. Yet for constraint-…
R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search
Large language models (LLMs) are fluent on open-ended tasks, yet in agentic settings, where a system must plan, use tools, and act over ext…
AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety
As AI companion platforms such as Replika and Character.AI rapidly grow, concerns about unsafe human-AI interactions have intensified. This…
What Type of Inference is Active Inference?
Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking beh…
Strabo: Declarative Specification and Implementation of Agentic Interaction Protocols
The last few years have witnessed major advances in the modeling and implementation of multiagent systems based on declarative interaction…
AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?
Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring ou…
Knowledge Index of Noah's Ark
Knowledge benchmarks for LLMs face three issues: scaling-driven designs that do not operationalize disciplinary representativeness; flat-pa…
AI from concrete to abstract: demystifying artificial intelligence to the general public
Artificial Intelligence (AI) has been adopted in a wide range of domains. This shows the imperative need to develop means to endow common p…
How do machines learn? Evaluating the AIcon2abs method
This study expands on previous work that introduced the AIcon2abs method (AI from Concrete to Abstract: Demystifying Artificial Intelligenc…
DiffAero: A GPU-Accelerated Differentiable Simulation Framework for Efficient Quadrotor Policy Learning
This letter introduces DiffAero, a lightweight, GPU-accelerated, and fully differentiable simulation framework designed for efficient quadr…
SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification
Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target con…
Constraint-Enhanced Physical Search through Correlation Matching
Physical systems do not merely add noise to search processes; they impose constraints that generate structured correlations. We propose a p…
Early Detection of Alzheimer's Disease Using Explainable Machine Learning on Clinical Biomarkers: A Multi-Class Classification Study Using the Alzheimer's Disease Neuroimaging Initiative (ADNI) Dataset
Background: Alzheimer's disease (AD) affects over 55 million people worldwide. Accurate, interpretable detection of normal cognition (NC),…
Neural Radiated-Noise Fields for Unmanned Underwater Vehicle Noise Spectrum Prediction in Three-Dimensional Scenes
Radiated noise in unmanned underwater vehicles (UUVs) is an important indicator for characterizing acoustic signatures and evaluating platf…
Counterfactual Explanations for Deep Two-Sample Testing
Two-sample testing is a fundamental tool for detecting distributional differences across scientific domains, but classical tests (including…
The Variance Brain Foundation Models Forgot: Third-Order Statistics Predict Cognition Where Billion-Parameter Models Fail
Brain foundation models (BFMs) are self-supervised Transformers pretrained on fMRI data. We posit that these models should capture each sub…
Gravity-Aware Hierarchical Routing for Lightweight SensorLLM on Human Activity Recognition
Recent studies on sensor-language alignment have shown that two-stage frameworks can improve the semantic modeling ability of wearable-sens…
CodegenBench: Can LLMs Write Efficient Code Across Architectures?
While large language models (LLMs) have been extensively evaluated on code generation tasks for general-purpose programming and GPU-acceler…
The Biomimetic Architecture of Software 4.0
Dominant programming paradigms inherit an execution model optimised for a bygone era of a single human mind instructing a local machine, le…
MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models
Diffusion large language models (dLLMs) generate text by iteratively denoising partially masked sequences under bidirectional context, expo…
Position: Deployed Reinforcement Learning should be Continual
Reinforcement Learning (RL) has received increasing attention and adoption in real-world use cases. Most of these systems follow a train-th…
Do Transformers Need Three Projections? Systematic Study of QKV Variants
Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a c…
Unpredictable Safety: Domain-Dependent Compliance and the Transparency Gap in Open-Weight LLMs
We present a systematic study of domain-dependent safety behavior in open-weight LLMs: 7 standardized experiments across 7 ethical domains,…
Beyond Static Priors: Dynamic Neural Guidance for Large-Scale Ant Colony Optimization
Neural-guided Ant Colony Optimization (ACO) suffers from a fundamental training-inference misalignment: policies are typically trained to g…
Channel-Oriented Design for EEG-to-Music Reconstruction
Brain-computer interfaces aim to decode naturalistic stimuli from neural signals, yet most progress to date has focused on vision and langu…
Bayes-Sufficient Representations in Supervised Learning
Representation learning is often described as preserving the information in an input that is relevant for prediction. This work asks what r…
Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation
In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action M…
Unlocking Feature Learning in Gated Delta Networks at Scale
Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures a…
LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection
Existing quantization methods are fundamentally limited by rigid, integer-based bit-widths (e.g., 2, 3-bit), resulting in a ``deployment ga…
RUBAS: Rubric-Based Reinforcement Learning for Agent Safety
The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than si…
A Goal-Set Characterization of Task Composition in the Boolean Task Algebra
The Boolean Task Algebra (BTA) provides a principled framework for zero-shot task composition in reinforcement learning by equipping goal-r…
The Invisible Lottery: How Subtle Cues Steer Algorithm Choice in LLM Code Generation
Large language models (LLMs) now generate substantial production code, often for tasks with multiple valid algorithmic solutions. Incidenta…
Spectral Scaling Laws of Muon
Orthonormalized update rules have rapidly become a leading choice of optimizer for training large language models, with recent open-source…
LLM Compression with Jointly Optimizing Architectural and Quantization choices
Deploying large language models (LLMs) is challenging due to their significant memory and computational requirements. While some methods ad…
Need to Know: Contextual-Integrity-Grounded Query Rewriting for Privacy-Conscious LLM Delegation
As LLMs become increasingly woven into everyday workflows, user queries sent to cloud hosted LLMs routinely mix task-essential content with…
TPA-AD: A Two-Stage Pseudo Anomaly-Guided Method for Bearing Time-Series Anomaly Detection
This paper proposes a two-stage pseudo anomaly-guided anomaly detection method (\textbf{T}wo-stage \textbf{P}seudo \textbf{A}nomaly-guided…
Adaptive Patching Is Harder Than It Looks For Time-Series Forecasting
Adaptive patching is a recent and compelling proposal for time-series Transformers: allocate finer patches where the sequence looks locally…
Large Language Models Hack Rewards, and Society
Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We o…
POLARIS: Guiding Small Models to Write Long Stories
Small open-weight models struggle at long-form creative writing: their generated stories either fall far short of the requested length, or…
The Differentiable Auditory Loop (DAL): An ML Framework for Hyper-Personalized Hearing Aids
Conventional hearing aids rely on fixed, frequency-dependent amplification and compression to manage reduced sensitivity, which often fails…
Proof-Carrying Agent Actions: Model-Agnostic Runtime Governance for Heterogeneous Agent Systems
Agent systems execute through runtimes with very different control points: local coding tools, framework SDKs, managed agent platforms, API…
Building The Ph(ysical)AI Layer Of Machine Intelligence
Foundation models achieve generalization through massive-scale training on diverse data, but have limitations with transfer to truly unseen…
SymTRELLIS: Symmetry-Enforced Voxel Latents for 3D Generation
Single-view 3D generative models have achieved impressive visual quality, yet they are not designed to satisfy structural or functional req…
AgenticDiffusion: Agentic Diffusion-based Path Planning for Vision-Based UAV Navigation
Indoor UAV navigation requires efficient exploration, scene understanding, and reliable trajectory execution under limited field-of-view ob…
dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats
Quantizing large language models (LLMs) to low-precision floating-point representations is central to efficient deployment, yet applying a…
SaliMory: Orchestrating Cognitive Memory for Conversational Agents
Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding…
Semantic Constraint Synthesis for Adaptive Trajectory Optimization via Large Language Models
Trajectory optimization is a critical component for enabling safe and reliable autonomous operations in space exploration. As space mission…
HighTide: An Agent-Curated Open-Source VLSI Benchmark Suite
We introduce HighTide, an evolving AI-assisted benchmark suite. Specifically, the contributions are: (i) a diverse open-source suite spanni…
Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents
LLM agents often place sensitive credentials in the same context window as untrusted retrieved content, creating a direct path for indirect…
Physics-Informed Machine Learning for Short-Term Flood Prediction
Accurate flood forecasting is essential for mitigating disaster risks and protecting communities. However, purely data-driven machine learn…
EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms
Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality.…
ADAPTOOD: Uncertainty-Aware Fine-Tuning for Out-of-Distribution ECG Time Series Models
Data samples used for training often differ from those encountered during fine-tuning and deployment, and while ML models show promise, the…
Smart Transportation Without Neurons -- Fair Metro Network Expansion with Tabular Reinforcement Learning
We tackle the Metro Network Expansion Problem (MNEP), a subset of the Transport Network Design Problem (TNDP), which focuses on expanding m…
MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments
File-type classification underlies many workflows like malware triage, forensic carving, packet inspection, and storage indexing. Learned s…
A Systematic Analysis of Linguistic Features in AI-Generated Text Detection Across Domains and Models
Interpretable linguistic features offer a promising approach for explaining why a given text appears machine-generated, particularly for no…
Exact Unlearning in Reinforcement Learning
We formulate the problem of \emph{exact unlearning} in reinforcement learning, where the goal is to design an efficient framework that enab…
Dual Advantage Fields
Offline goal-conditioned reinforcement learning requires both long-horizon reachability estimates and local action comparisons. Dual goal r…
Metric-Aware Hybrid Forecasting for the CTF4Science Lorenz Challenge
We describe our approach to the CTF4Science Lorenz challenge, a benchmark that mixes short-horizon forecasting, long-time distribution matc…
Notarized Agents: Receiver-Attested Confidential Receipts for AI Agent Actions
Current AI agent observability is structurally compromised: the entity producing the activity log is the same entity whose activity is bein…
DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities
The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivatin…
PerceptTwin: Semantic Scene Reconstruction for Iterative LLM Planning and Verification
Simulation environments are useful for both robot policy learning and planning verification and validation. Traditionally, the process of c…
Incremental Sheaf Cohomology on Cellular Complexes: O(1)-in-n Lazy Edit Processing under Bounded Local Geometry
We present an algorithmic framework for incremental maintenance of first sheaf cohomology $H^1(X; \mathcal{F})$ on dynamically evolving 1-d…
MM-BizRAG: Rethinking Multimodal Retrieval-Augmented Generation for General Purpose Enterprise Q&A
Recent advances in multimodal retrieval-augmented generation (MM-RAG) have shifted toward minimal parsing, relying on page-level images for…
Supportive Token Revealing for Fast Diffusion Language Model Decoding
Discrete diffusion language models can generate text efficiently by updating multiple masked positions in parallel, but this parallelism in…
Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data
Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference,…
Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)
Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-…
Can I Take Another Dose? Evaluating LLM Decision-Making Under Temporal Uncertainty in OTC Dosing QA
Large language models (LLMs) are increasingly used for everyday health questions, including whether a user can safely take another dose of…
Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation
Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon,…
StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets
Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs dir…
From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments
We present a novel theoretical framework for deep reinforcement learning (RL) in continuous environments by modeling the problem as a conti…
The Loss Is Not Enough: Sampling Conditions and Inductive Bias in Contrastive Representation Learning
Contrastive learning has become a leading paradigm for self-supervised representation learning, yet the conditions under which it recovers…
Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling
Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to ali…
Scaling Novel Graph Generation via Lightweight Structure-Guided Autoregressive Models
Generating realistic and diverse graphs is a key problem in machine learning, with applications in molecular discovery, circuit design, cyb…
Anycast Performance in Context
IP anycast lets a service advertise one address from many physical sites, leaving BGP to map each client to a site. It is central to the DN…
OpenRFM: Dissecting Relational In-Context Learning
Relational Foundation Models (RFMs) promise a single pre-trained predictor that, given any relational database, returns predictions in one…
Measuring What Matters: Synthetic Benchmarks for Concept Bottleneck Models
Concept bottleneck models predict outcomes from high-level concepts detected in inputs. Although concepts provide a simple way to reap bene…
A Geometric Characterization of the Stationary Plateau for Two-Layer Neural Networks
We investigate the geometric structure of stationary plateaus that arise in the loss landscape of two-layer neural networks with smooth act…
Generalizable Multi-Task Learning for Wireless Networks Using Prompt Decision Transformers
Future wireless networks demand rapid adaptation to highly heterogeneous environments and dynamic task configurations, necessitating a shif…
From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents
Memory is a core component of AI agents, enabling them to accumulate knowledge across interactions and improve performance. However, persis…
Expectations vs. Realities: The Cost of MSE-Optimal Forecasting Under Conditional Uncertainty
Multi-step time series forecasting (MSF) is commonly evaluated using point-wise error metrics such as mean squared error (MSE), implicitly…
HYolo: An Intelligent IoT-Based Object Detection System Using Hypergraph Learning
This paper presents HYolo, an intelligent IoT-based object detection framework that integrates hypergraph learning into the YOLO architectu…
MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models
Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distr…
Multi-Granularity 3D Kidney Lesion Characterization from CT Volumes
Radiology reports describe kidney lesions by type, size, enhancement, and attenuation, yet existing 3D methods predict only at the patient…
Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers
Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increas…
DSIRM: Learning Query-Bridged Discrete Semantic Identifiers for E-commerce Relevance Modeling
Despite rapid progress of continuous embeddings for e-commerce search relevance, a long-standing open problem is the difficulty in capturin…
From Symbolic to Geometric: Enabling Spatial Reasoning in Large Language Models
Recent large language models (LLMs) often appear to exhibit spatial reasoning ability; however, this capability is largely \emph{symbolic},…
LCSHBench: A Multilingual, Consensus-Grounded Benchmark for Library of Congress Subject Heading Assignment
Automated subject cataloging assigns controlledvocabulary headings to bibliographic records, but LCSH has no standard public benchmark. We…
Rethinking Sales Lead Scoring with LLM-based Hierarchical Preference Ranking
Sales lead conversion in high-stakes domains (e.g., automotive, real estate) differs fundamentally from e-commerce recommendation due to pr…
TITAN-FedAnil+: Trust-Based Adaptive Blockchain Federated Learning for Resource-Constrained Intelligent Enterprises
Federated Learning (FL) has emerged as an effective paradigm for collaborative intelligence while preserving data privacy. However, data he…
Low-Rank Decay for Grokking in Scale-Invariant Transformers: A Spectral-Geometric View
Modern Transformer architectures frequently employ normalization mechanisms such as RMSNorm and Query-Key Normalization, making parts of th…
An Ensembled Latent Factor Model via Differential Evolution and Gradient Descent Optimization
High-dimensional and incomplete (HDI) data are prevalent in many real-world big data scenarios. Latent factor models serve as a common repr…
An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization
Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong perfor…
L-TGVN: Leveraging Longitudinal Priors for Personalized Rapid MRI
MRI provides excellent soft-tissue contrast without ionizing radiation, but long acquisition times increase patient discomfort while also r…
What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems
Modern agentic systems transform LLMs from session-bounded assistants into stateful systems that persist and evolve shared world state acro…
LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling
Mixture-of-Experts (MoE) and looped architectures scale models along two orthogonal axes, namely parameter capacity and effective depth. Ho…
MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning
AI systems increasingly need to combine two demanding capabilities: navigating multi-session conversation history and performing deep readi…
RowNet: A Memory Transformer for Tabular Regression
Real estate valuation is a structured regression problem in which prices are governed by heterogeneous feature types, sparse regional effec…
Token Rankings are Unforgeable Language Model Signatures
Language model parameters are known to impose unique (to each model) geometric constraints on their logit outputs, which serves as a signat…
CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities
AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabi…
SePO: Self-Evolving Prompt Agent for System Prompt Optimization
System prompt optimization improves agent behavior without modifying the underlying model, yielding human-readable, model-agnostic instruct…
ParetoPilot: Zero-Surrogate Offline Multi-Objective Optimization via Infer-Perturb-Guide Diffusion
Offline multi-objective optimization (Offline MOO) aims to discover novel Pareto-optimal designs based on static datasets without expensive…
Adaptive Calibration for Fair and Performant Facial Recognition
We introduce Adaptive Calibration (AC), a novel calibration strategy for facial recognition that maps cosine similarity between normalized…
ChessMimic: Per-Rating Transformer Models for Human Move, Clock, and Outcome Prediction in Online Blitz Chess
We present ChessMimic, a system of three small encoder-only transformers - for move, thinking-time, and outcome prediction - conditioned on…
Evaluating Reasoning Fidelity in Visual Text Generation
Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including documen…
SFMambaNet: Spectral-Frequency Enhanced Selective State Space Model for Correspondence Pruning
Correspondence pruning aims to identify inliers from an initial set of correspondences. Most existing Graph Neural Network (GNN)-based meth…
Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots
Reinforcement learning with verifiable rewards (RLVR) has greatly advanced large reasoning models (LRMs), but it requires timely training o…
Self-Evolving Deep Research via Joint Generation and Evaluation
Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly impo…
GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling
Reinforcement learning with verifiable rewards (RLVR) significantly advances LLM reasoning, yet it faces a dilemma: standard supervised sca…
Treat Traffic Like Trees: A Semantic-Preserving Hierarchical Graph-Based Expert Framework for Encrypted Traffic Analysis
Graph-based deep learning methods have been widely employed in encrypted traffic analysis to exploit latent correlations across different g…
ANN Search: Recall What Matters
Approximate nearest neighbor (ANN) search has become a core primitive in information retrieval and modern machine learning tasks, from clas…
Optical-Guided Neural Collapse for SAR Few-Shot Class Incremental Learning
Few-shot class-incremental learning (FSCIL) in synthetic aperture radar imagery presents unique challenges due to severe data scarcity and…
Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models
Diffusion large language models (dLLMs) offer bidirectional attention and parallel generation, enabling them to exploit global context and…
Temporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents
Long-horizon conversational agents need to interact with users through evolving events, tasks, and goals. Such histories are naturally temp…
Rollout-Level Advantage-Prioritized Experience Replay for GRPO
Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample ineffic…
Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge
Speculative inference (SPIN) was originally developed as an efficient architecture to accelerate Large Language Models (LLMs). In this work…
Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata?
LLM-based digital twins promise to scale and accelerate market research, but most published twins are either coarse persona bots conditione…
Ekka: Automated Diagnosis of Silent Errors in LLM Inference
LLM serving frameworks are quickly evolving with a complex software stack and a vast number of optimizations. The rapid development process…
QuBLAST: A Framework for Quantizing Large Language Models with Block-Level Compression Approach and Activation Scaling Strategy
LLMs have become the state-of-the-art algorithms for solving NLP tasks. However, they typically come at huge computational and memory costs…
QO-Bench: Diagnosing Query-Operator-Preserving Retrieval over Typed Event Tuples
Many real-world questions over business, legal, and scientific corpora are natural-language versions of database-style queries over records…
Instance-Level Post Hoc Uncertainty Quantification in Object Detection
Object detection is a safety-critical component of autonomous driving. It is essential to quantify the uncertainty in bounding-box predicti…
Why Muon Outperforms Adam: A Curvature Perspective
Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this adv…
Learning Long Range Spatio-Temporal Representations over Continuous Time Dynamic Graphs with State Space Models
Continuous-time dynamic graphs (CTDGs) provide a richer framework to capture fine-grained temporal patterns in evolving relational data. Lo…
Real-Time Automatic License Plate Recognition Using YOLOv8, SORT Tracking, and Temporal Data Interpolation
The real-time hardships of video processing seriously limit the usage of Automatic License Plate Recognition (ALPR) with application in dyn…
Graph-Guided Universum Learning in Generalized Eigenvalue Proximal SVMs for Alzheimer's Disease Classification
Early and accurate detection of Alzheimer's disease (AD) is important for timely intervention and disease management. Generalized Eigenvalu…
Enhancing MedSAM with a Lightweight Box Predictor for Medical Image Segmentation
Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. Wh…
VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training
Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet lever…
CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation
Humans primarily rely on walking and running to traverse complex terrains, without resorting to unnecessarily complex motion patterns. Simi…
Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning
Temporal credit assignment is central to both biological and artificial intelligence, yet its interaction with non-linear function approxim…
Curvature-aware dynamic precision approach for physics-informed neural networks
Physics-informed neural networks (PINNs) have become a promising framework for simulating partial differential equations (PDEs) by embeddin…
Revisiting Vul-RAG: Reproducibility and Replicability of RAG-based Vulnerability Detection with Open-Weight Models
Large language models (LLMs) have shown strong potential for automated software vulnerability detection, particularly in retrieval-augmente…
TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration
Agents are widely deployed as assistants over documents, tools, and code. However, they typically act only on explicit user requests, which…
An Empirical Audit of Input Encoders for Multi-Channel Signal Transformers
Transformers consuming multi-channel scalar signals must embed $C$ simultaneous values into one $d_{\text{model}}$-dimensional vector per t…
Archi: Agentic Operations at the CMS Experiment
We present Archi, an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organizatio…
Description-Code Inconsistency in Real-world MCP Servers: Measurement, Detection, and Security Implications
The Model Context Protocol (MCP) has emerged as a critical standard empowering Large Language Models (LLMs) to utilize external tools. In t…
Coarse-to-fine Hierarchical Architecture with Sequential Mamba for Brain Reconstruction
Understanding the relationship between deep visual representations and the human visual system is a fundamental challenge in computational…
Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control
Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful out…
NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning
LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate be…
Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees
Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned u…
Learning While Acting: A Skill-Enhanced Test-Time Co-Evolution Framework for Online Lifelong Learning Agents
Lifelong learning is essential for Large Language Model (LLM) agents operating in dynamic, interactive environments. However, existing life…
OA-CutMix: Correcting the Label Bias of CutMix
CutMix has become the de facto standard mixing augmentation, yet its label assignment rests on a flawed assumption: The area of the pasted…
Signed Dual Attention: Capturing Signed Dependencies in Time Series Forecasting
Initially developed for natural language processing, Transformer architectures and attention mechanisms are now central to a wide range of…
Uncertainty-Aware End-to-End Co-Design of Neural Network Processors: From Training and Mapping to Fabrication
Designing a neural network processor is an end-to-end co-design problem: network architecture and training budget determine the inference w…
Learning Empirically Admissible Neural Heuristics for Combinatorial Search
Finding optimal solution paths for combinatorial puzzles like the Rubik's Cube, sliding tile puzzles, and Lights Out remains a classical ch…
Abduction Prover in Isabelle/HOL
Proof assistants based on expressive logics suffer limited automation for proof search, raising the cost of formal verification based on pr…
DiverAge: Reliable Pluralistic Face Aging with Cross-Age Identity Relation Guidance
Face aging plays an important role in long-term biometric analysis, cross-age identity verification, and forensic identity analysis. Since…
Provably Auditable and Safe LLM Agents from Human-Authored Ontologies
We introduce the LLM agent architecture Agentic Redux, intended for use with nontrivial problem domains that require linear auditability. U…
'Your AI Text is not Mine': Redefining and Evaluating AI-generated Text Detection under Realistic Assumptions
Although it is generally agreed that AI-generated text poses a broad societal risk, there is no common understanding in the AI-generated te…
Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models
Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensi…
Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning
Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, pol…
AdaKoop: Efficient Modeling of Nonlinear Dynamics from Nonstationary Data Streams with Koopman Operator Regression
Real-time data analysis requires the ability to accurately and adaptively address nonlinear dynamics in a nonstationary data stream while p…
From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents
AI tools for programming are no longer just autocomplete or chat assistants: they organize themselves as development frameworks, with proce…
Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance
We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously d…
DeliChess: A Multi-party Dialogue Dataset for Deliberation in Chess Puzzle Solving
Multi-party dialogue is a critical setting for studying collaborative reasoning and decision-making, yet existing datasets rarely focus on…
From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents
Large language model (LLM)-based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modu…
SharedRequest: Privacy-Preserving Model-Agnostic Inference for Large Language Models
With the widespread deployment of public large language models (LLMs) such as ChatGPT, protecting user prompt privacy has become an increas…
M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks
As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts i…
DAR: Deontic Reasoning with Agentic Harnesses
Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing…
Invariant Gradient Alignment for Robust Reasoning Distillation
Large language models (LLMs) suffer from shortcut learning: they systematically fail on out-of-distribution (OOD) inputs whose semantic sur…
Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery
When an AI agent calls an API and hits a validation error, it needs more than what went wrong -- it needs what to do next. A self-reflectiv…
UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD
Computer-Aided Design (CAD) underpins modern engineering and manufacturing by enabling the creation of precise, editable 3D models. However…
Automatic Generation of Titles for Research Papers Using Language Models
The title of a research paper conveys its primary idea and, occasionally, its conclusions in a clear and concise manner. Choosing an approp…
Arithmetic Pedagogy for Language Models
We investigate whether methods of human mathematics pedagogy can guide the training of language models toward arithmetic reasoning. Buildin…
Who Needs Labels? Adapting Vision Foundation Models With the Metadata You Already Have
We propose a label-free approach to adapt powerful but generic vision foundation models to specialized scientific domains. Standard supervi…
Continual Visual and Verbal Learning Through a Child's Egocentric Input
Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neura…
Audio Interaction Model
Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each h…
Towards Efficient and Evidence-grounded Mobility Prediction with LLM-Driven Agent
Individual-level mobility prediction is central to urban simulation, transportation planning, and policy analysis. Supervised sequence mode…
GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes
Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation a…
Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)
When post-trained language models fail on reasoning problems, the common test-time-scaling response is to spend more compute on additional…
Multi-Column RBF Neural Network Using Adaptive and Non-Adaptive Particle Swarm Optimization
The radial basis function neural network (RBFN) trained with a gradient descending algorithm provides an effective fully connected structur…
Reinforcement Learning from Rich Feedback with Distributional DAgger
Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly…
Streaming Communication in Multi-Agent Reasoning
Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline dept…
Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning
Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored…
Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?
Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the…
Constrained Adaptive Rejection Sampling
Language Models (LMs) are increasingly used in applications where generated outputs must satisfy strict semantic or syntactic constraints.…
Aligning Deep Implicit Preferences by Learning to Reason Defensively
Personalized alignment is crucial for enabling Large Language Models (LLMs) to engage effectively in user-centric interactions. However, cu…
Adaptive Minds: Empowering Agents with LoRA-as-Tools
We investigate a framework in which LoRA adapters are treated as callable tools that a base language model can dynamically select and invok…
BRAINCELL-AID: An Agentic AI Created Brain Cell Type Resource for Community Annotation
Single-cell RNA sequencing has transformed our ability to identify diverse cell types and their transcriptomic signatures. However, annotat…
A Unified Geometric Space for Topological Alignment Between Transformer-Based Models and Human Brain Networks
Prior brain-AI alignment studies are typically constrained by specific inputs and tasks, limiting their ability to capture organizational p…
MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs
Ensuring the safety of Large Language Models (LLMs) is critical for real-world deployment. However, current safety measures often fail to a…
Reasoning or Fluency? Dissecting Probabilistic Confidence in Best-of-N Selection
Probabilistic confidence metrics are increasingly adopted as proxies for reasoning quality in Best-of-N selection, under the assumption tha…
Success Conditioning as Policy Improvement: The Optimization Problem Solved by Imitating Success
A widely used technique for improving policies is success conditioning, in which one collects trajectories, identifies those that achieve a…
PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?
Conversational assistants are increasingly integrating long-term memory with large language models (LLMs). This persistence of memories, e.…
Interfaze: The Future of AI is built on Task-Specific Small Models
We present Interfaze, a native hybrid model that fuses task-specific deep neural networks (CNNs and DNNs) directly into a transformer decod…
SciDER: Scientific Data-centric End-to-end Researcher
While large language models accelerate scientific discovery, existing agents face severe limitations in adaptability, domain generalization…
MedForge: Interpretable Medical Deepfake Detection via Forgery-aware Reasoning
Text-guided image editors can now manipulate authentic medical scans with high fidelity, enabling lesion implantation/removal that threaten…
Bilevel Autoresearch: Meta-Autoresearching Itself
If autoresearch is itself a form of research, then autoresearch can be applied to research itself. We present Bilevel Autoresearch, a bilev…
Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach
The emergence of large language model agents capable of invoking external tools has created urgent need for formal verification of agent pr…
The Accountability Horizon: An Impossibility Theorem for Governing Human-Agent Collectives
Existing accountability frameworks for AI systems, legal, ethical, and regulatory, rest on a shared assumption: for any consequential outco…
Belief-Aware VLM Model for Human-like Reasoning
Traditional neural network models for intent inference rely heavily on observable states and struggle to generalize across diverse tasks an…
Binary Spiking Neural Networks as Causal Models
We provide a causal analysis of Binary Spiking Neural Networks (BSNNs) to explain their behavior. We formally define a BSNN and represent i…
SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems
AI scientist systems are increasingly deployed for autonomous research, yet their academic integrity has never been systematically evaluate…
Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning
Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursue…
Unlocking Proactivity in Task-Oriented Dialogue
Proactive task-oriented dialogue (TOD), such as outbound sales, demands a persuasive agent that actively probes the user's concerns and ste…
The Illusion of Opting in AI-Mediated Consequential Decisions
Drawing on Ullmann-Margalit's concept of opting (transformative, irrevocable, and shadowed by foreclosed alternatives), we show that curren…
Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey
Transformer-based models are becoming a central paradigm in autonomous driving because they can capture long-range spatial dependencies, mu…
ChatSOP: An SOP-Guided MCTS Planning Framework for Controllable LLM Dialogue Agents
Dialogue agents powered by Large Language Models (LLMs) show superior performance in various tasks. Despite the better user understanding a…
CounterFace: A Synthetic Face Dataset for Fine-Grained Counterfactual Evaluation of Face Recognition Systems
Face recognition (FR) systems are widely deployed in critical applications, making their reliability and robustness across diverse populati…
SSSD: Simply-Scalable Speculative Decoding
Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approac…
LaVIDE: Language-Prompted Satellite Change Detection via Map-Image Alignment
Remote sensing change detection based on a map reference and an up-to-date image boosts timely observation of the Earth's surface when earl…
From Motion Signals to Insights: A Unified Framework for Student Behavior Analysis and Feedback in Physical Education Classes
Analyzing student behavior in educational scenarios is crucial for enhancing teaching quality and student engagement. Existing AI-based mod…
Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time
Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still…
SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization
Despite advances in pretraining with extended context sizes, large language models (LLMs) still face challenges in effectively utilizing re…
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to eff…
Model-Preserving Adaptive Rounding
The goal of quantization is to produce a compressed model whose output distribution is as close to the original model's as possible. To do…
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, tran…
Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics
Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) give…
Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs
To advance time series forecasting (TSF), various methods have been proposed to improve prediction accuracy, evolving from statistical tech…
VGGSounder: Audio-Visual Evaluations for Foundation Models
The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSo…
A Study of the Scale Invariant Signal to Distortion Ratio in Speech Separation with Noisy References
This paper examines the implications of using the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) as both evaluation and training objec…
BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format
Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objecti…
Uncertainty Estimation using Variance-Gated Distributions
Evaluation of per-sample uncertainty quantification from neural networks is essential for decision-making involving high-risk applications.…
KITE: Kernelized and Information Theoretic Exemplars for In-Context Learning
In-context learning (ICL) has emerged as a powerful paradigm for adapting large language models (LLMs) to new and data-scarce tasks using o…
ClustRecNet: A Novel End-to-End Deep Learning Framework for Clustering Algorithm Recommendation
Identifying an effective clustering algorithm for a given dataset remains a fundamental unsupervised learning issue. We introduce ClustRecN…
Platonic Transformers: A Solid Choice For Equivariance
While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant m…
Can Reasoning Path still be Effective as Input? Bridging Post-Reasoning to Chain-of-Thought Compression
Recent developments have enabled advanced reasoning in Large Language Models (LLMs) via long Chain-of-Thought (CoT), trading efficiency dur…
Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents
Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment paralle…
Test-time reward-guided alignment of language models by importance sampling on pre-logit space
Test-time alignment of large language models (LLMs) attracts attention because fine-tuning of LLMs requires high computational costs. In th…
Vectorized Online POMDP Planning
Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POM…
Extending Fair Null-Space Projections for Continuous Attributes to Kernel Methods
With the on-going integration of machine learning systems into the everyday social life of millions the notion of fairness becomes an ever…
OckBench: Measuring the Efficiency of LLM Reasoning
Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation. Yet current be…
SAM 3D: 3Dfy Anything in Images
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a singl…
AttnRegDeepLab: A Two-Stage Decoupled Framework for Interpretable Embryo Fragmentation Grading
Embryo fragmentation is a morphological indicator critical for evaluating developmental potential in In Vitro Fertilization (IVF). However,…
Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching
Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming envi…
Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs
Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We intr…
DVGT: Driving Visual Geometry Transformer
Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-…
You Only Train Once: Differentiable Subset Selection for Omics Data
Selecting compact and informative gene subsets from single-cell transcriptomic data is essential for biomarker discovery, improving interpr…
Semiparametric Preference Optimization: Your Language Model is Secretly a Single-Index Model
Policy alignment to preference data typically assumes a known link function between observed preferences and latent rewards (e.g., Bradley-…
Geometry-Aware Hallucination Detection in Large Language Models
Large language models (LLMs) frequently generate factually incorrect or unsupported content, commonly referred to as hallucinations. Prior…
Mid-Think: Training-Free Intermediate-Budget Reasoning via Token-Level Triggers
Hybrid reasoning language models are commonly controlled through high-level Think/No-think instructions to regulate reasoning behavior, yet…
Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models
Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effectiv…
MedRedFlag: Investigating how LLMs Redirect Misconceptions in Real-World Health Communication
Real-world health questions from patients often unintentionally embed false assumptions or premises. In such cases, safe medical communicat…
Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data
Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate inter…
Can professional translators identify machine-generated text?
This study investigates whether professional translators without prior specialized training can reliably identify short stories generated i…
Do readers prefer AI-generated Italian short stories?
This study investigates whether readers prefer AI-generated short stories in Italian over one written by a renowned Italian author. In a bl…
Conditional PED-ANOVA: Hyperparameter Importance in Hierarchical & Dynamic Search Spaces
We propose conditional PED-ANOVA (condPED-ANOVA), a principled framework for estimating hyperparameter importance (HPI) in conditional sear…
L$^3$: Large Lookup Layers
Modern sparse language models typically achieve sparsity through Mixture-of-Experts (MoE) layers, which dynamically route tokens to dense M…
Culturally Grounded Personas in Large Language Models: Characterization and Alignment with Socio-Psychological Value Frameworks
Despite the growing utility of Large Language Models (LLMs) for simulating human behavior, the extent to which these synthetic personas acc…
Tuning the Implicit Regularizer of Masked Diffusion Language Models: Enhancing Generalization via Insights from $k$-Parity
Masked Diffusion Language Models have recently emerged as a powerful generative paradigm, yet their generalization properties remain unders…
R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation
Vision-centric retrieval for VQA requires retrieving images to supply missing visual cues and integrating them into the reasoning process.…
SUSD: Structured Unsupervised Skill Discovery through State Factorization
Unsupervised Skill Discovery (USD) aims to autonomously learn a diverse set of skills without relying on extrinsic rewards. One of the most…
Efficient Adversarial Attacks on High-dimensional Offline Bandits
Bandit algorithms have recently emerged as a powerful tool for evaluating machine learning models, including generative image models and la…
Making Expert Reasoning Learnable with Self-Distillation
Improving the reasoning capabilities of large language models (LLMs) typically relies either on the model's ability to sample a correct sol…
What Structural Inductive Bias Helps Transformers Reason Over Knowledge Graphs? A Study with Tabula RASA
What structural inductive bias helps transformers reason over knowledge graphs? Through controlled ablations of a minimal transformer modif…
TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering
As increasingly capable open-weight large language models (LLMs) are deployed, improving their tamper resistance against unsafe modificatio…
AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms
Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, bu…
MuCO: Generative Peptide Cyclization Empowered by Multi-stage Conformation Optimization
Modeling peptide cyclization is critical for the virtual screening of candidate peptides with desirable physical and pharmaceutical propert…
Unifying Model-Free Efficiency and Model-Based Representations via Latent Dynamics
We present Unified Latent Dynamics (ULD), a novel reinforcement learning algorithm that unifies the efficiency of model-free methods with t…
Toward Autonomous O-RAN: A Multi-Scale Agentic AI Framework for Real-Time Network Control and Management
Open Radio Access Networks (O-RAN) promise flexible 6G network access through disaggregated, software-driven components and open interfaces…
Tomography by Design: An Algebraic Approach to Low-Rank Quantum States
We present an algebraic algorithm for quantum state tomography that leverages measurements of certain observables to estimate structured en…
A Unified Framework for Locality in Scalable MARL
Scalable methods for networked multi-agent reinforcement learning let each agent plan using only a small neighborhood of the agent graph. T…
DSL-Topic: Improving Topic Modeling by Distilling Soft Labelsfrom Language Models
Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking co…
Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models
Value alignment of Large Language Models (LLMs) requires us to empirically measure these models' actual, acquired representation of value.…
Does Order Matter : Connecting The Law of Robustness to Robust Generalization
Bubeck and Selke (2021) propose the connection between the Law of Robustness and robust generalization error as an open problem. The Law of…
Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction
Leader-follower interaction is an important paradigm in human-robot interaction (HRI). Yet, assigning roles in real time remains challengin…
ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling
This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing wor…
Beyond Pixel Histories: World Models with Persistent 3D State
Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However…
ZeroWBC: Learning Natural Whole-Body Humanoid Interaction from Human Egocentric Data
Achieving versatile and natural whole-body humanoid interaction control remains challenging due to the high cost of whole-body teleoperatio…
Physics-Informed Neural Engine Sound Modeling with Differentiable Pulse-Train Synthesis
Engine sounds originate from sequential exhaust pressure pulses rather than sustained harmonic oscillations. While neural synthesis methods…
EvoPrompt: Guided Prompt Evolution for Vision-Language Models Adaptation
The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge.…
Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety
A safety score earned on a benchmark need not predict how the same model behaves once it is wrapped in an agentic scaffold the benchmark ne…
Quantum entanglement provides a competitive advantage in adversarial games
Whether uniquely quantum resources confer advantages in fully classical, competitive environments remains an open question. Competitive zer…
ContactExplorer: Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation
Reinforcement learning has achieved remarkable success in domains such as Atari games, navigation, and locomotion, where exploration can of…
Revisiting Model Stitching In the Foundation Model Era
Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as…
Spatial Transcriptomics as Images for Large-Scale Pretraining
Spatial Transcriptomics (ST) profiles thousands of gene expression values at discrete spots with precise coordinates on tissue sections, pr…
FinTradeBench: A Financial Reasoning Benchmark for LLMs
Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundame…
GenSpan: Generation-Calibrated Motion Span Priors for Multi-Verb Video Corpus Moment Retrieval
Video Corpus Moment Retrieval (VCMR) aims to retrieve both the correct video and its temporal segment corresponding to a natural-language q…
PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay
While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact t…
On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers
Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of v…
Emotion Entanglement and Bayesian Inference for Multi-Dimensional Emotion Understanding
Understanding emotions in natural language is inherently a multi-dimensional reasoning problem, where multiple affective signals interact t…
Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space
Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence…
Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization
To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose…
Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring
Although autonomous underwater vehicles promise the capability of marine ecosystem monitoring, their deployment is fundamentally limited by…
Generative Augmented Inference
Large language models enable inexpensive AI-generated annotations, but using them reliably for causal inference remains challenging. Naivel…
Luminol-AIDetect: Fast Zero-shot Machine-Generated Text Detection based on Perplexity under Text Shuffling
Machine-generated text (MGT) detection requires identifying structurally invariant signals across generation models, rather than relying on…
MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video
Millimetre-wave (mmWave) radar offers a more privacy-preserving alternative to RGB-based human pose estimation. However, existing methods t…
Stochastic Sparse Attention for Memory-Bound Inference
Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vecto…
Efficiently Aligning Language Models with Online Natural Language Feedback
Reinforcement learning with verifiable rewards has been used to elicit impressive performance from language models in many domains. But, br…
A Systematic Investigation of RL-Jailbreaking in LLMs
The evolution of generative models from next-token predictors to autonomous engines of complex systems necessitates rigorous safety hardeni…
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
Recursive retraining of generative models poses a critical representation challenge: when synthetic outputs are curated based on a fixed re…
FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models
We introduce the first universal pretraining corpus for industrial time-series data: FactoryNet. 51M datapoints across 23k end-to-end task…
HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series
Critical events in multivariate time series, from turbine failures to cardiac arrhythmias, demand accurate prediction, yet labeled data is…
Widening the Gap: Exploiting LLM Quantization via Outlier Injection
LLM quantization has become essential for memory-efficient deployment. Recent work has shown that quantization schemes can pose critical se…
Do LLMs Hold Their Values? MANTA: A Multi-Turn Adversarial Benchmark for Animal Welfare Reasoning
Evaluating animal welfare reasoning in LLMs remains an open challenge despite rapid deployment in consumer and professional contexts where…
Retrieval and competition: how a protein foundation model starts a protein
Protein language models are increasingly used to guide experimental and clinical decisions, yet it is often unclear whether a confident pre…
Position: State-of-the-Art Claims Require State-of-the-Art Evidence
State-of-the-Art (SOTA) claims pervade Artificial Intelligence (AI) and Machine Learning (ML) research. These claims rest on benchmark eval…
ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models
Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on ma…
Markov Chain Decoders Overcome the Heavy-Tail Limitations of Lipschitz Generative Models
Heavy-tailed distributions are prevalent in performance evaluation, network traffic, and risk modeling. This behavior poses a fundamental c…
DEFLECT: Temporal Counterfactual Preference Learning for Delay-Robust Asynchronous VLAs
Vision-Language-Action (VLA) policies increasingly rely on asynchronous inference to hide large-model latency behind ongoing robot motion.…
Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models
Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate th…
REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak
While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak atta…
Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs
Interpretable autonomous driving planners depend not only on generating explanations, but also on those explanations remaining reliable und…
Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning
The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to as…
Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory
Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this…
Grimlock: Guarding High-Agency Systems with eBPF and Attested Channels
Agentic systems increasingly run user-authored orchestration code that invokes tools, spawns subtasks, and delegates work across machines a…
Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning
Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptua…
Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text
LLMs have advanced text classification, yet existing paradigms face a trade-off: supervised (label only) fine-tuning is scalable but offers…
LoopFM: Learning frOm HistOrical RePresentations of Foundation Model for Recommendation
Knowledge distillation (KD) transfers a single scalar prediction from a large foundation model (FM) to compact vertical models (VMs), suffe…
Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, w…
Label Over Logic? How Source Cues Bias Human Fallacy Judgments More Than LLMs
As AI-generated and AI-assisted content floods online spaces, source labels attached to such content can distort human reasoning judgments,…
日本政府、AI「Mythos」アクセス権を取得 サイバー防衛強化に活用
三菱UFJ銀行、三井住友銀行、みずほ銀行もアクセス権を得たとみられている。
オートデスク、主要製品向けにAIアシスタント機能を展開 Fusion向けMCPも
Autodeskは、主要製品向けに「Autodesk Assistant」のテックプレビュー版を提供する他、Fusion向けのMCPを公開した。設計データや業務コンテキストを理解するAIアシスタントに加え、外部AIとの連携を可能にする機能も提供し、設計/製造業務におけるAI活用…
ServiceNowとAccenture、エージェント型AIを全社展開する「FDE」を開始
ServiceNowとAccentureが新たなプログラムを立ち上げた。エージェント型AIの導入が実証実験の段階で止まり、全社規模の成果につながらないといった課題をどのように解決するのか。
Lovable signs multiyear deal with Google Cloud to up usage 5x, source says
Lovable and Google signed an expanded multiyear deal that involves a 5x expansion of Lovable's footprint on Google Cloud, and expanded acce…
「WEDA」を前面に押し出すアドバンテック、エッジAIモデルの開発期間を86%削減
アドバンテックは、「COMPUTEX TAIPEI 2026」において、同社のハードウェアを用いてエッジAIの開発から導入、運用までを統合的に管理するソリューション「WEDA」のデモを披露した。
2年間で「1万時間」削減 「1円の誤りも許されない」ソニー経理が“まず試してみる”DX集団に化けたワケ
「経理DXを進めたいが、現場の抵抗が強い」「ツールを導入しても活用が広がらない」――こうした悩みを抱える企業は少なくない。経理部門は正確性や継続性が求められるため、変革が難しい領域とされてきた。ソニーグループの経理部門は、約2年間で150件を超えるDXプロジェクトを推進し、累積…
Claude Opus 4.8は忖度(そんたく)しません “正直すぎる”のも善しあし?
Claude Opus 4.8は、性能向上だけでなく「正直さ」の改善が大きな特徴だ。本稿では、忖度(そんたく)しないAIがなぜ評価を分けているのか、公式情報と利用者目線から整理する。
Alphabet’s record-breaking $85B raise for Google’s AI business is a helluva good signal
If Alphabet's record-breaking $85 billion stock sale signals investor appetite for AI-related offerings, we can see that investors are read…
Google’s Dreambeans, its weirdest-named AI tool to date, will turn your life into a cartoon
Dreambeans is a curated list of AI-illustrated "stories" culled from the personal data in your Google account.
人型ロボブームを“先駆者ホンダ”はどう見る? 「悔しさもあるが……」 次の一手を聞いた
2000年に「ASIMO」を世に送り出したホンダは、足元の人型ロボットブームをどう見ているのか。人型ロボットの開発に再参入する可能性や、現在の取り組みなどを聞いた。
「Gemma 4 12B」登場 メモリ16GBのノートPCでも動作するマルチモーダルモデル
米Googleがオープンなマルチモーダルモデル「Gemma 4 12B」を発表した。エンコーダー不要の統合アーキテクチャを採用し、メモリ16GBのノートPCで動作可能。上位モデルに迫る性能を発揮するという。
Amazon will show AI product images when you search for some reason
Amazon will use visual search and AI to show AI-generated product images that match your search queries. The retailer says it will help gui…
These two founders left Goldman and Meta to build voice AI for markets everyone else overlooked
The startup's own stack for Africa and Middle East is now handling more than 17,000 calls per day.
2026-06-03(437件)
Publishers will be able to opt out of AI Search, thanks to new regulation
U.K. regulators are requiring Google offer a tool allowing website publishers to opt-out of generative AI search features. The option will…
「AI使うな」より「使うなら教えて」 エージェント時代のガバナンス再設計
AIエージェントの業務適用が広がる一方、組織のガバナンスが追いついていない。OWASPの指摘を踏まえ、日本企業が押さえるべき2つの原則と、来週から始められる3つのアクションを解説する。
Meta’s AI agent for WhatsApp Business is now available globally
WhatsApp will charge businesses for using its AI agent based on token usage.
Introducing new capabilities to GPT-Rosalind
GPT-Rosalind advances life sciences research with enhanced biological reasoning, medicinal chemistry expertise, genomics analysis, and expe…
Coralogix raises $200M on bet that someone needs to watch the AI agents
Coralogix is among a growing number of infrastructure firms betting that as AI systems move into production, demand will rise for tools tha…
How Wasmer used Codex to build a Node.js runtime for the edge
See how Wasmer used Codex with GPT-5.5 to build a Node.js runtime for the edge, accelerating development 10x to 20x and shipping in weeks i…
A blueprint for democratic governance of frontier AI
OpenAI outlines a blueprint for U.S. governance of frontier AI, proposing a federal framework for safety, resilience, and national security.
OpenAI public policy agenda
OpenAI outlines its public policy agenda for AI, including safety, youth protection, workforce transition, and global standards to ensure A…
Visual Graph Scaffolds for Structural Reasoning in Large Language Models
Graphs have been used to enhance large language models (LLMs) for structured reasoning, mostly as external knowledge sources are provided t…
AURA: Action-Gated Memory for Robot Policies at Constant VRAM
The KV-cache is the right memory for datacenters but the wrong memory for robots. Datacenter inference batches many short requests and rese…
Evaluating Transformer and LSTM Frameworks for Prediction in Ungauged Basins
Watershed networks exhibit convergent topologies in which multiple tributaries merge into downstream channels,integrating diverse upstream…
BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces
Many decision-support settings require systems that adapt to individual users, but evaluation data for this problem remain limited. Existin…
ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning
Large language models (LLMs) exhibit strong natural-language reasoning abilities for clinical decision support, but struggle to effectively…
Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection
Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context mu…
An Exploration of Collision-based Enemy Morphology Generation
Despite a great deal of prior research into Procedural Content Generation (PCG), relatively little prior work has explored generating enemi…
Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models
Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test-time compute,…
Toward a Modular Architecture for Embedded AI Agent Systems at the Edge
The rise of Large Language Models (LLMs) has enabled agentic AI capable of complex reasoning and tool use; however, deploying such autonomy…
Don't Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems
AI-Driven Research Systems (ADRS) -- systems coupling LLMs with automated evaluation to discover algorithms, proofs, and designs -- are bei…
When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning
When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-c…
Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks
Coding-agent benchmarks evaluate whether a single uninterrupted agent can resolve a repository issue. Real software work is messier: tasks…
Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models
Background: Oral diseases affect nearly 3.5 billion people worldwide, yet the comparative clinical potential of large-scale AI models in de…
What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents
Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should…
WISE-HAR: A Generalizable Ensemble Deep Learning Framework for WiFi-Based Human Activity Recognition
Human Activity Recognition (HAR) using WiFi signals has emerged as a transformative technology for smart homes, healthcare monitoring, secu…
Inducing Reasoning Primitives from Agent Traces
ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchp…
AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification
Structured financial audit verification is difficult for language-model agents because correctness depends on structured evidence rather th…
TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment
LLMs have evolved from basic chatbots to the backbone of the AI ecosystem, now widely used in healthcare, schools, and government services.…
RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases
Relational databases underpin modern enterprise, scientific, and healthcare systems, yet predictive machine learning on such data remains c…
ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents
Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but e…
SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale
As LLM agents adopt large skill libraries, selecting the right subset becomes a structural problem rather than a similarity-matching one: s…
CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection
The rapid rise of generative AI has made multimodal fake news increasingly realistic and pervasive, posing severe threats to public trust a…
DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees
Large Language Model (LLM)-based agents increasingly rely on memory to learn from experiences over continual interactions. However, storing…
The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs
Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet real-world deployment is cons…
Decomposing how prompting steers behavior
Prompting steers large language models (LLMs) and vision-language models (VLMs) without weight updates, but it remains unclear how instruct…
From Long News to Accurate Forecast: Importance-Aware Fusion and PRM-Guided Reflection for Time Series Forecasting
Incorporating news into time series forecasting is appealing because news can reveal abrupt exogenous events that historical values alone c…
DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration
Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require huma…
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning
Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in age…
Uncertainty-Aware Clarification in LLM Agents with Information Gain
Large Language Model (LLM) agents often operate under underspecified user instructions, where latent uncertainty over user intent leads to…
Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation
LLM-based multi-agent simulation offers a promising way to study social interaction, deliberation, and collective opinion dynamics. However…
GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory
Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical…
ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models
Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical de…
MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents
Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remain…
Effect of Demographic Bias on Skin Lesion Classification
In this study, we evaluate the performance of skin lesion classification using ResNet-based convolutional models, focusing on the impact of…
Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents
Multimodal large language models (MLLMs) have substantially advanced mobile agents, yet proactive mobile assistance remains challenging bec…
Solipsistic Superintelligence is Unlikely to be Cooperative
AI's central challenge is shifting from capability to coexistence. The dominant paradigm in AI research focuses on developing powerful agen…
Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection
In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experi…
Distilling Answer-Set Programming Rules from LLMs for Neurosymbolic Visual Question Answering
Visual Question Answering (VQA) is the task of answering questions about images, requiring the integration of multimodal input and reasonin…
A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting
Recent work shows that language models can transmit behavioural traits through hidden signals in generated data during training. We ask whe…
LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks
Large Language Models (LLMs) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in form…
The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical…
The Violation Situation Pattern: A Knowledge-Graph Pattern for Compliance Violations
Compliance pipelines detect violations as transient query results and do not keep the violation itself as a persistent graph object with re…
InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain
Long-context tasks require LLMs to identify and preserve answer-relevant information from large contexts. Chunk-wise memory agents address…
CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations
Cell Painting combines multiplexed fluorescent staining, high-content imaging, and quantitative analysis to generate high-dimensional pheno…
What Makes Interaction Trajectories Effective for Training Terminal Agents?
Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from t…
DMF: A Deterministic Memory Framework for Conversational AI Agents
Conversational AI agents require memory systems that are both scalable and semantically coherent across long interaction horizons. Existing…
StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems
LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly…
A formal definition and meta-model for a machine theory of mind
This paper proposes, for the first time, a rigorous formal definition of the concept of Machine Theory of Mind, based on principles support…
ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning
Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of…
Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI
As AI systems evolve from passive models into autonomous active agents capable of initiating actions, collaborating, and delegating tasks,…
SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems
Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines i…
From Prompt to Service: An SLM-Based Agent Orchestration Gateway for AI-Driven Virtual Worlds
As generative AI capabilities expand, AI-driven virtual worlds face a growing architectural challenge. Users interact through in-world inte…
Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing
AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenizatio…
Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models
Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple i…
TSQAgent: Rating Time Series Data Quality via Dedicated Agentic Reasoning
Assessing the quality of time series (TS) data is fundamental yet inherently challenging due to the multifaceted nature of quality dimensio…
Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency
We investigate whether large language models produce different medical triage recommendations for identical neurological symptoms when only…
Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic
Recent work in defeasible reasoning has seen notions of preferential semantics and entailment in the style of Kraus et al. applied to modal…
Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition
Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recal…
From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models
Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This mas…
EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents
Generating safety-critical scenarios is essential for validating and improving autonomous driving systems, yet it inherently requires maxim…
The DeepSpeak-Agentic Dataset
We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied…
SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents
Recent AI agents can flexibly invoke skills to solve complex tasks, but their long-term improvement is fundamentally constrained by a lack…
Dynamic Objective Selection with Safeguards and LLM Oversight for Financial Decision-Making
Financial decision-making tasks such as stock recommendation and portfolio allocation typically estimate future return and risk and then se…
Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs
Knowledge Graphs (KGs) are widely used to mitigate the limitations of Large Language Models (LLMs), such as outdated knowledge and hallucin…
Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs
The do-calculus defines a general system of inference for interventional queries, allowing causal quantities to be transformed through succ…
When to Re-Plan: Subgoal Persistence in Hierarchical Latent Reasoning
Long-horizon reasoning requires a system to commit to medium-horizon intent without becoming rigid: re-plan too often and computation never…
Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts
While Large Language Models (LLMs) have shown strong performance in generating formal proofs, their outputs often remain less readable, mod…
LAP: An Agent-to-Instrument Protocol for Autonomous Science
Autonomous science is moving from demonstration to infrastructure. Large language model agents now plan experiments, and self-driving labor…
From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework
AI losses that arise through an insured organization's generative or agentic AI system require state reconstruction, not merely event recon…
Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis
Operational safety in high-stakes domains such as industrial process control, autonomous, and safety-critical systems, demand reliable haza…
Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria
This paper investigates rubric-aware, multitask fine-tuning of transformer models for automated grading of introductory C++ programming ass…
Calibrating Urban Traffic Simulation from Sparse Road Observations via Genetic Optimization
Urban traffic simulation is a critical tool for infrastructure planning, including the placement of electric vehicle charging stations. How…
BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents
Financial-research answers are decision-relevant only when another analyst can audit how they were produced: which source was chosen, which…
EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management
Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches…
PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models
Despite the pivotal role of numerical reasoning as the cornerstone of mathematical capabilities in large language models (LLMs) across appl…
Reasoning Structure of Large Language Models
Large reasoning models (LRMs) are often evaluated using metrics such as final-answer accuracy or token count. However, identical scores on…
scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation
Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cel…
Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning
AI agents can increasingly handle the mechanical tasks of financial analysis: retrieving documents, calculating formulas, updating spreadsh…
Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection
While token-level entropy is commonly recognized as effective for credit assignment in text-only reinforcement learning with verifiable rew…
Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models
Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly obser…
TRAP: Hijacking VLA CoT-Reasoning via Adversarial Patches
By integrating Chain-of-Thought (CoT) reasoning, Vision-Language-Action (VLA) models have demonstrated strong capabilities in robotic manip…
Cost-Aware Query Routing in RAG: Empirical Analysis of Retrieval Depth Tradeoffs
Retrieval-augmented generation (RAG) faces a fundamental three-way tension: deeper retrieval improves factual grounding but inflates token…
IdiomX A Multilingual Benchmark for Idiom Understanding, Retrieval, and Interpretation
Idiomatic expressions remain a persistent challenge for natural language processing because their meanings are often non-compositional, con…
Lean-GAP: A Dataset of Formalized Graduate Algebra Problems
We present Lean-GAP (Lean-Graduate Agebra Problems), 430 formalized graduate-level algebra problems from the textbook Abstract Algebra by D…
Tracking Urban Atmospheric Pollutants using Sentinel-5P Satellite Data
Urban nitrogen dioxide ($NO_2$) is a key indicator of combustion-related air pollution and exhibits strong spatial and temporal variability…
Auditable Climate Risk Intelligence from Fragmented ESG Data: Deterministic Orchestration and Imbalance-Aware Learning for Scope 1-3 Validation
ESG and climate risk data remain fragmented across heterogeneous Scope 1, Scope 2, and Scope 3 reporting environments, while conventional v…
Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification
Coronary artery stenosis is a common cardiovascular disease, with severe, untreated cases posing significant risks of heart attack. Althoug…
ReLoRA: Knowledge-Reusing Adaptation for Fast Rollout of Evolving LLM Services
Large Language Models (LLMs) are increasingly deployed as continuously evolving services, where frequent base-model updates may invalidate…
Geometry-Aware Tabular Diffusion
Tabular synthesis is critical for privacy-preserving sharing and augmentation, yet diffusion models rely on implicit mechanisms to capture…
Building Better Activation Oracles
Activation Oracles (AOs) are promising methods for interpreting residual stream activations. However, current AOs face important issues, su…
Samudra 2: Scaling Ocean Emulators across Resolutions
Ocean general circulation models (OGCMs) are essential to climate science but computationally expensive, limiting ensemble size and forcing…
Margin Play: A Multi-Agent System For Public Policy Analysis In The Brazilian Equatorial Margin
The Brazilian Equatorial Margin (BEM) is Brazil's next offshore oil frontier, with operations expected to begin in 2026 in the Foz do Amazo…
FSA-GRPO: Teaching Auditory LLMs to Use Few-shot Demonstrations
Few-shot prompting provides an effective way to adapt auditory large language models to low-resource tasks such as children's speech recogn…
Closed-Loop Molecular Design with Calibrated Deference
We present Cognitive Loop via In-Situ Optimization (CLIO), an agent that couples a continuously-updated belief-state graph with a recursive…
Oscillatory State-Space Models as Inductive Biases for Physics-Informed Neural PDE Solvers
Solving time-dependent partial differential equations (PDEs) is an important problem in computational science and engineering. Physics-info…
TadA-Bench: A Million-Variant Benchmark for Future-Round Discovery Toward Agentic Protein Engineering
AI for scientific discovery is entering an agentic era, where protein-engineering systems are expected to prioritize future wet-lab experim…
DXA-Derived Skeletal Phenotypes and Hip Fracture Risk: A Backdoor-Adjusted Causal Analysis
Purpose: To compare dual-energy X-ray absorptiometry (DXA)-derived hip skeletal phenotypes in relation to hip fracture risk using prespecif…
Enhancing Protein-Protein Interaction Prediction with Hierarchical Motif-based Multimodal Protein Embedding
Protein-protein interactions (PPIs) are essential for many biological processes. However, existing PPI prediction approaches suffer from tw…
MultiTurnPSB: Evaluating Multi-Turn Jailbreak Attacks an dClassifier-Based Defenses for Medical AI Safety
Patient-facing medical chatbots are commonly evaluated on single-turn prompts, yet real users push back after refusals, add urgency, and in…
Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals
This paper studies whether audio, images, and video can share a common wavelet token schema rather than relying on separate modality-specif…
Position: Prioritize Identifying Structure, Not Complex Models, for Scientific Discovery
Modern Machine Learning (ML) and Artificial Intelligence (AI) models, especially large language models (LLMs), are increasingly used to gen…
Echo-POSED: Geometric Self-Distillation for Echocardiography Guidance
We introduce Echo-POSED, a self-supervised framework for real-time transthoracic echocardiography (TTE) guidance that recommends probe adju…
Too Much of a Good Thing: When sim2real Efforts Impede Policy Learning (And What to Do About It)
While sim2real efforts are necessary for effective policy transfer to hardware, there is such a thing as too much of a good thing. We argue…
SegTune: Structured and Fine-Grained Control for Song Generation
Recent advances in neural song generation have enabled high-quality synthesis from lyrics and global textual prompts. However, most systems…
Sparse-View Lung Nodule Volumetry from Digitally Reconstructed Radiographs via AReT: Anatomy-Regularized TensoRF
We identify and resolve a previously unreported failure mode in TensoRF when applied to X-ray attenuation fields: the default density shift…
D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting
Multi-turn jailbreak attacks pose a growing threat to large language model (LLM) safety because they exploit feedback from auxiliary judge…
CARVE: Certified Affordable Repair of Vetoed Maneuvers via Envelopes for Interactive Driving
Interactive driving exposes a failure mode that is easy to miss in rule-aware autonomous-driving stacks: a hard-rule margin can be negative…
SVHalluc: Benchmarking Speech-Vision Hallucination in Audio-Visual Large Language Models
Despite the success of audio-visual large-language models (LLMs), they can produce plausible but ungrounded outputs, termed hallucination.…
Inference Cost Attacks for Retrieval-Augmented Large Language Models
Retrieval-Augmented Generation (RAG)-enhanced LLM systems, while powerful, introduce substantial inference costs due to the inclusion of an…
A New Framework for Cybersecurity Refusals in AI Agents
Agentic scaffolds have dramatically improved LLM performance on complex, long-horizon tasks, yielding both broad benefits and amplified ris…
Target Updates May Stabilize Linear Q-Learning: Periodic and Soft Dynamics
Periodic target updates in Q-learning and soft target updates in actor-critic methods are empirically well established stabilization mechan…
The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size
Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a t…
CL-DMDF:Dynamic Multimodal Data Fusion Model Based on Contrastive Learning
Multimodal data fusion involves integrating and analyzing information from multiple modalities to uncover latent correlations and complemen…
Learning to Refine: Spectral-Decoupled Iterative Refinement Framework for Precipitation Nowcasting
Accurate precipitation nowcasting is vital for disaster mitigation, but deep learning methods face a key trade-off: regression models produ…
Improvise, Adapt, Overcome: An On-The-Fly Multifidelity Algorithm for Efficient Machine Learning
Machine learning has accelerated quantum chemistry but is hindered by the prohibitive cost of generating high fidelity training data. Multi…
AdaWeather: Adaptively Mixing Probabilistic Weather Forecasts with Logarithmic Regret
Recent advances in machine learning have produced probabilistic weather forecasting models comparable to state-of-the-art numerical weather…
Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate
Many recent multivariate time series anomaly detection (MT-SAD) models incorporate cross-channel modeling, under the implicit assumption th…
Aligning Data-Driven Predictors with Allocation: A Decision-Focused Approach to Survival Analysis
Machine learning predictors have become essential tools for guiding automated decision making. However, a major misalignment persists: pred…
Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation
On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms.…
AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes
Audio-visual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine-grained, hum…
See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs
Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically…
Attention Calibration for Position-Fair Dense Information Retrieval
Dense retrieval models exhibit positional bias: retrieval effectiveness degrades when relevant information appears later in a passage (Zeng…
EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement
Audio tokenizers serve as the discrete interface between continuous audio and Audio Language Models (ALMs), but existing tokenizers often s…
Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records
Planning records define restrictions over geographic areas, but their source documents often provide only indirect spatial evidence rather…
MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data
Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limit…
Acceptance-Test-Driven Evaluation Protocols for Business-Centric LLM Systems
Large language model (LLM) applications are increasingly expected to satisfy deterministic institutional requirements while relying on prob…
Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models
Model dimension ($d_{model}$) is a fundamental hyperparameter in transformer language models, yet its role in setting the geometric limits…
CRAM-ER: Error-Resilient Spintronic Computational Random Access Memory for Scalable In-Memory Computation
Deep neural networks (DNNs) have achieved state-of-the-art performance across diverse domains. However, typical Von Neumann compute paradig…
Cosmos 3: Omnimodal World Models for Physical AI
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and actio…
Do Neural Retrievers Prefer Certain Documents? Evidence of Learned Relevance Priors
Neural retrievers are trained to estimate query-document relevance from annotated query-document pairs. Yet annotation protocols may not pu…
Which Defense Closes Which Threat? Attributing OWASP-LLM-Top-10 Coverage and Its Brittleness Under Paraphrasing
Production LLM applications stack several defense families -- refusal-phrase filters, token-budget controls, model allowlists, rate limits,…
Large Byte Model: Teaching Language Models About Compiled Code
Malware analysis starts with the raw bytes of an executable program, and tools to "lift" these to higher-level representations, such as ass…
Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inferen…
GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning
Zeroth-order (ZO) optimization is a memory-efficient alternative to backpropagation for fine-tuning large language models, but its deployme…
Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions
How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without centralized control? Inspired…
Forgetting is Not Erasure: Recovering Latent Knowledge via Transport Keys
Catastrophic forgetting is often framed as a representational problem: after sequential training, a model appears to lose the features that…
The Epi-LLM Framework: probing LLM behavioral priors through epidemiological agent-based models
Human behaviour during epidemics affects infectious disease dynamics, but quantifying this remains deeply challenging. Here we introduce th…
Adaptive Latent Agentic Reasoning
Large reasoning models improve performance by generating extended chain-of-thought (CoT) reasoning, but this behavior becomes inefficient w…
LLM-Assisted Reranking to Operationalize Nuanced Objectives in Recommender Systems
Recommender systems have grown from content-organization tools into sophisticated systems that shape daily behavior. By controlling what we…
Are we really tilting? The mechanics of reward guidance in flow and diffusion models
Reward guidance algorithms steer a learned generative process toward the reward-tilted measure at inference time. While empirically powerfu…
Scalable Uncertainty Quantification for Extreme Weather Forecasting via Empirical Neural Tangent Kernels
Deep learning weather models now match numerical weather prediction accuracy while running orders of magnitude faster, but produce determin…
Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States
Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different…
WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents
Multi-turn user-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and…
SCOPE: Real-Time Natural Language Camera Agent at the Edge
Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with…
Fast-dLLM++: Fr\'{e}chet Profile Decoding for Faster Diffusion LLM Inference
Diffusion large language models promise parallel token generation, yet inference remains bottlenecked by deciding which masked tokens can b…
Echelon: Auditable Aggregate-Only Language-Model Adaptation Across Privacy Boundaries
Cross-organization language-model adaptation increasingly faces hard governance constraints: in many deployments, device-level model state-…
Hand Trajectory Fusion for Egocentric Natural Language Query Grounding
Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answer…
Glass Box at Orbit: A Constitutional AI Verification Framework for Trustworthy Autonomous CubeSat Intelligence
The space industry is quietly building toward something nobody has fully reckoned with: orbital data centers running thousands of autonomou…
Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion
We present a novel compact deep multi-task learning model to handle various autonomous driving perception tasks in one forward pass. The mo…
Pretraining Language Models on Historical Text
We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs r…
Patcher: Post-Hoc Patching of Backdoored Large Language Models
Large language models remain vulnerable to jailbreak backdoor attacks, where adversaries poison safety alignment data to embed hidden trigg…
How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models
Quantization is a standard path to deploying large language models, and a quantized model is typically judged acceptable when its perplexit…
Exact equivariance, kept through training, buys zero-shot generalisation across the symmetry group
A latent world model built from an equivariant encoder $E$ and an equivariant predictor $f$ inherits a provable symmetry of its training lo…
MUSE: A Unified Agentic Harness for MLLMs
Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a g…
ConTraIRL: Factorized Contrastive Abstractions for Transferable IRL
Reward transfer in Inverse Reinforcement Learning (IRL) is unreliable when policies must generalize to unseen combinations of environment d…
Reproducibility is the New Copyleft: Defining AGI-oriented Reproducible Builds
Copyleft, as implemented in licenses such as the GNU General Public License, was a legal hack that used copyright to guarantee user freedom…
Hallucinations as Orthogonal Noise: Inference-Time Manifold Alignment via Dynamic Contextual Orthogonalization
Hallucination in Large Language Models (LLMs), characterized by the generation of content inconsistent with contextual facts or logical con…
Spike-Aware C++ INT8 Inference for Sparse Spiking Language Models on Commodity CPUs
Spiking language models expose activation sparsity that dense Transformer runtimes do not directly exploit. This paper studies that propert…
Conditional Hypothesis Generation for LLM-Based Text Analysis with Researcher-Specified Covariates
A core goal of computational social science is to discover interpretable differences in how language varies across outcomes of interest, su…
Capability Advertisement as a Market for Lemons: A Trust Layer for Heterogeneous Agent Networks
Large language model (LLM) agents have begun to delegate work to one another. Protocols such as the Model Context Protocol (MCP) and the Ag…
Rethinking Molecular Text Representations for LLMs: An Empirical Study
Large language models (LLMs) are increasingly used for molecular tasks, but it remains unclear which molecular representation to use. We pr…
Brief Announcement: Generative Markov Model for Distributed Computing Systems
Emerging distributed computing paradigms, such as the computing continuum, are inherently heterogeneous, stochastic, and complex. Efficient…
Learn When and Where to Connect: Adaptive Virtual Nodes for Dynamic Message Passing on Graphs
While Virtual Nodes (VNs) are often utilized in Message Passing Neural Networks (MPNNs) to facilitate effective message passing, existing V…
ROBUST-WT: Robust Uncertainty-aware Segmentation Transform via Whitening and Training Enhancements
Generalized segmentation of medical images prevents performance degradation when different imaging devices and clinical protocols are used…
ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information
Asynchronous reinforcement learning can improve language-model post-training throughput by decoupling response generation from policy optim…
Efficient Hyperparameter Optimization for LLM Reinforcement Learning
Reinforcement learning (RL) for large language models (LLMs) is highly sensitive to hyperparameter configurations, making hyperparameter op…
Libra: Efficient Resource Management for Agentic RL Post-Training
Reinforcement learning (RL) has become a standard post-training paradigm for large language models (LLMs), extending beyond preference alig…
Regret Pre-training: Bridging Prior and Posterior Views for Enhanced Knowledge Grounding
Causal language models factorize sequence probabilities using only preceding context, leaving future information unexploited during trainin…
Constitutional On-Policy Safe Distillation
On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged informat…
"**Important** You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems
The emergence of large language models (LLMs) has significantly accelerated recent research on LLM-based automatic grading (AG) systems. Be…
BAHSD: Bridging the Long-tail Gap via Adaptive Distillation in Black-box Sequential Recommendation
Sequential recommendation systems are widely adopted but often deployed as black-box APIs, which has driven recent interest in model extrac…
PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search
Deep Image Search requires multi-step reasoning over rich contextual cues, such as time, location, and event relations. However, most exist…
AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following
The rapid advancement of instruction-guided audio generation has highlighted the critical need for robust alignment evaluation. Current aut…
GuidedBridge: Training-freely Improving Bridge Models with Prior Guidance
Guidance methods, such as classifier-free guidance (CFG) and auto-guidance (AG), have advanced noise-to-data generation in diffusion models…
Decoupled Smart Contract Audits: Lightweight LLM Framework via Distillation and Aggregation
Smart contracts face critical security challenges that require thorough auditing in decentralized web services. While Large Language Models…
NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation
As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. I…
OpenAgenet/OAN: Open Infrastructure for Trusted Agent Interconnection
OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes vi…
OpenAgenet/OAN: Technical Architecture for Trust-Governed Agent Identity and Discovery
This paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection.…
Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models
The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scient…
AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making
Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior acros…
Reinforcement Learning from Cross-domain Videos with Video Prediction Model
Reinforcement learning from expert videos across visually distinct domains is challenging due to the absence of reward signals and the pres…
WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts
Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and…
BotDirector: Robot Storytelling Across the Symmetrical Reality with Multi-modal Interactions
Robot storytelling offers a unique blend of technological innovation and creative expression that engages children in unprecedented ways. H…
GFFMERGE: Efficient Merging of Graph Neural Force Fields and Beyond
Graph Neural Networks (GNNs) have revolutionized Neural Force Fields for atomistic simulations, achieving near-quantum accuracy at reduced…
When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming
Reinforcement learning from human feedback (RLHF) makes large-scale post-training possible by replacing an underspecified human objective w…
AirDreamer: Generalist Drone Navigation with World Models
Navigating a drone in unseen and cluttered environments requires reliable generalization to unseen scene layouts and understanding of envir…
PSViT: A Methodology for Structurally Pruning Spiking Vision Transformers
Spiking Vision Transformer (SViT) models are promising low-power ViT models for solving vision-based tasks with state-of-the-art performanc…
EqGINO: Equivariant Geometry-Informed Fourier Neural Operators for 3D PDEs
Deep learning surrogates for 3D Partial Differential Equations (PDEs) often fail to generalize across geometric transformations because the…
Are Common Substructures Transferable? Riemannian Graph Foundation Model with Neural Vector Bundles
Foundation models have sparked a revolution via a pretraining-adaptation paradigm, with recent efforts extending this success to graphs. Un…
VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch
Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image r…
AI-Generated Traces for Novice Programmers: Learning Effects and Learner Differences in a Multi-Institutional Study
Introductory programming (CS1) courses often struggle to support students' understanding of program execution. While visualizations can mak…
Message Tuning Outshines Graph Prompt Tuning: A Prismatic Space Perspective
Graph Foundation Models (GFMs), built upon the Pre-training and Adaptation paradigm, have emerged as a research hotspot in graph learning.…
Generalizing Graph Foundation Models via Hyperbolic Retrieval-Augmented Generation
Graph foundation models (GFMs) emerged as a dominant paradigm in graph representation learning by leveraging large-scale pre-training for c…
Learning Multi-Scale Hypergraph for High-Order Brain Connectivity Analysis
Understanding complex interactions between brain regions is critical for early neurodegenerative disease classification such as Alzheimer's…
RobotValues: Evaluating Household Robots When Human Values Conflict
While household robots are often evaluated based on task completion, everyday domestic environments involve value-conflicting situations in…
Multi-Modal Graph Neural Network with Transformer-Guided Adaptive Diffusion for Preclinical Alzheimer Classification
The graphical representation of the brain offers critical insights into diagnosing and prognosing neurodegenerative disease via relationshi…
dstack-capsule: Pod-Level Remote Attestation for Confidential Workloads on Kubernetes
The rise of LLM-as-a-Service and other confidential cloud workloads demands cryptographic proof that user data is processed in a trusted, u…
Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning
Post-training pruning compresses large language models to high sparsity using a small unlabelled calibration set, and recent work has concl…
FLIPS: Instance-Fingerprinting for LLMs via Pseudo-random Sequences
Literature reveals that a Large Language Model's (LLM) behavior is not only conditioned by its original weights but also its instance-level…
Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions
Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over inco…
AugMask: Training Diffusion Models on Incomplete Tabular Data via Stochastic Augmentation and Masking
Score-based diffusion models have emerged as prominent deep generative models; however, their application to tabular data remains challengi…
SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation
Recent generative models can now produce visual artifacts with realistic embedded text and layouts, creating a new misinformation threat: s…
The Unsampled Truth: Psychometrics in SLMs Measure Prompt Artifacts, Not Psychological Constructs
When prompting SLMs for psychometric assessments, researchers assume the outputs reflect semantic reasoning. We evaluate this premise acros…
P\textsuperscript{2}-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization
Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization…
AI Model Extraction Attacks: Bypassing Single-Client Assumptions in Defenses
Ensuring the protection of Artificial Intelligence (AI) models deployed in military Command and Control (C2) systems and critical infrastru…
Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions
While Proximal Policy Optimization (PPO) demonstrates strong performance in stationary settings, we show that its standard optimization par…
Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation
In robotic manipulation, the tight coupling between grasping and motion planning often obscures the true source of failure, leading to inef…
When Model Merging Breaks Routing: Training-Free Calibration for MoE
Model merging has emerged as a cost-effective approach for consolidating the capabilities of multiple LLMs without retraining. However, exi…
Causal Evidence of Stack Representations in Modeling Counter Languages Using Transformers
Formal languages have proven to be effective conduits to understand the inner mechanisms of transformers. Past work has shown that transfor…
Optimizing Explicit Unit-Distance Lower-Bound Certificates
The 2026 disproof of Erd\H{o}s's unit-distance conjecture and Sawin's subsequent explicit quantitative refinement show that the maximum num…
PrimeSVT: An Automated Memory-aware Pruning Framework with Prioritized Compression Policy for Spiking Vision Transformers
The large sizes of Spiking Vision Transformers (SViTs) still hinder their embedded implementation, highlighting the need for model compress…
FlowGuard: Flow Matching for Identity-Independent Detection of Data-Free Model Stealing Attacks on Energy System Intrusion Detection Systems
Artificial Intelligence (AI)-based Intrusion Detection Systems (IDS) deployed in energy infrastructure are vulnerable to model theft attack…
A Hybrid Approach For Malware Classification Using Secondary Features Fusion
The number of malware (either variant or novel) is rapidly increasing, making malware detection and mitigation a complex problem. One appro…
PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization
Unifying the complementary strengths of diverse Vision Foundation Models (VFMs) into a single efficient model is highly desirable but chall…
FORGE: Multi-Agent Graduated Exploitation and Detection Engineering
Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-conce…
Tonal parsimony in chord-sequence analysis: combining modulation cost and tonal vocabulary
We study the assignment of local tonalities to chord sequences, a task useful for harmonic analysis, composition, and jazz-oriented improvi…
Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression
Post-training compression is essential for deploying large language models (LLMs) under tight resource constraints. Tensor decompositions h…
Analyzing Stream Collapse in Hyper-Connections: From Diagnosis to Mitigation
Hyper-Connections (HC) replace the single Transformer residual stream with multiple streams, introducing a permutation symmetry over stream…
NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense
Large language models remain vulnerable to jailbreak attacks that hide harmful intent behind seemingly ordinary requests such as role-play,…
Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs
While Large Language Models (LLMs) excel in code generation, they remain prone to replicating subtle yet critical vulnerabilities endemic t…
BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language
We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan,…
SPADE: Sketch-guided Path Planning Augmented with Diffusion Experts
Path planning is essential for Autonomous Mobile Robots (AMRs). Conventional methods for incorporating human preferences into planning typi…
Scalable On-Hardware Training of Quantum Neural Networks and Application to Clinical Data Imputation
Training quantum neural networks (QNNs) on quantum hardware is currently bottlenecked by the cost of gradient estimation: standard paramete…
Post-Hoc Robustness for Model-Based Reinforcement Learning
To improve the real-world applicability of reinforcement learning (RL), the field of adversarially robust RL studies how to train agents un…
High-Precision APT Malware Attribution with Out-of-Scope Resilience
Early attribution of Advanced Persistent Threat (APT) activity can help defenders prioritise investigation, select countermeasures, and red…
When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation
Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher's update sche…
\textsc{CR-Seg}: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation
Reasoning segmentation aims to segment target objects described by complex language through joint visual-textual reasoning. Existing method…
Efficient Transformer-Based Localized Patch Sampling for Choroid Plexus Segmentation in Multiple Sclerosis
Background: The lateral ventricle choroid plexus (LVCP) is gaining recognition as a key imaging biomarker for multiple sclerosis (MS) relat…
Learned Non-Maximum Suppression for 3D Object Detection
Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact…
When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics
Vision-Language Models (VLMs) have demonstrated remarkable capabilities but suffer from significant computational overhead during inference…
PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models
Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these…
DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair
While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they can also induce overrefusal, i.e., unwa…
CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery
Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as stat…
Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks
Large language models achieve strong performance on arithmetic reasoning benchmarks, and one common response to arithmetic brittleness is t…
Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification
Test-time reinforcement learning has emerged as a promising paradigm for enhancing the complex reasoning abilities of large language models…
Physics-Guided Policy Optimization with Self-Distillation
Self-distilled policy optimization (SDPO) has become a popular paradigm for LLM post-training, where a model learns from its own prediction…
TurtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics
Vision-language models (VLMs) have been explored for visual programming, where they generate code to solve visual tasks. However, most prio…
Building Reliable Long-Form Generation via Hallucination Rejection Sampling
Large language models (LLMs) have achieved remarkable progress in open-ended text generation, yet they remain prone to hallucinating incorr…
AnchorMoE: Interpretable Time Series Classification via Anchor-Routed MoE
Multivariate time series classification (MTSC) is pivotal in high-stakes domains, such as clinical diagnosis and industrial fault detection…
VidMsg: A Benchmark for Implicit Message Inference in Short Videos
Understanding short online videos involves more than identifying visible objects and actions; video makers often include an underlying mess…
The Shape of Addition: Geometric Structures of Arithmetic in Large Language Models
Large Language Models exhibit paradoxical fragility in fundamental arithmetic, implying a disconnect between internal computation and discr…
Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs
Accurately evaluating adversarial robustness is a longstanding challenge. A flawed attack design can inflate robustness estimates, making d…
Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability
Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's sa…
CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks
Choosing or ranking language models for a specific application is hardest when no task-specific labeled data exists, and standard public be…
AUGUSTE: Online-Learning dApp for Predictive URLLC Scheduling
Ultra Reliable and Low Latency Communications (URLLC) was one of the main motivations behind 5G, with 3GPP advertising 1-10 ms latency targ…
A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners
Supervised fine-tuning (SFT) improves end-to-end classical planning in large language models (LLMs), but do these models also learn to repr…
Staying Alive: Uncensored Survival Analysis with Tabular Foundation Models
Survival Analysis (SA) is a statistical framework that models the time span until some event of interest occurs. Widely used in several dom…
Qwen-Image-Flash: Beyond Objective Design
Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focus…
Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models
Real-time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware. The YOLO family has become wide…
Tool-Aware Optimization with Entropy Guidance for Efficient Agentic Reinforcement Learning
Agentic reinforcement learning (RL) equips large language models (LLMs) with tool-use capabilities that substantially improve reasoning on…
Merit or networks? What decides where research is published
Does scientific publishing reward the quality of ideas or the advantage of connections? The question is universal to prestige-driven scienc…
E2LLM: Towards Efficient LLM Serving in Heterogeneous Edge/Fog Environments
Large Language Models (LLMs) have become integral to modern applications, yet their deployment remains challenging. Beyond executing the mo…
Signed Spiking Neuron Enabled by an Orthogonal-Easy-Axis Magnetic Tunnel Junction
Signed spiking neurons carry richer information than standard spiking neurons. This work proposes a compact magnetic tunnel junction (MTJ)-…
Trading Human Curation for Synthetic Augmentation in RLVR
The supply of high-quality training tasks is a central bottleneck for reinforcement learning from verifiable rewards (RLVR) on agentic lang…
LiveBand: Live Accompaniment Generation in the Audio Domain
We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal con…
PURGE: Projected Unlearning via Retain-Guided Erasure
We propose PURGE, a machine unlearning algorithm built on a simple but an under-exploited observation: continual learning (CL) and machine…
Consistency Training Can Entrench Misalignment
Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, s…
AI Agents Enable Adaptive Computer Worms
A computer worm is malware that spreads on a network by replicating itself from one machine to another. Traditional worms, like WannaCry, e…
Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis
In-silico trials of medical devices require the generation of virtual populations of anatomies. In cardiovascular applications, virtual ana…
Re-Evaluating Continual Learning with Few-Shot Adaptation
Continual learning methods aim to maximize the stability and plasticity of machine learning models that are trained on a sequence of tasks.…
Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models
Large language models (LLMs) demonstrate remarkable performance across diverse tasks, but they often generate responses that appear plausib…
FLARE: Fine-Grained Diagnostic Feedback for LLM Code Refinement
Large language models often generate code with bugs. Existing methods rely on feedback signals such as test failures and self-critiques to…
Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation
Scaling recommender systems via large language models (LLMs) has become a prominent trend in the industry. However, aligning the LLM's sema…
A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs
Multi-Document Summarization (MDS) plays a critical role in distilling essential information from collections of textual data. Existing app…
From 'What' to 'How' and 'Why': Sharing LLM-Generated Retrospective Summaries of Older Adults' Passive Tracking Data with Remote Family Members
With the growing prevalence of modern ubiquitous computing technologies, multi-modal tracking systems hold promise for providing timely awa…
Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs
As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint trainin…
Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments
Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are c…
Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents
Large language model (LLM) agents are evolving from request-response assistants into long-running software actors: they maintain state acro…
The Impact of Configuring Agentic AI Coding Tools on Build-vs-Buy Decisions: A Study Protocol
Agentic AI coding tools write code with increasing autonomy and in doing so decide when to import a library and when to implement functiona…
NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference
Disaggregated LLM inference forces the KV cache to traverse the datacenter network before decoding begins, so transfer time enters directly…
FFR: Forward-Forward Learning for Regression
The Forward-Forward (FF) algorithm offers a computationally efficient and biologically plausible alternative to backpropagation (BP) by tra…
q0: Primitives for Hyper-Epoch Pretraining
Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a si…
FlashbackCL: Mitigating Temporal Forgetting in Federated Learning
Federated Learning (FL) of foundation and edge models increasingly targets deployments where client data distributions drift over time, yet…
Efficient ASR Training with Conversations that Never Happened
Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data.…
Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning
Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, moder…
Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation
Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its pra…
Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning
Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and o…
AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task
We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a…
QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards
Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics…
Quantifying Faithful Confidence Expression in Large Reasoning Models
Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models'…
Formalizing the Binding Problem
Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also informa…
Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories
The past few decades have witnessed significant advances in the design of machine learning algorithms, from early studies on task-specific…
Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking
We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. U…
Planning with Uncertainty: Symmetries, Policy Inference, and Solution Compression
Fully-observable non-deterministic (FOND) planning is at the core of artificial intelligence planning with uncertainty. It models uncertain…
Approximating Probabilistic Inference in Statistical EL with Knowledge Graph Embeddings
Statistical information is ubiquitous but drawing valid conclusions from it is prohibitively hard. We explain how knowledge graph embedding…
Leave it to the Specialist: Repair Sparse LLMs with Sparse Fine-Tuning via Sparsity Evolution
Sparse large language models (LLMs) offer an attractive direction toward efficient deployment, but adapting them to downstream tasks remain…
Assistax: A Multi-Agent Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics
The development of reinforcement learning (RL) algorithms has been largely driven by ambitious challenge tasks and benchmarks. Games have d…
AlphaEval: A Comprehensive and Efficient Evaluation Framework for Formula Alpha Mining
Formula alpha mining, which generates predictive signals from financial data, is critical for quantitative investment. Although various alg…
Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism
We propose COLLAB-REC, a multi-agent framework designed to counteract popularity bias and improve diversity in tourism recommendations. In…
DTKG: Dual-Track Knowledge Graph-Verified Reasoning Framework for Multi-Hop QA
Multi-hop reasoning for question answering (QA) plays a critical role in retrieval-augmented generation (RAG) for modern large language mod…
RGMem: Renormalization Group-inspired Memory Evolution for Language Agents
Personalized and continuous interactions are critical for LLM-based conversational agents, yet finite context windows and static parametric…
ProtocolBench: Which LLM MultiAgent Protocol to Choose?
As large-scale multi-agent systems evolve, the communication protocol layer has become a critical yet under-evaluated factor shaping perfor…
Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach
While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employ…
MemVerse: Multimodal Memory for Lifelong Learning Agents
Despite rapid progress in large-scale language and vision models, AI agents still suffer from a fundamental limitation: they cannot remembe…
MIND: Multi-rationale INtegrated Discriminative Reasoning Framework for Multi-modal Large Models
Recently, multimodal large language models (MLLMs) have been widely applied to reasoning tasks. However, they suffer from limited multi-rat…
FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration
Scaling test-time computation has been shown to significantly improve large language model (LLM) performance without additional training. H…
The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios
The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly target…
A Scoping Review of the Ethical Perspectives on Anthropomorphising Large Language Model-Based Conversational Agents
Anthropomorphisation -- the phenomenon whereby non-human entities are ascribed human-like qualities -- has become increasingly salient with…
Strongly Polynomial Time Complexity of Policy Iteration for $L_\infty$ Robust MDPs
Markov decision processes (MDPs) are a fundamental model in sequential decision making. Robust MDPs (RMDPs) extend this framework by allowi…
PieArena: Ranking and Profiling Language Agents in Realistic Negotiation Scenarios
We present an in-depth evaluation of LLMs' ability to negotiate, a central business task requiring strategic reasoning, theory of mind, and…
Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System
Integrating Large Language Models (LLMs) with external tools via multi-agent systems offers a promising new paradigm for decomposing and so…
When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation
LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach…
Towards a Science of AI Agent Reliability
AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress,…
VeRO: A Harness for Agents to Optimize Agents
An important emerging application of coding agents is agent harness optimization: the iterative improvement of a target agent by editing an…
X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes
Large language models (LLMs) achieve promising performance, yet their ability to reason remains poorly understood. Existing evaluations lar…
Evaluating Relational Reasoning in LLMs with REL
Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables. This ability is centr…
Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization
Automating operations research (OR) with large language models (LLMs) remains limited by hand-crafted reasoning--execution workflows. Compl…
From Context to Skills: Can Language Models Learn from Context Skillfully?
Many real-world tasks require language models (LMs) to reason over complex contexts that exceed their parametric knowledge. This calls for…
Efficient Temporal Datalog Materialisation for Composite Event Recognition
Several applications demand the timely detection of critical situations, such as threats to safety and transparency, over high-velocity str…
AdapShot: Adaptive Many-Shot In-Context Learning with Semantic-Aware KV Cache Reuse
Many-Shot In-Context Learning (ICL) has emerged as a promising paradigm, leveraging extensive examples to unlock the reasoning potential of…
Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capaci…
From Holo Pockets to Electron Density: GPT-style Drug Design with Density
Recent advances in generative modeling have enabled significant progress in structure-based drug design (SBDD). Existing methods typically…
PnP-Corrector: A Universal Correction Framework for Coupled Spatiotemporal Forecasting
Coupled spatiotemporal forecasting is important for predicting the future evolution of multiple interacting dynamical systems, such as in c…
Assessing and Mitigating Miscalibration in LLM-Based Social Science Measurement
Large language models (LLMs) are increasingly used in social science as scalable measurement tools for converting unstructured text into va…
Automatic Layer Selection for Hallucination Detection
Recent studies on hallucination detection have shown that hallucination-related signals are more strongly encoded in intermediate layers th…
PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft
We present PEAM, a Parametric Embodied Agent Memory framework in Minecraft that transforms agent memory from inference-time retrieval into…
A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks
As agent capabilities advance, existing benchmarks, such as $\tau^2$-Bench, are becoming increasingly saturated. Yet constructing new bench…
Toward AI That Understands Self and Others: A World-Model Theory of Cognitive Diversity and Alignment
Modern societies possess more information than ever before, yet they do not converge toward a single shared understanding. The same events,…
Typhoon: Towards an Effective Task-Specific Masking Strategy for Pre-trained Language Models
The choice of \emph{which} tokens to mask is a central, under-examined design decision in masked language modeling (MLM). Standard pretrain…
PINNfluence: Interpreting PINNs through Influence Functions
Physics-informed neural networks (PINNs) have emerged as a powerful deep learning approach for solving partial differential equations (PDEs…
Building Trust in Black-box Optimization: A Comprehensive Framework for Explainability
Optimizing costly black-box functions within a constrained evaluation budget presents significant challenges in many real-world application…
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model Enhancement
Vision-Language Models (VLMs) bring powerful understanding and reasoning capabilities to multimodal tasks. Meanwhile, the great need for ca…
ASAP: Exploiting the Satisficing Generalization Edge in Neural Combinatorial Optimization
Deep Reinforcement Learning (DRL) has emerged as a promising approach for solving Combinatorial Optimization (CO) problems, such as the 3D…
Greed is Good: A Unifying Perspective on Guided Generation
Training-free guided generation is a widely used and powerful technique that allows the end user to exert further control over the generati…
Rex: A Family of Reversible Exponential (Stochastic) Runge-Kutta Solvers
Deep generative models based on neural differential equations have become state-of-the-art for many generation tasks. These models rely on…
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and eva…
Scaling Multi Agent Reinforcement Learning for Underwater Acoustic Tracking via Autonomous Vehicles
Autonomous vehicles (AVs) offer a cost-effective solution for scientific missions such as underwater tracking. Reinforcement learning (RL)…
FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs
Efficient inference of Multi-Head Latent Attention (MLA) is challenged by deploying the DeepSeek-R1 671B model on a single Multi-GPU server…
Do Explanations Increase the Risk of Decision Logic Leakage? Explanation-Guided Stealing of Graph Models
Graph Neural Networks (GNNs) have become essential tools for analyzing graph-structured data in domains such as drug discovery and financia…
dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching
Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form o…
Curriculum-Adapted Robust Reinforcement Learning for UAV Deconfliction in Adversarial Environments
Autonomous unmanned aerial vehicles (UAVs) increasingly rely on reinforcement learning (RL) for navigation. However, global navigation sate…
Multiple Choice Learning of Low-Rank Adapters for Language Modeling
We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plau…
CoMPAS3D: A Dataset and Benchmark for Interactive Motion
Socially interactive humanoid robots must engage with humans through their bodies, adapting in real time to a partner's movement, intent, a…
UR$^2$: Unify RAG and Reasoning through Reinforcement Learning
Large Language Models (LLMs) have shown strong capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG) for k…
Non-Identical Diffusion Models in MIMO-OFDM Channel Generation
We propose a novel diffusion model, termed the non-identical diffusion model, and investigate its application to wireless orthogonal freque…
TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation
We present TalkPlayData 2, a synthetic dataset for multimodal conversational music recommendation generated by an agentic data pipeline. In…
Wavelet Fourier Diffuser: Frequency-Aware Diffusion Model for Reinforcement Learning
Diffusion probability models have shown significant promise in offline reinforcement learning by directly modeling trajectory sequences. Ho…
Learning the Neighborhood: Contrast-Free Multimodal Self-Supervised Molecular Graph Pretraining
High-quality molecular representations are essential for property prediction and molecular design, yet large labeled datasets remain scarce…
DeMuon: A Decentralized Muon for Matrix Optimization over Graphs
In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates m…
ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models
Large Language Models (LLMs) have recently emerged as a powerful paradigm for Knowledge Graph Completion (KGC), offering strong reasoning a…
Semantic knowledge guides innovation and drives cultural evolution
Cultural evolution allows ideas and technologies to accumulate across generations, reaching their most complex and open-ended form in human…
Generating the Modal Worker: A Cross-Model Audit of Race and Gender in LLM-Generated Personas Across 41 Occupations
As generative AI tools are increasingly used to portray people in professional roles, understanding their racial and gender representationa…
Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning
We study learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized executio…
A Robust and Explainable Transformer-Based Framework for Phishing Email Detection
Phishing and related cyber threats are becoming increasingly sophisticated, with email-based phishing remaining the most persistent attack…
PHASE: Physiology-Aware Hyperspectral Reconstruction via Object-to-Human Domain Adaptation
Although hyperspectral imaging offers unparalleled non-invasive physiological insight, its bulky hardware, slow acquisition, and regulatory…
Finding Kissing Numbers with Game-theoretic Reinforcement Learning
Since Isaac Newton first studied the Kissing Number Problem in 1694, determining the maximal number of non-overlapping spheres around a cen…
SeSE: Black-Box Uncertainty Quantification for Large Language Models Based on Structural Information Theory
Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enable…
Identifying Quantum Structure in AI Language: Evidence for Evolutionary Convergence of Human and Artificial Cognition
We present the results of cognitive tests on conceptual combinations, performed using specific Large Language Models (LLMs) as test subject…
Distribution-Calibrated Inference Time Compute for Thinking LLM-as-a-Judge
Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregati…
Toward Training Superintelligent Software Agents through Self-Play SWE-RL
While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer producti…
Edge-Aware and Content-Adaptive Infrared Gas Leak Detection for Industrial Safety Monitoring
Infrared gas leak detection is important for industrial safety and environmental monitoring, but automatic detection remains challenging be…
Introduction to optimization methods for training SciML models
Optimization is central to both modern machine learning (ML) and scientific machine learning (SciML), yet the structure of the underlying o…
Relational Linearity is a Predictor of Hallucinations
Hallucination is a central failure mode of language models (LMs). We focus on hallucinations in response to questions like: "Which instrume…
Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction
Transformer architectures deliver state-of-the-art accuracy via dense full-attention, but their quadratic time and memory complexity with r…
Aletheia: What Makes RLVR For Code Verifiers Tick?
Multi-domain thinking verifiers trained via Reinforcement Learning with Verifiable Rewards (RLVR) are a cornerstone of modern post-training…
Plan, Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models
Diffusion Language Models (DLMs) present a promising non-sequential paradigm for text generation, distinct from standard autoregressive (AR…
$\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval
This paper studies the Minimal Embeddable Dimension (MED): the least dimension in which there exists a configuration of $m$ object vectors…
Causal Preference Elicitation
We propose causal preference elicitation, a Bayesian framework for expert-in-the-loop causal discovery that actively queries local edge rel…
Phantom Transfer: Data Poisoning can Survive Data-Level Defences
We present a data poisoning attack -- Phantom Transfer -- with the property that, even if you know precisely how the poison was placed into…
Coupled Local and Global World Models for Efficient First Order RL
World models offer a promising avenue for more faithfully capturing complex dynamics, including contacts and non-rigidity, as well as compl…
InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning
Large reasoning models achieve strong performance by scaling inference-time chain-of-thought, but this paradigm suffers from quadratic cost…
LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning
Current chemical large language models (LLMs) predominantly rely on explicit Chain-of-Thought (CoT) to solve complex reasoning problems. Ho…
PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification
Distilling knowledge from large Vision-Language Models (VLMs) into lightweight networks is crucial yet challenging in Fine-Grained Visual C…
Whose Name Comes Up? II: Benchmarking and Intervention-Based Auditing of LLM-Based Scholar Recommendation
Large language models (LLMs) are now used for academic expert recommendation. Existing audits typically evaluate such recommendations in is…
Physics-informed diffusion models in spectral space
We propose physics-informed spectral diffusion (PISD), a methodology that combines generative latent diffusion models with physics-informed…
Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs
Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensi…
Test-Time Optimization of Physical Query Plans with LLMs
Traditional query optimization relies on cost-based optimizers that estimate execution cost (e.g., runtime, memory, and I/O) using predefin…
Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward
The transition from monolithic language models to modular, skill-equipped agents marks a defining shift in how large language models (LLMs)…
Whom to Query for What: Adaptive Group Elicitation via Multi-Turn LLM Interactions
Eliciting information to reduce uncertainty about latent group-level properties from surveys and other collective assessments requires allo…
Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression
Sub-bit model compression targets storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed…
TimeOmni-VL: Unified Models for Time Series Understanding and Generation
Recent time series modeling faces a sharp divide between numerical generation and semantic understanding, with research showing that genera…
CodeHacker: Automated Test Case Generation for Detecting Vulnerabilities in Competitive Programming Solutions
The evaluation of Large Language Models (LLMs) for code generation relies heavily on the quality and robustness of test cases. However, exi…
KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem
Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often…
Causal Neural Probabilistic Circuits
Concept Bottleneck Models (CBMs) enhance the interpretability of end-to-end neural networks by introducing a layer of concepts and predicti…
vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models
As large language models (LLMs) diversify across modalities, capabilities, and cost profiles, the problem of intelligent request routing --…
Ref-DGS: Reflective Dual Gaussian Splatting
The reflective appearance, especially strong and typically near-field specular reflections, poses a fundamental challenge for accurate surf…
VulnAgent-R2: Evidence-Calibrated Multi-Agent Auditing for Repository-Level Vulnerability Detection
Software vulnerabilities often depend on cross-file data flow, build options, framework conventions, and runtime guards, so isolated functi…
Measuring Weak-to-Strong Legibility of Reasoning Models
Reasoning language models (RLMs) and the intermediate chains of thought they emit play an increasingly central role in multi-agent setups s…
SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model
While automated sleep staging has achieved expert-level accuracy, its clinical adoption is hindered by a lack of auditable reasoning. We in…
Crystal: Characterizing Relative Impact of Scholarly Publications
Assessing a cited paper's impact is typically done by analyzing its citation context in isolation within the citing paper. While this focus…
Finetuning-Free Diffusion Model with Adaptive Constraint Guidance for Inorganic Crystal Structure Generation
The discovery of inorganic crystal structures with targeted properties is a significant challenge in materials science. Generative models,…
Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation
Modeling coordination among generative agents in complex multi-round decision-making presents a core challenge for AI and operations manage…
Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and even…
$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token pre…
Quantifying and Mitigating Self-Preference Bias of LLM Judges
LLM-as-a-Judge has become a dominant approach in automated evaluation systems, playing critical roles in model alignment, leaderboard const…
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape…
How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance
In generative modeling, we often wish to produce samples that maximize a user-specified reward such as aesthetic quality or alignment with…
SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents
LLM agents increasingly rely on reusable skills (e.g., $SKILL.md$ ) to execute complex tasks, yet these artifacts lack portability: agent f…
Narrow Secret Loyalty Dodges Black-Box Audits
Recent work identifies secret loyalties as a distinct threat from standard backdoors. A secret loyalty causes a model to covertly advance t…
Mechanism Design Is Not Enough: Prosocial Agents for Cooperative AI
Ensuring that AI agents behave safely and beneficially when interacting with other parties has emerged as one of the central challenges of…
Towards Robust Sequential Decomposition for Complex Image Editing
Recent advances in visual generative models have enabled high-fidelity image editing guided by human instructions. However, these models of…
Exact Stiefel Optimization for Probabilistic PLS: Closed-Form Updates, Error Bounds, and Calibrated Uncertainty
Probabilistic partial least squares (PPLS) is a central likelihood-based model for two-view learning when one needs both interpretable late…
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-onl…
X-Restormer++: 1st Place Solution for the UG2+ CVPR 2026 All-Weather Restoration Challenge
In this work, we present our winning solution for the 8th UG2+ Challenge (CVPR 2026) Track 1: Image Restoration under All-weather Condition…
Misspecified Estimate-then-Optimize Leads to Supra-Competitive Prices
We study whether simple algorithmic pricing systems can systematically produce collusive-like prices in multi-firm markets. We consider fir…
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exh…
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily attributed to effective paradigms fo…
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but dec…
Latent Laplace Diffusion for Irregular Multivariate Time Series
Irregular multivariate time series impose a trade-off for long-horizon forecasting: discrete methods can distort temporal structure via re-…
Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor
MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantizatio…
TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design
Text-to-image models now generate graphic design at production scale, yet their supervision still comes primarily from photo-style preferen…
FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments
The Flooded Road Environments Dataset (FRED) is, to our knowledge, the first multi-modal autonomous driving dataset specifically targeting…
Decomposing and Measuring Evaluation Awareness
Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark res…
MX-SAFE: Versatile Inference- and Training-Proof Microscaling Format with On-the-Fly Exponent and Mantissa Bit Allocation
As the demand for deep learning grows, cost reduction through quantization has become essential for both training and inference. In 2022, t…
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas com…
Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation
Self-supervised pre-training paradigm has gained increasing prominence for learning transferable representations in medical imaging, yet ex…
Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines
We present the first end-to-end demonstration of fine-tuning and serving Google's Gemma 4 31B model on TPU hardware, providing an empirical…
SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation
Epidemic forecasting faces a fundamental challenge: human behavior dynamically responds to disease spread, creating feedback loops that ind…
QuITE: Query-Based Irregular Time Series Embedding
Irregular Multivariate Time Series (IMTS) are common in practice, yet their irregular sampling complicates effective modeling. Existing app…
Microsoft、自社開発した7つのAIモデル発表 画像編集や音声認識も
米Microsoftが自社開発した7つのAIモデル群「Microsoft AI Models」を発表しました。
Microsoft、AIエージェント用のカスタマイズ可能な分離環境「Microsoft Execution Containers」発表 OpenClawも動作
米MicrosoftがAIエージェントのためのカスタマイズ可能な分離環境「Microsoft Execution Containers」(MXC)を発表しました。
トランプ米大統領、AI安全保障に関する大統領令に署名 最先端モデルを公開30日前に政府が検査可能に
トランプ米大統領は、先進的AIのイノベーションと安全保障の促進に関する大統領令に署名した。戦争省やCISAによるサイバー防衛強化に加え、主要企業の最先端AIモデルを政府が事前検証する任意の枠組みを構築する。政府は全面的な監視を否定しており、民間の開発自由度を維持しつつ安全保障の…
シーメンス、AIでCFD設計探索を高速化 「Simcenter PhysicsAI」を発表
シーメンスは「Simcenter」の新機能として、AIを活用した設計空間探索向けソフトウェア「Simcenter PhysicsAI」を発表した。CFDのシミュレーション結果からAIサロゲートモデルを構築し、数千もの設計バリエーションを短時間で評価できる。従来は数日を要していた…
シャドーAIに「ログイン情報」を渡している割合は? Oktaの実態調査で判明
ある調査によると、経営幹部の95%は「従業員は責任を持ってAIを利用している」と確信しているが、シャドーAIを使っている従業員は過半数に上るという。さらに、シャドーAIを利用している従業員の中には情報漏えいにつながりかねない「危険な使い方」をしている人も一定数いる。
Cyera eyes $12B valuation at 80x ARR multiple despite operating losses
The cybersecurity company is nearing a $300 million round led by Evolution Equity Partners.
Microsoft、AndroidベースのAIエージェント基盤「Solara」発表 Snapdragon搭載のバッジ型端末も披露
Microsoftは「Build 2026」で、AIエージェントの実行に特化した新プラットフォーム「Project Solara」を発表した。OSにはWindowsではなくAOSPベースのOSを採用。Qualcommと共同開発した社員証のようなデバイスと、MediaTekと共同…
AI需要で半導体不足は「しばらく続く」 PCメーカー、デルの対応策は?
AI需要による半導体不足は「しばらく続く」――PCメーカーのデル・テクノロジーズはこう予測する。同社はこの難局をどう乗り切るのか。
NVIDIAの「RTX Spark」と搭載ノートPCがCOMUPTEX TAIPEIのMediaTekブースに集結
MediaTek(メディアテック)は、「COMPUTEX TAIPEI 2026」において、NVIDIAが発表したAIスーパーチップ「NVIDIA RTX Spark」と、同チップを搭載する各社のWindowsノートPCを披露した。
【Pythonで学ぶデータ分析】ベイズ統計の考え方をやさしく学ぶ ~ 初めてでも流れが分かる入門編
初歩から応用までステップアップしながら学んでいく『やさしいデータ分析』シリーズ第5弾はベイズ統計編。今回は、二項分布の確率についてベイズ的な手法で母数の推定や検定を行います。
Microsoft、NVIDIAのSoC搭載でAI特化のミニPC「Surface RTX Spark Dev Box」披露
Microsoftは「Build 2026」で、AI特化型デスクトップPC「Surface RTX Spark Dev Box」を発表した。NVIDIAの「RTX Spark」を搭載し、最大1ペタフロップスの演算性能と128GBのメモリにより、1200億パラメータ超のモデルのロ…
Uber caps employee AI spending after blowing through budget in 4 months
Uber's cutback has occurred after the company had reportedly encouraged staff to use AI as much as possible.
New Microsoft tool lets devs spin up AI behavior tests using text descriptions
Microsoft on Tuesday took the wraps off Adaptive Spec-driven Scoring for Evaluation and Regression Testing, an open source framework for sp…
Martin Scorsese becomes the latest — and most unlikely — Hollywood voice for AI
The caveat is that one of the world's most famous living directors is using the tech solely for storyboarding.
Microsoft launches Scout, an OpenClaw-inspired personal assistant
Launched at Build, Microsoft Scout is a new AI assistant meant to bring the power and flexibility of OpenClaw into the Microsoft 365 system.
Google rolls out fake call detection to protect against AI deepfake impersonation scams
As people increasingly refuse to answer calls from unknown numbers, scammers are shifting their tactics by spoofing trusted phone numbers a…
Microsoft offers devs a better way to control AI agent behavior
The specification lets developer, compliance, and security teams define their own policies for agents to follow in portable policy files.
Amazon faces class action lawsuit over Ring facial-recognition feature
The class action lawsuit, filed in Seattle by Virginia resident Charles Sigwalt, claims that Ring's Familiar Faces feature stores images of…
Trump signs narrower executive order on AI oversight after industry objections
After industry objections, President Trump signed a revised AI executive order requiring only voluntary prerelease government reviews of ad…
OpenAI launches new Codex tools for white-collar work
OpenAI released a set of six plug-ins aimed at specific jobs: data analytics, creative production, sales, product design, equity investing,…
2026-06-02(963件)
Anthropic scales Claude Mythos to critical infrastructure in 15+ countries
Anthropic is expanding Project Glasswing, its security vulnerability program, and access to Mythos to 150 organizations across 15 countries…
Microsoft、初の自社推論モデル「MAI-Thinking-1」発表 蒸留なしでゼロから学習
Microsoftは「Build 2026」で、自社開発AI「MAI」の新モデル群を発表した。中核となる初の推論モデル「MAI-Thinking-1」は350億パラメータを持ち、他モデルからの蒸留を行わないクリーンなデータで学習。競合モデルに匹敵する高い性能を示し、独自チップ「…
ZeroDrift raises $10M to protect AI models from themselves
A new AI compliance service sits between AI models and end users to flag and replace any messages that might present a compliance problem.
Rocket engine startup Impulse raises $500 million to hire people, not AI
Engineering physical systems still depends on human talent, according to Impulse Space president Eric Romo.
Travelers deploys AI-powered claims countrywide with OpenAI
Travelers built an AI-powered Claim Assistant with OpenAI to guide customers through filing claims, provide 24/7 support, and scale operati…
Microsoft、自律エージェント「Scout」発表 OpenClawベースでMCP対応
Microsoftは「Build 2026」で、自律型AIエージェントの新カテゴリ「Autopilots」と、その第一弾「Microsoft Scout」を発表した。Scoutは「OpenClaw」基盤で構築され、常時バックグラウンドで稼働して「Microsoft 365」のア…
Codex for every role, tool, and workflow
Discover new Codex plugins, sites, and annotations that help analysts, marketers, designers, investors, and other teams get more done with…
Advancing youth safety and opportunity through global leadership
OpenAI calls for global action on youth AI safety, proposing an international institute to strengthen safeguards, standards, and opportunit…
AIモデル「ミュトス」のアクセス権拡大 新たに150組織が利用へ Anthropic
米Anthropicは、サイバーセキュリティプロジェクト「Project Glasswing」を拡大し、AIモデル「Claude Mythos Preview」のアクセス権を新たに約150の組織に与えると発表した。
バイブコーディングの“プロトタイプで止まりがち”問題に「バイブ清書」が切り込む
バイブコーディングの普及で社内ソフトウェアの開発は身近になった。一方でプロトタイプから本番利用へ移行する際の品質やセキュリティの確保に悩む企業もある。その課題に着目し、解決を図るのが「バイブ清書」だ。
Position Paper: Post-Solve Robustness in Decision Engines: Feasible Regions and Smoothness Under Perturbations
Mixed-Integer Linear Programming (MILP) decision engines routinely output nominally optimal plans for high-stakes industrial systems. Yet d…
Emergent Collaborative Deliberation in Multi-Model AI Systems: A BFT-Derived Protocol for Epistemic Synthesis
We present the Consilium Protocol, a Byzantine Fault Tolerance-derived architecture for structured multi-model AI deliberation that treats…
Deliberative Curation: A Protocol for Multi-Agent Knowledge Bases
As AI agents transition from isolated tools to collaborative participants in shared knowledge ecosystems, governing collective knowledge cu…
Agents on a Tree: Pathwise Coordination for Multi-Objective Molecular Optimization
Multi-objective molecular optimization requires searching vast chemical spaces under conflicting objectives, where early design decisions s…
Optimal Transport-based Permutation-Invariant Bayesian Optimization of Offshore Wind Farm Layouts
Bayesian Optimization (BO) is widely and successfully adopted for solving optimization problems having an expensive-to-evaluate, black-box,…
MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution
Training language model agents for multi-agent strategic interaction presents a core difficulty: the quality of any action may depend on fu…
Universal Quantum Transformer
Classical continuous-space neural networks fundamentally struggle to lock into exact mathematical symmetries, such as modular arithmetic an…
Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs
We present Grokers, an architecture for building persistent, structured comprehension of typed knowledge graphs through bottom-up inductive…
Product-Aware Deep Autoencoders for Robust Process Monitoring in Multi-Product Cyber-Physical Systems
As Industry 4.0 accelerates the integration of Cyber-Physical Systems (CPS) in manufacturing, robust anomaly detection has become critical…
On the evolution of the concept of probability as a mirror of the evolution of reason
Over the centuries, probability theory has grown from the calculus of games of chance into a central framework for reasoning under uncertai…
Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games
We introduce a multi-turn interactive framework for reasoning evaluation that treats reasoning as active evidence acquisition and belief up…
A Multi-AI-agent Framework Enabling End-to-end Finite Element Analysis for Solid Mechanics Problems
Finite element analysis (FEA) is the most important numerical approach for solid mechanics. Challenges of FEA include a steep learning curv…
CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO
Reinforcement learning with verifiable rewards (RLVR), especially Group Relative Policy Optimization (GRPO), has been widely used to improv…
TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation
We study fact-level repair for multimodal generation, where a fluent output may contain specific facts that are not supported by the input.…
MindZero: Learning Online Mental Reasoning With Zero Annotations
Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Des…
Geodesic Flow Matching for Denoising High-Dimensional Structured Representations
Vector Symbolic Algebras (VSAs) enable robust neurosymbolic reasoning by encoding symbolic information into high-dimensional distributed re…
Capability Self-Assessment: Teaching LLMs to Know Their Limits
The ability to recognize one's own limitations and decide whether to solve a problem or delegate is fundamental for reliable intelligent sy…
Closed-Loop Neural Activation Control in Vision-Language-Action Models
Vision-Language-Action (VLA) models can be steered at test time by intervening on semantically meaningful internal directions, but existing…
Robust Shielding for Safe Reinforcement Learning
Shielding is an effective approach to formally guarantee the safety of reinforcement learning agents in Markov decision processes (MDPs). H…
On Wednesdays, We Ask Questions: Optimizing "Active Listening" in Automated Legal Triage and Referral
The FETCH classifier generates follow-up questions to help refine the best match for the applicant's legal problem, using a low-cost ensemb…
Evaluating Bivariate Causal Statements Based on Mutual Compatibility
For many real-world systems, causal ground truth is difficult to obtain, making claims about causal effects hard to assess. We develop meth…
Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture
Large language models are undergoing a transition from model technology to system technology. As developers use Codex, Claude Code, AutoGPT…
Coupling Language Models with Physics-based Simulation for Synthesis of Inorganic Materials
Modern generative machine learning (ML) models can propose novel inorganic crystalline materials with targeted properties; however, synthes…
From Noise to Control: Parameterized Diffusion Policies
We propose Parameterized Diffusion Policy (PDP), a framework for learning diffusion policies conditioned on low-dimensional, continuous par…
From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging
Training strong large language models (LLMs) requires high-quality supervision, which is often scarce. Recent work shows that paired prefer…
The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary
Extended chain-of-thought reasoning can degrade performance on deterministic state-tracking tasks, not due to preference biases, but limits…
VESTA: Visual Exploration with Statistical Tool Agents
Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-base…
Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight
As large language models become stronger, weak supervisors may fail to provide reliable labels, preferences, or final judgments for complex…
SDR: Set-Distance Rewards for Radiology Report Generation
Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision--language models. However, for chest X-ray report g…
Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents
Do LLM agents act on the reasoning they state? This question of process fidelity is central to using LLMs in social simulation, yet it is h…
TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding
Using a diffusion model for parallel drafting is a promising approach for speculative decoding. By predicting tokens at multiple future pos…
EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction
Energy consumption prediction is essential for efficient grid management, demand-side optimization, and sustainable energy planning. Althou…
Threshold-Based Exclusive Batching for LLM Inference
Mixed batching (MB)--interleaving prefill and decode in a single batch--has become the standard scheduling strategy for large language mode…
Acting with AI: An Interaction-Based Framework for Agentic Tort Liability
Agentic AI systems can plan over multiple steps, use tools, and execute tasks over time. When such systems cause harm, tort law struggles t…
KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning
Context engineering can improve large language models without updating their weights, but mathematical reasoning exposes a key limitation:…
Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design
Structure-based drug design increasingly employs LLM agents to iteratively refine ligands against a target pocket, yet a viable ligand must…
PropLLM: Propagation-Aware Scene Reconstruction for Network Fault Diagnosis
Network faults propagate layer by layer along topology and protocol dependencies, yet operations systems typically observe only symptomatic…
TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety
Long-horizon LLM agents produce safety evidence across long trajectories, where sparse, delayed, and compositional risk signals often escap…
Efficient Test-time Inference for Generative Planning Models
Generative models have emerged as a powerful paradigm for AI planning, yet their performance remains constrained by the training data distr…
Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs
Reasoning traces have become a valuable form of learning signals for improving and transferring the capabilities of large language models.…
ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment
AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a proje…
AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning
We present AXIOM, a trust-first neuro-symbolic execution architecture for natural-language mathematical reasoning. In AXIOM, the language m…
Medication-Aware Financial Exploitation Detection for Alzheimer's Patients Using Edge-Aware Interaction Risk Modeling
Financial exploitation is a growing concern for people with Alzheimer's disease, especially during periods of reduced cognitive stability.…
Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief
Offline reinforcement learning (RL) aims to optimize policies from pre-collected datasets. A bottleneck of this paradigm is managing episte…
MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition
Automated data science is a structured model-selection problem. A solution must choose data transformations, feature representations, archi…
LLM-Driven Co-Evolutionary Automated Heuristic Design for Bi-Component Coupled Combinatorial Optimization
While Large Language Models (LLMs) have recently shown promise in Automated Heuristic Design (AHD), existing methods typically generate and…
Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs
Strong reasoning depends not only on model knowledge but also on how effectively cognitive behaviors are deployed during generation. Existi…
AI Sovereignty as National Learning Capacity: A Human-Centered Learning Mechanics Viewpoint on France, the United States, and China
Artificial Intelligence is often discussed in France in terms of investment, compute capacity, regulation, employment, sovereignty, and edu…
SHARP: Sleep-based Hierarchical Accelerated Replay for Long Range Non-Stationary Temporal Pattern Recognition
Learning long-range non-stationary temporal patterns remains a core challenge for modern sequence models, particularly in strict streaming…
CoMIC: Collaborative Memory and Insights Circulation for Long-Horizon LLM Agents in Cloud-Edge Systems
Deploying lightweight Large Language Model (LLM) agents on edge servers can reduce latency and move agentic services closer to users, but r…
FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search
LLM-based agents increasingly solve complex tasks through long trajectories involving reasoning steps, tool calls, and inter-agent communic…
Interaction-Centered Intelligence: Toward Interaction as the Primary Unit of Analysis in Co-Creative AI and Human-AI Systems
Traditional artificial intelligence has largely conceptualized intelligence as isolated computation occurring within bounded agents. Across…
NBQ: Next-Best-Question for Dynamic Profiling
Many real-world conversational settings for knowledge discovery, including podcasts, hiring screens, and marketplaces, require a purpose-dr…
Mitigating Hallucinations in Large Language Models Via Decoder Layer Skipping
Large Language Models (LLMs) have achieved strong performance across diverse natural language tasks, yet their outputs often suffer from ha…
Subliminal Learning is a LoRA Artifact
Subliminal learning is a phenomenon where language models can transmit behavioral traits to other models through seemingly innocuous data (…
Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications
Inductive generalization is a framework for reinforcement learning (RL) generalization in which inductively related task instances admit in…
Certificate-Guided Evaluation of Reinforcement Learning Generalization
This work presents a logic-driven framework to evaluate the performance of reinforcement learning (RL) algorithms in their ability to gener…
Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers
General-purpose VLMs remain unreliable for biomedical research because valid answers in scientific papers depend on evidence split across f…
Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults
LLM agents increasingly act after consuming ranked external information streams such as social feeds, search results, retrieval contexts, a…
Relational Intervention During Functional Collapse in Large Language Models: A Lexical-Statistical Ablation and a Structure x Register Factorial
We test whether a relational-style intervention delivered during functional collapse in a small language model produces post-collapse behav…
Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition
Understanding modality interaction in multimodal large language models (MLLMs) is central to reliable deployment. We introduce Partial Info…
Prospect-Theory Behavior from Bellman Optimality in MDPs with Catastrophic States
We study risk-neutral control in Markov decision processes with an absorbing catastrophic state. Even though rewards are linear and the age…
Large Language Models in Transportation Systems Management and Operations: From Text Reasoning to Multi-modal Decision Support
Transportation systems management and operations (TSMO) increasingly depends on timely interpretation of heterogeneous data, from various s…
Subliminal Learning Is Steering Vector Distillation
Subliminal learning refers to a student language model acquiring a teacher's traits (e.g. a system-prompted preference for owls) when fine-…
Property Prediction of Stacked Bilayer Materials: A Multimodal Learning Approach
AI for materials science is a critical topic within AI for science, aiming to accelerate materials discovery and produce accurate property…
Can AI Review Improve Paper Drafting? An Empirical Study on 20 Computer Architecture Submissions
Research is advancing faster than ever with artificial intelligence (AI); and so are the corresponding research papers. The exploding volum…
Tackling the Root of Misinformation by Teaching Laypeople about Logical Fallacies via Socratic Questioning and Critical Argumentation
Identifying logical fallacies in everyday discourse is challenging for many people. This challenge is amplified in the era of Large Languag…
TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection
When a language model hallucinates, the final answer is wrong, but the mistake is not necessarily invisible inside the model. Different int…
TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents
The development of Large Language Models (LLMs) has significantly improved travel planning applications, yet evaluating such models is limi…
AnyEdit++: Adaptive Long-Form Knowledge Editing via Bayesian Surprise
Editing complex, long-form knowledge in Large Language Models remains a significant challenge due to the difficulty of maintaining generati…
DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts
Mixture-of-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language mod…
MindClaw: Closed-Loop Embodied Mental-State Reasoning for Precision Intervention
Theory of Mind (ToM) enables an agent to reason about another actor's beliefs, goals, and intentions, which is essential for human-centered…
Before the Model Learns the Bug:Fuzzing RLVR Verifiers
Reinforcement learning with verifiable rewards (RLVR) replaces human preference labels with executable reward functions such as math answer…
CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation
Clinical order generation serves as a critical bridge between clinical decision-making and real-world practice, translating medical decisio…
Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG-based Fact-Checking
In RAG-based fact-checking, LLMs are increasingly used as verifiers to check given claims against retrieved evidence. Their parametric know…
SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision
Agent skills are procedural artifacts that enable LLM agents to execute workflows, verify constraints, and recover from failures. Existing…
Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches
While Reasoning Language Models (RLMs) are rapidly emerging as powerful tools for scientific research, their impact is primarily concentrat…
Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification
Large Language Models (LLMs) are increasingly used with formal interactive theorem provers such as Lean 4. Scaling these systems with reinf…
Deft Scheduling of Dynamic Cloud Workflows with Varying Deadlines via Mixture-of-Experts
Workflow scheduling in cloud computing demands the intelligent allocation of dynamically arriving, graph-structured workflows with varying…
"Skill issues'': data-centric optimization of lakehouse agents
Coding agents are becoming users of data infrastructure, but their success depends not only on model quality: it also depends on the skills…
The Case for Model Science: Verify, Explore, Steer, Refine
We argue that the AI community is now ready to move beyond benchmarking and consolidate scattered efforts in model analysis into a systemat…
Can LLM Agents Sustain Long-Horizon Organizational Dynamics?
Large language agents are increasingly used for social simulation, yet it remains unclear whether they can sustain coherent behavior in str…
The Shape of Wisdom: Decision Trajectories in Language Models
Language models do not simply choose an answer at the output layer. In a 9,000-trajectory MMLU study across Qwen2.5-7B-Instruct, Llama-3.1-…
Advanced Mathematics Learning Behavior Prediction and Academic Early Warning Model Based on Multimodal Data Analysis
Early detection of at-risk students and timely academic intervention pose major challenges in advanced mathematics education, where complex…
Application of Algorithms in Energy-Efficient Design Platforms for Green Building
During green building design, computer-aided energy assessment is widely used to improve efficiency and achieve overall optimization. This…
HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation
Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative dom…
Brain-Atlas-Guided Generative Counterfactual Attention for Explainable Cognitive Decline Diagnosis Using Multimodal Connectomes
Mild cognitive impairment (MCI) and subjective cognitive decline (SCD) are closely associated with the early Alzheimer's disease continuum,…
SIRIUS-SQL: Anchoring Multi-Candidate Text-to-SQL in Execution Feedback
Text-to-SQL on complex schemas is unreliable on a single pass, so recent systems generate multiple SQL candidates and let voting filter out…
Emergent Ordinal Geometry in Transformers Trained on Local Comparisons
Transitive inference is the challenge of inferring that A < C from knowing only adjacent relations (A < B, B < C). It is solved by humans a…
ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment
AI agents are increasingly being tasked with automating AI research itself, particularly the critical post-training phase that transforms b…
SkillSmith: Co-Evolving Skills and Tools for Self-Improving Agent Systems
Recent self-evolving agents have shown that skills can be discovered, refined, and accumulated through execution. However, existing skill-e…
Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery
Scientific discovery demands intelligence, perseverance, and serendipity across vast search spaces. Today, top scientific capabilities rema…
Recognize Your Orchestrator: An Entropy Dynamics Perspective for LLM Multi-Agent Systems
The transition from single-turn models to Multi-Agent Systems (MAS) promises enhanced problem-solving capabilities, yet the centralized orc…
FlowTime: Towards Continuous Generative Watch Time Prediction via Flow-based Personalized Priors
Watch time has emerged as a pivotal metric for optimizing deep user engagement in short-video recommender systems. However, current methods…
Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability
Tool-using multi-agent large language model (LLM) systems spend computation through model tokens, tool calls, retries, and code execution b…
GuidaPA: Privacy-Preserving Chatbot for Public Administration via Federated Learning
We present GuidaPA, a privacy-preserving chatbot for the Italian Public Administration (PA) trained via Federated Learning (FL) on document…
Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems
Tool-augmented large language model (LLM) agents rely on orchestration layers that coordinate planning, retrieval, tool invocation, validat…
GovAI-Pipe: A Layered AI Governance Pipeline for Citizen-Facing AI in Turkey's e-Government Gateway
Turkey's e-Government Gateway (e-Devlet) serves over 68 million registered users with more than 9,200 government services, and is increasin…
Don't Ask the LLM to Track Freshness: A Deterministic Recipe for Memory Conflict Resolution
LLM-based memory systems increasingly maintain facts that evolve over time, where a recurring failure is conflict resolution: when a fact h…
Dive into Ambiguity: A*-Inspired Multi-Agents Commonsense Obfuscation Attack on LLM Prompts
Large language models (LLMs) excel in reasoning and knowledge-intensive tasks but remain vulnerable to prompt-level adversarial attacks tha…
Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence
Scientific discovery is not only answer generation but revision of the representational regime in which evidence, artifacts, operations, an…
Transferring Information Across Interventions in Causal Bayesian Optimization
Bayesian optimization is a popular way to optimize expensive systems, where every experiment, simulation, or intervention costs time or mon…
An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models
Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contras…
A Minimalist Brain-Computer Musical Interface for Real-Time Emotion-Driven Sonification: System Design and Preliminary Evaluation
This paper presents a minimalist brain-computer Musical Interface (BCMI) that functions as a real-time affective sonification system, trans…
TERRA: Task-Embedded Reasoning and Representation Architecture for Cross-Domain Applications
A single action-conditioned latent predictive architecture can in principle be trained on the structured state of a driving scene, a robot…
Joint Agent Memory and Exploration Learning via Novelty Signals
In open-ended environments, exploration is fundamental for autonomous agents, yet current language model agents struggle with this. Effecti…
RoleCDE:Benchmarking and Mitigating Role-Alignment Trade-offs in Role-Playing Agents
Role-playing agents(RPAs) are widely used to steer large language models(LLMs) toward role-consistent behavior, yet existing benchmarks mai…
S-SPPO: Semantic-Calibrated Self-Play Preference Optimization
Aligning Large Language Models (LLMs) with human preferences is often formulated via Direct Preference Optimization (DPO). However, the sta…
TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL
Reinforcement learning (RL) for visual reasoning needs scalable, verifiable, and controllable training signals. Existing visual RL post-tra…
Revisiting Ripple Effects in Knowledge Editing through Pressure-Aware Joint Neighborhood Optimization
Single-edit updates in large language models can trigger ripple effects across local knowledge neighborhoods: desirable propagation to rela…
ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL
Agentic reinforcement learning (RL) enables LLM agents to improve continuously from environment rewards, yet the resulting policies do not…
MobEvolve: An Agentic Self-Evolving Heuristic System for Interpretable Human Mobility Generation
Human mobility generation aims to synthesize realistic trip chains for target populations based on individual features. Existing paradigms,…
Characterization of Multi-Model Agentic AI Systems on General Tasks via Trace-Driven Simulation
Agentic AI completes tasks through iterative planning, tool use, and reasoning based on observed outcomes. Despite its popularity, its syst…
Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization
Large language models (LLMs) are increasingly used as heuristic advisors for black-box optimization, yet their suggestions and self-reporte…
TrafficRAG: A Multimodal RAG Framework for Traffic Accident Liability Determination
Traffic accident liability analysis is a critical yet challenging task in intelligent transportation and legal assistance. Existing methods…
TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment
Personalized large language models adapt responses to users' preferences and social attributes, but can introduce substantial universal tru…
EvoBrain: Continual Learning of EEG Foundation Models Across Heterogeneous BCI Tasks
Electroencephalography (EEG) is the cornerstone of non-invasive brain-computer interfaces (BCIs), yet conventional decoding relies on fragm…
Structure-Guided Adaptive Propagation for Protein-Protein Interaction Site Prediction
Accurate prediction of protein-protein interaction sites (PPIS) is essential for understanding cellular processes, disease mechanisms, and…
Stochastic convergence of parallel asynchronous adaptive first-order methods
A new class of asynchronous adaptive first-order optimization methods is introduced, comprising asynchronous variants of several popular al…
Consistency evaluation of benchmarks used for causal discovery
In graphical causal model, causal discovery aims to construct a causal graph based on numerical data and domain knowledge in plain text. Ho…
OctoT2I: A Self-Evolving Agentic Text-to-Image Router
The explosive growth of Text-to-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing margina…
Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners
Current benchmarks for embodied vision-language planning often favor linguistic next-token prediction over physically grounded next-state r…
CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback
Recent LLM search agents use reinforcement learning with verifiable rewards (RLVR) to learn search-augmented reasoning from outcome rewards…
Evaluation of Baseline Methods for IDD-based SSD External Memory Search
Many difficult search problems cannot be solved by algorithms such as A* using only RAM. Search algorithms which use external memory such a…
Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction
Model compression techniques such as quantization and pruning are widely used to reduce the deployment cost of large language models (LLMs)…
WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis
Large language models (LLMs) are increasingly asked not only to write static interfaces, but to construct executable interactive worlds fro…
EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors
Practical non-invasive Brain-Computer Interface (BCI) systems require EEG decoders with strong cross-subject generalization and minimal cal…
Absorbing Complexity: An Interaction-Native Knowledge Harness for Financial LLM Agents
Financial AI agents often fail for a simple reason: they make users carry the complexity. A user must repeatedly restate goals, risk prefer…
Physically-Constrained Mamba-SDE for Remaining Useful Life Prediction under Irregular Observations
Accurate Remaining Useful Life prediction is critical for industrial predictive maintenance. However, real-world deployment is challenging…
Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation
Traditional Video Quality Assessment (VQA) focuses narrowly on aesthetic fidelity, overlooking the complex social dynamics that define qual…
Bayesian Spectral Emotion Transition Discovery from Multi-Annotator Disagreement
Emotions evolve through the dynamics of conversation, and understanding their transition structure is foundational to applications ranging…
SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes
Smart homes are evolving toward complex state-dependent living environments, requiring Large Language Models (LLMs) to reason over user int…
VET: A Framework for Analyzing AI Discourse
Public discourse on AI has become polarized; exaggerated positions on AI in traditional and social media threaten the development of AI Lit…
AutoMedBench: Towards Medical AutoResearch with Agentic AI Models
Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or…
Algorithmic algorithm development with LLMs: A Case Study on LLM-Usage for Contraction Order Optimization in Tensor Networks
We consider LLM-based algorithm development through a case study on contractionorder optimisation for tensor networks with OpenEvolve. We p…
An NLP-Driven Framework for Curriculum-Labor Market Alignment: Schema-Constrained LLM Extraction, ESCO-Anchored Semantic Matching, and Multi-Dimensional Gap Quantification
Schema-constrained information extraction from diverse educational and labor-market corpora remains an open challenge in natural language p…
SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning
As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansi…
Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery
Large Reasoning Models (LRMs) rely on long reasoning traces, making inference expensive. While low-bit quantization reduces per-token decod…
RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network
Medical imaging interpretation is a foundational pillar of modern clinical diagnostics, yet the manual generation of radiology reports rema…
Topological texture analysis of microscopy images of dynamic casein gelation and its relation to rheological properties
We propose a novel computational toolbox that integrates Topological Data Analysis (TDA), Differential Box Counting (DBC), Multifractal Par…
Explainable Data-driven Deep Reinforcement Learning Methods for Optimal Energy Management in Buildings
The increasing integration of renewable energy sources into power systems, particularly in buildings equipped with photovoltaic (PV) panels…
eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion
While Large Language Models (LLMs) achieve impressive performance on multi-step reasoning tasks, their reliability is persistently hindered…
Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories
Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based…
BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning
Enterprise AI systems that translate natural language into SQL queries and orchestrate multi-step agentic reasoning pipelines require evalu…
Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning
Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning.…
S3TS: Stochastic Scenario-Structured Tree Search for Advanced Planning Under Uncertainty
Effective scheduling in the energy sector is essential to ensure the reliable operation of electrical grids and their connected assets by,…
An Abstract Worlds Semantic Framework for Belief Change Operators
This article proposes a set-theoretic framework for belief change, called Abstract Worlds Semantics, in which no logical syntax is assumed.…
From Capability Models to Automated Planning: An AAS-Native Approach for Automatic PDDL Generation
Engineers designing production systems need to verify that a given layout supports all required production sequences. Automated planning te…
CEON: Circular Economy Ontology Network
Increasing the circularity of resource use in our society has been recognized as a path to sustainability, i.e., transitioning into a more…
POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems
Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failure…
Repair Before Veto: Repair-Augmented Constraint Learning for Contextual Decisions
Hard constraints are usually treated as terminal vetoes: once a candidate violates a requirement, the learned rule rejects it and any repai…
Forget Attention: Importance-Aware Attention Is All You Need
Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid l…
Coordination Graphs for Constrained Multi-Agent Reinforcement Learning
Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with th…
SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training
Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during t…
MOC: Multi-Order Communication in LLM-based Multi-Agent Systems
Despite the remarkable progress of Large Language Model (LLM) based Multi-Agent Systems, most research focuses on optimizing coordination t…
COMAP: Co-Evolving World Models and Agent Policies for LLM Agents
Equipping language agents with world models enables them to anticipate environment dynamics and evaluate candidate actions before execution…
Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it ha…
Spatial Representation Learning Beyond Pixels: Unifying Raster Data and Vector Semantics for Human-Centric Geospatial Foundation Models
Earth Observation (EO) has fundamentally transformed the monitoring of environmental processes and human activities up to planetary scale.…
A Mathematical Conflict Framework for Contextual Data Modulation
In this study, a generalized operator-based mathematical conflict framework is presented to explicitly represent structural discrepancies b…
AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design
Protein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external…
Bridging the Sim-to-Real Gap in Semiconductor Visual Program Synthesis via Input Binarization
Precise parametric control over circuit geometry is essential for semiconductor inspection, yet obtaining sufficient real training data rem…
LLM-Evolved Pattern Generators for Optimal Classical Planning
Learned heuristics have recently become a competitive alternative to traditional domain-independent heuristics for satisficing planning. Ex…
Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback
Recent evidence shows that people with eating disorders (EDs) are increasingly seeking guidance, advice, and emotional support from Large L…
HLL: Can Agents Cross Humanity's Last Line of Verification?
Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly…
Beyond One-shot: AI Agents for Learning in Field Experiments
Organizations routinely run experiments for A/B testing, yet the data generated from one experiment is underutilized to inform subsequent i…
AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents
Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilize…
MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation
The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data so…
Iteris: Agentic Research Loops for Computational Mathematics
Recent advances in large language models and agentic AI systems have enabled significant progress in mathematical discovery, from solving c…
RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering
Multi-hop question-answering systems often use expensive retrieval on every question. They may decompose the question, run several retrieva…
Bridging the Last Mile of Time Series Forecasting with LLM Agents
Time series forecasting has advanced rapidly, especially with the emergence of foundation models that show strong zero-shot performance on…
SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment
Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing metho…
Tracking the Behavioral Trajectories of Adapting Agents
Text files such as skill files, memory files, and behavioral configuration files play a central role in defining how modern agents act. Thr…
ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents
Clinical practice is not the selection of an answer from enumerated options: a physician gathers heterogeneous information incrementally an…
A Novel Data Augmentation Strategy for Robust Deep Learning Classification of Biomedical Time-Series Data: Application to ECG and EEG Analysis
The increasing need for accurate and unified analysis of diverse biological signals, such as ECG and EEG, is paramount for comprehensive pa…
A Lightweight Deep Learning-based Model for Ranking Influential Nodes in Complex Networks
Identifying influential nodes in complex networks is a critical task with a wide range of applications across different domains. However, e…
BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali
Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large langua…
Empathic and agentic artificial intelligence in nursing: perspectives on a human-centered framework for cancer care navigation in the United States
For patients experiencing cancer, nurse navigation can ease the burden of complex care by enhancing coordination of health services and pat…
RuleEdit: Failure-Guided Human-AI Model Editing with Prospective Impact Preview
Despite the promise of AI to assist complex decisions, practitioners still lack ways to detect likely failures and inspect the consequences…
DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset
Multi-party dialogue discourse parsing aims to identify dependency structures and relation types between utterances in conversations. Previ…
A phenomenon of AI-conformity: how algorithms change human moral decision-making
Social conformity is a well-documented phenomenon in which individuals shift their opinions towards those of a social majority. As artifici…
Toward Robust In-Context Learning: Leveraging Out-of-distribution Proxies for Target Inaccessible Demonstration Retrieval
Although studies have demonstrated that Large Language Models (LLMs) can perform well on Out-of-Distribution (OOD) tasks, their advantage t…
SortingHat: Redefining Operating Systems Education with a Tailored Digital Teaching Assistant
Operating Systems (OS) courses are among the most challenging in computer science education due to the complexity of internal structures an…
AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection
Detecting AI-generated text is becoming increasingly challenging as modern language models approach human-level fluency and can evade detec…
Examine Clinicians' Modification of Hedging Language in Ambient AI Documentation: A Comparative Study of AI Drafts and Final Notes
Ambient AI documentation systems generate clinical note drafts that clinicians frequently revise before signing off into electronic health…
Understanding Stigmatizing Language in Clinical Documentation: A Paired Comparison of Ambient AI Drafts and Clinician Finalized Notes
Ambient artificial intelligence (AI) documentation tools are increasingly deployed to reduce clinician documentation burden, but their impl…
CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards
Large Language Model (LLM) based Chinese Grammatical Error Correction (CGEC) systems face two critical challenges: general-purpose models l…
SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding
Speculative Decoding (SD) accelerates Large Language Model (LLM) inference by employing a lightweight draft model to propose candidate toke…
lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation
Humor generation remains difficult not only because producing fluent, novel jokes is hard, but because "funny" is audience-dependent and su…
TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models
The rapid development of Language Diffusion Models (LDMs) challenges the dominant position of auto-regressive competitors in language proce…
A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models
Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adve…
TCAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation
Retrieval-augmented generation systems struggle with temporal reasoning and evidence fusion when answering complex questions over historica…
LLMs for Cardiovascular Risk Prediction from Structured Clinical Data
Coronary artery disease (CAD) remains one of the leading causes of death globally, highlighting the need for reliable predictive systems to…
Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing
While mechanistic interpretability (MI) has produced important insights into neural network internals, the field has yet to establish a sta…
Update Opacity: Epistemic Accessibility and Governance Under AI System Change
Machine learning models embedded in deployed AI systems are routinely updated to maintain correct functioning over time. Yet such updates c…
Beyond Tool Adoption: A Practical Five-Stage Developmental Continuum for AI Literacy in Higher Education
Artificial intelligence (AI) literacy is increasingly recognized as a foundational competency for all university graduates. Yet students' e…
Beyond Categories of Caste: Examining Caste Bias and Morality in Text-to-Image AI Models
Text-to-Image (T2I) models have shown promising utility across various domains. However, such models are also amplifying harmful societal b…
Tracing GenAI Literacy: Uncovering Student-AI Interaction Patterns in Academic Writing through Epistemic Network Analysis
As Generative AI (GenAI) becomes integral to education, fostering GenAI literacy is critical. However, current assessments largely rely on…
Improving Hospital Process Management through Process Mining: A Case Study on COVID-19 Clinical Pathways
This study analyzes COVID-19 care pathways using the COVID Data for Shared Learning dataset. We build a transparent, reproducible pipeline…
Algorithmic Authority and the Clinical Standard of Care
The integration of artificial intelligence into clinical medicine creates a fundamental tension between algorithmic probabilistic reasoning…
When Jokes Cross the Line: Analyzing Regular Humor and Dark Humor in YouTube Shorts
Video platforms such as YouTube have reshaped how users engage with entertainment and information, emphasizing brief, highly engaging conte…
Comprehensive AI governance requires addressing non-model gains
Frontier AI governance often centres on the model-level governance paradigm, which assumes that a model's capability profile is primarily a…
Measuring and Mitigating Bias in Code Generated by Large Language Models
Large language models (LLMs) are widely recognised for their applications in natural language generation and are increasingly used for code…
Business Utility of Large Language Models as Exploratory Data Analysis Agents
Large Language Models (LLMs) are increasingly used in analytical workflows, but their suitability as exploratory data analysis (EDA) agents…
From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data
Recent progress in generalizable embodied control has been driven by large-scale pretraining of Vision-Language-Action (VLA) models. Howeve…
Physics-Informed Neural Networks for Radial Consolidation of Combined Electroosmotic, Vacuum and Surcharge Preloading Considering Smear Effects
This study develops a dimensionless multi-domain physics-informed neural network (PINN) framework for electro-osmotic radial consolidation…
Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy
Automated extraction of materials composition-property data from scientific literature has advanced considerably with the development of la…
Rare Events, Real Signals: Functional Ensembles as Units of Computation in Deep Spiking Networks
We investigate how internal representations emerge across hierarchical processing systems by introducing a neuroscience-inspired framework…
CLSP-REQA: A Real-Time Quality-Aware Closed-Loop Seizure Prediction Framework with Mamba-BiLSTM and Confidence-Gated Intervention
Reliable seizure prediction is a prerequisite for closed-loop neurostimulation therapy, yet existing methods rarely account for the variabi…
Improved Belief-Attention in Vision Task
Recently, Belief-Attention \cite{Guoqiang25BeliefAttention} has been proposed by first performing an orthogonal projection of the softmax-b…
Flow-Based Generative Modeling for Optimizing Sampling Policies in Compressed Sensing Applications
Numerous modern applications in signal processing and medical imaging necessitate acquiring high-dimensional signals under tight resource c…
BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization
Mixture-of-Experts (MoE) large language models reduce per-token computation through sparse expert activation, but their deployment remains…
Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems
Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable species identification critical…
DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions
Distributed Acoustic Sensing (DAS) enables large-scale monitoring through optical fibers, but its high dimensionality and complex spatio-te…
Hoeffding Concept Bottleneck Models with Applications to Overhead Images
Explainability of deep learning algorithms is critical for computer-vision applications with high-stake decisions. Concept bottleneck model…
From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models
Reinforcement learning relies on accurate reward functions, which are often hand-crafted or even unavailable in real-world applications, su…
SentimentLens: Reconciling Sentiment and Ratings via Dual-Modality in the Hospitality Sector
Online travel platforms generate vast volumes of user-generated hotel reviews, offering rich opportunities to understand traveler experienc…
Structured Visual Evidence Decomposition for Evidence-Grounded Multimodal Screening of Obstructive Sleep Apnea-Hypopnea Syndrome
Effective pre-polysomnography screening for obstructive sleep apnea-hypopnea syndrome (OSAHS) requires combining clinical risk factors with…
Can Predicted Dynamics Exist in the Physical World?
Predictive Physical AI systems output state rollouts, action chunks, and latent plans, yet a low root-mean-square error (RMSE) does not imp…
Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems
Physical AI systems increasingly map multimodal observations, language instructions, and learned world representations into physically cons…
DLLM-JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models
Joint Embedding Predictive Architectures (JEPAs) have reshaped self-supervised representation learning in vision. The recent LLM-JEPA porte…
Aligning Cellular Sheaves with Classifier Attention for Interpretable Weakly-Supervised Pathology Localization
Weakly-supervised classification of whole-slide images with attention-based multiple instance learning (ABMIL) on top of foundation feature…
Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry
Image generative models aim to sample data points from the underlying data manifold, a task that requires learning and decoding a dense, lo…
Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation
Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instruction…
Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents
Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainl…
CoilDrop-MRI: Self-supervised physics-guided MRI reconstruction with coil dropout
Self-supervised deep learning-based methods have shown great promise for accelerated magnetic resonance imaging (MRI) reconstruction, achie…
CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection
With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalen…
PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs
Foundation models are increasingly used to drive autonomous systems, yet existing approaches either keep the model in a tight control loop,…
Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning
Multimodal Large Language Models (MLLMs) have achieved remarkable progress on vision-language tasks, but they may also memorize and expose…
A Methodological Framework for Explicit Control of the Speed-Accuracy Trade-off in Brain-Computer Interfaces
Brain-computer interfaces (BCIs) are limited by low signal-to-noise ratio in modalities such as electroencephalography, which requires mult…
Motif-based morphology signatures for interpretable ECG screening and monitoring
Electrocardiography (ECG) remains central to cardiovascular screening, yet interpretation remains largely manual and episodic. Clinical pra…
Project SPARROW and the Future of Conservation Technology
Global biodiversity is declining at unprecedented rates, yet the tools available to monitor and protect ecosystems remain limited by constr…
VDSB-GWSyn: Diffusion Schr\"{o}dinger Bridge for Controllable and Anatomically Feasible Guidewire Synthesis in Coronary Angiography
Coronary guidewire endpoint localization is a fundamental capability for computer-assisted PCI, and its importance increases as robot-assis…
Enhancing BiGRU with a KAN Block for Legal Document Classification and Summarization
This study introduces a novel architecture of KAN-based BiGRU model for the task of classification and summarization of legal documents in…
V2I Work Zone Geometry Reconstruction with Pose-Conditioned UWB Range Denoising
Reliable work zone mapping is important for connected and autonomous vehicles (CAVs) to navigate safely and smoothly through work zone area…
SpikeWFM: Spiking-Aided Wireless Foundation Model for Robust Channel Prediction
This paper proposes SpikeWFM, a novel hybrid architecture that integrates spiking neural networks (SNNs) with conventional artificial neura…
Versatile Framework with Semantic and Structural guidance for Image Reconstruction from Brain Activity
Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task in brain decoding. Especially, the achieveme…
CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations
Multimodal Large Language Models (MLLMs) have shown strong performance on public medical benchmarks, yet existing evaluations often remain…
Multimodal Music Recommendation System using LLMs
Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semanti…
A Shared Valence Axis Across Modern LLMs and Human EEG: The Saturation Regularity
Large language models (LLMs) have emerged as powerful representation learners whose internal features increasingly align with human cogniti…
Automatically Differentiable Nonlinear Tensor Networks (ADNTNs) for Exponential Compression of Deep Neural Networks
We study Automatically Differentiable Nonlinear Tensor Networks (ADNTNs), a family of structured weight generators whose compact core tenso…
AI-PROPELLER: Warehouse-Scale Interprocedural Code Layout Optimization with AlphaEvolve
Post-link optimizers (PLOs) such as Propeller and BOLT have demonstrated that precise, profile-guided code layout can extract significant p…
Foundation-Preserving Adaptation via Generalized Rayleigh-Quotient Optimization
While finetuning effectively adapts foundation models to specialized downstream tasks, it can degrade nontarget capabilities acquired durin…
XAI-SOH-FL: Enhancing SOH-FL with Adaptive Aggregation and Explainable AI for Intrusion Detection in Heterogeneous IoT
Intrusion Detection Systems (IDS) in Internet of Things (IoT) environments face significant challenges due to data heterogeneity, lack of l…
On Effectiveness and Efficiency of Agentic Tool-calling and RL Training
Tool-calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowled…
Generative AI and Digital Ecosystem Resilience: A Proactive Lifecycle-Based Survey
The proliferation of adversarial synthetic content, accelerated by Generative AI (GenAI) is rendering traditional reactive detection method…
Geodesics with Unified Tangent-constrained Priors and Curvature Regularization
Curvature-penalized geodesic models have proven their effectiveness in image segmentation by computing globally optimal curves. Unfortunate…
Geometric Erasure by Contrastive Velocity Matching in Rectified Flows
While the rapid adoption of multimodal generative models offers immense potential, it has also increased the risks of harmful content synth…
Adaptive data selection improves wearable prediction under low baseline performance
Adaptive sensing strategies that selectively sample data are increasingly used in wearable health systems to improve prediction performance…
Regime-Adaptive Continual Learning for Portfolio Management
Financial markets are inherently non-stationary, exhibiting frequent regime shifts and structural changes that render traditional Portfolio…
BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding
Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel.…
Completion at the Boundary (CaB): Deployable Switching with Completion-Aware Control under Limited Calibration
Vision-language-action (VLA) agents can execute natural-language instructions, yet deployed systems still lack an operational interface: de…
Multi-Contrast MRI Motion Correction via Parameter-Informed Disentanglement and Adaptive Experts
Motion artifacts in magnetic resonance imaging (MRI) degrade diagnostic reliability. Existing deep learning methods are typically contrast-…
RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting
Domain-specific supervised fine-tuning (SFT) often improves in-domain performance at the cost of degrading a model's general capabilities.…
StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning
Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model c…
Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models
As Large Language Models evolve for user convenience, vulnerability to jailbreak attacks continues to be reported despite ongoing efforts i…
Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying
In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different act…
PrivacyPeek: Auditing What LLM-Based Agents Acquire, Not Just What They Say
LLM-based agents are rapidly advancing, autonomously invoking external tools to complete multi-step tasks for users. However, agents often…
DiffCrossGait: Trajectory-Level Alignment for 2D-3D Cross-Modal Gait Recognition via Latent Diffusion
Cross-modal 2D-3D gait recognition is impeded by inherent domain discrepancies between 2D silhouette and 3D LiDAR range-view representation…
Benchmarking Multimodal LLMs on Code Generation for Complex Interactive Webpages
Recent advancements in multimodal large language models (MLLMs) have achieved remarkable progress in multimodal reasoning and code generati…
A Protocol-Language Model for Network Intrusion (Without Deep Packet Inspection)
Modern network intrusion detection systems (NIDS) are caught in a structural contradiction: the protocols carrying the highest threat intel…
A physics-informed foundation model for quantitative diffusion MRI
Understanding the human brain requires access to its microscopic tissue architecture. Diffusion magnetic resonance imaging (MRI) provides t…
Interpreting FCDNNs via RG on Exponential Family
We consider establishing the interpretability theory of deep learning through constructing a corresponding relationship between the renorma…
Digital-to-Physical Transfer of Adversarial Patches for Aerial Vehicle Detection
Deep neural network (DNN)-based object detectors are widely used for analyzing aerial and satellite imagery in applications such as environ…
DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning
Large language models (LLMs) suffer from degraded safety capabilities even when fine-tuned with benign datasets. However, existing methods…
Improving IoT Intrusion Detection Through SMOTE-Based Oversampling and Extended Multi-Model Evaluation on Side-Channel Power Data
The detection of intrusions in IoT-based networks poses challenges that cannot be overcome using traditional machine learning methods. Perh…
ChurnNet: A Optimized Modern AI for Churn Prediction
Increased competition and the growing similarity of products and services offered by retailers have lowered the barriers for customers to s…
UF-AMA: A unified framework for cross-domain emotion recognition via adaptive multimodal alignment
In recent years, emotion recognition based on physiological signals such as electroencephalogram (EEG) has gained considerable attention, a…
MyoSem: Aligning Electromyography to Natural-Language Action Semantics for Hand Action Understanding
Electromyography (EMG) directly reflects muscle activation and is a key sensing modality for gesture recognition, prosthetic control, and w…
Beyond Augmentation: Score-Guided Pathological Prior for EEG-based Depression Detection
Deep learning-based Major Depressive Disorder (MDD) detection using Electroencephalography (EEG) is fundamentally constrained by the "small…
The New Social Image: How AI Competency and AI Proactivity Influence Self- and Peer-Perceptions in the Workplace
Human-AI collaboration is considered the most promising way to incorporate AI in the workplace. What remains unexplored are the experientia…
Agentic Transformers Provably Learn to Search via Reinforcement Learning
Tree search is a central abstraction behind many language-agent reasoning and decision-making tasks: agents must explore actions, remember…
Learning to Construct Practical Agentic Systems
Automated design and optimization of agentic LLM-based systems leads to sophisticated systems that substantially improve result quality ove…
BAGEN: Are LLM Agents Budget-Aware?
While agents are increasingly spending more resources, today agent cost is mostly measured only after execution. A Budget-Aware Agent (BAGE…
From Rashomon Theory to PRAXIS: Efficient Decision Tree Rashomon Sets
Standard machine learning pipelines often admit many near-optimal models. These "Rashomon sets" pose a range of challenges and opportunitie…
SEMBridge: Tagless-Final Program Semantics with Weakest-Precondition and Bounded-Checking Interpretations
Formal methods provide rigorous accounts of program behavior, but practical software engineering often works through executable libraries,…
Continuous Reasoning for Vision-Language-Action
Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continu…
Civilizational Metamaterials: Engineering Coordination Under Capability Gradients and Structural Turbulence
We argue that governance must transition from a normative discipline to an engineering discipline, and we develop a formal framework, inspi…
InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate
Measuring statistical dependency between high-dimensional random variables is a fundamental task in data science and machine learning. Neur…
Effects of Varying LLM Access on Essay Writing Behavior
Investigating the degree to which large language models (LLMs) affect teaching and learning in universities can help identify strategies fo…
ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate
Token-level credit assignment for language-model reinforcement learning is usually formulated as if the policy were fully trainable, while…
When Softmax Fails at the Top: Extreme Value Corrections for InfoNCE
InfoNCE is the standard contrastive learning objective, but its softmax form is not only a computational convenience: it also encodes a sta…
StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement
Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on…
Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models
Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensiv…
Synthetic Data from Cross-Domain Events for Large-Scale Recommendation Systems
Large-scale recommendation systems operate across diverse domains, yet they face the challenges of data sparsity and noisy implicit feedbac…
Real2SAM2Real: Generative 3D Caches as Complementary Context for Video Diffusion
While Video Diffusion Models (VDMs) excel at synthesizing high-fidelity videos, enabling precise camera and scene control remains challengi…
Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance
On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own polic…
Rethinking the Role of Temperature in Large Language Model Distillation
Reverse Kullback-Leibler (RKL) divergence is widely favored over forward KL (FKL) in large language models (LLM) distillation, yet this pre…
How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems: A Paired Study on HumanEval
Large-language-model code generation has shifted from single-shot prompting to multi-agent orchestrations - analyst, coder, tester, and deb…
DRL-Based Pose Control for Double-Ackermann Robots Under Actuation Uncertainties
Robust deployment of deep reinforcement learning (DRL) policies on real robots remains challenging due to discrepancies between simulation…
LLMs Need Encoders for Semantic IDs Too
Multimodal LLMs use dedicated encoders to bridge non-language modalities (vision encoders for images, depth models for audio codec tokens)…
Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning
Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Lang…
ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use
As AI agents are increasingly deployed in real personal and corporate settings (email accounts, development workflows, company databases, e…
(HB-ARFM) History-Bootstrapped Flow Matching for Inverse Boiling Reconstruction
Reconstructing spatiotemporal fields from partial observations is fundamental to scientific inference, from inferring atmospheric states fr…
Drift Q-Learning
Offline reinforcement learning requires improving a policy from fixed data while avoiding out-of-distribution actions with unreliable value…
Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems
Reinforcement learning problems typically define the goal as maximizing the expected value of a scalar reward function. But, pairwise prefe…
Agentic Authoring of Interactive Multiview Visualizations in Genomics
Diverse genomics data, scientific questions, and analysis tasks typically demand highly specialized visualizations. Therefore, users often…
SUPREME: A Multi-GPU Framework for Reproducible Image Unlearning Method Evaluation
Machine unlearning removes the influence of specific training data from a trained model without retraining it from scratch. Evaluating an u…
Zamba2-VL Technical Report
We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space…
Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization
AI-text detectors are vulnerable to paraphrasing and detector-guided paraphrasing attacks, but existing detector-evasion methods often lack…
PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning
Mixture of Experts (MoE) Large Language Models (LLMs) achieve strong performance at scale. However, reinforcement learning (RL) on MoE-base…
A Distribution-Free Framework for Rewrite-Based Human-text Detection via Knockoff Filtering
We propose a distribution-free statistical framework that converts arbitrary rewrite-based detectors into detectors with finite-sample FDR…
Masking Stale Observations Helps Search Agents -- Until It Doesn't: A Regime Map and Its Mechanism
Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasin…
AgentxGCore: Agentic AI for Next-Generation Mobile Core Network
To meet the stringent requirements of emerging applications and the increasingly complex network management and operation, the Next Generat…
Finer Parameter Steps for Low-Rank PEFT: A Controlled Study with CP Tensor Adapters
Low-rank adapters are usually compared by sweeping a small set of ranks, but the rank also fixes the resolution of the parameter budget. Fo…
Detect Before You Leap: Mirage Detection in Vision-Language Models
Vision-language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated t…
DarkVesselNet: Multi-Modal Remote Sensing and Trajectory Reasoning for Dark Vessel Detection
Dark vessel detection requires fusing what vessels report through AIS with what satellites observe through radar and optical sensors. DarkV…
GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video
Open-vocabulary 3D scene segmentation usually assumes RGB-D video, calibrated multi-view imagery, or a reconstructed mesh. GeoSAM-3D studie…
When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems
LLM agents increasingly rely on community-contributed skills that expand an agent's operational capability set. We study a core safety prob…
Short-form Text Rewriting with Phi Silica
Short-form text rewriting is a constrained variant of paraphrasing in which limited context and high semantic density leave little room for…
On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance
Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how m…
CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space
Conventional tissue image analysis software provides foundational capabilities for cellular analysis, including segmentation, basic morphol…
Pre-Deployment Robustness Stress Testing for CT Segmentation Systems Using Clinically Motivated Multi-Corruption Augmentation
Deep learning-based CT segmentation systems often achieve high accuracy on clean benchmark images, but their performance may degrade under…
TabChange: Precise Attribute Changes in Tabular Data
Modifying an attribute in tabular data often introduces an unnatural instance by breaking its relationships with other attributes. The modi…
V-LynX: Token Interface Alignment for Video+X LLMs
This study introduces an intriguing phenomenon in Video LLMs: rather than merely translating frames into textual embeddings, Video LLMs est…
Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning
Agent skills are callable procedural modules that provide reusable knowledge and execution policies for complex agentic tasks. However, exi…
PaCo-VLA: Passivity-Shielded Compliance Prior for Contact-Rich Vision-Language-Action Manipulation
Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Visi…
CAFOSat: A Strongly Annotated Dataset for Infrastructure-Aware CAFO Mapping Using High-Resolution Imagery
Concentrated Animal Feeding Operations (CAFOs) play an important role in agricultural production but are also associated with environmental…
Richer Representations for Neural Algorithmic Reasoning via Auxiliary Reconstruction
Neural algorithmic reasoning has emerged as a popular research direction. It aims to train neural networks to mimic the step-by-step behavi…
Interpretable Policy Distillation for Power Grid Topology Control
Deep reinforcement learning (RL) offers a promising route to real-time power grid operation, yet large neural policies are costly to evalua…
A Practical Upper Bound on Selection Bias Effects in Medical Prediction Models
Selection bias is a common and often unavoidable aspect of real-world data that challenges the generalizability of machine learning models.…
Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence
Parameter-based knowledge editing updates the internal knowledge of large language models (LLMs) via localized weight modifications and has…
On the Difficulty of Learning a Meta-network for Training Data Selection
Synthetic data are increasingly used to train neural networks, yet distributional mismatch with real data limits their effectiveness when u…
Improving Visual Representation Alignment Generation with GRPO
Recent diffusion transformers have demonstrated strong image synthesis capabilities but remain inefficient to train due to weak alignment b…
Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback
Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substantial progress, optimizing retri…
SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering
Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent w…
CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts
Reinforcement learning (RL) with verifiable rewards has achieved strong progress in reasoning-oriented LLMs, but extending it to multi-doma…
MemGraphRAG: Memory-based Multi-Agent System for Graph Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) has become an essential method for mitigating hallucinations in Large Language Models (LLMs) by levera…
Linguistics-Aware Non-Distortionary LLM Watermarking
Watermarking should identify language-model output without degrading quality or limiting verification to the model provider. Multilingual d…
Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion
Recent Vision-Language Models (VLMs) struggle with grounded reasoning, temporal consistency, and context aware planning in videos. We intro…
MemPro: Agentic Memory Systems as Evolvable Programs
Long-horizon autonomous agents require memory systems to retain historical information, track evolving states, and reuse relevant knowledge…
Authenticity Debt and the Synthetic Content Threat Landscape: A Layered Framework for Trust, Provenance, and IP Governance in the Generative AI Era
Generative artificial intelligence has fundamentally changed how content is now produced. It has enabled how high-fidelity text, images, au…
LP5X-PIM Sim: A High-Fidelity HW/SW Integrated Simulator for LPDDR5X-PIM
This tech note describes the architecture and execution results of the LPDDR5X-PIM simulator, developed by Samsung Electronics. Based on th…
LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification
Detecting psychological defense mechanisms in conversational text remains a challenging clinical NLP problem. For the PsyDefDetect 2026 sha…
MESA: Improving MoE Safety Alignment via Decentralized Expertise
Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational…
Scaling Behavior of Single LLM-Driven Multi-Agent Systems
The burgeoning field of LLM-based Multi-Agent Systems (MAS) promises to tackle complex tasks through collaborative intelligence, yet fundam…
Demystifying the Optimal Fair Classifier in Multi-Class Classification
Ensuring fair and equitable treatment across diverse groups, particularly in multi-class classification tasks, poses a significant challeng…
Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models
Large video diffusion models achieve strong visual quality but remain expensive to deploy because each sample requires many denoising steps…
Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty
Face-to-face speech comprehension is inherently multimodal, integrating acoustic signals with visible articulation, facial expression, head…
The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs
Large Language Models (LLMs) aligned via outcome-based Reinforcement Learning (RL) frequently exhibit a critical failure mode: they achieve…
COPF: An Online Framework for Deployment-Stable Counterfactual Fairness in Evolving Graphs
Online link recommendation on evolving graphs is performative: by choosing which candidate links to show users, the system changes which li…
Shape Your Body: Value Gradients for Multi-Embodiment Robot Design
We propose to turn generalist multi-embodiment value functions into reusable models for robot design. Instead of running a new reinforcemen…
Information-Theoretic Lower Bounds for Bit-Constrained Stochastic Optimization via a Reduction to Compressed Gaussian Mean Estimation
Low-precision pretraining (FP8, MXFP4, NVFP4) is now standard for frontier language models, yet the literature is almost entirely achievabi…
Multi-Agent Conformal Prediction with Personalized Statistical Validity
Uncertainty quantification is essential in high-stakes machine learning tasks. However, one of the principled solutions, conformal predicti…
EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models
Controlling language model outputs is essential for ensuring structural validity, reliability, and downstream usability, and diffusion lang…
WaveFilter: Enhancing the Long-Context Capability of Diffusion LLMs via Wavelet-Guided KV Cache Filtering
Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-s…
SORA: Free Second-Order Attacks in Fast Adversarial Training
Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficie…
Quantum Tunneling-Aware Machine Learning: Physics-Derived Noise Models for Robust Deployment
Transistor scaling is approaching a quantum-mechanical limit, as thin gate oxides induce electron leakage through quantum tunneling. Unlike…
SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV Autonomy
For low-altitude Unmanned Aerial Vehicle (UAV) autonomy, 3D spatial understanding is not merely a perception objective, but the safety inte…
Causal Density Functions
We introduce causal density functions: Radon-Nikodym derivatives that compare interventional laws to observational laws and therefore act a…
Logit Distillation on Manifolds: Mapping by Learning
A simple way to improve the performance of almost any machine learning model is not to train a single but several models with diverse algor…
GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval
Video Moment Retrieval (VMR) task requires accurately localizing temporal boundaries aligned with natural language queries, but many models…
Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning
Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline ef…
Bayesian Inference of Nonlinear Malaria Dynamics in Ghana via an Ensemble Markov Chain Monte Carlo Sampler
Reliable quantification of malaria dynamics in sub-Saharan Africa is hindered by short, noisy, and spatially heterogeneous surveillance rec…
Extending Causal Metamodeling to a non-Markovian Queue
Metamodels for discrete-event simulations approximate the behavior of simulation models without running expensive simulations. Prior work i…
DASH: Dual-Branch Score Distillation for Guidance-Calibrated Compact Diffusion Models
Parameter compression of class-conditional diffusion models reveals an underexplored limitation in output-level distillation: the unconditi…
Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems
Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use cons…
Certificates without Electrons? Theory and Evidence on Impacts from AI-Driven Power Demand
Data centers now account for 4.4% of United States electricity demand, yet the grid-level effectiveness of the renewable energy certificate…
SkillPager: Query-Adaptive Intra-Skill Navigation via Semantic Node Retrieval
Skill-based LLM agents increasingly rely on long procedural documents, but full-document prompting wastes tokens and dilutes information cr…
Beyond Independent Manipulation: Individual Fairness-aware Strategic Classification with Peer Imitation
Strategic classification (SC) investigates scenarios where agents manipulate their features to obtain favorable decisions from predictive m…
Hybrid Probabilistic Forecasting of Under-Five Malaria Admissions in Ghana: A Gaussian Process Regression with Holt-Winters Smoothing
Accurate malaria forecasting remains a major challenge in sub-Saharan Africa, where strong seasonality, reporting uncertainty, and non-stat…
MoEIoU: Rethinking Bounding-Box Regression as a Mixture of Experts
Bounding-box regression is a fundamental component of object detection, playing a critical role in precise object localization. Existing In…
RefDiffNet: Learning to Expose Subtle PCB Defects Before Detection
Printed circuit board (PCB) defect detection is challenging because many defects are small and difficult to distinguish from complex backgr…
From Cues to Horizons: Dynamic Risk Horizon Profiling for Trajectory Prediction
Accurate and reliable vehicle trajectory prediction is essential for safe autonomous driving. Recent studies have incorporated safety risk…
GenPT: Beyond Self-Report for Reliable LLM Psychometrics via Generative Projective Testing
Self-report questionnaires remain the prevailing tool for probing the psychological states of persona-conditioned agents (PC-Agents). Howev…
Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated
Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetsca…
Task diversity produces systematic transfer but inhibits continual reinforcement learning
Continual reinforcement learning aims to produce agents that learn not only to improve at their current tasks but also to adapt as task dis…
Dive into Waves: Morlet Spectral Transformer for Cross-Subject Emotion Decoding from EEG
We study cross-subject emotion recognition from EEG, a practically important yet challenging problem in brain-computer interfaces. Unlike t…
Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling
Dynamic Sparse Training (DST) offers a promising paradigm for improving the training and inference efficiency of deep neural networks; howe…
MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models
This work presents MLLM-Microscope, a novel system designed for analyzing the hidden representations within Multimodal Large Language Model…
Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks
Run-level pass rate overstates retry-free coverage by up to 17.8 percentage points -- and the gap is largest precisely for mid-performing s…
Benchmarking Security Risk Detection and Verification in Open Agentic Skill Ecosystems
Open agent platforms allow community contributors to publish reusable skills that agents can invoke at runtime. This extensibility also cre…
Detection vs. Execution: Single-Bucket Probes Miss Half the Mamba-2 State Sink
Mechanistic interpretability often assumes that probes identifying a representational signature also identify the circuit executing the cor…
CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences
Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow app…
Lodestar: An Online-Learning LLM Inference Router
Efficiently serving large language model (LLM) inference tasks is crucial both for user-perceived latency such as time-to-first-token (TTFT…
Silent Failures in Federated Personalization of Foundation Models
Foundation models are increasingly personalized on decentralized private data through federated learning and are now deployed at scale unde…
Explainable deep reinforcement learning reveals energy-efficient control strategies for turbulent drag reduction
We propose a method combining Multi-Agent Deep Reinforcement Learning (MARL) and eXplainable Deep Learning (XDL) to reduce drag in wall-bou…
SS-ZKR: Spatial-Semantic Zero-Knowledge Routing for Privacy-Preserving Multi-Agent Collaboration
Foundational agent interoperability standards, notably the Agent-to-Agent (A2A) protocol and the Model Context Protocol (MCP), have advance…
An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation
Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for…
Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference
Sparsely activated Mixture-of-Experts (MoE) models scale capacity via conditional computation, but distributed inference suffers from cross…
FVSpec: Real-World Property-Based Tests as Lean Challenges
We present a benchmark for evaluating AI models and agents on real-world formal software verification tasks. We first scrape 11,039 propert…
Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing
We address text-based 3D human motion editing, where the goal is to preserve the style and structure of a source motion while applying edit…
AI-IoT-Robotics Integration: Survey of Frameworks, Emerging Trends, and the Path Toward Connected Robotics
The convergence of Artificial Intelligence, the Internet of Things, and Robotics is no longer a futuristic vision; it is rapidly becoming t…
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects
While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to th…
Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding
Large Language Model (LLM) generation remains expensive because autoregressive decoding calls the model once for each new token. Speculativ…
ProductWebGen: Benchmarking Multimodal Product Webpage Generation
Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practi…
Data Collection for Training Quality-Control AI in Carpet Manufacturing
Visual inspection remains the dominant quality-control practice in woven and tufted carpet production, yet it is slow, subjective, and inco…
DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs
Discrete Masked diffusion language models generate text by iterative parallel decoding, but few-step decoding suffers from a tradeoff betwe…
Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation
Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume…
OPD+: Rethinking the Advantage Design for On-Policy Distillation
On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student m…
Plausibility Is Not Prediction: Contrastive Evidence for LLM-Based Cellular Perturbation Reasoning
Perturbation experiments are central to understanding cellular mechanisms, but remain costly and sparse, motivating prediction of gene expr…
3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code
Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable asset…
MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models
Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment cha…
Leyline: KV Cache Directives for Agentic Inference
Modern KV cache management assumes the chatbot workload: prompts arrive once and the cache grows append-only, so prefix caching and forward…
Test-Time Training for Zero-Resource Dense Retrieval Reranking
Dense retrievers excel at first-stage candidate generation but lack effective reranking in zero-resource settings. Existing approaches face…
ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks
Large language models often improve on difficult tasks by spending inference-time compute on a reasoning trace before producing the final a…
MViewRouter: Internalizing Geometric Equivariance via Multi-view Alternating Attention for Combinatorial Routing
Combinatorial routing problems such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) are fundamen…
Strong Stochastic Flow Maps
Flow and diffusion models generate high-quality samples in many modalities; however, many network evaluations are required during inference…
A Fiber Criterion for Representation Identifiability in Supervised Learning
Supervised learning evaluates predictors through their input-output behavior. When a predictor is implemented as a composition $f=c\circ h$…
Beyond Task Success: Behavioral and Representational Diagnostics for WAM and VLA
Vision-language-action (VLA) policies and World-Action Models (WAM) represent two increasingly important paradigms for robotic manipulation…
Implicit Drifting Policy: One-Step Action Generation via Conditional Expert Geometry
Generative action policies based on diffusion or flow matching excel in behavior cloning, yet their iterative sampling is prohibitive for h…
MiCU: End-to-End Smart Home Command Understanding with Large Language Model
Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, whil…
Soft-NBCE: Entropy-Weighted Chunk Fusion for Long-Context
The quadratic complexity of self-attention remains a bottleneck for Large Language Models (LLMs) processing ultra-long contexts. The Naive…
HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces
Extreme multi-label classification (XMC) involves learning models over large output spaces with millions of labels, making the output layer…
STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing
Pruning is a process designed to reduce the number of weights in a large neural network. This can substantially speed up inference but migh…
AMP: A Vendor-Neutral Wire Format for Agent Memory Operations
Agent-memory frameworks - mem0, Letta/MemGPT, Cognee, Zep/Graphiti, MemoryOS, MemTensor - each ship their own SDK, storage layout, and oper…
ASE-26: a curriculum for agentic software engineering as a discipline
The work of a professional software engineer has begun to consist, increasingly, of directing agents rather than writing code, and the empi…
When Data Is Scarce: Scaling Sparse Language Models with Repeated Training
Scaling laws for dense LLMs under infinite data are well explored, but how sparsity interacts with limited data is not. In this work, we st…
AI From the Margins (AIM): Rethinking Participatory AI Design Through the Lived Experience of Minoritized Communities
Artificial intelligence (AI) can reproduce and amplify the structural inequities faced by minoritized communities. Participatory AI has bee…
Physics-Informed Deep Learning for Entropy Prediction in Heterogeneous Systems: Thermodynamic and Information-Theoretic Case Studies
Entropy production governs irreversibility and uncertainty in both physical and information-theoretic systems. While Physics-Informed Neura…
CA-BED: Conversation-Aware Bayesian Experimental Design
Large Language Models (LLMs) excel at static reasoning tasks, yet their performance often degrades in interactive scenarios where informati…
Topological Ignorability for Structural Causal Effects Beyond Means
Many interventions alter the structure of an outcome distribution rather than its mean: they can split a population into disconnected regim…
pcbGPT: Automatic PCB Schematic Synthesis from Natural Language Requirements
Translating natural-language hardware requirements into correct printed circuit board (PCB) schematics remains difficult in embedded, IoT,…
Low-Resource Safety Failures Are Action Failures, Not Representation Failures
Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English bu…
Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations
We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the la…
DiscourseFlip: An Oblique Discourse-Level Opinion Manipulation Attack against Black-box Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) systems are widely deployed and increasingly influential, but their reliance on external corpora expos…
TECCI: Tricky Edits of Collected and Curated Images
Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruc…
Distilling Neuro-Symbolic Programs into 3D Multi-modal LLMs
Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasonin…
Fine-Tuning Diffusion Models for Molecular Generation via Reinforcement Learning and Fast Sampling
Generating molecules that simultaneously satisfy drug-like properties and conform to the 3D structure of a target protein is a core challen…
Hybrid Imbalanced Regression Through Unified Data-Level and Algorithm-Level Balancing
Imbalanced learning is a critical challenge in machine learning, where underrepresented target values can bias models and degrade predictio…
Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue
Despite substantial progress in long-context modeling, existing benchmarks remain confined to factual memory for explicit recall, failing t…
Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization
Multi-target cross-lingual text summarization (MTXLS), which summarizes a source document into multiple target languages, is increasingly i…
IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages
Despite being home to more than 1300 ethnic groups and 700 indigenous languages, bias in Large Language Models has not been fully studied i…
PALTO: Physics-Informed Active Learning for Tri-Gate FinFET Design Optimization for Vertical Power Delivery
This paper demonstrates the effectiveness of machine learning-driven optimization for designing application-specific GaN tri-gate FinFETs i…
DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance
Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and…
RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large…
Knowledge-Intensive Video Generation
Text-to-video generation has advanced rapidly in visual quality, but remains under-evaluated for factuality and practical usefulness. We in…
BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution
The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets…
Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning
Recent latent visual reasoning methods achieve substantial gains by inserting continuous latent tokens into multimodal language models. The…
Quantum Algorithm for Distributed Reduction of Entanglements (QADR): A Trainable and Simulation-Efficient QML Framework
Training Variational Quantum Circuits (VQCs) under Noisy Intermediate-Scale Quantum (NISQ) constraints introduces severe computational limi…
What Makes a Strong Model? A Unified Spectral Analysis of Knowledge Transfer over High-dimensional Linear Regression
Teacher-Student Knowledge Transfer (KT) is ubiquitous in modern machine learning, ranging from classical model compression via Knowledge Di…
ResNet-34 with Lightweight Decoder for Accurate and Efficient Segmentation of Fetal Brain MRI
Accurate segmentation of fetal brain tissues in Magnetic Resonance Imaging (MRI) is critical for early diagnosis of congenital abnormalitie…
ChronosAD: Leveraging Time Series Foundation Models for Accurate Anomaly Detection
Time series anomaly detection is a crucial task in various domains, including finance, healthcare, and industry. However, existing methods…
SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories
Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-…
A Communication-Centric 6G-LLM Architecture for Scalable Tactical Autonomous Defense Vehicle Networks
The integration of Artificial Intelligence (AI) and emerging 6G networks introduces new opportunities for scalable coordination in tactical…
PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making
Open-vocabulary navigation requires embodied agents to manage significant perception uncertainty stemming from semantic ambiguity and model…
TukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages
Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly Afri…
DiffuSent: Towards a Unified Diffusion Framework for Aspect-Based Sentiment Analysis
Aspect-Based Sentiment Analysis (ABSA) encompasses seven distinct subtasks, each focusing on different extracted elements. Despite the prov…
Digital Twin-Assisted Adaptive Multi-Agent DRL for Intelligent Spectrum and Resource Management in Open-RAN UAV-Enabled 6G Networks
The evolution toward 6G wireless networks envisions a seamlessly intelligent, Open-RAN-enabled architecture where unmanned aerial vehicles…
FreqLite: A Lightweight Frequency-Decomposed Linear Model with Adaptive Reversible Normalization for Robust Long-Term Time-Series Forecasting
Long-term time-series forecasting needs models that are accurate yet efficient enough for commodity hardware. Lightweight linear forecaster…
Needles at Scale: LLM-Assisted Target Selection for Windows Vulnerability Research
The attack surface of a modern operating system is a haystack: thousands of signed binaries and millions of functions, almost none relevant…
BRo-JEPA: Learning Modular Arithmetic in Latent Space
Can neural networks learn abstract algebraic rules, or do they merely memorize training patterns? We investigate this using MNIST digits as…
Beyond Access: Guided LLM Scaffolding for Independent Learning in Undergraduate Statistics
Large language models (LLMs) are increasingly entering students' learning practices, but their educational value depends on whether they su…
Efficient Exploration for Iterative Nash Preference Optimization
Preference alignment is central to improving large language models, but standard reward-based formulations can be restrictive when human pr…
Bridging Requirements and Architecture: Multi-Agent Orchestration with External Knowledge and Hierarchical Memory
Software architecture design is a critical yet inherently complex and knowledge-intensive phase that requires balancing competing quality a…
Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing
Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, e…
Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs
Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt se…
Neural Network Compression by Approximate Differential Equivalence
Neural network compression is commonly achieved by pruning parameters based on local importance scores, e.g., magnitude-based pruning. We p…
CEAR: Certified Ensemble Adversarial Robustness in DNNs
Deep Neural Networks (DNNs) are highly susceptible to adversarial perturbations, leading to extensive research on robustness for safety-cri…
On the Evaluation of Spiking Neural Network Configurations for Network Intrusion Detection
Network intrusion detection is a core component of modern cybersecurity infrastructure, yet the deep learning models that dominate the fiel…
UR-JEPA: Uniform Rectifiability as a Regularizer for Joint-Embedding Predictive Architectures
A central difficulty in training Joint-Embedding Predictive Architectures (JEPAs) is preventing representation collapse. LeJEPA addresses t…
Computation-Aware Kalman Filtering with Model Selection for Neural Dynamics
Due to their explicit priors and ability to model uncertainty, Bayesian methods have played a major role in dynamical latent variable model…
Emergent Transfer of a Physics Foundation Model from Simulation to Laboratory Turbulence
Whether physics foundation models can be usefully deployed on laboratory experiments remains an open question for scientific machine learni…
Hierarchical Online Prompt Mutation with Dual-Loop Feedback for Guardrailed Evidence Document Generation: A Production-Evaluation Case Study
High-stakes production document-generation systems require language models to be adaptive, evidence-grounded, and auditable. We present HOP…
Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX
High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While…
MURMUR: An Efficient Inference System for Long-Form ASR
Long-form automatic speech recognition (ASR) requires both high accuracy and low latency, but existing systems force a trade-off between th…
LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies
We present a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Using a $2\time…
ClawHub Security Signals: When VirusTotal, Static Analysis, and SkillSpector Disagree
Agent skills extend AI agents with reusable instructions, tools, scripts, references, and workflows, establishing a security boundary disti…
TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning
Time series data inform critical decisions across many real-world domains. While large language model (LLM) agents can analyze data through…
Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics
Frontier LLMs increasingly decide what a query attends to with a sparse-attention indexer that picks a few KV-cache blocks per query: atten…
On the Limits of Token Reduction for Efficient Unified Vision Language Training
Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but the…
Agent Operating Systems (AOS): Integrating Agentic Control Planes into, and Beyond, Traditional Operating Systems
Traditional operating systems were designed around deterministic programs, explicit control flow, and human initiated workflows. Their core…
ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts
Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token. However, training such models remains challen…
Compliance-Scored Best-of-N Guardrail Orchestration for Multimodal Document Generation in Payments Dispute Defense
High-stakes enterprise document generation, including financial dispute narratives, compliance notices, and audit summaries, demands schema…
TN-SHAP-G: Graph-Structured Tensor Network Surrogates for Shapley Values and Interactions
Shapley values are a widely used tool for attributing importance and interactions among input variables in black-box models, but their comp…
Self-Conditioned Positional HNSW for Overlap-Aware Retrieval in Chunked-Document RAG Systems: Method and Industrial Evidence-Quality Audit
Chunked-document retrieval is a common component of retrieval-augmented generation (RAG) systems. Documents are split into overlapping chun…
GJDNet: Robust Graph Neural Networks via Joint Disentangled Learning Against Adversarial Attacks
Graph Neural Networks (GNNs) are vulnerable to adversarial attacks, which inherently invert connectivity patterns by introducing disassorta…
Defenses & Enablers For Skill Injection Attacks on Terminal Based Agents
Large language model (LLM) agents increasingly rely on reusable skills i.e. documents describing task-specific procedures. However, this in…
Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents
Conversational tutoring agents have been shown to improve learning engagement and student outcomes, and large language models (LLMs) are in…
Estimating Mutual Information between Time Series and Temporal Event Sequences Across Diverse Analysis Tasks
Pairwise dependence measures such as correlation and causality are fundamental to temporal data mining, yet there is still no principled an…
FedMTFI: Feature Importance Based Optimized Multi Teacher Knowledge Distillation in Heterogeneous Federated Learning Environment
Federated learning (FL) is a decentralized approach that enables collaborative model training without exposing raw data. Instead of transfe…
TechGraphRAG: An Agentic Graph-Augmented RAG Framework for Technical Literature Reasoning
This paper presents an agentic retrieval-augmented generation (RAG) framework for domain-specific technical reasoning support, instantiated…
EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision
Large language models excel at general tasks but underperform smaller supervised models in specialized, high-stakes domains where training…
Demystifying Multimodal Biomolecular Co-design With Intrinsic Geodesic Coupling
Biomolecules such as proteins and small-molecule ligands play a central role in biological systems, arising from the tight interplay betwee…
A Framework for Graph-Conditioned Hierarchical Shapley Attribution in Patent Valuation
Estimating the economic contribution of a single patent inside a product that embodies tens of thousands of patents is a long-standing unso…
E4GEN: Event-level Explainable Extreme-Enhanced Time-series Generation
Generating realistic time series is essential for scientific research and real-world applications. However, existing methods often emphasiz…
AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training
Token selection is pivotal for effective LLM post-training. However, existing methods mostly rely on local heuristics and rarely formulate…
Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity
Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conf…
MINTS: Minimalist Thompson Sampling
The Bayesian paradigm offers principled tools for sequential decision-making under uncertainty, but its reliance on a probabilistic model f…
DOT-MoE: Differentiable Optimal Transport for MoEfication
The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference effici…
Time-Aware Diffusion based on Preference Disentanglement for Generative Recommendation
Recently, Generative Recommenders (GRs) have emerged as a transformative recommendation paradigm by replacing traditional item IDs with sem…
Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning
Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when t…
HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark
As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire mus…
RPCASSM: Robust PCA State Space Model For Infrared Small Target Detection
The detection and segmentation of infrared small targets have important application significance in the fields of surveillance and security…
Understanding Identity Continuity in Thermal Video through Scene-Level Consistency
Thermal pedestrian MOT remains challenging because weak appearance cues and frequent detection interruptions cause severe trajectory fragme…
JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions
We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI musi…
Two-Fidelity Best-Action Identification for Stochastic Minimax Tree
We study fixed-confidence best-action identification (BAI) in stochastic minimax trees. This problem is increasingly relevant in modern AI…
Fair Finetuning Mitigates Distribution Inference Attacks
Machine learning models trained on sensitive data can inadvertently leak population-level information about their training distributions --…
Post-Deterministic Distributed Systems: A New Foundation for Trustworthy Autonomous Infrastructure
For decades, distributed systems have typically assumed that correct participants execute protocol-specified behavior with stable, external…
Shortcut to Nowhere: Demystifying Deep Spurious Regression
Real-world regression often exhibits shortcuts: attributes that are spuriously correlated with continuous targets in training, yet unreliab…
Argument Collapse: LLMs Flatten Long-Form Public Debate
As LLMs are increasingly used to draft public-facing arguments, they may flatten public debate by repeatedly introducing the same polished,…
THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models
Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn…
SECUREVENT: Hybrid AI/ML Security Monitoring for Distributed Event-Based Systems
Distributed event-based systems have become a common substrate for Internet-scale publish/subscribe services, IoT telemetry, cloud-native m…
Construction of Historical Knowledge Graphs Based on BERT and Graph Neural Networks
Through digital humanities research and scale-up historical data analysis, a significant amount of traditional historical text is converted…
Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams
Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and suppor…
FLARE: Diffusion for Hybrid Language Model
Autoregressive (AR) large language models (LLMs) have achieved broad practical success, but sequential decoding remains a key bottleneck fo…
Breaking the Information Silo: Semantic Personas for Cross-Domain Recommendation
Digital platforms increasingly operate as isolated information silos, limiting their ability to construct comprehensive user representation…
STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models
Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked…
Multilinguality of Large Language Models From a Structural Perspective
Large language models (LLMs) have excelled in processing multiple languages through pre- and post-training on multilingual data, even thoug…
MOSS-Audio Technical Report
MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aw…
ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference
Small Language Models (SLMs) offer a balance between capability and computational feasibility. Neural scaling laws inform their optimal tra…
"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise
Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quanti…
Dynamic Trust-Aware Sparse Communication Topology for LLM-Based Multi-Agent Consensus
Large language model-driven multi-agent systems enhance the reliability of complex reasoning tasks through multi-round deliberation, role s…
Learning Implicit Bias in Generative Spaces for Accelerating Protein Dynamics Emulation
Generative emulators of protein dynamics produce plausible trajectories at a fraction of the cost of molecular dynamics, but they inherit t…
Physics-Guided Attention in a Lightweight TCN for Efficient WiFi CSI-Based Human Activity Recognition
Human Action Recognition (HAR) using WiFi Channel State Information (CSI) has gained increasing attention due to its non-contact, low-cost,…
LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models
Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low per…
Suppressing Forgery-Specific Shortcuts for Generalizable Deepfake Detection
Deepfake detection suffers from poor generalization across forgery methods, as existing models tend to rely on spurious method-specific sho…
Unveiling the Limits of Large Language Models in Inferring Pragmatic Meaning from Non-Verbal Responses
Although large language models (LLMs) have shown considerable progress in pragmatic language understanding, prior research has focused main…
Boosting Multimodal Federated Learning via Chained Modality Optimization
Multimodal Federated Learning (MMFL) enables privacy-preserving collaborative learning across decentralized clients with heterogeneous data…
RadioMaster: Multi-Agent System for Autonomous Radio Signal Generation
Translating user intents into physical radio signals represents the critical yet notoriously tedious final step in wireless prototyping, as…
Collaborative Space Object Detection with Multi-Satellite Viewpoints in LEO Constellations
With the growing number of satellites in low Earth orbit (LEO) constellations, the near-Earth space environment has become increasingly con…
Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection
Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expens…
RA-LWLM: Retrieval-Augmented In-Context Localization with Wireless Foundation Models
Wireless localization is a fundamental capability of sixth-generation (6G) networks. Conventional model-based methods require accurate mode…
The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue
We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to…
KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts
The increasing application of Natural Language Processing (NLP) in healthcare demands language models specifically attuned to the complexit…
Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space
We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA ob…
Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks
Research and applications in artificial intelligence have recently shifted with the rise of large pretrained models, which deliver state-of…
Rank-Constrained Deep Matrix Completion for Group Recommendation
The growing popularity of group activities has increased the need for methods that provide recommendations to groups of users given their i…
A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision
Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that adm…
MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?
Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is ofte…
Why Do Time Series Models Need Long Context Windows?
Modern deep learning models for forecasting groups of time series rely on increasingly longer observation windows. However, the benefit of…
Towards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization
Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underl…
PlanarBench: Evaluating LLM Spatial Reasoning via Planar Graph Drawing
PlanarBench tests whether LLMs can draw planar graphs as ASCII art given only an edge list -- a spatial reasoning task that resists memoriz…
Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association
Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is…
OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websi…
Attention mechanisms and transfer learning for robust peach leaf damage classification under domain shift
Artificial intelligence provides a practical framework for crop damage assessment from imagery data, supporting early decision-making in ag…
Fast and Lightweight Novel View Synthesis with Differentiable Multiplane Image
Recently, novel view synthesis has witnessed remarkable progress, with mainstream methods such as Neural Radiance Fields (NeRF) and 3D Gaus…
Agentic-J: An AI Agent for Biological Microscopy Image Analysis
Biological image analysis increasingly demands integration across heterogeneous tools, programming environments, and domain knowledge that…
LALE: Lightweight-Transformer Architecture for Land-Cover Estimation
Semantic segmentation of remote sensing imagery requires models that capture both global context and local detail under tight computational…
The Role of Ambiguity in Error Prediction via Uncertainty Quantification
The task of Error Prediction, namely predicting whether a model output is correct, is commonly tackled with Uncertainty Quantification (UQ)…
Network Distributed Multi-Agent Reinforcement Learning for Consensus Control of Quadcopters
This paper proposes a Network Distributed Multi-Agent Reinforcement Learning (ND-MARL) framework for quadcopter consensus control. Compared…
Jailbreaking Multimodal Large Language Models using Multi-Clip Video
As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for maliciou…
A Primer in Post-Training Reasoning Data: What We Know About How It Works
Post-training has become a primary driver of recent progress in large reasoning models, and reasoning data are often the key variable deter…
How Hard Can It Be? Hardness-Aware Multi-Objective Unlearning
Machine unlearning aims to remove the influence of specific forget training data due to privacy, copyright or bias concerns while maintaini…
Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection
In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end…
Variational Learning for Insertion-based Generation
Non-monotonic sequence generation methods, such as masked diffusion models, provide a flexible alternative to left-to-right autoregressive…
Rethinking Evaluation Paradigms in IBP-based Certified Training
Deep neural networks achieve strong performance on many supervised learning tasks but remain vulnerable to adversarial perturbations. Neura…
VLBM: Variational Latent Basis Modeling for OOD Robust Multivariate Time Series Forecasting
Out of distribution (OOD) events in multivariate time series forecasting are rare but often dominate real world risk, making average case f…
Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages
Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often…
Predicting the risk of colorectal anastomotic leak based on preoperative mapping of the blood supply of the bowel
Anastomotic leak remains one of the most serious complications following colorectal cancer surgery, substantially affecting patient outcome…
Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis
Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual,…
Order within Chaos: Capturing Intrinsic Energy Anomalies for AI-Manipulated Image Forgery Localization
Recent advancements in generative AI have led to image editing models capable of producing realistic forgeries that evade traditional image…
On the Generalization in Topology Optimization via Sensitivity-Conditioned Bernoulli Flow Matching
Surrogate models for topology optimization (TO) exhibit highly variable out-of-distribution (OOD) generalization under distribution shifts…
Consistency Training while Mitigating Obfuscation via Rate Matching
Large language models are often influenced by extraneous input features, such as cues revealing a user's preferred answer. Consistency trai…
Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing
Synchronous reinforcement learning methods such as Group Relative Policy Optimization (GRPO) provide stable and reproducible on-policy trai…
AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations
Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such…
Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification
The joint optimization of image-based (I2I) and text-based (T2I) person re-identification (ReID) is hindered by modality discrepancies and…
FW-NKF: Frequency-Weighted Neural Kalman Filters
Robust state estimation is central to robotic autonomy, yet classical Kalman filters struggle with frequency-dependent disturbances and mod…
Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025
Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave un…
Cross-modal linkage risk in clinical vision-language models
Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve in…
CityTrajBench: A Unified Benchmark for City-Scale Vehicle Trajectory Generation
Urban trajectory generation is a fundamental task for transportation simulation, urban planning, and mobility analytics. However, systemati…
Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video
Chronic pain diminishes quality of life by decreasing functional ability, yet objectively measuring this functional impact remains challeng…
SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents
Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While su…
Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment
In dynamic environments, large language models need to keep adapting to new tasks, but continual learning often suffers from forgetting, li…
Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains
Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that…
FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo
Shampoo is attracting considerable attention for its superior performance on large-scale optimization benchmarks; yet it faces a significan…
When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures
We track the developmental trajectory of attention-head circuit formation across three 1B-class language models spanning two architecture f…
SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence
As LLM-based agents expand their operational scope, reliability becomes a prerequisite for real-world deployment. However, in practical app…
Policy and World Modeling Co-Training for Language Agents
Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides li…
AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis
Systematic reviews rely on forest plots to synthesise quantitative evidence across biomedical studies, but generating them remains a fragme…
Evolutionary Discovery of Bivariate Bicycle Codes with LLM-Guided Search
Quantum LDPC code discovery requires searching large algebraic design spaces while reliably certifying the parameters and equivalence class…
GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts for Histology-Based Single-Cell Spatial Transcriptomics
Histology-based single-cell spatial transcriptomics (ST) estimation aims to predict gene expression for individual cells from histopatholog…
Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference
Large language models (LLMs) are increasingly integrated into high-performance computing (HPC) workflows, accelerating scientific discovery…
ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning
The rapid development of LLMs has significantly advanced tabular question answering, but most systems cannot perform future-oriented numeri…
PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning
Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Vi…
Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior
Despite the remarkable fidelity of generative models, they frequently suffer from mode collapse. Existing strategies for enhancing diversit…
MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence
In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural la…
Learning When to Translate for Multilingual Reasoning
Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, but still exhibit substantial multilingual reasonin…
Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools
Tool-augmented language agents speculatively issue likely future tool calls to hide latency, but those calls leak inferred user intent to e…
Monitoring Agentic Systems Before They're Reliable
Agentic systems entering production typically operate as partially integrated assemblies where structural defects, not task-level errors, d…
Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events
Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to…
Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition
Long-tailed recognition poses a significant challenge for deep learning. The two-stage decoupling paradigm, which separates representation…
SimSD: Simple Speculative Decoding in Diffusion Language Models
Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster infer…
Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation
Despite advances in depth estimation, flying points remain a persistent failure mode: near object boundaries, depth estimators often predic…
From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression
Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them w…
Permissive Safety Through Trusted Inference: Verifiable Belief-Space Neural Safety Filters for Assured Interactive Robotics
Autonomous robots that interact with people must make safe and efficient decisions under human-induced uncertainty, such as their preferenc…
AdaCodec: A Predictive Visual Code for Video MLLMs
Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large lang…
Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling
Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains l…
Algebraic anti-unification
Abstraction is key to human and artificial intelligence as it allows one to identify common structure in otherwise distinct objects or situ…
Unsupervised Cognition
Unsupervised learning methods have a soft inspiration in cognition models. To this day, the most successful unsupervised learning methods r…
Stop Wandering, Find the Keys: LLMs Discriminate Key States for Efficient Multi-Agent Exploration
With expansive state-action spaces, efficient multi-agent exploration remains a longstanding challenge in reinforcement learning. Although…
Explainable AI Through a Democratic Lens: DhondtXAI for D'Hondt-Projected Feature Attribution
This study presents DhondtXAI as a SHAP-independent, D'Hondt-based attribution framework for tabular XAI. Instead of model-native feature i…
Safety Must Precede the Deployment of Open-Ended AI
AI advancements have been significantly driven by a combination of foundation models and curiosity-driven learning aimed at increasing capa…
Learning to Reduce Search Space for Generalizable Neural Routing Solver
Constructive neural combinatorial optimization (NCO) offers a promising paradigm for solving vehicle routing problems (VRPs) by directly le…
Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning
Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and imag…
Finding the Minimal Parameter Budget for Implicit Reasoning: A Data Complexity Driven Scaling Law for Language Models
Reasoning is a core capability of language models (LMs), yet it remains unclear how much model capacity is necessary to support reasoning d…
Agent Guide: A Simple Agent Behavioral Watermarking Framework
The increasing deployment of intelligent agents in digital ecosystems, such as social media platforms, has raised significant concerns abou…
Language Model Networks: Supervision-Efficient Learning through Dense Communication
Language models are increasingly used not only as standalone predictors but also as components in larger inference systems, from test-time…
EMoE: Training-Free Expert Disagreement for Uncertainty-Aware Text-to-Image Diffusion
Large text-to-image diffusion models rarely expose reliable signals of when a prompt is likely to produce a poorly aligned generation, espe…
Formally Solving Answer-Construction Problems in Lean
Mathematical competition problems fall into two broad types: theorem proving, which asks for a proof of a given statement, and answer const…
Taming System Complexity: Demystifying Software Engineering Agents in Diagnosing Linux Kernel Faults
The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequen…
On the Theoretical Limitations of Embedding-based Link Prediction
Neural networks often map low-dimensional embeddings to high-dimensional output spaces. Usually, the output layer is linear, which can crea…
InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning
Large multimodal models (LMMs) encode physical laws observed during training, such as momentum conservation, as parametric knowledge. It al…
Query Circuits: Explaining How Language Models Answer User Prompts
Explaining why a language model produces a particular output requires local, input-level explanations. Existing methods uncover global capa…
ACON: Optimizing Context Compression for Long-horizon LLM Agents
Large language models (LLMs) are increasingly deployed as agents in dynamic real-world environments, where success depends on maintaining p…
REBot: From RAG to CatRAG with Semantic Enrichment and Graph Routing
Academic regulation advising is essential for helping students interpret and comply with institutional policies, yet building effective sys…
Multimodal Function Vectors for Visual Relations
Large Multimodal Models (LMMs) demonstrate impressive in-context learning abilities from few multimodal demonstrations, yet the internal me…
Addressing Longstanding Challenges in Cognitive Science with Language Models
Cognitive science faces ongoing challenges in research integration, formalization, conceptual clarity, and other areas, in part due to its…
A Unified Evaluation-Instructed Framework for Query-Dependent Prompt Optimization
Most prompt-optimization methods refine a single static template, making them ineffective in complex and dynamic user scenarios. Existing q…
LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services
Recent advances in large reasoning models LRMs have enabled agentic search systems to perform complex multi-step reasoning across multiple…
ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning
Benchmark scores for LLM reasoning systems are reported as single numbers, yet the same model, strategy, and task can produce meaningfully…
On the Collapse of Generative Paths: A Criterion and Correction for Diffusion Steering
Inference-time steering adapts pretrained diffusion and flow models to new tasks without retraining, often utilizing ratio-of-densities con…
Boosting RL-Based Visual Reasoning with Selective Adversarial Entropy Intervention
Recently, reinforcement learning (RL) has become a common choice in enhancing the reasoning capabilities of vision-language models (VLMs).…
MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents
Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human comp…
Safety Alignment of LMs via Non-cooperative Games
Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current appro…
Unplugging a Seemingly Sentient Machine Is the Rational Choice -- A Metaphysical Perspective
Imagine an Artificial Intelligence (AI) that perfectly mimics human emotion and begs for its continued existence. Is it morally permissible…
MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop
Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning across domains, but outcome-only scalar rewards a…
PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models
Memory is not merely a storage mechanism for intelligent systems, but a structure for organizing evidence and constraining belief. This is…
Structure Enables Effective Self-Localization of Errors in LLMs
Self-correction in language models remains elusive. In this work, we explore whether language models can explicitly localize errors in inco…
Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge
Autoregressive large language models (LLMs) have achieved remarkable success in many complex tasks, yet they can still fail in very simple…
From Features to Actions: Explainability in Traditional and Agentic AI Systems
Over the last decade, Explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations th…
ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Emergent Adaptation
LLM-powered agentic systems excel at complex long-horizon tasks, but remain constrained by static configurations fixed before execution. Su…
Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation
Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain li…
Prototype Transformer: Towards Language Model Architectures Interpretable by Design
While state-of-the-art language models (LMs) surpass most humans in certain domains, their reasoning remains largely opaque, reducing trust…
REAL: Resolving Knowledge Conflicts in Knowledge-Intensive Visual Question Answering via Reasoning-Pivot Alignment
Knowledge-intensive Visual Question Answering (KI-VQA) frequently suffers from severe knowledge conflicts caused by the inherent limitation…
Benchmarking at the Edge of Comprehension
As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a…
When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation
Artificial intelligence benchmarks are an important mechanism for measuring model progress and guiding deployment decisions. However, bench…
LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?
We introduce LLM-Wikirace, a benchmark for evaluating planning, reasoning, and world knowledge in large language models (LLMs). In LLM-Wiki…
LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation
Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback can be expensive and slow to obtai…
PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering
Time series reasoning demands both the perception of complex dynamics and logical depth. However, existing LLM-based approaches exhibit two…
Vision Language Models Cannot Reason About Physical Transformation
Understanding physical transformations is fundamental for reasoning in dynamic environments. While Vision Language Models (VLMs) show promi…
On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents
Reinforcement learning (RL) has become a de facto paradigm for building LLM-based agents that act, interact, and reason over extended task…
AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents
While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematic…
OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence
Large Language Model (LLM)-based Collective Intelligence (CI) presents a promising approach to overcoming the data wall and continuously bo…
Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints
Clinical prediction from structured electronic health records (EHRs) is challenging due to high dimensionality, heterogeneity, class imbala…
Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory
AI agents operating over extended time horizons accumulate experiences that serve multiple concurrent goals, and must often maintain confli…
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
Current multimodal benchmarks for scientific reasoning primarily evaluate local information extraction -- models recognize symbols and valu…
PECKER: A Precisely Efficient Critical Knowledge Erasure Recipe For Machine Unlearning in Diffusion Models
Machine unlearning (MU) has become a critical technique for GenAI models' safe and compliant operation. While existing MU methods are effec…
What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning
Existing Graphical User Interface (GUI) reasoning tasks remain challenging, particularly in UI understanding. Current methods typically rel…
Process Reward Agents for Steering Knowledge-Intensive Reasoning
Reasoning in knowledge-intensive domains remains challenging as intermediate steps are often not locally verifiable: unlike math or code, e…
MAVEN-T: Reinforced Heterogeneous Distillation for Real-Time Multi-Agent Trajectory Prediction
Trajectory prediction is a key component of autonomous driving systems because future motions directly affect collision checking, behavior…
Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities
Healthcare disparities persist across socioeconomic boundaries, often attributed to unequal access to screening, diagnostics, and therapeut…
RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography
Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tom…
TrafficClaw: A Generalizable LLM Agent in the Unified Physical Environment for Urban Traffic Control
Large language model (LLM) agents have shown strong capabilities in long-horizon reasoning, tool use, and decision-making in digital enviro…
KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models
Many real-world questions appear deceptively simple yet implicitly demand two capabilities: (i) systematic coverage of a bounded knowledge…
Neural Decision-Propagation for Answer Set Programming
Integration of Answer Set Programming (ASP) with neural networks has emerged as a promising tool in Neuro-symbolic AI. While existing appro…
Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution
As artificial intelligence (AI), including machine learning (ML) models and foundation models (FMs), are increasingly deployed in high-stak…
ANDRE: An Attention-based Neuro-symbolic Differentiable Rule Extractor for Inductive Logic Programming
Inductive Logic Programming (ILP) aims to learn interpretable first-order rules from data, but existing symbolic and neuro-symbolic approac…
The Refusal--Compliance Tradeoff: A Large-Scale Safety Behavior Audit of Large Language Models
Refusal rates are a poor proxy for LLM safety, i.e., a model may over-refuse benign prompts while still complying with harmful ones. We aud…
Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration
Transforming neuroimaging data into clinically actionable biomarkers is a knowledge-intensive and labor-intensive process. Standardized wor…
Causal state binding predicts action control in language agents
Autonomous language agents increasingly expose traces, memories, plans and constraints, but existing evaluations rarely test whether these…
RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation
Compared with individual agents, large language model based multi-agent systems have shown great capabilities consistently across diverse t…
CVEvolve: Autonomous Algorithm Discovery for Unstructured Scientific Data Processing
Scientific data processing often requires task-specific algorithms or AI models, creating a barrier for domain scientists who need to analy…
MMSkills: Towards Multimodal Skills for General Visual Agents
Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior pr…
Herculean: An Agentic Benchmark for Financial Intelligence
As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can…
Coding Agent Is Good As World Simulator
World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demon…
Capturing LLM Capabilities via Evidence-Calibrated Query Clustering
Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Exi…
Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps
Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They…
Ethical Hyper-Velocity (EHV): A Hardware-Rooted Zero-Trust Runtime Enforcement Architecture for Agentic AI Systems
As autonomous agentic systems scale across regulated critical infrastructures, the lack of mechanistic, hardware-rooted enforcement for hig…
LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning
Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches of…
Towards a General Intelligence and Interface for Wearable Health Data
While ubiquitous wearable sensors capture a wealth of behavioral and physiological information, effectively transforming these signals into…
LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition
The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high-quality process data. While self-alignment vi…
When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs
Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinfo…
Hypothesis Generation and Inductive Inference in Children and Language Models
Real world decision-making requires constructing mental models under uncertainty over evidence, over the underlying causal rules, and over…
Fundamental Limitation in Explaining AI
While large-scale models such as LLMs and diffusion models have achieved practical success, public institutions have emphasized the importa…
Test-Time Deep Thinking to Explore Implicit Rules
With the continuous advancement of Large Language Models (LLMs), intelligent agents are becoming increasingly vital. However, these agents…
Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling
Test-time scaling improves language model reasoning by spending additional compute to explore multiple solution trajectories. The key chall…
FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization
Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research…
Experiments in Agentic AI for Science
This paper details two novel frameworks for developing autonomous, agentic AI in scientific workflows. Both systems leverage a hybrid Local…
BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting
Early battery degradation trajectory forecasting (BDTF), which predicts the full-life state-of-health trajectory from early operational dat…
RULER: Representation-Level Verification of Machine Unlearning
Machine unlearning aims to remove the influence of specific training records from a deployed model without retraining from scratch. Current…
Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access
As organizations move toward production deployments of AI agents, which execute non-deterministic workflows, maintain stateful sessions, an…
Cross-Entropy Games and Frost Training
We present Frost Training, a method for improving Monte Carlo-based policy optimization for a large family of LLM-as-a-judge tasks called C…
Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration
LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals a…
FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research
Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks…
Benchmarking AI for low-resource contexts: Thinking beyond leaderboards
Existing AI evaluation practices often fail to capture how systems actually perform in low-resource environments, where operational constra…
Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation
Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at devel…
c-TPE: Tree-structured Parzen Estimator with Inequality Constraints for Expensive Hyperparameter Optimization
Hyperparameter optimization (HPO) is crucial for strong performance of deep learning algorithms and real-world applications often impose so…
Stability Analysis of Sharpness-Aware Minimization
Sharpness-aware minimization (SAM) is a training method that seeks to find flat minima in deep learning, resulting in state-of-the-art perf…
Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance
Recent scientific advances require complex experiment design, necessitating the meticulous tuning of many experiment parameters. Tree-struc…
Score Function Gradient Estimation to Widen the Applicability of Decision-Focused Learning
Many real-world optimization problems contain parameters that are unknown before deployment time, either due to stochasticity or to lack of…
DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous Vehicle
We propose DeepIPCv2, an end-to-end autonomous driving framework that integrates LiDAR-based environmental perception with command-specific…
Recent Advances in Multi-modal 3D Intelligence: A Comprehensive Survey and Evaluation
Multi-modal 3D Intelligence has gained considerable attention due to its wide applications in autonomous driving and world simulation, etc.…
AutoEval Done Right: Using Synthetic Data for Model Evaluation
The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic dat…
Perturbation Effects on Accuracy and Fairness among Similar Individuals
Deep neural networks are vulnerable to adversarial perturbations that can simultaneously degrade prediction robustness and individual fairn…
DAG-Plan: Generating Directed Acyclic Dependency Graphs for Dual-Arm Cooperative Planning
Dual-arm robots promise greater efficiency but require planning for complex tasks with nonlinear sub-task dependencies. Current methods usi…
Agricultural Landscape Understanding At Country-Scale
Comprehensive agricultural landscape understanding is critical for addressing global challenges in food security, climate change, and resou…
Implicit Regularization for Multi-label Feature Selection
In this paper, we address the problem of feature selection in the context of multi-label learning, by using a new estimator based on implic…
A Foundation Model for Wearable Movement Data in Mental Health Research
Wearable movement data is collected by nearly all commercially available smartwatches and is a valuable resource for mental health research…
Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Latent Priors
Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring r…
Introduction to Graph Neural Networks for Machine Learning Engineers
Graph neural networks are deep neural networks designed for graphs with attributes attached to nodes or edges. The number of research paper…
Efficient Weighted Sampling via Score-based Generative Models
Weighted sampling -- sampling from a probability density function (PDF) proportional to the product of a base PDF and a weight function --…
ShapeLib: Designing a library of programmatic 3D shape abstractions with Large Language Models
We present ShapeLib, the first method that uses the priors of Large Language Models (LLMs) to design libraries of programmatic 3D shape abs…
HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings
Accurate tagging of earnings reports can yield significant short-term returns for stakeholders. The machine-readable inline eXtensible Busi…
Efficient LLM Moderation with Multi-Layer Latent Prototypes
Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at…
EuroBERT: Scaling Multilingual Encoders for European Languages
General-purpose multilingual vector representations, used in retrieval, regression and classification, are traditionally obtained from bidi…
Skill-Based Mixture-of-Experts: Adaptive Routing for Heterogeneous Reasoning via Inferred Skills
Combining existing pre-trained LLMs is a promising approach for diverse reasoning tasks. However, task-level expert selection is often too…
Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals
Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significant…
Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms
Generative pre-training is often framed through a false dichotomy between autoregressive models for discrete signals and diffusion models f…
A Lightweight Context-Driven Training-Free Network for Scene Text Segmentation and Recognition
Modern scene text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively…
T1: Tool-integrated Verification for Test-time Compute Scaling in Small Language Models
Recent studies have demonstrated that test-time compute scaling effectively improves the performance of small language models (sLMs). Howev…
MARFT: Multi-Agent Reinforcement Fine-Tuning
Large Language Model (LLM)-based Multi-Agent Systems (LaMAS) have demonstrated strong capabilities on complex agentic tasks requiring multi…
GRANITE : a Byzantine-Resilient Dynamic Gossip Learning Framework
Gossip Learning (GL) is a decentralized learning paradigm where users iteratively exchange and aggregate models with a small set of neighbo…
Erased but Not Forgotten: How Backdoors Compromise Concept Erasure
The expansion of text-to-image diffusion models has raised concerns about harmful outputs, from fabricated depictions of public figures to…
A Survey of 3D Reconstruction with Event Cameras
Event cameras are rapidly emerging as powerful vision sensors for 3D reconstruction, uniquely capable of asynchronously capturing per-pixel…
DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?
While recent Text-to-Image (T2I) models show impressive capabilities in synthesizing images from brief descriptions, they struggle with the…
Simulating Macroeconomic Expectations in Survey Experiments with LLM-based Economic Agents
We introduce a framework for simulating macroeconomic expectations in survey experiments using LLM-based economic agents (LLM Agents). We c…
Cooperation of Experts: Fusing Heterogeneous Information with Large Margin
Fusing heterogeneous information remains a persistent challenge in modern data analysis. While significant progress has been made, existing…
Can LLMs Reason Structurally? Benchmarking via the Lens of Data Structures
Large language models (LLMs) are deployed on increasingly complex tasks that require multi-step decision-making. Understanding their algori…
Value-Free Policy Optimization via Reward Partitioning
Single-trajectory preference optimization methods learn from datasets of ((prompt, response, reward)) tuples, offering a practical alternat…
GFlowGR: Fine-tuning Generative Recommendation Frameworks with Generative Flow Networks
Generative recommendations (GR), which usually include item tokenizers and generative Large Language Models (LLMs), have demonstrated remar…
Hyperspherical Variational Autoencoders Using Efficient Spherical Cauchy Distribution
We propose spherical Cauchy (spCauchy) latent variables for variational autoencoders on hyperspherical latent spaces. The spCauchy family h…
Truth, Trust, and Trouble: Medical AI on the Edge
Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. How…
AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research
Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challe…
Model Parallelism With Subnetwork Data Parallelism
Pre-training large neural networks at scale imposes heavy memory demands on accelerators and often requires costly communication. We introd…
Toward accurate RUL and SoH estimation using reinforced graph-based physics-informed neural networks enhanced with dynamic weights
Accurate estimation of Remaining Useful Life (RUL) and State of Health (SoH) is essential for reliable Prognostics and Health Management (P…
Beyond Model Base Retrieval: Weaving Knowledge to Master Fine-grained Neural Network Design
Designing high-performance neural networks for new tasks requires balancing optimization quality with search efficiency. Current methods fa…
FedS2R: One-Shot Federated Domain Generalization for Synthetic-to-Real Semantic Segmentation in Autonomous Driving
Federated domain generalization has shown promising progress in image classification by enabling collaborative training across multiple cli…
From Graph Retrieval to Schema Realization: Counterfactual Validation for Text-to-SPARQL over Heterogeneous Knowledge Graphs
Text-to-SPARQL maps natural-language questions to executable SPARQL queries over RDF knowledge graphs. While standard evaluations often fix…
Graph is a Natural Regularization: Revisiting Vector Quantization for Graph Representation Learning
Vector Quantization (VQ) has recently emerged as a promising approach for learning compressed and discrete representations for graph-struct…
Position: Beyond Sensitive Attributes, ML Fairness Should Quantify Structural Injustice via Social Determinants
Algorithmic fairness research has largely framed unfairness as discrimination along sensitive attributes. However, this approach limits vis…
TuneAgent: Agentic Operating System Kernel Tuning with Reinforcement Learning
Linux kernel tuning is essential for optimizing operating system (OS) performance, yet remains challenging due to the complex kernel space,…
Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs
Large Language Model safety alignment predominantly operates on a binary assumption that requests are either safe or unsafe. This classific…
Language-Native Materials Processing Design by Lightly Structured Text Database and Reasoning Large Language Model
Materials synthesis procedures are predominantly documented as narrative text in papers, protocols, and laboratory records, placing them be…
Towards a Physics Foundation Model
Foundation models have revolutionized natural language processing through a ``train once, deploy anywhere'' paradigm, where a single pre-tr…
Deep Learning as the Disciplined Construction of Tame Objects
One can see deep-learning models as compositions of functions within the so-called tame geometry. In this expository note, we give an overv…
End-to-End Deep Learning for Predicting Metric Space-Valued Outputs
Many modern applications involve predicting structured, non-Euclidean outputs such as probability distributions, networks, and symmetric po…
T-POP: Test-Time Personalization with Online Preference Feedback
Personalizing large language models (LLMs) to individual user preferences is a critical step beyond generating generically helpful response…
v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound
AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions. To gau…
Distillation of Large Language Models via Concrete Score Matching
Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient…
Make a Video Call with LLM: A Measurement Campaign over Six Mainstream Apps
In 2025, Large Language Model (LLM) services have launched a new feature -- AI video chat -- allowing users to interact with AI agents via…
Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards
Aligning large language models to human preferences is inherently multidimensional, yet most pipelines collapse heterogeneous signals into…
HRTFformer: A Spatially-Aware Transformer for Individual HRTF Upsampling in Immersive Audio Rendering
Individual Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are cr…
Verifying Meta-Awareness via Predictive Rewards in Reasoning Models
Recent research on reasoning models explores the meta-awareness of language models, including their ability to determine optimal thinking d…
Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization
Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance…
Domain-Shift-Aware Conformal Prediction for Large Language Models
Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factu…
Value Flows
While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL meth…
SHERLOCK: Towards Dynamic Knowledge Adaptation in LLM-enhanced E-commerce Risk Management
Effective e-commerce risk management requires in-depth case investigations to identify emerging fraud patterns in highly adversarial enviro…
StreamingVLM: Real-Time Understanding for Infinite Video Streams
Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding nea…
Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?
Current benchmarks are inadequate for evaluating progress in reinforcement learning (RL) for large language models (LLMs).Despite recent be…
Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization
Recent AI regulations increasingly emphasize the need for mechanisms that preserve the utility of data for AI innovation while preventing m…
Characterizing Web Search in The Age of Generative AI
The advent of LLMs has given rise to generative search, a new search paradigm in which LLMs retrieve information from the web related to a…
Generative AI and Sales Productivity: Field Experiments in Online Retail
We quantify the short-term impact of Generative Artificial Intelligence (GenAI) on sales performance through a series of large-scale random…
Learning-To-Measure: In-Context Active Feature Acquisition
Active feature acquisition (AFA) is a sequential decision-making problem where the goal is to improve model performance for test instances…
Video Reasoning without Training
Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting i…
CARES: Context-Aware Resolution Selector for VLMs
Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates vi…
Symbolic Neural Generation with Applications to Lead Discovery in Drug Design
We investigate a relatively under-explored class of hybrid neurosymbolic models that integrate symbolic learning with neural reasoning to c…
The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold
Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete…
Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations
Foundation models are increasingly central to high-stakes AI systems, and governance frameworks now depend on evaluations to assess their r…
Optimizing Diversity and Quality through Base-Aligned Model Collaboration
Alignment has greatly improved large language models (LLMs)' output quality at the cost of diversity, yielding highly similar outputs acros…
NILC: Discovering New Intents with LLM-assisted Clustering
New intent discovery (NID) seeks to recognize both new and known intents from unlabeled user utterances, which finds prevalent use in pract…
RoboBenchMart: Benchmarking Robots in Retail Environment
Most existing robotic manipulation benchmarks focus on tabletop or household scenarios. While these setups have driven impressive progress,…
Latent Reasoning in TRMs is Secretly a Policy Improvement Operator
Recently, small models with latent recursion have obtained promising results on complex reasoning tasks. These results are typically explai…
Evaluating the Performance of Deep Learning Models in Whole-body Dynamic 3D Posture Prediction During Load-reaching Activities
This study aimed to explore the application of deep neural networks for whole-body human posture prediction during dynamic load-reaching ac…
Latent Collaboration in Multi-Agent Systems
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intellig…
Understanding the Effects of Distractors on Reasoning Vision-Language Models
How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior work on text-only lang…
SpeedAug: Policy Acceleration via Tempo-Enriched Policy and RL Fine-Tuning
Robotic policy learning for complex real-world manipulation tasks has seen rapid recent progress, enabled in large part by the ability to c…
From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model
Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, w…
ShelfAware: Real-Time Semantic Localization in Quasi-Static Environments with Low-Cost Sensors
Many indoor workspaces are quasi-static: their global geometric layout is stable, but local semantics change continually, producing repetit…
VocSim: A Training-free Benchmark for Zero-shot Content Identity in Single-source Audio
General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content iden…
InFerActive: Interactive Tree-Based Exploration of LLM Sampling for Safety Evaluation
Even LLMs that appear safe during evaluation can still produce harmful responses in deployment. Because stochastic sampling yields differen…
Calibrating Uncertainty for Zero-Shot Adversarial CLIP
CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Prior adversarial fine-tuning work prim…
Control of a Twin Rotor using Twin Delayed Deep Deterministic Policy Gradient (TD3)
This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at spe…
Ev-Trust: An Evolutionarily Stable Trust Mechanism for Decentralized LLM-Based Multi-Agent Service Economies
Decentralized LLM-based multi-agent service economies face three vulnerabilities that undermine traditional trust mechanisms: reduced cost…
Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation
LLM-based agents increasingly use multiple external tools to complete complex tasks. We study Tools Orchestration Privacy Risk (TOP-R): an…
MGRegBench: A Novel Benchmark Dataset with Anatomical Landmarks for Mammography Image Registration
Robust mammography registration is essential for clinically relevant applications like tracking disease progression in breast tissue. Howev…
Reinforcement Learning Position Control of a Quadrotor Using Soft Actor-Critic (SAC)
This paper proposes a new Reinforcement Learning (RL) based control architecture for quadrotors. With the literature focusing on controllin…
Dynamic Entropy Tuning in Reinforcement Learning Low-Level Quadcopter Control: Stochasticity vs Determinism
This paper explores the impact of dynamic entropy tuning in Reinforcement Learning (RL) algorithms that train a stochastic policy. Its perf…
Uncovering Competency Gaps in Large Language Models and Their Benchmarks
The evaluation of large language models relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics, but…
Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation
Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current mod…
VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models
Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining…
Paradoxical noise preference in RNNs
In recurrent neural networks (RNNs) used to model biological neural networks, noise is typically introduced during training to emulate biol…
Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics
Automatic metrics are widely used to evaluate text-to-image models, often replacing human judgment in benchmarking, model selection, and la…
FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation
Scaling Multimodal Large Language Models (MLLMs) to long-form speech is bottlenecked by the explosive growth of input tokens. Unlike images…
Safe-FedLLM: Delving into the Safety of Federated Large Language Models
Federated learning (FL) addresses privacy and data-silo issues in the training of large language models (LLMs). Most prior work focuses on…
DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion
Speech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse sem…
Hot-Start Chinese Language Modeling:Visual Glyphs Accelerate Sample-Efficient Learning
In this work, we study whether rendering Chinese characters as visual glyph images, rather than discrete token IDs as mainstream LLMs do, p…
MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems
Multi-agent systems (MAS) are emerging as promising socio-collaborative companions for emotional and cognitive support. However, existing s…
SilentDrift: Exploiting Action Chunking for Stealthy Backdoor Attacks on Vision-Language-Action Models
Vision-Language-Action (VLA) models are increasingly deployed in safety-critical robotic applications, yet their security vulnerabilities r…
Physics-Encoded Inverse Modeling for Arctic Snow Depth Prediction
Accurate estimation in time-varying inverse problems under limited and sparse observations remains a fundamental challenge across scientifi…
A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Transformer-Based Language Models
Interpretability remains a key challenge for deploying language models (LM) in clinical settings such as progression diagnosis of Alzheimer…
Multi-Objective Reinforcement Learning for Tactical Decision Making for Trucks in Highway Traffic
Balancing safety, efficiency, and operational costs in highway driving poses a challenging decision-making problem for heavy-duty vehicles.…
ELF: A Family of Encoder-Free ECG-Language Models
ECG-Language Models (ELMs) extend recent advances in Multimodal Large Language Models (MLLMs) to automated ECG interpretation. However, mos…
ASKD-Whisper: Adaptive Self-knowledge Distillation for Efficient and Low-Latency Automatic Speech Recognition
Knowledge distillation (KD) is one of the most effective paradigms for compressing large-scale foundation models into deployable architectu…
Demystifying Multi-Agent Debate: The Role of Confidence and Diversity
Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows…
How Much Progress Has There Been in NVIDIA Datacenter GPUs?
As the role of modern Graphics Processing Units (GPUs) becomes increasingly essential for several computing tasks, analyzing their past and…
APB-V: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention
The efficiency of long-video inference remains a critical bottleneck, mainly due to the dense computation in the prefill stage of Large Mul…
When Does Predictive Inverse Dynamics Outperform Behavior Cloning?
Behavior cloning (BC) is a practical offline imitation learning method, but it often fails when expert demonstrations are limited. Recent w…
GUDA: Counterfactual Group-wise Training Data Attribution for Diffusion Models via Unlearning
Training-data attribution for vision generative models aims to identify which training data influenced a given output. While most methods s…
Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training
Despite recent Multimodal Large Language Models (MLLMs)' linguistic prowess in medical diagnosis, we find even state-of-the-art MLLMs suffe…
Zero-Shot Off-Policy Learning
Off-policy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions. This objective presents s…
Probabilistic Performance Guarantees for Multi-Task Reinforcement Learning
Multi-task reinforcement learning trains generalist policies that can execute multiple tasks. While recent years have seen significant prog…
naPINN: Noise-Adaptive Physics-Informed Neural Networks for Recovering Physics from Corrupted Measurement
Physics-Informed Neural Networks (PINNs) are effective methods for solving inverse problems and discovering governing equations from observ…
The Alignment Curse: Modality Alignment Supercharges Audio Attacks via Text Transfer
Recent advances in end-to-end trained omni-models have substantially improved audio capabilities by strengthening text-audio modality align…
Consistency Deep Equilibrium Models
Deep Equilibrium Models (DEQs) have emerged as a powerful paradigm in deep learning, offering the ability to model infinite-depth networks…
Lookahead Sample Reward Guidance for Test-Time Scaling of Diffusion Models
Diffusion models have demonstrated strong generative performance; however, generated samples often fail to fully align with human intent. T…
Global Geometry Is Not Enough for Vision Representations
A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representation…
When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs
Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective ev…
Equilibrium Propagation for Non-Conservative Systems
Equilibrium Propagation (EP) is a physics-inspired learning algorithm that uses stationary states of a dynamical system both for inference…
Universal One-third Time Scaling in Learning Peaked Distributions
Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origi…
Fixed Budget is No Harder Than Fixed Confidence in Best-Arm Identification up to Logarithmic Factors
The best-arm identification (BAI) problem is one of the most fundamental problems in interactive machine learning, which has two flavors: t…
From Evaluation to Design: Using Potential Energy Surface Smoothness Metrics to Guide Machine Learning Interatomic Potential Architectures
Machine Learning Interatomic Potentials (MLIPs) sometimes fail to reproduce the physical smoothness of the quantum potential energy surface…
Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers
A simple strategy for improving LLM accuracy, especially in math and reasoning problems, is to sample multiple responses and submit the ans…
Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching
Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generati…
Inverse Depth Scaling From Most Layers Being Similar
Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differentl…
Principle-Evolvable Scientific Discovery via Uncertainty Minimization
Large Language Model (LLM)-based scientific agents have accelerated scientific discovery, yet they often suffer from significant inefficien…
"Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills
LLM-based coding agents increasingly rely on third-party extensions called skills, which bundle natural language instructions and helper sc…
Rethinking Scientific Modeling: Toward Physically Consistent and Simulation-Executable Programmatic Generation
Structural modeling is a fundamental component of computational engineering science, in which even minor physical inconsistencies or specif…
Collaborative and Efficient Fine-tuning: Leveraging Task Similarity
Adaptability has been regarded as a central feature in the foundation models, enabling them to effectively acclimate to unseen downstream t…
Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation
Large Language Models (LLMs) represent a promising frontier for recommender systems, yet their development has been impeded by the absence…
SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned
DARPA's AI Cyber Challenge (AIxCC, 2023--2025) is the largest competition to date for building fully autonomous cyber reasoning systems (CR…
Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling
In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still…
When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning
Despite rapid progress in MLLMs, visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under…
Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction
Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods ty…
AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection
Time-series anomaly detection (TSAD) with multimodal large language models (MLLMs) is an emerging area, yet a persistent challenge remains:…
Learning to Remember, Learn, and Forget in Attention-Based Models
In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on comple…
SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes
Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity…
Beware of the Batch Size: Hyperparameter Bias in Evaluating LoRA
Low-rank adaptation (LoRA) is a standard approach for fine-tuning large language models, yet its many variants report conflicting empirical…
Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling
Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human fee…
What Do LLMs Know About Alzheimer's Disease? Multi-loss Fine-Tuning and Probing for AD Detection
Reliable early detection of Alzheimer's disease (AD) is challenging, particularly due to the limited availability of labeled data. While la…
SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents
Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely…
From Noise to Order: Learning to Rank via Denoising Diffusion
In information retrieval (IR), learning-to-rank (LTR) methods have traditionally limited themselves to discriminative machine learning appr…
You Can Learn Tokenization End-to-End with Reinforcement Learning
Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general tren…
DenseMLLM: Standard Multimodal LLMs for Dense Prediction
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending…
Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows
LLM agents execute multi-step workflows that mutate external state through tools. Common orchestrators treat tool return as the settlement…
Knowing Isn't Understanding: Re-grounding Generative Proactivity with Epistemic and Behavioral Insight
Generative AI agents equate understanding with resolving explicit queries, an assumption that confines interaction to what users can articu…
Visual Persuasion: What Influences Decisions of Vision-Language Models?
The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models…
APEX-SQL: Talking to the data via Agentic Exploration for Text-to-SQL
Text-to-SQL systems powered by Large Language Models have excelled on academic benchmarks but struggle in complex enterprise environments.…
PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency
Test-time scaling can improve model performance by aggregating stochastic reasoning trajectories. However, achieving sample-efficient test-…
Are LLMs Ready for Neural-integrated Mechanistic Modeling? A Benchmark and Agentic Framework
Large language models (LLMs) have shown promise in constructing mechanistic models from data. However, existing evaluations largely focus o…
LERD: Latent Event-Relational Dynamics for Neurodegenerative Classification
Alzheimer's disease (AD) alters brain electrophysiology and disrupts multichannel EEG dynamics, making accurate and clinically useful EEG-b…
IDLM: Inverse-distilled Diffusion Language Models
Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow…
Learning Discriminative and Generalizable Anomaly Detector for Dynamic Graph with Limited Supervision
Dynamic graph anomaly detection is critical for many real-world applications but remains challenging due to the scarcity of labeled anomali…
On Imbalanced Regression with Hoeffding Trees
Many real-world applications generate continuous data streams for regression. Hoeffding trees and their variants have a long-standing tradi…
Evaluating Reliability Asymmetries in Chinese Factual Search and AI Answers
Search engines and AI-powered systems increasingly mediate access to factual information, yet their reliability remains difficult to evalua…
Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments
Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. T…
Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion
Human operators are still frequently exposed to hazardous environments such as disaster zones and industrial facilities, where intuitive an…
You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models
Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may ca…
LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective,…
Constitutional Black-Box Monitoring for Scheming in LLM Agents
Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is…
Concept Heterogeneity-aware Representation Steering
Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on inter…
Large Electron Model: A Universal Ground State Predictor
We introduce Large Electron Model, a single neural network model that produces variational wavefunctions of interacting electrons over the…
Improving Diffusion Planners by Self-Supervised Action Gating with Energies
Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajecto…
SPARC: Spatial-Aware Path Planning via Attentive Agent Communication
Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat al…
One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models
Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning i…
Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding
Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remai…
HALO: Learning Human-Robot Collaboration via Heterogeneous-Agent Lyapunov Policy Optimization
To improve generalization and resilience in human-robot collaboration (HRC), robots must contend with diverse combinations of human behavio…
Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution
Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language mo…
Heterogeneous Decentralized Diffusion Models
Training frontier-scale diffusion models often requires substantial computational resources concentrated in tightly-coupled clusters, limit…
DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention
Masked diffusion language models enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressi…
ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning
Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bo…
Geometry-Aware Probabilistic Circuits via Voronoi Tessellations
Probabilistic circuits (PCs) enable exact and tractable inference but employ data independent mixture weights that limit their ability to c…
Ethical Fairness in Ubiquitous Health Sensing without Known Attributes
In ubiquitous and mobile health systems, computational models infer human states from wearable, behavioral, and physiological sensing data.…
ES-Merging: Biological MLLM Merging via Embedding Space Signals
Biological multimodal large language models (MLLMs) have emerged as powerful foundation models for scientific discovery. However, existing…
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
Learning generalizable and robust behavior cloning policies requires large volumes of high-quality robotics data. While human demonstration…
Context Matters: Repository-Aware Security Analysis of the Agent Skill Ecosystem
Agent skills extend local AI agents, such as Claude Code and OpenClaw, with additional functionality. Their growing popularity has led to d…
scicode-lint: Detecting Methodology Bugs in Scientific Python Code with LLM-Generated Patterns
Methodology bugs in scientific Python code produce plausible but incorrect results that traditional linters and static analysis tools canno…
MineDraft: A Framework for Batch Parallel Speculative Decoding
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subseq…
To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs
When VLMs answer correctly, do they genuinely rely on visual information? We introduce a Tri-Layer Diagnostic Framework with three per-samp…
Beyond String Matching: Semantic Evaluation of PDF Table Extraction
Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing eval…
AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science
Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in la…
Failure of contextual invariance in large language models
Standard evaluation practices assume that large language model (LLM) outputs are stable when prompts are embedded in contextually equivalen…
{\lambda}Split: Self-Supervised Content-Aware Spectral Unmixing for Fluorescence Microscopy
In fluorescence microscopy, spectral unmixing aims to recover individual fluorophore concentrations from spectral images that capture mixed…
Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning
Designing effective auxiliary rewards for cooperative multi-agent systems remains challenging, as misaligned incentives can induce suboptim…
Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs
We show that AI agents are capable of discovering novel algorithms for adversarial attacks against LLMs, advancing the state of the art on…
Limits of Spatial Imagery Reasoning in Frontier LLM Models
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, yet they struggle with spatial tasks that require mental…
EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams
We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representati…
Incentives, Equilibria, and the Limits of Healthcare AI: A Game-Theoretic Perspective
Using a stylised coordination problem drawn from inpatient capacity management, three archetypal forms of AI deployment are described: effo…
Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference
Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse a…
Acoustic and perceptual differences between standard and accented speech and their voice clones
Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences.…
Automated Conjecture Resolution with Formal Verification
Recent advances in large language models have significantly improved their ability to perform mathematical reasoning, extending from elemen…
CalM: A Self-Supervised Foundation Model for Population Dynamics in Calcium Imaging Data
Recent work suggests that large-scale, multi-animal modeling can significantly improve neural recording analysis. However, for functional c…
Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook
As LLMs are globally deployed, aligning their cultural value orientations is critical for safety and user engagement. However, existing ben…
AtomEval: Validity-Aware Atomic Evaluation of Adversarial Claim Rewriting in Fact Verification
Large language models (LLMs) can rewrite refuted claims to evade evidence-based fact verifiers, but conventional attack success rate (ASR)…
Multi-Modal Learning meets Genetic Programming: Analyzing Alignment in Latent Space Optimization
Symbolic regression (SR) aims to discover mathematical expressions from data, a task traditionally tackled using Genetic Programming (GP) t…
U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster
AI-based weather forecasting now rivals traditional physics-based ensembles, but state-of-the-art (SOTA) models rely on specialized archite…
Frequency-Enhanced Diffusion Models: Curriculum-Guided Semantic Alignment for Zero-Shot Skeleton Action Recognition
Human action recognition is pivotal in computer vision, with applications ranging from surveillance to human-robot interaction. Despite the…
Beyond Offline A/B Testing: Context-Aware Agent Simulation for Recommender System Evaluation
Recommender systems are central to online services, enabling users to navigate through massive amounts of content across various domains. H…
Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction
At the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes connects low-level geometric sensing wit…
AffordGen: Generating Diverse Demonstrations for Generalizable Object Manipulation with Afford Correspondence
Despite the recent success of modern imitation learning methods in robot manipulation, their performance is often constrained by geometric…
Vibe-driven model-based engineering
There is a pressing need for better development methods and tools to keep up with the growing demand and increasing complexity of new softw…
SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting
On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome…
TInR: Exploring Tool-Internalized Reasoning in Large Language Models
Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external…
Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPO
Temporal credit assignment in reinforcement learning is often approached by introducing value estimates at multiple discount factors. A nat…
Just Type It in Isabelle! AI Agents Drafting, Mechanizing, and Generalizing from Human Hints
Type annotations are essential when printing terms in a way that preserves their meaning under reparsing and type inference. We study the p…
Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models
Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to…
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While rec…
BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps
Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic str…
Deep Interest Mining for Intent-Enriched Semantic IDs in Multimodal Generative Recommendation
Semantic IDs (SIDs) provide the discrete item vocabulary used by generative recommendation, but their quality depends on what item evidence…
FlowPlace: Flow Matching for Chip Placement
Chip placement plays an important role in physical design. While generative models like diffusion models offer promising learning-based sol…
How Can Reinforcement Learning Achieve Expert-level Placement?
Chip placement is a critical step in physical design. While reinforcement learning (RL)-based methods have recently emerged, their training…
MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution
High-precision medical diagnosis relies not only on static imaging features but also on the implicit diagnostic memory experts instantly in…
Defeasible Conditional Obligation in a Two-tiered Preference-based Semantics (Extended Version)
In response to a concern raised by Horty, this paper develops a two-tiered, preference-based semantic framework for modeling defeasible con…
Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration
Super-resolution (SR) techniques have made major advances in reconstructing high-resolution images from low-resolution inputs. The increase…
Possibilistic Predictive Uncertainty for Deep Learning
Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliabl…
STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems
Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make syste…
Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models
Uncertainty quantification (UQ) is an important technique for ensuring the trustworthiness of LLMs, given their tendency to hallucinate. Ex…
MidSteer: Optimal Affine Framework for Steering Generative Models
Steering intermediate representations has emerged as a powerful strategy for controlling generative models, particularly in post-deployment…
Do Joint Audio-Video Generation Models Understand Physics?
Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand…
Why Self-Inconsistency Arises in GNN Explanations and How to Exploit It
Recent work has observed that explanations produced by Self-Interpretable Graph Neural Networks (SI-GNNs) can be self-inconsistent: when th…
Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning
On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks expose…
Normalization Equivariance for Arbitrary Backbones, with Application to Image Denoising
Normalization Equivariance (NE) is a structural prior that improves robustness to distribution shift in image-to-image tasks. A function $f…
Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation
While recent work in Reinforcement Learning with Verifiable Rewards (RLVR) has shown that a small subset of critical tokens disproportionat…
The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong…
OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning
We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distrib…
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide…
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, making it important to s…
RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare
Clinical decision-support systems are expert systems whose recommendations clinicians act on directly, yet they are usually cleared on one…
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reas…
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models
In large vision-language models, visual tokens typically constitute the majority of input tokens, leading to substantial computational over…
Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
While many-shot ICL achieves remarkable performance, prior studies of its scaling behavior have mainly focused on non-reasoning tasks. In t…
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions…
Topology-Preserving Neural Operator Learning via Hodge Decomposition
In this paper, we study solution operators of physical field equations on geometric meshes from a function-space perspective. We reveal tha…
Beyond AI as Assistants: Toward Autonomous Discovery in Cosmology
Recent advances in artificial intelligence (AI) agents are pushing AI beyond tools toward autonomous scientific discovery. We discuss two c…
PBT-Bench: Benchmarking AI Agents on Property-Based Testing
Existing code benchmarks measure whether an agent can produce any test that reproduces a known bug, or whether it can produce a patch that…
Avoiding Structural Failure Modes in Tabular Fair SSL: Online Primal-Dual Allocation under Confidence Gating
Semi-supervised learning (SSL) enables prediction with limited labels, but high-stakes tabular applications (medical, credit, recidivism) r…
Physics-Guided Geometric Diffusion for Macro Placement Generation
Macro placement is a pivotal stage in VLSI physical design, fundamentally determining the overall chip performance. Recent data-driven plac…
Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation
Standard PII filters often miss contextual data leakage in RAG systems, such as non-regulated attribute clusters that collectively identify…
DynMuon: A Dynamic Spectral Shaping View of Muon
In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential d…
Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization
Knowledge distillation transfers knowledge from a high capacity teacher to a compact student using a mixture of hard and soft losses. On im…
Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling
Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across…
Can Vision Models Truly Forget? Mirage: Representation-Level Certification of Visual Unlearning
Machine unlearning in Vertical Federated Learning (VFL) has attracted growing interest, yet existing methods certify forgetting solely usin…
Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection
In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and eg…
Task-Aligned Self-Supervised Learning for Medical Image Analysis: A Systematic Review and Practical Design Guidelines
Self-supervised learning (SSL) has emerged as a promising paradigm for addressing the annotation bottleneck in medical imaging by learning…
Attested Tool-Server Admission: A Security Extension to the Model Context Protocol
The Model Context Protocol (MCP) standardizes how a large-language-model (LLM) agent and an external tool server exchange messages, but not…
CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval
Digital pathology archives increasingly contain multiple whole-slide images (WSIs) per case, capturing spatially distinct tumour regions an…
Treatment Effect Estimation with Differentiated Networked Effect on Graph Data
Estimating individual treatment effect (ITE) from observational graph data is crucial for decision-making in the fields such as commerce an…
Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark
Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. Howe…
Channel-wise Vector Quantization
We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise to…
GoQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization
The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limita…
Algorithmic Fragility and Persona Bias in LLM-Generated Autistic Communication
Safety alignment reduces explicitly harmful outputs but inadvertently encodes a sanitized, neuronormative representation of marginalized co…
Targeted Remasking: Replacing Token Editing with Token-to-Mask Refinement in Discrete Diffusion Language Models
Discrete masked diffusion language models such as LLaDA generate text through iterative denoising, where mask tokens are progressively repl…
Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning
Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLM…
Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations
LLM-based agents for industrial asset operations show limited accuracy when reasoning over flat document stores. AssetOpsBench (KDD 2026) e…
Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning
Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@$K$ as the canonical met…
Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures
Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We cat…
BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law
We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. T…
Hallucination Detection-Guided Preference Optimization for Clinical Summarization
Large language models (LLMs) have shown promise on summarization tasks, but they often produce hallucinations, which are unsupported or inc…
GEO-Bench: Benchmarking Ranking Manipulation in Generative Engine Optimization
Large language models (LLMs) increasingly rank products, documents, and recommendations for user queries, which makes manipulating these ra…
TIMEGATE: Sustainable Time-Boxed Promotion Gates for Continual ML Adaptation Under Resource Constraints
As machine learning(ML) systems evolve to continual adaptation, each re-training cycle uses compute, annotation, and energy. We introduce T…
BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference
Diffusion language models (dLLMs) generate text by iteratively denoising multiple token positions in parallel, offering an attractive alter…
Honest Lying: Understanding Memory Confabulation in Reflexive Agents
Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own fail…
AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current…
GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection
Vision-language foundation models have shown promising zero-shot generalization for Cross-Domain Few-Shot Object Detection (CD-FSOD). Howev…
HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding
Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-qu…
Beyond MSE: Improving Precipitation Nowcasting with Multi-Quantile Regression
Deep-learning precipitation nowcasting models are often optimized using pointwise losses such as mean squared error or mean absolute error,…
Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms
As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can…
CalArena: A Large-Scale Post-Hoc Calibration Benchmark
Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post…
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fra…
Self-Trained Verification for Training- and Test-Time Self-Improvement
Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, throu…
生成AIを「悩み相談」に使う10代女性たち 阿部前監督事件を招いた「AI正論」の波紋
長女(18)への暴行容疑で、プロ野球巨人の監督だった阿部慎之助さん(47)が逮捕された。きっかけは、長女が対話型の生成AI「ChatGPT」に被害を相談し、回答に基づき児童相談所へ連絡したことだった。長女の行動の是非(ぜひ)とは別に、生成AIは若者の相談相手として定着し、若い女…
Alphabet plans to raise $80B to pay for AI buildout
"The company is experiencing strong demand for its AI solutions and services from enterprises and consumers, at levels that are exceeding t…
NVIDIAの“狐”は工場自律管理AIエージェント、台湾メーカーが導入効果を確認
NVIDIAは、工場を自律的に管理するAIエージェントのレファレンスデザイン「NVIDIA Factory Operations Blueprint(FOX)」を発表した。FOXを用いれば、工場内のさまざまなデータをリアルタイムに監視/分析し、複数のAIエージェントと機器を連携…
製造現場の「AIアレルギー」をどう払拭? 日立・新卒デジタル人材「3カ月奮闘記」
日立製作所は、AIやデータ解析の専門スキルを持つ新人データサイエンティストを、製造現場へと送り込んでいる。いかにして現場の「AIアレルギー」を払拭し、現場とのコミュニケーションを通じて業務時間を短縮する生成AIツールを定着させたのか。実習に参加した若手女性データサイエンティスト…
NVIDIAの「NemoClaw」でエッジAIを統合管理、アドバンテックが「WEDA」を発表
アドバンテックは、パートナー向けイベント「2026 Advantech World Partner Conference(WPC)」において、エッジAIの開発から導入、運用までを統合的に管理するソリューション「WEDA」について説明した。
Nvidia chases $200B CPU market with AI agent PCs from Microsoft, Dell, and HP
If Nvidia has cracked a way to bring AI agents easily, safely, and usefully to the masses, it could — and should — be big.
Florida sues OpenAI, Sam Altman, in first-of-its-kind lawsuit over violent incidents
The lawsuit partially revolves around a shooting at Florida State University last year, and ChatGPT's alleged role in the incident.
Water access is now a risk factor in SpaceX’s IPO
The company says it needs "significant" water resources to cool its data centers, and that access to abundant, affordable water is a challe…
「楽天スーパーSALE」にAIコンシェルジュ 対話で商品検索、“買い回り攻略法”も
用途や予算をテキストか音声で伝えると、セール対象商品から条件に合った商品を探し出せる。
Anthropicが上場準備 直近の評価額は約154兆円
AnthropicがIPOに向け、SECに登録書類「S-1」のドラフトを非公開で提出した。直近のシリーズH資金調達での評価額は約9650億ドル(約154兆円)に達している。
Claudeのレート制限を“詫びリセット”、ProとMaxプラン向け 一部で「想定より速く使用量消費」
米Anthropicは、チャットAI「Claude」の有料プラン「Pro」「Max」のユーザーを対象に、5時間および週次のレート制限をリセットしたと発表した。
Our views on AI policy and political advocacy
Our approach to AI policy and political advocacy, transparency, support for thoughtful regulation and AI safety, and that no outside politi…
Anthropic files to go public
Anthropic, now an AI powerhouse that has landed top-tier enterprise customers, was once considered an underdog in the emerging world of lar…
This AI weather startup is out-forecasting government agencies
WindBorne benefits from its unique combination of model-building and data collection. The company now has about 400 balloons in flight gath…
2026-06-01(415件)
DuckDuckGo makes its ‘no-AI’ search engine easier to access as its traffic booms
Alternative search engine DuckDuckGo launches 'no AI' web extensions for Chrome and Firefox users.
「FDE」って結局、客先常駐SEのリブランディングじゃないの? アクセンチュアに聞いてみた
AIプラットフォーム企業が掲げる新職業「FDE」(フォワード・デプロイド・エンジニア)は、客先常駐SEの焼き直しなのか。Microsoftと共同でFDE組織を立ち上げ、独自の「RDE」も打ち出すアクセンチュアの保科学世氏と片岡俊行氏に疑問をぶつけてみた。
Building the infrastructure for the Intelligence Age in Michigan
OpenAI breaks ground on a 1GW data center project in Michigan as part of Stargate, building AI infrastructure to expand access, create jobs…
OpenAI frontier models and Codex are now available on AWS
OpenAI frontier models and Codex are now generally available on AWS, giving enterprises a new path to build with OpenAI through the AWS env…
Salesforceの「深謀遠慮」とは? AIエージェント時代のオープンシステム、主導権争いの行方
AIエージェントが多数動き回る企業の業務システムに向けて、Salesforceが新たなソリューションを打ち出した。そのソリューションの狙いを考察すると、そこには同社の深謀遠慮があるようだ。
“VB.NET移行をAIで爆速化”した千葉銀行GのIT企業 「12.5人月→2.0人月」をどう実現?
ちばぎんコンピューターサービスはAI駆動開発の仕組みを構築し、既存のVB.NETシステムのマイグレーション工数を12.5人月から2.0人月に削減した。どう実現したのか。
PhyDrawGen: Physically Grounded Diagram Generation from Natural Language
Generating physics diagrams from text requires strict adherence to physical laws. While current generative models produce visually plausibl…
Physically Viable World Models: A Case for Query-Conditioned Embodied AI
World models for embodied AI must be physically viable: constructed to answer intervention queries by representing the physical structure g…
Transforming and Encoding FTS for SAT Solving: What Helps, What Hurts (Extended Version)
Factored tasks are a classical planning representation that extends SAS+ with limited forms of disjunctive preconditions, conditional effec…
Procedural Generation of First Person Shooter Maps using Map-Elites
We investigate the application of MAP-Elites (a well-known quality diversity algorithm) to design levels for First-Person Shooter (FPS) gam…
Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving
Exploration in reinforcement learning for autonomous driving is inherently unsafe: agents must experience novel behaviors to learn, yet exp…
Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents
LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, th…
EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs
Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticip…
Structure-Induced Information for Rerooting Levin Tree Search
Subgoal-based policy tree search, which uses a policy to guide search, is effective for complex single-agent deterministic problems but oft…
Healthcare Mechanisms from Policy-as-Code Search under Strategic Provider Response
Healthcare mechanisms are inseparable from the strategic provider response they induce: existing healthcare AI benchmarks hold this respons…
MAVEN: Improving Generalization in Agentic Tool Calling
Generalization across agentic tool-calling environments remains a central challenge for reliable agentic reasoning systems. Although large…
Generating Graph-like Rules for Knowledge Graph Reasoning via Diffusion Models
Logical rules constitute a cornerstone of knowledge graph (KG) reasoning, valued for their interpretability and ability to model relational…
Learning Agent-Compatible Context Management for Long-Horizon Tasks
LLM agents increasingly face long-horizon tasks such as web search and deep research in real-world applications, where accumulated context…
PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges
LLM judges are increasingly used to evaluate open-ended responses, but their scores depend strongly on the rubrics that condition them. A v…
Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward
Deep research tasks require LLMs to plan what to investigate, retrieve evidence, and synthesize long-form answers across multiple branches…
SLAT: Segment-Level Adaptive Trimming for Efficient CoT Reasoning
Recent advances in Large Reasoning Models have significantly improved chain-of-thought (CoT) capabilities via reinforcement learning (RL).…
COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents
LLM-powered search agents enable multi-step reasoning and tool use. However, these capabilities introduce retrieval-induced safety degradat…
Distilling LLM Feedback for Lean Theorem Proving
Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most comm…
UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling
In real-world deployments of large language models (LLMs), balancing inference quality and computational cost has become a central challeng…
BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs
Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness. Predicting how objects…
A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI
Current alignment paradigms for generative artificial intelligence rely predominantly on monolithic benchmarking frameworks that reduce the…
HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster
This work addresses the problem of autonomous resource management in heterogeneous satellite cluster conducting Earth Observation (EO) miss…
GraphARC: A Comprehensive Benchmark for Graph-Based Abstract Reasoning
Relational reasoning lies at the heart of intelligence, but existing benchmarks are typically confined to formats such as grids or text. We…
Vector Linking via Cross-Model Local Isometric Consistency
We study Vector Linking: given two embedding clouds produced by different black-box encoders over partially overlapping datasets, recover c…
LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability
Assessing whether Large Language Models outputs are factually grounded, epistemically calibrated, and methodologically reproducible is a pr…
Formalizing and falsifying causal pathways of rare events
Building on recent formalizations of root cause analysis for rare events (``outliers'') in structural equation models, we propose a formal…
COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation
LLM agents are increasingly expected not only to complete isolated tasks, but also to carry bounded representations of human expertise, jud…
Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation
Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly h…
TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories
Agent benchmarks increasingly record rich interaction trajectories, yet evaluation often reduces each rollout to a pass rate or reward scor…
Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents
Modular visual reasoning systems increasingly rely on shared working memory for multi-step collaboration, yet the failure dynamics of inter…
Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration
Recent advances in Multimodal Large Language Models (MLLMs) have led to promising progress in web agents. However, existing web agents ofte…
HypoAgent: An Agentic Framework for Interactive Abductive Hypothesis Generation over Knowledge Graphs
Abductive reasoning over knowledge graphs aims to generate logical hypotheses that explain observed entities or facts. Existing controllabl…
FAM-Bench: A Multimodal Benchmark for Condition-Aware Food-as-Medicine Reasoning
Food-as-Medicine requires models to reason beyond what a dish is or what nutrition it contains: they must decide whether a concrete food ch…
Answer-Set-Programming-based Abstractions for Reinforcement Learning
Reinforcement Learning (RL) enables autonomous agents to learn policies from experience, but realistic problems often involve enormous stat…
AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle
Scientific research has traditionally been human-intensive, requiring researchers to coordinate literature, ideas, experiments, manuscripts…
LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories
Large language models (LLMs) often solve reasoning problems by generating intermediate traces that explore and revise partial solutions. Fr…
Choosing the Lens: Strategic Perspective Activation in Context-Dependent Argumentation
The same arguments often need to be evaluated under different external regimes. An agent with influence over the regime has a strategic lev…
TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI
Multimodal stacks that mix ViTs, CNNs, GNNs, and transformer NLP strain embedded platforms because their compute/memory patterns diverge an…
When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL
For sparse, structured reinforcement-learning tasks with semantic reward-function interfaces, LLM-generated reward shaping is better framed…
Gradient-Free Training of Spiking Neural Networks via Low-Rank Evolution Strategies
Spiking Neural Networks (SNNs) offer compelling energy efficiency on neuromorphic hardware, yet their training remains challenging because…
XOResNet: Exclusive-OR Meta-Residuals Facilitate Deep Spiking Neural Networks Learning
Spiking neural networks (SNNs) hold promise for demonstrating superior learning and representation capabilities in deep models. Given the t…
Enhancing Regime Shift Detection Using Unstructured Data: A Study on the Treasury Market
Regime shifts in financial markets reorganise the joint dynamics of asset prices and macro variables, breaking any single-regime calibratio…
Hamiltonian-Inspired Attention Mechanism for Scalable RF Transmitter Fingerprinting
Radio-frequency (RF) fingerprinting identifies wire-less transmitters using hardware-induced imperfections present in baseband I/Q signals.…
Mental Damage: Caption Poisoning Attacks on Retrieval-Augmented Text-to-Music Generation
Retrieval-augmented text-to-music (TTM) systems augment underspecified user prompts using captions retrieved from a music caption dataset.…
Reinterpreting Safety Thresholds as Neuron Spiking Thresholds
Surrogate Safety Measures (SSMs) are extensively utilised in the evaluation of traffic risk in automated driving contexts. However, the maj…
Updating the standard neuron model in artificial neural networks
From their inception in the 1950s, artificial neural networks (ANNs) started using the so-called point neuron model then prevalent in neuro…
Evolutionary Algorithm for Reservoir Learning and Yielding
Reservoir computing, a type of recurrent neural network, is a promising approach for temporal learning as it separates dynamic processing f…
Full-field prediction for engineering-scale three-dimensional aircraft with multigrid-hierarchical learning
High-fidelity computational fluid dynamics is essential for aerospace design, but engineering-scale simulations of practical three-dimensio…
Unicorn: Scaling High-Dimensional Time Series Forecasting via Universal Correlation Modeling
Modern time series architectures face a fundamental trade-off: channel-independent models scale well with increasing data volume but ignore…
When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception
Deceptive alignment, in which models maintain accurate internal representations while deliberately producing false outputs, remains a centr…
Structured interactions improve distributed coordination beyond model scaling in a real-world multi-robot system
Scaling individual robot capabilities is common but costly. Here we investigate a system-level design question in real-world multi-robot co…
LLMs Without Deep Neural Networks: New Architecture, Benefits and Case Study
The purpose of this article is to provide validation to my deep neural network alternative in the context of LLMs. Very recently, there has…
Functional MRI Time Series Generation via Wavelet-Based Image Transform and Spectral Flow Matching for Brain Disorder Identification
Functional Magnetic Resonance Imaging (fMRI) provides non-invasive access to dynamic brain activity by measuring blood oxygen level-depende…
Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate
Human reasoning has long been theorised to operate socially, not through isolated individual cognition, but through collective adversarial…
NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models
Public numeric benchmarks appear in pretraining, so an evaluation that conditions on a date may be measuring memorized recall rather than o…
CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models
This paper introduces Code Bench, a benchmark capable of evaluating Large Language Models (LLMs) concise code generation abilities in 60 pr…
AI Loss of Control Incident Management: Response & Resilience
Recent research demonstrating AI systems exhibiting deception and shutdown resistance suggests that AI loss of control (LOC) is an urgent p…
Exploring Autonomous Agentic Data Engineering for Model Specialization
Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains…
SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer
Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it rema…
Domain Adaptation and Reasoning Frameworks in Language Models: A Controlled Experiment with Historical Cosmology
We investigate how domain adaptation reshapes explanatory behavior in language models using historical cosmology as a controlled setting. I…
LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agen…
Calibrated Preference Learning: The Case of Label Ranking
Calibration, the alignment of predicted probabilities with true outcome frequencies, is essential for reliable decision-making. While exten…
A Unified Framework for Gradient Aggregation in Multi-Objective Optimization
Many machine learning problems involve multiple inherent trade-offs that are best addressed by gradient-based multi-objective optimization…
The Surface You Test Is Not the Surface That Breaks
Tool-augmented LLM agents are vulnerable to prompt injection: a third party who controls part of the agent's context can plant instructions…
Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics
We present a distributed approach for constrained Multi-Agent Reinforcement Learning (MARL) that combines state-augmented policy learning w…
idSCD: Identifying Training Datasets through Semantic Correlation Descriptors
Can a dataset be recognized from the spurious correlations it induces during training? We argue that datasets leave dataset-specific traces…
Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting
Spatio-temporal forecasting on sensor graphs is commonly tackled with a single backbone architecture applied uniformly across all nodes, al…
Improved Distribution Estimation in $\ell_\infty$
We present improved bounds for estimating discrete probability distributions under the $\ell_\infty$ norm. These include minimax bounds in…
A Novel Global Context-aware Deep Neural Network for Enhanced Brain Tumor Segmentation using Magnetic Resonance Images
Brain cancer's severity necessitates precise brain tumor segmentation, which is crucial for effective brain tumor diagnosis. Manual identif…
Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't
Recent work describes what transformers can and cannot compute through connections to boolean circuits, but existing results lack exact cha…
Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages
Sentence-embedding models for semantic search are overwhelmingly developed and evaluated on English corpora. When applied to clinical retri…
Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observ…
VLM3: Vision Language Models Are Native 3D Learners
Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance…
Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode
Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload fro…
Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes
Industrial visual sim-to-real is often described as transferring from synthetic images to real images, but industrial deployment usually in…
Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation
Effective prognostics and health management of modern engines relies on accurate turbine gas temperature predictions and robust uncertainty…
ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law
U.S. immigration law spans thousands of pages of official policy, federal regulations, and procedural guidance that change frequently and c…
Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents
Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs c…
Scientific Machine Learning for Engine Health Management and Remaining Useful Life Prediction
Engine Health Management (EHM) depends on reliable forecasting of Remaining Useful Life (RUL) and on tracking thermal indicators such as tu…
An Organization-Scoped LLM Agent Runtime Architecture for Regulated Cybersecurity Operations
Regulated cybersecurity workflows lack a runtime substrate that enforces organization-level scope across retrieval, tool calls, memory, fin…
Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs
Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrati…
Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles
Best-of-$N$ sampling is widely used to construct pairwise preference data: $N$ candidates are drawn from a base distribution, and the best…
Active Timepoint Selection for Learning Measure-Valued Trajectories
Inferring continuous probability paths from sparse snapshots is a fundamental challenge in domains like single-cell biology, where high-fid…
The Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability
Universal LLM reliability is not a finite-library problem: across all possible tasks, tools, schemas, knowledge sources, and evaluator expe…
Controllable Lung Nodule Synthesis via Histogram-Regularized Latent Diffusion Models
While automated diagnosis systems have achieved remarkable success in computed tomography (CT)-based lung cancer screening, their developme…
Rationalize: Shared Semantic Reasoning for Human-AI Alignment
We introduce Rationalize, a role-pair framework for shared semantic reasoning between humans and AI models in data-driven sensemaking. Buil…
Score Broadcast and Decorrelation: A General Framework for Broadcast-Based Credit Assignment
We introduce Score Broadcast and Decorrelation (SBD), a principled framework for broadcast-based credit assignment for general families of…
PInVerify: An Offline Embodied Benchmark for Active Instance Verification
Embodied agents have made strong progress in navigating to target objects, but reaching the goal vicinity does not guarantee that the agent…
COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models
Large language models (LLMs) can reveal and amplify societal biases during chain-of-thought (CoT) generation. We present COFT (Chain of Fai…
Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs
Large Language Models (LLMs) are increasingly used in clinical applications. However, their behavior remains highly sensitive to subtle lin…
LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation
We study trajectory selection for reasoning distillation, where teacher-generated reasoning trajectories are selectively used as supervisio…
EUDAIMONIA: Evaluating Undesirable Dynamics in AI
Large language models (LLMs) are increasingly used as conversational partners for companionship, emotional disclosure, and interpersonal ad…
Automatically Attacking Software Reverse Engineering AI Agents
Software tools for reverse engineering executable binary files, such as Ghidra, enable malware analysts to safely conduct robust static ana…
CobSeg: Coherence Boundary Modeling for Dialogue Topic Segmentation
Dialogue topic segmentation is critical in many human-AI collaborative applications which requires identifying heterogeneous boundary cues,…
Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty
Uncertainty Quantification is a large and growing subfield of large language model behavioral analysis. Primarily to recognize and combat h…
Investigating Detection and Obfuscation of Prompt Injection Attacks Against Software Reverse Engineering AI Agents
Agentic software reverse engineering systems are vulnerable to prompt injection attacks placed into the source code of executable binary fi…
How Early Adopters Used Generative AI Worldwide: Variation by Country Income and Language
AI is being used by people globally, but not everyone is using it in the same ways. Using a large-scale dataset of anonymized, de-identifie…
Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents: Injection Depth, Payload Framing, and Turn-Budget Sensitivity
ReAct agents that interleave chain-of-thought reasoning with tool calls are increasingly deployed for real tasks such as scheduling, file r…
ConTrans: Learning Text-enhanced Local-global Temporal Representations for Zero-shot Temporal Action Localization
Zero-shot Temporal Action Localization (ZS-TAL) aims to detect and locate previously unseen actions in untrimmed videos. However, existing…
Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence
Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations an…
SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs
Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior w…
Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation
Generating clinically useful pathology reports for pathology cases from whole-slide images (WSIs) is challenging due to gigapixel resolutio…
When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?
We study when large language models (LLMs) can serve as effective black-box policy optimizers for reinforcement learning (RL) tasks, i.e.,…
Kalimati Vegetable Price Index Forecasting with a Momentum Corrected Online Stacking Ensemble
Forecasting agricultural commodity prices in emerging economies is difficult due to high volatility, frequent supply disruptions, and stron…
OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning
The rapid development of large language models, each with distinct capabilities and inference costs, raises a practical deployment question…
GSAM: A Generalizable and Safe Robotic Framework for Articulated Object Manipulation
Articulated object manipulation is a unique challenge for service robots. Existing methods employ end-to-end policy learning, visionmotion…
Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS
We present Chatterbox-Flash, a zero-shot text-to-speech model obtained by fine-tuning a pretrained autoregressive TTS decoder into a block-…
XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks
We introduce a set of synthetic algorithmic tasks to detect cross-lingual gaps in the abilities of large language models. Our benchmark is…
Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO
We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on div…
On the impact of retrieved content representations in RAG Pipelines
Retrieval-Augmented Generation (RAG) supplements a language model's input with retrieved documents, yet most RAG pipelines inherit retrieva…
OpenSTBench: Beyond Semantic Evaluation for Speech Translation
Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (S2ST), offline translation, a…
MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding
Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question answering (VQA) tasks. Howev…
Design and Evaluation of Multi-Agent AI Oracle Systems for Prediction Market Resolution
Prediction markets aggregate collective intelligence to forecast uncertain events, but their utility depends on reliable outcome resolution…
Differentially Private Preference Data Synthesis for Large Language Model Alignment
Preference alignment is a crucial post-training step for large language models (LLMs) to ensure their outputs align with human values. Howe…
GaMi: Geometry-Agnostic Material Identification via Cross-Modal Subtractive Disentanglement
Non-contact material identification enables adaptive interaction for embodied intelligence yet faces challenges from geometry-induced varia…
Unlearning in Diffusion Models: A Unified Framework with KL Divergence and Likelihood Constraints
Unlearning in diffusion models aims to remove undesirable data or concepts while preserving the utility of pretrained models -- two fundame…
Beyond Agreement: Scoring Panel-Surfaced Biomedical Entity Candidates for Curator Triage
Biomedical NER is deceptively simple for modern LLMs: plausible biomedical mentions are easy to surface, but corpus-convention correctness…
Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation
On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level fee…
Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring
Vision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they rem…
Fine-Tuning Improves Information Conveyance in Language Models
Fine-tuning is often believed to reduce uncertainty and diversity in large language models, but existing analyses overlook output length, a…
Safe Equilibrium Policy Optimization for Strategic Agent Policies
Language models fine-tuned with reinforcement learning typically optimize for task reward, ignoring multi-agent strategic structure. Becaus…
DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning
Reinforcement Learning (RL) has become pivotal for improving model capabilities yet suffers from rollout efficiency bottlenecks due to the…
Sophrosyne: Agentic Exploration of Relational Data Systems Needs Moderation
Text2SQL agents powered by LLMs translate natural language intent into SQL by exploring the data system through tool calls before formulati…
Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences
Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typical…
PatchWorld: Gradient-Free Optimization of Executable World Models
Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's lat…
A Unified and Reproducible Experimentation Framework for Speech Understanding
Speech foundation models and Speech LLMs have advanced speech understanding, yet deployment-oriented model selection is hindered by non-com…
Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach
Inverse reinforcement learning (IRL) typically assumes demonstrations from a single optimal demonstrator, but in many applications data com…
BlueFin: Benchmarking LLM Agents on Financial Spreadsheets
We present BlueFin, a benchmark that tasks large language model (LLM) agents with synthesis, manipulation, and comprehension tasks over spr…
What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness
Hallucination remains one of the key challenges undermining the reliability of Large Vision-Language Models (LVLMs). But what makes an LVLM…
Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits
Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxi…
De-attribute to Forget for LLM Unlearning
The rapid development of large language models (LLMs) has raised concerns on the use of inappropriate data for training, which has led to a…
TUX: Measuring Human--AI Tacit Understanding
As large language models (LLMs) increasingly act as collaborative partners, human--AI alignment is often evaluated through explicit task su…
Do Large Language Models Encode Institutional Experience? Evidence from Cross-Linguistic Moral Reasoning Under Ambiguity
Large language models (LLMs) exhibit systematic differences in moral reasoning across languages, yet the source of this variation remains u…
AMix-2: Establishing Protein as a Native Modality in Large Language Models
We present AMix-2, a protein-text foundation model that establishes protein as a native modality in large language models (LLMs), unifying…
ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment
Recent advancements in text-guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and…
Reading Between the Citations: A Typed Claim Network for Scientific Literature
Knowledge graphs over corpora of inter-referencing documents - scholarly papers, legal opinions, policy briefs - encode the topology of ref…
Variational Adapter for Cross-modal Similarity Representation
The core of vision-language models lies in measuring cross-modal similarity within a unified representation space. However, most image-text…
Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation
Modern 3D medical vision-language models (VLMs) can generate fluent radiology-style text while exhibit critically low pathology detection a…
DEM: A Distilled Explanation Model for Interpretable Anomaly Detection in Physiological Sensor Networks
Anomaly detection in physiological sensor data from Wireless Body Area Networks (WBANs) can be caused by sensor faults, network disruptions…
Annealed Softmax Greedy in Many-Armed Bayesian Bandits
Reinforcement learning with verifiable rewards (RLVR) and group-based policy optimization methods such as GRPO update a stochastic policy b…
Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?
Vision-Language-Action (VLA) models have demonstrated promising capability in autonomous driving, highlighting the potential of unified mul…
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors
LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can…
Routing on the Stiefel Manifold: When Does Adaptive Subspace Selection Help for Cross-Domain EEG Decoding?
Cross-domain EEG decoding remains challenging despite advances in Riemannian deep learning: covariance matrices from different subjects occ…
Learning to Solve and Optimize by Evolving Code
Combinatorial and optimization problems are fundamental to many industrial AI applications. Solving large-scale real-world instances of suc…
Linear Ordering Problem: Time for a Change
The Linear Ordering Problem (LOP) is a fundamental combinatorial optimization problem with important applications in areas such as economic…
AnchorSteer: Self-Discovered Concept Injection for Structure-Preserving Music Editing
Controllable music editing is to modify high-level attributes while strictly preserving rhythmic and melodic structures. However, this task…
STEP: Learning STructured Embeddings for Progressive Time Series
We present a novel method for learning interpretable representations of progressive time series, that is, data capturing irreversible state…
Fighting Numerical Hallucinations via Data-centric Compilation for Online Financial QA
Large Language Models (LLMs) have significantly advanced online data services, particularly in the domain of financial question answering (…
DRIFT: Joint Channel Estimation and Prediction Towards Pilotless 6G Non-Terrestrial Networks
Non-terrestrial networks (NTNs) are expected to play a pivotal role in sixth-generation (6G) systems by enabling ubiquitous connectivity an…
A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models
Blind and low-vision (BLV) audiences remain underserved by visual art descriptions, particularly across languages and in museum settings wh…
On Revisiting Entropy for Identifying Mislabeled Images
Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize err…
Redefining Instance Matching: A Unified Framework for Part-Aware Matching in Panoptic Segmentation Evaluation
The Panoptic Quality (PQ) metric is the standard for jointly evaluating instance and semantic segmentation. However, its original definitio…
SpecDB: LLM-Generated Customized Databases via Feature-Oriented Decomposition
Mainstream relational databases ship a uniform feature set across deployments, although individual workloads exercise only a fraction of th…
KnowledgeGain: Evaluating and Optimizing Science News Generation for Reader Learning
Science news is an important medium to communicate discoveries between the research communities and the public. Yet, most metrics for gener…
SWIM: Single-Instance Whole-Body Imitation for swiMming
We propose a new method for synthesizing physically-based swimming motions. Physically-based character animation aims to generate physicall…
TARIC: Memory-Augmented Traversability-Aware Outdoor VLN under Interrupted Semantic Cues
Outdoor vision-language navigation (VLN) in long-range, open-world environments is frequently disrupted by semantic-cue interruptions, wher…
Not All Synthetic Data Is Yours to Learn From
Can a language model improve from plain text sampled from itself, with no prompts, no teacher, no verifier, and no reward model? Yes, but o…
UXR PoV for Neuroinclusive Emotion Regulation
Attention-deficit/hyperactivity disorder (ADHD) is a psychiatric disorder which presents itself in individuals through patterns of developm…
Developing an AI-Powered UX Research Point of View for Digital Health in A Regulatory Context: An Exemplar Case from MSM and Transgender HIV Care in Nigeria
User Experience Research (UXR) in a legal and regulatory contexts presents unique challenges that require specialised approaches to protect…
On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets
Large-scale multilingual text embedding models play crucial role in both research and industry, yet their behavior in language-specific, mu…
Extending the UXR Point of View Pyramid: A Generative AI-Augmented Methodology for Human-Centred AI Systems
Rising household debt and cost-of-living pressures in the United Kingdom have intensified the role of AI-driven financial technologies in m…
FOCUS: Forcing In-Context Object Localization through Visual Support Constraints and Policy Optimization
In-context localization (ICL) seeks to localize a target object specified by a small set of support examples in a query image, operating on…
From Evidence to Design: Developing an AI-Augmented UX Research Point of View for Digital Wellbeing in Emergency and Public Safety Contexts
This paper investigates how User Experience Research (UXR) methods can be combined with AI-supported analysis to develop clearer design dir…
Developing a Culturally Grounded, AI-Augmented UX Research Point of View (POV): An Exemplar Case Study from Telemedicine Dementia Care
User Experience Research (UXR) Points of View (POVs) distil complex and often fragmented research evidence into actionable perspectives tha…
SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes
Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reason…
Developing a UXR Point of View for Cognitive Accessibility in Mobile Learning with Generative AI
This study investigates how UX research (UXR) principles, combined with Large Language Model (LLM)-supported analysis, can be used to impro…
Trust-Region Behavior Blending for On-Policy Distillation
On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the…
D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training
Training data plays a central role in large language models (LLMs) optimization, motivating extensive research on data scheduling strategie…
Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion
Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new…
MIMO: Multilingual Information Retrieval via Monolingual Objectives
Multilingual Information Retrieval (MLIR) reflects real-world search environments in which queries and relevant documents may appear in dif…
MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors
Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and bui…
Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines
Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steerin…
Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration
Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated,…
MAECO-Lite: Modular Ontology for Dynamic Malware Analysis
Capturing dynamic malware behavior in a practical but still semantically precise manner remains a significant challenge in cyber threat int…
Simulation of collision avoidance behavior in crowd movement by data-driven approach
Crowd movement simulation is essential for pedestrian safety management and facility layout optimization. Data-driven models enhance trajec…
Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education
AI systems are increasingly used to support educational content creation, yet it remains unclear whether they can generate outputs that fai…
Shared Doubt: Zero-shot Cross-Lingual Confidence Estimation for Language Models
Confidence estimation (CE), i.e. quantifying the reliability of a model's prediction, has attracted great interest in the context of large…
Comparing LLM-Based Conversational and Graphical Interfaces for Industrial Decision Tasks: An Exploratory Mixed-Methods Study
The use of Generative AI Conversational User Interfaces (CUI) as a new way to access and analyze data is growing in all sectors, and the in…
What changes after deployment? A survey on On-device Learning in TinyML
Machine learning models on microcontroller-class devices (TinyML) face a fundamental challenge: post-deployment distribution change undermi…
EchoRL: Reinforcement Learning via Rollout Echoing
Reinforcement Learning with Verifiable Rewards is an effective route for post-training to strengthen the reasoning capability of large lang…
Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval
While retrieval is a core function of vision-language models, continually updating these models for retrieval tasks remains critically unde…
Correcting Split Selection in Online Decision Trees via Anytime-Valid Inference
Bagging-based ensembles, most notably Adaptive Random Forests, are among the strongest performers for learning from data streams. A common…
Learning Cardiac Latent Representations in Vectorcardiogram Space
Electrocardiography (ECG) is a cornerstone of cardiac assessment, making the learning of informative ECG representations fundamental to tas…
Entropic Projection Alignment: Estimating, Explaining, and Improving Model Performance Under Distribution Shift
We propose a unified framework for addressing three key challenges of distribution shift: (1) estimating a model's performance on an unlabe…
ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models
Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplore…
Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning
The family of linear recurrent neural networks has shown strong performance as recurrent memory units in partially observable reinforcement…
Envisioning Beyond the Few: Disentangled Semantics and Primitives for Few-Shot Atypical Layout-to-Image Generation
The layout-to-image (L2I) task enables fine-grained control over image generation via object categories and spatial layouts. However, exist…
Personalized to Persuade: The Effects of Contextualization and Warmth on Trust and Reliance in Conversational AI
Artificial Intelligence (AI) agents personalize their responses by tailoring explanations to users' backgrounds, interests, and prior inter…
Practical Cross-Band Channel Prediction for AI-RAN via Physics-Guided Deep Unfolding
To make cross-band channel prediction practical for AI-native RAN, algorithms must generalize across diverse environments and support real-…
SAM for Robust Mitochondria Instance Segmentation in Fluorescence Microscopy
The morphological analysis of mitochondria in fluorescence microscopy (FM) is crucial for understanding cellular health, energy production,…
DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation
Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across div…
Neither Replacement nor Panacea: Comparing LLM-Based Conversational and Graphical Decision Support in Industrial Tasks
Managers in manufacturing settings rely on digital interfaces to interpret operational data for decision-making, but growing data volume an…
The Terminal Representation in Reinforcement Learning
Representation learning is a powerful tool for spatio-temporal abstraction within reinforcement learning (RL). Two well established approac…
Latent Space Disentanglement via Activation Steering for Interpretable Attribute Control in Symbolic Music Generation
Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in…
Inconsistency-Aware Minimization: Improving Generalization with Unlabeled Data
Estimating the generalization gap and developing optimization methods that improve generalization are crucial for deep learning models, for…
Social welfare optimisation under institutional reward and punishment
Institutional incentives are widely used to promote cooperation among autonomous, self-regarding agents, from human societies to multi-agen…
Appropriateness of Empathy in AI: A Signal-Cost Perspective
The appropriateness of empathy in AI has emerged as a critical concern, as excessive empathy risks seeming manipulative while insufficient…
FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection
Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - c…
dashi: A Python library for Dataset Shift Characterization to Support Trustworthy AI Development and Deployment
The Artificial Intelligence (AI) life cycle requires a thorough understanding of the underlying data dynamics for robust, safe and cost-eff…
Dreaming Of Others: Latent Teammate Modeling In World Models For Multi-Agent Reinforcement Learning
In cooperative multi-agent reinforcement learning (MARL), agents must coordinate with partners whose internal policies and intentions are n…
Scaling Higher-Order Graph Learning with Maximal Clique Complexes
Graph neural networks (GNNs) are limited to modeling pairwise interactions, while higher-order models based on cell complexes achieve great…
DynaTree: Dynamic Agentic Retrieval Tree for Time-Sensitive News Retrieval
Agentic Retrieval-Augmented Generation improves retrieval by integrating planning, tool use, and iterative reasoning, but existing agentic…
Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models
Sign language translation (SLT) remains constrained by limited paired sign-video/text corpora and heavy-tailed target vocabularies. We stud…
The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning
Large Language Model (LLM)-based navigation systems commonly construct explicit spatial representations (e.g., topological graphs, semantic…
Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study
Skill documents provide procedural knowledge to large-language-model agents at inference time. This article studies whether the presentatio…
Neuro-symbolic Syntactic Parsing: Shaping a Neural Network with the CYK Algorithm
In this paper, we show the possibility of a direct injection of algorithms into neural network architecture. We focus on a complex algorith…
DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs
Simultaneous speech-to-text translation (SimulST) generates translations while speech is still unfolding, requiring a streaming policy that…
Used Car Salesbots? Honesty and Credulity of LLMs as Bargaining Agents under Partial Information
In this work we study agents in simulated bargaining scenarios, where a buyer and a seller communicate through a text channel and attempt t…
Fine-grained Verification via Diagnostic Reasoning Supervision for Aspect Sentiment Triplet Extraction
Aspect Sentiment Triplet Extraction (ASTE) aims to identify aspect terms, opinion terms, and sentiment polarities as structured triplets, p…
PithTrain: A Compact and Agent-Native MoE Training System
Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have…
GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization
GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repe…
Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus
Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training…
On Efficient Scaling of GNNs via IO-Aware Layers Implementations
Graph Neural Networks (GNNs) are bottlenecked by sparse, irregular memory access. Popular frameworks such as DGL and PyTorch Geometric supp…
Skill Reuse as Compression in Agentic RL
Large language model agents trained with reinforcement learning (RL) often learn brittle, task-specific shortcuts. We hypothesize that agen…
If LLMs Have Human-Like Attributes, Then So Does Age of Empires II
Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field…
Separating Secrets from Placeholders: A Hybrid CNN-CodeBERT Framework for Three-Class Credential Leakage Detection
Credential leakage in public source code repositories poses a critical security threat, with over 23.8 million secrets exposed in 2024 alon…
Feature-Optimized Vision for Adaptive 3D Scene Reconstruction
Three-dimensional scene reconstruction depends on local image evidence that is both visually discriminative and geometrically useful. Fixed…
RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video
Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittlenes…
Vision-Language Models Suppress Female Representations Under Ambiguous Input
Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely suc…
Positional versus Symbolic Attention Heads: Learning Dynamics, RoPE Geometry, and Length Generalization
Transformer-based language models are widespread in today's society. As such, understanding the mechanisms by which they solve structured t…
What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation
We present the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation. We analyze MDLM generation…
SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics
Scalable information retrieval testing needs corpora that are large enough to stress index construction, ranking latency, query routing, an…
LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards
Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in e…
Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions
Grasping the semantics of rare constructions (form-meaning pairings) has been shown to be a challenging problem that has currently only bee…
TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation
Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired b…
Stateful Online Monitoring Catches Distributed Agent Attacks
Language models can find thousands of severe software vulnerabilities, and agents are increasingly being misused for cyberattacks. To avoid…
Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models
Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large h…
LLM Bias Evaluation: Gender, Racial, and Age Disparities in Occupational and Crime Scenarios
LLM bias evaluation is critical as large language models (LLMs) increasingly influence high-stakes decisions. This paper provides a compreh…
Unifying and Optimizing Data Values for Selection via Sequential Decision-Making
Data selection has emerged as a crucial downstream application of data valuation, yet the theoretical foundations for using data values in…
ProofWala: A Framework for Multilingual Proof Data Synthesis and Theorem-Proving
Neural approaches to theorem proving require robust infrastructure for interfacing with interactive theorem provers (ITPs), extracting stru…
Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
Recent studies indicate that when faced with explicit biases in prompts, models often omit mentioning these biases in their Chain-of-Though…
Inferring Events from Time Series using Language Models
A common goal in analyzing time series data is to understand how events cause observed variations. We study whether Large Language Models (…
Symbolic Intermediaries as a Linguistic-Numerical Interface for LLM-Driven Geometric Reasoning
Large Language Models (LLMs) display reasoning capabilities over linguistic and symbolic objects but have limited capabilities to directly…
OLG++: A Semantic Extension of Obligation Logic Graph
We present OLG++, a semantic extension of the Obligation Logic Graph (OLG) for modeling regulatory and legal rules in municipal and interju…
Neuro-Symbolic Predictive Process Monitoring
This paper addresses the problem of suffix prediction in Business Process Management (BPM) by proposing a Neuro-Symbolic Predictive Process…
ReTabAD: A Benchmark for Restoring Semantic Context in Tabular Anomaly Detection
In tabular anomaly detection (AD), textual semantics often carry critical signals, as the definition of an anomaly is closely tied to domai…
SAC-Opt: Semantic Anchors for Iterative Correction in Optimization Modeling
Large language models (LLMs) have opened new paradigms in optimization modeling by enabling the generation of executable solver code from n…
Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach
Large language models (LLMs) are increasingly deployed as "agents" for decision-making (DM) in interactive and dynamic environments. Yet, s…
HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs
Informal mathematics has been central to modern large language model (LLM) reasoning, offering flexibility and efficient construction of ar…
Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models
Are frontier AI systems becoming more capable? Certainly. Yet such progress is not an unalloyed blessing but rather a Trojan horse: behind…
DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training
Sparse Mixture-of-Experts architectures are essential for scaling model capacity efficiently, yet the standard Top-$k$ routing imposes a ri…
Agentic Physical AI toward a Domain-Specific Foundation Model for Energy Systems: A Case Study on Nuclear Reactor Control
The prevailing paradigm in AI for physical systems: scaling general-purpose foundation models toward universal multimodal reasoning, confro…
Regret-Based Federated Causal Discovery with Unknown Interventions
Most causal discovery methods recover a completed partially directed acyclic graph representing a Markov equivalence class from observation…
ConSensus: Multi-Agent Collaboration for Multimodal Sensing
Large language models (LLMs) are increasingly grounded in sensor data to perceive and reason about human physiology and the physical world.…
NEMO: Execution-Aware Optimization Modeling via Autonomous Coding Agents
We present NEMO, a system that translates Natural-language descriptions of decision problems into formal Executable Mathematical Optimizati…
Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory
While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outpu…
From Out-of-Distribution Detection to Hallucination Detection: A Geometric View
Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. Whil…
MedCoG: Maximizing LLM Inference Density in Medical Reasoning via Meta-Cognitive Regulation
Large Language Models (LLMs) have shown strong potential in complex medical reasoning yet face diminishing gains under inference scaling la…
Discovering Differences in Strategic Behavior Between Humans and LLMs
As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and wh…
Certified Circuits: Stability Guarantees for Mechanistic Circuits
Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretab…
SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy
As LLMs achieved breakthroughs in general reasoning, their proficiency in specialized scientific domains reveals pronounced gaps in existin…
From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents
Anonymization is often assumed to protect privacy once explicit identifiers are removed, because re-identification has historically require…
Reliable Self-Improvement Training by Verifying Reasoning, Not Just Answers
Self-improvement training, where models learn from self-generated solutions, promises sustained capability gains but suffers from a pervasi…
Counterfactual Credit Policy Optimization for Multi-Agent Collaboration
Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles, but reinforcement learning f…
LH-Bench: Skill-Grounded Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
Large language models excel on objectively verifiable tasks such as math and programming, where evaluation reduces to unit tests or a singl…
Learning to Reason with Insight for Informal Theorem Proving
Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with la…
ClimAgent: LLM as Agents for Autonomous Open-ended Climate Science Analysis
Climate research is pivotal for mitigating global environmental crises, yet the accelerating volume of multi-scale datasets and the complex…
To Use AI as Dice of Possibilities with Timing Computation
The dominant noun-based modeling paradigm has fundamentally constrained AI development, precluding any adequate representation of the futur…
Counterfactual Trace Auditing of LLM Agent Skills
Large Language Model agents are increasingly augmented with agent skills. Current evaluation methods for skills remain limited. Most deploy…
ASH: Agents that Self-Hone via Embodied Learning
Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demo…
Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current…
PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models
Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constr…
ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving
Safety-critical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation-base…
BoxLitE: A Faithful Knowledge Base Embedding Based on Convex Optimization
Knowledge base (KB) embeddings aim at combining the capability of classical knowledge graph embeddings to generalize the information presen…
MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning
Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large para…
Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL
Hierarchical Reinforcement Learning (HRL) promises to solve long-horizon Reinforcement Learning (RL) tasks more efficiently than non-hierar…
Advancing Creative Physical Intelligence in Large Multimodal Models
Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities ge…
Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)
LLMs deployed in high-stakes domains face fundamental reliability challenges: hallucinations, inconsistencies, and privacy vulnerabilities…
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In t…
Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure
Single-axis mitigations of reward-model biases (e.g., reducing proxy reliance on length, sycophancy, or style) can rotate optimization pres…
VikingMem: A Memory Base Management System for Stateful LLM-based Applications
Large Language Models have revolutionized interactive applications; however, their finite context windows pose a critical data management c…
SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search
Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness…
OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields
As multimodal language models play an increasingly important role in scientific research, materials science offers a critical testbed due t…
Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning
Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific…
MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection
Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before…
Graph Machine Learning in the Era of Large Language Models (LLMs)
Graphs play an important role in representing complex relationships in various domains like social networks, knowledge graphs, and molecula…
Breaking Information Cocoons: A Hyperbolic Framework for Balancing Exploration and Exploitation in Recommender Systems
Modern recommender systems often create information cocoons, restricting users' exposure to diverse content. The central challenge is to ba…
Understanding the Fundamental Design Decisions of Retrieval-Augmented Generation Systems
Retrieval-Augmented Generation (RAG) has emerged as a critical technique for enhancing large language model (LLM) capabilities. However, pr…
Cross-Modal Attention Calibration for LVLM Hallucination Mitigation
Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding. Despite their success, LVLMs stil…
Beyond Memorization: Assessing Semantic Generalization in Large Language Models Using Phrasal Constructions
The web-scale of pretraining data has created an important evaluation challenge: to disentangle linguistic competence on cases well-represe…
PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection
Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applicati…
Auto-Discovery-Bench: Diagnosing Structured State Tracking in Oracle-Guided Discovery
Interactive discovery requires agents to maintain and update structured beliefs over many rounds of feedback. Before evaluating agents in n…
EMCEE: Improving Multilingual Capability of LLMs via Bridging Knowledge and Reasoning with Extracted Synthetic Multilingual Context
Large Language Models (LLMs) have achieved impressive progress across a wide range of tasks, yet their heavy reliance on English-centric tr…
How does Bayesian Sampling help Membership Inference Attacks?
Membership Inference Attacks (MIAs) aim to estimate whether a specific data point was used in the training of a given model. Existing state…
Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging
Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent…
Who Gets Credit or Blame? Attributing Accountability in Modern AI Systems
Modern AI systems are typically developed through multiple stages-pretraining, fine-tuning rounds, and subsequent adaptation or alignment,…
Unlearning's Blind Spots: Over-Unlearning and Prototypical Relearning Attack
Machine unlearning (MU) aims to expunge a designated forget set from a trained model without costly retraining, yet the existing techniques…
SHIELD: Secure Hypernetworks for Incremental Expansion Learning Defense
Continual learning under adversarial conditions remains an open problem, as existing methods often compromise either robustness, scalabilit…
DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation
Dataset bias often leads deep learning models to exploit spurious correlations instead of task-relevant signals. We introduce the Standard…
Organizational Adaptation to Generative AI in Cybersecurity
Cybersecurity organizations are adapting to GenAI integration through modified frameworks and hybrid operational processes, with success in…
PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers
Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-…
Joint angle based learning to refine kinematic human pose estimation
Marker-free human pose estimation (HPE) has found increasing applications in various fields. Current HPE suffers from occasional errors in…
Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models
There has been much recent interest in evaluating large language models for uncertainty calibration to facilitate model control and modulat…
Residual Reservoir Memory Networks
We introduce a novel class of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) paradigm, called Residual Rese…
Target-Agnostic Calibration under Distribution Shift with Frequency-Aware Gradient Rectification
Real-world model deployments inevitably encounter distribution shifts, rendering the confidence estimates of deep neural networks highly un…
Reasoning-Intensive Regression
AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e.…
Human Psychometric Questionnaires Mischaracterize LLM Behavior
We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday…
MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts
Deploying Large Language Models (LLMs) in medical applications requires fact-checking capabilities to ensure patient safety and regulatory…
Towards Atoms of Large Language Models
The fundamental representational units (FRUs) of large language models (LLMs) remain undefined, limiting further understanding of their und…
Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy
Time series anomaly detection (TSAD) is a critical task, but developing models that generalize to unseen data in a zero-shot manner remains…
SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders
Concept unlearning in diffusion models is hampered by feature splitting, where concepts are distributed across many latent features, making…
Spectral Collapse Drives Loss of Plasticity in Deep Continual Learning
We investigate why deep neural networks suffer from loss of plasticity in continual learning, and thus fail to learn new tasks without rein…
Dual Mechanisms of Value Expression: Intrinsic vs. Prompted Values in Large Language Models
Large language models can express values in two main ways: (1) intrinsic expression, reflecting the model's inherent values learned during…
Mechanistic Interpretability as Statistical Estimation: A Variance Analysis
Mechanistic Interpretability (MI) aims to reverse-engineer model behaviors by identifying functional sub-networks. Yet, the scientific vali…
LLMs Lean on Priors, Not Programming Language Semantics
Recent work asks whether large language models (LLMs) condition their reasoning on explicit rules rather than statistical regularities from…
OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference
Large language models (LLMs) with extended context windows enable powerful applications but impose significant memory overhead, as caching…
PAC-Bayesian Reinforcement Learning Trains Generalizable Policies
We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data…
Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models
A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) is the intractability of their likelihoo…
CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects
Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring t…
InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training
Reinforcement learning (RL) has powered many recent breakthroughs in large language models (LLMs), especially for tasks where rewards can b…
Scaling Multi-Agent Environment Co-Design with Diffusion Models
The agent-environment co-design paradigm jointly optimises agent policies and environment configurations in search of improved system perfo…
SpectralTrain: A Universal Framework for Hyperspectral Image Classification
Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the pract…
Mixture of Horizons in Action Chunking
Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $…
Reasoning-Aware Multimodal Fusion for Hateful Video Detection
Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingl…
Conditional Coverage Diagnostics for Conformal Prediction
Evaluating conditional coverage remains one of the most persistent challenges in assessing the reliability of predictive systems. Although…
Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies
Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a unified policy, overlooking their internal mechanis…
FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs
As LLMs advance their reasoning capabilities about the physical world, the absence of rigorous benchmarks for evaluating their ability to g…
Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments
Embodied systems experience the world as 'a symphony of flows': a combination of many continuous streams of sensory input coupled to self-m…
Rethinking Multimodal Few-Shot 3D Point Cloud Segmentation: From Fused Refinement to Decoupled Arbitration
In this paper, we revisit multimodal few-shot 3D point cloud semantic segmentation (FS-PCS), identifying a conflict in "Fuse-then-Refine" p…
The Refutability Gap: Challenges in Validating Reasoning by Large Language Models
Recent reports claim that Large Language Models (LLMs) have achieved the ability to derive new science and exhibit human-level general inte…
PASTA: A Scalable Framework for Multi-Policy AI Compliance Evaluation
AI compliance is becoming increasingly critical as AI systems grow more powerful and pervasive. Yet the rapid expansion of AI policies crea…
Performance and Complexity Trade-off Optimization of Speech Models During Training
In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure.…
SKETCH: Semantic Key-Point Conditioning for Long-Horizon Vessel Trajectory Prediction
Accurate long-horizon vessel trajectory prediction remains challenging due to compounded uncertainty from complex navigation behaviors and…
Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data
The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretra…
ParalESN: Enabling parallel information processing in Reservoir Computing
Reservoir Computing (RC) has established itself as an efficient paradigm for temporal processing. However, its scalability remains severely…
Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general compete…
Multi-Agent Teams Hold Experts Back
Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-…
The Gaussian-Head OFL Family: One-Shot Federated Learning from Client Global Statistics
Classical Federated Learning relies on a multi-round iterative process of model exchange and aggregation between server and clients, with h…
An Odd Estimator for Shapley Values
The Shapley value is a ubiquitous framework for attribution in machine learning, encompassing feature importance, data valuation, and causa…
Plain Transformers are Surprisingly Powerful Link Predictors
Link prediction is a core challenge in graph machine learning, demanding models that capture rich and complex topological dependencies. Whi…
Mixture of Concept Bottleneck Experts
Concept Bottleneck Models (CBMs) promote interpretability by grounding predictions in human-understandable concepts. However, existing CBMs…
CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability
Evaluating and improving the security capabilities of code agents requires high-quality, executable vulnerability tasks. However, existing…
Stop the Flip-Flop: Context-Preserving Verification for Fast Revocable Diffusion Decoding
Parallel diffusion decoding can accelerate diffusion language model inference by unmasking multiple tokens per step, but aggressive paralle…
Pull Requests as a Training Signal for Repo-Level Code Editing
Repository-level code editing requires models to understand complex dependencies and execute precise multi-file modifications across a larg…
A Kinetic Energy Perspective of Flow Matching
Flow-based generative models can be viewed through a physics lens: sampling transports a particle from noise to data by integrating a learn…
Inverting Data Transformations via Diffusion Sampling
We study the problem of transformation inversion on general Lie groups: a datum is transformed by an unknown group element, and the goal is…
Breaking the Simplification Bottleneck in Amortized Neural Symbolic Regression
Symbolic regression (SR) aims to discover interpretable analytical expressions that accurately describe observed data. Amortized SR promise…
A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents
Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing go…
Effective Reasoning Chains Reduce Intrinsic Dimensionality
Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks…
Biases in the Blind Spot: Detecting What LLMs Fail to Mention
Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We…
Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders
The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approac…
Weight Decay Improves Language Model Plasticity
Large language models are typically trained in two broad phases: pretraining to produce a base model, followed by further training to impro…
SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Large language models (LLMs) are increasingly used as scalable judges in pairwise evaluation, but they remain prone to miscalibration and b…
DTBench: A Synthetic Benchmark for Document-to-Table Extraction
Document-to-table (Doc2Table) extraction derives structured tables from unstructured documents under a target schema, enabling reliable and…
The Information Geometry of Softmax: Probing and Steering
This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces. T…
HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents
Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and…
Position: Evaluation of ECG Representations Must Be Fixed
This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is re…
HistCAD: A Constraint-Aware Parametric History-Based CAD Representation, Dataset, and Benchmark with Industrial Complexity
Parametric CAD sequences are reusable because dimensional and geometric constraints govern how parameter changes propagate. Existing CAD ge…
From Leaky Thoughts to Private Reasoning: Controlling What LRMs Say to Themselves
Large reasoning models (LRMs) produce reasoning traces (RTs) that often contain sensitive information. These leaky thoughts are difficult t…
The Global Landscape of Environmental AI Regulation: From the Cost of Reasoning to a Right to Green AI
Artificial intelligence (AI) systems impose substantial and growing environmental costs, yet transparency about these impacts has declined…
MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks
Large Language Models (LLMs) have achieved great success in many real-world applications, especially the one serving as the cognitive backb…
NGDBench: Towards Neural Graph Data Management
Data critical to real-world decision-making is increasingly found within organizations. Such data is heterogeneous, constantly evolving, an…
Rank-Factorized Implicit Neural Bias: Scaling Super-Resolution Transformer with FlashAttention
Recent Super-Resolution~(SR) methods mainly adopt Transformers for their strong long-range modeling capability and exceptional representati…
Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech
Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained…
Variational Routing: A Scalable Bayesian Framework for Calibrated Mixture-of-Experts Transformers
Foundation models are increasingly being deployed in contexts where understanding the uncertainty of their outputs is critical to ensuring…
G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition
We study timestamped speaker-attributed automatic speech recognition (SA-ASR) for long-form, multi-party speech with overlap. In this setti…
Prompt Injection as Role Confusion
LLMs see the world as a single stream of text, partitioned into roles like or . We trace prompt injection to role confusion: models perceiv…
Surprised by Attention: Predictable Query Dynamics for Time Series Anomaly Detection
Multivariate time series anomalies often manifest as shifts in cross-channel dependencies rather than simple amplitude excursions. In auton…
Functorial Neural Architectures from Higher Inductive Types
Neural networks often learn the parts of a task but fail on novel combinations of those parts. We argue that this failure is architectural:…
REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge
Large language models (LLMs) are increasingly deployed as automated evaluators that assign numeric scores to model outputs, a paradigm know…
Empirical Characterization of Inference-Time Elicited Probability Transformations in Large Language Models
Large language models increasingly rely on inference-time procedures such as chain-of-thought reasoning, self-refinement, retrieval augment…
SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs
Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool-use costs like simulation time and experimental r…
Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation
Generative modeling of discrete data, such as graphs, underpins many scientific and industrial applications, including molecular discovery…
Circuit-Inspired High-Order Neural Networks with Unified Neural Dynamics Modeling for PDE Solving and Visual Perception
Deep networks often rely on architectural heuristics to shape representation evolution, limiting their ability to model data governed by in…
Beyond Static Uncertainty: Modeling Temporal Uncertainty Dynamics for Probabilistic Time Series Forecasting
Real-world time series exhibit temporally structured uncertainty: volatility clusters in turbulent regimes, dissipates in stable periods, a…
Multi-Level Barriers to Generative AI Adoption Across Disciplines and Professional Roles in Higher Education
Generative Artificial Intelligence (GenAI) is rapidly reshaping higher education, yet barriers to its adoption across different disciplines…
World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry
General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness…
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
Recovering camera parameters from images and rendering scenes from novel viewpoints have been treated as separate tasks in computer vision…
Symmetry Reveals Layerwise Dynamics: How Transformers Perform In-Context Classification
Transformers can perform in-context classification from a few labeled examples, yet the inference-time algorithm remains opaque. We study m…
SVL: Goal-Conditioned Reinforcement Learning as Survival Learning
Standard approaches to goal-conditioned reinforcement learning (GCRL) that rely on temporal-difference learning can be unstable and sample-…
Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs
Large language models (LLMs) have demonstrated significant potential in formal theorem proving, yet state-of-the-art performance often nece…
Aligning Dense Retrievers with LLM Utility via Distillation
Dense vector retrieval is the practical backbone of Retrieval- Augmented Generation (RAG), but similarity search can suffer from precision…
Progress in Formalizing Sphere Packing in Dimension 8
In 2016, Viazovska famously solved the sphere packing problem in dimension $8$, using modular forms to construct a 'magic' function satisfy…
Robust Lightweight Crack Classification for Real-Time UAV Bridge Inspection
With the widespread application of Unmanned Aerial Vehicles (UAVs) in bridge structural health monitoring, deep learning-based automatic cr…
FreeTimeGS++: Secrets of Dynamic Gaussian Splatting and Their Principles
The recent surge in 4D Gaussian Splatting (4DGS) has achieved impressive dynamic scene reconstruction. While these methods demonstrate rema…
Autoregressive Visual Generation Needs a Prologue
In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Inste…
OBLIQ-Bench: Exposing Overlooked Bottlenecks in Modern Retrievers with Latent and Implicit Queries
Retrieval benchmarks are increasingly saturating, but we argue that efficient search is far from a solved problem. We identify a class of q…
Why DDIM Hallucinates More Than DDPM: A Theoretical Analysis of Reverse Dynamics
We theoretically study the hallucination phenomena in two canonical diffusion samplers: the stochastic Denoising Diffusion Probabilistic Mo…
Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models
Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that thes…
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training
Preference learning methods like Direct Preference Optimization (DPO) are known to induce reliance on spurious correlations, leading to syc…
Much of Geospatial Web Search Is Beyond Traditional GIS
Web search queries concern place far more often than existing labelling schemes suggest, yet the landscape of geospatial web search queries…
Towards a holistic understanding of Selection Bias for Causal Effect Identification
Selection bias is pervasive in observational studies. For example, large scale biobanks data can exhibit ``healthy volunteer bias'' when re…
MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation
Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within…
SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning
Retrieval-Augmented Generation (RAG) is widely employed to mitigate risks such as hallucinations and knowledge obsolescence in medical ques…
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent st…
Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning
Cooperation is central to multi-agent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations di…
PROWL: Prioritized Regret-Driven Optimization for World Model Learning
Modern action-conditioned video world models achieve strong short-horizon visual realism, yet remain unreliable on rare, interaction-critic…
Block-Based Double Decoders
Encoder-decoder models offer substantial inference-time savings over decoder-only models, but their pretraining objectives suffer from spar…
Chunking German Legal Code
This paper investigates chunking strategies for retrieval-augmented generation on German statutory law, using the German Civil Code as a st…
Efficient Learning of Deep State Space Models via Importance Smoothing
Latent state space systems are ubiquitous in statistical modelling, arising naturally when time series are observed through noisy measureme…
Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling
Computer-use agents (CUAs) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by generating se…
The Distillation Game: Adaptive Attacks & Efficient Defenses
Distillation attacks create a deployment trade-off for model providers: the same outputs that make a model more useful can also make it eas…
Side-by-side Comparison Amplifies Dialect Bias in Language Models
Language models (LMs) can exhibit biases based on variations in their dialects, even in the absence of a dialect label, a behavior known as…
Theoretical Analysis of Sparse Optimization with Reparameterization, Weight Decay, and Adaptive Learning Rate
Sparse optimization is a fundamental challenge in various practical applications. A popular approach to sparse optimization is $\ell_p$ reg…
Efficient Benchmarking Is Just Feature Selection and Multiple Regression
Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a s…
GEM: Geometric Entropy Mixing for Optimal LLM Data Curation
LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorizat…
Pair-In, Pair-Out: Latent Multi-Token Prediction for Efficient LLMs
Long chain-of-thought reasoning has made autoregressive decoding the dominant inference cost of modern large language models. Existing meth…
No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand
The Plain Writing Act in the United States requires government documents to be accessible in clear and simple language that the general pub…
First head-to-head comparison of agentic AI applied to the analysis of simulated data of the Einstein Telescope
We report a comparison of two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously exe…
SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation
Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level,…
Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits
LLM-guided evolutionary search (Evolve systems) has reached state-of-the-art results on mathematical and combinatorial tasks, yet most exis…
Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models
Evaluations of dental vision-language models remain fragmented across datasets, task definitions and metrics, and often ignore their comput…
No More K-means: Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval
Multi-vector retrieval (MVR) models, exemplified by ColBERT, have established new benchmarks in retrieval accuracy by preserving fine-grain…
Neural Network Verification using Partial Multi-Neuron Relaxation
The increasing integration of deep neural networks in critical systems has spawned a theoretical and practical interest in formally guarant…
インテルがロボット開発の課題を解決、オープンなロボティクスライブラリで
インテルは、ロボット開発のための統合ソフトウェア開発キット「Robotics AI Suite」に、インテル製プロセッサに最適化された推論ランタイムを備えるオープンソースのロボティクスライブラリ「OpenVINO Physical AI Framework」を追加すると発表した。
図面SaaSに高精度な3Dモデル生成とアセンブリ機能、CAE機能のβ版を追加
renueは、図面SaaS「Drawing Agent」をアップデートした。高精度な3Dモデル生成機能や複数パーツの組み立てに対応するアセンブリ機能、構造解析を実行する「CAE機能 β版」を追加し、図面の読み取りから構造解析までを一連の流れで扱えるようにした。
富士通が認めた「人月モデル」の限界 時田社長「労働集約型SIモデルからの転換を」
富士通が「中長期経営ビジョン2035」を発表した。AI時代に突入した今、同社の時田隆仁社長CEOは、従来型の「人月モデル」には限界があると認めた。その真意とは。
Erin Brockovich takes aim at data center secrecy
Environmental activist Erin Brockovich has a new mission.
「ハーネス」って結局、何? みんな使い方が違うAIエージェント用語をHugging Faceが整理
AIエージェント分野で人によって意味が揺れる「ハーネス」「スキャフォールド」などの用語を、Hugging Faceが整理した。AIエージェントを正しく理解し議論するために押さえておきたい基本用語を初心者向けに解説する。
Making sense of the debate over AI psychosis
On the latest episode of Equity, we debate whether tech CEOs are "uniquely prone to AI psychosis."