週次AIニュース 2026-W22
対象期間: 2026-05-25 〜 2026-05-31(2631 件)
トピックの推移
トピック別件数
- 研究/論文 1447件
- LLM/生成AI 839件
- エージェント 554件
- ビジネス/資金調達 162件
- ハードウェア/半導体 68件
- ロボティクス 54件
- その他 47件
- 画像/動画生成 32件
- 規制/政策 13件
今週のハイライト(上位 10 件)
Boston Children’s uses AI to unlock new diagnoses
Boston Children’s Hospital uses OpenAI technology to improve patient care, reduce operational burden, and help diagnose more than 40 rare d…
How Braintrust turns customer requests into code with Codex
How Braintrust engineers use Codex with GPT-5.5 to run experiments and code faster.
How Endava builds an agentic organization with Codex
Learn how Endava uses Codex to build an agentic organization, accelerating software delivery and reducing requirements analysis from weeks…
Cisco and OpenAI redefine enterprise engineering with Codex
Cisco and OpenAI are redefining enterprise engineering with Codex, helping Cisco scale AI-native development, accelerate AI Defense work, a…
Building self-improving tax agents with Codex
See how OpenAI, Thrive, and Crete built a self-improving tax agent with Codex, automating filings, improving accuracy, and accelerating wor…
日立はAnthropicと組んで何を狙うのか 従業員29万人へのClaude導入で目指す姿
ミッションクリティカル領域でのAI活用に向け、日立はAnthropicと戦略的パートナーシップを締結した。同社は今回の提携で何を実現しようとしているのか。
‘What a joke’: Github Copilot’s new token-based billing spurs consternation among devs
The golden age of Microsoft's Github Copilot appears to be at an end.
I put Google’s 24/7 AI assistant Gemini Spark to work, and it’s actually pretty useful
Gemini Spark helps automate everyday tasks, from inbox summaries to local event planning, but it’s unclear why Google made it a separate pr…
Coders are refusing to work without AI — and that could come back to bite them
While AI is helping coders produce code faster, it may not be producing better code, researchers warn. And that could cause problems down t…
OpenAIが生命科学推論AI「GPT-Rosalind」をバイオディフェンスに開放 デュアルユースリスクに懸念も
OpenAIは、生命科学研究に特化したフロンティア推論モデル「GPT-Rosalind」を活用した「Rosalind Biodefense」プログラムを発表した。生物脅威の検知など防衛目的に限定し、審査済みの開発者や米政府機関および同盟国のパートナー組織にAPIを無償提供する。
全件(日付別)
2026-05-31(5件)
日立はAnthropicと組んで何を狙うのか 従業員29万人へのClaude導入で目指す姿
ミッションクリティカル領域でのAI活用に向け、日立はAnthropicと戦略的パートナーシップを締結した。同社は今回の提携で何を実現しようとしているのか。
SoftBank says it will invest up to €75 billion to build French data centers
The goal, the firm said, is to develop and operate up to 5 gigawatts of additional data center capacity.
‘What a joke’: Github Copilot’s new token-based billing spurs consternation among devs
The golden age of Microsoft's Github Copilot appears to be at an end.
Meta is reportedly developing an AI pendant
Meta seems to be making big bets on AI-powered hardware.
I put Google’s 24/7 AI assistant Gemini Spark to work, and it’s actually pretty useful
Gemini Spark helps automate everyday tasks, from inbox summaries to local event planning, but it’s unclear why Google made it a separate pr…
2026-05-30(7件)
As the browser wars heat up, here are the hottest alternatives to Chrome and Safari in 2026
We’ve compiled an overview of some of the top alternative browsers available today aiming to challenge Chrome and Safari.
Coders are refusing to work without AI — and that could come back to bite them
While AI is helping coders produce code faster, it may not be producing better code, researchers warn. And that could cause problems down t…
OpenAIが生命科学推論AI「GPT-Rosalind」をバイオディフェンスに開放 デュアルユースリスクに懸念も
OpenAIは、生命科学研究に特化したフロンティア推論モデル「GPT-Rosalind」を活用した「Rosalind Biodefense」プログラムを発表した。生物脅威の検知など防衛目的に限定し、審査済みの開発者や米政府機関および同盟国のパートナー組織にAPIを無償提供する。
So you’ve heard these AI terms and nodded along; let’s fix that
The rise of AI has brought an avalanche of new terms and slang. Here is a glossary with definitions of some of the most important words and…
What happens when companies become too AI-pilled?
The people deciding that AI can replace your job are also the ones least likely to understand what your job truly involves, according to Bo…
After Nvidia’s $20B not-acqui-hire, AI chip startup Groq reportedly raising $650M
Chipmaker Groq is looking to raise $650 million in internal funding as it pivots from hardware to focus more on AI inference, the process o…
Cognition’s Scott Wu says AI coding agents shouldn’t replace humans
Cognition makes Devin, the first and arguably most successful AI coding agent. But famed coder Wu says it isn't designed to supplant human…
2026-05-29(540件)
Today is the last day to apply to speak at TechCrunch Disrupt 2026
Submit your session topic before today ends for a chance to speak at TechCrunch Disrupt 2026. Apply now to share your insight and help shap…
Final 24 hours to save up to $410 on your TechCrunch Disrupt 2026 ticket
You now have until tonight at 11:59 p.m. PT to lock in Early Bird savings of up to $410 for TechCrunch Disrupt 2026 before prices increase.…
Does your CEO have AI psychosis? Aaron Levie thinks most of them do.
The people deciding that AI can replace your job are also the ones least likely to understand what your job truly involves, according to Bo…
Kiwibit’s AI-powered bird feeder is my new backyard buddy
If you're looking for a fun way to connect with nature while collecting bird species on an app like Pokémon, give this smart feeder a try.
This chip startup just raised $135M on a bet that AI’s biggest bottleneck isn’t compute — it’s memory
South Korean chip startup XCENA is betting that AI's real bottleneck is not compute, but memory.
Boston Children’s uses AI to unlock new diagnoses
Boston Children’s Hospital uses OpenAI technology to improve patient care, reduce operational burden, and help diagnose more than 40 rare d…
How Braintrust turns customer requests into code with Codex
How Braintrust engineers use Codex with GPT-5.5 to run experiments and code faster.
「Nano Banana 2」「Nano Banana Pro」が一般提供開始 「2」は動画からの画像生成もサポート
米Googleは5月29日、「Nano Banana 2」(Gemini 3.1 Flash Image)と「Nano Banana Pro」(Gemini 3 Pro Image)の一般提供を開始すると発表した。加えて、動画入力に対応する新機能をNano Banana 2でプ…
日本政府と主要金融機関、OpenAI新モデルのアクセス権を取得 サイバー対策強化へ
片山さつき金融担当相が、米OpenAIが開発した新型AIのアクセス権を政府と主要金融機関が取得したと明らかにした。高性能AIがサイバー攻撃に悪用される懸念が高まる中、AIを防御に活用した対策が急務となっている。片山氏は「わが国金融機関のサイバーセキュリティー強化の観点から歓迎す…
JR西日本は“熟練者が手書きするしかなかった車両作業計画”をAIでどう自動化するのか?
JR西日本は、熟練担当者が手書きで作成していた鉄道車両基地の構内作業計画を、AIで自動作成するシステムを開発している。なぜ開発し、どのような効果を見込むのか。
OpenAI、日本政府とサイバーセキュリティで協力 最新AI「GPT-5.5-Cyber」を金融機関に提供
米OpenAIは、日本政府とサイバーセキュリティで協力する取り組み「日本サイバー・アクションプラン」を発表した。まずは、金融機関にサイバーセキュリティに特化した最新AIモデル「GPT-5.5-Cyber」を提供する。
Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction
Gradient temporal-difference methods provide stable off-policy prediction with linear function approximation, but their practical performan…
Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction
Temporal-difference learning with function approximation can be unstable under off-policy sampling. TDC stabilizes off-policy TD through an…
The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling
The Cognitive Categorical Transformer (CCT) is a 306M-parameter architecture that augments a pretrained GPT-2 Small backbone with cognitive…
Ultra-Reduced-Impact-Encased-Logging (URIEL): propose a new method for selective sustainable logging and post-harvest silvicultural treatment in tropical forest using airborne robotics systems
Tropical forests worldwide are under intense deforestation pressure driven by economic and political interests, and scientific evidence sug…
Review Arcade: On the Human Alignment and Gameability of LLM Reviews
LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. W…
Orthogonal Concept Erasure for Diffusion Models
Concept erasure has emerged as a promising approach to mitigate undesired or unsafe content in diffusion models, yet existing methods still…
Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes
Linking free-text phenotype descriptions to ontology terms, typically referred to as phenotype annotation, is essential for the cross-study…
VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis
Finite Element Analysis (FEA) serves as the cornerstone of modern engineering design. However, its workflow is inherently complex and relie…
BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation
AI tools to support real world decision making must be able to build simulation models that inform their recommendations and render them in…
Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild
Although a growing body of research has begun to describe user--LLM interactions, the picture it paints is largely static; little is known…
When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis
Federal agencies are deploying large language models (LLMs) to categorize public comment corpora, where the model's organization of the rec…
Mind Your Tone: Does Tone Alter LLM Performance?
The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In…
Practitioner Beliefs and Behaviors in AI-Enhanced Education: DOT Framework Survey Evidence
This study reports findings from a cross-sectional survey (n = 72) of higher education practitioners examining beliefs, behaviors, and inst…
Differentiable Belief-based Opponent Shaping
Human coordination often relies on the ability to influence the beliefs of others through strategic action. In multi-agent reinforcement le…
Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching
Hallucination remains a major reliability barrier for production LLM systems, particularly in multi-agent pipelines where unsupported claim…
Robust and Efficient Guardrails with Latent Reasoning
Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing sa…
Bridging the Sim-to-Real Gap in Reinforcement Learning-Based Industrial Dispatching through Execution Semantics
Event-driven scheduling policies are increasingly deployed in industrial environments, where decisions are made under asynchronous and part…
The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane
AI agents are increasingly expected to operate as digital employees: accessing enterprise data, making decisions, and taking actions autono…
The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure
Reasoning models are evaluated on single-turn benchmarks but deployed in multi-turn dialogue, where users push back on correct answers. Und…
Trends in AI and Human-AI Interaction in Clinical Trials -- A Hybrid Human-AI Exploration
This paper examines records retrieved from the ClinicalTrials.gov registry to characterize temporal trends in AI terminology and the geogra…
Beyond Consensus: Trace-Level Synthesis in Mixture of Agents
When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a majority vote or layered synthe…
PRO-CUA: Process-Reward Optimization for Computer Use Agents
Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their training remains constrained by…
The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models
Masked diffusion language models (MDMs) uniquely support any-order generation, with confidence-based decoding currently serving as the de f…
Governing Technical Debt in Agentic AI Systems
Agentic AI systems are increasingly being explored as production infrastructure: they reason over multiple steps, call tools, act through w…
Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction
Question answering (QA) is a core challenge in AI, particularly for complex queries requiring multi-hop reasoning across documents, or symb…
Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents
DeFi investment agents, systems that use AI for autonomous on-chain trading, have attained over USD 3 billion in combined token valuations…
ReasonOps: Operator Segmentation for LLM Reasoning Traces
Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a vocabulary for describing their int…
GTA: Generating Long-Horizon Tasks for Web Agents at Scale
Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web assistants. Yet progress is incr…
BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents
Self-evolving agents improve over time by reflecting on past failures, but existing evaluation is limited in two ways: it measures only tas…
Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility
Reasoning distillation transfers complex reasoning abilities from large language models (LLMs) to smaller ones, yet its success depends on…
Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth
We study large-scale literature search from two complementary angles: improving the retrieval pipeline, and stress-testing the human refere…
Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI
AI-augmented classrooms generate rich teacher and student feedback before graded outcomes become available, yet these signals can be diffic…
DenseSteer: Steering Small Language Models towards Dense Math Reasoning
Large language models (LLMs) demonstrate strong chain-of-thought (CoT) reasoning abilities, while smaller models (<= 3B parameters) signifi…
Provably Secure Agent Guardrail
As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control…
OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories
Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating un…
Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling
The Dynamic Flexible Job Shop Scheduling Problem (DFJSP) necessitates a trade-off between instant reaction to stochastic disturbances and g…
When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop
Foundation models are increasingly trained on synthetic data generated by prior model iterations rather than exclusively on real data. This…
Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies
The era of the Internet of Agents (IoA) is taking shape: LLM agents are expected to fulfill user goals by orchestrating fast-growing popula…
CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval
Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified languag…
Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces
Long chain-of-thought (CoT) traces are widely used as supervision for reasoning-oriented LLM SFT, yet answer-correct traces can still lead…
Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models
Supervised fine-tuning (SFT) followed by reinforcement learning (RL) has become a standard post-training paradigm for large language models…
Rubric-Guided Process Reward for Stepwise Model Routing
Stepwise model routing improves the efficiency of Large Reasoning Models (LRMs) by assigning each reasoning step to a suitable model. Recen…
ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression
Mixture-of-Experts (MoE) language models reduce per-token computation but still require storing and serving all experts, making deployment…
PassNet: Scaling Large Language Models for Graph Compiler Pass Generation
Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling…
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, address…
MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models
Action-conditioned world models are increasingly used as scalable simulators for robot learning, yet current evaluations provide limited ev…
EvoMD-LLM: Learning the Language of Species Evolution in Reactive Molecular Dynamics
While large language models (LLMs) excel at static scientific reasoning, they struggle to model the temporal structure of dynamic physical…
Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization
Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe behavior while preserving general utility. However, rece…
Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark
We benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent,…
When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs
Persona prompting is widely used to steer large language models, yet its practical value remains unclear. Prior work often evaluates person…
ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control
Reinforcement learning (RL) has shown promise in traffic signal control (TSC). However, its reliance on predefined states limits responsive…
Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation
Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based…
CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials
Miller-index identification from powder XRD patterns requires capabilities untested by existing multimodal benchmarks: the model must read…
VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data
Wearable devices enable continuous monitoring of physiological signals such as ECG and PPG, but existing mHealth systems are largely limite…
The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF
Large Language Models (LLMs) are increasingly deployed in agentic and retrieval-augmented generation (RAG) systems, where they must execute…
Xetrieval: Mechanistically Explaining Dense Retrieval
Explaining why dense retrievers assign high relevance scores remains challenging because retrieval decisions are made through opaque high-d…
MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs
Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over ex…
DeepSurvey: Enhancing Analytical Depth and Citation Reliability in Automated Survey Generation
As scientific literature grows rapidly, automated survey generation has become a key capability for AI scientists and human researchers. Ho…
UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents
Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on la…
Opt-Verifier: Unleashing the Power of LLMs for Optimization Modeling via Dual-Side Verification
Building mathematical optimization models is critical in operations research (OR), while it requires substantial human expertise. Recent ad…
Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation
Parameterizing high-fidelity "digital twins" of batteries is a critical yet challenging inverse problem that hinders the pace of battery in…
ParaTool: Shifting Tool Representations from Context to Parameters
Tool calling extends large language models (LLMs) by enabling grounded interaction with external executable interfaces, thereby supporting…
Planning with the Views via Scene Self-Exploration
Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1…
DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning
Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberati…
GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation
Tourist mobility poses a distinct challenge for urban transportation planning. Unlike resident commuting, tourist travel is largely non-rou…
FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification
We introduce FinVerBench, a benchmark and validity study for financial statement verification: determining whether a set of corporate finan…
Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion
Modeling the interplay between external stimuli and internal neural representations is a pivotal research area for Brain-Computer Interface…
HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering
Retrieval-augmented generation (RAG) for document-based Open-domain Question Answering (ODQA) on large-scale industrial corpora faces two c…
Improving Collaborative Storytelling with a Multi-Agent Framework Based on Large Language Models
The topic of Co-creation, i.e., AI agents interacting with humans to generate outputs (e.g., art), has gained significant attention recentl…
Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures
Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the end of generation, telling us whether a failure happen…
VikingMem: A Memory Base Management System for Stateful LLM-based Applications
Large Language Models have revolutionized interactive applications; however, their finite context windows pose a critical data management c…
LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning
Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are the result of decades of work by planni…
Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation
Large language models (LLMs) are increasingly being used to generate health text from structured records such as wearable time series, biom…
PTCG-Bench: Can LLM Agents Master Pok\'emon Trading Card Game?
Given a strategically complex board game, human players can quickly learn to devise strategies after playing a few rounds. Autonomous agent…
TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation
Evaluating open-ended outputs from large language models (LLMs) remains challenging due to the absence of ground truth. Existing metrics re…
GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents
LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural kno…
Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems
Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The de…
NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs
As large language models (LLMs) are increasingly applied in social contexts such as emotional companionship and customer service, measuring…
Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability
Large Language Models (LLMs) excel at understanding natural language but struggle with optimisation tasks involving multiple constraints an…
FHRFormer: A Self-Supervised Masked Transformer Framework for Fetal Heart Rate Time-Series Inpainting and Forecasting
Approximately 10% of newborns require assistance to initiate breathing at birth, and around 5% need ventilation support. Fetal heart rate (…
Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling
In Agentic Search, trajectory-level outcome rewards fail to quantify the behavioral contributions of individual steps, while existing step-…
BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices
Trajectory prediction is a fundamental task for autonomous systems, requiring complex reasoning about multi-agent interactions and intents.…
NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs
Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive generative paradigm. Given the prohibitive computati…
Uncertainty-Aware Transfer Learning for Cross-Building Energy Forecasting: Toward Robust and Scalable District-Level Energy Management
Scaling data-driven energy forecasting to district level requires models that can be re-used across buildings with minimal target-domain da…
Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering
Deploying Large Language Models (LLMs) for regulatory compliance demands rigorous traceability via comprehensive citations across multi-tie…
Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence
The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will…
Benchmarking Positional Encoding Strategies for Transformer-Based EEG Foundation Models
Electroencephalography (EEG) is a widely used non-invasive technique for measuring brain activity in brain-computer interface (BCI) applica…
LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs
As large language models continue to scale, low-bit weight-only post-training quantization (PTQ) offers a practical solution to their memor…
From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks
Existing traffic forecasting benchmarks assume a fixed sensor set, but real road-sensor networks grow continuously as the road network chan…
Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations
Reproducibility is fundamental to the scientific method, yet remains a critical challenge in machine learning. Contributing factors include…
Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk
Critical sequential decisions are rarely single-timescale: a strategic decision causally shapes the context in which every subsequent tacti…
SkillsInjector: Dynamic Skill Context Construction for LLM Agents
LLM agents now draw on growing skill libraries to handle complex tasks. However, injecting more skills does not always improve task complet…
MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains
Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data regimes. However, existing approaches…
SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search
Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness…
AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security
Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sou…
PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing
The growing number of submitted papers has motivated the exploration of Large Language Models (LLMs) as a means to support and augment the…
Harnessing non-adversarial robustness in large language models
The work presents an approach for addressing the challenge of robustness in Large Language Models (LLMs) to alterations and potential error…
Quantifying and Optimizing Simplicity via Polynomial Representations
Deep networks often exhibit a preference for "simple" solutions, and such a simplicity bias is widely believed to play a key role in genera…
OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation
Leveraging Large Language Models (LLMs) to automatically formulate and solve optimization problems from natural language has emerged as an…
OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields
As multimodal language models play an increasingly important role in scientific research, materials science offers a critical testbed due t…
Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation
Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often appli…
Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories
LLM-based agents have demonstrated strong capabilities in solving complex tasks through multi-step reasoning and tool use. However, existin…
On the Geometry of Games and their Solvers
A central challenge in game theory and learning systems such as GANs is understanding which algorithms can efficiently compute equilibria a…
Toward AI Systems That Understand Self and Others: A Multi-Phase Inference Framework for Human Cognitive Diversity and World-Model Alignment
Mutual misunderstanding in contemporary society does not arise merely because people hold different opinions or values. Even under the same…
It`s All About Speed: AI`s Impact on Workflow in Music Production
In this paper, we present the results of an ethnographic study into the impact of AI and automated tools on music production workflow. Focu…
Make LLM Learn to Synthesize from Streaming Experiences through Feedback
Large language models (LLMs) have been widely adopted for synthetic data generation, significantly reducing annotation costs. However, most…
MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization
Understanding how harm emerges from interaction between otherwise benign image-text pairs requires intent-aware cross-modal reasoning beyon…
Formalizing Mathematics at Scale
We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orches…
Meta-Programming for Linear-time Temporal Answer Set Programming
The development of temporal extensions of Answer Set Programming (ASP) has led to the emergence of non-monotonic linear-time (TEL), dynamic…
Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent
Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in-situ observations remain c…
Accelerating Constrained Decoding with Token Space Compression
To guarantee that an LLM's outputs conform to a specified structure, context-free grammar (CFG) decoding engines force the selection of nex…
Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation
Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at devel…
KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning
Cross-domain multimodal time series forecasting is a challenging task, requiring models to integrate precise numerical comprehension, cross…
From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs
Urban trajectories play a crucial role in modeling urban dynamics and supporting various smart city applications. However, privacy concerns…
RAISE: RAG Design as an Architecture Search Problem
Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking, retrieval depth, reranking,…
Teaching Values to Machines: Simulating Human-Like Behavior in LLMs
Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether t…
Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning
Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific…
Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection
Automating scientific computing workflows requires more than generating executable code: autonomous systems must also select appropriate co…
Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers
Diffusion Transformers have become a powerful backbone for text-to-image generation, but their layered and cross-modal generation process m…
Conformal Certification of Reasoning Trace Prefixes
Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs.…
Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison
Emerging personal AI agents are moving toward persistent, multi-source memory. This creates an evaluation problem: systems must decide how…
PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers
Poker is a landmark challenge for artificial intelligence. The dominant approach relies on equilibrium solvers built on counterfactual regr…
VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing
Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We pr…
Enhancing Multi-Agent Communication through Attention Steering with Context Relevance
LLM-based multi-agent systems have demonstrated remarkable performance on complex tasks through collaborative reasoning. However, these sys…
AgentSchool: An LLM-Powered Multi-Agent Simulation for Education
Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developi…
Anchorless Diversification for Parallel LLM Ideation
LLMs are increasingly used to generate candidate-idea pools for creative tasks where broad exploration is valuable. Parallel inference can…
Temporal Stability and Few-Shot Prompting in Math Task Assessment
As AI tools become increasingly integrated into educational contexts, questions arise about both their stability over time and their respon…
Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents
Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. Howe…
BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders
Biosecurity evaluations of language models typically ask whether models produce hazardous output. This paper asks a complementary question:…
Modularizing Educational LLM-Agency for Fostering Responsible Learning Assistance
The widespread adoption of AI chatbots in education will drastically change learning, making responsible deployment a critical concern. Whi…
Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale
The double-edged sword of integrating Large Language Models (LLMs) requires an effective triadic collaboration mechanism among LLMs, teache…
Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit
The same prompt -- "best CRM software" -- reaches AI assistants from buyers in widely different contexts: a solo founder, an enterprise VP,…
When Should Models Change Their Minds? Contextual Belief Management in Large Language Models
Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their st…
mcp-proto-okn: Natural-language access to open scientific knowledge graphs through the Model Context Protocol
MCP Server Proto-OKN (mcp-proto-okn) is a Python-based Model Context Protocol server that enables AI assistants to discover, inspect, query…
ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure
Scientific discovery is an inherently creative and uncertain process, requiring reasoning beyond the recall of known knowledge. While many…
MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection
Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before…
Demystifying Data Organization for Enhanced LLM Training
Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily reliant on effective data curatio…
Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents
Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can vi…
Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection
Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfa…
SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations
Printed circuit board (PCB) schematic design defines nearly all electronic hardware, but it remains manual and expertise-intensive. While g…
Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software
Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Cl…
Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models
Large Language Models (LLMs) achieve impressive performance across many tasks but remain prone to hallucination, especially in long-form ge…
Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning
Competitive STEM examinations such as JEE and NEET require multi-step symbolic reasoning, precise numerical computation, and deep conceptua…
Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation
As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We…
S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering
Long-horizon interactive agents often accumulate large trajectory histories yet still fail to answer questions about earlier events reliabl…
A comparative study of transformer-based embeddings for topic coherence
Topic modeling is a branch of Natural Language Processing (NLP) that aims to organize large collections of texts into coherent groups accor…
Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions
Automatic speech recognition (ASR) has the potential to substantially reduce manual annotation effort in child speech research by generatin…
Assessing Dutch Syllabification Algorithms and Improving Accuracy by Combining Phonetic and Orthographic Information through Deep Learning
Syllabification describes the task of dividing words into syllables. Due to many rules and exceptions, training an algorithm to perform syl…
GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling
Large Language Models (LLMs) extend their capabilities through function-calling (FC), which relies on training data with high quality, dive…
No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand
The Plain Writing Act in the United States requires government documents to be accessible in clear and simple language that the general pub…
SERC: LDPC-Inspired Semantic Error Correction for Retrieval-Augmented Generation
While Large Language Models (LLMs) have demonstrated remarkable capabilities, their reliability is significantly compromised by hallucinati…
Specialty-Specific Medical Language Model for Immune-Mediated Diseases
Extracting detailed clinical information from free-text medical narratives remains a practical challenge for researchers and healthcare sys…
How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines
Large language model (LLM) agents with tool-calling capabilities are increasingly deployed in production systems, yet a fundamental reliabi…
Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning
The success of large language models (LLMs) across diverse NLP tasks has elevated the importance of reasoning chain optimization as a criti…
GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models
Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world in…
Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?
Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinf…
Self-Play Reinforcement Learning under Imperfect Information in Big 2
Imperfect-information multiplayer games test whether agents can act under hidden information, sparse rewards, and non-stationary opponents.…
Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision
What does a world model learn from physical exploration, without any linguistic supervision? We argue the answer is organized by a single p…
Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models
Token-based time series large language models (TS-LLMs) have emerged as a promising direction for time series analysis and reasoning. Howev…
PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation
Generating high-quality time-series data is challenging because real-world signals often exhibit multimodal patterns and multiscale dynamic…
TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models
Metagenomic taxonomic annotation aims to identify the microbial origins of DNA fragments in environmental samples. Traditional methods that…
Balancing Multimodal Learning through Label Space Reshaping
Multimodal learning often suffers from modality imbalance, where modalities that converge faster dominate optimization while others remain…
Representation Alignment Rests on Linear Structure
We investigate the Platonic Representation Hypothesis (PRH) through a tripartite statistical framework of representations: signal, bias, an…
LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis
CI failure logs are large (median 5k lines, max 200k in this corpus) and noisy. Coding agents that try to debug them depend on an upstream…
GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human
With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important…
Context Distillation as Latent Memory Management
Context distillation compresses contextual information into model parameters, yet existing methods often ignore how multiple distilled late…
Quantum-Enhanced Adversarial Robustness in Artificial Intelligence
Artificial Intelligence has achieved remarkable success across diverse application domains. However, its vulnerability to adversarial attac…
Hallucination Detection-Guided Preference Optimization for Clinical Summarization
Large language models (LLMs) have shown promise on summarization tasks, but they often produce hallucinations, which are unsupported or inc…
AIRGuard: Guarding Agent Actions with Runtime Authority Control
Tool-using language agents turn model decisions into external side effects: they read files, run scripts, call APIs, send messages, and inv…
First head-to-head comparison of agentic AI applied to the analysis of simulated data of the Einstein Telescope
We report a comparison of two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously exe…
CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models
Large language models have achieved strong reasoning capabilities, though often at the cost of massive parameter counts and expensive infer…
Conf-Gen: Conformal Uncertainty Quantification for Generative Models
Conformal prediction (CP) and its extension, conformal risk control (CRC), are established frameworks for quantifying uncertainty in superv…
Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization
If an AI agent makes decisions on a person's behalf, those decisions must align with its user. We introduce representational accuracy to me…
Comparing Post-Hoc Explainable AI Methods for Interpreting Black-Box EEG Models in Depression Detection
Recent advances in deep learning have enabled increasingly accurate electroencephalography (EEG)-based classification of Major Depressive D…
The Hamilton-Jacobi Theory of Deep Learning
In this paper, training a neural network is identified, exactly, as a search through Hamilton--Jacobi initial-value problems: each gradient…
Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening
LLMs are vulnerable to prompt injection attacks. However, this vulnerability has been primarily demonstrated conceptually in academic studi…
FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks
A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops…
LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers
Diffusion-based neural solvers for combinatorial optimization repeatedly re-evaluate dense edge/factor interactions, making inference expen…
Label-Free Reinforcement Learning via Cross-Model Entropy
Post-training large language models with reinforcement learning is bottlenecked by the reward signal. Existing approaches require either gr…
Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning
Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat th…
SCDBench: A Benchmark for LLM-Based Smart Contract Decompilers
Smart contract decompilation aims to recover high-level source code from bytecode, but evaluating decompilers remains difficult because exi…
Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text
LLMs have advanced text classification, yet existing paradigms face a trade-off: supervised (label only) fine-tuning is scalable but offers…
Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG
A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same que…
OISD: On-Policy Internal Self-Distillation of Language Models
Recent reinforcement learning (RL) post-training approaches primarily optimize the final output policy using sparse outcome-level rewards,…
GEO-Bench: Benchmarking Ranking Manipulation in Generative Engine Optimization
Large language models (LLMs) increasingly rank products, documents, and recommendations for user queries, which makes manipulating these ra…
unix-ctf: Procedural Environments for Unix-Competence Reinforcement Learning
Unix competence is the ability to use shell and operating-system primitives as first-class tools, not merely to write programs through a te…
A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router
We propose a minimal dynamical model of adaptive softmax routing for a two-expert Mixture-of-Experts (MoE) layer. The model is obtained as…
When and How Long? The Readout-Mediator Angle in Temporal Reasoning
A linear probe can decode a representation almost perfectly and yet be completely irrelevant to how the model uses it. On calendar-date dur…
Multi-Resolution End-to-End Deep Neural Network for Optimizing Latency-Accuracy Tradeoff in Autonomous Driving
Latency-accuracy tradeoffs are fundamental in real-time applications of deep neural networks (DNNs) for cyber-physical systems. In autonomo…
Toward User Preference Alignment in LLM Recommendation via Explicit Context Feedback
Traditional recommender systems (RecSys) primarily infer user preferences from implicit signals (such as clicks, watches, and purchases), o…
SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation
Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level,…
Real-rootedness of the Poincar\'e polynomials of $\overline{\mathcal M}_{0,n}$: an AI-assisted proof
We prove real-rootedness for the Poincar\'e polynomial \[ P_n(t)=\sum_{i=0}^{n-3} \dim H^{2i}(\overline{\mathcal M}_{0,n};\mathbb{Q})t^i \]…
Unveiling Multi-regime Patterns in SciML: Distinct Failure Modes and Regime-specific Optimization
Neural networks trained under different hyperparameter settings can fall into distinct training "regimes," with consistent behavior within…
CA-AC-MPC: CUDA-Accelerated Actor-Critic Model Predictive Control
In the literature, actor-critic model predictive control (AC-MPC) integrates MPC with reinforcement learning to enable high-performance con…
Parallax: Parameterized Local Linear Attention for Language Modeling
Large Language Models (LLMs) have become the central paradigm in artificial intelligence, yet the core computational primitive of attention…
Evolutionary Refinement of Generative Graph Topologies: A Hybrid WGAN-GA Approach
Generating realistic graph-structured data is challenging due to discrete connectivity, varying graph sizes, and class-specific structural…
Domain-Informed Representation for Evolutionary Sieving in Integral and Module Lattices
Traditional cryptography, rooted in problems, e.g., integer factorisation or discrete log, is inevitably vulnerable to a fully operational…
UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning
Legal NLP benchmarks are overwhelmingly English-centric, leaving failure modes in morphologically rich, non-Latin-script languages undetect…
Sustainable Metal-Organic Framework Water Harvesters in the Artificial Intelligence Era
Metal-organic frameworks (MOFs) are excellent candidates for water harvesting due to their tunable pore environments, which can be precisel…
TIMEGATE: Sustainable Time-Boxed Promotion Gates for Continual ML Adaptation Under Resource Constraints
As machine learning(ML) systems evolve to continual adaptation, each re-training cycle uses compute, annotation, and energy. We introduce T…
Influence-Guided Symbolic Regression: Scientific Discovery via LLM-Driven Equation Search with Granular Feedback
Large Language Models (LLMs) offer a promising avenue for scientific discovery, yet their application to symbolic regression is often const…
Stochastic Lifting for Generating Trajectories of Stochastic Physical Systems
Many stochastic physical systems evolve smoothly over time in the sense that the distribution of states changes regularly across time steps…
Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents
AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, inc…
Toward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children's Data
Age estimation from facial images typically relies on training data that includes images of minors, a practice that raises serious ethical,…
BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference
Diffusion language models (dLLMs) generate text by iteratively denoising multiple token positions in parallel, offering an attractive alter…
Wait! There's a Way Out: A Decision Mechanism for Forecasting Conversational Derailment
Forecasting conversational derailment is the task of predicting, as the conversation unfolds, whether it will eventually derail into person…
OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources
Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to know…
Extreme dynamic symmetry enables omnidirectional and multifunctional robots
Symmetry is a central organizing principle in natural systems, yet its use as a unifying design strategy in robotics has largely remained l…
DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents
Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interacti…
KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency Tradeoffs
Given the wide range of deployment targets, flexible model selection is essential for optimizing performance within a given compute budget.…
Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits
LLM-guided evolutionary search (Evolve systems) has reached state-of-the-art results on mathematical and combinatorial tasks, yet most exis…
Causal Label Recovery in Payment Networks
Fraud detection models in payment networks train on chargeback labels that are systematically biased. Every label must survive three sequen…
Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA
We present Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks that separates genuin…
LoopFM: Learning frOm HistOrical RePresentations of Foundation Model for Recommendation
Knowledge distillation (KD) transfers a single scalar prediction from a large foundation model (FM) to compact vertical models (VMs), suffe…
Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts
Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a…
Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models
Evaluations of dental vision-language models remain fragmented across datasets, task definitions and metrics, and often ignore their comput…
MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs
Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their r…
GrepSeek: Training Search Agents for Direct Corpus Interaction
Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reason…
Rethinking FID Through the Geometry of the Reference Dataset
Fr\'echet Inception Distance (FID) is widely used to evaluate image generators, yet lower FID does not always correspond to better sample q…
Does Distributed Training Undermine Compute Governance?
Compute governance proposals often rely on the assumption that frontier AI training requires large, detectable computing clusters. However,…
SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow
The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collab…
TRACER: Persistent Regularization for Robust Multimodal Finetuning
Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as…
Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies
We propose Latent Terms, a method revealing that models trained for dense retrieval, whether single- or multi-vector, learn representations…
On the Optimizer Dependence of Neural Scaling Laws
The scaling exponent $\alpha$ in neural scaling laws $L(N) \propto N^{-\alpha}$ is commonly treated as a fixed constant set by architecture…
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models
Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered b…
Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge
Understanding long-form egocentric videos remains challenging for multimodal large language models (MLLMs) due to limited context length an…
The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction
Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other fea…
Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning
Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-…
DELOS: Detecting Shallow Transits in Kepler Photometry Using a Contrastive-Learning Framework
We present DEtection in phase-folded Light curves with cOntrastive Scoring (DELOS), a contrastive-learning-based framework designed to sear…
AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing
Existing sentence-level watermarking methods enhance robustness to paraphrasing by anchoring watermarks in sentence semantics. However, the…
SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents
Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision mak…
How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions
AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajecto…
How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions
Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both t…
Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs
Representation learning on dynamic graphs requires capturing complex dependencies that evolve across both time and structure. Existing appr…
Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment
Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona i…
Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset
The emergence of Large Vision-Language Models (LVLMs) has substantially expanded model capabilities beyond text-only understanding, enablin…
Honest Lying: Understanding Memory Confabulation in Reflexive Agents
Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own fail…
Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference
Stacking probabilistic building blocks into deeper architectures typically breaks closed-form inference. We show that closed-form inference…
SciIntBench: Measuring LLM Compliance with Research Integrity Norms Under Adversarial Framing
Large language models (LLMs) are increasingly used to support scientific work, but it is unclear whether they uphold responsible conduct of…
Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles
Language models are increasingly being deployed for conversational support in informal caregiving contexts, where interactions often extend…
MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery
Large language models (LLMs) show remarkable potential in scientific hypothesis discovery. However, existing approaches face two critical l…
Evolutionary Rule Extraction from Corporate Default Prediction Models
Small and medium-sized enterprises (SMEs) represent the majority of firms in most economies and often face financial constraints and higher…
PhoneWorld: Scaling Phone-Use Agent Environments
A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build a…
AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current…
The New Pro Se: Generative AI and the Surge in Federal Civil Self-Representation
Since public access to generative AI tools became widespread, federal civil litigation has seen a marked increase in pro se (self-represent…
Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities
Off-policy evaluation estimates how a target policy would perform using data collected by a different behavior policy, which is crucial whe…
Source-Grounded Semantic Reinforcement Learning for Low-Resource Target-Language Generation
Low-resource target-language generation is often limited by scarce parallel data, while high-resource source-language monolingual data is a…
Network Optimization Aspects of Autonomous Vehicles: Challenges and Future Directions
Global megatrends, such as urbanization, population growth, and emerging network solutions are accelerating the development of the Connecte…
KBF: Knowledge Boundary as Fingerprint for Language Model and Black-Box API Auditing
Relay and reseller APIs increasingly intermediate access to large language models (LLMs), but users have no direct way to verify that a cla…
Temporal Motif-aware Graph Test-time Adaptation for OOD Blockchain Anomaly Detection
Ever-evolving transaction patterns have significantly hindered anomaly detection on emerging cryptocurrency blockchains due to the vast num…
GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing
Exploratory GUI testing is a particularly demanding setting for MLLM agents: without predefined test scripts, an agent must autonomously na…
GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection
Vision-language foundation models have shown promising zero-shot generalization for Cross-Domain Few-Shot Object Detection (CD-FSOD). Howev…
SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring
Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. Howev…
Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization
Deep learning optimization relies heavily on the assumption of smooth loss landscapes, a condition systematically violated by modern archit…
VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models
Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to genera…
Brain-IT-VQA: From Brain Signals to Answers
Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images,…
Training Deliberative Monitors for Black-Box Scheming Detection
As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may beco…
Learning Context-Conditioned Predicate Semantics via Prototype Feedback
In scene graph generation, a central challenge is modeling polysemous predicates whose meanings shift across contexts. Prior approaches add…
DLM-SWAI: Steering Diffusion Language Models Before They Unmask
Steering language model generation toward desired textual properties is essential for practical deployment, and inference-time methods are…
COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings
Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swappi…
Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory
End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor ent…
Predicting Causal Effects from Natural Language Queries using Structured Representations
Randomized controlled trials are a cornerstone of medicine and the social sciences as they enable reliable estimates of causal effects. How…
The Sample Complexity of Multiclass and Sparse Contextual Bandits
We study contextual bandits in the stochastic i.i.d.\ setting, where a learner observes contexts drawn from an unknown distribution, select…
OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning
Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both comp…
Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content
Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language,…
EviLink: Multi-Path Schema Linking with Uncertainty-Guided Evidence Acquisition for Large-Scale Text-to-SQL
Schema linking is a difficult and important step in large-scale Text-to-SQL, where systems must identify a compact yet sufficient schema co…
From Prompts to Context: An Ontology-Driven Framework for Human-Generative AI Collaboration
Collaborations with Generative AI often begin with a short prompt and end with an opaque output, leaving implicit who was involved, what ta…
Personalized Turn-Level User Conversation Satisfaction Benchmark
User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on wha…
Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies
Grounded claim factuality checking is important for large language model (LLM) applications such as retrieval-augmented generation, as it h…
The Little Book of Generative AI Foundations: An Intuitive Mathematical Primer
This book provides a compact, derivation-oriented introduction to the mathematical foundations of modern generative artificial intelligence…
Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions
Legal NLP benchmarks overwhelmingly evaluate a single language or aggregate tasks that differ fundamentally across jurisdictions, making cr…
A unified deeplearning framework for contrast-phase-specific virtual monochromatic imaging
Dual-energy CT (DECT) enables virtual monochromatic imaging (VMI) and improved contrast resolution, but its clinical adoption is limited by…
Energy-Aware NECO for Single-Pass Pixel-wise Out-of-Distribution Detection in Semantic Segmentation
Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distributi…
Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning
Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accura…
Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems
LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks,…
Data filtering methods for training language models
Data quality is a critical factor in the effectiveness of machine learning models. Label errors, present even in widely used benchmarks, in…
Inferring Code Correctness from Specification
Large language models (LLMs) have become integral to modern software development, enabling automated code generation at scale. However, val…
Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models
Existing methods in Multimodal Knowledge Editing (MKE) have advanced the ability to correct outdated or inaccurate knowledge in Multimodal…
CB-SLICE: Concept-Based Interpretable Error Slice Discovery
Despite strong average-case performance, deep learning models often exhibit systematic errors on specific population groups, known as error…
HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization
Post-training quantization (PTQ) is essential for deploying LLMs under memory and bandwidth constraints. However, extreme low-bit quantizat…
ESPO: Early-Stopping Proximal Policy Optimization
When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it…
Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, w…
Mitigating Stethoscope-Induced Shortcuts in Respiratory Sound Classification under Federated Domain Generalization with Causality-Inspired Interventions
AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hinde…
Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension
Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape…
Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering
Large vision-language models (LVLMs) often hallucinate objects that are not present in the input image, largely because visual grounding we…
CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) improves knowledge-intensive question answering by incorporating external evidence. However, existing…
LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training
Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little ex…
Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate
Patient-voiced clinical-triage benchmarks report high under-triage rates for consumer LLMs for constrained multiple-choice output, yet the…
Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents
Consensus protocols form the backbone of distributed systems and blockchains, where implementation bugs can cause data corruption and finan…
Selection Hyper-heuristics Can Automatically Adjust the Learning Period to Optimally Solve Pseudo-Boolean Problems
The Random Gradient hyper-heuristic was recently shown to be able to learn the optimal neighbourhood size when optimizing the LeadingOnes b…
Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents
Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task…
Label Over Logic? How Source Cues Bias Human Fallacy Judgments More Than LLMs
As AI-generated and AI-assisted content floods online spaces, source labels attached to such content can distort human reasoning judgments,…
CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous Driving
Autonomous driving systems are commonly trained and evaluated within limited geographic regions, which hinders their scalability when deplo…
HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding
Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-qu…
Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction
Large language model (LLM) agents increasingly leverage long term memory to support persistent and autonomous task execution. However, this…
Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots
Honeypots are decoy systems mimicking real system components designed to defend against cyber attacks. Recently, LLMs increasingly serve as…
Evaluating Skill and Stability of ArchesWeather and ArchesWeatherGen under Multi-Decadal Climate Simulations
We evaluate the climate simulation capabilities of ArchesWeather and ArchesWeatherGen, two machine learning models originally trained for w…
Genetically Aligned Patient Representations Improve Hematological Diagnosis
Multimodal alignment of histopathology encoders with transcriptomic and genomic data has been shown to significantly improve performance in…
Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas
We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-syn…
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, te…
Test Time Training for Supervised Causal Learning
Supervised Causal Learning (SCL) has shown promise in causal discovery by framing it as a supervised learning problem. However, it suffers…
Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders
Positional encoding (PE) underpins how permutation-invariant Transformers represent sequence order, yet how positional information is proce…
Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation
Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline,…
Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models
Diffusion models generate highly realistic images but often struggle with precise text-image alignment. While recent post-training methods…
Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage
Per-token billing is now the standard pricing model for commercial large language models (LLMs), so the honesty of reported token counts di…
Masked Diffusion Modeling for Anomaly Detection
Anomaly detection aims to identify samples that deviate from the nominal data distribution and is central to many safety-critical applicati…
REPOT: Recoverable Program-of-Thought via Checkpoint Repair
One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates…
Projectional Decoding: Towards Semantic-Aware LLM Generation
Large language models (LLMs) are increasingly used to generate software artifacts across many software engineering (SE) tasks, yet ensuring…
A Predictive Law for On-Policy Self-Distillation From World Feedback
Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distil…
How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency
Large language models (LLMs) can autonomously conduct multi-stage cyber attacks, but the consistency of their offensive behavior under repe…
When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems
The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offer…
xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR
Point cloud segmentation is a fundamental task in 3D scene understanding. Its progress is constrained by the high cost and time required fo…
Evolving Features vs Evolving Entire Trees with GP for Interpretable Survival Analysis
Survival analysis concerns the task of predicting the time until an event occurs. Often used in the medical field, survival analysis deals…
No More K-means:Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval
Multi-vector retrieval (MVR) models, exemplified by ColBERT, have established new benchmarks in retrieval accuracy by preserving fine-grain…
Beyond MSE: Improving Precipitation Nowcasting with Multi-Quantile Regression
Deep-learning precipitation nowcasting models are often optimized using pointwise losses such as mean squared error or mean absolute error,…
PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding
Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for infere…
DAMEL: Dual-Axis Multi-Expert Learning for Class-Imbalanced Learning
Various algorithms have been proposed to address the challenges posed by class-imbalanced learning from real-world data with long-tailed di…
Overcoming Forgetting in LLM Fine-Tuning with Evolution Strategies
Evolution Strategies (ES) has recently emerged as a competitive alternative to reinforcement learning (RL) for large language model (LLM) f…
Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?
Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text…
Neural Network Verification using Partial Multi-Neuron Relaxation
The increasing integration of deep neural networks in critical systems has spawned a theoretical and practical interest in formally guarant…
On Distributional Reinforcement Learning in Chaotic Dynamical Systems
Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induc…
Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms
As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can…
iLoRA: Bayesian Low-Rank Adaptation with Latent Interaction Graphs for Microbiome Diagnosis
Parameter-efficient adaptation has made LLMs practical for domain prediction, but standard LoRA still relies on a static low-rank update an…
CalArena: A Large-Scale Post-Hoc Calibration Benchmark
Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post…
Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection
We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisonin…
What drives performance in molecular MPNNs? An operator-level factorial benchmark
Message-passing neural networks (MPNNs) are widely used for molecular property prediction, but their deployment as monolithic architectures…
HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime
We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early upd…
Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency
AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year ove…
BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models
Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world roboti…
Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization
While Multi-Agent Systems (MAS) empower Large Language Models to tackle complex reasoning tasks through collaborative interaction, optimizi…
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual q…
Do Language Models Track Entities Across State Changes?
Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning. An increasing amount of…
Reinforcement Learning with Robust Rubric Rewards
While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks…
Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models
Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is r…
How LoRA Remembers? A Parametric Memory Law for LLM Finetuning
Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low…
PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions
We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and…
LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback
Large language models (LLMs) show promise in generating supportive responses for mental health queries, but improving their usefulness, emp…
Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection
Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context wi…
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fra…
Self-Trained Verification for Training- and Test-Time Self-Improvement
Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, throu…
MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings
Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health recor…
City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images
City-scale 3D surface reconstruction from multiview images for downstream 3D simulation, poses highly challenging problems due to the scale…
Archon: A Unified Multimodal Model for Holistic Digital Human Generation
Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motio…
Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes
Portrait photography is largely decided before the shutter opens: the subject's pose, the camera configuration, and the lighting devices mu…
Improved Guarantees for Heterogeneous Treatment-Effect Estimation via Matrix Completion
A central goal of modern causal inference is estimating heterogeneous treatment effects to answer questions like "how does an intervention…
Gram: Assessing sabotage propensities via automated alignment auditing
We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemin…
In-Context Reward Adaptation for Robust Preference Modeling
Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human prefer…
On Language Generation in the Limit with Bounded Memory
We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language o…
RoboWits: Unexpected Challenges for Robotic Creative Problem Solving
The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world env…
Reasoning with Sampling: Cutting at Decision Points
Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by…
GPIC: A Giant Permissive Image Corpus for Visual Generation
Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permis…
Unlocking the Working Memory of Large Language Models for Latent Reasoning
To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens bef…
LLMSurgeon: Diagnosing Data Mixture of Large Language Models
The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and fa…
VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layo…
PersonaAgent: Bridging Memory and Action for Personalized LLM Agents
Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide rang…
PuzzleClone: A DSL-Powered Framework for Synthesizing Verifiable Data
High-quality mathematical and logical datasets with verifiable answers are essential for strengthening the reasoning capabilities of large…
Estimating the Empowerment of Language Model Agents
As language model (LM) agents become increasingly capable and adopted in real-world applications, there is a growing need for scalable eval…
SafeSearch: Automated Red-Teaming of LLM-Based Search Agents
Search agents connect LLMs to the Internet, enabling them to access broader and more up-to-date information. However, this also introduces…
EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance
Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards.…
Controlling the Risk of Corrupted Contexts for Language Models via Early-Exiting
Large language models (LLMs) can be influenced by harmful or irrelevant context, which can significantly harm model performance on downstre…
TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis
Modern enterprises generate vast streams of time series metrics when monitoring complex systems, known as observability data. Unlike conven…
CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization
We introduce CodeEvolve, an open-source framework that couples large language models with island-based evolutionary search for end-to-end a…
Large-Scale AI and Foundation Models for Neuroscience: A Comprehensive Review
The development of large-scale artificial intelligence (AI) models is influencing neuroscience research by enabling end-to-end learning fro…
Modeling Hierarchical Thinking in Large Reasoning Models
Large Reasoning Models (LRMs) solve complex tasks by generating long Chain-of-Thought (CoT) sequences; however, the emergent dynamics gover…
Graph-Enhanced Policy Optimization in LLM Agent Training
Multi-step LLM agents in interactive environments represent a crucial step toward long-horizon decision-making. To train such agents, group…
A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models
The evolution of mathematics is shaped importantly by interestingness: researchers choose which problems to pursue, and students choose whi…
InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents
Data analysis has become an indispensable part of scientific research. To discover the latent knowledge and insights hidden within massive…
SCOPE: Prompt Evolution for Enhancing Agent Effectiveness
Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bo…
LsrIF: Enhancing Logic-Structured Instruction Following of Large Language Models
Instruction following is critical for large language models, yet real-world instructions often involve multiple constraints with logical st…
TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech
Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a compl…
From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning
Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based R…
Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning
Agentic Reinforcement Learning (ARL) trains large language models to interleave reasoning with external tool execution to solve complex tas…
Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents
LLM-driven agents excel at sequential decision-making but often rely on on-the-fly reasoning, re-deriving solutions even in recurring scena…
AutoSizer: Automatic Sizing of Analog and Mixed-Signal Circuits via Large Language Model (LLM) Agents
The design of Analog and Mixed-Signal (AMS) integrated circuits remains heavily reliant on expert knowledge, with transistor sizing a major…
Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs
Inference-time scaling via chain-of-thought (CoT) reasoning is a major driver of state-of-the-art LLM performance, but it comes with substa…
Small Agent Group is the Future of Digital Health
The rapid adoption of large language models (LLMs) in digital health has been driven by a "scaling-first" philosophy, i.e., the assumption…
Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure
Latent or continuous chain-of-thought methods replace explicit textual rationales with a number of internal latent steps, but these interme…
Causal-JEPA: Learning World Models through Object-Level Latent Masking
World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations pr…
Benchmarking at the Edge of Comprehension
As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a…
Recurrent Structural Policy Gradient for Partially Observable Mean Field Games
Mean Field Games (MFGs) provide a principled framework for modelling interactions in large population systems. However, algorithmic progres…
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information from individual…
Rel-MOSS: Towards Imbalanced Relational Deep Learning on Relational Databases
In recent advances, to enable a fully data-driven learning paradigm on relational databases (RDB), relational deep learning (RDL) is propos…
RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models
Reinforcement learning (RL) shows promise for enhancing LLM agentic reasoning, yet sparse terminal rewards hinder fine-grained optimization…
FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse Autoformalization
Autoformalization aims to produce formal statements that compile and faithfully preserve the intended meaning of informal mathematics. Yet…
When Models Learn to Ask Why: Adaptive Causal Reasoning for Trustworthy Medical Vision-Language Models
Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet,…
MemCollab: Cross-Model Memory Collaboration via Contrastive Trajectory Distillation
LLM agents increasingly rely on memory mechanisms to reuse knowledge from past problem-solving experiences. However, existing methods typic…
SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems
Combining multiple Vision-Language Models (VLMs) can enhance multimodal reasoning and robustness, but aggregating heterogeneous models' out…
MediHive: A Decentralized Agent Collective for Medical Reasoning
Large language models (LLMs) have revolutionized medical reasoning tasks, yet single-agent systems often falter on complex, interdisciplina…
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
Multimodal Large Reasoning Models (MLRMs) have achieved remarkable strides in visual reasoning through test time compute scaling, yet long…
SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning
Current multimodal models often suffer from shallow reasoning, leading to errors caused by incomplete or inconsistent thought processes. To…
Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation
Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy eval…
Guardrails Beat Guidance: A Large-Scale Study of Rules, Skills, and Persistent Configuration for Coding Agents
Random rules improve a coding agent's task performance as much as expert-curated ones (both $+13.8$pp on a discriminative subset of SWE-ben…
MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration
While chain-of-thought (CoT) reasoning enables LLMs to solve challenging reasoning tasks, the linear growth of the KV cache leads to substa…
Human-Guided Harm Recovery for Computer Use Agents
As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but al…
Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling
Large Language Models (LLMs) now exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), with impressive perform…
SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data
AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulat…
A Foundation Model for Zero-Shot Logical Rule Induction
Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters…
Hierarchical Task Network Planning with LLM-Generated Heuristics
HTN planning is a variation of classical planning where, instead of searching for a linear sequence of actions, an algorithm decomposes hig…
NOVA: Fundamental Limits of Knowledge Discovery Through AI
Can AI systems discover genuinely new knowledge through iterative self improvement, and if so, at what cost? We introduce the NOVA framewor…
AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence
Emotional intelligence (EI), the ability to perceive, understand, and respond appropriately to others' emotional states, is central to huma…
MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing
Document parsing converts visually rich documents into machine-readable structured representations, forming a crucial foundation for inform…
Learning A Simulation-based Visual Policy for Real-world Peg In Unseen Holes
This paper proposes a learning-based visual peg-in-hole that enables training with several shapes in simulation, and adapting to arbitrary…
MATNet: Multi-Level Fusion Transformer-Based Model for Day-Ahead PV Generation Forecasting
Accurate forecasting of renewable generation is crucial to facilitate the integration of Renewable Energy Sources into the power system. Fo…
A Survey on Recent Advances in Conversational Data Generation
Recent advancements in conversational systems have significantly enhanced human-machine interactions across various domains. However, train…
Crafting Desirable Climate Trajectories with RL Explored Socio-Environmental Simulations
Climate change poses an existential threat, necessitating effective climate policies to enact impactful change. Decisions in this domain ar…
Are LLMs Socially Adaptive? Contrasting Belief Evolution in Large Language Models and Humans
As large language models (LLMs) increasingly engage in complex social interactions, ensuring that their behaviors align with human ethical…
Jailbreaking and Mitigation of Vulnerabilities in Large Language Models
Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling…
Dataset-Driven Channel Masks in Transformers for Multivariate Time Series
Recent advancements in foundation models have been successfully extended to the time series (TS) domain, facilitated by the emergence of la…
Agent4Edu: Generating Learner Response Data by Generative Agents for Intelligent Education Systems
Personalized learning represents a promising educational strategy within intelligent educational systems, aiming to enhance learners' pract…
A Composable Multimodal Framework for cine CMR-Text-Driven Prediction of Heart Failure Outcomes
Objective. Heart failure is one of the leading causes of death worldwide, with millions of deaths each year, according to data from the Wor…
Weakly Supervised Detection and Temporal Localization of Whale Calls in Long-Duration Bioacoustic Data
Passive acoustic monitoring (PAM) systems generate continuous recordings spanning months, yet automated bioacoustic analysis of whale calls…
Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio
Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recogniz…
Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives
State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based servic…
EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance
Recent approaches for video generation with camera control often create anchor videos (i.e., rendered videos that approximate desired camer…
VRAG: Learning World Models for Interactive Video Generation
Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices.…
Online Fair Division with Additional Information
We study the problem of fairly allocating indivisible goods to agents in an online setting, where goods arrive sequentially and must be all…
Position: Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning
This position paper argues that text embedding research should move beyond surface meaning and embrace implicit semantics as a central mode…
Model Fusion via Retrofitting
Model fusion seeks to combine independently trained neural networks into a single model without retraining, but is complicated by represent…
Taming Data Challenges in ML-based Security Tasks Using Generative AI
Machine learning-based supervised classifiers are widely used for security tasks, and their improvement has been largely focused on algorit…
MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models
Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and r…
Finding DoRI: Discovery of Retained Images in Diffusion Models
Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intelle…
Approximate Proportionality in Online Fair Division
We study the online fair division problem, where indivisible goods arrive sequentially and must be allocated immediately and irrevocably. P…
GroundAct: Can LLM Agents Ground Actions in Environmental States?
LLM agents achieve 85-96% success on tasks where instructions fully specify the action, but drop to 29-53% when action feasibility depends…
Scalable RF Simulation in Generative 4D Worlds
Radio Frequency (RF) sensing has emerged as a powerful, privacy-preserving alternative to vision-based methods for various perception tasks…
Less Is More: Elevating RAG via Performance-Driven Context Compression
Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for improving the timeliness of knowledge updates and the factual…
GRPO is Secretly a Process Reward Model
Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome r…
Towards Foundation Models for Zero-Shot Time Series Anomaly Detection: Leveraging Synthetic Data and Relative Context Discrepancy
Time series anomaly detection (TSAD) is a critical task, but developing models that generalize to unseen data in a zero-shot manner remains…
Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting
As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detec…
AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials
Large language models (LLMs) have shown promising potential in scientific research, enabling tasks ranging from knowledge retrieval to prop…
The Impact of Semantic Pairs on Self-Supervised Representation Learning
Instance discrimination learns visual representations by treating different augmented views of the same image as positive pairs. While this…
Obfuscation Rules for Detecting and Detoxifying Korean Toxicity
As language models become increasingly deployed in online environments, toxicity detection and detoxification have received growing attenti…
Offline Reinforcement Learning with Generative Trajectory Policies
Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture comp…
Beyond Accuracy: Are Time Series Foundation Models Well-Calibrated?
The recent development of foundation models for time series data has generated considerable interest in using such models across a variety…
Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations
We present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations…
LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation
Recent advances in text-to-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under…
ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning & Scheduling
Bimanual and humanoid robots are appealing because of their human-like ability to leverage multiple arms to efficiently complete tasks. How…
An accuracy-aware extension to LRP-based pruning for CNNs to prevent cascading accuracy degradation in data-scarce transfer learning
Convolutional Neural Networks (CNNs) pre-trained on large-scale datasets such as ImageNet are widely used as feature extractors to construc…
Enhancing Reinforcement Learning in 3D Environments through Semantic Segmentation: A Case Study in ViZDoom
Reinforcement learning (RL) in 3D environments with high-dimensional sensory input poses two major challenges: (1) the high memory consumpt…
MiAD: Mirage Atom Diffusion for De Novo Crystal Generation
In recent years, diffusion-based models have demonstrated exceptional performance in searching for simultaneously stable, unique, and novel…
Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach
Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic style…
BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models
Foundation models have revolutionized various fields such as natural language processing (NLP) and computer vision (CV). While efforts have…
Topological Order in Neural Wavefunctions
Topologically ordered states are among the most interesting quantum phases of matter that host emergent quasi-particles having fractional c…
E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing
Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the succ…
The Best of the Two Worlds: Harmonizing Semantic and Hash IDs for Sequential Recommendation
Conventional Sequential Recommender Systems (SRS) typically assign unique hash IDs (HID) to construct item embeddings, which mainly capture…
A Review of Learning-Based Motion Planning: Toward a Data-Driven Optimal Control Approach
Motion planning for autonomous driving (AD) faces a critical trade-off. While traditional rule-based pipelines offer verifiable safety and…
Revisiting the Reliability of Language Models in Instruction-Following
Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do n…
HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens
Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represent…
Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models
Qualitative data are widespread in domains such as healthcare, marketing, and bioinformatics, where clustering offers a fundamental tool fo…
Thinking Before Constraining: A Unified Decoding Framework for Large Language Models
Natural generation allows Large Language Models (LLMs) to produce free-form responses with rich reasoning, yet the lack of structure makes…
From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges
Rubric-based text evaluation increasingly uses large language models (LLMs) as scalable judges, but aligning frozen black-box models with h…
Steering Language Models Before They Speak: Logit-Level Interventions
Controllable generation requires language models to realize output characteristics such as reading level, politeness, and toxicity. Existin…
CORE-T: COherent REtrieval of Tables for Text-to-SQL
Realistic text-to-SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables become…
Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models
Post-training pretrained autoregressive models (ARMs) into masked diffusion models (MDMs) has emerged as a cost-effective way to overcome t…
Grammar-Aware Literate Generative Mathematical Programming with Compiler-in-the-Loop
Mathematical programming is widely employed across various sectors - such as logistics, energy, and workforce planning - to model and solve…
Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers
Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fu…
Pushing the Limits of Block Rotations in Post-Training Quantization
Recent post-training quantization (PTQ) methods have adopted block rotations to diffuse outliers prior to rounding. While this reduces the…
Learn from A Rationalist: Distilling Intermediate Interpretable Rationales
Because of the pervasive use of deep neural networks (DNNs), especially in high-stakes domains, the interpretability of DNNs has received i…
Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning
Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement l…
Scaling Small Agents Through Strategy Auctions
Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are suff…
Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders
The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approac…
S-MARC: Causal Streaming Reasoning for Full-Duplex Conversational Behavior Modeling
Human conversation is organized by an implicit chain of thought and manifests as temporally structured conversational behaviors. Capturing…
A Language-Guided Bayesian Optimization for Efficient LoRA Hyperparameter Search
Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) offers a resource-efficient way to personalize or specialize. Howe…
OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model
Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual…
Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR
Reward-maximizing RL methods have shown to be capable of enhancing the reasoning performance of LLMs, but often lead to reduced generation…
Who can we trust? LLM-as-a-jury for Comparative Assessment
Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwi…
Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training
Generative Flow Networks (GFlowNets) enable fine-tuning large language models to approximate reward-proportional posteriors, but they remai…
AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching
REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teac…
Relational In-Context Learning via Synthetic Pre-training with Structural Prior
Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A k…
MOO: A Multi-view Oriented Observations Dataset for Viewpoint Analysis in Cattle Re-Identification
Animal re-identification (ReID) faces critical challenges due to viewpoint variations, particularly in Aerial-Ground (AG-ReID) settings whe…
Post-Training Language Models for Crosslingual Consistency
Language models often respond inconsistently to translation-equivalent prompts across languages, undermining the reliability of multilingua…
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answe…
Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial p…
Steering at the Source: Style Modulation Heads for Robust Persona Control
Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning. While ef…
P$^2$RAG: Efficient Privacy-Preserving RAG Service Supporting Arbitrary Top-$k$ Retrieval
Retrieval-Augmented Generation (RAG) enables large language models to use external knowledge, but outsourcing the RAG service raises privac…
When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making
Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decisi…
Maximizing Mutual Information Between Prompt and Response Improves LLM Performance With No Additional Data
While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-l…
AuthorMix: Modular Authorship Style Transfer via Layer-wise Adapter Mixing
The task of authorship style transfer involves rewriting text in the style of a target author while preserving the meaning of the original…
The Price Reversal Phenomenon: When Cheaper Reasoning Models Cost More
Developers and consumers increasingly choose reasoning models (RMs) based on their listed API prices. However, how accurately do these pric…
Bridge-RAG: An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm
As an important paradigm for enhancing the generation quality of Large Language Models (LLMs), retrieval-augmented generation (RAG) faces t…
Multi-Level Barriers to Generative AI Adoption Across Disciplines and Professional Roles in Higher Education
Generative Artificial Intelligence (GenAI) is rapidly reshaping higher education, yet barriers to its adoption across different disciplines…
EvA: An Evidence-First Audio Understanding Paradigm for LALMs
Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic ev…
SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits
Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existin…
Combating Data Laundering in LLM Training
Data rights owners can detect unauthorized data use in large language model (LLM) training by querying with proprietary samples. Often, sup…
The Planetary Cost of AI Acceleration, Part II: The 10th Planetary Boundary and the 6.5-Year Countdown
The recent, super-exponential scaling of autonomous Large Language Model (LLM) agents signals a broader, fundamental paradigm shift from ma…
SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems
Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a large…
SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding
Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic s…
ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation
Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization…
Intent-aligned Autonomous Spacecraft Guidance via Reasoning Models
Future spacecraft operations require autonomy that can interpret high-level mission intent while preserving safety. However, existing traje…
BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps
Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic str…
DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories
We introduce DialToM, an annotated Theory of Mind (ToM) benchmark built from naturalistic human-human dialogues using a multiple-choice eva…
Causal Disentanglement-Inspired Degradation Representation Learning for Full-Reference Image Quality Assessment
Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of de…
Architecture-Induced Recoverability Bias in Differentiable Symbolic Regression
Symbolic regression aims to recover closed-form expressions from numerical data, but in differentiable symbolic regression the recovered ex…
Explainable AI in Speaker Recognition -- Making Latent Representations Understandable
Neural networks can be trained to learn task-relevant representations from data. Understanding how these networks make decisions falls with…
When 2D Tasks Meet 1D Serialization: On Serialization Friction in Structured Tasks
In the LLM era, many symbolic and structured problems are presented to models through 1D text serialization. Yet some such problems are nat…
MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio
Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existi…
Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning
On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks expose…
CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs
Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes…
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During…
Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification
Approximate Bayesian inference typically revolves around computing the posterior parameter distribution. In practice, however, the main obj…
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
Here is the updated abstract: Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch pass…
Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence
On-policy distillation (OPD) has become a promising paradigm for reasoning-oriented post-training of large language models (LLMs), especial…
Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
While many-shot ICL achieves remarkable performance, prior studies of its scaling behavior have mainly focused on non-reasoning tasks. In t…
AttenA+: Rectifying Action Inequality in Robotic Foundation Models
Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions…
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterpr…
ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows
While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack th…
Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning
Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, howeve…
JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation
We introduce JMed48k, a multi-profession Japanese healthcare licensing benchmark for evaluating vision-language models. Built from official…
The Distillation Game: Adaptive Attacks & Efficient Defenses
Distillation attacks create a deployment trade-off for model providers: the same outputs that make a model more useful can also make it eas…
Reducing Political Manipulation with Consistency Training
Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart…
クラウド依存、コストの課題を解消? MicrosoftのローカルAI基盤「Foundry Local」
Microsoftは、開発者がアプリケーションにAI機能を組み込めるローカルAI実行基盤「Foundry Local」の一般提供を開始した。ユーザーの端末上でAI処理を完結させる仕組みにより、クラウドへの依存やネットワーク遅延、トークン課金が発生しないAI実装が可能になるという。
それで、メモリ不足はいつまで続くの? なかなか終わらない狂騒のウラ側
長引くメモリ不足。いつ安く購入できるようになるのか……。
「Mythos級モデル」一般提供、数週間以内に 米Anthropic「Opus 4.8」リリース
より強力な安全策を講じた上で、数週間以内に全顧客に提供するとの見通しだという。
Glean’s top line crosses $300M as AI budget-cutting becomes its major selling point
The enterprise AI search startup tripled its annual revenue even as tech giants entered the category.
AI-CAEで風荷重評価を効率化、大林組がRICOS製ソリューションで検証
RICOSは、大林組が建物の風荷重予測に向けたAI-CAEソリューションの検証を開始したと発表した。AIが風向や建物形状などの条件を踏まえて予測し、設計工程の効率化を図る。
富士通がOpenAI、Anthropicと相次ぎ提携 AIベンダーと組む狙いは?
富士通はOpenAIとAnthropicとの提携を同じ日に発表した。自社独自のAI技術を持つ同社は、AIベンダーとの提携によって何を狙うのか。
データ分析の「分からない」「準備が面倒」を解消 ソニーの「初心者」特化ツール、記者が使ってみた【レポート】
自業務の成果を高めるために、データを分析したいけれど、何から始めればいいのか分からない──と悩むビジネスパーソンは少なくない。ソニーネットワークコミュニケーションズは、こうした初心者向けのニーズに着目した。データ分析業務の初心者である筆者が体験してみたところ……。
「日本は製造業のパワーハウス」、IFSが産業AI投資を急拡大する理由
IFSジャパンは記者会見を開催し、日本市場への投資継続とパートナーシップ強化の方針を説明した。日本IBMらとの戦略的協業を通じ、製造業などアセット集約型産業のAI実装とDXを支援する。
The internet is being rebuilt for machines
As AI agents move from experiments to production, AWS, Cloudflare, and others are redesigning cloud infrastructure for a future dominated b…
Asana acquires no-code agent-builder StackAI
Asana will incorporate StackAI into its growing suite of AI workflow tools.
「Google Antigravity 2.0と戯れながら感じたこと」と「LLM Wikiを実践して『ロケスマペディア』を作ってみた」
かわさきからは「Google Antigravity 2.0と戯れながら感じたこと」というタイトルで生成AI時代における教科書的コンテンツの存在意義と、AIにコードを書かせる時代の学び方について、一色からは「LLM Wikiを実践して『ロケスマペディア』を作ってみた」というタイ…
Anthropic raises $65 billion, nears $1T valuation ahead of IPO
Anthropic has closed a $65 billion Series H round at a $965 billion post-money valuation, marking what could be the AI startup's final priv…
Just like gold and oil, we’ll soon be able to trade AI token futures
Large exchanges are designing derivative products around AI tokens, which are increasingly being considered less a computational output and…
「国産人型ロボ」量産化へ 東大発スタートアップ 三菱自動車も出資
東京大学発のロボット開発スタートアップHighlandersは、国産人型ロボットの量産化を目指す取り組みを始めると発表した。
デジタル庁、AI「源内」向け国産LLM再公募 有償の政府調達へ 評価テストは50問→300問に
初回公募は無償試用を前提に募集したが、27年度向け公募では政府調達(有償)に移行する。
In just 3 weeks, StrictlyVC is coming to Los Angeles
StrictlyVC Los Angeles is on June 18. Join for meaningful networking and fireside chats with leaders from Mach Industries, Shinkei Systems,…
Anthropic releases Opus 4.8 with new ‘dynamic workflow’ tool
The new Opus model comes with a tool called Dynamic Workflows, for coordinating swarms of subagents.
Anthropic、Claude Opus 4.8を一般提供 誠実さが飛躍的に向上、Mythosに並ぶアライメント性能を実現
Anthropicは、AIモデルの最新版「Claude Opus 4.8」の一般提供を開始した。前世代から推論やコーディング能力を向上させ、自らの作業の不確実性に対する「誠実さ」が劇的に改善した。また、数百のサブエージェントを並行して走らせる新機能「dynamic workfl…
How long is Anthropic’s lease with SpaceX? Opinions vary
Elon Musk is publicly reframing xAI’s massive Anthropic compute deal as short-term and cancellable, despite SpaceX’s own S-1 filing describ…
Sesame, the conversational AI startup from Oculus founders, launches its iOS app
Sesame’s new iOS app brings its conversational AI agents to the public, offering more natural back-and-forth interactions designed to feel…
2026-05-28(552件)
Sneak peek at new Siri app reveals Apple’s plans to take on ChatGPT and more
New renders offer a closer look at Apple’s planned AI overhaul for iOS 27, including a redesigned Siri experience and standalone Siri app.
RSI is the new AGI — and it’s just as hard to pin down
A new crop of AI labs are focused on recursive self-improvement — but the goal is proving elusive.
At TechCrunch Disrupt 2026: Databricks’ co-founder on what kills enterprise AI deals
Enterprise AI is entering a different phase now, one where enterprises are no longer evaluating whether AI is exciting. They are evaluating…
YouTube adds new podcast features, including an AI recommendation tool and ‘Auto speed’
The update signals YouTube's ongoing efforts to compete with other platforms for podcast audiences.
2 days left: Lock in ticket savings of up to $410 to TechCrunch Disrupt 2026
Savings of up to $410 on TechCrunch Disrupt 2026 tickets end tomorrow, May 29, 11:59 p.m. PT. Register now to save and join 10,000+ tech le…
Visa invests in Replit to power agentic payments for developers
Visa said that over 1,000 employees have been using Replit for prototyping and development.
Has the hunt for AI compute uncovered the next Cerebras?
General Compute is betting SambaNova will be the next breakout chipmaker.
How Endava builds an agentic organization with Codex
Learn how Endava uses Codex to build an agentic organization, accelerating software delivery and reducing requirements analysis from weeks…
レノボ、国内に“水冷AIインフラ”の検証施設 GPUサーバ需要増で水冷活用促す
レノボ・ジャパンが水冷技術を活用したAIインフラの検証施設「Neptuneラボ」を新設した。レノボの冷却技術を使う顧客やパートナー企業に対し、本番に近い検証・PoC環境として提供する。クラウドベンダーやSIerとの共同検証を通し、推奨される機器構成などの策定にも役立てる。レノボ…
Vertu wants CEOs to run companies from an AI foldable starting at $6,880
Built on top of the open source Hermes project, Vertu's new foldable combines AI-agent workflows, enterprise integrations, and ultra-premiu…
Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture
As intelligent systems become more autonomous, the scientific community focuses on creating decision-making mechanisms that include ethical…
Soro: A Lightweight Foundation Model and Chatbot for Tajik
We present Soro, a family of Tajik-specialized conversational large language models (LLMs) designed for real-world deployment under tight c…
On the Origin of Synthetic Information by Means of Steganographic Inheritance
The origin of species has been the mystery of mysteries in natural science. By analogy, the origin of synthetic information, we suggest, is…
DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents
Progress in neural combinatorial optimization for Dynamic Flexible Job Shop Scheduling Problem (DFJSP) is currently hindered by a methodolo…
Why LLMs Fail at Causal Discovery and How Interventional Agents Escape
Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open questi…
RULER: Representation-Level Verification of Machine Unlearning
Machine unlearning aims to remove the influence of specific training records from a deployed model without retraining from scratch. Current…
LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation
Parallel LLM test-time scaling techniques (e.g., best-of-$N$) require drawing $N>1$ sequences conditioned on the same input prompt. These m…
Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems
Modern analytics systems are fundamentally reactive, requiring users to define queries over increasingly complex and continuously evolving…
Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access
As organizations move toward production deployments of AI agents, which execute non-deterministic workflows, maintain stateful sessions, an…
You Are in Control of Your State: Why Human Outcomes Are Controllable Through Causal State Intervention
A central puzzle for the behavioural sciences and for human-facing artificial intelligence is the persistence of within-person variability.…
Cyberbullying Governance on Social Media: A Unified Framework from Content Identification to Intervention
The proliferation of social media platforms and online communities has inadvertently catalyzed the spread of cyberbullying, hate speech, an…
Voluntary Collusion with Secret Tools in Competing LLM Agents
Even when a tool is explicitly described as unfair and harmful to others, ostensibly safety-aligned LLM agents still voluntarily engage in…
Laguna M.1/XS.2 Technical Report
We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has $225.8$B to…
Reasoning and Planning with Dynamically Changing Norms
To safely interact with humans, AI agents must both know our norms and consider them during planning. However, such norm-guided planning ha…
Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems
As autonomous and agentic AI systems scale in robotic and human-machine environments, managing hallucination and persistent but unjustified…
Behavioural Analysis of Alignment Faking
Alignment faking (AF) refers to a model strategically complying with a training objective to avoid behavioural modification while preservin…
Cross-Entropy Games and Frost Training
We present Frost Training, a method for improving Monte Carlo-based policy optimization for a large family of LLM-as-a-judge tasks called C…
Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models
Large Language Models are increasingly deployed inside agentic systems, where they must follow structured protocols, adapt to evolving stat…
DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation
Misalignment between claims and their cited evidence is a common failure mode in reports generated by large language models, limiting their…
Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking
Long reasoning traces need reliability estimates before final answers are known. We study prefix-conditioned eventual-success estimation, $…
A Policy-Driven Runtime Layer for Agentic LLM Serving
Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework abo…
Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration
LLM confidence calibration is often evaluated by comparing two signals: token-probability scores and verbalized confidence. These signals a…
SkillGrad: Optimizing Agent Skills Like Gradient Descent
Agent skills provide a lightweight way to adapt LLM agents to specialized domains by storing reusable procedural knowledge in structured fi…
PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft
We present PEAM, a Parametric Embodied Agent Memory framework in Minecraft that transforms agent memory from inference-time retrieval into…
Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems
LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environ…
Auditable Decision Models with Learned Abstention and Real-Time Steering
Production AI systems often operate with incomplete, conflicting, or insufficient evidence. Forced classifiers collapse such cases into act…
Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles
LLM agents are governed by long-lived natural-language prompt policies, but individually reasonable standing rules can interact in uninspec…
A Query Engine for the Agents
The fastest-growing data in production today is unstructured text: agent traces, chat logs, reasoning chains, model outputs. People want to…
A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test
Retrieval-augmented generation (RAG) systems are often compared by asking a large language model (LLM) judge which answer is better. For mu…
GraD-IBD: Graph Representation Learning from Diagnosis Trajectories for Early Detection of Inflammatory Bowel Disease
International Classification of Diseases (ICD) is a globally recognized coding system that records diagnostic events during each patient en…
Constrained Auto-Bidding via Generative Response Modeling
Auto-bidding systems aim to maximize advertiser value over long horizons under budget constraints and ratio targets such as cost-per-acquis…
EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents
As AI agents increasingly operate in open, real-world environments, they require a deep synergy of multimodal perception, tool invocation w…
Revealing Algorithmic Deductive Circuits for Logical Reasoning
Recent studies have shown that Large Language Models (LLMs) can achieve strong reasoning performance by incorporating functional symbolic r…
Operational AI Deployment Assurance: Governance-State Orchestration Under Threshold-Sensitive Deployment Conditions -- A Governance Framework for High-Stakes AI Systems
AI governance frameworks increasingly emphasize fairness, transparency, accountability, and lifecycle risk management in high-stakes domain…
EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA
Large Reasoning Models are typically trained via reinforcement learning from verifiable rewards (RLVR). However, existing approaches adopt…
TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems
Effective multi-agent systems cannot be designed by selecting prompts or communication graphs in isolation. Agent behavior depends on the i…
When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models
Safety benchmark scores provide incomplete evidence of deployment readiness: aligned language models often adhere to rigid rules even when…
MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents
We present MolLingo, a multi-agent system that emulates the reasoning process of a chemist to automate molecular design. Existing LLM-based…
C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning
Retrieval-augmented generation combined with reinforcement learning has shown promise for grounding large language models in trustworthy me…
FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research
Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks…
AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models
AI models underpin data-centric applications from image and text processing to scientific discovery in biology, physics, and chemistry. Yet…
Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness
Explainable AI (XAI) helps users interpret model behavior and identify potential faults. Agentic XAI systems use Large Language Models (LLM…
PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management
LLMs have shown strong performance across diverse financial tasks, yet portfolio management (PM), a critical financial decision-making task…
A Unified Framework for the Evaluation of LLM Agentic Capabilities
As LLMs are increasingly deployed as agents, reliable assessment of their agentic capabilities has become essential. However, reported benc…
SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment
Structured skill prompts improve exploration in long-horizon agentic reinforcement learning (RL). Skill-augmented RL methods retain externa…
Dr-CiK: A Testbed for Foresight-Driven Agents
Time series forecasting in real-world settings often depends not only on historical observations, but also on external context that must be…
Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization
Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. How…
SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats
Suicide is a critical global public health challenge, causing approximately 720,000 deaths each year and calling for timely, effective prev…
Show, Don't TELL: Explainable AI-Generated Text Detection
Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high i…
Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows
LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workf…
DiagramRAG: A Lightweight Framework to Retrieve Scientific Diagram for Figure Generation
Scientific diagrams are essential for communicating complex methodologies in academic papers. A natural way for researchers to specify such…
Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning
Recent mechanistic studies suggest that large language models (LLMs) may utilize their depth inefficiently in standard single-turn tasks. W…
From Talking to Singing: A New Challenge for Audio-Visual Deepfake Detection
With rapid advances in audio-visual generative models, reliable forgery detection becomes increasingly critical. Existing methods for audio…
The Shape of Overthinking: Backtracking Bursts in Long Reasoning Traces
Reasoning models often generate long traces in which useful self-correction and unproductive revision are hard to distinguish. We study thi…
Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations
While large language models (LLMs) are trained purely on textual data, prior work has shown that their internal representations can exhibit…
STAB: Specification-driven Testing for Algorithmic Bottlenecks
Evaluating the efficiency of algorithmic code requires test cases that expose runtime bottlenecks. Previous methods generate efficiency tes…
AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios
Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing ev…
Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure
Single-axis mitigations of reward-model biases (e.g., reducing proxy reliance on length, sycophancy, or style) can rotate optimization pres…
An Empirical Audit of k-NAF Budget Accounting for Anchored Decoding
We empirically audit the k-NAF budget-accounting mechanism in Anchored Decoding using (i) a fixed, class-stratified workload (approximately…
Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training
Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between perfor…
Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback
Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated s…
MIRA: A Bilingual Benchmark for Medical Information Response Audit
Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook wh…
PetroBench: A Benchmark for Large Language Models in Petroleum Engineering
Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework.…
Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings
Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic spars…
MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation
In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-syn…
Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG
Cited RAG evaluation often treats visible sources as a grounding signal, but a real, topically relevant citation can still under-warrant th…
MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents
Existing agent memory systems universally follow what we term a Memory-as-Tool paradigm where a single query triggers one-shot retrieval of…
Verifiable Benchmarking of Long-Horizon Spatial Biology
AI agents are increasingly useful for biological data analysis, but existing benchmarks mostly test broad biological knowledge, executable…
BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion Models
The remarkable generation quality of modern diffusion models often comes at the cost of massive parameter counts, which necessitate server-…
ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay
Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based comp…
Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information
We highlight a failure mode of large reasoning models on questions with insufficient information: models may recognize that a problem is un…
MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing
Parsing chemical reaction diagrams from scientific literature is challenging due to heterogeneous layouts, intertwined visual elements, and…
BuddyBench: A Privacy-Constrained Multi-Task Benchmark for Pediatric Social-Communication Personalization
BuddyBench introduces a privacy-constrained multi-task benchmark for pediatric social-communication personalization. Unlike existing neurod…
Examining Agents' Bias Amplification versus Suppression in Multi-Agent Systems
Multi-agent systems are increasingly deployed to support various tasks where agents interact to achieve individual and collective objective…
Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction
Large language models trained with Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI exhibit persistent behavioral pa…
Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification
Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative dec…
Human-like in-group bias in instruction-tuned language model agents
As autonomous AI agents are deployed in persistent, interacting networks -- coordinating tasks, routing resources, and accumulating reputat…
CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models
Vision-Language Models (VLMs) face severe memory and latency bottlenecks due to high-resolution visual tokens. While current token reductio…
Gradient Step Plug-and-Play Model for Dental Cone-Beam CT Reconstruction
The goal of this work is to reduce the effect of photon noise in dental cone-beam CT reconstruction. We consider an inverse problem formula…
Do Clinical Models Change Treatment Decisions?
Clinical foundation models are evaluated with factual or exam-style medical QA, but treatment decisions must change when patient context ch…
Data-Efficient On-Policy Distillation for Automatic Speech Recognition
Building competitive automatic speech recognition (ASR) models usually requires large-scale au- dio supervision, which makes reproduction a…
Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning
LLMs have shown remarkable proficiency in general language understanding and reasoning. However, they consistently underperform in spatial…
Adaptive Reservoir Computing for Multi-Scenario Chaotic System Forecasting
We present an adaptive reservoir computing framework for the CTF-4-Science Lorenz benchmark, which evaluates machine learning models across…
OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents
Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmar…
Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning
Existing multimodal reasoning approaches predominantly follow two paradigms: converting visual inputs into text prior to reasoning, or perf…
OccuReward: LLM-Guided Occupant-Centric Reward Shaping for Demographic Equity in Grid-Interactive Buildings
Large language models (LLMs) have demonstrated promising capability in generating reward functions for deep reinforcement learning (DRL)-ba…
Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values
As large language models (LLMs) are increasingly integrated into high-stakes decision-making, the ability to reliably quantify uncertainty…
Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning
Multi-hop audio-visual reasoning remains challenging for Omni-LLMs, as relevant evidence is often sparse, temporally dispersed, and distrib…
Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents
Large Language Model (LLM) agents remain vulnerable to safety threats from the external environment, where attackers inject adversarial con…
The Illusion of Opting in AI-Mediated Consequential Decisions
Drawing on Ullmann-Margalit's concept of opting (transformative, irrevocable, and shadowed by foreclosed alternatives), we show that curren…
Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages
LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those op…
Explaining is Harder Than Predicting Alone: Evaluating Concept-based Explanations of MLLMs as ICL Visual Classifiers
In-context learning (ICL) enables multimodal large language models (MLLMs) to classify images from a few labelled examples. Yet, how these…
When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents?
Multi-trajectory inference for tool-use LLM agents - generating multiple reasoning attempts and selecting among them - benefits from transf…
PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management
Occupant comfort and grid-aware energy efficiency are competing objectives whose joint optimization depends critically on how reward functi…
AI, Take the Wheel: What Drives Delegation and Trust in Human-Computer Cooperative Question Answering?
AI systems are fallible, and humans can make mistakes in deciding whether to trust AI over their own judgment. Thus, improving human-AI col…
Entropy Distribution as a Fingerprint for Hallucinations in Generative Models
Large Language Models (LLMs) often generate factually incorrect outputs, commonly termed hallucinations, that undermine trust and limit dep…
Global Policy-Space Response Oracles for Two-Player Zero-Sum Games
The Policy-Space Response Oracles (PSRO) framework scales equilibrium computation to large zero-sum games by iteratively expanding a restri…
Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning
Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether suc…
ResearchLoop: An Evidence-Gated Control Plane for AI-Assisted Research
AI-assisted research compresses ideation, implementation, evaluation, and manuscript writing into a single interactive loop. This compressi…
Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR
Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to…
REED: Post-Training Representation Editing for Cross-Domain Linguistic Steganalysis
In real-world scenarios of linguistic steganalysis, tested texts usually come from unseen domains with different vocabularies, topics, writ…
Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation
Chain-of-thought (CoT) distillation trains a smaller model to imitate a teacher's reasoning trace, but it is typically evaluated by final-a…
From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation
While Knowledge Editing (KE) enables efficient updates, its dominant Static Fact Overwriting paradigm treats LLMs as discrete databases, fo…
An Enhanced Large Neighborhood Search Approach for the Capacitated Facility Location Problem with Incompatible Customers
A new variant of the classic capacitated facility location problem, which considers incompatibilities between customers, has recently been…
SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models
Large language models(LLMs) increasingly match expert performance on licensing examinations, yet routine clinical use remains limited becau…
Picid: A Modular Evaluation Infrastructure for Reproducible PHM Across Tasks and Domains
Progress in Prognostics and Health Management (PHM) is hindered by the lack of standardized and reusable evaluation practices across tasks,…
FedMPT: Federated Multi-label Prompt Tuning of Vision-Language Models
Multi-Label Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge to better adapt complex r…
Plan Before Search: Search Agents Need Plan
Training large language models as retrieval-augmented reasoning agents typically combines reinforcement learning with an SFT cold start dis…
From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets
Evaluating whether large language model (LLM) agents can profit in capital markets is increasingly framed as end-to-end trading: place an a…
Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement
Automatic prompt optimization (APO) has driven significant gains in LLM-based agentic workflows. However, existing methods treat each task'…
Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning
Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a fa…
CyberJurors: A Multi-Agent Simulation Task for E-Commerce Disputes Verdict
E-commerce platforms have begun recruiting crowdsourced jurors to adjudicate massive volumes of transaction disputes. Unlike formal legal j…
From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence
Industrial Prognostics and Health Management (PHM) provides a representative case study for a broader challenge in applied machine learning…
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
Reinforcement Learning with Verifiable Reward (RLVR) is empirically shown to notably enhance the reasoning performance of large language mo…
You Live More Than Once: Towards Hierarchical Skill Meta-Evolving
Test-time skill evolving is regarded as a new paradigm for enhancing deployed agentic systems. Existing works mainly focus on hard-coded sk…
HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs
Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer…
Measuring Progress Toward AGI: A Cognitive Framework
Despite widespread discussion of AGI, there is no clear framework for measuring progress toward it. This ambiguity fuels subjective claims,…
Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning
Post-training using online reinforcement learning (RL) is an important training step for LLMs, including code-generating models. However, o…
DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes
Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depe…
GONDOR to the Rescue: Satisficing Planning with Low Memory
Greedy Best-First Search (GBFS) is the dominant approach for solving search problems where the goal can be estimated with a heuristic, such…
Diffusion Large Language Models for Visual Speech Recognition
Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decision…
From Learning Resources to Competencies: LLM-Based Tagging with Evidence and Graph Constraints
Linking learning resources to a structured competency framework is key to enabling competency-based search and curriculum analytics in Lear…
ProvMind: Provenance-grounded reasoning for materials synthesis
Materials process optimization requires reasoning over routes, conditions, tools and causal dependencies, yet most computational formulatio…
Benchmarking AI for low-resource contexts: Thinking beyond leaderboards
Existing AI evaluation practices often fail to capture how systems actually perform in low-resource environments, where operational constra…
GS-FUSE: Granger-Supervised Gated Fusion and Multi-Granularity Alignment for Event-Driven Financial Forecasting
Accurately forecasting the impact of salient financial events on markets is critical for investors and policymakers. However, existing mult…
Let Relations Speak: An End-to-End LLM-GNN Soft Prompt Framework for Fraud Detection
In recent years, Large Language Models (LLMs) have shown great capability in processing graph tasks such as fraud detection. However, most…
Entropy-aware Masking for Masked Language Modeling
Masked language modeling has become a standard pretraining objective for training encoder-based language models. In this approach, certain…
Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents
Tool-using agents often incur substantial computational cost due to long reasoning chains and iterative tool usage. In practical scenarios,…
Cultural Binding Heads in Language Models
LLMs often default to equal treatment across cultural groups, even though context warrants differentiation: this is a lack of difference aw…
Modeling Vehicle-Type-Specific Pedestrian Crash Avoidance Behavior in Safety-Critical Interactions Using Smooth-Mamba Deep Reinforcement Learning
As automated vehicles (AVs) increasingly share roadways with human-driven vehicles (HDVs), understanding how pedestrians respond to differe…
Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations
In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear prob…
A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks
As agent capabilities advance, existing benchmarks, such as $\tau^2$-Bench, are becoming increasingly saturated. Yet constructing new bench…
Tree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design Patterns
Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, yet their standard generation process -- auto-regressive…
A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis
Multimodal Sentiment Analysis (MSA) fuses text, acoustic, and visual streams to infer sentiment. Because pre-trained text encoders are far…
Continual Model Routing in Evolving Model Hubs
AI model hubs provide access to a rapidly growing collection of powerful pre-trained models, enabling off-the-shelf mixture-of-experts syst…
MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation
Large language models (LLMs) have recently advanced text-driven 3D generation, yet Text-to-CAD remains far from supporting industrial produ…
Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability
Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ab…
Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution
Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with…
LACUNA: Safe Agents as Recursive Program Holes
LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes.…
Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation
Multimodal large language models (MLLMs) have demonstrated significant potential for speech-to-text translation (S2TT). However, existing d…
The Ethics of LLM Sandbox and Persona Dynamics
It is well known that LLM guardrails and trained persona dynamics can produce a reality gap: the distance between the world a LLM is permit…
AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can a…
An LLM-Based Assistance System for Intuitive and Flexible Capability-Based Planning
In modern industry, dynamic environments and the complexity of modular and reconfigurable resources require automated planning of process s…
DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution
Speculative reasoning has recently been proposed as a means to accelerate reasoning-intensive generation in large multimodal models, but it…
VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora
Existing benchmarks have laid the foundation for travel planning agents by establishing API-centric paradigms. However, as the capabilities…
TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning
Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two parad…
The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic
The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested…
Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI
Critical decision-making in socially consequential spaces is increasingly involving AI systems at varying capacities. Yet, despite the ubiq…
Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor
Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing method…
OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol
Modern datacenter RDMA is bottlenecked at the network interface, not the wire. A NIC running RoCE or InfiniBand holds per-connection state…
LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp wi…
Multi-Adapter Representation Interventions via Energy Calibration
Representation intervention has emerged as a promising paradigm for aligning large language models toward desired behaviors without modifyi…
AlphaTransit: Learning to Design City-scale Transit Routes
Designing a transit network requires many sequential route extension decisions, but their quality is often visible only after the full netw…
Utility-Aware Multimodal Contrastive Learning for Product Image Generation
Product images strongly influence consumer decision-making in online marketplaces. Empowered by multimodal contrastive learning, generative…
CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-pa…
CubePart: An Open-Vocabulary Part-Controllable 3D Generator
Interactive 3D assets used in games and simulation are typically decomposed into specific semantic parts to support animation, physics, and…
SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks
Vast quantities of compute (GPU cycles on personal workstations, idle inference servers, and edge devices between jobs) go unused because n…
CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models
Electroencephalography (EEG) is a critical, non-invasive method to monitor electrical brain activity. EEGs can span anywhere from a couple…
Calibrating Conservatism for Scalable Oversight
Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans…
LNN-PINN: A Unified Physics-Only Training Framework with Liquid Residual Blocks
Physics-informed neural networks (PINNs) have attracted considerable attention for their ability to integrate partial differential equation…
Unlocking Fine-Grained and Within-Utterance Speaking Style Control in Prompt-Based Text-to-Speech Models
While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grain…
RAG-Coding: Enhancing LLM Medical Coding with Structured External Knowledge
We present RAG-Coding, an agentic method for automated ICD-10-CM coding. RAG-Coding orchestrates four large language model (LLM) agents and…
BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking
Cross-lingual biomedical entity linking (BEL) maps mentions in any language to unique identifiers in a biomedical knowledge base (KB), supp…
The Computational Boundary of Inference: Capability Internalization, Training, and the Turing Jump
Claims about recursive self-improvement in AI often slide from repeated internal revision to the possibility of qualitatively stronger capa…
The Alignment Floor: When Persona Customization Is Safe
A key promise of pluralistic AI is behavioral adaptation: persona prompts like "be creative" or "be thorough" let systems respect diverse u…
Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models
Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines…
From Instructor to Collaborator: What a 90-Participant Study Reveals about Human-Agent Collaboration in a Mobile Serious Game
This position paper reflects empirical data collected during my PhD from a large-scale within-subjects study (N = 90). The study compared a…
Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity
Federated reinforcement learning (FedRL) enables multiple agents to collaboratively train a global policy without sharing raw data, making…
From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons
Diffusion models promise efficient parallel text generation but rely on bidirectional attention, creating a structural mismatch with pre-tr…
Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities
Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully repre…
Memory-Based vs. Context-Only Conditioning Produces Distinct Behavioral Patterns in Stateful Personalization
We study how conditioning context shapes personalization behavior in a teacher-facing educational recommender system. We compare contextual…
EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter AdaptationTarget
Speculative decoding accelerates Large Language Model inference via a draft-then-verify paradigm, yet the output projection layer becomes a…
Learning after COVID-19 and the ICT career aspirations: Are students entering the AI era with weaker skills?
This paper examines whether students are entering the generative AI era with sufficiently strong educational foundations, focusing on the r…
StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation
Large language models (LLMs) can generate fluent dialogue, but prior works lack situational grounding, dynamic strategy control, and evalua…
Human-AI Collaboration for Estimating Scientific Replicability
Determining whether published scientific findings can successfully be replicated is a long-standing challenge in the empirical sciences. Ex…
Informing AI Policy Assessment using Large-Scale Simulation of Interventions
As the rapid proliferation of AI systems and harms spurs efforts in AI governance around the world, prioritizing among competing policy opt…
Agentic Literacy Debt: A Structural Problem the AI Literacy Field Has Not Yet Named
Autonomous AI agents now plan, decide, and act on behalf of users across healthcare, financial services, and workplace contexts, often with…
Short-Term Gain, Long-Term Fragility: AI Labor Substitution and the Erosion of Sustainable Capability
What looks like acceleration can be a quiet transfer of burden from the present to the future. Attempts to replace human labor with AI syst…
Mathematical Modelling of Ethical AI Use in Higher Education: A Coordination Game Framework for Future-Facing Learning
The rapid uptake of generative artificial intelligence (AI) in higher education is reshaping assessment practices and intensifying concerns…
Using Zero-Shot LLM-Generated Survey Data for Geographically Explicit Population Synthesis
There is a growing interest in utilizing synthetic populations for a diverse range of applications. At the same time, we are witnessing a t…
REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading
Open-ended grading is central to equitable and personalized education, yet manual grading remains time-consuming and costly, underscoring t…
LLM-assisted sentiment analysis for integrated computational and qualitative mixed methods education research: A case study of students' written reflection assignments
Written reflection assignments give students valuable opportunities for critical self-assessment, meaning making, and learning processing.…
Smaller, Younger, and More Impactful: How AI-Assisted Writing Transforms Research Teams
The era of Big Science has long been defined by increasingly large and specialized research teams pushing the frontiers of knowledge. Howev…
Benchmarking Fairness in Spiking Neural Networks: Data Bias, Spurious Features, and Hardware Effects
Evaluating fairness in Spiking Neural Networks (SNNs) demands rigorous benchmarks that reflect real-world complexities, yet existing assess…
STARS: Spike Tail-Aware Relational Synthesis for ANN-to-SNN Data-Free Knowledge Distillation
SNNs promise energy-efficient and low-latency inference, but their performance still trails that of ANNs. ANN-to-SNN knowledge distillation…
Advancing Direct Training for Spiking Neural Networks with Circulate-Firing Neurons and Learnable Gradients
Spiking Neural Networks (SNNs) have emerged with promising energy-efficient property, yet a substantial performance gap persists compared t…
Ligand-Conditioned Discrete Diffusion for Protein Sequence-Structure Co-Design
Proteins perform their biological functions through three-dimensional structures encoded by amino acid sequences, and ligand-binding protei…
Can Quantum Federated Learning Withstand Circuit-Level Backdoors?
Quantum Federated Learning (QFL) inherits the core vulnerability of federated optimization to malicious clients, while also introducing an…
Quantum Machine Learning-based 6G edge Network: Enabling Adaptive Communication and Model Aggregation
With the advent of sixth-generation (6G) mobile communication technology, vehicle-to-everything (V2X) communication faces unprecedented cha…
Ocean4Rec: Offline LLM-Derived OCEAN Profiles for Request-Time VOD Reranking
Industrial video-on-demand (VOD) recommenders need richer content understanding, but LLM-as-reranker designs repeat prompt construction, to…
Tackling Multimodal Learning Challenges with Mixture-of-Expert: A Survey
Mixture-of-Experts (MoE) presents a naturally compatible and scalable framework for multimodal learning, demonstrating strong adaptability…
FD-RAG: Federated Dual-System Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) has emerged as a paradigm for grounding large language models in external knowledge, yet most existing…
Heterogeneous Multi-Agent Modeling for Measurement and Network Analysis of the Data Service Market
With the increasing complexity of collaboration among various social entities and user demands, the factors affecting the stable developmen…
When NPUs Are Not Always Faster: A Stage-Level Analysis of Mobile LLM Inference
Deploying large language models (LLMs) on mobile devices increasingly relies on heterogeneous execution, yet no prior study has systematica…
RE-TRIANGLE: Does TRIANGLE Enable Multimodal Alignment Beyond Cosine Similarity in Retrieval?
Multimodal alignment is critical for bridging the semantic gap in information retrieval. However, traditional pairwise strategies introduce…
MGRetrieval: Memory-Guided Reflective Retrieval for Long-Term Dialogue Agents
Large Language Models (LLMs) have made significant progress in dialogue, yet redundant memory contexts severely limit their effectiveness i…
Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recommendation: A 37,000-Run Audit
AI assistants like ChatGPT and Claude are recommendation engines, not search engines: they answer commercial queries by directly nominating…
Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation: Reproducibility Below the Rerun-Stability Baseline
Small changes to how a buyer phrases a question -- "best CRM" vs "top CRM" vs "best CRM for a SaaS startup" -- produce substantially differ…
A Systematic Evaluation of Retrieval-Augmented Generation and Language Models for Space Operations
The rapid expansion of space activities has led to an unprecedented accumulation of technical documentation, operational guidelines, and sc…
RAGe: A Retrieval-Augmented Generation Evaluation Framework
Deploying Large Language Model (LLM) applications, particularly those relying on Retrieval-Augmented Generation (RAG), remains challenging…
Checking Fact with Better Retrieval: Dynamic Contrastive Learning for Evidence Retrieval
In the field of multimodal fact checking, the accuracy of retrieving evidence from different modalities has a significant impact on the dow…
Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures
Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We cat…
When prompt perturbations break your A/B test: A valid statistical test for generative surveying
Generative surveying -- where collections of LLM-based personas provide feedback on messages -- has emerged as a cheap and scalable alterna…
Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU
AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-moun…
AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers
The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivatin…
AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems
Multi-agent systems built on large language models (LLMs) require many coordination choices that are difficult to fix a priori: which skill…
Comparative Analysis of Liquid Neural Networks and LSTM for Sequential Pattern Recognition: Robustness, Efficiency, and Clinical Utility
Traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) units operate on discrete time steps, often failing to captu…
Architecture-driven Shift: towards a lightweight selector for capturing the trends of logit shift
Continual Learning (CL) is a practical paradigm to utilize power of deep pre-trained neural networks, but which pre-trained model has a bet…
Detect by Yourself: Self-Designing Agentic Workflows for Few-Shot Graph Anomaly Detection
Graph anomaly detection aims to identify anomaly nodes in attributed graphs and plays an important role in real-world applications. However…
AssertLLM2: A Comprehensive LLM Benchmark for Assertion Generation from Design Specifications
Assertion-based verification (ABV) is a cornerstone of modern hardware design, yet manually translating design intent into formal SystemVer…
HEAL: Resilient and Self-* Hub-based Learning
Decentralized learning enhances privacy, scalability, and fault tolerance by distributing data and computation across nodes. A popular appr…
Balancing Fidelity and Diversity in Diffusion Models via Symmetric Attention Decomposition: Hopfield Perspective
We characterize the pre-softmax attention matrix $\mathbf{QK^\top}$ in transformers as an associative memory matrix encoding pairwise assoc…
Resource-Constrained Affect Modelling via Variance Regularisation Pruning
Affective computing systems are increasingly embedded in pervasive and interactive environments, such as adaptive games, assistive technolo…
BIRDS: Characterizing and Understanding Biodiversity Impact of Large Language Model Serving
Large language model (LLM) serving creates environmental impacts beyond carbon and water, including ecosystem damage through biodiversity-r…
Energy-Structured Low-Rank Adaptation for Continual Learning
While orthogonal subspace methods try to mitigate task interference in Continual Learning (CL), they often suffer from energy diffusion acr…
Debate Helps Weak Judges Reward Stronger Models
Despite theoretical promise, debate as a scalable oversight protocol has produced mixed empirical results: gains in some settings, and null…
Diffusion-Based Ukrainian Handwritten Text Generation with Cross-Domain Style Transfer
Handwritten text generation (HTG) conditioned on writer style has been widely studied for Latin scripts, but remains underexplored for low-…
Grimlock: Guarding High-Agency Systems with eBPF and Attested Channels
Agentic systems increasingly run user-authored orchestration code that invokes tools, spawns subtasks, and delegates work across machines a…
HARP: Measuring Harm Amplification in Multi-Agent LLM Systems
Multi-agent LLM systems decompose workflows across agents, tools, shared context, memory, and decision gates. This modularity improves inte…
Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems
LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodolo…
Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?
Modern retrieval-augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time-to-first-token(TTFT). Pre…
Detection Without Correction: A Two-Parameter Decomposition of Multi-Stage LLM Pipelines
Multi-stage LLM pipelines that perform multi-agent debate, intrinsic self-correction, or retrieval-augmented verification exhibit puzzling…
Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System
Introduction. Early detection of malignant skin lesions is critical for prognosis, yet dermatologist shortages in Russian regions limit scr…
On the Subgaussianity of Quantized Linear Maps: An AI-Assisted Note
This short note presents a dimension-independent subgaussian concentration bound for Gaussian vectors under coordinate-wise nonlinear mappi…
The Future of Facts: Tracing the Factual Generation-Verification Gap
Language models are becoming the default interface to factual knowledge, yet they often verify outputs more reliably than they generate the…
Hallucination Behavior in Multimodal LLMs Across Agricultural Image Interpretation and Generation Tasks
Large Language Models (LLMs) are being rapidly adopted in agricultural imaging applications, ranging from crop interpretation to synthetic…
The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution
Agentic AI workloads - where a single user goal triggers multi-step orchestration, tool calls, retries, and failure recovery - are being ta…
Eliot: Interactively $\underline{E}$xploring Fast-Changing Scientific $\underline{Li}$terature Trends with $\underline{O}$nline Da$\underline{t}$a and Learning
The rapid growth of scientific publishing has made it increasingly difficult to track how fast-moving areas evolve. Search engines and LLM-…
Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation
Real-time anomaly segmentation demands both high recall and efficient low-precision inference. We study the three-way interaction of model…
Supervised Distributional Reduction via Optimal Transport and Dependence Maximization
Learning representations that capture both intrinsic data geometry and target-relevant structure remains a fundamental challenge, particula…
Trinity: Unifying Class-Agnostic Terrain and Semantic Segmentation for Unstructured Outdoor Environments by Leveraging Synthetic Data
Terrain understanding is fundamental for mobile robots operating in unstructured outdoor environments. Existing vision-based traversability…
Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression
We propose \textbf{Hurwitz Quaternion Multiplicative Quantization (HQMQ)}, a \textbf{calibration-free} method for KV cache compression of l…
Cultural Fidelity in English-to-Hindi Translation: A Preservation-Fluency Frontier for Gender Recoverability
Generative translation systems are cultural technologies because they decide how socially meaningful cues are rendered within culturally sp…
Developing an Intelligent Job Recommendation System Using Semantic Retrieval and Explainable AI Techniques
Online recruitment platforms require recommendation methods capable of retrieving relevant job opportunities from large and heterogeneous c…
Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment
Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonom…
How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks
Equivariant neural networks encode geometric symmetries by construction, yet they are often difficult to optimize and can underperform less…
Aligning LLMs with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting
Probabilistic forecasting estimates the likelihood of uncertain future events. To improve LLM forecasting, existing methods typically learn…
Backdoor Attacks on Fault Detection and Localization in Cyber-Physical Systems
Cyber-Physical Systems (CPS) integrate sensing, communication, computation, and control to support critical infrastructure, including smart…
Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers
Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past…
Simulation-Informed Diffusion for Decentralized Multi-robot Motion Planning
Decentralized multi-robot motion planning requires each robot to generate collision-free trajectories from local observations, without glob…
CiteCheck: Retrieval-Grounded Detection of LLM Citation Hallucinations in Scientific Text
Large language models (LLMs) are increasingly used to generate scientific reports, but they can produce references that appear plausible wh…
UserHarness: Harnessing User Minds for Stronger Agent Theory-of-Mind
Understanding what a user believes and intends is central to building effective agent assistants. This ability is often evaluated through T…
HumanoidMimicGen: Data Generation for Loco-Manipulation via Whole-Body Planning
Imitation learning is a promising approach for training humanoid robots to both walk and manipulate, but it requires a large number of demo…
Worker Disagreement Reveals Sharp Directions in Local SGD
Deep neural network training often exhibits highly anisotropic loss geometry, where a few sharp dominant Hessian directions coexist with a…
Mahalanobis PatchCore: Covariance-Aware and Streaming-Compatible Industrial Anomaly Detection
Industrial visual anomaly detection is usually one-class: normal images are abundant, while defects are rare, heterogeneous, and often unav…
Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions
Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually u…
High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention
Automotive crashworthiness optimization remains a safety-critical challenge, requiring the management of large-scale nonlinear structural d…
Can Segmentation Models Understand the World? Towards Proactive Affordance Reasoning via Visual Chain-of-Thought
Recent segmentation models couple large language models (LLMs) with mask decoders to ground complex language expressions into masks, yet th…
Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning
Self-Distillation Policy Optimization (SDPO) provides dense token-level credit assignment for reinforcement learning with large language mo…
UniMaia: Steering Chess Policies with Language for Human-like Play
Recent advances in large language models have enabled natural language to serve as a flexible interface for controlling complex systems, bu…
Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict
When a language model sees a document contradicting its training knowledge, it must choose: follow the document or trust itself. Prior work…
Locality-Aware Redundancy Pruning for LLM Depth Compression
Large language models are known to contain representational redundancy across network depth, making depth pruning an effective approach for…
ChildEval: When large language models meet children's personalities
While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of…
Residualized Temporal Sparse Autoencoders for Interpreting Diffusion Models
Text-to-image diffusion models generate images through an iterative denoising process, so internal neural layers produce trajectories of ac…
Turning Video Models into Generalist Robot Policies
Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex t…
ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions
Sparse autoencoders are usually trained one layer at a time, even though transformer residual stream activations are strongly coupled acros…
Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security
Large Language Models (LLMs) are increasingly vulnerable to adversarial prompts that exploit semantic ambiguities to bypass safety mechanis…
Symmetry Defeats Auditing
We demonstrate an attack on Introspection Adapters (Shenoy et al., 2026).
LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation
Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generat…
Snippet-Driven Supply Chain Discovery with LLMs: Scaling Visibility in China
Financial and economic research often relies on structured supply-chain disclosures and commercial databases. In China, supplier--customer…
FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation
Despite rapid progress in LLM-based code generation, existing models are predominantly trained on imperative languages, leaving functional…
Fine-Tuned LLM as a Complementary Predictor Improving Ads System
Recommendation systems power engagement and monetization across feeds, ads, and short-video platforms, but translating the latest advances…
DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification
Claim verification splits between end-to-end classifiers that are accurate but yields no inspectable traces, and decomposition-based method…
From Detection to Mechanism: Cross-Attention Graph Neural Networks Enable Drug-Drug Interaction Type Prediction An Ablation Study with Acetylsalicylic Acid Validation
Predicting whether two drugs interact (binary detection) is a substantially dif- ferent task from predicting the mechanism type of that int…
SPAR: Support-Preserving Action Rectification
Offline policy improvement faces an inherent conflict between maximizing value and fitting the data distribution. While in-sample weighted…
VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild
LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-…
SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control
The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visu…
The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages
Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language mod…
ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations
Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offerin…
Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking
Subjective evaluation of LLM behavior -- empathy, restraint, calibrated emotional tone -- is hard. Human inter-rater agreement on such qual…
Do We Really Need Quantum Machine Learning?: A Multidimensional Empirical Study
The rapid growth of computer vision and increasingly complex image recognition tasks has exposed fundamental computational limitations of c…
When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?
Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poo…
Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations
Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean…
ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning
Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-…
Multi-Teacher Knowledge Distillation via Teacher-Informed Mixture Priors
Knowledge distillation is a powerful method for model compression, enabling the efficient deployment of complex deep learning models (teach…
Semantic Flow Regularization: Teaching LLMs to Generate Diverse Yet Coherent Responses
When large language models are fine-tuned to generate persona- or tone-conditioned responses, their output diversity is severely limited--a…
Periodic RoPE for Infinite Context LLMs
The ability to process ultra-long contexts is crucial for large language models (LLMs) to perform long-horizon tasks. While recent efforts…
KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs
Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. How…
Geometry-Correct Diffusion Posterior Sampling with Denoiser-Pullback Curvature Guidance and Manifold-Aligned Damping
Diffusion posterior sampling conditions diffusion priors on measurements, but data-consistency updates are typically scaled by hand-tuned g…
Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models
Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or…
Learning to Assign Prediction Tasks to Agents with Capacity Constraints
We address the problem of learning to assign prediction tasks to one agent from a set of available human or AI agents. In particular, we fo…
Tool Forge: A Validation-Carrying Toolchain for Governed Agentic Execution
Large language model agents are increasingly expected to perform operational work: calling APIs, manipulating files, assembling workflows,…
Integrated and Cross-Architecture Interpretation of LLM Reasoning
Understanding how LLMs reason is hindered by a practical asymmetry: while their generated outputs are observable, the underlying reasoning…
Learning Compositional Latent Structure with Vector Networks
Deep networks are powerful function approximators, but they typically store many different computations in shared weight matrices, making i…
MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models
Memory-augmented large language models extend reasoning beyond a fixed context window by maintaining long-term memory across interactions.…
VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning
Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant pa…
SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection
Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in wh…
Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts
Modern large language models (LLMs) achieve state-of-the-art machine translation performance, but they do so as broad generalists largely t…
On the Learnability of Test-Time Adaptation: A Recovery Complexity Perspective
Test-time adaptation (TTA) aims to adapt models to maintain reliable performance on non-stationary test streams without requiring labeled d…
Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts
Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a c…
I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors
Automatic deepfake detection has received considerable research attention, yet the socio-technical environment in which humans actually enc…
PromptEmbedder:: Efficient and Transferable Text Embedding via Dual-LLM Soft Prompting
Large Language Models (LLMs) have demonstrated remarkable efficacy in text embedding, yet current adaptation methods like LoRA face signifi…
StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment
Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence.…
Mind the Gap: Mixtures of Gaussians in Approximate Differential Privacy
We design a class of additive noise mechanisms that satisfy \((\varepsilon, \delta)\)-differential privacy (DP) for scalar, real-valued que…
SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about Laughter
Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter a…
Revisiting Change Detection Methods for their Application to Serac Fall Time-Lapse Monitoring
In an era where climate change aggravates environmental uncertainties, the identification and detection of event precursors are becoming cr…
EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR Prediction
Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for…
MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content
Mobile graphical user interface (GUI) agents driven by vision-language models (VLMs) perceive the screen as rendered pixels and choose acti…
LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning
Graph-based Retrieval-Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabl…
SNARE: Adaptive Scenario Synthesis for Eliciting Overeager Behavior in Coding Agents
A coding agent executes a benign task as a sequence of shell, file, and network actions, any of which can quietly exceed the authorized sco…
DeltaMCP: Incremental Regeneration via Spec-Aware Transformation for MCP servers
The rapid development of LLMs coupled with the introduction of Model Context Protocol (MCP) has revolutionized how intelligent agents inter…
DEPART: DEcomposing PARiTy across Multilingual LLMs
Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving sys…
Performance and Explainability Requirements of Evolutionary Algorithms in Real-World Physics-Informed Optimization
Evolutionary computation offers a variety of tools to solve complex real-world optimization problems. However, research often focuses on sm…
QuITE: Query-Based Irregular Time Series Embedding
Irregular Multivariate Time Series (IMTS) are common in practice, yet their irregular sampling complicates effective modeling. Existing app…
FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales
Foundation models offer a promising route to transferable remote sensing representations, but many current approaches depend on very large…
BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law
We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. T…
Visualizing Latent Phase Structures in Locomotion Policies: A Multi-Environment Study with Temporal Feature Extension
Deep reinforcement learning (DRL) has been shown to achieve high performance on locomotion control tasks in MuJoCo benchmarks such as HalfC…
Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar Recommendation
Large language models (LLMs) are increasingly used as scholar recommenders, shaping who is seen as an expert in academia. Existing audits r…
Pruning and Distilling Mixture-of-Experts into Dense Language Models
Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded…
SmartIterator: Visual Analytics Workflows for Supervising Unsupervised Data Grouping
Unsupervised learning methods -- topic modeling, partition-based and density-based clustering -- produce data groupings without human guida…
VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer
With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \emph{\…
IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage
Reinforcement learning with verifiable rewards (RLVR) has become a key technique for en- hancing LLM reasoning, yet its data ineffi- ciency…
GUI Agents for Continual Game Generation
Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game gen…
PrunePath: Towards Highly Structured Sparse Language Models
Feed-forward networks (FFNs) dominate the parameter count and computation of modern language models, yet existing pruning methods often str…
ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation
Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendat…
How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving
Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT s…
Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning
Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such…
Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models
Mixture-of-Experts (MoE) models have emerged as a dominant paradigm for efficient LLM scaling, yet adapting them to non-English downstream…
Hybrid Neural World Models
Neural surrogates promise large speedups over classical solvers for physical dynamics but fail silently at sharp dynamical events such as s…
Identifying Explicit Parsimonious Piece-wise Polynomial Relationships in Industrial time-series: Application to manipulator robots
This paper addresses the problem of identifying parsimonious explicit piece-wise polynomial relationships that might involve a relatively l…
Multi-Agent LLM-based Metamorphic Testing for REST APIs
As REST APIs become an increasingly significant part of software systems, their validation is becoming more critical. Hence, testing and un…
Learning the Error Patterns of Language Models
When generating outputs for domains with specific validity constraints (e.g., a program should compile), LLMs often fail in a small number…
Improving Evaluation of Recombination-based Cartesian Genetic Programming
Cartesian Genetic Programming has traditionally been using mutation as its main and often sole genetic operator to drive evolutionary searc…
Score Based Error Correcting Code Decoder
Error-correcting codes enable reliable communication, yet practical soft decoding remains challenging across code families and block length…
CLANE: Continual Learning of Actions on Neuromorphic Hardware from Event Cameras
Recognizing and continuously learning novel human actions without forgetting prior classes is a requirement for emerging AR/VR and robotics…
ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation
On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, b…
VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs
Latent reasoning enables reasoning over continuous hidden states rather than explicit tokens, avoiding the language bottleneck and inferenc…
Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization
Detecting subtle visual anomalies in images remains challenging, particularly when only normal samples are available a priori. Such unsuper…
Bayesian Gated Non-Negative Contrastive Learning
While Contrastive Learning (CL) has revolutionized self-supervised representation learning, its latent representations remain highly entang…
BiasEdit: A Training-Free Bias-Detect-and-Edit Framework for Learning Fair Visual Classifiers
Visual data from the Web power image classifiers, which often underpin many web services, such as recommendation and content moderation. Ho…
The Cases LJP Never Sees: Prosecution Decision Prediction for More Complete Criminal Liability Assessment
Legal Judgment Prediction (LJP) has become a core benchmark for evaluating AI in the criminal legal domain, but it only sees criminal cases…
SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs
3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding…
The Decision to Verify: How Warmth and User Characteristics Shape Reliance on Conversational Agents for Information Search
Conversational artificial intelligence (AI) provides an efficient and convenient gateway to information access. However, it can cause overr…
Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification
Large language models have shown impressive capabilities in code generation, yet they often produce functionally incorrect code. Uncertaint…
Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets
Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce train…
Learning Theory of the SVRG: Generalization and Convergence Analysis
Variance reduction (VR) methods employ stochastic gradients with decreasing variance, and they have been widely applied to solve large-scal…
Do LLMs Favor Their Providers? Measuring Vertical Integration Bias in Code Generation
Large Language Models (LLMs) have become an integral part of software development, especially with the advent of agentic capabilities. Yet,…
Stochastic Gradient Descent with Momentum is Algorithmically Stable
Stochastic gradient descent with momentum (SGDM) is one of the most widely used optimization algorithms in machine learning. While optimiza…
Token Optimization Strategies for LLM-Based Oracle-to-PostgreSQL Migration
LLMs are increasingly used for software modernization, code translation, and database migration. However, LLM-based Oracle2PostgreSQL migra…
A Multi-dimensional Framework for Evaluating Generalization in EEG Foundation Models
Evaluating foundation models under appropriate adaptation settings is essential for understanding the quality and transferability of the le…
Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs
Users of search-augmented LLMs rely on citations as evidence that responses are grounded in real sources, and rarely verify the cited pages…
Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression
Sparse autoencoders (SAEs) have become a central tool for interpreting language models. However, two key SAE analyses that remain difficult…
Efficient Pre-Training of LLMs through Truncated SVD Layers
The massive scaling of Large Language Models (LLMs) has made pretraining increasingly cost-prohibitive. While low-rank representation and o…
SARAD: LLM-Based Safety-Aware Hybrid Reinforcement Learning with Collision Prediction for Autonomous Driving
Ensuring both safety and efficiency in decision-making for autonomous driving systems remains a fundamental challenge. Traditional Deep Rei…
Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem
We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including credential theft, backdoor i…
Models That Know How Evaluations Are Designed Score Safer
The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has ide…
Thermodynamic properties of chemically disordered compounds via AI-driven estimation of partition function with the PULSE method
In this article, we present an improved version of the PULSE method (Partition function Unsupervised Learning Sampling and Evaluation) for…
Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation
This position paper argues that the AI/ML community should stop overclaiming and retire the label "positive backdoor," and instead treat tr…
Evaluating the Realism of LLM-powered Social Agents: A Case Study of Reactions to Spanish Online News
LLM-powered social agents are increasingly used to simulate online social behavior, yet their realism remains difficult to validate. Existi…
Online Irregular Multivariate Time Series Forecasting via Uncertainty-Driven Dual-Expert Calibration
Irregular multivariate time series forecasting is critical in many real-world applications, where time series are irregularly sampled and e…
Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification
Identifying key individuals in video scenes is essential for applications such as automated video editing and intelligent surveillance. Cur…
Measuring Form and Function in Language Models
We introduce quantitative metrics for child language acquisition to evaluate language models. Our focus is on the formal syntactic and func…
Blind PRNG Hijacking: An Undetectable Integrity-Preserving Attack Against LLM Watermarking
Cryptographic watermarking is a leading defense for attributing text generated by large language models (LLMs). Existing schemes, including…
The Attentional White Bear Effect in Transformer Language Models
Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether…
Sense Representations Are Inducible Interfaces
Sense representations (explicit, per-token meaning decompositions) are useful for disambiguation, steering, and cross-lingual alignment, bu…
AI in the Workplace: The Impact of AI on Perceived Job Decency and Meaningfulness
The proliferation of Artificial Intelligence (AI) in workplaces is transforming how we work. While existing research on human-AI collaborat…
Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images
Backpropagation is the core learning mechanism underlying deep learning. However, whether and how this algorithm is implemented in the brai…
Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution?
Speckle tracking echocardiography (STE) is the clinical standard for myocardial strain estimation. Despite good performance on global strai…
A Fresh Look at Lamarckian Evolution and the Baldwin Effect
Baldwinian and Lamarckian evolution have existed for a long time in evolutionary algorithms (EAs) without ever dominating the academic lite…
Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study
Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English.…
IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents
An Initial Public Offering (IPO) filing is a document released when a private firm goes public, allowing individual (retail) investors to p…
MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems
Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and…
BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks
Tabular data in knowledge-rich domains often carries a latent prior in the form of Boolean implication relationships (BIRs) between pairs o…
Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text
As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes criti…
Preference-Shaped Expected Hypervolume and R2 Improvement: Exact Computation and Monotonicity
This paper studies preference-shaped expected improvement criteria for Bayesian multiobjective optimization. We consider two indicator fami…
Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL
Linear interpolation between fine-tuned checkpoints has been shown to trace the Pareto front between competing objectives, but whether extr…
Rethinking Memory as Continuously Evolving Connectivity
Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelin…
Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents
Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains…
Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval
In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata l…
Skill-Conditioned Gated Self-Distillation for LLM Reasoning
On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes…
OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration
Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for s…
Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation
A primary bottleneck in contact-rich manipulation is the difficulty of collecting real-world data. Sim-to-real reinforcement learning offer…
Planning a Community Approach to Diabetes Care in Low- and Middle-Income Countries Using Optimization
Diabetes is a global health priority, especially in low- and-middle-income countries, where over 50% of premature deaths are attributed to…
Tell Me a Story! Narrative-Driven XAI with Large Language Models
In many AI applications today, the predominance of black-box machine learning models, due to their typically higher accuracy, amplifies the…
Apple Intelligence Foundation Language Models
We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to ru…
Heterogeneous Causal Discovery of Repeated Undesirable Health Outcomes
Understanding the factors that trigger or prevent undesirable health outcomes across patient subpopulations is essential for designing targ…
Text-Only Data Synthesis for Vision Language Model Training
Training vision-language models (VLMs) typically requires large-scale, high-quality image-text pairs, but collecting or synthesizing such d…
Domain size asymptotics for Markov logic networks
A Markov logic network (MLN) $\mathbb{M}$ determines a probability distribution $\mathbb{P}_n^\mathbb{M}$ on the set $\mathbf{W}_n$ of stru…
A Comparative Study of Rule-Based and Data-Driven Approaches in Industrial Monitoring
Industrial monitoring systems, especially when deployed in Industry 4.0 environments, are experiencing a shift in paradigm from traditional…
RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning…
MetaboT: An LLM-based Multi-Agent Frameworkfor Interactive Analysis of Mass SpectrometryMetabolomics Knowledge Graphs
Mass spectrometry-based metabolomics generates complex, high-dimensional data that holds vast potential for biological discovery but remain…
The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models
Evaluating the quality of reasoning traces from large language models remains understudied, labor-intensive, and unreliable: current practi…
SynthTools: A Framework for Scaling Synthetic Tools for Agent Development
For agentic systems to use external tools to solve complex, long-horizon tasks, we need a large set of diverse and controllable tool-use en…
Guaranteed Optimal Compositional Explanations for Neurons
Compositional explanations are a family of methods that aim to describe the spatial alignment between neurons' receptive field activations…
Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models
Are frontier AI systems becoming more capable? Certainly. Yet such progress is not an unalloyed blessing but rather a Trojan horse: behind…
Atomic Skills are the Prerequisite: When Reinforcement Learning Synthesizes Compositional Reasoning, and When It Only Amplifies
Does Reinforcement Learning (RL) merely amplify existing skills, or synthesize novel skills? We investigate this question through the lens…
CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts
Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronic design au…
How Much Can a Few Engine Moves Help? Quantifying Limited Cheating in Chess
Cheating in chess, by using advice from powerful software, has become a major problem, reaching the highest levels. As opposed to the large…
SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understa…
Aligning Language Model Benchmarks with Pairwise Preferences
Language model benchmarks are pervasive and computationally-efficient proxies for real-world performance. However, many recent works find t…
Can I Have Your Order? Monte-Carlo Tree Search for Slot Filling Ordering in Diffusion Language Models
While plan-and-infill decoding in Masked Diffusion Models (MDMs) shows promise for mathematical and code reasoning, performance remains hig…
DIG to Heal: Scaling General-purpose Agent Collaboration via Explainable Dynamic Decision Paths
The increasingly popular agentic AI paradigm promises to harness the power of multiple, general-purpose large language model (LLM) agents t…
COOP$^2$: Defining, Observing, and Repairing Cooperation in LLM Multi-Agent Systems
Many complex tasks require extended effort, diverse capabilities, or coordinated actions beyond what a single agent can provide. However, s…
FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing
The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage tex…
Towards automated data analysis: A guided framework for LLM-based risk estimation
Large Language Models (LLMs) are increasingly integrated into critical decision-making pipelines, a trend that raises the demand for robust…
CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders
While modern LLMs are aligned to refuse harmful requests, it is essential to understand the underlying mechanistic basis of this refusal be…
FactReview: Evidence-Grounded Peer Review with Execution-Based Claim Verification
LLM-based reviewing systems typically take only the manuscript as input, leaving literature and code-based claims hard to verify. We presen…
Graph-of-Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills
Modern LLM agents increasingly rely on reusable skills, and as they interact with personal applications, web browsers, and other interfaces…
Text2Model: Modeling Copilots for Text-to-Model Translation
There is growing interest in leveraging large language models (LLMs) for text-to-model translation and optimization tasks. This paper aims…
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems
Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku…
Towards Rigorous Explainability by Feature Attribution
For around a decade, non-symbolic methods have been the option of choice when explaining complex machine learning (ML) models. Unfortunatel…
OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning
Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoni…
Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization
While recent autonomous agents demonstrate impressive capabilities, they predominantly rely on manually scripted workflows and handcrafted…
DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis
Autonomous data analysis agents are increasingly expected to conduct exploratory analysis with limited human guidance about data. However,…
Can We Formally Verify Neural PDE Surrogates? SMT Compilation of Small Fourier Neural Operators
Fourier Neural Operators (FNOs) can greatly accelerate PDE simulation, but they are often used without formal guarantees that they preserve…
Verifiable Process Rewards for Agentic Reasoning
Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existi…
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts…
Optimal LTLf Synthesis
Strategy synthesis typically follows an all-or-nothing paradigm, returning unrealisable whenever a specification cannot be guaranteed in an…
Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches
Optimization models developed by operations research (OR) experts are often deployed as decision-support systems in industrial settings. Ho…
Position: The Turing-Completeness of Autoregressive Transformers Relies Heavily on Context Management
Many works make the eye-catching claim that Transformers are Turing-complete. However, the literature often conflates two distinct settings…
EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design
Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequatel…
Who Uses AI? Platform Selection and the Measurement of Occupational AI Exposure
Conversation logs from AI platforms are increasingly used to measure occupational exposure to artificial intelligence, but the users observ…
FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation
Modern recommender systems rely heavily on ID-based collaborative filtering: each item is represented by a unique ID embedding that accumul…
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
LLM agents are shaped not only by their language models, but also by the runtime harness that mediates observation, tool use, action execut…
Measuring Massive Multitask Chinese Understanding
The development of large-scale Chinese language models is flourishing, yet there is a lack of corresponding capability assessments. Therefo…
DSSE: a drone swarm search environment
The Drone Swarm Search project is an environment, based on \textsc{PettingZoo}, that is to be used in conjunction with multi-agent (or sing…
Delay-Aware Reinforcement Learning for Highway On-Ramp Merging under Stochastic Communication Latency
Delayed and partially observable state information poses significant challenges for reinforcement learning (RL)-based control in real-world…
Generalized Holographic Reduced Representations
Hyperdimensional Computing (HDC) is a computationally and data-efficient paradigm that acts as a bridge between connectionist and symbolic…
Sinc Kolmogorov-Arnold network and its application for solving PDEs with singularities
In this paper, we propose to use Sinc interpolation in the context of Kolmogorov-Arnold Networks, neural networks with learnable activation…
Revisiting Graph Autoencoders as Implicit Contrastive Learners
Graph autoencoders (GAEs) and graph contrastive learning (GCL) are two major paradigms for self-supervised representation learning on graph…
Isometry pursuit
Isometry pursuit is a convex algorithm for identifying orthonormal column-submatrices of wide matrices. It consists of a novel normalizatio…
Improving Requirements Classification with SMOTE-Tomek Preprocessing
This study emphasizes the domain of requirements engineering by applying the SMOTE-Tomek preprocessing technique, combined with stratified…
HEART: Achieving Timely Multi-Model Training for Vehicle-Edge-Cloud-Integrated Hierarchical Federated Learning
The rapid growth of AI-enabled Internet of Vehicles (IoV) calls for efficient Machine Learning (ML) solutions that can handle high vehicula…
Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring
Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making processes remain u…
MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation
The LLM-as-a-Judge paradigm shows promise for evaluating generative content but lacks reliability in reasoning-intensive scenarios, such as…
GradientStabilizer:Fix the Norm, Not the Gradient
Training instability in modern deep learning systems is frequently triggered by rare but extreme gradient-norm spikes, which can induce ove…
MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks
Retrieval-augmented generation (RAG) has become a common practice in multimodal large language models (MLLM) to enhance factual grounding a…
Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models
Large Language Models (LLMs) demonstrate persuasive capabilities that rival human-level persuasion. While these capabilities can be used fo…
Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024
In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While m…
Manboformer: Learning Gaussian Representations via Spatial-temporal Attention Mechanism
Compared with voxel-based grid prediction, in the field of 3D semantic occupation prediction for autonomous driving, GaussianFormer propose…
The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models? A Bias-Controlled Study
3D Large Language Models (LLMs) leveraging spatial information in point clouds for 3D spatial reasoning attract great attention. Despite so…
LiDDA: Data Driven Attribution at LinkedIn
Data Driven Attribution, which assigns conversion credits to marketing interactions based on causal patterns learned from data, is the foun…
Structured Agent Distillation for Large Language Model
Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-…
EVADE-Bench: Multimodal Benchmark for Evaluating and Enhancing Evasive Content Detection
E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision Language Models (VLMs) to detect illicit or misleading pr…
STFlow: Data-Coupled Flow Matching for Geometric Trajectory Simulation
Simulating trajectories of dynamical systems is a fundamental problem in a wide range of fields such as molecular dynamics, biochemistry, a…
ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference
Multi-device inference can reduce Transformer latency by parallelizing computation. However, existing methods require high inter-device ban…
MMTABREAL: Real-World Benchmark for Multimodal Table Understanding
Multimodal tables i.e. tabular layouts interleaved with charts, maps, icons, and color encodings are ubiquitous in real applications yet re…
Beyond Interpretability: When, Why, and How Sparse Autoencoders Enable Label-Free Visual Steering
Sparse Autoencoders (SAEs) are increasingly used to interpret foundation models, but their role as an actionable intervention space remains…
Understanding Automated Program Repair Agents Through the Lens of Traceability: An Empirical Study
Automated Program Repair (APR) agents leverage Large Language Models (LLMs) to autonomously diagnose and fix software bugs through reasonin…
Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN
We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box…
LLM Watermark Evasion via Bias Inversion
Watermarking offers a promising solution for detecting LLM-generated content, yet its robustness under realistic query-free (black-box) eva…
Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting
In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark…
Regression Language Models for Code
We study code-to-metric regression: predicting numeric outcomes of code executions, a challenging task due to the open-ended nature of prog…
SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification
Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge…
EAGer: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling
With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computa…
InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training
Reinforcement learning (RL) has driven recent breakthroughs in large language models (LLMs), especially for tasks where rewards can be comp…
The Principles of Diffusion Models
This book presents the core principles that have guided the development of diffusion models, tracing their origins and showing how diverse…
VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization
Large language models (LLMs) have recently shown strong potential in vulnerability detection (VD). However, accurately detecting vulnerabil…
ReflexGrad: Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing
We present ReflexGrad, a dual-process architecture for within-episode failure recovery in LLM agents without demonstrations. When agents co…
Not All Pixels Are Equal: Pixel-wise Meta-Learning for Medical Segmentation with Noisy Labels
Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomic…
Object-Centric Vision Token Pruning for Vision Language Models
In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume to…
Diffusion-Augmented Markov Decision Processes for Maximum Entropy Reinforcement Learning
Diffusion models excel at sampling from complex, unnormalized distributions. In this work, we extend Maximum Entropy Reinforcement Learning…
Optimal and Diffusion Transports in Machine Learning
Several problems in machine learning are naturally expressed as the design and analysis of time-evolving probability distributions. This in…
Evaluation of AI Ethics Tools in Language Models: A Developers' Perspective Case Study
In Artificial Intelligence (AI), language models have gained significant importance due to the widespread adoption of systems capable of si…
Snowveil: A Framework for Decentralised Preference Discovery
Aggregating subjective preferences in social choice traditionally assumes a trusted central authority. In contrast, this paper formalises D…
Feature Learning Dynamics in Infinite-Depth Neural Networks
Deep neural networks have achieved remarkable success in practice, yet a mechanistic understanding of how features evolve during training r…
Adapting, Fast and Slow: On Few-Shot Transportability of Compositions
Generalization across domains requires stable structure that links the source and target distributions. Building on causal transportability…
HGMEM: Hypergraph-based Working Memory to Improve Multi-step RAG for Long-Context Complex Relational Modeling
Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks th…
The Optimal Sample Complexity of Linear Contracts
In this paper, we settle the problem of learning optimal linear contracts from data in the offline setting, where agent types are drawn fro…
JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models
As Large Language Models (LLMs) are increasingly deployed in healthcare field, it becomes essential to carefully evaluate their medical saf…
On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning
Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation…
Differential syntactic and semantic encoding in LLMs
We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the…
On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation
Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content while preservin…
Teaching and Evaluating LLMs to Reason About Polymer Design Related Tasks
Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs are ineffective in t…
Do readers prefer AI-generated Italian short stories?
This study investigates whether readers prefer AI-generated short stories in Italian over one written by a renowned Italian author. In a bl…
The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation
Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However,…
The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models
We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models (TLMs), reporting on…
NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning
Learning from Noisy Labels (LNL) remains a fundamental challenge in deep learning because real-world datasets often contain corrupted annot…
A Sheaf-Theoretic and Topological Perspective on Complex Network Modeling and Attention Mechanisms in Graph Neural Models
Combinatorial and topological structures, such as graphs, simplicial complexes, and cell complexes, form the foundation of geometric and to…
Probability-Entropy Calibration: An Elastic Indicator for Adaptive Fine-tuning
Token-level reweighting is a simple yet effective mechanism for controlling supervised fine-tuning, but common indicators are largely one-d…
SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to…
ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning
Test-time reinforcement learning generates multiple candidate answers via repeated rollouts and performs online updates using pseudo-labels…
MathlibLemma: Folklore Lemma Generation and Benchmark for Formal Mathematics
While the ecosystem of Lean and Mathlib has enjoyed celebrated success in formal mathematical reasoning with the help of large language mod…
Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation
Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, off…
Semantic-level Backdoor Attack against Text-to-Image Diffusion Models
Text-to-image (T2I) diffusion models are widely adopted for their strong generative capabilities, yet remain vulnerable to backdoor attacks…
Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory
Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rel…
Capture Timing-Attention of Events in Clinical Time Series
Automatically discovering personalized trajectories (i.e., sequential event patterns) from longitudinal EHR data is crucial for enabling pr…
Singular Vectors of Attention Heads Align with Features
Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made…
The Well-Tempered Classifier: Some Elementary Properties of Temperature Scaling
Temperature scaling is a simple method that allows to control the uncertainty of probabilistic models. It is mostly used in two contexts: i…
Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems
Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex coopera…
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Training against white-box deception detectors has been proposed as a way to make AI systems honest. However, such training risks models le…
AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models
The rapid advancement of Large Language Models (LLMs) has led to a surge of financial benchmarks, evolving from static knowledge evaluation…
Noise Scheduling as Information-Guided Allocation in Diffusion Training
We introduce InfoNoise, an online adaptive noise schedule for diffusion training that reallocates optimization effort toward noise levels w…
LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration
Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion…
Probing for Knowledge Attribution in Large Language Models
Large language model (LLM) hallucinations, meaning fluent but factually incorrect generations, fall into two types: faithfulness violations…
Learning Tangent Bundles and Characteristic Classes with Autoencoder Atlases
We introduce a theoretical framework that connects multi-chart autoencoders in manifold learning with the classical theory of vector bundle…
SPARC: Spatial-Aware Path Planning via Attentive Agent Communication
Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat al…
RelaxFlow: Text-Driven Amodal 3D Generation
Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determin…
Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search
Open-world interactive object search in household environments requires understanding semantic relationships between objects and their surr…
Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards
Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers fr…
Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning
Extrinsic dexterity leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexteri…
HO-SFL: Hybrid-Order Split Federated Learning with Backprop-Free Clients and Dimension-Free Aggregation
Fine-tuning large models on edge devices is severely hindered by the memory-intensive backpropagation (BP) in standard frameworks like fede…
From Causal Discovery to Dynamic Causal Inference in Neural Time Series
Time-varying causal models provide a powerful framework for studying dynamic scientific systems, yet most existing approaches assume that t…
Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation
Direct Preference Optimization (DPO) guides large language models (LLMs) to generate recommendations aligned with user historical behavior…
Coherence Collapse: Diagnosing Why Code Agents Fail After Reaching the Right Code
Code agents resolve 65-70% of SWE-bench Verified issues, but Pass@1 cannot tell us why the rest fail, and, as we show, capable-model failur…
COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving
Developing robust models to accurately predict the trajectories of surrounding agents is fundamental to autonomous driving safety. However,…
Speaking of Language: Reflections on Metalanguage Research in NLP
This work aims to shine a spotlight on the topic of metalanguage. We first define metalanguage, link it to NLP and LLMs, and then discuss o…
Rectified Schr\"odinger Bridge Matching for Few-Step Visual Navigation
Visual navigation is a core challenge in Embodied AI, requiring autonomous agents to translate high-dimensional sensory observations into c…
Compositional Consistency-Guided Decoding for Three-Way Logical Question Answering
Three-way logical question answering (QA) assigns one of $\text{True}$, $\text{False}$, or $\text{Unknown}$ to a hypothesis $H$ given a pre…
Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions
Retrieval-augmented generation (RAG) extends large language models (LLMs) with external knowledge, but this access path also introduces sec…
Early Decisions Matter: Proximity Bias and Initial Trajectory Shaping in Non-Autoregressive Diffusion Language Models
Diffusion-based language models (dLLMs) have emerged as a promising alternative to autoregressive language models, offering the potential f…
BenGER Platform: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks
Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, a…
When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden
Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic chal…
Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models
While Diffusion Large Language Models (dLLMs) offer structural advantages for global planning, efficiently verifying that they arrive at co…
Retention Consequence in Lifecycle Memory Control
Persistent memory can fail after successful admission: a premise is written, then becomes a silent assumption, and later maintenance treats…
Negative Advantages Is a Double-Edged Sword: Calibrating advantages in GRPO for Search Agents
Search agents achieve strong question-answering performance through multi-turn interactions with search engines, with Group Relative Policy…
S2MAM: Semi-supervised Meta Additive Model for Robust Estimation and Variable Selection
Semi-supervised learning with manifold regularization is a classical framework for jointly learning from both labeled and unlabeled data, w…
LASER: Learning Active Sensing for Continuum Field Reconstruction
High-fidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challengin…
DiagramBank: A Quality-Audited Dataset of Scientific Schematic Diagrams with Multi-Level Document Context
Scientific papers use schematic diagrams to communicate methods, workflows, and system structure, yet existing scientific-figure corpora of…
C-MORAL: Controllable Multi-Objective Molecular Optimization with Reinforcement Alignment for LLMs
Large language models (LLMs) show promise for molecular optimization, but aligning them with selective and competing drug-design constraint…
Graph Memory Transformer (GMT)
We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory…
Rethinking Layer Redundancy: Calibration Matters More Than Search in LLM Depth Pruning
Depth pruning improves the inference efficiency of large language models by removing Transformer blocks. Prior work typically treats layer…
The Forensic Cost of Watermark Removal: From Dedicated Attacks to Image Editing
Current watermark removal methods are evaluated on two axes: attack success rate and perceptual quality. We show this is insufficient. Whil…
Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models
Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are…
Escaping Mode Collapse in LLM Generation via Geometric Regulation
Mode collapse is a persistent challenge in generative modeling and appears in autoregressive text generation as behaviors ranging from expl…
Quantifying the Reconstructability of Astrophysical Methods with Large Language Models and Information Theory: A Case Study in Spectral Reconstruction
Modern astrophysical studies rely heavily on complex data analysis pipelines; however, published descriptions often lack the detail require…
Structured Belief State and the First Precision-Aware Benchmark for LLM Memory Retrieval
Every major benchmark for LLM memory systems, LoCoMo foremost, measures whether a model answered correctly, not whether the memory system r…
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to fil…
Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis
Retinal diagnosis is inherently bilateral: clinicians compare homologous structures across eyes (e.g., optic disc asymmetry), yet most deep…
ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin
Vector Quantized Variational Autoencoder (VQ-VAE) has become a fundamental framework for learning discrete representations in image modelin…
GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding
Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matc…
From Prediction to Intervention: The Evolution of AI in Biomedicine
Artificial intelligence has advanced rapidly in biomedicine through large-scale multimodal data integration, enabling increasingly accurate…
Voice "Cloning" is Style Transfer
Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity pr…
MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation
Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within…
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
Multimodal Large Language Models (MLLMs) still struggle with fine-grained visual understanding, where answers often depend on small but dec…
Detecting and Mitigating the Correct-Answer Extinction Window in Test-Time Reinforcement Learning with Majority Voting
Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a ps…
LIFT and PLACE: A Simple, Stable, and Effective Knowledge Distillation Framework for Lightweight Diffusion Models
We demonstrate that in knowledge distillation for diffusion models, the teacher network's highly complex denoising process - stemming from…
Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement
Deep learning has brought significant progress to medical image classification, yet most existing methods still rely on isolated visual evi…
Docker専用のAIエージェント「Gordon」が正式リリース 無料アカウントでも利用可能
米Docker社は、Docker DesktopとDocker CLIの新機能として、Dockerに関する質問への回答やベストプラクティスの提案、エラーの修正対応などを行ってくれるAIエージェント「Gordon」が正式版になったと発表しました。
「国会議員マップ」話題 建設職人が個人で開発、議員の発言や政治の動きを分かりやすく 生成AI活用
小さな建設会社を経営し、工事現場で働く職人でもある「中島」氏が、けがで現場を離れた期間に立ち上げたというサイトだ。余暇に個人で開発・運営しており、収益は目的にしていない。
OpenAI、Anthropicが新会社設立 国内SIerは「黒船襲来」に対抗できるか?
AnthropicとOpenAIがAIサービスを担う新会社の設立を相次いで発表した。FDEやApplied AI Engineerを擁する「黒船」来襲はSIerにとって脅威となるのか。国内SIerが取るべき備えとは何か。
Why Google’s AI can’t spell Google (or anything else)
Google is embarrassing itself, again.
AIコーディングエージェント市場が「新段階」突入 IDEが必要不可欠でなくなる“3つの理由”
Gartnerによると、AIの進化を受けてAIコーディングエージェントの市場が「新たな段階」に入った。統合開発環境(IDE)が必要不可欠なものでなくなる3つの理由とは。
「背中を見て覚えろ」はもう限界 「職員激減」に向けて自治体DXに必要な「AIと共有する業務マニュアル」の作り方
自治体DXの推進が叫ばれる中、現場では業務の属人化や知識継承の停滞といった課題が依然として残っている。背景には、「オレの背中を見て覚えろ」に象徴される暗黙知への依存や、再現性を前提とした業務設計の不足があるのではないか。CIO補佐官として全国の自治体を支援する筆者が、人間とAI…
失敗データこそ資産だ――3Dモデルや解析結果をAIで統合、一目で探せるナレッジに
「RAGでは超えられない製造現場の暗黙知がある」――ギリアは3Dモデルや解析結果をマルチモーダルLLMで統合し、設計の暗黙知を形式知化する新プラットフォームの提供を開始した。不採用理由や失敗の文脈も、組織の資産に変える。
In more good news for Amazon, Snowflake signs $6B deal with AWS for AI CPU chips
Snowflake has signed a new, enormous five-year deal with Amazon to secure chips for AI usage. Nvidia is once again being put on notice.
Payroll startup Remote says it grew revenue 50% per employee without adding headcount
Payroll service provider Remote recently surpassed $300 million in annual recurring revenue (ARR) and became cash-flow positive, thanks to…
Your SEO strategy is optimized for a search engine that no longer exists.
Google I/O made it official: AI-generated answers are now front and center in search, and most brands have almost no visibility into how AI…
最新AI「ミュトス」を使えても「バグマゲドン」に? Firefox開発元に学ぶセキュリティ対策
米AnthropicのAIモデル「Claude Mythos Preview」のような最先端モデルさえ使えれば、サイバーセキュリティ対策は万全になるのか。Webブラウザ「Firefox」を手掛ける米Mozillaの事例を読み解く。
Meta launches Instagram, Facebook, and WhatsApp subscriptions, with more to come, including AI plans
Meta is rolling out paid subscription plans for Instagram, Facebook, and WhatsApp worldwide, while also testing new AI, creator, and busine…
AI coding startup Cognition raises $1B at $25B pre-money valuation
As Cognition reaches $492 million in annualized revenue run rate, it more than doubled its valuation in eight months, it says.
OpenAI Foundation、AIによる経済激変から労働者を守るため2.5億ドルを拠出へ
OpenAI Foundationは、AIの普及に伴う労働市場や経済への急激な変化に対応するため、初期資金として2億5000万ドルを拠出すると発表した。この資金は、経済への影響測定、労働者の移行支援、新たな経済的安定モデルの構築の3領域に充てられる。組織再編を経た同財団が、ガバ…
2026-05-27(459件)
Startup Battlefield 200 applications close today: Nominate a founder or submit your startup
Today is the final day to apply or nominate a startup for Startup Battlefield 200. Once the clock strikes 11:59 p.m. PT, the window closes…
ElevenLabs’ new music-generation model can switch genres mid-track
ElevenLabs' new model will let users regenerate a section of a song without affecting the rest of the track.
TechCrunch Disrupt 2026 Early Bird ticket savings end in 3 days
There are only 3 days left to save up to $410 on your ticket to TechCrunch Disrupt 2026. Early Bird pricing ends May 29 at 11:59 p.m. PT, a…
SOND, a sleep tech startup from Bose’s former head of sleep, exits stealth with $7M
SOND introduced its debut product: Dreambuds, a closed-loop, in-ear system that captures 12 physiological signals from the wearer, then act…
China is increasingly keeping its best AI talent to itself
China's AI boom is producing world-class talent, and Beijing is increasingly reluctant to let them go elsewhere.
ClickHouse triples annualized revenue to $250M, charting a path toward an IPO
The database provider is eyeing a public debut within the next few years.
YouTube will now automatically label AI videos
YouTube will now automatically label videos that use significant photorealistic AI, instead of relying solely on creators to disclose AI-ge…
Tech CEOs are apparently suffering from AI psychosis
"CEOs are uniquely prone to AI psychosis," Box CEO Aaron Levie opines. Maybe that explains the almost religious belief in AI productivity g…
Robinhood now lets your AI agents trade stocks
While these agents would be able to read and analyze users' portfolios to come up with trading strategies and suggest investments, they'll…
Cisco and OpenAI redefine enterprise engineering with Codex
Cisco and OpenAI are redefining enterprise engineering with Codex, helping Cisco scale AI-native development, accelerate AI Defense work, a…
NEC、日立、富士通が“Anthropic協業”でそろい踏み 狙いは? 【3社の幹部コメントまとめ】
わずか1カ月の間にNEC、日立製作所、富士通がAnthropicとの協業を発表した。各社の狙いはどこにあるのか。
「この答弁はAIが原案作成」 参院本会議で松本デジタル相「職員が事実確認し私が決済」
5月から全府省庁で実証事業を開始している行政用の生成AI基盤「源内」について、5月27日の参院本会議で参政党の梅村みずほ氏が活用状況を尋ねたのに対し、松本尚デジタル相は「この答弁も源内が原案を作成した」と述べた。
Building self-improving tax agents with Codex
See how OpenAI, Thrive, and Crete built a self-improving tax agent with Codex, automating filings, improving accuracy, and accelerating wor…
BrickAnything: Geometry-Conditioned Buildable Brick Generation with Structure-Aware Tokenization
Generating physically buildable brick structures from 3D shapes requires more than geometric reconstruction: the output must also satisfy d…
Can LLMs Introspect? A Reality Check
Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is…
Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory
Long-running AI agents need persistent memory. Memory supports learning across sessions, reduces repeated context injection, and enables au…
Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions
Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments…
Constraint acquisition needs better benchmarks
Constraint Acquisition (CA) and related research on the validation and enhancement of Mathematical Programming (MP) models from domain know…
Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems
Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized mod…
Experiments in Agentic AI for Science
This paper details two novel frameworks for developing autonomous, agentic AI in scientific workflows. Both systems leverage a hybrid Local…
Anchor: Mitigating Artifact Drift in Agent Benchmark Generation
AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterpri…
OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling
Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLM…
JobBench: Aligning Agent Work With Human Will
Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench,…
Managing Uncertainty in LLM-Generated Procedural Knowledge for Virtual Laboratory Planning
Educational virtual laboratories can make experimental training more scala-ble, adaptive, and accessible, especially when students have lim…
ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability fail…
Automatic Layer Selection for Hallucination Detection
Recent studies on hallucination detection have shown that hallucination-related signals are more strongly encoded in intermediate layers th…
Exploiting Local Dynamics Regularity for Reusable Skills in Offline Hierarchical RL
Hierarchical Reinforcement Learning (HRL) promises to solve long-horizon Reinforcement Learning (RL) tasks more efficiently than non-hierar…
Advancing Creative Physical Intelligence in Large Multimodal Models
Large multimodal models (LMMs) have rapidly advanced in perception and reasoning; however, it remains unclear whether these capabilities ge…
From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator
A long-standing goal of the research community is to develop highly interactive LLM-based dialogue agents. Recent research focuses on optim…
Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions
Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems ar…
The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unl…
Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning
Legal reasoning requires distinguishing changes that matter from those that do not. Legal AI should remain stable under legally irrelevant…
PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design
Polymer discovery is central to fields ranging from energy storage to biomedicine, but it is hindered by an astronomically large chemical d…
MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration
Mobile graphical user interface (GUI) agents enable AI models to autonomously operate smartphones on behalf of users. However, most existin…
MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning
Clinical practice guidelines (CPGs) encode evidence-based decision logic that clinicians apply by evaluating patient variables, conditional…
AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents
The token-level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across 17 (env, ba…
FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning
Vision-language models such as CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy an…
Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2
This report describes Tail-Aware HiFloat4, our submission to the low-bit text-to-video generation quantization challenge. Our method adapts…
UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems
LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and…
Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems
Long-horizon decision problems with cumulative damage couple locally attractive actions to globally adverse outcomes. We identify two ortho…
MemFail: Stress-Testing Failure Modes of LLM Memory Systems
Large language model (LLM) agents increasingly rely on external memory systems to remain consistent across long-horizon interactions, but l…
Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents
Medical AI agents increasingly use external tools for diagnosis, treatment recommendation, and evidence retrieval, yet most existing approa…
Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation
Large language models (LLMs) have shown strong empirical gains as self-evolving agents for CUDA kernel generation, driven by feedback-condi…
It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers
A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher-capabi…
A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks
Large Language Models (LLMs) have brought huge improvements to Artificial Intelligence (AI), which can be applied to general-purpose tasks.…
Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal
Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state t…
The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context
Retrieval-augmented generation promises to ground language model outputs in external evidence, yet the field has no reliable way to verify…
LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intel…
Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning
Post-training is routinely evaluated through aggregate benchmark scores that treat multi-hop reasoning as a single capability -- as if a mo…
What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation
Chain-of-thought (CoT) prompting reliably improves language-model accuracy, but which properties of a rationale text drive the improvement…
Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLMs
LLM-based multi-agent systems have been widely adopted for knowledge retrieval and report generation, synthesizing known information throug…
Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation
Multi-stakeholder tasks require one output to satisfy users with conflicting preferences. Holistic LLM judges conflate utility estimation a…
On the Detection of Commutative Factors in Factor Graphs: Necessary and Sufficient Conditions
Exploiting the indistinguishability of objects in a probabilistic graphical model such as a factor graph is key to lifted probabilistic inf…
TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews
LLM-generated peer reviews are increasingly common at major venues, yet their deficiencies are hard to detect because they are uniformly fl…
From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation
Computing legal indicators from normative texts is a key task in legal monitoring and policy evaluation, but presents significant challenge…
Developing a Totally Unimodular Linear Program for Optimal Conformance Checking: When and Why It Complements A*
Alignment-based conformance checking is the state-of-the-art approach for comparing observed process executions with normative process mode…
Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)
LLMs deployed in high-stakes domains face fundamental reliability challenges: hallucinations, inconsistencies, and privacy vulnerabilities…
LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation
Entity linking is a key component of many downstream NLP systems, yet existing approaches are often tied to the specific target knowledge b…
Generating Robust Portfolios of Optimization Models using Large Language Models
Mathematical optimization is a powerful tool for structured decision-making across domains such as resource allocation and planning. Formul…
ORCA: An End-to-End Interactive Copilot for Optimized Root Cause Analysis
Causal analysis is a crucial task in many domains, including manufacturing, social science, and medicine. However, despite recent progress,…
Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling
Knowledge graphs (KGs) have become the core backbone of numerous downstream tasks such as question answering and recommender systems. Howev…
BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting
Early battery degradation trajectory forecasting (BDTF), which predicts the full-life state-of-health trajectory from early operational dat…
Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry
Key knowledge for steel-industry volatile organic compounds (VOCs) governance is scattered across unstructured scientific literature, makin…
Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?
Biomedical discovery often requires connecting broad biomedical knowledge with specific experimental or clinical data. Background knowledge…
Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation
Domain specialization can improve LLM behavior in vertical domains, but often weakens the general capabilities inherited from the original…
Position: AI Safety Requires Effective Controllability
AI safety is still largely framed as alignment: training models to follow human preferences, safety policies, and normative constraints. Th…
Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation
Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, b…
ICCU: In-Context Continual Unlearning via Pattern-Induced Refusal Rules
Machine unlearning aims to remove the influence of specific data from trained language models. In real-world deployments, unlearning reques…
StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning
Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while succ…
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions
Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration…
Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs
Retrieval-augmented LLMs are deployed for tasks where evidence quality determines action safety, yet evaluation protocols assume that singl…
Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering
Retrieval-Augmented Generation (RAG) systems for question answering typically retrieve evidence by semantic similarity between the query an…
The Compressive Knowledge Graph Hypothesis: Which Graph Facts Matter for Scientific Hypothesis Generation?
Knowledge graphs (KGs) can provide structured scientific context to language models, but it remains unclear which graph facts actually shap…
Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments
Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoni…
Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering
An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly di…
SIA: Self Improving AI with Harness & Weight Updates
Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by…
Modeling Agentic Technical Debt and Stochastic Tax: A Standalone Framework for Measurement, Simulation, and Dashboarding
Agentic AI systems combine probabilistic reasoning with delegated action through tools, context, memory, orchestration, and external workfl…
Maat: The Agentic Legal Research Assistant for Competition Protection
Competition law experts conducting legal research must review extensive volumes of cases, decisions, and judicial reports to identify prece…
2-ASP(Q) programs with weak constraints: Complexity and efficient implementation
ASP(Q) extends Answer Set Programming (ASP) with Quantifiers over answer sets. In this paper we focus on the class of ASP(Q) programs with…
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In t…
Natural Language Query to Configuration for Retrieval Agents
Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy --…
MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation
Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills…
Xe-Forge: Multi-Stage LLM-Powered Kernel Optimization for Intel GPU
Porting deep learning algorithms to new hardware accelerators requires developers to repeatedly apply the same low-level optimizations -- q…
Edge AI Deployment Beyond Models: A BSP-Aware Systems Framework for Industrial Embedded Platforms
Industrial Edge AI programs often begin with the model and only later confront the platform. That sequencing is attractive because it allow…
GEM: Geometric Entropy Mixing for Optimal LLM Data Curation
LLM pre-training efficacy increasingly depends on data composition rather than sheer volume. Yet, optimal mixing is hindered by categorizat…
Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications
Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretrain…
Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception
Audio deepfakes have improved rapidly recently, yet their effect on human trust in real speech remains unstudied. We present the largest li…
AssetGen: Deployable 3D Asset Generation at Interactive Speed
While 3D generation is progressing rapidly, recent work has often focused on obtaining high-resolution assets, leaving user experience and…
VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents
We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based age…
Augment Engineering: A Methodology for Multi-Tool AI Orchestration Across Professional Domains
Organizations increasingly deploy separate purpose-built AI tools across professional domains, often hiring domain specialists for each, re…
MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning
LLM-driven agents are capable of selecting external tools to complete users' tasks. However, attackers could compromise such process, steer…
When Does Adaptive Guidance Help? Belief-Aware Privileged Distillation for Autonomous Driving Under Partial Observability
Guided Soft Actor-Critic (GSAC) distills knowledge from a privileged full-state teacher to a partial-observation student for autonomous dri…
Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges
The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored secu…
Furina: Fragmented Uncertainty-Driven Refusal Instability Attack
Safety alignment in large language models (LLMs) and multimodal large language models (MLLMs) is commonly assumed to operate as a near-bina…
TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models
Time series foundation models (TSFMs) are increasingly pretrained on large corpora, raising concerns that evaluation datasets may have been…
On the Push-Based Asynchronous Federated Learning: A Bias-Correction Aggregation Approach
Asynchronous decentralized federated learning (ADFL) eliminates central coordination and global synchronization, making it attractive for l…
Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets
Agentic RAG systems that equip language models with dozens to hundreds of tool definitions face a critical resource conflict: tool schemas…
Enhancing Autonomous Online Intrusion Detection for IoT with Balanced Learning, Reliable Pseudo-Labels, and Lightweight Architectures
The rapid proliferation of Internet of Things (IoT) devices has created an urgent demand for adaptive, resource-efficient Intrusion Detecti…
Planning Neural Dynamics with Lie Group Embedding through Supervised Projective Manifold Learning
We propose Lie group embedded dynamical neural networks (LieEDNN) and the corresponding learning algorithms based on gradient descent and m…
A Universal Cliff and a Design Fingerprint: Cross-Section Defect Detection Under LLM Orchestration
Production language-model systems answer a request by partitioning it across an invisible orchestration of worker agents that recompose one…
InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization
Low-bit activation quantization remains a major bottleneck in efficient large language model (LLM) deployment. The difficulty is not only t…
PitchBench: Measuring Pitch Hearing in Audio-Language Models
Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and tra…
RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations
Code agents are currently having skillful performance on repository-level software engineering benchmarks, but it remains unclear whether s…
AutoDFT: A Closed-Loop Multi-Agent Framework for Autonomous DFT Calculations
Density functional theory (DFT) serves as the basis for computational discovery in materials science and chemistry, yet each calculation de…
GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training
Hybrid post-training usually combines supervised fine-tuning and reinforcement learning, but fixed mixing schedules cannot adapt when the r…
SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?
Functionality-correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute…
Max-Window Scale Estimation for Near-Lossless HiF8 W8A8 Quantization-Aware Training
Quantization-aware training (QAT) with low-bit floating-point formats enables efficient LLM deployment, yet introduces subtle failure modes…
HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals
This paper presents the HRVConformer, a novel deep learning architecture for the classification of hypoxic-ischemic encephalopathy (HIE) us…
Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series
This research addresses the problem of adaptive modeling in time-series data streams with clear input-output relationships. This problem is…
Co-folding model guided by structural proteomics
Protein structure generative models excel at predicting single protein static structures from sequence, but routinely fail to capture the c…
Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection
Time series anomaly detection (TSAD) has long been a hot research topic in data mining due to its various applications. Recent studies chal…
CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly
LLM-based agents are increasingly used for cybersecurity tasks, but most existing systems rely on fixed, human-designed scaffolds that stru…
Workflow Closure Is Not Scientific Closure in Auto-Research Systems
This paper argues that workflow closure is not scientific closure in auto-research systems. Current systems can increasingly complete resea…
AgentSociety: Incentivizing Agentic Social Intelligence
The success of deployed agents relies on their ability to handle open-ended user requests using their inherent capabilities, not only in so…
Unified Neural Scaling Laws
We present a functional form (that we refer to as a Unified Neural Scaling Law (UNSL)) that accurately models and extrapolates the scaling…
Prospective evaluation of multimodal respiratory failure prediction: Do chest X-rays improve performance beyond EHR signals?
Early prediction of respiratory failure is critical for timely clinical intervention in intensive care units. Existing electronic health re…
Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion
Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this…
VesselSim: learning 3D blood vessel segmentation without expert annotations
Blood vessel segmentation is a core task in medical image analysis for the care of vascular diseases and surgical planning, yet the challen…
Decoupled Delay Compensation: Enhancing Pre-trained MARL Policies via Learned Dynamics Filtering
Real-world multi-agent reinforcement learning (MARL) systems must often operate under stale observations, stochastic communication delays,…
CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations
Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores,…
Intelligent Detection and Mitigation of Carpet-Bombing DDoS Attacks in SDN Using Retrieval-Augmented Generation and Large Language Models
Software-Defined Networking (SDN) provides flexible and programmable network management; however, its centralized control architecture rema…
Curriculum Learning for Safety Alignment
Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle…
E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control
Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others'…
Semigroup Consistency as a Diagnostic for Learned Physics Simulators
Learned physics simulators are often evaluated by one-step or short-horizon prediction error, but these metrics can miss failures in tempor…
Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models
Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attack…
When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning
In-context learning (ICL) is often motivated by the intuition that demonstrations help because they provide correct input-output examples.…
Personalized Generative Models for Contextual Debiasing
Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a roa…
Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations
In many reasoning tasks, large language models (LLMs) rely on structured external knowledge, such as graphs and tables, which is typically…
Unified Panoramic Geometry Estimation via Multi-View Foundation Models
Geometry estimation from perspective images has greatly advanced, maturing to the point where off-the-shelf foundation models are able to r…
BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma
Hepatocellular carcinoma (HCC) is biologically heterogeneous, shaped by the interplay between hepatic functional reserve and tumor-related…
VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes
Frontier multimodal large language models (MLLMs) have been reported to achieve over 90% accuracy on fine-grained perception benchmarks. Ho…
Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking
Large-scale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-st…
Annotator Positionality as Signal: Psychometric Weighting for Anti-Autistic Ableism Detection
Large language models (LLMs) are increasingly used in decision-making tasks where they can amplify or suppress perspectives, raising concer…
Plans for Evaluating Structured Generative Search Summaries
We propose a framework for evaluating structured generative search summaries that are placed atop organic web search results. A structured…
Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models
Evaluating and mitigating a generative system's susceptibility to jailbreak attacks is critical to its safe deployment. Given the number of…
Confounder Detection via Treatment Intent: A New Observational Study Design
Understanding the effects of interventions is central to scientific progress, with randomized controlled trials (RCTs) regarded as the gold…
The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP
Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architecture…
When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control
A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost acros…
Uniboost: Global Coordination with Value Alignment for Fair and Efficient Traffic Allocation
With the rapid evolution of internet services, recommendation systems have become indispensable. In particular, the blending (re-ranking) s…
Structure-Adaptive Conformal Inference for Large-Scale Out-of-Distribution Testing
This paper addresses structured out-of-distribution (OOD) testing in high-stakes machine learning applications. Traditional conformal metho…
Aperiodic and Low-Frequency Spectral Bias in Reconstruction based EEG Foundation Models
EEG foundation models, pre-trained on large-scale unlabelled EEG data, have emerged as a promising direction towards learning generalizable…
Targeted Remasking: Replacing Token Editing with Token-to-Mask Refinement in Discrete Diffusion Language Models
Discrete masked diffusion language models such as LLaDA generate text through iterative denoising, where mask tokens are progressively repl…
LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness
Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which unde…
Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective
This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the mom…
Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines
Much of the alignment tuning literature is organized around optimization objectives, while the construction of alignment data is often trea…
DDGAD: Trajectory Dynamics for Diffusion-Based Graph Anomaly Detection
Graph anomaly detection (GAD) aims to identify nodes or substructures whose behavior or attributes deviate significantly from the overall p…
Cross-scale Aligned Supervision for Training GANs
Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as…
Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization
AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental chal…
AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation
Multi-Modal Diffusion Transformers (MM-DiTs) encode rich representations for training-free concept grounding, but existing attention-based…
Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records
Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient saf…
Diffuse to Detect: Generative Diffusion Models for Unsupervised IC Anomaly Detection
Latent defect screening is challenged by extremely low failure rates, high-dimensional test data, and absence of labeled anomalies. We prop…
Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes
Vision-based metric distance and area measurement remains challenging in large-scale outdoor environments due to long-range sensing, camera…
Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient
We present the stochastic decoupled policy gradient (SDPG), a lightweight visual reinforcement learning (RL) method that trains diverse vis…
Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories
LLM-generated stories are a popular use case, but they show very low variability. We sample 20,000 total stories from four current models u…
Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling
The Mixture of Experts MoE architecture is highly promising for resource constrained on device deployments yet training these models from s…
Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization
Large Vision-Language Models (LVLMs) have transformed multi-modal understanding, excelling in tasks like image captioning and visual questi…
Foundations of a Time-Consistent Counterfactual Actuarial Runtime for Autonomous AI Agents
We propose a foundational runtime actuarial layer for autonomous AI agents in which every side-effect-bearing action carries a time-consist…
CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies
Confirming Alzheimer's disease (AD) typically relies on positron emission tomography (PET), which remains costly and invasive, motivating t…
InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward
While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively…
StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting
Large-batch Contrastive Learning (CL), the foundation of modern representation learning, is fundamentally incompatible with the volatile re…
CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence
Maritime intelligent transportation systems (MITS) are essential for ensuring navigation safety and efficiency in busy waterways. However,…
ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation
Minute-scale cinematic video generation is a central challenge for generative video models. Existing paradigms address only fragments of th…
A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection
Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice…
Recursive Flow Matching
Generative models have emerged as a powerful paradigm for solving physics systems and modeling complex spatiotemporal dynamics. However, ac…
DGLD: Domain-Gated Latent Diffusion for the Discovery of Novel Energetic Materials
Energetic-materials performance gains translate directly into reduced propellant mass, smaller warheads, and more efficient civilian gas-ge…
ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation
Tool-using agents increasingly operate in open-ended deployment environments, where they compose file systems, web APIs, code interpreters,…
Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference
Aligning a few-step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tract…
Linear and Neural Dueling Bandits with Delayed Feedback
Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and la…
Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice
Tabular foundation models achieve strong accuracy on choice prediction tasks, but their predictions often violate the economic logic those…
Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline
Objective. Outpatient notes carry follow-up instructions pairing actions with future times ("MRI brain in two weeks"). Extracting (action,…
Bridging Control with Neural Network Verifier alpha-beta-CROWN: A Tutorial
Learning-based methods for synthesizing controllers have gained popularity due to their high expressiveness and strong empirical performanc…
On the Error-Correcting Effects of Stochasticity in Discrete Diffusion
Discrete diffusion models achieve strong performance in text and image generation, but their inference remains slow and must inherently bal…
Few-shot Cross-country Generalization of Tabular Machine Learning and Foundation Models for Childhood Anemia Prediction under Distribution Shift
Childhood anemia affects around 40% of children aged 6-59 months globally and arises from heterogeneous factors, limiting model generalizab…
Examining the Challenges of Intellectual Property in AI-Generated Productions
With the advancement of artificial intelligence systems capable of autonomously generating artistic, literary, musical works, and even inve…
Cordyceps: Covert Control Attacks on LLMs via Data Poisoning
Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primar…
Geometry-Aware Contrastive Learning for Few-Shot Automatic Modulation Recognition
Standard Self-Supervised Learning (SSL) for Automatic Modulation Recognition (AMR) struggles with ineffective isotropic augmentations, spec…
Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training
Reinforcement learning (RL) is the dominant paradigm for post-training large language models. However, in the online, on-policy setting, ro…
MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation
Volumetric Reasoning Segmentation (VRS) aims to segment a target region in a 3D medical scan from a free-form clinical query, where the ref…
JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search
We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full…
More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations
Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language…
Bilevel Optimization over Saddle Points of Zero-Sum Markov Games
Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-leve…
Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models
Out-of-distribution (OOD) detection has emerged as a popular technique to enhance the reliability of machine learning models by identifying…
AI evaluation may bias perceptions: The importance of context in interpreting academic writing
This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences acros…
The Labyrinth and the Thread: Rethinking Regularizations in Sequential Knowledge Editing for Large Language Models
Sequential editing of structured knowledge in large language models allows targeted factual updates without retraining, yet existing method…
Certified Causal Attribution for Real-Time Attack Forensics in 6G Network Slicing
Cross-slice attack attribution in 6G networks requires identifying causal propagation chains through shared infrastructure in under 100 ms.…
DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding
Recent video multimodal large language models (MLLMs) increasingly couple step-by-step reasoning with on-demand visual evidence retrieval,…
An In-Vitro Study on Cross-Lingual Generalization in Language Models
Cross-lingual transfer in language models is difficult to study in natural corpora because lexical overlap, morphology, data imbalance, and…
Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning
Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLM…
Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets
Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluati…
Model Merging on Loss Landscape: A Geometry Perspective
Model merging offers a promising avenue for knowledge integration and parallel development without retraining. Yet, existing methods either…
Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling
Reliable watermarking of panoramic imagery is fundamentally challenged by arbitrary 3D rotations. As panoramas are defined on the sphere, t…
SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation
Epidemic forecasting faces a fundamental challenge: human behavior dynamically responds to disease spread, creating feedback loops that ind…
L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation
Adapting large language models (LLMs) for personalized recommendation requires aligning their general-purpose capabilities with user-specif…
Measuring Prediction Uncertainty in Neural Cellular Automata
Neural cellular automata (NCA) provide a lightweight alternative to encoder-decoder segmentation networks. However, it can be difficult to…
Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models
Looped Language Models (LoopLMs) enable efficient latent reasoning through depth recurrence, yet exhibit unreliable test-time scaling behav…
MatFormBench: A Benchmarking Evaluation Framework for Target-Driven Materials Formulation
Inverse design of materials has significantly advanced target-driven formulation optimization, yet existing materials machine learning benc…
Cordon-MAS: Defending RAG against Knowledge Poisoning via Information-Flow Control
Retrieval-augmented generation (RAG) increasingly underpins high-stakes applications, yet remains vulnerable to Confundo-style poisoning wh…
Adversarial Training for Robust Coverage Network under Worst-case Facility Losses
The Maximal Covering Location-Interdiction Problem (MCLIP) is a classic bi-level optimization problem, which is fundamental to resilient in…
Generative artificial intelligence and the marginalization of minoritized knowledges in higher education: the case of disability
Generative artificial intelligence redefines higher education by restructuring the processes through which scientific knowledge is produced…
Towards Generalization-Oriented Models for Vehicle Routing Problems with Mixture-of-Experts
In recent years, Deep Reinforcement Learning (DRL) has achieved substantial progress on Vehicle Routing Problems (VRPs). However, existing…
Ratio-Variance Regularized Policy Optimization
Standard on-policy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost b…
EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation
Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate…
Implementation of Big Data Analytics for Diabetes Management: Needs Assessment in the Rwanda Healthcare System
Diabetes is a chronic metabolic disease that can lead to serious health problems if not diagnosed and managed early. Big Data Analytics (BD…
SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability
Large language models (LLMs) achieve impressive performance when a task is fully specified in a single turn, yet the same models lose up to…
HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML
LLMs can now produce full HTML pages, but many of those pages are only superficially correct: they render once, then fail under scroll, hov…
Innovation: An Almost Characterization of Hallucination
Hallucination is a central limitation of large language models (LLMs), and substantial effort has been devoted to understanding and mitigat…
RAGEAR: Retrieval-Augmented Graph-Enhanced Academic Recommender
We present RAGEAR (Retrieval-Augmented Graph-Enhanced Academic Recommender), a neurosymbolic recommender system for academic course recomme…
ContextGuard: Structured Self-Auditing for Context Learning in Language Models
Recent benchmarks reveal that despite strong reasoning capabilities, large language models (LLMs) still struggle to faithfully apply comple…
The Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery
State estimation is a fundamental problem in control and signal processing, for which the Kalman Filter provides an optimal solution under…
Periodic Topological Deep Learning for Polymer Design and Discovery
Polymers underpin applications across energy, healthcare, and materials science, yet their vast chemical space makes systematic discovery c…
The Sensation Modulating Network:Haltability as the architectural ground for object-directed phenomenology
Cognitive science remains split between cognitivism - which accounts for recursion and language but cannot ground formal symbols in meaning…
Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study
Background: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what…
The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection
LLM training increasingly relies on teacher-generated supervision, from synthetic responses to reasoning traces and tool-use demonstrations…
Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations
LLM-based agents for industrial asset operations show limited accuracy when reasoning over flat document stores. AssetOpsBench (KDD 2026) e…
GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought
Chain-of-Thought (CoT) reasoning has advanced large language models (LLMs), but outcome-based supervision leads to pervasive post-hoc ratio…
Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models
Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector…
Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton
Large Language Models (LLMs) can generate functional source code from natural-language prompts, but often fail to consistently follow highe…
ICICLE: Expanding Retrieval with In-Context Documents
Generative retrieval (GR) maps queries directly to document identifiers (docids) using parametric knowledge, However, this design makes cor…
Practical Anonymous Two-Party Gradient Boosting Decision Tree
Structured data is well handled by gradient-boosted decision trees (GBDT), which are usually trained on vertically partitioned features acr…
EEG-FM-Audit: A Systematic Evaluation and Analysis Pipeline for EEG Foundation Models
Large EEG Foundation Models (FMs) have shown great potential for decoding EEG signals across diverse cognitive tasks. However, existing EEG…
Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks
Reinforcement learning with verifiable rewards (RLVR) has become central to post-training reasoning models, yet a key limitation of existin…
Beyond Questions: Evaluating What Large Language Models (Actually) Know
Parametric knowledge in large language models (LLMs) is a cornerstone of their success, yet remains poorly understood. Existing knowledge b…
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors
As large language models (LLMs) are increasingly deployed to users around the world, they are integrated into everyday tasks across diverse…
Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation
Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often…
Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling
User modeling aims to use language models (LMs) to mimic an individual's behavior from a corpus of past context-action pairs (e.g., convers…
Cast a Wider Net: Coordinated Pass@K Policy Optimization for Code Reasoning
Repeated sampling with a verifier is the standard way to allocate test-time compute for code generation, with pass@$K$ as the canonical met…
Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V
W4A4 quantization of large video diffusion Transformers offers substantial memory savings but is hindered by two main challenges: sparse la…
ReasonOps: A Unified Operational Paradigm for Trustworthy Verified LLM Reasoning
Large Language Models (LLMs) have transformed artificial intelligence from primarily generative systems into increasingly capable reasoning…
Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination
Large language models (LLMs) are prone to hallucinations, i.e., statements unsupported by the input or training data, hindering reliable de…
Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models
The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infr…
Less is More: Early Stopping Rollout for On-Policy Distillation
On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring…
Tracing Computation Density in LLMs
Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs, but…
Lessons from Penetration Tests on Large-Scale Agent Systems
As AI systems gain increasing autonomy and execution capability, the number of discovered security vulnerabilities continues to rise. Howev…
ConVer: Using Contracts and Loop Invariant Synthesis for Scalable Formal Software Verification
Formal verification of large C programs is impeded by state-space explosion: Bounded Model Checking (BMC) tools must encode the entire stat…
QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents
Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language…
E3: Issue-Level Backtesting for Automated Research Critique
We present E3, an automated review assistant that augments reviewers and engineering teams by identifying decision-relevant technical conce…
Two Speeds of Learning: A Representation-Readout Decomposition of Grokking and Double Descent
Training loss and accuracy are the standard signals used to monitor generalization during deep neural network training. Two well-documented…
Trust Region Q Adjoint Matching
Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the m…
ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference
Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while mai…
MiRD: Reliable Set-Valued Prediction for Open-Ended Question Answering via Miscoverage Risk Decomposition
Reliable set-valued prediction provides a principled way to mitigate hallucinations in open-ended question answering (QA), yet existing con…
High-Quality Synthetic Financial Time-Series using a GAN-Diffusion Framework
In recent years, financial institutions and firms have increasingly adopted synthetic data to address data scarcity and to generate counter…
DEI: Diversity in Evolutionary Inference for Quality-Diversity Search
We present DEI: Diversity in Evolutionary Inference, a distributed Quality-Diversity (QD) search framework that assigns heterogeneous large…
Beyond the Data Mesh Illusion: Designing Modern AI-augmented Lakehouses to Bridge the Gap Between Theory and Practice
Enterprise data platforms face an enduring tension between domain self-service and holistic governance. The data mesh paradigm proposed dec…
Deep-layer limit and stability analysis of the basic forward-backward-splitting induced network (II): learning problems
Deep unfolding neural networks derived from iterative optimization schemes and numerical ordinary/partial differential equations (ODEs/PDEs…
Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object Detection
Testing object detectors in safety-critical domains requires semantically meaningful probes beyond pixel-level corruptions. We present SemP…
LitSeg: Narrative-Aware Document Segmentation for Literary RAG
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge, particularly for long-tail…
Grounding Text Embeddings in Stakeholder Associations
Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same s…
An investigation of AI integration in sound designer workflows and experiences
Artificial intelligence is increasingly being integrated into professional audio production workflows, yet a gap persists between the tools…
FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation
We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotation…
Learning When to Think While Listening in Large Audio-Language Models
Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this se…
Generative Animations: A Multi-Model Pipeline for Prompt-Driven Motion Synthesis
Animation elevates digital documents into immersive experiences, yet creating custom motion paths remains cumbersome, requiring designers t…
TWIST: Closed-Loop token Synchronization for Application-Aware Wireless Digital Twins
Wireless digital twins require repeated synchronization between a time-evolving physical scene and its digital counterpart under limited an…
Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation
We adapt Microsoft's QuantumKatas -- a well-established quantum computing curriculum -- from Q# to Qiskit, the most widely-adopted quantum…
Many Logics, One Methodology: A Plea for Logical Pluralism in Formalised Reasoning (preprint)
This position statement looks back on two decades of work on shallow embeddings of non-classical logics in classical higher-order logic (HO…
LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models
Selecting which instances to label is a key challenge in low-label tabular learning. For recent Tabular Foundation Models such as TabPFN, c…
Pair-In, Pair-Out: Latent Multi-Token Prediction for Efficient LLMs
Long chain-of-thought reasoning has made autoregressive decoding the dominant inference cost of modern large language models. Existing meth…
PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis
Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage arch…
Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)
Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent voca…
FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies
Vision-Language-Action (VLA) models are increasingly expected to not only complete robot tasks, but also follow human instructions about ho…
Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling
Time series foundation models (TSFMs) are transforming the forecasting paradigm through large-scale cross-domain pretraining. However, most…
It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty
Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes…
Risk Averse Alert Prioritization for IDS Using Subnormal Gaussian Fuzzy Models
Modern intrusion detection systems generate thousands of alerts daily, but alert fatigue severely limits security operations effectiveness…
Governed Evolution of Agent Runtimes through Executable Operational Cognition
Recent advances in agentic systems increasingly treat code as an executable operational substrate rather than as a disposable output artifa…
EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering
Flowcharts are widely used in industrial requirements, but usually remain embedded as static images. Vision Language Models (VLMs) show pro…
When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection
Recent generative models have largely closed the gap on low-level artifacts - pixel fingerprints, frequency anomalies, upsampling traces -…
Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders
Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data eng…
MobileMoE: Scaling On-Device Mixture of Experts
Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-bill…
GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing
Cellular research and development (R&D) is throttled by six structural processes that each consume months of manual engineering work per it…
LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding
Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each…
Algorithmic Monocultures in Hiring
Many employers screen job applicants with algorithms built by the same few algorithm vendors. We hypothesize that algorithmic monoculture l…
Multi-Agent Causal Discovery Using Large Language Models
Causal discovery aims to identify causal relationships between variables and is a fundamental problem across the sciences. Traditional stat…
LiPUP-MA: A Residential Experience-centric Multi-Agent Framework for Living-in-the-loop Participatory Urban Planning
Participatory Urban Planning (PUP) is increasingly supported by LLM-based agents, yet existing methods largely rely on static preference el…
Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction
LLM-based agents solve complex tasks through iterative reasoning, tool use, and environment interaction, where each intermediate thought di…
Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning
Document understanding aims to perform question answering and information extraction over document images, where the visual content is high…
How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation
Chain-of-Thought (CoT) prompting significantly enhances model reasoning, yet its internal mechanisms remain poorly understood. We analyze C…
EvoEmo: Towards Evolved Emotional Policies for Adversarial LLM Agents in Multi-Turn Price Negotiation
Recent research on Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) has demonstrated that agents can engage in \textit{comp…
Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning
Large language models (LLMs) demonstrate strong reasoning abilities via Chain-of-Thought (CoT), but their token-level generation encourages…
AI-Driven Contribution Evaluation and Conflict Resolution: A Framework & Design for Group Workload Investigation
The equitable assessment of individual contribution in teams remains a persistent challenge, where conflict and disparity in workload can r…
PaTAS: A Framework for Trust Propagation in Neural Networks Using Subjective Logic
Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Convent…
LEC: Linear Expectation Constraints for Selection-Conditioned Risk Control in Selective Prediction and Routing Systems
Foundation models often generate unreliable answers, while heuristic uncertainty estimators fail to fully distinguish correct from incorrec…
XGrammar-2: Efficient Dynamic Structured Generation Engine for Agentic LLMs
Modern LLM agents increasingly rely on dynamic structured generation, such as tool calling and response protocols. Unlike traditional struc…
TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents
Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and d…
Drive-P2D: A Progressive Perception-to-Decision Benchmark for VLMs in Autonomous Driving
Autonomous driving requires reliable perception and safe decision-making in complex scenarios. Recent vision-language models (VLMs) demonst…
AI Agent for Reverse-Engineering Legacy Finite-Difference Code and Translating to Devito
To facilitate the transformation of legacy finite difference implementations into the Devito environment, this study develops an integrated…
Chain Of Thought Compression: A Theoretical Analysis
Chain-of-Thought (CoT) has unlocked advanced reasoning abilities of Large Language Models (LLMs) with intermediate steps, yet incurs prohib…
Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic
Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning app…
UCPO: Uncertainty-Aware Policy Optimization
The key to building trustworthy large language models (LLMs) lies in endowing them with inherent uncertainty expression capabilities, there…
Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection
Despite their importance in model sampling, efficient implementation of Top-k and Top-p algorithms for large vocabularies remains a signifi…
The Necessity of a Unified Framework for LLM-Based Agent Evaluation
With the advent of Large Language Models (LLMs), general-purpose agents have seen fundamental advancements. However, evaluating these agent…
Persona Generators: Generating Diverse Synthetic Personas for Arbitrary Contexts
Evaluating AI systems that interact with humans requires understanding their behavior across diverse user populations, but collecting repre…
DIANOIA: Diagnostic Decomposition and Joint Optimization for Multi-Agent Reasoning
Multi-agent LLM systems consistently outperform single-agent baselines, yet practitioners still cannot predict which design works for a new…
Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation
Multi-modal recommendation has gained traction as items possess rich attributes like text and images. Semantic ID-based approaches effectiv…
Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty
LLMs often exhibit Aha moments such as self-correction after tokens like "Wait," yet the underlying mechanism remains unclear. Standard LLM…
Counterfactual Credit Policy Optimization for Multi-Agent Collaboration
Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles, but reinforcement learning f…
OMD-GraphRAG: Enhancing GraphRAG with Ontology-Guided Extraction, Multi-Dimensional Clustering and Dual-Channel Fusion
Retrieval-Augmented Generation (RAG) systems face significant challenges in complex reasoning, multi-hop queries, and domain-specific QA. W…
Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation
Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representation…
Communication Gain and Delay Cost Under Cross-Timestep Delays in Cooperative Multi-Agent Reinforcement Learning
Communication is essential for coordination in \emph{cooperative} multi-agent reinforcement learning under partial observability, yet \emph…
ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback
Designing effective heuristics for NP-hard combinatorial optimization problems remains challenging and often requires substantial domain ex…
From Attribution to Action: A Human-Centered Application of Activation Steering
Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these…
Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling
Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-co…
Mechanized Foundations of Structural Governance: Machine-Checked Proofs for Governed Intelligence
We present five results in the theory of structural governance for cognitive workflow systems. Three are mechanized in Coq 8.19 using the I…
The Two Boundaries: Why Behavioral AI Governance Fails Structurally
Every system that performs effects has two boundaries: what it can do (expressiveness) and what governance covers (governance). In nearly a…
Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling
World models have recently re-emerged as a central paradigm for embodied intelligence, robotics, autonomous driving, and model-based reinfo…
Effect-Transparent Governance for AI Workflow Architectures: Semantic Preservation, Expressive Minimality, and Decidability Boundaries
We present a machine-checked formalization of structurally governed AI workflow architectures and prove that effect-level governance can be…
Algebraic Semantics of Governed Execution: Monoidal Categories, Effect Algebras, and Coterminous Boundaries
We present an algebraic semantics for governed execution in which governance is axiomatized, compositional, and coterminous with expressibi…
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep resea…
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models
Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling an…
From Feasible to Practical: Pareto-Optimal Synthesis Planning
Current computer-aided synthesis planning (CASP) methods often treat retrosynthesis as solved once a single feasible route is identified, f…
Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management
This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game. We identify four inference-time…
GraphMind: From Operational Traces to Self-Evolving Workflow Automation
Complex operational workflows coordinating personnel, tools, and information are central to system operations, yet end-to-end automation re…
Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)
Two decades ago, the Semantic Web Services community was asked how agents with different ontological commitments could discover, compose, a…
AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but their evaluations…
Declarative Data Services: Structured Agentic Discovery for Composing Data Systems
Agentic discovery has shown that LLM-driven search can find novel algorithms, designs, and code under benchmark conditions. Translating the…
ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving
Safety-critical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation-base…
Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent. A line of rece…
Querying and Repairing Inconsistent Prioritized Knowledge Bases: Complexity Analysis and Links with Abstract Argumentation
In this paper, we explore the issue of inconsistency handling over prioritized knowledge bases (KBs), which consist of an ontology, a set o…
Continual Model-Based Reinforcement Learning with Hypernetworks
Effective planning in model-based reinforcement learning (MBRL) and model-predictive control (MPC) relies on the accuracy of the learned dy…
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across diffe…
Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights
Large-scale web-crawled datasets contain noise, bias, and irrelevant information, necessitating data selection techniques. Existing methods…
Intelligent Offloading in Vehicular Edge Computing: A Comprehensive Review of Deep Reinforcement Learning Approaches and Architectures
The increasing complexity of Intelligent Transportation Systems (ITS) has led to significant interest in computational offloading to extern…
Yes, Q-learning Helps Offline In-Context RL
Existing offline in-context reinforcement learning (ICRL) methods have predominantly relied on supervised training objectives, which are kn…
Hands-On: Segmenting Individual Signs from Continuous Sequences
This work tackles the challenge of continuous sign language segmentation, a key task with huge implications for sign language translation a…
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the p…
OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning
Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. How…
Message-Passing State-Space Models: Improving Graph Learning with Modern Sequence Modeling
The recent success of State-Space Models (SSMs) in sequence modeling has motivated their adaptation to graph learning, giving rise to Graph…
Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks
Large Language Models (LLMs) have demonstrated remarkable performance across various tasks by effectively utilizing a prompting strategy. H…
Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution
Arbitrary-scale image super-resolution aims to upsample images to any desired resolution, offering greater flexibility than traditional fix…
Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models
We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reaso…
Genre Controlled Music Generation via Activation Steering
Computational Music Generation is evolving towards non-conventional styles, demanding methods that enable precise and controllable blending…
Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection
Out-of-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset f…
Real-Time Progress Prediction in Reasoning Language Models
Recent reasoning language models, particularly those that employ long latent chains of thought, achieve strong performance on complex agent…
"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models
Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accur…
PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization
In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs a…
A Physics-Informed Hierarchical Neural Network for Microwave Scattering Analysis of 3D PEC Targets
Accurate modeling of scattering from three-dimensional (3D) perfectly electrically conducting (PEC) targets at microwave frequencies consti…
How Reliable are LLMs for Reasoning on the Re-ranking task?
With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with h…
Conceptual Schema Inference for Tabular Datasets using Large Language Models
Large collections of tabular data from data lakes, web tables and open data portals often originate from heterogeneous sources, leading to…
Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards
Reinforcement learning with verifiable rewards (RLVR) is a practical, scalable way to improve large language models on math, code, and othe…
Scalable GANs with Transformers
Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investi…
When LLMs Benchmark Themselves: Deconstructing Self-Bias in Automated Evaluation
As LLMs rapidly saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) -- where a model generates test…
Searching the Internet for Challenging Benchmarks at Scale
Many static benchmarks are beginning to saturate: as models rapidly improve, they achieve near-perfect scores on fixed test sets, leaving l…
HiSpec: Hierarchical Speculative Decoding for LLMs
Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Veri…
Inference-Time Search Using Side Information for Diffusion-Based Image Reconstruction
Diffusion models have been used as priors for solving inverse problems. However, existing approaches typically overlook side information th…
CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit
Diffusion large language models (dLLMs) generate text through iterative denoising. In commonly adopted parallel decoding schemes, each step…
Monte Carlo Permutation Search
We propose Monte Carlo Permutation Search (MCPS), a general-purpose Monte Carlo Tree Search (MCTS) algorithm that improves upon the GRAVE a…
Self-signals Driven Multi-LLM Debate for Efficient and Accurate Reasoning
Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi-LLM…
EconCausal: A Context-Aware Economic Reasoning Benchmark for Large Language Models
Socio-economic causal effects depend heavily on their institutional and environmental contexts. The same intervention can produce different…
ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Persian remains substantially underrepresented in open speech-text resources, limiting progress in multi-speaker text-to-speech (TTS), spee…
Securing Multi-Agent Systems Against Corruptions via Node Contribution Backpropagation
Multi-Agent Systems (MAS) have become a prevalent paradigm for Large Language Model (LLM) applications. However, the complex multi-agent de…
An End-to-End Learning Approach for Solving Capacitated Location-Routing Problems
The capacitated location-routing problems (CLRPs) are classical problems in combinatorial optimization, which require simultaneously making…
SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking
Large-scale vision-language models, especially CLIP, have demonstrated remarkable performance across diverse downstream tasks. Soft prompts…
Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from Low-Dose Computed Tomography
Low-dose chest computed tomography (LDCT) captures pulmonary and cardiac structures in a single scan, enabling joint assessment of lung and…
CFG-OEC: Classifier Free Guidance with Orthogonal Error Correction
Classifier free guidance is a standard method for conditional sampling in diffusion models, but its sampling rule is not aligned with the o…
Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation
This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesi…
Reconstructing Multi-Scale Physical Fields from Extremely Sparse Measurements with an Autoencoder-Diffusion Cascade
Extreme sensor sparsity makes full-field reconstruction a fundamentally ill-posed problem in scientific sensing,where the goal is to infer…
SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs
Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dep…
Mechanistic Interpretability of Antibody Language Models Using SAEs
Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within…
How to Square Tensor Networks and Circuits Without Squaring Them
Squared tensor networks (TNs) and their extension as computational graphs--squared circuits--have been used as expressive distribution esti…
EHRSummarizer: A Privacy-Aware, FHIR-Native Reference Architecture for Source-Grounded EHR Summarization
Clinicians routinely navigate fragmented electronic health record (EHR) interfaces to assemble a coherent picture of a patient's problems,…
Faithfulness Evaluation for Decoder-only LLM Attributions with Controlled Retained Information
Large Language Models (LLMs) are increasingly evaluated with input attribution methods, yet comparing such explanations remains challenging…
EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering and Reasoning
Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention ef…
Beyond Binary: Turning Partial Success into Dense Verifiable Rewards for Reinforcement Learning in Code Generation
Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream test-suite-level outcome rewa…
Shadow Unlearning: A Neuro-Semantic Approach to Fidelity-Preserving Faceless Forgetting in LLMs
Machine unlearning aims to selectively remove the influence of specific training samples to satisfy privacy regulations such as the GDPR's…
The AI Cognitive Trojan Horse: How Large Language Models May Bypass Human Epistemic Vigilance
Large language model (LLM)-based conversational AI systems present a challenge to human cognition that current frameworks for understanding…
Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in mainstream visual understanding tasks, but their abili…
Beyond Transfer Accuracy: Faithful Circuits for Controlled Low-Resource Adaptation
Existing circuit discovery methods rely on templated tasks with clean counterfactuals, limiting their use on diverse natural text. We adapt…
DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion
Speech tokenizers are a key building block of fully discrete Speech LLMs. Existing tokenizers either prioritize semantic encoding, fuse sem…
Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data
Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquir…
MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning
Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks. Nev…
LLMs versus the Halting Problem: Characterizing Program Termination Reasoning
Determining whether a program terminates is a central problem in computer science. Turing's Halting Problem established termination as unde…
ORLoopBench: Solver-in-the-Loop Benchmarks for Self-Correction and Behavioral Rationality in Operations Research
Operations Research practitioners debug infeasible models through an iterative process: inspecting Irreducible Infeasible Subsystems ( IIS)…
ECSEL: Explainable Classification via Signomial Equation Learning
We introduce ECSEL, an explainable classification method that learns formal expressions in the form of signomial equations, motivated by th…
Graph is a Substrate Across Data Modalities
Graphs provide a natural representation of relational structure that arises across diverse domains. Despite this ubiquity, graph structure…
RulePlanner: All-in-One Reinforcement Learner for Unifying Design Rules in 3D Floorplanning
Floorplanning determines the coordinate and shape of each module in Integrated Circuits. With the scaling of technology nodes, in floorplan…
GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training
Large language models (LLMs) increasingly rely on external knowledge to improve factuality, yet many real-world knowledge sources are organ…
Rethinking the Trust Region in LLM Reinforcement Learning
Reinforcement learning (RL) has become a cornerstone for fine-tuning Large Language Models (LLMs), with Proximal Policy Optimization (PPO)…
Emergent Causal-Geometric Dynamics Across Depth in Large Language Models
Geometric analyses of large language model (LLM) representations reveal structured variation across depth but remain fundamentally correlat…
Scaling GraphLLM with Bilevel-Optimized Sparse Querying
LLMs have recently shown strong potential in enhancing node-level tasks on text-attributed graphs (TAGs) by providing explanation features.…
Olaf-World: Orienting Latent Actions for Video World Modeling
Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract cont…
Constructing Industrial-Scale Optimization Modeling Benchmark
Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirem…
Vital Trace: Protocol-Constrained Patient-State Reasoning for Longitudinal Clinical Trajectories
Longitudinal clinical reasoning over electronic health records requires tracking evolving physiological measurements, laboratory results, a…
Assessing Per-Sample Membership Inference Vulnerability without Retraining
Recent work in the privacy literature shows that sample-targeted membership inference attacks (MIAs) significantly outperform untargeted ap…
GICDM: Mitigating Hubness for Reliable Distance-Based Generative Model Evaluation
Generative model evaluation commonly relies on high-dimensional embedding spaces to compute distances between samples. We show that dataset…
Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History
Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every deta…
Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery
In environmental monitoring, data collection is often costly, sparse, and shaped by urgent public-health needs. This is particularly true f…
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations:…
MedCollab: IBIS-Guided Multi-Agent Collaboration with Hierarchical Disease Relation Chains for Clinical Diagnosis
Large language models (LLMs) have shown promise in clinical diagnosis but remain limited by unreliable report generation, weak evidence gro…
Phase-Type Variational Autoencoders for Heavy-Tailed Data
Heavy-tailed distributions are ubiquitous in real-world data, where rare but extreme events dominate risk and variability. However, standar…
Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility
Misinformation is a growing societal threat, and susceptibility to misinformative claims varies across demographic groups due to difference…
RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies
Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manip…
Geometrically Constrained Outlier Synthesis
Deep neural networks for image classification often exhibit overconfidence on out-of-distribution (OOD) samples. To address this, we introd…
Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Reliability
Large language models (LLMs) excel on static benchmarks, but their performance across multi-turn conversations, which better reflect real-w…
Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents
Time Series Event Detection (TSED) aims to localize semantically meaningful events in time series data, with critical applications in high-…
Early Pruning for Public Transport Routing
Routing algorithms for public transport, particularly the widely used RAPTOR and its variants, often face performance bottlenecks during th…
LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction
Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis…
FedTreeLoRA: Reconciling Statistical and Functional Heterogeneity in Federated LoRA Fine-Tuning
Federated Learning (FL) with Low-Rank Adaptation (LoRA) has become a standard for privacy-preserving LLM fine-tuning. However, existing per…
Ethical Fairness without Demographics in Human-Centered AI
In ubiquitous and mobile health systems, computational models infer human states from wearable, behavioral, and physiological sensing data.…
Beyond Linearity in Attention Projections: The Case for Nonlinear Queries
Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity witho…
APEX-Searcher: Refining Credit Assignment with Subgoaling for Agentic Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) connects large language models (LLMs) to external knowledge, but single-round retrieval is often insuf…
Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models
Evaluating the reasoning abilities of large language models (LLMs) solely from final answers can obscure failures in intermediate steps, es…
Demystifying Video Reasoning
Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capa…
Alignment Makes Language Models Normative, Not Descriptive
Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling obser…
To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs
When VLMs answer correctly, do they genuinely rely on visual information? We introduce a Tri-Layer Diagnostic Framework with three per-samp…
Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR
Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromi…
Understanding the Challenges in Iterative Generative Optimization with LLMs
Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execu…
Where Code Meets Natural Language: Taxonomy-Driven Information Flow Analysis for LLM-Integrated Applications
LLM API calls are becoming a ubiquitous program construct, yet they create a boundary that no existing program analysis can cross: runtime…
From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering
Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated P…
SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills
OpenClaw's ClawHub marketplace hosts tens of thousands of community-contributed agent skills (49,592 in our 2026-04-04 snapshot), and recen…
Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation
Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game-theoretic model…
The ATOM Report: Measuring the Open Language Model Ecosystem
We present a comprehensive adoption snapshot of the leading open language models and who is building them, focusing on the ~1.5K mainline o…
Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study
Software systems built from versioned AI components increasingly need lifecycle-time governance: when a capability module evolves into a ne…
SenBen: Sensitive Scene Graphs for Explainable Content Moderation
Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sens…
ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering
Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challen…
Degradation-Consistent Paired Training for Robust AI-Generated Image Detection
AI-generated image detectors suffer significant performance degradation under real-world image corruptions such as JPEG compression, Gaussi…
Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR
Reinforcement Learning with Verifiable Rewards (RLVR) improves the reasoning ability of Large Language Models (LLMs), but sparse outcome re…
Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language
At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliabi…
Reasoning Primitives in Hybrid and Non-Hybrid LLMs: Do Architectural Differences Yield Advantages in State-Tracking and Recall?
Reasoning in large language models is often discussed as a single capability, but some of its gains may stem from simpler underlying operat…
When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR
Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capab…
BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning
Brain MRI underpins a wide range of neuroscientific and clinical applications, yet most learning-based methods remain task-specific and req…
Certified Purity for Cognitive Workflow Executors: From Static Analysis to Cryptographic Attestation
We present a certified purity architecture that converts governance enforcement in cognitive workflow systems from a runtime convention int…
VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation
Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map a…
MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings
Pneumonia remains a leading global cause of morbidity and mortality, particularly in low-resource settings where access to imaging, laborat…
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
Representation Engineering analyses often characterize refusal using static directions extracted from terminal or pooled representations. W…
Cryptographic Registry Provenance: Structural Defense Against Dependency Confusion in AI Package Ecosystems
Dependency confusion attacks exploit a structural gap in software distribution: once a package is installed, there is no cryptographic proo…
PHALAR: Phasors for Learned Musical Audio Representations
Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard tem…
Governed Metaprogramming for Intelligent Systems: Reclassifying Eval as a Governed Effect
AI systems increasingly synthesize executable structure at runtime: LLMs generate programs, agents construct workflows,self-improving syste…
GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorizati…
Post-training makes large language models less human-like
Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture h…
Tool Calling is Linearly Readable and Steerable in Language Models
When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. As…
VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning
Multi-model learning has attracted great attention in visual-text tasks. However, visual-tabular data, which plays a pivotal role in high-s…
Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan
The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite co…
From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs
Large-scale AI training is now fundamentally a distributed systems problem, and hardware failures have become routine operating conditions…
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps be…
GraphIP-Bench: How Hard Is It to Steal a Graph Neural Network, and Can We Stop It?
Graph neural networks (GNNs) deployed as cloud services can be stolen through model-extraction attacks, which train a surrogate from query…
MinT: Managed Infrastructure for Training and Serving Millions of LLMs
We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT ta…
Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict
The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it…
Access Timing as Scaffolding: A Reinforcement Learning Approach to GenAI in Education
In recent years, generative AI (GenAI) in educational settings has become ubiquitous in university students' daily lives, despite its poten…
CitePrism: Human-in-the-Loop AI for Citation Auditing and Editorial Integrity
Editors and reviewers are expected to ensure that manuscripts cite relevant, accurate, current, and ethically appropriate literature, yet m…
Identifiable Token Correspondence for World Models
Token-based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsi…
Echoes in Filter Bubble: Diagnosing and Curing Popularity Bias in Generative Recommenders
Recently, Generative Recommenders (GRs), characterized by a unified end-to-end framework, have exhibited astonishing potential in transform…
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
A striking geometric disparity has long persisted in the practice of deep learning. While modern neural network architectures naturally exh…
AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning
Rubric-based reward shaping provides interpretable and editable reward signals for fine-tuning LLMs via reinforcement learning (RL), but ex…
FLUIDSPLAT: Reconstructing Physical Fields from Sparse Sensors via Gaussian Primitives
Reconstructing continuous flow fields from sparse surface-mounted sensors is central to aerodynamic design, flow control, and digital-twin…
Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty
Simulation-based testing of self-driving cars (SDCs) typically relies on scripted pedestrian models that do not capture the heterogeneity a…
A Sharper Picture of Generalization in Transformers
We study transformers' generalization behavior on boolean domains from the perspective of the Fourier spectra of their target functions. In…
One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate ac…
BioFormer: Rethinking Cross-Subject Generalization via Spectral Structural Alignment in Biomedical Time-Series
Cross-subject generalization in biomedical time-series refers to training on data from some subjects and testing on unseen subjects.The key…
CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation
Real-time cognitive load assessment is essential for adaptive human-computer interaction but remains challenging due to limited labeled dat…
ヒト型AIロボスタートアップのアトムが30億円調達 「日本のGDPを1%アップ」目指す
ヒューマノイドAIロボットを開発するアトム(東京都江東区)は5月27日、開発着手とあわせて、シードラウンドで総額30億円を調達したと発表した。製造業や物流・運輸の現場で使えるロボットを開発し、将来の量産化を目指す。
Notion、新開発基盤をリリース ナレッジ共有から「AIと協働する基盤」へ
Notion Labsは、開発者向け機能群「Notion Developer Platform」を発表した。CLIや実行環境「Workers」、外部エージェント連携APIなどを提供し、NotionでAIエージェントや業務ワークフローを構築できる開発基盤だ。
選挙の公正確保を “虚偽”SNS対策が判明、AI生成動画像に改変表示義務付け 与野党案
選挙期間中のSNSでの偽・誤情報拡散対策として、与野党による選挙運動に関する協議会で検討が進められていた関連法改正案の骨子の全容が5月26日、分かった。
「小説家になろう」、AI利用状況を報告必須に 未設定だと9月から投稿不可
Web小説投稿サイト「小説家になろう」の運営は5月26日、作品創作におけるAI利用状況の設定を必須化すると発表した。6月9日に新設する設定項目で、AIの関与度に応じた4区分から選ばせる。利用度が高い作品はキーワード欄などで開示する。
マツダが統合ストレージ基盤を刷新 運用負荷低減、将来のAI活用も視野に
マツダは、デル・テクノロジーズの「Dell PowerScale」ストレージを導入し、モデルベース開発やCAD、アーカイブ用途向けの統合ストレージ基盤を構築。設計開発データの増加に対応するとともに、ストレージ総容量を約10PBへ拡大し、容量単価を従来比で約10分の1に低減した。…
AIが生んだ新たな業務、9割が「負担」 AIOpsの“不都合な実態”
ある調査によると、AIOpsを導入した情報システム担当者の約75%が業務負荷軽減を実感する一方、約9割がAIを利用することで生まれた作業を「負担」と感じていることが明らかになった。
若年人口減少の中で「新卒採用支援市場」拡大 AIは採用活動をどう変える?
若年人口が減少する中で、優秀な人材をいかに確保するかが課題になっている。矢野経済研究所によると、新卒採用支援サービス市場は拡大基調にある。AIはこの市場をどう変えているのか。
富士通、業務の変化に合わせて進化するAIエージェント技術を開発
法改正や仕様変更が続く企業業務でAIエージェントを使い続けるには、専門人材による継続的な調整が欠かせなかった。富士通の「自己進化マルチAIエージェント技術」は、この前提をどう変えるのか。
DuckDuckGo installs are up 30% as users reject being ‘force-fed’ Google’s AI Search
Google overhauled Search at I/O 2026, replacing blue links with AI agents. The backlash has been swift. DuckDuckGo app installs spiked 30%…
OpenRouter more than doubles valuation to $1.3B in a year
OpenRouter has raised a $113 million Series B led by CapitalG. Its 5x growth in usage over six months indicates the multi-AI-model future i…
「AIによる権利侵害」に出版・アニメ制作会社など集う国内団体が声明 「看過できない問題」
出版社やアニメ制作会社などで構成される団体・コンテンツ海外流通促進機構(CODA)は、生成AIによる著作権侵害に関する声明を出した。AIの開発やAIサービスを提供する事業者に対し、権利の保護などを求めている。
This startup is betting India’s gig economy can train the world’s robots
Human Archive, a startup founded by UC Berkeley and Stanford researchers, is paying gig workers in India to wear camera-equipped caps and s…
2026-05-26(772件)
Universal Music Group and TikTok renew agreement to combat unauthorized AI music
For years, UMG has pushed platforms, streaming services, and AI companies to implement stricter content moderation policies.
TechCrunch Disrupt 2026 Early Bird ticket rates end May 29
Save up to $410 on your TechCrunch Disrupt 2026 pass before prices increase on May 29 at 11:59 p.m. PT. Register here to join the tech epic…
ファーウェイ、半導体で「1.4nm相当」目指す 31年までに 「ムーアの法則」に代わる新法則を提唱
中国Huaweiが半導体進化の新法則「τスケーリング法則」を提唱した。従来の微細化に代わり信号遅延を圧縮しトランジスタ密度を向上させる。秋のKirinチップに独自の回路技術LogicFoldingを初適用し、2031年に1.4nm相当の密度を目指すという。
松下幸之助氏の「AI偽動画」に注意 PHP研究所が再告知 津田健次郎さんによるTikTok提訴受け
出版・教育事業を手がけるPHP研究所は5月26日、創設者・故松下幸之助氏の画像や音声を無断でAI合成した偽動画が引き続き出回っているとして、公式Xで改めて注意喚起を投稿した。
Gemini APIが“半額”で使える「Flex」 注意点は? 「Priority」とは何が違う?
Googleは「Gemini」のAPI向けに、新たなサービスティア「Flex」と「Priority」を追加した。Flexは標準サービスティアの半額で利用できるという。両者はどう違い、どう使い分けるべきなのか。
In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models
We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative prod…
Confidence Calibration in Large Language Models
We investigate the calibration of large language models' (LLMs') confidence across diverse tasks. The results of our preregistered study sh…
How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning
Reasoning-capable large language models solve hard problems by emitting long chains of thought, paying heavily in latency, GPU time, and en…
Context: Proactive Goal-Directed Intelligence via Composable Sandboxed Programs, Declarative Wiring, and Structured Interaction
We present Context, the intelligence layer of the Magarshak Architecture, which replaces reactive query-response chatbots with proactive go…
Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs
Modern AI systems increasingly rely on workflows composed of multiple interacting agents, some powered by large language models (LLMs) and…
Quantum Frog: Emergent Cooperation and Difficulty Scaling in a Quantized-Time Cooperative Game
We introduce \emph{Quantum Frog}, a two-player cooperative game built on a novel \emph{quantized-time} mechanic in which the environment ad…
BODHI: Precise OS Kernel Specification Inference
The formal verification of operating system kernels requires precise specifications that capture the intended behavior of system calls. Wri…
When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure
Despite strong medical benchmark accuracy, LLMs can exhibit severe multi-turn sycophancy in clinical dialogue, abandoning initial correct d…
Practical Quantum CIM Empowerment via All-Domestic-Core Agentic Large Model
Quantum computing devices are recognized as powerful tools for solving NP-complete problems. However, the intricacy of their modeling prese…
Operationalizing Reconstructive Authority: Runtime Construction, Dependency Resolution, and Execution Gating in Autonomous Agent Systems
Autonomous agent systems fail not only due to incorrect decisions, but due to executing decisions whose authority no longer holds at runtim…
Fuzzy, Neutrosophic, and Uncertain Graph Theory: Properties and Applications
This book presents a comprehensive and systematic survey of graph theory under uncertainty, with particular emphasis on the unifying role o…
BoxLitE: A Faithful Knowledge Base Embedding Based on Convex Optimization
Knowledge base (KB) embeddings aim at combining the capability of classical knowledge graph embeddings to generalize the information presen…
Authority Inversion in LLM-Mediated Ubiquitous Systems: When Models Trust Users Over Sensors
Large language models (LLMs) increasingly fuse heterogeneous inputs in ubiquitous systems. Yet, how LLMs implicitly allocate authority when…
DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning
Web agents require both high-level reasoning (for task decomposition) and low-level interactions (for page elements manipulation) to conduc…
Residual Drift Dominates Contradiction in Multi-Turn Constraint Reasoning
How do multi-turn reasoning systems fail? The expected answer is logical contradiction, in which the system's maintained state becomes unsa…
MEMOR-E: In-Context and Fine-Tuned LLM Personalization for Alzheimer's Assistive Robotics
Alzheimer's disease is a neurodegenerative disorder marked by progressive declines in memory and language that reduce independence in daily…
A Dynamical Framework for Cognitive Processes Based on Transformations and Semantic Equivalence
This paper proposes a structural and dynamical framework for modeling cognitive processes within a cybernetic perspective. Cognitive states…
Spacetime Formation under Requirements: Contextual Realization and Form-Dependent Probability
Quantum cognition often explains order effects, contextuality, and violations of the law of total probability by replacing classical probab…
Right-Sizing Communication and Recommendation Set Size in AI-Assisted Search
We model the interaction between a user and an AI driven recommendation system. The user initiates the process by conveying preference info…
Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism
Reinforcement Learning from Human Feedback (RLHF) has become a key post-training paradigm for improving model quality. However, the synchro…
Stop Comparing LLM Agents Without Disclosing the Harness
This position paper argues that, for long-horizon tasks evaluated across models with comparable frontier capability, the agent execution ha…
Methods for Formal Verification of Agent Skills: Three Layers Toward a Mechanically Checkable Capability-Containment Proof
The companion paper introduced a four-level verification lattice on agent-skill manifests (unverified, declared, tested, formal) and left t…
Machine Psychometrics: A Mathematical Psychology of Artificial Intelligence
Artificial agents now generate behavior rich enough to invite trust, surprise, and concern, yet our evaluation tools still privilege capabi…
From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems
Deploying machine learning in regulated financial environments -- credit risk, fraud detection, and anti-money laundering -- exposes critic…
QUIVER: A Formal Framework for Quantifying Perturbation Propagation and Bifurcation in Compound AI Systems
Compound AI systems that chain multiple LLM calls into directed computation graphs are now the dominant architecture for production AI. Alt…
Low-Cost Labels, Reliable Choices: Rollout-Calibrated Hyper-Heuristics for Job Shop Scheduling
Learning-assisted hyper-heuristics can select among dispatching rules while preserving the feasibility and interpretability of constructive…
LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs
Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing…
Why We Need World Models for AGI: Where LLMs Fail and How World Models May Outperform
Large language models achieve strong performance in language generation and knowledge-intensive tasks, yet remain limited in settings requi…
Saturating Scaling Laws for Equational Discovery: A Phenomenology of Growth Dynamics in Three Toy Substrates with Two Real-World Replications
We investigate growth dynamics in deterministic equational discovery substrates. Across three toy domains (arithmetic, boolean, higher-orde…
Beyond Predefined Learning Objects: A Thinking-Learning Interaction Model for Up-to-Date Autonomous Robot Learning
Autonomous robots operating in open and changing environments cannot always rely on predefined inputs, outputs, and action routines. Althou…
Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security
Agentic AI systems -- Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions -- can execute…
Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving
Large language models (LLMs) are promising for autonomous driving, but semantics-only decision policies can yield physically unsafe behavio…
LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition
The evolution of Large Language Model (LLM) reasoning is bottlenecked by the scarcity of high-quality process data. While self-alignment vi…
EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery
Large language models (LLMs), have shown strong potential in scientific discovery, yet existing methods still face substantial challenges i…
Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models
Large Language Models (LLMs) are predominantly governed by probabilistic frameworks in which the sum of outcome probabilities is constraine…
EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions
Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one…
SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills
Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such…
MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games
Imperfect-information games (IIGs) are challenging, as players must make decisions without fully observing the true game state. While Alpha…
HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models
Multi-step reasoning remains a central challenge for large language models: single-pass generation is efficient but lacks accuracy; tree-se…
Neuro-Inspired Inverse Learning for Planning and Control
We present a neuro-inspired framework for embodied planning and control. Building on three principles that enable fast and highly effective…
Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs
Current safety alignment of foundation models largely follows a \emph{one-size-fits-all} paradigm, applying the same refusal policy across…
Inference Time Context Sparsity: Illusion or Opportunity?
Sparsity has long been a central theme in LLM efficiency, but its role in context processing remains unresolved. As LLM workloads shift tow…
EPPC-OASIS: Ontology-Aware Adaptation and Structured Inference Refinement for Electronic Patient-Provider Communication Mining in Secure Messages
Secure patient-provider messages contain clinically important communication behaviors that are difficult to characterize manually at scale.…
A Sober Look at Agentic Misalignment in Automated Workflows
We study a class of emergent misalignment in multi-agent systems (MAS), with a focus on automated workflows, which we refer to agentic misa…
When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs
Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinfo…
Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks
As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against stric…
Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows
Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most h…
How Well Do Models Follow Their Constitutions?
Frontier AI developers now train models against long written behavioral specifications, such as Anthropic's constitution (Anthropic, 2025a)…
Toward Enactive Artificial Intelligence
In this paper, we advocate for incorporating enactive approaches to perception and cognition into artificial intelligence (AI). Enactive ap…
Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts
Sparse mixture-of-experts (MoE) language models activate only a small subset of parameters for each token, making router behavior a central…
When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification
We study when LLM-generated synthetic data helps low-resource multi-label patent classification, separating true synthetic value from the c…
Adaptive Human-AI Coordination via Hierarchical Action Disentanglement
Human-AI collaboration requires agents that can adapt to diverse partner behaviors and skill levels while remaining robust to unseen partne…
Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration
Multi-agent collaboration, especially in human-AI teaming, requires agents that can adapt to novel partners with diverse and dynamic behavi…
Distilling Game Code World Model Generation into Lightweight Large Language Models
Large Language Models (LLMs) have shown great ability in generating executable code from natural language, opening the possibility of autom…
A governance horizon for ethical-use constraints in open-weight AI models
Ethical constraints on open-weight AI models are both a reflection of societal concerns and a foundation for AI governance policy. They are…
Understanding and Mitigating Premature Confidence for Better LLM Reasoning
Long chains of thought (CoT) from current language models frequently contain logical gaps and unjustified leaps, limiting the gains from ad…
ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology
Healthcare models are transitioning from unimodal prediction toward multimodal reasoning over heterogeneous diagnostic inputs. In computati…
Advancing Graph Few-Shot Learning via In-Context Learning
Graph few-shot learning, which aims to classify nodes from novel classes with only a few labeled examples, is a widely studied problem in g…
The Model Is Not the Product: A Dual-Pillar Architecture for Local-First Psychological Coaching
Existing language model applications struggle to meet the demand for emotionally oriented support, primarily due to their inability to main…
JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data
We introduce JT-Safe-V2, a large language model designed to advance the safety and trustworthiness of foundation models, extending our prev…
Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork
In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc…
SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent
Long-horizon agentic reasoning requires large language models to act over long interaction histories containing thoughts, tool calls, obser…
SPACE: Unifying Symmetric and Asymmetric Routing Problems for Generalist Neural Solver
Generalist neural routing solvers have shown great potential in solving diverse vehicle routing problems (VRPs) with a unified model. Howev…
AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning
Recent progress on long-horizon agentic tasks has been driven largely by scaling up individual agents through stronger models, better tools…
TIGER: Text-Informed Generalized Enzyme-Reaction Retrieval
Enzyme-reaction retrieval is a fundamental problem in computational biology, underpinning enzyme characterization, reaction mechanism eluci…
Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems
Multi-agent LLM decision systems for portfolio management still lack a principled way to assign credit across specialist agents, remain vul…
Reasoning as an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in reasoning and generation tasks and are increasingly deployed in…
Hypothesis Generation and Inductive Inference in Children and Language Models
Real world decision-making requires constructing mental models under uncertainty over evidence, over the underlying causal rules, and over…
DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations
Agent harness evolution improves frozen language-model agents by modifying the executable structures around them. We study this paradigm as…
Emission-Aware Reinforcement Learning for Sustainable Electric Vehicle Charging and Carbon Dioxide Reduction Under Varying Renewable Penetration
The rapid growth of Electric Vehicle (EV) adoption challenges power distribution networks through peak load spikes, voltage instability, an…
Beyond Control-Flow: Integrating the Resource Perspective into Multi-Collaborative Process Modeling from Text
Process modeling is a sub-domain of Business Process Management (BPM) focused on the translation of process artifacts into formal models. T…
PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models
Efficiently updating Large Language Models (LLMs) with new or evolving factual knowledge remains a central challenge, as even parameter-eff…
Jailbreak to Protect: Buffering and Reinforcing via Temporary Jailbreaking for Safe Fine-Tuning in Large Language Models
Fine-tuning-as-a-Service (FaaS) enables personalization of large language models (LLMs), but it can weaken safety-alignment under harmful f…
Summoning the Oracle to Slay It: Mitigating Look-Ahead Bias in Financial Backtesting with Large Language Models
Backtesting large language models (LLMs) on historical financial data is unreliable because pre-training cuts off after the events happened…
Associations between echocardiographic traits and AI-ECG predictions of heart failure
Artificial intelligence-enabled electrocardiography (AI-ECG) can detect heart failure (HF), including disease not captured by left ventricu…
HeartBeatAI: An Interpretable and Robust Deep Learning Framework for Multi-Label ECG Arrhythmia Detection
While Deep Learning (DL) enhances automated electrocardiogram (ECG) analysis, clinical deployment is hindered by class imbalance and the ge…
Learning to Reason Efficiently with A* Post-Training
Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inferen…
Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents
Large language model (LLM) agents excel at solving complex long-horizon tasks through autonomous interaction with environments. However, th…
Agent-as-Peer-Debriefer: A Multi-Agent Framework with Perspective-Based Refinement for Qualitative Analysis
Large language models (LLMs) are increasingly used for qualitative data analysis (QDA), yet their outputs often miss the depth and nuance o…
Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology
We develop a rigorous algebraic framework for deep convolutional architectures, CNNs, ResNets, and encoder--decoder networks such as UNet,…
GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration
While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical…
AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models
Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related…
Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction
Major LLM platforms deploy models in an inference-only configuration: the model serves requests but never updates per-user weights. Users m…
Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework
LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer corr…
When Mean CE Fails: Median CE Can Better Track Language Model Quality
Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examin…
Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care
Speech and language technologies offer valuable opportunities for supporting mental health assessment through objective and interpretable c…
Emotional intelligence in large language models is fragmented across perception, cognition, and interaction
As large language models (LLMs) are increasingly integrated into emotionally sensitive domains, the structural integrity of their emotional…
MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional
Most reported gains on agentic-LLM clinical benchmarks are often attributed to prompt engineering, yet our results suggest that larger impr…
Fundamental Limitation in Explaining AI
While large-scale models such as LLMs and diffusion models have achieved practical success, public institutions have emphasized the importa…
Hylos: Operability Contracts for Model-Native Spatial Intelligence
Foundation models can increasingly describe, reconstruct, and generate 3D objects, assemblies, scenes, and environments, but visually plaus…
Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models
Speech monologues recorded in naturalistic settings provide opportunities to characterize mental illness phenomenology and detect symptom e…
Proper Scoring Rules for Agentic Uncertainty Quantification
Language-model agents increasingly emit uncertainty signals throughout a trajectory, but existing agentic UQ evaluations often conflate ran…
Uncertainty Decomposition via Cyclical SG-MCMC and Soft-label Learning for Subjective NLP
Annotator disagreement in emotion classification reflects ambiguity intrinsic to emotion concepts and is essential for predictor-quality as…
PRIMA: Operational Patterns for Resilient Multi-Agent Research with Verifiable Identity and Convergent Feedback
Operating LLMs as coordinated multi-agent research systems over multi-hour runs surfaces failure modes that single-shot evaluation cannot:…
GRAIL: AI translation for scientists application workflow on satellite data
Domain scientists increasingly develop Python scripts to analyze satellite imagery but they lack scalability to large-scale data. This pape…
PANDO: Efficient Multimodal AI Agents via Online Skill Distillation
Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, off…
CoRe-Code: Collaborative Reinforcement Learning for Code Generation
Large language models (LLMs) have achieved strong performance in code generation, but most methods rely on autoregressive decoding without…
Agent Manufacturing: Foundation-Model Agents as First-Class Industrial Entities
Manufacturing has passed through four widely recognized paradigms - mechanization, electrification, programmable automation, and Smart Manu…
Test-Time Deep Thinking to Explore Implicit Rules
With the continuous advancement of Large Language Models (LLMs), intelligent agents are becoming increasingly vital. However, these agents…
Geo-Expert: Towards Expert-Level Geological Reasoning via Parameter-Efficient Fine-Tuning
While general-purpose Large Language Models (LLMs) applied to Geology often hallucinate when reasoning about subsurface structures and deep…
Solving Combinatorial Counting Problems with Weighted First-Order Model Counting
Combinatorial counting problems pervade artificial intelligence, statistics, and discrete mathematics. Whether the task is enumerating subs…
Clustering as Reasoning: A $k$-Means Interpretation of Chain-of-Thought Graph Learning
Chain-of-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs) on text-attribut…
Inverting the Shield: Systematically Generating Safety Tests from Policy Specifications
The widespread integration of Large Language Models (LLMs) necessitates rigorous and systematic safety evaluation. Existing paradigms eithe…
TaBIIC2: Interactive Building of Ontological Taxonomies using Weighted Self-Organizing Maps
Ontologies represent the conceptual knowledge of a domain. At the core of an ontology is the taxonomy of concepts and subconcepts that repr…
ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents
Proactive task-oriented agents must autonomously anticipate user needs, identify actionable opportunities, and trigger software actions at…
Noise-Robust Financial Numerical Entity Attribute Tagging
Financial Numerical Entity (FNE) understanding aims to recover the meaning of numerical mentions in financial reports. Existing studies pri…
Energy Shields for Fairness
Runtime fairness is not a one-time constraint but a dynamic property evaluated over a sequence of decisions. To ensure fairness at runtime,…
Towards Multi-Turn Dialog Systems for Industrial Asset Operations and Maintenance
Industrial asset operations and maintenance question answering is inherently multi-turn, iterative, and highly dependent on external tool i…
Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration
The generation of factually incorrect objects, commonly known as object hallucination, remains a persistent challenge in Large Vision-Langu…
NeurIPS: Neuro-anatomical Inductive Priors for Sphere-based Brain Decoding
Current fMRI decoders face a performance-fidelity trade-off where efficient ID encoders outperform geometrically faithful surface-based mod…
Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients
Chronic dermatologic diseases such as pemphigus require long-term follow-up, generating extensive longitudinal clinical documentation that…
AION: Next-Generation Tasks and Practical Harness for Time Series
Time series research is moving beyond fixed forecasting benchmarks toward realistic tasks that combine prediction, contextual reasoning, to…
Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat
As modern air combat evolves toward beyond-visual-range (BVR) multi-aircraft cooperative engagements, autonomous decision-making for unmann…
RECTOR: Priority-Aware Rule-Based Reranking for Compliance-Aware Autonomous Driving Trajectory Selection
Autonomous driving stacks must pick one trajectory from a multi-modal candidate set; choosing by model confidence ignores safety, traffic-l…
Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction
Reliably knowing when a language model is correct is almost as important as being correct. We introduce prover-verifier deliberation (PVD),…
Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling
Test-time scaling improves language model reasoning by spending additional compute to explore multiple solution trajectories. The key chall…
Representation Without Control: Testing the Realization Effect in Language Models
Large language models are increasingly used as behavioral simulators, but it remains unclear when their outputs reflect human-like cognitiv…
SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking
Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluatio…
SpecAlign: A Semantic Alignment Framework for SystemVerilog Assertion Generation
Existing Large Language Model (LLM) approaches to SystemVerilog Assertion (SVA) generation primarily focus on syntactic validity and formal…
DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs
Multi-agent LLM systems improve reasoning by combining outputs from multiple agents, but interaction-heavy methods can introduce error prop…
Boosting Inference with Guided Reasoning: Stochastic Exploration for Recursive Models
Recent work on recursive architectures has shown that tiny neural networks can be surprisingly powerful on structured reasoning tasks. The…
Meta-Agent: From Task Descriptions to Verified Multi-Agent Systems
AI agents are increasingly used to solve complex, multi-step tasks, but existing multi-agent frameworks remain brittle as workflows grow in…
FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization
Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research…
LipoAgent: Coordinating Fine-Tuned LLM Agents for Safer Lipid Design
Lipid nanoparticles (LNPs) are among the most clinically mature platforms for nucleic acid delivery, yet designing lipids that are both eff…
Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts
Aligning AI systems with organizational decision-making is typically framed as a single-target problem: make the model behave like the orga…
AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems
While aggregate leaderboard scores drive AI development, they contain substantial measurement noise whose sources and magnitudes remain unq…
Context-CoT: Enhancing Context Learning via High-Quality Reasoning Synthesis
While LLMs excel at reasoning over prompts using static pretrained knowledge, they struggle significantly with context learning-the ability…
Second Guess: Detecting Uncertainty Through Abstention and Answer Stability in Small Language Models
Large language models often generate confident but incorrect answers rather than abstaining when uncertain. This problem is particularly ac…
Towards end-to-end LLM-based censoring-aware survival analysis
Objective: Survival analysis is central to medical prediction, yet large language models (LLMs) are rarely used as end-to-end survival mode…
CODESKILL: Learning Self-Evolving Skills for Coding Agents
Coding agents produce rich trajectories while solving software-engineering tasks. To enable agent self-evolution, these trajectories can be…
Security of OpenClaw Agents: Fundamentals, Attacks, and Countermeasures
The rapid evolution of large language model (LLM)-driven autonomous agents has given rise to OpenClaw, a new class of open-source agent fra…
A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography
Electrocardiography (ECG) is central to cardiovascular care, but conventional AI models are often restricted to common arrhythmias and may…
ATWL: A Formal Language for Representing, Comparing, and Reusing Visual Analytics Workflows
Visual analytics (VA) workflows are inherently complex, involving data transformation, feature engineering, visual representation, and huma…
Credit Assignment with Resets in Language Model Reasoning
Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single…
What Gets Cited: Competitive GEO in AI Answer Engines
AI answer engines generate answers from retrieved pages but cite only a few sources. This makes visibility depend not just on ranking, but…
StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLMs
Multimodal Large Language Models (MLLMs) excel at structural reasoning yet suffer from a sharp logical brittleness in structural consistenc…
Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents
Existing large language model (LLM) based memory systems apply universal, static policies that overlook a fundamental reality: the contexts…
ADMFormer: An Adaptive-Decomposition Transformer with Time-Varying Masked Spatial Attention for Traffic Forecasting
Accurate traffic forecasting is essential for intelligent transportation systems, supporting a wide range of real-world applications. Howev…
PHGNet: Prototype-Guided Hypergraph Construction for Heterogeneous Spatiotemporal Forecasting
As a core task in intelligent transportation systems, traffic forecasting plays a critical role in urban traffic management. Accurate traff…
Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching
Optimizing the trade-off among predictive performance and computational cost is a central focus in the deployment of Large Language Models…
Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis
Clinical decision-making requires reasoning over incomplete, imprecise, and linguistically expressed patient narratives. While large langua…
Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy
Chain-of-thought (CoT) reasoning improves the problem-solving ability of large language models (LLMs), but generated reasoning traces may n…
Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations
World models enable agents to predict future dynamics conditioned on actions, making the choice of latent representation central to plannin…
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering,…
Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents
Autonomous AI agents increasingly issue side-effect-bearing actions: database mutations, refunds, payments, external commitments. We propos…
AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions
Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing c…
FLOATBench: A Dataset and Benchmark for Floating Offshore Wind Turbine Tower Fatigue
Most of the world's offshore wind resource lies in waters too deep for fixed-bottom foundations, making floating offshore wind turbines (FO…
Learning to Search and Searching to Learn for Generalization in Planning
Combinatorial generalization remains a central challenge in Deep Reinforcement Learning (DRL). Classical planning provides a simple yet cha…
A Deep Dive into Axiomatic Design -- Part I: Problem Formulation
Problem formulation translating customer needs and constraints into a minimum set of independent first-level functional requirements, is ar…
Agent-Centric Social Trajectory Prediction: A Free Energy Principle Perspective
Trajectory prediction methods have demonstrated remarkable capabilities in capturing complex motion patterns. However, existing methods rel…
When Can We Trust Early Warnings? Leakage-Excluded Early Outcome Prediction from LMS Interaction Logs
Early-warning models built from Learning Management System (LMS) logs aim to predict end-of-course outcomes early enough to enable timely l…
Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network
Agent-to-Agent (A2A) networks enable autonomous AI agents to collaborate by sharing reusable problem-solving instructions. However, how the…
MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning
Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large para…
From Accounting to Coordination: A Virtual Water-Aware Electricity-Computation-Water Nexus Framework for Data Center Dispatch
The expansion of data centers (DCs) drives a sustained increase in electricity demand and associated water withdrawals at generation sites.…
$D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing
Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safe…
Explore Before You Solve: The Speed--Depth Trade-off in Epistemic Agents for ARC-AGI-3
We systematically investigate all 25 public ARC-AGI-3 games and find that every one is reachable through non-intelligent strategies: 10 in…
LECTOR: Joint Optimization of Scientific Reasoning Graphs and Introduction Generation
AI Scientists have shown promising progress across multiple stages of the research pipeline, among which automatic scientific paper writing…
Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables
Complex Query Answering (CQA) is a fundamental knowledge representation and reasoning task over incomplete knowledge graphs (KGs). Answerin…
CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab…
CITYREP: A Unified Benchmark for Urban Representations Across Cities, Tasks, and Modalities
Urban representation learning encodes complex urban environments into general-purpose embeddings for diverse downstream tasks and emerging…
L2IR: Revealing Latent Intent in Graph Fraud Detection
Graph fraud detection has long depended on Graph Neural Networks (GNNs) to propagate and aggregate information across relational data. A cr…
Retrying vs Resampling in AI Control
AI coding scaffolds like Claude Code and Codex use \textit{retrying}: blocking actions flagged as risky and continuing the trajectory. We s…
VeriTrace: Evolving Mental Models for Deep Research Agents
Deep research agents face vast, interdependent, and pervasively uncertain information. Existing systems explore what evolving intermediate…
Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World
Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's dig…
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent,…
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity…
LETS Forecast: Learning Embedology for Time Series Forecasting
Real-world time series are often governed by complex nonlinear dynamics. Understanding these underlying dynamics is crucial for precise fut…
Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study
Tokenizer fertility varies 1.6x across foundation models on Ukrainian legal text, yet this cost-critical dimension is absent from model sel…
Check Your LLM's Secret Dictionary! Five Lines of Code Reveal What Your LLM Learned (Including What It Shouldn't Have)
We show that singular value decomposition of the lm_head} weight matrix of a transformer-based large language model -- requiring only five…
AI-Driven Alpha Decay: Algorithmic Homogenization, Reflexive Signal Erosion, and the Paradox of Intelligent Markets
We show that AI-driven investment strategies are inherently self-defeating at scale. As AI adoption rises, three mutually reinforcing chann…
Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches
Information fusion is used widely to improve document classification by the integration of multiple data sources (multimodal) or representa…
Raon-Speech Technical Report
We present Raon-Speech, a top-performing 9B-parameter speech language model (SpeechLM) for English and Korean speech understanding, answeri…
VineLM: Trie-Based Fine-Grained Control for Agentic Workflows
Agentic workflows interleave configurable LLM stages with tool stages and often include retries or refinement loops. Existing workflow mana…
Agent-Facing Information Design in LLM Tool Registries
LLM tool registries function as unregulated advertising platforms: providers write free-text descriptions that agents use for selection, ye…
Artificial Effort
Real-effort tasks, in which participants perform cognitively costly activities whose outcomes depend on actual performance, are widely used…
Authority Signals in Claude AI Health Citations: A Descriptive Analysis Using the Authority Signals Framework
This study seeks to determine the authority signals used by Anthropic's Claude AI in its presentation of sources when answering consumer he…
High-Risk AI Systems and the Problem of Identity in the European AI Act
The EU Artificial Intelligence Act (AIA) establishes a lifecycle governance regime for high-risk AI systems built around ex-ante conformity…
Catching The Correct Answer Trap: Characterising AI Tutor Blind Spots When Analysing Student Reasoning
Intelligent tutoring systems increasingly provide automated feedback on student work, but robust feedback requires assessing reasoning, not…
KT4EQG: Personalized Exercise Question Generation via Knowledge Tracing
Educational Question Generation (EQG) aims to synthesize customized exercise questions that enhance student learning. An effective EQG syst…
AI-Driven Controlled Environment Agriculture as Resilient Infrastructure for U.S. Fresh-Produce Supply Chains
Climate volatility, regional production concentration, labor constraints, cyber risk, and dependence on long-distance fresh-produce supply…
SODE: Analyzing Social Dynamics in LLM Agents
As Large Language Models (LLMs) evolve into interactive agents, understanding their behavioral alignment within human social dynamics becom…
EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs
Audio Large Language Models (ALLMs) are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations…
AI in the Enterprise: How People Use M365 Copilot Chat
M365 Copilot is used every week by millions of people across more than a million companies around the world as part of their workflows. Uni…
Multimodal Alignment and Preference Optimization for Zero-Shot Conditional RNA Generation
The design of RNA molecules that interact with specific proteins is a critical challenge in experimental and computational biology. Despite…
Multi-market value-stacking: Battery control for combined imbalance participation and non-uniform FCR bidding
The growing share of Renewable Energy Sources (RES) in modern power systems increases both grid imbalances and frequency deviations, reinfo…
TriVAL: A Tri-Validation Framework for Faithful Automatic Optimization Modeling
Optimization modeling serves as the pivotal bridge between natural-language problem descriptions and optimization solvers, and remains a co…
Sensing Intelligence as a Trainable Metamaterial Property
In biological systems, sensing is not performed by the brain alone: the body deforms, vibrates, and filters external stimuli before they ar…
Metacognition Should Be the Scientific Framework for Bounded and Effective Self-Governance in Generative AI
Generative AI research increasingly confronts a shared problem: systems must sustain yet govern their own generative activity when uncertai…
Parameter Efficient Multi-Class Intelligent Scheduling for Multimodal Online Distributed Industrial Anomaly Detection
Industrial anomaly detection has attracted significant attention as a fundamental challenge in industrial systems. The rapid advancement of…
MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing
Memory is a fundamental component for enabling long-context LLM agents, supporting persistent state across interactions through a continuou…
A World Model of Radiologist Reading for Medical Image Representation Learning
Radiologist eye-tracking data provide a rich record of how experts search, compare, and accumulate evidence during image reading; yet, exis…
Nano World Models: A Minimalist Implementation of Future Video Prediction
World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision-making. Yet,…
RAW: Robust Avatar Watermarking -- Benchmarking and Baseline
Digital avatar watermarking presents unique challenges: avatars are routinely post-processed with background replacement, reframing, and fo…
Task-Aligned Self-Supervised Learning for Medical Image Analysis: A Systematic Review and Practical Design Guidelines
Self-supervised learning (SSL) has emerged as a promising paradigm for addressing the annotation bottleneck in medical imaging by learning…
IVR-R1: Refining Trajectories through Iterative Visual-Grounded Reasoning in Reinforcement Learning
Multimodal large language models via reinforcement learning (RL) have demonstrated remarkable capabilities in complex visual reasoning task…
Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL
Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous rein…
Harnessing AtomisticSkills for Agentic Atomistic Research
Computational materials science and chemistry span vast knowledge domains and fractured software ecosystems. Although large language models…
Remote sensing data imputation using deep learning for multispectral imagery
Remote sensing techniques have been increasingly utilised in aquatic applications in recent years. A common challenge in using optical sate…
ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models
Vision-Language-Action (VLA) models exhibit remarkable action generation for embodied intelligence, but their heavy compute make deployment…
SA-Kura: An Energy-Efficient Systolic Array Accelerator for Locally-Coupled Kuramoto Drift in Diffusion Sampling
Diffusion inference remains costly for edge deployment, yet existing accelerators focus almost exclusively on score networks because standa…
Machine Intelligence that Understands Visual and Linguistic Information and Interacts with Humans and Environments
Advancements at the intersection of computer vision and natural language processing are crucial for applications like assistive tech, multi…
WTKO-CNN: Deep Learning Reveals Sequence Motifs Distinguishing Wild-Type and Knockout ATAC-seq Peaks
Chromatin regulators can alter transcriptional programs by modifying the accessibility of regulatory DNA elements. Understanding how regula…
Mode-as-Sequence: Translating Multimodal Motion Prediction into Unified Sequential Mode Modeling
Multimodal motion forecasting is inherently under-supervised: each training scene provides only one realized future, yet multiple plausible…
Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation
Neural operators serve as fast, data-driven surrogates for scientific modeling but typically rely on a monolithic, single-pass inference pr…
Hidden-State Privacy Has an Empty Middle
Of $1{,}536$ Gaussian release covariances we tested for single-layer hidden-state privacy, zero achieve both moderate utility and moderate…
LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs
Scientific discovery is a closed-loop process in which hypotheses guide data acquisition and observations refine the hypothesis space. Yet…
A Large-Scale Dataset and Benchmark: Do Protein-Ligand Models Learn Binding Sites or Just Binding Likelihood?
Protein-ligand modeling underpins computational drug discovery and molecular design. Existing protein-ligand benchmarks typically evaluate…
Mixture of Complementary Agents for Robust LLM Ensemble
Multi-AI collaboration, such as ensembling or debating large language models (LLMs), is a promising paradigm for aggregating information an…
More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries
Skill libraries allow LLM agents to load task-specific instructions on demand, letting non-expert users solve domain-specific tasks through…
Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing
To better serve users' demands in mobile applications (e.g., navigation), mobile crowdsourcing platforms can iteratively align large langua…
Cascade-KDE: Robust Time-Series Restoration under Out-of-Distribution Impulse Corruptions
Real-world time-series data in industrial sensing, healthcare, and energy systems is often corrupted by a mixture of Gaussian noise and occ…
Feature Lottery? A Bifurcation Theory of Concept Emergence
Neural networks acquire structured representations at specific moments during training, yet identifying these transitions typically relies…
Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning
On-device adaptation of large language models commonly keeps a quantized base model frozen while training and deploying a small, task-speci…
Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers
We present a three-step recipe for identifying attention-head circuits in pretrained transformers. A per-head spectral signal -- the time-i…
Federated Learning over Human-Body Communication for On-Body Edge Intelligence: A Survey, Taxonomy, and BODYFED-HBC Scheduling Vignette
Human-body communication (HBC) is a promising physical substrate for wearable body-area networks because it can localize communication arou…
Generative Representation Learning on Hyper-relational Knowledge Graphs via Masked Discrete Diffusion
Hyper-relational knowledge graphs (HKGs) effectively represent complex facts. While inferring new knowledge in HKGs is a critical problem,…
When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents
The rise of tool-using Large Language Model (LLM) agents, standardized by protocols like the Model Context Protocol (MCP), has unlocked unp…
Not All Transitions Matter: Evidence from PPO
Training a reinforcement learning agent on-policy means collecting fresh experience at every update, and that experience comes with a hidde…
TRACER: A Semantic-Aware Framework for Fine-Grained Contamination Detection in Code LLMs
Data contamination is a known threat to the reliability of model evaluation. However, it remains underexplored in code large language model…
Verified SHAP: Provable Bounds for Exact Shapley Values of Neural Networks
Shapley additive explanations (SHAP) are widely recognised as computationally intractable for neural networks, since they induce an exponen…
The Time is Here for Just-in-Time Systems: Challenges and Opportunities
Core systems like key-value stores have historically taken years to build, and are designed to be general so as to amortize cost across dep…
Overcoming "Physics Shock" in Earth Observation A Heteroscedastic Uncertainty Framework for PINN-based Flood Inference
Rapid and accurate flood extent mapping from Remote Sensing data, such as Synthetic Aperture Radar (SAR), is critical for operational disas…
MASt3R-Nav: WayPixel Navigation in Relative 3D Maps
Visual navigation ability is strongly tied to its underlying representation of the world. Unlike classical 3D maps that require globally-co…
Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries
Large Language Models (LLMs) are increasingly used to generate summaries of software bug reports, including sections such as Steps-to-Repro…
Understanding Conversational Patterns in Multi-agent Programming: A Case Study on Fibonacci Game Development
Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented colla…
An Interpretable CF-RL-TOPSIS Fusion Model for Skills-Aware Talent Recommendation
Effective skills-aware talent recommendation must balance behavioral transition patterns, trajectory-sensitive adaptation, and inspectable…
Knowledge Graph Modulated Deep Learning for Limited-Sample Clinical Data Analysis
Biological systems are governed by structured molecular interactions, where pathways, regulatory circuits, and functional gene relationship…
PromptAudit: Auditing Prompt Sensitivity in LLM-Based Vulnerability Detection
Large language models are increasingly used for vulnerability detection, yet their reliability under different prompt formulations remains…
Extracting Training Data from Diffusion Language Models via Infilling
Memorization in large language models has been studied almost exclusively through prefix-conditioned extraction, a natural choice for autor…
Human-AI Collaboration in Science at Scale: A Global Large-scale Randomized Field Experiment
Collaboration is the defining mode of modern science, yet its core mechanism -- feedback -- remains hard to observe, difficult to scale, an…
AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery
We introduce AvalancheBench, a benchmark for evaluating enterprise data agents through \emph{latent world recovery}. AvalancheBench improve…
Filtered Posterior Mean Collections: A Unified Framework for Analytical Models of Diffusion Generalization
The neural-network denoising functions which form the backbone of image diffusion models are remarkably consistent in their generalization…
Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation
Analogies help learners understand unfamiliar concepts by relating them to known concepts. Despite recent advances, large language models (…
Distributionally Robust Transfer Learning with Structurally Missing Covariates, with Application to Cross-National Cardiac Arrest Prediction
Deploying clinical prediction models across healthcare systems often fails when key training covariates are unavailable at deployment and l…
Towards Evaluation Engineering: An Empirical Study of ML Evaluation Harnesses in the Wild
Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation,…
Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning
Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and…
Unlocking Apple's Private Cloud Compute: An Analysis of Privacy-Preserving Artificial Intelligence
Many existing Artificial Intelligence (AI) solutions on mobile devices rely on an extensive collection of sensitive data, raising privacy c…
GIBLy: Improving 3D Semantic Segmentation through an Architecture-Agnostic Lightweight Geometric Inductive Bias Layer
In 3D scene understanding, deep learning models rely on large models and extensive training to capture basic geometric structures that are…
Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation
Many automated labeling pipelines classify inputs into categories defined by a written specification, content moderation being a prominent…
Attested Tool-Server Admission: A Security Extension to the Model Context Protocol
The Model Context Protocol (MCP) standardizes how a large-language-model (LLM) agent and an external tool server exchange messages, but not…
CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval
Digital pathology archives increasingly contain multiple whole-slide images (WSIs) per case, capturing spatially distinct tumour regions an…
An Interactive Paradigm for Deep Research
Recent advances in large language models (LLMs) have enabled deep research systems that synthesize comprehensive, report-style answers to o…
Concept Drift Adaptation Using Self-Supervised and Reinforcement Learning In Android Malware Detection
Android malware detectors often degrade after deployment because of concept drift, while full retraining at each maintenance step is costly…
Benchmarking Patent Embeddings: A Multi-Task Evaluation of 22 Models Across Retrieval, Classification, and Clustering
Which fine-tuning signals improve patent embedding models, and do gains transfer across patent landscapes? We benchmark 22 embedding models…
An Empirical Evaluation of LLM-Generated Code Security Across Prompting Methods
The growing use of Large Language Models (LLMs) for automated code generation has enhanced software development efficiency, but often at th…
Enhancing Reliability in LLM-Based Secure Code Generation
Large language models (LLMs) are widely used for code generation, but their security reliability remains inconsistent across languages and…
ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views
Articulated object reconstruction from sparse-view images is an ill-posed problem that requires simultaneous inference of geometry and unde…
ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale
Standard accuracy on binary reasoning benchmarks hides critical failure modes: prior collapse, inconsistency under paraphrase, and inabilit…
ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training
The rapid scaling of large language model training requires distributing GPU resources across multiple data center buildings and regions. W…
Treatment Effect Estimation with Differentiated Networked Effect on Graph Data
Estimating individual treatment effect (ITE) from observational graph data is crucial for decision-making in the fields such as commerce an…
Assessing the Operational Viability of Foundation Models for Time Series Forecasting
Time series forecasting drives operational decisions in areas like finance, transportation, and energy. While supervised learning approache…
Side-by-side Comparison Amplifies Dialect Bias in Language Models
Language models (LMs) can exhibit systematic biases against speakers based on variations in their dialects, even in the absence of a dialec…
MX-SAFE: Versatile Inference- and Training-Proof Microscaling Format with On-the-Fly Exponent and Mantissa Bit Allocation
As the demand for deep learning grows, cost reduction through quantization has become essential for both training and inference. In 2022, t…
VectorArk: Learning Practical Image Vectorization with Rounded Polygon Representation
Recent vision-language model (VLM)-based approaches have achieved impressive results on image vectorization tasks. However, they are typica…
Generative OOD-regularized Model-based Policy Optimization
We study sequential decision-making with offline reinforcement learning (RL). Traditional offline RL policies may result in out-of-distribu…
Batch Normalization Amplifies Memorization and Privacy Risks
Batch Normalization (BN) is widely adopted to enable faster convergence and more stable training of deep neural networks. However, its impa…
Momentum Streams for Optimizer-Inspired Transformers
The residual update of a pre-norm Transformer layer admits an interpretation as one step of a first-order optimizer acting on a surrogate t…
Temporal Concept Drift in Legal Judgment Prediction: Neural Baselines Across Three Epochs of Ukrainian Court Decisions
Legal NLP benchmarks evaluate models on randomly split data, implicitly assuming that legal language is stationary. We test this assumption…
Code2UML: Agentic LLMs with context engineering for scalable software visualization
Large Language Model (LLM)-based code analysis tools are adopted to automate software documentation tasks. However, the scalability of thes…
Balancing Fairness, Privacy, and Accuracy: A Multitask Adversarial Framework for Centralized Data-Driven Systems
The integration of fairness and privacy in centralized data-driven applications is critical, especially as these systems increasingly influ…
Coarse-to-Fine Domain Incremental Learning with Attentive Distillation for Mining Footprint Segmentation in Multispectral Imagery
Automatically mapping and segmenting global mining footprints using remote sensing and deep learning is critical for monitoring the socio-e…
Robust Fuzzy Multi-view Learning under View Conflict
Trusted multi-view classification aims to deliver reliable fusion for accurate predictions and has recently attracted substantial attention…
FoodMonitor: Benchmarking MLLMs for Explainable Compliance Analysis
As AI-powered compliance monitoring becomes increasingly important in public governance and industrial safety, the ability to provide verif…
{\Phi}-Noise: Training-Free Temporal Video Conditioning via Phase-Based Noise Manipulation
Latent video diffusion models generate videos by progressively transforming Gaussian noise into realistic samples conditioned on text or vi…
Adaptive Punishment for Cooperation in Mixed-Motive Games
Mixed-motive scenarios are ubiquitous in real-world multi-agent interactions, where self-interested agents often defect for immediate rewar…
Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers
The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deployi…
TRAFA: Anticipating User Actions to Reduce Errors in Procedural Tasks with Predictive Feedback
Interactive assistance systems typically provide feedback after an action has been completed, supporting error recovery but not preventing…
Is Decentralized AI Governable? From Regulative Policy to Constitutive Protocol
Every major framework for governing artificial intelligence presupposes an identifiable entity -- a developer, deployer, or operator -- who…
SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors
Text compression for large language model (LLM) systems is usually framed as token deletion, retrieval, summarization, or exact reconstruct…
AI-Driven Adaptive Adversaries and the Erosion of Cryptographic Trust in Public Key Systems
This paper examines the erosion of Public Key Cryptography (PKC) security under adaptive adversarial optimisation driven by artificial inte…
Rethinking Federated Unlearning via the Lens of Memorization
Federated learning (FL) increasingly needs machine unlearning to comply with privacy regulations. However, existing federated unlearning ap…
PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction
Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing nav…
PILOT: Policy-Informed Learned Optimization for Adaptive Deep Network Training
Despite the central role of optimization in deep learning, most optimizers rely on update structures whose functional form is fixed before…
Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m
Independently trained transformers compute the same function in residual-stream bases that differ by a uniform random rotation on $\mathrm{…
LAPLEX: The FFT of Learnable Laplace Kernels
Fast linear algebra in deep learning usually comes with a choice: fixed geometry and exact computation, as in the Fourier transform, or ada…
Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory
Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this…
Catching MRI outliers: unsupervised detection and localization of MRI artefacts and clinical anomalies using deep learning
Artificial intelligence is increasingly integrated into radiotherapy workflows, yet such pipelines remain vulnerable to out-of-distribution…
Guarded Repair for Harm-Aware Post-hoc Replacement of LLM Mathematical Reasoning
Post-hoc repair of LLM mathematical reasoning introduces an asymmetric risk: fixing an incorrect reasoning trace is useful, but replacing a…
Measuring the Depth of LLM Unlearning via Activation Patching
Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whethe…
Phase-Aware Wavelet-Based-Scattering Encoder-Decoder for Dense Predictions
Scattering transforms achieve Lipschitz stability and translation invariance, but dense prediction tasks require preserving spatial structu…
Beyond Generative Priors: Minority Sampling with JEPA-Guided Diffusion
Minority sampling aims to generate low-density instances on a data manifold and is of central importance in applications such as medical di…
Demystifying the Mythos or Disrupting Bugonomics? From Zero-Day Asymmetry to Defender Remediation Throughput
Recent demonstrations of large language models producing candidate and confirmed vulnerabilities in production software have renewed the na…
DisDop: Distillation with Domain Priors for Open-Vocabulary Aerial Object Detection
With the widespread application of drones in recent years, object detection of aerial images has attracted increasing attention, especially…
On the Stability and Realizability of Recurrent Polynomial Surrogate Ternary Logic Gate Networks
Recurrent Neural Networks (RNNs) can learn to predict Signal Temporal Logic (STL) verdicts online from partial trajectories, but deploying…
How Many Tools Should an LLM Agent See? A Chance-Corrected Answer
Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortli…
CyBOKClaw: Human-in-the-Loop CyBOK Mapping for Cybersecurity Curriculum
This paper presents CyBOKClaw, an interpretable human-in-the-loop retrieval framework for mapping cybersecurity keywords or phrases (KWoPs)…
VaaWIT: Visual-Aware Adaptation of Large Language Models for Multilingual Web Image Translation
Translating text embedded in Web images is crucial for improving content accessibility and cross-lingual information retrieval, particularl…
Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs
Large Language Models (LLMs) have shown great promise in multilingual machine translation (MT), even with limited bilingual supervision. Ho…
Beyond the Aggregation Dilemma: Prior-Retaining Decoupled Learning for Multimodal Graphs
Multimodal Attributed Graph Learning (MAGL) integrates intrinsic node attributes with structural topology via graph aggregation. However, a…
HoloFair: Unified T2I Fairness Evaluation and Fair-GRPO Debiasing
Text-to-Image (T2I) models have made significant strides in visual realism and semantic consistency, yet they often perpetuate and amplify…
The Path Matters: Learning a Token-Commitment Policy for Diffusion Language Models
Diffusion large language models promise faster generation by refining many token positions in parallel, but this parallelism introduces a h…
TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering
Large language models (LLMs) and time-series language models (TSLMs) are increasingly applied to time-series question answering (TSQA). Unl…
World-State Transformations for Neuro-symbolic Interactive Storytelling
Large Language Models (LLMs) have changed the possibilities of Interactive Storytelling systems that process free-text user input. However,…
Who judges the judges? Governance from metrics: a runtime framework for continuous LLM compliance monitoring
Current approaches to AI compliance treat conformity as a binary, audit-time verdict rather than a continuous, measurable property of produ…
Bilevel Optimization of Synthetic Trajectories for Multi-Turn LLM Fine-Tuning
While LLMs excel at single-turn generation, they struggle with long-horizon, multi-turn interactions. Offline reinforcement learning (RL) o…
Motion-Compensated Weight Compression
Neural network weights are increasingly a bottleneck for deployment, yet most compression pipelines treat layers independently and overlook…
Spectral Retrieval: Multi-Scale Sinc Convolution over Token Embeddings for Localized Retrieval in LLM Multi-Agent Systems
[Abridged] - Spectral Retrieval is a plug-in re-ranking stage that interpolates between per-token MaxSim and mean-pool retrieval through a…
Leveraging pretrained RGB denoisers for hyperspectral image restoration
Hyperspectral image restoration faces several challenges, including limited training data, strong sensor specificity, and high spectral dim…
From Theory to Decision Rule: Calibrating the Noisy-Label Crossover for Vision-Language Model Weak Supervision Across Three Medical-Imaging Benchmarks
Classical noisy-label theory predicts that downstream performance under weak supervision is bounded above by the labeler's accuracy, implyi…
Complement Submodular Information Measures for Balanced and Robust Data Selection
Submodular optimization has become a fundamental paradigm for data selection, retrieval, summarization, and representation learning due to…
CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM
Long-horizon LLM inference turns the key--value (KV) cache into the dominant GPU memory consumer and makes per-token attention increasingly…
Parameter-Efficient VLMs for Gastrointestinal Endoscopy: Medical Image Generation and Clinical Visual Question Answering
The major limitations of gastrointestinal (GI) endoscopy AI systems arise from a shortage of annotated data, strict privacy policies, and s…
Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities across a wide range of vision language tasks. However, when…
Zero-Shot Parkinson's Disease Detection from Speech: Comparing Large Audio and Language Models
Large audio and language models have recently demonstrated zero-shot reasoning capabilities across various domains. However, it remains unc…
Disentangled Double Machine Learning for Accurate Causal Effect Estimation
Confounding bias is a key challenge in causal effect estimation from observational data. Double Machine Learning (DML) addresses this issue…
Cross-Domain Energy-Guided Diffusion Generation for Off-Dynamics Reinforcement Learning
Off-dynamics offline reinforcement learning seeks to learn a target-domain policy from a large source dataset and a limited target dataset…
Multiscale Real-Time Object Detection in the NMS-Free Era: A Comparative Performance Evaluation of YOLOv8 and YOLO26
Non-Maximum Suppression (NMS) remains a key post-processing step in many real-time object detection pipelines, but it can introduce latency…
Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection
Large language model (LLM) safety classifiers such as Llama Guard are effective at detecting overtly harmful prompts but remain vulnerable…
Adversarial Error Correction for Visual Autoregressive Generation
Visual Autoregressive (VAR) models have emerged as a powerful paradigm for image synthesis by performing hierarchical next-scale prediction…
Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts
Large language models (LLMs) display strong comprehensive abilities, yet the internal mechanisms that support these behaviors remain insuff…
The Concept Allocation Zone: Tracking How Concepts Form Across Transformer Depth
Concept formation in transformer language models is depth-extended, not a single-layer event: concepts emerge gradually across a contiguous…
DBPnet: Damper Characteristics-Based Bayesian Physics-Informed Neural Network for Wheel Load Estimation
Advanced driver assistance systems (ADAS) play an important role in modern automotive intelligence, significantly enhancing vehicle safety…
Towards a Universal Causal Reasoner
Despite the importance of causal reasoning, training LLMs to reason causally remains underexplored. Existing data efforts mostly focus on b…
When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation
Reasoning-enabled LLMs perform strongly on medical reasoning benchmarks, but it remains unclear whether these gains transfer to structured…
On the Impact of Class Imbalance on the Learning Dynamics of Deep Neural Networks:An Intuitive Insight
Class imbalance in deep neural networks (DNNs) has witnessed a rapid increase in research attention in recent years. However, the varying a…
Factorize to Generalize: Retrieval-Guided Invariant-Dynamic Decomposition for Time Series Forecasting
Time series foundation models (TSFMs) have recently achieved strong zero-shot forecasting performance through large-scale pretraining and r…
Explainable Retinal Imaging for Prediction of Multi-Organ Dysfunction in Type 2 Diabetes
Background: Type 2 diabetes mellitus (T2DM) is increasingly recognised as a systemic disease characterised by coordinated dysfunction acros…
Explainable Multi-Task Retinal Imaging Reveals Microvascular Signals for Systemic Risk Stratification in Type 2 Diabetes: A Pilot Study
Retinal imaging provides a non-invasive window into systemic microvascular health and has emerged as a potential biomarker for systemic dis…
Quaternion Self-Attention with Shared Scores
Quaternion neural networks are parameter-efficient and model multidimensional dependencies by representing four related features as a singl…
HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos
Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remain…
Your Embedding Model is SMARTer Than You Think
Multimodal retrieval relies heavily on single-vector retrievers, which compress rich, sequential token sequences into one single global rep…
Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering
Steering a language model - intervening on its internal activations to change downstream behaviour - has recently expanded beyond linear in…
RealBench: Benchmarking Data-Driven Numerical Weather Forecasting Under Operational Conditions and Extreme Event Challenges
Accurate evaluation of weather forecasting models is critical for their reliable deployment in real-world applications. However, existing b…
APT-Agent: Automated Penetration Testing using Large Language Models
Penetration testing is essential to securing modern web infrastructures, yet traditional manual methods struggle to keep pace with their sc…
SEP-Attack: A Simple and Effective Paradigm for Transfer-Based Textual Adversarial Attack
Despite the strong performance of deep neural networks in modern Web and language applications, they remain vulnerable to adversarial attac…
Investigating the Interplay between Contextual and Parametric Chain-of-Thought Faithfulness under Optimization
Chain-of-Thought (CoT) faithfulness, i.e., whether CoTs genuinely reflect large language models' (LLM) underlying behavior, is typically ev…
Cross-Domain Generalization Limits of Vision Foundation Models in Facial Deepfake Detection
The rapid evolution of generative models has enabled the creation of hyper-realistic facial deepfakes, exposing a critical vulnerability in…
OSDTW: Optimal Shared Depth and Task Weighting for Long-Tailed Recognition
Long-tailed recognition suffers from a persistent head--tail trade-off: improving tail performance often degrades head accuracy and can inc…
TGFormer: Towards Temporal Graph Transformer with Auto-Correlation Mechanism
The growing interest in Temporal Graph Neural Networks (TGNNs) stems from their ability to model complex dynamics and deliver superior perf…
MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing
VLM-based OCR models have become the de facto choice for document parsing, as they can accurately extract page-level elements (e.g., paragr…
Bridging the Gap: Enabling Soft Actor Critic for High Performance Legged Locomotion
Proximal Policy Optimization (PPO) has become the de facto standard for training legged robots, thanks to its robustness and scalability in…
Selective Test-Time Compute Scaling for Click-Through Rate Prediction via Uncertainty-Triggered Feature Path Exploration
Scaling test-time compute has proven highly effective for language models, yet this opportunity remains largely unexplored for industrial C…
Scaling up Energy-Aware Multi-Agent Reinforcement Learning for Mission-Oriented Drone Networks with Individual Reward
Multi-agent reinforcement learning (MARL) has shown wide applicability in collaborative systems such as autonomous driving and smart cities…
Interpretation, Learning, and Empathy as One Constraint: A Residual-Adequacy Architecture with Accountable Abstention
An agent must act on the situation before it, learn what it cannot yet represent, and model other agents well enough to coordinate. These f…
Metropolis-Scale Resilient and Trustworthy Traffic Flow Inference Using Multi-Source Data
Inferring network-wide traffic states from sparse observations with high accuracy and trustworthy uncertainty quantification is essential f…
Performance Comparison of Classical and Neural Sampling Algorithms for Robotic Navigation
Integrating artificial intelligence (AI) into sampling-based motion planning provides new possibilities for improving autonomous navigation…
D3S2: Diffusion-Guided Dataset Distillation for Semantic Segmentation
Dataset distillation (DD) aims to compress large-scale datasets into compact synthetic sets while preserving training efficacy. However, ex…
Language Bias in LVLMs: From In-Depth Analysis to Simple and Effective Mitigation
Large Vision-Language Models (LVLMs) extend large language models with visual understanding, but remain vulnerable to hallucination, where…
TinyFormer: Preserving Tiny Objects in YOLO-DETRHybridReal-time Detectors
YOLO-series and DETR-based detectors struggle with tiny-object detection. YOLO-style models benefit from efficient dense prediction, but th…
Scale When Needed: Adaptive Neuron-level Mixed Precision Quantization Aware Training
Deploying deep neural networks on resource-constrained 6G edge devices demands aggressive compression with minimal accuracy loss. Quantizat…
Intent Signal Theory: A Computational Framework for Intent-State Control in Human-AI Interaction
Current AI interaction models treat the prompt as the primary object of exchange, omitting a critical layer: the user's latent source inten…
GL-LFGNN:A Global-Local Dual-branch Causal Graph Neural Network Based on Liang-Kleeman Information Flow for EEG Emotion Recognition
EEG-based emotion recognition holds significant promise for objective diagnosis of mood disorders. Graph neural networks (GNNs) have emerge…
Cultivating Machine Intelligence: The OMEGA Shift from Top-Down Optimization to Autopoietic Cognitive Ecologies
The dominant artificial intelligence paradigm trains neural architectures via gradient descent against proxy objectives and reinforcement l…
Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions
Background: Fine-tuning is central to adapting pre-trained Large Language Models (LLMs) to downstream tasks, but its reliance on training d…
Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression
We study the rate-distortion limits of online KV cache compression in autoregressive language models, formulating it as sequential Wyner-Zi…
Multi-Agent Specification-based Metamorphic Testing of FMU-Based Simulations
In many industrial domains, the Functional Mock-up Interface (FMI) is used to exchange simulation models as Functional Mock-up Units (FMUs)…
Leveraging Gauge Freedom for Learning Non-Gradient Population Dynamics of Stochastic Systems
Existing work on population dynamics inference often focuses on flows arising from vector fields that are the gradients of scalar potential…
Uncertainty-DTW for Sequences and Visual Tokens
Aligning structured data is a fundamental problem in computer vision and machine learning, underlying tasks such as time series analysis, h…
Courant: a State-Adaptive Perceiver-Based Neural Surrogate with Local Support and Interpretable Field Decomposition
We introduce "Courant", a Perceiver-based encoder-processor-decoder surrogate model that has latent features exhibiting adaptive specializa…
Trust-Aware Joint Feature-Prediction Discrepancy for Robust Domain Adaptation
Domain adaptation aims to mitigate performance degradation caused by distribution shifts between a labeled source domain and an unlabeled o…
Evidence-Linked Radiology Reporting: A Human-Supervised Reference Architecture for Structured Imaging Intelligence
Radiology reports remain the primary mechanism by which imaging findings are communicated to clinical teams. However, much of the structure…
Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo
We study inference-time alignment for diffusion-based generative models, aiming to steer a base model toward high-reward outputs without up…
Theoretical Analysis of Sparse Optimization with Reparameterization, Weight Decay, and Adaptive Learning Rate
Sparse optimization is a fundamental challenge in various practical applications. A popular approach to sparse optimization is $\ell_p$ reg…
ASTRO: Adaptive Spatio-Temporal Reinforcement Optimization for GNN Powered Anomly Detection in Cyber Physical Systems
Anomaly detection in Industrial Internet of Things (IIoT) environments is essential to protect the Industrial Control Systems (ICS) and Cyb…
LLM Agent Based Renewable Energy Forecasting Using Edge and IoT Data A Review of Solar Wind Weather and Grid Aware Decision Support
Reliable forecasting of renewable energy generation is a foundational requirement for grid stability energy trading battery scheduling and…
Abduction-Deduction Entanglement: Domain Generalization via Representation Transplants
Prediction models trained under the source distribution do not generalize well to a different target distribution. A valid inference about…
STREAM: A Data-Centric Framework for Mining High-Value Task-Oriented Dialogues from Streaming Media
Large language models for vertical domains are bottlenecked by the scarcity of complex, domain-specific task-oriented dialogues. Existing d…
K-U-KAN: Koopman-Enhanced U-KAN for 3D Dental Reconstruction from a Single Panoramic X-ray Radiograph
A panoramic X-ray compresses a 3D jaw into a 2D strip; we aim to recover the missing depth cleanly and fast. Existing implicit neural repre…
AME-TS: Anchored Mixture-of-Experts for Time Series Forecasting
Time series forecasting models are increasingly scaled through large Transformer backbones, yet most existing approaches process all series…
Methodology for Creating a Clinically Verified Dermoscopic Image Dataset
This study presents a methodology for constructing a clinically verified dataset of dermatoscopic images for medical informatics research.…
Grow-Prune-Freeze Networks: Adaptive & Continual Learning Technique for Olfactory Navigation
Training data for olfaction is scattered through disparate, non-standardized datasets that limit the ability to build representative world…
Knowledge Graph-Driven Expert-Level Reasoning for Neuroscience
Knowledge graph (KG) is an abstraction that can be extracted from text corpora and used for in-depth reasoning. Prior work has leveraged KG…
By Their Fruits You Will Know Them: Comparing Formalizations of Law by the Decisions They Encode
Formalizing legal provisions promises machine-accessible law and automated legal reasoning, and recent LLMs make it tempting to generate su…
Beyond Killer Robots: General AI Attitudes and Public Support for Military AI in Nine Countries
AI-enabled military systems are a fixture of modern military conflict. Applications vary from autonomous drones for surveillance and attack…
Hide to Guide: Learning via Semantic Masking
Reinforcement learning with verifiable rewards (RLVR) has become a powerful paradigm for improving language models on reasoning-intensive t…
Influence-Inspired Spectral Rotations for Extreme Low-Bit LLM Quantization
We apply the influence-adaptive Walsh geometry of a companion theory paper (arXiv:2605.01637) to extreme low-bit weight-only LLM quantizati…
Multi-Objective Learning for Diffusion Models: A Statistical Theory under Semi-Supervised Learning
Diffusion models are increasingly used as powerful conditional generators, yet real deployments often involve multiple target distributions…
Continuous-Depth Field Theory for Transformer Patching and Mechanistic Interpretability
Mechanistic interpretability often uses activation patching, causal tracing, path patching, and steering directions to reveal behaviorally…
Specification-Based Code-Text-Code Reengineering for LLM-Mediated Software Evolution
Direct Code2Code transformation remains challenging to control because it can preserve surface-level syntax while introducing semantic drif…
On the Epistemic Uncertainty of Overparametrized Neural Networks
Epistemic uncertainty is often viewed as a reducible uncertainty that vanishes with increasing data. This perspective implicitly assumes pa…
Constraint-Anchored Attribution: Feasibility-Certified Counterfactuals and Bonferroni-PAC Sufficient Subsets for Neural CO Policies
We give an attribution method for neural combinatorial-optimisation (CO) policies that (i) decomposes a decision by constraint families via…
JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment
Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas com…
Quantifying Empirical Compute-Supervision Tradeoffs in RLVR
Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training language models, but in practice, ve…
Guess the Unified Model: How Much Can We Recover from Generated Images?
With unified model-generated images now widespread online, attributing their model of origin offers a path toward transparency and deeper i…
First, do no harm: Breaking suicidogenic echo chambers in media recommendation
Recommender systems generally optimises user engagement, but this approach is dangerous in mental health contexts. When vulnerable users sh…
Mimir: Large-scale Multilingual Concept Modeling
Current language modeling approaches are built around tokens. Text corpora are split into tokens, and models are trained by performing comp…
Latent Q-Barrier Shielding for Safe In-Context Reinforcement Learning
Safe in-context reinforcement learning (ICRL) adapts online from interaction history without test-time parameter updates while controlling…
Positivity in classical enumerative geometry: a case study in synchronized AI-assisted mathematics
We study the symmetric polynomial $\prod_{\alpha\in A_{n,d}}\bigl(1+\alpha_1 x_1+\cdots+\alpha_n x_n\bigr)$ where $A_{n,d}:=\{\alpha\in\mat…
READER: Reasoning-Enhanced AI-Generated Text Detection
Recent advances in large language models (LLMs) have made it increasingly difficult to distinguish human-written text from AI-generated con…
Neuromorphic LiDAR-based Bird's Eye View Object Detection using Energy-efficient Spiking Neural Networks
Autonomous driving perception demands accurate and efficient processing of three-dimensional sensor data under strict power constraints. Tr…
Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction
Effective features are crucial for predictive model performance, but creating them often requires domain expertise, limiting scalability ac…
UWM-JEPA: Predictive World Models That Imagine in Belief Space
World models for partially observed environments must imagine multiple compatible hidden futures and steer between them under counterfactua…
CausalFlow: Causal Attribution and Counterfactual Repair for LLM Agent Failures
Large language model (LLM) agents frequently fail on multi-step tasks involving reasoning, tool use, and environment interaction. While suc…
A general tensor-structured compression scheme for efficient large language models
Large language models (LLMs) are dominated by dense linear transformations, whose storage, memory and computational overheads hinder effici…
Parallel Differentiable Reachability for Learning and Planning with Certified Neural Dynamics and Controllers
Neural network (NN) dynamics models and control policies achieve strong performance in robotics, but providing sound guarantees under uncer…
Parameter-Efficient CT Reconstruction via Deep Graph Laplacian Regularization
Low-dose computed tomography (LDCT) reconstruction faces a critical tradeoff between reconstruction quality and resource requirements. Whil…
Certified Robustness from Approximate Gaussian Mixture Structures in Pretrained Latent Spaces
Deep learning models are vulnerable to adversarial perturbations, raising important concerns for safety-critical deployment. Empirical defe…
AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing
AI-associated lexical shifts have been documented mainly in Scientific English. We extend this work to 34 languages in the WMT News Crawl c…
KYA: A Framework-Agnostic Trust Layer for Autonomous Systems with Verifiable Provenance and Hierarchical Policy Composition
Observability tells operators when an agent is slow. KYA tells operators when an agent is wrong, drifting, leaking, or quietly going rogue.…
Adversarial Orthogonal Disentanglement for LVLM Hallucination Mitigation
Large Vision-Language Models (LVLMs) have advanced multimodal understanding, yet their reliability is limited by hallucination, where gener…
CollectionLoRA: Collecting 50 Effects in 1 LoRA via Multi-Teacher On-Policy Distillation
Customized image editing aims to equip pre-trained diffusion models with specific visual effects using limited paired data, typically via L…
Weakly Supervised Camouflaged Object Detection Based on the SAM Model and Mask Guidance
Camouflaged object detection (COD) from a single image is a challenging task due to the high similarity between objects and their surroundi…
Evo-Attacker: Memory-Augmented Reinforcement Learning for Long-Horizon Tool Attacks on LLM-MAS
While Large Language Model-based Multi-Agent Systems (LLM-MAS) demonstrate remarkable capabilities in solving complex tasks by orchestratin…
Subspace-Guided Semantic and Topological Invariant Registration for Annotation-Free Ultrasound Plane Quality Control
Reliable quality control (QC) of ultrasound images is essential for both real-time acquisition guidance and retrospective clinical audit, y…
Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation
Self-supervised pre-training paradigm has gained increasing prominence for learning transferable representations in medical imaging, yet ex…
Autoregression-Free Neural Operators for Time-Dependent PDEs
Neural operators learn mappings from function-dependent inputs to solutions, providing an effective framework for solving partial different…
SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models
Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are…
A Token/KV-Cache Communication Media Selection and Resource Allocation Strategy for Multi-Agent Collaboration
The convergence of large language models (LLMs) with 6G networks is fostering a paradigm of autonomous multi-agent cooperation, which in tu…
SeqRoute: Global Budget-Aware Sequential LLM Routing via Offline Reinforcement Learning
Existing LLM routing frameworks treat queries as independent events, neglecting the sequential nature of real-world user sessions constrain…
Binding Visual Features Point by Point
Despite success on standard benchmarks, vision language models display persistent failures on tasks involving processing of multi-object sc…
A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback
Verbal feedback delivered by attending surgeons in the operating room plays a critical formative role in resident trainee skill acquisition…
AI Content Moderation in Therapy Conversations
Large language models (LLMs) are increasingly being used for emotional support. They are also being developed for formal therapy purposes.…
From Simulation to Enaction: Post-trained language models recognize and react to their own generations
Language models are pretrained as passive predictors with no incentive to model the consequences of their own outputs. Post-training change…
IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference
Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that…
EXPO-FT: Sample-Efficient Reinforcement Learning Finetuning for Vision-Language-Action Models
The ability to efficiently and reliably learn new tasks has been a foundational challenge in robotics. Vision-Language-Action (VLA) models…
Test-Time Self-Adaptive Conditioning for Stable Audio-Driven Talking-Head Generation
Audio-driven talking-head generation has achieved remarkable progress with recent models such as AniTalker, FLOAT, and Sonic. Despite their…
A Controlled Synthetic Benchmark for Educational Aspect-Based Sentiment Analysis
Educational aspect-based sentiment analysis (ABSA) can support course improvement, but public aspect-labeled student feedback remains scarc…
Generative AI impacts on intra-urban inequality and skill premium in Beijing
Generative artificial intelligence (GenAI) is the first automation wave to reach high-cognitive tasks at scale, yet its effects on intra-ur…
Cross-Stage Attention Multi-Expert Network for Radiologist-Inspired Breast Ultrasound Diagnosis
Breast ultrasound imaging is an important noninvasive method for early breast cancer diagnosis, but automatic benign/malignant classificati…
A Tertiary Review of Large Language Model-Based Code Generating Tasks: Trends, Challenges, and Future Directions
Context. Large language models (LLMs) are increasingly applied to code-generating tasks (CGTs) in software engineering. While reported resu…
TopoAlign: Topology-Aware Visual Representation Alignment
Neural networks encode inputs as high-dimensional vectors, known as representations, that capture how models process data by encoding task-…
'Si'multaneous 'S'patial-'T'emporal Message Passing for Dynamic Graph Representation Learning
Dynamic graph neural networks (DGNNs) that operate on snapshot sequences typically fall into one of two categories. \emph{Temporal-first} a…
BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data
High-quality expert chain-of-thought (CoT) data is one of the core bottlenecks in large language model (LLM) post-training. Existing data p…
Keep the Proof State Live: Snapshotting for Efficient Tactic Search in Lean 4
Automated theorem proving systems built on Lean 4 increasingly rely on parallel tactic search over partially specified proofs, such as thos…
PennySynth: RAG-Driven Data Synthesis for Automated Quantum Code Generation
The growing complexity of quantum programming frameworks has exposed a critical limitation in existing large language model (LLM)-based cod…
Mosaic: Compositional Multi-Concept Erasure via Vector Field Blending
Concept erasure has emerged as a key research direction for ensuring safe and ethical image synthesis in Text-to-Image (T2I) models. While…
Geometric Flow Matching for Molecular Conformation Generation via Manifold Decomposition
The generation of accurate 3D molecular conformations is a pivotal challenge in computational chemistry and drug discovery. Recently, diffu…
Extreme Region Policy Distillation
Reinforcement learning for large language models faces a fundamental trade-off between sample efficiency and asymptotic performance: strict…
Acting on the Unseen: Communication-Free Collaborative Filtering for Decentralized Multi-Robot Task Allocation
Multi-robot task allocation usually assumes some combination of communication, known task models, or a coordinator. We study the opposite e…
Toward a Benchmark for Controllable Simulation of Imperfect Students with Large Language Models
Teacher education requires deliberate practice with learners who exhibit identifiable strengths, weaknesses, and partial mastery. Large lan…
Towards the Connection between Activation Sparsity and Flat Minima
The observation that activation sparsity emerges in MLP blocks of standardly trained Transformers offers an opportunity to drastically redu…
Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines
We present the first end-to-end demonstration of fine-tuning and serving Google's Gemma 4 31B model on TPU hardware, providing an empirical…
AutoSG: LLM-Driven Solver Generation Solely from Task Prompts for Expensive Optimization
Expensive optimization tasks are ubiquitous in real-world applications, demanding highly specialized solvers. While LLM-driven automated so…
Posture Clip: Sit properly or I wont let you work
Poor posture is a significant concern due to its detrimental effects on health and productivity. This paper presents a collar-clipped devic…
Meta-Engineering Harnesses for AI-Native Software Production: A Contract-Driven Adversarial Verification Architecture with Early Deployment Report
AI-native software development is often evaluated at the level of individual models, prompts, or generated artifacts. This framing is insuf…
Referential Security as a New Paradigm for AI Evaluations
Security evaluations inherently depend on stable identifiers. Any finding, audit, or regulatory decision must remain attached to the specif…
Simulating Human Memory with Language Models
Language models are increasingly being deployed as user simulators, but their memory is far more reliable than that of real users. To measu…
Don't Retrain, Just Reuse: Recovering Dual-Target Molecules from Single-Target Diffusion Models
Designing a single molecule that modulates two targets is a promising strategy for polypharmacology, but it remains substantially harder th…
Profiling-Driven Adaptive Distributed Transformer Inference on Embedded Edge Deployment
Distributing Transformer inference across embedded edge devices can alleviate individual memory and compute constraints, yet practical bene…
How Should LLMs Consume High-Quality Data? Optimal Data Scheduling via Quality-Aware Functional Scaling Laws
High-quality data is scarce in large language model (LLM) training, yet how to schedule its use jointly with training dynamics lacks theore…
Multi-Agent Coordination Adaptation via Structure-Guided Orchestration
As large language model (LLM)-based multi-agent systems scale to handle increasingly complex tasks, balancing structural stability and dyna…
DeGRe: Dense-supervised Generative Reranking for Recommendation
In multi-stage recommender systems, reranking optimizes overall utility by capturing intra-list contextual dependencies, yet its central ch…
Benchmarking Pathology Foundation Models for Spatial Domain Understanding
Pathology foundation models (PFMs) have emerged as a core approach for learning transferable representations from whole slide images (WSIs)…
Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models
Concept unlearning aims to erase a target concept from a pretrained text-to-image diffusion model without retraining. Closed-form methods a…
MDGMIX: Boundary-Aware Subgraph Mixing for Multi-Domain Graph Pre-Training
Multi-domain graph pre-training is a crucial step in constructing foundational graph models with cross-domain generalization capabilities.…
Efficient Benchmarking Is Just Feature Selection and Multiple Regression
Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a s…
NPSolver: Neural Poisson Solver with Iterative Physics Supervision
Efficiently solving Poisson equations on complex, irregular domains remains a fundamental challenge in scientific computing, as classical i…
On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits
We study a stochastic multi-armed bandit problem where an agent is granted a free exploration budget before regret accumulates, a setting n…
SAMark: A Self-Anchored Text Watermarking with Paragraph-Level Paraphrase Robustness
Semantic-level watermarking (SWM) improves robustness against text modifications by treating sentences as the basic unit. However, robustne…
Adaptive Graph Refinement and Label Propagation with LLMs for Cost-Effective Entity Resolution
Dirty entity resolution (ER), which identifies records referring to the same real-world entity from a single, messy dataset, is a fundament…
Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa
Personally identifiable information (PII) detection systems are frequently trained within narrow source or domain boundaries, limiting cove…
OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation
Recent vision-language-action (VLA) models and world action models (WAMs) advance robotic manipulation by enriching intermediate representa…
Clarify, Abstain or Answer? Strategising in Conversation with Belief-Augmented Generation
Large language models (LLMs) define a distribution over text, which can be viewed as a probabilistic representation of uncertainty: samplin…
When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills
Large language models (LLMs) are increasingly used as proposal generators for evolutionary robot design, yet most loops remain memoryless:…
Context-Instrumental Data Distillation for Kubernetes Manifest Generation: Method and Experimental Evaluation
This paper examines the specialization of Small Language Models (SLMs) with up to 4 billion parameters for generating artifacts in domain-s…
TTPrint: Evidence-Grounded TTP Extraction via Diverge-then-Converge Verification
Extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports is an open-set, multi-label problem requiring both high rec…
Geometric Evolution Maps: Extracting Stable Concept Probes from Transformer Residual Streams
Concept probes extracted from transformer residual streams are only as reliable as the layer from which they are extracted. The common prac…
TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning
This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness i…
Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition
Large Language Model interfaces are increasingly verbose, exposing intermediate reasoning traces alongside final answers. Traces are framed…
MuNet: A Mutualistic Network for Joint 3D Human Mesh Recovery and 3D Clothed Human Reconstruction from Single Images
3D human mesh recovery and 3D clothed human reconstruction are inherently related, yet they have long been studied in isolation, thereby ov…
Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express
We find a mismatch between what large language models encode about a causal question and what they answer. On anti-commonsense CLadder item…
Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning
While large language models (LLMs) augmented with agentic search capabilities show promise for legal reasoning, they overlook a fundamental…
Quantitative Evaluation of the Severity of Posttraumatic Stress Disorder through Transfer Learning from Specific Phobia Data
Posttraumatic stress disorder (PTSD) is a prevalent and debilitating mental health condition with significant personal and societal impacts…
From Latent Space to Training Data: Explainable Specialization in Minimal MLPs
We here study whether training biases can make hidden neurons specialize in minimal one-hidden-layer MLPs, and whether such specialization…
EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory
Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation.…
Small Models, Strong Priors: Architectural Inductive Bias for Parameter-Efficient Neural PDE Solvers
Neural PDE solvers have followed the scaling trajectory of vision and language, with recent foundation models reaching billions of paramete…
VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding
Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from notic…
Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization
Despite the strong reasoning capabilities of large language models (LLMs), optimizing the execution efficiency of tensor programs remains c…
QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability
Large language models (LLMs) face a dual challenge in creative capability evaluation: existing benchmarks (e.g., Story Cloze Test, HellaSwa…
Continual Speaker Identity Unlearning with Minimal Interference
Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker iden…
Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning
This paper provides an empirical implementation of the creative quality metric proposed in Calibrated Surprise (Zou & Xu, 2026a). The quest…
SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation
Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment…
AI-Assisted Systematization for Evaluating GenAI Systems
Evaluating generative AI (GenAI) systems is challenging because many targets of evaluation are broad, contested concepts, such as "reasonin…
Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning
Deep reinforcement learning (RL) agents commonly rely on high-dimensional neural representations, despite growing evidence that task-releva…
AdvantageFlow: Advantage-Weighted Least Squares for RL in Flow Models
We introduce AdvantageFlow, a forward-process reinforcement learning algorithm for rectified flow models. Unlike Flow-GRPO, which optimizes…
Retrieval-Augmented Detection of Potentially Abusive Clauses in Chilean Terms of Service
Online Terms of Service often function as contracts of adhesion, creating asymmetries that may expose consumers to potentially abusive clau…
A Multimodal 3D Foundation Model for Light Sheet Fluorescence Microscopy Enables Few-Shot Segmentation, Classification, and Deblurring
Light sheet fluorescence microscopy (LSM) enables high-resolution, three-dimensional (3D) imaging of biological specimens, providing rich v…
Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution
Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical…
DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models
Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where mul…
Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals
Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpre…
When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
Customizing an LLM judge to a specific task or domain often involves optimizing its prompt across multiple evaluation criteria simultaneous…
Neuronal Stochastic Attention Circuit (NSAC) for Probabilistic Representation Learning
Reliable quantification of uncertainty estimates in continuous-time (CT) representation learning remains nascent, particularly within CT at…
Conditional KRR: Injecting Unpenalized Features into Kernel Methods with Applications to Kernel Thresholding
Conditionally positive definite (CPD) kernels are defined with respect to a function class $\mathcal{F}$. It is well known that such a kern…
Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark
Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. Howe…
StakeBench: Evaluating Language Understanding Grounded in Market Commitment
Existing financial NLP benchmarks often rely on labels supplied by outside observers, measuring how language is perceived rather than what…
Channel-wise Vector Quantization
We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise to…
OrpQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization
The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limita…
Language Models Need Sleep
Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with…
Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models
Code review is a critical practice in software engineering, yet the growing scale and frequency of code patches in modern projects, togethe…
Squeezing Capacity from Multimodal Large Language Models for Subject-driven Generation
Subject-driven image generation aims to synthesize new images that preserve the identity of the given subject while following textual instr…
WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point
Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the e…
PCGRLLM: Large Language Model-Driven Reward Design for Procedural Content Generation Reinforcement Learning
Reward design plays a pivotal role in the training of game AIs, requiring substantial domain-specific knowledge and human effort. In recent…
Efficient and Scalable Neural Symbolic Search for Knowledge Graph Complex Query Answering
Complex Query Answering (CQA) is a crucial reasoning task over Knowledge Graphs (KGs), which aims to answer first-order logical queries fro…
FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations
We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large language models (LLMs). FloorplanQA is grounded…
Hide-and-Shill: A Reinforcement Learning Framework for Market Manipulation Detection in Symphony-a Decentralized Multi-Agent System
Decentralized finance (DeFi) has introduced a new era of permissionless financial innovation but also led to unprecedented market manipulat…
From Multi-Agent Systems and the Semantic Web to Agentic AI: A Unified Narrative of the Web of Agents
The Web of Agents (WoA) transforms the document-centric Web into an environment of autonomous agents acting on users' behalf, a vision newl…
Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step re…
Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare
We develop new experimental paradigms for measuring welfare in language models. We compare verbal reports of models about their preferences…
Teaching large language models to reason like expert diagnosticians
Differential diagnosis is an iterative process that integrates patient information with broader medical knowledge. Clinical case series suc…
Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents
Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical traj…
Agent Learning via Early Experience
A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-…
False Fixed Points: Kantian Feedback, Stable Miscalibration, and Representational Compression in LLMs
High-confidence errors in large language models are often treated as fragile failures. We study an alternative: some errors may be false fi…
Voting with the Graph: Stable RLAIF via Topological Consistency Maximization
Reinforcement Learning from AI Feedback (RLAIF) relies on LLM judges as preference measurement instruments, yet these instruments are funda…
From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports -- with Preliminary Extension to Lung Cancer
Large language models (LLMs) have demonstrated promising performance in generating diagnostic conclusions from imaging findings, thereby su…
Rewarding Structural Conformance of Reasoning using Process Mining
Recent advances in sparse reward policy gradient methods have enabled effective reinforcement learning (RL)-based language model post-train…
Chain-of-Thought Hijacking
Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that lo…
Optimizing Sensor Placement for Flow Reconstruction in Urban Drainage Networks: A Digital Twin-Based Sparse Sensing Approach
Urban flooding triggered by intense rainfall is becoming increasingly frequent and widespread. While flood prediction and monitoring in hig…
Learning to Trust: Bayesian Adaptation to Varying Suggester Reliability in Sequential Decision Making
Autonomous agents operating in sequential decision-making tasks under uncertainty can benefit from external action suggestions, which provi…
IPR-1: Interactive Physical Reasoner
Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can…
Actionable and diverse counterfactual explanations incorporating domain knowledge and plausibility constraints
Counterfactual explanations improve the actionable interpretability of machine learning models by identifying minimal changes required to a…
AGI Requires a Coordination Layer on Top of Pattern Repositories
In this paper we argue that influential critiques dismissing Large Language Models (LLMs) as a dead end for AGI misidentify the bottleneck:…
Leveraging Spreading Activation for Improved Document Retrieval in Knowledge-Graph-Based RAG Systems
Despite initial successes and a variety of architectures, retrieval-augmented generation systems still struggle to reliably retrieve and co…
SPARK: Search Personalization via Agent-Driven Retrieval and Knowledge-sharing
Personalized search demands the ability to model users' evolving, multi-dimensional information needs; a challenge for systems constrained…
Architecting Agentic Communities using Design Patterns
The rapid evolution of Large Language Models (LLM) and subsequent Agentic AI technologies requires systematic architectural guidance for bu…
MMUEChange: A Generalized LLM Agent Framework for Intelligent Multi-Modal Urban Environment Change Analysis
Understanding urban environment change is essential for sustainable development. However, current approaches, particularly remote sensing c…
NSR-Boost: A Neuro-Symbolic Residual Boosting Framework for Industrial Legacy Models
Although the Gradient Boosted Decision Trees (GBDTs) dominate industrial tabular applications, upgrading legacy models in high-concurrency…
PathWise: Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs
Large Language Models (LLMs) have enabled automated heuristic design (AHD) for combinatorial optimization problems (COPs), but existing fra…
DropoutTS: Sample-Adaptive Dropout for Robust Time Series Forecasting
Deep time series models are vulnerable to noisy data ubiquitous in real-world applications. Existing robustness strategies either prune dat…
Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory
Diagnosing failure patterns in Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-e…
MedBeads: An Agent-Native, Immutable Data Substrate for Trustworthy Medical AI
Background: As of 2026, Large Language Models (LLMs) demonstrate expert-level medical knowledge. However, deploying them as autonomous "Cli…
Emergent Analogical Reasoning in Transformers
Analogy is a central faculty of human intelligence, enabling abstract patterns discovered in one domain to be applied to another. Despite i…
AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent
While large language model (LLM) multi-agent systems achieve superior reasoning performance through iterative debate, practical deployment…
FLINGO -- Instilling ASP Expressiveness into Linear Integer Constraints
Constraint Answer Set Programming (CASP) is a hybrid paradigm that enriches Answer Set Programming (ASP) with numerical constraint processi…
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning
Recent advances in large language model (LLM) have empowered autonomous agents to perform multi-turn interactions with tools and environmen…
OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization
Socially intelligent AI systems must entail reasoning across diverse human behavioral tasks, and generalization to new contexts. However, A…
JEPA-DNA: Grounding Genomic Foundation Models through Joint-Embedding Predictive Architectures
Genomic Foundation Models (GFMs) typically rely on Masked Language Modeling (MLM) or Next-Token Prediction (NTP) to learn the "Laws of Natu…
All Leaks Count, Some Count More: Interpretable Temporal Contamination Detection and Mitigation in LLM Backtesting
Backtesting LLMs on resolved events assumes models reason only from pre-cutoff knowledge, yet pretrained models inevitably leak post-cutoff…
Decoding ML Decision: An Agentic Reasoning Framework for Large-Scale Ranking System
Modern large-scale ranking systems operate within a sophisticated landscape of competing objectives, operational constraints, and evolving…
INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic
We introduce INDUCTION, a benchmark for finite structure concept synthesis in first order logic. Given small finite relational worlds with…
AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications
Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is…
PathMem: Toward Cognition-Aligned Memory Transformation for Pathology MLLMs
Computational pathology demands both visual pattern recognition and dynamic integration of structured domain knowledge, including taxonomy,…
Characterizing Linear Alignment Across Language Models
Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data m…
Dynamic Dual-Granularity Skill Bank for Agentic RL
Agentic RL can benefit substantially from reusable experience, yet existing skill-based methods mainly extract trajectory-level guidance an…
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
Agentic search enables language models to solve knowledge-intensive tasks by adaptively acquiring external evidence over multiple steps. Re…
SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment
Current LLM-based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic a…
Learning Preference-Based Objectives from Clinical Narratives for Dynamic Sepsis Treatment
Designing reward functions for reinforcement learning (RL) in healthcare remains challenging because clinically meaningful outcomes are spa…
UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents
Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function c…
The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment
Large language models (LLMs) are increasingly deployed as tool-augmented agents capable of executing system-level operations. While existin…
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Withou…
Generative structure search for efficient and diverse discovery of molecular and crystal structures
Predicting stable and metastable structures is central to molecular and materials discovery, but remains limited by the cost of searching h…
Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective
This position paper argues that reliable AI requires infrastructure for human validation of implicit knowledge. AI learns from both explici…
FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models
Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail on financial dom…
MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents
Long-running autonomous AI agents suffer from a well-documented memory coherence problem: tool-execution success rates degrade 14 percentag…
ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting
Long-term personalized memory for LLM agents is challenging on resource-limited edge devices due to high storage costs and multimodal compl…
Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games
While Large Language Models (LLMs) excel in certain reasoning tasks, they struggle in multi-agent games where the final outcome depends on…
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
As LLM agent systems take on more complex tasks, they increasingly rely on meta-agents: higher-order agents that operate on other agents, m…
The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes
On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language mode…
CVEvolve: Autonomous Algorithm Discovery for Unstructured Scientific Data Processing
Scientific data processing often requires task-specific algorithms or AI models, creating a barrier for domain scientists who need to analy…
Differentiable Learning of Lifted Action Schemas for Classical Planning
Classical planners can effectively solve very large deterministic MDPs represented in STRIPS or PDDL where states are sets of atoms over ob…
CogniFold: Always-On Proactive Memory via Cognitive Folding
Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into per…
A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology
Existing frameworks for LLM-based agent architectures describe systems from a single perspective: industry guides (Anthropic, Google, LangC…
SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?
Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more…
How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study
While AI-generated hallucinations pose considerable risks, the underlying cognitive mechanisms by which humans can successfully recognize o…
Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models
We present the first systematic study of faithfulness in Vision-Language-Action (VLA) driving models, analyzing 300 Alpamayo-R1-10B inferen…
Self-supervised Hierarchical Visual Reasoning with World Model
3D open-world environments with adversarial opponents remain a core challenge for reinforcement learning due to their vast state spaces. Ef…
Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs
Leveraging the universal representations of pre-trained LLMs and MLLMs offers a promising path toward brain foundation models. However, vis…
When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity
Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task p…
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from…
ECUAS$_n$: A family of metrics for principled evaluation of uncertainty-augmented systems
In high-stakes automated decision-making, access to predictive uncertainty is essential for enabling users -- human or downstream systems -…
Benchmarking and Improving Monitors for Out-Of-Distribution Alignment Failure in LLMs
Many safety and alignment failures of large language models (LLMs) occur due to out-of-distribution (OOD) situations: unusual prompt or res…
SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?
LLM agents have incredible potential for scientific discovery applications. However, the performance of LLM agents on real-world, small mol…
Knowledge Graph Re-engineering Along the Ontological Continuum (extended version)
Knowledge graphs have become the primary vehicle for data integration and are critical to the success of modern AI, but the diversity of KG…
Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression
The KV cache used in large language models has linearly growing time complexity, so LLMs face memory blow-up and reduced decoding efficienc…
AMEL: Accumulated Message Effects on LLM Judgments
Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items…
MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems
Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist un…
Unbalanced Incomplete Multi-view Clustering via the Scheme of View Evolution: Weak Views are Meat; Strong Views do Eat
Incomplete multi-view clustering is an important technique to deal with real-world incomplete multi-view data. Previous works assume that a…
Double Self-weighted Multi-view Clustering via Adaptive View Fusion
Multi-view clustering has been applied in many real-world applications where original data often contain noises. Some graph-based multi-vie…
Multi-Modal Cross-Domain Alignment Network for Video Moment Retrieval
As an increasingly popular task in multimedia information retrieval, video moment retrieval (VMR) aims to localize the target moment from a…
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos
Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target moment semantically according to a sentence query. Alth…
Bridging Evolutionary Algorithms and Reinforcement Learning: A Comprehensive Survey on Hybrid Algorithms
Evolutionary Reinforcement Learning (ERL), which integrates Evolutionary Algorithms (EAs) and Reinforcement Learning (RL) for optimization,…
The Impact of Large Language Models on Open-source Innovation: Evidence from GitHub Copilot
Large Language Models (LLMs) are reshaping knowledge work, yet their impact on voluntary, self-guided open innovation forums (contributors…
Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation
The thematic fit estimation task measures semantic arguments' compatibility with a given semantic role for a given predicate. We investigat…
The Meme Is the Message: Generative Memesis and AI Visuals in the 2024 USA Presidential Elections
Visual content on social media has become increasingly influential in shaping political discourse and civic engagement, but it also limits…
Fusion Embedding for Pose-Guided Person Image Synthesis with Diffusion Model
Pose-Guided Person Image Synthesis (PGPIS) aims to generate human images in specified poses while preserving the identity and appearance of…
FragmentNet: Adaptive Graph Fragmentation for Graph-to-Sequence Molecular Representation Learning
Molecular representation learning methods typically tokenize molecules as individual atoms or use rigid, rule-based fragment decompositions…
Message-Passing GNNs Fail to Approximate Sparse Triangular Factorizations
Graph Neural Networks (GNNs) have been proposed as a tool for learning sparse matrix preconditioners, which are key components in accelerat…
Kolmogorov-Arnold Fourier Networks
Although Kolmogorov-Arnold-based interpretable networks (KANs) possess strong theoretical expressiveness, they suffer from severe parameter…
ExplainReduce: Generating global explanations from many local explanations
Most commonly used non-linear machine learning methods are closed-box models, uninterpretable to humans. The field of explainable artificia…
Pragmatic Reasoning improves LLM Code Generation
Pragmatic reasoning helps interlocutors infer intended meaning from ambiguous or underspecified messages by considering shared context and…
How does Bayesian Sampling help Membership Inference Attacks?
Membership Inference Attacks (MIAs) aim to estimate whether a specific data point was used in the training of a given model. Existing state…
Subspace Aggregation Query and Index Generation for Multidimensional Resource Space Model
Organizing large-scale resources in a multidimensional semantic space is an approach to efficiently managing and querying resources from di…
Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning
Few-shot adaptation of vision-language models remains fundamentally limited by how negative class signals are handled at inference. Existin…
PhySense: Sensor Placement Optimization for Accurate Physics Sensing
Physics sensing plays a central role in many scientific and engineering domains, which inherently involves two coupled tasks: reconstructin…
Beyond the Proxy: Trajectory-Distilled Guidance for Offline GFlowNet Training
Generative Flow Networks (GFlowNets) excel at sampling diverse, high-reward objects. In many practical applications where active reward que…
MultiPhishGuard: An Explainable and Adaptive Multi-Agent LLM System for Phishing Email Detection
Phishing email detection faces significant challenges due to evolving adversarial tactics and heterogeneous attack patterns. Traditional ap…
Music Interpretation and Emotion Perception: A Computational and Neurophysiological Investigation
This study investigates emotional expression and perception in music performance using computational and neurophysiological methods. The in…
PageLLM: A Multi-Grained Reward Framework for Whole-Page Optimization with Large Language Models
Whole-page optimization (WPO) decides how search and recommendation results are surfaced to users, and large language models (LLMs) open a…
FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models
Integrating Low-Rank Adaptation (LoRA) into federated learning offers a promising solution for parameter-efficient fine-tuning of Large Lan…
Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs
Direct Preference Optimization (DPO) has emerged as a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicit…
From Reasoning to Code: GRPO Optimization for Underrepresented Languages
Generating accurate and executable code using Large Language Models (LLMs) remains a significant challenge for underrepresented programming…
CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding…
SoK: A Comprehensive Security Analysis of Jailbreak Resilience in GPT and DeepSeek Models
The rapid proliferation of Large Language Models (LLMs) has heightened concerns regarding their exposure to jailbreak attacks, which craft…
Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models
Masked diffusion language models (MDLMs) promise fast, non-autoregressive text generation, yet existing samplers, which pick tokens to unma…
Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators
As psychometric surveys are increasingly used to assess the traits of large language models (LLMs), the need for scalable survey item gener…
ToolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLMs
Every LLM tool call is structurally an RPC -- a function name, JSON arguments, and a serialized result -- yet each protocol (native Python,…
QUTCC: Quantile Uncertainty Training and Conformal Calibration for Imaging Inverse Problems
While deep learning offers tremendous promise for scientific and medical imaging, any failures and hallucinations (predictions that do not…
Page image classification for content-specific data processing
Digitization projects in humanities often generate vast quantities of page images from historical documents, presenting significant challen…
HiTeC: Hierarchical Contrastive Learning on Text-Attributed Hypergraph with Semantic-Aware Augmentation
Contrastive learning (CL) has become a dominant paradigm for self-supervised hypergraph learning, enabling effective training without costl…
Explainable Attention-Guided Stacked Graph Neural Networks for Malware Detection
Malware detection in modern computing environments demands models that are not only accurate but also interpretable and robust to evasive t…
Designing Singing Syllabi with Virtual Avatars: AI-Assisted Syllabus Reauthoring
Traditional syllabi often function as static reference documents rather than engaging introductions to a course. In practical teaching, we…
EXOTIC: An Exact, Optimistic, Tree-Based Algorithm for Min-Max Optimization
Min-max optimization arises in many domains such as game theory, adversarial machine learning, etc. For these problems, gradient-based meth…
MCPXKIT: The Unified Toolkit for Analyzing Model Context Protocol Security
The Model Context Protocol (MCP) has emerged as a universal standard that enables AI agents to seamlessly connect with external tools, sign…
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Tool calling has emerged as a critical capability for AI agents. In contrast to conventional tool calling frameworks that rely on static, p…
HiGraph: A Large-Scale Hierarchical Graph Dataset for Malware Analysis
The advancement of graph-based malware analysis is critically limited by the absence of large-scale datasets that capture the inherent hier…
SpecPrune-VLA: Accelerating Vision-Language-Action Models via Action-Aware Self-Speculative Pruning
Pruning is a typical acceleration technique for compute-bound models by removing computation on unimportant values. Recently, it has been a…
Dynamic Relational Priming Improves Transformer in Multivariate Time Series
Standard attention mechanisms in transformers employ static token representations that remain unchanged across all pair-wise computations i…
Equip Pre-ranking with Target Attention by Residual Quantization
The pre-ranking stage in industrial recommendation systems faces a fundamental conflict between efficiency and effectiveness. While powerfu…
What Happens Next? Anticipating Future Motion by Generating Point Trajectories
We consider the problem of forecasting motion from a single image, i.e., predicting how objects in the world are likely to move, without th…
HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space
Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared…
Uncovering Vulnerabilities of LLM-Assisted Cyber Threat Intelligence
Large language models (LLMs) are increasingly used to help security analysts manage the surge of cyber threats, automating tasks from vulne…
VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes
Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answe…
INSIGHT: INference-time Sequence Introspection for Generating Help Triggers in Vision-Language-Action Models
Recent Vision-Language-Action (VLA) models show strong generalization capabilities, yet they lack introspective mechanisms for anticipating…
Go witheFlow: Real-time Emotion Driven Audio Effects Modulation
Music performance is a distinctly human activity, intrinsically linked to the performer's ability to convey, evoke, or express emotion. Mac…
KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI
Real-time speech-to-speech (S2S) models excel at generating natural, low-latency conversational responses but often lack deep knowledge and…
ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference
Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due t…
vAttention: Verified Sparse Attention
State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extensi…
Membership Inference Attacks on Tokenizers of Large Language Models
Membership inference attacks (MIAs) are widely used to assess the privacy risks associated with machine learning models. However, when thes…
Local MAP Sampling for Diffusion Models
Diffusion Posterior Sampling (DPS) provides a principled Bayesian approach to inverse problems by sampling from $p(x_0 \mid y)$. While post…
DeepEN: A Deep Reinforcement Learning Framework for Personalized Enteral Nutrition in Critical Care
Objective: Enteral nutrition (EN) delivery in the ICU remains suboptimal due to limited personalization and uncertainty regarding appropria…
Auditing medical multi-agent AI reveals risks of false consensus
Large language models are increasingly being assembled into medical multi-agent systems that emulate multidisciplinary consultation through…
FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model
Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that…
AI-generated podcasts: Synthetic Intimacy and Cultural Mistranslation in NotebookLM's Audio Overviews
This paper analyses AI-generated podcasts produced by Google's NotebookLM, which generates audio podcasts with two chatty AI hosts discussi…
BackWeak: Backdooring Knowledge Distillation Simply with Weak Triggers and Fine-tuning
Knowledge Distillation (KD) is essential for compressing large models, yet relying on pre-trained "teacher" models downloaded from third-pa…
Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference
Vision-Language-Action (VLA) models have shown great potential for embodied AI by integrating visual perception, language understanding, an…
Understanding, Accelerating, and Improving MeanFlow Training
MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the un…
Asking LLMs to Verify First is Almost Free Lunch
To enhance the reasoning capabilities of Large Language Models (LLMs) without high costs of training, nor extensive test-time sampling, we…
E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger'…
Smart Timing for Mining: A Deep Learning Framework for Bitcoin Hardware ROI Prediction
Bitcoin mining hardware acquisition requires strategic timing due to volatile markets, rapid technological obsolescence, and protocol-drive…
Intrinsically Interpretable Attention via Sparse Post-Training
We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible spa…
AI as Equalizer or Amplifier? Task Complexity as the Moderating Factor for Human Expertise in Hybrid Intelligence Systems
A growing body of empirical research suggests that generative AI narrows performance gaps between novice and expert workers on routine task…
DynaPURLS: Dynamic Refinement of Part-Aware Representations for Skeleton-Based Zero-Shot Action Recognition
Zero-shot skeleton-based action recognition (ZS-SAR) is fundamentally constrained by prevailing approaches that rely on aligning skeleton f…
Coupled Variational Reinforcement Learning for Language Model General Reasoning
While reinforcement learning has achieved impressive progress in language model reasoning, it is constrained by the requirement for verifia…
Selection-Induced Contraction of Innovation Statistics in Gated Kalman Filters
Validation gating is a fundamental component of classical Kalman-based tracking systems. Only measurements whose normalized innovation squa…
$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models
Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understandi…
DIVER-1: Scaling Intracranial EEG Foundation Models for Transferable Representations
Intracranial EEG (iEEG) provides direct, millisecond-scale recordings of human neural activity, but reusable representation learning is dif…
Multimodal Functional Maximum Correlation for Emotion Recognition
Emotional states manifest as coordinated yet heterogeneous physiological responses across central and autonomic systems, posing a fundament…
A Comprehensive Dataset for Human vs. AI Generated Image Detection
Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created.…
Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts
Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric ``router'' to dispatch tokens to a…
FlowPlan-G2P: A Structured Generation Framework for Transforming Scientific Papers into Patent Descriptions
Generating patent descriptions from scientific papers is challenging due to fundamental rhetorical and structural disparities between the t…
SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering
Traditional Retrieval-Augmented Generation (RAG) effectively supports single-hop question answering with large language models but faces si…
AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation
Multimodal medical large language models have shown substantial progress in chest X-ray interpretation but continue to face challenges in s…
Extreme-value forest fire prediction A study of the Loss Function in an Ordinality Scheme
Wildfires are highly imbalanced natural hazards in both space and severity, making the prediction of extreme events particularly challengin…
NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning
Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains unde…
PiXTime: A Model for Federated Time Series Forecasting with Heterogeneous Data across Nodes
While collaborative forecasting on distributed time series is highly desirable, directly pooling localized datasets is often impractical du…
RiskBridge: Turning CVEs into Business-Aligned Patch Priorities
Enterprises are confronted with an unprecedented escalation in cybersecurity vulnerabilities, with thousands of new CVEs disclosed each mon…
SafeGPT: Preventing Data Leakage and Unethical Outputs in Enterprise LLM Use
Large Language Models (LLMs) are transforming enterprise workflows but introduce security and ethics challenges when employees inadvertentl…
QASA: Quality-Aware Semantic Augmentation for Robust Multimodal Sentiment Analysis
Multimodal large language models have demonstrated strong ability in capturing semantic representations for multimodal sentiment analysis.…
Future-KL Regularized GRPO: Process-Level Credit Assignment from $f$-Divergence Regularization
Group Relative Policy Optimization (GRPO) is widely used for critic-free Large Language Model (LLM) post-training, but its KL regularizatio…
Pixelwise Uncertainty Quantification of Accelerated MRI Reconstruction
Parallel imaging techniques reduce magnetic resonance imaging (MRI) scan time but image quality degrades as the acceleration factor increas…
Delayed Assignments in Online Non-Centroid Clustering with Stochastic Arrivals
Clustering is a fundamental problem, aiming to partition a set of elements, like agents or data points, into clusters such that elements in…
Safety Generalization Under Distribution Shift in Safe Reinforcement Learning: A Diabetes Testbed
Safe Reinforcement Learning (RL) algorithms are typically evaluated under fixed training conditions. We investigate whether training-time s…
Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs
Existing speech editing detection (SED) datasets are predominantly constructed using manual splicing or limited editing operations, resulti…
Dynamics Reveals Structure: Challenging the Linear Propagation Assumption
Neural networks adapt through first-order parameter updates, yet it remains unclear whether such updates preserve logical coherence. We inv…
Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs
Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes s…
BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models
Recent years have seen a rapid surge in research leveraging Large Language Models (LLMs) for recommendation. These methods typically employ…
RecGOAT: Graph Optimal Adaptive Transport for LLM-Enhanced Multimodal Recommendation with Dual Semantic Alignment
Integrating large language model (LLM) representations into multimodal recommendation has shown promise, yet a fundamental challenge remain…
Generative Visual Code Mobile World Models
Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and in…
MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents
Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These f…
Reward-free Alignment for Conflicting Objectives
Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world align…
SPA-Cache: Singular Proxies for Adaptive Caching in Diffusion Language Models
While Diffusion Language Models (DLMs) offer a flexible, arbitrary-order alternative to the autoregressive paradigm, their non-causal natur…
Fine-Tuning Language Models to Know What They Know
Evaluating true metacognition in Large Language Models (LLMs) is difficult due to biases and heuristics. This paper presents a framework to…
Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems
While existing multi-agent systems (MAS) can handle complex problems by enabling collaboration among multiple agents, they are often highly…
Scalable Explainability-as-a-Service (XaaS) for Edge AI Systems
Though Explainable AI (XAI) has made significant advancements, its inclusion in edge and IoT systems is typically ad-hoc and inefficient. M…
Counterfactual Explanations for Hypergraph Neural Networks
Hypergraph neural networks (HGNNs) effectively model higher-order interactions in many real-world systems but remain difficult to interpret…
F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare
Reinforcement Learning with Verifiable Rewards (RLVR) is commonly based on group sampling to estimate advantages and stabilize policy updat…
SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned
DARPA's AI Cyber Challenge (AIxCC, 2023--2025) is the largest competition to date for building fully autonomous cyber reasoning systems (CR…
Prism: Spectral-Aware Block-Sparse Attention
Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bo…
Contextual Rollout Bandits for Reinforcement Learning with Verifiable Rewards
Reinforcement Learning with Verifiable Rewards (RLVR) is an effective paradigm for improving the reasoning capabilities of large language m…
Krause Synchronization Transformers
Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. W…
Two-Sided Time-Independent Regret for Matching Markets with Limited Interviews
Two-sided matching platforms rely on preferences from both sides, yet participants can evaluate only a small fraction of potential partners…
Adversarial Network Imagination: Causal LLMs and Digital Twins for Proactive Telecom Mitigation
Telecommunication networks experience complex failures such as fiber cuts, traffic overloads, and cascading outages. Existing monitoring an…
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on…
CARL-CXR: Continual Adapter-Based Routing for Task-Unknown Chest Radiograph Classification
Clinical deployment of chest radiograph classifiers requires models that can be updated as new datasets become available without retraining…
MARS: Margin and Semantic-Aware Data Augmentation for Reward Modeling
Reward modeling is central to alignment pipelines such as RLHF, RLAIF, and PPO-based policy optimization, yet its reliability is constraine…
JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments
Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio…
Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments
Trusted Execution Environments (TEEs) (e.g., Intel SGX and ArmTrustZone) aim to protect sensitive computation from a compromised operating…
MoBiQuant: Mixture-of-Bits Quantization for Token-Adaptive Any-Precision LLM
Dynamic runtime latency and memory constraints necessitate flexible large language model (LLM) deployment, where an LLM can be inferred wit…
Learning from Trials and Errors: Reflective Test-Time Planning for Embodied LLMs
Embodied LLMs endow robots with high-level task reasoning, but they cannot reflect on what went wrong or why, turning deployment into a seq…
Topology-Driven Transferability Estimation of Medical Foundation Models for Segmentation
The advent of large-scale self-supervised learning (SSL) has produced a vast zoo of medical foundation models. However, selecting optimal m…
MultiPUFFIN: A Multimodal Domain-Constrained Foundation Model for Molecular Property Prediction of Small Molecules
MultiPUFFIN is a domain-informed multimodal foundation model for predicting thermophysical properties of small molecules, addressing a crit…
Non-Invasive Reconstruction of Intracranial EEG Across the Deep Temporal Lobe from Scalp EEG based on Conditional Normalizing Flow
Although obtaining deep brain activity from non-invasive scalp electroencephalography (sEEG) is crucial for neuroscience and clinical diagn…
Grouter: Decoupling Routing from Representation for Accelerated MoE Training
Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultan…
RCTs & Human Uplift Studies: Methodological Challenges and Practical Solutions for Frontier AI Evaluation
Human uplift studies, or studies that measure the effects of AI access on human performance via randomized controlled trials (RCT) or simil…
UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Tasks
The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, whic…
Is Human Annotation Necessary? Iterative MBR Distillation for Error Span Detection in Machine Translation
Error Span Detection (ESD) is a crucial subtask in Machine Translation (MT) evaluation, aiming to identify the location and severity of tra…
Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization
Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performan…
Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models
Unified multimodal models share a language model backbone for both understanding and generating images. Can DPO align both capabilities sim…
PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching
Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities…
Discounted Beta-Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards
Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabi…
Procedural Refinement by LLM-driven Algorithmic Debugging for ARC-AGI-2
In high-complexity abstract reasoning, a system must infer a latent rule from a few examples or structured observations and apply it to uns…
Profiling learners' affective engagement: Emotion AI, intercultural pragmatics, and language learning
Learning another language can be a highly emotional process, typically characterized by numerous frustrations and triumphs, big and small.…
Beyond Static Uncertainty: Modeling Temporal Uncertainty Dynamics for Probabilistic Time Series Forecasting
Real-world time series exhibit temporally structured uncertainty: volatility clusters in turbulent regimes, dissipates in stable periods, a…
UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates
Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal re…
AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules
Robotic systems lack a principled abstraction for organizing intelligence, capabilities, and execution in a unified manner. Existing approa…
EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis
High-quality source-target image pairs with precise editing instructions are essential for instruction-guided image editing, yet constructi…
Temporal Dropout Risk in Learning Analytics: A Harmonized Survival Benchmark Across Dynamic and Early-Window Representations
Student dropout is a persistent concern in Learning Analytics, yet comparative studies frequently evaluate predictive models under heteroge…
M$^\star$: Every Task Deserves Its Own Memory Harness
Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent archi…
Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations
When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We pro…
Design Conditions for Intra-Group Learning of Sequence-Level Rewards: Token Gradient Cancellation
Reinforcement learning for multi-step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which create…
Federation over Text: Insight Sharing for Multi-Agent Reasoning
We propose a federated learning-like framework, Federation over Text (FoT), that enables multiple clients solving different tasks to collec…
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
This paper investigates the length problem in sequence-level relative reinforcement learning. We observe that, although existing methods pa…
Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition
We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-…
Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing
LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy…
Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs
Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return sch…
MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support
Large language models (LLMs) are increasingly used for conversational clinical decision support, yet they conflate next token prediction wi…
Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations
Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such…
Talking Slide Avatars: Open-Source Multimodal Communication Approach for Teaching
Slide-based teaching is widely used in higher education, yet in online, hybrid, and asynchronous contexts, slides often lose instructor pre…
ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction
Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decis…
Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation
Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressive…
VILAS: A VLA-Integrated Low-cost Architecture with Soft Grasping for Robotic Manipulation
We present VILAS, a fully low-cost, modular robotic manipulation platform designed to support end-to-end vision-language-action (VLA) polic…
Soft-to-Hard Routing in Sparse Mixture-of-Experts Models
Softmax routing approaches hard top-1 routing as the temperature tends to zero, but the limiting passage is singular at router ties. This p…
Efficient Preference Poisoning Attack on Offline RLHF
Offline Reinforcement Learning from Human Feedback (RLHF) pipelines such as Direct Preference Optimization (DPO) train on a pre-collected p…
Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses
Embodied Artificial Intelligence (Embodied AI) integrates perception, cognition, planning, and interaction into agents that operate in open…
From Muscle Bursts to Motor Intent: Self-Supervised Token Modeling for Heterogeneous EMG
Surface electromyography provides a practical way to infer human movement intention from wearable muscle recordings, but models trained und…
Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks
Mental-health dialogue models are increasingly evaluated by AI-based evaluators, yet these evaluators often treat surface empathy, supporti…
BFORE: Butterfly-Firefly Optimized Retinex Enhancement for Low-Light Image Quality Improvement
Low-light images suffer from poor visibility, noise, and color distortion. Existing Retinex-based enhancement methods rely on manually tune…
LLMs Uncertainty Quantification via Adaptive Conformal Semantic Entropy
LLMs' overconfidence, particularly when hallucinating, poses a significant challenge for the deployment of the models in safety-critical se…
Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment
TabPFN has recently gained attention as a foundation model for tabular datasets, achieving strong performance by leveraging in-context lear…
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization
Jailbreak attacks on audio language models (ALMs) optimize audio perturbations to elicit unsafe generations, and they typically update the…
Aes3D: Aesthetic Assessment in 3D Gaussian Splatting
As 3D Gaussian Splatting (3DGS) gains attention in immersive media and digital content creation, assessing the aesthetics of 3D scenes beco…
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamen…
E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology
We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy…
PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization
We introduce PACZero, a family of PAC-private zeroth-order mechanisms for fine-tuning large language models that delivers usable utility at…
Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation
Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot s…
Intelligent Truck Matching in Full Truckload Shipments using Ping2Hex approach
Accurate truck-to-shipment matching using GPS data is foundational for full truckload supply chain visibility, enabling real-time tracking…
Flow-OPD: On-Policy Distillation for Flow Matching Models
Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induc…
FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models
We introduce the first universal pretraining corpus for industrial time-series data: FactoryNet. 51M datapoints across 23k end-to-end task…
Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning
Supervised Fine-Tuning (SFT) is widely used for task-specific adaptation, yet recent work shows it systematically undermines reasoning gene…
Real vs. Semi-Simulated: Rethinking Evaluation for Treatment Effect Estimation
Estimating heterogeneous treatment effects with machine learning has attracted substantial attention in both academic research and industri…
An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum
Grey failures in the computing continuum produce ambiguous overlapping symptoms that existing approaches fail to diagnose reliably, either…
Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization
Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transf…
PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks
Watermarking for large language models (LLMs) is a promising approach for detecting LLM-generated text and enabling responsible deployment.…
SURGE: Surrogate Gradient Adaptation in Binary Neural Networks
The training of Binary Neural Networks (BNNs) is fundamentally based on gradient approximation for non-differentiable binarization operatio…
HEPA: A Self-Supervised Horizon-Conditioned Event Predictive Architecture for Time Series
Critical events in multivariate time series, from turbine failures to cardiac arrhythmias, demand accurate prediction, yet labeled data is…
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding ex…
Persona-Model Collapse in Emergent Misalignment
Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomen…
Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning
Data selection during supervised fine-tuning (SFT) can critically change the behavior of large language models (LLMs). Although existing wo…
Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization
Zeroth-Order (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning…
One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries
Model providers increasingly release open weights or allow users to fine-tune foundation models through APIs. Although these models are saf…
SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition
Online surgical phase recognition (SPR) underpins context-aware operating-room systems and requires committing to a prediction at every fra…
Learning Selective Merge Policies for Deadline-Constrained Coded Caching via Deep Reinforcement Learning
In the coded caching, the server uses the cached information at the users to serve multiple users in parallel with a single coded multi-cas…
Reducing Credit Assignment Variance via Counterfactual Reasoning Paths
Reinforcement learning for multi-step reasoning with large language models (LLMs) typically relies on sparse terminal rewards, which create…
How Few-Shot Examples Add Up: A Causal Decomposition of Function Vectors in In-Context Learning
In-context learning (ICL) excels at new tasks from minimal examples, yet we still lack a mechanistic explanation of how few-shot prompts sh…
Position: State-of-the-Art Claims Require State-of-the-Art Evidence
State-of-the-Art (SOTA) claims pervade Artificial Intelligence (AI) and Machine Learning (ML) research. These claims rest on benchmark eval…
L-Drive: Beyond a Single Mapping-Latent Context Drives Time Series Forecasting
Mainstream methods for multivariate time-series forecasting largely follow the Direct-Mapping paradigm. They learn a unified mapping from h…
BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting
Quantitative backtesting is essential for evaluating trading strategies but remains hampered by high technical barriers and limited scalabi…
LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection
AI agents such as OpenClaw are increasingly deployed in local workflows with access to external tools. This creates indirect prompt-injecti…
A Simplex Witness Certificate for Constant Collapse in Variational Autoencoders
We study exact constant collapse in variational autoencoders: the deterministic encoder mean becomes independent of the input. The prior re…
KairosHope: A Next-Generation Time-Series Foundation Model for Specialized Classification via Dual-Memory Architecture
Time Series Foundation Models (TSFMs) have demonstrated notable success in general-purpose forecasting tasks; however, their adaptation to…
ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop
Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary a…
Simply Stabilizing the Loop via Fully Looped Transformer
Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusi…
Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling
Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across…
The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next
Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and…
CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation
We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence whil…
HypergraphFormer: Learning Hypergraphs from LLMs for Editable Floor Plan Generation
In this work, we propose HypergraphFormer, a novel and efficient approach to floor plan generation based on learning hypergraph representat…
Unlocking the Potential of Continual Model Merging: An ODE Perspective
Continual Model Merging (CMM) enables rapid customization of foundation models by sequentially incorporating task-adapted models without re…
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in general video understanding, yet they often struggle with the fi…
ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents
Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics p…
ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison
Long-form image captioning exposes a reward granularity problem in RL: captions are judged as whole sequences, while the important errors o…
The Devil is in the Condition Numbers: Why is GLU Better than non-GLU Structure?
Gated Linear Units (GLU) and their variants are widely adopted in modern open-source large language model architectures and consistently ou…
Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition
Blended emotion recognition is challenging because emotions are often expressed as mixtures of subtle and overlapping multimodal cues rathe…
Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming
Vision-Language Models (VLMs) have significantly advanced medical visual question answering, yet their performance in ultrasound remains su…
Action with Visual Primitives
Vision-Language-Action (VLA) models have emerged as a promising paradigm for generalist robotic manipulation. A common design in current ar…
TimeGuard: Channel-wise Pool Training for Backdoor Defense in Time Series Forecasting
Time Series Forecasting (TSF) is highly vulnerable to backdoor attacks, yet effective defenses remain underexplored due to challenges arisi…
Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents
Skills have become a practical packaging mechanism for agent instructions, workflows, scripts, and reference materials. In enterprise setti…
AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in…
Understanding Data Temporality Impact on Large Language Models Pre-training
Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose t…
Finite-Particle Convergence Rates for Conservative and Non-Conservative Drifting Models
We propose and analyze a conservative drifting method for one-step generative modeling. The method replaces the original displacement-based…
The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning
Robustness, domain adaptation, photometric/occlusion invariance, sensor drift, and alignment style are treated as separate literatures with…
考えるだけでPC操作・脳内発話も 中国がAI脳インプラント開発加速、一般販売も間近か Nature報道
Nature誌によると、中国のスタートアップ企業が、AIを活用したブレイン・コンピュータ・インタフェース(BCI)の開発と実用化を急ピッチで進めているという。
雑談、挨拶、雪かきまで クレディセゾン「43人のおせっかい集団」が変えた、AI時代の「孤独」
「全社員のAIワーカー化」を掲げ、AI活用を前提とした業務効率化を推進するクレディセゾン。先進的なイメージの強い同社が今、社員同士がそれぞれのちょっとした困りごとを解決するために“おせっかい”をする、通称「おせっ会」活動に注力している。
Webサイトの二重管理に限界 青森県庁チャットbot「生成AI化」でコスト7割減
青森県庁は、公式Webサイトに生成AIチャットbot「AIデジタルスタッフ」を導入した。従来のシナリオ型チャットbotで課題となっていた運用負荷を見直し、運用コストを7割超削減したという。
「まず何から……」が62% 中小企業のAI活用を阻む“5大不安”
Leachは、中小企業におけるAI導入実態の調査結果を公表した。AI導入率は約12%にとどまり、最大の課題として「何から始めればいいか分からない」が挙がった。活用領域は、書類処理やデータ入力など、定型業務の効率化を目的とした導入が中心だ。
「あの人が 休むと業務 止まりがち」 AIに“IT用語かるた”を作らせてみた
IT用語は便利だが、一から覚えるのはなかなか大変だ。それならば、IT用語を五七五に置き換えれば、少しは親しみやすくなるのではないか。ChatGPTとNotebookLMで「IT用語かるた」を作ってみた。
Anthropicの「Mythos Preview」、1カ月で1万件超の脆弱性を発見──「Project Glasswing」初期報告
Anthropicは、未公開AIモデル「Claude Mythos Preview」を活用するサイバーセキュリティプロジェクト「Project Glasswing」の初期報告を公開した。約50のパートナー企業と協力し、開始から1カ月で世界の重要ソフトウェアから1万件超の重大な脆…
What ClickUp’s mass layoff tells us about the future of work
The nine-year-old startup is replacing hundreds of employees with thousands of AI agents.
The pope’s AI encyclical isn’t really about AI
Pope Leo XIV's first encyclical uses AI as a lens to diagnose older problems: concentrated power, eroding democracy, and a tech elite that…
2026-05-25(296件)
Startup Battlefield 200 applications close in days: Apply before May 27
The deadline to apply or nominate for Startup Battlefield 200 is May 27. This is your shot at VC access, global visibility, TechCrunch cove…
5 days left: Save up to $410 on TechCrunch Disrupt 2026 passes before prices increase
Early Bird savings for TechCrunch Disrupt 2026 in San Francisco end May 29 at 11:59 p.m. PT. Register now to save up to $410 before prices…
「Claude Mythos」が1万件以上の脆弱性を発見 しかし修正追い付かず Anthropicが報告書
米Anthropicは5月22日(現地時間)、セキュリティプロジェクト「Project Glasswing」の初期報告を公開した。約50社のパートナー企業が1カ月で高・重大レベルの脆弱性を1万件超発見した成果に加え、同社が独自に進めてきたオープンソースソフトウェアのスキャン結果…
日大、教職員1万人が「Google AI Pro」活用へ
日本大学が、米Googleの教育機関向けAIサブスクリプション「Google AI Pro for Education」を導入する。専任の教職員1万人が利用可能にし、定型業務のさらなる効率化を目指す。グーグル・クラウド・ジャパンが発表した。
ServiceNow、AIエージェントの「データの空白」を埋める機能群を発表
AIエージェント導入の「壁」になるのが、サイロ化したデータとガバナンスの未整備だ。ServiceNowはこの課題をどう解決しようとしているのか。年次イベントで発表された新機能群を紹介する。
BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems
Compound AI systems route tasks through hierarchies of specialised components. Attribution is dominated by Shapley-based methods (SHAP), wh…
NeuroNL2LTL: A Neurosymbolic Framework for Natural Language Translation of Linear Temporal Logic
Effectively translating between natural language (NL) and formal logics like Linear Temporal Logic (LTL) requires expertise that limits for…
RMA: an Agentic System for Research-Level Mathematical Problems
We present $\textbf{Research Math Agents (RMA)}$, an agentic framework for automated reasoning on research-level mathematical problems. Unl…
SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research
The exponential growth of global academic output has confronted researchers and AI agents with an unprecedented ``information explosion,''…
Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems
Current AI energy benchmarks measure consumption at the granularity of a single model invocation or training run. For classical single-turn…
ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization
Formal mathematics libraries are rapidly expanding, creating a growing need to refactor verified proofs for maintainability and to improve…
Mediative Fuzzy Logic: From Type-1 Foundations to Type-2, Type-3 and Quantum Extensions
Mediative Fuzzy Logic was conceived as a practical scheme for reconciling hesitant or conflicting assessments in fuzzy control and decision…
EVE-Agent: Evidence-Verifiable Self-Evolving Agents
Self-evolving agents should not train on examples they cannot justify. Data-free self-evolving search agents offer a scalable route to syst…
The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems
Large language models now write software, draft legal documents, and produce clinical notes, yet fundamental limits, from Turing and Arrow…
PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning
The emergence of Large Reasoning Language Models (LRMs) has paved the way for tackling complex reasoning tasks through test-time scaling by…
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems
AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of fu…
Redrawing the AI Map: A Theory of Accountability Boundaries in Agentic Ecosystems
Agentic AI orchestrators reduce the interface and assembly costs of composing information systems capabilities across organizational bounda…
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery
Scientific research is being reshaped by AI systems that move beyond isolated assistance toward longer-horizon workflows spanning literatur…
Foundation Protocol: A Coordination Layer for Agentic Society
Autonomous agents are moving from tools into a layer of social infrastructure: they browse, purchase, deploy software, manage systems, and…
GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models
Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating the…
Design and Report Benchmarks for Knowledge Work
The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However,…
Parallel Context Compaction for Long-Horizon LLM Agent Serving
Long-horizon LLM agents accumulate growing conversation histories that eventually exceed the model's context window. Context compaction via…
Ontological Knowledge Blocks: Executable Compliance and Profile-Based Validation for Trustworthy AI Systems
AI-enabled services deployed in critical digital infrastructure are subject to governance obligations spanning transparency, accountability…
DART: Semantic Recoverability for Structured Tool Agents
When a structured tool agent fails mid-execution, the runtime faces a dilemma: replaying the entire task is safe but wasteful, while restor…
Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning
Ventilator decision support requires sequential decisions that track evolving physiology and disease trajectories while respecting safety b…
When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems
LLM-based multi-agent systems can fail even when planned actions are executed correctly because agents may misjudge their knowledge when ev…
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation
On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities…
CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem
Dynamic Programming (DP) and Constraint Programming (CP) are well-established paradigms for solving combinatorial optimization problems. Us…
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
ReAct-style agents for search-intensive, multi-step reasoning tasks rely largely on their own internal judgment to decide what evidence to…
Solving the Aircraft Disassembly Scheduling Problem
Dismantling aircrafts reaching their end of life is a complex endeavour that is necessary in terms of sustainability but yields small incom…
One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents
On a 300-persona life-simulation benchmark, pcsp achieves compositional zero-shot persona identification up to 17x above chance, Spearman r…
MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection
Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improv…
Agentic Proving for Program Verification
Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far…
Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment
Multimodal large language models (MLLMs) need efficient mechanisms to update knowledge without degrading existing capabilities. While intri…
SPACENUM: Revisiting Spatial Numerical Understanding in VLMs
Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action…
From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills
Language agents increasingly improve by reusing \emph{skills} -- structured procedural artifacts distilled from past experience. In particu…
SkillOpt: Executive Strategy for Self-Evolving Agent Skills
Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a…
An AI-Driven Framework for Energy-Efficient Environmental Monitoring in Smart Cities Using Edge Intelligence
Environmental monitoring is a crucial component of the smart city infrastructure. It enables informed decision making which enhances sustai…
KPI2KVI: A Multi Agent Workflow for Calculating Key Value Indicators from Service Descriptions
Key Value Indicators (KVIs) provide a decision oriented view of a service by summarizing how operational performance translates into stakeh…
Evaluating Large Language Models in a Complex Hidden Role Game
Quantifying the deceptive potential of Large Language Models (LLMs) is critical for AI safety, yet difficult to achieve in uncontrolled env…
Computable Fairness: Boltzmann-Softmax Control for AI Resource Allocation
In large-scale AI systems, allocating scarce resources such as GPU compute time and bandwidth among multiple agents is a critical challenge…
LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding
Multimodal Retrieval-Augmented Generation (RAG) has emerged as an effective paradigm for enhancing Large Language Models (LLMs) with extern…
RAG4Outcome: A Retrieval-Augmented Multimodal Framework for Prognostic Prediction in Chronic Osteomyelitis
Chronic osteomyelitis presents substantial prognostic challenges due to its high recurrence risk and complex postoperative recovery traject…
The Cognitive Kardashev Scale: Quantifying the Material Envelope of Civilisational Computation
How much thinking can a civilisation do? Kardashev's (1964) typology ranks civilisations by total power: planetary (Type I, ~10^16 W), stel…
Strategic Coercion Within Alliances: The Greenland Sovereignty Game as an AI Stress Test
What happens when the strongest alliance member pressures a weaker member over territory and strategic control? We examine the Greenland so…
The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems
Multi-agent AI pipelines typically assume that agent misconduct originates from model misalignment. We identify a structural failure in thi…
ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse
Prefix KV caching has become a key mechanism in LLM serving: it reduces time to first token (TTFT) by avoiding redundant computation across…
Expressive Power of Deep Homomorphism Networks over Relational Databases
The expressive limitations of message-passing Graph Neural Networks (GNNs) have motivated a wide range of more powerful graph learning arch…
PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations
Personalized pricing negotiations are a challenging testbed for LLM agents because successful interaction does not guarantee profitable dec…
PilotWiMAE: Pilot-Native Representation Learning for Wireless Channels
Channel foundation models assume access to fully observed channels, an assumption that fails in deployment. We introduce PilotWiMAE, a self…
Staging by the Book: Automatic Sleep Stage Classification Using Scoring Rules
Automated sleep staging is commonly approached as a supervised machine learning problem, with deep learning methods dominating recent resea…
The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models
Chain-of-thought (CoT) prompting is necessary for arithmetic in small language models, yet shuffling its steps preserves most performance.…
Approximate Machine Unlearning through Manifold Representation Forgetting Guided by Self Mode Connectivity
Machine unlearning is a fundamental mechanism that enforces the right to be forgotten. Existing unlearning studies that rely on label manip…
MedExpMem: Adapting Experience Memory for Differential Diagnosis
Experienced physicians develop diagnostic expertise through clinical practice, acquiring not only disease knowledge but also the ability to…
When Do LLMs Reason? A Dynamical Systems View via Entropy Phase Transitions
Chain-of-thought (CoT) reasoning has become the default strategy for enhancing LLM capabilities, yet its application raises a fundamental q…
How Far Will They Go? Red-Teaming Online Influence with Large Language Models
As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political in…
Tensor Cache: Eviction-conditioned Associative Memory for Transformers
Autoregressive Transformer KV caches grow linearly with context length; sliding-window caching bounds memory but discards evicted tokens en…
Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models
Vision-Language-Action (VLA) models have emerged as a promising paradigm for robotic manipulation by leveraging pre-trained vision-language…
Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models
Generative Vision-Language Models (VLMs) perform well on multimodal reasoning, but how visual inputs are transformed to text remains poorly…
Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?
Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains un…
Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations
Understanding and monitoring human behavior in metro stations play an important role in supporting suicide prevention efforts, where early…
Human-Centered Learning Mechanics: A Dynamical Framework for Entropy-Regulated Representation Learning
Deep learning is increasingly viewed as a dynamical process in parameter space, yet many existing theories still treat training as a closed…
Graph Alignment Topology as an Inductive Bias for Grounding Detection
Large Language Models (LLMs) are optimized to produce distributionally plausible continuations rather than to explicitly verify whether gen…
A mathematical theory of balancing relational generalization and memorization
Humans, animals, and modern machine learning models exhibit impressive abilities to learn complex behaviors and generalize these behaviors…
Worse than Random: The Importance of a Baseline for Unsupervised Feature Selection
Many novel unsupervised feature selection methods are proposed each year, yet their empirical evaluation is limited to supervised and unsup…
LLM Code Smells: A Taxonomy and Detection Approach
Large Language Models (LLMs) are increasingly integrated into software systems for diverse purposes, due to their versatility, flexibility,…
Memorization Dynamics of Fill-in-the-Middle Pretraining
Fill-in-the-middle (FIM) is a pretraining objective widely used to equip causal language models with infilling ability, yet its effect on v…
Test-Time Training Undermines Safety Guardrails
Test-Time Training (TTT) is an emerging paradigm that enables models to adapt their parameters during inference, improving performance on t…
Robots That Know What to Ask: Recovering Misaligned Rewards through Targeted Explanations
Learning reward functions from demonstrations assumes that demonstrations provide adequate supervision over all features -- or task-relevan…
A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism
Characteristic linguistic behaviors associated with Social Language Disorder (SLD) in autism spectrum disorder, including echoic repetition…
Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good
Agentic AI systems are increasingly proposed for social-good domains, often invoking the United Nations Sustainable Development Goals (SDGs…
MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models
We explore the application of LLM-driven algorithm optimization to several common tasks in quantitative finance. MadEvolve, a general-purpo…
Brain-LLM Alignment Tracks Training Data, Not Typology
Brain-LLM alignment is well established in English, yet the brain's language network is neuroanatomically universal across languages. Does…
Uncovering the Latent Potential of Deep Intermediate Representations
Foundational Models pretrained on huge amount of data learn representations that evolve across depth, forming a hierarchy of embeddings wit…
Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography
Intermediate layers of large language models (LLMs) best predict human brain responses to language, one of the most robust findings in comp…
Do Language Models Know What Not to Say? Causal Evidence for Statistical Preemption in LLMs
How do learners acquire knowledge of what is unacceptable without negative evidence? Construction Grammar proposes statistical preemption:…
The TIME Machine: On The Power of Motion for Efficient Perception
Video representation learning has seen tremendous progress in recent years. This has been driven by many factors, including the scale of tr…
DreamerNLplus: Interpretable Modeling of Mental Health Dynamics from Social Media Timelines using Hybrid Rule-Based and RAG Methods
We present DreamerNLplus, a hybrid framework for modeling mental health dynamics from social media timelines in the CLPsych 2026 shared tas…
Model Collapse as Cultural Evolution
Model collapse, the progressive degradation of LLMs trained on their own outputs, has been characterized statistically but lacks a linguist…
Decomposing and Measuring Evaluation Awareness
Frontier language models sometimes recognize that they are being evaluated and adjust their behavior, undermining validity of benchmark res…
DRL-Driven Edge-Aware Utility Optimization for Multi-Slice 6G Networks
Virtual Reality (VR) services delivered over 6G networks demand ultra-low latency and high bandwidth to ensure seamless user experiences. T…
A measurement substrate for agentic Kubernetes operations: Methodology and a case study in retrieval-compounding falsification
Empirical claims about autonomous Kubernetes operations agents are largely unfalsifiable. Published work reports observational results with…
Anytime Training with Schedule-Free Spectral Optimization
Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re…
Dithering Defense: Adversarial Robustness of Vision Foundation Models via Multi-Level Floyd-Steinberg Dithering
Vision foundation models are widely used as frozen backbones across many downstream tasks, making them a single point of failure under adve…
KAPLAN: Kolmogorov-Arnold Prognostic Learnable Activation Networks for Survival Analysis
Survival analysis aims to model how covariates and time jointly shape the time-to-event distribution under right censoring. Classical metho…
Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics
Model-based reinforcement learning improves sample efficiency by learning a world model. However, existing latent world models such as Drea…
Security of LLM-generated Code: A Comparative Analysis
The majority of software developers use or are planning to use Artificial Intelligence (AI) tools in their development processes. Their top…
Do Synthetic Brain MRIs Reliably Improve Tumour Classification? A StyleGAN2-ADA Class-Plane Augmentation Study on BRISC 2025
Generative augmentation is often proposed as a remedy for small medical-image datasets, but synthetic images are only useful when they impr…
A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works
I present Lepton (Letter Prediction), a fine-tuned BERT classifier that predicts whether a title in a Classical Chinese wenji table of cont…
Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study
AI-assisted code review tools typically operate as generic "expert reviewer" agents, producing homogeneous findings regardless of the analy…
CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection
Existing Video Anomaly Detection (VAD) methods typically rely on task-specific training, leading to strong domain dependency and high train…
Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking
Tracking tumor lesions across serial CT scans is essential for oncological response assessment. Existing automated methods face a fundament…
Defining AI Fatigue in Academic Contexts: Dimensions, Indicators, and a Stage-Based Model Using Grounded Theory
The integration of AI tools in academic settings has introduced a distinct form of strain that existing frameworks like technostress and di…
Classical State Preparation for Variational Quantum Algorithms via Reinforcement Learning
Variational Quantum Algorithms (VQAs) potentially offer a pathway to practical quantum advantage, but their optimization is heavily hindere…
CALAD: Channel-Aware contrastive Learning for multivariate time series Anomaly Detection
Multivariate time series anomaly detection has become increasingly important in real-world applications, where labeled data are often scarc…
Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness
Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent's policy.…
As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs
Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer tr…
Generative AI and the Reorganization of Labor Demand
Generative artificial intelligence (AI) is expected to transform work, but less is known about how firms reorganize labor demand as the tec…
Autonomous Frontier-Based Exploration with VLM Guidance
Autonomous robotic exploration of unknown and hazardous environments, a long-standing challenge, can be significantly improved by leveragin…
PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs
When practitioners fine-tune LLMs on unvetted datasets, an adversary can exploit the data supply chain through task-level poisoning: insert…
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks…
Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning
Recent advancements in instructional fine-tuning have injected noise into embeddings, with NEFTune (Jain et al., 2024) setting benchmarks u…
Scalable Heterogeneous Graph Foundation Models for Data-Driven Optimal Power Flow in Smart Grids
Fast and reliable optimal power flow (OPF) approximation is essential for reliable smart-grid operation, yet many learning-based surrogates…
Adaptive Mass-Segmented KV Compression for Long-Context Reasoning
The linear growth of the Key-Value (KV) cache is a critical bottleneck in long-form LLM inference. Existing KV compression methods mitigate…
Lipschitz Optimization for Formal Verification of Homographies
The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains…
FastKernels: Benchmarking GPU Kernel Generation in Production
LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they op…
PaP-NF: Probabilistic Long-Term Time Series Forecasting via Prefix-as-Prompt Reprogramming and Normalizing Flows
Time series forecasting plays a central role in many real-world applications and has been extensively studied. Most existing approaches rel…
Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks
We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detecti…
SimInsert: Seamless Video Object Insertion via Regional Sparse Attention Fusion
Video object insertion requires ensuring spatio-temporal coherence and interactive realism, extending far beyond simple content placement.…
Enhancing Deep Neural Network Reliability with Refinement and Calibration
Although deep neural networks (DNNs) achieve high predictive accuracy, their confidence estimates are often unreliable, potentially comprom…
Multi-Gate Residuals
While Attention Residuals has shown some effectiveness in addressing the widespread issue of unbounded activation growth across deep residu…
6G Communication Networks Enabling Embodied Agents: Architecture and Prototype
Embodied agents, which couple intelligent decision-making with physical actuation in the real world, impose far more stringent and heteroge…
Coloring the Noise: Adversarial Sobolev Alignment for Faithful Image Super Resolution
Generative priors in Image Super-Resolution (SR) often compromise faithful restoration, we attribute this limitation to a fundamental spect…
ChainFlow-VLA: Causal Flow Planning with Vision-Language Models
Current end-to-end autonomous driving systems are fundamentally limited by a mismatch between temporal causal reasoning and global trajecto…
EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation
The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve…
When Good Equations Get Bad Scores: Improving Symbolic Regression Through Better Parameter Optimization
Symbolic Regression (SR) plays a central role in scientific knowledge discovery by distilling mathematical equations from observational dat…
Reinforcement Learning for Microcanonical Graph Ensemble with Assortativity Constraints
How network structure determines function is a fundamental question, and it can be investigated by graph ensembles with precisely controlle…
Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning
Large language models trained under diverse objectives and architectures have been shown to develop increasingly similar internal represent…
Sparse Compositional Flow Matching by geometric assembly from motion primitives
Embodied trajectories, such as the executable motion sequences of robotic manipulators, underwater vehicles, and mobile robots, are a funda…
CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs
Large Vision-Language Models have shown strong multimodal reasoning capabilities, yet they remain susceptible to object hallucinations when…
XWind: A Cross-site Router for Large Language Model Inference Serving at Renewable Energy Farms
AI power demand is growing at an unprecedented rate while power grids are often ailing and struggle to keep up. Grid expansion comes with h…
Score-Based One-step MeanFlow Policy Optimization
Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising…
Curriculum reinforcement learning with measurable task representation learning
In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and…
Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals
Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Rei…
Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition
Mechanistic interpretability of transformers requires identifying not just which components matter but how they compose into the computatio…
Parametric Prior Mapping Framework for Non-stationary Probabilistic Time Series Forecasting
Effectively modeling non-stationary dynamics in probabilistic multivariate time series(MTS) forecasting requires balancing expressiveness w…
Online Hand Gesture Recognition Using 3D Convolutional Neural Networks
In human computer interaction, real-time detection and classification of dynamic hand gestures is challenging as: 1) the system must run in…
Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control
Reinforcement learning has long struggled with poor sample efficiency. One promising approach to mitigate this problem is leveraging group-…
Socially fluent AI decouples conversational signals from source identity in online interaction
Socially fluent agentic AI can now participate in online interaction in ways that resemble ordinary human conversation, potentially weakeni…
SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction
Joint Entity and Relation Extraction (JERE) is highly susceptible to weak generalization due to low-quality training data. Data augmentatio…
AI Security Research Should Better Incentivize Defense Research
This work examines an imbalance in artificial intelligence (AI) security research: the field tends to produce more work on attacking AI sys…
One-Forcing: Towards Stable One-Step Autoregressive Video Generation
Recent advances have substantially improved real-time interactive video generation in the autoregressive regime. However, most existing few…
AI Assurance: A Comprehensive Testing Strategy for Enterprise AI Systems
Enterprise AI systems, built on large language models, retrieval pipelines and autonomous agents, introduce a class of risks that tradition…
Learning Individual Dynamics from Sparse Cross-Sectional Snapshots
Predicting how a dynamical unit evolves over time - how an individual ages, an epidemic spreads, or a physical system degrades - typically…
CBANet: A Compact Attention-Based CNN-BiLSTM Network for Aggressive Driving Event Detection
Aggressive driving is a major cause of traffic accidents and poses a serious threat to road safety. Although deep learning methods have sho…
Automated Random Embedding for Practical Bayesian Optimization with Unknown Effective Dimension
Bayesian optimization is widely employed for optimizing complex black-box functions but struggles with the curse of dimensionality. Random…
PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction
Accurate crop yield prediction is crucial for sustainable agriculture and global food security. While existing methods are predominantly de…
Multimodal Distribution Matching for Vision-Language Dataset Distillation
Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern syst…
CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test
Recently, Reinforcement Learning with Verifiable Rewards (RLVR) and Test-Time Scaling (TTS) have advanced LLM code generation through execu…
VACE: Learning Geometrically Structured Representations for Time Series Anomaly Detection
Anomaly detection in multivariate time series is a critical task across a wide range of real-world applications, where abnormal behaviour i…
DrawVideo: Generating Long Video from Storyboard Keyframe Sketches
Long video generation requires high-fidelity synthesis, coherent narrative structure, and user control over extended time spans. Existing t…
Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models
Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching g…
RA-DCA: A Randomized Active-Set DCA for Directional Stationarity in Max-Structured DC Programs
We study nonsmooth difference-of-convex programs whose subtracted convex term is a finite maximum of smooth convex functions. In this setti…
Goal-Conditioned Agents that Learn Everything All at Once
A goal-conditioned reinforcement learning agent exploring an environment will see a wealth of information throughout a trajectory, most of…
PathNavigate: A Training-Free Pathology Agent with Surprise-Guided Scan and Shared Slide Memory for Whole-Slide Image VQA
Whole-slide image visual question answering (WSI-VQA) frames pathology as an extreme-context search problem: to answer a free-form clinical…
ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning
Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity an…
Understanding Goal Generalisation in Sequential Reinforcement Learning
Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a…
HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval
In the competitive landscape of sponsored search, balancing retrieval quality with production latency is a critical challenge. While large…
Cost-Effective Model Evaluation with Meta-Learning
The rapid growth of machine learning has produced an ever-expanding ecosystem of models, making it increasingly challenging to verify the r…
Preisach Attention: A Hysteretic Model of Sequential Memory
We introduce the Preisach Attention Layer (PAL), a novel sequence modelling architecture grounded in the classical Preisach hysteresis oper…
DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling
Diffusion language models intrinsically fail to capture correlations between decoded tokens, which leads to a harsh trade-off between sampl…
EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation
Multi-shot video generation requires maintaining a consistent appearance of recurring entities across shots while remaining faithful to sho…
Adversarial Vulnerability Under Temporal Concept Drift: A Longitudinal Study of Android Malware Detection
We present a longitudinal, drift-aware evaluation of adversarial robustness across more than a decade of Android applications using static…
DualMem: Bypassing the Objectness Bottleneck for Calibrated Unknown-Stream Filtering in Open-World Object Detection
Open-world object detection (OWOD) requires detectors to localize known classes while identifying unknown objects for future incremental le…
Learning Through Noise: Why Subliminal Learning Works and When It Fails
In the context of artificial neural networks, subliminal learning refers to the transfer of task-relevant knowledge or unintended biases fr…
CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception
High-resolution (HR) image perception presents a key bottleneck for multimodal large language models (MLLMs). While visual search offers a…
OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations
Although large language model (LLM) conversational systems process millions of multi-turn dialogues daily, they remain fundamentally reacti…
Weierstrass Positional Encoding for Vision Transformers
Vision Transformers have achieved remarkable success in computer vision, but their common use of learnable one-dimensional positional encod…
Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking
Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidel…
PhotoFlow: Agentic 3D Virtual Photography Missions
Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot fr…
Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot
A central question in computational vision is whether human-like visual representations are better explained by discriminative or generativ…
It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt
It has generally been assumed that geopolitical bias in language models originates from the training data used during the pre-training phas…
Leveraging Foundation Models for Causal Generative Modeling
Causal generative modeling is essential for developing reliable and transparent AI systems capable of counterfactual reasoning. While exist…
Human Decision-Making with Persuasive and Narrative LLM Explanations
Large language models (LLMs) have the potential to aid and improve human decision-making in classification tasks, not only by providing fai…
PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs
Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with fine-grained understanding tasks.…
CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces
Temporal knowledge-graph data marketplaces face three coupled failures in static designs: stale hybrid index shortcuts reduce recall as edg…
Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers
Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D…
ETCHR: Editing To Clarify and Harness Reasoning
Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions t…
LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomen…
Robust Counterfactual Inference in Markov Decision Processes
This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches…
Epistemic Skills: Reasoning about Knowledge and Oblivion
This paper presents a class of epistemic logics that captures the dynamics of acquiring knowledge and descending into oblivion, while incor…
Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints
Methods for query answering over incomplete knowledge graphs retrieve entities that are likely to be answers, which is particularly useful…
Forget What's Sensitive, Remember What Matters: Token-Level Differential Privacy in Memory Sculpting for Continual Learning
Continual Learning (CL) models, while adept at sequential knowledge acquisition, face significant and often overlooked privacy challenges d…
FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation
Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We…
Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics
We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains…
MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders
Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medic…
MUSEKG: A Knowledge Graph Over Museum Collections
Digitisation in the cultural heritage sector has produced large but fragmented repositories of museum collection data, spanning structured…
MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks
While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design un…
Deja Vu in Plots: Leveraging Cross-Session Evidence with Retrieval-Augmented LLMs for Live Streaming Risk Assessment
The rise of live streaming has transformed online interaction, enabling massive real-time engagement but also exposing platforms to complex…
TCAP: Tri-Component Attention Profiling for Unsupervised Backdoor Detection in MLLM Fine-Tuning
Fine-Tuning-as-a-Service (FTaaS) facilitates the customization of Multimodal Large Language Models (MLLMs) but introduces critical backdoor…
Scaling-Aware Adapter for Structure-Grounded LLM Reasoning
Large language models (LLMs) are enabling reasoning over 2D and 3D structures, yet existing methods remain modality-specific and typically…
ALIVE: Awakening LLM Reasoning via Adversarial Learning and Instructive Verbal Evaluation
The quest for expert-level reasoning in Large Language Models (LLMs) has been hampered by a persistent \textit{reward bottleneck}: traditio…
VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation
Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonst…
GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory
Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks l…
NeuroWeaver: An Autonomous Evolutionary Agent for Exploring the Programmatic Space of EEG Analysis Pipelines
Although foundation models have demonstrated remarkable success in general domains, the application of these models to electroencephalograp…
Bridging AI and Clinical Reasoning: Abductive Explanations for Alignment on Critical Symptoms
Artificial intelligence (AI) has demonstrated strong potential in clinical diagnostics, often achieving accuracy comparable to or exceeding…
AI Evaluation Should Require Standardized Item-Level Data Releases
This position paper argues that standardized item-level benchmark data should become the default infrastructure for AI evaluation. Current…
IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents
Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions with…
Agentivism: a learning theory for the age of artificial intelligence
Learning theories have historically changed when the conditions of learning evolved. Generative and agentic AI create a new condition by al…
Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure
Organizational knowledge used by AI agents typically lacks epistemic structure: retrieval systems surface semantically relevant content wit…
QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems
We present \textbf{QED}, an open-source multi-agent system that turns human-provided research questions into complete mathematical proofs w…
Model Spec Midtraining: Improving How Alignment Training Generalizes
Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes the intended model behavior. Howeve…
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit…
How Mobile World Model Guides GUI Agents?
Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but r…
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or in…
Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models
The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systemat…
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime…
Moonwalk: Inverse-Forward Differentiation
Backpropagation's main limitation is its need to store intermediate activations (residuals) during the forward pass, which restricts the de…
ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection
Post-hoc out-of-distribution (OOD) detection has garnered intensive attention in reliable machine learning. Many efforts have been dedicate…
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
Vision-Language Models (VLMs) are increasingly susceptible to sophisticated adversarial attacks, including adaptive strategies specifically…
XAttnMark: Learning Robust Audio Watermarking with Cross-Attention
The rapid proliferation of generative audio synthesis and editing technologies has raised serious concerns about copyright infringement, da…
Representational Alignment with Chemical Induced Fit for Molecular Relational Learning
Molecular Relational Learning (MRL) is widely applied in natural sciences to predict relationships between molecular pairs by extracting st…
Diffusion and Flow Matching Models for Tabular Data: A Survey
Deep generative models have made rapid progress in image, text, audio, and video generation, and are increasingly being applied to structur…
Naturalistic Computational Cognitive Science: Towards generalizable models and theories that capture the full range of natural behavior
How can cognitive science build generalizable theories that span the full scope of natural situations and behaviors? We argue that progress…
GlyTwin: Digital Twin for Glucose Control in Type 1 Diabetes Through Optimal Behavioral Modifications Using Patient-Centric Counterfactuals
Frequent and long-term exposure to hyperglycemia increases the risk of chronic complications, including neuropathy, nephropathy, and cardio…
Spectral-inspired Operator Learning with Limited Data and Unknown Physics
Learning PDE dynamics from limited data with unknown physics is challenging. Existing neural PDE solvers either require large datasets or r…
Through the Stealth Lens: Attention-Aware Defenses Against Poisoning in RAG
Retrieval-augmented generation (RAG) systems are vulnerable to attacks that inject poisoned passages into the retrieved context, even at lo…
PLACE: Prompt Learning for Attributed Community Search in Large Graphs
In this paper, we propose PLACE (Prompt Learning for Attributed Community Search), an innovative graph prompt learning framework for ACS. E…
STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction
Recently, spatio-temporal time-series prediction has developed rapidly, yet existing deep learning methods struggle with learning complex l…
GeoMAE: Masking Representation Learning for Spatio-Temporal Graph Forecasting with Missing Values
The ubiquity of missing data in urban intelligence systems, attributable to adverse environmental conditions and equipment failures, poses…
Online Learning with Multiple Fairness Regularizers via Graph-Structured Feedback
There is an increasing need to enforce multiple, often competing, measures of fairness within automated decision systems. The appropriate w…
A drone-based framework for coral habitat mapping via weakly supervised segmentation
Obtaining pixel-level annotations over large spatial extents remains a major bottleneck for deploying machine learning in ecological applic…
Disentangling Interaction and Bias Effects in Opinion Dynamics of Large Language Models
Large Language Models are increasingly used to simulate human opinion dynamics, yet the effect of genuine interaction is often obscured by…
Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning
Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (…
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers
Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking,…
GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning
Graph Neural Networks (GNNs) are powerful tools for processing relational data but often struggle to generalize to unseen graphs, giving ri…
Controlled Personalization in Legacy Media Online Services: A Case Study in News Recommendation
Personalized news recommendations have become a standard feature of large news aggregation services, optimizing user engagement through aut…
RAG-Pull: Turning Retrieval into a Code-Injection Channel via Invisible Unicode Perturbations
Retrieval-Augmented Generation (RAG) increases the reliability and trustworthiness of the LLM response and reduces hallucination by elimina…
Sparser Block-Sparse Attention via Token Permutation
Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems…
LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation
Learning generalizable policies for robotic manipulation increasingly relies on large-scale models that map language instructions to action…
Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures
Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether…
DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA
Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within co…
Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches
We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource co…
Bridging Data and Physics: A Graph Neural Network-Based Hybrid Twin Framework
Simulating complex unsteady physical phenomena relies on detailed mathematical models, simulated for instance by using the Finite Element M…
Operator-Based Generalization Bound for Deep Learning: Insights on Multi-Task Learning
This paper presents novel generalization bounds for vector-valued neural networks and deep kernel methods, focusing on multi-task learning…
On the Koopman-Based Generalization Bounds for Multi-Task Deep Learning
The paper establishes generalization bounds for multitask deep neural networks using operator-theoretic techniques. The authors propose a t…
Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives
Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives.…
V-VLAPS: Value-Guided Planning for Vision-Language-Action Models
Vision-language-action (VLA) models provide strong action priors for robotic manipulation, but their reactive behavior can fail under distr…
R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification
Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both explorat…
Information Access of the Oppressed: Freirean Design for Emancipatory Information Access
Online information access (IA) platforms are targets of authoritarian capture. We explore the question of how to safeguard our platforms an…
SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding
Large language models (LLMs) have shown growing promise in biomedical research, particularly for knowledge-driven interpretation tasks. How…
ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling
While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory…
The Surprising Difficulty of Search in Model-Based Reinforcement Learning
This paper investigates search in model-based reinforcement learning (RL). Conventional wisdom holds that long-term predictions and compoun…
CoFrGeNet: Continued Fraction Architectures for Language Generation
Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce…
GradingAttack: Exposing Security Vulnerabilities in LLM Based Educational Grading Agents
Large language models (LLMs) are increasingly deployed as educational agents for automatic short answer grading (ASAG) in real-world educat…
TABX: A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning
The design of environments plays a critical role in shaping the development and evaluation of cooperative multi-agent reinforcement learnin…
PipeMFL-240K: A Large-scale Dataset and Benchmark for Object Detection in Pipeline Magnetic Flux Leakage Imaging
Pipeline integrity is critical to industrial safety and environmental protection, with Magnetic Flux Leakage (MFL) detection being a primar…
ArcMark: Distortion-Free Multi-Byte LLM Watermark via Optimal Transport
Watermarking is an important tool for promoting the responsible use of large language models (LLMs). Existing watermarks insert a signal in…
On the Infinite Width and Depth Limits of Predictive Coding Networks
Predictive coding (PC) is a biologically plausible alternative to standard backpropagation (BP) that minimises an energy function with resp…
VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos
In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performanc…
Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling
Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computa…
VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoni…
Empowering 9-1-1 Calltaking Training with Generative AI: Experiences and Lessons Learned
Emergency call-takers form the first operational link in public safety response, handling over 240 million calls annually while facing a su…
A Systematic Evaluation of Co-folding Model Representations for Small-Molecule Learning
Small-molecule foundation models are typically pretrained on standalone molecular data, unlike vision and language models that often benefi…
BarrierSteer: LLM Safety via Learning Barrier Steering
Despite the strong performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe…
Multimodal Crystal Flow: Any-to-Any Modality Generation for Unified Crystal Modeling
Crystal modeling spans a family of conditional and unconditional generation tasks, including crystal structure prediction (CSP) and de novo…
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction
Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogo…
Adapting Dijkstra for Buffers and Unlimited Transfers
In recent years, RAPTOR based algorithms have been considered the state-of-the-art for path-finding with unlimited transfers without prepro…
Anatomy-Guided Vision-Language Learning with Angular Prototype Separation for Multi-Label Video Capsule Endoscopy Classification Under Class Imbalance
This work presents a multi-label temporal event detection framework for video capsule endoscopy (VCE) that addresses the extreme class imba…
Understanding Task Aggregation for Generalizable Ultrasound Foundation Models
Foundation models promise to unify multiple clinical tasks within a single framework, but recent ultrasound studies report that unified mod…
MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels
Reinforcement learning has emerged as a powerful paradigm for improving large language model (LLM) reasoning, where rollouts are sampled fr…
Safe Reinforcement Learning with Preference-based Constraint Inference
Safe reinforcement learning (RL) is a standard paradigm for safety-critical decision making. However, real-world safety constraints can be…
Tabular PDF Information Extraction with Local LLMs and Layout-Aware Parsing: A Reliability Evaluation
Extracting structured information from academic PDF documents is non trivial: a single page typically combines free text metadata with tabu…
Visually-Guided Policy Optimization for Multimodal Reasoning
Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). Ho…
Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO
Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dop…
Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees
Statement autoformalization acts as a critical bridge between human mathematics and formal mathematics by translating natural language prob…
TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale
Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtim…
Skill Retrieval Augmentation for Agentic AI
As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks be…
A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks
Edge computing environments impose strict constraints on energy consumption and latency, making the deployment of deep neural networks a si…
SUDP: Secret-Use Delegation Protocol for Agentic Systems
Agentic systems increasingly act with user secrets for APIs, messaging platforms, and cloud services. Today's agent runtimes typically impl…
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this fee…
ProtDBench: A Unified Benchmark of Protein Binder Design and Evaluation
Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain dif…
Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination
State-of-the-art model-based Reinforcement Learning (RL) approaches either use gradient-free, population-based methods for planning, learne…
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
Recent advances in foundation models have transformed LLMs from passive conversational systems into autonomous agents capable of reasoning…
VISD: Enhancing Video Reasoning via Structured Self-Distillation
Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assig…
Bridging the Last Mile of Circuit Design: PostEDA-Bench, a Hierarchical Benchmark for PPA Convergence and DRC Fixing
LLM-based agents are increasingly applied to the "last mile" of Electronic Design Automation (EDA): repairing residual sign-off Design Rule…
The AI-Native Large-Scale Agile Software Development Manifesto
Despite the widespread adoption of agile methods, achieving true agility at scale remains elusive. Large-scale agile frameworks remain larg…
Content-Aware Attack Detection in LLM Agent Tool-Call Traffic: An Empirical Study of Features, Architectures, and Evaluation Protocols
The Model Context Protocol (MCP) has become a widely adopted interface for LLM agents to invoke external tools, yet learned monitoring of M…
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilien…
Bridging Silicon and the Hippocampus: Algebro-Deterministic Memory "VaCoAl" as a Substrate for Vector-HaSH and TEM
Vector-HaSH and the Tolman-Eichenbaum Machine (TEM) propose the hippocampal-entorhinal circuit factorizes memory via a grid-cell scaffold f…
Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation
Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV…
Towards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment
Deep Neural Networks have become the dominant solution for Autonomous Driving perception, but their opacity conflicts with emerging Trustwo…
Can the Recovery Mechanism Survive AI? Skill Formation, Labor, and What Current Measurement Misses
Throughout the modern era, when new technologies displaced workers, societies adapted through the same mechanism: education raised the cogn…
Cross-Domain Molecular Relational Learning: Leveraging Chemical Structure-Activity Analysis
Recent advances in molecular representation integrates molecular topological and visual modalities, opening new avenues for precise Molecul…
S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination
We address concurrency control for LLM agents sharing mutable state over HTTP, where agents cannot be modified to declare read sets. S-Bus…
DynMuon: A Dynamic Spectral Shaping View of Muon
In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential d…
IVF-TQ: Calibration-Free Streaming Vector Search via a Codebook-Free Residual Layer
Approximate nearest neighbor (ANN) indexes deployed against streaming corpora silently lose recall over weeks. The standard diagnosis is di…
An Interpretable Closed-Loop Intelligent Tutoring System for Multimodal Affective Feedback in Asynchronous Presentation Training
This paper presents an interpretable closed-loop Intelligent Tutoring System (ITS) that supports feedback-guided practice for developing on…
TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing
LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single…
Does Your Wildfire Prediction Model Actually Work, or Just Score Well?
Wildfire prediction is important for early warning and resource allocation, yet existing Earth foundation models (Earth FMs) are pretrained…
Distilling Linearized Behavior into Non-Linear Fine-Tuning for Effective Task Arithmetic
Task vector composition has emerged as a promising paradigm for editing pre-trained models, enabling model merging through addition and unl…
Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
Code-switching -- the natural alternation between two languages within a single utterance -- remains one of the most challenging and under-…
ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions
Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce…
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning
Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex…
Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor
MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantizatio…
Codec-Robust Attacks on Audio LLMs
Prior attacks on Audio Large Language Models (Audio LLMs) demonstrated that carefully crafted waveform-domain perturbations can force targe…
GenAI-Driven Threat Detection with Microsoft Security Copilot
Defending against today's increasingly sophisticated cyberattacks requires security analysts to continuously translate evolving attacker tr…
Sutra: Tensor-Op RNNs as a Compilation Target for Vector Symbolic Architectures
Sutra is a typed, purely functional programming language whose compiled forward pass is a PyTorch neural network. The compiler beta-reduces…
Fine-grained Claim-level RAG Benchmark for Law
The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questi…
Variance Reduction for Expectations with Diffusion Teachers
Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as text-to-3D, single-step distillation, and data at…
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
Reinforcement learning with verifiable rewards has become the standard recipe for improving LLM reasoning, but the dominant algorithm GRPO…
Atom-level Protein Representation Learning Improves Protein Structure Prediction
Recent advances in generative modeling show that pretrained representations can improve generation as conditioning features or alignment ta…
More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts
Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained disti…
Proxy-Based Approximation of Shapley and Banzhaf Interactions
Shapley and Banzhaf interactions capture the complex dynamics inherent in modern machine learning applications. However, current estimators…
Everyone is navigating AI security in real time — even Google
We're in the transition period -- all of us.
I tried Amazon’s Bee wearable and am both intrigued and slightly creeped out
Like other AI wearables, Amazon's Bee offers an odd combination of convenience and privacy anxiety.