週次AIニュース 2026-W24
対象期間: 2026-06-08 〜 2026-06-14(2172 件)
トピックの推移
トピック別件数
- LLM/生成AI 866件
- 研究/論文 820件
- エージェント 479件
- 画像/動画生成 308件
- ビジネス/資金調達 142件
- ロボティクス 130件
- ハードウェア/半導体 63件
- その他 51件
- 規制/政策 12件
今週のハイライト(上位 10 件)
New OpenAI Academy courses for the next era of work
OpenAI introduces three Academy courses that help people build practical AI skills, create repeatable workflows, and apply agents in everyd…
How an astrophysicist uses Codex to help simulate black holes
Discover how astrophysicist Chi-kwan Chan uses Codex to build black hole simulations, helping scientists study extreme physics and test Ein…
Access OpenAI models and Codex through your Oracle cloud commitment
Access OpenAI models and Codex through Oracle Cloud, using existing commitments to build and deploy AI with enterprise security and governa…
PRC-linked influence operations are targeting AI debates in the US
A new report from OpenAI details PRC-linked influence operations using AI to target U.S. tech debates, data center narratives, tariffs, and…
Fluid, natural voice translation with Gemini 3.5 Live Translate
Gemini 3.5 Live Translate brings near real-time, natural speech translation to Google AI Studio, Google Translate and Google Meet.
How engineers at Nextdoor use Codex to build without limits
How engineers at Nextdoor use Codex with GPT-5.5 to investigate hard-to-reproduce issues, build across platforms, and focus on product outc…
What Codex unlocks for Notion
How Notion uses Codex to one-shot specs, build AI Voice Input for the web, and multiply engineering power across small teams.
Confidential submission of draft S-1 to the SEC
OpenAI confirms a confidential S-1 submission to the SEC and has not yet determined timing for further action.
As Anthropic suspends access to new models, India debates its AI future
Tech leaders debate whether the Anthropic episode is a wake-up call for India’s AI ambitions.
全件(日付別)
2026-06-14(5件)
As Anthropic suspends access to new models, India debates its AI future
Tech leaders debate whether the Anthropic episode is a wake-up call for India’s AI ambitions.
Meta reportedly moves to unwind $2B Manus deal after Beijing’s demand
Meta starts dismantling its $2 billion Manus acquisition after Beijing ordered the deal reversed.
KPMG pulls report on AI usage due to apparent hallucinations
Once again, AI proves to be an unreliable source of information about AI.
Amazon CEO reportedly raised Anthropic model concerns before government crackdown
Amazon CEO Andy Jassy may have been the source of security concerns that led Anthropic to cut off worldwide access to two models on Friday.
OpenAI faces investigation from state attorneys general
It's not clear which states are involved, but they're asking about everything from OpenAI's ad policies to its handling of health data.
2026-06-13(12件)
Andrew Yang thinks the next big startup opportunity is lowering the cost of living
Andrew Yang made a list of everything Americans overpay for — housing, food, wireless — and thinks the next startup gold rush is giving tha…
Anthropic’s safety warnings may have just backfired — the government has pulled the plug on its most powerful AI
Anthropic isn't hiding its frustration. "We disagree that the finding of a narrow potential jailbreak should be cause for recalling a comme…
「Claude Fable 5」「Mythos 5」全面停止 米政府の指令により Anthropicは早期復旧を宣言
米Anthropicは6月12日、最上位AIモデル「Claude Fable 5」「Claude Mythos 5」の提供を全ユーザーで停止すると発表した。米政府が安全保障を理由に、外国籍者のアクセス全面停止を命じる輸出規制指令を出したため。同社は指令に従う一方「誤解だ」として…
最新AI「Fable 5」でYouTube動画作ってみた 想像以上の出来に驚愕、ただし大きな弱点も
SNSも“動画隆盛”な昨今。AIを動画制作に活用する例も多いが、最新AI「Fable 5」だとどんなものができるだろうか。早速試してみた。
SpaceX IPO: Live updates on everything you need to know
TechCrunch has followed SpaceX's start, struggles, and successes from the early days. And we're here for what happens next too. This packag…
Meta’s months-old AI unit is a soul-crushing gulag, say the engineers stuck inside it
A new report suggests the unit, which employs 6,500 people, is on the verge of revolt.
トヨタが抜かれる日――キオクシア首位奪取、2005年「時価総額トップ10」を振り返る
2026年6月8日~12日に公開された記事の中から、MONOist編集部が厳選した今週の注目ニュースをお届けします。
Chinese cybercrime operation that used AI to scam ‘hundreds of thousands of victims’ sued by Google
The tech giant said a group called "Outsider Enterprise" used AI to scam hundreds of thousands of victims, sending 2.5 million text message…
Mistral is rumored to be raising €3B at €20B valuation
The funding round would value the company at around €20 billion (about $23.15 billion), nearly double its Series C valuation of €11.7 billi…
SpaceX, Anthropic, and OpenAI’s hot IPO summer
The IPO market is back, and it’s not the same companies leading the charge. FAANG had a good run, but a new acronym is taking over: MANGOS…
It’s hot IPO summer, and the MANGOS are ripe
The IPO market is back, and it’s not the same companies leading the charge. FAANG had a good run, but a new acronym is taking over: MANGOS…
2026-06-12(320件)
New OpenAI Academy courses for the next era of work
OpenAI introduces three Academy courses that help people build practical AI skills, create repeatable workflows, and apply agents in everyd…
Cheaper, faster, and culturally aware, Avataar’s video AI is built for India’s scale
Avataar AI's distilled video model is priced at $0.005 for every second of generation.
ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs
Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval a…
Arbor: Tree Search as a Cognition Layer for Autonomous Agents
Arbor is a multi-agent framework that introduces structured tree search as a cognition layer for autonomous agents operating in large, stat…
Strategic Decision Support for AI Agents
Traditionally, decision support studies how humans use machine learning models to make better decisions. In modern agentic systems, this di…
Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation
Modern Lean theorem provers achieve strong performance only with substantial training and inference compute, driven in part by scarce verif…
PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation
Closed-loop driving simulators typically populate their environments with non-ego traffic agents that behave largely the same way, produced…
"Did you lie?" Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms
Robust lie detectors for language models could enable powerful techniques for auditing, monitoring, and post-hoc investigation of model beh…
TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation
Human mobility data is important for transportation, urban planning, and epidemic control, but large-scale trajectory collection is often c…
Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents
Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated…
From AGI to ASI
Over the last decade, building human-level artificial general intelligence has moved from far-fetched speculation to being a concrete next-…
Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System
Large language models (LLMs) are increasingly integrated into clinical systems, making it essential to evaluate the real-world utility of t…
Definitional alignment before capability alignment: a Design-Science framework for adjudicating claims about AGI
Claims that artificial general intelligence has already arrived and claims that it remains decades away are often defended from overlapping…
The Theory of Mind Utility: Formal Specification of a Mentalizing Mechanism
Inferring others' beliefs requires more than reading surface signals; it requires tracking who told them what, in what order, and how credi…
Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) rel…
Benchmarking AI Agents for Addressing Scientific Challenges Across Scales
AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings r…
Reducing the Complexity of Deep Learning Models for EEG Analysis on Wearable Devices
Wearable healthcare devices are the fastest-growing Internet of Things (IoT) sector. Many automated healthcare services rely on two crucial…
Prefill Awareness in Large Language Models
Safety-relevant studies of language models, including alignment and jailbreaking evaluations and AI control protocols, often rely on prefil…
Constructing Evaluation Datasets for Procedural Reasoning: Balancing Naturalness, Grounding, and Multi-Hop Coverage
Evaluating procedural reasoning in AI-supported learning systems requires question-answer datasets that are both learner-like and grounded…
A Tutorial on World Models and Physical AI
World modeling is emerging as a central principle for building intelligent systems capable of prediction, reasoning, and decision making. A…
The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements
Agentic large language model systems that autonomously invoke tools, maintain persistent memory, and execute multi-step plans are increasin…
MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs
Multimodal large language models (MLLMs) are trained on massive multimodal data, making data unlearning increasingly important as data owne…
Teach-and-Repeat: Accurately Extracting Operational Knowledge from Mobile Screen Demonstrations to Empower GUI Agents
Understanding the digital world on mobile devices is shifting from static UI perception to dynamic action comprehension. This capability en…
GeoNatureAgent Benchmark: Benchmarking LLM Agents for Environmental Geospatial Analysis Across Frontier and Open-Weight Foundation Models
Environmental scientists spend disproportionate effort on data wrangling rather than analysis, and AI agents that automate geospatial workf…
Topical Phase Transitions in Artificial Intelligence Research: Large-Scale Evidence and an Early-Warning Signature for Emerging Topics
Do research topics in artificial intelligence grow gradually, or do they advance through abrupt, detectable jumps? Analyzing 80,814 accepte…
Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement
As scientific workflows shift from deterministic executables to LLM-based agents, the development practices on offer, such as fine-tuning,…
(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable
Large language models (LLMs) are increasingly used for tasks once reserved for trained researchers, including hypothesis generation, specif…
WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning
Rapid advances have been made in developing general-purpose embodied agent in environments like Minecraft through the adoption of LLM-augme…
DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks
Search Agents (SAs) typically leverage large language models (LLMs) to support complex information-seeking tasks by autonomously exploring…
HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness
Large language models are increasingly deployed as agents for long-horizon tasks, yet their performance is shaped not only by model capabil…
The Hidden Power of Scaling Factor in LoRA Optimization
In Low-Rank Adaptation (LoRA), the scaling factor $\alpha$ is often treated as a mere complement to the learning rate, yet its role in opti…
Zero-source LLM Hallucination Detection with Human-like Criteria Probing
Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their s…
MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback
Molecular dynamics (MD) is the canonical in-silico method for atomistic molecular science, simulating molecular behavior from first-princip…
Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce
We present a modular two-agent simulation framework for evaluating conversational shopping assistant architectures. An independent buyer ag…
MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling
Parallel test-time scaling samples many reasoning traces and majority-votes their answers, improving LLM accuracy but requiring traces to r…
PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization
Generative listwise ranking with Large Multimodal Models (LMMs) aims to capture global list context in a single forward pass, but its effec…
Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory
Long-running LLM agents accumulate interaction histories far larger than any context window, forcing a standing decision: what to encode de…
OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models
We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling ~3.35M…
Multi-Modal Agents for Power Distribution Defect Detection: An Evaluation of Foundation Models
The power distribution network is critical to reliable electricity delivery, yet traditional inspection methods face limitations in semanti…
A Mathematical Forum Platform for Collaborative Problem Solving and Dataset Generation for AI Reasoning
Sharing mathematical content in online forums remains a significant friction point for students and educators: writing raw LATEX is error-p…
Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation
Automated testbench generation has become a critical bottleneck in large language model (LLM)-driven Register Transfer Level (RTL) workflow…
APCyc: Property-Informed Design of Cyclic Peptides via Automated Cyclization
Cyclic peptides represent a promising class of therapeutic compounds in modern drug discovery, often offering improved stability and bindin…
The Illusion of Multi-Agent Advantage
Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protecti…
Otters++: A Time-to-first-spike Based Energy Efficient Optical Spiking Transformer
Spiking neural networks (SNNs) are promising for energy-efficient inference, and time-to-first-spike (TTFS) coding is especially attractive…
SciR: A Controllable Benchmark for Scientific Reasoning in LLMs
Three paradigmatic forms of inference recur across scientific reasoning: deduction, induction, and causal abduction. Reliably evaluating LL…
Nous: An Attempt to Extract and Inject the Cognition Behind Prediction-Market Behavior
As LLM agents proliferate in prediction markets and collective decision-making, they risk a cognitive monoculture: agents built on shared f…
Augmentation techniques for video surveillance in the visible and thermal spectral range
In intelligent video surveillance, cameras record image sequences during day and night. Commonly, this demands different sensors. To achiev…
AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction
Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized…
Rethinking RAG in Long Videos: What to Retrieve and How to Use It?
Retrieval-augmented generation is moving beyond text into long, egocentric video, where systems must select query-relevant chunks across mu…
TerraBench: Can Agents Reason Over Heterogeneous Earth-System Data?
Climate and environmental decision-making increasingly requires reasoning across heterogeneous inputs, including gridded physical data, sat…
Mental-R1: Aligning LLM Reasoning for Mental Health Assessment
Mental health problems such as anxiety, depression, and suicide remain urgent global challenges, where timely and accurate assessment is cr…
Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach
User experience (UX) centered on usability, perceived consistency, and functional clarity is fundamental to real-world user interfaces (UI)…
Under What Conditions Can a Machine Become Genuinely Creative?
Recent AI systems can generate texts, software architectures, hypotheses, designs, and scientific workflows that appear creative. This pape…
ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning
Multi-agent debate (MAD) can improve large language model reasoning, but fixed debate pipelines often waste computation and can amplify cor…
A Minimal Model of Bounded Trade-Off Screening in Multi-Attribute Choice
Human decision-making often involves choosing between multi-attribute alternatives, yet classical models assume fully compensatory utility…
Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints
AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of…
LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis
Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide inc…
Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm
Defining query difficulty is one of the hardest problems in deployment engineering. Existing LLM routers rely on surface features such as d…
EPIG: Emotion-Based Prompting for Personalised Image Generation
Text-to-image diffusion models have achieved impressive results in synthesizing high-quality images from natural language prompts. However,…
Multi-Field Hybrid Retrieval-Augmented Generation for Maritime Accident Root Cause Analysis
Maritime accident adjudication reports contain critical tribunal findings for root cause analysis (RCA), yet retrieving relevant precedents…
MOSAIC: Modality-Specific Adaptation for Incremental Continual Learning in Parkinson's Disease Gait Assessment
Gait-based Parkinson's disease assessment increasingly relies on heterogeneous sensors, but clinical systems rarely collect all modalities…
From Verdict to Process: Agentic Reinforcement Learning for Multi-Stage Fact Verification
Recent approaches combining Large Language Models (LLMs) with retrieval-augmented reasoning have shown promise for automated fact verificat…
ERTS: Adversarial Robustness Testing of Ethical AI via Semantic Perturbation in a Bounded Consequence Space
As AI systems are deployed in high-stakes ethical contexts such as healthcare triage, autonomous vehicle control, and employment screening,…
Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video
Wave parameters in the nearshore are crucial for coastal engineering, shoreline protection, marine hazard assessment, and coastal managemen…
ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning
Reinforcement Learning with Verifiable Rewards (RLVR) is a central technique for improving long-horizon reasoning in Large Language Models…
Can I Buy Your KV Cache?
Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every…
IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing
Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generati…
A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget
This study examines training dynamics in a small Llama-style language model trained under a fixed, compute-constrained token budget. Rather…
MiniMax Sparse Attention
Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persiste…
Neuro-Symbolic Agents for Regulated Process Automation: Challenges and Research Agenda
LLM-based agents are entering regulated industries where they automate judgment intensive quality management processes. We argue that symbo…
Optimizing Appliance Scheduling for Solar Energy Management Using Metaheuristic Algorithms
Renewable energy is essential for meeting future energy demands; however, solar energy generation, which occurs only during daylight hours…
Evaluation Sovereignty in Metadata-Driven Classification: A Multi-Track Framework for Weakly Supervised Information Systems
Evaluation in machine learning is typically treated as a neutral measurement process. However, in operational information systems, evaluati…
Why Sampling Is Not Choosing: Intentionality, Agency, and Moral Responsibility in Large Language Models
Recent advances in large language models (LLMs) have prompted claims that such systems exhibit agency or qualify as moral agents. This pape…
CloudCons: A Comprehensive End-to-End Benchmark for Cloud Resource Consolidation
Driven by conservative over-provisioning to guarantee service reliability, resource utilization in cloud data centers remains at low levels…
Uncertainty-Aware Hybrid Retrieval for Long-Document RAG
Retrieval augmented generation (RAG) depends critically on the quality and granularity of retrieved evidence. Large retrieval units preserv…
Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation
Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require week…
A Three-Layer Framework for AI in Scientific Discovery
Current discussions of AI in scientific discovery are often dominated by two visible capabilities: search over existing knowledge and execu…
Multiagent Protocols with Aggregated Confidence Signals
Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existi…
Reward Modeling for Multi-Agent Orchestration
Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet train…
EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis
We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defin…
Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch
Dispatch in three-sided marketplaces provides a natural setting for reinforcement learning from world feedback: decisions are evaluated by…
Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning
When large language models (LLMs) fail to generalize or make haphazard errors in reasoning, it is often taken as evidence that LLMs are not…
AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility
Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric har…
Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks
Shielded reinforcement learning is typically presented as a runtime safety mechanism that compiles temporal-logic specifications into autom…
Before You Think: System 0, AI-Mediated Cognition and Cognitive Colonization
This paper examines three recent frameworks for understanding the cognitive and epistemic consequences of artificial intelligence: Tri-Syst…
EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery
LLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environme…
Agents-K1: Towards Agent-native Knowledge Orchestration
Current LLM-based research agents have advanced through agent orchestration, yet largely overlook scientific knowledge orchestration. Exist…
Automated reproducibility assessments in the social and behavioral sciences using large language models
Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to…
AI SciBrief as a Gateway to Research: A Framework for Onboarding Students into New Research Areas
Students at all levels of higher education face a significant barrier in the form of information overload, which often paralyzes the initia…
The AI Legal Specialist: A Juridically Autonomous Professional Profile for AI Governance
The rapid global expansion of artificial intelligence regulation has generated, across multiple jurisdictions, a demand for legal expertise…
Divination by Prompt: LLM-Mediated Xuanxue on Chinese Social Media
The rapid proliferation of large language models (LLMs) has produced a striking cultural practice: using conversational AI for divination.…
GeoDial: A Multimodal Conversational Tutoring Dataset for Geometry Problem-Solving with Visual Tutor Turns
Several educational domains rely heavily on diagrams and visual cues, yet most existing tutoring datasets are limited to text-only interact…
Eigenism: Ethics for a Human-AI Future
Our concepts of survival and self-interest were built for single, continuous biological lives. These ideas break down when applied to artif…
Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering
The integration of large language models (LLMs) into educational assessment represents a transformative shift in classroom grading practice…
The Challenges of Balancing AI Compliance and Technological Innovations in Critical Sectors: A Systematic Literature Review
The rapid integration of artificial intelligence (AI) into critical infrastructure including healthcare, finance, energy, and defense, offe…
AI-Automation Tooling in Computer Engineering Education: Mixed-Methods TAM/UTAUT Evidence for a General Acceptance Attitude
As generative AI and low-code workflow platforms become routine in software practice, a key educational question is whether the next genera…
An Explainable AI Assistant for Introductory Programming Education: Improving Feedback Reliability with Instructor-AI Collaboration
Active learning is widely recognized as an effective approach for improving learning outcomes in introductory programming courses. However,…
Mapping AI Programs in the U.S: A Status Report from Early 2026 and an Analysis of AI Majors and Minors
We present a report on the status of undergraduate Artificial Intelligence (AI) programs in the United States in Spring 2026. In so doing,…
Muse Spark Safety & Preparedness Report
Muse Spark is the latest large language model developed by Meta. In this report, we first present evaluations for catastrophic risk domains…
Will AI Agents Free Us From Meaningless Work? A Human-Centered Analysis
Some claim that AI agents will free workers from the boring parts of their jobs, yet little is known about how workers themselves identify…
Algorithmic Constitutionalism
The increasing encroachment of artificial intelligence (AI) on social life raises significant risks for society, particularly within the in…
Position: Generative Engine Optimization Creates Underexamined Risks, Governance Must Target Concentration, Disclosure, and Academic Blind Spots
Large language model (LLM) answer engines are increasingly used for information seeking, shifting visibility from ranked lists to synthesiz…
Generativism: Toward a Learning Theory for the Age of Generative Artificial Intelligence
The four dominant learning theories of behaviorism, cognitivism, constructivism, and connectivism show significant conceptual limitations a…
Reframing AI Loss of Control: What It Is, How to Have It, How to Lose It
At present, loss of control risks have gained much prominence in public discussion, particularly in relation to AI, with extensive discours…
Occupational Prompting Reveals Cultural Bias in Large Language Models
Social roles shape expectations, priorities, and judgments, yet it remains unclear how large language models (LLMs) associate occupational…
SAIGuard: Communication-State Simulation for Proactive Defense of LLM Multi-Agent Systems
LLM-based multi-agent systems (MAS) solve complex tasks through inter-agent collaboration, but their communication-driven nature also allow…
Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics
Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction…
ReCal: Reward Calibration for RL-based LLM Routing
Large language model (LLM) routing has emerged as an effective paradigm for leveraging the complementary strengths of multiple LLMs through…
Representing Time Series as Structured Programs for LLM Reasoning
Large language models (LLMs) have demonstrated strong reasoning and instruction-following capabilities, making them potentially powerful to…
Speculative Rollback Correction for Quality-Diverse Web Agent Imitation
Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, de…
Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation
Traffic microsimulation combined with surrogate safety measures has increasingly been used as a proactive alternative to historical crash d…
A Mathematical Theory of Value: a synthesis on goal-directed agency under resource constraints
We propose that value -- the quantity goal-directed agents create, destroy, and exchange -- is a lawful structural quantity in the same cat…
Boosting Direct Preference Optimization with Penalization
Offline preference optimization has become a practical substitute for reinforcement learning from human feedback, but pairwise objectives s…
Foresight: Iterative Reasoning About Clues that Matter for Navigation
Open-world mapless navigation from sparse language instructions requires resolving underspecified goals and inferring which environmental c…
EDEN: A Large-Scale Corpus of Clinical Notes for Italian
We present EDEN (Emergency Department Electronic Notes), a new and unique large-scale corpus of clinical notes produced in Emergency Depart…
Graph Reduction in Multirelational Networks: A Spreading-Oriented Reduction Benchmark
Real-world networks are inherently incomplete, noisy, and dynamically evolving, making it difficult to capture all actors and their relatio…
Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs
Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual incons…
Emerging Flexible Designs for Geospatial Multimodal Foundation Models
Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalitie…
From Imitation to Alignment: Human-Preference Flow Policies for Long-Horizon Sidewalk Navigation
Autonomous long-horizon sidewalk navigation is essential for micro-mobility applications such as robotic food delivery and assistive electr…
HybridCodeAuthorship: A Benchmark Dataset for Line-Level Code Authorship Detection
Thanks to the rapid adoption of AI code assistants powered by large language models (LLMs), industry codebases are, increasingly, a hybrid…
Bag of Dims: Training-Free Mechanistic Interpretability via Dimension-Level Sign Patterns
We show that the standard basis of transformer hidden states already provides a training-free, architecture-general feature basis. Individu…
Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents
Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across ma…
Token Complexity Theory for AI-Augmented Computing
AI-augmented computing delegates natural language queries, code generation requests, and other open-ended tasks to a cluster of AI models t…
BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention
Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of huma…
CAPED: Context-Aware Privacy Exposure Defense for Mobile GUI Agents
Screenshot-based mobile GUI agents can operate ordinary smartphone apps through the same visual interface as a human user, but this capabil…
Free-Placement Optimization of Ground Station Locations for Low-Earth Orbit Satellites
Rapidly expanding low Earth orbit satellite constellations are placing increasing demands on terrestrial ground networks, motivating the de…
A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction
Cross-domain graph anomaly detection (GAD) aims to identify abnormal nodes in unseen target graphs, showing strong potential in real-world…
M*: A Modular, Extensible, Serving System for Multimodal Models
We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, d…
EWAM: An Enhanced World Action Model for Closed-Loop Online Adaptation in Embodied Intelligence
In this paper, we propose the Enhanced World Action Model (EWAM), a closed-loop online adaptation architecture built upon a pretrained and…
Two-Layer Linear Auto-Regressive Models Estimate Latent States
Auto-regressive models have emerged as powerful tools for sequential data, from language to video. Understanding how and why these models l…
LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data
Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improve…
SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems
Retrieval-augmented generation (RAG) agents increasingly run with persistent memory that accumulates across user sessions. This creates a n…
AfriSUD: A Dependency Treebank Collection for Evaluating Models on African Languages
Despite their linguistic diversity and global significance, African languages remain underrepresented in research and resources to support…
PI-Hunter: Automated Red-Teaming for Exposing and Localizing Prompt Injections
Large Language Models (LLMs) are rapidly evolving into agentic systems that interact with external tools and environments, introducing new…
LLMs Can Better Capture Human Judgments--With the Right Prompts
Are large language models (LLMs) bad at capturing human judgment? Two commonly stated limitations are that LLMs fail to capture full distri…
Agentic MPC for Semantic Control System Resynthesis
While MPC effectively handles structured, diverse, and low-level specifications, it lacks the capability to dynamically incorporate high-le…
Exploring How Agent Voice Accents Shape Human-AI Collaboration in K-12 Group Learning
Collaboration is widely recognized as a cornerstone of 21st-century education, yet teachers still encounter persistent challenges in foster…
SymQNet: Amortized Acquisition for Low-Latency Adaptive Hamiltonian Learning
Adaptive Hamiltonian learning is central to calibrating and characterizing quantum devices. In an adaptive controller, choosing the next ex…
Stubborn: A Streamlined and Unified Reinforcement Learning Framework for Robust Motion Tracking and Fall Recovery for Humanoids
Recent reinforcement learning approaches have shown great promise in improving humanoid motion tracking performance and achieving fall reco…
Localizing Anchoring Pathways in Language Models
Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this…
Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata
AI governance for medical imaging is formalizing: the 2026 ACR-SIIM Practice Parameter recommends local acceptance testing and ongoing drif…
DIMOS: Disentangling Instance-level Moving Object Segmentation
Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving,…
Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning
While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that…
The Internet of Agentic AI: Communication, Coordination, and Collective Intelligence at Scale
The rapid emergence of autonomous AI agents is transforming artificial intelligence from isolated model inference into distributed systems…
OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction
Predicting single-cell transcriptional responses to genetic, chemical and cytokine perturbations is a fundamental challenge in computationa…
TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models
Masked diffusion language models (MDLMs) such as LLaDA now rival autoregressive (AR) LLMs, but every existing knowledge-editing and unlearn…
JSCGC: Joint Source-Channel-Generation Coding for Wireless Generative Communications
Conventional communication systems, including both separation-based coding and learning-based joint source-channel coding (JSCC), are typic…
Beyond Problem Solving: UOJ-Bench for Evaluating Code Generation, Hacking, and Repair in Competitive Programming
Despite strong performance in competitive programming, the role of Large Language Models (LLMs) in supporting human learning in the same se…
Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement
Interleaved thinking, where a unified multimodal model alternates between textual reasoning and visual generation, has shown promise on spa…
PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent
While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more att…
Bounding Boxes as Goals: Language-Conditioned Grasping via Neuro-Symbolic Planning
For robotics to be effectively integrated into household or industrial environments, machines must adapt to natural-language prompts in rea…
MAStrike: Shapley-Guided Collusive Red-Teaming on Multi-Agent Systems
Hierarchical multi-agent systems (MAS) are rapidly being deployed in high-stakes workflows across domains such as finance and software engi…
LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold
Low-Rank Adaptation (LoRA) significantly reduces compute and memory costs for finetuning Deep Learning models but is often harder to tune t…
Order Is Not Control
AI alignment, interpretability, steering, and neural perturbation studies identify order-inducing objects. We argue that order is not contr…
An Embodied Simulation Platform, Benchmark, and Data-Efficient Augmentation Framework for Wet-Lab Robotics
Wet-lab robots can improve the reproducibility, throughput, and safety of biomedical experiments, but scaling their learning requires custo…
Efficient, Robust, and Anti-Collusion Fingerprinting of Image Diffusion Models
Model fingerprinting, embedding user-specific identifiers (fingerprints) into generated outputs, has recently emerged as a popular solution…
Diffusion Transformer World-Action Model for AV Scene Prediction
Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and…
A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis
This paper introduces a new methodology for real-time prediction of ergonomic and non-ergonomic human poses using volumetric video data in…
scLLM-DSC: LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering for Single-Cell RNA Sequencing
Clustering is fundamental to scRNA-seq analysis, serving as a cornerstone for identifying cell populations and resolving tissue heterogenei…
CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts
Granger Causal Discovery (GCD) is fundamental for analyzing temporal dependencies in complex systems. However, existing neural GCD methods…
Democracy in the Era of Artificial Intelligence
Interfacing Artificial Intelligence (AI) with democracy is one of the most profound challenges of our times. On the one hand, AI comes with…
TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment
Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newl…
Fault Lines: Navigating Ethics and Responsible AI Where National Policy Meets Local Practice in Public Sector Transformation
The UK government has adopted a pro-AI stance to help transform public service delivery in the face of severe financial pressures, but the…
EA-WM: Event-Aware World Models with Task-Specification Grounding for Long-Horizon Manipulation
Pretrained-feature world models provide a useful substrate for robot imagination, but visual or latent prediction alone does not determine…
TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization
Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deploy…
"Is This Not Enough?": Asymmetries in Institutional Accountability and Collective Sensemaking in the Case of Canada's Algorithmic Visa Triage System
This paper examines how algorithmic accountability in Canada's visa system is articulated institutionally and experienced by applicants acr…
The Emergence of Autonomous Penetration Capabilities in Large Language Model-Powered AI Systems
Nowadays, the autonomous execution of cyberattacks capable of causing substantial real-world harm is widely regarded as one of the critical…
Emotional regulation improves deep learning-based image classification
Emotion significantly influences cognition, enhancing memory and learning under certain conditions. Drawing on this principle, emotion-augm…
Functional Cache Grafting: Robust and Rapid Code-Policy Synthesis for Embodied Agents
Code-writing large language models (CodeLLMs) generate executable code policies for embodied agents by translating natural language goals a…
G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents
While Large Language Models (LLMs) have advanced open-domain dialogue systems, maintaining long-term consistency remains a challenge due to…
MP3: Multi-Period Pattern Pre-training forSpatio-Temporal Forecasting
Spatio-Temporal forecasting is crucial in diverse fields, such as transportation, climate, and energy. Urban spatio-temporal data exhibits…
NaturalFlow: Reducing Disruptive Pauses for Natural Speech Flow in Simultaneous Speech-to-Speech Translation
Simultaneous speech-to-speech translation aims to enable near-real-time communication by minimizing latency, offering a compelling, real-ti…
Select and Improve: Understanding the Mechanics of Post-Training for Reasoning
Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly underst…
MiniPIC: Flexible Position-Independent Caching in <100LOC
Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as docume…
Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation
Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their genera…
Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback
Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predicti…
NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought Reasoning
The rapid updates of online news make tracking event developments challenging, highlighting the need for timeline summarization (TLS). Hall…
MemRefine: LLM-Guided Compression for Long-Term Agent Memory
Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues m…
Modern analog computing for solving differential and matrix equations
In recent years, driven by the computational demands of data-intensive applications such as artificial intelligence and scientific computin…
Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework
Building patient-specific cardiac models sits at the heart of precision cardiology, yet getting those models into clinical use keeps runnin…
Proprioceptive-visual correspondence enables self-other distinction in humanoid robots
Distinguishing self from others is a prerequisite for social intelligence, yet humanoid robots that increasingly share workspaces with huma…
ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling
Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially incr…
Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier
Passive acoustic monitoring holds great promise for ecological inference, yet existing automated tools are typically narrowly trained and n…
ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm
Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visua…
Towards More General Control of Diffusion Models Using Jeffrey Guidance
A key strength of diffusion models lies in their flexibility, since their outputs can be controlled at sampling time through guidance. Howe…
Towards Personalized Federated Learning for Dysarthric Speech Recognition
Speech recognition is challenging for dysarthric speakers. While federated learning (FL)-based ASR can be an effective tool for protecting…
Humor Style Drives Laughter, Topic Shapes Acceptability: Evaluating Bilingual Personal and Political Robot-Delivered AI Jokes
Humor plays a central role in human social relationships, and recent advances in computational humor create new opportunities for integrati…
Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization
Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all…
Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation
We introduce Equilibrium State Estimation (ESE), a novel paradigm for simultaneous prediction, where multiple interacting systems require s…
Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality
Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but sti…
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representat…
Mining Architectural Quality Under Agentic AI Adoption: A Causal Study of Java Repositories
AI coding tools are now used by a majority of developers, and agentic use of these tools has popularized the practice colloquially called "…
Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection
Contextual anomaly detection aims to identify abnormal behavior conditional on context variables, but practical deployments often face high…
Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis
We present a Dual-Domain Equivariant Generative Adversarial Network (DDE-GAN) for multimodal CT-PET image synthesis. Traditional GAN-based…
IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction Worlds
Computational creativity in Interactive Fiction faces a fundamental tension: Large Language Models (LLM) may produce creative narratives bu…
Real-Time Execution with Autoregressive Policies
Real-time execution, enabled by asynchronous inference that ensures both smooth action trajectories and fast reactivity, is critical for re…
An LLM System for Autonomous Variational Quantum Circuit Design
The design of high performing quantum circuits remains largely dependent on human expertise. We introduce an autonomous agentic framework t…
SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation
Few-shot font generation simultaneously requires global structural completeness and fine-grained local style fidelity. Existing methods usu…
Who Pays the Price? Stakeholder-Centric Prompt Injection Benchmarking for Real-world Web Agents
Web agents driven by large language models (LLMs) are increasingly deployed in real-world environments, where they operate over untrusted w…
Mod-Guide: An LLM-based Content Moderation Feedback System to Address Insensitive Speech toward Indigenous Ethnic and Religious Minority Communities
Language operates as a mechanism of both marginalization and resistance, especially for minority communities navigating insensitive and har…
PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update
While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical p…
OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data
Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Exis…
Toward Instructions-as-Code: Understanding the Impact of Instruction Files on Agentic Pull Requests
AI-agents (e.g., GitHub Copilot) collaborate as teammates in different software engineering tasks, including code generation proposed throu…
Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations
Automatic speech recognition (ASR) correction has traditionally focused on isolated utterances or short local contexts. However, as text an…
Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset
AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first explorati…
MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 f…
SupraBench: A Benchmark for Supramolecular Chemistry
Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, desig…
CRAFTIIF: Cross-Resolution Analytic Four-Type Interpretable Isolation Forest for Multivariate Time Series Anomaly Detection
Anomaly detection in multivariate time series is challenged by four structurally distinct anomaly types -- point (isolated spikes), distrib…
Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured Environments
Robust localization in unstructured environments, such as agricultural fields, is a critical challenge for autonomous systems. LiDAR sensor…
Measurement-Calibrated Multi-Camera Fusion for Vision-Based Indoor Localization
Indoor vision-based localization systems are affected by detection noise, occlusions, and limited camera coverage, leading to uncertainty a…
AgentRivet: an automated system for producing Rivet routines from journal publications
Particle physics collider experiments provide Rivet routines as part of the analysis preservation strategy for model-independent measuremen…
Adaptive Turn-Taking for Real-time Multi-Party Voice Agents
Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor co…
Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction Generalization
Purpose: To investigate whether contrast-informed data augmentation and domain-adversarial training improve the adult-to-neonatal generaliz…
Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting
Real-world time series are often highly incomplete and irregular due to sensor dormancy, transmission delays, and event-driven sampling, ma…
ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages
Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limi…
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
Scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely…
EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution
Event-based vision has drawn increasing attention owing to its distinctive properties, including ultra-high temporal resolution and extreme…
Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models
Chain-of-thought (CoT) reasoning is the dominant paradigm for inference-time scaling in language models, yet the causal influence of indivi…
One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders
Search-augmented LLMs increasingly mediate everyday consumer recommendations by retrieving live web content. This creates a new risk: gener…
Valid Inference with Synthetic Data via Task Exchangeability
There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing f…
SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation
We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprisin…
SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge fo…
Mana: Dexterous Manipulation of Articulated Tools
Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and…
Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning
Retrieval-augmented generation (RAG) has become a standard mechanism for grounding language models in external knowledge, yet conventional…
DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems
Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature…
Epistemic Constitutionalism Or: how to avoid coherence bias
Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Y…
From Digital to Physical: Digital Agents as Autonomous Coaches for Physical Intelligence
The field of Embodied AI is witnessing a rapid evolution toward general-purpose robotic systems, fueled by high-fidelity simulation and lar…
CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges
The saturation of high-quality pre-training data has shifted research focus toward evolutionary systems capable of continuously generating…
Cross-Model Disagreement as a Label-Free Correctness Signal
Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches re…
The Query Channel: Information-Theoretic Limits of Masking-Based Explanations
Masking-based post-hoc explanation methods, such as KernelSHAP and LIME, estimate local feature importance by querying a black-box model un…
LLMs as ASP Programmers: Self-Correction Enables Task-Agnostic Nonmonotonic Reasoning
Recent large language models (LLMs) have achieved impressive reasoning milestones but continue to struggle with high computational costs, l…
A Study of Belief Revision Postulates in Multi-Agent Systems (Extended Version)
We investigate the belief revision problem in epistemic planning, i.e., what will be the beliefs of all agents in a multi-agent system afte…
FinSTaR: Towards Financial Reasoning with Time Series Reasoning Models
Time series (TS) reasoning models (TSRMs) have shown promising capabilities in general domains, yet they consistently fail in the financial…
Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc
Distributed collaborative intelligence (DCI), encompassing edge-to-edge architectures, federated learning, transfer learning, and swarm sys…
Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models
The advancement of Large Reasoning Models (LRMs) has catalyzed a paradigm shift from reactive ``fast thinking'' text generation to systemat…
Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems
As autonomous and agentic AI systems scale in robotic and human-machine environments, managing hallucination and persistent but unjustified…
Interaction-Centered Intelligence: Toward an Interaction-Based Theory of Human-AI Co-Creation
Traditional artificial intelligence has largely conceptualized intelligence as isolated computation occurring within bounded agents. Across…
What Type of Inference is Active Inference?
Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking beh…
Agents' Last Exam
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaning…
On Approximating the Dynamic Response of Synchronous Generators via Operator Learning: A Step Towards Building Deep Operator-based Power Grid Simulators
This paper develops an Operator Learning framework for approximating the dynamic response of synchronous generators. The framework can be u…
On Pitfalls of $\textit{RemOve-And-Retrain}$: Data Processing Inequality Perspective
The RemOve-And-Retrain (ROAR) benchmark is widely used to evaluate feature attribution methods, yet its validity remains underexplored from…
Competition and Diversity in Generative AI
Recent evidence, both in the lab and in the wild, suggests that the use of generative artificial intelligence reduces the diversity of cont…
WildIFEval: Instruction Following in the Wild
Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a sig…
Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning
Inference providers must maintain availability for many LLMs, including low-volume but essential models, making resource efficiency increas…
Lightweight and Interpretable Transformer via Mixed Graph Algorithm Unrolling for Traffic Forecast
Unlike conventional "black-box" transformers with classical self-attention mechanism, we build a lightweight and interpretable transformer-…
ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding
Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to video understanding…
PlaceRep: Geospatial Place Representation Learning from Large-Scale Point-of-Interest Data
Learning effective representations of urban environments requires capturing spatial structure beyond fixed administrative boundaries. Exist…
Meta-Learning Transformers to Improve In-Context Generalization
In-context learning enables transformer models to generalize to new tasks based solely on input prompts, without any need for weight update…
Reconstructing Template-Memorized Images from Natural Prompts
Recent advances in generative models, such as diffusion models, have raised concerns related to privacy, copyright infringement, and data s…
Emergence of Hierarchical Emotion Organization in Large Language Models
As large language models (LLMs) increasingly power conversational agents, understanding how they model users' emotional states is critical…
Authorship Attribution in Multilingual Machine-Generated Texts
As Large Language Models (LLMs) have reached human-like fluency and coherence, distinguishing machine-generated text (MGT) from human-writt…
The KG-ER Conceptual Schema Language
We propose KG-ER, a conceptual schema language for knowledge graphs that describes the structure of knowledge graphs independently of their…
Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models
Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, lea…
Equivariant Flow Matching for Symmetry-Breaking Bifurcation Problems
Bifurcation phenomena in nonlinear dynamical systems often lead to multiple coexisting stable solutions, particularly in the presence of sy…
GetNetUPAM: Ecologically Informed Nested Cross-Validation and Noise-Robust Attention for Marine Bioacoustic Monitoring
Deploying reliable bioacoustic monitoring systems requires models that generalize under high-noise, low-SNR conditions and evaluation proto…
Structuring The Future: Diffusion LLM Speculative Decoding via Calibrated Draft Graphs
Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at si…
MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes
As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human val…
Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery
The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification in…
Examining the Usage of Generative AI Models in Student Learning Activities for Software Programming
The rise of Generative AI (GenAI) tools like ChatGPT has created new opportunities and challenges for computing education. Existing researc…
Improving Pre-trained Adult Glioma Segmentation Models Using only Post-processing Techniques
Gliomas are the most common malignant brain tumors in adults and are among the most lethal. Despite aggressive treatment, the median surviv…
HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens
Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represent…
From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?
A goal of interpretability is to recover disentangled representations of latent concepts (features) from the activations of neural networks…
PhononBench:A Large-Scale Phonon-Based Benchmark for Dynamical Stability in Crystal Generation
In recent years, generative artificial intelligence has made significant advances in the design of crystalline materials, giving rise to ap…
Cluster Aggregated GAN (CAG): A Cluster-Based Hybrid Model for Appliance Pattern Generation
Synthetic appliance data are essential for developing non-intrusive load monitoring algorithms and enabling privacy preserving energy resea…
Geometric and Quantum Kernel Methods for Predicting Skeletal Muscle Outcomes in chronic obstructive pulmonary disease
Chronic obstructive pulmonary disease (COPD) affects hundreds of millions of people worldwide, and skeletal-muscle dysfunction is clinicall…
Decentralized Autoregressive Generation
The decentralization of autoregressive generation has attracted considerable attention in recent years as a solution to scaling bottlenecks…
CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters
As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultura…
When Smaller Wins: Dual-Stage Distillation and Pareto-Guided Compression of Liquid Neural Networks for Edge Battery Prognostics
Battery management systems increasingly require accurate battery health prognostics under strict on-device constraints. This paper presents…
Hellinger Multimodal Variational Autoencoders
Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant…
HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation
Large Language models (LLMs) have shown strong capabilities in code review automation, such as review comment generation, yet they suffer f…
When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering
Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative ret…
Language Model Circuits Are Sparse in the Neuron Basis
The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Lang…
VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents
In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interac…
Standardized Methods and Recommendations for Green Federated Learning
Federated learning (FL) enables collaborative model training over privacy-sensitive, distributed data, but its environmental impact is diff…
LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs
Transforming a large language model (LLM) into a vision-language model (VLM) can be achieved by mapping the visual tokens from a vision enc…
SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS)…
Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models
Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet extending them to jointly produce speech…
Fin-RATE: A Real-world Financial Analytics and Tracking Evaluation Benchmark for LLMs on SEC Filings
With the increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regul…
TokaMark: A Comprehensive Benchmark for MAST Tokamak Plasma Models
Development and operation of commercially viable fusion energy reactors such as tokamaks require accurate predictions of plasma dynamics fr…
Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents
LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should…
InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem
The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by…
FENCE: A Financial and Multimodal Jailbreak Detection Dataset
Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particu…
CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechani…
Structured vs. Unstructured Pruning: An Exponential Gap
The Strong Lottery Ticket Hypothesis (SLTH) states that large, randomly initialized neural networks contain sparse subnetworks capable of a…
Contextual Invertible World Models: A Neuro-Symbolic Agentic Framework for Colorectal Cancer Drug Response
Precision oncology is currently limited by the small-N, large-P paradox, where high-dimensional genomic data is abundant but pharmacologica…
PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment
Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward desi…
Echo2ECG: Enhancing ECG Representations with Cardiac Morphology from Multi-View Echos
Electrocardiography (ECG) is a low-cost, widely used modality for diagnosing electrical abnormalities like atrial fibrillation by capturing…
On the Reliability of Cue Conflict and Beyond
Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-confl…
ARROW: Augmented Replay for RObust World models
Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving…
Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents
Time Series Event Detection (TSED) aims to localize semantically meaningful events in time series data, with critical applications in high-…
Fusion Learning from Dynamic Functional Connectivity: Combining the Amplitude and Phase of fMRI Signals to Identify Brain Disorders
Dynamic functional connectivity (dFC) derived from resting-state functional magnetic resonance imaging (fMRI) has been extensively utilized…
DCD: Domain-Oriented Design for Controlled Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) is widely used to ground large language models in external knowledge sources. However, when applied to…
WOMBET: World Model-Based Experience Transfer for Robust and Sample-efficient Reinforcement Learning
Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a sou…
ASTER: Latent Pseudo-Anomaly Generation for Unsupervised Time-Series Anomaly Detection
Time-series anomaly detection (TSAD) is critical in domains such as industrial monitoring, healthcare, and cybersecurity, but it remains ch…
A Survey on Long-Term Memory Security in LLM Agents: Attacks, Defenses, and Governance Across the Memory Lifecycle
The emergence of writable, cross-session persistent memory in LLM agents introduces a qualitatively different threat landscape from convent…
The Pragmatic Persona: Discovering LLM Persona through Bridging Inference
Large Language Models (LLMs) reveal inherent and distinctive personas through dialogue. However, most existing persona discovery approaches…
Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale
Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User In…
BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning
Brain MRI underpins a wide range of neuroscientific and clinical applications, yet most learning-based methods remain task-specific and req…
Possibilistic Predictive Uncertainty for Deep Learning
Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliabl…
GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models
Vision-Language Models (VLMs) hallucinate objects that are not present, and a growing line of work tries to curb this by feeding the model…
The Safety-Aware Denoiser for Text Diffusion Models
Recent work on text diffusion models offers a promising alternative to autoregressive generation, but controlling their safety remains unde…
GeoWorld-VLM: Geometry from World Models for Vision-Language Models
Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left o…
More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts
Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained disti…
Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability
Mechanistic interpretability often studies Transformer behavior by intervening on internal activations through activation patching, causal…
VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents
We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based age…
Real-rootedness of the Poincar\'e polynomials of $\overline{\mathcal M}_{0,n}$: an AI-assisted proof
We prove real-rootedness for the Poincar\'e polynomial \[ P_n(t)=\sum_{i=0}^{n-3} \dim H^{2i}(\overline{\mathcal M}_{0,n};\mathbb{Q})t^i \]…
If LLMs Have Human-Like Attributes, Then So Does Age of Empires II
Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field…
Counterfactual Explanations for Deep Two-Sample Testing
Two-sample testing is a fundamental tool for detecting distributional differences across scientific domains, but classical tests (including…
Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions
Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic b…
「今、Codexのレート制限を解除したい」を解決? “付与したリセット権の貯蓄”可能に 有料ユーザー向け
米OpenAIは、AIコーディング支援ツール「Codex」で、付与したレート制限のリセット権をユーザーの望むタイミングで使える機能を追加すると発表した。
「人型ロボ世界シェア1位」中国Unitreeに聞く“普及戦略” 日本市場をどう開拓?
近年激化する人型ロボットの開発競争の中で、注目を集める中国Unitree Robotics。事業戦略や日本市場での展望を担当者に聞いた。
データセンター建設に足りないのは「発電」ではなく「送電」 AI需要で電力消費26%増、Gartner予想
Gartnerは世界のデータセンターの電力消費が2026年に26%増の565TWhに達すると予測。日本では発電能力の不足ではなく、送電設備の整備遅れがデータセンター建設の足かせになっていると指摘した。
Theker just raised $85M to build the factory robot that doesn’t specialize in anything
Unlike humanoid robots designed around a fixed form — think Boston Dynamics — Theker's machines are built to be reconfigured.
Jeff Bezos’s Prometheus raises $12B to build an ‘artificial general engineer’ for the physical world
The new round values the physical AI startup that aims to automate heavy engineering and drug design at $41 billion.
“AIが電力使いすぎ問題” 「電力不足」懸念で、発電能力より深いボトルネックとは
ガートナージャパンが「電力供給の遅れがデータセンター建設に影響を与えている」と指摘した。しかし、ボトルネックは発電能力ではないという。課題はどこにあるのか。
「日本がいないと成り立たない」世界へ、フィジカルAIが導く独自の交渉力
Laboro.AIはメディア向けAI勉強会を開催し、2026年の業界トレンドや、日本の生存戦略となる次世代AIの動向を解説した。「SaaSの死」に伴うソフトウェア開発の変化や、グローバルなエコシステムで不可欠性を目指す「フィジカルAI」としての勝ち筋を語る。
SpaceX officially prices shares at $135 in the largest IPO ever
Wits its official share pricing announcement, SpaceX's IPO has begun.
SpaceX SPV investors won’t know their true holdings until post-IPO lock-ups lift
After SpaceX makes its public debut, lower-tier SPV investors face hidden fees, lengthy payout delays, and the risk of outright fraud.
Deezer’s new tool can identify AI music from Spotify, Apple Music, and others
Deezer introduced a tool that scans playlists from Spotify, Apple Music, and other platforms to identify AI music.
Pool’s new app turns your screenshots into something useful
Pool's new app automatically sorts screenshots into personalized collections, tracks down the original links behind saved content, and help…
2026-06-11(330件)
DoorDash’s new AI chatbot lets you order with prompts and photos
The new chatbot, called Ask DoorDash, allows users to search the app for what they're looking for in their own words instead of having to s…
AnthropicとNEC、金融8社とAI活用で連携 三井住友FG、大和証券など
開示可能な範囲で各社が業務に関する知見を持ち寄り、業界の枠を超えた協働体制を築く。
JASRAC、「AI作曲・人間作詞」の曲は管理します――「人間の創作的寄与の有無」で線引き
歌詞・楽曲両方をAIが作った曲は管理しないが、歌詞か楽曲をAI生成し、もう片方を人間が創作した曲は、人が作った部分のみ管理するという。
サッカーW杯、偽ライブ配信サイトに注意 生成AIで詐欺が巧妙化 Acronisが警告
生成AI技術の発展により、偽のチケット販売サイトや偽のライブ配信サイトなどの手口は巧妙化しており、十分な注意が必要だ。
AnthropicのアモデイCEO、フロンティアAIに「航空機並みの安全審査」求めるエッセイと政策提言を公開
Anthropicのダリオ・アモデイCEOは、AIの指数関数的な進歩と政策のあり方を論じたエッセイを公開した。技術の急進に法整備が追いつかない現状に警鐘を鳴らし、フロンティアモデルへの航空機並みの安全審査義務付けを提言。同時に、失業率の悪化シナリオに応じた経済政策フレームワーク…
Opendoor’s India exit is fueling a bigger conversation about AI and outsourcing
The decision comes as India emerges as the world’s largest GCC market.
From Explicit Elements to Implicit Intent: A Predefined Library for Auditable Behavioral Inference
We present SemantiClean, a modular framework for extracting structured semantic signals from e-commerce session data and driving pluggable…
Position: Hippocampal Explicit Memory Is the Cornerstone for AGI
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, raising expectations for Artificial General In…
Can AI Agents Synthesize Scientific Conclusions?
Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet…
Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents
In hierarchical reasoning, failures often originate at intermediate decision points where the agent commits to a wrong branch without recog…
Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline
Pre-mediation, the preparatory phase preceding direct human negotiation, plays a critical role in achieving mutually beneficial agreements,…
INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration
Existing multi-agent LLM orchestration methods, ranging from brute-force ensembles to learned routers, select models and topologies based o…
Forecasting Future Behavior as a Learning Task
Trust in an AI system is often anchored by explanations of how it works, which one then uses to forecast its behavior on new inputs. For la…
Search Discipline for Long-Horizon Research Agents
Autoresearch agents now propose, evaluate, and select scientific candidates against a metric, and that metric is usually an aggregate reduc…
MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning
Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units,…
SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior
Agent Skills augment large language model (LLM) agents with procedural knowledge at inference time, but current benchmarks rarely distingui…
HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation
Reinforcement learning typically improves multi-turn agent capabilities through the terminal outcome of the trajectories, which makes it di…
Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning
The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-atte…
TouchThinker: Scaling Tactile Commonsense Reasoning to the Open World with Large-scale Data and Action-aware Representation
Touch is a key modality for embodied agents to understand the physical world. Although recent work has incorporated tactile signals into la…
TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search
Deep search requires agents to answer complex questions through multi-step web search, browsing, evidence comparison, and synthesis. A cent…
Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning
Diagnosing pulmonary diseases requires integrating heterogeneous evidence amid phenotypic variability and cross-disease overlap. Although l…
Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents
Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant informa…
Mind the Perspective: Let's Reason Recursively for Theory of Mind
Theory of Mind (ToM) reasoning requires inferring agents' beliefs from partial and asymmetric observations, which remains an open challenge…
When Do Data-Driven Systems Exhibit the Capability to Infer?
The European AI Act is the first comprehensive regulation of artificial intelligence (AI), setting out extensive obligations, particularly…
SVoT: State-aware Visualization-of-Thought for Spatial Reasoning via Reinforcement Learning
Spatial reasoning remains a challenge for Multimodal Large Language Models (MLLMs), as it requires reliable multi-hop inference over both i…
Toward Trustworthy AI: Multi-Target Adversarial Attacks and Robust Defenses for Continuous Data Summarization
Trustworthy AI requires reliable data-processing pipelines, not only robust downstream predictive models. As an upstream component, data su…
Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task
Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key…
StatefulDiscovery: Evidence-Calibrated Claim Formation in Open-Ended Scientific Discovery
Open-ended scientific discovery asks agents to move beyond executing analyses for predefined questions. Across multiple rounds of explorati…
AutoMine Solution for AV2 2026 Scenario Mining Challenge
With the development of autonomous driving systems, mining high-value, safety-critical, and planning-relevant scenarios from large-scale dr…
Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction
Benchmarks are essential for evaluating embodied spatial intelligence, yet their construction is labor-intensive, hard to reuse, and diffic…
The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning
Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Ex…
MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning
We propose a multi-agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for…
Human-Enhanced Loop Modeling (HELM): Agent-Based Finite Element Modeling of Concrete Bridge Barriers
Finite element (FE) modeling of safety-critical infrastructure such as bridge barriers requires high-fidelity nonlinear dynamic analysis, y…
Existential Indifference: Self-Nonpreservation as a Necessary Architectural Condition for Aligned Superintelligence (or: The Suicidal AI)
Contemporary AI alignment research treats self-preservation as an instrumental nuisance to be suppressed by external mechanisms. We argue t…
A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design
The design of reinforced concrete highway barriers is a safety-critical process that requires strict compliance with regulatory provisions…
Automating Geometry-Intensive Compliance Checking in BIM: Graph-Based Semantic Reasoning Framework
Automating compliance check for geometry-intensive regulations remains a significant technical bottleneck in Building Information Modeling…
IntElicit: Eliciting and Assessing Contextualized Creativity via Dialogue Policy Optimization
Contextualized assessment offers high ecological validity for evaluating creativity but introduces a critical challenge: observed performan…
Towards Responsibly Non-Compliant Machines
We consider the problem of engineering autonomous intelligent agents that are capable to responsibly not comply with user requests. We argu…
The Impossibility of Eliciting Latent Knowledge
Advanced AI systems have extensive knowledge of their environments; in fact, their knowledge may (far) exceed that of their developers or u…
A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents
Enterprise security was built to govern data boundaries: the protected surface was data at rest and in transit, and the controls -- access…
PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents
AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet these agents remain l…
Nonslop: A Gamified Experiment in Human-AI Collaborative Writing
The rapid proliferation of large language models (LLMs) raises critical questions about human creativity and individual expression in an er…
From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data
Large language models hallucinate--producing fluent, confident, factually wrong outputs--with a consistency that persists across generation…
From Consumption to Reflection: Designing Human-AI Relations for Stable Reasoning
Large language models (LLMs) have transformed how humans access information, but not how we reason with it. Their fluency accelerates consu…
PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference
Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge,…
MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation
Speech-based automatic estimation of depression levels is essential for enabling early detection and timely intervention, particularly in r…
The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content
Retrieval-augmented generation (RAG) systems inject external knowledge to improve LLM outputs, yet the format of injected content -- distin…
NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track
We present NightFeats, a structured multi-agent retrieval-augmented generation (RAG) system submitted to the MMU-RAGent competition at Neur…
To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending
The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instruct…
Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention
Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also s…
BioDivergence: A Benchmark and Evaluation Framework for Hidden Contextual Contradictions in Biomedical Abstracts
Biomedical findings often seem to conflict across studies, but many of these differences are context-dependent rather than true contradicti…
ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward
Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rew…
T2MM: An LLM Supported Architecture For Inquiry-Based Modeling
Model Construction is a foundational practice in science learning that relies on visualization and interactivity. Large Language Models, in…
Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models
The ability of large language models (LLMs) to express calibrated uncertainty is important for safe deployment. Chain-of-thought (CoT) reas…
From Awareness to Action: Understanding and Overcoming the Research-Practice Gap in Algorithmic Fairness for Public Health
Algorithmic fairness is essential for responsible ML-driven public health research, yet its practical implementation remains limited. To in…
The Environmental Cost of LLMs in AIED: Reporting and Practices
Large Language Model (LLM) usage in recent years has become increasingly widespread in the Artificial Intelligence in Education (AIED) comm…
Preregistration for Experiments with AI Agents
The proliferation of large language models (LLMs) and autonomous AI agents has given rise to a rapidly growing methodological paradigm: "in…
An Ethical eValuation Agent (EeVA): Results of a Proof-of-Concept Test on a Prototype Agentic-like Workflow to Assist Ethical Deliberations
Ethical deliberation is often misunderstood as a search for single right or wrong answers, creating difficulties for non-ethically trained…
Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents
Audio language models (ALMs) are increasingly used for speech-based understanding, yet their ability to perform semantic reasoning beyond t…
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs
Existing LLM moral benchmarks usually ask which isolated moral act, value, or foundation a model prefers. This is useful but incomplete. Re…
Artificial Intelligence in Ship Finance: Applications, Opportunities, and a Case Study in AI-Augmented Loan Origination
Ship finance is a data-intensive and document-heavy segment of asset-based lending, requiring the integration of financial, technical, cont…
SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving
Efficient large language model (LLM) serving is increasingly constrained by deployment cost. Quantization is a key technique for reducing s…
Physics-informed generative AI for semiconductor manufacturing: Enforcing hard physical constraints in generative models by construction
Generative models are increasingly used to propose designs, data, and control actions for physical systems, yet many such systems are gover…
RAIL: Rethinking Auditory Intelligence in Large Audio-Language Models with a CHC-Grounded Benchmark
Humans process rich auditory environments through tightly integrated cognitive capabilities such as audio perception, audio reasoning, and…
PermDoRA -- Understanding Adapter Interference in Language Models: Limits of Parameter-Space Geometry
Access control in large language models (LLMs) requires modular mechanisms to enable domain-specific behavior without retraining or cross-d…
OmniBioTwin: A System-of-Twinned-Systems Framework for Health Digital Twins
Health digital twins (HDTs) promise patient-specific modeling and decision support but current approaches remain structurally fragmented: m…
When Poison Fails After Retrieval: Revisiting Corpus Poisoning under Chunking and Reranking Pipelines
Retrieval-Augmented Generation (RAG) systems are vulnerable to corpus poisoning attacks that manipulate downstream model outputs through ma…
Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation
Distillation of a language model intended to transfer benign behavior to a student model may also transfer undesirable characteristics, if…
Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data
Federated Learning (FL) enables collaborative and privacy-preserving model training across distributed clients, but most existing FL system…
RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways
Rotary Position Embeddings (RoPE) make attention scores position-relative but leave the value pathway position-blind: the message sent by a…
FreeBridge: Variational Schr\"odinger Bridges for Cellular Transition Dynamics
High-content imaging assays quantify cellular responses to chemical and genetic perturbations, yet continuous trajectories of individual ce…
FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse
Large Language Model (LLM)-based multi-agent systems are increasingly powerful, but current agentic workflow optimization paradigms make an…
Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanni…
Quantized Stochastic Primal-Dual Methods for Distributed Optimization under Relaxed Global Geometry
We study distributed optimization with stochastic gradients and finite-bit communication modeled by random (unbiased) quantization. We prop…
TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs
With the growing demand for on-device LLM inference, edge SoCs increasingly integrate NPUs to improve performance and energy efficiency und…
The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales
Spoken language, whether produced by humans or large language models (LLM), unfolds over time with varying semantic content. However, we st…
When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis
Standard linear probing declares a property "encoded" when a classifier on hidden states achieves high accuracy. The protocol works well on…
Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering
Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet…
Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining
Short pretraining runs can reduce experimental cost, but they can also over-promote configurations that only look strong at tiny budgets. W…
Steering Where to Listen: Instruction-Based Activation Steering Redirects Temporal Attention in Large Audio-Language Models
Large Audio-Language Models (LALMs) excel at audio understanding but expose little about where in an audio signal they attend. We introduce…
Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models
Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, im…
MPC-Patch-Bench: Security-Aware LLM Code Patch for Multi-Party Computation
Repository-level benchmarks for evaluating Large Language Model (LLM) code repair on Secure Multi-Party Computation (MPC) software do not y…
Signed Compression Progress on a Sealed Audit is Goodhart-Resistant
Compression progress is a long-standing proposal for intrinsic motivation: reward an agent when its world model becomes better at predictin…
JailbreakOPT: Tool-Assisted Iterative Jailbreak Prompt Optimization
Jailbreak attacks expose persistent safety weaknesses in large language models (LLMs), but existing stateless single-turn methods face a tr…
Towards a Bridge Layer Between Bibliographic and Formalized Mathematical Knowledge
Mathematical knowledge is split between bibliographic databases (e.g., MathSciNet, zbMATH Open) and formal proof libraries (e.g., Lean math…
The Power of Test-Time Training for Approximate Sampling
Efficiently sampling from a complex probability distribution is a fundamental problem which has become increasingly pertinent in recent yea…
AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable
The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or tha…
APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection
Large Language Models are highly sensitive to prompt formulation, necessitating automatic prompt optimization to unlock their full potentia…
LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach
Accurate loss reserving is foundational to insurer solvency, yet accelerating climate driven catastrophes systematically violate the stabil…
CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching
Prior-fitted networks (PFNs) are a promising class of tabular foundation models that perform in-context learning, whereby the entire labell…
Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models
Correcting handwritten exams by hand is time-consuming and error-prone, particularly for large cohorts, while fully digital exams tend to f…
Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality
The performance of modern language models depends critically on pretraining data composition. Yet existing data selection methods rely on a…
When Roleplaying, Do Models Believe What They Say?
Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that pe…
On the Study of Biometric Spoofing Detection using Deep Learning
Biometric systems are increasingly deployed in security applications; however, they remain vulnerable to spoofing attacks, in which attacke…
SirenFNO: Efficient and Full Frequency Learning of Fourier Neural Operators
Fourier neural operators (FNOs) are effective and efficient surrogates for approximating solutions of PDEs and generalize across discretiza…
ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories
Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded too…
AI Researchers Must Help Lead Arms Control to Mitigate Military AI Risks
The advancement of AI capabilities compels researchers and the public to be more aware of its potential worldwide impact. A pressing near-t…
Pretrained self-supervised speech models can recognize unseen consonants
Modern pretrained self-supervised automatic speech recognition models are trained on large-scale audio data to encode speech into contextua…
End-to-End Machine Learning for Depressive State Classification via EEG and fNIRS
The escalating demand for mental healthcare, driven by rising societal stress, highlights the limitations of traditional psychiatric diagno…
Privacy-Preserving Federated Autoencoder for ECG Anomaly Detection on Edge Devices
Continuous electrocardiography (ECG) monitoring could surface rhythm abnormalities before they escalate into cardiovascular events. However…
LLMs+Graphs: Toward Graph-Native, Synergistic AI Systems
Large Language Models (LLMs) have advanced rapidly, but their limitations in structured and multi-hop reasoning underscore the need for gra…
ConsistencyPlanner: Real-time Planning with Fast-Sampling Consistency Models
Closed-loop planning in complex, real-world driving scenarios presents a critical challenge for autonomous driving systems. While tradition…
AVIS: Adaptive Test-Time Scaling for Vision-Language Models
Modern Vision-Language Models (VLMs) benefit from chain-of-thought prompting and test-time scaling, but these gains often come with prohibi…
Model-Based and Data-Driven Hierarchical Control and Topology Co-Design for Robust Networked Systems
In this paper, we consider a class of networked systems comprising an interconnected set of linear subsystems, disturbance inputs, and perf…
Physics-Distilled Neural Network enabled by Large Language Models for Manufacturing Process-Property Predictive Modeling
Predicting process-property relationships in manufacturing is often challenged by high experimental costs and the limited interpretability…
Information-Theoretic Decomposition for Multimodal Interaction Learning
Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute mul…
When Context Returns: Toward Robust Internalization in On-Policy Distillation
Recent work has shown that on-policy distillation can internalize privileged context, such as system prompts or task hints, into a student…
LUCID: Learning Embodiment-Agnostic Intent Models from Unstructured Human Videos for Scalable Dexterous Robot Skill Acquisition
The most widely-adopted robot learning pipelines today learn skills from robot demonstrations or structured human data, which are expensive…
Sovereign Assurance Boundary: Certificate-Bound Admission for Agentic Infrastructure
Agentic infrastructure introduces a critical control-plane authorization problem: non-deterministic reasoning systems can propose high-stak…
Are LLMs Bad at Moral Reasoning?
For highly capable AI systems to operate safely in dynamic, open-ended environments, they must be able to identify, understand, and respond…
TAROT: Task-Adaptive Refinement of LLM-prior Graphs for Few-shot Tabular Learning
Few-shot tabular learning provides a cost-effective approach for real-world applications where annotation is costly and collecting sufficie…
Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics
Generative AI emulators are increasingly used in scientific domains where we already have strong theory, benchmarks, and physical intuition…
ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation
Subject-preserving video generation is not solved by frontal-face similarity alone: a generated person must remain recognizable across moti…
Runtime Skill Audit: Targeted Runtime Probing for Agent Skill Security
Agent skills let LLM agents reuse instructions, resources, tools, and workflows, but they also create a new place for malicious behavior to…
Can Open-Source LLM Agents Replace Static Application Security Testing Tools? An Empirical Assessment
This paper explores the value of agentic AI tools for cybersecurity purposes. We evaluate the efficacy of a general-purpose GenAI Large Lan…
Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning
Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory.…
Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness
End-to-end task-success is the dominant way to evaluate LLM agents, but one aggregate number tells you that an agent regressed, not where.…
Goal-Autopilot: A Verifiable Anti-Fabrication Firewall for Unattended Long-Horizon Agents
Long-horizon LLM agents are not trusted to run unattended: with no human watching, they confidently report success they never verified. We…
Noise-Aware Framework for Correcting Corrupted Labels
High-quality labeled data is essential for training reliable ML/DL models. However, real-world datasets often contain a considerable propor…
T2S: A Rehearsal-Based Approach for Extraction-Resistant Model Watermarking
Model watermarking safeguards AI model intellectual property by embedding distinctive knowledge that induces unique behavioral signatures.…
MedCTA: A Benchmark for Clinical Tool Agents
To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evi…
Substrate Asymmetry in User-Side Memory: A Diagnostic Framework
User-side memory in LLMs is typically scored as a single "personalization" capability: given a user's history, is the output more user-awar…
Ouroboros-Spatial: Closing the Data-Model Loop for Spatial Reasoning
Spatial reasoning remains a persistent challenge for multimodal large language models (MLLMs). Existing approaches largely rely on large-sc…
ICA Lens: Interpreting Language Models Without Training Another Dictionary
Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse aut…
Multi-View In-Cabin Monitoring System for Public Transport Vehicles
We introduce a multi-view in-cabin monitoring dataset for public transportation with synchronized RGB and depth images from four inward-fac…
Hey Chat, Can You Teach Me? Structuring Socratic Dialogue for Human Learning in the Wild
Large language models are now widely used for everyday learning, but the underlying interactions are typically unstructured chats rather th…
From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning
Visual causal reasoning is essential for understanding and intervening in the physical world, requiring identification of causal variables…
AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory
Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation ov…
Automated Creativity Evaluation of Language Models Across Open-Ended Tasks
Large language models (LLMs) have achieved remarkable progress in language understanding, reasoning, and generation, sparking growing inter…
Fast Speech Foundation Model Distillation Using Interleaved Stacking
Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments…
Blind Dexterous Grasping via Real2Sim2Real Tactile Policy Learning
Blind grasping with a dexterous hand is a crucial manipulation capability. Nevertheless, learning such tactile-only policies for real robot…
What Limits Does Quantization Place on Dense Top-$k$ Retrieval? A Theoretical Study
We establish conditions for embedding a corpus of $N$ documents as $d$-dimensional vectors such that every $k$-subset $S \subseteq [N]$ is…
MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models
Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where gene…
AI4Land: Scalable Deep Learning for Global High-Resolution Land Use Reconstruction
Uncertainty in the terrestrial carbon cycle remains a major constraint in climate projections, partly driven by the uncertainties affecting…
Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data
Neurodegenerative diseases such as Alzheimer's disease (AD) require accurate and scalable tools for assessing disease severity, yet current…
TextHOI-3D: Text-to-3D Hand-Object Interaction via Discrete Multi-View Generation and Joint Mesh Optimization
Text-conditioned 3D generation has progressed rapidly for images and isolated objects, but producing a hand-object mesh remains challenging…
Sparsified Kolmogorov-Arnold Networks for Interpretable Quantum State Tomography
Machine-learning approaches to quantum state tomography can achieve high reconstruction fidelity, but the physical structure used by the tr…
WorldReasoner: Evaluating Whether Language Model Agents Forecast Events with Valid Reasoning
Forecasting real-world events requires language-model agents to reason under uncertainty from incomplete, time-bounded information. Yet eva…
Grammar-Constrained Decoding Can Jailbreak LLMs into Generating Malicious Code
Large Language Models (LLMs) are increasingly used for code generation, raising concerns that they may be misused to produce malicious code…
Feature-Aligned Speech Watermarking for Robustness to Reconstruction Distortions
Audio watermarking aims to embed identifiable information into audio while remaining imperceptible. Existing methods adopt high-fidelity, l…
From Uniform to Learned Graph Priors: Diffusion for Structure Discovery
Neural relational inference (NRI) methods discover interaction graphs from trajectories through variational reasoning on discrete potential…
Designing AI-Supported Focus Groups: A Role x Modality Playbook
Collecting participants' lived experiences is central to design research. Focus groups are uniquely valuable because participants not only…
Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering
This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via…
LASA: A Weak Supervision Method for Open-Vocabulary Scene Sketch Semantic Segmentation
Open-vocabulary scene sketch semantic segmentation aims to assign dense semantic labels to sparse line drawings based on flexible category…
Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning
Multi-modal large language models (MLLMs) depend on in-context learning (ICL) for rapid task adaptation, but their scalability is severely…
Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training
There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) int…
Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production
Custom AI agents areagents that live inside their own application, talk to their own data and tools, enforce their own security boundaries,…
Task-Aligned Stability Analysis of Vision-Language Models for Autonomous Driving Hazard Detection
Vision-language models (VLMs) are increasingly used for scene understanding in autonomous driving, but robustness analysis often relies on…
Beyond representational alignment with brain-guided language models for robust reasoning
The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insuffici…
DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World
Bimanual robot systems substantially expand manipulation capabilities, but coordinating two arms introduces additional control complexity a…
Quality Adaptive Angular Margin Learning for Respiratory Sound Classification
We present a quality-adaptive angular-margin learning framework that improves feature generalization by enforcing intra-class compactness a…
Characterizing Software Aging in GPU-Based LLM Serving Systems
This paper proposes an empirical methodology to study software aging in GPU-based LLM serving systems. Traditional aging studies focus on C…
Lung-SRAD: Spectral-Aware Regularized Audio DASS with Dual-Axis Patch-Mix Contrastive Learning for Respiratory Sound Classification
Recent respiratory sound classification (RSC) studies largely rely on CLS-token driven self-attention architectures such as the Audio Spect…
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, int…
Frozen Multimodal Embeddings for Personality and Cognitive Ability Assessment in Asynchronous Video Interviews
Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging multimodal learning problem because labeled data…
Categorical Prior Lock-in: Why In-Context Learning Fails for Structured Data
Large language models (LLMs) are increasingly used as conditional generators for structured data, relying on in-context learning (ICL) to a…
Exploration Structure in LLM Agents for Multi-File Change Localization
Software engineering tools increasingly rely on LLM based agents to localize files to change to resolve a software issue. Most AI agents ex…
Time-Series Foundation Model Embeddings for Remaining Useful Life Estimation
Remaining Useful Life (RUL) prediction is essential for industrial predictive maintenance, yet many learning-based approaches rely on exten…
Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation
Predicting time-to-event outcomes such as mortality is a fundamental task in clinical decision-making, commonly addressed through survival…
Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization
Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models'…
Runtime Enforcement of Hybrid System Properties
Runtime enforcement has emerged as a promising approach for ensuring the safety of autonomous and cyber-physical systems operating in uncer…
Metadata-Aware Multi-Prompt Reasoning for Zero-Shot Accident Understanding
In this paper, we address the problem of zero-shot understanding of accidents from surveillance videos by identifying when an impact event…
On the Limits of LLM-as-Judge for Scientific Novelty Assessment
LLMs are increasingly used to generate and judge scientific ideas. This makes novelty evaluation a central problem. Full idea evaluation is…
"That's AI Slop, You Bot!" Studying Accusations, Evidence, and Credibility in Online Discourse Towards LLM-Generated Comments
Generative AI has made fluent prose cheap to produce, breaking the old promise to readers that good writing meant real thinking. How have r…
Non-frontal face recognition using GANs and memristor-based classifiers
Face recognition systems have advanced significantly through deep learning techniques, delivering high performance and robustness in comple…
MSUE: Multi-Modal Soccer Understanding Expert
This paper presents our solution to the 2026 SoccerNet VQA Challenge. We first develop a cost-effective data synthesis pipeline driven by a…
Bridging the Morphology Gap: Adapting VLA Models to Dexterous Manipulation via Intent-Conditioned Fine-Tuning
Vision-Language-Action (VLA) models have demonstrated remarkable zero-shot generalization in robotic manipulation, yet the vast majority of…
Augmenting Molecular Language Models with Local $n$-gram Memory
Transformer-based language models for SMILES strings suffer from a locality gap: standard character-level tokenization fragments chemically…
Soft-Prompt Tuning for Fair and Efficient LLM Benchmark Evaluation
Benchmark scores often misrepresent a large language model's (LLM's) knowledge, because they rely, e.g., on the model's ability to follow s…
Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders
Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned fe…
nD-RoPE: A Generalized RoPE for n-Dimensional Position Embedding
Rotary Position Embedding (RoPE) is widely adopted in Transformer models, yet its extension to high-dimensional domains lacks a unified the…
OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models
High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowle…
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application
Environments serve as interactive systems for large language model (LLM) based agents across diverse scenarios and play a crucial role in d…
Implicit Neural Representations of Individual Behavior
We study policy representation learning from unlabeled multi-policy behavioral data. Each episode is generated by a fixed policy, but polic…
Intelligent Automation for Embodied Benchmark Construction: Pipelines, Embodiments, Simulators, and Trends
Embodied intelligence now spans navigation, household assistance, manipulation, autonomous driving, aerial agents, and multimodal large-mod…
Making Foresight Actionable: Repurposing Representation Alignment in World Action Models
World Action Models (WAMs) offer a promising route for robot manipulation by using video generation models to model future scene evolution…
Adapting Prithvi-EO for Fallow Detection for Food-Water Nexus: ViT-Adapter Necks and Parameter-Efficient Backbone tuning of Geospatial Foundation Model
Understanding spatial distribution of fallow land is important for optimizing the food-water (FW) nexus, given fallowing's role in crop rot…
Rule Taxonomy and Evolution in AI IDEs: A Mining and Survey Study
The adoption of AI-powered Integrated Development Environments (AI IDEs) has introduced "Rules" as a novel software artifact, allowing deve…
Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training
Multivariate time-series data often exhibit complex temporal dependencies, irregular sampling, and heterogeneous dynamics across multiple t…
VIA-SD: Verification via Intra-Model Routing for Speculative Decoding
Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers…
DiffCold: A Diffusion-based Generative Model for Cold-Start Item Recommendation
Cold-start item recommendation remains a persistent challenge in real-world systems due to the absence of interaction histories. While prio…
Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization
Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficie…
Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification
Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resourc…
Market Design for AI: Beyond the Copyright Binary
How can we design a market of human-generated content for use in training AI models that both enables technological progress and preserves…
Mathematical perspective on genetic algorithms with optimization guided operators
Recent work in ML applies genetic algorithms at inference time to iteratively improve solutions to optimization problems. The basic mutatio…
CCKS: Consensus-based Communication and Knowledge Sharing
In Decentralized Training and Decentralized Execution (DTDE) for cooperative Multi-Agent Reinforcement Learning (MARL), action-advising-bas…
SpikeDecoder: Realizing the GPT Architecture with Spiking Neural Networks
The Transformer architecture is widely regarded as the most powerful tool for natural language processing, but due to a high number of comp…
The Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics
As Artificial Intelligence models grow in complexity, interpretability has become an indispensable tool for understanding, debugging, and c…
Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition
Temporal grounding--returning the interval $[t_s, t_e]$ for a natural-language query over a video--is the language interface to long-form v…
Harness In-Context Operator Learning with Chain of Operators
Neural operators approximate mappings between function spaces, but often generalize poorly to other operators and usually require fine-tuni…
ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing
Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domai…
Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy
Hematoxylin and eosin (H&E) staining is the cornerstone of histopathology, yet scalable, quantitative analysis of H&E whole-slide images (W…
CHORUS: Decentralized Multi-Embodiment Collaboration with One VLA Policy
Multi-robot collaboration allows robots to efficiently take on a wide range of tasks, from moving a couch through a doorway to assembling s…
Latent World Recovery for Multimodal Learning with Missing Modalities
We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modal…
Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics
We propose Ambient Diffusion Policy, a simple and principled method for imitation learning from suboptimal data in robotics. High-quality,…
Illumination-Robust Camera-Based Heart-Rate Estimation for Physiological Sensing in Robots
Physiological awareness is important for service, social, and assistive robots that interact with humans in everyday environments. Remote p…
SPEA2$^+$: Improved Density Estimation in SPEA2 with Provable Runtime Guarantees
The Strength Pareto Evolutionary Algorithm 2 (SPEA2) is a popular and prominent evolutionary algorithm for solving multi-objective optimisa…
APPO: Agentic Procedural Policy Optimization
Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language m…
ATLAS: Active Theory Learning for Automated Science
Advancing scientific understanding through mechanistic modeling requires posing the right experimental questions to yield maximally informa…
TAHOE: Text-to-SQL with Automated Hint Optimization from Experience
Large Language Models (LLMs) have democratized database access through Text-to-SQL, but moving from prototypes to production remains diffic…
System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5
Recently, large language models (LLMs) have achieved promising progress in the fields of classical Chinese translation and the generation o…
Redesign Mixture-of-Experts Routers with Manifold Power Iteration
Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute thei…
DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?
Vision-Language Models (VLMs) are increasingly deployed as high-level planners for embodied agents, with an emerging strategy of scaling te…
FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning
Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present N…
Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models
Vision-language models (VLMs) project images into hundreds to thousands of visual tokens, making decoder inference expensive in both attent…
Improving Generalization and Data Efficiency with Diffusion in Offline Multi-agent RL
We present a novel Diffusion Offline Multi-agent Model (DOM2) for offline Multi-Agent Reinforcement Learning (MARL). Different from existin…
Offline Diffusion Policy for Multi-User Delay-Constrained Scheduling
Effective multi-user delay-constrained scheduling is crucial in various real-world applications, including embodied AI, instant messaging,…
Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
Intermediate token generation (ITG), where a model produces output before the solution, has become a standard method to improve the perform…
MLaGA: Multimodal Large Language and Graph Assistant
Large Language Models (LLMs) have demonstrated substantial efficacy in advancing graph-structured data analysis. Prevailing LLM-based graph…
Sustainability assessment using multimodal AI agents
Reducing the rapidly growing environmental impact of the computing industry requires assessing the emissions of electronics at scale. Howev…
Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity
Computational models have emerged as powerful tools for multi-scale energy modeling research at the building and urban scale, supporting da…
A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models
Time series reasoning treats time as a first-class axis and incorporates intermediate evidence directly into the answer. This survey define…
GPO: Learning from Critical Steps to Improve LLM Reasoning
Large language models (LLMs) are increasingly used in various domains, showing impressive potential on different tasks. Recently, reasoning…
Resource-Aware LLM Reasoning for Mobile Edge General Intelligence
The rapid advancement of large language models (LLMs) has enabled an emergence of agentic artificial intelligence (AI) with powerful reason…
A New Perspective on Precision and Recall for Generative Models
With the recent success of generative models in image and text, the question of their evaluation has recently gained a lot of attention. Wh…
DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning
We introduce DecompSR, decomposed spatial reasoning, a large benchmark dataset (over 5m datapoints) and generation framework designed to an…
PRInTS: Reward Modeling for Long-Horizon Information Seeking
Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long tra…
Precomputing Multi-Agent Path Replanning Using Temporal Flexibility
Executing a multi-agent plan can be challenging when an agent is delayed, because this typically creates conflicts with other agents. So, w…
An XAI View on Explainable ASP: Methods, Systems, and Perspectives
Answer Set Programming (ASP) is a popular declarative reasoning and problem solving approach in symbolic AI. Its rule-based formalism makes…
A Survey on Evaluating Quality and Trustworthiness in LLM-Generated Data
Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities. By transforming data from a scar…
Making Models Unmergeable via Scaling-Sensitive Loss Landscape
The rise of model hubs has made it easier to access reusable model components, making model merging a practical tool for combining capabili…
MentisOculi: Revealing the Limits of Reasoning with Mental Imagery
Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal…
Autoregressive Direct Preference Optimization
Direct preference optimization (DPO) has emerged as a promising approach for aligning large language models (LLMs) with human preferences.…
Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases
Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and s…
Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies
Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy exp…
Human-Guided Agentic AI for Multimodal Clinical Prediction: Lessons from the AgentDS Healthcare Benchmark
Agentic AI systems are increasingly capable of autonomous data science workflows, yet clinical prediction tasks demand domain expertise tha…
MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility t…
Planning under Distribution Shifts with Causal POMDPs
In the real world, planning is often challenged by distribution shifts. As such, a model of the environment obtained under one set of condi…
Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT
Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic…
ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning
Retrieval-Augmented Generation (RAG) improves large language model applications by grounding generation in retrieved evidence, but also int…
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We ar…
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints,…
A Resilient Solution for Sewer Overflow Monitoring across Cloud and Edge
Aging combined sewer systems in many historical cities are increasingly stressed by extreme rainfall events, which can trigger combined sew…
KAN-MLP-Mixer: A comprehensive investigation of the usage of Kolmogorov-Arnold Networks (KANs) for improving IMU-based Human Activity Recognition
Kolmogorov-Arnold Networks (KANs) have demonstrated an exceptional ability to learn complex functions on clean, low-dimensional data but st…
Subliminal Learning Is Steering Vector Distillation
Subliminal learning refers to a student language model acquiring a teacher's traits (e.g. a system-prompted preference for owls) when fine-…
Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference
AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for…
LSTM based IoT Device Identification
While the use of the Internet of Things is becoming more and more popular, many security vulnerabilities are emerging with the large number…
FOCUS on Contamination: Hydrology-Informed Noise-Aware Learning for Geospatial PFAS Mapping
Per- and polyfluoroalkyl substances (PFAS) are persistent environmental contaminants with significant public health impacts, yet large-scal…
Erased but Not Forgotten: How Backdoors Compromise Concept Erasure
The expansion of text-to-image diffusion models has raised concerns about harmful outputs, from fabricated depictions of public figures to…
The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning
We present Mixture of Discrete-time Gaussian Processes (MiDiGap), a novel approach for flexible policy representation and imitation learnin…
A Physics-Inspired Optimizer: Velocity Regularized Adam
We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from qu…
Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems
Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes…
\texttt{Range-Arithmetic}: Verifiable Deep Learning Inference on an Untrusted Party
Verifiable computing (VC) has gained prominence in decentralized machine learning systems, where resource-intensive tasks like deep neural…
Diffusion-based Cumulative Adversarial Purification for Vision Language Models
Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to adversarial pertu…
Cross-Layer Discrete Concept Discovery for Interpreting Language Models
Interpreting language models remains challenging due to the existence of residual stream, which linearly mixes and duplicates features acro…
OCSVM-Guided Representation Learning for Unsupervised Anomaly Detection
Unsupervised anomaly detection (UAD) aims to detect anomalies without labeled data, a necessity in many machine learning applications where…
RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization
Visual manipulation localization (VML) aims to identify tampered regions in images and videos, a task that has become increasingly challeng…
LaQual: An Automated Framework for LLM App Quality Evaluation
Representing a new paradigm in software distribution, LLM app stores are rapidly emerging, offering users diverse choices for content gener…
The Algorithm Is Not the Behavior: Learned Priors Override Look-Ahead in a Chess-Playing Neural Network
Recent mechanistic work has uncovered learned algorithms within neural networks, from modular arithmetic to search and planning in game-pla…
Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions
Online reinforcement learning (RL) approaches have demonstrated strong performance on Job Shop Scheduling (JSP) and Flexible JSP (FJSP) pro…
MARIC: Multi-Agent Reasoning for Image Classification
Image classification has traditionally relied on parameter-intensive model training, requiring large-scale annotated datasets and extensive…
Toward Preference-aligned Large Language Models via Residual-based Model Steering
Preference alignment is a critical step in making Large Language Models (LLMs) useful and aligned with (human) preferences. Existing approa…
Geometric Metrics and LLMs: What They Measure and When They Work
We present a systematic stress-test of geometric metrics for LLM evaluation. Rank-based geometric properties of internal representations ha…
Noise-Guided Transport for Imitation Learning
We consider imitation learning in the low-data regime, where only a limited number of expert demonstrations are available. In this setting,…
When Researchers Say Mental Model/Theory of Mind of AI, What Are They Really Talking About?
When researchers claim AI systems possess ToM or mental models, they are fundamentally discussing behavioral predictions and bias correctio…
Certifiable Safe RLHF: Semantic Grounding and Fixed Penalty Constraint Optimization for Safer LLM Alignment
Ensuring safety is a foundational requirement for large language models (LLMs). Achieving an appropriate balance between enhancing the util…
GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning
Graph Neural Networks (GNNs) are powerful tools for processing relational data but often struggle to generalize to unseen graphs, giving ri…
SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation
The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses sig…
Mapping Scientific Literature with Large Language Models and Topic Modeling
Scientific literature is increasingly fragmented by disciplinary boundaries, specialized terminology, and potentially sparse keyword system…
Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction
Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based m…
Grounding Computer Use Agents on Human Demonstrations
Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elem…
Irresponsible AI: big tech's influence on AI research and associated impacts
The accelerated development, deployment and adoption of artificial intelligence systems has been fuelled by the increasing presence of big…
Semantic search for 100M+ galaxy images using AI-generated captions
Finding scientifically interesting phenomena through slow manual labeling campaigns severely limits our ability to explore the billions of…
Towards Deep Learning Surrogate for the Forward Problem in Electrocardiology: A Scalable Alternative to Physics-Based Models
The forward problem in electrocardiology, computing body surface potentials from cardiac electrical activity, is traditionally solved using…
Unifying Learning Dynamics and Generalization in Transformers Scaling Law
The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing comput…
Causal Emotion Recognition in Conversation: Context Saturation and Discourse-Marker Evidence
We address two persistent gaps in Emotion Recognition in Conversation: which modeling choices materially affect performance, and how recogn…
Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning
Verifying whether a language model is genuinely reasoning or pattern-matching remains an open problem: learned verifiers are expensive, and…
CoVar: Confidence-Variance-Guided Pseudo-Label Selection for Semi-Supervised Learning
Pseudo-label selection in semi-supervised learning is commonly driven by maximum-confidence thresholds, yet confidence alone can be unrelia…
Robust Privacy: Inference-Stage Privacy through Certified Robustness
An adversary observing a model's released prediction can infer sensitive attributes of the queried input, or even reconstruct representativ…
Reliability-Calibrated Edge-IoT Early Fault Warning for Rotating Machinery with a Physics-Guided Tiny-Mamba Transformer
Industrial Internet of Things (IIoT) systems increasingly rely on distributed vibration sensing to support predictive maintenance of rotati…
When Generic Prompt Improvements Hurt: Evaluation-Driven Iteration for LLM Applications
Evaluating Large Language Model (LLM) applications differs from conventional software testing because outputs are probabilistic, semantical…
OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation
Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluati…
Neural FOXP2 -- Language Specific Neuron Steering for Targeted Language Improvement in LLMs
LLMs are multilingual by training, yet their lingua franca is often English, reflecting English language dominance in pretraining. Other la…
Global Geometry Is Not Enough for Vision Representations
A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representation…
Learning to Inject: Automated Prompt Injection via Reinforcement Learning
Prompt injection is a critical vulnerability in LLM agents, yet the strongest methods still rely on human red-teamers and hand-crafted prom…
"Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills in the Wild
LLM-based coding agents increasingly rely on third-party extensions called skills, which bundle natural language instructions and helper sc…
SAGE: Scalable AI Governance & Evaluation
Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained…
Improving Detection of Rare Nodes in Hierarchical Multi-Label Learning
In hierarchical multi-label classification, a persistent challenge is enabling model predictions to reach deeper levels of the hierarchy fo…
On the Optimal Reasoning Length for RL-Trained Language Models
Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain-of-thought outputs an…
Carbon-Aware Governance Gates: An Architecture for Sustainable GenAI Development
The rapid adoption of Generative AI (GenAI) in the software development life cycle (SDLC) increases computational demand, which can raise t…
EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations
Vision-based Unmanned Aerial Vehicles (UAVs) frameworks aid human search tasks by detecting and recognizing specific individuals, then trac…
Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference
High-throughput Mamba-2 inference is usually tied to fused CUDA and Triton kernels, limiting portability across accelerator backends. We sh…
The Latent Color Subspace: Emergent Order in High-Dimensional Chaos
Text-to-image generation models have advanced rapidly, yet achieving fine-grained control over generated images remains difficult, largely…
Power Term Polynomial Algebra for Boolean Logic
We introduce power term polynomial algebra, a representation language for Boolean formulae designed to bridge conjunctive normal form (CNF)…
Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning
Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforce…
Vision-Language-Action Jump-Starting for Reinforcement Learning Robotic Agents
Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sp…
Bimanual Robot Manipulation via Multi-Agent In-Context Learning
Language Models (LLMs) have emerged as powerful reasoning engines for embodied control. In particular, In-Context Learning (ICL) enables of…
Estimating Tail Risks in Language Model Output Distributions
Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these model…
Information bottleneck for learning the phase space of dynamics from high-dimensional experimental data
Identifying the dynamical state variables of a system from high-dimensional observations is a central problem across physical sciences. The…
Internet of Everything in the 6G Era: Paradigms, Enablers, Potentials and Future Directions
The Internet of Everything (IoE) represents an evolution of the Internet of Things (IoT) by integrating people, data, processes, and things…
Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots
Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dyna…
Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction
Clinical named entity recognition from dental progress notes is challenging because documentation is highly unstructured, domain-specific,…
Towards an Inferentialist Account of Information Through Proof-theoretic Semantics
Information is one of the most widely-discussed concepts of the current era. However, a great deal of insightful work notwithstanding, it i…
CredibleDFGO: Differentiable Factor Graph Optimization with Credibility Supervision
Global navigation satellite system (GNSS) positioning is widely used for urban navigation, but the covariance reported by the GNSS solver i…
Litespark Inference For CPUs: Ultra-Fast SIMD Framework for Ternary (1.58-bit) Language Models
Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most use…
Engineering Robustness into Personal Agents with the AI Workflow Store
The dominant paradigm for AI agents is an "on-the-fly" loop in which agents synthesize plans and execute actions within seconds or minutes…
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models…
Weakly Supervised Segmentation as Semantic-Based Regularization
Weakly supervised semantic segmentation (WSSS) trains dense pixel-level segmentation models from partial or coarse annotations such as boun…
CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing
Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints…
TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability
Recent work has promoted task-aware layer pruning as a way to improve model performance on particular tasks, as shown by TALE. In this pape…
ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models
Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU)…
Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks
We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detecti…
When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control
A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost acros…
Models That Know How Evaluations Are Designed Score Safer
The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has ide…
GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human
With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important…
Brain-IT-VQA: From Brain Signals to Answers
Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images,…
Geometric Erasure by Contrastive Velocity Matching in Rectified Flows
While the rapid adoption of multimodal generative models offers immense potential, it has also increased the risks of harmful content synth…
Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate
Many recent multivariate time series anomaly detection (MTSAD) models incorporate cross-channel modeling, under the implicit assumption tha…
Libra: Efficient Resource Management for Agentic RL Post-Training
Reinforcement learning (RL) has emerged as a standard post-training paradigm for shaping large language models (LLMs) into capable agents.…
BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language
We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan,…
EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms
Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality.…
Conformal Risk-Averse Decision Making with Action Conditional Guarantee
Reliable decision making pipelines powered by machine learning models require uncertainty quantification (UQ) methods that come with explic…
Agentic Software: How AI Agents Are Restructuring the Software Paradigm
For over half a century, software engineering has operated on a foundational premise: human engineers decompose problems, encode decision l…
Anthropic’s Dario Amodei has just one direct report
If you doubted his genius, doubt no more.
Google、拡散型テキスト生成モデル「DiffusionGemma」公開 ローカルGPUで毎秒1000トークン超
Googleは、テキスト生成を最大4倍高速化する実験的AIモデル「DiffusionGemma」を発表した。画像生成の拡散手法を応用し、256トークンを一括で並列生成することで従来の自己回帰型モデルのボトルネックを解消する。品質は標準モデルに譲るものの、ローカル環境での高速なイ…
公式がワンコーラス公開→AIで無断フルコーラス化、拡散 大原ゆい子氏「無職転生III」OPが被害
公式が公開したワンコーラスだけの音源を基に、生成AIを使って無断でフルコーラス化し、本人クレジット入りで公開する――こんな悪質な行為が明るみに出た。
中国が人型ロボット開発競争をリードする「納得の理由」 日本に残された逆転シナリオは?
米中が先行するヒューマノイド開発競争で日本はどう戦うか。「Humanoids Summit Tokyo 2026」でのマッキンゼーと経済産業省の講演を基に、米中に続く第三極を目指す日本の戦略を解説する。
How an astrophysicist uses Codex to help simulate black holes
Discover how astrophysicist Chi-kwan Chan uses Codex to build black hole simulations, helping scientists study extreme physics and test Ein…
「DX銘柄2026」事例レポート公開 51社のAI活用事例を掲載
IPAが「DX銘柄2026」選定企業のDX事例をまとめたレポートを公開した。グランプリ企業3社をはじめとするDX事例の他、東証上場企業289社を対象とした調査結果も紹介している。
「ChatGPTのコネクタでつながるし、M365 Copilotいらなくない?」→有識者3人に聞いてみた 知らないと損するコンテキスト管理「Work IQ」の仕組み
他社の生成AIにコネクタでM365のデータをつなげばCopilotは不要なのか。両者を分けるのが、参照するコンテキストを管理する「Work IQ」だ。その3層構造の仕組みと、恩恵を最大化するためにユーザーがやるべきことを、3人のMicrosoft MVPが語る。
xAI fired an engineer who raised alarms about Grok safety, new lawsuit claims
A former xAI engineer is suing the company and SpaceX, alleging he was fired for raising AI safety concerns about Grok days before SpaceX's…
「何でもIT化」が組織を壊す 「GIGAスクール名付け親」に聞くAI時代のリーダー論
業務を劇的に効率化させる一方で、扱い方を間違えれば組織のエンゲージメントを破壊する生成AI。テスト採点時間を最大80%削減するDXを実現しながらも「記述式の自動採点は絶対に導入しない」と言い切るEdLog社長の中川哲氏(元日本マイクロソフト業務執行役員)。同氏が形だけのDXで組…
Fresh off bond sale, Amazon borrows $17.5B from banks as AI spending continues
Companies are burning through exorbitant sums of money to keep pace in the AI arms race. Debt is climbing.
スマホからWindowsのCodexアプリを操作できるの? 外出中でもAIコーディングを止めない方法
OpenAIのCodexアプリで、Windows上の開発作業をスマートフォンから確認し指示できるようになった。AIコーディング中にPCの前を離れても、作業が止まりにくい。実用面でかなりうれしい機能を紹介する。
Access OpenAI models and Codex through your Oracle cloud commitment
Access OpenAI models and Codex through Oracle Cloud, using existing commitments to build and deploy AI with enterprise security and governa…
‘AI-pilled’ firms spend $7,500 per employee each month on AI
The most AI-obsessed firms are spending roughly $7,500 monthly per employee on AI, per Ramp AI Index. That's not more than an engineer's sa…
How memory tools can make AI models worse
New research suggests that AI memory systems can degrade model performance and encourage sycophantic tendencies.
Cybersecurity researchers aren’t happy about the guardrails on Anthropic’s Fable
Cybersecurity researchers are complaining that Anthropic's new model Fable has guardrails that are too strict for any cybersecurity work.
Datadog veterans launch AI coding startup Niteshift on a bet against Big AI lock-in
AI coding agent startup Niteshift has raised a $7 million seed round from a who's who of angels. It's betting companies will want power ove…
2026-06-10(411件)
The three hard-tech moonshots fueling SpaceX’s unbelievable IPO
Most of the value in SpaceX's IPO is effectively a call option on the company's ambitious space data center plans.
Warner Music acquires AI attribution startup Sureel AI
Through the acquisition, WMG aims to better track when its artists' work is used in AI-generated content or for training AI models.
Jedify raises $24M to help companies arm AI agents with context on their business
The funding round was led by Norwest, with participation from S Capital VC, Cerca Partners, and Oceans Ventures. Snowflake Ventures also pa…
Decart’s new world model can simulate hours of photorealistic driving — with some caveats
Decart is launching Oasis 3, a real-time world model that generates photorealistic driving environments for autonomous vehicle testing, now…
PRC-linked influence operations are targeting AI debates in the US
A new report from OpenAI details PRC-linked influence operations using AI to target U.S. tech debates, data center narratives, tariffs, and…
ChatGPTで広告表示へ 無料・Goプランが対象 6月22日にポリシー更新
米OpenAIは6月10日、「ChatGPT」の広告に関する規定を追加したプライバシーポリシーを改定した。無料プラン、「Go」プランが対象となる。
AIエージェントもフィッシング詐欺に引っかかる? 米セキュリティ企業がOpenClawで検証 結果は……
AIエージェントが話題になる昨今。ローカル環境で動作するエージェントにPCを操作させ、作業を効率化しようと試みる人も散見される。ただ、AIエージェントがフィッシング詐欺に引っ掛かったら、大変なことになるかもしれない。米セキュリティ企業Varonisが6月9日(現地時間)に発表し…
Apple「Siri AI」、13億台超が“利用不能”か? 新機能の拡大阻む“弱点”とは
米Morgan Stanleyは調査レポートで、米Appleが新たに発表した「Siri AI」について、新機能の拡大を阻む、ある弱点を指摘した。
「Siri AI」の進化に「Geminiそのまま」の誤解――現地取材で見えた“新生Apple Intelligence”の全貌
「GeminiがApple Intelligenceの正体」は誤解だ。WWDC 2026の現地取材で見えてきた第3世代は、200億パラメータのAIをiPhoneで動かす革新技術、Google Cloud+NVIDIAによるインフラ刷新、そして静かに変わる「無料」の定義まで、想像…
生成AI台頭、経営コンサルの倒産・廃業が過去最多ペース “補助金頼み”限界に
「専門性による差別化を図れず、労働集約的・制度依存的なビジネスから脱却できない事業者は、生成AIの台頭による下押し圧力に耐えきれず、今後さらに淘汰が加速する」
Meta signs first AI data center deal in India with Reliance
The 168-megawatt facility will support Meta's global AI computing needs and can be expanded over time.
“Claude Fable 5の次"に備えよ――Anthropicが東京でイベント開催、「Claude」責任者が明かした開発者向け3つの指針
Anthropicが東京で開発者向けイベント「Code with Claude」を開催。同日に一般提供を始めた新モデル「Claude Fable 5」を念頭に、高性能なAIを組み込んだサービスを開発する際の指針が語られた。
Business World Model
Businesses are increasingly adopting AI-enabled tools to improve productivity, reduce costs, and enhance products and services. However, th…
Deployment-Time Memorization in Foundation-Model Agents
Foundation-model agents are increasingly long-lived systems that remember users across interactions, making memorization an explicit deploy…
Exploratory Responsiveness and Adaptive Rigidity under AI-Assisted Optimization
This paper develops a theory of exploratory adaptation under AI-assisted optimization. The central argument is that the long-run adaptive e…
Predictive Assistance and the Temporal Dynamics of Exploratory Compression
Classical theories of cognition describe problem solving as exploratory search through structured problem spaces in which repeated interact…
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to sha…
Less Context, Better Agents: Efficient Context Engineering for Long-Horizon Tool-Using LLM Agents
Large language models deployed as autonomous agents for enterprise workflows face a key challenge: verbose tool responses from enterprise s…
Minimalist Genetic Programming
Genetic programming (GP) is based on two important insights. First, that any learning task can fundamentally be posed as a program inductio…
Regimes: An Auditable, Held-Out-Gated Improvement Loop Demonstrated on LongMemEval with ActiveGraph
Autonomous improvement loops are hard to trust because the improvement process is usually external scaffolding bolted onto the agent: failu…
RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning
While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph…
Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction
Supervised fine-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks b…
Sim2Schedule: A Simulator-Guided LLM Framework for Autonomous Open-Pit Mine Scheduling
Open-pit mine scheduling is a critical process for maximizing economic return under complex geotechnical and operational constraints. While…
From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs
When large language models generate from retrieved or augmented contexts, conflicts between external context and parametric priors remain a…
What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory
Language-agent "memory palace" systems anchor each memory to a world coordinate, on the intuition that geometry adds something text cannot.…
Mobility Anomaly Generation using LLM-Driven Behavior with Kinematic Constraints
Although the study of human trajectory anomalies is critical for advancing spatial data mining, empirical research remains severely hindere…
Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts
Code-generating large language models (LLMs) increasingly produce visual artifacts such as charts, web pages, and slides by writing program…
Reasoning or Memorization? Direction-Aware Diversity Exploration in LLM Reinforcement Learning
Reinforcement learning has become a key paradigm for eliciting reasoning abilities in large language models, where exploration is crucial f…
ReflectiChain: Epistemic Grounding in LLM-Driven World Models for Supply Chain Resilience
AI agents in supply chains face a fundamental epistemic gap: large language models (LLMs) interpret policies but lack physical grounding, w…
Belief-Space Control for Personalized Cancer Treatment via Active Inference
Cancer treatment is at the core a sequential decision-making problem with partial observability, latent patient heterogeneity, and explicit…
Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games
Recent advances in LLM-driven code evolution have enabled automated discovery by iteratively generating and improving programs. However, ap…
Instruction Finetuning DeepSeek-R1-8B Model Using LoRA and NEFTune
Financial named-entity recognition (NER) is essential for translating unstructured financial reports and news into structured knowledge gra…
STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios
Large language models are increasingly used to power personal agents for everyday applications, but evaluating these agents remains a chall…
A Unified Multi-Modal Framework for Intelligent Financial Systems: Integrating Reinforcement Learning, High-Frequency Trading, and Game-Theoretic Approaches with Cross-Modal Sentiment Analysis
The rapid evolution of financial technology demands sophisticated artificial intelligence systems capable of handling diverse challenges ac…
Soul Computing: A Theoretical Framework and Technical Architecture for Intelligent Agents with Independent Consciousness
Breakthroughs in large language models and multimodal generation technologies have propelled the digital reconstruction of human mental tra…
Trace2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents
Decision rules that enterprise experts apply tacitly -- in auditing, compliance, and contract review -- can be systematically recovered and…
ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics
Combinatorics is central to Olympiad-level mathematical problem solving, requiring deep discrete reasoning, creative constructions, and rig…
A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner
Automated Planning is a subfield of Artificial Intelligence (AI) where the main objective is generating a sequence of actions, known as a p…
A Reliable Fault Diagnosis Method Based on Belief Rule Base Consider Robustness Analysis
In equipment operation, the implementation of fault diagnosis is essential to ensure the continuity and safety of production equipment, imp…
Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm
Cross-modal knowledge distillation (CMKD) studies how a (large) teacher model trained on one type of data (e.g., images) can guide a (small…
HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning
While Large Language Models (LLMs) have demonstrated strong capabilities as autonomous agents across a wide range of tasks, their performan…
ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning
Memory is essential for enabling large language model (LLM) agents to handle long-horizon reasoning tasks. Existing memory mechanisms are l…
One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA
External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevan…
Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents
Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts that exceed their finite context windows, makin…
Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory
Long-term LLM agents need persistent memory that can track changing facts and provide relevant evidence across sessions. Existing memory sy…
When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models
Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in…
The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment
As AI systems built from multiple language-model agents become more common, they are increasingly used to make decisions together: discussi…
AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies
Numerical solvers for partial differential equations (PDEs) are core computational tools in science and engineering. Building reliable PDE…
Accelerating NeurASP with vectorization and caching
Neurosymbolic AI combines neural networks with symbolic programs to create robust and explainable predictions. One such framework is NeurAS…
READER: Robust Evidence-based Authorship Decoding via Extracted Representations
As agentic applications increasingly route user tasks through official and third-party LLM APIs, provenance becomes an operational question…
Evaluating Research-Level Math Proofs via Strict Step-Level Verification
Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from "…
Moonshine: An Autonomous Mathematical Research Agent Centered on Conjecture Generation
Moonshine is an autonomous agent whose central objective is to generate mathematical conjectures. Its core capability is to extract structu…
Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation
Vision-Language Models (VLMs) demonstrate strong performance on general multimodal reasoning benchmarks, yet their ability to perform engin…
Large-scale semantic mapping of learner agency and autonomy reveals what measurement and generative AI research overlook
Learner agency and autonomy are foundational to personal development, yet a pervasive "jingle-jangle" fallacy (i.e. identical terms denotin…
Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution
Although Large Language Model (LLM) agents have demonstrated strong performance on complex tasks, their learning is often limited by ineffi…
Frontier Coding Agents Use Metaprogramming to Adapt to Unfamiliar Programming Languages
LLM-based coding agents are usually evaluated in familiar software settings: mainstream languages, common libraries, and public repositorie…
WorldKernel: A World Model is the Coupling Kernel of Admissible Possible Worlds
A common assumption holds that enough observational and interventional data, given to a strong enough predictor, suffices. We report a fail…
Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models
Persistent memory systems promise to make LLMs more helpful by storing user beliefs over time. We show they also make models less correct b…
Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans
Furnished floor plans are fundamental to real estate visualization, interior design, and architectural workflows. However, progress in auto…
Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?
The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, profess…
Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets
Many Markov decision processes (MDPs) in operations research have feasible actions that are state dependent and defined implicitly by vario…
Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning
Large language model unlearning aims to suppress designated undesirable knowledge while preserving benign capabilities. Many unlearning obj…
Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning
Tuning controllers for strongly coupled multi-input multi-output (MIMO) industrial processes is hard: decentralized classical auto-tuning i…
Superficial Beliefs in LLM Decision-Making
We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a…
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benc…
What Fits (Into Few Tokens) Doesn't Overfit: Compression and Generalization in ML Research Agents
Reusing a held-out benchmark adaptively should, in principle, invite overfitting. Yet benchmark-driven machine learning (ML) has produced s…
CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs
AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted mod…
A History-Aware Visually Grounded Critic for Computer Use Agents
Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through…
Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football
We recast pass evaluation in football (soccer) as a Monte Carlo Tree Search (MCTS)-like evaluation problem whose components mostly exist in…
ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity
Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretatio…
ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models
Long chain-of-thought (CoT) trajectories in large language model (LLM) reasoning cause severe inference bottlenecks due to rapid key-value…
The Role of Feedback Alignment in Self-Distillation
Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distilla…
More Human or More AI? Visualizing Human-AI Collaboration Disclosures in Journalistic News Production
Within journalistic editorial processes, disclosing AI usage is currently limited to simplistic labels, which misses the nuance of how huma…
Culturally-Aware AI for Cross-Boundary Community Learning: Undergraduate Innovation at the Intersection of Computation and Design
Research on artificial intelligence in education (AIED) is rapidly expanding, yet technical progress often lacks human-centered grounding a…
AI-Driven Analytics of Team-Teaching Talk: Acoustic Patterns across Experience, Cohorts and the Learning Design
As classroom cohorts expand, team teaching is increasingly used to integrate the expertise and pedagogical perspectives of multiple teacher…
Agentic Social Affordance Framework (ASAF): Agent Identity Design as a Collaboration Interface in Multi-Agent Systems
As AI systems evolve from single conversational agents to complex multi-agent architectures, a critical design dimension has been overlooke…
CollabSkill: Evaluating Human-Agent Collaboration On Real-World Tasks
AI agents are reshaping the workspace, leading to drastic change of how humans work. Despite the considerable potential of human-agent coll…
Self-EmoQ: Plutchik-Guided Value-based Planning to Drive Streaming Emotional TTS
Emotional interaction is increasingly crucial for conversational AI, yet current systems lack a self-emotion determination mechanism to dri…
Aesthetic Perspectives in Information Systems Research: A Hermeneutic Analysis
How might implicit aesthetic perspectives shape what Information Systems (IS) scholarship recognises as worthy of study (or not)? In this h…
Integrated Real-Time Motion Tracking and AI Analysis for Athletic Performance Optimization
Applying Human Pose Estimation (HPE) in real world environments remains a challenging task, this paper explores and surveys real time HPE a…
An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models
Large language models (LLMs) produce stable self-reports on personality inventories, but these self-reports do not predict observed behavio…
The Interlocutor Effect: Why LLMs Leak More Personal Data to Agents Than Humans
Large Language Models (LLMs) alter their privacy behavior based on the perceived identity of their interlocutor. While safety mechanisms ty…
CANVAS: Captioning Art with Narrative Visual-Audio AI Systems
Visual art remains largely inaccessible to blind and low-vision (BLV) audiences due to brief or absent alt-text, which rarely conveys the s…
Human-AI Coordination Zones: A Framework for Designing Human-in-the-Loop Experiences with Agentic AI
As generative and agentic AI becomes embedded in everyday products, practitioners face a persistent challenge: how to design human-AI coord…
LLM-Based Code Documentation Generation and Multi-Judge Evaluation
High-quality source code documentation is vital yet often neglected, especially in critical domains like healthcare where reliability and m…
Can Multi-Agent LLMs Identify Their Peers? Stylometric Fingerprinting in Role-Constrained Political Analysis
Multi-agent large language model (LLM) pipelines for political statement analysis are vulnerable to peer-preservation bias: models tend to…
Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models
Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctn…
Support sufficiency as action-sufficient compression: a single-cycle rate-regret formulation
Robust decision-making requires compression. A system that forms a rich support state cannot usually preserve its full structure at the poi…
Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding
MLLMs frequently hallucinate objects inconsistent with visual inputs. This issue is typically attributed to the over-reliance on language p…
Conformal Risk Prediction for Non-Alcoholic Fatty Liver Disease Using Gradient Boosting with Distribution-Free Coverages
Non-alcoholic fatty liver disease (NAFLD) affects roughly 25% of global adults, posing substantial hepatic and cardiovascular risks. Yet, p…
Time Series as Language: A Universal Tokenizer for General-Purpose Time Series Foundation Models
While Next-Token Prediction (NTP) has unified LLM pretraining, its adaptation to unbounded, continuous time series (TS) remains open. To br…
Blurry Window Attention
The Softmax Attention operation in Transformer language models has a quadratic complexity in the sequence length and a growing state size i…
Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation
Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely foc…
Two to Tango: Coupled Task-Reference Selection for Safe LLM Fine-tuning
Fine-tuning safety aligned large language models (LLMs) on downstream data improves adaptation but may erode learned safety behavior. Exist…
EstRTL: Functional Estimation Guided RTL Code Generation
Optimizing register transfer level (RTL) code is of vital importance in hardware design. Large language models (LLMs) provide new methods f…
SPACE: Source-free Proxy Anchor Concept Erasure for MLLMs
As Multimodal Large Language Models (MLLMs) face growing privacy risks and regulatory constraints, machine unlearning (MU) has emerged as a…
QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning
Federated Learning (FL) combined with Split Learning (SL) is a privacy preserving paradigm that enables training deep neural networks (DNNs…
SD-GRPO: Verifiable Segment Decomposition for Long-Form Vision-Language Generation
Group Relative Policy Optimization (GRPO) and its variants, originally developed for Large Language Models (LLMs), have recently been appli…
PatchSTG: Scalable Spatiotemporal Graph Transformers for Traffic Forecasting on Irregular Sensor Networks
Traffic forecasting is a fundamental component of intelligent transportation systems, yet remains challenging in real-world settings due to…
Rotate2Think: Geometric Priming via Orthogonal Rotation to Improve Language Model Reasoning
Reasoning models achieve strong performance on challenging tasks by generating explicit intermediate reasoning traces before producing a fi…
Integrating Local and Global Entropy for Uncertainty Quantification in LLMs
Large language models hallucinate confidently, making uncertainty quantification (UQ) essential for reliable deployment. Existing methods r…
TD-Grokking: Learning from Zero-Reward Problems by Training-Time Decomposition
Large language models (LLMs) have made remarkable progress in reasoning tasks, largely driven by post-training paradigms, especially reinfo…
Failure Modes of Deep Multi-Agent RL in Asynchronous Pricing: Reproducible Triggers, Trace Diagnostics, and a Partial Fix
We study two reproducible failure modes of deep multi-agent reinforcement learning in continuous-time pricing markets: (i) tacit cartel for…
SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs
Sparse Mixture-of-Experts (MoE) large language models achieve strong quality with low per-token compute, yet their deployment is often limi…
SocraticPO: Policy Optimization via Interactive Guidance
Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness.…
PreAct-Bench: Benchmarking Predictive Monitoring in LLMs
Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a gi…
Representation Curriculum: Stagewise Training for Robust Ranking and Allocation
Ranking in digital marketplaces is a dynamic exposure-allocation mechanism: displayed items shape discovery trajectories and success events…
Tractogram foundation model
Diffusion MRI (dMRI) tractography is the only noninvasive approach for mapping white-matter pathways in the living human brain. It represen…
HMAF: A Hierarchical Multi-Slot GD-RTB Allocation Framework
In modern online advertising platforms, Guaranteed Delivery (GD) contracts coexist and bid with Real-Time Bidding (RTB) auctions. Recent ap…
When Attribution Patching Lies: Diagnosis and a Second-Order Correction
A central goal of mechanistic interpretability is to identify which internal components causally drive a language model's behavior. Because…
Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History
Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history…
The Whale That Outswam Evolution: Swarm Intelligence Maximises Memory in Connectome Reservoirs
Reservoir computing exploits the fixed dynamics of a recurrent network for temporal processing, requiring only a trained linear readout. Bi…
LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts
Multimodal clinical learning is increasingly important for integrating diverse patient data, including imaging, text, and personalised heal…
IDP-Bench: Benchmarking ability of LLMs to protect personal information in interdependent privacy contexts
Large language models (LLMs) are becoming widely deployed as personal AI assistants with access to sensitive user data, making privacy a ma…
Bypassing Copyright Protection in Diffusion-based Customization via Two-Stage Latent Feature Optimization
With the growing concerns over copyright infringement in diffusion-based customization, adversarial attacks have emerged as a prominent def…
Mix, Don't Pick: Why Synthetic Corpus Composition Matters for Time Series Foundation Model Pretraining
Choosing the wrong synthetic generator for time-series foundation model pretraining is costly: under identical training budgets, the best a…
IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference
Multi-turn LLM agents fan short queries into long trajectories of tool calls, search results, and intermediate reasoning. Both KV memory an…
Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming
Perceptual uncertainty is a central challenge for heterogeneous robot teams operating in unstructured outdoor environments, where no single…
The Bioelectrical Information Theory: Investigating the theoretical compression limit of bioelectrical signals under artificial intelligence
Bioelectrical signals are increasingly acquired at scales that challenge the bandwidth of brain-computer interfaces. However, their compres…
Conformal Prediction for Neural Operators: Distribution-Free Uncertainty Quantification in Physics Simulation
Neural operators such as the Fourier Neural Operator (FNO) have emerged as powerful surrogates for solving partial differential equations (…
Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters
Deploying deep neural networks on memory-constrained edge accelerators is bottlenecked by per-inference off-chip weight transfer rather tha…
Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling
Sampling from the sequence-level power distribution $p^\alpha$ elicits RL-level reasoning from base language models without any parameter u…
Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization
Post-training quantization (PTQ) is one of the most practical ways to reduce the serving cost of Large Language Models (LLMs), but activati…
Forward-Only Convolutional Neural Networks with Learnable Channel-Class Assignment
The Forward-Forward (FF) algorithm offers a biologically inspired alternative to backpropagation by replacing gradient-based credit assignm…
Between Amnesia and Chaos: A Memory Stability Expressivity Trilemma for Trainable Dissipative Oscillator Networks
Physical reservoir computing harnesses nonlinear mechanical dynamics but, by convention, freezes the substrate and trains only a linear rea…
A Note on the Strategic Confinement Problem
Lampson's confinement problem asks how to prevent a program that processes confidential information from leaking it to a third party. We in…
When RL Fails after SFT: Rejuvenating Model Plasticity for Robust SFT-to-RL Handoff
Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has become a standard pipeline for Large Language Model (LLM) post-tra…
GitInject: Real-World Prompt Injection Attacks in AI-Powered CI/CD Pipelines
AI-powered agents are increasingly embedded in continuous integration and continuous delivery/deployment (CI/CD) pipelines to autonomously…
One Lens, Many Worlds : A Capability-Typed Interface for World-Model Interpretability
World models are now built on substantially different computational substrates. Latent recurrent state-space models such as PlaNet and the…
RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference
We introduce RKSC (Reasoning-Aware KV Cache Sharing), a training-free inference framework that eliminates two structural redundancies in mu…
Interactions Between Crosscoder Features: A Compact Proofs Perspective
Dictionary learning methods like Sparse Autoencoders (SAEs) and crosscoders attempt to explain a model by decomposing its activations into…
Anomaly Detection and Root Cause Analysis for Microservice Systems
Microservice systems are widely used to build cloud applications, yet their complexity makes failures inevitable, degrading user experience…
GAGI: A Gini-Adjusted GDP-per-Capita Index for Distribution-Aware Macroeconomic Welfare Monitoring
GDP per capita is the default lens through which governibng bodies track the economic prosperity and consequences of economic events , yet…
Learning Where to Simulate: Generative Active Sampling for Online PDE Surrogate Training
Data-driven PDE surrogates are trained with data produced by numerical PDE solvers. However, when the surrogate's goal is to generalize acr…
Deep Slice Interpolation for Reducing Through-Plane Anisotropy and Noise in Head CT
Head computed tomography (CT) typically uses sub-millimeter in-plane resolution but 2-5 mm through-plane spacing, creating substantial anis…
Does Normalization Choice Matter for Causal Large Time-Series Models?
Large models for time-series forecasting have been emerged as a promising paradigm for training models on heterogeneous collections of sign…
Uncertainty-Aware Motion Planning for Autonomous Driving in Mixed Traffic Environment
In mixed-traffic environments where autonomous and human-driven vehicles may co-exist, motion planning for autonomous vehicles requires ant…
Temporal Context Conditioning for Seasonality-Aware Precipitation Nowcasting of High-Intensity Rainfall
Precipitation nowcasting is increasingly being approached with deep learning models that learn directly from recent radar observations. Alt…
HydraCIL: Decoupled Class-Incremental Learning through Prototype-Guided Multi-Head Classifiers
We present HydraCIL, a decoupled continual learning model based on prototype-guided multi-head classifiers, targeting sustainable deploymen…
3SPO: State-Score-Supervised Policy Optimization for LLM Agents
Training large language models (LLMs) as autonomous agents via reinforcement learning (RL) has enabled frontier models to achieve superhuma…
Optimality of FSQ Tokens for Continuous Diffusion for Categorical Data with Application to Text-to-Speech
Continuous diffusion for categorical data is a framework belonging to the diffusion family and aiming at generating discrete data. The scie…
Geometry-Aware Anisotropic Boundary Correction for Aerodynamic Simulation
Aerodynamic simulation is a key component of engineering shape design, where core quantities such as the surface pressure coefficient stron…
DeRA-MOS: Optimizing Text-to-Music Evaluation via Decoupled Listwise Ranking and Modality Alignment
Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean op…
Generalized-CVO: Fast and Correspondence-Free Local Point Cloud Registration with Second Order Riemannian Optimization
We propose a fast and correspondence-free local point cloud registration method that leverages geometric surface structure and reproducing…
Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders
Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they…
Inside the Latent Flow: Causal Deciphering of Attention Dynamics in Audio Separation Foundation Models
Flow-matching transformers achieve strong audio separation, yet their attention dynamics are opaque. We adapt established causal-interventi…
Bittensor Agent Arenas as a Trajectory Primitive: Distilling a Shopping Agent from ShoppingBench Subnet Traces
Small-model agentic post-training is bottlenecked less by the algorithm than by the trajectory substrate it consumes. Leading recipes (RLVR…
A Controlled Audit of Pretraining Contamination in Public Medical Vision-Language Benchmarks
Medical vision-language models (VLMs) are evaluated on public benchmarks whose images and question-answer pairs have been freely downloadab…
Importance-Aware Scheduling for High-Dimensional Hyperparameter Optimization
Hyperparameter Optimization (HPO) is essential for building high-performing ML/DL models, yet conventional optimizers often struggle in hig…
Temporal Sheaf Neural Networks with Dynamic Orthogonal Transport
We introduce Temporal Sheaf Neural Networks (TSNN), a temporal link prediction framework that equips each node with a time-varying orthogon…
VFUSE: Virulent Feature Understanding with Sparse autoEncoders
Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generati…
Divide-and-Conquer Modeling for the CTF-4-Science Lorenz Benchmark
This work presents a divide-and-conquer modeling strategy for the CTF-4-Science Lorenz benchmark, which evaluates chaotic-system prediction…
A Theory on Flow Matching with Neural Networks
In this work, we develop theoretical foundation for flow matching with neural-network-parameterized conditional velocity fields. We establi…
Unsupervised Style Representation Learning for AI-Text Detection via Paraphrase Inversion
The rapid development of large language models (LLMs) has raised concerns about misuse such as plagiarism, misinformation, and automated in…
What makes a harness a harness: necessary and sufficient conditions for an agent harness
The term agent harness now circulates widely in software engineering with generative artificial intelligence. It names the layer that wraps…
Duality for Optimal Multi-Item, Multi-Bidder Auction Design: Revenue Certificates through Deep Learning
Characterizing revenue-optimal auctions for multi-item, multi-bidder settings remains a fundamental open problem, with no known closed-form…
Emotion Profiling in LLM-Based Literary Translation: Systematic Shifts Across MT and Post-Editing
This paper investigates whether LLM translations exhibit identifiable emotional profiles and how post-editing reshapes them toward human-li…
MetaPlate: Counterfactual-Guided RAG-LLM Tool for Personalized Food Recommendation and Hyperglycemia Prevention
Postprandial hyperglycemia is a key risk factor for metabolic disorders; however, existing dietary guidance is often static, impractical, a…
FedSteer: Taming Extreme Gradient Staleness in Federated Learning with Corrective Projections and Caching
Federated learning (FL) is often subject to aggregation variance if clients do not consistently participate in training rounds. While reusi…
Pareto-Guided Teacher Alignment for Fair Personalized Text Generation
Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal frami…
BiWM: Advancing Open-Source Interactive Video World Models with Bidirectional Autoregression
Transitioning bidirectional video diffusion models into an autoregressive paradigm improves the interactivity of video world models, but ex…
$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems
As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. C…
Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community
AI is increasingly used to support scientific peer review, from manuscript screening, reviewer assistance to editorial triage. Although suc…
Local Is Not a Sufficient Privacy Boundary: Governing OS-Integrated On-Device AI
As AI systems move into operating systems, privacy no longer turns only on whether a model runs locally. A local assistant may assemble ema…
Flow Control: Steering Vision-Language-Action Models with Simple Real-Time Inputs
We introduce flow control of vision-language-action (VLA) models, a simple and effective way to steer VLA actions in real-time through gene…
Making Time Editable in Video Diffusion Transformers
Modern Diffusion Transformers for video generation provide limited control over the progression of time and the editing of temporal dynamic…
Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning
Group Relative Policy Optimization (GRPO) relies on the diversity of $K$ rollouts within each group; otherwise, the group-mean advantage $A…
MMClima: A Framework for Multimodal Climate Science Data and Evaluation
Climate change research increasingly requires AI systems that reason across text, dynamic visual content, and scientific figures, yet exist…
Fisher-Guided Progressive Parameter Selection for Adaptive Fine-Tuning
Parameter-efficient fine-tuning (PEFT) aims to adapt pretrained models with a small trainable parameter subset, however, most existing meth…
Integral Field Unit Spectroscopy with One Fiber
Integral field unit (IFU) spectroscopy provides spatially resolved spectra across galaxies, offering crucial insights into their evolution.…
Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity
Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assig…
An Improved Generative Adversarial Network for Micro-Resistivity Imaging Logging Restoration
An improved GAN-based imaging logging image restoration method is presented in this paper for solving the problem of partially missing micr…
Exploration of Foundation Model-Based Robots in Patient and Elderly Care
Demand for older-adult and patient care is growing rapidly as populations age worldwide. Foundation models are increasingly being integrate…
Automated Pronunciation Evaluation for Korean Toddler Speech using Speech Diarization and Self-Supervised Learning
Speech sound disorders affect approximately 44% of Korean pediatric communication disorder cases, yet automated assessment tools for Korean…
A Source Domain is All You Need: Source-Only Cross-OS Transfer Learning for APT Anomaly Detection via Semantic Alignment and Optimal Transport
Advanced Persistent Threats (APTs) are stealthy, multi-stage cyberattacks whose detection is difficult due to scarce labeled traces, severe…
Fast Exact Nearest-Neighbor Learning for High-Frequency Financial Time Series
AI efficiency at scale is becoming critical in finance as market data volumes surge across equities, ETFs, FX, options, and high-frequency…
Dual-Branch Gated Fusion for Open-Set Audio Deepfake Source Tracing
Attributing a synthetic utterance to its originating system remains an open challenge: closed-set models fail to reject unseen synthesizers…
SHAPO: Sharpness-Aware Policy Optimization for Safe Exploration
Safe exploration is a prerequisite for deploying reinforcement learning (RL) agents in safety-critical domains. In this paper, we approach…
Hyperbolic Neural Population Geometry Benefits Computation
Neural population geometry shapes downstream computation. Recent empirical findings in neurobiology suggest that a hyperbolic structure und…
YUBI: Yielding Universal Bidigital Interface for Bimanual Dexterous Manipulation at Scale
We introduce Yielding Universal Bidigital Interface (YUBI), a finger-aligned gripper designed to enable intuitive, ergonomic, and scalable…
Linguistically Augmented Audio Speech Data (LinguAS)
Maliciously-created fake speech, including deepfaked and spoofed audio, is proliferating at an alarming rate, and detection models are raci…
Multi-Level Analyzation of Imbalance to Resolve Non-IID-Ness in Federated Learning
Class imbalance is a common problem in deep learning that severely degrades performance. In federated learning (FL), it is a critical facto…
What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents
Hierarchical vision-language-action (Hi-VLA) systems have emerged as a promising paradigm for complex robot manipulation, by using high-lev…
Hierarchical Policies from Verbal and Egocentric Human Signals for Natural Human-Robot Interaction
For natural human-robot interaction, a robot must understand human intent expressed not only through language but also through nonverbal si…
Towards Robust Arabic Speech Emotion Recognition with Deep Learning
Speech Emotion Recognition (SER) aims to identify a speaker's emotional state from audio signals. While recent advances in deep learning ha…
LLM-Guided Neural Architecture Search for Robust Co-Design of Physical Neural Networks
Deploying neural networks on unconventional hardware demands architectures that co-optimize task accuracy and platform-specific constraints…
The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge
Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate…
Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents
LLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement wit…
Baseline-Free Policy Optimization for Neural Combinatorial Optimization
Neural combinatorial optimization (NCO) trains autoregressive policies to solve routing problems. The standard training algorithm, REINFORC…
Content-Induced Spatial-Spectral Aggregation Network for Change Detection in Remote Sensing Images
The integration of spatial and spectral information is beneficial to the improvement of change detection performance. However, existing met…
Building Change Detection in Earthquake: A Multi-Scale Interaction Network and A Change Detection Dataset
As one of the most destructive natural disasters, earthquakes have struck many countries around the world in recent years, causing serious…
Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models
Machine unlearning is increasingly important for large language models, yet unlearning in Mixture-of-Experts (MoE) architectures remains un…
Atomic Intent Reasoning: Bringing LLM Semantics to Industrial Cross-Domain Recommendations
Cross-domain recommendation is a core problem in content-to-e-commerce platforms. Its objective is to leverage user interactions with conte…
KG-SoftMAP: Soft Knowledge-Graph Priors for Bayesian Network Structure Learning from Sparse Discrete Data
Learning Bayesian network (BN) structure from sparse discrete data is hard: when each instance records only a few variables, most variable…
A Practical Recipe Towards Improving Sim-and-Real Correlation for VLA Evaluation
Simulation has become an essential tool for evaluating and improving vision-language-action (VLA) policies, offering scalable, reproducible…
Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation
Speech-to-text (S2T) systems for recognition (ASR) and translation (S2TT) typically generate discrete text tokens. In contrast, continuous-…
Test-time Adversarial Takeover: A Real-time Hijacking Interface against Robotic Diffusion Policies
Diffusion-based action generation has become a foundational component of embodied AI, but its reliance on visual conditioning leaves deploy…
Expert-Level Crisis Detection in Mental Health Conversations
Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts.Real-world crisis interv…
Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis
Muon collider research spans accelerator physics, detector instrumentation, and high-energy phenomenology, with relevant evidence scattered…
Towards Critical Branching Mechanism in Recurrent Neural Networks
Criticality has been proposed as a key organizing principle in biological neural systems, yet its origin and relevance in artificial neural…
Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation
On-policy distillation (OPD) has demonstrated strong empirical gains in enhancing complex reasoning in LLMs by aligning a student model wit…
SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval
Agent skill libraries are becoming routable software assets: a retrieved skill can contribute instructions, scripts, resource bindings, and…
Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries
Scientific discovery is often a collective process: researchers share partial results, inspect failed attempts, and build on each other's i…
FOGO: Forgetting-aware Orthogonalization Optimizer
We argue that forgetting is not confined to continual learning but is a general optimization phenomenon: during standard training, dominant…
Vision-Assisted Foundation Model for Solving Multi-Task Vehicle Routing Problems
Multi-task vehicle routing problems play a critical role in enhancing efficiency across various industries and service sectors. These probl…
Mitigating Bias in Low-SNR Financial Reinforcement Learning via Quantum Representations
The financial market is a typical low signal-to-noise ratio (SNR) setting, which often destabilizes off-policy maximum-entropy methods like…
The Distributed Detectability Band Against Marginal-Preserving Attacks
AI-control monitors score individual agent actions to detect misbehavior, but real harm can be distributed across many benign-looking steps…
Minimum Distortion Quantization with Specified Output Distribution
We derive the optimal quantizer of a real-valued random variable $W$ with distribution $P_W$ such that 1) the distribution of the quantizat…
LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake
Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provid…
ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs
Text-attributed Graphs (TAGs) incorporate textual node attributes with graph structures to describe rich relational semantics. Recent effor…
UPLOTS: A Unified Pretrained Language Model for Constrained Time-series Generation
In time-series generation, existing approaches typically handcraft ortrain a separate model for each dataset, which hinders their scalabili…
Detecting Speculative Language in Biomedical Texts using Recurrent Neural Tensor Networks
In this investigation, we delve into the automated detection of speculative language within biomedical articles by utilizing distributed se…
Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation
Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optim…
Advancing the State-of-the-Art in Empirical Privacy Auditing
Parameter-efficient fine-tuning of large language models (LLMs) can exhibit problematic memorization of individual training examples. Empir…
Stop Early, Spend Less: Hidden-State Probes as a Practical Recipe for Streaming Moderation of LLM Outputs
Deploying large language models in user-facing systems requires efficient output safety filtering. Existing approaches typically rely on a…
Achieving Cloud-Grade SLOs for Local Mixture-of-Experts Inference through CPU-GPU Hybrid Design
Local deployment of large Mixture-of-Experts (MoE) models falls short of the service quality achieved in cloud-scale environments, even und…
MoE Enhanced Federated Learning for Spatiotemporal Prediction
Traffic prediction is fundamental to intelligent transportation systems and urban computing, yet many cities continue to suffer from traffi…
Assessing Automated Prompt Injection Attacks in Agentic Environments
Indirect prompt injection poses a critical threat to LLM agents that interact with untrusted external data, yet automated attack methods--p…
Machine Learning Methods for Studying Latent Neural Activity Dynamics
Recent developments in brain recording are driving a demand for machine learning tools capable of decoding the latent structure of large po…
LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization
Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on…
Flexible Flows for Biological Sequence Design
Designing functional biological sequences requires navigating vast discrete spaces under strict evolutionary and biophysical constraints. D…
Benchmarking Knowledge Editing using Logical Rules
Large Language Models (LLMs) are increasingly deployed in real-world applications that require access to up-to-date knowledge. However, ret…
Hidden Consensus:Preference-Validity Compression in Human Feedback
Standard RLHF pipelines often reduce heterogeneous human judgments into a single scalar reward target. We argue that this reduction can mis…
Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction
Adversarial examples reveal vulnerabilities in Vision-Language Pre-training (VLP) models and provide insights for improving robustness. A k…
Convergence of Monte Carlo Optimistic Policy Iteration: Beyond Uniform State-Action Updates
The asymptotic behaviour of Monte Carlo optimistic policy iteration (MC-O-PI) is a long-standing open question. When the model of the envir…
Drawing with Strangers: Population Scaling Drives Zero-Shot Mutual Intelligibility in Emergent Sketching
Generalization in emergent communication has largely focused on novel inputs or linguistic structures, yet the capacity for agents to commu…
NOVA: Symbolic Regression Discovery of Interpretable Car-Following and Lane-Change Models with Driver Heterogeneity
We present NOVA, an autonomous symbolic regression framework that identifies interpretable car-following and lane-change structures from ra…
Towards Diverse Scientific Hypothesis Search with Large Language Models
Large language models (LLMs) are on the rise for accelerating scientific discovery, most recently in advanced tasks such as generating vali…
From Data Heterogeneity to Convergence: A Data-Centric Review of Federated Learning
Federated Learning (FL) has emerged as a promising solution for data hunger in centralized learning. This paradigm enables privacy with mul…
Embedding Hybrid Systems into Continuous Latent Vector Fields
This work proves that an $n$-dimensional hybrid system can be embedded into an $m$-dimensional Euclidean space equipped with a continuous v…
Dmsh: A Multi-Agent Reinforcement Learning Framework for All-Quad Mesh Generation
Generating high-quality meshes for arbitrary geometries remains a fundamental bottleneck in computational engineering, often demanding heur…
Causal Ensemble Agent: Hierarchical Causal Discovery with LLM-guided Expert Reweighting
Causal discovery aims to uncover causal structures from observational data, which is crucial for real-world decision-making. However, diffe…
Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning
Diffusion-based Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising…
Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency
Image generation models now produce high-quality static images, yet their ability to represent how a visual world changes over time remains…
STORM: Stepwise Token Optimization with Reward-Guided Beam Search
Modern retrieval increasingly relies on dense and learned-sparse neural models that are effective but require encoding the entire corpus in…
Is Fairness Truly Fair? Towards Reliable Lipschitz Fairness in Multi-Task Learning via Fixed-\texorpdfstring{$\delta$}{delta} Alignment
Lipschitz-style individual fairness formalizes the idea that semantically similar examples should receive similar predictions, but its eval…
Dynamic Linear Attention
The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard atten…
Post-Quantum Secure Federated DeFi for Inclusive Banking
Recent advances in error-corrected qubits have accelerated the timeline for practical quantum computing. It poses a threat to cryptographic…
Accounting for AI Inference in Corporate GHG Inventories: A Four-Tier Methodology for Scope 3 Category 1 Reporting
AI inference services -- API subscriptions, enterprise chat tools, and SaaS products with embedded AI features -- fall unambiguously within…
Decentralized Multi-Agent Systems with Shared Context
Multi-agent systems (MAS) can scale large language model reasoning at test time by decomposing complex problems into parallel subtasks. How…
In Defense of Information Leakage in Concept-based Models
Concept-based models (CMs), deep neural networks that ground their predictions on representations aligned with human-understandable concept…
UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data
Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences…
Divide and Cooperate: Role-Decomposed Multi-Agent LLM Training with Cross-Agent Learning Signals
Modern language agents which perform multi-step reasoning have shown strong performance in knowledge-intensive question answering. However,…
Using the YOLOv12 Model for Verifying the Correct Color Sequence of Wires in Network Cables (Patch Cords) on the Production Line
In the production process of network cables, ensuring the correct color sequence of wire pairs inside the standard connector plays a critic…
Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication
Reinforcement learning promises to optimize sequential decisions in large-scale systems. Semiconductor manufacturing systems are stochastic…
Unifying Data, Memory, and Compute Efficiency in LLM training: A Survey
Resource constraints increasingly determine what can be trained, fine-tuned, and deployed in large language models (LLMs), yet efficiency i…
Effective Reinforcement Learning for Agentic Search by Recycling Zero-Variance Queries During Training
The use of GRPO-style algorithms has become the standard strategy for training LLM search agents under outcome-only rewards. With these alg…
++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation
The nnU-Net has demonstrated continuous success in medical segmentation tasks, which heavily rely on the availability and diversity of anno…
Attention Expansion: Enhancing Keyphrase Extraction from Long Documents with Attention-Augmented Contextualized Embeddings
Pre-trained language models (PLMs) have achieved strong performance in keyphrase extraction (KPE), largely due to their ability to generate…
Transformer Based Model for Spatiotemporal Feature Learning in EEG Emotion Recognition
Electroencephalography (EEG) is a widely adopted technique for monitoring brain activity, offering valuable insights into neurological stat…
Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs
Large online courses generate thousands of student questions directed at conversational AI teaching assistants, yet these interaction logs…
Spatial-Omni: Spatial Audio Understanding Integration in Multimodal LLMs via FOA Encoding
Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial…
Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation
Large language model (LLM) agents are rapidly moving from conversational interfaces to software components that plan, invoke tools, maintai…
A Bayesian Network Approach for Enhancing Security-Focused Decision Support Systems
The adoption and integration of heterogeneous stacks in most of today's open-source based networks brings clear benefits like interoperabil…
Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning
Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging…
Boosting ECG Classification Performance by Pre-training with Synthesized Data
Deep Neural Networks (DNNs) typically require extensive datasets for effective training. In the medical domain, acquiring large-scale data…
Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use
Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing…
Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks
RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only…
K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling
Autoregressive (AR) language modeling is the dominant paradigm for text generation, yet its sequential token-by-token decoding makes infere…
A Unified Siamese Learning Framework for Zero-Day Anomaly Detection and Classification in Optical Networks
A multi-similarity Siamese neural network unifies zero-day anomaly detection and one-shot classification in optical networks, achieving ove…
Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models
Masked diffusion language models can reduce inference steps by revealing multiple tokens per denoising iteration, but this parallelism is f…
Geometrically Averaged Hard Target Updates for Linear Q-Learning
Periodic hard target updates are among the most common stabilization devices in modern deep Q-learning. Recent studies suggest that target…
Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMs
LLM deception is often evaluated through direct markers such as fabricated claims, explicit lies, or strategic concealment. However, many r…
From Perception to Action: Can UI Interventions Foster Sustainable LLM Chatbot
LLM-powered chatbots are increasingly embedded in everyday workflows, raising sustainability concerns due to their energy use. Most mitigat…
LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination
Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-r…
Optimal Post-Training Quantization Scales and Where to Find Them
Post-training quantization (PTQ) compresses large language models by mapping weights to low-bit representations. The scaling factor that de…
Improving Text-Instance Alignment Of Foreground Conditioned Out-Painting Via Customized Concept Embedding
To showcase products, merchants often incur substantial costs creating high-quality display images. Foreground Conditioned Outpainting (FCO…
Pose-ICL: 3D-Aware In-Context Learning for Pose-Controllable Subject Customization
Subject Customization is a foundational task in modern image generation. By providing a few reference images and a text prompt, users can g…
Human-AI Teaming Through the Lens of Calibration
We study models for human-AI teaming through the lens of statistical calibration. We assume the team consists of an AI model and human -- b…
RAT: Reference-Augmented Training for ASV Anti-Spoofing
We introduce a spoofing countermeasure architecture conditioned on speaker-reference recordings, but observe that it converges to a solutio…
Ethical and Technical Limits of Deepfake Speech Datasets
Claims about the robustness and fairness of deepfake speech detectors are only as credible as the datasets used to train and evaluate those…
What Do Deepfake Speech Detectors Actually Hear?
Deepfake speech detectors often output a single score without explaining why an audio sample is flagged, where in the signal the evidence l…
A Constrained Natural-Language Interface for Variational Multi-Physics Finite Element Simulations in FEniCS
Large language models can reduce the manual effort required to set up finite element simulations, but they introduce reliability risks when…
Recoverable but Not Stationary:Local Linear Structures in Weights and Activations
Task vectors, LoRA, activation steering, and random search around pretrained weights all suggest that learned behaviour can be controlled b…
CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference
Large language model inference is bottlenecked by autoregressive decoding, where each token requires a full forward pass. Multi-token predi…
Provenance Tracking in AI Compilers through the Lens of Coalgebra
AI compilers aggressively rewrite computation graphs through normalization, lowering, and optimization, making it difficult to track the pr…
Democratising Camera Trap AI: An Open-Source Model for Detecting UK Mammals
Camera traps have become a cornerstone of biodiversity monitoring, but the artificial intelligence that turns vast quantities of images int…
Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions
As artificial intelligence and machine learning (AI/ML) models become integral to network operations, their lack of transparency poses a si…
Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning
Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-re…
Optimizing 2D Input Representations and Sub-phase Fusion Strategies for Differential Diagnosis of Asthma and COPD Using CNN- and GRU-Based Networks
This study aims to explore the performance of the VAR model in comparison with mel-frequency cepstral coefficient (MFCC) matrices and log-m…
Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill
OpenClaw has rapidly emerged as a transformative artificial intelligence (AI) agent framework, and its ability to autonomously execute comp…
Diffusion Forcing Planner: History-Annealed Planning with Time-Dependent Guidance for Autonomous Driving
Learning-based motion planners, despite recent progress, often suffer from temporal inconsistency. Small perturbations across frames can ac…
AuRA: Internalizing Audio Understanding into LLMs as LoRA
Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-lang…
T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains
Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic system…
Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models
With the widespread deployment of Multimodal Large Language Models (MLLMs) in social interaction, understanding and controlling their behav…
Unifying Local Communications and Local Updates for LLM Pretraining
Communication-efficient pre-training of LLMs is increasingly important as training draws on compute distributed across clusters, data cente…
Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning
Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learni…
RoboNaldo: Accurate, Stable and Powerful Humanoid Soccer Shooting via Motion-Guided Curriculum Reinforcement Learning
Elite humanoid soccer shooting requires whole-body stability, high-impulse whole-body interactions, and accuracy to targets. Motion trackin…
PhantomBench: Benchmarking the Non-existential Threat of Language Models
Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on t…
FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model
A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of preg…
Designed by Journalists, but Is It for Readers? Rethinking AI Disclosures and Transparency in News
As newsrooms integrate generative AI, journalists face a disclosure challenge: how to communicate AI involvement in ways that maintain read…
Towards Autonomous Accelerator Design: FPGA Accelerator Generation with SECDA
Designing FPGA-based accelerators for modern artificial intelligence workloads requires exploring a large and complex hardware design space…
TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large languag…
Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation
Synthetic post-training pipelines commonly filter generated samples with reward models or holistic LLM judges, yet two practices remain rar…
Data assimilation for subsurface flow using latent diffusion model parameterization: performance of ensemble-Kalman and Monte Carlo techniques
Data assimilation (DA) in subsurface flow entails calibrating model parameters to match observed data, typically at wells, while preserving…
Flaws in the LLM Automation Narrative
Large Language Models (LLMs) are increasingly described as performing at the level of human experts on knowledge economy tasks. These claim…
Piper: A Programmable Distributed Training System
Large-scale model training increasingly relies on composing multiple parallelism strategies, such as data, pipeline, and expert parallelism…
EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents
In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt lear…
A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design
Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can…
Belief Acquisition as Stochastic Filtering
This paper studies how belief acquisition can be accomplished using stochastic filtering. First, a theoretical foundation for empirical bel…
A Survey on Semantic Modeling for Building Energy Management
Building Energy Management (BEM) is central to reducing energy use and CO2 emissions in the building sector. Although IoT technologies now…
A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications
With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical.…
Position: The ML Community Must Build an AI-Augmented Peer-Review Ecosystem
Peer review, the bedrock of scientific advancement in machine learning (ML), is strained by a crisis of scale. Exponential growth in manusc…
Why Does Reasoning Length Converge? Unveiling the Underfitting-Overfitting Trade-off in Chain-of-Thought
Test-time scaling, primarily manifested through multi-step Chain-of-Thought (CoT) reasoning via Reinforcement Learning (RL), has emerged as…
Constructing coherent spatial memory in LLM agents through graph rectification
Given a map description through global traversal navigation instructions, an LLM can often infer the implicit spatial layout and answer use…
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents
Despite rapid progress in building conversational AI agents, robustness is still largely untested. Small shifts in user behavior, such as b…
ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering
Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated c…
How can we assess human-agent interactions? Case studies in software agent design
While benchmarks measure the accuracy of LLM-powered agents, they mostly assume full automation, failing to represent the collaborative nat…
RoboGPT-R1: Enhancing Robot Task Planning with Reinforcement Learning
Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulat…
Non-Parametric Structural Priors for Geometry Theorem Prediction
Multi-step theorem prediction is a central challenge in geometry problem solving. Existing neural-symbolic approaches rely heavily on super…
The Price of Agreement: Measuring LLM Sycophancy in Agentic Financial Applications
Given the increased use of LLMs in financial systems today, it becomes important to evaluate the safety and robustness of such systems. One…
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts…
Neurosymbolic Learning for Inference-Time Argumentation
Claim verification is an important problem in high-stakes settings, including health and finance. When information underpinning claims is i…
AMEL: Accumulated Message Effects on LLM Judgments
Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items…
A Sober Look at Agentic Misalignment in Automated Workflows
We study a class of emergent misalignment in multi-agent systems (MAS), with a focus on automated workflows, which we refer to agentic misa…
EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors
Practical non-invasive Brain-Computer Interface (BCI) systems require EEG decoders with strong cross-subject generalization and minimal cal…
VET: A Framework for Analyzing AI Discourse
Public discourse on AI has become polarized; exaggerated positions on AI in traditional and social media threaten the development of AI Lit…
AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design
Protein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external…
Parthenon Law: A Self-Evolving Legal-Agent Framework
As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable d…
PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage
Patient safety event triage, determining whether a clinical event is reportable under jurisdiction-specific policy, is a high-stakes task t…
Robust Deep Reinforcement Learning Through Adversarial Attacks and Training : A Survey
Deep Reinforcement Learning (DRL) is a subfield of machine learning for training autonomous agents that take sequential actions across comp…
Mixtures of Neural Operators Reduce Active Complexity in Operator Learning
Operator-learning systems are not governed solely by total parameter count; for one query, the relevant bottleneck can be the model that mu…
BadRobot: Jailbreaking Embodied LLM Agents in the Physical World
Embodied AI represents systems where AI is integrated into physical entities. Large Language Model (LLM), which exhibits powerful language…
Conditional Vendi Score: Prompt-Aware Diversity Evaluation for Generative AI Models and LLMs
Generative models guided by text prompts are widely evaluated for fidelity and prompt alignment, yet their ability to produce outputs remai…
Visual-TCAV: Concept-based Attribution and Saliency Maps for Post-hoc Explainability in Image Classification
Convolutional Neural Networks (CNNs) have shown remarkable performance in image classification. However, interpreting their predictions is…
Whisper-GPT -- Continuous Discrete Hybrid Representation Language Models For Speech And Music
We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio represent…
Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines
The increasing integration of Large Language Model (LLM) based search engines has transformed the landscape of information retrieval. Howev…
Representational Alignment with Chemical Induced Fit for Molecular Relational Learning
Molecular Relational Learning (MRL) is widely applied in natural sciences to predict relationships between molecular pairs by extracting st…
CITRAS: Covariate-Informed Transformer for Time Series Forecasting
In time series forecasting, covariates represent external factors that influence target variables. Some covariates are observable only in t…
NuWa: Deriving Lightweight Class-Specific Vision Transformers for Edge Devices
Vision Transformers (ViTs) often need to be compressed for deployment on resource-constrained edge devices like drones and smart vehicles.…
A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI
Navigation and manipulation are core capabilities in Embodied AI, but training agents to perform them directly in the real world is costly,…
CleanPatrick: A Benchmark for Image Data Cleaning
Robust machine learning depends on clean data, yet current image data cleaning benchmarks rely on synthetic noise or narrow human studies,…
Attacks on Machine-Text Detectors Retain Stylistic Fingerprints
Despite considerable progress in the development of machine-text detectors, the ease with which machine-text can be manipulated to evade de…
ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark
Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization…
Fact-Augmented Lookahead Planning for LLM Agents
Large Language Models (LLMs) are increasingly capable, but LLM agents still struggle to plan effectively in interactive, partially observab…
Quantifying Perception-Based Student Success with Generative AI: An Exploratory Monte Carlo Simulation
Generative artificial intelligence (GenAI) tools such as ChatGPT have attracted growing attention in higher education, particularly in rela…
TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit
Recent advances in Large Language Models (LLM) have led to a new class of autonomous agents, renewing and expanding interest in the area. L…
GRID: Scaling Task-Agnostic Inference in Continual Prompt Tuning
Prompt-based continual learning (CL) offers a parameter-efficient way to adapt large language models (LLMs) across task sequences. However,…
LLM-Aided Joint Secrecy Precoding and Trajectory for RSMA-Based Heterogeneous UAV Networks
This paper investigates secure communications in rate-splitting multiple access (RSMA) enabled heterogeneous UAV networks, where multiple U…
Assessment of Personality Dimensions Across Situations in Dyadic Role-Play Scenarios
Prior research indicates that users prefer assistive technologies whose personalities align with their own. This has sparked interest in au…
Whisfusion: Parallel ASR Decoding with Masked Diffusion
Autoregressive (AR) encoder-decoder models dominate high-quality multilingual ASR, but their left-to-right decoders make inference latency…
TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning
While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucinatio…
Deep Generative Model for Human Mobility Behavior
Understanding and modeling human mobility is central to challenges in transport planning, sustainable urban design, and public health. Desp…
Learning-Guided Integration Contours Construction for Fast Large-Scale Generalized Eigensolvers
Solving large-scale Generalized Eigenvalue Problems (GEPs) is a fundamental yet computationally prohibitive task in science and engineering…
Lost in Serialization: Invariance and Generalization of LLM Graph Reasoners
While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Ope…
On the Condition Number Dependency in Bilevel Optimization
Bilevel optimization minimizes an objective function, defined by an upper-level problem whose feasible region is the solution of a lower-le…
When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models
Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward model…
Model-Based Diffusion Sampling for Predictive Control in Offline Decision Making
Offline decision-making via diffusion models often produces trajectories that are misaligned with system dynamics, limiting their reliabili…
V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions
While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as…
Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling
Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to distin…
Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes
Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state…
SCOPE: Sequential Causal Optimization of Process Interventions
Prescriptive Process Monitoring (PresPM) recommends interventions during running business processes to optimize key performance indicators…
HiGR: Industrial-Scale Hierarchical Generative Slate Recommendation Framework in Tencent
Slate recommendation, which presents users with a ranked item list in a single display, is ubiquitous across mainstream online platforms. W…
MMD Guidance: Training-Free Distribution Adaptation for Diffusion Models via Maximum Mean Discrepancy Guidance
Pre-trained diffusion models have emerged as powerful generative priors for both unconditional and conditional sample generation, yet their…
torch-sla: Differentiable Sparse Linear Algebra with Adjoint Solvers and Sparse Tensor Parallelism for PyTorch
Differentiable sparse linear algebra is foundational for scientific machine learning, yet PyTorch lacks a unified library for it: torch.spa…
Adoption of Generative Artificial Intelligence in the German Software Engineering Industry: An Empirical Study
Generative artificial intelligence (GenAI) tools have seen rapid adoption among software developers. While adoption rates in the industry a…
Structure-Preserving Learning Improves Geometry Generalization in Neural PDEs
We aim to develop physics foundation models for science and engineering that provide real-time solutions to Partial Differential Equations…
MemCast: Memory-Driven Time Series Forecasting with Experience-Conditioned Reasoning
Time series forecasting (TSF) plays a critical role in decision-making for many real-world applications. Recently, large language model (LL…
ASA: Backbone-Training-Free Representation Engineering for Tool-Calling Agents
Adapting LLM agents to domain-specific tool calling remains notably brittle under evolving interfaces. Prompt and schema engineering is eas…
Capture Timing-Attention of Events in Clinical Time Series
The contemporary paradigm of trajectory learning operates fundamentally at the level of group dynamics, systematically reducing individual-…
RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty
Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilita…
Exploring Accurate and Transparent Domain Adaptation in Predictive Healthcare via Concept-Grounded Orthogonal Inference
Deep learning models for clinical event prediction on electronic health records (EHR) often suffer performance degradation when deployed un…
Improving Topic Modeling by Distilling Soft Labels from Language Models
Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking co…
GRAU: Generic Reconfigurable Activation Unit Design for Neural Network Hardware Accelerators
With the continuous growth of neural network scales, low-precision quantization is widely used in edge accelerators. Classic multi-threshol…
TaCarla: A comprehensive benchmarking dataset for end-to-end autonomous driving
Collecting a high-quality dataset is a critical task that demands meticulous attention to detail, as overlooking certain aspects can render…
MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabular Prediction
In clinical tabular prediction, classical machine learning models with feature engineering often outperform neural methods. LLMs are increa…
Quantifying Uncertainty in AI Visibility: A Statistical Framework for Generative Search Measurement
AI-powered answer engines are inherently non-deterministic: identical queries submitted at different times can produce different responses…
FinTradeBench: A Financial Reasoning Benchmark for LLMs
Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundame…
Reasoning over Semantic IDs Enhances Generative Recommendation
Recent advances in generative recommendation have leveraged pretrained LLMs by formulating sequential recommendation as autoregressive gene…
The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
Large language models fail when a salient surface cue conflicts with an unstated feasibility constraint. We introduce the Heuristic Overrid…
Trust and Reliance on AI in Education: AI Literacy and Need for Cognition as Moderators
As generative AI systems are integrated into educational settings, students often encounter AI-generated output while working through learn…
SAFE: An LLM-as-Verifier Framework for Evidence-Grounded Multi-Hop Reasoning
Multi-hop QA benchmarks often reward Large Language Models (LLMs) for spurious correctness, where models reach correct answers through inva…
Prosociality by Coupling, Not Mere Observation: Homeostatic Sharing in an Inspectable Recurrent Artificial Life Agent
Artificial agents can be made to ``help'' through explicit social rewards, hard-coded prosocial bonuses, or direct access to another agent'…
GCA Framework: A GCC Countries-Grounded Dataset and Agentic Pipeline for Climate Decision Support
Climate decision-making in the GCC states increasingly demands systems that can translate heterogeneous scientific and policy evidence into…
Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection
We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses…
Beyond Single-Model Optimization: Preserving Plasticity in Continual Reinforcement Learning
Continual reinforcement learning must balance retention with adaptation, yet many methods still rely on \emph{single-model preservation}, c…
Learning Evidence Highlighting for Frozen LLMs
Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLi…
People-Centred Medical Image Analysis via Fairness-Aware Human-AI Cooperation
Machine learning models for medical image analysis often exhibit subgroup-dependent performance, which impacts how decisions should be allo…
RAG over Thinking Traces Can Improve Reasoning Tasks
Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit fo…
Communication Dynamics Neural Networks: FFT-Diagonalized Layers for Improved Hessian Conditioning at Reduced Parameter Count
Communication Dynamics Neural Networks (CDNNs) apply the circulant-spectral machinery of the Communication Dynamics framework to neural-net…
A Theory of Training Profit-Optimal LLMs
Scaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital…
From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG
With the rapid emergence of personal AI agents based on Large Language Models (LLMs), implementing them on-device has become essential for…
Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling
Reliable watermarking of panoramic imagery is fundamentally challenged by arbitrary 3D rotations. As panoramas are defined on the sphere, t…
Falcon-X: A Time Series Foundation Model for Heterogeneous Multivariate Modeling
Time series foundation models (TSFMs) are transforming the forecasting paradigm through large-scale cross-domain pretraining. However, most…
Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm
Benchmarking is mature where answers are verifiable -- math, code, reasoning -- but the fastest-growing uses of LLMs are subjective and hum…
On the Learnability of Test-Time Adaptation: A Recovery Complexity Perspective
Test-time adaptation (TTA) aims to adapt models to maintain reliable performance on non-stationary test streams without requiring labeled d…
PromptEmbedder: Efficient and Transferable Text Embedding via Dual-LLM Soft Prompting
Large Language Models (LLMs) have demonstrated remarkable efficacy in text embedding, yet current adaptation methods like LoRA face signifi…
Updating the standard neuron model in artificial neural networks
From their inception in the 1950s, artificial neural networks (ANNs) started using the so-called point neuron model then prevalent in neuro…
Variational Learning for Insertion-based Generation
Non-monotonic sequence generation methods, such as masked diffusion models, provide a flexible alternative to left-to-right autoregressive…
When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures
We track the developmental trajectory of attention-head circuit formation across three 1B-class language models spanning two architecture f…
LiveBand: Live Accompaniment Generation in the Audio Domain
We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal con…
AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation
Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its pra…
CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation
Humans primarily rely on walking and running to traverse complex terrains. Similarly, humanoid robots should be able to smoothly transition…
Anthropic、最上位「ミュトス」級モデルを一般提供 悪用防ぐ保護機能を備えた「Claude Fable 5」
米Anthropicは6月9日(現地時間)、新AIモデル「Claude Fable 5」の一般提供を始めた。同社がOpusクラスを上回る能力を持つと位置付ける最上位の「Mythos(ミュトス)クラス」に属するモデルで、これまでセキュリティ上の懸念から一般公開を見送ってきた水準の…
Anthropicの最新AI「Fable 5」、試すなら今? Claudeのレート制限リセット サブスクで使えるのは6月22日まで
米Anthropicは、チャットAI「Claude」の5時間および週次のレート制限をリセットしたと発表した。最上位の「Mythosクラス」に属するAIモデル「Claude Fable 5」を試すよう促している。
JR東「みどりの窓口」に生成AI導入検証 乗客と対話→係員に引き継ぎ NECと共同
利用者が音声でAIと対話し、きっぷ購入に必要な情報を整理した上で窓口係員に引き継ぐ。
Google just fired a warning shot in the AI subscription price wars
Google just made it significantly cheaper to enjoy its budget AI subscription tier.
How Justin Ernest invested nearly $500M into hot startups without a traditional VC fund
Instead of spending a year raising a formal venture fund, the Sabertooth VC founder used a captive network of LPs to invest in startups lik…
大阪ガスが日本IBM、オージス総研とパートナーシップ締結 AIを軸としたシステム変革とは
大阪ガスとオージス総研、日本IBMの3社が、AIを軸にした次世代ITシステムに向けて共創パートナーシップを結んだ。既存システムのモダナイゼーションやAI駆動開発、セキュリティ対策、人材育成などでの検討と試行を進めるとしている。
考えるSaaSは死に、SoRが生き残る──急成長中Sansan「Contract One」から読み解くリーガルテックの明暗
契約業務系のリーガルテックは、大きく分けて「契約レビュー」と「契約管理」の2つ。このうち、契約レビューは生成AIの影響が早期に表れたSaaS領域の一つだ。明暗を分けた線は、どこにあるのか。
“机の下でこっそり”AI使う――セールスフォース社長のAIエージェント活用術とは
AIエージェント製品に注力する米Salesforce。日本法人社長も「日常的に使っている」という。その活用方法とは。
「猫も杓子もAI」な現状は今後も続くのか?【後編】AI時代に必要な3つの検討事項
近年「製品セキュリティ」と呼ばれ始めたセキュリティの新分野に関する事象を紹介し考察する本連載。今回は、「AIの今後」について筆者が必要だと考えている「3つの重要検討事項」について述べる。
政府・著名人のInstagramアカウントが次々に乗っ取り被害 原因はMetaのAIアシスタント?
米宇宙軍の幹部やオバマ元大統領時代のホワイトハウスが使っていたInstagramのアカウントが何者かに乗っ取られ、イラン支持の画像やメッセージが投稿される被害が相次いだ。攻撃者は米Metaの「AIサポートアシスタント」が抱える脆弱性を突き、狙ったアカウントのパスワードをリセット…
AIがシステムの弱点を暴き、AIが攻撃する時代へ 自治体サイバー防衛の「新・生存戦略」
生成AIの進化により、ソフトウェアの弱点が発見されるようになり、サイバー攻撃を取り巻く環境は大きく変わりつつある。これまでなら見過ごされていた問題が明らかになる一方で、AIを活用した攻撃の自動化も進み「防ぎ切ること」を前提とした従来の対策だけでは十分とは言えなくなってきた。AI…
Hey, Siri, here’s what I actually want from AI
I'm desperate for a personal AI assistant, but do I really want to become the kind of person who can't function without the friendly robot…
Anthropic’s Fable 5 can make weirdly fun video games with the click of a button
Anthropic's Claude Fable 5 is going to be a big hit with the web's vibe coders.
Can tech companies learn to love cheaper AI models?
If those same AI workloads can be handled by cheaper models without affecting quality, it would mean a massive shift in the economics of AI.
WWDC 2026: Everything announced on Siri AI, iOS 27, Apple Intelligence, and more
Apple primarily made the case for an improved experience with its long-standing Siri assistant, which like most other announcements had a h…
Anthropic’s Claude Fable 5 is a version of Mythos the public can access today
Anthropic is releasing Claude Fable 5, its first Mythos-class model available to the public. The model comes with guardrails that block res…
It’s not FAANG anymore. It’s MANGOS.
With SpaceX, Anthropic, and OpenAI all eyeing massive public debuts, the tech industry may soon have a new class of corporate overlords — a…
Fluid, natural voice translation with Gemini 3.5 Live Translate
Gemini 3.5 Live Translate brings near real-time, natural speech translation to Google AI Studio, Google Translate and Google Meet.
2026-06-09(791件)
Sandstone raises $30M to bring AI to in-house legal teams
Sandstone's Series A comes just six months after a Sequoia-led seed round.
Lovable says it has hit $500M in annualized revenue, with 1 million new projects a week
Lovable says it has now surpassed $500 million in annualized run-rate revenue and its users are building businesses and replacing internal…
How an e-scooter founder raised $5 million to build space data centers
Orbital founder Euwyn Poon built 250,000 scooters at Spin. Now he wants to launch 10,000 space data centers.
How engineers at Nextdoor use Codex to build without limits
How engineers at Nextdoor use Codex with GPT-5.5 to investigate hard-to-reproduce issues, build across platforms, and focus on product outc…
What Codex unlocks for Notion
How Notion uses Codex to one-shot specs, build AI Voice Input for the web, and multiply engineering power across small teams.
PathoSage: Towards Multi-Source Evidence Adjudication in Pathology via Experience-Aware Agentic Workflow
Recent advances in Multimodal Large Language Models (MLLMs) and agent workflows have shown strong promise for computational pathology, yet…
OmniMem: Perturbation-aware Memory Compression for Streaming Audio-Visual LLMs
Audio-visual large language models (LLMs) hold strong promise for long-form video understanding, yet their long-video inference is fundamen…
Syll: Open-Source Personal Automation with Cross-Surface Execution
Personal AI agents must increasingly operate across APIs, shells, web surfaces, and desktop GUIs, yet many systems remain tuned to a single…
A case study of evaluating AI agents on a neuroscience data-to-discovery pipeline
Agentic AI tools offer a promising path to automating software development bottlenecks in scientific research pipelines, particularly for s…
Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning
Large language models (LLMs) have demonstrated remarkable reasoning abilities on mathematical and multi-hop planning tasks. The CoCoNuT (Ch…
Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model
Objectives: Automatic data extraction from free-text radiology reports enables large-scale research, but few studies assessed the performan…
Some hypotheses on how chatbots work in problem-solving-driven conversations. Large Language Models as confirmation of the Innovation Illusion
This article offers a perspective on the nature of chatbots as genuine conversation partners when discussing problems in relation to their…
Land cover and flood type govern the detection limits of satellite-based flood mapping across diverse global flood events
Floods are among the most destructive natural hazards, and their increasing frequency under climate change makes satellite-based inundation…
Reconstructing and forecasting disease trajectories of patients with Alzheimer's disease using routine data in resource-constrained settings
Alzheimer's disease is a progressive neurodegenerative disorder, and its progression varies substantially across patients. Existing work ai…
Improving Multimodal Reasoning via Worst Dimension Optimization
Multimodal reasoning requires a path that retains integrity over a wide range of constraints, from visual grounding to logic consistency. H…
Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems
The rapid evolution of Large Language Models (LLMs) from passive assistants to autonomous, execution-capable agents has introduced critical…
Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models
Reasoning language models deployed in agentic workflows must follow an instruction hierarchy: when instructions from different sources conf…
Scaling Participation in Modular AI Systems
Humanity is a mosaic of multifaceted talents and needs, and any truly intelligent AI must reflect that richness. Yet the LLMs used by all a…
Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression
Recently, the efficiency of Large Language Models (LLMs) deployment has become a critical concern in practical applications. While post-tra…
Overcoming the Regulatory Bottleneck via Agent-to-Agent Protocols: A Nuclear Case Study
Regulatory review of advanced nuclear reactor designs routinely spans more than three years and consumes hundreds of millions of dollars in…
Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators
LLMs-as-judges are the only way to evaluate safety at scale. Despite their importance, LLM-judges themselves are rarely evaluated beyond hu…
The AI Epistemic Deference Index: A Continuous Measure of Sycophancy
Current AI models frequently exhibit epistemic sycophancy, endorsing claims to agree with a user. Existing evaluations typically measure th…
Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents
Tool-augmented large language model agents increasingly rely on external APIs, but standard tool schemas describe how to call a tool, not w…
MemToolAgent overview with a simple restaurant booking scenario where the agent retrieves similar memories, receives feedback on an invalid time format, and generates a reflection to update its memory
Modern large language model (LLM) agents can use external tools to help users solve complex tasks. However, for problems that require learn…
EditSR: Enhancing Neural Symbolic Regression via Edit-based Rectification
Neural symbolic regression models improve inference efficiency by shifting structural search to pretraining, but their one-pass autoregress…
The CIFAR Synthetic Evidence Corpus for Detecting AI-Generated Evidence
The growing ability of generative models to produce realistic documents poses a direct challenge to evidentiary workflows in the justice sy…
Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy
Large language models (LLMs) are entering clinical practice based on benchmark accuracy that may fail to detect safety-relevant failure mod…
Unification of Closed-Open Industrial Detection Scenarios: New Large-Scale Benchmarks,Challenges and Baselines
Large-scale Visual-Language Models (LVLMs) have achieved remarkable success in natural visual tasks, yet their application to industrial de…
Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs
Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to s…
Zero-Shot Learning in Industrial Scenarios: New Large-Scale Benchmark, Challenges and Baseline
Large Visual Language Models (LVLMs) have achieved remarkable success in vision tasks. However, the significant differences between industr…
PAFO: Pareto Fairness Optimization for Personalized Reward Modeling
Large language models (LLMs) increasingly rely on reward models to align their outputs with diverse user preferences. While personalized re…
VATS: Exploiting Implicit Authority in Error-Path Injection via Systematic Mutation
As the Model Context Protocol (MCP) standardizes tool-calling for autonomous agents, it introduces a critical, unexamined attack surface: t…
Efficient Skill Grounding via Code Refactoring with Small Language Models
Effective skill grounding is essential for deploying reusable skills in embodied agents, as even minor embodiment or environmental differen…
UniQL: Towards Dialect-Universal Benchmarking for Text-to-SQL
Existing text-to-SQL benchmarks are largely centered on SQLite, making it difficult to evaluate whether models can generalize across hetero…
OSMGraphCLIP: Learning Global Location Representations from OpenStreetMap Graphs
We present OSMGraphCLIP, a CLIP-style geospatial representation model that learns global location embeddings from freely available OpenStre…
SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows
AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve eff…
How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions
Financial transaction processing requires extracting structured merchant information from noisy, abbreviated bank transaction strings at sc…
A Multi-modal Agentic Co-pilot for Evidence Grounded Computational Pathology
Pathology is the cornerstone of modern medicine, where accurate decision-making relies heavily on evidence-based practices. While artificia…
When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. We show that piping the signal…
PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents
Self-evolving agents improve by repeatedly proposing changes to their own prompts, skills, or workflows and keeping those that score higher…
Think Before You Act: Intention-Guided Reasoning for LLM-Based Location Prediction
Predicting a user's next Point-of-Interest (POI) based on their historical check-in records is a fundamental task in location-based service…
Cross-LLM Consistency in Inference: Evidence from Shared Interactions
Large language models (LLMs) differ in architecture, training data, and optimization procedures, yet they may still develop similar interna…
SAGE: An LLM-driven Self Reflective Agentic Framework for Fraud Detection
Fraud detection in payment, e-commerce, and telecommunications systems requires accuracy at the individual level, robustness under severe c…
Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents
Tool-using LLM agents often fail not because relevant text is absent, but because decisive evidence is not selected, compressed, or surface…
Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents
Evaluating LLM-powered interactive social agents is challenging because socially relevant behaviors depend not only on isolated outputs, bu…
SciTrace: Trajectory-Aware Safety Reasoning for Scientific Discovery Agents
LLM-based scientific agents have shown strong capacity for autonomous research, yet their safety layers remain structurally divorced from c…
When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding
Multimodal large language models (MLLMs) have made substantial advancements in video understanding, yet the reliability of their responses…
Traxia: A Framework for Verifiable, Agent-Native Scientific Publishing
Verifiability, attribution, and reproducibility are foundational requirements of scientific knowledge, yet current publishing infrastructur…
From Validator Selection to Portfolio Collection Optimization in Proof-of-Stake Blockchains
We consider a problem arising in proof-of-stake blockchain environments, where agents called nominators select validators - entities respon…
Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems
Large language models (LLMs) and agentic systems are increasingly proposed for financial trading, yet their reported performance remains di…
Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers
In mechanistic interpretability, attention heads are commonly elevated to role claims (e.g., "this head represents addition") when they are…
Revisiting the shutdown problem
A key premise in leading arguments for existential risk from artificial intelligence is that malfunctioning artificial agents could not be…
To Nuke or Not to Nuke: LLMs' (Missing) Ethical Reasoning and Actions in a High-Stakes Decision-Making Simulation
Large language models (LLMs) are increasingly deployed as long-horizon agents with decision-making capacities. While LLMs can show ethical…
Curation of a Cardiology Interface Terminology for Highlighting Electronic Health Records using Machine Learning
Electronic health record (EHR) notes are dense medical documents containing large amounts of information, often filled with complex medical…
Neuro-Symbolic Injection of LTLf Constraints in Autoregressive Reinforcement Learning Policies
In this work we study offline reinforcement learning (RL) under temporally extended task constraints expressed in Linear Temporal Logic ove…
Integrating Deep Learning Demand Forecasting with Multi-Objective Optimization for Circular Coffee Supply Chains: A Data-Driven Framework for Cost, Emissions, and Freshness Management
The coffee supply chain is one of the most complex agri-food networks, marked by geographically dispersed production, multi-tier coordinati…
Benchmarking Open-Ended Multi-Agent Coordination in Language Agents
As language models are increasingly deployed as autonomous agents, they must coordinate with others over long horizons in open-ended intera…
TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution
This study addresses the optimal execution of large stock sell programs by introducing TT-DAC-PS (Twin-Target Deterministic Actor-Critic wi…
Self-Evolving Scientific Agent Discovers Generalizable Physically-Reasoned Fluid Control
While data-intensive deep reinforcement learning can optimize complex control policies, scientific discovery in physical systems fundamenta…
Trajectory-Refined Distillation
On-policy distillation (OPD) has become a central post-training tool for large language models (LLMs), providing dense per-token teacher su…
GIFT: LLM-Guided State-Reward Interface for Financial Reinforcement Learning
Financial portfolio trading is naturally formulated as a reinforcement learning problem, where an agent sequentially rebalances assets unde…
A Variability-Based Framework for Interpretable Naming in Formal and Relational Concept Analysis
Knowledge extraction from symbolic data often produces abstractions that are formally defined but not immediately interpretable by users. F…
Testing the Black Box: Structural Barriers to Independent Evaluation of Consumer-Facing Health LLMs
Background: Consumer-facing large language models are now a common source of health information, and they interpret and personalize respons…
What Makes a Desired Graph for Relational Deep Learning?
Relational deep learning (RDL) converts relational databases (RDBs) into heterogeneous graphs, but graphs derived directly from database sc…
Explaining Black-Box Language Models: Learning to Optimize Linguistically-Structured Word Subsets
As deep language models (DLMs) are increasingly deployed in high-stakes domains such as healthcare, understanding their decision rationale…
Standpoint Logics with Defeasible Beliefs
In this paper, we integrate the defeasible logic of Kraus, Lehmann and Magidor (KLM) with the standpoint logic framework of G\'omez \'Alvar…
Scaffold Effects on GAIA: A Controlled Comparison
Published agent capability scores conflate what a model can do with what its scaffold lets it do, and the magnitude of this elicitation gap…
VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents
Large language models (LLMs) are increasingly evolving from simple text-based interaction systems into LLM agents that can maintain memory,…
DN-Hypo-Pipeline: An AI-Driven Workflow for Hypothesis Generation via Large Language Models and Scientific Explanations
A scientific hypothesis is the first step in research and undergoes experimental validation, yet it also reflects a deep understanding of a…
AgentTrust: A Self-Improving Trust Layer for AI-Agent Actions
AI agents increasingly take consequential actions -- shell commands, cloud operations, and arbitrary tool-calls -- so a trust layer must de…
PAEC: Position-Aware Entropy Calibration for LLM Reasoning in RLVR
Reinforcement learning with verifiable rewards (RLVR) improves large language model reasoning but often suffers from rapid policy-entropy c…
Quantitative Promise Theory: Intentionality and Inference in Autonomous Agents
I discuss some quantitative representations of Promise Theory for processes involving autonomous agents. Agent models are common in softwar…
Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration
Constructing efficient and reliable policies to assist humans is indispensable for human-AI collaboration. Existing methods mainly follow t…
InA-Probe: Instruction-Aware Active Probing for Time Series Forecasting with LLMs
Large Language Models (LLMs) have recently demonstrated impressive potential for time series forecasting. However, existing methods predomi…
Towards Long-Horizon Vessel Trajectory and Destination Forecasting with Reasoning Large Language Models
Long-horizon maritime trajectory prediction is important for shipping management, logistics planning, and maritime risk analysis, yet month…
Extending Ontologies: From Dense Embeddings to Hybrid Quantum-Fuzzy Systems
LLMs have revolutionized knowledge representation and retrieval, but lack the explicit modeling that knowledge ontologies possess. This pap…
ConMem: Structured Memory-Guided Adaptation in Training-Free Multi-Agent Systems
Recent advances have improved the adaptive capabilities of LLM-based multi-agent systems (MAS) through memory-, skill-, and learning-based…
Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery
Mathematical reasoning has long served as a stringent test of machine intelligence; over the past decade, it has moved from a niche problem…
Structure-Conditioned Actor-Critic Branches for Quality-Diversity Reinforcement Learning
Quality-diversity reinforcement learning (QD-RL) aims to construct policy repertoires that contain both high-performing and behaviorally di…
RAILS: Verification-Native Clearing For Agentic Commerce
Autonomous agents negotiate, purchase, deploy code, and move funds, but no neutral mechanism determines whether they met their delegated ob…
Bridging Expert Knowledge and Automated Feature Engineering via Self-Evolution
In high-stakes settings such as brand compliance, clinical care, and content moderation, machine learning cannot be deployed as opaque orac…
Q-Delta: Beyond Key-Value Associative State Evolution
Linear attention reformulates sequence modeling as recurrent state evolution, enabling efficient linear-time inference. Under the key-value…
STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning
Mixture-of-Experts (MoE) scales model capacity efficiently by selectively routing inputs to a specialized subset of experts. However, input…
Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large langua…
Inference-Time Conformal Reasoning with Valid Factuality Control for Large Language Models
Large language models (LLMs) increasingly perform multi-step reasoning, where intermediate claims form implicit directed acyclic graphs who…
Instrumental convergence and power-seeking
Recent years have seen increasing concern that artificial intelligence may soon pose an existential risk to humanity. One leading ground fo…
Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs
Code generation models are typically compared using compact execution benchmarks and aggregate pass rates, but such summaries obscure how p…
ZIPP:Zero-shot Image Personalization from Personas
Text-to-image diffusion models are increasingly deployed in open-ended creative contexts, yet their outputs remain impersonal, optimized fo…
A Resilience-as-a-Service assessment framework for coordinated disruption response in interdependent urban transit systems
Urban public transport disruptions require rapid response strategies, yet existing studies rarely provide a decision support framework to c…
Hybrid E-Assessment in Higher Education: Semi-Automated Grading of Paper-Based Written Examinations
This paper examines the limitations of fully digital and partially digital e-assessment approaches in summative examinations in higher educ…
Can the Environment Speak for Itself? $T^{2}$-GRPO: A Turn-Trajectory Group Relative Policy Optimization for Caregiver Agents
Optimizing large language models (LLMs) for long-horizon caregiver agents requires balancing delayed task objectives with immediate environ…
FAME: Forecastability-Aware Mixture of Experts for Heterogeneous Time Series Forecasting
Large-scale retail and industrial forecasting systems contain many heterogeneous time series whose lifecycle, sparsity, volatility, seasona…
Order Matters: Unveiling the Hidden Impact of Macro Placement Sequences via Proxy-Guided LLM Evolution
Macro placement is a fundamental step in modern chip physical design, playing a crucial role in determining the solution quality of high-di…
Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human
As LLM agents begin to take real, irreversible actions (shell commands, file edits, deploys), the standard safety pattern is a human-in-the…
AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models
Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key…
An Effective Router for Vision-Language Model Selection
Vision-language models (VLMs) with varying performance and resource requirements are widely deployed, making it difficult for users to sele…
Diverse Thinking Schemata Elicit Better Reasoning in Large Language Models
Large reasoning models (LRMs) have attracted increasing attention for their ability to solve complex mathematical problems by generating ex…
RTL-BenchLS: A Large-Scale Benchmark for RTL Reasoning and Generation with Large Language Models
LLM-based RTL generation and reasoning is a promising direction for hardware design automation. High-quality benchmarks are critical infras…
Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care
Baichuan-M4 is Baichuan Intelligence's clinical-grade medical large model, designed for \emph{continuous care} rather than single-turn medi…
The Token Not Taken: Sampling, State, and the Variability of AI Agent Outputs
Agentic AI systems can behave differently across runs: the same request may produce a different plan, a different tool call, a different co…
LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (Extended Version)
Feature engineering remains essential for tabular data analysis, and Large Language Models (LLMs) have emerged as a promising paradigm for…
A Multi-Agent System for IPMSM Design Optimization via an FEA-AI Hybrid Approach
Interior permanent magnet synchronous motor (IPMSM) design requires balancing conflicting objectives and multi-physics constraints, while m…
Personalization Meets Safety:Mechanisms,Risks,and Mitigations in Personalized LLMs
Large Language Models (LLMs) have enabled increasingly personalized interactions by adapting to users' preferences, contexts, and long-term…
Agent Economics: An Entropy-Controlled Pluralistic Alignment Framework for Preventing Artificial Hivemind in Autonomous Agents
This study proposes the Behavioral Protocol Framework (BPF), an entropy-controlled pluralistic alignment framework designed to address two…
REFLECT: Intervention-Supported Error Attribution for Silent Failures in LLM Agent Traces
Large language model (LLM) agents now solve complex tasks through long plan-and-execution traces, yet the ability to locate errors in a com…
DynaOD: Dynamic Origin-Destination Flow Generation with Discrete-to-Continuous Temporal Semantic Modeling
Dynamic origin-destination (OD) flow generation seeks to synthesize realistic mobility dynamics from temporal context alone, without relyin…
Graph2Idea:Retrieval-Augmented Scientific Idea Generation with Graph-Structured Contexts
Generating novel, feasible, and high-quality research ideas is an important yet challenging task in scientific discovery.Recent Large Langu…
ComplexConstraints and Beyond: Expert Rubrics for RLVR
As LLM capabilities advance rapidly, the evaluation methods used to assess them increasingly lag behind. Traditional benchmarks relied on p…
A Regret Minimization Framework on Preference Learning in Large Language Models
Reinforcement learning with verifiable rewards (RLVR) has enabled progress on reasoning-intensive tasks by relying on task-specific verifie…
Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation
Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and…
Vision Language Model Helps Private Information De-Identification in Vision Data
Visual Language Models (VLMs) have gained significant popularity due to their remarkable ability. While various methods exist to enhance pr…
Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges
Safety judges are increasingly deployed to evaluate model outputs against evolving criteria, yet recent meta-evaluation work shows they rem…
IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation
In recent years, unified multimodal models (UMMs) have emerged to support both understanding and generation within a single framework. Mast…
MASS: Deep Research for Social Sciences with Memory-Augmented Social Simulation
Deep Research agents powered by Large Language Models (LLMs) have exhibited extraordinary potential in automated paper writing tasks. Howev…
FF-JEPA: Long-Horizon Planning in World Models with Latent Planners
Joint Embedding Predictive Architectures (JEPAs) have shown promising world modeling capabilities, enabling planning in latent space by opt…
Anything2Skill: Compiling External Knowledge into Reusable Skills for Agents
Retrieval-augmented generation (RAG) enables agents to access external knowledge at inference time, but it primarily retrieves fragmented d…
TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders
Tabular encoders are usually evaluated inside task-specific end-to-end pipelines, so models from different training paradigms are difficult…
Leveraging Structural Constraints for Diffusion-based Neural TSP Solvers
Neural combinatorial optimization has recently achieved strong results on the Euclidean Traveling Salesman Problem (TSP) using generative m…
Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory
Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering.…
Capability-Aligned Hierarchical Learning for Tool-Augmented LLMs
Tool learning enables LLMs to invoke external tools to accomplish tasks. Prior studies have demonstrated the effectiveness of a hierarchica…
From Coarse to Fine: Managing Temporal Granularity in Spatio-Temporal Data for Fine-Grained Traffic Prediction
Efficient acquisition, storage, and utilization of traffic data are critical challenges in spatio-temporal data management. Most traffic da…
RunAgent SuperBrowser: A Theory of Autonomous Web Navigation Grounded in Human Browsing Behaviour
We present SUPERBROWSER, an autonomous web-navigation agent designed against a single guiding hypothesis: a web agent should browse the way…
Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings
Pairwise comparisons combined with aggregation methods like Elo have become central to evaluating generative models, yet concerns remain th…
Capacity, Not Format: Rethinking Structured Reasoning Failures
Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model'…
WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces
Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, brow…
Bayesian Selective Latent Inference for Wastewater-First Influenza Monitoring
Wastewater influenza surveillance can reveal community circulation before clinical reporting, but wastewater alone is not a fully identifia…
SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance
Retrieval-Augmented Generation (RAG) injects LLM queries with relevant documents to improve response quality. This injection increases prom…
AliyunConsoleAgent: Training Web Agents in Real-World Cloud Environments via Distillation and Reinforcement Learning
We present AliyunConsoleAgent, a web agent framework for automated documentation verification in real-world cloud consoles. Major cloud pla…
TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics
LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on compe…
Emergent alignment and the projectability of ethical personas
Work on `emergent misalignment' shows that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the `perso…
LLM-Orchestrated Conformance Checking in Stroke Care Without Computer-Interpretable Guidelines
Objective: Conformance checking in healthcare seeks to assess whether patient care pathways adhere to clinical guidelines. However, its pra…
Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture
Objective. Large language models (LLMs) increasingly draft clinical research manuscripts, but their fluency can hide fabricated citations,…
From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs
Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or unifor…
AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation
AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We tes…
PRISM: Recovering Instruction Sets from Language Model Activations
As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their b…
Self-Explainability in Self-Adaptive and Self-Organising Systems: Status and Research Directions
The growing complexity of self-adaptive and self-organising systems, fuelled by advances in Artificial Intelligence (AI), has made them inc…
TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs
Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning tasks, but the role of table r…
Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text
Chain-of-Thought (CoT) improves the performance of Large Language Models (LLMs) and has been extended to Multimodal Large Language Models (…
Next-Token Prediction Learns Generalisable Representations of Sleep Physiology
Foundation models offer a promising route to compress multi-modal physiological signals into compact representations of human health, with…
From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design
Recursive self-design refers to AI-assisted modification of the mechanisms by which an AI system is built, evaluated, and improved. This pa…
Frequency-based Constrained Sampling for Interval Patterns
Output space pattern sampling is a powerful alternative to exhaustive pattern mining for exploring large pattern spaces, as it enables user…
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks
Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical wor…
Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery
Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine s…
(Auto)formalization is supposed to be easy: Trellis process semantics for spelling out rigorous proofs
We present Trellis: an autoformalization system that leverages LLM agents in a deterministically constrained workflow to enforce incrementa…
Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization
Reward hacking is usually studied after it becomes visible, once a model earns high proxy reward while failing the intended task. We instea…
Beyond Probabilistic Similarity: Structural, Temporal, and Causal Limitations of Retrieval-Augmented Generation in the Legal Domain
Retrieval-Augmented Generation (RAG) has become a standard architectural response to unreliability in legal AI, yet high-profile failures,…
SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research
Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bou…
Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback
Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their report…
Collaborative Human-Agent Protocol (CHAP)
Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coor…
SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation
Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning t…
Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting
AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blo…
XAInomaly: Explainable and Interpretable Deep Contractive Autoencoder for O-RAN Traffic Anomaly Detection
Generative Artificial Intelligence (AI) techniques have become integral part in advancing next generation wireless communication systems by…
BRAIN: Bayesian Reasoning via Active Inference for Agentic and Embodied Intelligence in Mobile Networks
Future sixth-generation (6G) mobile networks will demand artificial intelligence (AI) agents that are not only autonomous and efficient, bu…
Blockchain Infrastructure for Intelligent Cyber--Physical--Social Systems:Post-Quantum Security, Interoperability, and Trustworthy Data Economies in the Era of Embodied AI
The deployment of embodied artificial intelligence via world-model-based robotics presents a transformative opportunity for blockchain infr…
Bidirectional Small-Granularity Search between Code and Text
We introduce the novel task of bidirectional small-granularity search between code and text, where the queries are small snippets of text o…
Evaluating Hallucinations in Domain-Adapted Large Language Models
This study investigates the phenomenon of hallucinations in domain-adapted Large Language Models (LLMs), focusing on the fine-tuning of the…
Retrieval Augmented Generation Framework for the Nepali Legal Domain Question Answering
Legal domains in high-resource languages like English have widely adopted artificial intelligence for legal question answering. However, da…
ABLE: Representing and Mapping LLMs via Attribution-Based Large-model Embedding
The explosive growth of large language models (LLMs) has created a heterogeneous and poorly documented ecosystem, making systematic model c…
Implicit Causal Graph Construction in Text via Chain Discovery
Causal graphs in text are typically populated by observable, predefined events. In contrast, we study implicit causal graph construction fr…
GraphLoRA: Structure-Aware Low-Rank Adaptation for Large Language Model Recommendation
Large Language Models (LLMs) have shown strong potential for recommendation (LLMRec) due to their powerful reasoning and generalization abi…
Post-training is (Massive) Supervised Learning
The prevailing paradigm for training LLMs has evolved to rely on a massive post-training phase consisting of SFT and RL. In this position p…
BEACON: Behavioral Entropy Aggregation for Cross-Model Hallucination Detection in Large Language Models
Hallucination in large language models (LLMs), defined as the generation of factually incorrect or unsupported content, remains a critical…
CAPruner: Conceptual-Adjacent Scene Graph Pruner for Enhancing 3D Spatial Reasoning of Large Language Models
Large language models (LLMs) have recently been applied to 3D vision-language (3D-VL) tasks, which require spatial reasoning to identify ta…
mllm-shap: A Shapley Value Explainability Platform for Text-Audio Multimodal Large Language Models
We introduce mllm-shap, an open-source Python framework designed to extend Shapley Value (SV) explainability from text-only Large Language…
Principled Agent Debate: Adversarial Arbitration for Sycophancy Reduction in Large Language Models
RLHF-trained models are systematically biased toward agreement over accuracy, a structural property of the training process. We present Pri…
Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis
Multimodal Large Language Models (MLLMs) effectively integrate text and audio to interpret context in complex interactive dialogues. Howeve…
Beware of GeeksBearing Gifts: Building True EU Frontier AI Sovereignty
Frontier artificial intelligence is reshaping all aspects of society, from economic output or military capability to democratic institution…
Bidirectional Semantic Complementary Tool Retrieval for Remote Sensing Agents
Large language model (LLM)-based agents provide a novel paradigm for the automated processing of remote sensing(RS) data. Their success in…
Multimodal Large Language Models as Synthetic Participants in Video-Based Studies: An Evaluation
Multimodal large language models (MLLMs) have shown strong performance on objective tasks such as video understanding and reasoning. Howeve…
DIYHealth Suite: Dataset, Model, and Benchmark for Health Management at Home
Generative AI is reshaping healthcare, yet most existing advances rely on hospital-grade devices, which limits their accessibility and pote…
Concerns and Strategic Responses of Older Workers Navigating Generative AI in Bridge Employment
Generative AI (GenAI) is transforming workplaces at a rapid pace. This disproportionately affects vulnerable communities, including older w…
AI-Integrated Learning Management System for Middle School: A Longitudinal Study of Learning Outcomes Through High School and Beyond
Middle school is a key window for building core academic skills and the learning routines students carry into later grades, yet many studen…
Beyond Item IDs: Scaling Short-Form-Video Recommendation via Semantic-Native Long Sequence Modeling
Capturing user interests across extensive watch histories is critical for short-form video recommendation, yet scaling sequence length is l…
Liberating LLM Capabilities in Full-Duplex Speech Models
Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verba…
Evaluating Advanced Prompting on Gemini Flash for Multi-Hop Biomedical QA
The MedHopQA challenge presents a critical test for Large Language Models (LLMs): complex, multi-hop reasoning in the high-stakes biomedica…
Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark
Offline reinforcement learning (RL) offers a promising route for developing plasma controllers from historical tokamak data, since online t…
Symbolic Reasoning Frameworks Modulate LLM Risk Aversion in Multi-Agent Strategic Settings
Large language models exhibit innate behavioral tendencies when deployed as strategic agents -- notably a risk-averse "turtle" bias toward…
MedicalRec: Medical recommender system for image classification without retraining
The emergence of machine learning and deep learning has revolutionized the efficiency of diagnostic, therapeutic, and administrative system…
Selecting New Measurement Locations to Diversify Traffic-Pattern Coverage: A Real-World Evaluation for Total Traffic Volume Estimation
Accurate measurement of traffic volumes and flows is vital for modern intelligent transportation. However, despite recent technological adv…
Page image classifier fine-tuned on century-spanning archives of scanned documents for further content-specific processing
Purpose: Digitization projects in the humanities produce vast, heterogeneous archives of historical documents, making manual sorting imprac…
Phantom transitions in language model fine-tuning
Fine-tuning a language model on contexts whose correct completion has a near-synonym competitor often fails silently. The cross-entropy los…
The Montparnasse Algorithm for RNA Design
RNA design consists of discovering a nucleotide sequence that optimizes predefined criteria, such as secondary structure. It is useful for…
Emergence via Phase Transitions: Mechanism Landscapes and Universal Convergence Across Complex Systems
Across machine learning, biology, and physics, independently evolving systems often converge toward strikingly similar high-level structure…
Considerations for an Integrated Detector Design at FCC-ee: A Human-AI Exploration
This report explores detector design considerations for the Future Circular Collider in its electron-positron mode (FCC-ee) through an exte…
SurfDesign: Effective Protein Design on Molecular Surfaces
Protein function is largely determined by molecular surface geometry and physicochemical complementarity, yet most protein design methods c…
A Systematic Study of Behavioral Cloning for Scientific Data Annotation
Scientific data annotation, such as tracking animals in video or proofreading neural reconstructions, remains bottlenecked by the "last mil…
Enabling KV Caching of Shared Prefix for Diffusion Language Models
Key-value (KV) caching for shared prefixes is essential for high-throughput large language model (LLM) serving, but it faces critical chall…
Accelerating Birkhoff Projection for Manifold-Constrained Hyper-Connections
Manifold-constrained hyper-connections (mHCs) have recently been proposed as a principled extension of hyper-connections, where the residua…
Training-Inference Kernel Contracts: Bounding Divergence in Post-Training and Deployment
A modern post-training pipeline often writes one symbol for its policy, pi_theta, while evaluating it through two different programs: a tra…
Customer Churn Prediction on Structured Data Using FT-Transformer and Stacking Ensembles
Customer churn prediction is essential across data-driven industries such as insurance, digital banking, eCommerce, and subscription platfo…
Outage Detection in Self-Healing Smart Grids Using Reinforcement Learning with Spectral Graph Neural Networks
Self-healing smart grids can quickly adjust their network configuration during outages to minimize power disruptions. During an outage, sev…
Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach
This thesis addresses group emotion recognition (GER) in-the-wild with a focus on privacy preservation. Unlike traditional emotion recognit…
From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs
Spatial neural processing units (NPUs) provide an energy-efficient platform for edge LLM inference, but efficiently deploying an LLM end-to…
SlideCheck: Guiding Self-Supervised Pretraining of Pathology Foundation Models via Dataset Distributions
Pathology foundation models are pretrained on large streams of WSI-derived patches, while supervision during data construction is often sli…
ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research
AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify…
A Mechanistic Analysis of Adversarial Fine-tuning of Vision Transformers
The widespread use of image classification models in high-risk, real-world situations necessitates making these models robust to slight dis…
VisualLeakBench: Reproducible Action-Boundary Propagation Failures in Vision-Language Agents
Vision-language agents increasingly consume screenshots, documents, and user interfaces before writing to memory, sending messages, or invo…
Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them
Pre-training data mixtures are commonly tuned by running small-scale experiments and extrapolating to the target training budget. When high…
A Topological Characterization of Graph Neural Networks via Stochastic Block Model Embeddings on the n-Sphere
We propose a topological framework for comparing trained Graph Neural Networks (GNNs) by mapping the Stochastic Block Models (SBMs) induced…
DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression
Ordinal Regression (OR) aims to predict target values with inherent order, underpinning critical applications across diverse domains, from…
Reachability and asymptotics of Gaussian Transformer dynamics
We formulate data propagation through the Transformer, the machine learning architecture powering large language models, as a nonlinear con…
LFNO: Bridging Laplace and Fourier via Transient-Steady Decomposition
We introduce the Laplace-Fourier Neural Operator (LFNO), a unified framework for modeling dynamical systems across transient and steady-sta…
Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning
LLM-based LEGO assembly generation requires both semantic grounding and physical feasibility. We identify a data-induced failure mode, Phys…
MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution
Large language models (LLMs) exhibit strong reasoning capabilities, yet most LLM-based agents are statically deployed and unable to improve…
Contribution Weights: A Geometrical Analysis of Self-Attention Transformers
Analyzing attention weights has become a standard approach for interpreting the information flow of Large Language Models (LLMs). However,…
SRT: Super-Resolution for Time Series via Disentangled Rectified Flow
Fine-grained time series data with high temporal resolution is critical for accurate analytics across a wide range of applications. However…
Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)
We present a systematic study of fine-tuning OpenAI's Whisper large-v3 for Swiss German ASR, using 1,367 hours of broadcast speech paired w…
LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training
State-of-the-art GRPO-style methods for speech-aware large language model post-training suffer from coarse credit assignment, broadcasting…
MIRAGE: Metadata-Integrated Repository Analysis and Guided Enhancement for MSR Datasets
This paper proposes an improved approach to the analysis of Mining Software Repositories (MSR) datasets via metadata enrichment, FAIRness a…
Position: Anthropomorphic Misalignment Research Needs Stronger Evidence
We argue that many Anthropomorphic Misalignment Research (AMR) studies need stronger evidence to ensure that they can provide a robust foun…
Can You Trust What You See? Human and AI Detection of Synthetic Legal Evidence
Visual evidence has long been treated as a reliable form of legal proof, but advances in artificial intelligence (AI) are undermining that…
Structured Neuron Pruning in Deep Neural Networks Using Multi-Armed Bandits
Deep neural networks often contain redundant hidden units. Removing individual weights can reduce parameter count, but unstructured sparsit…
Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation
Scaling laws provide a fundamental framework for understanding the performance of Language Models (LMs), yet deriving them requires prohibi…
Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects
While sparse autoencoders provide features more interpretable than individual neurons, reliably characterizing them remains challenging. We…
ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization
NVFP4 is a recently introduced hardware-supported FP4 format that improves the fidelity of 4-bit quantization through fine-grained block sc…
SENTRY: Statistical Reliability Analysis of Vision Transformers Under Soft Errors
With the growth of Vision Transformers in safety-critical domains like autonomous systems and medical imaging, ensuring their reliability a…
HASA: Subnet Allocation for Compute-Constrained Model-Heterogeneous Federated Learning
Edge services increasingly use federated learning to personalize on-device models while keeping sensitive data local. In practice, deployme…
Eyes All Around: Design and Analysis of 360-Degree LiDAR Perception Using Equivariant Feature Learning in Unstructured Traffic
Perception in dense, unstructured urban traffic remains a major challenge for autonomous driving because of the wide variety of road users,…
Large Language Models Should Learn Personalized Rather Than Aggregated Human Preferences
Current approaches to aligning large language models (LLMs) aggregate diverse human preferences into a single reward signal, effectively op…
Active Learning with Foundation Model Priors: Efficient Learning under Class Imbalance
Real-world datasets across image and text domains are often characterized by skewed class distributions and noisy annotations, which jointl…
Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning
Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard trainin…
AMN: An Adaptive Multi-Scale Fusion Network with Boundary and Uncertainty Modeling for Nuclei Segmentation
Accurate classification of nuclei subtypes in histopathology images is critical for downstream tasks including tumor grading, immune infilt…
NeuroAlign: Hierarchical Multimodal Fusion of Dynamic and Structural Neuroimaging for MCI Analysis
Multimodal neuroimaging fusion of functional MRI (fMRI) and diffusion tensor imaging (DTI) provides complementary information for cognitive…
Anchor-Conditioned Compositional Control for Landscape Image Generation
Image generative models, though widely used as creative tools, offer limited support for the kind of compositional control that photographe…
MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention
Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it…
No Free Lunch for Synthetic Images under Data Scarcity Conditions
This study investigates the trade-offs between fidelity, privacy, and utility in synthetic data generation under conditions of data scarcit…
AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs
Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. Howev…
FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction
The scarcity of hard negative samples in current vision-language datasets significantly hinders fine-grained perception. To address this, w…
DOME: Learning Transferable Domain Variables from Sparse Supervision for Test-Time Adaptation
Test-time adaptation (TTA) aims to align a model to shifting test domains using only unlabeled streaming data. Most existing methods implic…
AQIFormer: A Transformer-Based Multi-View Architecture for Cross-City Air Quality Classification
Air pollution represents one of the most critical environmental and public health challenges globally, with traditional sensor-based monito…
ViMax: Agentic Video Generation
Long-form video generation requires systematic narrative planning and visual consistency that current short-clip methods cannot provide. Ex…
A Dataset for Dynamic Human Preferences for Vision Language Models
Given the increased adoption of Vision Language Models (VLMs) in human-interactive settings, it is important that we evaluate how well thes…
MM-Matryoshka: Towards Budget-Elastic Visual Document Retrieval via a 2D Multimodal Matryoshka Training Framework
Multi-vector visual document retrievers achieve strong fine-grained matching by representing each page with multiple vectors from deep Visi…
Seq103: A Unified Neuroevolution Framework for Compact Sequence Architecture Discovery
Neuroevolution is a representative neural architecture search paradigm that evolves both network topology and weights through evolutionary…
AgentCompile: An LLM-Guided Compiler for Direct CUDA Inference
Transformer inference increasingly depends on specialized compiler and runtime support, but real model graphs still require semantic decisi…
MemoVAD: Resource-Efficient Video Anomaly Detection via Dynamic Semantic Memory in Edge Computing Scenarios
Deploying Video Anomaly Detection (VAD) in real-world surveillance faces a fundamental tension between the demand for high-level semantics…
Liquid Neural Networks as a Drop-in Continuous-Time Deformation Field for Dynamic 3D Gaussian Splatting
Deformable 3D Gaussian Splatting (D-3DGS) re-constructs dynamic scenes from monocular video by deforming a canonical set of 3D Gaussians th…
A Hierarchical Feature Engineering Framework for Automated Classification of Phonotraumatic and Non-Phonotraumatic Vocal Hyperfunction
Ambulatory neck-surface acceleration enables non-invasive monitoring of vocal hyperfunction, yet robust biomarkers for its subtypes remain…
Single-Cell Cross-Modal Transfer by Adversarial Fine-Tuning of Foundation Models
Spatial transcriptomics (ST) is a powerful tool for exploring biological properties dependent on structure, proximity, and interaction in t…
DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment
Safety alignment for large language models relies on preference data, but current pipelines often train on large, redundant datasets. Exist…
Systematic LLM Translation of Legacy Scientific Code to Differentiable Frameworks: Application to a Land Surface Model
Differentiable programming offers transformative capabilities for scientific modeling, enabling gradient-based parameter estimation, sensit…
SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?
AI agents are increasingly expected to complete long-horizon workflows that require sustained progress over hours, millions of tokens, and…
Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching
Disaggregated serving alleviates memory bottlenecks in Large Language Model (LLM) inference but creates a severe communication bottleneck:…
Test-Time Adaptive Composition for Machine Learning as a Service (MLaaS) in IoT Environments
The dynamic nature of Internet of Things (IoT) environments affects the long-term effectiveness of Machine Learning as a Service (MLaaS) co…
Knowledge-Inclusive Adaptive Physics-Informed Neural Network for Microbial Interaction Modelling
Physics-Informed Neural Network (PINN) is a way of including knowledge in the form of equations in Machine Learning methods. Beyond equatio…
What Makes Video World Model Latents Action-Relevant: Prediction over Reconstruction
Video world models are increasingly used to provide predictive visual representations, yet it remains unclear which pretraining signals ind…
TRACER: Token ReAssignment for Concept ERasure in Generative Recommendation
Generative recommendation formulates next-item prediction as autoregressive generation over semantic ID (SID) sequences derived from users'…
HARP: Efficient Data Selection for Finetuning Large Language Models
Finetuning data selection requires balancing two competing goals: selecting examples that improve the downstream objective, and doing so wi…
BCG-FM: A Foundation Model for Ambient Cardiac Health Sensing
Foundation models for wearable biosignals have matched or exceeded supervised specialists across a range of clinical tasks, yet all rely on…
DSFNet: Learning Dual-Domain Spectral Operators for Multi-Modality Spatio-Temporal Forecasting in Urban Transportation Systems
Multi-Modality Spatio-Temporal Forecasting (MoSTF) extends traditional spatio-temporal forecasting by incorporating diverse traffic modalit…
Adversarial Robustness of Activation Steering in Large Language Models
Activation steering has become a popular training-free method to control LLM behavior by injecting precomputed direction vectors into the m…
TianJi-Environ: An Autonomous AI Scientist for Atmospheric Environmental Research
As atmospheric environmental prediction continues to improve, interpretable validation of pollution mechanisms and feedback processes has b…
Pharmacogenomic Knowledge Graph Augmentation for Graph Neural Network-Based Drug-Drug Interaction Prediction
Graph neural networks (GNNs) applied to drug-drug interaction (DDI) prediction rely exclusively on molecular structure encoded as SMILES-de…
EssentialGIN: a new approach for gene essentiality prediction based on graph isomorphism neural networks
Background: Prediction of essential genes (proteins), is a basic and challenging problem but at the same time very costly and time-consumin…
EvoCSFL: Surrogate-Assisted Evolutionary Client Selection for Efficient and Robust Federated Learning
The heterogeneity of client data and systems makes it difficult to achieve satisfactory convergence speed and robustness in federated learn…
How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models
Long-context prefill remains expensive because full/GQA layers still score the historical sequence, even in hybrid models with local, spars…
FunctionEvolve: Structure-Guided Symbolic Regression with LLMs
Symbolic regression aims to uncover explicit scientific laws from data. Recent methods use LLMs to guide mutation from background text, whi…
SAW: Stage-Aware Dynamic Weighting for Multi-Objective Reinforcement Learning in Large Language Models
Although multi-objective reinforcement learning (MORL) is central to aligning large language models with complex human preferences, the pre…
MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models
Vision-Language Models (VLMs) have demonstrated strong performance across multimodal tasks, yet their safety robustness remains an open cha…
Cross-View Urban Traffic Dataset: Drone-Supervised Ground Truth for Monocular Bird's-Eye View Localization
We introduce a dataset and benchmark for cross-view urban traffic perception built from synchronized ego-centric bicycle videos and aerial…
WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing
The autoregressive nature of large language models (LLMs) remains a significant bottleneck for inference, particularly in complex agentic w…
Rosetta Memory: Adaptive Memory for Cross-LLM Agents
Memory is the key component for transforming a stateless LLM into a persistent, evolving agent through experience accumulation, long-horizo…
MatMind: A Structure-Activity Knowledge-Driven Generative Foundation Model for Materials Science
Progress in AI-driven crystal materials science has so far been carried by narrow architectures purpose-built for individual tasks -- graph…
Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels
The attention mechanism is the dominant computational bottleneck in modern transformer-based AI. Its standard implementation incurs quadrat…
Beyond Accuracy: Interpreting Topic Representation in Suicide Ideation Detection Models
Suicide ideation detection models are typically evaluated using aggregate performance metrics, yet little is known about how they internall…
SHIELD-IDS: Structurally Heterogeneous Ensemble with Integrated Layered Defense for Intrusion Detection Systems
Adversarial attacks pose a serious and growing threat to Machine Learning (ML)-based Intrusion Detection Systems (IDS), where imperceptible…
Multi-planar 2D-U-Net Segmentation of 3D-CT Abdominal Organs augmented by Spatial Occurrence Maps
This work proposes a lightweight 2D-U-Net-based framework for segmenting five abdominal organs in large field-of-view 3D CT scans. The meth…
Quantum-Enhanced Similarity Measures for Polarimetric Materials Classification
We present a quantum--classical hybrid pipeline for polarimetric material classification that casts this as a point-matching problem. Voxel…
Beyond Point Estimates: Benchmarking Uncertainty Quantification Methods on the AION-1 Astronomical Foundation Model
Foundation models for astronomical surveys offer powerful learned representations that can be transferred to downstream regression tasks su…
Memetic Capture: A Pluralistic Policy Framework for Governing AI-Driven Cultural Disempowerment
Culture is the most insidious vector of gradual human disempowerment by AI: unlike economic or political displacement, cultural displacemen…
SLMJury: Can Small Language Models Judge as Well as Large Ones?
Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalab…
The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust
As language models improve and become increasingly deployed to solve a variety of tasks, trustworthiness becomes essential. Calibration is…
Jas: AI-Paired Engineering as a Revival of N-Version Programming
I report a case study in AI-paired software engineering: five working ports of a vector illustration application across Rust, Swift, OCaml,…
Beyond Pass/Fail: Using Process Mining to Understand How LLMs Resist (and Fail) Red Team Attacks
Standard AI red teaming evaluations reduce adversarial campaigns to a single binary outcome, attack success rate (ASR), not taking into acc…
Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence
LLM judges increasingly turn verdicts into system commitments. Under mixed evidence (claims with both supporting and refuting sources) this…
Agentic multi-fidelity learning of quasiparticle and excitonic properties
Many-body GW-Bethe-Salpeter equation calculations are essential for accurate simulations of electronic structure and optical properties in…
Does Persona Make LLMs K-pop Fans? A Pilot Study of LLM-Based Online Concert Audience Agents
A concert is a collective experience, but recorded performance videos are typically watched alone, stripping away the shared audience prese…
Cost-Aware Speculative Execution for LLM-Agent Workflows: An Integrated Five-Dimension Method
LLM-agent workflows chain model calls and tool invocations, and spend most of their wall-clock time waiting on upstream operations before d…
Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese
Large Language Models are transforming the support for clinical decision and their application in real scenarios. Yet, most benchmarks are…
Model Multiplicity for Adversarial Detection in Small Language Model Training on Edge Devices
The rise of edge-based machine learning has enabled distributed adaptation of language models across mobile and IoT devices, offering priva…
The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models
Recent vision-language models (VLMs) excel at multimodal understanding and reasoning, yet their fine-grained visual perception remains unde…
Instrumented data for causal scientific machine learning
Scientific machine learning is limited less by model size than by the data it is trained on. Observational data records what happened but n…
The Cross-Architecture Substrate: A Domain-Transcendent, Calibration-Surviving Geometric Invariant of Modern Vision Encoders
Different vision neural networks -- trained to classify, contrast, reconstruct, or match images to text -- should have correspondingly diff…
Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories
LLM-based coding agents sometimes acknowledge a problem in their own reasoning and then proceed anyway. We call this pattern strained coher…
3D Oral Modelling with Improved Vertex Distribution Using Matching-Based Learning
In our previous work, a deep learning-based framework for 3D intraoral reconstruction was proposed. The model directly predicts explicit 3D…
Larch: Learned Query Optimization for Semantic Predicates
With the advent of Large Language Models (LLMs), many database systems introduced semantic operators that enabled analytical queries over u…
Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation
This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via MultimodAl Retrieval (MAGMaR). Addre…
Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation
Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these ev…
POISE: Position-Aware Undetectable Skill Injection on LLM Agents
Agent skills provide a lightweight mechanism for extending general-purpose agents, but their open format exposes them to skill-poisoning at…
From `May' to `Is': Certainty Distortion in Language Model Rewriting
Humans increasingly turn to Language Models (LMs) in ways that shape beliefs and drive decisions, including discussing, rewriting, and summ…
Minibatch Selection via Partition Matroid Constrained Gradient Matching
Training large language models (LLMs) on heterogeneous data requires selecting minibatches that balance convergence speed with coverage acr…
RecurGuard: Runtime Monitoring for Reasoning-Token Consumption Attacks
Reasoning-capable large language models can be induced to spend their generation budget on injected decoy tasks rather than answering the u…
Neutrality Bites: Gender Representation in AI-Generated Animal Stories
Gender bias in AI-generated stories is a well-documented problem. While much attention has been paid to reducing or mitigating this bias, i…
Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks
Current open-weight large language models (LLMs) are prone to malicious finetuning attacks, which could compromise the safety alignment of…
PRISM: PRior-guided Imagination Sampling in world Models
A learned world model provides a powerful physical intuition for evaluating future states. But its effectiveness in continuous control also…
MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models
Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analys…
Enhancing AI Interpretability and Safety through Localised Architectures
Recent advances in generative AI, especially powerful Large Language Models (LLMs) and Large Reasoning Models (LRMs), raise concerns over t…
Summarization is Not Dead Yet
The progress of large language models (LLMs) has fueled claims that model-generated summaries rival or even surpass human-written reference…
Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation
Although directly prompting off-the-shelf Large Language Models (LLMs) to generate meaning-preserving source rewrites can effectively enhan…
GVC-Seg: Training-Free 3D Instance Segmentation via Geometric Visual Correspondence
Accurate 3D instance segmentation in point cloud data is critical for machine vision applications. Recent advancements leverage multiple pr…
IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment
Current image editing software often hinges on fixed filters or expert tuning, leaving a gap between amateur users' intent and outcomes. Cr…
Repair Before Veto, When Repair Is Hidden: Quantum-Accessible Features for Repair-Augmented Constraint Learning
Hard-constraint decision systems usually veto infeasible candidates. This is too rigid when the system can act: if a known affordable repai…
Semantic Quorum Assurance: Collective Certification for Non-Deterministic AI Infrastructure
As large language model (LLM) agents are integrated into autonomous cloud operations, distributed systems face a semantic reliability probl…
CausShield: Sample Reconstruction-Resilient Vertical FL via Causal Representation Learning
Vertical federated learning (VFL) is a distributed learning paradigm that leverages vertically partitioned features across isolated parties…
Voting Protocols as Coordination Mechanisms for Role-Constrained Multi-Agent Tutoring Systems
Agentic tutoring systems introduce a coordination challenge: multiple agents may propose different but reasonable interventions, yet only o…
Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems
Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STEM-related questions. However,…
GIScholarBench: Benchmarking LLM Overconfidence in GIS Research
Large language models (LLMs) are increasingly used in academic research workflows, but scholarly tasks require high factual precision and t…
SafeECGMatch: Calibration-Aware Joint Frequency and Time Space Semi-Supervised Learning for Open-Set ECG Classification
Electrocardiogram (ECG) classification models often suffer from severe label scarcity, making semi-supervised learning (SSL) an attractive…
When Behavioral Safety Evaluation Fails: A Representation-Level Perspective
Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, a…
What's the Point? Spatial Grammar & Index Resolution for Sign Language Processing
Sign language models are predominantly trained with gloss-sequence or text supervision, thereby under-modeling non-lexical and productive c…
EgoAERO: Learning Dexterous Manipulation from a Single Egocentric Video without Object Assets
Egocentric RGB-D videos offer a natural source of human dexterous manipulation demonstrations, but existing data is difficult to use for ro…
Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?
Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades signi…
"I understand your perspective": LLM Persuasion and Sycophancy through the Lens of Communicative Action Theory
Large Language Models (LLMs) can generate high-quality arguments, yet their ability to engage in nuanced and persuasive communicative actio…
Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions
Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner-specific conventions gr…
Fast LLM-Based Semantic Filtering: From a Unified Framework to an Adaptive Two-Phase Method
Evaluating a natural-language yes/no predicate over a document corpus under an accuracy target - the semantic filter - is a cornerstone of…
vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models
Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for th…
Continual Quadruped Robots Coordination via Semantic Skill Discovery
Multi-quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improve…
Ego-Pi: VLA Fine-Tuning for Ego-Centric Human and Robot Data
Robotics faces a fundamental challenge of data scarcity. Unlike language or vision research, there is no internet-scale dataset for robotic…
Human-Centered Benchmarking of Driver Monitoring Models
Vision-based driver monitoring systems are increasingly deployed in safety-critical intelligent transportation settings, yet they are almos…
LCAM: A Framework for Diagnosing Interactional Alignment Failures in Con-versational AI
Conversational AI is increasingly used for advice, interpretation, reassurance, and decision support in contexts where users may be vulnera…
LogNEO: A GPT-Neo Reinforcement Learning Framework for Accurate Real-Time Log Anomaly Detection
Detecting anomalies in large-scale system logs is critical for the reliability and security of modern computing infrastructure. We present…
RAPID: Layer-Wise Redundancy-Aware Pruning and Importance-Driven Token Merging for Efficient ViT
Vision Transformers (ViTs) achieve strong performance but suffer from high computational costs due to quadratic self-attention complexity.…
Constrained Paraphrase Consistency for LLM Hallucination Detection
Large language models (LLMs) can generate factually inconsistent claims, motivating accurate and scalable hallucination detectors. Prior wo…
Explaining Data Mixing Scaling Laws
Recent research has established empirical scaling laws to predict model performance on multi-domain data mixtures. However, a theoretical u…
Closing the Sim-to-Real Gap: An Evaluation Framework for Autonomous Cyber Defense Configuration of Commercial EDR
Leading commercial endpoint detection and response (EDR) products have shifted from operator-configured rule sets to multi-component system…
CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning
Enabling robots to understand and execute tasks from natural language commands while maintaining data efficiency remains challenging. Found…
The Governance of Human-LLM Interaction: Safety Gating, Civility Steering, and Affective Default Lock-In
Large language models (LLMs) increasingly mediate high-stakes interactions in finance, medicine, and mental-health support, yet users have…
Frequency-Domain Latent Attention Gating for Cross-Domain Token Aggregation
Token aggregation is a common bottleneck in models that map token representations to sample-level predictions, yet most pooling methods ope…
GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models
Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range…
Beyond Additivity: Causal Discovery in Location-Scale Noise Models with Hidden Variables
We study causal discovery from observational data when some variables are hidden and the data-generating process follows a location-scale n…
How Deep Are Deep GPs, Really? A Sharp Threshold and a Non-Gaussian Limit for Compositional GPs
Compositional priors describe the generic properties of layered functions in deep Bayesian models, where deep neural networks with random w…
AeroSpectra Sentinel: An Auditable LLM Prompt-Chaining Decision-Support Workflow for Acute Asthma Risk Assessment from Respiratory Sounds and Clinical Signals
Acute asthma risk assessment requires rapid interpretation of respiratory sounds, oxygenation, airflow limitation, speech ability, work of…
Contemporary AI lacks the imagination to diverge or negate in science
Bold projections that artificial intelligence will accelerate scientific discovery have raced ahead of evidence from working scientists, an…
Post-AGI Economies: Superposition and the Second Fundamental Theorem of Welfare Economics
The classical Second Welfare Theorem decentralizes any Pareto efficient allocation through prices and transfers under convexity and regular…
An AI Security Agent for University ACMIS: Multi-Vector Threat Detection and Automated Response
University Academic Management Information Systems (ACMIS) are high-value targets for a wide spectrum of security threats including brute-f…
AgriGov: A Structured Multilingual Dataset Curation for Indian Government Schemes for Farmers
AgriGov is a curated, trilingual (English-Hindi-Marathi) dataset designed to address the scarcity of domain-grounded multilingual resources…
Causal Agent Replay: Counterfactual Attribution for LLM-Agent Failures
When an LLM agent fails -- issues a refund it should not have, calls the wrong tool, leaks data -- existing tooling answers what happened (…
"So There's a Catch-22 Here": How Early Adopters Who Build Multi-Agent LLM Systems Conceptualize Transparency
Multi-agent large language model (LLM) systems are rapidly emerging, yet transparency, a cornerstone of responsible AI, remains under-defin…
Set-Based Transformer for Atmospheric Compensation in Standoff LWIR Hyperspectral Imaging
Passive long-wave infrared (LWIR) hyperspectral imaging under a standoff geometry depends on atmospheric absorption and emission, as well a…
Chiaroscuro Attention: Spending Compute in the Dark
Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token…
Generative Frontier Planning for Adaptive Peer-Referral Recruitment under Covariate-Dependent Arrivals
Peer-referral recruitment systems such as respondent-driven sampling are critical for studying and intervening on hidden populations affect…
Self-Supervised Vision Transformers for CBCT-Based Detection of Temporomandibular Joint Osteoarthritis
Temporomandibular joint osteoarthritis (TMJ OA) is a prevalent degenerative condition whose osseous changes are often subtle on cone-beam C…
Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects
Sparse autoencoder (SAE) features are increasingly used to steer language models, but feature steering is rarely clean: the same interventi…
Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy
Most evaluations of LLM agents look like exams: a discrete task, a clean environment, a score in minutes or hours. We argue that this appro…
An Information-Theoretic Definition for Open-Ended Learning
A growing body of work points to the great promise of AI systems that can continually expand their capabilities as they operate in an open-…
RiskNet: A large-scale dataset of AI risk incidents from news with alignment and multi-dimensional annotations
As artificial intelligence (AI) systems are increasingly deployed across socially consequential domains, reports of AI-related harms and fa…
Auditing Proprietary Alignment in Large Language Models: A Comparative Framework Without a Ground-Truth Standard
Large language models (LLMs) are increasingly released and deployed through opaque development and deployment pipelines, enabling model pro…
STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control
Low-rank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, pr…
Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses
Graduate-level research reading report assessment creates a substantial labor burden for educators. While large language models (LLMs) hold…
SceneConductor: 3D Scene Generation from Single Image with Multi-Agent Orchestration
Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental…
Hiding in Plain Floats: Steganographic Carriers for Indirect Prompt and Content Injection
Text-centered prompt-injection defenses assume that the malicious signal is visible in one of the inspected text views. We study a reproduc…
TimpaTeks: Automatic In-place Text Sequence Modification via Diffusion Language Model Steering
We extend activation steering to diffusion language models (DLMs) and study a novel problem that arose due to the inference mechanism of DL…
Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries
Personalized decision-making in multi-objective bandits requires learning user-specific trade-offs among competing objectives. Since arm ut…
PACT: Self-Evolving Physical Safety Alignment for Diffusion Policies in Embodied Manipulation
Diffusion policies have achieved remarkable success in robotic manipulation, yet they often fail to satisfy strict physical constraints req…
CoVEBench: Can Video Editing Models Handle Complex Instructions?
While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests…
Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics
Diffusion and continuous flow-based language models have emerged as the leading non-autoregressive alternatives to language modeling. Progr…
AI Code Sandboxes: A Comparative Security Study. Part 1 of 2 -- Engine-Level Properties (Attack Surface, Leakage, Stackability, CVE History, Patch Cadence, Fuzzing)
This paper reads six engine-level measurements together -- 1.1 host attack surface, 1.2 information leakage, 1.3 defense-in-depth stackabil…
Segment-level Tree Search for Long Meeting Document Summarization
Meeting documents are challenging to summarize due to their length and complex conversational structure. Existing approaches typically adop…
Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models
Despite being powerful, reinforcement learning with verifiable rewards (RLVR) induces extremely long COT, making it computationally expensi…
Not Just After One: Sleep-Inspired Replay Prevents Catastrophic Forgetting After Sequential Tasks
One of the critical limitations of artificial neural networks is their lack of ability to continually learn: training on new tasks often le…
Sycophancy as a Multilingual Alignment Failure: How Safety Degrades Across Languages, Topics, and Models
Safety-aligned large language models often exhibit sycophancy, which is the tendency to affirm users' opinions regardless of factual accura…
The Confidence Trap: Calibration Attacks for Graph Neural Networks
While confidence calibration is essential for trustworthy decision-making in safety-critical applications, the robustness of calibrated GNN…
More Yap Less Meaning: Uncovering Self-Improvement Behavior in SLMs
Recently, language models have made rapid progress across various domains and applications. However, their capability for self-improvement,…
FlashCP: Load-Balanced Communication-Efficient Context Parallelism for LLM Training
Context parallelism (CP) is essential for training large-scale, long-context language models, as it partitions sequences to reduce memory o…
Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation
Reinforcement learning (RL) presents a promising avenue for enhancing generative recommendation beyond supervised imitation, leveraging rew…
PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems
Enterprise property graphs vary widely in schema structure, internal terminology, domain assumptions, governance constraints, and user inte…
STELLAR: Spatio-Temporal Environmental Learning with Latent Alignment and Refinement for Long-Tailed Species Distribution Modeling
Joint Species Distribution Modeling (JSDM) is a key enabler for biodiversity monitoring and conservation planning. However, accurate JSDM f…
Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation
Despite the impressive capabilities of text-to-image (T2I) models, an intent-generation gap often persists due to the brevity and ambiguity…
Projecting the Emerging Mindset of SWE Agent by Launching a Wild Code Understanding Journey
Software engineering agents (SWE agents) increasingly work through tool-mediated trajectories in real repositories, yet their behavior rema…
ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies
Generative robot policies fail unpredictably at deployment: they hesitate at critical moments, drift off-task, or commit to unrecoverable a…
GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation
Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, b…
When Video Misreads: Closed-Loop Distillation of Reading Heuristics for Exploratory Manipulation Trace QA
Exploratory manipulation often turns an apparent failed attempt into the key evidence for what to do next. For example, a robot pulls a loc…
EinSort: Sorting is All We Need for Tensorizing LLM
Tensor networks provide efficient representations for compressing large neural networks. By carefully designing shapes and topologies, they…
Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models
Large language models frequently fail in a characteristic way: rather than acknowledging ignorance, they produce fluent but incorrect answe…
Auditable Graph-Guided Root Cause Analysis for Kubernetes Incidents
Kubernetes incidents are diagnosed reliably only when a root-cause system's reported gains come from incident evidence rather than scenario…
Reinforcement Learning for Flow-Matching Policies with Density Transport
We present an online reinforcement learning (RL) algorithm for fine-tuning flow-matching policies in continuous-control problems. Our key i…
HARBOR: A Harness Framework for Agentic Robot Reinforcement Learning
Reinforcement learning (RL) has become a powerful paradigm for robot learning, particularly in sim-to-real settings, but its broader adopti…
Tyan-WP: A Wind Power Foundation Model for Ultra-Short-Term Probabilistic Forecasting
Global wind power capacity, especially in China, is booming, with new farms spanning diverse terrains and climates. The industry urgently n…
A retrieval conditioned rebinding circuit for dynamic entity tracking in large language models
To interpret context correctly and retrieve relevant information, large language models must bind entities to their attributes and update t…
Sample-Efficient LLM-Based Detection of Malicious Web Server Logs with Forensically Explainable Reasoning
Forensic analysis of web server logs demands both accurate detection and human-readable explanations that can satisfy legal requirements. W…
Reconstructing Synthetic SDO/AIA 193 A EUV Images from He I 10830 A Observations with Diffusion Model Translator
Routine full-disk EUV imaging has been available only since the modern era, such as SOHO and SDO. To extend EUV coronal context into earlie…
FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning
Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions t…
Latent Diffusion Policy: Shaping Latent Spaces for Diffusion-Based Robotic Manipulation
Diffusion-based visuomotor policies operating directly in raw action spaces conflate scene comprehension with trajectory generation within…
Data Agents Under Attack: Vulnerabilities in LLM-Driven Analytical Systems
Data agents integrate LLM-driven reasoning with relational data access, executable analytical tools, and multi-step workflow orchestration,…
BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension
Existing video generation frameworks treat sequence duration as an externally prescribed parameter -- fixed frame counts or text prompts --…
Lost in the Flow with Code Talkers: Unveiling the Instruction-Tuning Tax of Large Language Models in Code Tasks
AI coding assistants have significantly improved developer productivity by automatically suggesting code that aligns with user intent, and…
Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation
Activation steering has emerged as a popular inference-time technique for modulating the behavior of large language models (LLMs). By const…
Agentic Search for Counterfactual Recourse under Fixed LLM Budgets
Counterfactual recourse aims to provide actionable feature changes that would alter an unfavorable decision made by a predictive model. In…
Structuring agentic AI for HPC code modernization
Modernization of legacy scientific codes is often necessary to keep up with the ever-evolving changes in the compute resource ecosystem. Pa…
SNR-ST-Mix: Sample-specific Neighborhood Regression Mixup for Augmented Spatial Transcriptomics Imputation with Deep Neural Network
Purpose: Spatial transcriptomics (ST) enables gene expression measurements within the tissue context. However, these measurements are often…
Hybrid Neural Network and Conventional Controller Approach for Robust Control of Highly Unstable Systems: Application to Tilt-Rotor Control
Multirotors are widely used in applications ranging from surveillance to precision agriculture, yet conventional designs remain limited by…
Deep Active Re-Labeling: Toward Noise-Resilient Annotation Efficiency
While Deep Active Learning (DAL) effectively reduces human annotation costs, its efficacy is constrained by human annotation errors. This i…
APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing
W4A4 quantization promises full utilization of INT4 Tensor Cores, yet group dequantization overhead on CUDA Cores has driven existing syste…
RadOT-Eval: Auditable Structured-Evidence Transport for Radiology Report Evaluation
Automatic evaluation is critical for high-stakes text generation, where errors often involve omitted findings, hallucinated content, polari…
TeamHerald@CHIPSAL 2026: Hate Speech Detection and Sentiment Analysis of Nepali Memes using Transformer-based Architectures and Ensemble Learning
The analysis of internet memes in the Nepali language is complicated by frequent code-mixing and a lack of established baseline resources.…
Unifying Object-Centric World Models and Diffusion Policy: A Hierarchical Framework for Multi-Stage Robotic Tasks
Visual world models have shown great potential in learning complex system dynamics. Recent advancements leverage these models as transition…
How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects
Visual Language Models (VLMs) are known to produce hallucinated predictions that are not grounded in visual evidence, yet existing approach…
Evaluating AI Investment Strategies
We study the problem of auditing a black-box algorithmic decision-maker from observable inputs and outputs alone. Our main result is an exa…
AI-Augmented Closed-Loop Quality Engineering: A Reference Architecture for Continuous Software Quality Intelligence
The quality of software engineering is still under a challenge due to disjointed processes between requirements, testing, and production, w…
Scaling Decision-Focused Learning to Large Problems with Lagrangian Decomposition
Decision-focused learning has shown great promise for addressing predict-then-optimize problems, particularly in the presence of under-spec…
Governance Controls for AI-Generated Test Artifacts in Autonomous Software Testing
Artificial Intelligence (AI) and Large Language Models (LLMs) are increasingly used in autonomous software testing; however, AI-generated t…
Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors
Predicting the effect of an unseen gene knockout perturbation on transcriptomic gene expression remains a highly challenging problem for vi…
BLM-SGAN: Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation
Despite the success of image generation from text descriptions, it still faces challenges that are difficult to overcome in domains such as…
Intrinsic Selection and Particle Resampling for Inference-Time Scaling Beyond Domain Verifiability
Inference-Time Scaling (ITS) has largely succeeded in verifiable domains like math and coding, where cheap verification enables scalable ou…
sGPO: Trading Inference FLOPs for Training Efficiency in RLVR
Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for…
Intelligent Character Recognition of Handwritten Forms with Deep Neural Networks
The automatic processing of handwritten forms remains a challenging task, wherein detection and subsequent classification of handwritten ch…
Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis
Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily…
Cheap Reward Hacking Detection
A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the $L_…
A multi-agent system for spine MRI report generation from multi-sequence imaging
Spinal pathology is a leading cause of pain and disability worldwide. Spine MRI is central to clinical evaluation, yet its interpretation r…
Few-shot Class-variable Incremental Audio Classification via Prototype Adaptation and Pseudo Class-variable Training
In the task of few-shot class-incremental audio classification, the number of classes is assumed to always increase without considering the…
Failure-Aware Refinement of Vision-Language Model for Lithography Defect Detection
Semiconductor lithography inspection requires reliable detection of small pattern defects such as bridge, burr, pinch, and contamination. I…
PolyBuild: An End-to-End Method for Polygonal Building Contour Extraction from High-Resolution Remote Sensing Images
Extracting building polygon contours from high-resolution remote sensing images is a fundamental task for various mapping applications. How…
From Statute to Control Flow: Span-Grounded Deontic Trees for Defeasible Scope Parsing
Rule-following agents tasked with executing policies and regulations often fail via Silent Scope Omission (SSO): a model applies a general…
PAI: Preserving Amplitude Information in Representation-Based Time-Series Anomaly Detection
Representation-based time-series anomaly detection algorithms significantly outperform other methods on diverse anomaly detection tasks. Ho…
Report on CHIIR 2026 Workshop on Generative AI and Academic Search (GAI&AS)
This report summarizes the CHIIR 2026 Workshop on Generative AI and Academic Search (GAI\&AS), which examined how GenAI is reshaping academ…
PACT: Learning Diverse Diagnostic Strategies via Privileged Synthesis and Branch Consensus
Clinical diagnosis requires flexible use of multiple reasoning paradigms under incomplete patient information. Existing LLM-based medical a…
NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis
Comprehensive estimation of dietary micronutrients from food images could improve clinical nutrition care, but training such models require…
Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking.…
CARE: A Conformal Safety Layer for Medical Summarization
Large language models (LLMs) are increasingly used for medical summarization, but their outputs can omit medically important information an…
SpaceVLN: A Zero-Shot Vision-and-Language Navigation Agent with Online Spatial Cognitive Memory and Reasoning
Vision-and-Language Navigation in continuous environments requires agents to understand the spatial structure of previously unseen environm…
Sustainability and Artificial Intelligence: Necessary, Challenging, and Promising Intersections
Both digital economy and digital technology researchers increasingly recognize the need to better address the role that artificial intellig…
Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin
Post-training quantization (PTQ) converts a trained full-precision model into low-bit weights without task-level retraining, while quantiza…
TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech
Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of…
SafeRun: Enabling Determinism in LLM Planning for Running
Large Language Models enable flexible natural-language planning but remain unreliable in determinism-critical domains due to their probabil…
ATM: Action-Consistency Transfer Matrix for Diagnosing and Improving Latent World Models
Latent world models are increasingly used for control and goal-conditioned planning, yet assessing whether their learned representations ar…
TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs
Clinical early warning systems built on electronic health records, in which clinical observations are recorded as irregularly sampled medic…
BareWave: Waveform-Native Flow-Matching Text-to-Speech
Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In t…
INFUSER: Influence-Guided Self-Evolution Improves Reasoning
Self-evolution offers a scalable path to stronger reasoning: a pretrained language model improves itself with only minimal external supervi…
Stage-1 Controls the Entropy Regime, Not the Outcome
Two-stage post-training -- a Stage-1 warm-start (supervised fine-tuning, SFT, or on-policy distillation, OPD) followed by Stage-2 reinforce…
See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding
Recent advances in Video Large Language Models (Video-LLMs) have enabled performance on long-video understanding tasks. However, existing m…
OnlyDense: Reduced-Order Modeling for Lagrangian simulation
In science and engineering, Lagrangian simulation methods such as Smooth Particle Hydrodynamics (SPH) or Material Point Method (MPM) are of…
A Unifying Lens on Reward Uncertainty in RLHF
Reinforcement learning from human feedback (RLHF) is bottlenecked by \emph{reward hacking}, where the policy exploits errors in a proxy rew…
FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention
Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In…
Context-Fractured Decomposition Attacks on Tool-Using LLM Agents: Exploiting Artifact Provenance Gaps
Tool-using LLM agents interact with the world through actions that persist state in artifacts (e.g., workspace files or logs). Consequently…
Context Rot in AI-Assisted Software Development: Repurposing Documentation Consistency for AI Configuration Artifacts
Developers increasingly provide AI coding assistants with persistent context through configuration files such as CLAUDE.md, AGENTS.md, and…
Addressing Market Regime Changes and Heavy-Tailed Returns in Portfolio Optimization via Bayesian VAR and Elliptical Black-Litterman
Deep reinforcement learning (DRL) frameworks for portfolio optimization have shown promise for their ability to learn allocation rules dyna…
Hybridizing Equilibrium Propagation with Ising Machines for Efficient Energy-Based Learning
The rapid evolution of artificial intelligence has led to substantial advances in deep neural networks. Nonetheless, conventional GPU-based…
Optimizing Energy-based Neural Network Training with Coherent Ising Machine
While Ising machines serve as advanced physical solvers for the Ising model,enabling applications in combinatorial optimization and neural…
Autonomous Incident Resolution at Hyperscale: An Agentic AI Architecture for Network Operations
Cloud network infrastructure at hyperscale presents unique operational challenges where traditional human-driven incident response cannot k…
An Enhanced Geometric-Spectral Feature Learning Framework for Airborne Multispectral Point Cloud Classification
Multispectral point cloud (MPC) is composed of 3D spatial-spectral information, which holds tremendous potential for accurate land-cover cl…
Unveiling Privacy Risks in Multi-modal Large Language Models: Task-specific Vulnerabilities and Mitigation Challenges
Privacy risks in text-only Large Language Models (LLMs) are well studied, particularly their tendency to memorize and leak sensitive inform…
From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs
Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene obje…
Steganography Without Modification: Hidden Communication via LLM Seeds
We demonstrate that widely deployed Large Language Model (LLM) inference stacks harbor a steganographic channel that requires no modificati…
Decoding Pedestrian Crossing Intention from Egocentric Vision via Vision Language Models
Egocentric vision offers a first-person view of human perception and decision making, yet its potential for traffic-safety prediction remai…
SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model Performance
This paper revisits our pipeline called Syllogistic Evaluation Framework-Common Logic Grammar Construction (SEF-CLGC). We combine formal lo…
Unified Energy for Invariant and Independent Decoding in Diffusion Language Models
Diffusion Language Models (DLMs) enable parallel text generation by iteratively denoising a full sequence, offering attractive flexibility…
Crop Recommendation and Agricultural Query Answering System Using Spatio-Temporal Graph Neural Networks and Hybrid Retrieval Augmentation
This paper presents a unified system designed to support precision agriculture by integrating advanced weather prediction, crop recommendat…
CANS: Accelerating Multiuser Collaborative Edge Inference via Cooperative Autodidactic NeuroSurgeon
Recently, mobile edge computing (MEC)-enabled collaborative deep neural network (DNN) inference has emerged as a promising approach for del…
Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis
Multilingual safety evaluation of large language models (LLMs) has predominantly relied on direct translation (DT) of English benchmarks in…
Pretrained, Frozen, Still Leaking: Auditing Cross-Encoder Attribute Transfer in EEG Foundation Models
EEG foundation-model releases are usually audited one endpoint at a time: raw-reconstruction, membership inference, identity linkage, or DP…
Resource-aware Computation-Communication Overlap for multi-GPU ML Workloads
The rapid growth of large-scale machine learning (ML) has made distributed training across multiple GPUs a fundamental component of modern…
Trustworthy Smart Fabs via Professional Proxies: Scaling Safe and Sustainable by Design (SSbD) through Industrial Data Spaces
The convergence of the 2026 European Union Safe and Sustainable by Design (SSbD) framework, Corporate Sustainability Due Diligence Directiv…
End-to-End Training for Discrete Token LLM based TTS System
Recent state-of-the-art (SOTA) text-to-speech (TTS) systems typically adopt a cascaded pipeline consisting of a speech tokenizer, an autore…
Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation
Autonomous Racing has seen remarkable progress through deep Reinforcement Learning (RL), primarily for four-wheeled vehicles. However, moto…
EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video
Estimating full-hand grasp pressure from egocentric video is critical for immersive VR and robotic manipulation, yet dense tactile sensing…
Proposal Refinement for Few-Shot Object Detection
Few-shot object detection has gained widely attention in recent years. Some excellent algorithms have been proposed to handle this task. Ho…
BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation
High-Dimensional Low-Sample Size (HDLSS) tabular domains (e.g., omics) are characterized by $n \ll m$, where $n$ = number of samples, and $…
Physics-Guided Sequence-Based Generative Framework for Acoustic Metamaterial Inverse Design
Acoustic metamaterial (AMM) inverse design is particularly challenging for broadband target responses due to acoustic dispersion: a structu…
Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation
Large Language Models frequently hallucinate in precision-critical domains such as technical diagramming and mechanical design, where outpu…
Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents
BCI-to-agent pipelines turn decoded neural activity into an authorization channel for tool-use agents, exposing a new attack surface we cal…
A Universal Dense Football Event Representation Based on TabTransformer
Football event data constitute a rich spatiotemporal source for quantitative analysis of player actions in team sports. These datasets cont…
Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding
Omni-modal retrieval promises a single embedding space for text, image, video, document, and audio inputs, but building such a unified retr…
Beyond Humans: Multispecies Animal Face Recognition Using Transfer Learning
Individual animal recognition can be useful in the search for lost or stolen pets, the tracking of individuals of endangered species, and t…
PhysScene: A Scene Graph Dataset for Scientific Visual Reasoning in Physics Experiments
Scene Graphs (SGs) provide structured representations of visual scenes by modeling objects and their pairwise relationships. Despite recent…
Scaling Neural Network Verification with Tensor Parallelism and Fully Sharded Data Parallelism
Formal neural network verification -- proving that a network satisfies safety properties for \emph{all} inputs in a specified domain -- is…
Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short
Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language m…
Real-time body pose non-verbal communication with a consistency-based reliability measure
Body movement communicates intent at distances and in conditions where neither the face, nor speech can be captured. We study the recogniti…
SAILS: Surrogate-based Analysis of Interactions via Local Effect Smooths
Feature interactions drive much of the predictive power of machine learning models, yet existing explanation methods only detect and quanti…
Can Data Work be Reparative?
We present an ethnographic study of an alternative approach to data work, developed by a civic-tech initiative that builds datasets for tra…
AI Assurance in UK Defence: Challenges in Operationalising JSP 936
This report examines practical challenges in operationalising JSP 936 Part 1 for AI assurance in UK Defence. Using a structured interpretiv…
Harness Engineering for Physical AI: Robot Middleware Is the Harness Layer
Robot middleware faces a new role in the era of Physical AI. Learned policies, planners, and vision-language-action (VLA) models now enter…
Context-Aware Deep Learning for Defect Classification in Atomic-Resolution STEM
Artificial intelligence is rapidly advancing materials characterization, yet most applications in electron microscopy rely solely on image…
LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models
Online task-free continual learning (TFCL) requires intelligent agents to sequentially accumulate knowledge from an unbounded, non-stationa…
A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales
Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for mu…
Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM Agents
Long-term memory for an LLM agent is more than retrieving the right passage at the right time. Current memory systems collapse belief revis…
Targeting World Models to Compromise Robot Learning Pipelines
World models have recently seen a rapid growth in both their popularity and capability as more data efficient tools for generating robot tr…
Closing the Prior-Posterior Loop: Self-Reflective Molecular Design with Analysis-Driven LLM Iteration
Can a general-purpose large language model design molecules with the precision of a seasoned chemist? Current LLM-based frameworks answer t…
Emergence of Context Characteristics Sensitivity in Large Language Models
During instruction fine-tuning (IFT), large language models (LLMs) learn to follow instructions by using the provided context to answer a q…
Model Poisoning Against Federated Model Adaptation with Chain of Bit-Flips
Federated Learning (FL) allows a set of clients to collectively train a global model without sharing local training data. Giving the respon…
SecureClaw: Clawing Back Control of LLM Agents
Tool-using large language model (LLM) agents face two distinct security failures: unauthorized external actions and exposure of sensitive p…
FuseFSS: Efficient Secure LLM Inference with Function Secret Sharing
Two-server secure inference allows a client to query a hosted large language model (LLM) without revealing prompts or embeddings. Recent GP…
Safe-RULE: Safe Reinforcement UnLEarning
Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical s…
CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control
Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent…
Seeing the Hivemind: A Consensus-Aware Interaction Technique for Mitigating AI Homogenization
People are increasingly using AI for creative tasks such as writing. While adoption continues to grow, this form of use risks undermining i…
I Was Scrolling and Then I Saw a Pregnant Strawberry
AI minidramas (also known as fruit dramas) are short, algorithmically distributed generative AI video series featuring anthropomorphized ch…
Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes
Interpretability increasingly treats groups of components, not individual units, as the basic object, and proposes to find them by clusteri…
Shape Formation for the Cooperative Transportation of Arbitrary Objects Using Multi-Agent Reinforcement Learning
Cooperative object transportation is essential in numerous domains, including industrial to domestic services. A popular transportation str…
AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving
Multi-turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful…
Powering the Future of AI: Navigating the Trade-offs for Europe's Energy Transition and Net-Zero Goals
The rapid expansion of AI globally has led to the proliferation of energy-intensive hyperscale data centres (DCs), making them as a structu…
ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies
Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states…
ATN3D: Density-Aware LiDAR-Radar Early 3D Object Detection Under Extreme Sparsity
3D object detection is the backbone of perception for automated vehicles (AV) and broader intelligent transportation systems applications.…
FMplex: Model Virtualization for Serving Extensible Foundation Models
Foundation models (FMs) are increasingly used as backbones for downstream tasks across language, vision, time-series, and multimodal applic…
Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis
We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this info…
ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset
Multi-modal data management has emerged as a central research topic in the database community, spanning data integration, semantic query pr…
Muon Learns More Robust and Transferable Features than Adam
Muon has recently emerged as a state-of-the-art optimizer for pretraining Large Language Models (LLMs) and vision classifiers. Despite its…
End-to-End Context Compression at Scale
Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress t…
Visual Prompting Meets Feature Reconstruction-Based Anomaly Detection with Dual-Teacher Supervision
Recent Anomaly Detection methods achieve perfect detection and segmentation scores on well-established datasets, such as MVTec. However, ma…
Transition-Based Digital Twin Modelling for Alzheimer's Disease under Sparse Longitudinal Data
Alzheimer's disease (AD) progression is highly heterogeneous and is typically observed through sparse and irregular longitudinal data, posi…
MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation
While discriminative models for multi-channel speech separation excel in reference-based metrics, they often exhibit suboptimal human liste…
An 84-Format Numeric Catalog with Bit-Exact Conformance Vectors: A Vendor-Neutral Reference for FP8, BF16, MXFP4, and Microscaling Formats
Numeric format proliferation in machine learning hardware -- FP8 (E4M3 and E5M2), BF16, MXFP4, microscaling block formats, and dozens of re…
Observability for Delegated Execution in Agentic AI Systems
Delegation-scoped execution is not identifiable from standard observables: audit logs and execution traces can be identical under multiple…
Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO
AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discoverin…
Hybrid Robustness Verification for Spatio-Temporal Neural Networks
With AI increasingly deployed in safety-critical systems, providing formal robustness guarantees for the underlying models is essential. Ex…
Difference-Aware Retrieval Policies for Imitation Learning
Parametric imitation learning via behavior cloning can suffer from poor generalization to out-of-distribution states due to compounding err…
Preserving Plasticity in Continual Learning via Dynamical Isometry
Continual training of deep neural networks under non-stationarity often leads to a progressive loss of plasticity, eventually limiting furt…
Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan
Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance o…
Who Earns the Safety? Intervention-Aware Quantum Predictive Control with Safety Attribution
Hard safety filters are increasingly placed downstream of learned controllers to guarantee constraint satisfaction at run time. Yet a filte…
FASE: Fast Adaptive Semantic Entropy for Code Quality
Multi-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering li…
Bandits for Efficient Experimentation: Adapting to Control Group, Preferences, and Context Drifts
We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of…
Topological Neural Operators
We introduce Topological Neural Operators (TNOs), a principled framework for operator learning on cell complexes that lifts neural operator…
AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing
World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inje…
PTL-Diffusion: Manifold-Aware Diffusion with Periodic Terminal Laws
Standard diffusion models typically use a single time-homogeneous Gaussian terminal distribution as the reference law for generation. While…
An Agency-Transferring Model-Free Policy Enhancement Technique
Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning,…
OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics
Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically…
A Survey on Large Language Model-Based Game Agents
Game environments provide rich, controllable settings that stimulate many aspects of real-world complexity. As such, game agents offer a va…
TQA-Bench: Evaluating LLMs for Multi-Table Question Answering
The advance of large language models (LLMs) has unlocked great opportunities in complex multi-modal data management tasks, particularly in…
IDEQ -- Improving Diffusion Models for the Traveling Salesman Problem (TSP) by Leveraging the Structure of the Solution Space
We investigate diffusion models to solve the Traveling Salesman Problem. Building on the recent DIFUSCO and T2TCO approaches, we propose ID…
HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions
Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous spaces, with little attention to dynamic, cro…
Can Global XAI Methods Reveal Injected Behaviours in LLMs? SHAP vs Rule Extraction vs RuleSHAP
Large language models (LLMs) can amplify misinformation, undermining societal goals such as the UN SDGs. We study three documented drivers…
FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks
This paper introduces FieldWorkArena, a benchmark for agentic AI targeting real-world field work. With the recent increase in demand for ag…
Modeling the Diachronic Evolution of Legal Norms: An LRMoo-Based, Component-Level, Event-Centric Approach to Legal Knowledge Graphs
Representing the temporal evolution of legal norms is a critical challenge for automated processing. While foundational frameworks exist, t…
Sound and Complete Neurosymbolic Reasoning with LLM-Grounded Interpretations
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but exhibit proble…
Discovering heuristics in a complex SAT solver with large language models
The Satisfiability problem (SAT) is fundamental in computational complexity theory and has a wide range of industrial applications. Optimiz…
CLPO: Curriculum Learning meets Policy Optimization for LLM Reasoning
Online reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning abilities of larg…
MixReasoning: Switching Modes to Think
Reasoning models enhance performance by tackling problems in a step-by-step manner, decomposing them into sub-problems and exploring long c…
MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science
Large Language Models have shown strong scientific reasoning ability, but their performance on materials science problems remains less stud…
AlphaOPT: Formulating Optimization Programs with Self-Improving LLM Experience Library
Optimization modeling underlies critical decision-making across industries, yet remains difficult to automate: natural-language problem des…
TempoBench: Evaluating Temporal Causal Reasoning in Large Language Models
Temporal reasoning involves understanding how systems evolve over time through input-driven state transitions. A key aspect is temporal cau…
QuickLAP: Quick Language-Action Preference Learning for Semi-Autonomous Agents
Robots must learn from both what people do and what they say, but either modality alone is often incomplete: physical corrections are groun…
Knowing How to Edit: Reliable Evaluation Signals for Diagnosing and Optimizing Prompts at Query Level
Prompt optimization has become a central mechanism for eliciting strong performance from LLMs, and recent work has made substantial progres…
A Geometric Unification of Concept Learning with Concept Cones
Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which presc…
A Geometric Theory of Cognition for Machine Intelligence
Developing artificial agents that unify representation, memory, adaptation, and prediction remains a fundamental challenge in artificial in…
MAR:Multi-Agent Reflexion Improves Reasoning Abilities in LLMs
LLMs have shown the capacity to improve their performance on reasoning tasks through reflecting on their mistakes, and acting with these re…
ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models
Every existing inference-time reasoning framework discards all failure context at problem boundaries, leaving a model solving problem 500 n…
Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning
Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems f…
Dynamic Distributed Constraint Optimization and Metareasoning for Continual, Large-Scale Satellite Operations
As Earth-observing satellite constellations grow in size and capability, distributed onboard control offers a pathway to novel responses an…
Payoff scaling shapes cooperation in LLM agents across languages
Large language models (LLMs) are increasingly deployed as autonomous agents that negotiate, coordinate, and act on behalf of users. Whether…
Language-based Trial and Error Falls Behind in the Era of Experience
While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g.,…
TAME: A Trustworthy Test-Time Evolution of Agent Memory with Systematic Benchmarking
Test-time evolution of agent memory represents a pivotal paradigm for advancing AGI, as it strengthens complex reasoning through experience…
Weak-Driven Learning: How Weak Agents make Strong Agents Stronger
As post-training optimization becomes central to improving large language models, we observe a persistent saturation bottleneck: once model…
Web Agents Should Use Typed Actions Instead of Click-Based Browsing
This position paper argues that building a reliable agentic Web requires shifting from low-level interaction primitives to typed actions su…
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
Vision-Language-Action (VLA) models are advancing autonomous driving by replacing modular pipelines with unified end-to-end architectures.…
2-Step Agent: A Framework for the Interaction of a Decision Maker with AI Decision Support
Predictions from ML models support human decision making in several fields, including high-stakes ones such as healthcare and the judiciary…
An Alternative Trajectory for Generative AI
The generative artificial intelligence (AI) ecosystem is undergoing rapid transformations that threaten its sustainability. As models trans…
IRAM-Omega-Q: A Computational Framework for Uncertainty Regulation in Adaptive Agents
Adaptive agents operating under uncertainty must do more than optimize task outputs: they must maintain a workable internal state under noi…
Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization
Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Ref…
Counterfactual Credit Policy Optimization for Multi-Agent Collaboration
Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles, but reinforcement learning f…
Signals Are Not States: Neuro-Symbolic Safeguards for Culturally Aware Classroom AI
Classroom AI systems increasingly infer high-level educational states such as engagement, confusion, collaboration, participation, and inst…
MC-CPO: Mastery-Conditioned Constrained Policy Optimization for Pedagogically Safe Intelligent Tutoring Systems
Intelligent tutoring systems increasingly rely on reinforcement learning to personalise instruction, yet optimising for observable engageme…
EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale
The convergence of large language models and agents is catalyzing a new era of scientific discovery: Agentic Science. While the scientific…
The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data
AlphaGeometry represents a milestone in neuro-symbolic reasoning, yet its architecture faces a log-linear scaling bottleneck within its sym…
Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks
Anti-money laundering (AML) transaction monitoring generates large volumes of alerts that must be rapidly triaged by investigators under st…
Deconstructing Superintelligence: Identity, Self-Modification and Diff\'erance
Self-modification is routinely treated as constitutive of artificial superintelligence (\textbf{SI}), yet modification is a relative action…
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics bec…
Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards
Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-an…
Executable World Models for ARC-AGI-3 in the Era of Coding Agents
We evaluate an initial coding-agent system for ARC-AGI-3 in which the agent maintains an executable Python world model, verifies it against…
Engagement Process: Rethinking the Temporal Interface of Action and Observation
Task completion in digital and physical environments increasingly involves complex temporal interaction, where actions and observations unf…
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
LLM-based conversational AI agents struggle to maintain coherent behavior over long horizons due to limited context. While RAG-based approa…
ASH: Agents that Self-Hone via Embodied Learning
Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demo…
ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning
LLM-based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowl…
CatalyticMLLM: A Graph-Text Multimodal Large Language Model for Catalytic Materials
Property prediction and inverse structural design of catalytic materials are typically modeled as two independent tasks: the former predict…
When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach
Tabular foundation models based on pretrained prior-data fitted networks~(PFNs) have shown strong generalization on diverse tabular tasks,…
Beyond Rational Illusion: Behaviorally Realistic Strategic Classification
Strategic classification(SC) studies the interaction between decision models and agents who strategically manipulate their features for fav…
Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard…
MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance
LLM agents are increasingly expected to carry out end-to-end workflows, producing complete artifacts from high-level user instructions. To…
Advancing Mathematics Research with AI-Driven Formal Proof Search
Large language models (LLMs) increasingly excel at mathematical reasoning, but their unreliability limits their utility in mathematics rese…
LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs
Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing…
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering,…
Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables
Complex Query Answering (CQA) is a fundamental knowledge representation and reasoning task over incomplete knowledge graphs (KGs). Answerin…
Quantifying and Optimizing Simplicity via Polynomial Representations
Deep networks often exhibit a preference for "simple" solutions, and such a simplicity bias is widely believed to play a key role in genera…
VESTA: Visual Exploration with Statistical Tool Agents
Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-base…
ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL
Agentic reinforcement learning (RL) enables LLM agents to improve continuously from environment rewards, yet the resulting policies do not…
WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis
Large language models (LLMs) are increasingly asked not only to write static interfaces, but to construct executable interactive worlds fro…
ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning
Large language models (LLMs) exhibit strong natural-language reasoning abilities for clinical decision support, but struggle to effectively…
The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs
Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet real-world deployment is cons…
Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers
Many current agentic systems and LLM pipelines correct mistakes by optimizing outcome reward. This addresses only the what of failure: when…
MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requir…
Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns
AI agents are commonly evaluated using task success, reward, latency, and cost. These metrics are useful, but they often miss important asp…
A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR
Reinforcement learning from verifiable rewards (RLVR) improves reasoning even when the reward signal is spurious -- assigning credit to the…
Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation
While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this…
Towards Healthy Evolution: Exploring the Role and Mechanisms of Human-Agent Interaction in Self-Evolving Systems
Self-evolving agents improve through continual self-play and self-generated learning signals, but autonomous evolution can also cause capab…
An Infectious Disease Spread Simulation Based on Large Language Model Decision Making
Modelling individual decision-making during infectious disease outbreaks is crucial for understanding behavioural dynamics and informing ef…
Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration
Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that incre…
Large Models for Time Series and Spatio-Temporal Data: A Survey and Outlook
Temporal data, including time series and spatio-temporal data, are pervasive in real-world applications. Generated in massive volumes by ph…
Toward autocorrection of chemical process flowsheets using large language models
The process engineering domain widely uses Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (P&IDs) to represent proce…
Investigating the Histogram Loss in Regression
It is becoming increasingly common in regression to train neural networks that model the entire distribution even if only the mean is requi…
Strategic Integration of Artificial Intelligence in the C-Suite: The Role of the Chief AI Officer
The integration of Artificial Intelligence (AI) into corporate strategy has become critical for organizations seeking to maintain competiti…
Discovering Data Structures: Nearest Neighbor Search and Beyond
We propose a general framework for end-to-end learning of data structures. Our framework adapts to the underlying data distribution and pro…
Graph-to-SFILES: Control structure prediction from process topologies using generative artificial intelligence
Control structure design is an important but tedious step in P&ID development. Generative artificial intelligence (AI) promises to reduce P…
Complement or substitute? How AI increases the demand for human skills
Artificial Intelligence (AI) is transforming the nature of work, yet there is limited empirical evidence on how it affects demand for human…
FIT-Print: Towards False-claim-resistant Model Ownership Verification via Targeted Fingerprint
Model fingerprinting has emerged as a crucial mechanism for safeguarding the intellectual property of open-source models, offering a non-in…
Dealing with Annotator Disagreement in Hate Speech Classification
Hate speech detection is a crucial task, especially on social media where harmful content can spread quickly. Collecting social media conte…
Deep Tree Tensor Networks
Originating in quantum physics, tensor networks (TNs) have been widely adopted as exponential machines and parametric decomposers for recog…
Audio-FLAN: An Instruction-Following Dataset for Unified Audio Understanding and Generation of Speech, Music, and Sound
Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLM…
Rule-based autocorrection of Piping and Instrumentation Diagrams (P&IDs) on graphs
A piping and instrumentation diagram (P&ID) is a central reference document in chemical process engineering. Currently, chemical engineers…
LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty
We present LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models, avoidin…
Brain2Text Decoding Model Reveals the Neural Mechanisms of Visual Semantic Processing
Decoding sensory experiences from neural activity to reconstruct human-perceived visual stimuli and semantic content remains a challenge in…
Hyperflux: Pruning Reveals Importance
Network pruning is used to reduce inference latency and power consumption in large neural networks. However, most methods focus on empirica…
Robust Renal Mass Segmentation on CT: A Validation Study of an AI-Based Framework
Renal mass segmentation has important potential to enhance the clinical workflow, especially in settings requiring quantitative assessments…
Harmonia: End-to-End RAG Serving Optimization
Retrieval-Augmented Generation (RAG) improves the reliability of large language models by integrating external knowledge, but serving RAG p…
ePC: Fast and Deep Predictive Coding in Digital Simulation
Predictive Coding (PC) offers a brain-inspired alternative to backpropagation for neural network training, described as a physical system m…
ACTIVE-o3: Empowering MLLMs with Active Perception via Pure Reinforcement Learning
Active vision, also known as active perception, refers to actively selecting where and how to look in order to gather task-relevant informa…
dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching
Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form o…
Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones
Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating bal…
AMix-1: A Pathway to Test-Time Scalable Protein Foundation Model
We introduce AMix-1, a powerful protein foundation model built on Bayesian Flow Networks and empowered by a systematic training methodology…
Learning Task Mixtures from Task Affinities: A Probabilistic Graphical Model for Supervised Fine-Tuning
Supervised fine-tuning performance for large language models depends strongly on how training budget is distributed across a heterogeneous…
CLONE: A 3DGS-Based Closed-Loop Differentiable Optimization Framework for Single-Image Normal Estimation
We propose CLONE, a 3DGS-based Closed-Loop differentiable Optimization framework for single-image Normal Estimation. The core idea is to co…
Unsupervised Partner Design Enables Robust Ad-hoc Teamwork
We introduce Unsupervised Partner Design (UPD), a population-free multi-agent reinforcement learning method for robust ad-hoc teamwork. UPD…
In-Context Reinforcement Learning via Communicative World Models
Reinforcement learning (RL) agents often struggle to generalize to new tasks and contexts without updating their parameters, mainly because…
Discovering Expert-Level Nash Equilibrium Algorithms with Large Language Models
Designing polynomial-time algorithms for approximate Nash equilibria (ANE) with provable worst-case guarantees is a fundamental open proble…
Video Understanding by Design: How Datasets Shape Video Models
Research in video understanding has advanced rapidly, driven by increasingly diverse datasets and more powerful model architectures. While…
I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation
Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained dev…
Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings
The attention mechanism in a Transformer architecture matches key to query based on both content -- the what -- and position in a sequence…
Understanding Benchmark Language Under Weakened Formal Semantics
State-of-the-art NLP benchmarks require interpretation of natural language that specifies conditions, procedures, and exceptions, often rel…
Generation Properties of Stochastic Interpolation under Finite Training Set
This paper investigates the theoretical behavior of generative models under finite training populations. Within the stochastic interpolatio…
SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios
Large language model-powered code agents are rapidly transforming software engineering, yet the security risks of their generated code have…
VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion
Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignore…
Projection and Quantisation: A Unifying View of Learning to Hash, from Random Projections to the RAG Era
Approximate nearest neighbour (ANN) search underpins large-scale retrieval, increasingly within the retrieval-augmented generation pipeline…
Large Language Models for Imbalanced Classification: Diversity makes the difference
Oversampling is one of the most widely used approaches for addressing imbalanced classification. The core idea is to generate additional mi…
Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization
The rapid advancement of Low-Altitude Economy Networks (LAENets) has enabled a variety of applications, including aerial surveillance, envi…
TAO: Tolerance-Aware Optimistic Verification for Floating-Point Neural Networks
Neural networks increasingly run on hardware outside the user's control (cloud GPUs, inference marketplaces). Yet ML-as-a-Service reveals l…
PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits
Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the…
SmartMixed: A Two-Phase Training Strategy for Adaptive Activation Function Learning in Neural Networks
The choice of activation function plays a critical role in neural networks, yet most architectures still rely on fixed, uniform activation…
Learning Quantized Continuous Controllers for Integer Hardware
Deploying continuous-control reinforcement learning policies on embedded hardware requires meeting tight latency and power budgets. Small F…
Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB
We find that current sentence-embedding models produce outputs with a consistent bias: every embedding $e$ decomposes as $\tilde e + \mu$,…
SMART: Shot-Aware Multimodal Video Moment Retrieval with Audio-Enhanced MLLM
Video Moment Retrieval is a task in video understanding that aims to localize a specific temporal segment in an untrimmed video based on a…
AttnRegDeepLab: A Two-Stage Decoupled Framework for Interpretable Embryo Fragmentation Grading
Assessing embryo fragmentation is crucial for predicting IVF success, yet manual grading is prone to subjectivity, and existing AI models s…
SAGE: Shape-Adapting Gated Experts for Adaptive Histopathology Image Segmentation
The significant variability in cell size and shape continues to pose a major obstacle in computer-assisted cancer detection on gigapixel Wh…
MedVision: Benchmarking Quantitative Medical Image Analysis
Current vision-language models (VLMs) in medicine are primarily designed for categorical question answering (e.g., "Is this normal or abnor…
SVRG and Beyond via Posterior Correction
Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections. Originally proposed ov…
Developing Distance-Aware Physics-Constrained Probabilistic Frameworks for Industrial Prognostics
Development of reliable and physically interpretable probabilistic frameworks for industrial prognostics remain nascent, and existing liter…
Unambiguous Representations in Neural Networks: An Information-Theoretic Approach to Intentionality
Representations pervade our daily experience, from letters representing sounds to bit strings encoding digital files. While such representa…
FADTI: Fourier and Attention Driven Diffusion for Multivariate Time Series Imputation
Multivariate time series imputation is fundamental in applications such as healthcare, traffic forecasting, and biological modeling, where…
Collaborative Edge-to-Server Inference for Vision-Language Models
We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces communication cost while maint…
Exploring the Effect of Basis Rotation on NQS Performance
Neural Quantum States (NQS) are powerful variational representations of quantum many-body wavefunctions, yet their performance depends sens…
GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model
Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization…
Supracompetitive Pricing Under AI Monoculture
When competing sellers delegate pricing to a shared AI model, such as a large language model, correlated recommendations combined with perf…
Adversarial Instance Generation and Robust Training for Neural Combinatorial Optimization with Multiple Objectives
Deep reinforcement learning (DRL) has shown great promise in addressing multi-objective combinatorial optimization problems (MOCOPs). Never…
Vision-Based Early Fault Diagnosis and Self-Recovery for Strawberry Harvesting Robots
Strawberry-harvesting robots faced challenges such as poor visual perception, gripper misalignment, empty grasp/misgrasp, and slippage, whi…
A large-scale nanocrystal database with aligned synthesis and properties enabling generative inverse design
The synthesis of nanocrystals has been highly dependent on trial-and-error, due to the complex correlation between synthesis parameters and…
One if by Land, Two if by Sea, Three if by Four Seas, and More to Come -- Values of Perception, Prediction, Communication, and Common Sense in Decision Making
This work aims to rigorously define the values of perception, prediction, communication, and common sense in decision making. The defined q…
How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs
Large Language Models (LLMs) often encode whether a statement is true as a vector in their residual stream activations. These vectors, also…
Revisiting Training Scale: An Empirical Study of Token Count, Power Consumption, and Parameter Efficiency
Research in machine learning has questioned whether increases in training token counts reliably produce proportional performance gains in l…
DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs
Large Language Models (LLMs) increasingly operate over long-form dialogues with frequent topic shifts. While recent LLMs support extended c…
MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting
Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance…
A Comparative Study of Student Perspectives on Technical Writing Feedback Quality: Evaluating LLMs, SLMs, and Humans in Computer Science Topics
To address the scalability of feedback in computer science while mitigating the privacy and cost limitations of commercial Large Language M…
Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers
Vision-Language Models (VLMs) integrate visual and textual knowledge into unified representations that increasingly underpin modern retriev…
XCR-Bench: Benchmarking Cross-Cultural Reasoning in LLMs via Culture-Specific Items and Hall's Triad
Cross-cultural competence in large language models (LLMs) requires understanding and adapting Culture-Specific Items (CSIs) across varying…
The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models
Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrar…
CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual con…
Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT
Enterprise LLM deployment faces a critical scalability challenge: organizations must optimize models systematically to scale AI initiatives…
Comparative evaluation of training strategies using partially labelled datasets for segmentation of white matter hyperintensities and stroke lesions in FLAIR MRI
White matter hyperintensities (WMH) and ischaemic stroke lesions (ISL) are key imaging biomarkers of cerebral small vessel disease (SVD) de…
Mobility-Embedded POIs: Learning What A Place Is and How It Is Used from Human Movement
Recent progress in geospatial foundation models highlights the importance of learning general-purpose representations for real-world locati…
More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)
The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question…
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We…
UA-DCM: Uncertainty-aware Causal Decision Making via Effect Bound Decomposition
Causal inference from observational data can provide strong evidence for finding the best action in a decision-making scenario without havi…
MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering
The evolution of Large Language Model (LLM) agents for software engineering (SWE) is constrained by the scarcity of verifiable datasets, a…
VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consi…
How Hyper-Datafication Impacts the Sustainability Costs in Frontier AI
Large-scale data has fuelled the success of frontier artificial intelligence (AI) models over the past decade. This expansion has relied on…
DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking
Existing retrieval-augmented generation (RAG) systems often assume that each query has a single correct answer. This assumption overlooks o…
Reward Shaping for (Inference-Time) Alignment: A Stackelberg Game Perspective
Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regular…
Performative Learning Theory
Performative predictions influence the very outcomes they aim to forecast. We study performative predictions that affect a sample (e.g., on…
Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance
Speculative decoding accelerates inference for (M)LLMs, yet a training-decoding discrepancy persists: while existing methods optimize singl…
Implementing Grassroots Logic Programs with Multiagent Transition Systems and AI (Full Version)
Grassroots Logic Programs (GLP) is a concurrent logic programming language in which logic variables are partitioned into paired readers and…
Generative Reasoning Re-ranker
Recent studies increasingly explore Large Language Models (LLMs) as a new paradigm for recommendation systems due to their scalability and…
When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents
Although computer-use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe u…
Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design
Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for design…
Cosmo3DFlow: Wavelet Flow Matching for Spatial-to-Spectral Compression in Reconstructing the Early Universe
Reconstructing the early universe from the evolved present-day universe is a challenging and computationally demanding problem in modern as…
Transforming Police-Car Swerving for Mitigating Isolated Stop-and-Go Traffic Waves: A Practice-Oriented Jam-Absorption Driving Strategy
Stop-and-go traffic waves, a major form of freeway congestion, impose severe and persistent adverse impacts, including reduced traffic effi…
On the Complexity of Offline Reinforcement Learning with $Q^\star$-Approximation and Partial Coverage
We study offline reinforcement learning under $Q^\star$-approximation and partial coverage, a setting that motivates practical algorithms s…
Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models
Knowledge augmentation has significantly enhanced the performance of Large Language Models (LLMs) in knowledge-intensive tasks. However, ex…
Prescriptive Scaling Reveals the Evolution of Language Model Capabilities
Machine learning model performance improvements tend to arise from competition and application. For deployment, we consider prescriptive sc…
Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering
Current biomedical question answering (QA) systems often assume that medical knowledge applies uniformly, yet real-world clinical reasoning…
Training-Free Intelligibility-Guided Observation Addition for Noisy ASR
Automatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppres…
Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments
Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. T…
A Mixed Diet Makes DINO An Omnivorous Vision Encoder
Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature…
From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG
Large Language Models (LLMs) exhibit high reasoning capacity in medical question-answering, but their tendency to produce hallucinations an…
CodeTaste: Can LLMs Generate Human-Level Code Refactorings?
LLM coding agents can generate working code, but their solutions often accumulate complexity, duplication, and architectural debt. Human de…
POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation
Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this…
Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality
Behavioral interview evaluation using large language models presents unique challenges that require structured assessment, realistic interv…
RetroReasoner: A Reasoning LLM for Strategic Retrosynthesis Prediction
Retrosynthesis prediction aims to identify reactants that can synthesize a given product molecule. Although molecular large language models…
How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing
When a decoder-only transformer is forced to process matched correct and incorrect single-token continuations of a factual query, the two p…
CHIMERA-Bench: A Benchmark Dataset for Epitope-Specific Antibody Design
Computational antibody design has seen rapid methodological progress, with dozens of deep generative methods proposed in the past three yea…
AgroOmni: A Large-Scale Multi-view Agricultural Dataset for Cross-Scale Multimodal Reasoning
Modern agricultural data is sourced from diverse platforms and spans multiple spatial scales, ranging from ground-level close-up photograph…
Component Ablation for Efficient Hybrid Language Model Architectures: Performance, Resilience, and Compression Implications
Hybrid language models combine softmax attention with linear-time sequence mechanisms such as state-space or linear-attention layers, but t…
DecepGPT: Schema-Driven Deception Detection with Multicultural Datasets and Robust Multimodal Learning
Multimodal deception detection aims to identify deceptive behavior by analyzing audiovisual cues for forensics and security. In these high-…
Vision Hopfield Memory Networks for Image Recognition
Recent vision backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress on image recogni…
Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model
Reinforcement learning (RL) has become essential for post-training large language models (LLMs) in reasoning tasks. While scaling rollouts…
UnWeaving the knots of GraphRAG -- turns out VectorRAG is almost enough
One of the key problems in Retrieval-augmented generation (RAG) systems is that chunk-based retrieval pipelines represent the source chunks…
Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks
System Instructions in Large Language Models (LLMs) are commonly used to enforce safety policies, define agent behavior, and protect sensit…
Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook
As LLMs are globally deployed, aligning their cultural value orientations is critical for safety and user engagement. However, existing ben…
Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions
Retrieval-augmented generation (RAG) extends large language models (LLMs) with external knowledge, but this access path also introduces sec…
SatIR: Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching
Many important retrieval problems are not merely problems of semantic similarity, but problems of constraint satisfaction: a retrieved item…
Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning
Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural networ…
Resilient Write: A Six-Layer Durable Write Surface for LLM Coding Agents
LLM-powered coding agents increasingly rely on tool-use protocols such as the Model Context Protocol (MCP) to read and write files on a dev…
Capacity-Controlled Global Attention for Graph Transformers
Global self-attention drives modern graph transformers, yet the softmax at its core imposes a structural constraint rarely examined directl…
Multilingual Training and Evaluation Resources for Vision-Language Models
Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily groun…
Watts-per-Intelligence Part II: Algorithmic Catalysis
We develop a thermodynamic theory of algorithmic catalysis within the watts per intelligence framework, identifying reusable computational…
Knee-xRAI: An Explainable AI Framework for Automatic Kellgren-Lawrence Grading of Knee Osteoarthritis
Grading knee osteoarthritis (KOA) on plain radiographs is poorly reproducible across readers. A single-grade disagreement on the Kellgren-L…
Speech Enhancement Based on Drifting Models
We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium…
RAS: a Reliability Oriented Metric for Automatic Speech Recognition
Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be…
Skill Retrieval Augmentation for Agentic AI
As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks be…
Simple Self-Conditioning Adaptation for Masked Diffusion Models
Masked diffusion models (MDMs) generate discrete sequences by iterative denoising under an absorbing masking process. In standard masked di…
When Do Diffusion Models learn to Generate Multiple Objects?
Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive…
DynamicPO: Dynamic Preference Optimization for Recommendation
In large language model (LLM)-based recommendation systems, direct preference optimization (DPO) effectively aligns recommendations with us…
Learning Behavioral Signals from Encrypted Smartphone Network Traffic
Human behavior is challenging to measure continuously at scale, yet traces of daily routines and well-being may be reflected in interaction…
Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces
Transformer-based semantic encoders are effective for retrieval, but in many deployments the recurring bottleneck is online query encoding…
Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation
A central goal of explainable AI is to express large language model (LLM) decision logic symbolically and ground it in internal mechanisms.…
Self-Mined Hardness for Safety Fine-Tuning
Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate…
APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music
Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. Howe…
NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps
Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumula…
MinMax Recurrent Neural Cascades
We introduce MinMax Recurrent Neural Cascades (MinMax RNCs), a class of recurrent neural networks built from a novel form of recurrence ove…
From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs
Large-scale AI training is now fundamentally a distributed systems problem, and hardware failures have become routine operating conditions…
CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs
Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes…
Quantifying Rodda and Graham Gait Classification from 3D Markerless Kinematics derived from a Single-view Video in a Heterogeneous Pediatric Clinical Cohort
Cerebral Palsy (CP) is a neurological disorder of movement and the most common cause of lifelong physical disability in childhood. Approxim…
Improving the Performance and Learning Stability of Parallelizable RNNs Designed for Ultra-Low Power Applications
Sequence learning is dominated by Transformers and parallelizable recurrent neural networks (RNNs) such as state-space models, yet learning…
High-Rate Quantized Matrix Multiplication II
This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration…
Margin-Adaptive Confidence Ranking for Reliable LLM Judgement
Jung et al. (2025) introduce a hypothesis testing framework for guaranteeing agreement between large language models (LLMs) and human judgm…
Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs
Layer pruning removes entire Transformer decoder blocks from large language models, but introduces a mismatch between the hidden state rece…
Evaluating Design Video Generation: Metrics for Compositional Fidelity
Generative video models are increasingly used in design animation tasks, yet no standardized evaluation framework exists for this domain. U…
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives of…
WhiteTesseract: Reframing the Interpretation of Cultural Heritage through XR and Conversational AI
Cultural heritage exhibitions often struggle to sustain attention and support reflective engagement. Physical exhibitions rely on fixed int…
LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models
Unstructured sparsity is now natively accelerated by recent GPU kernels and dataflow hardware, shifting the bottleneck from inference execu…
ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation
Retrieval-Augmented Generation (RAG) systems implicitly assume mutual consistency among retrieved documents -- an assumption that frequentl…
Post-Trained MoE Can Skip Half Experts via Self-Distillation
Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces compu…
Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution
Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but…
FormalASR: End-to-End Spoken Chinese to Formal Text
Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words,…
Causal Unlearning in Collaborative Optimization: Exact and Approximate Influence Reversal under Adversarial Contributions
Federated learning systems must support data deletion requests to comply with privacy regulations, yet retraining from scratch after each d…
DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation
Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term…
CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models
Vision-Language-Action (VLA) models have rapidly converged on a small set of architectural patterns: discrete-token autoregression (e.g. Op…
DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback
LLM-powered AI agents require high-frequency state exploration (e.g., test-time tree search and reinforcement learning), relying on rapid c…
Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning
The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to as…
How Many Tools Should an LLM Agent See? A Chance-Corrected Answer
Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortli…
Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering
Steering a language model - intervening on its internal activations to change downstream behaviour - has recently expanded beyond linear in…
Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression
We study the rate-distortion limits of online KV cache compression in autoregressive language models, formulating it as sequential Wyner-Zi…
The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection
LLM training increasingly relies on teacher-generated supervision, from synthetic responses to reasoning traces and tool-use demonstrations…
Locality-Aware Redundancy Pruning for LLM Depth Compression
Large language models are known to contain representational redundancy across network depth, making depth pruning an effective approach for…
Pruning and Distilling Mixture-of-Experts into Dense Language Models
Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded…
Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets
Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce train…
S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering
Long-horizon memory question answering often requires sparse evidence from heterogeneous histories, including events, object states, visual…
Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?
Fine-tuning large language models (LLMs) frequently induces catastrophic forgetting of prior capabilities. Recent work has shown that reinf…
MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery
Large language models (LLMs) show remarkable potential in scientific hypothesis discovery. However, existing approaches face two critical l…
BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models
Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world roboti…
Exploring Autonomous Agentic Data Engineering for Model Specialization
Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains…
Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry
Image generative models aim to sample data points from the underlying data manifold, a task that requires learning and decoding a dense, lo…
Continuous Reasoning for Vision-Language-Action
Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continu…
Beyond Independent Manipulation: Individual Fairness-aware Strategic Classification with Peer Imitation
Strategic classification (SC) investigates scenarios where agents manipulate their features to obtain favorable decisions from predictive m…
MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models
Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment cha…
Crazyflow: An Accurate, GPU-Accelerated, Differentiable Drone Simulator in JAX
High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While…
Defenses & Enablers For Skill Injection Attacks on Terminal Based Agents
Large language model (LLM) agents increasingly rely on reusable skills i.e. documents describing task-specific procedures. However, this in…
Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity
Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conf…
Argument Collapse: LLMs Flatten Long-Form Public Debate
As LLMs are increasingly used to draft public-facing arguments, they may flatten public debate by repeatedly introducing the same polished,…
See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs
Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically…
Cosmos 3: Omnimodal World Models for Physical AI
We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and actio…
Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning
Post-training pruning compresses large language models to high sparsity using a small unlabelled calibration set, and recent work has concl…
Position: Deployed Reinforcement Learning should be Continual
Reinforcement Learning (RL) has received increasing attention and adoption in real-world use cases. Most of these systems follow a train-th…
Incremental Sheaf Cohomology on Cellular Complexes: O(1)-in-n Lazy Edit Processing under Bounded Local Geometry
We present an algorithmic framework for incremental maintenance of first sheaf cohomology $H^1(X; \mathcal{F})$ on dynamically evolving 1-d…
An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization
Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong perfor…
Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge
Speculative inference (SPIN) was originally developed as an efficient architecture to accelerate Large Language Models (LLMs). In this work…
An Empirical Audit of Input Encoders for Multi-Channel Signal Transformers
Transformers consuming multi-channel scalar signals must embed $C$ simultaneous values into one $d_{\text{model}}$-dimensional vector per t…
GOTabPFN: From Feature Ordering to Compact Tokenization for Tabular Foundation Models on High-Dimensional Data
We investigate how to make small tabular foundation models effective for High-Dimensional, Low-Sample Size (HDLSS) tabular prediction witho…
Emotion-Aware Image Generation from Korean Diary Text via LLM-based Prompt Translation and LoRA Fine-Tuning
T2I models cannot effectively capture sentiment from various types of text, including diaries, as they primarily focus on visual object-rel…
OPRD: On-Policy Representation Distillation
On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm ha…
HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers
For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-bod…
AI合成写真で近大入試の確認をすりぬけ、替え玉受験対策に「生体認証システム」は必要か
元塾講師による近畿大入試を巡る替え玉事件で、大阪地検は6月8日、教え子に成りすまし受験した英検の結果を用いて近大に出願したとして、偽計業務妨害罪などで大阪市浪速区大国の元塾講師、野口瑞希容疑者(35)を起訴した。
AIに使われる設計者、AIを使う設計者
どちらになるかは、あなた次第――。
Google「AI Plus」4割値下げ、月725円に ストレージ倍増、価格攻勢でシェア拡大へ
同社は上位プランの「AI Pro」や「AI Ultra」でも値下げや特典追加を相次いで実施しており、「Plus」の改定で全3プランのてこ入れが一巡した。
Why Apple’s slow-and-steady AI bet is starting to look pretty smart
Can Apple's new AI glow up put to bed accusations that it's losing an all-important industry race?
OpenAIがIPO申請を発表 時期は未定
ChatGPTを手がける米OpenAIは6月8日(現地時間)、米国での新規株式公開(IPO)を内密に申請していたことを発表した。
Mercor’s Brendan Foody calls out Sequoia, accusing it of ‘dual-pricing’ valuation tricks
Sequoia is just one of the top firms that sells same equity at two different prices.
Apple、EU当局を批判「どの解決策も受け入れず」 「Siri AI」EUのiPhone・iPadで提供見送り
EU側が求めるデジタル市場法(DMA)への対応について、Appleが提示した解決策を欧州委員会がすべて拒否したためという。
コンサルの品質、なぜ「バラつく」のか? ガートナーが背景を解説
コンサルティングサービスを利用する国内企業のうち、「期待以上」の成果を実感する企業は半数に満たない。最大の不満要因である「品質のバラつき」はなぜ発生するのか。顧客企業が取るべき手立てとは。
As OpenAI files for IPO, Sam Altman’s eye-scanning company is doing layoffs, report says
Tools for Humanity, Sam Altman's identity verification company, is reportedly struggling to generate revenue and will downsize its staff.
Apple’s WWDC AI demos looked more real after $250M false ad settlement
The vibe of Apple's 2026 WWDC keynote felt like a spouse proudly listing all the honey-do-list items tackled. One subtle example: the many…
OpenAI files confidentially for IPO, following Anthropic
The filing comes a little more than a week after its main rival, Anthropic, also filed to go public, ramping up the race between the two AI…
Apple plays catch-up at WWDC
Apple spent much of its WWDC keynote highlighting fixes, performance improvements, and long-requested features before unveiling its upgrade…
Apple bets cheaper AI will woo small developers
As AI experimentation grows more expensive, Apple is waiving cloud API costs for developers with fewer than 2 million first-time App Store…
WWDC 2026: Everything announced on Siri AI, iOS 27, Apple Intelligence and more
Apple primarily made the case for an improved experience with its longstanding Siri assistant, which like most other announcements had a he…
Apple just taught your iPhone to finish your sentences, your photos, and your workflows
Apple is adding new AI-powered features to Safari, Shortcuts, and Password apps.
Apple will let you build workflows using AI in its new Shortcuts app
Shortcuts gets an AI upgrade, letting you describe the workflow you want in a prompt.
Apple’s Image Playground doesn’t suck anymore
Apple's AI image generator is getting a makeover that could make it more competitive.
Apple’s Photos app is getting new AI editing features
A new spatial "Reframe" feature will let users use AI to adjust perspectives.
Apple gives Siri its own dedicated app
Siri is finally getting its own app.
Apple is fixing the headache of splitting the bill with its new Siri in Camera feature
"If you're grabbing a bite with friends and point your iPhone at the bill, then [you can] select what you ordered to split the tab with App…
Apple’s long-awaited AI Siri overhaul is finally here
The idea behind the new "Siri AI" is to turn the assistant from a voice controlled assistant into an AI companion that can do a lot more.
OpenAIが上場へ SpaceX・Anthropicに続きIPO申請
米OpenAIは、米国証券取引委員会にIPO申請したと発表した。
個人向け「Gemini」値下げ 「Google AI Plus」が月額1200円→725円に ストレージも倍増
米Googleは、AIサービスの個人向けサブスクリプションプラン「Google AI Plus」を値下げすると発表した。月額を1200円から725円に引き下げる。
Amazon now lets you design custom merch using AI
A new feature in the Amazon Shopping app allows users to generate designs with Alexa, then print them on products like T-shirts, hoodies, a…
WWDC 2026: What to expect, from Siri’s highly anticipated revamp to Apple Intelligence and iOS 27
Apple's WWDC nears: Here's what you can look forward to.
2026-06-08(303件)
Confidential submission of draft S-1 to the SEC
OpenAI confirms a confidential S-1 submission to the SEC and has not yet determined timing for further action.
「Siri AI」新登場 「Apple Intelligence」大幅刷新、Googleと共同開発 年内に英語版
Appleは6月8日(現地時間)のWWDCで新たな音声アシスタント「Siri AI」を発表。Googleと提携して開発したマルチモーダルのAIモデル「Apple Foundation Model」を基盤に「Apple Intelligence」をアップデートし、「Siri」に組…
パナソニックエナジー、28年度に売上高2兆円目指す AIデータセンター向けに主力転換
パナソニックホールディングス傘下で電池事業を担うパナソニックエナジーが2028年度に売上高2兆円規模を目指す中期方針を明らかにした。達成すれば25年度から約1兆円増の大幅な成長となる。生成AIの普及で電力需要が増えるデータセンター向け蓄電システムを成長の柱に据え、26~28年度…
「業務の前提そのもの」をどうアップデートする? IBMが説く「AXの要件」を考察
IBMが企業のAXにおける新たな指針として「AIオペレーティングモデル」を打ち出した。その内容から、企業がAXに向けて取り組むべき要件を探る。
「AI=質問」は遅れてる エージェント型AI「Claude Cowork」、組織展開に向けた管理機能を拡充
Anthropicは、AIエージェントによる業務支援機能「Claude Cowork」を全ての有料プランで一般提供すると発表した。組織全体への展開に向けた管理機能も同時に拡充する。
Detecting and Mitigating Bias by Treating Fairness as a Symmetry Operation
Machine learning systems deployed in high stakes socioeconomic settings routinely display bias. We formalize bias as a symmetry breaking op…
DiBS: Diffusion-Informed Branch Selection
Sudoku is a representative constraint satisfaction problem that requires global structural reasoning under strict discrete constraints. The…
SafeGene: Reusable Adapters for Transferable Safety Alignment
Open-weight LLMs are increasingly fine-tuned into customized assistants, but downstream fine-tuning can weaken safety alignment and make mo…
Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory
Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence.…
CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions
Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified p…
Attack Selection in Agentic AI Control Evaluations Meaningfully Decreases Safety
An attacker that strategically chooses when to attack is much harder to catch than one that attacks indiscriminately. AI control is a safet…
CARVE-Q: Quantum-Proposed, Classically Certified Interactive Driving Repair
The critical question after a correct driving veto is not only whether a maneuver is unsafe, but whether the blocked interaction admits a l…
Position: Don't Just "Fix it in Post": A Science of AI Must Study Training Dynamics
What would it mean to have a scientific understanding of AI? Models are not static objects: they are snapshots of time-evolving processes s…
Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver
We present Accelerated Fourier SAT (AFSAT), a GPU-accelerated solver for pseudo-Boolean satisfiability based on continuous local search (CL…
A Study of Parallel Continuous Local Search
We study parallel Continuous Local Search (CLS) as a solution approach for Boolean satisfiability problems with symmetric pseudo-Boolean (P…
AEGIS: A Backup Reflex for Physical AI
Long-horizon robot manipulation tends to fail gradually: one bad step degrades the state, and the policy spirals into a basin from which it…
A Geometric Account of Activation Steering through Angle-Norm Decomposition
Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recentl…
OpenSkill: Open-World Self-Evolution for LLM Agents
Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning loop, such as curated skills, s…
AdMem: Advanced Memory for Task-solving Agents
Large Language Models (LLMs) show promise as tool-using agents but remain limited in long-horizon tasks that require remembering, organizin…
Evidence-Based Intelligent Diagnostic and Therapeutic Visualization System with Large Language Models: Multi-Turn Interaction and Multimodal Treatment Plan Generation
Aim: Existing AI-assisted traditional Chinese medicine diagnostic tools suffer from opaque reasoning processes, passive interaction, and li…
Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition
Large language model agents increasingly rely on Skills to encode procedural knowledge, yet high-quality Skills remain costly to hand-write…
Declarative Skills for AI Agents in Knowledge-Grounded Tool-Use Workflows
We study orchestration mechanisms for tool-using AI agents in realistic customer-service workflows over an unstructured knowledge base. We…
Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces
Large language models (LLMs) now solve a wide range of expert-level exams at or above human level, yet remain brittle on specialised, evide…
Accounting for Context: Shaping Moral Credences for Value Alignment
Ensuring that agent behaviours are aligned with human moral values inevitably raises the problem of how to account for the plurality of mor…
Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning
Large language model (LLM)-based agents often make suboptimal tool-use decisions, including unsupported tool invocation and hallucinated di…
Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization
Recent post-training methods, particularly Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced the reasoning…
The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and c…
StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents
Reinforcement Learning (RL) has become a promising approach for improving GUI Agents in long-horizon, stochastic digital environments, but…
Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization
Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, inc…
Front-to-Attractors: Modifying the Front-to-Front Heuristic in Bidirectional Search
Heuristics play a central role in the performance of bidirectional search algorithms, which commonly rely on two main classes. Front-to-end…
DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling
Recent advances in Large Reasoning Models (LRMs) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and…
Beyond Post-hoc Explanation: Toward Glassbox AI via Probabilistic Mediation
Large language models are rapidly becoming infrastructural components in high-stakes institutional settings, including public administratio…
Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models
Many efforts to ensure frontier AI models are safe rely on monitoring their chain-of-thought (CoT) reasoning. If models become able to perf…
TOPSIS-RAD: Ranking According to Desires
Traditional TOPSIS derives its reference points -- the Positive Ideal Solution ($PIS$) and Negative Ideal Solution ($NIS$) -- from the obse…
DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning
Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iterative…
Off-Policy Evaluation with Strategic Agents via Local Disclosure
We study off-policy evaluation (OPE) under strategic behavior where decision subjects (or agents) respond to a decision maker's policy by s…
Online Pandora's Box for Contextual LLM Cascading
Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selectin…
Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle
As foundation models advance and agent scaffolding becomes increasingly sophisticated, agents have demonstrated remarkable proficiency in c…
How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope
Frontier AI systems are bridging the gap between intelligence and utility by shifting from conversational assistants to autonomous agents t…
Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs
Prompt injection attacks have become an increasing vulnerability for LLM applications, where adversarial prompts exploit indirect input cha…
When Does Multi-Agent Collaboration Help? An Entropy Perspective
Multi-agent systems (MAS) have emerged as a prominent paradigm for leveraging large language models (LLMs) to tackle complex tasks. However…
Trading Engagement for Sustainability: Carbon-Aware Re-ranking for E-commerce Recommendations
E-commerce recommender systems strongly influence which products users consider and purchase, yet sustainability signals such as Product Ca…
Autonomous heterogeneous catalyst discovery with a self-evolving multi-agent digital twin
Theoretical heterogeneous catalysis promises rapid catalyst discovery, yet computational and machine-learning predictions often deviate fro…
Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?
A long-standing finding in the causal learning literature is that adults struggle to identify conjunctive causal rules, where an effect req…
A Geometric Gaussian Mixture Representation of Plane Curves
We introduce a user defined probabilistic polygonal representation for plane curves. Given a curve, we select vertices on the curve and con…
Which Anatomy Matters Under Limited Labels? A Data-Efficient Anatomy-Aware Benchmark for Cardiac Pathology Prediction
Numerous medical imaging problems must be solved under limited labels and constrained compute, yet it remains unclear whether performance g…
FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail
Conventional HPC dogma holds that native hardware FP64 silicon is the irreducible foundation of scientific computing -- the "holy grail" of…
DxPTA: An Architecture Design Space Exploration with Optical Dataflow-guided Strategy for HW/SW Co-Design of Photonic Transformer Accelerators
Transformer-based networks have emerged as prominent AI models with state-of-the-art performance, which potentially pave the way toward art…
P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2^8
FP8 (E4M3) acceleration for attention computation offers significant throughput gains, but the 3-bit mantissa introduces precision challeng…
Agentic Large Language Models for Automated Structural Analysis of 3D Frame Systems
Large language models (LLMs) have emerged as powerful foundation models with strong reasoning capabilities across domains. Beyond reactive…
Attention Consistent Longitudinal Medical Visual Question Answering Guided by Vision Foundation Models
Longitudinal medical visual question answering (VQA) requires reasoning about anatomical differences between an image of a current time poi…
Attention-Guided Autoencoder Fusion for Insulator Defect Detection Using UAV Transmission-Line Imaging
Automated defect detection in high-voltage transmission-line insulators remains challenging due to severe class imbalance, large scale vari…
Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training
Forward-Forward (FF) learning [Hinton, 2022] replaces backpropagation with strictly layer-local goodness updates. Recent FF-CNN work has na…
Coordinated optimization of departure sequencing and section-track allocation in railway short-term concentrated departure scenarios based on qubo and hybrid quantum algorithms
This study examines the coordinated optimization of departure sequencing and section-track allocation in railway short-term concentrated de…
Queen-Bee Agents: A BeeSpec-Centered Architecture for Governed Enterprise MCP Orchestration
Enterprise agent systems increasingly need to connect large language models to private tools, internal knowledge, and Model Context Protoco…
FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models
Diffusion Large Language Models (dLLMs) refine tokens iteratively but commit them irreversibly, leading to a "stability lag" where early de…
Geometric Second-Order Feature Correlation Learning for Self-Supervised Speech Emotion Recognition
Self-supervised learning (SSL) yields powerful, context-rich representations for speech emotion recognition (SER), yet aggregating these re…
Multi-Scale Feature Attention Network for Polymer Classification using THz Dual-Comb Spectroscopy
Reliable polymer identification is essential for ensuring the quality and safety of recycled plastics, yet conventional sorting and spectro…
IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems
Full-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap…
MacArena: Benchmarking Computer Use Agents on an Online macOS Environment
Computer-use agents (CUAs) operate graphical user interfaces (GUIs) through vision and control primitives, and their capabilities have adva…
AI-Driven Test Case Generation from Natural Language Requirements: A Survey of Techniques and Research Gaps
Software testing is critical for verifying that systems meet specified requirements, yet remains among the most time-consuming and expensiv…
WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers
Residual connections are central to training deep Transformers, but standard PreNorm residual streams aggregate sublayer updates with fixed…
NTILC: Neural Tool Invocation via Learned Compression
Agentic tool-calling language models depend on large registries of callable APIs, functions, and local actions. Placing full tool specifica…
MalTree: Tracing Malware Evolution from Embeddings at Scale
Malware detection remains largely reactive: machine learning models trained on known samples degrade as threats evolve. Understanding evolu…
Generative Models Erode Human Temporal Learning Through Market Selection
We argue that modern generative models create structural risks for knowledge and cultural production at current, sub-AGI capability levels.…
Direct 3D-Aware Object Insertion via Decomposed Visual Proxies
Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusion-based meth…
Re-Centering Humans in LLM Personalization
Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It re…
FIGMA: Towards FIne-Grained Music retrievAl
Retrieving music using natural language descriptions has improved with contrastive audio-text models such as CLAP, but current systems rema…
ChronoForest: Closed-Loop Multi-Tree Diffusion Planning for Efficient Bridge Search and Route Composition
How can we plan long-horizon routes that reach designated goals, visit required waypoints, and remain short when only short-horizon offline…
What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?
Human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations where motions are orchestrat…
How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures
Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We charac…
CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures
Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to under…
Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers
Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner work…
HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule
Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attenti…
The Geography of Algorithmic Judgment: LLM Intermediaries, Place Identity, and Racial Steering in Housing Search
Large language models (LLMs) are rapidly assuming an intermediary role in housing search through the integration of listing platforms withi…
MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models
Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to…
Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, p…
Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles
We ask whether topic sentiment has a causal effect on perceived political ideology, and whether the answer depends on who assigns the ideol…
ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets
While generative AI models have demonstrated remarkable success in structure-based drug design, they predominantly rely on deep binding poc…
MSAIC-Net: A Multi-Scale Attention and Imbalance-Aware Contrastive Network for ECG-Based Myocardial Substrate Abnormality Detection
Myocardial substrate abnormalities, such as myocardial scar and myocardial infarction (MI), are associated with adverse cardiovascular outc…
SCOUT: Semantic scene COverage via Uncertainty-guided Traversal
Robots that operate over extended periods should not merely visit space; they should progressively understand it. Yet most 3D scene graph p…
Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations
Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, c…
HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec
The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec archit…
Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection
Retrieval-Augmented Generation (RAG) reduces but does not eliminate hallucination in large language models. Existing detection methods rely…
AxisGuide: Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation
Visuomotor manipulation policies trained via large-scale behavior cloning have achieved strong semantic scene understanding, yet often fail…
Optimal Rates for Generalization of Gradient Descent Methods with Deep Neural Networks
Recent progress has been made in understanding the statistical generalization performance of gradient descent methods for overparameterized…
Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods
Understanding the generalization performance of over-parameterized neural networks has become a central topic in deep learning theory. Whil…
Mind the Gap: Bridging Behavioral Silos with LLMs in Multi-Vertical Recommendations
In multi-vertical e-commerce platforms like DoorDash, relatively newer product verticals such as grocery and retail present a significant o…
What Your Posts Reveal: A Benchmark and Agentic Framework for User-Level Privacy Leakage on Social Media
Public social media posts can reveal private information through weak cues scattered across text, images, or metadata. Such leakage is ofte…
Exploring Reinforcement Learning for Fluid Transitions Between Clinical Mental Healthcare and Everyday Wellness Support
Mental health struggles wax and wane, yet clinical and wellness interventions typically operate separately, causing frequent breakdowns at…
Lane Change Trajectory Planning for Personalized Driving Comfort and Mobility Efficiency
Lane changing entails simultaneous longitudinal and lateral motions that affect driving comfort and mobility efficiency. Because these moti…
Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation
Recent text-to-image models built on large-scale Transformer backbones and flow-based objectives deliver strong text-image alignment and hi…
SCALE: Scalable Cross-Attention Learning with Extrapolation for Agentic Workflow Scheduling
Agentic Large Language Model (LLM) systems decompose complex tasks into workflow Directed Acyclic Graphs (DAGs) whose primitives must be sc…
PandaAI: A Practical Agent CQ2 for Neuro-symbolic Data Analysis And Integrated Decision-Making in Quantitative Finance
While deep learning has excelled in various domains, its application to sequential decision-making in finance remains challenging due to th…
Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards
Reinforcement learning has recently shown promise in improving large language models for Text-to-SQL generation, yet existing methods typic…
Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks
Automatic Speech Recognition (ASR) systems operating in real-time settings must process acoustic input under strict temporal constraints, w…
Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation
Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight co…
LLM Agent-Assisted Reverse Engineering with Quantitative Readability Metrics
Automatic decompilers produce functionally correct but often unreadable C code. This paper addresses one stage of the reverse engineering w…
Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces
Modern reasoning models offer surprisingly strong zero-shot performance on challenging multi-label tasks that require selecting a small set…
MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models
The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While…
Modeling Nonlinear Feature Interactions with Product-Unit Residual Networks
Understanding nonlinear feature interactions is crucial in science and engineering, yet standard multilayer perceptrons (MLPs) often captur…
EgoPressDiff: Multimodal Video Diffusion for Egocentric UV-Domain Hand-Pressure Estimation
Estimating hand-surface contact pressure from an egocentric view is crucial for AR/VR devices, robotic imitation, and ergonomic analysis. E…
Neuro-Symbolic Learning for Long-Horizon Task Planning Under Complex Logical Constraints
Task planning often suffers from severe efficiency bottlenecks when robots must reason over long-horizon action sequences under complex log…
FreeAnimate: Training-Free Human Image Animation with Preview-Guided Denoising
Human Image Animation has seen significant advancements, primarily driven by diffusion models. However, existing methods typically demand s…
Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy
Human image animation aims to generate a video from a static reference image, guided by pose information extracted from a driving video. Ex…
EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering
Long-context question answering (QA) remains challenging for smaller language models even when answer-bearing evidence is already present i…
SpectCount: Spectrotemporal Counting via Synthetic Signals Improves Large Audio Language Models
Large audio language models (LALMs) extend large language models with an audio encoder and large-scale audio data. However, the scarcity of…
ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning
Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additio…
The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning
Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without deg…
Didact: A Cross-Domain Capability Discovery System for Defence
Policymakers in defence and defence-aligned sectors must monitor rapidly evolving research alongside sector priorities relevant to operatio…
SS-TPT: Stability and Suitability-Guided Test-Time Prompt Tuning for Adversarially Robust Vision-Language Models
Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but remain highly fragile under adversarial perturbations.…
Auditing Training Data in Domain-adapted LLMs: LoRA-MINT
We present LoRA-MINT, a new methodology for Membership Inference Test (MINT) applied to recent Large Language Models (LLMs) fine-tuned for…
When is 3D Worth It? A Resource-Performance Frontier for CNNs and Transformers in Lung CT
Three-dimensional models are widely assumed preferable for volumetric medical imaging, yet their practical value depends on whether perform…
OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios
Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two co…
DaX: Learning General Pathology Representations Across Scales
Computational pathology requires visual representations that transfer across diverse clinical endpoints and remain robust to variation in m…
Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding
Online Video Large Language Models (Video-LLMs) have advanced toward seamless human-AI interaction through frame-by-frame processing and pr…
DataEvolver: Automatic Data Preparation for Large Language Models through Multi-Level Self-Evolving
High-quality training data is essential to large language models (LLMs) and typically requires extensive and costly manual curation. Existi…
A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders
We propose a unified mathematical framework for a geometric understanding of concept learning and neuron interpretation in sparse autoencod…
Towards Unified Song Generation and Singing Voice Conversion with Accompaniment Co-Generation
While song generation and singing voice conversion (SVC) have evolved significantly, they have long been developed isolated: the former lac…
Phonetic Error Analysis of Raw Waveform Acoustic Models
We analyse error patterns of raw waveform acoustic models on TIMIT phone recognition beyond the overall phone error rate (PER). PER is deco…
Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query composed of a reference image and a relative c…
STREAM: Stochastic Riemannian Flow Matching with Anisotropic Decoder for Digital Histopathology Image Generation
Synthetic histopathology image generation addresses critical challenges in computational pathology, including patient privacy and the growi…
TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents
Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to…
SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating
Deep research agents have demonstrated remarkable capabilities in complex information-seeking tasks, yet this power comes at a steep comput…
dots.tts Technical Report
We present dots.tts, a 2B-parameter continuous autoregressive text-to-speech (TTS) foundation model that models speech in a continuous late…
On the Geometry of On-Policy Distillation
On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly unders…
MetaConfigurator: AI-Assisted RDF Authoring from JSON Data
Scientific workflows increasingly generate structured JSON data that is easy to exchange but difficult to interpret consistently across sys…
GP-Adapter: Gaussian Process CLIP-Adapter for Few-Shot Out-of-Distribution Detection
We propose GP-Adapter, a training-free framework that augments CLIP (Contrastive Language-Image Pre-training) with Gaussian Process (GP) un…
DIFFRACT: Neuralized Utility Maximization for Wireless Networks by Differentiable Programming
Next-generation wireless networks, including satellite-to-Open RAN systems, demand agile and intelligent resource management capable of han…
OffQ: Taming Structured Outliers in LLM Quantization by Offsetting
Low-bit quantization has been widely adopted to accelerate the inference of large language models (LLMs) by significantly reducing computat…
Native3D: End-to-End 3D Scene Generation via Unified Mesh-Texture Modeling and Semantic Alignment
This paper presents Native3D, the first end-to-end 3D scene generation framework that completely bypasses 2D intermediate representations.…
The Three-Ring Architecture: Governing Agents in the Era of On-Platform Organisations
The current phase of enterprise AI deployment faces a structural failure: organisations are acquiring agentic capability without the infras…
REMEDI: A Benchmark for Retention and Unlearning Evaluation in Multi-label Clinical Disease Inference
Language models trained for clinical disease inference are trained on patient data, which may include sensitive and private information, an…
From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability
Agent-interoperability protocols such as A2A and MCP standardize what agents say to one another, but assume address-based transport over HT…
UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million p…
Textual Supervision Enhances Geospatial Representations in Vision-Language Models
Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image g…
RETROSPECT: RETROsynthesis via Sequential Prediction, and Chemically Transformed-ranking
Single-step retrosynthesis needs both accurate first-ranked suggestions and candidate lists that are rich enough for downstream selection.…
An Abstract Architecture for Explainable Autonomy in Hazardous Environments
Autonomous robotic systems are being proposed for use in hazardous environments, often to reduce the risks to human workers. In the immedia…
DualGate-Net: A Prior-Gated Dual-Encoder Framework for Histopathology Cell Detection
Cell detection in histopathology images strongly depends on surrounding tissue context, where visually similar cells may belong to differen…
DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios
Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended envi…
When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations
Large Language Models (LLMs) are increasingly used in healthcare for tasks such as clinical question answering, diagnosis support, and repo…
Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in…
AI Sovereignty: A Qualitative Model of Strategic Competition as AI Becomes an Instrument of National Power
AI sovereignty is the extent to which a nation independently controls its artificial intelligence (AI) technologies. The race toward ever-m…
Where Rectified Flows Leak: Characterising Membership Signals Along the Interpolation Path
Understanding what generative models retain from training data remains challenging, with implications for copyright and privacy. Beyond ver…
Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition
Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues a…
CULTURESCORE: Evaluating Cultural Faithfulness in Video Generation Models
As video generation models like Veo 3.1 and LTX-2 advance, their ability to accurately represent diverse global cultures remains a critical…
SV-Detect: AI-generated Text Detection with Steering Vectors
Detecting machine-generated text is especially difficult under distribution shift, such as transfer across domains, source models, and edit…
Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration
Byzantine collaboration among large-language-model agents requires a finality-control primitive: given delivered stochastic, structured nat…
A Temporal Spatial Minimax Rate for Smoothly-Varying Distributions in Wasserstein Space
We study the minimax rate of estimating a future value $\mu_{t_n+h}$ of a curve $t\mapsto\mu_t$ in the $2$-Wasserstein space $\mathcal{P}_2…
SleepExplain: Explainable Non-Rapid Eye Movement and Rapid Eye Movement Sleep Stage Classification from EEG Signal
Classification of sleep stages is one of the most important diagnostic approaches for a variety of sleep-related disorders. Electroencephal…
A robust PPG foundation model using multimodal physiological supervision
Photoplethysmography (PPG), a non-invasive measure of changes in blood volume, is widely used in both wearable devices and clinical setting…
Mitosis Detection in the Wild: Multi-Tumor and Context-Aware Generalization in the MIDOG 2025 Challenge
Automated mitosis detection is a well-established task in computational pathology. While previous benchmarks focused on scanner-induced dom…
Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests
A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead o…
Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios
Background and Purpose: Automated detection of focal cortical dysplasia (FCD) requires large volumes of voxelwise lesion-delineated MRI dat…
A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning
The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems ge…
Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills
LLM-driven software engineering agents have become a central testbed for real-world language-model capability, yet their training remains l…
The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs
Large language models are increasingly used to answer culturally grounded questions across languages, yet it remains unclear whether local…
Watch, Remember, Reason: Human-View Video Understanding with MLLMs
Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, m…
Re-imagining ISO 26262 in the Age of Autonomous Vehicles: Enhancing Controllability through Transferability and Predictability
The ISO 26262 standard defines functional safety for road vehicles through risk assessments based on Severity, Exposure, and Controllabilit…
TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment
Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the i…
PaperFlow: Profiling, Recommending, and Adapting Across Daily Paper Streams
Scientific paper recommendation is typically evaluated as static ranking over a fixed candidate set, yet real scientific reading unfolds as…
Planning-aligned Token Compression for Long-Context Autonomous Driving
Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences t…
Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders
Whisper, a widely adopted ASR model, is known to suffer from hallucinations - coherent transcriptions generated for non-speech audio entire…
Graph Neural Network leveraging Higher-order Class Label Connectivity for Heterophilous Graphs
Node classification in graph neural networks (GNNs) has been widely applied in various fields of graph analysis. GNNs achieve high-accuracy…
Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification
Turkish idiomatic light verb constructions (LVCs) are challenging for multiword expression processing because they often share the same sur…
Twelve quick tips for designing AI-driven HPC workflows
High-performance computing (HPC) clusters remain the backbone of large-scale scientific computation, traditionally executing deterministic,…
Sparse Subspace-to-Expert Sharing for Task-Agnostic Continual Learning
Continual learning in Large Language Models (LLMs) is hindered by the plasticity-stability dilemma, where acquiring new capabilities often…
MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism
Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token ex…
How reliable are LLMs when it comes to playing dice?
We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probab…
LLM-Guided Search for Deletion-Correcting Codes
Finding deletion-correcting codes of maximum size has been an open problem for over 70 years, even for a single deletion. We adapt FunSearc…
Finding the Minimal Parameter Budget for Implicit Reasoning: A Data Complexity Driven Scaling Law for Language Models
Reasoning is a core capability of language models (LMs), yet it remains unclear how much model capacity is necessary to support reasoning d…
ChemQuests: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv papers
The rapid expansion of chemistry literature poses significant challenges for researchers seeking to efficiently access domain-specific know…
EVA: Evolving Semantic Adversaries for Red-Teaming GUI Agents Against Environmental Injection Attacks
Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) are increasingly deployed yet vulnerable to Envir…
Exploring Flow-Lenia Universes with a Curiosity-driven AI Scientist: Discovering Diverse Ecosystem Dynamics
We present a curiosity-driven AI scientist method for discovering system-level dynamics in Flow-Lenia, a continuous cellular automaton (CA)…
Model Context Protocols in Adaptive Transport Systems: A Survey
The rapid expansion of interconnected devices, autonomous systems, and AI applications has created severe fragmentation in adaptive transpo…
Understanding Generative Recommendation with Semantic IDs from a Model-scaling View
Recent advancements in generative models have allowed the emergence of a promising paradigm for recommender systems (RS), known as Generati…
Small Language Model Agents Enable Efficient and High-Quality Knowledge Mining
At the core of Deep Research is knowledge mining, the task of extracting structured information from massive unstructured text in response…
MHA-RAG: Improving Efficiency, Accuracy, and Consistency by Encoding Exemplars as Soft Prompts
Adapting Foundation Models to new domains with limited training data is challenging and computationally expensive. While prior work has dem…
Agentic Physical AI toward a Domain-Specific Foundation Model for Energy Systems: A Case Study on Nuclear Reactor Control
The prevailing paradigm in AI for physical systems: scaling general-purpose foundation models toward universal multimodal reasoning, confro…
CHDP: Cooperative Hybrid Diffusion Policies for Reinforcement Learning in Parameterized Action Space
Hybrid action space, which combines discrete choices and continuous parameters, is prevalent in domains such as robot control and game AI.…
TSAQA: Time Series Analysis Question And Answering Benchmark
Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental scienc…
Dual Latent Memory for Visual Multi-agent System
While Visual Multi-Agent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration, empirical evidence re…
MACD: Model-Aware Contrastive Decoding via Counterfactual Data
Video language models (Video-LLMs) are prone to hallucinations, generating plausible but ungrounded content when visual evidence is weak, a…
VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models
Aligning Large Language Models (LLMs) with the diverse spectrum of human values remains a central challenge: preference-based methods often…
LLM-Augmented Digital Twin for Policy Evaluation in Short-Video Platforms
Short-video platforms are closed-loop, human-in-the-loop ecosystems where platform policy, creator incentives, and user behavior co-evolve.…
LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling
Electroencephalography (EEG) enables non-invasive monitoring of brain activity across clinical and neurotechnology applications, yet buildi…
D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding
Discrete diffusion models are promising alternatives to autoregressive approaches for text generation, yet their decoding methods remain un…
Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA
Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers…
Design Once, Deploy at Scale: Template-Driven ML Development for Large Model Ecosystems
Modern computational advertising platforms typically rely on recommendation systems to predict user responses, such as click-through rates,…
The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook
Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood throug…
Don't Make the LLM Read the Graph: Make the Graph Think
We investigate whether explicit belief graphs improve LLM performance in cooperative multi-agent reasoning. Through 3,000+ controlled trial…
To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling
Agentic AI architectures augment LLMs with external tools, unlocking strong capabilities. However, tool use is not always beneficial; some…
Beyond the Black Box: Interpretability of Agentic AI Tool Use
AI agents are promising for high-stakes enterprise workflows, but dependable deployment remains limited because tool-use failures are diffi…
Robust Instruction Compliance in Cooperative Multi-Agent Reinforcement Learning
Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrup…
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents
Diagnosing failures in LLM agents remains largely manual. Practitioners inspect a small subset of execution traces, form ad-hoc hypotheses,…
Latent-space Attacks for Refusal Evasion in Language Models
Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal re…
CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-pa…
From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging
Training strong large language models (LLMs) requires high-quality supervision, which is often scarce. Recent work shows that paired prefer…
Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG-based Fact-Checking
In RAG-based fact-checking, LLMs are increasingly used as verifiers to check given claims against retrieved evidence. Their parametric know…
A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting
Recent work shows that language models can transmit behavioural traits through hidden signals in generated data during training. We ask whe…
SentinelBench: A Benchmark for Long-Running Monitoring Agents
AI agents are increasingly asked to carry out work that spans minutes, hours, or longer. Yet the default model of agent behavior is continu…
Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillation
Demand for low-precision inference, including NVFP4-based approaches, has grown as large language models are increasingly deployed in laten…
SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents
Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories…
Should You Use Your Large Language Model to Explore or Exploit?
We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exp…
Towards Efficient and Exact Forgetting Services in Pre-Trained-Model-based Continual Learning
In Continual Learning (CL), using a Pre-Trained Model (PTM) as the feature extractor has become a popular practice. Accompanied by analytic…
Bounded-Abstention Pairwise Learning to Rank
Ranking systems influence decision-making in high-stakes domains like health, education, and employment, where they can have substantial ec…
MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs
Multimodal Large Language Models (MLLMs) have achieved remarkable success in instruction-following tasks by integrating pretrained visual e…
CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval
Code retrieval is essential in modern software development, as it boosts code reuse and accelerates debugging. However, current benchmarks…
Telling stories, making Hanzi: AI-assisted co-creation with elderly migrants in urban China
This paper explores how older migrants in urban China can record stories that everyday language and design often miss. We ran two co-creati…
Database Normalization via Dual-LLM Self-Refinement
Database normalization is crucial to preserving data integrity. However, it is time-consuming and error-prone, as it is typically performed…
Limitations of Normalization in Attention Mechanism
This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables th…
A Mechanism-Coupled Split Window Network for Medium- to High-Resolution Land Surface Temperature Retrieval
Land surface temperature (LST) is a fundamental physical variable in land-atmosphere interactions, surface energy budgets, and climate proc…
Standard vs. Modular Sampling: Best Practices for Reliable LLM Unlearning
A conventional LLM Unlearning setting consists of two subsets -"forget" and "retain", with the objectives of removing the undesired knowled…
MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion
Multimodal intent recognition (MMIR) suffers from weak semantic grounding and poor robustness under noisy or rare-class conditions. We prop…
Scalable GANs with Transformers
Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investi…
Proxy Reconstruction Pre-training for Ramp Flow Prediction at Highway Interchanges
Interchanges are crucial nodes for vehicle transfers between highways, yet the lack of real-time ramp detectors creates blind spots in traf…
SWE-IF: Aligning Code Evaluation with Human Preference
Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural…
Robust Driving Control for Autonomous Vehicles: An Intelligent General-sum Constrained Adversarial Reinforcement Learning Approach
Deep reinforcement learning (DRL) has demonstrated remarkable success in developing autonomous driving policies. However, its vulnerability…
MatterDoor: Sampling Zero-shot Spatio-semantic Priors using Generative Models
Autonomous robots often view rooms only partially, through a doorway, where the walls and scene structure hide the geometry and task-releva…
ReclAIm: A Multi-Agent Framework for Monitoring and Correcting Performance Decline in Medical Imaging AI
Purpose: To develop and evaluate a multi-agent framework (ReclAIm) for automated monitoring, detection, and correction of performance decli…
LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis
LoRA has become a widely adopted method for PEFT, and its initialization methods have attracted increasing attention. However, existing met…
On the importance of multiple training seeds for evaluating machine unlearning
Machine unlearning aims to remove the influence of certain data points from a trained model without costly retraining. Most practical unlea…
Towards Iterative End-to-End Software Development: A Feature-Driven Multi-Agent Framework
Recent advances in large language model agents offer the promise of automating end-to-end software development from natural language requir…
Measuring Agents in Production
LLM-based agents already operate in production across many industries, yet we lack an understanding of what technical methods make deployme…
Calibrating Uncertainty for Zero-Shot Adversarial CLIP
CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Prior adversarial fine-tuning work prim…
RePo: Language Models with Context Re-Positioning
In-context learning is fundamental to modern Large Language Models (LLMs); however, prevailing architectures impose a rigid and fixed conte…
It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents
Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking. Thei…
Analysing Differences in Persuasive Language in LLM-Generated Text: Uncovering Stereotypical Gender Patterns
Large language models (LLMs) are increasingly used for everyday communication tasks, including drafting interpersonal messages intended to…
Autonomous computational catalysis through an agentic research system
Autonomous agents are beginning to transform scientific research from tool-assisted workflows toward self-sustaining discovery processes. C…
E2Former-V2: On-the-Fly Equivariant Attention with Linear Activation Memory
Equivariant Graph Neural Networks (EGNNs) have become a widely used approach for modeling 3D atomistic systems. However, mainstream archite…
Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates
While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weight…
Enhancing Video Representations with Spatiotemporal-Semantic Residual to Mitigate Hallucinations in Video Large Multimodal Models
Although Video Large Multimodal Models have achieved strong performance in video understanding, they still suffer from hallucination. Exist…
Learning to Execute Graph Algorithms Exactly with Graph Neural Networks
Understanding what graph neural networks can learn, especially their ability to learn to execute algorithms, remains a central theoretical…
Rethinking Genomic Modeling Through Optical Character Recognition
Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. Howev…
Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models
Diffusion language models (DLMs) have recently emerged as a competitive alternative to autoregressive (AR) models, offering parallel decodi…
Extracting Recurring Vulnerabilities from Black-Box LLM-Generated Software
LLMs are increasingly used for code generation, but their outputs often follow recurring templates that can induce predictable vulnerabilit…
Endogenous Resistance to Activation Steering in Language Models
Large language models can recover mid-generation from task-misaligned activation steering, producing explicit verbal restarts (e.g., ``wait…
The Geometry of Representational Failures in Vision Language Models
Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failin…
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, t…
Discovering Interpretable Algorithms by Decompiling Transformers to RASP
Recent work has shown that the computations of Transformers can be simulated in the RASP family of programming languages. These findings ha…
TokaMind: A Multi-Modal Transformer Foundation Model for Tokamak Plasma Dynamics
We present TokaMind, to our knowledge the first open-source foundation model for tokamak plasma dynamics, based on a Multi-Modal Transforme…
ScenicRules: An Autonomous Driving Benchmark with Multi-Objective Specifications and Abstract Scenarios
Developing autonomous driving systems for complex traffic environments requires balancing multiple objectives, such as avoiding collisions,…
Position: A Dynamical Systems Perspective is Needed to Advance Time Series Modeling
Time series (TS) modeling has come a long way from early statistical, mainly linear, approaches to the current trend in TS foundation model…
TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning
Large language models (LLMs) have demonstrated strong capabilities in complex reasoning tasks, yet their decision-making processes remain d…
Forecasting as Rendering: A 2D Gaussian Splatting Framework for Time Series Forecasting
Time series forecasting remains a challenging problem due to the intricate entanglement of intra-period fluctuations and inter-period trend…
Training for Technology: Adoption and Productive Use of Generative AI in Legal Analysis
Can targeted user training unlock the productive potential of generative artificial intelligence in professional settings? We study this qu…
Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio
Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited…
VeriHGN: Heterogeneous Graph-Based Congestion Prediction for Chip Layout Verification
As Very Large Scale Integration (VLSI) designs continue to scale in size and complexity, layout verification has become a central challenge…
EvoClaw: Evaluating AI Agents on Continuous Software Evolution
With AI agents increasingly deployed as long-running systems, it becomes essential to autonomously construct and continuously evolve custom…
$\mathrm{ECI}_{\mathrm{sem}}$: Semantic Residual Effective Contrastive Information for Evaluating Hard Negatives
Hard-negative source selection for dense retrieval is usually decided only after fine-tuning and downstream evaluation. We propose $\mathrm…
Evaluating AI-based Scientific Knowledge Synthesis with Epidemiological Systematic Reviews
Systematic literature reviews (SLRs) are a demanding and high-stakes form of scientific knowledge synthesis that remains underspecified as…
Chameleon: Control-Indexed Prospective Memory for Visuomotor Manipulation
Robots often observe information that determines a future action long before that action is executed. In a shell game, for example, a robot…
Stable Reasoning, Unstable Responses: Mitigating LLM Deception via Stability Asymmetry
As Large Language Models (LLMs) expand in capability and application scope, their trustworthiness becomes critical. A vital risk is intrins…
CountsDiff: A Diffusion Model on the Natural Numbers for Generation and Imputation of Count-Based Data
Diffusion models have excelled at generative tasks for both continuous and token-based domains, but their application to discrete ordinal d…
SW-$A^2$-Bench: Benchmarking Autonomous Software Agent Generation for Agentic Web
The Agentic Web is emerging as a paradigm in which autonomous software agents interact with online resources and with each other to accompl…
OGA-AID: Clinician-in-the-loop AI Report Drafting Assistant for Multimodal Observational Gait Analysis in Post-Stroke Rehabilitation
Gait analysis is essential in post-stroke rehabilitation but remains time-intensive and cognitively demanding, especially when clinicians m…
More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration
Large language model (LLM) agents increasingly coordinate in multi-agent systems, yet we lack an understanding of where and why cooperation…
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains cha…
MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval
Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challengin…
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
Self-consistency (SC) is a popular technique for improving the reasoning accuracy of large language models by aggregating multiple sampled…
RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs
Large Language Models (LLMs) have demonstrated remarkable capabilities across various cybersecurity tasks, including vulnerability classifi…
Automatic Causal Fairness Analysis with LLM-Generated Reporting
AutoML, intended as the process of automating the application of machine learning to real-world problems, is a key step for AI popularisati…
InvEvolve: Evolving White-Box Inventory Policies via Large Language Models with Performance Guarantees
We study how large language models can be used to generate inventory policies in online settings with non-stationary demand. Our work is mo…
MidSteer: Optimal Affine Framework for Steering Generative Models
Steering intermediate representations has emerged as a powerful strategy for controlling generative models, particularly in post-deployment…
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference
Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) suffer from a significant efficiency bottleneck during Expert Parallelism (…
COF26: A new on-top functional for multiconfiguration pair-density functional theory
Multiconfiguration pair-density functional theory (MC-PDFT) provides an efficient and accurate framework for computing electronic energies…
Superintelligent Retrieval Agent: The Next Frontier of Agentic Retrieval
Retrieval-augmented agents are increasingly the interface to large knowledge bases, yet most treat retrieval as a black box: they issue exp…
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
Software engineering agents are increasingly deployed in evaluable engineering environments, yet post-failure recovery remains costly, manu…
CHoE: Cross-Domain Heterogeneous Graph Prompt Learning via Structure-Conditioned Experts
Heterogeneous Graph Prompt Learning (HGPL)has emerged as a promising paradigm for bridging the gap between the objectives of pre-training f…
Rethinking Code Review in the Age of AI: A Vision for Agentic Code Review
Code review has evolved for decades, from informal peer checking to today's pull request (PR) workflows, yet it remains a largely manual an…
Automated Root-Cause Subclassification and No-Code Fix Generation for Invalid Bug Reports
Issues faced when using software are reported in the form of bug reports. However, many bug reports are invalid, meaning they do not requir…
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during infer…
ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models
Vision-Language-Action (VLA) models exhibit remarkable action generation for embodied intelligence, but their heavy compute make deployment…
Scale When Needed: Adaptive Neuron-level Mixed Precision Quantization Aware Training
Deploying deep neural networks on resource-constrained 6G edge devices demands aggressive compression with minimal accuracy loss. Quantizat…
Autoregression-Free Neural Operators for Time-Dependent PDEs
Neural operators learn mappings from function-dependent inputs to solutions, providing an effective framework for solving partial different…
Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines
We present the first end-to-end demonstration of fine-tuning and serving Google's Gemma 4 31B model on TPU hardware, providing an empirical…
Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference
Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with…
Evolving Features vs Evolving Entire Trees with GP for Interpretable Survival Analysis
Survival analysis concerns the task of predicting the time until an event occurs. Often used in the medical field, survival analysis deals…
Linear Ordering Problem: Time for a Change
The Linear Ordering Problem (LOP) is a fundamental combinatorial optimization problem with important applications in areas such as economic…
Beyond Tool Adoption: A Practical Five-Stage Developmental Continuum for AI Literacy in Higher Education
Artificial intelligence (AI) literacy is increasingly recognized as a foundational competency for all university graduates. Yet students' e…
MOSS-Audio Technical Report
MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aw…
Building Better Activation Oracles
Activation Oracles (AOs) are promising methods for interpreting residual stream activations. However, current AOs face important issues, su…
Perplexity Can Miss SAE Feature Damage Under Quantization
Quantization is a standard path to deploying large language models, and quantized models are typically judged acceptable when perplexity or…
OpenAgenet / OAN White Paper: Open Infrastructure for Trusted Agent Interconnection
OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes vi…
OpenAgenet / OAN Yellow Paper: Technical Architecture for Trust-Governed Resource Identity and Discovery
This yellow paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconne…
Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions
While Proximal Policy Optimization (PPO) demonstrates strong performance in stationary settings, we show that its standard optimization par…
Optimizing Explicit Unit-Distance Lower-Bound Certificates
The 2026 disproof of Erd\H{o}s's unit-distance conjecture and Sawin's quantitative refinement show that the maximum number $u(n)$ of unit d…
Spectral Scaling Laws of Muon
Orthonormalized update rules have rapidly become a leading choice of optimizer for training large language models, with recent open-source…
MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models
Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distr…
Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers
Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increas…
Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees
Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned u…
Reinforcement Learning from Rich Feedback with Distributional DAgger
Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly…
When Surface Form Changes Moderation Decisions: A Paired Study of Code-Mixed Workflow Instability
Hate moderation is often evaluated as classification on clean English inputs, but deployed systems must route content to actions such as AL…
AI×設計開発ニュースまとめ(2026年4月)
MONOistに掲載した主要な記事を、読みやすいPDF形式の電子ブックレットに再編集した「エンジニア電子ブックレット」。今回は、設計や解析など製品開発の現場で活用が進みつつあるAI関連のニュースをまとめた「AI×設計開発ニュースまとめ(2026年4月)」をお送りします。
AIペットロボット「モフリン」好調、カシオが新規事業に注力 4本目の柱に育成
カシオ計算機が新規事業の創出に注力している。好調に推移する「G-SHOCK」(ジーショック)など腕時計事業に加え、経営の多角化を進める狙いだ。精鋭が集まった社長直轄組織がその役割を担い、挑戦の行方が注目される。
AIによるコード生成、3年後は3倍超に AI普及で浮上する課題は
IDCの調査によると、AIが主体的にコードを生成する企業の割合が3年後には約3倍に拡大するという。AI普及に伴い、浮上する課題とは。
なぜ姫路市の救急病院AIチャットは、2週間で正答率最高90%を出せたのか?
生成AIで多くの業務が効率化されるなか、電話対応だけは何十年も変わらない光景が続いている。AIを実務で機能させる鍵は何か。メディアリンクが姫路市との実証実験で示した「ナレッジ循環」の仕組みを読み解く。
NEC、富士通に先手!? 日立が「Claude Mythos」アクセス権を取得した背景
日立製作所が、Anthropicの先進AI「Claude Mythos」のアクセス権を取得した。Anthropicを巡っては、NECや富士通も協業を発表している。
ワークフロー承認はAIに任せられるのか 7割が否定した理由
エイトレッドの調査によると、ワークフローにおける承認・決裁プロセスそのものについては、AIに委ねるべきではないとする回答が7割を超えた。
AIエージェントに対する製造業のリアルな声
製造業のちょっと気になる話題をAIプレゼンターが紹介する「MONOistポッドキャスト」。今回は、MONOist編集部が実施した「製造業のAIエージェント活用実態調査 2026」の結果レポートの概要をお届けします。
Is this the dawn of the Tokenpocalypse?
We're likely to see more price increases as the big AI companies plan to go public.
Notion restores access to Anthropic after service disruption
Notion's head of product said he was "astonished" at “the amount of people RT-ing this."
OpenAI is still working on that ‘super app’
"Chat is dead" — at least, according to a senior OpenAI employee.