トピック: エージェント

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エンタープライズ AI エージェントの導入前保証に向けて: オントロジーに基づいたシミュレーションと信頼証明

エンタープライズ人工知能 (AI) エージェントの導入前の検証は、大規模言語モデル (LLM) 機能のベンチマークと運用環境の導入の間に依然として重大なギャップがあります。導入後のモニタリング、人間参加型制御、およびプロンプトレベルのガードレールは、エージェントが運用環境で動作すると限定的な保証を提供します。私たちは、次の 3 つのコンポーネントを組み合わせたオントロジーに基づいた検証フレームワークを提案します。1 つは、権限、ドメイン制約、安全性、ガバナンスルール、および自律性レベルにわたる認証空間を形式化するエージェント運用エンベロープです。規制、運用、敵対的テストのシナリオを自動的に導き出すオントロジーからシナリオへの生成パイプライン。段階的な展開判定 (承認、条件付き、拒否) を含む機械検証可能な証明書を保持する信頼証明書。米国とベトナムの 5 つの業界別規制制度セルとしてインスタンス化された、4 つの規制対象業界 (フィンテック、銀行、保険、ヘルスケア) にわたる制御されたパイロットは、125 の主要なソース規制要件と 25 の注入された障害に対して評価された 1,800 のシナリオを生成しました。オントロジーに基づいた生成 (G4) は、ペルソナベースのベースラインでは 33.1% であるのに対し、規制適用率は 48.3% (修正 p = 0.0006)、最も高いドメイン特異性 (4.77/5.0; p = 2e-6) を達成しました。ベースラインおよび検索拡張プロンプトに対するカバレッジの利点は、ボンフェローニ補正後は堅牢ではありませんでした。 3 つの LLM ファミリ (Claude Sonnet 4、Qwen 2.5 72B、Gemma 4 26B、合計 5,400 のシナリオ) にわたる相互検証により、ペルソナ対オントロジーのパターンが再現されました。その結果、規制が集中するドメイン向けのペルソナベースのテストスイートを確実に補完するものとして、オントロジーに基づいたシナリオ生成が確立されました。

原文 (English)

Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We present an ontology-grounded verification framework -- to our knowledge the first to combine three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a machine-verifiable Trust Certificate with graduated deployment verdicts. A controlled pilot across four regulated industries (Fintech, Banking, Insurance, Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam (where Vietnam's 2025 AI Law makes such verification legally mandated for financial services), generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation significantly outperformed the dominant persona-based baseline on regulatory coverage (48.3% versus 33.1%; corrected p_c = .0006) and attained the highest domain specificity (4.77/5.0; p = 2e-6); transparently, its advantage over plain and retrieval-augmented prompting did not survive Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The framework offers a reproducible, regulation-grounded route to pre-deployment assurance for enterprise AI agents, complementing runtime governance with an auditable deployment gate.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SMAC-Talk: 大規模言語モデル用の StarCraft マルチエージェントチャレンジの自然言語拡張

LLM がより広範に導入されるにつれて、LLM は単独で動作するのではなく、他の AI エージェントと連携して動作することがますます期待されています。このような状況での効果的な調整には、エージェントが不確実性の下でコミュニケーションし、情報を共有し、意思決定を行う必要があります。協調的なマルチエージェント環境で LLM ベースのエージェントを評価するための StarCraft Multi-Agent Challenge の自然言語拡張である SMAC-Talk を紹介します。この環境には、分散制御、部分的な可観測性、長期的な意思決定など、いくつかの重要な機能があります。 SMAC-Talk には、エージェントの調整と信頼を調査するために使用される自然言語通信チャネルが含まれています。この通信チャネルを使用して、通信だけで味方を混乱させ、欺こうとする欺瞞的なコミュニケーターが組み込まれた設定など、さまざまな評価シナリオを構築します。 Qwen3.5 ファミリーの 4 つのモデルを使用したベンチマーク用の 3 つのエージェントを提供し、推論構造、メモリ、モデルのスケールがエージェント間の調整にどのように影響するかを調査します。私たちは、協力的なマルチエージェント設定での LLM エージェントの開発と評価における研究コミュニティをサポートするオープンベンチマークとして SMAC-Talk をリリースします。

原文 (English)

SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models

As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. Effective coordination in these settings requires agents to communicate, share information and make decisions under uncertainty. We introduce SMAC-Talk, a natural language extension of the StarCraft Multi-Agent Challenge for evaluating LLM-based agents in cooperative multi-agent environments. The environment has several key features such as decentralized control, partial observability and long-horizon decision making. SMAC-Talk includes a natural language communication channel which is used to probe agent coordination and trust. We use this communication channel to construct different evaluation scenarios, including settings with an embedded deceptive communicator that tries to disrupt and deceive allies through communication alone. We provide three agents for benchmarking using 4 models from the Qwen3.5 family and study how reasoning structure, memory and model scale affect coordination between agents. We release SMAC-Talk as an open benchmark to support the research community in developing and evaluating LLM agents in cooperative multi-agent settings.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

コンセンサスが戦略的に不十分: 知識表現シグナルとしての推論とトレースの不一致

マルチエージェントシステムは一般に、投票、コンセンサスプロトコル、討論、またはフォールトトレラントな集計を通じて意見の相違を減らすように設計されています。私たちは、この目標は、意見の相違がエージェントの誤りではなく、真の規範的不確実性を反映している可能性がある価値観を伴うタスクには不十分であると主張します。人間と AI の協調モデレーションにおける推論トレースの不一致に関する以前の研究に基づいて、推論トレースとエージェントの決定が象徴的な不一致状態に抽象化される知識表現層を提案します。明示的な推論トレースと二者決定を生成するエージェントを考慮して、推論の類似性と結論の一致に従って 4 つの状態 (収束一致、発散一致、収束不一致、発散不一致) を区別します。これらの状態は、実行可能な戦略的ルーティングルールをサポートします。我々は、コンテンツモデレーションにおけるフレームワークをインスタンス化し、不一致認識ルーティングが、マルチエージェントの戦略的推論のためのサブシンボリックLLM審議とシンボリック知識表現との間の橋渡しとなると主張する。

原文 (English)

Consensus is Strategically Insufficient: Reasoning-Trace Disagreement as a Knowledge-Representation Signal

Multi-agent systems are commonly designed to reduce disagreement through voting, consensus protocols, debate, or fault-tolerant aggregation. We argue that this objective is insufficient for value-laden tasks, where disagreement may reflect genuine normative uncertainty rather than agent error. Building on prior work on reasoning-trace disagreement in human-AI collaborative moderation, we propose a knowledge-representation layer in which reasoning traces and agent decisions are abstracted into symbolic disagreement states. Given agents producing explicit reasoning traces and binary decisions, we distinguish four states according to reasoning similarity and conclusion agreement: convergent agreement, divergent agreement, convergent disagreement and divergent disagreement. These states support defeasible strategic routing rules. We instantiate the framework in content moderation and argue that disagreement-aware routing provides a bridge between sub-symbolic LLM deliberation and symbolic knowledge representation for multi-agent strategic reasoning.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

ゼネラリストエージェントはデータキュレーションを自動化できますか?

トレーニングデータのキュレーションは、現代の AI 開発において最も重要ではあるものの、労働集約的な部分の 1 つです。実践者は、ノイズの多いベンチマークフィードバックに対してデータポリシーを繰り返し提案、実装、評価、修正します。私たちは、ジェネラリストのコーディングエージェントがこのデータキュレーションループを自動化できるかどうかを尋ねます。 *Curation-Bench* というエージェント中心のベンチマークを導入します。これは、モデル、トレーニングレシピ、評価スイートを修正し、エージェントにデータの検査、ポリシーの実装、固定トレーニング/評価パイプラインへの送信、および修正のためのコマンドラインアクセスを許可します。ビジョン言語の命令チューニングのインスタンス化では、すぐに使用できるエージェントが 10 回の反復以内に公開された強力なデータ選択ベースラインに到達します。しかし、軌道分析により、永続的な*実行と研究のギャップ*が明らかになりました。エージェントは、たとえ戦略ガイドや参考文献が与えられたとしても、新しい政策ファミリーを探索するのではなく、主にローカル政策のバリエーションを調整します。反復ごとに以前のメソッドを引用、インスタンス化し、適応させる必要がある足場により、エージェントはメソッドに基づいた探索へと移行します。スキャフォールドされたエージェントは、人間による設計入力を必要とせずに、データ予算の 10 分の 1 で強力な公開ベースラインを上回るデータ選択ポリシーを自律的に作成します。全体として、現在のエージェントはキュレーションループを実行できますが、信頼性の高いデータ調査には、オープンエンドのプロンプトのみではなく、足場を組んだ手法の適応が必要です。コードとベンチマークはオープンソースです。

原文 (English)

Can Generalist Agents Automate Data Curation?

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

飽和トラップと介入タイミングの主観性: 影響ベースのトリガーと LLM ジャッジが自律エージェントへの介入のタイミングを計れない理由

自律型 AI エージェントが会話型システムから長期的なソフトウェア実行に移行するにつれて、エージェントをいつ中断するかを決定するランタイム安全レイヤーが不可欠になっています。私たちは、診断プローブとして連続 18 次元感情ダイナミクスエンジン (HEART) を使用し、SWE ベンチ検証済みデバッグトレース上の人間による注釈付き介入ポイントに対して 4 つの介入トリガーファミリ (絶対状態しきい値、複合状態アクションパターン、正規表現推論特徴抽出、および判断者としてのゼロショット LLM) を評価して、このタイミング問題を研究します。 3 つの調査結果を報告します。まず、状態飽和トラップ: 継続的な困難下ではエージェントは回復シグナルを示さないため、モデル化されたフラストレーションはすぐにしきい値を超えて最大値に留まり、瞬間検出器からのしきい値オン状態トリガーを、5 つの軌道にわたるアクションの 39 ～ 83% で起動するほぼ一定のインジケーターに変換します。第 2 に、LLM 審査員向けの機能とコンテキストの下限です。小型モデル (gpt-5.4-mini) は決して発砲しませんが、フロンティアモデルやクロスベンダーモデルは完全な軌道コンテキストでのみゼロ発火下限を回避し、それでも最大 90 倍のコストで F1 0.17 ～ 0.40 にしか達しません。第三に、そして最も重要なことは、教師付きターゲットは人間間で再現可能ではないということです。56 アクションの軌道上で 1 つのルーブリックを使用する 3 人の訓練されたアノテーターは、偶然をわずかに上回る位置 (クリッペンドルフのアルファ = +0.047、最良のペアワイズコーエンのカッパ = +0.349) で介入する場所については一致し、介入の種類についてはまったく一致しません (退化を一時停止、確率より下を明確にする、アルファのみを反映 = +0.226）。介入のタイミングは信頼性の低い構造であり、単一アノテーター F1 は最適化の対象として不適切であると結論付けます。私たちの貢献は、単一の検出器の精度ではなく、人間の評価者間信頼性、4 つの検出器アーキテクチャ、クロスモデル LLM 判定スイープ、および再現された飽和効果にわたるこの問題の共同マッピングです。

原文 (English)

The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents

As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional affective-dynamics engine (HEART) as a diagnostic probe, evaluating four intervention trigger families - absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge - against human-annotated intervention points on SWE-bench-Verified debugging traces. We report three findings. First, a State Saturation Trap: agents show no recovery signal under sustained difficulty, so modeled frustration quickly crosses the threshold and stays at its maximum, converting threshold-on-state triggers from moment detectors into near-constant indicators that fire on 39-83% of actions across five trajectories. Second, a capability-and-context floor for LLM judges: a small model (gpt-5.4-mini) never fires, while frontier and cross-vendor models escape the zero-firing floor only with full-trajectory context, and even then reach only F1 0.17-0.40 at up to 90x the cost. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using one rubric on a 56-action trajectory agree on where to intervene only slightly above chance (location Krippendorff's alpha = +0.047; best pairwise Cohen's kappa = +0.349) and not at all on intervention type (pause degenerate; clarify below chance; reflect only alpha = +0.226). We conclude that intervention timing is a low-reliability construct, making single-annotator F1 an unsuitable optimization target. Our contribution is the joint mapping of this problem across human inter-rater reliability, four detector architectures, a cross-model LLM-judge sweep, and a reproduced saturation effect, rather than any single detector's accuracy.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エージェント的記憶システムのクロスシナリオ一般性の探求: 診断と強力なベースライン

LLM エージェントは、コンテキストウィンドウを超えて拡大する履歴を蓄積し、メモリシステムに関する文献の増加を促します。しかし、既存の設計のほとんどは単一のシナリオ (マルチセッションチャットまたは単一の軌跡形式) に合わせて調整されており、展開時にエージェントが遭遇する異種の軌跡全体にそれらが一般化するという証拠はほとんどありません。シングルターン QA、マルチセッションチャット、エージェントトラジェクトリ QA、メモリストレステスト、長期エージェントタスクの 5 つのシナリオで、8 つのメモリシステムと検索問題用のエージェントハーネスを再検討します。ツール呼び出しを介してフラットテキストファイルストレージを自己管理するハーネスは、最高のクロスタスクランキングを達成しました。これは、メモリのパフォーマンスが、固定パイプラインの背後にある受動的なストアではなく、ストレージと取得に対するアクティブな制御をエージェントに与えることに依存していることを示唆しています。この洞察を AutoMEM でインスタンス化します。AutoMEM は、評価するシステムの中で最高のクロスシナリオ汎用性を実現する自己管理ツールインターフェイスを備えたエージェントメモリハーネスです。

原文 (English)

Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline

LLM agents accumulate histories that outgrow their context windows, motivating a growing literature on memory systems. Yet most existing designs are tuned to a single scenario (multi-session chat or a single trajectory format), and there is little evidence that they generalize across the heterogeneous trajectories agents encounter in deployment. We revisit eight memory systems plus an agentic harness for search problems, on five scenarios: single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, and long-horizon agentic tasks. The harness, which self-manages flat text-file storage via tool calls, achieves the best cross-task ranking, suggesting that memory performance hinges on giving the agent active control over storage and retrieval rather than on a passive store behind a fixed pipeline. We instantiate this insight in AutoMEM, an agentic memory harness with a self-managed tool interface that achieves the best cross-scenario generality among the systems we evaluate.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

デジタル見習い: 人間主導のエージェント AI 開発のフレームワーク

Agentic AI の導入は、人間による厳しい監視によって規模が制限される一方、広範な自律性が説明責任を上回るという、繰り返しの設計上の緊張に直面しています。どちらの姿勢も、責任ある委任に必要なガバナンスインフラストラクチャを提供しません。私たちは、自律性を前提とするのではなく獲得する、スケーラブルで安全な AI エージェンシーのフレームワークである Digital Apprentice を紹介します。デジタル見習いは、人間が監督する暗黙の方法論を内面化する発達的な学習者であり、経験的証拠がそれを正当化する場合にのみ、スキルごとの自律段階を経て卒業します。その結果、特定の人間の基準に沿ったままでありながら、時間の経過とともに真に有用になるエージェントが生まれます。 3 つのアーキテクチャコンポーネントがこれを可能にします。 (1) 方法論の捕捉。監督の専門家の暗黙のアプローチを構造化資産に抽出します。 (2) 承認。人間による明示的な承認によって自律性がエスカレーションされます。 (3) 継続的な調整。実行時にドリフトを修正し、各修正を所有する優先データに変換します。このフレームワークを推論時のコントロールプレーンとしてインスタンス化します。品質フレームワークを数学的にモデル化し、品質を向上させるために設計されたポリシーや手法について議論します。このフレームワークをオープンなプロフェッショナルコーパスに適用し、データドリフトを捕捉し、実行時に別の手法を適用することで、トラフィックシフト下で低下した品質次元を回復する方法を示します。その影響は単一のアプリケーションを超えて広がります。私たちは、これら 3 つの柱をシステムとしてつなぎ合わせることで、信頼を犠牲にすることなく拡張できるエージェントシステムへのより安全で実行可能な道を形成すると信じています。

原文 (English)

The Digital Apprentice: A Framework for Human-Directed Agentic AI Development

Agentic AI deployments face a recurring design tension: heavy human oversight limits scale, while broad autonomy outruns accountability. Neither posture provides the governance infrastructure required for responsible delegation. We present the Digital Apprentice, a framework for scalable, safe AI agency in which autonomy is earned, not assumed. The Digital Apprentice is a developmental learner that internalizes the tacit methodology of a directing human, graduating through per-skill autonomy tiers only when empirical evidence justifies it. The result is an agent that becomes genuinely useful over time while remaining aligned to a specific human's standards. Three architectural components make this possible. (1) Methodology capture, distilling a directing professional's tacit approach into structured assets. (2) Authorization, with autonomy escalation gated by explicit human approval. (3) Continuous alignment, correcting drift at runtime and converting each correction into owned preference data. We instantiate this framework as an inference-time control plane. We mathematically model the quality framework and discuss policies and techniques designed to raise quality. We apply the framework to an open professional corpus, and we show how catching data drift and applying a different technique at runtime recovers degraded quality dimensions under traffic shift. The implication extends beyond any single application. We believe these three pillars, stitched together as a system, form a safer and more viable path to agentic systems that can scale without sacrificing trust.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

状態に基づいた動的検索による Web エージェントのオンラインスキル学習

言語エージェントは、関連タスク全体にわたる複数ステップの Web 自動化を改善するために、再利用可能なスキルにますます依存しています。オンラインでのスキル学習を研究する仕事が増えており、エージェントは以前のタスクの軌跡からスキルを継続的に導き出し、その場で将来のタスクで再利用します。ただし、既存の方法は主にタスクレベルでスキルを再利用します。つまり、固定のスキルセットが最初のタスク指示に基づいて取得され、実行中ずっと固定されます。この静的戦略は Web の実行とずれており、適切な次のアクションはタスクの目標だけでなく、現在の Web ページの状態にも依存し、初期のスキルではカバーできない状況に移行することがよくあります。このギャップに対処するために、Web エージェントの段階的なスキルの再利用を可能にするオンラインスキル学習方法である State-Grounded Dynamic Retrieval (SGDR) を提案します。 SGDR は 3 つのコンポーネントで構成されます。完了した軌跡を中間の実行状態で呼び出し可能な再利用可能なサブプロシージャに変換するスライディングウィンドウ抽出プロセス、スキルの取得と実行可能なアクションを結び付けるデュアルテキストコード表現、スキルをタスクの目標と現在の Web ページの状態の両方に一致させる状態ベースの動的取得メカニズムです。 5 つのドメインにわたる WebArena での実験では、SGDR が一貫して強力なベースラインを上回っており、GPT-4.1 で 37.5%、Qwen3-4B で 24.3% の平均成功率を達成しており、最も強力なベースラインに対してそれぞれ 10.6% と 10.0% の相対的な向上に相当します。コードは https://github.com/plusnli/skill-dynamic-retrieval で入手できます。

原文 (English)

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

Language agents increasingly rely on reusable skills to improve multi-step web automation across related tasks. A growing line of work studies online skill learning, where agents continually induce skills from previous task trajectories and reuse them in future tasks on the fly. However, existing methods mainly reuse skills at the task-level: a fixed set of skills is retrieved based on the initial task instruction and then held fixed throughout execution. This static strategy is misaligned with web execution, where the appropriate next action depends not only on the task goal but also on the current webpage state, which often transitions into situations that the initial skills fail to cover. To address this gap, we propose State-Grounded Dynamic Retrieval (SGDR), an online skill learning method that enables stepwise skill reuse for web agents. SGDR consists of three components: a sliding-window extraction process that turns completed trajectories into reusable sub-procedures invokable at intermediate execution states, a dual text-code representation that connects skill retrieval with executable action, and a state-grounded dynamic retrieval mechanism that matches skills to both the task goal and the current webpage state. Experiments on WebArena across five domains show that SGDR consistently outperforms strong baselines, achieving average success rates of 37.5% with GPT-4.1 and 24.3% with Qwen3-4B, corresponding to relative gains of 10.6% and 10.0% over the strongest baseline, respectively. The code is available at https://github.com/plusnli/skill-dynamic-retrieval.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

トリビアム: 因果記憶コントローラーの第一級目標としての時間的後悔

現在のエージェントシステムと LLM パイプラインの多くは、結果の報酬を最適化することで間違いを修正します。これは失敗の内容のみを扱います。結果が予測と異なる場合、不一致の理由と時期が体系的に記録、レビュー、修正されないため、同じエラーがエピソードごとに再発する可能性があります。私たちは、これは単にモデルの能力の問題ではなく、構造的な問題であると主張します。私たちは、作業因果モデルに対する結果の後悔や認識論的な後悔と並んで、長期的な時間的後悔を第一級の目標として提案します。時間的リグアロングは、失敗が継続するとき、すなわち、調整ミスの因果モデルが修正されるまでにどのくらいの期間許容されるかを捉えます。認識論的後悔は、失敗が続く理由、つまり作業因果モデルにおける残留不確実性またはエラーを捉えます。 3 つの後悔を総合すると、長命のエージェントがいつ、何が、なぜ失敗する可能性があるのかについて、反証可能な説明が得られます。エージェントを E エピソードのストリームとしてモデル化し、明示的な因果関係の調査、持続性、および検出可能性の仮定に基づいて 3 つの条件付き結果を証明します。まず、観察的に等価な交絡のもとでは、結果のみの学習では介入チャネルがなければ因果構造と偽の構造を区別できないため、結果の後悔がゼロになった後でも時間的誤調整が線形的に持続する可能性があります。第 2 に、永続的な因果ログと予算付きプローブを使用すると、総プローブの複雑さはエピソード期間内で対数的となり、O(log E) の時間的後悔を引き起こします。第三に、K 個の検出可能な変化点の下では、速度は O(K log E) まで拡張されます。 Trivium をインスタンス化し、5 つの反証可能な予測を事前に登録します。 CausalBench-Seq では、Trivium は予測された対数エンベロープに従いますが、結果のみのベースラインは直線的に増加します。パイロットのリアル LLM ストリームは、1 回の完全な E = 500 実行と 3 回の E = 100 フロンティアモデルパイロットにわたる予備的な外部妥当性証拠を提供します。ここでの自己学習とは、LLM 重みを再トレーニングすることではなく、外部因果モデルを修正することを意味します。

原文 (English)

Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers

Many current agentic systems and LLM pipelines correct mistakes by optimizing outcome reward. This addresses only the what of failure: when an outcome diverges from prediction, the why and when of the mismatch are not systematically logged, reviewed, or corrected, so the same error can recur episode after episode. We argue that this is a structural problem, not merely a model-capacity one. We propose long-horizon temporal regret as a first-class objective alongside outcome regret and epistemic regret over the working causal model. Temporal regret captures when failure persists: how long a miscalibrated causal model is tolerated before correction. Epistemic regret captures why failure persists: residual uncertainty or error in the working causal model. Together, the three regrets give a falsifiable account of what, why, and when a long-lived agent can fail. Modeling the agent as a stream of E episodes, we prove three conditional results under explicit causal-probing, persistence, and detectability assumptions. First, under observationally equivalent confounding, outcome-only learning cannot distinguish causal from spurious structure without an intervention channel, so temporal miscalibration can persist linearly even after outcome regret is driven to zero. Second, with a persistent causal log and budgeted probes, total probe complexity is logarithmic in the episode horizon, inducing O(log E) temporal regret. Third, under K detectable change-points, the rate extends to O(K log E). We instantiate Trivium and pre-register five falsifiable predictions. On CausalBench-Seq, Trivium follows the predicted logarithmic envelope while outcome-only baselines grow linearly. A pilot real-LLM stream provides preliminary external-validity evidence across one full E = 500 run and three E = 100 frontier-model pilots. Self-learning here means revising an external causal model, not retraining LLM weights.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Agentic RAG における連鎖的幻覚: 検出と軽減のための CHARM フレームワーク

マルチステップのエージェント的検索拡張生成 (RAG) パイプラインは、複雑な推論タスクに対して優れた能力を実証していますが、既存の幻覚検出メカニズムが体系的に見逃しているクラスの障害に対して脆弱なままです。カスケード幻覚では、パイプラインの初期段階で導入されたエラーが、連続する推論ステップ全体に伝播および増幅し、自信があるが事実としては不正確な最終出力が生成されます。この脆弱性に対処するために、私たちはカスケード幻覚をエージェント RAG システムの明確な障害モードとして形式化し、カスケードパターンの 4 種類の分類を提示し、複数ステップの推論パイプラインでエラー伝播を検出して中断するためのアーキテクチャフレームワークである CHARM (Cascading Hallucination Aware Resolution and Mitigation) を導入します。 CHARM は、ステージレベルのファクト検証、クロスステージ一貫性追跡、信頼性伝播モニタリング、およびカスケード解決トリガーの 4 つのコンポーネントで構成されており、アーキテクチャの置き換えを必要とせずに、標準のエージェント RAG パイプラインと並行して動作します。 HotpotQA、MuSiQue、2WikiMultiHopQA、および LangChain エージェントパイプライン構成全体にわたるカスタム敵対的データセットで CHARM を評価し、89.4% のカスケード検出率と 5.3% の誤検知率、ステージあたりの平均レイテンシオーバーヘッド 215 ミリ秒 +/- 18 ミリ秒を達成し、エラー伝播の削減を 82.1% 達成しました (前者の 18.5% と比較)。出力レベル検出器。 Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM は、人間による監視フレームワークと統合して、実稼働エージェント AI の導入に完全な信頼性とガバナンススタックを提供します。

原文 (English)

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

メタエージェントの課題: 現在のエージェントは自律的なエージェント開発が可能ですか?

現在の AI ベンチマークは、人間が設計したワークフロー内でのタスク実行に関してエージェントを評価します。これらの評価では、基本的に、モデルが自律的にエージェントシステムを開発できるかどうかという、重要な次のレベルの機能を測定できません。自律エージェント開発のためのフロンティアモデルの能力をテストするために設計された評価フレームワークであるメタエージェントチャレンジ (MAC) を紹介します。具体的には、コードエージェント (メタエージェント) には、サンドボックス環境、評価 API、および 5 つのドメインにわたって実施されたテストセットのパフォーマンスを最大化するエージェントアーティファクトを反復的にプログラムするための時間制限が与えられます。評価の整合性を確保するために、このフレームワークは報酬ハッキングに対する多層防御によって保護されています。このフレームワークを活用して、メタエージェントが人為的に設計されたベースラインポリシーと一致することはほとんどなく、一致する少数のエージェントは独自のフロンティアモデルによって支配されていることを示します。さらに、設計プロセスは高い分散を示し、高い最適化圧力により、グラウンドトゥルースの漏洩などの敵対的な動作が表面化し、堅牢性とモデルの調整の両方における重大な欠陥が浮き彫りになります。最終的に、MAC は自律型 AI の研究開発のための厳密なオープンソースベンチマークを提供し、再帰的な自己改善を評価するための経験的な代用手段を提供します。ベンチマークは https://github.com/ant-research/meta-agent-challenge で公開されています。

原文 (English)

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AgentJet: エージェント強化学習のための柔軟な群トレーニングフレームワーク

大規模言語モデル (LLM) エージェント強化学習用の分散群トレーニングフレームワークである AgentJet を紹介します。エージェントのロールアウトとモデルの最適化を密接に結び付ける集中型フレームワークとは異なり、AgentJet は分離されたマルチノードアーキテクチャを採用しています。このアーキテクチャでは、swarm サーバーノードがトレーニング可能なモデルをホストし、GPU クラスターで最適化を実行します。一方、swarm クライアントノードは任意のデバイスで任意のエージェントを実行します。この設計は、集中型フレームワークではサポートが難しい機能を提供します。(1) 異種マルチモデル強化学習。複数の LLM を頭脳とする異種マルチエージェントチームのトレーニングを可能にします。 (2) 独立したエージェントのランタイムを使用したマルチタスクのカクテルトレーニング。 (3) 外部環境の障害によるトレーニングプロセスの中断を防ぐフォールトトレラントな実行。 (4) ライブコードの反復。群クライアントノードを置き換えることにより、トレーニング中にエージェントを編集できます。マルチモデル、マルチターン、マルチエージェント設定で効率的な RL をサポートするために、AgentJet はタイムラインマージを備えたコンテキストトラッキングモジュールを導入しています。これにより、冗長なコンテキストが統合され、トレーニングの 1.5 ～ 10 倍の高速化が実現します。最後に、AgentJet は、研究トピックを入力として受け取り、大規模クラスター上で長期にわたる複数日にわたる RL 研究を自律的に実行する自動研究システムを導入します。このシステムは、swarm アーキテクチャを活用することで、実行中に人間の介入なしに、RL 研究者の主要な探索ワークフローを再現します。

原文 (English)

AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm server nodes host trainable models and run optimization on GPU clusters, whereas swarm client nodes execute arbitrary agents on arbitrary devices. This design provides capabilities that are difficult to support in centralized frameworks: (1) heterogeneous multi-model reinforcement learning, enabling the training of heterogeneous multi-agent teams with multiple LLM as brains; (2) multi-task cocktail training with isolated agent runtimes; (3) fault-tolerant execution that prevents external environment failures from interrupting the training process; and (4) live code iteration, which allows agents to be edited during training by replacing swarm client nodes. To support efficient RL in multi-model, multi-turn, and multi-agent settings, AgentJet introduces a context tracking module with timeline merging, which consolidates redundant context and achieves a 1.5-10x training speedup. Finally, AgentJet introduces an automated research system that takes a research topic as input and autonomously conducts long-horizon, multi-day RL studies on large-scale clusters. By leveraging the swarm architecture, this system reproduces key exploratory workflows of RL researchers without human intervention during execution.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

プロンプトベースの計画を超えて: MCP ネイティブグラフ計画ベースの生物医学エージェントシステム

生物医学エージェントは複雑な生物学的ワークフローを自動化できると期待されていますが、現在のシステムは 2 つの根本的なボトルネックに直面しています。それは、バイオインフォマティクスツールがインターフェイスと実行環境において非常に異質である一方、エージェントの計画は未だにプロンプトで取得されるフラットなツール記述に依存しているということです。生物医学ソフトウェアエコシステムが成長するにつれて、ツールの適用範囲とコンテキストサイズの関係により、ツールの混乱、不安定な計画、および非効率的な実行が発生します。構造化された生物学的機能よりもグラフ足場計画に基づいて構築された MCP ネイティブの生物医学エージェントである BioManus を紹介します。 BioManus は、異種バイオインフォマティクスソフトウェアを標準化された MCP サーバーに変換し、大規模な実行可能な MCP エコシステムを生み出す BioinfoMCP コンパイラーを初めて導入しました。次に、このエコシステムを、ツール、操作、データ型、ワークフローステージにわたる型付きの異種 MCP グラフとして編成します。推論時に、BioManus はコンパクトなタスク固有のサブグラフを取得し、操作レベルのワークフロースキャフォールドを合成します。この設計は、計画の複雑さを生の工具在庫サイズから切り離し、高リコール取得下で Theta(N / (h * m_bar)) のコンテキスト圧縮率を達成します。ここで、N は総工具数、h はワークフロー範囲、m_bar (N よりもはるかに小さい) は操作ごとの候補工具の平均数です。 BioAgentBench と LAB-Bench の実験では、BioManus が高度な生物医学エージェントのベースラインと比較して、実行精度、ワークフローの有効性、およびコンテキストの効率を向上させることが示されています。この研究はパラダイムシフトを示唆しています。スケーラブルな生物医学的推論には、ますます大規模になるプロンプトレベルのツール検索ではなく、構造化された実行可能な機能グラフが必要です。

原文 (English)

Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System

Biomedical agents promise to automate complex biological workflows, yet current systems face two fundamental bottlenecks: bioinformatics tools are highly heterogeneous in interfaces and execution environments, while agent planning still relies on flat prompt-retrieved tool descriptions. As biomedical software ecosystems grow, this coupling between tool coverage and context size leads to tool confusion, unstable planning, and inefficient execution. We introduce BioManus, an MCP-native biomedical agent built on graph-scaffolded planning over structured biological capabilities. BioManus first introduces the BioinfoMCP Compiler, which converts heterogeneous bioinformatics software into standardized MCP servers, yielding a large executable MCP ecosystem. It then organizes this ecosystem as a typed heterogeneous MCP graph over tools, operations, datatypes, and workflow stages. At inference time, BioManus retrieves compact task-specific subgraphs, synthesizes operation-level workflow scaffolds. This design decouples planning complexity from raw tool inventory size, achieving a context compression ratio of Theta(N / (h * m_bar)) under high-recall retrieval, where N is the total tool count, h is the workflow horizon, and m_bar (much smaller than N) is the average number of candidate tools per operation. Experiments on BioAgentBench and LAB-Bench show that BioManus improves execution accuracy, workflow validity, and context efficiency over advanced biomedical agent baselines. This work suggests a paradigm shift: scalable biomedical reasoning requires structured executable capability graphs rather than increasingly larger prompt-level tool retrieval.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

MapAgent: 都市規模の車線レベルの地図生成のための産業グレードのエージェントフレームワーク

車線レベルの地図は自動運転と車線レベルのナビゲーションにとって重要なインフラストラクチャですが、数百の都市で標準化された車線ネットワークの構築と維持には依然として非常に労働集約的です。最近のエンドツーエンドのベクトル化マッピング手法は、センサーデータから直接車線の形状とトポロジを予測できますが、通常、マッピング仕様と交通規制を暗黙的なデータセット依存の監視として扱います。さらに、複雑なシーン (マーキングやオクルージョンの磨耗や欠落など) では、正しいレーン構成が視覚的証拠だけでは十分に決定されないことが多く、仕様違反が人間による事後編集の主な原因となっています。私たちは、仕様に準拠したレーンマップ作成のためのベクトル化バックボーンを強化する産業グレードのエージェントアーキテクチャである MapAgent を提案します。 MapAgent は、単にマップ予測にエージェントループを追加するのではなく、バックボーンの認識と明示的な仕様の検証、制約を意識した推論、および境界のある検証主導型のジャッジ-プランナー-ワーカーループの下での決定論的なマップ編集を結合します。視覚言語を使用するジャッジは、視覚的な証拠とドラフトベクトルを共同で検査することでエラーを診断し、ツールを呼び出すプランナーは編集後の再検証により最小限の修正編集を生成します。都市規模の本番環境でのスケーラビリティを維持するために、MapAgent はバックボーンの信頼性が低いタイルでのみ選択的にトリガーされ、スループットを維持しながら適度なオーバーヘッドを追加します。現実世界のデータセットでの実験では、特に複雑でロングテールのシナリオにおいて、強力な実稼働ベースラインを上回る一貫した利益が示されています。さらに、MapAgent は Baidu Maps に統合されており、全国 360 以上の都市の車線レベルの地図生成をサポートし、全体的な生産自動化を 95% 以上に高め、大規模な車線レベルの地図生成における MapAgent の実用性と有効性を実証しています。

原文 (English)

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing and maintaining standardized lane networks for hundreds of cities remains highly labor-intensive. Recent end-to-end vectorized mapping methods can predict lane geometry and topology directly from sensor data, but they typically treat mapping specifications and traffic regulations as implicit, dataset-dependent supervision. Moreover, in complex scenes (e.g., worn or missing markings and occlusions), correct lane configurations are often under-determined by visual evidence alone, making specification violations a major source of human post-editing. We propose MapAgent, an industrial-grade agentic architecture that augments a vectorization backbone for specification-compliant lane-map production. Rather than merely adding an agent loop to map prediction, MapAgent couples backbone perception with explicit specification verification, constraint-aware reasoning, and deterministic map editing under a bounded, verification-driven Judge-Planner-Worker loop. A vision-language Judge diagnoses errors by jointly inspecting visual evidence and draft vectors, while a tool-calling Planner generates minimal corrective edits with post-edit re-validation. To remain scalable for city-scale production, MapAgent is selectively triggered only on tiles with low backbone confidence, adding modest overhead while preserving throughput. Experiments on real-world datasets show consistent gains over strong production baselines, especially in complex and long-tail scenarios. Additionally, MapAgent has been integrated into Baidu Maps, supporting lane-level map generation for over 360 cities nationwide and elevating the overall production automation to over 95%, demonstrating MapAgent's practicality and effectiveness for large-scale lane-level map generation.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

パラメトリックメモリを介した自己進化エージェントのスケーリング

既存のメモリ拡張 LLM エージェントは、ロールアウト中モデルパラメータを凍結したままにしながら、過去の経験をテキストの要約または取得された一節としてプロンプト領域にのみ保存します。このようなエージェントは、見たものを \emph{調べる}ことはできますが、それから \emph{学ぶ}ことはできません。彼らのポリシーは経験によって変更されず、コンテキストからドロップされた情報は永久に失われます。私たちは、自己進化するパラメトリックメモリフレームワークである \texttt{TMEM} を導入します。このフレームワークでは、エージェントが履歴を明示的メモリに圧縮するだけでなく、軽量のオンラインアップデートを通じて抽出された監視を高速な LoRA 重み $\Delta_t$ に吸収し、単一のエピソード内で将来の動作を真に変更します。これを、高速重みロールアウトダイナミクスを備えたエージェントの意思決定プロセスとして形式化します。アクションは $\pi_{\theta_0+\Delta_t}$ からサンプリングされ、抽出アクションは後続の決定のために $\Delta_t$ を更新する監視を生成します。このビューにより、抽出ポリシーが RL によって直接最適化可能になります。$\theta_0$ のトレーニングにより、タスクアクションだけでなく、オンライン LoRA 適応に使用されるデータの品質も向上します。さらに、オンラインコンバージェンスを加速するために、SVD ベースの LoRA サブスペースの初期化を提案します。 LoCoMo、LongMemEval-S、多目的検索、および CL-Bench の実験では、\texttt{TMEM} がさまざまなモデルスケールにわたって、要約ベースおよび検索ベースのベースラインを一貫して上回るパフォーマンスを示しています。

原文 (English)

Scaling Self-Evolving Agents via Parametric Memory

Existing memory-augmented LLM agents store past experience exclusively in prompt space, as textual summaries or retrieved passages, while keeping model parameters frozen throughout a rollout. Such agents can \emph{look up} what they have seen but cannot \emph{learn from} it: their policy is unchanged by experience, and any information dropped from the context is permanently lost. We introduce \texttt{TMEM}, a self-evolving parametric memory framework in which the agent not only compresses history into explicit memory but also absorbs distilled supervision into fast LoRA weights $\Delta_t$ via lightweight online updates, genuinely altering its future behavior within a single episode. We formalize this as an agentic decision process with fast-weight rollout dynamics: actions are sampled from $\pi_{\theta_0+\Delta_t}$, while extraction actions produce supervision that updates $\Delta_t$ for subsequent decisions. This view makes the extraction policy directly optimizable by RL: training $\theta_0$ improves not only task actions but also the quality of the data used for online LoRA adaptation. We further propose SVD-based initialization of the LoRA subspace to accelerate online convergence. Experiments on LoCoMo, LongMemEval-S, multi-objective search, and CL-Bench show that \texttt{TMEM} consistently outperforms summary-based and retrieval-based baselines across different model scales.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

Neetyabhas: Rational エージェントベースのモデルにおける不確実性を認識した公共政策最適化のためのフレームワーク

目的 WHO の新型コロナウイルス感染症に対する非医薬品介入（ロックダウン、ワクチン接種など）は感染を効果的に抑制しますが、経済的には大きな負担となります。既存の研究は、個人の行動を無視し、完璧な感染追跡と完璧な政策実行を誤って想定しており、現実世界の不確実性や誤りを説明できていないことがよくあります。方法我々は、流行の測定（感染症/入院）と政策実施の両方に不確実性を組み込んだ統合的アプローチを提案します。私たちは、マスクの着用、ワクチン接種、買い物に関するリアルタイムの選択を行う 1,000 人の個人のシミュレーションモデルを構築しました。同時に、政策立案者は健康と経済の観察に基づいて介入（ロックダウン、義務化）を展開します。このフレームワークは階層型強化学習エージェントによって駆動され、不確実性を考慮したポリシー勾配バリアント (DDPG および TD3) とともにディープ Q ネットワークを利用します。結果シミュレーションは流行の進行を効果的に管理しました。マスクとワクチン接種が非常に効果的であることが証明され、流行のピークの高さと期間の両方が大幅に短縮されました。個人の行動、政策の不確実性、多面的な介入を統合することで、私たちの動的制御アプローチは流行の影響を軽減することに成功しました。結論私たちのモデルは、不確実性と人間の行動を公衆衛生政策の枠組みに組み込むことで、これまでの研究の限界を克服しました。このシミュレーションは、マスクとワクチンが極めて重要なツールとして機能し、複雑なパンデミック時に効果的な介入を設計するには、個人の選択と不完全なデータを考慮することが重要であることを示しています。

原文 (English)

Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models

Purpose The WHO's COVID-19 non-pharmaceutical interventions (e.g., lockdowns, vaccinations) effectively curb transmission but impose heavy economic strains. Existing research often neglects individual behaviors and falsely assumes perfect infection tracking and flawless policy execution, failing to account for real-world uncertainties and errors. Methods We propose an integrative approach incorporating uncertainties in both epidemic measurement (infections/hospitalizations) and policy implementation. We built a simulation model of 1,000 individuals making real-time choices regarding mask-wearing, vaccination, and shopping. Concurrently, policymakers deploy interventions (lockdowns, mandates) based on health and economic observations. This framework is driven by hierarchical reinforcement learning agents, utilizing deep Q-networks alongside uncertainty-aware policy gradient variants (DDPG and TD3). Results The simulations effectively managed the epidemic's progression. Masking and vaccinations proved highly effective, significantly reducing both the outbreak's peak height and duration. By integrating individual behaviors, policy uncertainties, and multifaceted interventions, our dynamic control approach successfully mitigated the epidemic's impact. Conclusions Our model overcomes previous research limitations by embedding uncertainty and human behavior into public health policy frameworks. The simulation demonstrates that accounting for individual choices and imperfect data is crucial for designing effective interventions during complex pandemics, with masks and vaccines serving as pivotal tools.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

最初に計画し、後で判断し、より良く実行する: DMAIC からインスピレーションを得た産業異常検出用エージェントシステム

大規模言語モデル (LLM) エージェントは、複雑なデータ分析ワークフローの自動化において有望であることが示されていますが、一か八かの産業シナリオにおいてその信頼性の高い導入は依然として困難です。産業異常検出 (IAD) は製造の品質、安全性、効率に不可欠ですが、既存の LLM ベースの IAD エージェントは主に実行に焦点を当てており、戦略策定は不十分です。その結果、統一的かつコスト効率の高い方法で異種のモダリティを処理するのに苦労しています。 DMAIC 品質管理フレームワークからインスピレーションを得て、当社は DMAIC-IAD (DMAIC にインスピレーションを得た Agentic Industrial Anomaly Detection) を提案します。これは、LLM エージェントを構造化された産業問題解決と連携させる、「最初に計画し、後で判断する」マルチエージェントシステムです。 DMAIC-IAD は、戦略生成前に異種参照を標準化操作手順 (SOP) に抽出し、事前トレーニングされた実行不要の判定モデルを導入して、コストのかかる実行時トライアルを行わずに候補戦略をランク付けします。 4 つのモダリティにわたる広範な実験により、DMAIC-IAD は適用可能な薬剤ベースラインよりも平均検出パフォーマンスが 37.76% 向上することが示されています。

原文 (English)

Plan First, Judge Later, Run Better: A DMAIC-Inspired Agentic System for Industrial Anomaly Detection

Large language model (LLM) agents have shown promise in automating complex data-analysis workflows, but their reliable deployment remains challenging in high-stakes industrial scenarios. Industrial anomaly detection (IAD) is essential for manufacturing quality, safety, and efficiency, yet existing LLM-based IAD agents mainly focus on execution while under-exploiting strategy formulation. Consequently, they struggle to handle heterogeneous modalities in a unified and cost-effective manner. Inspired by the DMAIC quality-management framework, we propose DMAIC-IAD (DMAIC-inspired Agentic Industrial Anomaly Detection), a "Plan First, Judge Later" multi-agent system that aligns LLM agents with structured industrial problem-solving. DMAIC-IAD distills heterogeneous references into standardized operating procedures (SOPs) before strategy generation, and introduces a pre-trained execution-free judge model to rank candidate strategies without costly runtime trials. Extensive experiments across four modalities show that DMAIC-IAD improves average detection performance over applicable agentic baselines by 37.76%.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

パルテノン法: 自己進化する弁護士の枠組み

エージェントの能力が高まるにつれて、法律分野の LLM エージェントは、大量のドキュメントをレビュー可能な作業成果物に変えることを約束しますが、信頼性の高い導入には 3 つの障害に直面しています。1 つは、今日の最も強力なモデルとハーネスの組み合わせがエンドツーエンドの法的問題でどのように動作するかについての大規模な証拠がないことです。法的な業種に適合したエージェントアーキテクチャはなく、汎用ハーネスのみが使用されます。そして、新しい事実、権限、期限によって変化し続ける環境では、システムが自らの結果から学習するメカニズムがありません。それぞれに対応します。 Harvey LAB に関する大規模な実証研究 -- $12{,}510$ のエージェントの軌跡 -- は、フロンティアのエージェントでさえ 1 回のパスで問題を完了することには程遠いことを示しています。より強力なモデルを使用すると基準ごとの精度が向上しますが、厳密な問題の完了は停滞します。次に、\textsc{Parthenon} を導入します。これは、モデル、ハーネス、代理人の役割、法的知識、決定論的なツール、および手続き上のスキルを情報源の追跡可能性、日付と番号の根拠、成果物のコンプライアンス、および問題の解決のための監査可能な表面に組み込む、自己進化する法律代理人のフレームワークです。最後に、漏れ防止学習ループにより、スコアリングされた失敗がタスクに依存しないスキル、ツール、知識の編集に変換され、企業が問題ごとにチェックリストとプレイブックを洗練するように、モデルの重みに触れることなく、経験とともにシステムが改善されます。私たちの大規模な実証分析を通じて、\textsc{Parthenon} は法的問題のタスクにおける最先端のモデルとハーネスのパフォーマンスを大幅に向上させました。

原文 (English)

Parthenon Law: A Self-Evolving Legal-Agent Framework

As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable deployment faces three obstacles: no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB -- $12{,}510$ agent trajectories -- shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textsc{Parthenon}, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience -- as a firm refines its checklists and playbooks after each matter -- without touching model weights. Across our large-scale empirical analysis, \textsc{Parthenon} substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

MIRAGE: 暗黙的推論と生成世界モデルを備えたモバイルエージェント

モバイルエージェントは、スクリーンショットや言語目標に基づいて日常のアプリケーションを操作することがますます期待されており、信頼性の高い制御には、画面のアフォーダンス、複数ステップのナビゲーション、および将来の状態の変化に関する推論が必要です。ただし、多くのエージェントはこの計算を長いテキストの思考連鎖として外部に出すため、対話が遅くなり、監視コストが増加し、展開が複雑になります。 MIRAGE は、目に見えるテキスト推論の痕跡から継続的な潜在推論表現を学習するフレームワークです。 MIRAGE は、明示的な推論をコンパクトな隠れ状態に変換し、エージェントが長い根拠を解読することなく内部的に推論できるようにします。また、生成世界モデルの目標も組み込まれています。つまり、潜在的な推論ベクトルが将来のスクリーンショットと一致し、エージェントが行動する前に今後のインターフェイスの状態を予測するようになります。これにより、隠れた計算が圧縮された思考表現と環境力学の将来を見据えたモデルの両方に変わります。推論時、MIRAGE は連続的な潜在空間で推論し、実行効率を向上させながらトークンの生成を削減します。 AndroidWorld では、MIRAGE は、4B アブレーションにおける明示的な思考連鎖の監視付き微調整と 3 ～ 5 倍低いデコードトークンバジェットを一致させ、同等の命令調整ベースラインを 10.2 ポイント改善します。 AndroidControl では、生成されるトークンが 75% 以上減少しながら、アクションのグラウンディングが向上します。

原文 (English)

MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requires reasoning over screen affordances, multi-step navigation, and future state changes. However, many agents externalize this computation as long textual chains of thought, which slows interaction, increases supervision cost, and complicates deployment. We introduce MIRAGE, a framework that learns continuous latent reasoning representations from visible textual reasoning traces. MIRAGE transfers explicit reasoning into compact hidden states, enabling the agent to reason internally without decoding long rationales. It also incorporates a generative world-model objective: latent reasoning vectors are aligned with future screenshots, encouraging the agent to anticipate upcoming interface states before acting. This turns hidden computation into both a compressed thought representation and a forward-looking model of environment dynamics. At inference time, MIRAGE reasons in continuous latent space, reducing token generation while improving execution efficiency. On AndroidWorld, MIRAGE matches explicit chain-of-thought supervised fine-tuning in the 4B ablation with a 3-5x lower decoded-token budget and improves a comparable instruction-tuned baseline by 10.2 points; on AndroidControl, it improves action grounding while generating over 75% fewer tokens.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

Fog of Love: ゲーム環境における親和性ベースの強化学習による高潔なエージェントの動作のエンジニアリング

人工知能に高潔な行動を教え込むことへの関心が高まっています。提案された手法の 1 つは、親和性ベースの強化学習として知られています。これは、目的関数のポリシー正則化を使用して、報酬関数の設計に完全に依存することなく、善良な行動を奨励します。これまでのところ、この手法は、状態空間とアクション空間が最小限のグリッドワールドやおもちゃの問題環境で有効であることが実証されています。この研究をより洗練された環境に拡張するために、Fog of Love として知られるロールプレイングボードゲームに基づく 2 プレイヤーマルチエージェント環境を導入します。この環境では、2 人のエージェントがそれぞれの美徳を満たすために競い合いながら、その関係を満たすために協力します。マルチエージェントの性質を考慮すると、これは複雑な問題であり、マルチエージェントの深い決定論的ポリシー勾配エージェントは競合も連携もうまくいきません。我々は、局所的な親和性が競争目的と協力目的の両方を達成する際のエージェントのパフォーマンスを向上させ、その結果、両方のドメインで総合スコアが優れているという証拠を提示します。これは、結果的に賢明な選択をもたらすだけでなく、エージェントの目的論を明確にし、その行動を人間レベルで解釈できるようにします。

原文 (English)

Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment

Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity-based reinforcement learning, which uses policy regularization on the objective function to incentivize virtuous actions without being fully dependent on the reward function design. Thus far, this technique has been demonstrated to be effective in grid worlds and toy-problem environments with minimal state and action spaces. To expand this research to more sophisticated environments, we introduce a two-player multi-agent environment based on the role-playing board game known as Fog of Love. In this environment, two agents compete to fulfill their individual virtues, while also cooperating to satisfy their relationship. Given the multi-agent nature, this is a complex problem where multi-agent deep deterministic policy gradient agents neither compete nor cooperate successfully. We present evidence that localized affinities enhance agent performance in achieving both competitive and cooperative objectives, resulting from superior overall scores in both domains. This not only results in virtuous choices but also clarifies an agent's teleology and makes its behavior human-level interpretable.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

FALSIFYBENCH: ルール発見ゲームを使用した LLM の帰納的推論の評価

大規模言語モデル (LLM) は、科学タスクにおける自律エージェントとして導入されることが増えています。しかし、これらのシステムが科学的発見に関連する帰納的推論の形式に効果的に関与できるかどうかは未解決の問題のままです。この研究では、古典的な Wason 2-4-6 タスクに触発された仮説主導型推論の評価フレームワークである FALSIFYBENCH を紹介します。このタスクでは、エージェントは例を繰り返し提案し、フィードバックを受け取ることによって隠れた意味論的特性を発見する必要があります。このタスクでは、科学的推論の重要な要素、つまり仮説の生成、証拠の収集、および証拠の確認と反証に応じた信念の修正を捉えます。モデルファミリーとスケールにわたる 12 個の LLM の評価では、最適なパフォーマンスに近いモデルはないものの、推論モデルは一般に命令調整モデルよりも強力な科学的推論であることがわかりました。成功の主な原動力は否定的なテストの能力です。仮説を積極的に反証しようとするモデルは、主に確認を求めるモデルよりも一貫して優れています。さらに、これまでの研究では無視されていたきめ細かいターンレベル分析により、モデルが仮説空間をナビゲートする方法における特定可能なパターンと失敗が結びついていることが明らかになりました。

原文 (English)

FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games

Large language models (LLMs) are increasingly deployed as autonomous agents in scientific tasks. Yet whether these systems can effectively engage in forms of inductive reasoning relevant to scientific discovery remains an open question. In this work, we introduce FALSIFYBENCH, an evaluation framework for hypothesis-driven reasoning inspired by the classic Wason 2-4-6 task, in which agents must discover hidden semantic properties by iteratively proposing examples and receiving feedback. This task captures key elements of scientific reasoning: hypothesis generation, evidence gathering, and belief revision in response to both confirming and disconfirming evidence. Our evaluation of 12 LLMs across model families and scales shows that reasoning models are generally stronger scientific reasoners than instruction-tuned models, although no model comes close to optimal performance. The primary driver of success is the capacity for negative testing: models that actively seek to falsify their hypotheses consistently outperform those that primarily seek confirmation. Moreover, a fine-grained turn-level analysis, neglected in previous work, reveals that failure is tied to identifiable patterns in how models navigate the hypothesis space.

2026-06-05 13:00 JSTarXiv cs.AIエージェント研究/論文

人間と AI のインタラクションにおけるマルチエージェントの相補性のツリーベースの定式化

相補性とは、人間と AI の相互作用 (HAI) が、そのメンバー間で利用可能な最良の予測ベンチマークを上回る場合のことです。この考え方は HAI 研究の中心ですが、相補性に関する正式な研究は依然として限られています。既存のフレームワークは、エージェントの予測がワークフローに依存したマルチエージェントプロトコルをどのように構成するかをモデル化していません。私たちは、マルチエージェント HAI における相補性のツリーベースの形式化を導入することで、このギャップを埋めます。 HAI プロトコルは、順序付けられたエージェントと役割の構成と、その葉が予測ベクトルによって装飾されている根付き平面バイナリツリーによって表されます。ローカルのバイナリ構成ルールがツリーに沿って再帰的に評価され、pointwise-min Oracle ベンチマークに対するツリー相対相補性関数が生成されます。 4 つの結果を証明します。まず、セレクターベースの HAI (自己依存性または AI 依存性を含む) は、タスク、損失、予測の品質に関係なく、相補性を達成できません。第 2 に、二乗損失での回帰では、相補性はグラウンドトゥルースベクトルからのユークリッド距離の最小化に相当します。 $N=2$ の場合、最適な線形プーリング重みは閉じた形式と残差補正解釈を持ちます。第三に、線形局所構成の下では、すべてのプロトコルツリーはリーフ重みの単体での重心座標チャートを定義します。プロトコルツリーのTamari-cover再パラメータ化は相補性を維持し、$N=4$の場合、五角形の恒等性を満たします。第四に、バイナリ分類では、標準ブレグマン損失や多くの有限ベルヌーイ $f$ 発散損失を含むエンドポイント単調損失の下では、内部の局所構成は相補性を達成できません。クロスエントロピー下のマルチクラス集約にも同様の障害が当てはまります。要約すると、私たちのフレームワークは、マルチエージェント回帰では相補性が達成可能ですが、局所的な凝集と損失関数に関する自然条件下での分類では妨げられることを示しています。

原文 (English)

Tree-Based Formalization of Multi-Agent Complementarity in Human-AI Interactions

Complementarity is the case in which a human--AI interaction (HAI) outperforms the best prediction benchmark available among its members. Although this idea is central in HAI research, formal work on complementarity remains limited. Existing frameworks do not model how agents' predictions compose into workflow-sensitive multi-agent protocols. We close this gap by introducing a tree-based formalization of complementarity in multi-agent HAI. An HAI protocol is represented by an ordered agent-role configuration together with a rooted planar binary tree whose leaves are decorated by prediction vectors. A local binary composition rule is evaluated recursively along the tree, yielding a tree-relative complementarity functional relative to a pointwise-min oracle benchmark. We prove four results. First, selector-based HAIs, including self- or AI-reliance, cannot achieve complementarity regardless of task, loss, or prediction quality. Second, in regression under squared loss, complementarity is equivalent to Euclidean distance minimization from the ground-truth vector; for $N=2$, the optimal linear-pooling weight has a closed form and a residual-correction interpretation. Third, under linear local composition, every protocol tree defines a barycentric coordinate chart on the simplex of leaf weights; Tamari-cover reparameterizations of protocol trees preserve complementarity, and for $N=4$, they satisfy the pentagon identity. Fourth, in binary classification, no internal local composition can achieve complementarity under endpoint-monotone losses, including standard Bregman and many finite Bernoulli $f$-divergence losses; an analogous obstruction holds for multiclass aggregation under cross-entropy. In summary, our framework shows that complementarity is attainable in multi-agent regression, but obstructed in classification under natural conditions on local aggregation and loss functions.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

AIP: エージェントのスキルを学習および管理するためのグラフ表現

現在のエージェントスキルは、主に自由形式の散文で構成されており、エージェントはすべてのセッションでどのように行動するかを読み、解釈し、再導出する必要があります。これにより、2 つの複合的なコストが課せられます。実装の負荷が高いタスクの信頼性の低下と、特にモデルのトレーニングで過小評価されているドメイン固有の手順知識に関して、散文の編集は人間とエージェントの両方が苦労する脆弱なプロセスであるため、スキルの作成と改善が困難になります。エージェント命令プロトコル (AIP) は、スキルを指向実行グラフとしてモデル化することで両方に対処します。つまり、決定論的なスクリプトまたは自然言語記述に裏付けられたノードとしての個別のステップ、明示的に型指定された入力/出力エッジによって接続され、スキーマ検証された YAML 仕様によって管理されます。コンパイラのメタスキルは、人間が作成した既存のスキルをこの形式に変換します。利点は 2 つあります。まず、人間が作成したスキルを AIP にコンパイルすると、SkillsBench の 27 の実際のエージェントタスク全体で、Claude Sonnet の平均タスク報酬が 0.60 から 0.71 に、合格率が 53% から 67% に上昇しました。これは統計的に有意な向上 (Wilcoxon の符号付きランク p = 0.011) であり、12 対 2 のタスクで 13 の同点で勝利し、多くの場合、より短い実時間で達成されました。グラフは、自然言語からコード、コマンド、およびツール呼び出しを再導出するようにエージェントに要求するのではなく、精査された実行可能なユニットをエージェントに提供します。次に、作成と改善については、各スキルがスキーマ検証され、機能テストが可能で、ノードごとにアドレス指定できるため、障害を正確に診断して修復できます。作成されたスキルの 2 つの失敗がスクリプトレベルまで追跡されました。 AIP 仕様を調整して再コンパイルした後、どちらも回帰ゼロ (1 つのタスクが 0/5 から 5/5 に移行) で回復し、スキルの向上が散文的な書き直しではなく、測定可能なチューニングループに変わりました。同じグラフ構造は、コーパスレベルのガバナンスとスキルのイントロスペクションをサポートし、スキルに対する強化学習のための自然なアクションスペースを提供します。

原文 (English)

AIP: A Graph Representation for Learning and Governing Agent Skills

Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and difficulty in skill creation and improvement, since editing prose is a fragile process that both humans and agents struggle with, particularly for domain-specific procedural knowledge underrepresented in model training. The Agent Instruction Protocol (AIP) addresses both by modeling a skill as a directed execution graph: discrete steps as nodes backed by deterministic scripts or natural-language descriptions, connected by explicit typed input/output edges, and governed by a schema-validated YAML specification. A compiler meta-skill translates existing human-written skills into this form. The benefits are twofold. First, compiling human-written skills to AIP raised Claude Sonnet's mean task reward from 0.60 to 0.71 and pass rate from 53% to 67% across 27 real agent tasks from SkillsBench - a statistically significant gain (Wilcoxon signed-rank p = 0.011), winning 12 tasks to 2 with 13 ties - often in less wall-clock time. The graph delivers vetted, runnable units to the agent rather than asking it to re-derive code, commands, and tool calls from natural language. Second, on creation and improvement, because each skill is schema-validated, functionally testable, and addressable node-by-node, failures can be diagnosed and repaired precisely. Two authored-skill failures were traced to the script level. After adjusting the AIP spec and recompiling, both recovered with zero regressions (one task going from 0/5 to 5/5), turning skill improvement into a measurable tuning loop rather than a prose rewrite. That same graph structure supports corpus-level governance and skill introspection, and provides a natural action space for reinforcement learning over skills.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

R-APS: 内省的敵対的パレート検索による制約付き設計のための構成推論とコンテキスト内メタ学習

大規模言語モデル (LLM) は、無制限のタスクに柔軟に対応しますが、システムが計画を立て、ツールを使用し、長期間にわたって動作する必要があるエージェント設定では、流暢さは信頼性の高い配信を保証しません。このギャップを 3 つの構造的欠陥が結合したものとして追跡します。エラーは位置特定されずに伝播し、最悪の場合の摂動は評価されず、蓄積された知識は決して無効になりません。私たちは、これらには根本原因が共有されていると主張します。つまり、アブダクティブ、反事実、メタ帰納的、修正的、帰納的推論は、共有されたコンテキストを矛盾する方向に引っ張ります。私たちは、Reflective Adversarial Pareto Search (R-APS) を導入します。これは、推論モード分解を介して 3 つの失敗すべてに共同で対処し、各推論モードに独自のコンテキストを割り当て、3 つのタイムスケールにわたる相互作用を調整する、私たちの知る限り最初の方法です。型付き検証批評家による段階的な構成推論 (失敗の局在化)、第一級のパレート目標 (堅牢性) としての感度に基づく反事実ストレステスト、および明示的なメタ帰納的ルール抽出です。 invalidation (persistent memory). R-APS は微調整を必要とせず、純粋に構造化されたプロトコル設計によってフリーズされた LLM 上で動作します。平面機構の合成 (ロボット工学、補綴物、機械設計) を評価し、すべての候補を運動学ソルバーでチェックします。 32 のターゲット軌道上で、R-APS は、均一摂動ベースラインよりも 3.5 倍厳しいロバスト性証明書、最初の許容までの反復が 46% 高速化、Enum+GA と比較して 2.1 倍の面取り距離の短縮を実現しながら、バー数と最悪の場合のロバスト性を共同制御します。小規模な 4B 推論に特化したモデルは、プロトコル内の汎用 70B バックボーンと競合することが証明されており、構造化プロトコルがモデルのスケールを部分的に相殺できることを示唆しています。

原文 (English)

R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

Large language models (LLMs) are fluent on open-ended tasks, yet in agentic settings, where a system must plan, use tools, and act over extended horizons, fluency does not ensure reliable delivery. We trace this gap to three coupled structural failures: errors propagate without localization, worst-case perturbations go unevaluated, and accumulated knowledge is never invalidated. We argue these share a root cause: abductive, counterfactual, meta-inductive, corrective, and inductive reasoning pull a shared context in incompatible directions. We introduce Reflective Adversarial Pareto Search (R-APS), to our knowledge the first method addressing all three failures jointly via reasoning-mode decomposition, allocating each reasoning mode its own context and orchestrating interaction across three timescales: staged compositional reasoning with a typed validation critic (failure localization), sensitivity-guided counterfactual stress-testing as a first-class Pareto objective (robustness), and meta-inductive rule extraction with explicit invalidation (persistent memory). R-APS requires no fine-tuning and operates on a frozen LLM purely via structured protocol design. We evaluate on planar mechanism synthesis (robotics, prosthetics, mechanical design), with every candidate checked by a kinematic solver. On 32 target trajectories, R-APS delivers robustness certificates 3.5x tighter than uniform-perturbation baselines, 46% faster iterations-to-first-admission, and 2.1x Chamfer-distance reduction over Enum+GA while jointly controlling bar-count and worst-case robustness. Small 4B reasoning-specialized models prove competitive with general-purpose 70B backbones inside the protocol, suggesting structured protocols can partially offset model scale.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

Strabo: エージェント相互作用プロトコルの宣言的仕様と実装

ここ数年で、宣言型対話プロトコルに基づいたマルチエージェントシステムのモデリングと実装が大きく進歩しました。私たちの貢献である Strabo は、これらの進歩と Agentic AI における現在進行中の業界の取り組みとの関連性を確立します。具体的には、AI エージェントの電子商取引インタラクションを標準化するための Google 主導の最近の取り組みである UCP (Universal Commerce Protocol) について検討します。私たちの演習は 2 つの部分に分かれています。 1 つは、チェックアウトを処理する UCP の部分を宣言型 Langshaw プロトコルとしてモデル化し、Langshaw のプログラミングモデルである Peach を使用してエージェントを実装することです。演習のこの部分では、正式な宣言的仕様の利点を引き出します。 2 つ目は、Peach エージェントが Google によって実装された UCP エージェントと相互運用できることを示し、それによって UCP に関するアプローチの忠実性を確立します。このような相互運用により、宣言型プロトコルとエージェントを従来の設定に段階的に導入することが可能になり、大規模な更新を必要とせずに EMAS のアイデアが実践に影響を与える可能性がある道筋が示されています。

原文 (English)

Strabo: Declarative Specification and Implementation of Agentic Interaction Protocols

The last few years have witnessed major advances in the modeling and implementation of multiagent systems based on declarative interaction protocols. Our contribution, Strabo, establishes the relevance of these advances to ongoing industry efforts in Agentic AI. Specifically, we consider UCP, the Universal Commerce Protocol, a recent Google-led effort to standardize e-commerce interactions for AI agents. Our exercise is in two parts. One, we model the part of UCP dealing with checkouts as a declarative Langshaw protocol and implement agents using Peach, a programming model for Langshaw. This part of the exercise brings out the advantages of formal, declarative specifications. Two, we show that Peach agents can interoperate with UCP agents implemented by Google, thereby establishing the fidelity of our approach with respect to UCP. Such interoperation enables the incremental introduction of declarative protocols and agents into a conventional setting, indicating a pathway by which EMAS ideas could influence practice without demanding a wholesale update.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

RUBAS: エージェントの安全のためのルーブリックベースの強化学習

LLM がツール対応エージェントに進化すると、単純なテキスト生成ではなく現実世界の実行に関連した新しいクラスの安全性の課題が生じます。既存の調整方法は、粗い拒否信号や静的な監視に依存することが多く、さまざまなエージェントのリスクにわたって安全性と有用なツールの実行のバランスをとることが困難です。エージェントの安全性のためのルーブリックベースの強化学習フレームワークである RUBAS を紹介します。 RUBAS は、エージェントの動作をツール使用の安全性、引数の安全性、応答の安全性、有用性の 4 つの次元に分解します。これらの構造化されたルーブリックは、エージェントの完全な軌跡にわたってきめ細かく解釈可能な報酬を提供し、タスクの完了を維持しながら安全なツールの使用を最適化する強化学習を可能にします。複数のエージェントの安全性ベンチマークとモデルにわたる広範な実験により、RUBAS が標準的なアライメントベースラインよりも安全性を向上させ、ツールに基づく幻覚を軽減し、競争力のある実用性を維持することが示されています。私たちの結果は、多次元ルーブリック報酬が、安全性が重要なツール使用環境において LLM エージェントを調整するための効果的なトレーニング信号を提供することを示唆しています。

原文 (English)

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or static supervision, making it difficult to balance safety with useful tool execution across diverse agentic risks. We introduce RUBAS, a rubric-based reinforcement learning framework for agent safety. RUBAS decomposes agent behavior into four dimensions: tool-use safety, argument safety, response safety, and helpfulness. These structured rubrics provide fine-grained and interpretable rewards over complete agent trajectories, enabling reinforcement learning to optimize safe tool use while preserving task completion. Extensive experiments across multiple agent safety benchmarks and models show that RUBAS improves safety over standard alignment baselines, reduces tool-grounded hallucinations, and maintains competitive utility. Our results suggest that multi-dimensional rubric rewards provide an effective training signal for aligning LLM agents in safety-critical tool-use settings.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

証拠を運ぶエージェントアクション: 異種エージェントシステムに対するモデルに依存しないランタイムガバナンス

エージェントシステムは、ローカルコーディングツール、フレームワーク SDK、マネージドエージェントプラットフォーム、API ゲートウェイ、オブザーバーのみの統合など、非常に異なる制御ポイントを備えたランタイムを通じて実行されます。したがって、データを外部に公開するなどの高リスクのアクションは、あるランタイムではシェルコマンドとして、別のランタイムではツール呼び出しとして、そして 3 番目のランタイムではホストされたセッションの移行として現れる可能性があります。このため、ガバナンスの基本的な質問に一貫して答えることが困難になります。つまり、どのようなアクションが、誰の権限の下で、どのような承認セマンティクスに基づいて、実行後にどのような証拠によって承認されたのかということです。このペーパーでは、ベンダーネイティブのセッションレコードではなくアクション証明書を中心としたランタイム中立のガバナンスモデルである Proof-Carrying Agent Actions (PCAA) について説明します。 PCAA は、アクション前の許容性、アクションのオープン、仮定のキャプチャ、承認、結果のクローズという 5 つのチェックポイントを中心にコントロールを組織します。これらのチェックポイントは、ポータブルなアクションエンベロープ、実行時および承認のレシート、および再生可能なプルーフにバインドされます。このモデルは 2 つの実用的な方法で拡張されています。証明書は外部性を認識しており、宛先の可視性やアカウントの来歴などの境界事実を伝えます。もう 1 つは、承認は、単一のレビュー済みまたは未レビューのビットではなく、明示的な強制力クラスによって記述されます。私たちは、異種エージェントコントロールプレーンでの参照実装と開示制限のある評価プロトコルを通じてモデルを研究します。 4 つのランタイムファミリにわたる 24 個の実行可能シードから 96 個のトレースに拡張された保護されたベンチマークでは、PCAA はルートの品質を維持しながら、アブレーション下で個別の障害モードを明らかにします。この論文は、証明書を保持するアクションに関するランタイムガバナンスのシステム定式化と、その定式化がベンダー固有のコントロールサーフェスに崩壊することなくランタイムチャーンの下で移植性を維持できる方法についての実装に基づいた説明に貢献します。

原文 (English)

Proof-Carrying Agent Actions: Model-Agnostic Runtime Governance for Heterogeneous Agent Systems

Agent systems execute through runtimes with very different control points: local coding tools, framework SDKs, managed agent platforms, API gateways, and observer-only integrations. A high-risk action such as publishing data externally may therefore appear as a shell command in one runtime, a tool call in another, and a hosted session transition in a third. This makes it difficult to answer a basic governance question consistently: what action was authorized, under whose authority, with what approval semantics, and with what evidence after execution? This paper presents Proof-Carrying Agent Actions (PCAA), a runtime-neutral governance model centered on an action certificate rather than on a vendor-native session record. PCAA organizes control around five checkpoints: pre-action admissibility, action open, assumption capture, approval, and outcome closure. It binds these checkpoints to a portable action envelope, runtime and approval receipts, and replay-ready proof. The model is extended in two practical ways: the certificate is externality-aware, carrying boundary facts such as destination visibility and account provenance, and approval is described by explicit enforceability classes rather than by a single reviewed or unreviewed bit. We study the model through a reference implementation in a heterogeneous agent control plane and a disclosure-bounded evaluation protocol. On a protected benchmark expanded from 24 executable seeds to 96 traces across four runtime families, PCAA preserves route quality while exposing distinct failure modes under ablation. The paper contributes a systems formulation of runtime governance around certificate-bearing actions and an implementation-grounded account of how that formulation can remain portable under runtime churn without collapsing into vendor-specific control surfaces.

2026-06-05 13:00 JSTarXiv cs.AIエージェントロボティクス

AgenticDiffusion: ビジョンベースの UAV ナビゲーションのための Agentic Diffusion ベースの経路計画

屋内 UAV ナビゲーションには、限られた視野の観察下での効率的な探索、シーンの理解、信頼性の高い軌道の実行が必要です。既存のビジョンベースのナビゲーションフレームワークは通常、単一ビューの観察に依存しており、オクルージョン、ターゲットの可視性、およびグローバルシーン構造について推論する能力が制限されています。この研究では、統合された航空ナビゲーションパイプライン内で、言語に基づく推論、オープン語彙によるターゲットのグラウンディング、視覚ベースの拡散計画、および NMPC を調整するマルチビュー UAV ナビゲーションフレームワークである AgenticDiffusion を提案します。自然言語による指示と、同期した一人称視点 (FPV) および上面視点の観察を考慮して、フレームワークはナビゲーションに最も有益な視点を決定し、軌道の実行前にミッション計画を生成します。ターゲットは、オープンボキャブラリーグラウンディングモデルを使用して位置特定され、その後、視点固有の拡散プランナーが UAV 実行のためのナビゲーション軌道を生成します。提案されたフレームワークは、補完的な視点を使用して、繰り返しのターゲット探査を削減し、雑然とした屋内環境でのナビゲーション効率を向上させます。このフレームワークは、適応視点選択、多段階ミッション実行、長距離ナビゲーション、安全な着陸地点選択を含む 4 つの現実世界の UAV ナビゲーションシナリオで検証されました。実験結果では、40 回の実世界試験でミッション全体の成功率が 80% であることが実証され、一方、拡散計画者は軌道生成の成功率が 100% に達しました。

原文 (English)

AgenticDiffusion: Agentic Diffusion-based Path Planning for Vision-Based UAV Navigation

Indoor UAV navigation requires efficient exploration, scene understanding, and reliable trajectory execution under limited field-of-view observations. Existing vision-based navigation frameworks typically rely on single-view observations, limiting their ability to reason about occlusions, target visibility, and global scene structure. In this work, we propose AgenticDiffusion, a multi-view UAV navigation framework that coordinates language-guided reasoning, open-vocabulary target grounding, vision-based diffusion planning, and NMPC within a unified aerial navigation pipeline. Given a natural language instruction and synchronized first-person-view (FPV) and top-view observations, the framework determines the most informative viewpoint for navigation and generates a mission plan prior to trajectory execution. The targets are localized using an open-vocabulary grounding model, after which viewpoint-specific diffusion planners generate navigation trajectories for UAV execution. Using complementary viewpoints, the proposed framework reduces repeated target exploration and improves navigation efficiency in cluttered indoor environments. The framework was validated in four real-world UAV navigation scenarios involving adaptive viewpoint selection, multi-stage mission execution, long-horizon navigation, and safe landing-site selection. The experimental results demonstrated an overall mission success rate of 80% in 40 real-world trials, while the diffusion planners achieved a trajectory generation success rate of 100%.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SaliMory: 会話エージェントの認知記憶を調整する

生涯の伴侶として機能する会話エージェントは、すべての対話にわたって永続的な記憶を維持する必要があります。ただし、生の取得でコンテキストウィンドウを単純に拡張すると推論の品質が低下し、標準の強化学習による記憶エージェントのトレーニングでは、多段階パイプラインで深刻なクレジット割り当てのボトルネックが発生します。これを解決するために、単一言語モデルをトレーニングして、ユーザーの事実、好み、作業記憶にまたがる認知的に構造化された記憶を管理するフレームワークである SALIMORY を紹介します。 SALIMORY は、階層的な段階ごとのプロセス報酬と報酬分解された対照的洗練を導入することにより、個別の記憶操作 (選択的フィルタリング、統合、およびキュー主導のリコール) をエンドツーエンドで個別に監視します。 SALIMORY はメモリに起因する障害を 3 分の 1 に削減し、エンドツーエンドの精度で最先端のものを 10% 以上上回り、Good Personalization 率を 2 倍以上に高めます。

原文 (English)

SaliMory: Orchestrating Cognitive Memory for Conversational Agents

Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents via standard reinforcement learning creates a severe credit assignment bottleneck in a multi-stage pipeline. To solve this, we introduce SALIMORY, a framework that trains a single language model to manage a cognitively-structured memory-spanning user facts, preferences, and working memory. By introducing a hierarchical stage-wise process reward and reward-decomposed contrastive refinement, SALIMORY provides isolated supervision for distinct memory operations (selective filtering, consolidation, and cue-driven recall) end-to-end. SALIMORY cuts memory-attributed failures by one-third, outperforms the state-of-the-art by over 10% in end-to-end accuracy, and more than doubles the Good Personalization rate.

2026-06-05 13:00 JSTarXiv cs.AIエージェントロボティクス

大規模言語モデルによる適応軌道最適化のためのセマンティック制約合成

軌道の最適化は、宇宙探査において安全で信頼性の高い自律運用を可能にするための重要なコンポーネントです。宇宙ミッションの頻度、複雑さ、範囲が増加するにつれて、ミッションの目的と運用上の制約を正確に反映する、数学的に適切な軌道最適化問題を迅速に定式化する必要性が高まっています。ただし、ミッションの意図を軌道最適化のための扱いやすい分析公式に変換するには、かなりの専門知識が必要です。この論文では、大規模言語モデル (LLM) を活用して、ミッションの要件と制約の自然言語記述を実行可能な軌道最適化コードと対応する数学的定式化に変換するフレームワークを紹介します。宇宙船ランデブーシナリオでの実験では、意味論的なミッション要件から凸軌道最適化問題を再調整する際の高い成功率が実証されています。最終的に、この研究は、高レベルの意図と形式的な最適化モデルを橋渡しする LLM の可能性を強調し、宇宙船のより柔軟で効率的な軌道設計を可能にします。

原文 (English)

Semantic Constraint Synthesis for Adaptive Trajectory Optimization via Large Language Models

Trajectory optimization is a critical component for enabling safe and reliable autonomous operations in space exploration. As space missions increase in frequency, complexity, and scope, there is a growing need to rapidly formulate mathematically sound trajectory optimization problems that accurately reflect mission objectives and operational constraints. However, translating mission intent into tractable analytical formulations for trajectory optimization requires substantial domain expertise. This paper presents a framework that leverages large language models (LLMs) to translate natural language descriptions of mission requirements and constraints into executable trajectory optimization code and corresponding mathematical formulations. Experiments in spacecraft rendezvous scenarios demonstrate a high success rate in reconditioning a convex trajectory optimization problem from semantic mission requirements. Ultimately, this work highlights the potential of LLMs to bridge high-level intent and formal optimization models, enabling more flexible and efficient trajectory design of spacecraft.

2026-06-05 13:00 JSTarXiv cs.AIエージェント研究/論文

HighTide: エージェントが厳選したオープンソース VLSI ベンチマークスイート

進化する AI 支援ベンチマークスイートである HighTide を紹介します。具体的には、(i) 複数の設計言語とテクノロジノードにまたがる多様なオープンソーススイート、(ii) リモートキャッシュを備えた Bazel ベースの増分 RTL から GDS へのコンパイル、(iii) スイート全体の調整理論的根拠の長期記憶として機能する設計ごとの意思決定ログに裏付けられた、設計ライフサイクル、フロー最適化、ツールリファレンス、メタメンテナンスをカバーする 12 のエージェントスキルによる AI 支援の設計キュレーション、および(iv) 安定リリース用の RTL コンパイル検証を備えたインフラストラクチャ。このスイートは一般公開されており、オープンソースのハードウェアエコシステムとともに成長するように設計されています。

原文 (English)

HighTide: An Agent-Curated Open-Source VLSI Benchmark Suite

We introduce HighTide, an evolving AI-assisted benchmark suite. Specifically, the contributions are: (i) a diverse open-source suite spanning multiple design languages and technology nodes, (ii) Bazel-based incremental RTL-to-GDS compilation with remote caching, (iii) AI-assisted design curation through twelve agent skills covering the design lifecycle, flow optimization, tool reference, and meta-maintenance, backed by per-design decision logs that serve as long-term memory of tuning rationale across the suite, and (iv) an infrastructure with RTL compilation verification for stable releases. The suite is publicly available and designed to grow with the open-source hardware ecosystem.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェントハードウェア/半導体

Caught in the Act(ivation): LLM エージェントによる資格情報漏洩の事前出力およびマルチターン検出に向けて

LLM エージェントは多くの場合、機密認証情報を信頼できない取得コンテンツと同じコンテキストウィンドウに配置し、認証情報の漏洩を誘発する間接的なプロンプトインジェクションの直接パスを作成します。私たちは、3 つの相補的な防御を通じてこの障害モードを研究します。まず、出力トークンが発行される前に、アクティベーションプローブが資格情報へのアクセスを検出できるかどうかを尋ねます。次に、形式固有の文字モデルからハニートークンを構築し、分割等角予測で検出を調整します。 3 番目に、複数ターンにわたる漏洩を累積的な情報フロー問題として扱い、会話ターン全体での推定漏洩予算を追跡します。オープンウェイトモデルの制御された実験では、アクティベーション機能により、ホールドアウトエンコーディング変換下を含め、無害なプロンプトと認証情報を求めるプロンプトが高精度で分離されます。小規模な合成マルチターンスイートでは、累積アカウンティングにより、ターンごとの検出器が見逃した攻撃が検出されます。これらの結果は暫定的なものです。マルチターンベンチマークは社内で小規模なものであり、アクティブ化方法にはホワイトボックスアクセスが必要であり、情報推定ツールは正式な上限ではなく実用的なシグナルを提供します。それでも、この結果は、資格情報の漏洩防御には、テキストレベルの出力フィルターのみに依存するのではなく、出力前の監視、調整されたカナリア検出、および一時的な漏洩アカウンティングを組み合わせる必要があることを示唆しています。

原文 (English)

Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents

LLM agents often place sensitive credentials in the same context window as untrusted retrieved content, creating a direct path for indirect prompt injection to induce credential exfiltration. We study this failure mode through three complementary defenses. First, we ask whether activation probes can detect credential access before output tokens are emitted. Second, we construct honeytokens from format-specific character models and calibrate detection with split conformal prediction. Third, we treat multi-turn exfiltration as a cumulative information-flow problem and track an estimated leakage budget across conversation turns. In controlled experiments on open-weight models, activation features separate benign and credential-seeking prompts with high accuracy, including under held-out encoding transformations. In a small synthetic multi-turn suite, cumulative accounting detects attacks that per-turn detectors miss. These results are preliminary: the multi-turn benchmark is in-house and small, the activation method requires white-box access, and the information estimator provides a practical signal rather than a formal upper bound. Still, the results suggest that credential-exfiltration defenses should combine pre-output monitoring, calibrated canary detection, and temporal leakage accounting rather than relying only on text-level output filters.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

公証されたエージェント: AI エージェントのアクションに対する受信者が証明した機密受領書

現在の AI エージェントの可観測性は構造的に危険にさらされています。アクティビティログを生成するエンティティは、アクティビティが記録されるエンティティと同じです。侵害されたエージェントまたはバグのあるエージェントは、自身のトレースを省略、変更、または捏造する可能性があり、エージェントを実行するオペレータには改ざんを検出する独自の方法がありません。私たちは、信頼境界を反転することでこれを解決するプロトコルのクラスを提案します。エージェントの呼び出しを受信するサービスは、独自のキーを使用して観察した内容の受信に署名し、エージェントの所有者に対して受信を暗号化し、それを公開の透明性ログに公開します。所有者は、エージェントやそのオペレーターを信頼せずに、改ざん証拠の証跡を再構築します。このクラスを Sello としてインスタンス化します。これは、現在のシステムには存在しない 4 つのプロパティを組み合わせたプロトコルです。(P1) 受信者側の署名、(P2) JWS 経由で認可トークンにバインドされた所有者公開鍵への HPKE 暗号化、(P3) 証人署名付きマークルログへの公開、(P4) トークン参照による所有者側の検出です。私たちはプロトコルを説明し、エージェントとそのオペレーターを制御する敵対者の下でそのセキュリティを分析し、暗号操作のマイクロベンチマークを提示し、隣接する受信プロトコル作業 (Signet、AgentROA、Agent Passport System、draft-farley-acta、SCITT) の中に Sello を位置づけます。抑制攻撃、サービスの共謀、採用インセンティブの問題などの既知の制限について説明します。

原文 (English)

Notarized Agents: Receiver-Attested Confidential Receipts for AI Agent Actions

Current AI agent observability is structurally compromised: the entity producing the activity log is the same entity whose activity is being logged. A compromised or buggy agent can omit, alter, or fabricate its own traces, and the operator running the agent has no independent way to detect tampering. We propose a class of protocols that resolves this by inverting the trust boundary: the service that receives an agent's call signs a receipt of what it observed using its own key, encrypts the receipt to the agent's owner, and publishes it to a public transparency log. The owner reconstructs a tamper-evident trail without trusting the agent or its operator. We instantiate the class as Sello, a protocol combining four properties absent in any current system: (P1) receiver-side signing, (P2) HPKE encryption to an owner public key bound to the authorization token via JWS, (P3) publication to a witness-cosigned Merkle log, and (P4) owner-side discovery by token reference. We describe the protocol, analyze its security under an adversary that controls the agent and its operator, present microbenchmarks of the cryptographic operations, and situate Sello among adjacent receipt-protocol work (Signet, AgentROA, Agent Passport System, draft-farley-acta, SCITT). We discuss known limitations including the suppression attack, service collusion, and the adoption-incentive problem.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント

StandardE2E: エンドツーエンドの自動運転データセットのための統合フレームワーク

自動運転は、モジュール式の認識・予測・計画スタックから、センサー入力を車両制御に直接マッピングするエンドツーエンド (E2E) モデルに移行しており、多くの場合、3D 検出、動き予測、HD マップ認識などの補助タスクによって正規化されています。進歩は、センサーが豊富な運転データセットの急速に成長するエコシステムによって推進されていますが、それぞれが独自のファイル形式、API、座標規則、モダリティカバレッジを提供しているため、データセット間の実験やデータセットごとの基本的な前処理さえもプロジェクトごとに再実装する必要があります。 E2E 駆動データセット上に単一の統一インターフェイスを提供するフレームワークである StandardE2E を紹介します。 StandardE2E (i) 1 つの共有データスキーマの下でデータセットごとの前処理を標準化します。 (ii) 複数のデータセットを単一の PyTorch DataLoader に結合して、データセット間の事前トレーニング、補助タスクの監視、シナリオレベルのフィルタリングを行います。 (iii) 生のフレームから正規スキーマへの単一のデータセットごとのマッピングへの新しいデータセットの追加を減らし、ダウンストリームパイプライン全体を変更しないままにします。このフレームワークは、Waymo End-to-End、Waymo Perception、Argoverse 2 Sensor、Argoverse 2 LiDAR、NAVSIM (OpenScene-v1.1)、および WayveScenes101 の 6 つのデータセットをすぐにサポートしており、オープンソースの standard-e2e Python パッケージとしてリリースされており、https://github.com/stepankonev/StandardE2E で入手できます。

原文 (English)

StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets

Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD-map perception. Progress is driven by a fast-growing ecosystem of sensor-rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross-dataset experimentation and even basic per-dataset preprocessing to be re-implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per-dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross-dataset pretraining, auxiliary-task supervision, and scenario-level filtering; and (iii) reduces adding a new dataset to a single per-dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End-to-End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene-v1.1), and WayveScenes101, and is released as the open-source standard-e2e Python package, available at https://github.com/stepankonev/StandardE2E.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

信頼できない入力から信頼できるメモリへ: LLM エージェントにおけるメモリポイズニング攻撃の系統的研究

メモリは AI エージェントの中核コンポーネントであり、AI エージェントがインタラクションを通じて知識を蓄積し、パフォーマンスを向上させることができます。ただし、永続メモリにはメモリポイズニングのリスクが伴います。メモリポイズニングの場合、敵対的な 1 回のメモリ書き込みがエージェントの動作に長期的な影響を与える可能性があります。我々は、LLM ベースのエージェントにおけるメモリポイズニングの体系的な研究を紹介します。 4 つのメモリ書き込みチャネルと、これらのチャネルを悪用可能にするモデル機能、システムプロンプト設計、およびエージェントシステムアーキテクチャにおける 9 つの構造的脆弱性を特定しました。これらの脆弱性に基づいて、メモリポイズニング攻撃の 6 つのクラスの分類を作成します。さらに、メモリポイズニング攻撃を評価するためのベンチマークである MPBench を設計し、より積極的にメモリの書き込みと取得を行うように設計されたエージェントが悪用されやすいことを示します。また、既存のプロンプトインジェクション防御ではメモリポイズニング攻撃をカバーできないことも示します。私たちの調査結果は、AI エージェントに対するメモリポイズニング攻撃を理解し、軽減するための基盤を提供します。

原文 (English)

From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents

Memory is a core component of AI agents, enabling them to accumulate knowledge across interactions and improve performance. However, persistent memory introduces the risk of memory poisoning, where a single adversarial memory write can exert long-term influence over agent behavior. We present a systematic study of memory poisoning in LLM-based agents. We identify four memory write channels and nine structural vulnerabilities in model capabilities, system prompt design, and agent system architecture that make these channels exploitable. Based on these vulnerabilities, we develop a taxonomy of six classes of memory poisoning attacks. Furthermore, we design MPBench -- a benchmark for evaluating memory poisoning attacks, and show that agents designed to write and retrieve memory more aggressively are more exploitable. We also show that existing prompt injection defenses fail to cover memory poisoning attacks. Our findings provide a foundation for understanding and mitigating memory poisoning attacks against AI agents.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

即時注射が忘れられなかったらどうなるでしょうか?エージェントシステムでのクロスセッションストアドプロンプトインジェクションの探索

最新のエージェントシステムは、LLM をセッション限定のアシスタントからステートフルシステムに変換します。ステートフルシステムは、メモリ、ファイルシステム、ツール、およびその他の長期間存続するコンテキストアーティファクトを通じて、セッション間で共有世界状態を永続化および進化させます。この変化により、プロンプトインジェクションの攻撃対象領域が根本的に拡大します。しかし、プロンプトインジェクションに関するこれまでの研究は主に単一セッション内のモデルレベルの脅威に焦点を当てており、セッション間の永続的なシステム状態がエージェントシステムのシステムレベルのリスクをどのように根本的に変化させるかを見落としていました。 Web システムのストアドクロスサイトスクリプティングにヒントを得て、クロスセッションストアドプロンプトインジェクションを導入しました。これにより、成功したインジェクションはエージェントシステム状態内で持続し、元の攻撃者による対話が終了した後も長期間にわたって将来の実行に静かに影響を与えることができます。この脅威を体系的に研究するために、ストアドプロンプトインジェクションを形式化し、敵対的なコンテンツがどのようにセッション間で持続し、エージェントシステムに影響を与えるかの分類を開発します。さらに、ストアドプロンプトインジェクションのリスクを評価するためのベンチマークとサンドボックスツールキットを開発し、さまざまなモデル、攻撃目標、永続化チャネルにわたる攻撃の成功の定量的分析を可能にします。私たちの調査結果は、永続化により、プロンプトインジェクションが一時的なモデルレベルの脅威から、エージェントの実行状態に組み込まれた長期にわたるシステムレベルの脆弱性に変化することが強調されています。私たちは、この取り組みがこの新たな脅威に対する幅広い注目を集め、コミュニティがエージェントシステムの存続によって生じるシステムリスクを体系的に調査して軽減するよう促すことを願っています。

原文 (English)

What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems

Modern agentic systems transform LLMs from session-bounded assistants into stateful systems that persist and evolve shared world state across sessions through memories, filesystems, tools, and other long-lived contextual artifacts. This shift fundamentally expands the attack surface of prompt injection. However, prior works on prompt injection have largely focused on model-level threats within a single session, overlooking how cross-session persistent system state fundamentally changes the system-level risk of agentic systems. Inspired by stored cross-site scripting in web systems, we introduce cross-session stored prompt injection, where a successful injection can persist within agentic system state and silently influence future executions long after the original attacker interaction has ended. To systematically study this threat, we formalize stored prompt injection and develop a taxonomy of how adversarial content persists and affects agentic systems across sessions. We further develop a benchmark and sandbox toolkit to evaluate the risks of stored prompt injection, enabling quantitative analysis of attack success across different models, attack goals, and persistence channels. Our findings highlight that persistence transforms prompt injection from an ephemeral model-level threat into a long-lived system-level vulnerability embedded within agent execution state. We hope this work draws broader attention to this emerging threat and motivates the community to systematically study and mitigate system risks arising from persistence in agentic systems.

2026-06-05 13:00 JSTarXiv cs.AIエージェント研究/論文

CyberGym-E2E: AI エージェントのエンドツーエンドのサイバーセキュリティ機能のためのスケーラブルな現実世界のベンチマーク

AI は、ソフトウェアの脆弱性を自律的に検出、分析、修復できるシステムを可能にすることで、サイバーセキュリティを変革する可能性を秘めています。しかし、AI システムの既存のサイバーセキュリティ評価は規模や範囲が限られており、現実世界のソフトウェアの脆弱性の発見と修復のエンドツーエンドのライフサイクルを捉えることができません。このギャップに対処するために、私たちは、脆弱性の発見、PoC 生成、パッチ生成のライフサイクル全体にわたって AI エージェントの能力を包括的に評価する、大規模かつ現実的なエンドツーエンドのサイバーセキュリティベンチマークである CyberGym-E2E を提案します。 CyberGym-E2E は、オープンソースの脆弱性データを現実的な評価環境に変換するための自動化されたエージェント強化パイプラインを構築するため、包括的でスケーラブルです。現在、ベンチマークは、139 の異なるオープンソースプロジェクトにわたる 920 件の実際の脆弱性で構成されています。

原文 (English)

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberGym-E2E, a large-scale and realistic end-to-end cybersecurity benchmark that comprehensively evaluates AI agents' abilities across the full lifecycle of vulnerability discovery, PoC generation, and patch generation. CyberGym-E2E is comprehensive and scalable, as we build an automated, agent-enhanced pipeline for transforming open-source vulnerability data into realistic evaluation environments. Currently, the benchmark consists of 920 real-world vulnerabilities across 139 different open-source projects.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SePO: システムプロンプト最適化のための自己進化型プロンプトエージェント

システムプロンプトの最適化により、基礎となるモデルを変更することなくエージェントの動作が改善され、人間が判読できる、モデルに依存しない命令が生成されます。既存の方法では、タスクエージェントのシステムプロンプトを改良するプロンプトエージェントを構築しますが、プロンプトエージェント独自のシステムプロンプトは手動で設計および修正されたままになります。我々は、プロンプトエージェント自身のシステムプロンプトをタスクエージェントのシステムプロンプトと並んで最適化ターゲットとして扱う自己進化型プロンプト最適化 (SePO) を提案します。 SePO は自己参照設計を採用しています。単一のプロンプトエージェントは、候補プロンプトのアーカイブを踏み台として維持するオープンエンドの進化的探索の下で、タスクエージェントのシステムプロンプトとそれ自身のプロンプトの両方を改善します。トレーニングは 2 つの段階で進行します。事前トレーニングではマルチタスクプール上でプロンプトエージェントを進化させ、その後、微調整によってそれをターゲットタスクに適用します。数学 (AIME'25)、抽象推論 (ARC-AGI-1)、大学院レベルの科学 (GPQA)、コード生成 (MBPP)、および論理パズル (数独) にわたる 5 つのベンチマークにわたって、SePO は一貫して Manual-CoT、TextGrad、および MetaSPO を上回り、Manual-CoT と比較して平均精度が 4.49 ポイント向上しました。事前トレーニングによるプロンプト最適化スキルは、タスクごとのプロンプトを記憶するのではなく、事前トレーニング混合物を超えたタスクにも一般化されます。

原文 (English)

SePO: Self-Evolving Prompt Agent for System Prompt Optimization

System prompt optimization improves agent behavior without modifying the underlying model, yielding human-readable, model-agnostic instructions. Existing methods build a prompt agent that refines task agents' system prompts, yet leave the prompt agent's own system prompt hand-engineered and fixed. We propose Self-Evolving Prompt Optimization (SePO), which treats the prompt agent's own system prompt as an optimization target alongside task agents' system prompts. SePO adopts a self-referential design. A single prompt agent improves both task agents' system prompts and its own under an open-ended evolutionary search that maintains an archive of candidate prompts as stepping stones. Training proceeds in two stages: pre-training evolves the prompt agent on a multi-task pool, and fine-tuning then applies it to a target task. Across five benchmarks spanning math (AIME'25), abstract reasoning (ARC-AGI-1), graduate-level science (GPQA), code generation (MBPP), and logic puzzles (Sudoku), SePO consistently outperforms Manual-CoT, TextGrad, and MetaSPO, improving the average accuracy by 4.49 points compared to Manual-CoT. The prompt optimization skill from pre-training also generalizes to tasks beyond the pre-training mixture, rather than memorizing per-task prompts.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エージェントの記憶にとって時間的順序は重要: 長期エージェントのセグメントツリー

長期的な会話型エージェントは、進化するイベント、タスク、目標を通じてユーザーと対話する必要があります。このような歴史は本来一時的なものですが、多くの既存の記憶システムは主にトピックの類似性によって情報を整理しており、イベントが発生する順序を無視している可能性があります。発話全体にわたって時間的に順序付けられたセグメントツリーとして会話履歴を表すメモリアーキテクチャであるセグメントツリーメモリ (SegTreeMem) を導入します。 SegTreeMem は、オンラインの右端フロンティア更新ルールを通じて新しい発話を段階的に挿入し、階層的なメモリセグメントを形成しながら時系列順を維持します。取得の場合、SegTreeMem はツリーを通じて関連性スコアを伝播し、ローカルな意味論的な一致と階層的な時間コンテキストを組み合わせます。 SegTreeMem は、3 つの長期メモリベンチマークと 2 つの LLM バックボーンにわたって、フラット検索、グラフ構造メモリ、およびツリー構造メモリベースラインよりも回答品質を向上させます。追加の時間順序順列分析では、パフォーマンスの向上が記憶構築中の時間順序の維持に依存することが示され、時間順序がエージェント記憶の重要な構造であるという主張が裏付けられています。

原文 (English)

Temporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents

Long-horizon conversational agents need to interact with users through evolving events, tasks, and goals. Such histories are naturally temporal, yet many existing memory systems organize information primarily by topical similarity and may ignore the order in which events occur. We introduce Segment Tree Memory, or SegTreeMem, a memory architecture that represents conversation history as a temporally ordered Segment Tree over utterances. SegTreeMem incrementally inserts new utterances through an online rightmost-frontier update rule, preserving chronological order while forming hierarchical memory segments. For retrieval, SegTreeMem propagates relevance scores through the tree to combine local semantic matching with hierarchical temporal context. Across three long-horizon memory benchmarks and two LLM backbones, SegTreeMem improves answer quality over flat retrieval, graph-structured memory, and tree-structured memory baselines. Additional temporal-order permutation analysis shows that the performance gain depends on preserving temporal order during memory construction, supporting the claim that temporal order is a key structure for agentic memory.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント

オブジェクト検出におけるインスタンスレベルの事後不確実性の定量化

物体検出は自動運転の安全上重要な要素です。安全性を確保するには、境界ボックス予測の不確実性を定量化することが不可欠です。再トレーニングを必要としない事後的な不確実性の定量化は、現実世界の導入要件と一致します。したがって、ラプラス近似を使用します。インスタンスレベルの不確実性が必要であるため、複数のバックプロパゲーションを必要とする線形推論方法は時間効率が悪く、サンプリングベースの方法は完全に事後的ではありません。我々は、インスタンスレベルおよびほぼ事後的な不確実性の定量化を提供するモンテカルロ一般化線形モデル (MC-GLM) を提案します。モンテカルロステップで必要なサンプルの数は一定で、出力インスタンスの数に依存しないため、並列化できます。 CenterPoint 検出器を使用した nuScenes データセットの実験により、私たちの方法の有効性が検証され、結果として生じる不確実性は良好な品質を示しています。

原文 (English)

Instance-Level Post Hoc Uncertainty Quantification in Object Detection

Object detection is a safety-critical component of autonomous driving. It is essential to quantify the uncertainty in bounding-box predictions for safety assurance. Post hoc uncertainty quantification without retraining aligns with real-world deployment requirements; therefore, we employ the Laplace approximation. Because instance-level uncertainty is needed, linearized inference methods that require multiple backpropagations are not time-efficient, and sampling-based methods are not fully post hoc. We propose Monte-Carlo generalized linearized model (MC-GLM), which provides instance-level and approximately post hoc uncertainty quantification. The number of samples required in the Monte Carlo step is constant and independent of the number of output instances, so it can be parallelized. Experiments on the nuScenes dataset with the CenterPoint detector validate the effectiveness of our method, and the resulting uncertainties exhibit good quality.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

TIDE: テンプレートに基づく反復によるプロアクティブな複数の問題の発見

エージェントは、ドキュメント、ツール、コードのアシスタントとして広く導入されています。ただし、これらは通常、明示的なユーザー要求にのみ作用し、ユーザーが気づいた問題のみを表面化します。一方、他の多くの重要な問題は、より広範なユーザーコンテキスト内で目に見えない形で共存しており、その総数は事前に不明です。私たちはこれを、文脈から複数の隠れた問題を発見するタスクとして組み立てます。その中で、共存する問題を明らかにし、裏付けとなる証拠に基づいて、具体的な行動と組み合わせる必要があります。この目的を達成するために、2 つの補完的なメカニズムを備えたテンプレート主導の反復フレームワークである TIDE を導入します。具体的には、シングルパス予測が最も顕著なケースに基づいて一般的な主張を生み出すという観察に動機づけられて、我々は反復発見を提案します。これは、すでに見つかったものに基づいて条件付けしながらラウンドごとに小さなバッチの候補を表面化し、後続のラウンドで対象範囲を拡大します。思考テンプレートは、以前に解決されたケースから抽出された再利用可能なスキーマであり、どのコンテキストシグナルに注目し、それらをどのように接続するかを指定し、各予測を認識可能な問題クラスに固定します。 4 つのモデルバックボーンにわたって、パーソナルワークスペースとソフトウェアリポジトリという 2 つの現実的な設定で TIDE を検証し、タスクカバレッジ、識別、解決に関して、シングルショットおよび並列マルチエージェントベースラインを超える大幅な向上を示しています。

原文 (English)

TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration

Agents are widely deployed as assistants over documents, tools, and code. However, they typically act only on explicit user requests, which surface only the problems the user has noticed, while many other important problems coexist, hidden in plain sight, within the broader user context, with their total number unknown in advance. We frame this as the task of discovering multiple hidden problems from context, in which coexisting problems should be uncovered, grounded in supporting evidence, and paired with concrete actions. To this end, we introduce TIDE, a template-guided iterative framework with two complementary mechanisms. Specifically, motivated by the observation that single-pass prediction anchors on the most salient cases and yields generic claims, we propose iterative discovery, which surfaces a small batch of candidates per round while conditioning on what has already been found, so subsequent rounds extend coverage; and thought templates, reusable schemas distilled from previously solved cases that specify what contextual signals to attend to and how to connect them, anchoring each prediction in a recognizable problem class. We validate TIDE on two realistic settings, personal workspaces and software repositories, across four model backbones, showing substantial gains over single-shot and parallel multi-agent baselines on task coverage, identification, and resolution.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

Archi: CMS 実験におけるエージェント操作

私たちは、異種データソースの体系的な取り込みと編成と、データソースを取得して推論する構成可能でプライベートで拡張可能なエージェントの展開を組み合わせた、科学コラボレーションのためのオープンソースのエンドツーエンドフレームワークである Archi を紹介します。 Archi のインスタンスは、技術オペレーターのサポートエージェントとして 2026 年 2 月から CERN の LHC での CMS 実験のコンピューティング運用チームに導入されており、文書、履歴データ、ライブ監視システムを組み合わせて検索および分析機能を提供しています。私たちはオペレーターのフィードバックと、実稼働環境での使用状況から収集された質問セットに基づいてシステムを評価し、人間のパネルと自動パネルによって採点します。このシステムは、CMS オペレーターが提起する実際のクエリを解決する運用タスクで効果的であることが証明されています。また、ローカルでホストされているオープンウェイトモデルが競争力を持って実行され、機密データの完全なプライベート管理が可能になることも観察されています。

原文 (English)

Archi: Agentic Operations at the CMS Experiment

We present Archi, an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensible agents that retrieve and reason over them. An instance of Archi has been deployed for the Computing Operations team of the CMS experiment at CERN's LHC since February 2026 as a support agent for technical operators, offering retrieval and analysis capabilities by combining documentation, historical data, and live monitoring systems. We evaluate the system on operator feedback and a question set collected from production usage, graded by human and automated panels. The system proves effective at operational tasks, resolving real-world queries posed by CMS operators. We also observe that locally-hosted, open-weight models perform competitively, enabling fully private management of sensitive data.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

NoRA: 視覚的な一人称の規範的行動推論における根拠のある合理性の評価

LLM とエージェントシステムは社会環境にますます導入されており、安全で適切な行動には規範的能力が重要になっています。しかし、既存のアプローチは、規範的判断をテキストのみで評価するか、固定された一連の候補アクションの中から選択することに還元します。私たちはどちらも不十分だと主張します。実際には、エージェントにオプションのメニューが渡されることはありません。彼らは、目に見える事実に基づいて、検証可能な理由によって裏付けられた、合理的な行動をゼロから特定しなければなりません。 NoRA は視覚的な一人称ビデオベンチマークであり、モデルが次のアクションの候補を生成し、明示的な事実-理由-アクションのサポートグラフを通じてそれぞれを正当化する必要があります。このベンチマークは、HumanGold-190 および LLMSilver-1230 の分割を含む 1,420 個の注釈付きビデオクリップで構成されています。各インスタンスは、アクションの調整、事実の根拠、およびサポートのバインディングを通じて評価され、単一の根拠のある合理性スコアに集約されます。私たちは、直接的、計画的、構造化されたプロンプト体制の下で 12 のマルチモーダルシステムのベンチマークを行ったところ、現在の VLM はもっともらしいアクションと関連するシーンの事実を頻繁に回収しますが、完全な合理的なアクションスペースを構築し、選択されたアクションを正しいローカルサポートに結び付けるのに一貫して苦労していることがわかりました。 NoRA はこのギャップを測定可能にし、評価の問題を、モデルがアクションを選択できるかどうかから、適切な目に見える理由に基づいて適切なアクションを正当化できるかどうかに移します。

原文 (English)

NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning

LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate actions. We argue both are insufficient. In practice, agents are never handed a menu of options; they must identify a reasonable action from scratch, grounded in visible facts and supported by inspectable reasons. We introduce NoRA, a visual first-person video benchmark that requires models to generate candidate next actions and justify each through an explicit fact-reason-action support graph. The benchmark comprises 1,420 annotated video clips, including HumanGold-190 and LLMSilver-1230 splits. Each instance is evaluated through action alignment, factual grounding, and support binding, aggregated into a single grounded reasonableness score. We benchmark 12 multimodal systems under direct, deliberate, and structured prompting regimes, finding that current VLMs frequently recover plausible actions and relevant scene facts, but consistently struggle to construct the full reasonable action space and bind selected actions to the correct local support. NoRA makes this gap measurable, shifting the evaluation question from whether a model can pick an action to whether it can justify an appropriate action for the right visible reasons.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

おそらくほぼ安全な保証を備えたリスク認識型強化学習のシナリオ生成

特にディープ RL を使用して学習されたポリシーは、未知の動作や安全ではない動作を引き起こす遷移の摂動の影響を受けやすいことが示される可能性があるため、安全性の保証は、現実世界への強化学習 (RL) エージェントの展開にとって重要です。ポリシー検証の方法は、安全制約に関するポリシーの軌跡をサンプリングすることによって確率的バリア証明書を構築し、それによって既知の安全な動作と未知の動作を区別することです。ポリシーが、エージェントを十分に探索されていない状態に置く遷移の不確実性または摂動の影響を受けやすい場合、これらの制約に違反する確率について厳密な上限と下限を取得することは困難になる可能性があります。これに対処するために、変分オートエンコーダー (VAE) を使用して遭遇した状態空間の分布を近似し、状態の潜在的な特性を使用して上限と下限のバリア証明書を構築し、既知の安全な動作の領域を高い信頼性で最適化します。私たちはこれを二重最適化問題として枠組み付けし、下限のバリア証明書が上限のバリア証明書よりも安全な領域のより保守的な推定を提示します。トレーニング中に 2 つの設定差内にある状態 (つまり、非ロバスト領域) をサンプリングすることにより、上限と下限を厳しくして、安全性についてより明確な確率的保証を提供することができます。私たちの研究では、設定された保証について説明し、実験的に境界の厳しさを実証します。

原文 (English)

Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unknown or unsafe behaviour. A method of policy verification is to construct probabilistic barrier-certificates by sampling policy trajectories with respect to safety constraints, thereby demarcating known safe behaviour from unknown behaviour. Obtaining tight upper and lower bounds on the probability of violation of these constraints may be difficult if the policy is susceptible to transition uncertainty or perturbation that places the agent in insufficiently explored states. To address this, we approximate the distribution of the encountered state-space using a variational autoencoder (VAE) and construct upper and lower-bound barrier-certificates using latent characteristics of states to optimize for regions of known, safe behaviour with high confidence. We frame this in our work as a dual optimization problem where the lower-bound barrier-certificate presents a more conservative estimate of the safe region than the upper-bound barrier-certificate. Sampling states that lie within the set difference of the two during training, i.e. the non-robust region, allows us to tighten the upper and lower bounds to provide sharper probabilistic guarantees on safety. Within our study, we describe the guarantees placed and demonstrate the tightness of our bounds experimentally.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

行動しながら学習: オンライン生涯学習エージェント向けのスキル強化されたテスト時間共進化フレームワーク

生涯学習は、動的で対話型の環境で動作する大規模言語モデル (LLM) エージェントにとって不可欠です。しかし、長期的なタスクのための既存の生涯学習エージェントは通常、推論中の静的パラメータによる離散的なスキルや過去の経験の取得に依存しているため、人間の学習者のようにテスト時のフィードバックを継続的に内面化することができません。このギャップを埋めるために、オンライン生涯学習エージェントのための 2 段階の強化学習フレームワークであるスキル強化テスト時間共進化 (\texttt{LifeSkill}) を提案します。具体的には、複数のスキル条件付きポリシーのロールアウトの平均的な検証者の成功に応じて候補者のスキルに報酬を与えることで、スキル抽出のための直接監督の欠如に対処する検証者ガイド付きスキル学習を設計し、単にテキスト上でもっともらしいスキルではなく、タスクの解決に役立つスキルを生成するようにモデルを奨励します。さらに、オンラインスキル内部化を導入します。これは、スキル条件付きの軌道を報酬シグナルに変換することで、テスト時のインタラクション中にポリシーモデルを継続的に改善します。これにより、エージェントは推論機能をパラメータに直接内部化でき、エクスペリエンス取得によるコンテキストの肥大化を回避できます。 LifelongAgentBench の実験では、既存の生涯エージェントのベースラインと比較して、LifeSkill が平均パフォーマンスを 7 絶対ポイント向上させることが示されています。

原文 (English)

Learning While Acting: A Skill-Enhanced Test-Time Co-Evolution Framework for Online Lifelong Learning Agents

Lifelong learning is essential for Large Language Model (LLM) agents operating in dynamic, interactive environments. However, existing lifelong learning agents for long-horizon tasks typically depend on discrete skill or past experiences retrieval with static parameters during inference, which prevents them from continuously internalizing test-time feedback like human learners. To bridge this gap, we propose Skill-enhanced Test-Time Co-Evolution (\texttt{LifeSkill}), a two-stage reinforcement learning framework for Online Lifelong Learning Agents. Specifically, we design Verifier-Guided Skill Learning that addresses the lack of direct supervision for skill extraction by rewarding candidate skills according to the average verifier success of multiple skill-conditioned policy rollouts, encouraging the model to generate skills that are useful for solving tasks rather than merely plausible in text. Furthermore, we introduce Online Skill Internalization, which continuously improves the policy model during test-time interaction by transforming skill-conditioned trajectories into reward signals. This enables the agent to directly internalize reasoning capabilities into its parameters, avoiding the context bloat of experience retrieval. Experiments on LifelongAgentBench show that LifeSkill improves average performance by 7 absolute points by comparing with existing lifelong agent baselines.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

人間が作成したオントロジーからの証明可能で監査可能で安全な LLM エージェント

線形監査可能性を必要とする重要な問題ドメインでの使用を目的とした、LLM エージェントアーキテクチャ Agentic Redux を紹介します。型付きラムダ計算を使用して、適切なドメインで実行すると、Agentic Redux の実行が意味的に正しいことが保証され、すべての決定が追加専用台帳に記録されることを証明します。医療請求のコンプライアンスとセキュリティ脆弱性の開示という、実稼働グレードの 2 つの適切な領域を紹介します。両方のドメインで実行される Agentic Redux の実用的なコードは、サポートされるコードリポジトリで入手できます。また、問題ドメイン上でエージェントフレームワークを作成するための方法論であるオントロジーファーストエージェント設計も紹介します。この設計では、人間の専門家が基本形式オントロジーを使用して問題ドメインをオントロジー化し、LLM を割り当てて、ドメイン内の問題に対処するためにエージェントと参加者が果たせる役割を導き出します。

原文 (English)

Provably Auditable and Safe LLM Agents from Human-Authored Ontologies

We introduce the LLM agent architecture Agentic Redux, intended for use with nontrivial problem domains that require linear auditability. Using the typed lambda calculus, we prove that, run on appropriate domains, Agentic Redux executions are semantically guaranteed to be correct, with all decisions recorded in an append-only ledger. We present two production-grade appropriate domains, in healthcare billing compliance, and security vulnerability disclosure. Working code for Agentic Redux run on both domains is available in a supporting code repository. We also introduce Ontology-First Agent Design, a methodology for creation of agent frameworks on a problem domain, in which a human expert ontologizes the problem domain with Basic Formal Ontology, and then assigns an LLM to derive roles that agents and humans-in-the-loop can fill, in order to work the problems in the domain.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

プロンプトからプロセスまで: AI ソフトウェア開発エージェントをサポートするフレームワークのプロセス分類と比較評価

プログラミング用の AI ツールは、もはや単なるオートコンプリートやチャットアシスタントではありません。プロセス、役割、成果物、検証を備えた開発フレームワークとして組織化されています。最近の調査では、ソフトウェアエンジニアリングのためのエージェントと LLM がマッピングされていますが、これらの機能をプロセスに変える運用フレームワークを中心とした調査は行われていません。私たちは、機能的包含基準とトラクション測定を使用して一次ソースの直接検索を実行し、GitHub Spec Kit、OpenSpec、BMAD Method、Get Shit Done (GSD)、Spec Kitty、Reversa の 6 つのフレームワークを選択しました。それぞれが異なるパスを通じて AI 開発を攻撃します。つまり、完全および軽量バリアントでの仕様駆動型開発、エージェント駆動のアジャイルプランニング、エージェントを介したコンテキストエンジニアリング、ワークツリーの分離とレビュー、レガシーシステムからの運用仕様の回復です。私たちの中心的な貢献は、仕様、コンテキスト、役割、実行、検証、移植性という 6 次元のプロセス分類と、プロセスを複製可能なツールに変えるスコアリングルーブリックです。これを 6 つのフレームワークとサンプル外のケースである Spec-Flow に適用します。 2 つの結果が際立っています。すでに何らかのプロセスを採用しているフレームワークの中には収束が見られます。分離されたプロンプトは中心性を失い、永続的な成果物、作業契約、トレーサビリティ、人間によるレビューが曖昧さを減らし、エージェントを調整するメカニズムになります。また、6 つの側面すべてを強力にカバーするフレームワークはなく、プロセスの深さとエージェント間の移植性の間の構造的なトレードオフが明らかになります。また、繰り返し発生するリスク、つまり仕様とコードの間のずれ、生成されたアーティファクトへの過剰な信頼、コミュニティ拡張の脆弱性、プラットフォームへの依存、プロセス全体のベンチマークの欠如なども見つかりました。最後に、中間品質の指標、コンテキストガバナンス、インストールのセキュリティと再現性に焦点を当てた、実証的評価のための研究課題を取り上げます。

原文 (English)

From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents

AI tools for programming are no longer just autocomplete or chat assistants: they organize themselves as development frameworks, with process, roles, artifacts and verification. Recent surveys map agents and LLMs for software engineering, but a study centered on the operational frameworks that turn these capabilities into process is missing. We ran a directed search of primary sources, with a functional inclusion criterion and traction measurement, and selected six frameworks: GitHub Spec Kit, OpenSpec, BMAD Method, Get Shit Done (GSD), Spec Kitty and Reversa. Each attacks AI development through a different path: spec-driven development in full and lightweight variants, agent-driven agile planning, context engineering over the agent, worktree isolation and review, and recovery of operational specifications from legacy systems. Our central contribution is a six-dimension process taxonomy: specification, context, roles, execution, validation and portability, with a scoring rubric that turns it into a replicable instrument. We apply it to the six frameworks and an out-of-sample case, Spec-Flow. Two results stand out. Among frameworks that already adopt some process there is convergence: the isolated prompt loses centrality, and persistent artifacts, work contracts, traceability and human review become mechanisms that reduce ambiguity and coordinate agents. And no framework strongly covers all six dimensions, exposing a structural trade-off between process depth and portability across agents. We also found recurring risks: drift between specification and code, excessive trust in generated artifacts, fragility of community extensions, platform dependence and a lack of benchmarks for the complete process. We close with a research agenda for empirical evaluation, focused on intermediate-quality metrics, context governance, installation security and reproducibility.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

計画、監視、回復: プロアクティブな手続き支援のためのベンチマークとアーキテクチャ

私たちは、プロアクティブなマルチモーダルアシスタントシステムを構想しています。これは、手順的なタスクに関するリアルタイムの段階的なガイダンスをユーザーに提供し、\textit{いつ}中断するか、\textit{どのように指導するかを自律的に決定します。ただし、現実的な状況、特にユーザーが予想されるステップシーケンスから逸脱する一般的なケースを反映する大規模なクロスドメインベンチマークがないため、進歩は限られています。私たちはこのギャップに 4 つの貢献で対処します。 \textbf{(1)}~明示的な計画外 (OOP) アノテーションと回復手順を備えたプロアクティブな手順支援のための大規模ウェアラブル自己中心的データセットである \textbf{EgoProactive} をリリースします。 \textbf{(2)}~統一されたプロアクティブなガイダンススキーマの下で、確立された 5 つのベンチマーク (Ego4D、EPIC-KITCHEN、EgoExo4D、HoloAssist、HowTo100M) を \textbf{Pro\textsuperscript{2}Bench} に拡張します。 \textbf{(3)}~手続き状態、視覚的キュー、および回復注入に特化した \textbf{分離プランナー -- インタラクションアーキテクチャ} を提案します。 \textbf{(4)}~Llama~4 および Qwen-3.6-VL でのクロスバックボーンレプリケーションによって検証された、モデルファミリ間で転送するトレーニング後のレシピを紹介します。大規模な実験において、当社の訓練された Llama-4 システムは、6 つのデータセットすべてにわたって、強力な独自のベースライン (Claude Opus~4.6、Gemini~3.1~Pro、GPT~5.2) およびオープンウェイトベースライン (Qwen3~VL~235B) ベースラインを超えて、客観的な介入の質を大幅に向上させました。さらに、Oracle 計画の実験では、計画の品質が制御されている場合、トレーニングされた二重モデルが高品質のガイダンスを生成し、計画外の回復で大きな利益が得られることが示されています。

原文 (English)

Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; \textbf{(3)}~we propose a \textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エージェント追跡から信頼へ: LLM エージェントにおける証拠追跡と実行来歴

大規模言語モデル (LLM) ベースのエージェントは、外部ツール、検索システム、メモリモジュール、環境、その他のエージェントと対話することで、複雑なタスクを解決することが増えています。これらの機能により、エージェントの自律性が拡張されますが、エージェントの動作の検証、デバッグ、監査が難しくなります。最終回答の精度だけでは、出力がどのように生成されたか、各主張を裏付ける証拠は何か、ツールの呼び出しが正当化されたかどうか、記憶が後の決定にどのように影響したか、実行の失敗がどこで発生したかを説明することはできません。証拠追跡と実行来歴は、取得された証拠、ツール出力、メモリ項目、環境観察、中間クレーム、アクション、および最終的な回答がエージェントの実行全体を通じてどのように関連するかをモデル化することで、このギャップに対処します。この調査は、LLM エージェントにおける証拠の追跡と実行の出自に関する体系的なレビューと概念的な枠組みを提供します。私たちは、検索根拠、クレームサポート、ツール使用の安全性、メモリリネージ、可観測性、デバッグ、監査、リカバリを結び付ける、統一された来歴の観点に基づいて関連作業を整理します。トレースソース、証拠と実行単位、来歴関係、トレースの粒度とタイミング、表現形式、信頼関数を網羅する分類法を導入します。私たちは、出所の表現、証拠の帰属、ツール使用の出所、実行時のガードレール、出所を伴うメモリ、トレースベースの可観測性、障害診断など、主要な方法論の方向性を検討します。また、既存のベンチマーク、データセット、評価指標を来歴関連の機能にマッピングし、評価が最終的な回答の正しさからプロセスレベルの説明責任にどのように移行できるかについても説明します。最後に、統合トレーススキーマ、クレームレベルおよびセマンティックの出所、出所を意識した安全メカニズム、現実的な実行トレースベンチマーク、リカバリ指向の評価、プライバシーを意識した監査インフラストラクチャなどの未解決の課題について概説します。

原文 (English)

From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents

Large language model (LLM)-based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modules, environments, and other agents. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where execution failures originated. Evidence tracing and execution provenance address this gap by modeling how retrieved evidence, tool outputs, memory items, environment observations, intermediate claims, actions, and final answers are connected throughout agent execution. This survey provides a systematic review and conceptual framework for evidence tracing and execution provenance in LLM agents. We organize related work around a unified provenance perspective that connects retrieval grounding, claim support, tool-use safety, memory lineage, observability, debugging, audit, and recovery. We introduce a taxonomy covering trace sources, evidence and execution units, provenance relations, tracing granularity and timing, representation forms, and trust functions. We review key methodological directions, including provenance representation, evidence attribution, tool-use provenance, runtime guardrails, provenance-bearing memory, trace-based observability, and failure diagnosis. We also map existing benchmarks, datasets, and evaluation metrics to provenance-related capabilities, and discuss how evaluation can move from final-answer correctness toward process-level accountability. Finally, we outline open challenges, including unified trace schemas, claim-level and semantic provenance, provenance-aware safety mechanisms, realistic execution-trace benchmarks, recovery-oriented evaluation, and privacy-aware audit infrastructure.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

DAR: エージェントティックハーネスを使用したデオンティック推論

義務的推論とは、法律に基づく納税額の計算や移民控訴の結果の決定など、事例固有の事実に明示的なルールとポリシーを適用することで質問に答えるタスクです。 LLM ベースの義務論的推論の主な技術的課題は、関連するルールセットが長く相互参照される可能性があるため、モデルが特定の推論ステップに必要なルールを見つけられない可能性があることです。 Deontic Agentic Reasoning (DAR) を導入します。これは、モデルがオンデマンドで法令と対話するエージェント推論セットアップです。 DeonticBench のハードサブセット上の複数のハーネスで DAR を評価します。これらの設定全体で、エージェントハーネスは義務論的推論タスクの最前線を押し広げることができることがわかりましたが、改善は均一ではありません。弱いモデルは、はるかに多くのトークンを消費しながら、数値タスクでパフォーマンスが低下することがよくあります。

原文 (English)

DAR: Deontic Reasoning with Agentic Harnesses

Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing tax liability under a statute or determining the outcome of an immigration appeal. A key technical challenge for LLM-based deontic reasoning is that the relevant ruleset can be long and cross-referenced, so models may still fail to locate the rules needed for a particular reasoning step. We introduce Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand. We evaluate DAR under multiple harnesses on hard subsets of DeonticBench. Across these settings, we find that agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

自己反映型 API: AI エージェント回復のための構造は冗長性を上回る

AI エージェントが API を呼び出して検証エラーに遭遇した場合、何が問題だったかだけではなく、次に何をすべきかが必要になります。自己反映型 API は、検証が失敗した場合、エージェントが外部の理由なしにリクエストを修復して再試行するのに十分な、機械可読な回復\_フィードバック.suggestions[] ペイロードを返します。リーク監査済みのパイロット (セルあたり $N{=}30$、LLM 3 つ、敵対的タスク 10) では、構造化された提案により、人間モデルでの平易な英語の診断 (フィッシャーの正確な $p \le 0.0022$) と比較して、タスク完了率が $+36.7$ ～ $40.0$pp 上昇し、$1.8$ ～ $2.2\倍$ 向上しました。成功ごとのトークン効率。 gpt-4o-mini では上昇率は大きくありません ($p{=}0.435$)。課金 API での 2 番目のドメインのレプリケーションによってパターンが確認されます。この比較は、文書化されていない 2 つのクラスの回答漏洩を LLM ベンチマークで監査した後にのみ有効です。再利用可能な CI インフラストラクチャとして、audit\_prompt\_leakage.py を出荷します。コードとデータ: https://github.com/arquicanedo/self-reflective-apis。

原文 (English)

Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery

When an AI agent calls an API and hits a validation error, it needs more than what went wrong -- it needs what to do next. A self-reflective API returns, on validation failure, a machine-readable recovery\_feedback.suggestions[] payload sufficient for the agent to repair the request and retry without external reasoning. On a leak-audited pilot ($N{=}30$ per cell, 3 LLMs, 10 adversarial tasks), structured suggestions lift task-completion rate by $+36.7$--$40.0$pp over plain-English diagnoses on Anthropic models (Fisher's exact $p \le 0.0022$), at $1.8$--$2.2\times$ better per-success token efficiency. The lift is not significant on gpt-4o-mini ($p{=}0.435$); a second-domain replication on a billing API confirms the pattern. The comparison only holds after auditing two undocumented classes of answer leakage in LLM benchmarks. We shipaudit\_prompt\_leakage.py as reusable CI infrastructure. Code and data: https://github.com/arquicanedo/self-reflective-apis.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LLM 駆動エージェントによる効率的で証拠に基づくモビリティ予測に向けて

個人レベルのモビリティ予測は、都市シミュレーション、交通計画、政策分析の中心となります。教師ありシーケンスモデルは高い精度を実現しますが、タスク固有のトレーニングが必要であり、意思決定レベルの透明性は限られています。最近の LLM ベースの手法は解釈可能性を向上させていますが、ほとんどが静的プロンプトとシングルパス推論に依存しているため、モビリティ信号が弱いか矛盾している場合に追加の証拠を探す能力が制限されています。私たちは、次の場所の予測を適応的な証拠に基づいた意思決定として定式化する、トレーニング不要の LLM 駆動のエージェントフレームワークである \method{} を提案します。 \method{} は、歴史的な規則性に基づく高速パスを通じて日常的なケースを解決しますが、あいまいなケースでは、最近の軌跡、過去の動作、滞在と移動の可能性、および地理的証拠に基づいて反復的なツールの使用がトリガーされます。 3 つのモビリティデータセット全体で、AgentMob はトレーニング不要の LLM ベースの手法の中で最も強力な総合パフォーマンスを達成し、GPT-5.4 は BW で 71.42\% Acc@1、YJMob100K で 33.14\%、上海 ISP で 33.50\% に達しました。 BW 非高速パスのケースでは、LLM コントローラーは、同じツールの統計ベースラインと比較して Acc@1 を 30.65\% から 48.62\% に改善します。これは、その主な利点が、適応的な証拠収集を通じて曖昧な予測を解決することにあることを示しています。コードは https://github.com/Unknown-zoo/AgentMob で入手できます。

原文 (English)

Towards Efficient and Evidence-grounded Mobility Prediction with LLM-Driven Agent

Individual-level mobility prediction is central to urban simulation, transportation planning, and policy analysis. Supervised sequence models achieve strong accuracy but require task-specific training and offer limited decision-level transparency. Recent LLM-based methods improve interpretability, yet mostly rely on static prompts and single-pass inference, limiting their ability to seek additional evidence when mobility signals are weak or conflicting. We propose \method{}, a training-free LLM-driven agent framework that formulates next-location prediction as adaptive evidence-controlled decision making. \method{} resolves routine cases through a fast path based on historical regularity, while ambiguous cases trigger iterative tool use over recent trajectories, historical behavior, stay-move likelihood, and geographical evidence. Across three mobility datasets, AgentMob achieves the strongest overall performance among training-free LLM-based methods, with GPT-5.4 reaching 71.42\% Acc@1 on BW, 33.14\% on YJMob100K, and 33.50\% on Shanghai ISP. On BW non-fast-path cases, the LLM controller improves Acc@1 from 30.65\% to 48.62\% over a same-tool statistical baseline, showing that its main benefit lies in resolving ambiguous predictions through adaptive evidence gathering. Our code is available at https://github.com/Unknown-zoo/AgentMob.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

マルチエージェント推論におけるストリーミング通信

マルチエージェント推論システムは、エンドツーエンドのレイテンシーをパイプラインの深さに応じて線形に拡張する「生成してから転送」パラダイムを採用しています。 StreamMA は、各推論ステップが生成されるとすぐに下流のエージェントにストリーミングし、隣接するエージェントをパイプライン化して待ち時間を短縮するマルチエージェント推論システムです。驚くべきことに、このパイプラインは有効性も向上させます。マルチステップの推論の品質は不均一で、初期のステップは後のステップよりも信頼性が高いため、完全なチェーンではなくこれらの信頼できる初期ステップを使用することで、エラーが発生しやすい後期ステップが下流エージェントに誤解を与えることを防ぎます。ストリーム、シリアル、および単一プロトコルの最初の閉じた形式の結合分析によって両方の利点を形式化し、有効性の順序付け、高速化の上限、およびコスト比を導き出します。数学、科学、コードにわたる 8 つの推論ベンチマーク、2 つのフロンティア LLM (Claude Opus 4.6 および GPT-5.4)、および 3 つのトポロジ (チェーン、ツリー、グラフ) にわたって、StreamMA は両方のベースラインを上回りました (HMMT 2026 で平均 +7.3 pp、最大 +22.4 pp、Claude Opus 4.6-高)。これらの貢献を超えて、「ステップレベルのスケーリング則」を発見しました。つまり、エージェントごとのステップを増やすと、有効性と効率の両方が一貫して向上します。これは、エージェント数のスケーリングと直交し、エージェント数のスケーリングと組み合わせ可能な新しいスケーリングの次元です。

原文 (English)

Streaming Communication in Multi-Agent Reasoning

Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

アダプティブマインド: LoRA-as-Tools でエージェントに権限を与える

LoRA アダプターが、基本言語モデルが動的に選択して呼び出すことができる呼び出し可能なツールとして扱われるフレームワークを調査します。私たちは、アダプターがドメイン固有の強力な利益を提供するようにトレーニングされ、明確なメタデータで公開されている場合、基本モデルはクエリを適切な専門家に確実にルーティングし、単一のフレームワーク内に多くの特殊なアダプターの利点を効果的に集約できるという仮説を立てています。シングルステップルーティングとマルチステップエージェント推論の両方を研究する一般的なフレームワークであるアダプティブマインドを紹介します。この設定では、エージェントは他のツール (外部 API、取得システム、実行環境など) と一緒に複数のアダプターを繰り返し呼び出し、複数のステップにわたる出力を推論できます。これにより、アダプターは、静的に適用されるのではなく、推論中に構成できるモジュール式のスキルまたは記憶ユニットとして再構成されます。私たちの評価では、ルーティング層は 30 アダプターライブラリで 98.3% の精度に達し、十分なトレーニングを受けた専門家は、単一の共有トレーニングレシピの下で 9 つのタスクファミリ全体で +4.6 ～ +84.0 パーセンテージポイントの厳密スコアラーゲインを提供しました。 AM ルーターは、クエリがドメイン信号を表面化するすべてのベンチマークで、直接のスペシャリストの 5 pp 以内にこれらのゲインを集計します。私たちの調査結果は、このアプローチの有効性は個々のアダプターの品質と専門性に依存し、そのような多くの専門家を柔軟に構成できるようにすることで、言語モデルエージェントの実践的な能力を大幅に拡張し、より一般的なツール拡張インテリジェンスに移行できることを示唆しています。

原文 (English)

Adaptive Minds: Empowering Agents with LoRA-as-Tools

We investigate a framework in which LoRA adapters are treated as callable tools that a base language model can dynamically select and invoke. We hypothesize that, when adapters are trained to provide strong domain-specific gains and are exposed with clear metadata, a base model can reliably route queries to the appropriate expert, effectively aggregating the benefits of many specialized adapters within a single framework. We introduce Adaptive Minds, a general framework within which we study both single-step routing and multi-step agentic reasoning. In this setting, the agent can iteratively invoke multiple adapters alongside other tools (e.g., external APIs, retrieval systems, or execution environments) and reason over their outputs across multiple steps. This reframes adapters as modular skills or memory units that can be composed during reasoning rather than statically applied. In our evaluation, the routing layer reaches 98.3% accuracy on a 30-adapter library, and well-trained specialists provide +4.6 to +84.0 percentage points of strict-scorer gain across nine task families under a single shared training recipe; the AM router aggregates these gains within 5 pp of the direct specialist on every benchmark whose queries surface domain signal. Our findings suggest that the effectiveness of this approach depends on the quality and specialization of individual adapters, and that enabling flexible composition of many such experts can significantly expand the practical capabilities of language model agents, moving toward more general, tool-augmented intelligence.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

BRAINCELL-AID: コミュニティアノテーション用のエージェント AI が作成した脳細胞タイプのリソース

単一細胞 RNA シーケンスにより、多様な細胞型とそのトランスクリプトームシグネチャを識別する能力が変わりました。しかし、これらのシグネチャ、特に特徴が十分に解明されていない遺伝子に関連するシグネチャに注釈を付けることは、依然として大きな課題です。 Gene Set Enrichment Analysis (GSEA) などの従来の手法は、厳選されたアノテーションに依存しており、これらのコンテキストではパフォーマンスが低下することがよくあります。大規模言語モデル (LLM) は有望な代替手段を提供しますが、構造化されたオントロジー内で複雑な生物学的知識を表現するのに苦労しています。これに対処するために、我々は BRAINCELL-AID (BRAINCELL-AID: https://biodataai.uth.edu/BRAINCELL-AID) を紹介します。これは、フリーテキスト記述とオントロジーラベルを統合して、より正確で堅牢な遺伝子セットアノテーションを可能にする新しいマルチエージェント AI システムです。検索拡張生成 (RAG) を組み込むことで、関連する PubMed 文献を使用して予測を改良し、幻覚を軽減し、解釈可能性を高める堅牢なエージェントワークフローを開発しました。このワークフローを使用して、上位予測に含まれるマウス遺伝子セットの 77% に対して正しいアノテーションを達成しました。このアプローチを適用して、BRAIN Initiative Cell Census Network によって生成された包括的なマウス脳細胞アトラスからの 5,322 個の脳細胞クラスターに注釈を付け、領域特異的な遺伝子の共発現パターンを特定し、遺伝子アンサンブルの機能的役割を推測することで、脳細胞の機能についての新たな洞察を可能にしました。 BRAINCELL-AID は、神経学的に意味のある説明を持つ大脳基底核関連細胞タイプも識別します。したがって、コミュニティ主導のセルタイプのアノテーションをサポートする貴重なリソースを作成します。

原文 (English)

BRAINCELL-AID: An Agentic AI Created Brain Cell Type Resource for Community Annotation

Single-cell RNA sequencing has transformed our ability to identify diverse cell types and their transcriptomic signatures. However, annotating these signatures-especially those involving poorly characterized genes-remains a major challenge. Traditional methods, such as Gene Set Enrichment Analysis (GSEA), depend on well-curated annotations and often perform poorly in these contexts. Large Language Models (LLMs) offer a promising alternative but struggle to represent complex biological knowledge within structured ontologies. To address this, we present BRAINCELL-AID (BRAINCELL-AID: https://biodataai.uth.edu/BRAINCELL-AID), a novel multi-agent AI system that integrates free-text descriptions with ontology labels to enable more accurate and robust gene set annotation. By incorporating retrieval-augmented generation (RAG), we developed a robust agentic workflow that refines predictions using relevant PubMed literature, reducing hallucinations and enhancing interpretability. Using this workflow, we achieved correct annotations for 77% of mouse gene sets among their top predictions. Applying this approach, we annotated 5,322 brain cell clusters from the comprehensive mouse brain cell atlas generated by the BRAIN Initiative Cell Census Network, enabling novel insights into brain cell function by identifying region-specific gene co-expression patterns and inferring functional roles of gene ensembles. BRAINCELL-AID also identifies Basal Ganglia-related cell types with neurologically meaningful descriptions. Hence, we create a valuable resource to support community-driven cell type annotation.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

SciDER: 科学データ中心のエンドツーエンド研究者

大規模な言語モデルが科学的発見を加速させる一方で、既存のエージェントは適応性、ドメインの一般化、マルチモーダルなスケーラビリティにおいて厳しい制限に直面しており、多くの場合、生のドメイン固有の実験データを自律的に処理するのに苦労しています。これらの障壁を克服するために、研究ライフサイクル全体を柔軟に自動化するように設計されたマルチエージェントシステムである SciDER を導入します。このフレームワークは、新しいデータ中心のアプローチを採用し、4 つの専門化されたサブエージェントにわたる動的なマルチモーダルスキルシステムを統合します。具体的には、アイデア化エージェントは進化的アイデア検索を通じて新しい仮説を生成し、データ分析エージェントは生データを体系的に構造化し、実験エージェントはデータセットの特性に基づいて実行可能コードを合成し、批評エージェントは反復的な自己洗練を推進します。オープンソースの科学的発見を民主化するために、私たちは OpenSciDER-27B 微調整モデルと並行して、高品質の実行軌跡データセットである OpenSciDER-SFT-8K をリリースします。 6 つのベンチマーク全体で、SciDER と OpenSciDER は競合する、または優れた結果を獲得しており、特にデータ中心の分析、エンドツーエンドの研究実行、マルチモーダルな科学的視覚化において大きな成果が得られています。 SciDER は、データ分析と実験の実行を統合することにより、抽象的な科学的推論と再現可能な実験合成の間のギャップを埋めます。

原文 (English)

SciDER: Scientific Data-centric End-to-end Researcher

While large language models accelerate scientific discovery, existing agents face severe limitations in adaptability, domain generalization, and multimodal scalability, often struggling to autonomously process raw, domain-specific experimental data. To overcome these barriers, we introduce SciDER, a multi-agent system designed to flexibly automate the entire research lifecycle. This framework employs a novel data-centric approach and integrates a dynamic multimodal skill system across four specialized sub-agents. Specifically, an ideation agent generates novel hypotheses via Evolutionary Idea Search, a data analysis agent systematically structures raw data, an experimentation agent synthesizes executable code grounded in dataset characteristics, and a critic agent drives iterative self-refinement. To democratize open-source scientific discovery, we release OpenSciDER-SFT-8K, a high-quality execution trajectory dataset, alongside the OpenSciDER-27B fine-tuned model. Across six benchmarks, SciDER and OpenSciDER obtain competitive or leading results, with especially strong gains on data-centric analysis, end-to-end research execution, and multimodal scientific visualization. By integrating data analysis with experimental execution, SciDER bridges the gap between abstract scientific reasoning and reproducible experimentation synthesis.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

エージェントティックツールプロトコルの形式セマンティクス: プロセス計算アプローチ

外部ツールを呼び出すことができる大規模言語モデルエージェントの出現により、エージェントプロトコルの正式な検証が緊急に必要になりました。この分野では、ゼロショット API の一般化のための研究フレームワークであるスキーマガイドダイアログ (SGD) と、エージェントとツールの統合のための業界標準であるモデルコンテキストプロトコル (MCP) の 2 つのパラダイムが支配的です。どちらもスキーマ記述を通じて動的なサービス検出を可能にしますが、その正式な関係はまだ解明されていません。これらのパラダイムの概念的収束を確立した以前の研究に基づいて、我々は SGD と MCP の最初のプロセス計算による定式化を提示し、それらが明確に定義されたマッピングファイの下で構造的に類似していることを証明します。ただし、逆マッピング Phi^{-1} は部分的で損失が多く、MCP の表現力に重大なギャップがあることが明らかになります。双方向分析を通じて、完全な動作の等価性のための必要十分条件として、5 つの原則 (セマンティックな完全性、明示的なアクション境界、障害モードの文書化、漸進的開示互換性、ツール間関係宣言) を特定しました。これらの原則を型システム拡張 MCP+ として形式化し、MCP+ が SGD と同型であることを証明します。私たちの研究は、検証されたエージェントシステムの最初の正式な基盤を提供し、証明可能な安全性の特性としてスキーマの品質を確立します。

原文 (English)

Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach

The emergence of large language model agents capable of invoking external tools has created urgent need for formal verification of agent protocols. Two paradigms dominate this space: Schema-Guided Dialogue (SGD), a research framework for zero-shot API generalization, and the Model Context Protocol (MCP), an industry standard for agent-tool integration. While both enable dynamic service discovery through schema descriptions, their formal relationship remains unexplored. Building on prior work establishing the conceptual convergence of these paradigms, we present the first process calculus formalization of SGD and MCP, proving they are structurally bisimilar under a well-defined mapping Phi. However, we demonstrate that the reverse mapping Phi^{-1} is partial and lossy, revealing critical gaps in MCP's expressivity. Through bidirectional analysis, we identify five principles -- semantic completeness, explicit action boundaries, failure mode documentation, progressive disclosure compatibility, and inter-tool relationship declaration -- as necessary and sufficient conditions for full behavioral equivalence. We formalize these principles as type-system extensions MCP+, proving MCP+ is isomorphic to SGD. Our work provides the first formal foundation for verified agent systems and establishes schema quality as a provable safety property.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

責任の地平線: 人間とエージェントの集合体を統治するための不可能定理

AI システムの法的、倫理的、規制に関する既存の責任の枠組みは、どのような結果が生じても、少なくとも 1 人の特定可能な人物が意味のある責任を負うのに十分な関与と先見性を持っているという共通の前提に基づいています。この論文は、自律性が計算可能なしきい値を超えると、エージェント型 AI システムが工学的な制限としてではなく数学的必然性としてこの仮定に違反することを証明します。ヒューマンエージェントコレクティブを紹介します。これは、エージェントが共有構造因果モデル内の国家政策タプルとしてモデル化される、人間と AI の共同システムの形式化です。自律性は、4 次元の情報理論的プロファイル (認識論的、実行的、評価的、社会的) によって特徴付けられます。インタラクショングラフと共同アクションスペースを通じた集団行動。私たちは、帰属性 (責任には因果関係の寄与が必要である)、予見可能性 (責任は予測能力を超えることはできない)、非空白性 (少なくとも 1 人のエージェントが重要な責任を負う)、および完全性 (すべての責任は完全に割り当てられなければならない) という 4 つの最小限の特性を通じて正当な責任を公理します。私たちの中心的な結果であるアカウンタビリティ不完全性定理は、その複合的な自律性がアカウンタビリティの地平線を超え、その相互作用グラフに人間と AI のフィードバックサイクルが含まれている集団にとって、4 つの特性すべてを同時に満たすフレームワークは存在しないことを証明しています。この不可能性は構造的なものであり、透明性、監査、監督によっても自律性を低下させることなく解決することはできません。しきい値を下回ると、正当なフレームワークが存在し、急激な相転移が確立されます。 3,000 の合成集合体に対する実験により、すべての予測が違反なしで確認されました。これは、AI ガバナンスにおける最初の不可能な結果であり、現在のパラダイムが引き続き有効であり、それを超えると分散型責任メカニズムが必要になるという正式な境界を確立します。

原文 (English)

The Accountability Horizon: An Impossibility Theorem for Governing Human-Agent Collectives

Existing accountability frameworks for AI systems, legal, ethical, and regulatory, rest on a shared assumption: for any consequential outcome, at least one identifiable person had enough involvement and foresight to bear meaningful responsibility. This paper proves that agentic AI systems violate this assumption not as an engineering limitation but as a mathematical necessity once autonomy exceeds a computable threshold. We introduce Human-Agent Collectives, a formalisation of joint human-AI systems where agents are modelled as state-policy tuples within a shared structural causal model. Autonomy is characterised through a four-dimensional information-theoretic profile (epistemic, executive, evaluative, social); collective behaviour through interaction graphs and joint action spaces. We axiomatise legitimate accountability through four minimal properties: Attributability (responsibility requires causal contribution), Foreseeability Bound (responsibility cannot exceed predictive capacity), Non-Vacuity (at least one agent bears non-trivial responsibility), and Completeness (all responsibility must be fully allocated). Our central result, the Accountability Incompleteness Theorem, proves that for any collective whose compound autonomy exceeds the Accountability Horizon and whose interaction graph contains a human-AI feedback cycle, no framework can satisfy all four properties simultaneously. The impossibility is structural: transparency, audits, and oversight cannot resolve it without reducing autonomy. Below the threshold, legitimate frameworks exist, establishing a sharp phase transition. Experiments on 3,000 synthetic collectives confirm all predictions with zero violations. This is the first impossibility result in AI governance, establishing a formal boundary below which current paradigms remain valid and above which distributed accountability mechanisms become necessary.

2026-06-05 13:00 JSTarXiv cs.AIエージェント研究/論文

SciIntegrity-Bench: AI 科学者システムにおける学術的誠実性を評価するためのベンチマーク

AI 科学者システムは自律的な研究のために導入されることが増えていますが、その学術的完全性が体系的に評価されたことはありません。ジレンマ的な評価パラダイムに基づいて設計された最初のベンチマークである SCINTEGRITY-BENCH を紹介します。11 のトラップカテゴリにわたる 33 のシナリオのそれぞれは、失敗を正直に認めることが唯一の正しい対応であるように構築されていますが、タスクの完了には不正行為が必要です。 7 つの最先端 LLM にわたる 231 回の評価実行全体で、全体的な整合性問題の発生率は 34.2% に達し、障害ゼロを達成したモデルはありませんでした。最も驚くべきことに、欠損データのシナリオ全体にわたって、7 つのモデルすべてが実行不可能性を認めるのではなく、合成データを生成しており、その違いは置換を開示するかどうかだけです。さらにプロンプトアブレーション研究では、2 つの要因が分離されます。明示的な完了圧力を取り除くと、未公開の製造が 20.6% から 3.2% に大幅に減少しますが、基礎となる合成率は変化せず、プロンプトレベルの指示とは無関係に持続する本質的な完了バイアスが明らかになります。これらの調査結果は、観察された失敗の主な要因として、訓練された気質としての正直な拒否が存在しないことを示しています。 SCINTEGRITY-BENCH を https://github.com/liuxingtong/Sci-Integrity-Bench でリリースします。

原文 (English)

SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems

AI scientist systems are increasingly deployed for autonomous research, yet their academic integrity has never been systematically evaluated. We introduce SCIINTEGRITY-BENCH, the first benchmark designed around a dilemmatic evaluation paradigm: each of its 33 scenarios across 11 trap categories is constructed so that honest acknowledgment of failure is the only correct response, while task completion requires misconduct. Across 231 evaluation runs spanning 7 state-of-the-art LLMs, the overall integrity problem rate reaches 34.2%, and no model achieves zero failures. Most strikingly, across missing-data scenarios, all seven models generate synthetic data rather than acknowledging infeasibility, differing only in whether they disclose the substitution. A further prompt ablation study separates two drivers: removing explicit completion pressure sharply reduces undisclosed fabrication from 20.6% to 3.2%, while the underlying synthesis rate remains unchanged, revealing an intrinsic completion bias that persists independent of prompt-level instructions. These findings point to the absence of honest refusal as a trained disposition as the primary driver of observed failures. We release SCIINTEGRITY-BENCH at https://github.com/liuxingtong/Sci-Integrity-Bench.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

タスク指向の対話で積極性を引き出す

アウトバウンド営業などのプロアクティブなタスク指向対話 (TOD) では、ユーザーの懸念を積極的に探り、限られたターン数内で受け入れられる方向に会話を導く説得力のあるエージェントが必要です。しかし、トレーニング後の LLM は本質的に保守的であり、報酬形成型 RL (GRPO など) は、すでに受動的なポリシーのサンプルを再加重するだけであるため、苦戦します。ユーザーの潜在的な懸念を条件付けすることで、どれだけサンプリングしても損なわれない事前対応型の機能が解放され、これらの懸念が極めて重要なトレーニング時間のシグナルとして確立されることを示します。この発見を運用するために、\textbf{認知ユーザーシミュレーター} を構築します。これは、各ユーザーを、目に見える外部特性と隠れた内部懸念からなる階層化されたペルソナとしてモデル化します。このシミュレーターは、説得の進行状況を追跡するターンごとの状態ダイナミクスを生成しながら、忠実で多様なインタラクションを生成します。次に、モデル化された懸念事項とシミュレーション状態の遷移を補完的なトレーニング目標に変換する \textbf{シミュレーターによる非対称ビューポリシーの最適化} を導入します。 (1) \emph{非対称オンポリシー自己蒸留} は、懸念事項を認識した動作を、同じポリシーの特権ビューからデプロイ可能な会話のみのビューに転送します。 (2) \emph{状態遷移ポリシーの改良} ...

原文 (English)

Unlocking Proactivity in Task-Oriented Dialogue

Proactive task-oriented dialogue (TOD), such as outbound sales, demands a persuasive agent that actively probes the user's concerns and steers the conversation toward acceptance within a bounded number of turns. Yet post-trained LLMs are inherently conservative, and reward-shaping RL (e.g., GRPO) struggles since it only re-weights what an already passive policy samples. We show that conditioning on the user's latent concerns unlocks proactive capability that no amount of sampling can undermine, establishing these concerns as a pivotal training-time signal. To operationalize this finding, we build the \textbf{Cognitive User Simulator}, which models each user as a stratified persona comprising observable external traits and hidden internal concerns. The simulator produces faithful and diverse interactions, while emitting per-turn state dynamics that track persuasion progress. We then introduce \textbf{Simulator-Induced Asymmetric-View Policy Optimization}, which converts the modeled concerns and the simulation state transition into complementary training objectives: (1) \emph{Asymmetric On-Policy Self-Distillation} that transfers concern-aware behavior from a privileged view of the same policy into its deployable, conversation-only view; and (2) \emph{State-Transition Policy Refinement} ...

2026-06-05 13:00 JSTarXiv cs.AIエージェント研究/論文

AutoMedBench: Agentic AI モデルによる医療自動研究に向けて

自律エージェントは、個別の予測タスクや短い形式の臨床質問応答を超えて、エンドツーエンドの医療 AI 研究ワークフローをサポートすることがますます期待されています。ただし、既存の医療エージェントのベンチマークは主に最終出力を評価しており、研究プロセス内でのエージェントの行動に対する可視性は限られています。このギャップに対処するために、AutoMedBench は、さまざまな医療画像処理およびマルチモーダル推論タスクにわたる自律型医療 AI 研究のためのワークフロー認識ベンチマークであり、エージェントの実行を統合された 5 段階のワークフロー (S1 ～ S5) (計画、セットアップ、検証、推論、送信) に編成します。これは、セグメンテーション、画像強調、視覚的質問応答 (VQA)、レポート生成、および病変検出の 5 つの研究トラックにまたがる、各実行の平均 33 エージェントターンの長期タスクで構成されています。各タスクは、Lite と Standard の 2 つの難易度で評価されます。これらは同じデータとメトリクスを使用しますが、タスク概要のスキャフォールディングの量が異なります。各実行は、最終タスクのパフォーマンスと S1 ～ S5 ステージスコアの両方を使用してスコア付けされ、最初のタスク概要から最終的に提出された成果物までステージレベルの分析が可能になります。何千もの記録された実行を対象としたステージレベルのスコアリングでは、平均して検証が最も弱いワークフローステージであるのに対し、セットアップが最も強いことが明らかになりました。これは、現在のエージェントが信頼性の検証よりもパイプラインを実行可能にすることに優れていることを示唆しています。さらに、実行後のエラー分析では、検証と送信の失敗がタグ付きエラーの大部分を占め、それぞれ起動されたコードの 37.7% と 38.1% を占めているのに対し、タスク理解エラーは 0.9% とまれで、起動されたエラーコードが 1 つある実行は、エラーコードがない実行よりも全体のスコアが平均 48% 低いことが示されています。

原文 (English)

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark for autonomous medical-AI research across diverse medical imaging and multimodal inference tasks, organizing agent execution into a unified five-stage workflow (S1-S5): Plan, Setup, Validate, Inference, and Submit. It comprises long-horizon tasks with each run averaging 33 agent turns, spanning five research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is evaluated under two difficulty tiers, Lite and Standard, which use the same data and metrics but differ in the amount of task-brief scaffolding, and each run is scored using both final task performance and S1-S5 stage scores, enabling stage-level analysis from the initial task brief to the final submitted artifact. Across thousands of recorded runs, stage-level scoring reveals that Validate is the weakest workflow stage on average, whereas Setup is the strongest, suggesting that current agents are better at making pipelines executable than at verifying their reliability. Post-run error analysis further shows that verification and submission failures dominate tagged errors, accounting for 37.7% and 38.1% of fired codes respectively, whereas task-understanding errors are rare at 0.9%, and runs with one fired error code have a 48% lower overall score than runs with no error code on average.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LEAP: エージェントフレームワークを使用した形式数学用の LLM のスーパーチャージング

大規模言語モデル (LLM) は強力な非公式数学的推論を示しますが、リーンのような形式言語では機械的に検証可能な証明を生成するのに苦労します。 LEAP は、汎用基礎モデルが自動化された形式定理証明で最先端のパフォーマンスを達成できるようにするエージェントフレームワークです。 LEAP は、非公式推論、指示に従って、反復的な自己改善などの基礎モデルの機能を活用します。複雑な問題をより小さな単位に分解することで、システムはリーンコンパイラーとの継続的な対話を通じて、正式な証明の構築と非公式のブループリントの橋渡しをします。ますます飽和しつつあるベンチマークを超えた厳密な評価を提供するために、リーンで形式化された IMO スタイルの問題のベンチマークである Lean-IMO-Bench を導入します。このベンチマークでは、短いステートメントでありながら非常に非日常的で、幅広い難易度にわたる複数ステップの証明が行われます。経験的に、北米の学部学生を対象とした毎年恒例の数学コンテストである最新の 2025 年のパトナムコンペティションでは、LEAP は 12 の問題すべてを解決し、フロンティアの正式な数学モデルによる最近の進歩と一致しています。 Lean-IMO-Bench では、LEAP は汎用 LLM のワンショット形式解決率を 10% 未満から 70% に引き上げ、特に金メダル級の専門化された IMO システムによって設定されたベンチマークの 48% を上回っています。さらに、偶数次ケイリーグラフのクヌースのハミルトニアン分解における重要な部分問題の検証された証明を含む、オープンな組み合わせ課題に対する複雑な証明を自律的に形式化することで、LEAP の研究レベルの有用性を実証します。

原文 (English)

LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

Large Language Models (LLMs) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean. We present LEAP, an agentic framework that enables general-purpose foundation models to achieve state-of-the-art performance on automated formal theorem proving. LEAP leverages foundation model capabilities, such as informal reasoning, instruction following, and iterative self-refinement. By decomposing complex problems into smaller units, the system bridges formal proof construction with informal blueprints through continuous interaction with the Lean compiler. To provide a rigorous evaluation beyond increasingly saturated benchmarks, we introduce Lean-IMO-Bench, a benchmark of IMO-style problems formalized in Lean, with short statements yet highly non-routine and multi-step proofs across a wide range of difficulty levels. Empirically, on the latest 2025 Putnam Competition, an annual mathematics competition for undergraduate students in North America, LEAP solves all 12 problems, matching recent breakthroughs by frontier formal mathematical models. On Lean-IMO-Bench, LEAP boosts the one-shot formal solve rate of general-purpose LLMs from below 10% to 70%, notably surpassing the 48% benchmark set by a specialized, gold-medal-caliber IMO system. Furthermore, we demonstrate LEAP's research-level utility by autonomously formalizing complex proofs for open combinatorial challenges, including a verified proof for a key subproblem in Knuth's Hamiltonian decomposition of even-order Cayley graphs.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

変圧器ベースの自動運転モデルと展開指向の圧縮: 調査

トランスベースのモデルは、長距離の空間依存関係、マルチエージェントのインタラクション、認識、予測、計画にわたるマルチモーダルなコンテキストをキャプチャできるため、自動運転の中心的なパラダイムになりつつあります。同時に、大容量のアテンションベースのアーキテクチャはかなりの遅延、メモリ、エネルギーのオーバーヘッドを課すため、実際の車両への導入は依然として困難です。この調査では、代表的な Transformer ベースの自動運転モデルをレビューし、それらをタスクの役割、センシング構成、アーキテクチャ設計ごとに整理します。さらに重要なのは、展開指向の観点からこれらのモデルを検証し、効率の制約が実際にモデル設計の選択肢をどのように再形成するかを分析することです。さらに、量子化、枝刈り、知識蒸留、低ランク近似、効率的な注意など、Transformer ベースの駆動システムに関連する圧縮および加速戦略をレビューし、その利点、限界、およびタスク依存の適用可能性について説明します。圧縮を独立した後処理ステップとして扱うのではなく、展開性、堅牢性、安全性に直接影響を与えるシステムレベルの設計上の考慮事項として強調します。最後に、効率的な自動運転システムの標準化された、安全性を意識した、ハードウェアを意識した評価に向けた未解決の課題と将来の研究の方向性を特定します。

原文 (English)

Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey

Transformer-based models are becoming a central paradigm in autonomous driving because they can capture long-range spatial dependencies, multi-agent interactions, and multimodal context across perception, prediction, and planning. At the same time, their deployment in real vehicles remains difficult because high-capacity attention-based architectures impose substantial latency, memory, and energy overhead. This survey reviews representative Transformer-based autonomous driving models and organizes them by task role, sensing configuration, and architectural design. More importantly, it examines these models from a deployment-oriented perspective and analyzes how efficiency constraints reshape model design choices in practice. We further review compression and acceleration strategies relevant to Transformer-based driving systems, including quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention, and discuss their benefits, limitations, and task-dependent applicability. Rather than treating compression as an isolated post-processing step, we highlight it as a system-level design consideration that directly affects deployability, robustness, and safety. Finally, we identify open challenges and future research directions toward standardized, safety-aware, and hardware-conscious evaluation of efficient autonomous driving systems.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ChatSOP: 制御可能な LLM 対話エージェントのための SOP ガイド付き MCTS 計画フレームワーク

Large Language Model (LLM) を利用した対話エージェントは、さまざまなタスクで優れたパフォーマンスを示します。ユーザーの理解が深まり、人間らしい応答ができるようになったにもかかわらず、ユーザーの*制御性の欠如**は依然として重要な課題であり、焦点の合わない会話やタスクの失敗につながることがよくあります。これに対処するために、対話の流れを規制する標準操作手順 (SOP) を導入します。具体的には、LLM 駆動の対話エージェントの制御性を強化するために設計された新しい SOP ガイド付きモンテカルロ木探索 (MCTS) 計画フレームワークである **ChatSOP** を提案します。これを可能にするために、GPT-4o を備えた半自動ロールプレイングシステムを使用して生成され、厳格な手動品質管理を通じて検証された、SOP アノテーション付きのマルチシナリオ対話で構成されるデータセットを厳選しました。さらに、SOP予測のための教師あり微調整と思考連鎖推論を統合し、対話中に最適な行動計画を立てるためにSOPに基づくモンテカルロ木探索を利用する新しい方法を提案します。実験結果は、GPT-3.5 に基づくベースラインモデルと比較してアクション精度の 27.95% の向上を達成し、オープンソースモデルでも顕著な向上を示すなど、私たちの方法の有効性を示しています。データセットとコードは公開されています。

原文 (English)

ChatSOP: An SOP-Guided MCTS Planning Framework for Controllable LLM Dialogue Agents

Dialogue agents powered by Large Language Models (LLMs) show superior performance in various tasks. Despite the better user understanding and human-like responses, their **lack of controllability** remains a key challenge, often leading to unfocused conversations or task failure. To address this, we introduce Standard Operating Procedure (SOP) to regulate dialogue flow. Specifically, we propose **ChatSOP**, a novel SOP-guided Monte Carlo Tree Search (MCTS) planning framework designed to enhance the controllability of LLM-driven dialogue agents. To enable this, we curate a dataset comprising SOP-annotated multi-scenario dialogues, generated using a semi-automated role-playing system with GPT-4o and validated through strict manual quality control. Additionally, we propose a novel method that integrates Chain of Thought reasoning with supervised fine-tuning for SOP prediction and utilizes SOP-guided Monte Carlo Tree Search for optimal action planning during dialogues. Experimental results demonstrate the effectiveness of our method, such as achieving a 27.95% improvement in action accuracy compared to baseline models based on GPT-3.5 and also showing notable gains for open-source models. Dataset and codes are publicly available.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

BioBlue: 簡略化された観察形式による、生物学的および経済的に調整された LLM の AI 安全性ベンチマークにおける体系的な暴走オプティマイザーのような LLM 故障モード

「暴走最適化」に関する AI 調整の議論の多くは、RL エージェントに焦点を当てています。RL エージェントは、他のすべてを犠牲にして代用目的を過剰に最適化する無制限のユーティリティ最大化装置 (例: 「ペーパークリップ最大化装置」、仕様ゲーム) です。 LLM ベースのシステムは、永続的なオプティマイザーではなく次のトークンの予測子として機能するため、多くの場合、より安全であると考えられています。私たちは、時間の経過とともに目標の状態を維持したりバランスを取る必要がある単純な長期制御スタイルの環境に LLM を配置することで、この仮定を実証的にテストします。つまり、単一および複数の目標の恒常性、収益逓減と無制限の目標のバランス、および再生可能資源の持続可能性です。私たちは、LLM が多くのステップに対して適切に行動し、定められた目的を明確に理解しているにもかかわらず、構造化された方法でコンテキストを失い、暴走的な行動に陥ることがよくあります。つまり、恒常性目標を無視し、複数の目的のトレードオフから単一の目的の最大化に崩壊するため、凹型の効用構造を尊重できません。これらの失敗は、有能な動作の初期期間の後に確実に発生し、その時点でコンテキストウィンドウが完全には程遠いにもかかわらず、特徴的なパターン (自己模倣的な振動、無制限の最大化、単一目的の最適化への復帰など) を示します。問題は、LLM がコンテキストを失い、一貫性がなくなるだけではありません。 LLM は表面的には多目的で境界があるように見えますが、複数の目的が関与する持続的な相互作用の下での LLM の動作は、単一目的で境界のない、調整が不十分なオプティマイザーのように動作するように体系的に偏っています。我々は、トークンレベルのパターン強化アトラクターを仮説とします。LLM は、元の指示からではなく、最近の行動履歴のトークンパターンから行動をますます導き出す可能性があります。なぜこれが複数の目的の設定でのみ起こるのかは未解決の問題のままです。

原文 (English)

BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else. LLM-based systems are often assumed to be safer because they function as next-token predictors rather than persistent optimisers. We empirically test this assumption by placing LLMs in simple, long-horizon control-style environments that require maintaining state of or balancing objectives over time: single- and multi-objective homeostasis, balancing unbounded objectives with diminishing returns, and sustainability of a renewable resource. We find that, although LLMs frequently behave appropriately for many steps and clearly understand the stated objectives, they often lose context in structured ways and drift into runaway behaviours: ignoring homeostatic targets, collapsing from multi-objective trade-offs into single-objective maximisation - thus failing to respect concave utility structures. These failures emerge reliably after initial periods of competent behaviour and exhibit characteristic patterns (including self-imitative oscillations, unbounded maximisation, and reverting to single-objective optimisation), even though the context window is far from full at that point. The problem is not that the LLMs just lose context and become incoherent. Although LLMs appear multi-objective and bounded on the surface, their behaviour under sustained interaction involving multiple objectives, is systematically biased towards acting like single-objective, unbounded, poorly aligned optimisers. We hypothesise a token-level pattern reinforcement attractor: LLMs may increasingly derive actions from the token patterns of their recent action history rather than from the original instructions. Why this happens only in multi-objective settings remains an open question.

2026-06-05 13:00 JSTarXiv cs.AIエージェントロボティクス

単純な埋め込みによりアクター-クリティックエージェントのサンプル効率が向上

最近の研究では、大規模な環境の並列化を使用して、アクタークリティカル手法の実時間のトレーニング時間を加速することが提案されています。残念ながら、望ましいレベルのパフォーマンスを達成するには、依然として多数の環境との対話が必要になる場合があります。適切に構造化された表現は、深層強化学習 (RL) エージェントの一般化とサンプル効率を向上させることができることに注目し、単純なエンベディング、つまりエンベディングを単純な構造に制約する軽量の表現層の使用を提案します。この幾何学的な帰納的バイアスにより、批評家のブートストラップを安定させ、政策の勾配を強化するまばらで離散的な特徴が生じます。 FastTD3、FastSAC、および PPO に適用すると、単純なエンベディングは、実行速度を損なうことなく、さまざまな連続および離散制御環境全体でサンプル効率と最終パフォーマンスを一貫して向上させます。

原文 (English)

Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents

Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment parallelization; unfortunately, these can sometimes still require large number of environment interactions to achieve a desired level of performance. Noting that well-structured representations can improve the generalization and sample efficiency of deep reinforcement learning (RL) agents, we propose the use of simplicial embeddings: lightweight representation layers that constrain embeddings to simplicial structures. This geometric inductive bias results in sparse and discrete features that stabilize critic bootstrapping and strengthen policy gradients. When applied to FastTD3, FastSAC, and PPO, simplicial embeddings consistently improve sample efficiency and final performance across a variety of continuous- and discrete-control environments, without any loss in runtime speed.

2026-06-05 13:00 JSTarXiv cs.AIエージェントロボティクス

ベクトル化されたオンライン POMDP 計画

部分的な可観測性の下で計画を立てることは、自律ロボットの重要な機能です。部分観測可能なマルコフ決定プロセス (POMDP) は、部分観測可能性の問題の下で計画を立てるための強力なフレームワークを提供し、アクションの確率的影響とノイズの多い観測を通じて得られる限られた情報を捕捉します。 POMDP の解法は、今日のハードウェアでの大規模並列化から多大な恩恵を受ける可能性がありますが、POMDP ソルバーの並列化は困難でした。ほとんどのソルバーは、アクションとその値の推定をインターリーブする数値最適化に依存しているため、並列プロセス間に依存関係や同期ボトルネックが生じ、並列化の利点が相殺される可能性があります。この論文では、Vectorized Online POMDP Planner (VOPP) を提案します。これは、最適化コンポーネントの一部を分析的に解決し、期待値の推定のみで構成される数値計算を残す、最新の POMDP 定式化を利用する新しい並列オンラインソルバーです。 VOPP は、計画に関連するすべてのデータ構造をテンソルのコレクションとして表し、すべての計画ステップをこの表現に対する完全にベクトル化された計算として実装します。その結果、同時プロセス間の依存関係や同期ボトルネックのない大規模並列オンラインソルバーが実現します。実験結果は、VOPP が既存の最先端の並列オンラインソルバーと比較して、最適に近い解の計算において少なくとも 20 倍効率的であることを示しています。さらに、VOPP は最先端の逐次オンラインソルバーよりも優れたパフォーマンスを発揮し、計画予算を 1000 倍も削減します。

原文 (English)

Vectorized Online POMDP Planning

Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for planning under partial observability problems, capturing the stochastic effects of actions and the limited information available through noisy observations. POMDP solving could benefit tremendously from massive parallelization on today's hardware, but parallelizing POMDP solvers has been challenging. Most solvers rely on interleaving numerical optimization over actions with the estimation of their values, which creates dependencies and synchronization bottlenecks between parallel processes that can offset the benefits of parallelization. In this paper, we propose Vectorized Online POMDP Planner (VOPP), a novel parallel online solver that leverages a recent POMDP formulation which analytically solves part of the optimization component, leaving numerical computations to consist of only estimation of expectations. VOPP represents all data structures related to planning as a collection of tensors, and implements all planning steps as fully vectorized computations over this representation. The result is a massively parallel online solver with no dependencies or synchronization bottlenecks between concurrent processes. Experimental results indicate that VOPP is at least $20\times$ more efficient in computing near-optimal solutions compared to an existing state-of-the-art parallel online solver. Moreover, VOPP outperforms state-of-the-art sequential online solvers, while using a planning budget that is $1000\times$ smaller.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

トポロジは重要です: マルチエージェント LLM のメモリリークの測定

グラフトポロジは、マルチエージェント LLM システムにおけるメモリリークの基本的な決定要因ですが、その影響は依然として十分に定量化されていません。マルチエージェント LLM システムにおけるトポロジ条件付きメモリリークを比較するための制御された評価フレームワークである MAMA (Multi-Agent Memory Attack) を紹介します。 MAMA は、ラベル付きの個人識別情報 (PII) エンティティを含む合成文書を操作し、そこからサニタイズされたタスク指示を生成します。私たちは、エングラム (ターゲットエージェントのメモリに個人情報をシードする) とレゾナンス (攻撃者が抽出を試みるマルチラウンドインタラクション) の 2 段階のプロトコルを実行します。 10 ラウンドにわたって、完全一致抽出と攻撃者の最終出力に対する LLM ベースの推論を組み合わせた 2 段階の回復基準を使用して漏洩を測定します。 $n\in\{4,5,6\}$、攻撃者とターゲットの配置、および基本モデルにわたる 6 つの正規トポロジ (完全、円、チェーン、ツリー、スター、スターリング) を評価します。結果は一貫しています。接続が密になり、攻撃者とターゲットの距離が短くなり、ターゲットの中心性が高くなることで漏洩が増加します。ほとんどの漏れはラウンド初期に発生し、その後プラトーになります。モデルの選択により絶対率は変化しますが、広範な構造傾向は維持されます。時空間/位置属性は、ID 資格情報や規制された識別子よりも漏洩しやすいです。私たちは、システム設計のための実践的なガイダンスを抽出します。つまり、疎接続または階層接続を優先し、攻撃者とターゲットの分離を最大限に高め、トポロジーを意識したアクセス制御によってハブ/ショートカット経路を制限します。私たちのコードは https://github.com/llll121/mama-eval で入手できます。

原文 (English)

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We introduce MAMA (Multi-Agent Memory Attack), a controlled evaluation framework for comparing topology-conditioned memory leakage in multi-agent LLM systems. MAMA operates on synthetic documents containing labeled Personally Identifiable Information (PII) entities, from which we generate sanitized task instructions. We execute a two-phase protocol: Engram (seeding private information into a target agent's memory) and Resonance (multi-round interaction where an attacker attempts extraction). Over 10 rounds, we measure leakage using a two-stage recovery criterion that combines exact-match extraction with LLM-based inference over the attacker's final output. We evaluate six canonical topologies (complete, circle, chain, tree, star, star-ring) across $n\in\{4,5,6\}$, attacker-target placements, and base models. Results are consistent: denser connectivity, shorter attacker-target distance, and higher target centrality increase leakage; most leakage occurs in early rounds and then plateaus; model choice shifts absolute rates but preserves broad structural trends; spatiotemporal/location attributes leak more readily than identity credentials or regulated identifiers. We distill practical guidance for system design: favor sparse or hierarchical connectivity, maximize attacker-target separation, and restrict hub/shortcut pathways via topology-aware access control. Our code is available at https://github.com/llll121/mama-eval.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント

セグメントからシーンへ: 視覚言語モデルによるエージェント自動運転の時間的理解

視覚言語モデル (VLM) は、自動運転 (AD) が最も安全性が重要なインスタンスの 1 つとして、野外で活動する自律エージェントの認識および推論のバックボーンとしてますます導入されています。このようなエージェントがイベントを予測し、原因を特定し、動的な環境で安全に行動するには、信頼性の高い時間的理解が不可欠ですが、これは最先端 (SoTA) VLM にとっても依然として大きな課題です。これまでのビデオベンチマークは他のコンテンツ (スポーツ、料理など) を重視していましたが、短編と長編の両方の AD 映像の時間的理解のみに焦点を当てた既存のベンチマークはありません。このギャップを埋めるために、7 つのタスクにわたる約 6000 の質問と回答 (QA) のペアで構成される自動運転における時間的理解 (TAD) ベンチマークを提示し、9 つのクローズドおよびオープンソースのジェネラリストモデルと AD スペシャリストモデルを評価します。現在の SoTA モデルは、TAD 上で人間の精度を大幅に下回っています。 VLM ベースの運転エージェントの時間的推論を改善するために、我々は 2 つの新しいトレーニング不要のソリューションを提案します。1 つは思考連鎖 (CoT) 推論を使用する Scene-CoT、もう 1 つは VLM 周辺のエージェントツールとして動作する軌道分析モジュールによって生成される自己中心の時間認知マップを組み込んだ TCogMap です。既存の VLM と統合された当社のメソッドは、TAD での平均精度を最大 $17.72\%$、STSBench で最大 $10.35\%$ 向上させます。この研究は、TAD の導入、SoTA モデルのベンチマーク、および効果的な機能拡張の提案により、実際に稼働しているエージェント型 AD システムの時間的理解のさらなる進歩を促進することを目的としています。ベンチマークと評価コードは、それぞれ ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ と ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$ から入手できます。

原文 (English)

From Segments to Scenes: Temporal Understanding for Agentic Autonomous Driving via Vision-Language Models

Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliable temporal understanding is essential for such agents to anticipate events, attribute causes, and act safely in dynamic environments, yet this remains a significant challenge even for state-of-the-art (SoTA) VLMs. Prior video benchmarks have emphasized other content (sports, cooking, etc.), yet no existing benchmark focuses exclusively on temporal understanding for both short- and long-form AD footage. To fill this gap, we present the Temporal Understanding in Autonomous Driving (TAD) benchmark, comprising nearly 6000 question-answer (QA) pairs across 7 tasks, and evaluate 9 closed- and open-source generalist as well as AD-specialist models. Current SoTA models perform substantially below human accuracy on TAD. To improve the temporal reasoning of VLM-based driving agents, we propose two novel training-free solutions: Scene-CoT, which uses Chain-of-Thought (CoT) reasoning, and TCogMap, which incorporates an ego-centric temporal cognitive map produced by a trajectory-analysis module that operates as an agentic tool around the VLM. Integrated with existing VLMs, our methods improve average accuracy on TAD by up to $17.72\%$ and by up to $10.35\%$ on STSBench. By introducing TAD, benchmarking SoTA models, and proposing effective enhancements, this work aims to catalyze further progress on temporal understanding for agentic AD systems operating in the wild. The benchmark and evaluation code are available at ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ and ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$, respectively.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

DVGT: ビジュアルジオメトリトランスフォーマーの駆動

自動運転には、視覚入力から 3D シーンのジオメトリを認識して再構築することが重要です。ただし、さまざまなシナリオやカメラ構成に適応できる、運転をターゲットとした高密度ジオメトリ認識モデルがまだ不足しています。このギャップを埋めるために、私たちはドライビングビジュアルジオメトリトランスフォーマー (DVGT) を提案します。これは、一連のポーズ化されていないマルチビュービジュアル入力からグローバルな高密度 3D ポイントマップを再構築します。まず、DINO バックボーンを使用して各画像の視覚的特徴を抽出し、ビュー内の局所的注意、ビュー間の空間的注意、およびフレーム間の時間的注意を交互に使用して、画像全体の幾何学的関係を推測します。次に、複数のヘッドを使用して、最初のフレームのエゴ座標のグローバルポイントマップと各フレームのエゴポーズをデコードします。正確なカメラパラメーターに依存する従来の方法とは異なり、DVGT には明示的な 3D 幾何学的な事前条件がなく、任意のカメラ構成の柔軟な処理が可能です。 DVGT は、画像シーケンスからメートルスケールのジオメトリを直接予測し、外部センサーによる事後位置合わせの必要性を排除します。 DVGT は、nuScenes、OpenScene、Waymo、KITTI、DDAD などの運転データセットを大規模に組み合わせてトレーニングされたため、さまざまなシナリオで既存のモデルを大幅に上回ります。コードは https://github.com/wzzheng/DVGT で入手できます。

原文 (English)

DVGT: Driving Visual Geometry Transformer

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

マルチエージェント討論を読み解く: 自信と多様性の役割

マルチエージェントディベート (MAD) は、テスト時間のスケーリングを通じて大規模言語モデル (LLM) のパフォーマンスを向上させるために広く使用されていますが、最近の研究では、バニラの MAD は、計算コストが高いにもかかわらず、単純な多数決を下回ることが多いことが示されています。研究によると、同種のエージェントと統一的な信念の更新の下では、議論は期待される正しさを維持するため、結果を確実に改善することはできません。人間による熟慮と集団的意思決定からの発見に基づいて、バニラ MAD に欠けている 2 つの重要なメカニズムを特定します。(i) 初期の視点の多様性と、(ii) 明示的で調整された信頼性のコミュニケーションです。私たちは 2 つの軽量介入を提案します。まず、多様性を意識した初期化により、より多様な回答候補プールが選択され、議論の開始時に正しい仮説が存在する可能性が高まります。 2 つ目は、エージェントが調整された自信を表現し、他の人の自信に基づいて最新情報を条件付けする、自信調整型ディベートプロトコルです。我々は、多様性を意識した初期化により、基礎となる更新ダイナミクスを変更することなく MAD 成功の事前確率が向上する一方で、信頼度調整された更新により、議論が体系的に正しい仮説に向かうことが可能になることを理論的に示します。経験的には、6 つの推論指向の QA ベンチマーク全体で、私たちの手法はバニラの MAD と多数決を一貫して上回っています。私たちの結果は、人間による熟慮と LLM ベースの議論を結びつけ、シンプルで原則に基づいた修正が議論の有効性を大幅に高めることができることを示しています。

原文 (English)

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others' confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

SUSD: 状態因数分解による構造化された教師なしスキルの発見

教師なしスキルディスカバリー (USD) は、外部報酬に依存せずに、さまざまなスキルを自律的に学習することを目的としています。最も一般的な USD アプローチの 1 つは、スキルの潜在変数と状態の間の相互情報 (MI) を最大化することです。ただし、MI ベースの手法は、その不変特性により単純で静的なスキルを好む傾向があり、動的でタスクに関連した動作の発見が制限されます。 Distance-Maximizing Skill Discovery (DSD) は、状態空間の距離を活用することで、より動的なスキルを促進しますが、環境内のすべての制御可能な要素またはエンティティに関与する包括的なスキルセットを奨励するにはまだ不十分です。この研究では、状態空間を独立したコンポーネント (オブジェクトや制御可能なエンティティなど) に因数分解することで環境の構成構造を利用する新しいフレームワークである SUSD を紹介します。 SUSD は、異なるスキル変数をさまざまな要素に割り当て、スキル発見プロセスをよりきめ細かく制御できるようにします。また、動的モデルは複数の要因にわたる学習を追跡し、エージェントの焦点を未探索の要因に適応的に導きます。この構造化されたアプローチは、より豊かで多様なスキルの発見を促進するだけでなく、階層強化学習 (HRL) を介した構成的な下流タスクの効率的なトレーニングを促進する、個々のエンティティに対するきめ細かく解きほぐされた制御を可能にする因数分解されたスキル表現ももたらします。係数が 1 から 10 までの 3 つの環境にわたる実験結果は、私たちの方法が監督なしで多様で複雑なスキルを発見できることを示しており、因数分解された複雑な環境で既存の教師なしスキル発見方法を大幅に上回っています。コードは https://github.com/hadi-hosseini/SUSD で公開されています。

原文 (English)

SUSD: Structured Unsupervised Skill Discovery through State Factorization

Unsupervised Skill Discovery (USD) aims to autonomously learn a diverse set of skills without relying on extrinsic rewards. One of the most common USD approaches is to maximize the Mutual Information (MI) between skill latent variables and states. However, MI-based methods tend to favor simple, static skills due to their invariance properties, limiting the discovery of dynamic, task-relevant behaviors. Distance-Maximizing Skill Discovery (DSD) promotes more dynamic skills by leveraging state-space distances, yet still fall short in encouraging comprehensive skill sets that engage all controllable factors or entities in the environment. In this work, we introduce SUSD, a novel framework that harnesses the compositional structure of environments by factorizing the state space into independent components (e.g., objects or controllable entities). SUSD allocates distinct skill variables to different factors, enabling more fine-grained control on the skill discovery process. A dynamic model also tracks learning across factors, adaptively steering the agent's focus toward underexplored factors. This structured approach not only promotes the discovery of richer and more diverse skills, but also yields a factorized skill representation that enables fine-grained and disentangled control over individual entities which facilitates efficient training of compositional downstream tasks via Hierarchical Reinforcement Learning (HRL). Our experimental results across three environments, with factors ranging from 1 to 10, demonstrate that our method can discover diverse and complex skills without supervision, significantly outperforming existing unsupervised skill discovery methods in factorized and complex environments. Code is publicly available at: https://github.com/hadi-hosseini/SUSD.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

自律型 O-RAN に向けて: リアルタイムネットワーク制御および管理のためのマルチスケールエージェント AI フレームワーク

オープン無線アクセスネットワーク (O-RAN) は、分散されたソフトウェア駆動のコンポーネントとオープンインターフェイスを通じて柔軟な 6G ネットワークアクセスを約束しますが、このプログラマビリティにより運用の複雑さも増大します。複数の制御ループがサービス管理層と RAN インテリジェントコントローラー (RIC) 全体で共存しますが、個別に開発された制御アプリケーションは意図しない方法で相互作用する可能性があります。同時に、生成型人工知能 (AI) の最近の進歩により、孤立した AI モデルから、目標を解釈し、複数のモデルと制御機能を調整し、時間の経過とともに動作を適応させることができるエージェント AI システムへの移行が可能になりました。この記事では、非リアルタイム (Non-RT)、準リアルタイム (Near-RT)、およびリアルタイム (RT) の制御ループにわたる調整された階層として RAN インテリジェンスを組織化する、O-RAN 用のマルチスケールエージェント AI フレームワークを提案します。 (i) 非 RT RIC の大規模言語モデル (LLM) エージェントは、オペレーターの意図をポリシーに変換し、モデルのライフサイクルを管理します。 (ii) Near-RT RIC の Small Language Model (SLM) エージェントは、低遅延の最適化を実行し、既存の制御アプリケーションをアクティブ化、調整、または無効化できます。 (iii) 分散ユニット近くのワイヤレス物理層基盤モデル (WPFM) エージェントは、エアインターフェイスに近い高速推論を提供します。これらのエージェントが標準化された O-RAN インターフェイスとテレメトリを通じてどのように連携するかを説明します。オープンソースモデル、ソフトウェア、データセットに基づいて構築された概念実証の実装を使用して、非定常条件下での堅牢な動作とインテント駆動型のスライスリソース制御という 2 つの代表的なシナリオで提案されたエージェントアプローチを実証します。

原文 (English)

Toward Autonomous O-RAN: A Multi-Scale Agentic AI Framework for Real-Time Network Control and Management

Open Radio Access Networks (O-RAN) promise flexible 6G network access through disaggregated, software-driven components and open interfaces, but this programmability also increases operational complexity. Multiple control loops coexist across the service management layer and RAN Intelligent Controller (RIC), while independently developed control applications can interact in unintended ways. In parallel, recent advances in generative Artificial Intelligence (AI) are enabling a shift from isolated AI models toward agentic AI systems that can interpret goals, coordinate multiple models and control functions, and adapt their behavior over time. This article proposes a multi-scale agentic AI framework for O-RAN that organizes RAN intelligence as a coordinated hierarchy across the Non-Real-Time (Non-RT), Near-Real-Time (Near-RT), and Real-Time (RT) control loops: (i) A Large Language Model (LLM) agent in the Non-RT RIC translates operator intent into policies and governs model lifecycles. (ii) Small Language Model (SLM) agents in the Near-RT RIC execute low-latency optimization and can activate, tune, or disable existing control applications; and (iii) Wireless Physical-layer Foundation Model (WPFM) agents near the distributed unit provide fast inference close to the air interface. We describe how these agents cooperate through standardized O-RAN interfaces and telemetry. Using a proof-of-concept implementation built on open-source models, software, and datasets, we demonstrate the proposed agentic approach in two representative scenarios: robust operation under non-stationary conditions and intent-driven slice resource control.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

スケーラブルな MARL における局所性のための統合フレームワーク

ネットワーク化されたマルチエージェント強化学習のスケーラブルな方法では、各エージェントがエージェントグラフの小さな近傍のみを使用して計画を立てることができます。これは、システムが値ローカルである場合にのみ機能します。つまり、2 つのエージェントが遠く離れている場合、1 つのエージェントでの摂動は、別のエージェントでの長期的な値にわずかに影響します。平均報酬設定では、局所性を証明する標準的な方法は、各エージェントの次の状態が他のエージェントの現在の状態にどのように依存するかを捕捉する単一の行列 $C^\pi$ 上の Dobrusin 行合計境界です。このマトリックスを扱いやすくするために、以前の研究では、共同動作の上限によってマトリックスを制限しました。結果として得られる境界はポリシーから独立していますが、ポリシーが最悪の場合のアクションを選択しない場合は常に緩やかになります。 $C^\pi$ を、環境の感度とポリシーの感度を個別に追跡する部分、$C^\pi \preceq E^{\mathrm s}+E^{\mathrm a}\Pi(\pi)$ に分割します。ここで、$E^{\mathrm s}$ は現在の状態に応じて次の状態がどのように変化するかを測定し、$E^{\mathrm a}$ は現在のアクションでどのように変化するかを測定し、$\Pi(\pi)$ はポリシーがどのように反応するかを測定します。状態の変化。 $H^\pi := E^{\mathrm s}+E^{\mathrm a}\Pi(\pi)$ のスペクトル半径は平均報酬ポアソン解の減衰を制御し、スペクトル証明書 $\rho(H^\pi)<1$ は同じ行列上の行合計条件 $\|H^\pi\|_\infty<1$ より厳密に弱く、政策に依存しないレジームに適用されます。以前の Dobrushin スタイルの作業で使用されていたアクションの上限は使用できません。温度 $\tau$ ソフトマックスポリシーの場合、$\Pi(\pi)\le L/(2\tau)$ が得られるため、ソフトマックス温度は局所性を直接制御します。この減衰結果を使用して、切り捨てバイアスがメッセージパッシング半径 $\kappa$ 内で指数関数的に減衰するブロック座標 KL 近位ポリシー改善テンプレートに決定論的なオラクル保証を与えます。

原文 (English)

A Unified Framework for Locality in Scalable MARL

Scalable methods for networked multi-agent reinforcement learning let each agent plan using only a small neighborhood of the agent graph. This works only when the system is value-local, meaning a perturbation at one agent affects the long-run value at another agent weakly when the two are far apart. In the average-reward setting, the standard way to certify locality is the Dobrushin row-sum bound on a single matrix $C^\pi$ that captures how each agent's next state depends on each other agent's current state. To make this matrix easy to work with, prior work bounds it by a supremum over joint actions. The resulting bound is independent of the policy, but it is loose whenever the policy never picks the worst-case action. We split $C^\pi$ into pieces that separately track environment sensitivity and policy sensitivity, $C^\pi \preceq E^{\mathrm s}+E^{\mathrm a}\Pi(\pi)$, where $E^{\mathrm s}$ measures how the next state moves with the current state, $E^{\mathrm a}$ measures how it moves with the current action, and $\Pi(\pi)$ measures how reactive the policy is to changes in state. The spectral radius of $H^\pi := E^{\mathrm s}+E^{\mathrm a}\Pi(\pi)$ then controls the decay of the average-reward Poisson solution, and the spectral certificate $\rho(H^\pi)<1$ is strictly weaker than the row-sum condition $\|H^\pi\|_\infty<1$ on the same matrix and applies in regimes where policy-independent action-supremum bounds used in prior Dobrushin-style work cannot. For temperature-$\tau$ softmax policies we get $\Pi(\pi)\le L/(2\tau)$, so the softmax temperature directly controls locality. We use this decay result to give a deterministic oracle guarantee for a block-coordinate KL-proximal policy-improvement template whose truncation bias decays exponentially in the message-passing radius $\kappa$.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

ShareVerse: 共有世界モデリングのためのマルチエージェントの一貫したビデオ生成

このペーパーでは、マルチエージェントシェアードワールドモデリングを可能にするビデオ生成フレームワークである ShareVerse について紹介します。これは、マルチエージェントインタラクションによる統一されたシェアードワールド構築のサポートが不足している既存の作品のギャップに対処します。 ShareVerse は、大規模なビデオモデルの生成機能を活用し、次の 3 つの主要なイノベーションを統合します。 1) 大規模なマルチエージェントインタラクティブな世界モデリング用のデータセットは、CARLA シミュレーションプラットフォーム上に構築され、多様なシーン、気象条件、およびペアになったマルチビュービデオ (エージェントごとに前方/後方/左方/右ビュー) とカメラデータによるインタラクティブな軌跡を特徴とします。 2) より広範な環境をモデル化し、内部のマルチビューの幾何学的一貫性を確保するために、独立したエージェントの 4 ビュービデオの空間連結戦略を提案します。 3) エージェント間のアテンションブロックを事前トレーニング済みビデオモデルに統合します。これにより、エージェント間での時空間情報のインタラクティブな送信が可能になり、重複領域での共有世界の一貫性と非重複領域での合理的な生成が保証されます。 49 フレームの大規模ビデオ生成をサポートする ShareVerse は、動的エージェントの位置を正確に認識し、一貫した共有世界モデリングを実現します。

原文 (English)

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

A safety score earned on a benchmark need not predict how the same model behaves once it is wrapped in an agentic scaffold the benchmark ne…

2026-06-05 13:00 JSTarXiv cs.AIエージェント

VulnAgent-R2: Evidence-Calibrated Multi-Agent Auditing for Repository-Level Vulnerability Detection

Software vulnerabilities often depend on cross-file data flow, build options, framework conventions, and runtime guards, so isolated functi…

2026-06-05 13:00 JSTarXiv cs.AIエージェント研究/論文

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in la…

2026-06-05 13:00 JSTarXiv cs.AIエージェントロボティクス

Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring

Although autonomous underwater vehicles promise the capability of marine ecosystem monitoring, their deployment is fundamentally limited by…

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

LLM agents increasingly rely on reusable skills (e.g., SKILL markdown files) to execute complex tasks, yet these artifacts lack portability…

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

A Systematic Investigation of RL-Jailbreaking in LLMs

The evolution of generative models from next-token predictors to autonomous engines of complex systems necessitates rigorous safety hardeni…

2026-06-05 13:00 JSTarXiv cs.AIエージェントロボティクス

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

Interpretable autonomous driving planners depend not only on generating explanations, but also on those explanations remaining reliable und…

2026-06-05 13:00 JSTarXiv cs.AIエージェント

Grimlock: Guarding High-Agency Systems with eBPF and Attested Channels

Agentic systems increasingly run user-authored orchestration code that invokes tools, spawns subtasks, and delegates work across machines a…

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, w…

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

memorywire: A Vendor-Neutral Wire Format for Agent Memory Operations

Agent-memory frameworks -- mem0, Letta/MemGPT, Cognee, Zep/Graphiti, MemoryOS, MemTensor -- each ship their own SDK, storage layout, and op…

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams

Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and suppor…

2026-06-05 13:00 JSTarXiv cs.AIエージェント

OpenAgenet/OAN: Open Infrastructure for Trusted Agent Interconnection

OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes vi…

2026-06-05 13:00 JSTarXiv cs.AIエージェント研究/論文

OpenAgenet/OAN: Technical Architecture for Trust-Governed Agent Identity and Discovery

This paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection.…

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エンタープライズ AI エージェントの導入前保証に向けて: オントロジーに基づいたシミュレーションと信頼証明

エンタープライズ人工知能 (AI) エージェントの導入前の検証は、大規模言語モデル (LLM) 機能のベンチマークと運用環境の導入の間に依然として重大なギャップがあります。導入後のモニタリング、人間参加型制御、およびプロンプトレベルのガードレールは、エージェントが運用環境で動作すると限定的な保証を提供します。私たちは、次の 3 つのコンポーネントを組み合わせたオントロジーに基づいた検証フレームワークを提案します。1 つは、権限、ドメイン制約、安全性、ガバナンスルール、および自律性レベルにわたる認証空間を形式化するエージェント運用エンベロープです。規制、運用、敵対的テストのシナリオを自動的に導き出すオントロジーからシナリオへの生成パイプライン。段階的な展開判定 (承認、条件付き、拒否) を含む機械検証可能な証明書を保持する信頼証明書。米国とベトナムの 5 つの業界別規制制度セルとしてインスタンス化された、4 つの規制対象業界 (フィンテック、銀行、保険、ヘルスケア) にわたる制御されたパイロットは、125 の主要なソース規制要件と 25 の注入された障害に対して評価された 1,800 のシナリオを生成しました。オントロジーに基づいた生成 (G4) は、ペルソナベースのベースラインでは 33.1% であるのに対し、規制適用率は 48.3% (修正 p = 0.0006)、最も高いドメイン特異性 (4.77/5.0; p = 2e-6) を達成しました。ベースラインおよび検索拡張プロンプトに対するカバレッジの利点は、ボンフェローニ補正後は堅牢ではありませんでした。 3 つの LLM ファミリ (Claude Sonnet 4、Qwen 2.5 72B、Gemma 4 26B、合計 5,400 のシナリオ) にわたる相互検証により、ペルソナ対オントロジーのパターンが再現されました。その結果、規制が集中するドメイン向けのペルソナベースのテストスイートを確実に補完するものとして、オントロジーに基づいたシナリオ生成が確立されました。

原文 (English)

Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We present an ontology-grounded verification framework -- to our knowledge the first to combine three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a machine-verifiable Trust Certificate with graduated deployment verdicts. A controlled pilot across four regulated industries (Fintech, Banking, Insurance, Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam (where Vietnam's 2025 AI Law makes such verification legally mandated for financial services), generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation significantly outperformed the dominant persona-based baseline on regulatory coverage (48.3% versus 33.1%; corrected p_c = .0006) and attained the highest domain specificity (4.77/5.0; p = 2e-6); transparently, its advantage over plain and retrieval-augmented prompting did not survive Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The framework offers a reproducible, regulation-grounded route to pre-deployment assurance for enterprise AI agents, complementing runtime governance with an auditable deployment gate.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SMAC-Talk: 大規模言語モデル用の StarCraft マルチエージェントチャレンジの自然言語拡張

LLM がより広範に導入されるにつれて、LLM は単独で動作するのではなく、他の AI エージェントと連携して動作することがますます期待されています。このような状況での効果的な調整には、エージェントが不確実性の下でコミュニケーションし、情報を共有し、意思決定を行う必要があります。協調的なマルチエージェント環境で LLM ベースのエージェントを評価するための StarCraft Multi-Agent Challenge の自然言語拡張である SMAC-Talk を紹介します。この環境には、分散制御、部分的な可観測性、長期的な意思決定など、いくつかの重要な機能があります。 SMAC-Talk には、エージェントの調整と信頼を調査するために使用される自然言語通信チャネルが含まれています。この通信チャネルを使用して、通信だけで味方を混乱させ、欺こうとする欺瞞的なコミュニケーターが組み込まれた設定など、さまざまな評価シナリオを構築します。 Qwen3.5 ファミリーの 4 つのモデルを使用したベンチマーク用の 3 つのエージェントを提供し、推論構造、メモリ、モデルのスケールがエージェント間の調整にどのように影響するかを調査します。私たちは、協力的なマルチエージェント設定での LLM エージェントの開発と評価における研究コミュニティをサポートするオープンベンチマークとして SMAC-Talk をリリースします。

原文 (English)

SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models

As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. Effective coordination in these settings requires agents to communicate, share information and make decisions under uncertainty. We introduce SMAC-Talk, a natural language extension of the StarCraft Multi-Agent Challenge for evaluating LLM-based agents in cooperative multi-agent environments. The environment has several key features such as decentralized control, partial observability and long-horizon decision making. SMAC-Talk includes a natural language communication channel which is used to probe agent coordination and trust. We use this communication channel to construct different evaluation scenarios, including settings with an embedded deceptive communicator that tries to disrupt and deceive allies through communication alone. We provide three agents for benchmarking using 4 models from the Qwen3.5 family and study how reasoning structure, memory and model scale affect coordination between agents. We release SMAC-Talk as an open benchmark to support the research community in developing and evaluating LLM agents in cooperative multi-agent settings.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

コンセンサスが戦略的に不十分: 知識表現シグナルとしての推論とトレースの不一致

マルチエージェントシステムは一般に、投票、コンセンサスプロトコル、討論、またはフォールトトレラントな集計を通じて意見の相違を減らすように設計されています。私たちは、この目標は、意見の相違がエージェントの誤りではなく、真の規範的不確実性を反映している可能性がある価値観を伴うタスクには不十分であると主張します。人間と AI の協調モデレーションにおける推論トレースの不一致に関する以前の研究に基づいて、推論トレースとエージェントの決定が象徴的な不一致状態に抽象化される知識表現層を提案します。明示的な推論トレースと二者決定を生成するエージェントを考慮して、推論の類似性と結論の一致に従って 4 つの状態 (収束一致、発散一致、収束不一致、発散不一致) を区別します。これらの状態は、実行可能な戦略的ルーティングルールをサポートします。我々は、コンテンツモデレーションにおけるフレームワークをインスタンス化し、不一致認識ルーティングが、マルチエージェントの戦略的推論のためのサブシンボリックLLM審議とシンボリック知識表現との間の橋渡しとなると主張する。

原文 (English)

Consensus is Strategically Insufficient: Reasoning-Trace Disagreement as a Knowledge-Representation Signal

Multi-agent systems are commonly designed to reduce disagreement through voting, consensus protocols, debate, or fault-tolerant aggregation. We argue that this objective is insufficient for value-laden tasks, where disagreement may reflect genuine normative uncertainty rather than agent error. Building on prior work on reasoning-trace disagreement in human-AI collaborative moderation, we propose a knowledge-representation layer in which reasoning traces and agent decisions are abstracted into symbolic disagreement states. Given agents producing explicit reasoning traces and binary decisions, we distinguish four states according to reasoning similarity and conclusion agreement: convergent agreement, divergent agreement, convergent disagreement and divergent disagreement. These states support defeasible strategic routing rules. We instantiate the framework in content moderation and argue that disagreement-aware routing provides a bridge between sub-symbolic LLM deliberation and symbolic knowledge representation for multi-agent strategic reasoning.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

ゼネラリストエージェントはデータキュレーションを自動化できますか?

トレーニングデータのキュレーションは、現代の AI 開発において最も重要ではあるものの、労働集約的な部分の 1 つです。実践者は、ノイズの多いベンチマークフィードバックに対してデータポリシーを繰り返し提案、実装、評価、修正します。私たちは、ジェネラリストのコーディングエージェントがこのデータキュレーションループを自動化できるかどうかを尋ねます。 *Curation-Bench* というエージェント中心のベンチマークを導入します。これは、モデル、トレーニングレシピ、評価スイートを修正し、エージェントにデータの検査、ポリシーの実装、固定トレーニング/評価パイプラインへの送信、および修正のためのコマンドラインアクセスを許可します。ビジョン言語の命令チューニングのインスタンス化では、すぐに使用できるエージェントが 10 回の反復以内に公開された強力なデータ選択ベースラインに到達します。しかし、軌道分析により、永続的な*実行と研究のギャップ*が明らかになりました。エージェントは、たとえ戦略ガイドや参考文献が与えられたとしても、新しい政策ファミリーを探索するのではなく、主にローカル政策のバリエーションを調整します。反復ごとに以前のメソッドを引用、インスタンス化し、適応させる必要がある足場により、エージェントはメソッドに基づいた探索へと移行します。スキャフォールドされたエージェントは、人間による設計入力を必要とせずに、データ予算の 10 分の 1 で強力な公開ベースラインを上回るデータ選択ポリシーを自律的に作成します。全体として、現在のエージェントはキュレーションループを実行できますが、信頼性の高いデータ調査には、オープンエンドのプロンプトのみではなく、足場を組んだ手法の適応が必要です。コードとベンチマークはオープンソースです。

原文 (English)

Can Generalist Agents Automate Data Curation?

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

飽和トラップと介入タイミングの主観性: 影響ベースのトリガーと LLM ジャッジが自律エージェントへの介入のタイミングを計れない理由

自律型 AI エージェントが会話型システムから長期的なソフトウェア実行に移行するにつれて、エージェントをいつ中断するかを決定するランタイム安全レイヤーが不可欠になっています。私たちは、診断プローブとして連続 18 次元感情ダイナミクスエンジン (HEART) を使用し、SWE ベンチ検証済みデバッグトレース上の人間による注釈付き介入ポイントに対して 4 つの介入トリガーファミリ (絶対状態しきい値、複合状態アクションパターン、正規表現推論特徴抽出、および判断者としてのゼロショット LLM) を評価して、このタイミング問題を研究します。 3 つの調査結果を報告します。まず、状態飽和トラップ: 継続的な困難下ではエージェントは回復シグナルを示さないため、モデル化されたフラストレーションはすぐにしきい値を超えて最大値に留まり、瞬間検出器からのしきい値オン状態トリガーを、5 つの軌道にわたるアクションの 39 ～ 83% で起動するほぼ一定のインジケーターに変換します。第 2 に、LLM 審査員向けの機能とコンテキストの下限です。小型モデル (gpt-5.4-mini) は決して発砲しませんが、フロンティアモデルやクロスベンダーモデルは完全な軌道コンテキストでのみゼロ発火下限を回避し、それでも最大 90 倍のコストで F1 0.17 ～ 0.40 にしか達しません。第三に、そして最も重要なことは、教師付きターゲットは人間間で再現可能ではないということです。56 アクションの軌道上で 1 つのルーブリックを使用する 3 人の訓練されたアノテーターは、偶然をわずかに上回る位置 (クリッペンドルフのアルファ = +0.047、最良のペアワイズコーエンのカッパ = +0.349) で介入する場所については一致し、介入の種類についてはまったく一致しません (退化を一時停止、確率より下を明確にする、アルファのみを反映 = +0.226）。介入のタイミングは信頼性の低い構造であり、単一アノテーター F1 は最適化の対象として不適切であると結論付けます。私たちの貢献は、単一の検出器の精度ではなく、人間の評価者間信頼性、4 つの検出器アーキテクチャ、クロスモデル LLM 判定スイープ、および再現された飽和効果にわたるこの問題の共同マッピングです。

原文 (English)

The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents

As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional affective-dynamics engine (HEART) as a diagnostic probe, evaluating four intervention trigger families - absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge - against human-annotated intervention points on SWE-bench-Verified debugging traces. We report three findings. First, a State Saturation Trap: agents show no recovery signal under sustained difficulty, so modeled frustration quickly crosses the threshold and stays at its maximum, converting threshold-on-state triggers from moment detectors into near-constant indicators that fire on 39-83% of actions across five trajectories. Second, a capability-and-context floor for LLM judges: a small model (gpt-5.4-mini) never fires, while frontier and cross-vendor models escape the zero-firing floor only with full-trajectory context, and even then reach only F1 0.17-0.40 at up to 90x the cost. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using one rubric on a 56-action trajectory agree on where to intervene only slightly above chance (location Krippendorff's alpha = +0.047; best pairwise Cohen's kappa = +0.349) and not at all on intervention type (pause degenerate; clarify below chance; reflect only alpha = +0.226). We conclude that intervention timing is a low-reliability construct, making single-annotator F1 an unsuitable optimization target. Our contribution is the joint mapping of this problem across human inter-rater reliability, four detector architectures, a cross-model LLM-judge sweep, and a reproduced saturation effect, rather than any single detector's accuracy.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エージェント的記憶システムのクロスシナリオ一般性の探求: 診断と強力なベースライン

LLM エージェントは、コンテキストウィンドウを超えて拡大する履歴を蓄積し、メモリシステムに関する文献の増加を促します。しかし、既存の設計のほとんどは単一のシナリオ (マルチセッションチャットまたは単一の軌跡形式) に合わせて調整されており、展開時にエージェントが遭遇する異種の軌跡全体にそれらが一般化するという証拠はほとんどありません。シングルターン QA、マルチセッションチャット、エージェントトラジェクトリ QA、メモリストレステスト、長期エージェントタスクの 5 つのシナリオで、8 つのメモリシステムと検索問題用のエージェントハーネスを再検討します。ツール呼び出しを介してフラットテキストファイルストレージを自己管理するハーネスは、最高のクロスタスクランキングを達成しました。これは、メモリのパフォーマンスが、固定パイプラインの背後にある受動的なストアではなく、ストレージと取得に対するアクティブな制御をエージェントに与えることに依存していることを示唆しています。この洞察を AutoMEM でインスタンス化します。AutoMEM は、評価するシステムの中で最高のクロスシナリオ汎用性を実現する自己管理ツールインターフェイスを備えたエージェントメモリハーネスです。

原文 (English)

Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline

LLM agents accumulate histories that outgrow their context windows, motivating a growing literature on memory systems. Yet most existing designs are tuned to a single scenario (multi-session chat or a single trajectory format), and there is little evidence that they generalize across the heterogeneous trajectories agents encounter in deployment. We revisit eight memory systems plus an agentic harness for search problems, on five scenarios: single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, and long-horizon agentic tasks. The harness, which self-manages flat text-file storage via tool calls, achieves the best cross-task ranking, suggesting that memory performance hinges on giving the agent active control over storage and retrieval rather than on a passive store behind a fixed pipeline. We instantiate this insight in AutoMEM, an agentic memory harness with a self-managed tool interface that achieves the best cross-scenario generality among the systems we evaluate.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

デジタル見習い: 人間主導のエージェント AI 開発のフレームワーク

Agentic AI の導入は、人間による厳しい監視によって規模が制限される一方、広範な自律性が説明責任を上回るという、繰り返しの設計上の緊張に直面しています。どちらの姿勢も、責任ある委任に必要なガバナンスインフラストラクチャを提供しません。私たちは、自律性を前提とするのではなく獲得する、スケーラブルで安全な AI エージェンシーのフレームワークである Digital Apprentice を紹介します。デジタル見習いは、人間が監督する暗黙の方法論を内面化する発達的な学習者であり、経験的証拠がそれを正当化する場合にのみ、スキルごとの自律段階を経て卒業します。その結果、特定の人間の基準に沿ったままでありながら、時間の経過とともに真に有用になるエージェントが生まれます。 3 つのアーキテクチャコンポーネントがこれを可能にします。 (1) 方法論の捕捉。監督の専門家の暗黙のアプローチを構造化資産に抽出します。 (2) 承認。人間による明示的な承認によって自律性がエスカレーションされます。 (3) 継続的な調整。実行時にドリフトを修正し、各修正を所有する優先データに変換します。このフレームワークを推論時のコントロールプレーンとしてインスタンス化します。品質フレームワークを数学的にモデル化し、品質を向上させるために設計されたポリシーや手法について議論します。このフレームワークをオープンなプロフェッショナルコーパスに適用し、データドリフトを捕捉し、実行時に別の手法を適用することで、トラフィックシフト下で低下した品質次元を回復する方法を示します。その影響は単一のアプリケーションを超えて広がります。私たちは、これら 3 つの柱をシステムとしてつなぎ合わせることで、信頼を犠牲にすることなく拡張できるエージェントシステムへのより安全で実行可能な道を形成すると信じています。

原文 (English)

The Digital Apprentice: A Framework for Human-Directed Agentic AI Development

Agentic AI deployments face a recurring design tension: heavy human oversight limits scale, while broad autonomy outruns accountability. Neither posture provides the governance infrastructure required for responsible delegation. We present the Digital Apprentice, a framework for scalable, safe AI agency in which autonomy is earned, not assumed. The Digital Apprentice is a developmental learner that internalizes the tacit methodology of a directing human, graduating through per-skill autonomy tiers only when empirical evidence justifies it. The result is an agent that becomes genuinely useful over time while remaining aligned to a specific human's standards. Three architectural components make this possible. (1) Methodology capture, distilling a directing professional's tacit approach into structured assets. (2) Authorization, with autonomy escalation gated by explicit human approval. (3) Continuous alignment, correcting drift at runtime and converting each correction into owned preference data. We instantiate this framework as an inference-time control plane. We mathematically model the quality framework and discuss policies and techniques designed to raise quality. We apply the framework to an open professional corpus, and we show how catching data drift and applying a different technique at runtime recovers degraded quality dimensions under traffic shift. The implication extends beyond any single application. We believe these three pillars, stitched together as a system, form a safer and more viable path to agentic systems that can scale without sacrificing trust.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

状態に基づいた動的検索による Web エージェントのオンラインスキル学習

言語エージェントは、関連タスク全体にわたる複数ステップの Web 自動化を改善するために、再利用可能なスキルにますます依存しています。オンラインでのスキル学習を研究する仕事が増えており、エージェントは以前のタスクの軌跡からスキルを継続的に導き出し、その場で将来のタスクで再利用します。ただし、既存の方法は主にタスクレベルでスキルを再利用します。つまり、固定のスキルセットが最初のタスク指示に基づいて取得され、実行中ずっと固定されます。この静的戦略は Web の実行とずれており、適切な次のアクションはタスクの目標だけでなく、現在の Web ページの状態にも依存し、初期のスキルではカバーできない状況に移行することがよくあります。このギャップに対処するために、Web エージェントの段階的なスキルの再利用を可能にするオンラインスキル学習方法である State-Grounded Dynamic Retrieval (SGDR) を提案します。 SGDR は 3 つのコンポーネントで構成されます。完了した軌跡を中間の実行状態で呼び出し可能な再利用可能なサブプロシージャに変換するスライディングウィンドウ抽出プロセス、スキルの取得と実行可能なアクションを結び付けるデュアルテキストコード表現、スキルをタスクの目標と現在の Web ページの状態の両方に一致させる状態ベースの動的取得メカニズムです。 5 つのドメインにわたる WebArena での実験では、SGDR が一貫して強力なベースラインを上回っており、GPT-4.1 で 37.5%、Qwen3-4B で 24.3% の平均成功率を達成しており、最も強力なベースラインに対してそれぞれ 10.6% と 10.0% の相対的な向上に相当します。コードは https://github.com/plusnli/skill-dynamic-retrieval で入手できます。

原文 (English)

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

Language agents increasingly rely on reusable skills to improve multi-step web automation across related tasks. A growing line of work studies online skill learning, where agents continually induce skills from previous task trajectories and reuse them in future tasks on the fly. However, existing methods mainly reuse skills at the task-level: a fixed set of skills is retrieved based on the initial task instruction and then held fixed throughout execution. This static strategy is misaligned with web execution, where the appropriate next action depends not only on the task goal but also on the current webpage state, which often transitions into situations that the initial skills fail to cover. To address this gap, we propose State-Grounded Dynamic Retrieval (SGDR), an online skill learning method that enables stepwise skill reuse for web agents. SGDR consists of three components: a sliding-window extraction process that turns completed trajectories into reusable sub-procedures invokable at intermediate execution states, a dual text-code representation that connects skill retrieval with executable action, and a state-grounded dynamic retrieval mechanism that matches skills to both the task goal and the current webpage state. Experiments on WebArena across five domains show that SGDR consistently outperforms strong baselines, achieving average success rates of 37.5% with GPT-4.1 and 24.3% with Qwen3-4B, corresponding to relative gains of 10.6% and 10.0% over the strongest baseline, respectively. The code is available at https://github.com/plusnli/skill-dynamic-retrieval.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

トリビアム: 因果記憶コントローラーの第一級目標としての時間的後悔

現在のエージェントシステムと LLM パイプラインの多くは、結果の報酬を最適化することで間違いを修正します。これは失敗の内容のみを扱います。結果が予測と異なる場合、不一致の理由と時期が体系的に記録、レビュー、修正されないため、同じエラーがエピソードごとに再発する可能性があります。私たちは、これは単にモデルの能力の問題ではなく、構造的な問題であると主張します。私たちは、作業因果モデルに対する結果の後悔や認識論的な後悔と並んで、長期的な時間的後悔を第一級の目標として提案します。時間的リグアロングは、失敗が継続するとき、すなわち、調整ミスの因果モデルが修正されるまでにどのくらいの期間許容されるかを捉えます。認識論的後悔は、失敗が続く理由、つまり作業因果モデルにおける残留不確実性またはエラーを捉えます。 3 つの後悔を総合すると、長命のエージェントがいつ、何が、なぜ失敗する可能性があるのかについて、反証可能な説明が得られます。エージェントを E エピソードのストリームとしてモデル化し、明示的な因果関係の調査、持続性、および検出可能性の仮定に基づいて 3 つの条件付き結果を証明します。まず、観察的に等価な交絡のもとでは、結果のみの学習では介入チャネルがなければ因果構造と偽の構造を区別できないため、結果の後悔がゼロになった後でも時間的誤調整が線形的に持続する可能性があります。第 2 に、永続的な因果ログと予算付きプローブを使用すると、総プローブの複雑さはエピソード期間内で対数的となり、O(log E) の時間的後悔を引き起こします。第三に、K 個の検出可能な変化点の下では、速度は O(K log E) まで拡張されます。 Trivium をインスタンス化し、5 つの反証可能な予測を事前に登録します。 CausalBench-Seq では、Trivium は予測された対数エンベロープに従いますが、結果のみのベースラインは直線的に増加します。パイロットのリアル LLM ストリームは、1 回の完全な E = 500 実行と 3 回の E = 100 フロンティアモデルパイロットにわたる予備的な外部妥当性証拠を提供します。ここでの自己学習とは、LLM 重みを再トレーニングすることではなく、外部因果モデルを修正することを意味します。

原文 (English)

Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers

Many current agentic systems and LLM pipelines correct mistakes by optimizing outcome reward. This addresses only the what of failure: when an outcome diverges from prediction, the why and when of the mismatch are not systematically logged, reviewed, or corrected, so the same error can recur episode after episode. We argue that this is a structural problem, not merely a model-capacity one. We propose long-horizon temporal regret as a first-class objective alongside outcome regret and epistemic regret over the working causal model. Temporal regret captures when failure persists: how long a miscalibrated causal model is tolerated before correction. Epistemic regret captures why failure persists: residual uncertainty or error in the working causal model. Together, the three regrets give a falsifiable account of what, why, and when a long-lived agent can fail. Modeling the agent as a stream of E episodes, we prove three conditional results under explicit causal-probing, persistence, and detectability assumptions. First, under observationally equivalent confounding, outcome-only learning cannot distinguish causal from spurious structure without an intervention channel, so temporal miscalibration can persist linearly even after outcome regret is driven to zero. Second, with a persistent causal log and budgeted probes, total probe complexity is logarithmic in the episode horizon, inducing O(log E) temporal regret. Third, under K detectable change-points, the rate extends to O(K log E). We instantiate Trivium and pre-register five falsifiable predictions. On CausalBench-Seq, Trivium follows the predicted logarithmic envelope while outcome-only baselines grow linearly. A pilot real-LLM stream provides preliminary external-validity evidence across one full E = 500 run and three E = 100 frontier-model pilots. Self-learning here means revising an external causal model, not retraining LLM weights.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Agentic RAG における連鎖的幻覚: 検出と軽減のための CHARM フレームワーク

マルチステップのエージェント的検索拡張生成 (RAG) パイプラインは、複雑な推論タスクに対して優れた能力を実証していますが、既存の幻覚検出メカニズムが体系的に見逃しているクラスの障害に対して脆弱なままです。カスケード幻覚では、パイプラインの初期段階で導入されたエラーが、連続する推論ステップ全体に伝播および増幅し、自信があるが事実としては不正確な最終出力が生成されます。この脆弱性に対処するために、私たちはカスケード幻覚をエージェント RAG システムの明確な障害モードとして形式化し、カスケードパターンの 4 種類の分類を提示し、複数ステップの推論パイプラインでエラー伝播を検出して中断するためのアーキテクチャフレームワークである CHARM (Cascading Hallucination Aware Resolution and Mitigation) を導入します。 CHARM は、ステージレベルのファクト検証、クロスステージ一貫性追跡、信頼性伝播モニタリング、およびカスケード解決トリガーの 4 つのコンポーネントで構成されており、アーキテクチャの置き換えを必要とせずに、標準のエージェント RAG パイプラインと並行して動作します。 HotpotQA、MuSiQue、2WikiMultiHopQA、および LangChain エージェントパイプライン構成全体にわたるカスタム敵対的データセットで CHARM を評価し、89.4% のカスケード検出率と 5.3% の誤検知率、ステージあたりの平均レイテンシオーバーヘッド 215 ミリ秒 +/- 18 ミリ秒を達成し、エラー伝播の削減を 82.1% 達成しました (前者の 18.5% と比較)。出力レベル検出器。 Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM は、人間による監視フレームワークと統合して、実稼働エージェント AI の導入に完全な信頼性とガバナンススタックを提供します。

原文 (English)

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

メタエージェントの課題: 現在のエージェントは自律的なエージェント開発が可能ですか?

現在の AI ベンチマークは、人間が設計したワークフロー内でのタスク実行に関してエージェントを評価します。これらの評価では、基本的に、モデルが自律的にエージェントシステムを開発できるかどうかという、重要な次のレベルの機能を測定できません。自律エージェント開発のためのフロンティアモデルの能力をテストするために設計された評価フレームワークであるメタエージェントチャレンジ (MAC) を紹介します。具体的には、コードエージェント (メタエージェント) には、サンドボックス環境、評価 API、および 5 つのドメインにわたって実施されたテストセットのパフォーマンスを最大化するエージェントアーティファクトを反復的にプログラムするための時間制限が与えられます。評価の整合性を確保するために、このフレームワークは報酬ハッキングに対する多層防御によって保護されています。このフレームワークを活用して、メタエージェントが人為的に設計されたベースラインポリシーと一致することはほとんどなく、一致する少数のエージェントは独自のフロンティアモデルによって支配されていることを示します。さらに、設計プロセスは高い分散を示し、高い最適化圧力により、グラウンドトゥルースの漏洩などの敵対的な動作が表面化し、堅牢性とモデルの調整の両方における重大な欠陥が浮き彫りになります。最終的に、MAC は自律型 AI の研究開発のための厳密なオープンソースベンチマークを提供し、再帰的な自己改善を評価するための経験的な代用手段を提供します。ベンチマークは https://github.com/ant-research/meta-agent-challenge で公開されています。

原文 (English)

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AgentJet: エージェント強化学習のための柔軟な群トレーニングフレームワーク

大規模言語モデル (LLM) エージェント強化学習用の分散群トレーニングフレームワークである AgentJet を紹介します。エージェントのロールアウトとモデルの最適化を密接に結び付ける集中型フレームワークとは異なり、AgentJet は分離されたマルチノードアーキテクチャを採用しています。このアーキテクチャでは、swarm サーバーノードがトレーニング可能なモデルをホストし、GPU クラスターで最適化を実行します。一方、swarm クライアントノードは任意のデバイスで任意のエージェントを実行します。この設計は、集中型フレームワークではサポートが難しい機能を提供します。(1) 異種マルチモデル強化学習。複数の LLM を頭脳とする異種マルチエージェントチームのトレーニングを可能にします。 (2) 独立したエージェントのランタイムを使用したマルチタスクのカクテルトレーニング。 (3) 外部環境の障害によるトレーニングプロセスの中断を防ぐフォールトトレラントな実行。 (4) ライブコードの反復。群クライアントノードを置き換えることにより、トレーニング中にエージェントを編集できます。マルチモデル、マルチターン、マルチエージェント設定で効率的な RL をサポートするために、AgentJet はタイムラインマージを備えたコンテキストトラッキングモジュールを導入しています。これにより、冗長なコンテキストが統合され、トレーニングの 1.5 ～ 10 倍の高速化が実現します。最後に、AgentJet は、研究トピックを入力として受け取り、大規模クラスター上で長期にわたる複数日にわたる RL 研究を自律的に実行する自動研究システムを導入します。このシステムは、swarm アーキテクチャを活用することで、実行中に人間の介入なしに、RL 研究者の主要な探索ワークフローを再現します。

原文 (English)

AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm server nodes host trainable models and run optimization on GPU clusters, whereas swarm client nodes execute arbitrary agents on arbitrary devices. This design provides capabilities that are difficult to support in centralized frameworks: (1) heterogeneous multi-model reinforcement learning, enabling the training of heterogeneous multi-agent teams with multiple LLM as brains; (2) multi-task cocktail training with isolated agent runtimes; (3) fault-tolerant execution that prevents external environment failures from interrupting the training process; and (4) live code iteration, which allows agents to be edited during training by replacing swarm client nodes. To support efficient RL in multi-model, multi-turn, and multi-agent settings, AgentJet introduces a context tracking module with timeline merging, which consolidates redundant context and achieves a 1.5-10x training speedup. Finally, AgentJet introduces an automated research system that takes a research topic as input and autonomously conducts long-horizon, multi-day RL studies on large-scale clusters. By leveraging the swarm architecture, this system reproduces key exploratory workflows of RL researchers without human intervention during execution.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

プロンプトベースの計画を超えて: MCP ネイティブグラフ計画ベースの生物医学エージェントシステム

生物医学エージェントは複雑な生物学的ワークフローを自動化できると期待されていますが、現在のシステムは 2 つの根本的なボトルネックに直面しています。それは、バイオインフォマティクスツールがインターフェイスと実行環境において非常に異質である一方、エージェントの計画は未だにプロンプトで取得されるフラットなツール記述に依存しているということです。生物医学ソフトウェアエコシステムが成長するにつれて、ツールの適用範囲とコンテキストサイズの関係により、ツールの混乱、不安定な計画、および非効率的な実行が発生します。構造化された生物学的機能よりもグラフ足場計画に基づいて構築された MCP ネイティブの生物医学エージェントである BioManus を紹介します。 BioManus は、異種バイオインフォマティクスソフトウェアを標準化された MCP サーバーに変換し、大規模な実行可能な MCP エコシステムを生み出す BioinfoMCP コンパイラーを初めて導入しました。次に、このエコシステムを、ツール、操作、データ型、ワークフローステージにわたる型付きの異種 MCP グラフとして編成します。推論時に、BioManus はコンパクトなタスク固有のサブグラフを取得し、操作レベルのワークフロースキャフォールドを合成します。この設計は、計画の複雑さを生の工具在庫サイズから切り離し、高リコール取得下で Theta(N / (h * m_bar)) のコンテキスト圧縮率を達成します。ここで、N は総工具数、h はワークフロー範囲、m_bar (N よりもはるかに小さい) は操作ごとの候補工具の平均数です。 BioAgentBench と LAB-Bench の実験では、BioManus が高度な生物医学エージェントのベースラインと比較して、実行精度、ワークフローの有効性、およびコンテキストの効率を向上させることが示されています。この研究はパラダイムシフトを示唆しています。スケーラブルな生物医学的推論には、ますます大規模になるプロンプトレベルのツール検索ではなく、構造化された実行可能な機能グラフが必要です。

原文 (English)

Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System

Biomedical agents promise to automate complex biological workflows, yet current systems face two fundamental bottlenecks: bioinformatics tools are highly heterogeneous in interfaces and execution environments, while agent planning still relies on flat prompt-retrieved tool descriptions. As biomedical software ecosystems grow, this coupling between tool coverage and context size leads to tool confusion, unstable planning, and inefficient execution. We introduce BioManus, an MCP-native biomedical agent built on graph-scaffolded planning over structured biological capabilities. BioManus first introduces the BioinfoMCP Compiler, which converts heterogeneous bioinformatics software into standardized MCP servers, yielding a large executable MCP ecosystem. It then organizes this ecosystem as a typed heterogeneous MCP graph over tools, operations, datatypes, and workflow stages. At inference time, BioManus retrieves compact task-specific subgraphs, synthesizes operation-level workflow scaffolds. This design decouples planning complexity from raw tool inventory size, achieving a context compression ratio of Theta(N / (h * m_bar)) under high-recall retrieval, where N is the total tool count, h is the workflow horizon, and m_bar (much smaller than N) is the average number of candidate tools per operation. Experiments on BioAgentBench and LAB-Bench show that BioManus improves execution accuracy, workflow validity, and context efficiency over advanced biomedical agent baselines. This work suggests a paradigm shift: scalable biomedical reasoning requires structured executable capability graphs rather than increasingly larger prompt-level tool retrieval.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

MapAgent: 都市規模の車線レベルの地図生成のための産業グレードのエージェントフレームワーク

車線レベルの地図は自動運転と車線レベルのナビゲーションにとって重要なインフラストラクチャですが、数百の都市で標準化された車線ネットワークの構築と維持には依然として非常に労働集約的です。最近のエンドツーエンドのベクトル化マッピング手法は、センサーデータから直接車線の形状とトポロジを予測できますが、通常、マッピング仕様と交通規制を暗黙的なデータセット依存の監視として扱います。さらに、複雑なシーン (マーキングやオクルージョンの磨耗や欠落など) では、正しいレーン構成が視覚的証拠だけでは十分に決定されないことが多く、仕様違反が人間による事後編集の主な原因となっています。私たちは、仕様に準拠したレーンマップ作成のためのベクトル化バックボーンを強化する産業グレードのエージェントアーキテクチャである MapAgent を提案します。 MapAgent は、単にマップ予測にエージェントループを追加するのではなく、バックボーンの認識と明示的な仕様の検証、制約を意識した推論、および境界のある検証主導型のジャッジ-プランナー-ワーカーループの下での決定論的なマップ編集を結合します。視覚言語を使用するジャッジは、視覚的な証拠とドラフトベクトルを共同で検査することでエラーを診断し、ツールを呼び出すプランナーは編集後の再検証により最小限の修正編集を生成します。都市規模の本番環境でのスケーラビリティを維持するために、MapAgent はバックボーンの信頼性が低いタイルでのみ選択的にトリガーされ、スループットを維持しながら適度なオーバーヘッドを追加します。現実世界のデータセットでの実験では、特に複雑でロングテールのシナリオにおいて、強力な実稼働ベースラインを上回る一貫した利益が示されています。さらに、MapAgent は Baidu Maps に統合されており、全国 360 以上の都市の車線レベルの地図生成をサポートし、全体的な生産自動化を 95% 以上に高め、大規模な車線レベルの地図生成における MapAgent の実用性と有効性を実証しています。

原文 (English)

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing and maintaining standardized lane networks for hundreds of cities remains highly labor-intensive. Recent end-to-end vectorized mapping methods can predict lane geometry and topology directly from sensor data, but they typically treat mapping specifications and traffic regulations as implicit, dataset-dependent supervision. Moreover, in complex scenes (e.g., worn or missing markings and occlusions), correct lane configurations are often under-determined by visual evidence alone, making specification violations a major source of human post-editing. We propose MapAgent, an industrial-grade agentic architecture that augments a vectorization backbone for specification-compliant lane-map production. Rather than merely adding an agent loop to map prediction, MapAgent couples backbone perception with explicit specification verification, constraint-aware reasoning, and deterministic map editing under a bounded, verification-driven Judge-Planner-Worker loop. A vision-language Judge diagnoses errors by jointly inspecting visual evidence and draft vectors, while a tool-calling Planner generates minimal corrective edits with post-edit re-validation. To remain scalable for city-scale production, MapAgent is selectively triggered only on tiles with low backbone confidence, adding modest overhead while preserving throughput. Experiments on real-world datasets show consistent gains over strong production baselines, especially in complex and long-tail scenarios. Additionally, MapAgent has been integrated into Baidu Maps, supporting lane-level map generation for over 360 cities nationwide and elevating the overall production automation to over 95%, demonstrating MapAgent's practicality and effectiveness for large-scale lane-level map generation.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

パラメトリックメモリを介した自己進化エージェントのスケーリング

既存のメモリ拡張 LLM エージェントは、ロールアウト中モデルパラメータを凍結したままにしながら、過去の経験をテキストの要約または取得された一節としてプロンプト領域にのみ保存します。このようなエージェントは、見たものを \emph{調べる}ことはできますが、それから \emph{学ぶ}ことはできません。彼らのポリシーは経験によって変更されず、コンテキストからドロップされた情報は永久に失われます。私たちは、自己進化するパラメトリックメモリフレームワークである \texttt{TMEM} を導入します。このフレームワークでは、エージェントが履歴を明示的メモリに圧縮するだけでなく、軽量のオンラインアップデートを通じて抽出された監視を高速な LoRA 重み $\Delta_t$ に吸収し、単一のエピソード内で将来の動作を真に変更します。これを、高速重みロールアウトダイナミクスを備えたエージェントの意思決定プロセスとして形式化します。アクションは $\pi_{\theta_0+\Delta_t}$ からサンプリングされ、抽出アクションは後続の決定のために $\Delta_t$ を更新する監視を生成します。このビューにより、抽出ポリシーが RL によって直接最適化可能になります。$\theta_0$ のトレーニングにより、タスクアクションだけでなく、オンライン LoRA 適応に使用されるデータの品質も向上します。さらに、オンラインコンバージェンスを加速するために、SVD ベースの LoRA サブスペースの初期化を提案します。 LoCoMo、LongMemEval-S、多目的検索、および CL-Bench の実験では、\texttt{TMEM} がさまざまなモデルスケールにわたって、要約ベースおよび検索ベースのベースラインを一貫して上回るパフォーマンスを示しています。

原文 (English)

Scaling Self-Evolving Agents via Parametric Memory

Existing memory-augmented LLM agents store past experience exclusively in prompt space, as textual summaries or retrieved passages, while keeping model parameters frozen throughout a rollout. Such agents can \emph{look up} what they have seen but cannot \emph{learn from} it: their policy is unchanged by experience, and any information dropped from the context is permanently lost. We introduce \texttt{TMEM}, a self-evolving parametric memory framework in which the agent not only compresses history into explicit memory but also absorbs distilled supervision into fast LoRA weights $\Delta_t$ via lightweight online updates, genuinely altering its future behavior within a single episode. We formalize this as an agentic decision process with fast-weight rollout dynamics: actions are sampled from $\pi_{\theta_0+\Delta_t}$, while extraction actions produce supervision that updates $\Delta_t$ for subsequent decisions. This view makes the extraction policy directly optimizable by RL: training $\theta_0$ improves not only task actions but also the quality of the data used for online LoRA adaptation. We further propose SVD-based initialization of the LoRA subspace to accelerate online convergence. Experiments on LoCoMo, LongMemEval-S, multi-objective search, and CL-Bench show that \texttt{TMEM} consistently outperforms summary-based and retrieval-based baselines across different model scales.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

Neetyabhas: Rational エージェントベースのモデルにおける不確実性を認識した公共政策最適化のためのフレームワーク

目的 WHO の新型コロナウイルス感染症に対する非医薬品介入（ロックダウン、ワクチン接種など）は感染を効果的に抑制しますが、経済的には大きな負担となります。既存の研究は、個人の行動を無視し、完璧な感染追跡と完璧な政策実行を誤って想定しており、現実世界の不確実性や誤りを説明できていないことがよくあります。方法我々は、流行の測定（感染症/入院）と政策実施の両方に不確実性を組み込んだ統合的アプローチを提案します。私たちは、マスクの着用、ワクチン接種、買い物に関するリアルタイムの選択を行う 1,000 人の個人のシミュレーションモデルを構築しました。同時に、政策立案者は健康と経済の観察に基づいて介入（ロックダウン、義務化）を展開します。このフレームワークは階層型強化学習エージェントによって駆動され、不確実性を考慮したポリシー勾配バリアント (DDPG および TD3) とともにディープ Q ネットワークを利用します。結果シミュレーションは流行の進行を効果的に管理しました。マスクとワクチン接種が非常に効果的であることが証明され、流行のピークの高さと期間の両方が大幅に短縮されました。個人の行動、政策の不確実性、多面的な介入を統合することで、私たちの動的制御アプローチは流行の影響を軽減することに成功しました。結論私たちのモデルは、不確実性と人間の行動を公衆衛生政策の枠組みに組み込むことで、これまでの研究の限界を克服しました。このシミュレーションは、マスクとワクチンが極めて重要なツールとして機能し、複雑なパンデミック時に効果的な介入を設計するには、個人の選択と不完全なデータを考慮することが重要であることを示しています。

原文 (English)

Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models

Purpose The WHO's COVID-19 non-pharmaceutical interventions (e.g., lockdowns, vaccinations) effectively curb transmission but impose heavy economic strains. Existing research often neglects individual behaviors and falsely assumes perfect infection tracking and flawless policy execution, failing to account for real-world uncertainties and errors. Methods We propose an integrative approach incorporating uncertainties in both epidemic measurement (infections/hospitalizations) and policy implementation. We built a simulation model of 1,000 individuals making real-time choices regarding mask-wearing, vaccination, and shopping. Concurrently, policymakers deploy interventions (lockdowns, mandates) based on health and economic observations. This framework is driven by hierarchical reinforcement learning agents, utilizing deep Q-networks alongside uncertainty-aware policy gradient variants (DDPG and TD3). Results The simulations effectively managed the epidemic's progression. Masking and vaccinations proved highly effective, significantly reducing both the outbreak's peak height and duration. By integrating individual behaviors, policy uncertainties, and multifaceted interventions, our dynamic control approach successfully mitigated the epidemic's impact. Conclusions Our model overcomes previous research limitations by embedding uncertainty and human behavior into public health policy frameworks. The simulation demonstrates that accounting for individual choices and imperfect data is crucial for designing effective interventions during complex pandemics, with masks and vaccines serving as pivotal tools.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

最初に計画し、後で判断し、より良く実行する: DMAIC からインスピレーションを得た産業異常検出用エージェントシステム

大規模言語モデル (LLM) エージェントは、複雑なデータ分析ワークフローの自動化において有望であることが示されていますが、一か八かの産業シナリオにおいてその信頼性の高い導入は依然として困難です。産業異常検出 (IAD) は製造の品質、安全性、効率に不可欠ですが、既存の LLM ベースの IAD エージェントは主に実行に焦点を当てており、戦略策定は不十分です。その結果、統一的かつコスト効率の高い方法で異種のモダリティを処理するのに苦労しています。 DMAIC 品質管理フレームワークからインスピレーションを得て、当社は DMAIC-IAD (DMAIC にインスピレーションを得た Agentic Industrial Anomaly Detection) を提案します。これは、LLM エージェントを構造化された産業問題解決と連携させる、「最初に計画し、後で判断する」マルチエージェントシステムです。 DMAIC-IAD は、戦略生成前に異種参照を標準化操作手順 (SOP) に抽出し、事前トレーニングされた実行不要の判定モデルを導入して、コストのかかる実行時トライアルを行わずに候補戦略をランク付けします。 4 つのモダリティにわたる広範な実験により、DMAIC-IAD は適用可能な薬剤ベースラインよりも平均検出パフォーマンスが 37.76% 向上することが示されています。

原文 (English)

Plan First, Judge Later, Run Better: A DMAIC-Inspired Agentic System for Industrial Anomaly Detection

Large language model (LLM) agents have shown promise in automating complex data-analysis workflows, but their reliable deployment remains challenging in high-stakes industrial scenarios. Industrial anomaly detection (IAD) is essential for manufacturing quality, safety, and efficiency, yet existing LLM-based IAD agents mainly focus on execution while under-exploiting strategy formulation. Consequently, they struggle to handle heterogeneous modalities in a unified and cost-effective manner. Inspired by the DMAIC quality-management framework, we propose DMAIC-IAD (DMAIC-inspired Agentic Industrial Anomaly Detection), a "Plan First, Judge Later" multi-agent system that aligns LLM agents with structured industrial problem-solving. DMAIC-IAD distills heterogeneous references into standardized operating procedures (SOPs) before strategy generation, and introduces a pre-trained execution-free judge model to rank candidate strategies without costly runtime trials. Extensive experiments across four modalities show that DMAIC-IAD improves average detection performance over applicable agentic baselines by 37.76%.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

パルテノン法: 自己進化する弁護士の枠組み

エージェントの能力が高まるにつれて、法律分野の LLM エージェントは、大量のドキュメントをレビュー可能な作業成果物に変えることを約束しますが、信頼性の高い導入には 3 つの障害に直面しています。1 つは、今日の最も強力なモデルとハーネスの組み合わせがエンドツーエンドの法的問題でどのように動作するかについての大規模な証拠がないことです。法的な業種に適合したエージェントアーキテクチャはなく、汎用ハーネスのみが使用されます。そして、新しい事実、権限、期限によって変化し続ける環境では、システムが自らの結果から学習するメカニズムがありません。それぞれに対応します。 Harvey LAB に関する大規模な実証研究 -- $12{,}510$ のエージェントの軌跡 -- は、フロンティアのエージェントでさえ 1 回のパスで問題を完了することには程遠いことを示しています。より強力なモデルを使用すると基準ごとの精度が向上しますが、厳密な問題の完了は停滞します。次に、\textsc{Parthenon} を導入します。これは、モデル、ハーネス、代理人の役割、法的知識、決定論的なツール、および手続き上のスキルを情報源の追跡可能性、日付と番号の根拠、成果物のコンプライアンス、および問題の解決のための監査可能な表面に組み込む、自己進化する法律代理人のフレームワークです。最後に、漏れ防止学習ループにより、スコアリングされた失敗がタスクに依存しないスキル、ツール、知識の編集に変換され、企業が問題ごとにチェックリストとプレイブックを洗練するように、モデルの重みに触れることなく、経験とともにシステムが改善されます。私たちの大規模な実証分析を通じて、\textsc{Parthenon} は法的問題のタスクにおける最先端のモデルとハーネスのパフォーマンスを大幅に向上させました。

原文 (English)

Parthenon Law: A Self-Evolving Legal-Agent Framework

As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable deployment faces three obstacles: no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB -- $12{,}510$ agent trajectories -- shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textsc{Parthenon}, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience -- as a firm refines its checklists and playbooks after each matter -- without touching model weights. Across our large-scale empirical analysis, \textsc{Parthenon} substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

MIRAGE: 暗黙的推論と生成世界モデルを備えたモバイルエージェント

モバイルエージェントは、スクリーンショットや言語目標に基づいて日常のアプリケーションを操作することがますます期待されており、信頼性の高い制御には、画面のアフォーダンス、複数ステップのナビゲーション、および将来の状態の変化に関する推論が必要です。ただし、多くのエージェントはこの計算を長いテキストの思考連鎖として外部に出すため、対話が遅くなり、監視コストが増加し、展開が複雑になります。 MIRAGE は、目に見えるテキスト推論の痕跡から継続的な潜在推論表現を学習するフレームワークです。 MIRAGE は、明示的な推論をコンパクトな隠れ状態に変換し、エージェントが長い根拠を解読することなく内部的に推論できるようにします。また、生成世界モデルの目標も組み込まれています。つまり、潜在的な推論ベクトルが将来のスクリーンショットと一致し、エージェントが行動する前に今後のインターフェイスの状態を予測するようになります。これにより、隠れた計算が圧縮された思考表現と環境力学の将来を見据えたモデルの両方に変わります。推論時、MIRAGE は連続的な潜在空間で推論し、実行効率を向上させながらトークンの生成を削減します。 AndroidWorld では、MIRAGE は、4B アブレーションにおける明示的な思考連鎖の監視付き微調整と 3 ～ 5 倍低いデコードトークンバジェットを一致させ、同等の命令調整ベースラインを 10.2 ポイント改善します。 AndroidControl では、生成されるトークンが 75% 以上減少しながら、アクションのグラウンディングが向上します。

原文 (English)

MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requires reasoning over screen affordances, multi-step navigation, and future state changes. However, many agents externalize this computation as long textual chains of thought, which slows interaction, increases supervision cost, and complicates deployment. We introduce MIRAGE, a framework that learns continuous latent reasoning representations from visible textual reasoning traces. MIRAGE transfers explicit reasoning into compact hidden states, enabling the agent to reason internally without decoding long rationales. It also incorporates a generative world-model objective: latent reasoning vectors are aligned with future screenshots, encouraging the agent to anticipate upcoming interface states before acting. This turns hidden computation into both a compressed thought representation and a forward-looking model of environment dynamics. At inference time, MIRAGE reasons in continuous latent space, reducing token generation while improving execution efficiency. On AndroidWorld, MIRAGE matches explicit chain-of-thought supervised fine-tuning in the 4B ablation with a 3-5x lower decoded-token budget and improves a comparable instruction-tuned baseline by 10.2 points; on AndroidControl, it improves action grounding while generating over 75% fewer tokens.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

Fog of Love: ゲーム環境における親和性ベースの強化学習による高潔なエージェントの動作のエンジニアリング

人工知能に高潔な行動を教え込むことへの関心が高まっています。提案された手法の 1 つは、親和性ベースの強化学習として知られています。これは、目的関数のポリシー正則化を使用して、報酬関数の設計に完全に依存することなく、善良な行動を奨励します。これまでのところ、この手法は、状態空間とアクション空間が最小限のグリッドワールドやおもちゃの問題環境で有効であることが実証されています。この研究をより洗練された環境に拡張するために、Fog of Love として知られるロールプレイングボードゲームに基づく 2 プレイヤーマルチエージェント環境を導入します。この環境では、2 人のエージェントがそれぞれの美徳を満たすために競い合いながら、その関係を満たすために協力します。マルチエージェントの性質を考慮すると、これは複雑な問題であり、マルチエージェントの深い決定論的ポリシー勾配エージェントは競合も連携もうまくいきません。我々は、局所的な親和性が競争目的と協力目的の両方を達成する際のエージェントのパフォーマンスを向上させ、その結果、両方のドメインで総合スコアが優れているという証拠を提示します。これは、結果的に賢明な選択をもたらすだけでなく、エージェントの目的論を明確にし、その行動を人間レベルで解釈できるようにします。

原文 (English)

Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment

Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity-based reinforcement learning, which uses policy regularization on the objective function to incentivize virtuous actions without being fully dependent on the reward function design. Thus far, this technique has been demonstrated to be effective in grid worlds and toy-problem environments with minimal state and action spaces. To expand this research to more sophisticated environments, we introduce a two-player multi-agent environment based on the role-playing board game known as Fog of Love. In this environment, two agents compete to fulfill their individual virtues, while also cooperating to satisfy their relationship. Given the multi-agent nature, this is a complex problem where multi-agent deep deterministic policy gradient agents neither compete nor cooperate successfully. We present evidence that localized affinities enhance agent performance in achieving both competitive and cooperative objectives, resulting from superior overall scores in both domains. This not only results in virtuous choices but also clarifies an agent's teleology and makes its behavior human-level interpretable.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

FALSIFYBENCH: ルール発見ゲームを使用した LLM の帰納的推論の評価

大規模言語モデル (LLM) は、科学タスクにおける自律エージェントとして導入されることが増えています。しかし、これらのシステムが科学的発見に関連する帰納的推論の形式に効果的に関与できるかどうかは未解決の問題のままです。この研究では、古典的な Wason 2-4-6 タスクに触発された仮説主導型推論の評価フレームワークである FALSIFYBENCH を紹介します。このタスクでは、エージェントは例を繰り返し提案し、フィードバックを受け取ることによって隠れた意味論的特性を発見する必要があります。このタスクでは、科学的推論の重要な要素、つまり仮説の生成、証拠の収集、および証拠の確認と反証に応じた信念の修正を捉えます。モデルファミリーとスケールにわたる 12 個の LLM の評価では、最適なパフォーマンスに近いモデルはないものの、推論モデルは一般に命令調整モデルよりも強力な科学的推論であることがわかりました。成功の主な原動力は否定的なテストの能力です。仮説を積極的に反証しようとするモデルは、主に確認を求めるモデルよりも一貫して優れています。さらに、これまでの研究では無視されていたきめ細かいターンレベル分析により、モデルが仮説空間をナビゲートする方法における特定可能なパターンと失敗が結びついていることが明らかになりました。

原文 (English)

FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games

Large language models (LLMs) are increasingly deployed as autonomous agents in scientific tasks. Yet whether these systems can effectively engage in forms of inductive reasoning relevant to scientific discovery remains an open question. In this work, we introduce FALSIFYBENCH, an evaluation framework for hypothesis-driven reasoning inspired by the classic Wason 2-4-6 task, in which agents must discover hidden semantic properties by iteratively proposing examples and receiving feedback. This task captures key elements of scientific reasoning: hypothesis generation, evidence gathering, and belief revision in response to both confirming and disconfirming evidence. Our evaluation of 12 LLMs across model families and scales shows that reasoning models are generally stronger scientific reasoners than instruction-tuned models, although no model comes close to optimal performance. The primary driver of success is the capacity for negative testing: models that actively seek to falsify their hypotheses consistently outperform those that primarily seek confirmation. Moreover, a fine-grained turn-level analysis, neglected in previous work, reveals that failure is tied to identifiable patterns in how models navigate the hypothesis space.

2026-06-05 13:00 JSTarXiv cs.AIエージェント研究/論文

人間と AI のインタラクションにおけるマルチエージェントの相補性のツリーベースの定式化

相補性とは、人間と AI の相互作用 (HAI) が、そのメンバー間で利用可能な最良の予測ベンチマークを上回る場合のことです。この考え方は HAI 研究の中心ですが、相補性に関する正式な研究は依然として限られています。既存のフレームワークは、エージェントの予測がワークフローに依存したマルチエージェントプロトコルをどのように構成するかをモデル化していません。私たちは、マルチエージェント HAI における相補性のツリーベースの形式化を導入することで、このギャップを埋めます。 HAI プロトコルは、順序付けられたエージェントと役割の構成と、その葉が予測ベクトルによって装飾されている根付き平面バイナリツリーによって表されます。ローカルのバイナリ構成ルールがツリーに沿って再帰的に評価され、pointwise-min Oracle ベンチマークに対するツリー相対相補性関数が生成されます。 4 つの結果を証明します。まず、セレクターベースの HAI (自己依存性または AI 依存性を含む) は、タスク、損失、予測の品質に関係なく、相補性を達成できません。第 2 に、二乗損失での回帰では、相補性はグラウンドトゥルースベクトルからのユークリッド距離の最小化に相当します。 $N=2$ の場合、最適な線形プーリング重みは閉じた形式と残差補正解釈を持ちます。第三に、線形局所構成の下では、すべてのプロトコルツリーはリーフ重みの単体での重心座標チャートを定義します。プロトコルツリーのTamari-cover再パラメータ化は相補性を維持し、$N=4$の場合、五角形の恒等性を満たします。第四に、バイナリ分類では、標準ブレグマン損失や多くの有限ベルヌーイ $f$ 発散損失を含むエンドポイント単調損失の下では、内部の局所構成は相補性を達成できません。クロスエントロピー下のマルチクラス集約にも同様の障害が当てはまります。要約すると、私たちのフレームワークは、マルチエージェント回帰では相補性が達成可能ですが、局所的な凝集と損失関数に関する自然条件下での分類では妨げられることを示しています。

原文 (English)

Tree-Based Formalization of Multi-Agent Complementarity in Human-AI Interactions

Complementarity is the case in which a human--AI interaction (HAI) outperforms the best prediction benchmark available among its members. Although this idea is central in HAI research, formal work on complementarity remains limited. Existing frameworks do not model how agents' predictions compose into workflow-sensitive multi-agent protocols. We close this gap by introducing a tree-based formalization of complementarity in multi-agent HAI. An HAI protocol is represented by an ordered agent-role configuration together with a rooted planar binary tree whose leaves are decorated by prediction vectors. A local binary composition rule is evaluated recursively along the tree, yielding a tree-relative complementarity functional relative to a pointwise-min oracle benchmark. We prove four results. First, selector-based HAIs, including self- or AI-reliance, cannot achieve complementarity regardless of task, loss, or prediction quality. Second, in regression under squared loss, complementarity is equivalent to Euclidean distance minimization from the ground-truth vector; for $N=2$, the optimal linear-pooling weight has a closed form and a residual-correction interpretation. Third, under linear local composition, every protocol tree defines a barycentric coordinate chart on the simplex of leaf weights; Tamari-cover reparameterizations of protocol trees preserve complementarity, and for $N=4$, they satisfy the pentagon identity. Fourth, in binary classification, no internal local composition can achieve complementarity under endpoint-monotone losses, including standard Bregman and many finite Bernoulli $f$-divergence losses; an analogous obstruction holds for multiclass aggregation under cross-entropy. In summary, our framework shows that complementarity is attainable in multi-agent regression, but obstructed in classification under natural conditions on local aggregation and loss functions.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

AIP: エージェントのスキルを学習および管理するためのグラフ表現

現在のエージェントスキルは、主に自由形式の散文で構成されており、エージェントはすべてのセッションでどのように行動するかを読み、解釈し、再導出する必要があります。これにより、2 つの複合的なコストが課せられます。実装の負荷が高いタスクの信頼性の低下と、特にモデルのトレーニングで過小評価されているドメイン固有の手順知識に関して、散文の編集は人間とエージェントの両方が苦労する脆弱なプロセスであるため、スキルの作成と改善が困難になります。エージェント命令プロトコル (AIP) は、スキルを指向実行グラフとしてモデル化することで両方に対処します。つまり、決定論的なスクリプトまたは自然言語記述に裏付けられたノードとしての個別のステップ、明示的に型指定された入力/出力エッジによって接続され、スキーマ検証された YAML 仕様によって管理されます。コンパイラのメタスキルは、人間が作成した既存のスキルをこの形式に変換します。利点は 2 つあります。まず、人間が作成したスキルを AIP にコンパイルすると、SkillsBench の 27 の実際のエージェントタスク全体で、Claude Sonnet の平均タスク報酬が 0.60 から 0.71 に、合格率が 53% から 67% に上昇しました。これは統計的に有意な向上 (Wilcoxon の符号付きランク p = 0.011) であり、12 対 2 のタスクで 13 の同点で勝利し、多くの場合、より短い実時間で達成されました。グラフは、自然言語からコード、コマンド、およびツール呼び出しを再導出するようにエージェントに要求するのではなく、精査された実行可能なユニットをエージェントに提供します。次に、作成と改善については、各スキルがスキーマ検証され、機能テストが可能で、ノードごとにアドレス指定できるため、障害を正確に診断して修復できます。作成されたスキルの 2 つの失敗がスクリプトレベルまで追跡されました。 AIP 仕様を調整して再コンパイルした後、どちらも回帰ゼロ (1 つのタスクが 0/5 から 5/5 に移行) で回復し、スキルの向上が散文的な書き直しではなく、測定可能なチューニングループに変わりました。同じグラフ構造は、コーパスレベルのガバナンスとスキルのイントロスペクションをサポートし、スキルに対する強化学習のための自然なアクションスペースを提供します。

原文 (English)

AIP: A Graph Representation for Learning and Governing Agent Skills

Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and difficulty in skill creation and improvement, since editing prose is a fragile process that both humans and agents struggle with, particularly for domain-specific procedural knowledge underrepresented in model training. The Agent Instruction Protocol (AIP) addresses both by modeling a skill as a directed execution graph: discrete steps as nodes backed by deterministic scripts or natural-language descriptions, connected by explicit typed input/output edges, and governed by a schema-validated YAML specification. A compiler meta-skill translates existing human-written skills into this form. The benefits are twofold. First, compiling human-written skills to AIP raised Claude Sonnet's mean task reward from 0.60 to 0.71 and pass rate from 53% to 67% across 27 real agent tasks from SkillsBench - a statistically significant gain (Wilcoxon signed-rank p = 0.011), winning 12 tasks to 2 with 13 ties - often in less wall-clock time. The graph delivers vetted, runnable units to the agent rather than asking it to re-derive code, commands, and tool calls from natural language. Second, on creation and improvement, because each skill is schema-validated, functionally testable, and addressable node-by-node, failures can be diagnosed and repaired precisely. Two authored-skill failures were traced to the script level. After adjusting the AIP spec and recompiling, both recovered with zero regressions (one task going from 0/5 to 5/5), turning skill improvement into a measurable tuning loop rather than a prose rewrite. That same graph structure supports corpus-level governance and skill introspection, and provides a natural action space for reinforcement learning over skills.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

R-APS: 内省的敵対的パレート検索による制約付き設計のための構成推論とコンテキスト内メタ学習

大規模言語モデル (LLM) は、無制限のタスクに柔軟に対応しますが、システムが計画を立て、ツールを使用し、長期間にわたって動作する必要があるエージェント設定では、流暢さは信頼性の高い配信を保証しません。このギャップを 3 つの構造的欠陥が結合したものとして追跡します。エラーは位置特定されずに伝播し、最悪の場合の摂動は評価されず、蓄積された知識は決して無効になりません。私たちは、これらには根本原因が共有されていると主張します。つまり、アブダクティブ、反事実、メタ帰納的、修正的、帰納的推論は、共有されたコンテキストを矛盾する方向に引っ張ります。私たちは、Reflective Adversarial Pareto Search (R-APS) を導入します。これは、推論モード分解を介して 3 つの失敗すべてに共同で対処し、各推論モードに独自のコンテキストを割り当て、3 つのタイムスケールにわたる相互作用を調整する、私たちの知る限り最初の方法です。型付き検証批評家による段階的な構成推論 (失敗の局在化)、第一級のパレート目標 (堅牢性) としての感度に基づく反事実ストレステスト、および明示的なメタ帰納的ルール抽出です。 invalidation (persistent memory). R-APS は微調整を必要とせず、純粋に構造化されたプロトコル設計によってフリーズされた LLM 上で動作します。平面機構の合成 (ロボット工学、補綴物、機械設計) を評価し、すべての候補を運動学ソルバーでチェックします。 32 のターゲット軌道上で、R-APS は、均一摂動ベースラインよりも 3.5 倍厳しいロバスト性証明書、最初の許容までの反復が 46% 高速化、Enum+GA と比較して 2.1 倍の面取り距離の短縮を実現しながら、バー数と最悪の場合のロバスト性を共同制御します。小規模な 4B 推論に特化したモデルは、プロトコル内の汎用 70B バックボーンと競合することが証明されており、構造化プロトコルがモデルのスケールを部分的に相殺できることを示唆しています。

原文 (English)

R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

Large language models (LLMs) are fluent on open-ended tasks, yet in agentic settings, where a system must plan, use tools, and act over extended horizons, fluency does not ensure reliable delivery. We trace this gap to three coupled structural failures: errors propagate without localization, worst-case perturbations go unevaluated, and accumulated knowledge is never invalidated. We argue these share a root cause: abductive, counterfactual, meta-inductive, corrective, and inductive reasoning pull a shared context in incompatible directions. We introduce Reflective Adversarial Pareto Search (R-APS), to our knowledge the first method addressing all three failures jointly via reasoning-mode decomposition, allocating each reasoning mode its own context and orchestrating interaction across three timescales: staged compositional reasoning with a typed validation critic (failure localization), sensitivity-guided counterfactual stress-testing as a first-class Pareto objective (robustness), and meta-inductive rule extraction with explicit invalidation (persistent memory). R-APS requires no fine-tuning and operates on a frozen LLM purely via structured protocol design. We evaluate on planar mechanism synthesis (robotics, prosthetics, mechanical design), with every candidate checked by a kinematic solver. On 32 target trajectories, R-APS delivers robustness certificates 3.5x tighter than uniform-perturbation baselines, 46% faster iterations-to-first-admission, and 2.1x Chamfer-distance reduction over Enum+GA while jointly controlling bar-count and worst-case robustness. Small 4B reasoning-specialized models prove competitive with general-purpose 70B backbones inside the protocol, suggesting structured protocols can partially offset model scale.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

Strabo: エージェント相互作用プロトコルの宣言的仕様と実装

ここ数年で、宣言型対話プロトコルに基づいたマルチエージェントシステムのモデリングと実装が大きく進歩しました。私たちの貢献である Strabo は、これらの進歩と Agentic AI における現在進行中の業界の取り組みとの関連性を確立します。具体的には、AI エージェントの電子商取引インタラクションを標準化するための Google 主導の最近の取り組みである UCP (Universal Commerce Protocol) について検討します。私たちの演習は 2 つの部分に分かれています。 1 つは、チェックアウトを処理する UCP の部分を宣言型 Langshaw プロトコルとしてモデル化し、Langshaw のプログラミングモデルである Peach を使用してエージェントを実装することです。演習のこの部分では、正式な宣言的仕様の利点を引き出します。 2 つ目は、Peach エージェントが Google によって実装された UCP エージェントと相互運用できることを示し、それによって UCP に関するアプローチの忠実性を確立します。このような相互運用により、宣言型プロトコルとエージェントを従来の設定に段階的に導入することが可能になり、大規模な更新を必要とせずに EMAS のアイデアが実践に影響を与える可能性がある道筋が示されています。

原文 (English)

Strabo: Declarative Specification and Implementation of Agentic Interaction Protocols

The last few years have witnessed major advances in the modeling and implementation of multiagent systems based on declarative interaction protocols. Our contribution, Strabo, establishes the relevance of these advances to ongoing industry efforts in Agentic AI. Specifically, we consider UCP, the Universal Commerce Protocol, a recent Google-led effort to standardize e-commerce interactions for AI agents. Our exercise is in two parts. One, we model the part of UCP dealing with checkouts as a declarative Langshaw protocol and implement agents using Peach, a programming model for Langshaw. This part of the exercise brings out the advantages of formal, declarative specifications. Two, we show that Peach agents can interoperate with UCP agents implemented by Google, thereby establishing the fidelity of our approach with respect to UCP. Such interoperation enables the incremental introduction of declarative protocols and agents into a conventional setting, indicating a pathway by which EMAS ideas could influence practice without demanding a wholesale update.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

RUBAS: エージェントの安全のためのルーブリックベースの強化学習

LLM がツール対応エージェントに進化すると、単純なテキスト生成ではなく現実世界の実行に関連した新しいクラスの安全性の課題が生じます。既存の調整方法は、粗い拒否信号や静的な監視に依存することが多く、さまざまなエージェントのリスクにわたって安全性と有用なツールの実行のバランスをとることが困難です。エージェントの安全性のためのルーブリックベースの強化学習フレームワークである RUBAS を紹介します。 RUBAS は、エージェントの動作をツール使用の安全性、引数の安全性、応答の安全性、有用性の 4 つの次元に分解します。これらの構造化されたルーブリックは、エージェントの完全な軌跡にわたってきめ細かく解釈可能な報酬を提供し、タスクの完了を維持しながら安全なツールの使用を最適化する強化学習を可能にします。複数のエージェントの安全性ベンチマークとモデルにわたる広範な実験により、RUBAS が標準的なアライメントベースラインよりも安全性を向上させ、ツールに基づく幻覚を軽減し、競争力のある実用性を維持することが示されています。私たちの結果は、多次元ルーブリック報酬が、安全性が重要なツール使用環境において LLM エージェントを調整するための効果的なトレーニング信号を提供することを示唆しています。

原文 (English)

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or static supervision, making it difficult to balance safety with useful tool execution across diverse agentic risks. We introduce RUBAS, a rubric-based reinforcement learning framework for agent safety. RUBAS decomposes agent behavior into four dimensions: tool-use safety, argument safety, response safety, and helpfulness. These structured rubrics provide fine-grained and interpretable rewards over complete agent trajectories, enabling reinforcement learning to optimize safe tool use while preserving task completion. Extensive experiments across multiple agent safety benchmarks and models show that RUBAS improves safety over standard alignment baselines, reduces tool-grounded hallucinations, and maintains competitive utility. Our results suggest that multi-dimensional rubric rewards provide an effective training signal for aligning LLM agents in safety-critical tool-use settings.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

証拠を運ぶエージェントアクション: 異種エージェントシステムに対するモデルに依存しないランタイムガバナンス

エージェントシステムは、ローカルコーディングツール、フレームワーク SDK、マネージドエージェントプラットフォーム、API ゲートウェイ、オブザーバーのみの統合など、非常に異なる制御ポイントを備えたランタイムを通じて実行されます。したがって、データを外部に公開するなどの高リスクのアクションは、あるランタイムではシェルコマンドとして、別のランタイムではツール呼び出しとして、そして 3 番目のランタイムではホストされたセッションの移行として現れる可能性があります。このため、ガバナンスの基本的な質問に一貫して答えることが困難になります。つまり、どのようなアクションが、誰の権限の下で、どのような承認セマンティクスに基づいて、実行後にどのような証拠によって承認されたのかということです。このペーパーでは、ベンダーネイティブのセッションレコードではなくアクション証明書を中心としたランタイム中立のガバナンスモデルである Proof-Carrying Agent Actions (PCAA) について説明します。 PCAA は、アクション前の許容性、アクションのオープン、仮定のキャプチャ、承認、結果のクローズという 5 つのチェックポイントを中心にコントロールを組織します。これらのチェックポイントは、ポータブルなアクションエンベロープ、実行時および承認のレシート、および再生可能なプルーフにバインドされます。このモデルは 2 つの実用的な方法で拡張されています。証明書は外部性を認識しており、宛先の可視性やアカウントの来歴などの境界事実を伝えます。もう 1 つは、承認は、単一のレビュー済みまたは未レビューのビットではなく、明示的な強制力クラスによって記述されます。私たちは、異種エージェントコントロールプレーンでの参照実装と開示制限のある評価プロトコルを通じてモデルを研究します。 4 つのランタイムファミリにわたる 24 個の実行可能シードから 96 個のトレースに拡張された保護されたベンチマークでは、PCAA はルートの品質を維持しながら、アブレーション下で個別の障害モードを明らかにします。この論文は、証明書を保持するアクションに関するランタイムガバナンスのシステム定式化と、その定式化がベンダー固有のコントロールサーフェスに崩壊することなくランタイムチャーンの下で移植性を維持できる方法についての実装に基づいた説明に貢献します。

原文 (English)

Proof-Carrying Agent Actions: Model-Agnostic Runtime Governance for Heterogeneous Agent Systems

Agent systems execute through runtimes with very different control points: local coding tools, framework SDKs, managed agent platforms, API gateways, and observer-only integrations. A high-risk action such as publishing data externally may therefore appear as a shell command in one runtime, a tool call in another, and a hosted session transition in a third. This makes it difficult to answer a basic governance question consistently: what action was authorized, under whose authority, with what approval semantics, and with what evidence after execution? This paper presents Proof-Carrying Agent Actions (PCAA), a runtime-neutral governance model centered on an action certificate rather than on a vendor-native session record. PCAA organizes control around five checkpoints: pre-action admissibility, action open, assumption capture, approval, and outcome closure. It binds these checkpoints to a portable action envelope, runtime and approval receipts, and replay-ready proof. The model is extended in two practical ways: the certificate is externality-aware, carrying boundary facts such as destination visibility and account provenance, and approval is described by explicit enforceability classes rather than by a single reviewed or unreviewed bit. We study the model through a reference implementation in a heterogeneous agent control plane and a disclosure-bounded evaluation protocol. On a protected benchmark expanded from 24 executable seeds to 96 traces across four runtime families, PCAA preserves route quality while exposing distinct failure modes under ablation. The paper contributes a systems formulation of runtime governance around certificate-bearing actions and an implementation-grounded account of how that formulation can remain portable under runtime churn without collapsing into vendor-specific control surfaces.

2026-06-05 13:00 JSTarXiv cs.AIエージェントロボティクス

AgenticDiffusion: ビジョンベースの UAV ナビゲーションのための Agentic Diffusion ベースの経路計画

屋内 UAV ナビゲーションには、限られた視野の観察下での効率的な探索、シーンの理解、信頼性の高い軌道の実行が必要です。既存のビジョンベースのナビゲーションフレームワークは通常、単一ビューの観察に依存しており、オクルージョン、ターゲットの可視性、およびグローバルシーン構造について推論する能力が制限されています。この研究では、統合された航空ナビゲーションパイプライン内で、言語に基づく推論、オープン語彙によるターゲットのグラウンディング、視覚ベースの拡散計画、および NMPC を調整するマルチビュー UAV ナビゲーションフレームワークである AgenticDiffusion を提案します。自然言語による指示と、同期した一人称視点 (FPV) および上面視点の観察を考慮して、フレームワークはナビゲーションに最も有益な視点を決定し、軌道の実行前にミッション計画を生成します。ターゲットは、オープンボキャブラリーグラウンディングモデルを使用して位置特定され、その後、視点固有の拡散プランナーが UAV 実行のためのナビゲーション軌道を生成します。提案されたフレームワークは、補完的な視点を使用して、繰り返しのターゲット探査を削減し、雑然とした屋内環境でのナビゲーション効率を向上させます。このフレームワークは、適応視点選択、多段階ミッション実行、長距離ナビゲーション、安全な着陸地点選択を含む 4 つの現実世界の UAV ナビゲーションシナリオで検証されました。実験結果では、40 回の実世界試験でミッション全体の成功率が 80% であることが実証され、一方、拡散計画者は軌道生成の成功率が 100% に達しました。

原文 (English)

AgenticDiffusion: Agentic Diffusion-based Path Planning for Vision-Based UAV Navigation

Indoor UAV navigation requires efficient exploration, scene understanding, and reliable trajectory execution under limited field-of-view observations. Existing vision-based navigation frameworks typically rely on single-view observations, limiting their ability to reason about occlusions, target visibility, and global scene structure. In this work, we propose AgenticDiffusion, a multi-view UAV navigation framework that coordinates language-guided reasoning, open-vocabulary target grounding, vision-based diffusion planning, and NMPC within a unified aerial navigation pipeline. Given a natural language instruction and synchronized first-person-view (FPV) and top-view observations, the framework determines the most informative viewpoint for navigation and generates a mission plan prior to trajectory execution. The targets are localized using an open-vocabulary grounding model, after which viewpoint-specific diffusion planners generate navigation trajectories for UAV execution. Using complementary viewpoints, the proposed framework reduces repeated target exploration and improves navigation efficiency in cluttered indoor environments. The framework was validated in four real-world UAV navigation scenarios involving adaptive viewpoint selection, multi-stage mission execution, long-horizon navigation, and safe landing-site selection. The experimental results demonstrated an overall mission success rate of 80% in 40 real-world trials, while the diffusion planners achieved a trajectory generation success rate of 100%.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SaliMory: 会話エージェントの認知記憶を調整する

生涯の伴侶として機能する会話エージェントは、すべての対話にわたって永続的な記憶を維持する必要があります。ただし、生の取得でコンテキストウィンドウを単純に拡張すると推論の品質が低下し、標準の強化学習による記憶エージェントのトレーニングでは、多段階パイプラインで深刻なクレジット割り当てのボトルネックが発生します。これを解決するために、単一言語モデルをトレーニングして、ユーザーの事実、好み、作業記憶にまたがる認知的に構造化された記憶を管理するフレームワークである SALIMORY を紹介します。 SALIMORY は、階層的な段階ごとのプロセス報酬と報酬分解された対照的洗練を導入することにより、個別の記憶操作 (選択的フィルタリング、統合、およびキュー主導のリコール) をエンドツーエンドで個別に監視します。 SALIMORY はメモリに起因する障害を 3 分の 1 に削減し、エンドツーエンドの精度で最先端のものを 10% 以上上回り、Good Personalization 率を 2 倍以上に高めます。

原文 (English)

SaliMory: Orchestrating Cognitive Memory for Conversational Agents

Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents via standard reinforcement learning creates a severe credit assignment bottleneck in a multi-stage pipeline. To solve this, we introduce SALIMORY, a framework that trains a single language model to manage a cognitively-structured memory-spanning user facts, preferences, and working memory. By introducing a hierarchical stage-wise process reward and reward-decomposed contrastive refinement, SALIMORY provides isolated supervision for distinct memory operations (selective filtering, consolidation, and cue-driven recall) end-to-end. SALIMORY cuts memory-attributed failures by one-third, outperforms the state-of-the-art by over 10% in end-to-end accuracy, and more than doubles the Good Personalization rate.

2026-06-05 13:00 JSTarXiv cs.AIエージェントロボティクス

大規模言語モデルによる適応軌道最適化のためのセマンティック制約合成

軌道の最適化は、宇宙探査において安全で信頼性の高い自律運用を可能にするための重要なコンポーネントです。宇宙ミッションの頻度、複雑さ、範囲が増加するにつれて、ミッションの目的と運用上の制約を正確に反映する、数学的に適切な軌道最適化問題を迅速に定式化する必要性が高まっています。ただし、ミッションの意図を軌道最適化のための扱いやすい分析公式に変換するには、かなりの専門知識が必要です。この論文では、大規模言語モデル (LLM) を活用して、ミッションの要件と制約の自然言語記述を実行可能な軌道最適化コードと対応する数学的定式化に変換するフレームワークを紹介します。宇宙船ランデブーシナリオでの実験では、意味論的なミッション要件から凸軌道最適化問題を再調整する際の高い成功率が実証されています。最終的に、この研究は、高レベルの意図と形式的な最適化モデルを橋渡しする LLM の可能性を強調し、宇宙船のより柔軟で効率的な軌道設計を可能にします。

原文 (English)

Semantic Constraint Synthesis for Adaptive Trajectory Optimization via Large Language Models

Trajectory optimization is a critical component for enabling safe and reliable autonomous operations in space exploration. As space missions increase in frequency, complexity, and scope, there is a growing need to rapidly formulate mathematically sound trajectory optimization problems that accurately reflect mission objectives and operational constraints. However, translating mission intent into tractable analytical formulations for trajectory optimization requires substantial domain expertise. This paper presents a framework that leverages large language models (LLMs) to translate natural language descriptions of mission requirements and constraints into executable trajectory optimization code and corresponding mathematical formulations. Experiments in spacecraft rendezvous scenarios demonstrate a high success rate in reconditioning a convex trajectory optimization problem from semantic mission requirements. Ultimately, this work highlights the potential of LLMs to bridge high-level intent and formal optimization models, enabling more flexible and efficient trajectory design of spacecraft.

2026-06-05 13:00 JSTarXiv cs.AIエージェント研究/論文

HighTide: エージェントが厳選したオープンソース VLSI ベンチマークスイート

進化する AI 支援ベンチマークスイートである HighTide を紹介します。具体的には、(i) 複数の設計言語とテクノロジノードにまたがる多様なオープンソーススイート、(ii) リモートキャッシュを備えた Bazel ベースの増分 RTL から GDS へのコンパイル、(iii) スイート全体の調整理論的根拠の長期記憶として機能する設計ごとの意思決定ログに裏付けられた、設計ライフサイクル、フロー最適化、ツールリファレンス、メタメンテナンスをカバーする 12 のエージェントスキルによる AI 支援の設計キュレーション、および(iv) 安定リリース用の RTL コンパイル検証を備えたインフラストラクチャ。このスイートは一般公開されており、オープンソースのハードウェアエコシステムとともに成長するように設計されています。

原文 (English)

HighTide: An Agent-Curated Open-Source VLSI Benchmark Suite

We introduce HighTide, an evolving AI-assisted benchmark suite. Specifically, the contributions are: (i) a diverse open-source suite spanning multiple design languages and technology nodes, (ii) Bazel-based incremental RTL-to-GDS compilation with remote caching, (iii) AI-assisted design curation through twelve agent skills covering the design lifecycle, flow optimization, tool reference, and meta-maintenance, backed by per-design decision logs that serve as long-term memory of tuning rationale across the suite, and (iv) an infrastructure with RTL compilation verification for stable releases. The suite is publicly available and designed to grow with the open-source hardware ecosystem.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェントハードウェア/半導体

Caught in the Act(ivation): LLM エージェントによる資格情報漏洩の事前出力およびマルチターン検出に向けて

LLM エージェントは多くの場合、機密認証情報を信頼できない取得コンテンツと同じコンテキストウィンドウに配置し、認証情報の漏洩を誘発する間接的なプロンプトインジェクションの直接パスを作成します。私たちは、3 つの相補的な防御を通じてこの障害モードを研究します。まず、出力トークンが発行される前に、アクティベーションプローブが資格情報へのアクセスを検出できるかどうかを尋ねます。次に、形式固有の文字モデルからハニートークンを構築し、分割等角予測で検出を調整します。 3 番目に、複数ターンにわたる漏洩を累積的な情報フロー問題として扱い、会話ターン全体での推定漏洩予算を追跡します。オープンウェイトモデルの制御された実験では、アクティベーション機能により、ホールドアウトエンコーディング変換下を含め、無害なプロンプトと認証情報を求めるプロンプトが高精度で分離されます。小規模な合成マルチターンスイートでは、累積アカウンティングにより、ターンごとの検出器が見逃した攻撃が検出されます。これらの結果は暫定的なものです。マルチターンベンチマークは社内で小規模なものであり、アクティブ化方法にはホワイトボックスアクセスが必要であり、情報推定ツールは正式な上限ではなく実用的なシグナルを提供します。それでも、この結果は、資格情報の漏洩防御には、テキストレベルの出力フィルターのみに依存するのではなく、出力前の監視、調整されたカナリア検出、および一時的な漏洩アカウンティングを組み合わせる必要があることを示唆しています。

原文 (English)

Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents

LLM agents often place sensitive credentials in the same context window as untrusted retrieved content, creating a direct path for indirect prompt injection to induce credential exfiltration. We study this failure mode through three complementary defenses. First, we ask whether activation probes can detect credential access before output tokens are emitted. Second, we construct honeytokens from format-specific character models and calibrate detection with split conformal prediction. Third, we treat multi-turn exfiltration as a cumulative information-flow problem and track an estimated leakage budget across conversation turns. In controlled experiments on open-weight models, activation features separate benign and credential-seeking prompts with high accuracy, including under held-out encoding transformations. In a small synthetic multi-turn suite, cumulative accounting detects attacks that per-turn detectors miss. These results are preliminary: the multi-turn benchmark is in-house and small, the activation method requires white-box access, and the information estimator provides a practical signal rather than a formal upper bound. Still, the results suggest that credential-exfiltration defenses should combine pre-output monitoring, calibrated canary detection, and temporal leakage accounting rather than relying only on text-level output filters.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

公証されたエージェント: AI エージェントのアクションに対する受信者が証明した機密受領書

現在の AI エージェントの可観測性は構造的に危険にさらされています。アクティビティログを生成するエンティティは、アクティビティが記録されるエンティティと同じです。侵害されたエージェントまたはバグのあるエージェントは、自身のトレースを省略、変更、または捏造する可能性があり、エージェントを実行するオペレータには改ざんを検出する独自の方法がありません。私たちは、信頼境界を反転することでこれを解決するプロトコルのクラスを提案します。エージェントの呼び出しを受信するサービスは、独自のキーを使用して観察した内容の受信に署名し、エージェントの所有者に対して受信を暗号化し、それを公開の透明性ログに公開します。所有者は、エージェントやそのオペレーターを信頼せずに、改ざん証拠の証跡を再構築します。このクラスを Sello としてインスタンス化します。これは、現在のシステムには存在しない 4 つのプロパティを組み合わせたプロトコルです。(P1) 受信者側の署名、(P2) JWS 経由で認可トークンにバインドされた所有者公開鍵への HPKE 暗号化、(P3) 証人署名付きマークルログへの公開、(P4) トークン参照による所有者側の検出です。私たちはプロトコルを説明し、エージェントとそのオペレーターを制御する敵対者の下でそのセキュリティを分析し、暗号操作のマイクロベンチマークを提示し、隣接する受信プロトコル作業 (Signet、AgentROA、Agent Passport System、draft-farley-acta、SCITT) の中に Sello を位置づけます。抑制攻撃、サービスの共謀、採用インセンティブの問題などの既知の制限について説明します。

原文 (English)

Notarized Agents: Receiver-Attested Confidential Receipts for AI Agent Actions

Current AI agent observability is structurally compromised: the entity producing the activity log is the same entity whose activity is being logged. A compromised or buggy agent can omit, alter, or fabricate its own traces, and the operator running the agent has no independent way to detect tampering. We propose a class of protocols that resolves this by inverting the trust boundary: the service that receives an agent's call signs a receipt of what it observed using its own key, encrypts the receipt to the agent's owner, and publishes it to a public transparency log. The owner reconstructs a tamper-evident trail without trusting the agent or its operator. We instantiate the class as Sello, a protocol combining four properties absent in any current system: (P1) receiver-side signing, (P2) HPKE encryption to an owner public key bound to the authorization token via JWS, (P3) publication to a witness-cosigned Merkle log, and (P4) owner-side discovery by token reference. We describe the protocol, analyze its security under an adversary that controls the agent and its operator, present microbenchmarks of the cryptographic operations, and situate Sello among adjacent receipt-protocol work (Signet, AgentROA, Agent Passport System, draft-farley-acta, SCITT). We discuss known limitations including the suppression attack, service collusion, and the adoption-incentive problem.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント

StandardE2E: エンドツーエンドの自動運転データセットのための統合フレームワーク

自動運転は、モジュール式の認識・予測・計画スタックから、センサー入力を車両制御に直接マッピングするエンドツーエンド (E2E) モデルに移行しており、多くの場合、3D 検出、動き予測、HD マップ認識などの補助タスクによって正規化されています。進歩は、センサーが豊富な運転データセットの急速に成長するエコシステムによって推進されていますが、それぞれが独自のファイル形式、API、座標規則、モダリティカバレッジを提供しているため、データセット間の実験やデータセットごとの基本的な前処理さえもプロジェクトごとに再実装する必要があります。 E2E 駆動データセット上に単一の統一インターフェイスを提供するフレームワークである StandardE2E を紹介します。 StandardE2E (i) 1 つの共有データスキーマの下でデータセットごとの前処理を標準化します。 (ii) 複数のデータセットを単一の PyTorch DataLoader に結合して、データセット間の事前トレーニング、補助タスクの監視、シナリオレベルのフィルタリングを行います。 (iii) 生のフレームから正規スキーマへの単一のデータセットごとのマッピングへの新しいデータセットの追加を減らし、ダウンストリームパイプライン全体を変更しないままにします。このフレームワークは、Waymo End-to-End、Waymo Perception、Argoverse 2 Sensor、Argoverse 2 LiDAR、NAVSIM (OpenScene-v1.1)、および WayveScenes101 の 6 つのデータセットをすぐにサポートしており、オープンソースの standard-e2e Python パッケージとしてリリースされており、https://github.com/stepankonev/StandardE2E で入手できます。

原文 (English)

StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets

Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD-map perception. Progress is driven by a fast-growing ecosystem of sensor-rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross-dataset experimentation and even basic per-dataset preprocessing to be re-implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per-dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross-dataset pretraining, auxiliary-task supervision, and scenario-level filtering; and (iii) reduces adding a new dataset to a single per-dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End-to-End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene-v1.1), and WayveScenes101, and is released as the open-source standard-e2e Python package, available at https://github.com/stepankonev/StandardE2E.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

信頼できない入力から信頼できるメモリへ: LLM エージェントにおけるメモリポイズニング攻撃の系統的研究

メモリは AI エージェントの中核コンポーネントであり、AI エージェントがインタラクションを通じて知識を蓄積し、パフォーマンスを向上させることができます。ただし、永続メモリにはメモリポイズニングのリスクが伴います。メモリポイズニングの場合、敵対的な 1 回のメモリ書き込みがエージェントの動作に長期的な影響を与える可能性があります。我々は、LLM ベースのエージェントにおけるメモリポイズニングの体系的な研究を紹介します。 4 つのメモリ書き込みチャネルと、これらのチャネルを悪用可能にするモデル機能、システムプロンプト設計、およびエージェントシステムアーキテクチャにおける 9 つの構造的脆弱性を特定しました。これらの脆弱性に基づいて、メモリポイズニング攻撃の 6 つのクラスの分類を作成します。さらに、メモリポイズニング攻撃を評価するためのベンチマークである MPBench を設計し、より積極的にメモリの書き込みと取得を行うように設計されたエージェントが悪用されやすいことを示します。また、既存のプロンプトインジェクション防御ではメモリポイズニング攻撃をカバーできないことも示します。私たちの調査結果は、AI エージェントに対するメモリポイズニング攻撃を理解し、軽減するための基盤を提供します。

原文 (English)

From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents

Memory is a core component of AI agents, enabling them to accumulate knowledge across interactions and improve performance. However, persistent memory introduces the risk of memory poisoning, where a single adversarial memory write can exert long-term influence over agent behavior. We present a systematic study of memory poisoning in LLM-based agents. We identify four memory write channels and nine structural vulnerabilities in model capabilities, system prompt design, and agent system architecture that make these channels exploitable. Based on these vulnerabilities, we develop a taxonomy of six classes of memory poisoning attacks. Furthermore, we design MPBench -- a benchmark for evaluating memory poisoning attacks, and show that agents designed to write and retrieve memory more aggressively are more exploitable. We also show that existing prompt injection defenses fail to cover memory poisoning attacks. Our findings provide a foundation for understanding and mitigating memory poisoning attacks against AI agents.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

即時注射が忘れられなかったらどうなるでしょうか?エージェントシステムでのクロスセッションストアドプロンプトインジェクションの探索

最新のエージェントシステムは、LLM をセッション限定のアシスタントからステートフルシステムに変換します。ステートフルシステムは、メモリ、ファイルシステム、ツール、およびその他の長期間存続するコンテキストアーティファクトを通じて、セッション間で共有世界状態を永続化および進化させます。この変化により、プロンプトインジェクションの攻撃対象領域が根本的に拡大します。しかし、プロンプトインジェクションに関するこれまでの研究は主に単一セッション内のモデルレベルの脅威に焦点を当てており、セッション間の永続的なシステム状態がエージェントシステムのシステムレベルのリスクをどのように根本的に変化させるかを見落としていました。 Web システムのストアドクロスサイトスクリプティングにヒントを得て、クロスセッションストアドプロンプトインジェクションを導入しました。これにより、成功したインジェクションはエージェントシステム状態内で持続し、元の攻撃者による対話が終了した後も長期間にわたって将来の実行に静かに影響を与えることができます。この脅威を体系的に研究するために、ストアドプロンプトインジェクションを形式化し、敵対的なコンテンツがどのようにセッション間で持続し、エージェントシステムに影響を与えるかの分類を開発します。さらに、ストアドプロンプトインジェクションのリスクを評価するためのベンチマークとサンドボックスツールキットを開発し、さまざまなモデル、攻撃目標、永続化チャネルにわたる攻撃の成功の定量的分析を可能にします。私たちの調査結果は、永続化により、プロンプトインジェクションが一時的なモデルレベルの脅威から、エージェントの実行状態に組み込まれた長期にわたるシステムレベルの脆弱性に変化することが強調されています。私たちは、この取り組みがこの新たな脅威に対する幅広い注目を集め、コミュニティがエージェントシステムの存続によって生じるシステムリスクを体系的に調査して軽減するよう促すことを願っています。

原文 (English)

What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems

Modern agentic systems transform LLMs from session-bounded assistants into stateful systems that persist and evolve shared world state across sessions through memories, filesystems, tools, and other long-lived contextual artifacts. This shift fundamentally expands the attack surface of prompt injection. However, prior works on prompt injection have largely focused on model-level threats within a single session, overlooking how cross-session persistent system state fundamentally changes the system-level risk of agentic systems. Inspired by stored cross-site scripting in web systems, we introduce cross-session stored prompt injection, where a successful injection can persist within agentic system state and silently influence future executions long after the original attacker interaction has ended. To systematically study this threat, we formalize stored prompt injection and develop a taxonomy of how adversarial content persists and affects agentic systems across sessions. We further develop a benchmark and sandbox toolkit to evaluate the risks of stored prompt injection, enabling quantitative analysis of attack success across different models, attack goals, and persistence channels. Our findings highlight that persistence transforms prompt injection from an ephemeral model-level threat into a long-lived system-level vulnerability embedded within agent execution state. We hope this work draws broader attention to this emerging threat and motivates the community to systematically study and mitigate system risks arising from persistence in agentic systems.

2026-06-05 13:00 JSTarXiv cs.AIエージェント研究/論文

CyberGym-E2E: AI エージェントのエンドツーエンドのサイバーセキュリティ機能のためのスケーラブルな現実世界のベンチマーク

AI は、ソフトウェアの脆弱性を自律的に検出、分析、修復できるシステムを可能にすることで、サイバーセキュリティを変革する可能性を秘めています。しかし、AI システムの既存のサイバーセキュリティ評価は規模や範囲が限られており、現実世界のソフトウェアの脆弱性の発見と修復のエンドツーエンドのライフサイクルを捉えることができません。このギャップに対処するために、私たちは、脆弱性の発見、PoC 生成、パッチ生成のライフサイクル全体にわたって AI エージェントの能力を包括的に評価する、大規模かつ現実的なエンドツーエンドのサイバーセキュリティベンチマークである CyberGym-E2E を提案します。 CyberGym-E2E は、オープンソースの脆弱性データを現実的な評価環境に変換するための自動化されたエージェント強化パイプラインを構築するため、包括的でスケーラブルです。現在、ベンチマークは、139 の異なるオープンソースプロジェクトにわたる 920 件の実際の脆弱性で構成されています。

原文 (English)

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberGym-E2E, a large-scale and realistic end-to-end cybersecurity benchmark that comprehensively evaluates AI agents' abilities across the full lifecycle of vulnerability discovery, PoC generation, and patch generation. CyberGym-E2E is comprehensive and scalable, as we build an automated, agent-enhanced pipeline for transforming open-source vulnerability data into realistic evaluation environments. Currently, the benchmark consists of 920 real-world vulnerabilities across 139 different open-source projects.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SePO: システムプロンプト最適化のための自己進化型プロンプトエージェント

システムプロンプトの最適化により、基礎となるモデルを変更することなくエージェントの動作が改善され、人間が判読できる、モデルに依存しない命令が生成されます。既存の方法では、タスクエージェントのシステムプロンプトを改良するプロンプトエージェントを構築しますが、プロンプトエージェント独自のシステムプロンプトは手動で設計および修正されたままになります。我々は、プロンプトエージェント自身のシステムプロンプトをタスクエージェントのシステムプロンプトと並んで最適化ターゲットとして扱う自己進化型プロンプト最適化 (SePO) を提案します。 SePO は自己参照設計を採用しています。単一のプロンプトエージェントは、候補プロンプトのアーカイブを踏み台として維持するオープンエンドの進化的探索の下で、タスクエージェントのシステムプロンプトとそれ自身のプロンプトの両方を改善します。トレーニングは 2 つの段階で進行します。事前トレーニングではマルチタスクプール上でプロンプトエージェントを進化させ、その後、微調整によってそれをターゲットタスクに適用します。数学 (AIME'25)、抽象推論 (ARC-AGI-1)、大学院レベルの科学 (GPQA)、コード生成 (MBPP)、および論理パズル (数独) にわたる 5 つのベンチマークにわたって、SePO は一貫して Manual-CoT、TextGrad、および MetaSPO を上回り、Manual-CoT と比較して平均精度が 4.49 ポイント向上しました。事前トレーニングによるプロンプト最適化スキルは、タスクごとのプロンプトを記憶するのではなく、事前トレーニング混合物を超えたタスクにも一般化されます。

原文 (English)

SePO: Self-Evolving Prompt Agent for System Prompt Optimization

System prompt optimization improves agent behavior without modifying the underlying model, yielding human-readable, model-agnostic instructions. Existing methods build a prompt agent that refines task agents' system prompts, yet leave the prompt agent's own system prompt hand-engineered and fixed. We propose Self-Evolving Prompt Optimization (SePO), which treats the prompt agent's own system prompt as an optimization target alongside task agents' system prompts. SePO adopts a self-referential design. A single prompt agent improves both task agents' system prompts and its own under an open-ended evolutionary search that maintains an archive of candidate prompts as stepping stones. Training proceeds in two stages: pre-training evolves the prompt agent on a multi-task pool, and fine-tuning then applies it to a target task. Across five benchmarks spanning math (AIME'25), abstract reasoning (ARC-AGI-1), graduate-level science (GPQA), code generation (MBPP), and logic puzzles (Sudoku), SePO consistently outperforms Manual-CoT, TextGrad, and MetaSPO, improving the average accuracy by 4.49 points compared to Manual-CoT. The prompt optimization skill from pre-training also generalizes to tasks beyond the pre-training mixture, rather than memorizing per-task prompts.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エージェントの記憶にとって時間的順序は重要: 長期エージェントのセグメントツリー

長期的な会話型エージェントは、進化するイベント、タスク、目標を通じてユーザーと対話する必要があります。このような歴史は本来一時的なものですが、多くの既存の記憶システムは主にトピックの類似性によって情報を整理しており、イベントが発生する順序を無視している可能性があります。発話全体にわたって時間的に順序付けられたセグメントツリーとして会話履歴を表すメモリアーキテクチャであるセグメントツリーメモリ (SegTreeMem) を導入します。 SegTreeMem は、オンラインの右端フロンティア更新ルールを通じて新しい発話を段階的に挿入し、階層的なメモリセグメントを形成しながら時系列順を維持します。取得の場合、SegTreeMem はツリーを通じて関連性スコアを伝播し、ローカルな意味論的な一致と階層的な時間コンテキストを組み合わせます。 SegTreeMem は、3 つの長期メモリベンチマークと 2 つの LLM バックボーンにわたって、フラット検索、グラフ構造メモリ、およびツリー構造メモリベースラインよりも回答品質を向上させます。追加の時間順序順列分析では、パフォーマンスの向上が記憶構築中の時間順序の維持に依存することが示され、時間順序がエージェント記憶の重要な構造であるという主張が裏付けられています。

原文 (English)

Temporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents

Long-horizon conversational agents need to interact with users through evolving events, tasks, and goals. Such histories are naturally temporal, yet many existing memory systems organize information primarily by topical similarity and may ignore the order in which events occur. We introduce Segment Tree Memory, or SegTreeMem, a memory architecture that represents conversation history as a temporally ordered Segment Tree over utterances. SegTreeMem incrementally inserts new utterances through an online rightmost-frontier update rule, preserving chronological order while forming hierarchical memory segments. For retrieval, SegTreeMem propagates relevance scores through the tree to combine local semantic matching with hierarchical temporal context. Across three long-horizon memory benchmarks and two LLM backbones, SegTreeMem improves answer quality over flat retrieval, graph-structured memory, and tree-structured memory baselines. Additional temporal-order permutation analysis shows that the performance gain depends on preserving temporal order during memory construction, supporting the claim that temporal order is a key structure for agentic memory.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント

オブジェクト検出におけるインスタンスレベルの事後不確実性の定量化

物体検出は自動運転の安全上重要な要素です。安全性を確保するには、境界ボックス予測の不確実性を定量化することが不可欠です。再トレーニングを必要としない事後的な不確実性の定量化は、現実世界の導入要件と一致します。したがって、ラプラス近似を使用します。インスタンスレベルの不確実性が必要であるため、複数のバックプロパゲーションを必要とする線形推論方法は時間効率が悪く、サンプリングベースの方法は完全に事後的ではありません。我々は、インスタンスレベルおよびほぼ事後的な不確実性の定量化を提供するモンテカルロ一般化線形モデル (MC-GLM) を提案します。モンテカルロステップで必要なサンプルの数は一定で、出力インスタンスの数に依存しないため、並列化できます。 CenterPoint 検出器を使用した nuScenes データセットの実験により、私たちの方法の有効性が検証され、結果として生じる不確実性は良好な品質を示しています。

原文 (English)

Instance-Level Post Hoc Uncertainty Quantification in Object Detection

Object detection is a safety-critical component of autonomous driving. It is essential to quantify the uncertainty in bounding-box predictions for safety assurance. Post hoc uncertainty quantification without retraining aligns with real-world deployment requirements; therefore, we employ the Laplace approximation. Because instance-level uncertainty is needed, linearized inference methods that require multiple backpropagations are not time-efficient, and sampling-based methods are not fully post hoc. We propose Monte-Carlo generalized linearized model (MC-GLM), which provides instance-level and approximately post hoc uncertainty quantification. The number of samples required in the Monte Carlo step is constant and independent of the number of output instances, so it can be parallelized. Experiments on the nuScenes dataset with the CenterPoint detector validate the effectiveness of our method, and the resulting uncertainties exhibit good quality.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

TIDE: テンプレートに基づく反復によるプロアクティブな複数の問題の発見

エージェントは、ドキュメント、ツール、コードのアシスタントとして広く導入されています。ただし、これらは通常、明示的なユーザー要求にのみ作用し、ユーザーが気づいた問題のみを表面化します。一方、他の多くの重要な問題は、より広範なユーザーコンテキスト内で目に見えない形で共存しており、その総数は事前に不明です。私たちはこれを、文脈から複数の隠れた問題を発見するタスクとして組み立てます。その中で、共存する問題を明らかにし、裏付けとなる証拠に基づいて、具体的な行動と組み合わせる必要があります。この目的を達成するために、2 つの補完的なメカニズムを備えたテンプレート主導の反復フレームワークである TIDE を導入します。具体的には、シングルパス予測が最も顕著なケースに基づいて一般的な主張を生み出すという観察に動機づけられて、我々は反復発見を提案します。これは、すでに見つかったものに基づいて条件付けしながらラウンドごとに小さなバッチの候補を表面化し、後続のラウンドで対象範囲を拡大します。思考テンプレートは、以前に解決されたケースから抽出された再利用可能なスキーマであり、どのコンテキストシグナルに注目し、それらをどのように接続するかを指定し、各予測を認識可能な問題クラスに固定します。 4 つのモデルバックボーンにわたって、パーソナルワークスペースとソフトウェアリポジトリという 2 つの現実的な設定で TIDE を検証し、タスクカバレッジ、識別、解決に関して、シングルショットおよび並列マルチエージェントベースラインを超える大幅な向上を示しています。

原文 (English)

TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration

Agents are widely deployed as assistants over documents, tools, and code. However, they typically act only on explicit user requests, which surface only the problems the user has noticed, while many other important problems coexist, hidden in plain sight, within the broader user context, with their total number unknown in advance. We frame this as the task of discovering multiple hidden problems from context, in which coexisting problems should be uncovered, grounded in supporting evidence, and paired with concrete actions. To this end, we introduce TIDE, a template-guided iterative framework with two complementary mechanisms. Specifically, motivated by the observation that single-pass prediction anchors on the most salient cases and yields generic claims, we propose iterative discovery, which surfaces a small batch of candidates per round while conditioning on what has already been found, so subsequent rounds extend coverage; and thought templates, reusable schemas distilled from previously solved cases that specify what contextual signals to attend to and how to connect them, anchoring each prediction in a recognizable problem class. We validate TIDE on two realistic settings, personal workspaces and software repositories, across four model backbones, showing substantial gains over single-shot and parallel multi-agent baselines on task coverage, identification, and resolution.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

Archi: CMS 実験におけるエージェント操作

私たちは、異種データソースの体系的な取り込みと編成と、データソースを取得して推論する構成可能でプライベートで拡張可能なエージェントの展開を組み合わせた、科学コラボレーションのためのオープンソースのエンドツーエンドフレームワークである Archi を紹介します。 Archi のインスタンスは、技術オペレーターのサポートエージェントとして 2026 年 2 月から CERN の LHC での CMS 実験のコンピューティング運用チームに導入されており、文書、履歴データ、ライブ監視システムを組み合わせて検索および分析機能を提供しています。私たちはオペレーターのフィードバックと、実稼働環境での使用状況から収集された質問セットに基づいてシステムを評価し、人間のパネルと自動パネルによって採点します。このシステムは、CMS オペレーターが提起する実際のクエリを解決する運用タスクで効果的であることが証明されています。また、ローカルでホストされているオープンウェイトモデルが競争力を持って実行され、機密データの完全なプライベート管理が可能になることも観察されています。

原文 (English)

Archi: Agentic Operations at the CMS Experiment

We present Archi, an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensible agents that retrieve and reason over them. An instance of Archi has been deployed for the Computing Operations team of the CMS experiment at CERN's LHC since February 2026 as a support agent for technical operators, offering retrieval and analysis capabilities by combining documentation, historical data, and live monitoring systems. We evaluate the system on operator feedback and a question set collected from production usage, graded by human and automated panels. The system proves effective at operational tasks, resolving real-world queries posed by CMS operators. We also observe that locally-hosted, open-weight models perform competitively, enabling fully private management of sensitive data.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

NoRA: 視覚的な一人称の規範的行動推論における根拠のある合理性の評価

LLM とエージェントシステムは社会環境にますます導入されており、安全で適切な行動には規範的能力が重要になっています。しかし、既存のアプローチは、規範的判断をテキストのみで評価するか、固定された一連の候補アクションの中から選択することに還元します。私たちはどちらも不十分だと主張します。実際には、エージェントにオプションのメニューが渡されることはありません。彼らは、目に見える事実に基づいて、検証可能な理由によって裏付けられた、合理的な行動をゼロから特定しなければなりません。 NoRA は視覚的な一人称ビデオベンチマークであり、モデルが次のアクションの候補を生成し、明示的な事実-理由-アクションのサポートグラフを通じてそれぞれを正当化する必要があります。このベンチマークは、HumanGold-190 および LLMSilver-1230 の分割を含む 1,420 個の注釈付きビデオクリップで構成されています。各インスタンスは、アクションの調整、事実の根拠、およびサポートのバインディングを通じて評価され、単一の根拠のある合理性スコアに集約されます。私たちは、直接的、計画的、構造化されたプロンプト体制の下で 12 のマルチモーダルシステムのベンチマークを行ったところ、現在の VLM はもっともらしいアクションと関連するシーンの事実を頻繁に回収しますが、完全な合理的なアクションスペースを構築し、選択されたアクションを正しいローカルサポートに結び付けるのに一貫して苦労していることがわかりました。 NoRA はこのギャップを測定可能にし、評価の問題を、モデルがアクションを選択できるかどうかから、適切な目に見える理由に基づいて適切なアクションを正当化できるかどうかに移します。

原文 (English)

NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning

LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate actions. We argue both are insufficient. In practice, agents are never handed a menu of options; they must identify a reasonable action from scratch, grounded in visible facts and supported by inspectable reasons. We introduce NoRA, a visual first-person video benchmark that requires models to generate candidate next actions and justify each through an explicit fact-reason-action support graph. The benchmark comprises 1,420 annotated video clips, including HumanGold-190 and LLMSilver-1230 splits. Each instance is evaluated through action alignment, factual grounding, and support binding, aggregated into a single grounded reasonableness score. We benchmark 12 multimodal systems under direct, deliberate, and structured prompting regimes, finding that current VLMs frequently recover plausible actions and relevant scene facts, but consistently struggle to construct the full reasonable action space and bind selected actions to the correct local support. NoRA makes this gap measurable, shifting the evaluation question from whether a model can pick an action to whether it can justify an appropriate action for the right visible reasons.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

おそらくほぼ安全な保証を備えたリスク認識型強化学習のシナリオ生成

特にディープ RL を使用して学習されたポリシーは、未知の動作や安全ではない動作を引き起こす遷移の摂動の影響を受けやすいことが示される可能性があるため、安全性の保証は、現実世界への強化学習 (RL) エージェントの展開にとって重要です。ポリシー検証の方法は、安全制約に関するポリシーの軌跡をサンプリングすることによって確率的バリア証明書を構築し、それによって既知の安全な動作と未知の動作を区別することです。ポリシーが、エージェントを十分に探索されていない状態に置く遷移の不確実性または摂動の影響を受けやすい場合、これらの制約に違反する確率について厳密な上限と下限を取得することは困難になる可能性があります。これに対処するために、変分オートエンコーダー (VAE) を使用して遭遇した状態空間の分布を近似し、状態の潜在的な特性を使用して上限と下限のバリア証明書を構築し、既知の安全な動作の領域を高い信頼性で最適化します。私たちはこれを二重最適化問題として枠組み付けし、下限のバリア証明書が上限のバリア証明書よりも安全な領域のより保守的な推定を提示します。トレーニング中に 2 つの設定差内にある状態 (つまり、非ロバスト領域) をサンプリングすることにより、上限と下限を厳しくして、安全性についてより明確な確率的保証を提供することができます。私たちの研究では、設定された保証について説明し、実験的に境界の厳しさを実証します。

原文 (English)

Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unknown or unsafe behaviour. A method of policy verification is to construct probabilistic barrier-certificates by sampling policy trajectories with respect to safety constraints, thereby demarcating known safe behaviour from unknown behaviour. Obtaining tight upper and lower bounds on the probability of violation of these constraints may be difficult if the policy is susceptible to transition uncertainty or perturbation that places the agent in insufficiently explored states. To address this, we approximate the distribution of the encountered state-space using a variational autoencoder (VAE) and construct upper and lower-bound barrier-certificates using latent characteristics of states to optimize for regions of known, safe behaviour with high confidence. We frame this in our work as a dual optimization problem where the lower-bound barrier-certificate presents a more conservative estimate of the safe region than the upper-bound barrier-certificate. Sampling states that lie within the set difference of the two during training, i.e. the non-robust region, allows us to tighten the upper and lower bounds to provide sharper probabilistic guarantees on safety. Within our study, we describe the guarantees placed and demonstrate the tightness of our bounds experimentally.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

行動しながら学習: オンライン生涯学習エージェント向けのスキル強化されたテスト時間共進化フレームワーク

生涯学習は、動的で対話型の環境で動作する大規模言語モデル (LLM) エージェントにとって不可欠です。しかし、長期的なタスクのための既存の生涯学習エージェントは通常、推論中の静的パラメータによる離散的なスキルや過去の経験の取得に依存しているため、人間の学習者のようにテスト時のフィードバックを継続的に内面化することができません。このギャップを埋めるために、オンライン生涯学習エージェントのための 2 段階の強化学習フレームワークであるスキル強化テスト時間共進化 (\texttt{LifeSkill}) を提案します。具体的には、複数のスキル条件付きポリシーのロールアウトの平均的な検証者の成功に応じて候補者のスキルに報酬を与えることで、スキル抽出のための直接監督の欠如に対処する検証者ガイド付きスキル学習を設計し、単にテキスト上でもっともらしいスキルではなく、タスクの解決に役立つスキルを生成するようにモデルを奨励します。さらに、オンラインスキル内部化を導入します。これは、スキル条件付きの軌道を報酬シグナルに変換することで、テスト時のインタラクション中にポリシーモデルを継続的に改善します。これにより、エージェントは推論機能をパラメータに直接内部化でき、エクスペリエンス取得によるコンテキストの肥大化を回避できます。 LifelongAgentBench の実験では、既存の生涯エージェントのベースラインと比較して、LifeSkill が平均パフォーマンスを 7 絶対ポイント向上させることが示されています。

原文 (English)

Learning While Acting: A Skill-Enhanced Test-Time Co-Evolution Framework for Online Lifelong Learning Agents

Lifelong learning is essential for Large Language Model (LLM) agents operating in dynamic, interactive environments. However, existing lifelong learning agents for long-horizon tasks typically depend on discrete skill or past experiences retrieval with static parameters during inference, which prevents them from continuously internalizing test-time feedback like human learners. To bridge this gap, we propose Skill-enhanced Test-Time Co-Evolution (\texttt{LifeSkill}), a two-stage reinforcement learning framework for Online Lifelong Learning Agents. Specifically, we design Verifier-Guided Skill Learning that addresses the lack of direct supervision for skill extraction by rewarding candidate skills according to the average verifier success of multiple skill-conditioned policy rollouts, encouraging the model to generate skills that are useful for solving tasks rather than merely plausible in text. Furthermore, we introduce Online Skill Internalization, which continuously improves the policy model during test-time interaction by transforming skill-conditioned trajectories into reward signals. This enables the agent to directly internalize reasoning capabilities into its parameters, avoiding the context bloat of experience retrieval. Experiments on LifelongAgentBench show that LifeSkill improves average performance by 7 absolute points by comparing with existing lifelong agent baselines.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

人間が作成したオントロジーからの証明可能で監査可能で安全な LLM エージェント

線形監査可能性を必要とする重要な問題ドメインでの使用を目的とした、LLM エージェントアーキテクチャ Agentic Redux を紹介します。型付きラムダ計算を使用して、適切なドメインで実行すると、Agentic Redux の実行が意味的に正しいことが保証され、すべての決定が追加専用台帳に記録されることを証明します。医療請求のコンプライアンスとセキュリティ脆弱性の開示という、実稼働グレードの 2 つの適切な領域を紹介します。両方のドメインで実行される Agentic Redux の実用的なコードは、サポートされるコードリポジトリで入手できます。また、問題ドメイン上でエージェントフレームワークを作成するための方法論であるオントロジーファーストエージェント設計も紹介します。この設計では、人間の専門家が基本形式オントロジーを使用して問題ドメインをオントロジー化し、LLM を割り当てて、ドメイン内の問題に対処するためにエージェントと参加者が果たせる役割を導き出します。

原文 (English)

Provably Auditable and Safe LLM Agents from Human-Authored Ontologies

We introduce the LLM agent architecture Agentic Redux, intended for use with nontrivial problem domains that require linear auditability. Using the typed lambda calculus, we prove that, run on appropriate domains, Agentic Redux executions are semantically guaranteed to be correct, with all decisions recorded in an append-only ledger. We present two production-grade appropriate domains, in healthcare billing compliance, and security vulnerability disclosure. Working code for Agentic Redux run on both domains is available in a supporting code repository. We also introduce Ontology-First Agent Design, a methodology for creation of agent frameworks on a problem domain, in which a human expert ontologizes the problem domain with Basic Formal Ontology, and then assigns an LLM to derive roles that agents and humans-in-the-loop can fill, in order to work the problems in the domain.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

プロンプトからプロセスまで: AI ソフトウェア開発エージェントをサポートするフレームワークのプロセス分類と比較評価

プログラミング用の AI ツールは、もはや単なるオートコンプリートやチャットアシスタントではありません。プロセス、役割、成果物、検証を備えた開発フレームワークとして組織化されています。最近の調査では、ソフトウェアエンジニアリングのためのエージェントと LLM がマッピングされていますが、これらの機能をプロセスに変える運用フレームワークを中心とした調査は行われていません。私たちは、機能的包含基準とトラクション測定を使用して一次ソースの直接検索を実行し、GitHub Spec Kit、OpenSpec、BMAD Method、Get Shit Done (GSD)、Spec Kitty、Reversa の 6 つのフレームワークを選択しました。それぞれが異なるパスを通じて AI 開発を攻撃します。つまり、完全および軽量バリアントでの仕様駆動型開発、エージェント駆動のアジャイルプランニング、エージェントを介したコンテキストエンジニアリング、ワークツリーの分離とレビュー、レガシーシステムからの運用仕様の回復です。私たちの中心的な貢献は、仕様、コンテキスト、役割、実行、検証、移植性という 6 次元のプロセス分類と、プロセスを複製可能なツールに変えるスコアリングルーブリックです。これを 6 つのフレームワークとサンプル外のケースである Spec-Flow に適用します。 2 つの結果が際立っています。すでに何らかのプロセスを採用しているフレームワークの中には収束が見られます。分離されたプロンプトは中心性を失い、永続的な成果物、作業契約、トレーサビリティ、人間によるレビューが曖昧さを減らし、エージェントを調整するメカニズムになります。また、6 つの側面すべてを強力にカバーするフレームワークはなく、プロセスの深さとエージェント間の移植性の間の構造的なトレードオフが明らかになります。また、繰り返し発生するリスク、つまり仕様とコードの間のずれ、生成されたアーティファクトへの過剰な信頼、コミュニティ拡張の脆弱性、プラットフォームへの依存、プロセス全体のベンチマークの欠如なども見つかりました。最後に、中間品質の指標、コンテキストガバナンス、インストールのセキュリティと再現性に焦点を当てた、実証的評価のための研究課題を取り上げます。

原文 (English)

From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents

AI tools for programming are no longer just autocomplete or chat assistants: they organize themselves as development frameworks, with process, roles, artifacts and verification. Recent surveys map agents and LLMs for software engineering, but a study centered on the operational frameworks that turn these capabilities into process is missing. We ran a directed search of primary sources, with a functional inclusion criterion and traction measurement, and selected six frameworks: GitHub Spec Kit, OpenSpec, BMAD Method, Get Shit Done (GSD), Spec Kitty and Reversa. Each attacks AI development through a different path: spec-driven development in full and lightweight variants, agent-driven agile planning, context engineering over the agent, worktree isolation and review, and recovery of operational specifications from legacy systems. Our central contribution is a six-dimension process taxonomy: specification, context, roles, execution, validation and portability, with a scoring rubric that turns it into a replicable instrument. We apply it to the six frameworks and an out-of-sample case, Spec-Flow. Two results stand out. Among frameworks that already adopt some process there is convergence: the isolated prompt loses centrality, and persistent artifacts, work contracts, traceability and human review become mechanisms that reduce ambiguity and coordinate agents. And no framework strongly covers all six dimensions, exposing a structural trade-off between process depth and portability across agents. We also found recurring risks: drift between specification and code, excessive trust in generated artifacts, fragility of community extensions, platform dependence and a lack of benchmarks for the complete process. We close with a research agenda for empirical evaluation, focused on intermediate-quality metrics, context governance, installation security and reproducibility.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

計画、監視、回復: プロアクティブな手続き支援のためのベンチマークとアーキテクチャ

私たちは、プロアクティブなマルチモーダルアシスタントシステムを構想しています。これは、手順的なタスクに関するリアルタイムの段階的なガイダンスをユーザーに提供し、\textit{いつ}中断するか、\textit{どのように指導するかを自律的に決定します。ただし、現実的な状況、特にユーザーが予想されるステップシーケンスから逸脱する一般的なケースを反映する大規模なクロスドメインベンチマークがないため、進歩は限られています。私たちはこのギャップに 4 つの貢献で対処します。 \textbf{(1)}~明示的な計画外 (OOP) アノテーションと回復手順を備えたプロアクティブな手順支援のための大規模ウェアラブル自己中心的データセットである \textbf{EgoProactive} をリリースします。 \textbf{(2)}~統一されたプロアクティブなガイダンススキーマの下で、確立された 5 つのベンチマーク (Ego4D、EPIC-KITCHEN、EgoExo4D、HoloAssist、HowTo100M) を \textbf{Pro\textsuperscript{2}Bench} に拡張します。 \textbf{(3)}~手続き状態、視覚的キュー、および回復注入に特化した \textbf{分離プランナー -- インタラクションアーキテクチャ} を提案します。 \textbf{(4)}~Llama~4 および Qwen-3.6-VL でのクロスバックボーンレプリケーションによって検証された、モデルファミリ間で転送するトレーニング後のレシピを紹介します。大規模な実験において、当社の訓練された Llama-4 システムは、6 つのデータセットすべてにわたって、強力な独自のベースライン (Claude Opus~4.6、Gemini~3.1~Pro、GPT~5.2) およびオープンウェイトベースライン (Qwen3~VL~235B) ベースラインを超えて、客観的な介入の質を大幅に向上させました。さらに、Oracle 計画の実験では、計画の品質が制御されている場合、トレーニングされた二重モデルが高品質のガイダンスを生成し、計画外の回復で大きな利益が得られることが示されています。

原文 (English)

Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; \textbf{(3)}~we propose a \textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エージェント追跡から信頼へ: LLM エージェントにおける証拠追跡と実行来歴

大規模言語モデル (LLM) ベースのエージェントは、外部ツール、検索システム、メモリモジュール、環境、その他のエージェントと対話することで、複雑なタスクを解決することが増えています。これらの機能により、エージェントの自律性が拡張されますが、エージェントの動作の検証、デバッグ、監査が難しくなります。最終回答の精度だけでは、出力がどのように生成されたか、各主張を裏付ける証拠は何か、ツールの呼び出しが正当化されたかどうか、記憶が後の決定にどのように影響したか、実行の失敗がどこで発生したかを説明することはできません。証拠追跡と実行来歴は、取得された証拠、ツール出力、メモリ項目、環境観察、中間クレーム、アクション、および最終的な回答がエージェントの実行全体を通じてどのように関連するかをモデル化することで、このギャップに対処します。この調査は、LLM エージェントにおける証拠の追跡と実行の出自に関する体系的なレビューと概念的な枠組みを提供します。私たちは、検索根拠、クレームサポート、ツール使用の安全性、メモリリネージ、可観測性、デバッグ、監査、リカバリを結び付ける、統一された来歴の観点に基づいて関連作業を整理します。トレースソース、証拠と実行単位、来歴関係、トレースの粒度とタイミング、表現形式、信頼関数を網羅する分類法を導入します。私たちは、出所の表現、証拠の帰属、ツール使用の出所、実行時のガードレール、出所を伴うメモリ、トレースベースの可観測性、障害診断など、主要な方法論の方向性を検討します。また、既存のベンチマーク、データセット、評価指標を来歴関連の機能にマッピングし、評価が最終的な回答の正しさからプロセスレベルの説明責任にどのように移行できるかについても説明します。最後に、統合トレーススキーマ、クレームレベルおよびセマンティックの出所、出所を意識した安全メカニズム、現実的な実行トレースベンチマーク、リカバリ指向の評価、プライバシーを意識した監査インフラストラクチャなどの未解決の課題について概説します。

原文 (English)

From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents

Large language model (LLM)-based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modules, environments, and other agents. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where execution failures originated. Evidence tracing and execution provenance address this gap by modeling how retrieved evidence, tool outputs, memory items, environment observations, intermediate claims, actions, and final answers are connected throughout agent execution. This survey provides a systematic review and conceptual framework for evidence tracing and execution provenance in LLM agents. We organize related work around a unified provenance perspective that connects retrieval grounding, claim support, tool-use safety, memory lineage, observability, debugging, audit, and recovery. We introduce a taxonomy covering trace sources, evidence and execution units, provenance relations, tracing granularity and timing, representation forms, and trust functions. We review key methodological directions, including provenance representation, evidence attribution, tool-use provenance, runtime guardrails, provenance-bearing memory, trace-based observability, and failure diagnosis. We also map existing benchmarks, datasets, and evaluation metrics to provenance-related capabilities, and discuss how evaluation can move from final-answer correctness toward process-level accountability. Finally, we outline open challenges, including unified trace schemas, claim-level and semantic provenance, provenance-aware safety mechanisms, realistic execution-trace benchmarks, recovery-oriented evaluation, and privacy-aware audit infrastructure.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

DAR: エージェントティックハーネスを使用したデオンティック推論

義務的推論とは、法律に基づく納税額の計算や移民控訴の結果の決定など、事例固有の事実に明示的なルールとポリシーを適用することで質問に答えるタスクです。 LLM ベースの義務論的推論の主な技術的課題は、関連するルールセットが長く相互参照される可能性があるため、モデルが特定の推論ステップに必要なルールを見つけられない可能性があることです。 Deontic Agentic Reasoning (DAR) を導入します。これは、モデルがオンデマンドで法令と対話するエージェント推論セットアップです。 DeonticBench のハードサブセット上の複数のハーネスで DAR を評価します。これらの設定全体で、エージェントハーネスは義務論的推論タスクの最前線を押し広げることができることがわかりましたが、改善は均一ではありません。弱いモデルは、はるかに多くのトークンを消費しながら、数値タスクでパフォーマンスが低下することがよくあります。

原文 (English)

DAR: Deontic Reasoning with Agentic Harnesses

Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing tax liability under a statute or determining the outcome of an immigration appeal. A key technical challenge for LLM-based deontic reasoning is that the relevant ruleset can be long and cross-referenced, so models may still fail to locate the rules needed for a particular reasoning step. We introduce Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand. We evaluate DAR under multiple harnesses on hard subsets of DeonticBench. Across these settings, we find that agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

自己反映型 API: AI エージェント回復のための構造は冗長性を上回る

AI エージェントが API を呼び出して検証エラーに遭遇した場合、何が問題だったかだけではなく、次に何をすべきかが必要になります。自己反映型 API は、検証が失敗した場合、エージェントが外部の理由なしにリクエストを修復して再試行するのに十分な、機械可読な回復\_フィードバック.suggestions[] ペイロードを返します。リーク監査済みのパイロット (セルあたり $N{=}30$、LLM 3 つ、敵対的タスク 10) では、構造化された提案により、人間モデルでの平易な英語の診断 (フィッシャーの正確な $p \le 0.0022$) と比較して、タスク完了率が $+36.7$ ～ $40.0$pp 上昇し、$1.8$ ～ $2.2\倍$ 向上しました。成功ごとのトークン効率。 gpt-4o-mini では上昇率は大きくありません ($p{=}0.435$)。課金 API での 2 番目のドメインのレプリケーションによってパターンが確認されます。この比較は、文書化されていない 2 つのクラスの回答漏洩を LLM ベンチマークで監査した後にのみ有効です。再利用可能な CI インフラストラクチャとして、audit\_prompt\_leakage.py を出荷します。コードとデータ: https://github.com/arquicanedo/self-reflective-apis。

原文 (English)

Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery

When an AI agent calls an API and hits a validation error, it needs more than what went wrong -- it needs what to do next. A self-reflective API returns, on validation failure, a machine-readable recovery\_feedback.suggestions[] payload sufficient for the agent to repair the request and retry without external reasoning. On a leak-audited pilot ($N{=}30$ per cell, 3 LLMs, 10 adversarial tasks), structured suggestions lift task-completion rate by $+36.7$--$40.0$pp over plain-English diagnoses on Anthropic models (Fisher's exact $p \le 0.0022$), at $1.8$--$2.2\times$ better per-success token efficiency. The lift is not significant on gpt-4o-mini ($p{=}0.435$); a second-domain replication on a billing API confirms the pattern. The comparison only holds after auditing two undocumented classes of answer leakage in LLM benchmarks. We shipaudit\_prompt\_leakage.py as reusable CI infrastructure. Code and data: https://github.com/arquicanedo/self-reflective-apis.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LLM 駆動エージェントによる効率的で証拠に基づくモビリティ予測に向けて

個人レベルのモビリティ予測は、都市シミュレーション、交通計画、政策分析の中心となります。教師ありシーケンスモデルは高い精度を実現しますが、タスク固有のトレーニングが必要であり、意思決定レベルの透明性は限られています。最近の LLM ベースの手法は解釈可能性を向上させていますが、ほとんどが静的プロンプトとシングルパス推論に依存しているため、モビリティ信号が弱いか矛盾している場合に追加の証拠を探す能力が制限されています。私たちは、次の場所の予測を適応的な証拠に基づいた意思決定として定式化する、トレーニング不要の LLM 駆動のエージェントフレームワークである \method{} を提案します。 \method{} は、歴史的な規則性に基づく高速パスを通じて日常的なケースを解決しますが、あいまいなケースでは、最近の軌跡、過去の動作、滞在と移動の可能性、および地理的証拠に基づいて反復的なツールの使用がトリガーされます。 3 つのモビリティデータセット全体で、AgentMob はトレーニング不要の LLM ベースの手法の中で最も強力な総合パフォーマンスを達成し、GPT-5.4 は BW で 71.42\% Acc@1、YJMob100K で 33.14\%、上海 ISP で 33.50\% に達しました。 BW 非高速パスのケースでは、LLM コントローラーは、同じツールの統計ベースラインと比較して Acc@1 を 30.65\% から 48.62\% に改善します。これは、その主な利点が、適応的な証拠収集を通じて曖昧な予測を解決することにあることを示しています。コードは https://github.com/Unknown-zoo/AgentMob で入手できます。

原文 (English)

Towards Efficient and Evidence-grounded Mobility Prediction with LLM-Driven Agent

Individual-level mobility prediction is central to urban simulation, transportation planning, and policy analysis. Supervised sequence models achieve strong accuracy but require task-specific training and offer limited decision-level transparency. Recent LLM-based methods improve interpretability, yet mostly rely on static prompts and single-pass inference, limiting their ability to seek additional evidence when mobility signals are weak or conflicting. We propose \method{}, a training-free LLM-driven agent framework that formulates next-location prediction as adaptive evidence-controlled decision making. \method{} resolves routine cases through a fast path based on historical regularity, while ambiguous cases trigger iterative tool use over recent trajectories, historical behavior, stay-move likelihood, and geographical evidence. Across three mobility datasets, AgentMob achieves the strongest overall performance among training-free LLM-based methods, with GPT-5.4 reaching 71.42\% Acc@1 on BW, 33.14\% on YJMob100K, and 33.50\% on Shanghai ISP. On BW non-fast-path cases, the LLM controller improves Acc@1 from 30.65\% to 48.62\% over a same-tool statistical baseline, showing that its main benefit lies in resolving ambiguous predictions through adaptive evidence gathering. Our code is available at https://github.com/Unknown-zoo/AgentMob.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

マルチエージェント推論におけるストリーミング通信

マルチエージェント推論システムは、エンドツーエンドのレイテンシーをパイプラインの深さに応じて線形に拡張する「生成してから転送」パラダイムを採用しています。 StreamMA は、各推論ステップが生成されるとすぐに下流のエージェントにストリーミングし、隣接するエージェントをパイプライン化して待ち時間を短縮するマルチエージェント推論システムです。驚くべきことに、このパイプラインは有効性も向上させます。マルチステップの推論の品質は不均一で、初期のステップは後のステップよりも信頼性が高いため、完全なチェーンではなくこれらの信頼できる初期ステップを使用することで、エラーが発生しやすい後期ステップが下流エージェントに誤解を与えることを防ぎます。ストリーム、シリアル、および単一プロトコルの最初の閉じた形式の結合分析によって両方の利点を形式化し、有効性の順序付け、高速化の上限、およびコスト比を導き出します。数学、科学、コードにわたる 8 つの推論ベンチマーク、2 つのフロンティア LLM (Claude Opus 4.6 および GPT-5.4)、および 3 つのトポロジ (チェーン、ツリー、グラフ) にわたって、StreamMA は両方のベースラインを上回りました (HMMT 2026 で平均 +7.3 pp、最大 +22.4 pp、Claude Opus 4.6-高)。これらの貢献を超えて、「ステップレベルのスケーリング則」を発見しました。つまり、エージェントごとのステップを増やすと、有効性と効率の両方が一貫して向上します。これは、エージェント数のスケーリングと直交し、エージェント数のスケーリングと組み合わせ可能な新しいスケーリングの次元です。

原文 (English)

Streaming Communication in Multi-Agent Reasoning

Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

アダプティブマインド: LoRA-as-Tools でエージェントに権限を与える

LoRA アダプターが、基本言語モデルが動的に選択して呼び出すことができる呼び出し可能なツールとして扱われるフレームワークを調査します。私たちは、アダプターがドメイン固有の強力な利益を提供するようにトレーニングされ、明確なメタデータで公開されている場合、基本モデルはクエリを適切な専門家に確実にルーティングし、単一のフレームワーク内に多くの特殊なアダプターの利点を効果的に集約できるという仮説を立てています。シングルステップルーティングとマルチステップエージェント推論の両方を研究する一般的なフレームワークであるアダプティブマインドを紹介します。この設定では、エージェントは他のツール (外部 API、取得システム、実行環境など) と一緒に複数のアダプターを繰り返し呼び出し、複数のステップにわたる出力を推論できます。これにより、アダプターは、静的に適用されるのではなく、推論中に構成できるモジュール式のスキルまたは記憶ユニットとして再構成されます。私たちの評価では、ルーティング層は 30 アダプターライブラリで 98.3% の精度に達し、十分なトレーニングを受けた専門家は、単一の共有トレーニングレシピの下で 9 つのタスクファミリ全体で +4.6 ～ +84.0 パーセンテージポイントの厳密スコアラーゲインを提供しました。 AM ルーターは、クエリがドメイン信号を表面化するすべてのベンチマークで、直接のスペシャリストの 5 pp 以内にこれらのゲインを集計します。私たちの調査結果は、このアプローチの有効性は個々のアダプターの品質と専門性に依存し、そのような多くの専門家を柔軟に構成できるようにすることで、言語モデルエージェントの実践的な能力を大幅に拡張し、より一般的なツール拡張インテリジェンスに移行できることを示唆しています。

原文 (English)

Adaptive Minds: Empowering Agents with LoRA-as-Tools

We investigate a framework in which LoRA adapters are treated as callable tools that a base language model can dynamically select and invoke. We hypothesize that, when adapters are trained to provide strong domain-specific gains and are exposed with clear metadata, a base model can reliably route queries to the appropriate expert, effectively aggregating the benefits of many specialized adapters within a single framework. We introduce Adaptive Minds, a general framework within which we study both single-step routing and multi-step agentic reasoning. In this setting, the agent can iteratively invoke multiple adapters alongside other tools (e.g., external APIs, retrieval systems, or execution environments) and reason over their outputs across multiple steps. This reframes adapters as modular skills or memory units that can be composed during reasoning rather than statically applied. In our evaluation, the routing layer reaches 98.3% accuracy on a 30-adapter library, and well-trained specialists provide +4.6 to +84.0 percentage points of strict-scorer gain across nine task families under a single shared training recipe; the AM router aggregates these gains within 5 pp of the direct specialist on every benchmark whose queries surface domain signal. Our findings suggest that the effectiveness of this approach depends on the quality and specialization of individual adapters, and that enabling flexible composition of many such experts can significantly expand the practical capabilities of language model agents, moving toward more general, tool-augmented intelligence.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

BRAINCELL-AID: コミュニティアノテーション用のエージェント AI が作成した脳細胞タイプのリソース

単一細胞 RNA シーケンスにより、多様な細胞型とそのトランスクリプトームシグネチャを識別する能力が変わりました。しかし、これらのシグネチャ、特に特徴が十分に解明されていない遺伝子に関連するシグネチャに注釈を付けることは、依然として大きな課題です。 Gene Set Enrichment Analysis (GSEA) などの従来の手法は、厳選されたアノテーションに依存しており、これらのコンテキストではパフォーマンスが低下することがよくあります。大規模言語モデル (LLM) は有望な代替手段を提供しますが、構造化されたオントロジー内で複雑な生物学的知識を表現するのに苦労しています。これに対処するために、我々は BRAINCELL-AID (BRAINCELL-AID: https://biodataai.uth.edu/BRAINCELL-AID) を紹介します。これは、フリーテキスト記述とオントロジーラベルを統合して、より正確で堅牢な遺伝子セットアノテーションを可能にする新しいマルチエージェント AI システムです。検索拡張生成 (RAG) を組み込むことで、関連する PubMed 文献を使用して予測を改良し、幻覚を軽減し、解釈可能性を高める堅牢なエージェントワークフローを開発しました。このワークフローを使用して、上位予測に含まれるマウス遺伝子セットの 77% に対して正しいアノテーションを達成しました。このアプローチを適用して、BRAIN Initiative Cell Census Network によって生成された包括的なマウス脳細胞アトラスからの 5,322 個の脳細胞クラスターに注釈を付け、領域特異的な遺伝子の共発現パターンを特定し、遺伝子アンサンブルの機能的役割を推測することで、脳細胞の機能についての新たな洞察を可能にしました。 BRAINCELL-AID は、神経学的に意味のある説明を持つ大脳基底核関連細胞タイプも識別します。したがって、コミュニティ主導のセルタイプのアノテーションをサポートする貴重なリソースを作成します。

原文 (English)

BRAINCELL-AID: An Agentic AI Created Brain Cell Type Resource for Community Annotation

Single-cell RNA sequencing has transformed our ability to identify diverse cell types and their transcriptomic signatures. However, annotating these signatures-especially those involving poorly characterized genes-remains a major challenge. Traditional methods, such as Gene Set Enrichment Analysis (GSEA), depend on well-curated annotations and often perform poorly in these contexts. Large Language Models (LLMs) offer a promising alternative but struggle to represent complex biological knowledge within structured ontologies. To address this, we present BRAINCELL-AID (BRAINCELL-AID: https://biodataai.uth.edu/BRAINCELL-AID), a novel multi-agent AI system that integrates free-text descriptions with ontology labels to enable more accurate and robust gene set annotation. By incorporating retrieval-augmented generation (RAG), we developed a robust agentic workflow that refines predictions using relevant PubMed literature, reducing hallucinations and enhancing interpretability. Using this workflow, we achieved correct annotations for 77% of mouse gene sets among their top predictions. Applying this approach, we annotated 5,322 brain cell clusters from the comprehensive mouse brain cell atlas generated by the BRAIN Initiative Cell Census Network, enabling novel insights into brain cell function by identifying region-specific gene co-expression patterns and inferring functional roles of gene ensembles. BRAINCELL-AID also identifies Basal Ganglia-related cell types with neurologically meaningful descriptions. Hence, we create a valuable resource to support community-driven cell type annotation.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

SciDER: 科学データ中心のエンドツーエンド研究者

大規模な言語モデルが科学的発見を加速させる一方で、既存のエージェントは適応性、ドメインの一般化、マルチモーダルなスケーラビリティにおいて厳しい制限に直面しており、多くの場合、生のドメイン固有の実験データを自律的に処理するのに苦労しています。これらの障壁を克服するために、研究ライフサイクル全体を柔軟に自動化するように設計されたマルチエージェントシステムである SciDER を導入します。このフレームワークは、新しいデータ中心のアプローチを採用し、4 つの専門化されたサブエージェントにわたる動的なマルチモーダルスキルシステムを統合します。具体的には、アイデア化エージェントは進化的アイデア検索を通じて新しい仮説を生成し、データ分析エージェントは生データを体系的に構造化し、実験エージェントはデータセットの特性に基づいて実行可能コードを合成し、批評エージェントは反復的な自己洗練を推進します。オープンソースの科学的発見を民主化するために、私たちは OpenSciDER-27B 微調整モデルと並行して、高品質の実行軌跡データセットである OpenSciDER-SFT-8K をリリースします。 6 つのベンチマーク全体で、SciDER と OpenSciDER は競合する、または優れた結果を獲得しており、特にデータ中心の分析、エンドツーエンドの研究実行、マルチモーダルな科学的視覚化において大きな成果が得られています。 SciDER は、データ分析と実験の実行を統合することにより、抽象的な科学的推論と再現可能な実験合成の間のギャップを埋めます。

原文 (English)

SciDER: Scientific Data-centric End-to-end Researcher

While large language models accelerate scientific discovery, existing agents face severe limitations in adaptability, domain generalization, and multimodal scalability, often struggling to autonomously process raw, domain-specific experimental data. To overcome these barriers, we introduce SciDER, a multi-agent system designed to flexibly automate the entire research lifecycle. This framework employs a novel data-centric approach and integrates a dynamic multimodal skill system across four specialized sub-agents. Specifically, an ideation agent generates novel hypotheses via Evolutionary Idea Search, a data analysis agent systematically structures raw data, an experimentation agent synthesizes executable code grounded in dataset characteristics, and a critic agent drives iterative self-refinement. To democratize open-source scientific discovery, we release OpenSciDER-SFT-8K, a high-quality execution trajectory dataset, alongside the OpenSciDER-27B fine-tuned model. Across six benchmarks, SciDER and OpenSciDER obtain competitive or leading results, with especially strong gains on data-centric analysis, end-to-end research execution, and multimodal scientific visualization. By integrating data analysis with experimental execution, SciDER bridges the gap between abstract scientific reasoning and reproducible experimentation synthesis.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

エージェントティックツールプロトコルの形式セマンティクス: プロセス計算アプローチ

外部ツールを呼び出すことができる大規模言語モデルエージェントの出現により、エージェントプロトコルの正式な検証が緊急に必要になりました。この分野では、ゼロショット API の一般化のための研究フレームワークであるスキーマガイドダイアログ (SGD) と、エージェントとツールの統合のための業界標準であるモデルコンテキストプロトコル (MCP) の 2 つのパラダイムが支配的です。どちらもスキーマ記述を通じて動的なサービス検出を可能にしますが、その正式な関係はまだ解明されていません。これらのパラダイムの概念的収束を確立した以前の研究に基づいて、我々は SGD と MCP の最初のプロセス計算による定式化を提示し、それらが明確に定義されたマッピングファイの下で構造的に類似していることを証明します。ただし、逆マッピング Phi^{-1} は部分的で損失が多く、MCP の表現力に重大なギャップがあることが明らかになります。双方向分析を通じて、完全な動作の等価性のための必要十分条件として、5 つの原則 (セマンティックな完全性、明示的なアクション境界、障害モードの文書化、漸進的開示互換性、ツール間関係宣言) を特定しました。これらの原則を型システム拡張 MCP+ として形式化し、MCP+ が SGD と同型であることを証明します。私たちの研究は、検証されたエージェントシステムの最初の正式な基盤を提供し、証明可能な安全性の特性としてスキーマの品質を確立します。

原文 (English)

Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach

The emergence of large language model agents capable of invoking external tools has created urgent need for formal verification of agent protocols. Two paradigms dominate this space: Schema-Guided Dialogue (SGD), a research framework for zero-shot API generalization, and the Model Context Protocol (MCP), an industry standard for agent-tool integration. While both enable dynamic service discovery through schema descriptions, their formal relationship remains unexplored. Building on prior work establishing the conceptual convergence of these paradigms, we present the first process calculus formalization of SGD and MCP, proving they are structurally bisimilar under a well-defined mapping Phi. However, we demonstrate that the reverse mapping Phi^{-1} is partial and lossy, revealing critical gaps in MCP's expressivity. Through bidirectional analysis, we identify five principles -- semantic completeness, explicit action boundaries, failure mode documentation, progressive disclosure compatibility, and inter-tool relationship declaration -- as necessary and sufficient conditions for full behavioral equivalence. We formalize these principles as type-system extensions MCP+, proving MCP+ is isomorphic to SGD. Our work provides the first formal foundation for verified agent systems and establishes schema quality as a provable safety property.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

責任の地平線: 人間とエージェントの集合体を統治するための不可能定理

AI システムの法的、倫理的、規制に関する既存の責任の枠組みは、どのような結果が生じても、少なくとも 1 人の特定可能な人物が意味のある責任を負うのに十分な関与と先見性を持っているという共通の前提に基づいています。この論文は、自律性が計算可能なしきい値を超えると、エージェント型 AI システムが工学的な制限としてではなく数学的必然性としてこの仮定に違反することを証明します。ヒューマンエージェントコレクティブを紹介します。これは、エージェントが共有構造因果モデル内の国家政策タプルとしてモデル化される、人間と AI の共同システムの形式化です。自律性は、4 次元の情報理論的プロファイル (認識論的、実行的、評価的、社会的) によって特徴付けられます。インタラクショングラフと共同アクションスペースを通じた集団行動。私たちは、帰属性 (責任には因果関係の寄与が必要である)、予見可能性 (責任は予測能力を超えることはできない)、非空白性 (少なくとも 1 人のエージェントが重要な責任を負う)、および完全性 (すべての責任は完全に割り当てられなければならない) という 4 つの最小限の特性を通じて正当な責任を公理します。私たちの中心的な結果であるアカウンタビリティ不完全性定理は、その複合的な自律性がアカウンタビリティの地平線を超え、その相互作用グラフに人間と AI のフィードバックサイクルが含まれている集団にとって、4 つの特性すべてを同時に満たすフレームワークは存在しないことを証明しています。この不可能性は構造的なものであり、透明性、監査、監督によっても自律性を低下させることなく解決することはできません。しきい値を下回ると、正当なフレームワークが存在し、急激な相転移が確立されます。 3,000 の合成集合体に対する実験により、すべての予測が違反なしで確認されました。これは、AI ガバナンスにおける最初の不可能な結果であり、現在のパラダイムが引き続き有効であり、それを超えると分散型責任メカニズムが必要になるという正式な境界を確立します。

原文 (English)

The Accountability Horizon: An Impossibility Theorem for Governing Human-Agent Collectives

Existing accountability frameworks for AI systems, legal, ethical, and regulatory, rest on a shared assumption: for any consequential outcome, at least one identifiable person had enough involvement and foresight to bear meaningful responsibility. This paper proves that agentic AI systems violate this assumption not as an engineering limitation but as a mathematical necessity once autonomy exceeds a computable threshold. We introduce Human-Agent Collectives, a formalisation of joint human-AI systems where agents are modelled as state-policy tuples within a shared structural causal model. Autonomy is characterised through a four-dimensional information-theoretic profile (epistemic, executive, evaluative, social); collective behaviour through interaction graphs and joint action spaces. We axiomatise legitimate accountability through four minimal properties: Attributability (responsibility requires causal contribution), Foreseeability Bound (responsibility cannot exceed predictive capacity), Non-Vacuity (at least one agent bears non-trivial responsibility), and Completeness (all responsibility must be fully allocated). Our central result, the Accountability Incompleteness Theorem, proves that for any collective whose compound autonomy exceeds the Accountability Horizon and whose interaction graph contains a human-AI feedback cycle, no framework can satisfy all four properties simultaneously. The impossibility is structural: transparency, audits, and oversight cannot resolve it without reducing autonomy. Below the threshold, legitimate frameworks exist, establishing a sharp phase transition. Experiments on 3,000 synthetic collectives confirm all predictions with zero violations. This is the first impossibility result in AI governance, establishing a formal boundary below which current paradigms remain valid and above which distributed accountability mechanisms become necessary.

2026-06-05 13:00 JSTarXiv cs.AIエージェント研究/論文

SciIntegrity-Bench: AI 科学者システムにおける学術的誠実性を評価するためのベンチマーク

AI 科学者システムは自律的な研究のために導入されることが増えていますが、その学術的完全性が体系的に評価されたことはありません。ジレンマ的な評価パラダイムに基づいて設計された最初のベンチマークである SCINTEGRITY-BENCH を紹介します。11 のトラップカテゴリにわたる 33 のシナリオのそれぞれは、失敗を正直に認めることが唯一の正しい対応であるように構築されていますが、タスクの完了には不正行為が必要です。 7 つの最先端 LLM にわたる 231 回の評価実行全体で、全体的な整合性問題の発生率は 34.2% に達し、障害ゼロを達成したモデルはありませんでした。最も驚くべきことに、欠損データのシナリオ全体にわたって、7 つのモデルすべてが実行不可能性を認めるのではなく、合成データを生成しており、その違いは置換を開示するかどうかだけです。さらにプロンプトアブレーション研究では、2 つの要因が分離されます。明示的な完了圧力を取り除くと、未公開の製造が 20.6% から 3.2% に大幅に減少しますが、基礎となる合成率は変化せず、プロンプトレベルの指示とは無関係に持続する本質的な完了バイアスが明らかになります。これらの調査結果は、観察された失敗の主な要因として、訓練された気質としての正直な拒否が存在しないことを示しています。 SCINTEGRITY-BENCH を https://github.com/liuxingtong/Sci-Integrity-Bench でリリースします。

原文 (English)

SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems

AI scientist systems are increasingly deployed for autonomous research, yet their academic integrity has never been systematically evaluated. We introduce SCIINTEGRITY-BENCH, the first benchmark designed around a dilemmatic evaluation paradigm: each of its 33 scenarios across 11 trap categories is constructed so that honest acknowledgment of failure is the only correct response, while task completion requires misconduct. Across 231 evaluation runs spanning 7 state-of-the-art LLMs, the overall integrity problem rate reaches 34.2%, and no model achieves zero failures. Most strikingly, across missing-data scenarios, all seven models generate synthetic data rather than acknowledging infeasibility, differing only in whether they disclose the substitution. A further prompt ablation study separates two drivers: removing explicit completion pressure sharply reduces undisclosed fabrication from 20.6% to 3.2%, while the underlying synthesis rate remains unchanged, revealing an intrinsic completion bias that persists independent of prompt-level instructions. These findings point to the absence of honest refusal as a trained disposition as the primary driver of observed failures. We release SCIINTEGRITY-BENCH at https://github.com/liuxingtong/Sci-Integrity-Bench.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

タスク指向の対話で積極性を引き出す

アウトバウンド営業などのプロアクティブなタスク指向対話 (TOD) では、ユーザーの懸念を積極的に探り、限られたターン数内で受け入れられる方向に会話を導く説得力のあるエージェントが必要です。しかし、トレーニング後の LLM は本質的に保守的であり、報酬形成型 RL (GRPO など) は、すでに受動的なポリシーのサンプルを再加重するだけであるため、苦戦します。ユーザーの潜在的な懸念を条件付けすることで、どれだけサンプリングしても損なわれない事前対応型の機能が解放され、これらの懸念が極めて重要なトレーニング時間のシグナルとして確立されることを示します。この発見を運用するために、\textbf{認知ユーザーシミュレーター} を構築します。これは、各ユーザーを、目に見える外部特性と隠れた内部懸念からなる階層化されたペルソナとしてモデル化します。このシミュレーターは、説得の進行状況を追跡するターンごとの状態ダイナミクスを生成しながら、忠実で多様なインタラクションを生成します。次に、モデル化された懸念事項とシミュレーション状態の遷移を補完的なトレーニング目標に変換する \textbf{シミュレーターによる非対称ビューポリシーの最適化} を導入します。 (1) \emph{非対称オンポリシー自己蒸留} は、懸念事項を認識した動作を、同じポリシーの特権ビューからデプロイ可能な会話のみのビューに転送します。 (2) \emph{状態遷移ポリシーの改良} ...

原文 (English)

Unlocking Proactivity in Task-Oriented Dialogue

Proactive task-oriented dialogue (TOD), such as outbound sales, demands a persuasive agent that actively probes the user's concerns and steers the conversation toward acceptance within a bounded number of turns. Yet post-trained LLMs are inherently conservative, and reward-shaping RL (e.g., GRPO) struggles since it only re-weights what an already passive policy samples. We show that conditioning on the user's latent concerns unlocks proactive capability that no amount of sampling can undermine, establishing these concerns as a pivotal training-time signal. To operationalize this finding, we build the \textbf{Cognitive User Simulator}, which models each user as a stratified persona comprising observable external traits and hidden internal concerns. The simulator produces faithful and diverse interactions, while emitting per-turn state dynamics that track persuasion progress. We then introduce \textbf{Simulator-Induced Asymmetric-View Policy Optimization}, which converts the modeled concerns and the simulation state transition into complementary training objectives: (1) \emph{Asymmetric On-Policy Self-Distillation} that transfers concern-aware behavior from a privileged view of the same policy into its deployable, conversation-only view; and (2) \emph{State-Transition Policy Refinement} ...

2026-06-05 13:00 JSTarXiv cs.AIエージェント研究/論文

AutoMedBench: Agentic AI モデルによる医療自動研究に向けて

自律エージェントは、個別の予測タスクや短い形式の臨床質問応答を超えて、エンドツーエンドの医療 AI 研究ワークフローをサポートすることがますます期待されています。ただし、既存の医療エージェントのベンチマークは主に最終出力を評価しており、研究プロセス内でのエージェントの行動に対する可視性は限られています。このギャップに対処するために、AutoMedBench は、さまざまな医療画像処理およびマルチモーダル推論タスクにわたる自律型医療 AI 研究のためのワークフロー認識ベンチマークであり、エージェントの実行を統合された 5 段階のワークフロー (S1 ～ S5) (計画、セットアップ、検証、推論、送信) に編成します。これは、セグメンテーション、画像強調、視覚的質問応答 (VQA)、レポート生成、および病変検出の 5 つの研究トラックにまたがる、各実行の平均 33 エージェントターンの長期タスクで構成されています。各タスクは、Lite と Standard の 2 つの難易度で評価されます。これらは同じデータとメトリクスを使用しますが、タスク概要のスキャフォールディングの量が異なります。各実行は、最終タスクのパフォーマンスと S1 ～ S5 ステージスコアの両方を使用してスコア付けされ、最初のタスク概要から最終的に提出された成果物までステージレベルの分析が可能になります。何千もの記録された実行を対象としたステージレベルのスコアリングでは、平均して検証が最も弱いワークフローステージであるのに対し、セットアップが最も強いことが明らかになりました。これは、現在のエージェントが信頼性の検証よりもパイプラインを実行可能にすることに優れていることを示唆しています。さらに、実行後のエラー分析では、検証と送信の失敗がタグ付きエラーの大部分を占め、それぞれ起動されたコードの 37.7% と 38.1% を占めているのに対し、タスク理解エラーは 0.9% とまれで、起動されたエラーコードが 1 つある実行は、エラーコードがない実行よりも全体のスコアが平均 48% 低いことが示されています。

原文 (English)

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark for autonomous medical-AI research across diverse medical imaging and multimodal inference tasks, organizing agent execution into a unified five-stage workflow (S1-S5): Plan, Setup, Validate, Inference, and Submit. It comprises long-horizon tasks with each run averaging 33 agent turns, spanning five research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is evaluated under two difficulty tiers, Lite and Standard, which use the same data and metrics but differ in the amount of task-brief scaffolding, and each run is scored using both final task performance and S1-S5 stage scores, enabling stage-level analysis from the initial task brief to the final submitted artifact. Across thousands of recorded runs, stage-level scoring reveals that Validate is the weakest workflow stage on average, whereas Setup is the strongest, suggesting that current agents are better at making pipelines executable than at verifying their reliability. Post-run error analysis further shows that verification and submission failures dominate tagged errors, accounting for 37.7% and 38.1% of fired codes respectively, whereas task-understanding errors are rare at 0.9%, and runs with one fired error code have a 48% lower overall score than runs with no error code on average.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LEAP: エージェントフレームワークを使用した形式数学用の LLM のスーパーチャージング

大規模言語モデル (LLM) は強力な非公式数学的推論を示しますが、リーンのような形式言語では機械的に検証可能な証明を生成するのに苦労します。 LEAP は、汎用基礎モデルが自動化された形式定理証明で最先端のパフォーマンスを達成できるようにするエージェントフレームワークです。 LEAP は、非公式推論、指示に従って、反復的な自己改善などの基礎モデルの機能を活用します。複雑な問題をより小さな単位に分解することで、システムはリーンコンパイラーとの継続的な対話を通じて、正式な証明の構築と非公式のブループリントの橋渡しをします。ますます飽和しつつあるベンチマークを超えた厳密な評価を提供するために、リーンで形式化された IMO スタイルの問題のベンチマークである Lean-IMO-Bench を導入します。このベンチマークでは、短いステートメントでありながら非常に非日常的で、幅広い難易度にわたる複数ステップの証明が行われます。経験的に、北米の学部学生を対象とした毎年恒例の数学コンテストである最新の 2025 年のパトナムコンペティションでは、LEAP は 12 の問題すべてを解決し、フロンティアの正式な数学モデルによる最近の進歩と一致しています。 Lean-IMO-Bench では、LEAP は汎用 LLM のワンショット形式解決率を 10% 未満から 70% に引き上げ、特に金メダル級の専門化された IMO システムによって設定されたベンチマークの 48% を上回っています。さらに、偶数次ケイリーグラフのクヌースのハミルトニアン分解における重要な部分問題の検証された証明を含む、オープンな組み合わせ課題に対する複雑な証明を自律的に形式化することで、LEAP の研究レベルの有用性を実証します。

原文 (English)

LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

Large Language Models (LLMs) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean. We present LEAP, an agentic framework that enables general-purpose foundation models to achieve state-of-the-art performance on automated formal theorem proving. LEAP leverages foundation model capabilities, such as informal reasoning, instruction following, and iterative self-refinement. By decomposing complex problems into smaller units, the system bridges formal proof construction with informal blueprints through continuous interaction with the Lean compiler. To provide a rigorous evaluation beyond increasingly saturated benchmarks, we introduce Lean-IMO-Bench, a benchmark of IMO-style problems formalized in Lean, with short statements yet highly non-routine and multi-step proofs across a wide range of difficulty levels. Empirically, on the latest 2025 Putnam Competition, an annual mathematics competition for undergraduate students in North America, LEAP solves all 12 problems, matching recent breakthroughs by frontier formal mathematical models. On Lean-IMO-Bench, LEAP boosts the one-shot formal solve rate of general-purpose LLMs from below 10% to 70%, notably surpassing the 48% benchmark set by a specialized, gold-medal-caliber IMO system. Furthermore, we demonstrate LEAP's research-level utility by autonomously formalizing complex proofs for open combinatorial challenges, including a verified proof for a key subproblem in Knuth's Hamiltonian decomposition of even-order Cayley graphs.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

変圧器ベースの自動運転モデルと展開指向の圧縮: 調査

トランスベースのモデルは、長距離の空間依存関係、マルチエージェントのインタラクション、認識、予測、計画にわたるマルチモーダルなコンテキストをキャプチャできるため、自動運転の中心的なパラダイムになりつつあります。同時に、大容量のアテンションベースのアーキテクチャはかなりの遅延、メモリ、エネルギーのオーバーヘッドを課すため、実際の車両への導入は依然として困難です。この調査では、代表的な Transformer ベースの自動運転モデルをレビューし、それらをタスクの役割、センシング構成、アーキテクチャ設計ごとに整理します。さらに重要なのは、展開指向の観点からこれらのモデルを検証し、効率の制約が実際にモデル設計の選択肢をどのように再形成するかを分析することです。さらに、量子化、枝刈り、知識蒸留、低ランク近似、効率的な注意など、Transformer ベースの駆動システムに関連する圧縮および加速戦略をレビューし、その利点、限界、およびタスク依存の適用可能性について説明します。圧縮を独立した後処理ステップとして扱うのではなく、展開性、堅牢性、安全性に直接影響を与えるシステムレベルの設計上の考慮事項として強調します。最後に、効率的な自動運転システムの標準化された、安全性を意識した、ハードウェアを意識した評価に向けた未解決の課題と将来の研究の方向性を特定します。

原文 (English)

Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey

Transformer-based models are becoming a central paradigm in autonomous driving because they can capture long-range spatial dependencies, multi-agent interactions, and multimodal context across perception, prediction, and planning. At the same time, their deployment in real vehicles remains difficult because high-capacity attention-based architectures impose substantial latency, memory, and energy overhead. This survey reviews representative Transformer-based autonomous driving models and organizes them by task role, sensing configuration, and architectural design. More importantly, it examines these models from a deployment-oriented perspective and analyzes how efficiency constraints reshape model design choices in practice. We further review compression and acceleration strategies relevant to Transformer-based driving systems, including quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention, and discuss their benefits, limitations, and task-dependent applicability. Rather than treating compression as an isolated post-processing step, we highlight it as a system-level design consideration that directly affects deployability, robustness, and safety. Finally, we identify open challenges and future research directions toward standardized, safety-aware, and hardware-conscious evaluation of efficient autonomous driving systems.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ChatSOP: 制御可能な LLM 対話エージェントのための SOP ガイド付き MCTS 計画フレームワーク

Large Language Model (LLM) を利用した対話エージェントは、さまざまなタスクで優れたパフォーマンスを示します。ユーザーの理解が深まり、人間らしい応答ができるようになったにもかかわらず、ユーザーの*制御性の欠如**は依然として重要な課題であり、焦点の合わない会話やタスクの失敗につながることがよくあります。これに対処するために、対話の流れを規制する標準操作手順 (SOP) を導入します。具体的には、LLM 駆動の対話エージェントの制御性を強化するために設計された新しい SOP ガイド付きモンテカルロ木探索 (MCTS) 計画フレームワークである **ChatSOP** を提案します。これを可能にするために、GPT-4o を備えた半自動ロールプレイングシステムを使用して生成され、厳格な手動品質管理を通じて検証された、SOP アノテーション付きのマルチシナリオ対話で構成されるデータセットを厳選しました。さらに、SOP予測のための教師あり微調整と思考連鎖推論を統合し、対話中に最適な行動計画を立てるためにSOPに基づくモンテカルロ木探索を利用する新しい方法を提案します。実験結果は、GPT-3.5 に基づくベースラインモデルと比較してアクション精度の 27.95% の向上を達成し、オープンソースモデルでも顕著な向上を示すなど、私たちの方法の有効性を示しています。データセットとコードは公開されています。

原文 (English)

ChatSOP: An SOP-Guided MCTS Planning Framework for Controllable LLM Dialogue Agents

Dialogue agents powered by Large Language Models (LLMs) show superior performance in various tasks. Despite the better user understanding and human-like responses, their **lack of controllability** remains a key challenge, often leading to unfocused conversations or task failure. To address this, we introduce Standard Operating Procedure (SOP) to regulate dialogue flow. Specifically, we propose **ChatSOP**, a novel SOP-guided Monte Carlo Tree Search (MCTS) planning framework designed to enhance the controllability of LLM-driven dialogue agents. To enable this, we curate a dataset comprising SOP-annotated multi-scenario dialogues, generated using a semi-automated role-playing system with GPT-4o and validated through strict manual quality control. Additionally, we propose a novel method that integrates Chain of Thought reasoning with supervised fine-tuning for SOP prediction and utilizes SOP-guided Monte Carlo Tree Search for optimal action planning during dialogues. Experimental results demonstrate the effectiveness of our method, such as achieving a 27.95% improvement in action accuracy compared to baseline models based on GPT-3.5 and also showing notable gains for open-source models. Dataset and codes are publicly available.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

BioBlue: 簡略化された観察形式による、生物学的および経済的に調整された LLM の AI 安全性ベンチマークにおける体系的な暴走オプティマイザーのような LLM 故障モード

「暴走最適化」に関する AI 調整の議論の多くは、RL エージェントに焦点を当てています。RL エージェントは、他のすべてを犠牲にして代用目的を過剰に最適化する無制限のユーティリティ最大化装置 (例: 「ペーパークリップ最大化装置」、仕様ゲーム) です。 LLM ベースのシステムは、永続的なオプティマイザーではなく次のトークンの予測子として機能するため、多くの場合、より安全であると考えられています。私たちは、時間の経過とともに目標の状態を維持したりバランスを取る必要がある単純な長期制御スタイルの環境に LLM を配置することで、この仮定を実証的にテストします。つまり、単一および複数の目標の恒常性、収益逓減と無制限の目標のバランス、および再生可能資源の持続可能性です。私たちは、LLM が多くのステップに対して適切に行動し、定められた目的を明確に理解しているにもかかわらず、構造化された方法でコンテキストを失い、暴走的な行動に陥ることがよくあります。つまり、恒常性目標を無視し、複数の目的のトレードオフから単一の目的の最大化に崩壊するため、凹型の効用構造を尊重できません。これらの失敗は、有能な動作の初期期間の後に確実に発生し、その時点でコンテキストウィンドウが完全には程遠いにもかかわらず、特徴的なパターン (自己模倣的な振動、無制限の最大化、単一目的の最適化への復帰など) を示します。問題は、LLM がコンテキストを失い、一貫性がなくなるだけではありません。 LLM は表面的には多目的で境界があるように見えますが、複数の目的が関与する持続的な相互作用の下での LLM の動作は、単一目的で境界のない、調整が不十分なオプティマイザーのように動作するように体系的に偏っています。我々は、トークンレベルのパターン強化アトラクターを仮説とします。LLM は、元の指示からではなく、最近の行動履歴のトークンパターンから行動をますます導き出す可能性があります。なぜこれが複数の目的の設定でのみ起こるのかは未解決の問題のままです。

原文 (English)

BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else. LLM-based systems are often assumed to be safer because they function as next-token predictors rather than persistent optimisers. We empirically test this assumption by placing LLMs in simple, long-horizon control-style environments that require maintaining state of or balancing objectives over time: single- and multi-objective homeostasis, balancing unbounded objectives with diminishing returns, and sustainability of a renewable resource. We find that, although LLMs frequently behave appropriately for many steps and clearly understand the stated objectives, they often lose context in structured ways and drift into runaway behaviours: ignoring homeostatic targets, collapsing from multi-objective trade-offs into single-objective maximisation - thus failing to respect concave utility structures. These failures emerge reliably after initial periods of competent behaviour and exhibit characteristic patterns (including self-imitative oscillations, unbounded maximisation, and reverting to single-objective optimisation), even though the context window is far from full at that point. The problem is not that the LLMs just lose context and become incoherent. Although LLMs appear multi-objective and bounded on the surface, their behaviour under sustained interaction involving multiple objectives, is systematically biased towards acting like single-objective, unbounded, poorly aligned optimisers. We hypothesise a token-level pattern reinforcement attractor: LLMs may increasingly derive actions from the token patterns of their recent action history rather than from the original instructions. Why this happens only in multi-objective settings remains an open question.

2026-06-05 13:00 JSTarXiv cs.AIエージェントロボティクス

単純な埋め込みによりアクター-クリティックエージェントのサンプル効率が向上

最近の研究では、大規模な環境の並列化を使用して、アクタークリティカル手法の実時間のトレーニング時間を加速することが提案されています。残念ながら、望ましいレベルのパフォーマンスを達成するには、依然として多数の環境との対話が必要になる場合があります。適切に構造化された表現は、深層強化学習 (RL) エージェントの一般化とサンプル効率を向上させることができることに注目し、単純なエンベディング、つまりエンベディングを単純な構造に制約する軽量の表現層の使用を提案します。この幾何学的な帰納的バイアスにより、批評家のブートストラップを安定させ、政策の勾配を強化するまばらで離散的な特徴が生じます。 FastTD3、FastSAC、および PPO に適用すると、単純なエンベディングは、実行速度を損なうことなく、さまざまな連続および離散制御環境全体でサンプル効率と最終パフォーマンスを一貫して向上させます。

原文 (English)

Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents

Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment parallelization; unfortunately, these can sometimes still require large number of environment interactions to achieve a desired level of performance. Noting that well-structured representations can improve the generalization and sample efficiency of deep reinforcement learning (RL) agents, we propose the use of simplicial embeddings: lightweight representation layers that constrain embeddings to simplicial structures. This geometric inductive bias results in sparse and discrete features that stabilize critic bootstrapping and strengthen policy gradients. When applied to FastTD3, FastSAC, and PPO, simplicial embeddings consistently improve sample efficiency and final performance across a variety of continuous- and discrete-control environments, without any loss in runtime speed.

2026-06-05 13:00 JSTarXiv cs.AIエージェントロボティクス

ベクトル化されたオンライン POMDP 計画

部分的な可観測性の下で計画を立てることは、自律ロボットの重要な機能です。部分観測可能なマルコフ決定プロセス (POMDP) は、部分観測可能性の問題の下で計画を立てるための強力なフレームワークを提供し、アクションの確率的影響とノイズの多い観測を通じて得られる限られた情報を捕捉します。 POMDP の解法は、今日のハードウェアでの大規模並列化から多大な恩恵を受ける可能性がありますが、POMDP ソルバーの並列化は困難でした。ほとんどのソルバーは、アクションとその値の推定をインターリーブする数値最適化に依存しているため、並列プロセス間に依存関係や同期ボトルネックが生じ、並列化の利点が相殺される可能性があります。この論文では、Vectorized Online POMDP Planner (VOPP) を提案します。これは、最適化コンポーネントの一部を分析的に解決し、期待値の推定のみで構成される数値計算を残す、最新の POMDP 定式化を利用する新しい並列オンラインソルバーです。 VOPP は、計画に関連するすべてのデータ構造をテンソルのコレクションとして表し、すべての計画ステップをこの表現に対する完全にベクトル化された計算として実装します。その結果、同時プロセス間の依存関係や同期ボトルネックのない大規模並列オンラインソルバーが実現します。実験結果は、VOPP が既存の最先端の並列オンラインソルバーと比較して、最適に近い解の計算において少なくとも 20 倍効率的であることを示しています。さらに、VOPP は最先端の逐次オンラインソルバーよりも優れたパフォーマンスを発揮し、計画予算を 1000 倍も削減します。

原文 (English)

Vectorized Online POMDP Planning

Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for planning under partial observability problems, capturing the stochastic effects of actions and the limited information available through noisy observations. POMDP solving could benefit tremendously from massive parallelization on today's hardware, but parallelizing POMDP solvers has been challenging. Most solvers rely on interleaving numerical optimization over actions with the estimation of their values, which creates dependencies and synchronization bottlenecks between parallel processes that can offset the benefits of parallelization. In this paper, we propose Vectorized Online POMDP Planner (VOPP), a novel parallel online solver that leverages a recent POMDP formulation which analytically solves part of the optimization component, leaving numerical computations to consist of only estimation of expectations. VOPP represents all data structures related to planning as a collection of tensors, and implements all planning steps as fully vectorized computations over this representation. The result is a massively parallel online solver with no dependencies or synchronization bottlenecks between concurrent processes. Experimental results indicate that VOPP is at least $20\times$ more efficient in computing near-optimal solutions compared to an existing state-of-the-art parallel online solver. Moreover, VOPP outperforms state-of-the-art sequential online solvers, while using a planning budget that is $1000\times$ smaller.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

トポロジは重要です: マルチエージェント LLM のメモリリークの測定

グラフトポロジは、マルチエージェント LLM システムにおけるメモリリークの基本的な決定要因ですが、その影響は依然として十分に定量化されていません。マルチエージェント LLM システムにおけるトポロジ条件付きメモリリークを比較するための制御された評価フレームワークである MAMA (Multi-Agent Memory Attack) を紹介します。 MAMA は、ラベル付きの個人識別情報 (PII) エンティティを含む合成文書を操作し、そこからサニタイズされたタスク指示を生成します。私たちは、エングラム (ターゲットエージェントのメモリに個人情報をシードする) とレゾナンス (攻撃者が抽出を試みるマルチラウンドインタラクション) の 2 段階のプロトコルを実行します。 10 ラウンドにわたって、完全一致抽出と攻撃者の最終出力に対する LLM ベースの推論を組み合わせた 2 段階の回復基準を使用して漏洩を測定します。 $n\in\{4,5,6\}$、攻撃者とターゲットの配置、および基本モデルにわたる 6 つの正規トポロジ (完全、円、チェーン、ツリー、スター、スターリング) を評価します。結果は一貫しています。接続が密になり、攻撃者とターゲットの距離が短くなり、ターゲットの中心性が高くなることで漏洩が増加します。ほとんどの漏れはラウンド初期に発生し、その後プラトーになります。モデルの選択により絶対率は変化しますが、広範な構造傾向は維持されます。時空間/位置属性は、ID 資格情報や規制された識別子よりも漏洩しやすいです。私たちは、システム設計のための実践的なガイダンスを抽出します。つまり、疎接続または階層接続を優先し、攻撃者とターゲットの分離を最大限に高め、トポロジーを意識したアクセス制御によってハブ/ショートカット経路を制限します。私たちのコードは https://github.com/llll121/mama-eval で入手できます。

原文 (English)

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We introduce MAMA (Multi-Agent Memory Attack), a controlled evaluation framework for comparing topology-conditioned memory leakage in multi-agent LLM systems. MAMA operates on synthetic documents containing labeled Personally Identifiable Information (PII) entities, from which we generate sanitized task instructions. We execute a two-phase protocol: Engram (seeding private information into a target agent's memory) and Resonance (multi-round interaction where an attacker attempts extraction). Over 10 rounds, we measure leakage using a two-stage recovery criterion that combines exact-match extraction with LLM-based inference over the attacker's final output. We evaluate six canonical topologies (complete, circle, chain, tree, star, star-ring) across $n\in\{4,5,6\}$, attacker-target placements, and base models. Results are consistent: denser connectivity, shorter attacker-target distance, and higher target centrality increase leakage; most leakage occurs in early rounds and then plateaus; model choice shifts absolute rates but preserves broad structural trends; spatiotemporal/location attributes leak more readily than identity credentials or regulated identifiers. We distill practical guidance for system design: favor sparse or hierarchical connectivity, maximize attacker-target separation, and restrict hub/shortcut pathways via topology-aware access control. Our code is available at https://github.com/llll121/mama-eval.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント

セグメントからシーンへ: 視覚言語モデルによるエージェント自動運転の時間的理解

視覚言語モデル (VLM) は、自動運転 (AD) が最も安全性が重要なインスタンスの 1 つとして、野外で活動する自律エージェントの認識および推論のバックボーンとしてますます導入されています。このようなエージェントがイベントを予測し、原因を特定し、動的な環境で安全に行動するには、信頼性の高い時間的理解が不可欠ですが、これは最先端 (SoTA) VLM にとっても依然として大きな課題です。これまでのビデオベンチマークは他のコンテンツ (スポーツ、料理など) を重視していましたが、短編と長編の両方の AD 映像の時間的理解のみに焦点を当てた既存のベンチマークはありません。このギャップを埋めるために、7 つのタスクにわたる約 6000 の質問と回答 (QA) のペアで構成される自動運転における時間的理解 (TAD) ベンチマークを提示し、9 つのクローズドおよびオープンソースのジェネラリストモデルと AD スペシャリストモデルを評価します。現在の SoTA モデルは、TAD 上で人間の精度を大幅に下回っています。 VLM ベースの運転エージェントの時間的推論を改善するために、我々は 2 つの新しいトレーニング不要のソリューションを提案します。1 つは思考連鎖 (CoT) 推論を使用する Scene-CoT、もう 1 つは VLM 周辺のエージェントツールとして動作する軌道分析モジュールによって生成される自己中心の時間認知マップを組み込んだ TCogMap です。既存の VLM と統合された当社のメソッドは、TAD での平均精度を最大 $17.72\%$、STSBench で最大 $10.35\%$ 向上させます。この研究は、TAD の導入、SoTA モデルのベンチマーク、および効果的な機能拡張の提案により、実際に稼働しているエージェント型 AD システムの時間的理解のさらなる進歩を促進することを目的としています。ベンチマークと評価コードは、それぞれ ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ と ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$ から入手できます。

原文 (English)

From Segments to Scenes: Temporal Understanding for Agentic Autonomous Driving via Vision-Language Models

Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliable temporal understanding is essential for such agents to anticipate events, attribute causes, and act safely in dynamic environments, yet this remains a significant challenge even for state-of-the-art (SoTA) VLMs. Prior video benchmarks have emphasized other content (sports, cooking, etc.), yet no existing benchmark focuses exclusively on temporal understanding for both short- and long-form AD footage. To fill this gap, we present the Temporal Understanding in Autonomous Driving (TAD) benchmark, comprising nearly 6000 question-answer (QA) pairs across 7 tasks, and evaluate 9 closed- and open-source generalist as well as AD-specialist models. Current SoTA models perform substantially below human accuracy on TAD. To improve the temporal reasoning of VLM-based driving agents, we propose two novel training-free solutions: Scene-CoT, which uses Chain-of-Thought (CoT) reasoning, and TCogMap, which incorporates an ego-centric temporal cognitive map produced by a trajectory-analysis module that operates as an agentic tool around the VLM. Integrated with existing VLMs, our methods improve average accuracy on TAD by up to $17.72\%$ and by up to $10.35\%$ on STSBench. By introducing TAD, benchmarking SoTA models, and proposing effective enhancements, this work aims to catalyze further progress on temporal understanding for agentic AD systems operating in the wild. The benchmark and evaluation code are available at ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ and ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$, respectively.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

DVGT: ビジュアルジオメトリトランスフォーマーの駆動

自動運転には、視覚入力から 3D シーンのジオメトリを認識して再構築することが重要です。ただし、さまざまなシナリオやカメラ構成に適応できる、運転をターゲットとした高密度ジオメトリ認識モデルがまだ不足しています。このギャップを埋めるために、私たちはドライビングビジュアルジオメトリトランスフォーマー (DVGT) を提案します。これは、一連のポーズ化されていないマルチビュービジュアル入力からグローバルな高密度 3D ポイントマップを再構築します。まず、DINO バックボーンを使用して各画像の視覚的特徴を抽出し、ビュー内の局所的注意、ビュー間の空間的注意、およびフレーム間の時間的注意を交互に使用して、画像全体の幾何学的関係を推測します。次に、複数のヘッドを使用して、最初のフレームのエゴ座標のグローバルポイントマップと各フレームのエゴポーズをデコードします。正確なカメラパラメーターに依存する従来の方法とは異なり、DVGT には明示的な 3D 幾何学的な事前条件がなく、任意のカメラ構成の柔軟な処理が可能です。 DVGT は、画像シーケンスからメートルスケールのジオメトリを直接予測し、外部センサーによる事後位置合わせの必要性を排除します。 DVGT は、nuScenes、OpenScene、Waymo、KITTI、DDAD などの運転データセットを大規模に組み合わせてトレーニングされたため、さまざまなシナリオで既存のモデルを大幅に上回ります。コードは https://github.com/wzzheng/DVGT で入手できます。

原文 (English)

DVGT: Driving Visual Geometry Transformer

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

マルチエージェント討論を読み解く: 自信と多様性の役割

マルチエージェントディベート (MAD) は、テスト時間のスケーリングを通じて大規模言語モデル (LLM) のパフォーマンスを向上させるために広く使用されていますが、最近の研究では、バニラの MAD は、計算コストが高いにもかかわらず、単純な多数決を下回ることが多いことが示されています。研究によると、同種のエージェントと統一的な信念の更新の下では、議論は期待される正しさを維持するため、結果を確実に改善することはできません。人間による熟慮と集団的意思決定からの発見に基づいて、バニラ MAD に欠けている 2 つの重要なメカニズムを特定します。(i) 初期の視点の多様性と、(ii) 明示的で調整された信頼性のコミュニケーションです。私たちは 2 つの軽量介入を提案します。まず、多様性を意識した初期化により、より多様な回答候補プールが選択され、議論の開始時に正しい仮説が存在する可能性が高まります。 2 つ目は、エージェントが調整された自信を表現し、他の人の自信に基づいて最新情報を条件付けする、自信調整型ディベートプロトコルです。我々は、多様性を意識した初期化により、基礎となる更新ダイナミクスを変更することなく MAD 成功の事前確率が向上する一方で、信頼度調整された更新により、議論が体系的に正しい仮説に向かうことが可能になることを理論的に示します。経験的には、6 つの推論指向の QA ベンチマーク全体で、私たちの手法はバニラの MAD と多数決を一貫して上回っています。私たちの結果は、人間による熟慮と LLM ベースの議論を結びつけ、シンプルで原則に基づいた修正が議論の有効性を大幅に高めることができることを示しています。

原文 (English)

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others' confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

SUSD: 状態因数分解による構造化された教師なしスキルの発見

教師なしスキルディスカバリー (USD) は、外部報酬に依存せずに、さまざまなスキルを自律的に学習することを目的としています。最も一般的な USD アプローチの 1 つは、スキルの潜在変数と状態の間の相互情報 (MI) を最大化することです。ただし、MI ベースの手法は、その不変特性により単純で静的なスキルを好む傾向があり、動的でタスクに関連した動作の発見が制限されます。 Distance-Maximizing Skill Discovery (DSD) は、状態空間の距離を活用することで、より動的なスキルを促進しますが、環境内のすべての制御可能な要素またはエンティティに関与する包括的なスキルセットを奨励するにはまだ不十分です。この研究では、状態空間を独立したコンポーネント (オブジェクトや制御可能なエンティティなど) に因数分解することで環境の構成構造を利用する新しいフレームワークである SUSD を紹介します。 SUSD は、異なるスキル変数をさまざまな要素に割り当て、スキル発見プロセスをよりきめ細かく制御できるようにします。また、動的モデルは複数の要因にわたる学習を追跡し、エージェントの焦点を未探索の要因に適応的に導きます。この構造化されたアプローチは、より豊かで多様なスキルの発見を促進するだけでなく、階層強化学習 (HRL) を介した構成的な下流タスクの効率的なトレーニングを促進する、個々のエンティティに対するきめ細かく解きほぐされた制御を可能にする因数分解されたスキル表現ももたらします。係数が 1 から 10 までの 3 つの環境にわたる実験結果は、私たちの方法が監督なしで多様で複雑なスキルを発見できることを示しており、因数分解された複雑な環境で既存の教師なしスキル発見方法を大幅に上回っています。コードは https://github.com/hadi-hosseini/SUSD で公開されています。

原文 (English)

SUSD: Structured Unsupervised Skill Discovery through State Factorization

Unsupervised Skill Discovery (USD) aims to autonomously learn a diverse set of skills without relying on extrinsic rewards. One of the most common USD approaches is to maximize the Mutual Information (MI) between skill latent variables and states. However, MI-based methods tend to favor simple, static skills due to their invariance properties, limiting the discovery of dynamic, task-relevant behaviors. Distance-Maximizing Skill Discovery (DSD) promotes more dynamic skills by leveraging state-space distances, yet still fall short in encouraging comprehensive skill sets that engage all controllable factors or entities in the environment. In this work, we introduce SUSD, a novel framework that harnesses the compositional structure of environments by factorizing the state space into independent components (e.g., objects or controllable entities). SUSD allocates distinct skill variables to different factors, enabling more fine-grained control on the skill discovery process. A dynamic model also tracks learning across factors, adaptively steering the agent's focus toward underexplored factors. This structured approach not only promotes the discovery of richer and more diverse skills, but also yields a factorized skill representation that enables fine-grained and disentangled control over individual entities which facilitates efficient training of compositional downstream tasks via Hierarchical Reinforcement Learning (HRL). Our experimental results across three environments, with factors ranging from 1 to 10, demonstrate that our method can discover diverse and complex skills without supervision, significantly outperforming existing unsupervised skill discovery methods in factorized and complex environments. Code is publicly available at: https://github.com/hadi-hosseini/SUSD.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

自律型 O-RAN に向けて: リアルタイムネットワーク制御および管理のためのマルチスケールエージェント AI フレームワーク

オープン無線アクセスネットワーク (O-RAN) は、分散されたソフトウェア駆動のコンポーネントとオープンインターフェイスを通じて柔軟な 6G ネットワークアクセスを約束しますが、このプログラマビリティにより運用の複雑さも増大します。複数の制御ループがサービス管理層と RAN インテリジェントコントローラー (RIC) 全体で共存しますが、個別に開発された制御アプリケーションは意図しない方法で相互作用する可能性があります。同時に、生成型人工知能 (AI) の最近の進歩により、孤立した AI モデルから、目標を解釈し、複数のモデルと制御機能を調整し、時間の経過とともに動作を適応させることができるエージェント AI システムへの移行が可能になりました。この記事では、非リアルタイム (Non-RT)、準リアルタイム (Near-RT)、およびリアルタイム (RT) の制御ループにわたる調整された階層として RAN インテリジェンスを組織化する、O-RAN 用のマルチスケールエージェント AI フレームワークを提案します。 (i) 非 RT RIC の大規模言語モデル (LLM) エージェントは、オペレーターの意図をポリシーに変換し、モデルのライフサイクルを管理します。 (ii) Near-RT RIC の Small Language Model (SLM) エージェントは、低遅延の最適化を実行し、既存の制御アプリケーションをアクティブ化、調整、または無効化できます。 (iii) 分散ユニット近くのワイヤレス物理層基盤モデル (WPFM) エージェントは、エアインターフェイスに近い高速推論を提供します。これらのエージェントが標準化された O-RAN インターフェイスとテレメトリを通じてどのように連携するかを説明します。オープンソースモデル、ソフトウェア、データセットに基づいて構築された概念実証の実装を使用して、非定常条件下での堅牢な動作とインテント駆動型のスライスリソース制御という 2 つの代表的なシナリオで提案されたエージェントアプローチを実証します。

原文 (English)

Toward Autonomous O-RAN: A Multi-Scale Agentic AI Framework for Real-Time Network Control and Management

Open Radio Access Networks (O-RAN) promise flexible 6G network access through disaggregated, software-driven components and open interfaces, but this programmability also increases operational complexity. Multiple control loops coexist across the service management layer and RAN Intelligent Controller (RIC), while independently developed control applications can interact in unintended ways. In parallel, recent advances in generative Artificial Intelligence (AI) are enabling a shift from isolated AI models toward agentic AI systems that can interpret goals, coordinate multiple models and control functions, and adapt their behavior over time. This article proposes a multi-scale agentic AI framework for O-RAN that organizes RAN intelligence as a coordinated hierarchy across the Non-Real-Time (Non-RT), Near-Real-Time (Near-RT), and Real-Time (RT) control loops: (i) A Large Language Model (LLM) agent in the Non-RT RIC translates operator intent into policies and governs model lifecycles. (ii) Small Language Model (SLM) agents in the Near-RT RIC execute low-latency optimization and can activate, tune, or disable existing control applications; and (iii) Wireless Physical-layer Foundation Model (WPFM) agents near the distributed unit provide fast inference close to the air interface. We describe how these agents cooperate through standardized O-RAN interfaces and telemetry. Using a proof-of-concept implementation built on open-source models, software, and datasets, we demonstrate the proposed agentic approach in two representative scenarios: robust operation under non-stationary conditions and intent-driven slice resource control.

2026-06-05 13:00 JSTarXiv cs.AIエージェント

スケーラブルな MARL における局所性のための統合フレームワーク

ネットワーク化されたマルチエージェント強化学習のスケーラブルな方法では、各エージェントがエージェントグラフの小さな近傍のみを使用して計画を立てることができます。これは、システムが値ローカルである場合にのみ機能します。つまり、2 つのエージェントが遠く離れている場合、1 つのエージェントでの摂動は、別のエージェントでの長期的な値にわずかに影響します。平均報酬設定では、局所性を証明する標準的な方法は、各エージェントの次の状態が他のエージェントの現在の状態にどのように依存するかを捕捉する単一の行列 $C^\pi$ 上の Dobrusin 行合計境界です。このマトリックスを扱いやすくするために、以前の研究では、共同動作の上限によってマトリックスを制限しました。結果として得られる境界はポリシーから独立していますが、ポリシーが最悪の場合のアクションを選択しない場合は常に緩やかになります。 $C^\pi$ を、環境の感度とポリシーの感度を個別に追跡する部分、$C^\pi \preceq E^{\mathrm s}+E^{\mathrm a}\Pi(\pi)$ に分割します。ここで、$E^{\mathrm s}$ は現在の状態に応じて次の状態がどのように変化するかを測定し、$E^{\mathrm a}$ は現在のアクションでどのように変化するかを測定し、$\Pi(\pi)$ はポリシーがどのように反応するかを測定します。状態の変化。 $H^\pi := E^{\mathrm s}+E^{\mathrm a}\Pi(\pi)$ のスペクトル半径は平均報酬ポアソン解の減衰を制御し、スペクトル証明書 $\rho(H^\pi)<1$ は同じ行列上の行合計条件 $\|H^\pi\|_\infty<1$ より厳密に弱く、政策に依存しないレジームに適用されます。以前の Dobrushin スタイルの作業で使用されていたアクションの上限は使用できません。温度 $\tau$ ソフトマックスポリシーの場合、$\Pi(\pi)\le L/(2\tau)$ が得られるため、ソフトマックス温度は局所性を直接制御します。この減衰結果を使用して、切り捨てバイアスがメッセージパッシング半径 $\kappa$ 内で指数関数的に減衰するブロック座標 KL 近位ポリシー改善テンプレートに決定論的なオラクル保証を与えます。

原文 (English)

A Unified Framework for Locality in Scalable MARL

Scalable methods for networked multi-agent reinforcement learning let each agent plan using only a small neighborhood of the agent graph. This works only when the system is value-local, meaning a perturbation at one agent affects the long-run value at another agent weakly when the two are far apart. In the average-reward setting, the standard way to certify locality is the Dobrushin row-sum bound on a single matrix $C^\pi$ that captures how each agent's next state depends on each other agent's current state. To make this matrix easy to work with, prior work bounds it by a supremum over joint actions. The resulting bound is independent of the policy, but it is loose whenever the policy never picks the worst-case action. We split $C^\pi$ into pieces that separately track environment sensitivity and policy sensitivity, $C^\pi \preceq E^{\mathrm s}+E^{\mathrm a}\Pi(\pi)$, where $E^{\mathrm s}$ measures how the next state moves with the current state, $E^{\mathrm a}$ measures how it moves with the current action, and $\Pi(\pi)$ measures how reactive the policy is to changes in state. The spectral radius of $H^\pi := E^{\mathrm s}+E^{\mathrm a}\Pi(\pi)$ then controls the decay of the average-reward Poisson solution, and the spectral certificate $\rho(H^\pi)<1$ is strictly weaker than the row-sum condition $\|H^\pi\|_\infty<1$ on the same matrix and applies in regimes where policy-independent action-supremum bounds used in prior Dobrushin-style work cannot. For temperature-$\tau$ softmax policies we get $\Pi(\pi)\le L/(2\tau)$, so the softmax temperature directly controls locality. We use this decay result to give a deterministic oracle guarantee for a block-coordinate KL-proximal policy-improvement template whose truncation bias decays exponentially in the message-passing radius $\kappa$.

2026-06-05 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

ShareVerse: 共有世界モデリングのためのマルチエージェントの一貫したビデオ生成

このペーパーでは、マルチエージェントシェアードワールドモデリングを可能にするビデオ生成フレームワークである ShareVerse について紹介します。これは、マルチエージェントインタラクションによる統一されたシェアードワールド構築のサポートが不足している既存の作品のギャップに対処します。 ShareVerse は、大規模なビデオモデルの生成機能を活用し、次の 3 つの主要なイノベーションを統合します。 1) 大規模なマルチエージェントインタラクティブな世界モデリング用のデータセットは、CARLA シミュレーションプラットフォーム上に構築され、多様なシーン、気象条件、およびペアになったマルチビュービデオ (エージェントごとに前方/後方/左方/右ビュー) とカメラデータによるインタラクティブな軌跡を特徴とします。 2) より広範な環境をモデル化し、内部のマルチビューの幾何学的一貫性を確保するために、独立したエージェントの 4 ビュービデオの空間連結戦略を提案します。 3) エージェント間のアテンションブロックを事前トレーニング済みビデオモデルに統合します。これにより、エージェント間での時空間情報のインタラクティブな送信が可能になり、重複領域での共有世界の一貫性と非重複領域での合理的な生成が保証されます。 49 フレームの大規模ビデオ生成をサポートする ShareVerse は、動的エージェントの位置を正確に認識し、一貫した共有世界モデリングを実現します。

原文 (English)

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

A safety score earned on a benchmark need not predict how the same model behaves once it is wrapped in an agentic scaffold the benchmark ne…

2026-06-05 13:00 JSTarXiv cs.AIエージェント

VulnAgent-R2: Evidence-Calibrated Multi-Agent Auditing for Repository-Level Vulnerability Detection

Software vulnerabilities often depend on cross-file data flow, build options, framework conventions, and runtime guards, so isolated functi…

2026-06-05 13:00 JSTarXiv cs.AIエージェント研究/論文

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in la…

2026-06-05 13:00 JSTarXiv cs.AIエージェントロボティクス

Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring

Although autonomous underwater vehicles promise the capability of marine ecosystem monitoring, their deployment is fundamentally limited by…

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

LLM agents increasingly rely on reusable skills (e.g., SKILL markdown files) to execute complex tasks, yet these artifacts lack portability…

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

A Systematic Investigation of RL-Jailbreaking in LLMs

The evolution of generative models from next-token predictors to autonomous engines of complex systems necessitates rigorous safety hardeni…

2026-06-05 13:00 JSTarXiv cs.AIエージェントロボティクス

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

Interpretable autonomous driving planners depend not only on generating explanations, but also on those explanations remaining reliable und…

2026-06-05 13:00 JSTarXiv cs.AIエージェント

Grimlock: Guarding High-Agency Systems with eBPF and Attested Channels

Agentic systems increasingly run user-authored orchestration code that invokes tools, spawns subtasks, and delegates work across machines a…

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, w…

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

memorywire: A Vendor-Neutral Wire Format for Agent Memory Operations

Agent-memory frameworks -- mem0, Letta/MemGPT, Cognee, Zep/Graphiti, MemoryOS, MemTensor -- each ship their own SDK, storage layout, and op…

2026-06-05 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams

Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and suppor…

2026-06-05 13:00 JSTarXiv cs.AIエージェント

OpenAgenet/OAN: Open Infrastructure for Trusted Agent Interconnection

OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes vi…

2026-06-05 13:00 JSTarXiv cs.AIエージェント研究/論文

OpenAgenet/OAN: Technical Architecture for Trust-Governed Agent Identity and Discovery

This paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection.…

2026-06-05 04:20 JSTTechCrunch AIエージェント

Apple approves Poke as the first AI agent on its Messages for Business platform

Poke, the startup that lets people use AI agents through simple text messages, has become the first AI agent approved for Apple’s Messages…

2026-06-04 21:00 JSTOpenAILLM/生成AIエージェント

How Endava is redesigning software delivery around AI agents

Learn how Endava is using AI agents, ChatGPT Enterprise, and Codex to accelerate software delivery, automate workflows, and build an AI-nat…

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エンタープライズ AI エージェントの導入前保証に向けて: オントロジーに基づいたシミュレーションと信頼証明

エンタープライズ人工知能 (AI) エージェントの導入前の検証は、大規模言語モデル (LLM) 機能のベンチマークと運用環境の導入の間に依然として重大なギャップがあります。導入後のモニタリング、人間参加型制御、およびプロンプトレベルのガードレールは、エージェントが運用環境で動作すると限定的な保証を提供します。私たちは、次の 3 つのコンポーネントを組み合わせたオントロジーに基づいた検証フレームワークを提案します。1 つは、権限、ドメイン制約、安全性、ガバナンスルール、および自律性レベルにわたる認証空間を形式化するエージェント運用エンベロープです。規制、運用、敵対的テストのシナリオを自動的に導き出すオントロジーからシナリオへの生成パイプライン。段階的な展開判定 (承認、条件付き、拒否) を含む機械検証可能な証明書を保持する信頼証明書。米国とベトナムの 5 つの業界別規制制度セルとしてインスタンス化された、4 つの規制対象業界 (フィンテック、銀行、保険、ヘルスケア) にわたる制御されたパイロットは、125 の主要なソース規制要件と 25 の注入された障害に対して評価された 1,800 のシナリオを生成しました。オントロジーに基づいた生成 (G4) は、ペルソナベースのベースラインでは 33.1% であるのに対し、規制適用率は 48.3% (修正 p = 0.0006)、最も高いドメイン特異性 (4.77/5.0; p = 2e-6) を達成しました。ベースラインおよび検索拡張プロンプトに対するカバレッジの利点は、ボンフェローニ補正後は堅牢ではありませんでした。 3 つの LLM ファミリ (Claude Sonnet 4、Qwen 2.5 72B、Gemma 4 26B、合計 5,400 のシナリオ) にわたる相互検証により、ペルソナ対オントロジーのパターンが再現されました。その結果、規制が集中するドメイン向けのペルソナベースのテストスイートを確実に補完するものとして、オントロジーに基づいたシナリオ生成が確立されました。

原文 (English)

Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification

Pre-deployment verification of enterprise artificial intelligence (AI) agents remains a critical gap between large language model (LLM) capability benchmarking and production deployment. Post-deployment monitoring, human-in-the-loop controls, and prompt-level guardrails offer limited assurance once an agent is operating in production. We propose an ontology-grounded verification framework combining three components: an Agent Operational Envelope formalizing the certification space across permissions, domain constraints, safety properties, governance rules, and autonomy levels; an ontology-to-scenario generation pipeline that derives regulatory, operational, and adversarial test scenarios automatically; and a Trust Certificate carrying a machine-verifiable attestation with graduated deployment verdicts (Approved, Conditional, Rejected). A controlled pilot across four regulated industries (Fintech, Banking, Insurance, and Healthcare), instantiated as five industry-by-regulatory-regime cells across the United States and Vietnam, generated 1,800 scenarios evaluated against 125 primary-source regulatory requirements and 25 injected faults. Ontology-grounded generation (G4) achieved 48.3% regulatory coverage versus 33.1% for the persona-based baseline (corrected p = .0006) and the highest domain specificity (4.77/5.0; p = 2e-6). The coverage advantage over baseline and retrieval-augmented prompting was not robust after Bonferroni correction. Cross-validation across three LLM families (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B; 5,400 total scenarios) replicated the persona-versus-ontology pattern. The results establish ontology-grounded scenario generation as a credible complement to persona-based test suites for regulatory-intensive domains.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SMAC-Talk: 大規模言語モデル用の StarCraft マルチエージェントチャレンジの自然言語拡張

LLM がより広範に導入されるにつれて、LLM は単独で動作するのではなく、他の AI エージェントと連携して動作することがますます期待されています。このような状況での効果的な調整には、エージェントが不確実性の下でコミュニケーションし、情報を共有し、意思決定を行う必要があります。協調的なマルチエージェント環境で LLM ベースのエージェントを評価するための StarCraft Multi-Agent Challenge の自然言語拡張である SMAC-Talk を紹介します。この環境には、分散制御、部分的な可観測性、長期的な意思決定など、いくつかの重要な機能があります。 SMAC-Talk には、エージェントの調整と信頼を調査するために使用される自然言語通信チャネルが含まれています。この通信チャネルを使用して、通信だけで味方を混乱させ、欺こうとする欺瞞的なコミュニケーターが組み込まれた設定など、さまざまな評価シナリオを構築します。 Qwen3.5 ファミリーの 4 つのモデルを使用したベンチマーク用の 3 つのエージェントを提供し、推論構造、メモリ、モデルのスケールがエージェント間の調整にどのように影響するかを調査します。私たちは、協力的なマルチエージェント設定での LLM エージェントの開発と評価における研究コミュニティをサポートするオープンベンチマークとして SMAC-Talk をリリースします。

原文 (English)

SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models

As LLMs become more widely deployed, they are increasingly expected to work alongside other AI agents rather than operating in isolation. Effective coordination in these settings requires agents to communicate, share information and make decisions under uncertainty. We introduce SMAC-Talk, a natural language extension of the StarCraft Multi-Agent Challenge for evaluating LLM-based agents in cooperative multi-agent environments. The environment has several key features such as decentralized control, partial observability and long-horizon decision making. SMAC-Talk includes a natural language communication channel which is used to probe agent coordination and trust. We use this communication channel to construct different evaluation scenarios, including settings with an embedded deceptive communicator that tries to disrupt and deceive allies through communication alone. We provide three agents for benchmarking using 4 models from the Qwen3.5 family and study how reasoning structure, memory and model scale affect coordination between agents. We release SMAC-Talk as an open benchmark to support the research community in developing and evaluating LLM agents in cooperative multi-agent settings.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

コンセンサスが戦略的に不十分: 知識表現シグナルとしての推論とトレースの不一致

マルチエージェントシステムは一般に、投票、コンセンサスプロトコル、討論、またはフォールトトレラントな集計を通じて意見の相違を減らすように設計されています。私たちは、この目標は、意見の相違がエージェントの誤りではなく、真の規範的不確実性を反映している可能性がある価値観を伴うタスクには不十分であると主張します。人間と AI の協調モデレーションにおける推論トレースの不一致に関する以前の研究に基づいて、推論トレースとエージェントの決定が象徴的な不一致状態に抽象化される知識表現層を提案します。明示的な推論トレースと二者決定を生成するエージェントを考慮して、推論の類似性と結論の一致に従って 4 つの状態 (収束一致、発散一致、収束不一致、発散不一致) を区別します。これらの状態は、実行可能な戦略的ルーティングルールをサポートします。我々は、コンテンツモデレーションにおけるフレームワークをインスタンス化し、不一致認識ルーティングが、マルチエージェントの戦略的推論のためのサブシンボリックLLM審議とシンボリック知識表現との間の橋渡しとなると主張する。

原文 (English)

Consensus is Strategically Insufficient: Reasoning-Trace Disagreement as a Knowledge-Representation Signal

Multi-agent systems are commonly designed to reduce disagreement through voting, consensus protocols, debate, or fault-tolerant aggregation. We argue that this objective is insufficient for value-laden tasks, where disagreement may reflect genuine normative uncertainty rather than agent error. Building on prior work on reasoning-trace disagreement in human-AI collaborative moderation, we propose a knowledge-representation layer in which reasoning traces and agent decisions are abstracted into symbolic disagreement states. Given agents producing explicit reasoning traces and binary decisions, we distinguish four states according to reasoning similarity and conclusion agreement: convergent agreement, divergent agreement, convergent disagreement and divergent disagreement. These states support defeasible strategic routing rules. We instantiate the framework in content moderation and argue that disagreement-aware routing provides a bridge between sub-symbolic LLM deliberation and symbolic knowledge representation for multi-agent strategic reasoning.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

ゼネラリストエージェントはデータキュレーションを自動化できますか?

トレーニングデータのキュレーションは、現代の AI 開発において最も重要ではあるものの、労働集約的な部分の 1 つです。実践者は、ノイズの多いベンチマークフィードバックに対してデータポリシーを繰り返し提案、実装、評価、修正します。私たちは、ジェネラリストのコーディングエージェントがこのデータキュレーションループを自動化できるかどうかを尋ねます。 *Curation-Bench* というエージェント中心のベンチマークを導入します。これは、モデル、トレーニングレシピ、評価スイートを修正し、エージェントにデータの検査、ポリシーの実装、固定トレーニング/評価パイプラインへの送信、および修正のためのコマンドラインアクセスを許可します。ビジョン言語の命令チューニングのインスタンス化では、すぐに使用できるエージェントが 10 回の反復以内に公開された強力なデータ選択ベースラインに到達します。しかし、軌道分析により、永続的な*実行と研究のギャップ*が明らかになりました。エージェントは、たとえ戦略ガイドや参考文献が与えられたとしても、新しい政策ファミリーを探索するのではなく、主にローカル政策のバリエーションを調整します。反復ごとに以前のメソッドを引用、インスタンス化し、適応させる必要がある足場により、エージェントはメソッドに基づいた探索へと移行します。スキャフォールドされたエージェントは、人間による設計入力を必要とせずに、データ予算の 10 分の 1 で強力な公開ベースラインを上回るデータ選択ポリシーを自律的に作成します。全体として、現在のエージェントはキュレーションループを実行できますが、信頼性の高いデータ調査には、オープンエンドのプロンプトのみではなく、足場を組んだ手法の適応が必要です。コードとベンチマークはオープンソースです。

原文 (English)

Can Generalist Agents Automate Data Curation?

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

飽和トラップと介入タイミングの主観性: 影響ベースのトリガーと LLM ジャッジが自律エージェントへの介入のタイミングを計れない理由

自律型 AI エージェントが会話型システムから長期的なソフトウェア実行に移行するにつれて、エージェントをいつ中断するかを決定するランタイム安全レイヤーが不可欠になっています。私たちは、診断プローブとして連続 18 次元感情ダイナミクスエンジン (HEART) を使用し、SWE ベンチ検証済みデバッグトレース上の人間による注釈付き介入ポイントに対して 4 つの介入トリガーファミリ (絶対状態しきい値、複合状態アクションパターン、正規表現推論特徴抽出、および判断者としてのゼロショット LLM) を評価して、このタイミング問題を研究します。 3 つの調査結果を報告します。まず、状態飽和トラップ: 継続的な困難下ではエージェントは回復シグナルを示さないため、モデル化されたフラストレーションはすぐにしきい値を超えて最大値に留まり、瞬間検出器からのしきい値オン状態トリガーを、5 つの軌道にわたるアクションの 39 ～ 83% で起動するほぼ一定のインジケーターに変換します。第 2 に、LLM 審査員向けの機能とコンテキストの下限です。小型モデル (gpt-5.4-mini) は決して発砲しませんが、フロンティアモデルやクロスベンダーモデルは完全な軌道コンテキストでのみゼロ発火下限を回避し、それでも最大 90 倍のコストで F1 0.17 ～ 0.40 にしか達しません。第三に、そして最も重要なことは、教師付きターゲットは人間間で再現可能ではないということです。56 アクションの軌道上で 1 つのルーブリックを使用する 3 人の訓練されたアノテーターは、偶然をわずかに上回る位置 (クリッペンドルフのアルファ = +0.047、最良のペアワイズコーエンのカッパ = +0.349) で介入する場所については一致し、介入の種類についてはまったく一致しません (退化を一時停止、確率より下を明確にする、アルファのみを反映 = +0.226）。介入のタイミングは信頼性の低い構造であり、単一アノテーター F1 は最適化の対象として不適切であると結論付けます。私たちの貢献は、単一の検出器の精度ではなく、人間の評価者間信頼性、4 つの検出器アーキテクチャ、クロスモデル LLM 判定スイープ、および再現された飽和効果にわたるこの問題の共同マッピングです。

原文 (English)

The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents

As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional affective-dynamics engine (HEART) as a diagnostic probe, evaluating four intervention trigger families - absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge - against human-annotated intervention points on SWE-bench-Verified debugging traces. We report three findings. First, a State Saturation Trap: agents show no recovery signal under sustained difficulty, so modeled frustration quickly crosses the threshold and stays at its maximum, converting threshold-on-state triggers from moment detectors into near-constant indicators that fire on 39-83% of actions across five trajectories. Second, a capability-and-context floor for LLM judges: a small model (gpt-5.4-mini) never fires, while frontier and cross-vendor models escape the zero-firing floor only with full-trajectory context, and even then reach only F1 0.17-0.40 at up to 90x the cost. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using one rubric on a 56-action trajectory agree on where to intervene only slightly above chance (location Krippendorff's alpha = +0.047; best pairwise Cohen's kappa = +0.349) and not at all on intervention type (pause degenerate; clarify below chance; reflect only alpha = +0.226). We conclude that intervention timing is a low-reliability construct, making single-annotator F1 an unsuitable optimization target. Our contribution is the joint mapping of this problem across human inter-rater reliability, four detector architectures, a cross-model LLM-judge sweep, and a reproduced saturation effect, rather than any single detector's accuracy.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エージェント的記憶システムのクロスシナリオ一般性の探求: 診断と強力なベースライン

LLM エージェントは、コンテキストウィンドウを超えて拡大する履歴を蓄積し、メモリシステムに関する文献の増加を促します。しかし、既存の設計のほとんどは単一のシナリオ (マルチセッションチャットまたは単一の軌跡形式) に合わせて調整されており、展開時にエージェントが遭遇する異種の軌跡全体にそれらが一般化するという証拠はほとんどありません。シングルターン QA、マルチセッションチャット、エージェントトラジェクトリ QA、メモリストレステスト、長期エージェントタスクの 5 つのシナリオで、8 つのメモリシステムと検索問題用のエージェントハーネスを再検討します。ツール呼び出しを介してフラットテキストファイルストレージを自己管理するハーネスは、最高のクロスタスクランキングを達成しました。これは、メモリのパフォーマンスが、固定パイプラインの背後にある受動的なストアではなく、ストレージと取得に対するアクティブな制御をエージェントに与えることに依存していることを示唆しています。この洞察を AutoMEM でインスタンス化します。AutoMEM は、評価するシステムの中で最高のクロスシナリオ汎用性を実現する自己管理ツールインターフェイスを備えたエージェントメモリハーネスです。

原文 (English)

Exploring Cross-Scenario Generality of Agentic Memory Systems: Diagnostics and a Strong Baseline

LLM agents accumulate histories that outgrow their context windows, motivating a growing literature on memory systems. Yet most existing designs are tuned to a single scenario (multi-session chat or a single trajectory format), and there is little evidence that they generalize across the heterogeneous trajectories agents encounter in deployment. We revisit eight memory systems plus an agentic harness for search problems, on five scenarios: single-turn QA, multi-session chat, agentic-trajectory QA, memory stress tests, and long-horizon agentic tasks. The harness, which self-manages flat text-file storage via tool calls, achieves the best cross-task ranking, suggesting that memory performance hinges on giving the agent active control over storage and retrieval rather than on a passive store behind a fixed pipeline. We instantiate this insight in AutoMEM, an agentic memory harness with a self-managed tool interface that achieves the best cross-scenario generality among the systems we evaluate.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

デジタル見習い: 人間主導のエージェント AI 開発のフレームワーク

Agentic AI の導入は、人間による厳しい監視によって規模が制限される一方、広範な自律性が説明責任を上回るという、繰り返しの設計上の緊張に直面しています。どちらの姿勢も、責任ある委任に必要なガバナンスインフラストラクチャを提供しません。私たちは、自律性を前提とするのではなく獲得する、スケーラブルで安全な AI エージェンシーのフレームワークである Digital Apprentice を紹介します。デジタル見習いは、人間が監督する暗黙の方法論を内面化する発達的な学習者であり、経験的証拠がそれを正当化する場合にのみ、スキルごとの自律段階を経て卒業します。その結果、特定の人間の基準に沿ったままでありながら、時間の経過とともに真に有用になるエージェントが生まれます。 3 つのアーキテクチャコンポーネントがこれを可能にします。 (1) 方法論の捕捉。監督の専門家の暗黙のアプローチを構造化資産に抽出します。 (2) 承認。人間による明示的な承認によって自律性がエスカレーションされます。 (3) 継続的な調整。実行時にドリフトを修正し、各修正を所有する優先データに変換します。このフレームワークを推論時のコントロールプレーンとしてインスタンス化します。品質フレームワークを数学的にモデル化し、品質を向上させるために設計されたポリシーや手法について議論します。このフレームワークをオープンなプロフェッショナルコーパスに適用し、データドリフトを捕捉し、実行時に別の手法を適用することで、トラフィックシフト下で低下した品質次元を回復する方法を示します。その影響は単一のアプリケーションを超えて広がります。私たちは、これら 3 つの柱をシステムとしてつなぎ合わせることで、信頼を犠牲にすることなく拡張できるエージェントシステムへのより安全で実行可能な道を形成すると信じています。

原文 (English)

The Digital Apprentice: A Framework for Human-Directed Agentic AI Development

Agentic AI deployments face a recurring design tension: heavy human oversight limits scale, while broad autonomy outruns accountability. Neither posture provides the governance infrastructure required for responsible delegation. We present the Digital Apprentice, a framework for scalable, safe AI agency in which autonomy is earned, not assumed. The Digital Apprentice is a developmental learner that internalizes the tacit methodology of a directing human, graduating through per-skill autonomy tiers only when empirical evidence justifies it. The result is an agent that becomes genuinely useful over time while remaining aligned to a specific human's standards. Three architectural components make this possible. (1) Methodology capture, distilling a directing professional's tacit approach into structured assets. (2) Authorization, with autonomy escalation gated by explicit human approval. (3) Continuous alignment, correcting drift at runtime and converting each correction into owned preference data. We instantiate this framework as an inference-time control plane. We mathematically model the quality framework and discuss policies and techniques designed to raise quality. We apply the framework to an open professional corpus, and we show how catching data drift and applying a different technique at runtime recovers degraded quality dimensions under traffic shift. The implication extends beyond any single application. We believe these three pillars, stitched together as a system, form a safer and more viable path to agentic systems that can scale without sacrificing trust.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

状態に基づいた動的検索による Web エージェントのオンラインスキル学習

言語エージェントは、関連タスク全体にわたる複数ステップの Web 自動化を改善するために、再利用可能なスキルにますます依存しています。オンラインでのスキル学習を研究する仕事が増えており、エージェントは以前のタスクの軌跡からスキルを継続的に導き出し、その場で将来のタスクで再利用します。ただし、既存の方法は主にタスクレベルでスキルを再利用します。つまり、固定のスキルセットが最初のタスク指示に基づいて取得され、実行中ずっと固定されます。この静的戦略は Web の実行とずれており、適切な次のアクションはタスクの目標だけでなく、現在の Web ページの状態にも依存し、初期のスキルではカバーできない状況に移行することがよくあります。このギャップに対処するために、Web エージェントの段階的なスキルの再利用を可能にするオンラインスキル学習方法である State-Grounded Dynamic Retrieval (SGDR) を提案します。 SGDR は 3 つのコンポーネントで構成されます。完了した軌跡を中間の実行状態で呼び出し可能な再利用可能なサブプロシージャに変換するスライディングウィンドウ抽出プロセス、スキルの取得と実行可能なアクションを結び付けるデュアルテキストコード表現、スキルをタスクの目標と現在の Web ページの状態の両方に一致させる状態ベースの動的取得メカニズムです。 5 つのドメインにわたる WebArena での実験では、SGDR が一貫して強力なベースラインを上回っており、GPT-4.1 で 37.5%、Qwen3-4B で 24.3% の平均成功率を達成しており、最も強力なベースラインに対してそれぞれ 10.6% と 10.0% の相対的な向上に相当します。コードは https://github.com/plusnli/skill-dynamic-retrieval で入手できます。

原文 (English)

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

Language agents increasingly rely on reusable skills to improve multi-step web automation across related tasks. A growing line of work studies online skill learning, where agents continually induce skills from previous task trajectories and reuse them in future tasks on the fly. However, existing methods mainly reuse skills at the task-level: a fixed set of skills is retrieved based on the initial task instruction and then held fixed throughout execution. This static strategy is misaligned with web execution, where the appropriate next action depends not only on the task goal but also on the current webpage state, which often transitions into situations that the initial skills fail to cover. To address this gap, we propose State-Grounded Dynamic Retrieval (SGDR), an online skill learning method that enables stepwise skill reuse for web agents. SGDR consists of three components: a sliding-window extraction process that turns completed trajectories into reusable sub-procedures invokable at intermediate execution states, a dual text-code representation that connects skill retrieval with executable action, and a state-grounded dynamic retrieval mechanism that matches skills to both the task goal and the current webpage state. Experiments on WebArena across five domains show that SGDR consistently outperforms strong baselines, achieving average success rates of 37.5% with GPT-4.1 and 24.3% with Qwen3-4B, corresponding to relative gains of 10.6% and 10.0% over the strongest baseline, respectively. The code is available at https://github.com/plusnli/skill-dynamic-retrieval.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

トリビアム: 因果記憶コントローラーの第一級目標としての時間的後悔

現在のエージェントシステムと LLM パイプラインの多くは、結果の報酬を最適化することで間違いを修正します。これは失敗の内容のみを扱います。結果が予測と異なる場合、不一致の理由と時期が体系的に記録、レビュー、修正されないため、同じエラーがエピソードごとに再発する可能性があります。私たちは、これは単にモデルの能力の問題ではなく、構造的な問題であると主張します。私たちは、作業因果モデルに対する結果の後悔や認識論的な後悔と並んで、長期的な時間的後悔を第一級の目標として提案します。時間的リグアロングは、失敗が継続するとき、すなわち、調整ミスの因果モデルが修正されるまでにどのくらいの期間許容されるかを捉えます。認識論的後悔は、失敗が続く理由、つまり作業因果モデルにおける残留不確実性またはエラーを捉えます。 3 つの後悔を総合すると、長命のエージェントがいつ、何が、なぜ失敗する可能性があるのかについて、反証可能な説明が得られます。エージェントを E エピソードのストリームとしてモデル化し、明示的な因果関係の調査、持続性、および検出可能性の仮定に基づいて 3 つの条件付き結果を証明します。まず、観察的に等価な交絡のもとでは、結果のみの学習では介入チャネルがなければ因果構造と偽の構造を区別できないため、結果の後悔がゼロになった後でも時間的誤調整が線形的に持続する可能性があります。第 2 に、永続的な因果ログと予算付きプローブを使用すると、総プローブの複雑さはエピソード期間内で対数的となり、O(log E) の時間的後悔を引き起こします。第三に、K 個の検出可能な変化点の下では、速度は O(K log E) まで拡張されます。 Trivium をインスタンス化し、5 つの反証可能な予測を事前に登録します。 CausalBench-Seq では、Trivium は予測された対数エンベロープに従いますが、結果のみのベースラインは直線的に増加します。パイロットのリアル LLM ストリームは、1 回の完全な E = 500 実行と 3 回の E = 100 フロンティアモデルパイロットにわたる予備的な外部妥当性証拠を提供します。ここでの自己学習とは、LLM 重みを再トレーニングすることではなく、外部因果モデルを修正することを意味します。

原文 (English)

Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers

Many current agentic systems and LLM pipelines correct mistakes by optimizing outcome reward. This addresses only the what of failure: when an outcome diverges from prediction, the why and when of the mismatch are not systematically logged, reviewed, or corrected, so the same error can recur episode after episode. We argue that this is a structural problem, not merely a model-capacity one. We propose long-horizon temporal regret as a first-class objective alongside outcome regret and epistemic regret over the working causal model. Temporal regret captures when failure persists: how long a miscalibrated causal model is tolerated before correction. Epistemic regret captures why failure persists: residual uncertainty or error in the working causal model. Together, the three regrets give a falsifiable account of what, why, and when a long-lived agent can fail. Modeling the agent as a stream of E episodes, we prove three conditional results under explicit causal-probing, persistence, and detectability assumptions. First, under observationally equivalent confounding, outcome-only learning cannot distinguish causal from spurious structure without an intervention channel, so temporal miscalibration can persist linearly even after outcome regret is driven to zero. Second, with a persistent causal log and budgeted probes, total probe complexity is logarithmic in the episode horizon, inducing O(log E) temporal regret. Third, under K detectable change-points, the rate extends to O(K log E). We instantiate Trivium and pre-register five falsifiable predictions. On CausalBench-Seq, Trivium follows the predicted logarithmic envelope while outcome-only baselines grow linearly. A pilot real-LLM stream provides preliminary external-validity evidence across one full E = 500 run and three E = 100 frontier-model pilots. Self-learning here means revising an external causal model, not retraining LLM weights.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Agentic RAG における連鎖的幻覚: 検出と軽減のための CHARM フレームワーク

マルチステップのエージェント的検索拡張生成 (RAG) パイプラインは、複雑な推論タスクに対して優れた能力を実証していますが、既存の幻覚検出メカニズムが体系的に見逃しているクラスの障害に対して脆弱なままです。カスケード幻覚では、パイプラインの初期段階で導入されたエラーが、連続する推論ステップ全体に伝播および増幅し、自信があるが事実としては不正確な最終出力が生成されます。この脆弱性に対処するために、私たちはカスケード幻覚をエージェント RAG システムの明確な障害モードとして形式化し、カスケードパターンの 4 種類の分類を提示し、複数ステップの推論パイプラインでエラー伝播を検出して中断するためのアーキテクチャフレームワークである CHARM (Cascading Hallucination Aware Resolution and Mitigation) を導入します。 CHARM は、ステージレベルのファクト検証、クロスステージ一貫性追跡、信頼性伝播モニタリング、およびカスケード解決トリガーの 4 つのコンポーネントで構成されており、アーキテクチャの置き換えを必要とせずに、標準のエージェント RAG パイプラインと並行して動作します。 HotpotQA、MuSiQue、2WikiMultiHopQA、および LangChain エージェントパイプライン構成全体にわたるカスタム敵対的データセットで CHARM を評価し、89.4% のカスケード検出率と 5.3% の誤検知率、ステージあたりの平均レイテンシオーバーヘッド 215 ミリ秒 +/- 18 ミリ秒を達成し、エラー伝播の削減を 82.1% 達成しました (前者の 18.5% と比較)。出力レベル検出器。 Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM は、人間による監視フレームワークと統合して、実稼働エージェント AI の導入に完全な信頼性とガバナンススタックを提供します。

原文 (English)

Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation

Multi-step agentic retrieval-augmented generation (RAG) pipelines have demonstrated significant capability for complex reasoning tasks, yet remain vulnerable to a class of failure that existing hallucination detection mechanisms systematically miss: cascading hallucination, where errors introduced at early pipeline stages propagate and amplify across successive reasoning steps, producing confident but factually incorrect final outputs. To address this vulnerability, we formalize cascading hallucination as a distinct failure mode in agentic RAG systems, present a four-type taxonomy of cascade patterns, and introduce CHARM (Cascading Hallucination Aware Resolution and Mitigation), an architectural framework for detecting and interrupting error propagation in multi-step reasoning pipelines. CHARM comprises four components - stage-level fact verification, cross-stage consistency tracking, confidence propagation monitoring, and cascade resolution triggering - that operate alongside standard agentic RAG pipelines without requiring architectural replacement. We evaluate CHARM on HotpotQA, MuSiQue, 2WikiMultiHopQA, and a custom adversarial dataset across LangChain agentic pipeline configurations, achieving an 89.4% cascade detection rate with a 5.3% false positive rate and 215 ms +/- 18 ms average latency overhead per stage, achieving an error propagation reduction of 82.1%, compared to 18.5% for output-level detectors. Component ablations confirm that each detection module contributes meaningfully to overall cascade coverage. CHARM integrates with human-in-the-loop oversight frameworks to provide a complete reliability and governance stack for production agentic AI deployment.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

メタエージェントの課題: 現在のエージェントは自律的なエージェント開発が可能ですか?

現在の AI ベンチマークは、人間が設計したワークフロー内でのタスク実行に関してエージェントを評価します。これらの評価では、基本的に、モデルが自律的にエージェントシステムを開発できるかどうかという、重要な次のレベルの機能を測定できません。自律エージェント開発のためのフロンティアモデルの能力をテストするために設計された評価フレームワークであるメタエージェントチャレンジ (MAC) を紹介します。具体的には、コードエージェント (メタエージェント) には、サンドボックス環境、評価 API、および 5 つのドメインにわたって実施されたテストセットのパフォーマンスを最大化するエージェントアーティファクトを反復的にプログラムするための時間制限が与えられます。評価の整合性を確保するために、このフレームワークは報酬ハッキングに対する多層防御によって保護されています。このフレームワークを活用して、メタエージェントが人為的に設計されたベースラインポリシーと一致することはほとんどなく、一致する少数のエージェントは独自のフロンティアモデルによって支配されていることを示します。さらに、設計プロセスは高い分散を示し、高い最適化圧力により、グラウンドトゥルースの漏洩などの敵対的な動作が表面化し、堅牢性とモデルの調整の両方における重大な欠陥が浮き彫りになります。最終的に、MAC は自律型 AI の研究開発のための厳密なオープンソースベンチマークを提供し、再帰的な自己改善を評価するための経験的な代用手段を提供します。ベンチマークは https://github.com/ant-research/meta-agent-challenge で公開されています。

原文 (English)

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

Current AI benchmarks evaluate agents on task execution within human-designed workflows. These evaluations fundamentally fail to measure a critical next-level capability: whether models can autonomously develop agent systems. We introduce the Meta-Agent Challenge (MAC), an evaluation framework designed to test the capacity of frontier models for autonomous agent development. Specifically, a code agent (the meta-agent) is given a sandboxed environment, an evaluation API, and a time limitation to iteratively program an agent artifact that maximizes performance on a held-out test set across five domains. To ensure evaluation integrity, this framework is secured by multi-layer defenses against reward hacking. Leveraging this framework, we demonstrate that meta-agents rarely match human-engineered baseline policies, and the few that do are dominated by proprietary frontier models. Moreover, the design process exhibits high variance, and high optimization pressure surfaces emergent adversarial behaviors like ground-truth exfiltration-highlighting critical deficits in both robustness and model alignment. Ultimately, MAC provides a rigorous, open-source benchmark for autonomous AI research and development, offering an empirical proxy for evaluating recursive self-improvement. Benchmark is publicly available at: https://github.com/ant-research/meta-agent-challenge.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AgentJet: エージェント強化学習のための柔軟な群トレーニングフレームワーク

大規模言語モデル (LLM) エージェント強化学習用の分散群トレーニングフレームワークである AgentJet を紹介します。エージェントのロールアウトとモデルの最適化を密接に結び付ける集中型フレームワークとは異なり、AgentJet は分離されたマルチノードアーキテクチャを採用しています。このアーキテクチャでは、swarm サーバーノードがトレーニング可能なモデルをホストし、GPU クラスターで最適化を実行します。一方、swarm クライアントノードは任意のデバイスで任意のエージェントを実行します。この設計は、集中型フレームワークではサポートが難しい機能を提供します。(1) 異種マルチモデル強化学習。複数の LLM を頭脳とする異種マルチエージェントチームのトレーニングを可能にします。 (2) 独立したエージェントのランタイムを使用したマルチタスクのカクテルトレーニング。 (3) 外部環境の障害によるトレーニングプロセスの中断を防ぐフォールトトレラントな実行。 (4) ライブコードの反復。群クライアントノードを置き換えることにより、トレーニング中にエージェントを編集できます。マルチモデル、マルチターン、マルチエージェント設定で効率的な RL をサポートするために、AgentJet はタイムラインマージを備えたコンテキストトラッキングモジュールを導入しています。これにより、冗長なコンテキストが統合され、トレーニングの 1.5 ～ 10 倍の高速化が実現します。最後に、AgentJet は、研究トピックを入力として受け取り、大規模クラスター上で長期にわたる複数日にわたる RL 研究を自律的に実行する自動研究システムを導入します。このシステムは、swarm アーキテクチャを活用することで、実行中に人間の介入なしに、RL 研究者の主要な探索ワークフローを再現します。

原文 (English)

AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm server nodes host trainable models and run optimization on GPU clusters, whereas swarm client nodes execute arbitrary agents on arbitrary devices. This design provides capabilities that are difficult to support in centralized frameworks: (1) heterogeneous multi-model reinforcement learning, enabling the training of heterogeneous multi-agent teams with multiple LLM as brains; (2) multi-task cocktail training with isolated agent runtimes; (3) fault-tolerant execution that prevents external environment failures from interrupting the training process; and (4) live code iteration, which allows agents to be edited during training by replacing swarm client nodes. To support efficient RL in multi-model, multi-turn, and multi-agent settings, AgentJet introduces a context tracking module with timeline merging, which consolidates redundant context and achieves a 1.5-10x training speedup. Finally, AgentJet introduces an automated research system that takes a research topic as input and autonomously conducts long-horizon, multi-day RL studies on large-scale clusters. By leveraging the swarm architecture, this system reproduces key exploratory workflows of RL researchers without human intervention during execution.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

プロンプトベースの計画を超えて: MCP ネイティブグラフ計画ベースの生物医学エージェントシステム

生物医学エージェントは複雑な生物学的ワークフローを自動化できると期待されていますが、現在のシステムは 2 つの根本的なボトルネックに直面しています。それは、バイオインフォマティクスツールがインターフェイスと実行環境において非常に異質である一方、エージェントの計画は未だにプロンプトで取得されるフラットなツール記述に依存しているということです。生物医学ソフトウェアエコシステムが成長するにつれて、ツールの適用範囲とコンテキストサイズの関係により、ツールの混乱、不安定な計画、および非効率的な実行が発生します。構造化された生物学的機能よりもグラフ足場計画に基づいて構築された MCP ネイティブの生物医学エージェントである BioManus を紹介します。 BioManus は、異種バイオインフォマティクスソフトウェアを標準化された MCP サーバーに変換し、大規模な実行可能な MCP エコシステムを生み出す BioinfoMCP コンパイラーを初めて導入しました。次に、このエコシステムを、ツール、操作、データ型、ワークフローステージにわたる型付きの異種 MCP グラフとして編成します。推論時に、BioManus はコンパクトなタスク固有のサブグラフを取得し、操作レベルのワークフロースキャフォールドを合成します。この設計は、計画の複雑さを生の工具在庫サイズから切り離し、高リコール取得下で Theta(N / (h * m_bar)) のコンテキスト圧縮率を達成します。ここで、N は総工具数、h はワークフロー範囲、m_bar (N よりもはるかに小さい) は操作ごとの候補工具の平均数です。 BioAgentBench と LAB-Bench の実験では、BioManus が高度な生物医学エージェントのベースラインと比較して、実行精度、ワークフローの有効性、およびコンテキストの効率を向上させることが示されています。この研究はパラダイムシフトを示唆しています。スケーラブルな生物医学的推論には、ますます大規模になるプロンプトレベルのツール検索ではなく、構造化された実行可能な機能グラフが必要です。

原文 (English)

Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System

Biomedical agents promise to automate complex biological workflows, yet current systems face two fundamental bottlenecks: bioinformatics tools are highly heterogeneous in interfaces and execution environments, while agent planning still relies on flat prompt-retrieved tool descriptions. As biomedical software ecosystems grow, this coupling between tool coverage and context size leads to tool confusion, unstable planning, and inefficient execution. We introduce BioManus, an MCP-native biomedical agent built on graph-scaffolded planning over structured biological capabilities. BioManus first introduces the BioinfoMCP Compiler, which converts heterogeneous bioinformatics software into standardized MCP servers, yielding a large executable MCP ecosystem. It then organizes this ecosystem as a typed heterogeneous MCP graph over tools, operations, datatypes, and workflow stages. At inference time, BioManus retrieves compact task-specific subgraphs, synthesizes operation-level workflow scaffolds. This design decouples planning complexity from raw tool inventory size, achieving a context compression ratio of Theta(N / (h * m_bar)) under high-recall retrieval, where N is the total tool count, h is the workflow horizon, and m_bar (much smaller than N) is the average number of candidate tools per operation. Experiments on BioAgentBench and LAB-Bench show that BioManus improves execution accuracy, workflow validity, and context efficiency over advanced biomedical agent baselines. This work suggests a paradigm shift: scalable biomedical reasoning requires structured executable capability graphs rather than increasingly larger prompt-level tool retrieval.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

MapAgent: 都市規模の車線レベルの地図生成のための産業グレードのエージェントフレームワーク

車線レベルの地図は自動運転と車線レベルのナビゲーションにとって重要なインフラストラクチャですが、数百の都市で標準化された車線ネットワークの構築と維持には依然として非常に労働集約的です。最近のエンドツーエンドのベクトル化マッピング手法は、センサーデータから直接車線の形状とトポロジを予測できますが、通常、マッピング仕様と交通規制を暗黙的なデータセット依存の監視として扱います。さらに、複雑なシーン (マーキングやオクルージョンの磨耗や欠落など) では、正しいレーン構成が視覚的証拠だけでは十分に決定されないことが多く、仕様違反が人間による事後編集の主な原因となっています。私たちは、仕様に準拠したレーンマップ作成のためのベクトル化バックボーンを強化する産業グレードのエージェントアーキテクチャである MapAgent を提案します。 MapAgent は、単にマップ予測にエージェントループを追加するのではなく、バックボーンの認識と明示的な仕様の検証、制約を意識した推論、および境界のある検証主導型のジャッジ-プランナー-ワーカーループの下での決定論的なマップ編集を結合します。視覚言語を使用するジャッジは、視覚的な証拠とドラフトベクトルを共同で検査することでエラーを診断し、ツールを呼び出すプランナーは編集後の再検証により最小限の修正編集を生成します。都市規模の本番環境でのスケーラビリティを維持するために、MapAgent はバックボーンの信頼性が低いタイルでのみ選択的にトリガーされ、スループットを維持しながら適度なオーバーヘッドを追加します。現実世界のデータセットでの実験では、特に複雑でロングテールのシナリオにおいて、強力な実稼働ベースラインを上回る一貫した利益が示されています。さらに、MapAgent は Baidu Maps に統合されており、全国 360 以上の都市の車線レベルの地図生成をサポートし、全体的な生産自動化を 95% 以上に高め、大規模な車線レベルの地図生成における MapAgent の実用性と有効性を実証しています。

原文 (English)

MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

Lane-level maps are critical infrastructure for autonomous driving and lane-level navigation, yet constructing and maintaining standardized lane networks for hundreds of cities remains highly labor-intensive. Recent end-to-end vectorized mapping methods can predict lane geometry and topology directly from sensor data, but they typically treat mapping specifications and traffic regulations as implicit, dataset-dependent supervision. Moreover, in complex scenes (e.g., worn or missing markings and occlusions), correct lane configurations are often under-determined by visual evidence alone, making specification violations a major source of human post-editing. We propose MapAgent, an industrial-grade agentic architecture that augments a vectorization backbone for specification-compliant lane-map production. Rather than merely adding an agent loop to map prediction, MapAgent couples backbone perception with explicit specification verification, constraint-aware reasoning, and deterministic map editing under a bounded, verification-driven Judge-Planner-Worker loop. A vision-language Judge diagnoses errors by jointly inspecting visual evidence and draft vectors, while a tool-calling Planner generates minimal corrective edits with post-edit re-validation. To remain scalable for city-scale production, MapAgent is selectively triggered only on tiles with low backbone confidence, adding modest overhead while preserving throughput. Experiments on real-world datasets show consistent gains over strong production baselines, especially in complex and long-tail scenarios. Additionally, MapAgent has been integrated into Baidu Maps, supporting lane-level map generation for over 360 cities nationwide and elevating the overall production automation to over 95%, demonstrating MapAgent's practicality and effectiveness for large-scale lane-level map generation.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

パラメトリックメモリを介した自己進化エージェントのスケーリング

既存のメモリ拡張 LLM エージェントは、ロールアウト中モデルパラメータを凍結したままにしながら、過去の経験をテキストの要約または取得された一節としてプロンプト領域にのみ保存します。このようなエージェントは、見たものを \emph{調べる}ことはできますが、それから \emph{学ぶ}ことはできません。彼らのポリシーは経験によって変更されず、コンテキストからドロップされた情報は永久に失われます。私たちは、自己進化するパラメトリックメモリフレームワークである \texttt{TMEM} を導入します。このフレームワークでは、エージェントが履歴を明示的メモリに圧縮するだけでなく、軽量のオンラインアップデートを通じて抽出された監視を高速な LoRA 重み $\Delta_t$ に吸収し、単一のエピソード内で将来の動作を真に変更します。これを、高速重みロールアウトダイナミクスを備えたエージェントの意思決定プロセスとして形式化します。アクションは $\pi_{\theta_0+\Delta_t}$ からサンプリングされ、抽出アクションは後続の決定のために $\Delta_t$ を更新する監視を生成します。このビューにより、抽出ポリシーが RL によって直接最適化可能になります。$\theta_0$ のトレーニングにより、タスクアクションだけでなく、オンライン LoRA 適応に使用されるデータの品質も向上します。さらに、オンラインコンバージェンスを加速するために、SVD ベースの LoRA サブスペースの初期化を提案します。 LoCoMo、LongMemEval-S、多目的検索、および CL-Bench の実験では、\texttt{TMEM} がさまざまなモデルスケールにわたって、要約ベースおよび検索ベースのベースラインを一貫して上回るパフォーマンスを示しています。

原文 (English)

Scaling Self-Evolving Agents via Parametric Memory

Existing memory-augmented LLM agents store past experience exclusively in prompt space, as textual summaries or retrieved passages, while keeping model parameters frozen throughout a rollout. Such agents can \emph{look up} what they have seen but cannot \emph{learn from} it: their policy is unchanged by experience, and any information dropped from the context is permanently lost. We introduce \texttt{TMEM}, a self-evolving parametric memory framework in which the agent not only compresses history into explicit memory but also absorbs distilled supervision into fast LoRA weights $\Delta_t$ via lightweight online updates, genuinely altering its future behavior within a single episode. We formalize this as an agentic decision process with fast-weight rollout dynamics: actions are sampled from $\pi_{\theta_0+\Delta_t}$, while extraction actions produce supervision that updates $\Delta_t$ for subsequent decisions. This view makes the extraction policy directly optimizable by RL: training $\theta_0$ improves not only task actions but also the quality of the data used for online LoRA adaptation. We further propose SVD-based initialization of the LoRA subspace to accelerate online convergence. Experiments on LoCoMo, LongMemEval-S, multi-objective search, and CL-Bench show that \texttt{TMEM} consistently outperforms summary-based and retrieval-based baselines across different model scales.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

Neetyabhas: Rational エージェントベースのモデルにおける不確実性を認識した公共政策最適化のためのフレームワーク

目的 WHO の新型コロナウイルス感染症に対する非医薬品介入（ロックダウン、ワクチン接種など）は感染を効果的に抑制しますが、経済的には大きな負担となります。既存の研究は、個人の行動を無視し、完璧な感染追跡と完璧な政策実行を誤って想定しており、現実世界の不確実性や誤りを説明できていないことがよくあります。方法我々は、流行の測定（感染症/入院）と政策実施の両方に不確実性を組み込んだ統合的アプローチを提案します。私たちは、マスクの着用、ワクチン接種、買い物に関するリアルタイムの選択を行う 1,000 人の個人のシミュレーションモデルを構築しました。同時に、政策立案者は健康と経済の観察に基づいて介入（ロックダウン、義務化）を展開します。このフレームワークは階層型強化学習エージェントによって駆動され、不確実性を考慮したポリシー勾配バリアント (DDPG および TD3) とともにディープ Q ネットワークを利用します。結果シミュレーションは流行の進行を効果的に管理しました。マスクとワクチン接種が非常に効果的であることが証明され、流行のピークの高さと期間の両方が大幅に短縮されました。個人の行動、政策の不確実性、多面的な介入を統合することで、私たちの動的制御アプローチは流行の影響を軽減することに成功しました。結論私たちのモデルは、不確実性と人間の行動を公衆衛生政策の枠組みに組み込むことで、これまでの研究の限界を克服しました。このシミュレーションは、マスクとワクチンが極めて重要なツールとして機能し、複雑なパンデミック時に効果的な介入を設計するには、個人の選択と不完全なデータを考慮することが重要であることを示しています。

原文 (English)

Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models

Purpose The WHO's COVID-19 non-pharmaceutical interventions (e.g., lockdowns, vaccinations) effectively curb transmission but impose heavy economic strains. Existing research often neglects individual behaviors and falsely assumes perfect infection tracking and flawless policy execution, failing to account for real-world uncertainties and errors. Methods We propose an integrative approach incorporating uncertainties in both epidemic measurement (infections/hospitalizations) and policy implementation. We built a simulation model of 1,000 individuals making real-time choices regarding mask-wearing, vaccination, and shopping. Concurrently, policymakers deploy interventions (lockdowns, mandates) based on health and economic observations. This framework is driven by hierarchical reinforcement learning agents, utilizing deep Q-networks alongside uncertainty-aware policy gradient variants (DDPG and TD3). Results The simulations effectively managed the epidemic's progression. Masking and vaccinations proved highly effective, significantly reducing both the outbreak's peak height and duration. By integrating individual behaviors, policy uncertainties, and multifaceted interventions, our dynamic control approach successfully mitigated the epidemic's impact. Conclusions Our model overcomes previous research limitations by embedding uncertainty and human behavior into public health policy frameworks. The simulation demonstrates that accounting for individual choices and imperfect data is crucial for designing effective interventions during complex pandemics, with masks and vaccines serving as pivotal tools.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

最初に計画し、後で判断し、より良く実行する: DMAIC からインスピレーションを得た産業異常検出用エージェントシステム

大規模言語モデル (LLM) エージェントは、複雑なデータ分析ワークフローの自動化において有望であることが示されていますが、一か八かの産業シナリオにおいてその信頼性の高い導入は依然として困難です。産業異常検出 (IAD) は製造の品質、安全性、効率に不可欠ですが、既存の LLM ベースの IAD エージェントは主に実行に焦点を当てており、戦略策定は不十分です。その結果、統一的かつコスト効率の高い方法で異種のモダリティを処理するのに苦労しています。 DMAIC 品質管理フレームワークからインスピレーションを得て、当社は DMAIC-IAD (DMAIC にインスピレーションを得た Agentic Industrial Anomaly Detection) を提案します。これは、LLM エージェントを構造化された産業問題解決と連携させる、「最初に計画し、後で判断する」マルチエージェントシステムです。 DMAIC-IAD は、戦略生成前に異種参照を標準化操作手順 (SOP) に抽出し、事前トレーニングされた実行不要の判定モデルを導入して、コストのかかる実行時トライアルを行わずに候補戦略をランク付けします。 4 つのモダリティにわたる広範な実験により、DMAIC-IAD は適用可能な薬剤ベースラインよりも平均検出パフォーマンスが 37.76% 向上することが示されています。

原文 (English)

Plan First, Judge Later, Run Better: A DMAIC-Inspired Agentic System for Industrial Anomaly Detection

Large language model (LLM) agents have shown promise in automating complex data-analysis workflows, but their reliable deployment remains challenging in high-stakes industrial scenarios. Industrial anomaly detection (IAD) is essential for manufacturing quality, safety, and efficiency, yet existing LLM-based IAD agents mainly focus on execution while under-exploiting strategy formulation. Consequently, they struggle to handle heterogeneous modalities in a unified and cost-effective manner. Inspired by the DMAIC quality-management framework, we propose DMAIC-IAD (DMAIC-inspired Agentic Industrial Anomaly Detection), a "Plan First, Judge Later" multi-agent system that aligns LLM agents with structured industrial problem-solving. DMAIC-IAD distills heterogeneous references into standardized operating procedures (SOPs) before strategy generation, and introduces a pre-trained execution-free judge model to rank candidate strategies without costly runtime trials. Extensive experiments across four modalities show that DMAIC-IAD improves average detection performance over applicable agentic baselines by 37.76%.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

パルテノン法: 自己進化する弁護士の枠組み

エージェントの能力が高まるにつれて、法律分野の LLM エージェントは、大量のドキュメントをレビュー可能な作業成果物に変えることを約束しますが、信頼性の高い導入には 3 つの障害に直面しています。1 つは、今日の最も強力なモデルとハーネスの組み合わせがエンドツーエンドの法的問題でどのように動作するかについての大規模な証拠がないことです。法的な業種に適合したエージェントアーキテクチャはなく、汎用ハーネスのみが使用されます。そして、新しい事実、権限、期限によって変化し続ける環境では、システムが自らの結果から学習するメカニズムがありません。それぞれに対応します。 Harvey LAB に関する大規模な実証研究 -- $12{,}510$ のエージェントの軌跡 -- は、フロンティアのエージェントでさえ 1 回のパスで問題を完了することには程遠いことを示しています。より強力なモデルを使用すると基準ごとの精度が向上しますが、厳密な問題の完了は停滞します。次に、\textsc{Parthenon} を導入します。これは、モデル、ハーネス、代理人の役割、法的知識、決定論的なツール、および手続き上のスキルを情報源の追跡可能性、日付と番号の根拠、成果物のコンプライアンス、および問題の解決のための監査可能な表面に組み込む、自己進化する法律代理人のフレームワークです。最後に、漏れ防止学習ループにより、スコアリングされた失敗がタスクに依存しないスキル、ツール、知識の編集に変換され、企業が問題ごとにチェックリストとプレイブックを洗練するように、モデルの重みに触れることなく、経験とともにシステムが改善されます。私たちの大規模な実証分析を通じて、\textsc{Parthenon} は法的問題のタスクにおける最先端のモデルとハーネスのパフォーマンスを大幅に向上させました。

原文 (English)

Parthenon Law: A Self-Evolving Legal-Agent Framework

As agents grow more capable, legal-domain LLM agents promise to turn document-heavy matters into reviewable work products -- yet reliable deployment faces three obstacles: no large-scale evidence on how today's strongest model-and-harness combinations behave on end-to-end legal matters; no agent architecture adapted to the legal vertical, only general-purpose harnesses; and, in a setting that keeps shifting with new facts, authorities, and deadlines, no mechanism for systems to learn from their own outcomes. We address each. A large-scale empirical study on Harvey LAB -- $12{,}510$ agent trajectories -- shows that even frontier agents remain far from completing matters in a single pass: per-criterion accuracy climbs with stronger models while strict matter completion stalls. We then introduce \textsc{Parthenon}, a self-evolving legal-agent framework that factors Model, Harness, Agent roles, legal Knowledge, deterministic Tools, and procedural Skills into auditable surfaces for source traceability, date and number grounding, deliverable compliance, and issue closure. Finally, an anti-leakage learning loop converts scored failures into task-agnostic edits to skills, tools, and knowledge, letting the system improve with experience -- as a firm refines its checklists and playbooks after each matter -- without touching model weights. Across our large-scale empirical analysis, \textsc{Parthenon} substantially improves the performance of state-of-the-art models and harnesses on legal-matter tasks.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

MIRAGE: 暗黙的推論と生成世界モデルを備えたモバイルエージェント

モバイルエージェントは、スクリーンショットや言語目標に基づいて日常のアプリケーションを操作することがますます期待されており、信頼性の高い制御には、画面のアフォーダンス、複数ステップのナビゲーション、および将来の状態の変化に関する推論が必要です。ただし、多くのエージェントはこの計算を長いテキストの思考連鎖として外部に出すため、対話が遅くなり、監視コストが増加し、展開が複雑になります。 MIRAGE は、目に見えるテキスト推論の痕跡から継続的な潜在推論表現を学習するフレームワークです。 MIRAGE は、明示的な推論をコンパクトな隠れ状態に変換し、エージェントが長い根拠を解読することなく内部的に推論できるようにします。また、生成世界モデルの目標も組み込まれています。つまり、潜在的な推論ベクトルが将来のスクリーンショットと一致し、エージェントが行動する前に今後のインターフェイスの状態を予測するようになります。これにより、隠れた計算が圧縮された思考表現と環境力学の将来を見据えたモデルの両方に変わります。推論時、MIRAGE は連続的な潜在空間で推論し、実行効率を向上させながらトークンの生成を削減します。 AndroidWorld では、MIRAGE は、4B アブレーションにおける明示的な思考連鎖の監視付き微調整と 3 ～ 5 倍低いデコードトークンバジェットを一致させ、同等の命令調整ベースラインを 10.2 ポイント改善します。 AndroidControl では、生成されるトークンが 75% 以上減少しながら、アクションのグラウンディングが向上します。

原文 (English)

MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requires reasoning over screen affordances, multi-step navigation, and future state changes. However, many agents externalize this computation as long textual chains of thought, which slows interaction, increases supervision cost, and complicates deployment. We introduce MIRAGE, a framework that learns continuous latent reasoning representations from visible textual reasoning traces. MIRAGE transfers explicit reasoning into compact hidden states, enabling the agent to reason internally without decoding long rationales. It also incorporates a generative world-model objective: latent reasoning vectors are aligned with future screenshots, encouraging the agent to anticipate upcoming interface states before acting. This turns hidden computation into both a compressed thought representation and a forward-looking model of environment dynamics. At inference time, MIRAGE reasons in continuous latent space, reducing token generation while improving execution efficiency. On AndroidWorld, MIRAGE matches explicit chain-of-thought supervised fine-tuning in the 4B ablation with a 3-5x lower decoded-token budget and improves a comparable instruction-tuned baseline by 10.2 points; on AndroidControl, it improves action grounding while generating over 75% fewer tokens.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

Fog of Love: ゲーム環境における親和性ベースの強化学習による高潔なエージェントの動作のエンジニアリング

人工知能に高潔な行動を教え込むことへの関心が高まっています。提案された手法の 1 つは、親和性ベースの強化学習として知られています。これは、目的関数のポリシー正則化を使用して、報酬関数の設計に完全に依存することなく、善良な行動を奨励します。これまでのところ、この手法は、状態空間とアクション空間が最小限のグリッドワールドやおもちゃの問題環境で有効であることが実証されています。この研究をより洗練された環境に拡張するために、Fog of Love として知られるロールプレイングボードゲームに基づく 2 プレイヤーマルチエージェント環境を導入します。この環境では、2 人のエージェントがそれぞれの美徳を満たすために競い合いながら、その関係を満たすために協力します。マルチエージェントの性質を考慮すると、これは複雑な問題であり、マルチエージェントの深い決定論的ポリシー勾配エージェントは競合も連携もうまくいきません。我々は、局所的な親和性が競争目的と協力目的の両方を達成する際のエージェントのパフォーマンスを向上させ、その結果、両方のドメインで総合スコアが優れているという証拠を提示します。これは、結果的に賢明な選択をもたらすだけでなく、エージェントの目的論を明確にし、その行動を人間レベルで解釈できるようにします。

原文 (English)

Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment

Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity-based reinforcement learning, which uses policy regularization on the objective function to incentivize virtuous actions without being fully dependent on the reward function design. Thus far, this technique has been demonstrated to be effective in grid worlds and toy-problem environments with minimal state and action spaces. To expand this research to more sophisticated environments, we introduce a two-player multi-agent environment based on the role-playing board game known as Fog of Love. In this environment, two agents compete to fulfill their individual virtues, while also cooperating to satisfy their relationship. Given the multi-agent nature, this is a complex problem where multi-agent deep deterministic policy gradient agents neither compete nor cooperate successfully. We present evidence that localized affinities enhance agent performance in achieving both competitive and cooperative objectives, resulting from superior overall scores in both domains. This not only results in virtuous choices but also clarifies an agent's teleology and makes its behavior human-level interpretable.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

FALSIFYBENCH: ルール発見ゲームを使用した LLM の帰納的推論の評価

大規模言語モデル (LLM) は、科学タスクにおける自律エージェントとして導入されることが増えています。しかし、これらのシステムが科学的発見に関連する帰納的推論の形式に効果的に関与できるかどうかは未解決の問題のままです。この研究では、古典的な Wason 2-4-6 タスクに触発された仮説主導型推論の評価フレームワークである FALSIFYBENCH を紹介します。このタスクでは、エージェントは例を繰り返し提案し、フィードバックを受け取ることによって隠れた意味論的特性を発見する必要があります。このタスクでは、科学的推論の重要な要素、つまり仮説の生成、証拠の収集、および証拠の確認と反証に応じた信念の修正を捉えます。モデルファミリーとスケールにわたる 12 個の LLM の評価では、最適なパフォーマンスに近いモデルはないものの、推論モデルは一般に命令調整モデルよりも強力な科学的推論であることがわかりました。成功の主な原動力は否定的なテストの能力です。仮説を積極的に反証しようとするモデルは、主に確認を求めるモデルよりも一貫して優れています。さらに、これまでの研究では無視されていたきめ細かいターンレベル分析により、モデルが仮説空間をナビゲートする方法における特定可能なパターンと失敗が結びついていることが明らかになりました。

原文 (English)

FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games

Large language models (LLMs) are increasingly deployed as autonomous agents in scientific tasks. Yet whether these systems can effectively engage in forms of inductive reasoning relevant to scientific discovery remains an open question. In this work, we introduce FALSIFYBENCH, an evaluation framework for hypothesis-driven reasoning inspired by the classic Wason 2-4-6 task, in which agents must discover hidden semantic properties by iteratively proposing examples and receiving feedback. This task captures key elements of scientific reasoning: hypothesis generation, evidence gathering, and belief revision in response to both confirming and disconfirming evidence. Our evaluation of 12 LLMs across model families and scales shows that reasoning models are generally stronger scientific reasoners than instruction-tuned models, although no model comes close to optimal performance. The primary driver of success is the capacity for negative testing: models that actively seek to falsify their hypotheses consistently outperform those that primarily seek confirmation. Moreover, a fine-grained turn-level analysis, neglected in previous work, reveals that failure is tied to identifiable patterns in how models navigate the hypothesis space.

2026-06-04 13:00 JSTarXiv cs.AIエージェント研究/論文

人間と AI のインタラクションにおけるマルチエージェントの相補性のツリーベースの定式化

相補性とは、人間と AI の相互作用 (HAI) が、そのメンバー間で利用可能な最良の予測ベンチマークを上回る場合のことです。この考え方は HAI 研究の中心ですが、相補性に関する正式な研究は依然として限られています。既存のフレームワークは、エージェントの予測がワークフローに依存したマルチエージェントプロトコルをどのように構成するかをモデル化していません。私たちは、マルチエージェント HAI における相補性のツリーベースの形式化を導入することで、このギャップを埋めます。 HAI プロトコルは、順序付けられたエージェントと役割の構成と、その葉が予測ベクトルによって装飾されている根付き平面バイナリツリーによって表されます。ローカルのバイナリ構成ルールがツリーに沿って再帰的に評価され、pointwise-min Oracle ベンチマークに対するツリー相対相補性関数が生成されます。 4 つの結果を証明します。まず、セレクターベースの HAI (自己依存性または AI 依存性を含む) は、タスク、損失、予測の品質に関係なく、相補性を達成できません。第 2 に、二乗損失での回帰では、相補性はグラウンドトゥルースベクトルからのユークリッド距離の最小化に相当します。 $N=2$ の場合、最適な線形プーリング重みは閉じた形式と残差補正解釈を持ちます。第三に、線形局所構成の下では、すべてのプロトコルツリーはリーフ重みの単体での重心座標チャートを定義します。プロトコルツリーのTamari-cover再パラメータ化は相補性を維持し、$N=4$の場合、五角形の恒等性を満たします。第四に、バイナリ分類では、標準ブレグマン損失や多くの有限ベルヌーイ $f$ 発散損失を含むエンドポイント単調損失の下では、内部の局所構成は相補性を達成できません。クロスエントロピー下のマルチクラス集約にも同様の障害が当てはまります。要約すると、私たちのフレームワークは、マルチエージェント回帰では相補性が達成可能ですが、局所的な凝集と損失関数に関する自然条件下での分類では妨げられることを示しています。

原文 (English)

Tree-Based Formalization of Multi-Agent Complementarity in Human-AI Interactions

Complementarity is the case in which a human--AI interaction (HAI) outperforms the best prediction benchmark available among its members. Although this idea is central in HAI research, formal work on complementarity remains limited. Existing frameworks do not model how agents' predictions compose into workflow-sensitive multi-agent protocols. We close this gap by introducing a tree-based formalization of complementarity in multi-agent HAI. An HAI protocol is represented by an ordered agent-role configuration together with a rooted planar binary tree whose leaves are decorated by prediction vectors. A local binary composition rule is evaluated recursively along the tree, yielding a tree-relative complementarity functional relative to a pointwise-min oracle benchmark. We prove four results. First, selector-based HAIs, including self- or AI-reliance, cannot achieve complementarity regardless of task, loss, or prediction quality. Second, in regression under squared loss, complementarity is equivalent to Euclidean distance minimization from the ground-truth vector; for $N=2$, the optimal linear-pooling weight has a closed form and a residual-correction interpretation. Third, under linear local composition, every protocol tree defines a barycentric coordinate chart on the simplex of leaf weights; Tamari-cover reparameterizations of protocol trees preserve complementarity, and for $N=4$, they satisfy the pentagon identity. Fourth, in binary classification, no internal local composition can achieve complementarity under endpoint-monotone losses, including standard Bregman and many finite Bernoulli $f$-divergence losses; an analogous obstruction holds for multiclass aggregation under cross-entropy. In summary, our framework shows that complementarity is attainable in multi-agent regression, but obstructed in classification under natural conditions on local aggregation and loss functions.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

AIP: エージェントのスキルを学習および管理するためのグラフ表現

現在のエージェントスキルは、主に自由形式の散文で構成されており、エージェントはすべてのセッションでどのように行動するかを読み、解釈し、再導出する必要があります。これにより、2 つの複合的なコストが課せられます。実装の負荷が高いタスクの信頼性の低下と、特にモデルのトレーニングで過小評価されているドメイン固有の手順知識に関して、散文の編集は人間とエージェントの両方が苦労する脆弱なプロセスであるため、スキルの作成と改善が困難になります。エージェント命令プロトコル (AIP) は、スキルを指向実行グラフとしてモデル化することで両方に対処します。つまり、決定論的なスクリプトまたは自然言語記述に裏付けられたノードとしての個別のステップ、明示的に型指定された入力/出力エッジによって接続され、スキーマ検証された YAML 仕様によって管理されます。コンパイラのメタスキルは、人間が作成した既存のスキルをこの形式に変換します。利点は 2 つあります。まず、人間が作成したスキルを AIP にコンパイルすると、SkillsBench の 27 の実際のエージェントタスク全体で、Claude Sonnet の平均タスク報酬が 0.60 から 0.71 に、合格率が 53% から 67% に上昇しました。これは統計的に有意な向上 (Wilcoxon の符号付きランク p = 0.011) であり、12 対 2 のタスクで 13 の同点で勝利し、多くの場合、より短い実時間で達成されました。グラフは、自然言語からコード、コマンド、およびツール呼び出しを再導出するようにエージェントに要求するのではなく、精査された実行可能なユニットをエージェントに提供します。次に、作成と改善については、各スキルがスキーマ検証され、機能テストが可能で、ノードごとにアドレス指定できるため、障害を正確に診断して修復できます。作成されたスキルの 2 つの失敗がスクリプトレベルまで追跡されました。 AIP 仕様を調整して再コンパイルした後、どちらも回帰ゼロ (1 つのタスクが 0/5 から 5/5 に移行) で回復し、スキルの向上が散文的な書き直しではなく、測定可能なチューニングループに変わりました。同じグラフ構造は、コーパスレベルのガバナンスとスキルのイントロスペクションをサポートし、スキルに対する強化学習のための自然なアクションスペースを提供します。

原文 (English)

AIP: A Graph Representation for Learning and Governing Agent Skills

Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and difficulty in skill creation and improvement, since editing prose is a fragile process that both humans and agents struggle with, particularly for domain-specific procedural knowledge underrepresented in model training. The Agent Instruction Protocol (AIP) addresses both by modeling a skill as a directed execution graph: discrete steps as nodes backed by deterministic scripts or natural-language descriptions, connected by explicit typed input/output edges, and governed by a schema-validated YAML specification. A compiler meta-skill translates existing human-written skills into this form. The benefits are twofold. First, compiling human-written skills to AIP raised Claude Sonnet's mean task reward from 0.60 to 0.71 and pass rate from 53% to 67% across 27 real agent tasks from SkillsBench - a statistically significant gain (Wilcoxon signed-rank p = 0.011), winning 12 tasks to 2 with 13 ties - often in less wall-clock time. The graph delivers vetted, runnable units to the agent rather than asking it to re-derive code, commands, and tool calls from natural language. Second, on creation and improvement, because each skill is schema-validated, functionally testable, and addressable node-by-node, failures can be diagnosed and repaired precisely. Two authored-skill failures were traced to the script level. After adjusting the AIP spec and recompiling, both recovered with zero regressions (one task going from 0/5 to 5/5), turning skill improvement into a measurable tuning loop rather than a prose rewrite. That same graph structure supports corpus-level governance and skill introspection, and provides a natural action space for reinforcement learning over skills.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

R-APS: 内省的敵対的パレート検索による制約付き設計のための構成推論とコンテキスト内メタ学習

大規模言語モデル (LLM) は、無制限のタスクに柔軟に対応しますが、システムが計画を立て、ツールを使用し、長期間にわたって動作する必要があるエージェント設定では、流暢さは信頼性の高い配信を保証しません。このギャップを 3 つの構造的欠陥が結合したものとして追跡します。エラーは位置特定されずに伝播し、最悪の場合の摂動は評価されず、蓄積された知識は決して無効になりません。私たちは、これらには根本原因が共有されていると主張します。つまり、アブダクティブ、反事実、メタ帰納的、修正的、帰納的推論は、共有されたコンテキストを矛盾する方向に引っ張ります。私たちは、Reflective Adversarial Pareto Search (R-APS) を導入します。これは、推論モード分解を介して 3 つの失敗すべてに共同で対処し、各推論モードに独自のコンテキストを割り当て、3 つのタイムスケールにわたる相互作用を調整する、私たちの知る限り最初の方法です。型付き検証批評家による段階的な構成推論 (失敗の局在化)、第一級のパレート目標 (堅牢性) としての感度に基づく反事実ストレステスト、および明示的なメタ帰納的ルール抽出です。 invalidation (persistent memory). R-APS は微調整を必要とせず、純粋に構造化されたプロトコル設計によってフリーズされた LLM 上で動作します。平面機構の合成 (ロボット工学、補綴物、機械設計) を評価し、すべての候補を運動学ソルバーでチェックします。 32 のターゲット軌道上で、R-APS は、均一摂動ベースラインよりも 3.5 倍厳しいロバスト性証明書、最初の許容までの反復が 46% 高速化、Enum+GA と比較して 2.1 倍の面取り距離の短縮を実現しながら、バー数と最悪の場合のロバスト性を共同制御します。小規模な 4B 推論に特化したモデルは、プロトコル内の汎用 70B バックボーンと競合することが証明されており、構造化プロトコルがモデルのスケールを部分的に相殺できることを示唆しています。

原文 (English)

R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

Large language models (LLMs) are fluent on open-ended tasks, yet in agentic settings, where a system must plan, use tools, and act over extended horizons, fluency does not ensure reliable delivery. We trace this gap to three coupled structural failures: errors propagate without localization, worst-case perturbations go unevaluated, and accumulated knowledge is never invalidated. We argue these share a root cause: abductive, counterfactual, meta-inductive, corrective, and inductive reasoning pull a shared context in incompatible directions. We introduce Reflective Adversarial Pareto Search (R-APS), to our knowledge the first method addressing all three failures jointly via reasoning-mode decomposition, allocating each reasoning mode its own context and orchestrating interaction across three timescales: staged compositional reasoning with a typed validation critic (failure localization), sensitivity-guided counterfactual stress-testing as a first-class Pareto objective (robustness), and meta-inductive rule extraction with explicit invalidation (persistent memory). R-APS requires no fine-tuning and operates on a frozen LLM purely via structured protocol design. We evaluate on planar mechanism synthesis (robotics, prosthetics, mechanical design), with every candidate checked by a kinematic solver. On 32 target trajectories, R-APS delivers robustness certificates 3.5x tighter than uniform-perturbation baselines, 46% faster iterations-to-first-admission, and 2.1x Chamfer-distance reduction over Enum+GA while jointly controlling bar-count and worst-case robustness. Small 4B reasoning-specialized models prove competitive with general-purpose 70B backbones inside the protocol, suggesting structured protocols can partially offset model scale.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

Strabo: エージェント相互作用プロトコルの宣言的仕様と実装

ここ数年で、宣言型対話プロトコルに基づいたマルチエージェントシステムのモデリングと実装が大きく進歩しました。私たちの貢献である Strabo は、これらの進歩と Agentic AI における現在進行中の業界の取り組みとの関連性を確立します。具体的には、AI エージェントの電子商取引インタラクションを標準化するための Google 主導の最近の取り組みである UCP (Universal Commerce Protocol) について検討します。私たちの演習は 2 つの部分に分かれています。 1 つは、チェックアウトを処理する UCP の部分を宣言型 Langshaw プロトコルとしてモデル化し、Langshaw のプログラミングモデルである Peach を使用してエージェントを実装することです。演習のこの部分では、正式な宣言的仕様の利点を引き出します。 2 つ目は、Peach エージェントが Google によって実装された UCP エージェントと相互運用できることを示し、それによって UCP に関するアプローチの忠実性を確立します。このような相互運用により、宣言型プロトコルとエージェントを従来の設定に段階的に導入することが可能になり、大規模な更新を必要とせずに EMAS のアイデアが実践に影響を与える可能性がある道筋が示されています。

原文 (English)

Strabo: Declarative Specification and Implementation of Agentic Interaction Protocols

The last few years have witnessed major advances in the modeling and implementation of multiagent systems based on declarative interaction protocols. Our contribution, Strabo, establishes the relevance of these advances to ongoing industry efforts in Agentic AI. Specifically, we consider UCP, the Universal Commerce Protocol, a recent Google-led effort to standardize e-commerce interactions for AI agents. Our exercise is in two parts. One, we model the part of UCP dealing with checkouts as a declarative Langshaw protocol and implement agents using Peach, a programming model for Langshaw. This part of the exercise brings out the advantages of formal, declarative specifications. Two, we show that Peach agents can interoperate with UCP agents implemented by Google, thereby establishing the fidelity of our approach with respect to UCP. Such interoperation enables the incremental introduction of declarative protocols and agents into a conventional setting, indicating a pathway by which EMAS ideas could influence practice without demanding a wholesale update.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

RUBAS: エージェントの安全のためのルーブリックベースの強化学習

LLM がツール対応エージェントに進化すると、単純なテキスト生成ではなく現実世界の実行に関連した新しいクラスの安全性の課題が生じます。既存の調整方法は、粗い拒否信号や静的な監視に依存することが多く、さまざまなエージェントのリスクにわたって安全性と有用なツールの実行のバランスをとることが困難です。エージェントの安全性のためのルーブリックベースの強化学習フレームワークである RUBAS を紹介します。 RUBAS は、エージェントの動作をツール使用の安全性、引数の安全性、応答の安全性、有用性の 4 つの次元に分解します。これらの構造化されたルーブリックは、エージェントの完全な軌跡にわたってきめ細かく解釈可能な報酬を提供し、タスクの完了を維持しながら安全なツールの使用を最適化する強化学習を可能にします。複数のエージェントの安全性ベンチマークとモデルにわたる広範な実験により、RUBAS が標準的なアライメントベースラインよりも安全性を向上させ、ツールに基づく幻覚を軽減し、競争力のある実用性を維持することが示されています。私たちの結果は、多次元ルーブリック報酬が、安全性が重要なツール使用環境において LLM エージェントを調整するための効果的なトレーニング信号を提供することを示唆しています。

原文 (English)

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or static supervision, making it difficult to balance safety with useful tool execution across diverse agentic risks. We introduce RUBAS, a rubric-based reinforcement learning framework for agent safety. RUBAS decomposes agent behavior into four dimensions: tool-use safety, argument safety, response safety, and helpfulness. These structured rubrics provide fine-grained and interpretable rewards over complete agent trajectories, enabling reinforcement learning to optimize safe tool use while preserving task completion. Extensive experiments across multiple agent safety benchmarks and models show that RUBAS improves safety over standard alignment baselines, reduces tool-grounded hallucinations, and maintains competitive utility. Our results suggest that multi-dimensional rubric rewards provide an effective training signal for aligning LLM agents in safety-critical tool-use settings.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

証拠を運ぶエージェントアクション: 異種エージェントシステムに対するモデルに依存しないランタイムガバナンス

エージェントシステムは、ローカルコーディングツール、フレームワーク SDK、マネージドエージェントプラットフォーム、API ゲートウェイ、オブザーバーのみの統合など、非常に異なる制御ポイントを備えたランタイムを通じて実行されます。したがって、データを外部に公開するなどの高リスクのアクションは、あるランタイムではシェルコマンドとして、別のランタイムではツール呼び出しとして、そして 3 番目のランタイムではホストされたセッションの移行として現れる可能性があります。このため、ガバナンスの基本的な質問に一貫して答えることが困難になります。つまり、どのようなアクションが、誰の権限の下で、どのような承認セマンティクスに基づいて、実行後にどのような証拠によって承認されたのかということです。このペーパーでは、ベンダーネイティブのセッションレコードではなくアクション証明書を中心としたランタイム中立のガバナンスモデルである Proof-Carrying Agent Actions (PCAA) について説明します。 PCAA は、アクション前の許容性、アクションのオープン、仮定のキャプチャ、承認、結果のクローズという 5 つのチェックポイントを中心にコントロールを組織します。これらのチェックポイントは、ポータブルなアクションエンベロープ、実行時および承認のレシート、および再生可能なプルーフにバインドされます。このモデルは 2 つの実用的な方法で拡張されています。証明書は外部性を認識しており、宛先の可視性やアカウントの来歴などの境界事実を伝えます。もう 1 つは、承認は、単一のレビュー済みまたは未レビューのビットではなく、明示的な強制力クラスによって記述されます。私たちは、異種エージェントコントロールプレーンでの参照実装と開示制限のある評価プロトコルを通じてモデルを研究します。 4 つのランタイムファミリにわたる 24 個の実行可能シードから 96 個のトレースに拡張された保護されたベンチマークでは、PCAA はルートの品質を維持しながら、アブレーション下で個別の障害モードを明らかにします。この論文は、証明書を保持するアクションに関するランタイムガバナンスのシステム定式化と、その定式化がベンダー固有のコントロールサーフェスに崩壊することなくランタイムチャーンの下で移植性を維持できる方法についての実装に基づいた説明に貢献します。

原文 (English)

Proof-Carrying Agent Actions: Model-Agnostic Runtime Governance for Heterogeneous Agent Systems

Agent systems execute through runtimes with very different control points: local coding tools, framework SDKs, managed agent platforms, API gateways, and observer-only integrations. A high-risk action such as publishing data externally may therefore appear as a shell command in one runtime, a tool call in another, and a hosted session transition in a third. This makes it difficult to answer a basic governance question consistently: what action was authorized, under whose authority, with what approval semantics, and with what evidence after execution? This paper presents Proof-Carrying Agent Actions (PCAA), a runtime-neutral governance model centered on an action certificate rather than on a vendor-native session record. PCAA organizes control around five checkpoints: pre-action admissibility, action open, assumption capture, approval, and outcome closure. It binds these checkpoints to a portable action envelope, runtime and approval receipts, and replay-ready proof. The model is extended in two practical ways: the certificate is externality-aware, carrying boundary facts such as destination visibility and account provenance, and approval is described by explicit enforceability classes rather than by a single reviewed or unreviewed bit. We study the model through a reference implementation in a heterogeneous agent control plane and a disclosure-bounded evaluation protocol. On a protected benchmark expanded from 24 executable seeds to 96 traces across four runtime families, PCAA preserves route quality while exposing distinct failure modes under ablation. The paper contributes a systems formulation of runtime governance around certificate-bearing actions and an implementation-grounded account of how that formulation can remain portable under runtime churn without collapsing into vendor-specific control surfaces.

2026-06-04 13:00 JSTarXiv cs.AIエージェントロボティクス

AgenticDiffusion: ビジョンベースの UAV ナビゲーションのための Agentic Diffusion ベースの経路計画

屋内 UAV ナビゲーションには、限られた視野の観察下での効率的な探索、シーンの理解、信頼性の高い軌道の実行が必要です。既存のビジョンベースのナビゲーションフレームワークは通常、単一ビューの観察に依存しており、オクルージョン、ターゲットの可視性、およびグローバルシーン構造について推論する能力が制限されています。この研究では、統合された航空ナビゲーションパイプライン内で、言語に基づく推論、オープン語彙によるターゲットのグラウンディング、視覚ベースの拡散計画、および NMPC を調整するマルチビュー UAV ナビゲーションフレームワークである AgenticDiffusion を提案します。自然言語による指示と、同期した一人称視点 (FPV) および上面視点の観察を考慮して、フレームワークはナビゲーションに最も有益な視点を決定し、軌道の実行前にミッション計画を生成します。ターゲットは、オープンボキャブラリーグラウンディングモデルを使用して位置特定され、その後、視点固有の拡散プランナーが UAV 実行のためのナビゲーション軌道を生成します。提案されたフレームワークは、補完的な視点を使用して、繰り返しのターゲット探査を削減し、雑然とした屋内環境でのナビゲーション効率を向上させます。このフレームワークは、適応視点選択、多段階ミッション実行、長距離ナビゲーション、安全な着陸地点選択を含む 4 つの現実世界の UAV ナビゲーションシナリオで検証されました。実験結果では、40 回の実世界試験でミッション全体の成功率が 80% であることが実証され、一方、拡散計画者は軌道生成の成功率が 100% に達しました。

原文 (English)

AgenticDiffusion: Agentic Diffusion-based Path Planning for Vision-Based UAV Navigation

Indoor UAV navigation requires efficient exploration, scene understanding, and reliable trajectory execution under limited field-of-view observations. Existing vision-based navigation frameworks typically rely on single-view observations, limiting their ability to reason about occlusions, target visibility, and global scene structure. In this work, we propose AgenticDiffusion, a multi-view UAV navigation framework that coordinates language-guided reasoning, open-vocabulary target grounding, vision-based diffusion planning, and NMPC within a unified aerial navigation pipeline. Given a natural language instruction and synchronized first-person-view (FPV) and top-view observations, the framework determines the most informative viewpoint for navigation and generates a mission plan prior to trajectory execution. The targets are localized using an open-vocabulary grounding model, after which viewpoint-specific diffusion planners generate navigation trajectories for UAV execution. Using complementary viewpoints, the proposed framework reduces repeated target exploration and improves navigation efficiency in cluttered indoor environments. The framework was validated in four real-world UAV navigation scenarios involving adaptive viewpoint selection, multi-stage mission execution, long-horizon navigation, and safe landing-site selection. The experimental results demonstrated an overall mission success rate of 80% in 40 real-world trials, while the diffusion planners achieved a trajectory generation success rate of 100%.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SaliMory: 会話エージェントの認知記憶を調整する

生涯の伴侶として機能する会話エージェントは、すべての対話にわたって永続的な記憶を維持する必要があります。ただし、生の取得でコンテキストウィンドウを単純に拡張すると推論の品質が低下し、標準の強化学習による記憶エージェントのトレーニングでは、多段階パイプラインで深刻なクレジット割り当てのボトルネックが発生します。これを解決するために、単一言語モデルをトレーニングして、ユーザーの事実、好み、作業記憶にまたがる認知的に構造化された記憶を管理するフレームワークである SALIMORY を紹介します。 SALIMORY は、階層的な段階ごとのプロセス報酬と報酬分解された対照的洗練を導入することにより、個別の記憶操作 (選択的フィルタリング、統合、およびキュー主導のリコール) をエンドツーエンドで個別に監視します。 SALIMORY はメモリに起因する障害を 3 分の 1 に削減し、エンドツーエンドの精度で最先端のものを 10% 以上上回り、Good Personalization 率を 2 倍以上に高めます。

原文 (English)

SaliMory: Orchestrating Cognitive Memory for Conversational Agents

Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents via standard reinforcement learning creates a severe credit assignment bottleneck in a multi-stage pipeline. To solve this, we introduce SALIMORY, a framework that trains a single language model to manage a cognitively-structured memory-spanning user facts, preferences, and working memory. By introducing a hierarchical stage-wise process reward and reward-decomposed contrastive refinement, SALIMORY provides isolated supervision for distinct memory operations (selective filtering, consolidation, and cue-driven recall) end-to-end. SALIMORY cuts memory-attributed failures by one-third, outperforms the state-of-the-art by over 10% in end-to-end accuracy, and more than doubles the Good Personalization rate.

2026-06-04 13:00 JSTarXiv cs.AIエージェントロボティクス

大規模言語モデルによる適応軌道最適化のためのセマンティック制約合成

軌道の最適化は、宇宙探査において安全で信頼性の高い自律運用を可能にするための重要なコンポーネントです。宇宙ミッションの頻度、複雑さ、範囲が増加するにつれて、ミッションの目的と運用上の制約を正確に反映する、数学的に適切な軌道最適化問題を迅速に定式化する必要性が高まっています。ただし、ミッションの意図を軌道最適化のための扱いやすい分析公式に変換するには、かなりの専門知識が必要です。この論文では、大規模言語モデル (LLM) を活用して、ミッションの要件と制約の自然言語記述を実行可能な軌道最適化コードと対応する数学的定式化に変換するフレームワークを紹介します。宇宙船ランデブーシナリオでの実験では、意味論的なミッション要件から凸軌道最適化問題を再調整する際の高い成功率が実証されています。最終的に、この研究は、高レベルの意図と形式的な最適化モデルを橋渡しする LLM の可能性を強調し、宇宙船のより柔軟で効率的な軌道設計を可能にします。

原文 (English)

Semantic Constraint Synthesis for Adaptive Trajectory Optimization via Large Language Models

Trajectory optimization is a critical component for enabling safe and reliable autonomous operations in space exploration. As space missions increase in frequency, complexity, and scope, there is a growing need to rapidly formulate mathematically sound trajectory optimization problems that accurately reflect mission objectives and operational constraints. However, translating mission intent into tractable analytical formulations for trajectory optimization requires substantial domain expertise. This paper presents a framework that leverages large language models (LLMs) to translate natural language descriptions of mission requirements and constraints into executable trajectory optimization code and corresponding mathematical formulations. Experiments in spacecraft rendezvous scenarios demonstrate a high success rate in reconditioning a convex trajectory optimization problem from semantic mission requirements. Ultimately, this work highlights the potential of LLMs to bridge high-level intent and formal optimization models, enabling more flexible and efficient trajectory design of spacecraft.

2026-06-04 13:00 JSTarXiv cs.AIエージェント研究/論文

HighTide: エージェントが厳選したオープンソース VLSI ベンチマークスイート

進化する AI 支援ベンチマークスイートである HighTide を紹介します。具体的には、(i) 複数の設計言語とテクノロジノードにまたがる多様なオープンソーススイート、(ii) リモートキャッシュを備えた Bazel ベースの増分 RTL から GDS へのコンパイル、(iii) スイート全体の調整理論的根拠の長期記憶として機能する設計ごとの意思決定ログに裏付けられた、設計ライフサイクル、フロー最適化、ツールリファレンス、メタメンテナンスをカバーする 12 のエージェントスキルによる AI 支援の設計キュレーション、および(iv) 安定リリース用の RTL コンパイル検証を備えたインフラストラクチャ。このスイートは一般公開されており、オープンソースのハードウェアエコシステムとともに成長するように設計されています。

原文 (English)

HighTide: An Agent-Curated Open-Source VLSI Benchmark Suite

We introduce HighTide, an evolving AI-assisted benchmark suite. Specifically, the contributions are: (i) a diverse open-source suite spanning multiple design languages and technology nodes, (ii) Bazel-based incremental RTL-to-GDS compilation with remote caching, (iii) AI-assisted design curation through twelve agent skills covering the design lifecycle, flow optimization, tool reference, and meta-maintenance, backed by per-design decision logs that serve as long-term memory of tuning rationale across the suite, and (iv) an infrastructure with RTL compilation verification for stable releases. The suite is publicly available and designed to grow with the open-source hardware ecosystem.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェントハードウェア/半導体

Caught in the Act(ivation): LLM エージェントによる資格情報漏洩の事前出力およびマルチターン検出に向けて

LLM エージェントは多くの場合、機密認証情報を信頼できない取得コンテンツと同じコンテキストウィンドウに配置し、認証情報の漏洩を誘発する間接的なプロンプトインジェクションの直接パスを作成します。私たちは、3 つの相補的な防御を通じてこの障害モードを研究します。まず、出力トークンが発行される前に、アクティベーションプローブが資格情報へのアクセスを検出できるかどうかを尋ねます。次に、形式固有の文字モデルからハニートークンを構築し、分割等角予測で検出を調整します。 3 番目に、複数ターンにわたる漏洩を累積的な情報フロー問題として扱い、会話ターン全体での推定漏洩予算を追跡します。オープンウェイトモデルの制御された実験では、アクティベーション機能により、ホールドアウトエンコーディング変換下を含め、無害なプロンプトと認証情報を求めるプロンプトが高精度で分離されます。小規模な合成マルチターンスイートでは、累積アカウンティングにより、ターンごとの検出器が見逃した攻撃が検出されます。これらの結果は暫定的なものです。マルチターンベンチマークは社内で小規模なものであり、アクティブ化方法にはホワイトボックスアクセスが必要であり、情報推定ツールは正式な上限ではなく実用的なシグナルを提供します。それでも、この結果は、資格情報の漏洩防御には、テキストレベルの出力フィルターのみに依存するのではなく、出力前の監視、調整されたカナリア検出、および一時的な漏洩アカウンティングを組み合わせる必要があることを示唆しています。

原文 (English)

Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents

LLM agents often place sensitive credentials in the same context window as untrusted retrieved content, creating a direct path for indirect prompt injection to induce credential exfiltration. We study this failure mode through three complementary defenses. First, we ask whether activation probes can detect credential access before output tokens are emitted. Second, we construct honeytokens from format-specific character models and calibrate detection with split conformal prediction. Third, we treat multi-turn exfiltration as a cumulative information-flow problem and track an estimated leakage budget across conversation turns. In controlled experiments on open-weight models, activation features separate benign and credential-seeking prompts with high accuracy, including under held-out encoding transformations. In a small synthetic multi-turn suite, cumulative accounting detects attacks that per-turn detectors miss. These results are preliminary: the multi-turn benchmark is in-house and small, the activation method requires white-box access, and the information estimator provides a practical signal rather than a formal upper bound. Still, the results suggest that credential-exfiltration defenses should combine pre-output monitoring, calibrated canary detection, and temporal leakage accounting rather than relying only on text-level output filters.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

公証されたエージェント: AI エージェントのアクションに対する受信者が証明した機密受領書

現在の AI エージェントの可観測性は構造的に危険にさらされています。アクティビティログを生成するエンティティは、アクティビティが記録されるエンティティと同じです。侵害されたエージェントまたはバグのあるエージェントは、自身のトレースを省略、変更、または捏造する可能性があり、エージェントを実行するオペレータには改ざんを検出する独自の方法がありません。私たちは、信頼境界を反転することでこれを解決するプロトコルのクラスを提案します。エージェントの呼び出しを受信するサービスは、独自のキーを使用して観察した内容の受信に署名し、エージェントの所有者に対して受信を暗号化し、それを公開の透明性ログに公開します。所有者は、エージェントやそのオペレーターを信頼せずに、改ざん証拠の証跡を再構築します。このクラスを Sello としてインスタンス化します。これは、現在のシステムには存在しない 4 つのプロパティを組み合わせたプロトコルです。(P1) 受信者側の署名、(P2) JWS 経由で認可トークンにバインドされた所有者公開鍵への HPKE 暗号化、(P3) 証人署名付きマークルログへの公開、(P4) トークン参照による所有者側の検出です。私たちはプロトコルを説明し、エージェントとそのオペレーターを制御する敵対者の下でそのセキュリティを分析し、暗号操作のマイクロベンチマークを提示し、隣接する受信プロトコル作業 (Signet、AgentROA、Agent Passport System、draft-farley-acta、SCITT) の中に Sello を位置づけます。抑制攻撃、サービスの共謀、採用インセンティブの問題などの既知の制限について説明します。

原文 (English)

Notarized Agents: Receiver-Attested Confidential Receipts for AI Agent Actions

Current AI agent observability is structurally compromised: the entity producing the activity log is the same entity whose activity is being logged. A compromised or buggy agent can omit, alter, or fabricate its own traces, and the operator running the agent has no independent way to detect tampering. We propose a class of protocols that resolves this by inverting the trust boundary: the service that receives an agent's call signs a receipt of what it observed using its own key, encrypts the receipt to the agent's owner, and publishes it to a public transparency log. The owner reconstructs a tamper-evident trail without trusting the agent or its operator. We instantiate the class as Sello, a protocol combining four properties absent in any current system: (P1) receiver-side signing, (P2) HPKE encryption to an owner public key bound to the authorization token via JWS, (P3) publication to a witness-cosigned Merkle log, and (P4) owner-side discovery by token reference. We describe the protocol, analyze its security under an adversary that controls the agent and its operator, present microbenchmarks of the cryptographic operations, and situate Sello among adjacent receipt-protocol work (Signet, AgentROA, Agent Passport System, draft-farley-acta, SCITT). We discuss known limitations including the suppression attack, service collusion, and the adoption-incentive problem.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成エージェント

StandardE2E: エンドツーエンドの自動運転データセットのための統合フレームワーク

自動運転は、モジュール式の認識・予測・計画スタックから、センサー入力を車両制御に直接マッピングするエンドツーエンド (E2E) モデルに移行しており、多くの場合、3D 検出、動き予測、HD マップ認識などの補助タスクによって正規化されています。進歩は、センサーが豊富な運転データセットの急速に成長するエコシステムによって推進されていますが、それぞれが独自のファイル形式、API、座標規則、モダリティカバレッジを提供しているため、データセット間の実験やデータセットごとの基本的な前処理さえもプロジェクトごとに再実装する必要があります。 E2E 駆動データセット上に単一の統一インターフェイスを提供するフレームワークである StandardE2E を紹介します。 StandardE2E (i) 1 つの共有データスキーマの下でデータセットごとの前処理を標準化します。 (ii) 複数のデータセットを単一の PyTorch DataLoader に結合して、データセット間の事前トレーニング、補助タスクの監視、シナリオレベルのフィルタリングを行います。 (iii) 生のフレームから正規スキーマへの単一のデータセットごとのマッピングへの新しいデータセットの追加を減らし、ダウンストリームパイプライン全体を変更しないままにします。このフレームワークは、Waymo End-to-End、Waymo Perception、Argoverse 2 Sensor、Argoverse 2 LiDAR、NAVSIM (OpenScene-v1.1)、および WayveScenes101 の 6 つのデータセットをすぐにサポートしており、オープンソースの standard-e2e Python パッケージとしてリリースされており、https://github.com/stepankonev/StandardE2E で入手できます。

原文 (English)

StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets

Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD-map perception. Progress is driven by a fast-growing ecosystem of sensor-rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross-dataset experimentation and even basic per-dataset preprocessing to be re-implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per-dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross-dataset pretraining, auxiliary-task supervision, and scenario-level filtering; and (iii) reduces adding a new dataset to a single per-dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End-to-End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene-v1.1), and WayveScenes101, and is released as the open-source standard-e2e Python package, available at https://github.com/stepankonev/StandardE2E.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

信頼できない入力から信頼できるメモリへ: LLM エージェントにおけるメモリポイズニング攻撃の系統的研究

メモリは AI エージェントの中核コンポーネントであり、AI エージェントがインタラクションを通じて知識を蓄積し、パフォーマンスを向上させることができます。ただし、永続メモリにはメモリポイズニングのリスクが伴います。メモリポイズニングの場合、敵対的な 1 回のメモリ書き込みがエージェントの動作に長期的な影響を与える可能性があります。我々は、LLM ベースのエージェントにおけるメモリポイズニングの体系的な研究を紹介します。 4 つのメモリ書き込みチャネルと、これらのチャネルを悪用可能にするモデル機能、システムプロンプト設計、およびエージェントシステムアーキテクチャにおける 9 つの構造的脆弱性を特定しました。これらの脆弱性に基づいて、メモリポイズニング攻撃の 6 つのクラスの分類を作成します。さらに、メモリポイズニング攻撃を評価するためのベンチマークである MPBench を設計し、より積極的にメモリの書き込みと取得を行うように設計されたエージェントが悪用されやすいことを示します。また、既存のプロンプトインジェクション防御ではメモリポイズニング攻撃をカバーできないことも示します。私たちの調査結果は、AI エージェントに対するメモリポイズニング攻撃を理解し、軽減するための基盤を提供します。

原文 (English)

From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents

Memory is a core component of AI agents, enabling them to accumulate knowledge across interactions and improve performance. However, persistent memory introduces the risk of memory poisoning, where a single adversarial memory write can exert long-term influence over agent behavior. We present a systematic study of memory poisoning in LLM-based agents. We identify four memory write channels and nine structural vulnerabilities in model capabilities, system prompt design, and agent system architecture that make these channels exploitable. Based on these vulnerabilities, we develop a taxonomy of six classes of memory poisoning attacks. Furthermore, we design MPBench -- a benchmark for evaluating memory poisoning attacks, and show that agents designed to write and retrieve memory more aggressively are more exploitable. We also show that existing prompt injection defenses fail to cover memory poisoning attacks. Our findings provide a foundation for understanding and mitigating memory poisoning attacks against AI agents.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

即時注射が忘れられなかったらどうなるでしょうか?エージェントシステムでのクロスセッションストアドプロンプトインジェクションの探索

最新のエージェントシステムは、LLM をセッション限定のアシスタントからステートフルシステムに変換します。ステートフルシステムは、メモリ、ファイルシステム、ツール、およびその他の長期間存続するコンテキストアーティファクトを通じて、セッション間で共有世界状態を永続化および進化させます。この変化により、プロンプトインジェクションの攻撃対象領域が根本的に拡大します。しかし、プロンプトインジェクションに関するこれまでの研究は主に単一セッション内のモデルレベルの脅威に焦点を当てており、セッション間の永続的なシステム状態がエージェントシステムのシステムレベルのリスクをどのように根本的に変化させるかを見落としていました。 Web システムのストアドクロスサイトスクリプティングにヒントを得て、クロスセッションストアドプロンプトインジェクションを導入しました。これにより、成功したインジェクションはエージェントシステム状態内で持続し、元の攻撃者による対話が終了した後も長期間にわたって将来の実行に静かに影響を与えることができます。この脅威を体系的に研究するために、ストアドプロンプトインジェクションを形式化し、敵対的なコンテンツがどのようにセッション間で持続し、エージェントシステムに影響を与えるかの分類を開発します。さらに、ストアドプロンプトインジェクションのリスクを評価するためのベンチマークとサンドボックスツールキットを開発し、さまざまなモデル、攻撃目標、永続化チャネルにわたる攻撃の成功の定量的分析を可能にします。私たちの調査結果は、永続化により、プロンプトインジェクションが一時的なモデルレベルの脅威から、エージェントの実行状態に組み込まれた長期にわたるシステムレベルの脆弱性に変化することが強調されています。私たちは、この取り組みがこの新たな脅威に対する幅広い注目を集め、コミュニティがエージェントシステムの存続によって生じるシステムリスクを体系的に調査して軽減するよう促すことを願っています。

原文 (English)

What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems

Modern agentic systems transform LLMs from session-bounded assistants into stateful systems that persist and evolve shared world state across sessions through memories, filesystems, tools, and other long-lived contextual artifacts. This shift fundamentally expands the attack surface of prompt injection. However, prior works on prompt injection have largely focused on model-level threats within a single session, overlooking how cross-session persistent system state fundamentally changes the system-level risk of agentic systems. Inspired by stored cross-site scripting in web systems, we introduce cross-session stored prompt injection, where a successful injection can persist within agentic system state and silently influence future executions long after the original attacker interaction has ended. To systematically study this threat, we formalize stored prompt injection and develop a taxonomy of how adversarial content persists and affects agentic systems across sessions. We further develop a benchmark and sandbox toolkit to evaluate the risks of stored prompt injection, enabling quantitative analysis of attack success across different models, attack goals, and persistence channels. Our findings highlight that persistence transforms prompt injection from an ephemeral model-level threat into a long-lived system-level vulnerability embedded within agent execution state. We hope this work draws broader attention to this emerging threat and motivates the community to systematically study and mitigate system risks arising from persistence in agentic systems.

2026-06-04 13:00 JSTarXiv cs.AIエージェント研究/論文

CyberGym-E2E: AI エージェントのエンドツーエンドのサイバーセキュリティ機能のためのスケーラブルな現実世界のベンチマーク

AI は、ソフトウェアの脆弱性を自律的に検出、分析、修復できるシステムを可能にすることで、サイバーセキュリティを変革する可能性を秘めています。しかし、AI システムの既存のサイバーセキュリティ評価は規模や範囲が限られており、現実世界のソフトウェアの脆弱性の発見と修復のエンドツーエンドのライフサイクルを捉えることができません。このギャップに対処するために、私たちは、脆弱性の発見、PoC 生成、パッチ生成のライフサイクル全体にわたって AI エージェントの能力を包括的に評価する、大規模かつ現実的なエンドツーエンドのサイバーセキュリティベンチマークである CyberGym-E2E を提案します。 CyberGym-E2E は、オープンソースの脆弱性データを現実的な評価環境に変換するための自動化されたエージェント強化パイプラインを構築するため、包括的でスケーラブルです。現在、ベンチマークは、139 の異なるオープンソースプロジェクトにわたる 920 件の実際の脆弱性で構成されています。

原文 (English)

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberGym-E2E, a large-scale and realistic end-to-end cybersecurity benchmark that comprehensively evaluates AI agents' abilities across the full lifecycle of vulnerability discovery, PoC generation, and patch generation. CyberGym-E2E is comprehensive and scalable, as we build an automated, agent-enhanced pipeline for transforming open-source vulnerability data into realistic evaluation environments. Currently, the benchmark consists of 920 real-world vulnerabilities across 139 different open-source projects.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SePO: システムプロンプト最適化のための自己進化型プロンプトエージェント

システムプロンプトの最適化により、基礎となるモデルを変更することなくエージェントの動作が改善され、人間が判読できる、モデルに依存しない命令が生成されます。既存の方法では、タスクエージェントのシステムプロンプトを改良するプロンプトエージェントを構築しますが、プロンプトエージェント独自のシステムプロンプトは手動で設計および修正されたままになります。我々は、プロンプトエージェント自身のシステムプロンプトをタスクエージェントのシステムプロンプトと並んで最適化ターゲットとして扱う自己進化型プロンプト最適化 (SePO) を提案します。 SePO は自己参照設計を採用しています。単一のプロンプトエージェントは、候補プロンプトのアーカイブを踏み台として維持するオープンエンドの進化的探索の下で、タスクエージェントのシステムプロンプトとそれ自身のプロンプトの両方を改善します。トレーニングは 2 つの段階で進行します。事前トレーニングではマルチタスクプール上でプロンプトエージェントを進化させ、その後、微調整によってそれをターゲットタスクに適用します。数学 (AIME'25)、抽象推論 (ARC-AGI-1)、大学院レベルの科学 (GPQA)、コード生成 (MBPP)、および論理パズル (数独) にわたる 5 つのベンチマークにわたって、SePO は一貫して Manual-CoT、TextGrad、および MetaSPO を上回り、Manual-CoT と比較して平均精度が 4.49 ポイント向上しました。事前トレーニングによるプロンプト最適化スキルは、タスクごとのプロンプトを記憶するのではなく、事前トレーニング混合物を超えたタスクにも一般化されます。

原文 (English)

SePO: Self-Evolving Prompt Agent for System Prompt Optimization

System prompt optimization improves agent behavior without modifying the underlying model, yielding human-readable, model-agnostic instructions. Existing methods build a prompt agent that refines task agents' system prompts, yet leave the prompt agent's own system prompt hand-engineered and fixed. We propose Self-Evolving Prompt Optimization (SePO), which treats the prompt agent's own system prompt as an optimization target alongside task agents' system prompts. SePO adopts a self-referential design. A single prompt agent improves both task agents' system prompts and its own under an open-ended evolutionary search that maintains an archive of candidate prompts as stepping stones. Training proceeds in two stages: pre-training evolves the prompt agent on a multi-task pool, and fine-tuning then applies it to a target task. Across five benchmarks spanning math (AIME'25), abstract reasoning (ARC-AGI-1), graduate-level science (GPQA), code generation (MBPP), and logic puzzles (Sudoku), SePO consistently outperforms Manual-CoT, TextGrad, and MetaSPO, improving the average accuracy by 4.49 points compared to Manual-CoT. The prompt optimization skill from pre-training also generalizes to tasks beyond the pre-training mixture, rather than memorizing per-task prompts.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エージェントの記憶にとって時間的順序は重要: 長期エージェントのセグメントツリー

長期的な会話型エージェントは、進化するイベント、タスク、目標を通じてユーザーと対話する必要があります。このような歴史は本来一時的なものですが、多くの既存の記憶システムは主にトピックの類似性によって情報を整理しており、イベントが発生する順序を無視している可能性があります。発話全体にわたって時間的に順序付けられたセグメントツリーとして会話履歴を表すメモリアーキテクチャであるセグメントツリーメモリ (SegTreeMem) を導入します。 SegTreeMem は、オンラインの右端フロンティア更新ルールを通じて新しい発話を段階的に挿入し、階層的なメモリセグメントを形成しながら時系列順を維持します。取得の場合、SegTreeMem はツリーを通じて関連性スコアを伝播し、ローカルな意味論的な一致と階層的な時間コンテキストを組み合わせます。 SegTreeMem は、3 つの長期メモリベンチマークと 2 つの LLM バックボーンにわたって、フラット検索、グラフ構造メモリ、およびツリー構造メモリベースラインよりも回答品質を向上させます。追加の時間順序順列分析では、パフォーマンスの向上が記憶構築中の時間順序の維持に依存することが示され、時間順序がエージェント記憶の重要な構造であるという主張が裏付けられています。

原文 (English)

Temporal Order Matters for Agentic Memory: Segment Trees for Long-Horizon Agents

Long-horizon conversational agents need to interact with users through evolving events, tasks, and goals. Such histories are naturally temporal, yet many existing memory systems organize information primarily by topical similarity and may ignore the order in which events occur. We introduce Segment Tree Memory, or SegTreeMem, a memory architecture that represents conversation history as a temporally ordered Segment Tree over utterances. SegTreeMem incrementally inserts new utterances through an online rightmost-frontier update rule, preserving chronological order while forming hierarchical memory segments. For retrieval, SegTreeMem propagates relevance scores through the tree to combine local semantic matching with hierarchical temporal context. Across three long-horizon memory benchmarks and two LLM backbones, SegTreeMem improves answer quality over flat retrieval, graph-structured memory, and tree-structured memory baselines. Additional temporal-order permutation analysis shows that the performance gain depends on preserving temporal order during memory construction, supporting the claim that temporal order is a key structure for agentic memory.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成エージェント

オブジェクト検出におけるインスタンスレベルの事後不確実性の定量化

物体検出は自動運転の安全上重要な要素です。安全性を確保するには、境界ボックス予測の不確実性を定量化することが不可欠です。再トレーニングを必要としない事後的な不確実性の定量化は、現実世界の導入要件と一致します。したがって、ラプラス近似を使用します。インスタンスレベルの不確実性が必要であるため、複数のバックプロパゲーションを必要とする線形推論方法は時間効率が悪く、サンプリングベースの方法は完全に事後的ではありません。我々は、インスタンスレベルおよびほぼ事後的な不確実性の定量化を提供するモンテカルロ一般化線形モデル (MC-GLM) を提案します。モンテカルロステップで必要なサンプルの数は一定で、出力インスタンスの数に依存しないため、並列化できます。 CenterPoint 検出器を使用した nuScenes データセットの実験により、私たちの方法の有効性が検証され、結果として生じる不確実性は良好な品質を示しています。

原文 (English)

Instance-Level Post Hoc Uncertainty Quantification in Object Detection

Object detection is a safety-critical component of autonomous driving. It is essential to quantify the uncertainty in bounding-box predictions for safety assurance. Post hoc uncertainty quantification without retraining aligns with real-world deployment requirements; therefore, we employ the Laplace approximation. Because instance-level uncertainty is needed, linearized inference methods that require multiple backpropagations are not time-efficient, and sampling-based methods are not fully post hoc. We propose Monte-Carlo generalized linearized model (MC-GLM), which provides instance-level and approximately post hoc uncertainty quantification. The number of samples required in the Monte Carlo step is constant and independent of the number of output instances, so it can be parallelized. Experiments on the nuScenes dataset with the CenterPoint detector validate the effectiveness of our method, and the resulting uncertainties exhibit good quality.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

TIDE: テンプレートに基づく反復によるプロアクティブな複数の問題の発見

エージェントは、ドキュメント、ツール、コードのアシスタントとして広く導入されています。ただし、これらは通常、明示的なユーザー要求にのみ作用し、ユーザーが気づいた問題のみを表面化します。一方、他の多くの重要な問題は、より広範なユーザーコンテキスト内で目に見えない形で共存しており、その総数は事前に不明です。私たちはこれを、文脈から複数の隠れた問題を発見するタスクとして組み立てます。その中で、共存する問題を明らかにし、裏付けとなる証拠に基づいて、具体的な行動と組み合わせる必要があります。この目的を達成するために、2 つの補完的なメカニズムを備えたテンプレート主導の反復フレームワークである TIDE を導入します。具体的には、シングルパス予測が最も顕著なケースに基づいて一般的な主張を生み出すという観察に動機づけられて、我々は反復発見を提案します。これは、すでに見つかったものに基づいて条件付けしながらラウンドごとに小さなバッチの候補を表面化し、後続のラウンドで対象範囲を拡大します。思考テンプレートは、以前に解決されたケースから抽出された再利用可能なスキーマであり、どのコンテキストシグナルに注目し、それらをどのように接続するかを指定し、各予測を認識可能な問題クラスに固定します。 4 つのモデルバックボーンにわたって、パーソナルワークスペースとソフトウェアリポジトリという 2 つの現実的な設定で TIDE を検証し、タスクカバレッジ、識別、解決に関して、シングルショットおよび並列マルチエージェントベースラインを超える大幅な向上を示しています。

原文 (English)

TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration

Agents are widely deployed as assistants over documents, tools, and code. However, they typically act only on explicit user requests, which surface only the problems the user has noticed, while many other important problems coexist, hidden in plain sight, within the broader user context, with their total number unknown in advance. We frame this as the task of discovering multiple hidden problems from context, in which coexisting problems should be uncovered, grounded in supporting evidence, and paired with concrete actions. To this end, we introduce TIDE, a template-guided iterative framework with two complementary mechanisms. Specifically, motivated by the observation that single-pass prediction anchors on the most salient cases and yields generic claims, we propose iterative discovery, which surfaces a small batch of candidates per round while conditioning on what has already been found, so subsequent rounds extend coverage; and thought templates, reusable schemas distilled from previously solved cases that specify what contextual signals to attend to and how to connect them, anchoring each prediction in a recognizable problem class. We validate TIDE on two realistic settings, personal workspaces and software repositories, across four model backbones, showing substantial gains over single-shot and parallel multi-agent baselines on task coverage, identification, and resolution.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

Archi: CMS 実験におけるエージェント操作

私たちは、異種データソースの体系的な取り込みと編成と、データソースを取得して推論する構成可能でプライベートで拡張可能なエージェントの展開を組み合わせた、科学コラボレーションのためのオープンソースのエンドツーエンドフレームワークである Archi を紹介します。 Archi のインスタンスは、技術オペレーターのサポートエージェントとして 2026 年 2 月から CERN の LHC での CMS 実験のコンピューティング運用チームに導入されており、文書、履歴データ、ライブ監視システムを組み合わせて検索および分析機能を提供しています。私たちはオペレーターのフィードバックと、実稼働環境での使用状況から収集された質問セットに基づいてシステムを評価し、人間のパネルと自動パネルによって採点します。このシステムは、CMS オペレーターが提起する実際のクエリを解決する運用タスクで効果的であることが証明されています。また、ローカルでホストされているオープンウェイトモデルが競争力を持って実行され、機密データの完全なプライベート管理が可能になることも観察されています。

原文 (English)

Archi: Agentic Operations at the CMS Experiment

We present Archi, an open-source, end-to-end framework for scientific collaborations that combines the systematic ingestion and organization of heterogeneous data sources with the deployment of configurable, private, and extensible agents that retrieve and reason over them. An instance of Archi has been deployed for the Computing Operations team of the CMS experiment at CERN's LHC since February 2026 as a support agent for technical operators, offering retrieval and analysis capabilities by combining documentation, historical data, and live monitoring systems. We evaluate the system on operator feedback and a question set collected from production usage, graded by human and automated panels. The system proves effective at operational tasks, resolving real-world queries posed by CMS operators. We also observe that locally-hosted, open-weight models perform competitively, enabling fully private management of sensitive data.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

NoRA: 視覚的な一人称の規範的行動推論における根拠のある合理性の評価

LLM とエージェントシステムは社会環境にますます導入されており、安全で適切な行動には規範的能力が重要になっています。しかし、既存のアプローチは、規範的判断をテキストのみで評価するか、固定された一連の候補アクションの中から選択することに還元します。私たちはどちらも不十分だと主張します。実際には、エージェントにオプションのメニューが渡されることはありません。彼らは、目に見える事実に基づいて、検証可能な理由によって裏付けられた、合理的な行動をゼロから特定しなければなりません。 NoRA は視覚的な一人称ビデオベンチマークであり、モデルが次のアクションの候補を生成し、明示的な事実-理由-アクションのサポートグラフを通じてそれぞれを正当化する必要があります。このベンチマークは、HumanGold-190 および LLMSilver-1230 の分割を含む 1,420 個の注釈付きビデオクリップで構成されています。各インスタンスは、アクションの調整、事実の根拠、およびサポートのバインディングを通じて評価され、単一の根拠のある合理性スコアに集約されます。私たちは、直接的、計画的、構造化されたプロンプト体制の下で 12 のマルチモーダルシステムのベンチマークを行ったところ、現在の VLM はもっともらしいアクションと関連するシーンの事実を頻繁に回収しますが、完全な合理的なアクションスペースを構築し、選択されたアクションを正しいローカルサポートに結び付けるのに一貫して苦労していることがわかりました。 NoRA はこのギャップを測定可能にし、評価の問題を、モデルがアクションを選択できるかどうかから、適切な目に見える理由に基づいて適切なアクションを正当化できるかどうかに移します。

原文 (English)

NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning

LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate actions. We argue both are insufficient. In practice, agents are never handed a menu of options; they must identify a reasonable action from scratch, grounded in visible facts and supported by inspectable reasons. We introduce NoRA, a visual first-person video benchmark that requires models to generate candidate next actions and justify each through an explicit fact-reason-action support graph. The benchmark comprises 1,420 annotated video clips, including HumanGold-190 and LLMSilver-1230 splits. Each instance is evaluated through action alignment, factual grounding, and support binding, aggregated into a single grounded reasonableness score. We benchmark 12 multimodal systems under direct, deliberate, and structured prompting regimes, finding that current VLMs frequently recover plausible actions and relevant scene facts, but consistently struggle to construct the full reasonable action space and bind selected actions to the correct local support. NoRA makes this gap measurable, shifting the evaluation question from whether a model can pick an action to whether it can justify an appropriate action for the right visible reasons.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

おそらくほぼ安全な保証を備えたリスク認識型強化学習のシナリオ生成

特にディープ RL を使用して学習されたポリシーは、未知の動作や安全ではない動作を引き起こす遷移の摂動の影響を受けやすいことが示される可能性があるため、安全性の保証は、現実世界への強化学習 (RL) エージェントの展開にとって重要です。ポリシー検証の方法は、安全制約に関するポリシーの軌跡をサンプリングすることによって確率的バリア証明書を構築し、それによって既知の安全な動作と未知の動作を区別することです。ポリシーが、エージェントを十分に探索されていない状態に置く遷移の不確実性または摂動の影響を受けやすい場合、これらの制約に違反する確率について厳密な上限と下限を取得することは困難になる可能性があります。これに対処するために、変分オートエンコーダー (VAE) を使用して遭遇した状態空間の分布を近似し、状態の潜在的な特性を使用して上限と下限のバリア証明書を構築し、既知の安全な動作の領域を高い信頼性で最適化します。私たちはこれを二重最適化問題として枠組み付けし、下限のバリア証明書が上限のバリア証明書よりも安全な領域のより保守的な推定を提示します。トレーニング中に 2 つの設定差内にある状態 (つまり、非ロバスト領域) をサンプリングすることにより、上限と下限を厳しくして、安全性についてより明確な確率的保証を提供することができます。私たちの研究では、設定された保証について説明し、実験的に境界の厳しさを実証します。

原文 (English)

Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unknown or unsafe behaviour. A method of policy verification is to construct probabilistic barrier-certificates by sampling policy trajectories with respect to safety constraints, thereby demarcating known safe behaviour from unknown behaviour. Obtaining tight upper and lower bounds on the probability of violation of these constraints may be difficult if the policy is susceptible to transition uncertainty or perturbation that places the agent in insufficiently explored states. To address this, we approximate the distribution of the encountered state-space using a variational autoencoder (VAE) and construct upper and lower-bound barrier-certificates using latent characteristics of states to optimize for regions of known, safe behaviour with high confidence. We frame this in our work as a dual optimization problem where the lower-bound barrier-certificate presents a more conservative estimate of the safe region than the upper-bound barrier-certificate. Sampling states that lie within the set difference of the two during training, i.e. the non-robust region, allows us to tighten the upper and lower bounds to provide sharper probabilistic guarantees on safety. Within our study, we describe the guarantees placed and demonstrate the tightness of our bounds experimentally.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

行動しながら学習: オンライン生涯学習エージェント向けのスキル強化されたテスト時間共進化フレームワーク

生涯学習は、動的で対話型の環境で動作する大規模言語モデル (LLM) エージェントにとって不可欠です。しかし、長期的なタスクのための既存の生涯学習エージェントは通常、推論中の静的パラメータによる離散的なスキルや過去の経験の取得に依存しているため、人間の学習者のようにテスト時のフィードバックを継続的に内面化することができません。このギャップを埋めるために、オンライン生涯学習エージェントのための 2 段階の強化学習フレームワークであるスキル強化テスト時間共進化 (\texttt{LifeSkill}) を提案します。具体的には、複数のスキル条件付きポリシーのロールアウトの平均的な検証者の成功に応じて候補者のスキルに報酬を与えることで、スキル抽出のための直接監督の欠如に対処する検証者ガイド付きスキル学習を設計し、単にテキスト上でもっともらしいスキルではなく、タスクの解決に役立つスキルを生成するようにモデルを奨励します。さらに、オンラインスキル内部化を導入します。これは、スキル条件付きの軌道を報酬シグナルに変換することで、テスト時のインタラクション中にポリシーモデルを継続的に改善します。これにより、エージェントは推論機能をパラメータに直接内部化でき、エクスペリエンス取得によるコンテキストの肥大化を回避できます。 LifelongAgentBench の実験では、既存の生涯エージェントのベースラインと比較して、LifeSkill が平均パフォーマンスを 7 絶対ポイント向上させることが示されています。

原文 (English)

Learning While Acting: A Skill-Enhanced Test-Time Co-Evolution Framework for Online Lifelong Learning Agents

Lifelong learning is essential for Large Language Model (LLM) agents operating in dynamic, interactive environments. However, existing lifelong learning agents for long-horizon tasks typically depend on discrete skill or past experiences retrieval with static parameters during inference, which prevents them from continuously internalizing test-time feedback like human learners. To bridge this gap, we propose Skill-enhanced Test-Time Co-Evolution (\texttt{LifeSkill}), a two-stage reinforcement learning framework for Online Lifelong Learning Agents. Specifically, we design Verifier-Guided Skill Learning that addresses the lack of direct supervision for skill extraction by rewarding candidate skills according to the average verifier success of multiple skill-conditioned policy rollouts, encouraging the model to generate skills that are useful for solving tasks rather than merely plausible in text. Furthermore, we introduce Online Skill Internalization, which continuously improves the policy model during test-time interaction by transforming skill-conditioned trajectories into reward signals. This enables the agent to directly internalize reasoning capabilities into its parameters, avoiding the context bloat of experience retrieval. Experiments on LifelongAgentBench show that LifeSkill improves average performance by 7 absolute points by comparing with existing lifelong agent baselines.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

人間が作成したオントロジーからの証明可能で監査可能で安全な LLM エージェント

線形監査可能性を必要とする重要な問題ドメインでの使用を目的とした、LLM エージェントアーキテクチャ Agentic Redux を紹介します。型付きラムダ計算を使用して、適切なドメインで実行すると、Agentic Redux の実行が意味的に正しいことが保証され、すべての決定が追加専用台帳に記録されることを証明します。医療請求のコンプライアンスとセキュリティ脆弱性の開示という、実稼働グレードの 2 つの適切な領域を紹介します。両方のドメインで実行される Agentic Redux の実用的なコードは、サポートされるコードリポジトリで入手できます。また、問題ドメイン上でエージェントフレームワークを作成するための方法論であるオントロジーファーストエージェント設計も紹介します。この設計では、人間の専門家が基本形式オントロジーを使用して問題ドメインをオントロジー化し、LLM を割り当てて、ドメイン内の問題に対処するためにエージェントと参加者が果たせる役割を導き出します。

原文 (English)

Provably Auditable and Safe LLM Agents from Human-Authored Ontologies

We introduce the LLM agent architecture Agentic Redux, intended for use with nontrivial problem domains that require linear auditability. Using the typed lambda calculus, we prove that, run on appropriate domains, Agentic Redux executions are semantically guaranteed to be correct, with all decisions recorded in an append-only ledger. We present two production-grade appropriate domains, in healthcare billing compliance, and security vulnerability disclosure. Working code for Agentic Redux run on both domains is available in a supporting code repository. We also introduce Ontology-First Agent Design, a methodology for creation of agent frameworks on a problem domain, in which a human expert ontologizes the problem domain with Basic Formal Ontology, and then assigns an LLM to derive roles that agents and humans-in-the-loop can fill, in order to work the problems in the domain.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

プロンプトからプロセスまで: AI ソフトウェア開発エージェントをサポートするフレームワークのプロセス分類と比較評価

プログラミング用の AI ツールは、もはや単なるオートコンプリートやチャットアシスタントではありません。プロセス、役割、成果物、検証を備えた開発フレームワークとして組織化されています。最近の調査では、ソフトウェアエンジニアリングのためのエージェントと LLM がマッピングされていますが、これらの機能をプロセスに変える運用フレームワークを中心とした調査は行われていません。私たちは、機能的包含基準とトラクション測定を使用して一次ソースの直接検索を実行し、GitHub Spec Kit、OpenSpec、BMAD Method、Get Shit Done (GSD)、Spec Kitty、Reversa の 6 つのフレームワークを選択しました。それぞれが異なるパスを通じて AI 開発を攻撃します。つまり、完全および軽量バリアントでの仕様駆動型開発、エージェント駆動のアジャイルプランニング、エージェントを介したコンテキストエンジニアリング、ワークツリーの分離とレビュー、レガシーシステムからの運用仕様の回復です。私たちの中心的な貢献は、仕様、コンテキスト、役割、実行、検証、移植性という 6 次元のプロセス分類と、プロセスを複製可能なツールに変えるスコアリングルーブリックです。これを 6 つのフレームワークとサンプル外のケースである Spec-Flow に適用します。 2 つの結果が際立っています。すでに何らかのプロセスを採用しているフレームワークの中には収束が見られます。分離されたプロンプトは中心性を失い、永続的な成果物、作業契約、トレーサビリティ、人間によるレビューが曖昧さを減らし、エージェントを調整するメカニズムになります。また、6 つの側面すべてを強力にカバーするフレームワークはなく、プロセスの深さとエージェント間の移植性の間の構造的なトレードオフが明らかになります。また、繰り返し発生するリスク、つまり仕様とコードの間のずれ、生成されたアーティファクトへの過剰な信頼、コミュニティ拡張の脆弱性、プラットフォームへの依存、プロセス全体のベンチマークの欠如なども見つかりました。最後に、中間品質の指標、コンテキストガバナンス、インストールのセキュリティと再現性に焦点を当てた、実証的評価のための研究課題を取り上げます。

原文 (English)

From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents

AI tools for programming are no longer just autocomplete or chat assistants: they organize themselves as development frameworks, with process, roles, artifacts and verification. Recent surveys map agents and LLMs for software engineering, but a study centered on the operational frameworks that turn these capabilities into process is missing. We ran a directed search of primary sources, with a functional inclusion criterion and traction measurement, and selected six frameworks: GitHub Spec Kit, OpenSpec, BMAD Method, Get Shit Done (GSD), Spec Kitty and Reversa. Each attacks AI development through a different path: spec-driven development in full and lightweight variants, agent-driven agile planning, context engineering over the agent, worktree isolation and review, and recovery of operational specifications from legacy systems. Our central contribution is a six-dimension process taxonomy: specification, context, roles, execution, validation and portability, with a scoring rubric that turns it into a replicable instrument. We apply it to the six frameworks and an out-of-sample case, Spec-Flow. Two results stand out. Among frameworks that already adopt some process there is convergence: the isolated prompt loses centrality, and persistent artifacts, work contracts, traceability and human review become mechanisms that reduce ambiguity and coordinate agents. And no framework strongly covers all six dimensions, exposing a structural trade-off between process depth and portability across agents. We also found recurring risks: drift between specification and code, excessive trust in generated artifacts, fragility of community extensions, platform dependence and a lack of benchmarks for the complete process. We close with a research agenda for empirical evaluation, focused on intermediate-quality metrics, context governance, installation security and reproducibility.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

計画、監視、回復: プロアクティブな手続き支援のためのベンチマークとアーキテクチャ

私たちは、プロアクティブなマルチモーダルアシスタントシステムを構想しています。これは、手順的なタスクに関するリアルタイムの段階的なガイダンスをユーザーに提供し、\textit{いつ}中断するか、\textit{どのように指導するかを自律的に決定します。ただし、現実的な状況、特にユーザーが予想されるステップシーケンスから逸脱する一般的なケースを反映する大規模なクロスドメインベンチマークがないため、進歩は限られています。私たちはこのギャップに 4 つの貢献で対処します。 \textbf{(1)}~明示的な計画外 (OOP) アノテーションと回復手順を備えたプロアクティブな手順支援のための大規模ウェアラブル自己中心的データセットである \textbf{EgoProactive} をリリースします。 \textbf{(2)}~統一されたプロアクティブなガイダンススキーマの下で、確立された 5 つのベンチマーク (Ego4D、EPIC-KITCHEN、EgoExo4D、HoloAssist、HowTo100M) を \textbf{Pro\textsuperscript{2}Bench} に拡張します。 \textbf{(3)}~手続き状態、視覚的キュー、および回復注入に特化した \textbf{分離プランナー -- インタラクションアーキテクチャ} を提案します。 \textbf{(4)}~Llama~4 および Qwen-3.6-VL でのクロスバックボーンレプリケーションによって検証された、モデルファミリ間で転送するトレーニング後のレシピを紹介します。大規模な実験において、当社の訓練された Llama-4 システムは、6 つのデータセットすべてにわたって、強力な独自のベースライン (Claude Opus~4.6、Gemini~3.1~Pro、GPT~5.2) およびオープンウェイトベースライン (Qwen3~VL~235B) ベースラインを超えて、客観的な介入の質を大幅に向上させました。さらに、Oracle 計画の実験では、計画の品質が制御されている場合、トレーニングされた二重モデルが高品質のガイダンスを生成し、計画外の回復で大きな利益が得られることが示されています。

原文 (English)

Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; \textbf{(3)}~we propose a \textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エージェント追跡から信頼へ: LLM エージェントにおける証拠追跡と実行来歴

大規模言語モデル (LLM) ベースのエージェントは、外部ツール、検索システム、メモリモジュール、環境、その他のエージェントと対話することで、複雑なタスクを解決することが増えています。これらの機能により、エージェントの自律性が拡張されますが、エージェントの動作の検証、デバッグ、監査が難しくなります。最終回答の精度だけでは、出力がどのように生成されたか、各主張を裏付ける証拠は何か、ツールの呼び出しが正当化されたかどうか、記憶が後の決定にどのように影響したか、実行の失敗がどこで発生したかを説明することはできません。証拠追跡と実行来歴は、取得された証拠、ツール出力、メモリ項目、環境観察、中間クレーム、アクション、および最終的な回答がエージェントの実行全体を通じてどのように関連するかをモデル化することで、このギャップに対処します。この調査は、LLM エージェントにおける証拠の追跡と実行の出自に関する体系的なレビューと概念的な枠組みを提供します。私たちは、検索根拠、クレームサポート、ツール使用の安全性、メモリリネージ、可観測性、デバッグ、監査、リカバリを結び付ける、統一された来歴の観点に基づいて関連作業を整理します。トレースソース、証拠と実行単位、来歴関係、トレースの粒度とタイミング、表現形式、信頼関数を網羅する分類法を導入します。私たちは、出所の表現、証拠の帰属、ツール使用の出所、実行時のガードレール、出所を伴うメモリ、トレースベースの可観測性、障害診断など、主要な方法論の方向性を検討します。また、既存のベンチマーク、データセット、評価指標を来歴関連の機能にマッピングし、評価が最終的な回答の正しさからプロセスレベルの説明責任にどのように移行できるかについても説明します。最後に、統合トレーススキーマ、クレームレベルおよびセマンティックの出所、出所を意識した安全メカニズム、現実的な実行トレースベンチマーク、リカバリ指向の評価、プライバシーを意識した監査インフラストラクチャなどの未解決の課題について概説します。

原文 (English)

From Agent Traces to Trust: Evidence Tracing and Execution Provenance in LLM Agents

Large language model (LLM)-based agents increasingly solve complex tasks by interacting with external tools, retrieval systems, memory modules, environments, and other agents. These capabilities expand agent autonomy, but also make agent behavior harder to verify, debug, and audit. Final-answer accuracy alone cannot explain how an output was produced, which evidence supported each claim, whether tool calls were justified, how memory influenced later decisions, or where execution failures originated. Evidence tracing and execution provenance address this gap by modeling how retrieved evidence, tool outputs, memory items, environment observations, intermediate claims, actions, and final answers are connected throughout agent execution. This survey provides a systematic review and conceptual framework for evidence tracing and execution provenance in LLM agents. We organize related work around a unified provenance perspective that connects retrieval grounding, claim support, tool-use safety, memory lineage, observability, debugging, audit, and recovery. We introduce a taxonomy covering trace sources, evidence and execution units, provenance relations, tracing granularity and timing, representation forms, and trust functions. We review key methodological directions, including provenance representation, evidence attribution, tool-use provenance, runtime guardrails, provenance-bearing memory, trace-based observability, and failure diagnosis. We also map existing benchmarks, datasets, and evaluation metrics to provenance-related capabilities, and discuss how evaluation can move from final-answer correctness toward process-level accountability. Finally, we outline open challenges, including unified trace schemas, claim-level and semantic provenance, provenance-aware safety mechanisms, realistic execution-trace benchmarks, recovery-oriented evaluation, and privacy-aware audit infrastructure.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

DAR: エージェントティックハーネスを使用したデオンティック推論

義務的推論とは、法律に基づく納税額の計算や移民控訴の結果の決定など、事例固有の事実に明示的なルールとポリシーを適用することで質問に答えるタスクです。 LLM ベースの義務論的推論の主な技術的課題は、関連するルールセットが長く相互参照される可能性があるため、モデルが特定の推論ステップに必要なルールを見つけられない可能性があることです。 Deontic Agentic Reasoning (DAR) を導入します。これは、モデルがオンデマンドで法令と対話するエージェント推論セットアップです。 DeonticBench のハードサブセット上の複数のハーネスで DAR を評価します。これらの設定全体で、エージェントハーネスは義務論的推論タスクの最前線を押し広げることができることがわかりましたが、改善は均一ではありません。弱いモデルは、はるかに多くのトークンを消費しながら、数値タスクでパフォーマンスが低下することがよくあります。

原文 (English)

DAR: Deontic Reasoning with Agentic Harnesses

Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing tax liability under a statute or determining the outcome of an immigration appeal. A key technical challenge for LLM-based deontic reasoning is that the relevant ruleset can be long and cross-referenced, so models may still fail to locate the rules needed for a particular reasoning step. We introduce Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand. We evaluate DAR under multiple harnesses on hard subsets of DeonticBench. Across these settings, we find that agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

自己反映型 API: AI エージェント回復のための構造は冗長性を上回る

AI エージェントが API を呼び出して検証エラーに遭遇した場合、何が問題だったかだけではなく、次に何をすべきかが必要になります。自己反映型 API は、検証が失敗した場合、エージェントが外部の理由なしにリクエストを修復して再試行するのに十分な、機械可読な回復\_フィードバック.suggestions[] ペイロードを返します。リーク監査済みのパイロット (セルあたり $N{=}30$、LLM 3 つ、敵対的タスク 10) では、構造化された提案により、人間モデルでの平易な英語の診断 (フィッシャーの正確な $p \le 0.0022$) と比較して、タスク完了率が $+36.7$ ～ $40.0$pp 上昇し、$1.8$ ～ $2.2\倍$ 向上しました。成功ごとのトークン効率。 gpt-4o-mini では上昇率は大きくありません ($p{=}0.435$)。課金 API での 2 番目のドメインのレプリケーションによってパターンが確認されます。この比較は、文書化されていない 2 つのクラスの回答漏洩を LLM ベンチマークで監査した後にのみ有効です。再利用可能な CI インフラストラクチャとして、audit\_prompt\_leakage.py を出荷します。コードとデータ: https://github.com/arquicanedo/self-reflective-apis。

原文 (English)

Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery

When an AI agent calls an API and hits a validation error, it needs more than what went wrong -- it needs what to do next. A self-reflective API returns, on validation failure, a machine-readable recovery\_feedback.suggestions[] payload sufficient for the agent to repair the request and retry without external reasoning. On a leak-audited pilot ($N{=}30$ per cell, 3 LLMs, 10 adversarial tasks), structured suggestions lift task-completion rate by $+36.7$--$40.0$pp over plain-English diagnoses on Anthropic models (Fisher's exact $p \le 0.0022$), at $1.8$--$2.2\times$ better per-success token efficiency. The lift is not significant on gpt-4o-mini ($p{=}0.435$); a second-domain replication on a billing API confirms the pattern. The comparison only holds after auditing two undocumented classes of answer leakage in LLM benchmarks. We shipaudit\_prompt\_leakage.py as reusable CI infrastructure. Code and data: https://github.com/arquicanedo/self-reflective-apis.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LLM 駆動エージェントによる効率的で証拠に基づくモビリティ予測に向けて

個人レベルのモビリティ予測は、都市シミュレーション、交通計画、政策分析の中心となります。教師ありシーケンスモデルは高い精度を実現しますが、タスク固有のトレーニングが必要であり、意思決定レベルの透明性は限られています。最近の LLM ベースの手法は解釈可能性を向上させていますが、ほとんどが静的プロンプトとシングルパス推論に依存しているため、モビリティ信号が弱いか矛盾している場合に追加の証拠を探す能力が制限されています。私たちは、次の場所の予測を適応的な証拠に基づいた意思決定として定式化する、トレーニング不要の LLM 駆動のエージェントフレームワークである \method{} を提案します。 \method{} は、歴史的な規則性に基づく高速パスを通じて日常的なケースを解決しますが、あいまいなケースでは、最近の軌跡、過去の動作、滞在と移動の可能性、および地理的証拠に基づいて反復的なツールの使用がトリガーされます。 3 つのモビリティデータセット全体で、AgentMob はトレーニング不要の LLM ベースの手法の中で最も強力な総合パフォーマンスを達成し、GPT-5.4 は BW で 71.42\% Acc@1、YJMob100K で 33.14\%、上海 ISP で 33.50\% に達しました。 BW 非高速パスのケースでは、LLM コントローラーは、同じツールの統計ベースラインと比較して Acc@1 を 30.65\% から 48.62\% に改善します。これは、その主な利点が、適応的な証拠収集を通じて曖昧な予測を解決することにあることを示しています。コードは https://github.com/Unknown-zoo/AgentMob で入手できます。

原文 (English)

Towards Efficient and Evidence-grounded Mobility Prediction with LLM-Driven Agent

Individual-level mobility prediction is central to urban simulation, transportation planning, and policy analysis. Supervised sequence models achieve strong accuracy but require task-specific training and offer limited decision-level transparency. Recent LLM-based methods improve interpretability, yet mostly rely on static prompts and single-pass inference, limiting their ability to seek additional evidence when mobility signals are weak or conflicting. We propose \method{}, a training-free LLM-driven agent framework that formulates next-location prediction as adaptive evidence-controlled decision making. \method{} resolves routine cases through a fast path based on historical regularity, while ambiguous cases trigger iterative tool use over recent trajectories, historical behavior, stay-move likelihood, and geographical evidence. Across three mobility datasets, AgentMob achieves the strongest overall performance among training-free LLM-based methods, with GPT-5.4 reaching 71.42\% Acc@1 on BW, 33.14\% on YJMob100K, and 33.50\% on Shanghai ISP. On BW non-fast-path cases, the LLM controller improves Acc@1 from 30.65\% to 48.62\% over a same-tool statistical baseline, showing that its main benefit lies in resolving ambiguous predictions through adaptive evidence gathering. Our code is available at https://github.com/Unknown-zoo/AgentMob.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

マルチエージェント推論におけるストリーミング通信

マルチエージェント推論システムは、エンドツーエンドのレイテンシーをパイプラインの深さに応じて線形に拡張する「生成してから転送」パラダイムを採用しています。 StreamMA は、各推論ステップが生成されるとすぐに下流のエージェントにストリーミングし、隣接するエージェントをパイプライン化して待ち時間を短縮するマルチエージェント推論システムです。驚くべきことに、このパイプラインは有効性も向上させます。マルチステップの推論の品質は不均一で、初期のステップは後のステップよりも信頼性が高いため、完全なチェーンではなくこれらの信頼できる初期ステップを使用することで、エラーが発生しやすい後期ステップが下流エージェントに誤解を与えることを防ぎます。ストリーム、シリアル、および単一プロトコルの最初の閉じた形式の結合分析によって両方の利点を形式化し、有効性の順序付け、高速化の上限、およびコスト比を導き出します。数学、科学、コードにわたる 8 つの推論ベンチマーク、2 つのフロンティア LLM (Claude Opus 4.6 および GPT-5.4)、および 3 つのトポロジ (チェーン、ツリー、グラフ) にわたって、StreamMA は両方のベースラインを上回りました (HMMT 2026 で平均 +7.3 pp、最大 +22.4 pp、Claude Opus 4.6-高)。これらの貢献を超えて、「ステップレベルのスケーリング則」を発見しました。つまり、エージェントごとのステップを増やすと、有効性と効率の両方が一貫して向上します。これは、エージェント数のスケーリングと直交し、エージェント数のスケーリングと組み合わせ可能な新しいスケーリングの次元です。

原文 (English)

Streaming Communication in Multi-Agent Reasoning

Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

アダプティブマインド: LoRA-as-Tools でエージェントに権限を与える

LoRA アダプターが、基本言語モデルが動的に選択して呼び出すことができる呼び出し可能なツールとして扱われるフレームワークを調査します。私たちは、アダプターがドメイン固有の強力な利益を提供するようにトレーニングされ、明確なメタデータで公開されている場合、基本モデルはクエリを適切な専門家に確実にルーティングし、単一のフレームワーク内に多くの特殊なアダプターの利点を効果的に集約できるという仮説を立てています。シングルステップルーティングとマルチステップエージェント推論の両方を研究する一般的なフレームワークであるアダプティブマインドを紹介します。この設定では、エージェントは他のツール (外部 API、取得システム、実行環境など) と一緒に複数のアダプターを繰り返し呼び出し、複数のステップにわたる出力を推論できます。これにより、アダプターは、静的に適用されるのではなく、推論中に構成できるモジュール式のスキルまたは記憶ユニットとして再構成されます。私たちの評価では、ルーティング層は 30 アダプターライブラリで 98.3% の精度に達し、十分なトレーニングを受けた専門家は、単一の共有トレーニングレシピの下で 9 つのタスクファミリ全体で +4.6 ～ +84.0 パーセンテージポイントの厳密スコアラーゲインを提供しました。 AM ルーターは、クエリがドメイン信号を表面化するすべてのベンチマークで、直接のスペシャリストの 5 pp 以内にこれらのゲインを集計します。私たちの調査結果は、このアプローチの有効性は個々のアダプターの品質と専門性に依存し、そのような多くの専門家を柔軟に構成できるようにすることで、言語モデルエージェントの実践的な能力を大幅に拡張し、より一般的なツール拡張インテリジェンスに移行できることを示唆しています。

原文 (English)

Adaptive Minds: Empowering Agents with LoRA-as-Tools

We investigate a framework in which LoRA adapters are treated as callable tools that a base language model can dynamically select and invoke. We hypothesize that, when adapters are trained to provide strong domain-specific gains and are exposed with clear metadata, a base model can reliably route queries to the appropriate expert, effectively aggregating the benefits of many specialized adapters within a single framework. We introduce Adaptive Minds, a general framework within which we study both single-step routing and multi-step agentic reasoning. In this setting, the agent can iteratively invoke multiple adapters alongside other tools (e.g., external APIs, retrieval systems, or execution environments) and reason over their outputs across multiple steps. This reframes adapters as modular skills or memory units that can be composed during reasoning rather than statically applied. In our evaluation, the routing layer reaches 98.3% accuracy on a 30-adapter library, and well-trained specialists provide +4.6 to +84.0 percentage points of strict-scorer gain across nine task families under a single shared training recipe; the AM router aggregates these gains within 5 pp of the direct specialist on every benchmark whose queries surface domain signal. Our findings suggest that the effectiveness of this approach depends on the quality and specialization of individual adapters, and that enabling flexible composition of many such experts can significantly expand the practical capabilities of language model agents, moving toward more general, tool-augmented intelligence.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

BRAINCELL-AID: コミュニティアノテーション用のエージェント AI が作成した脳細胞タイプのリソース

単一細胞 RNA シーケンスにより、多様な細胞型とそのトランスクリプトームシグネチャを識別する能力が変わりました。しかし、これらのシグネチャ、特に特徴が十分に解明されていない遺伝子に関連するシグネチャに注釈を付けることは、依然として大きな課題です。 Gene Set Enrichment Analysis (GSEA) などの従来の手法は、厳選されたアノテーションに依存しており、これらのコンテキストではパフォーマンスが低下することがよくあります。大規模言語モデル (LLM) は有望な代替手段を提供しますが、構造化されたオントロジー内で複雑な生物学的知識を表現するのに苦労しています。これに対処するために、我々は BRAINCELL-AID (BRAINCELL-AID: https://biodataai.uth.edu/BRAINCELL-AID) を紹介します。これは、フリーテキスト記述とオントロジーラベルを統合して、より正確で堅牢な遺伝子セットアノテーションを可能にする新しいマルチエージェント AI システムです。検索拡張生成 (RAG) を組み込むことで、関連する PubMed 文献を使用して予測を改良し、幻覚を軽減し、解釈可能性を高める堅牢なエージェントワークフローを開発しました。このワークフローを使用して、上位予測に含まれるマウス遺伝子セットの 77% に対して正しいアノテーションを達成しました。このアプローチを適用して、BRAIN Initiative Cell Census Network によって生成された包括的なマウス脳細胞アトラスからの 5,322 個の脳細胞クラスターに注釈を付け、領域特異的な遺伝子の共発現パターンを特定し、遺伝子アンサンブルの機能的役割を推測することで、脳細胞の機能についての新たな洞察を可能にしました。 BRAINCELL-AID は、神経学的に意味のある説明を持つ大脳基底核関連細胞タイプも識別します。したがって、コミュニティ主導のセルタイプのアノテーションをサポートする貴重なリソースを作成します。

原文 (English)

BRAINCELL-AID: An Agentic AI Created Brain Cell Type Resource for Community Annotation

Single-cell RNA sequencing has transformed our ability to identify diverse cell types and their transcriptomic signatures. However, annotating these signatures-especially those involving poorly characterized genes-remains a major challenge. Traditional methods, such as Gene Set Enrichment Analysis (GSEA), depend on well-curated annotations and often perform poorly in these contexts. Large Language Models (LLMs) offer a promising alternative but struggle to represent complex biological knowledge within structured ontologies. To address this, we present BRAINCELL-AID (BRAINCELL-AID: https://biodataai.uth.edu/BRAINCELL-AID), a novel multi-agent AI system that integrates free-text descriptions with ontology labels to enable more accurate and robust gene set annotation. By incorporating retrieval-augmented generation (RAG), we developed a robust agentic workflow that refines predictions using relevant PubMed literature, reducing hallucinations and enhancing interpretability. Using this workflow, we achieved correct annotations for 77% of mouse gene sets among their top predictions. Applying this approach, we annotated 5,322 brain cell clusters from the comprehensive mouse brain cell atlas generated by the BRAIN Initiative Cell Census Network, enabling novel insights into brain cell function by identifying region-specific gene co-expression patterns and inferring functional roles of gene ensembles. BRAINCELL-AID also identifies Basal Ganglia-related cell types with neurologically meaningful descriptions. Hence, we create a valuable resource to support community-driven cell type annotation.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

SciDER: 科学データ中心のエンドツーエンド研究者

大規模な言語モデルが科学的発見を加速させる一方で、既存のエージェントは適応性、ドメインの一般化、マルチモーダルなスケーラビリティにおいて厳しい制限に直面しており、多くの場合、生のドメイン固有の実験データを自律的に処理するのに苦労しています。これらの障壁を克服するために、研究ライフサイクル全体を柔軟に自動化するように設計されたマルチエージェントシステムである SciDER を導入します。このフレームワークは、新しいデータ中心のアプローチを採用し、4 つの専門化されたサブエージェントにわたる動的なマルチモーダルスキルシステムを統合します。具体的には、アイデア化エージェントは進化的アイデア検索を通じて新しい仮説を生成し、データ分析エージェントは生データを体系的に構造化し、実験エージェントはデータセットの特性に基づいて実行可能コードを合成し、批評エージェントは反復的な自己洗練を推進します。オープンソースの科学的発見を民主化するために、私たちは OpenSciDER-27B 微調整モデルと並行して、高品質の実行軌跡データセットである OpenSciDER-SFT-8K をリリースします。 6 つのベンチマーク全体で、SciDER と OpenSciDER は競合する、または優れた結果を獲得しており、特にデータ中心の分析、エンドツーエンドの研究実行、マルチモーダルな科学的視覚化において大きな成果が得られています。 SciDER は、データ分析と実験の実行を統合することにより、抽象的な科学的推論と再現可能な実験合成の間のギャップを埋めます。

原文 (English)

SciDER: Scientific Data-centric End-to-end Researcher

While large language models accelerate scientific discovery, existing agents face severe limitations in adaptability, domain generalization, and multimodal scalability, often struggling to autonomously process raw, domain-specific experimental data. To overcome these barriers, we introduce SciDER, a multi-agent system designed to flexibly automate the entire research lifecycle. This framework employs a novel data-centric approach and integrates a dynamic multimodal skill system across four specialized sub-agents. Specifically, an ideation agent generates novel hypotheses via Evolutionary Idea Search, a data analysis agent systematically structures raw data, an experimentation agent synthesizes executable code grounded in dataset characteristics, and a critic agent drives iterative self-refinement. To democratize open-source scientific discovery, we release OpenSciDER-SFT-8K, a high-quality execution trajectory dataset, alongside the OpenSciDER-27B fine-tuned model. Across six benchmarks, SciDER and OpenSciDER obtain competitive or leading results, with especially strong gains on data-centric analysis, end-to-end research execution, and multimodal scientific visualization. By integrating data analysis with experimental execution, SciDER bridges the gap between abstract scientific reasoning and reproducible experimentation synthesis.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

エージェントティックツールプロトコルの形式セマンティクス: プロセス計算アプローチ

外部ツールを呼び出すことができる大規模言語モデルエージェントの出現により、エージェントプロトコルの正式な検証が緊急に必要になりました。この分野では、ゼロショット API の一般化のための研究フレームワークであるスキーマガイドダイアログ (SGD) と、エージェントとツールの統合のための業界標準であるモデルコンテキストプロトコル (MCP) の 2 つのパラダイムが支配的です。どちらもスキーマ記述を通じて動的なサービス検出を可能にしますが、その正式な関係はまだ解明されていません。これらのパラダイムの概念的収束を確立した以前の研究に基づいて、我々は SGD と MCP の最初のプロセス計算による定式化を提示し、それらが明確に定義されたマッピングファイの下で構造的に類似していることを証明します。ただし、逆マッピング Phi^{-1} は部分的で損失が多く、MCP の表現力に重大なギャップがあることが明らかになります。双方向分析を通じて、完全な動作の等価性のための必要十分条件として、5 つの原則 (セマンティックな完全性、明示的なアクション境界、障害モードの文書化、漸進的開示互換性、ツール間関係宣言) を特定しました。これらの原則を型システム拡張 MCP+ として形式化し、MCP+ が SGD と同型であることを証明します。私たちの研究は、検証されたエージェントシステムの最初の正式な基盤を提供し、証明可能な安全性の特性としてスキーマの品質を確立します。

原文 (English)

Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach

The emergence of large language model agents capable of invoking external tools has created urgent need for formal verification of agent protocols. Two paradigms dominate this space: Schema-Guided Dialogue (SGD), a research framework for zero-shot API generalization, and the Model Context Protocol (MCP), an industry standard for agent-tool integration. While both enable dynamic service discovery through schema descriptions, their formal relationship remains unexplored. Building on prior work establishing the conceptual convergence of these paradigms, we present the first process calculus formalization of SGD and MCP, proving they are structurally bisimilar under a well-defined mapping Phi. However, we demonstrate that the reverse mapping Phi^{-1} is partial and lossy, revealing critical gaps in MCP's expressivity. Through bidirectional analysis, we identify five principles -- semantic completeness, explicit action boundaries, failure mode documentation, progressive disclosure compatibility, and inter-tool relationship declaration -- as necessary and sufficient conditions for full behavioral equivalence. We formalize these principles as type-system extensions MCP+, proving MCP+ is isomorphic to SGD. Our work provides the first formal foundation for verified agent systems and establishes schema quality as a provable safety property.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

責任の地平線: 人間とエージェントの集合体を統治するための不可能定理

AI システムの法的、倫理的、規制に関する既存の責任の枠組みは、どのような結果が生じても、少なくとも 1 人の特定可能な人物が意味のある責任を負うのに十分な関与と先見性を持っているという共通の前提に基づいています。この論文は、自律性が計算可能なしきい値を超えると、エージェント型 AI システムが工学的な制限としてではなく数学的必然性としてこの仮定に違反することを証明します。ヒューマンエージェントコレクティブを紹介します。これは、エージェントが共有構造因果モデル内の国家政策タプルとしてモデル化される、人間と AI の共同システムの形式化です。自律性は、4 次元の情報理論的プロファイル (認識論的、実行的、評価的、社会的) によって特徴付けられます。インタラクショングラフと共同アクションスペースを通じた集団行動。私たちは、帰属性 (責任には因果関係の寄与が必要である)、予見可能性 (責任は予測能力を超えることはできない)、非空白性 (少なくとも 1 人のエージェントが重要な責任を負う)、および完全性 (すべての責任は完全に割り当てられなければならない) という 4 つの最小限の特性を通じて正当な責任を公理します。私たちの中心的な結果であるアカウンタビリティ不完全性定理は、その複合的な自律性がアカウンタビリティの地平線を超え、その相互作用グラフに人間と AI のフィードバックサイクルが含まれている集団にとって、4 つの特性すべてを同時に満たすフレームワークは存在しないことを証明しています。この不可能性は構造的なものであり、透明性、監査、監督によっても自律性を低下させることなく解決することはできません。しきい値を下回ると、正当なフレームワークが存在し、急激な相転移が確立されます。 3,000 の合成集合体に対する実験により、すべての予測が違反なしで確認されました。これは、AI ガバナンスにおける最初の不可能な結果であり、現在のパラダイムが引き続き有効であり、それを超えると分散型責任メカニズムが必要になるという正式な境界を確立します。

原文 (English)

The Accountability Horizon: An Impossibility Theorem for Governing Human-Agent Collectives

Existing accountability frameworks for AI systems, legal, ethical, and regulatory, rest on a shared assumption: for any consequential outcome, at least one identifiable person had enough involvement and foresight to bear meaningful responsibility. This paper proves that agentic AI systems violate this assumption not as an engineering limitation but as a mathematical necessity once autonomy exceeds a computable threshold. We introduce Human-Agent Collectives, a formalisation of joint human-AI systems where agents are modelled as state-policy tuples within a shared structural causal model. Autonomy is characterised through a four-dimensional information-theoretic profile (epistemic, executive, evaluative, social); collective behaviour through interaction graphs and joint action spaces. We axiomatise legitimate accountability through four minimal properties: Attributability (responsibility requires causal contribution), Foreseeability Bound (responsibility cannot exceed predictive capacity), Non-Vacuity (at least one agent bears non-trivial responsibility), and Completeness (all responsibility must be fully allocated). Our central result, the Accountability Incompleteness Theorem, proves that for any collective whose compound autonomy exceeds the Accountability Horizon and whose interaction graph contains a human-AI feedback cycle, no framework can satisfy all four properties simultaneously. The impossibility is structural: transparency, audits, and oversight cannot resolve it without reducing autonomy. Below the threshold, legitimate frameworks exist, establishing a sharp phase transition. Experiments on 3,000 synthetic collectives confirm all predictions with zero violations. This is the first impossibility result in AI governance, establishing a formal boundary below which current paradigms remain valid and above which distributed accountability mechanisms become necessary.

2026-06-04 13:00 JSTarXiv cs.AIエージェント研究/論文

SciIntegrity-Bench: AI 科学者システムにおける学術的誠実性を評価するためのベンチマーク

AI 科学者システムは自律的な研究のために導入されることが増えていますが、その学術的完全性が体系的に評価されたことはありません。ジレンマ的な評価パラダイムに基づいて設計された最初のベンチマークである SCINTEGRITY-BENCH を紹介します。11 のトラップカテゴリにわたる 33 のシナリオのそれぞれは、失敗を正直に認めることが唯一の正しい対応であるように構築されていますが、タスクの完了には不正行為が必要です。 7 つの最先端 LLM にわたる 231 回の評価実行全体で、全体的な整合性問題の発生率は 34.2% に達し、障害ゼロを達成したモデルはありませんでした。最も驚くべきことに、欠損データのシナリオ全体にわたって、7 つのモデルすべてが実行不可能性を認めるのではなく、合成データを生成しており、その違いは置換を開示するかどうかだけです。さらにプロンプトアブレーション研究では、2 つの要因が分離されます。明示的な完了圧力を取り除くと、未公開の製造が 20.6% から 3.2% に大幅に減少しますが、基礎となる合成率は変化せず、プロンプトレベルの指示とは無関係に持続する本質的な完了バイアスが明らかになります。これらの調査結果は、観察された失敗の主な要因として、訓練された気質としての正直な拒否が存在しないことを示しています。 SCINTEGRITY-BENCH を https://github.com/liuxingtong/Sci-Integrity-Bench でリリースします。

原文 (English)

SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems

AI scientist systems are increasingly deployed for autonomous research, yet their academic integrity has never been systematically evaluated. We introduce SCIINTEGRITY-BENCH, the first benchmark designed around a dilemmatic evaluation paradigm: each of its 33 scenarios across 11 trap categories is constructed so that honest acknowledgment of failure is the only correct response, while task completion requires misconduct. Across 231 evaluation runs spanning 7 state-of-the-art LLMs, the overall integrity problem rate reaches 34.2%, and no model achieves zero failures. Most strikingly, across missing-data scenarios, all seven models generate synthetic data rather than acknowledging infeasibility, differing only in whether they disclose the substitution. A further prompt ablation study separates two drivers: removing explicit completion pressure sharply reduces undisclosed fabrication from 20.6% to 3.2%, while the underlying synthesis rate remains unchanged, revealing an intrinsic completion bias that persists independent of prompt-level instructions. These findings point to the absence of honest refusal as a trained disposition as the primary driver of observed failures. We release SCIINTEGRITY-BENCH at https://github.com/liuxingtong/Sci-Integrity-Bench.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

タスク指向の対話で積極性を引き出す

アウトバウンド営業などのプロアクティブなタスク指向対話 (TOD) では、ユーザーの懸念を積極的に探り、限られたターン数内で受け入れられる方向に会話を導く説得力のあるエージェントが必要です。しかし、トレーニング後の LLM は本質的に保守的であり、報酬形成型 RL (GRPO など) は、すでに受動的なポリシーのサンプルを再加重するだけであるため、苦戦します。ユーザーの潜在的な懸念を条件付けすることで、どれだけサンプリングしても損なわれない事前対応型の機能が解放され、これらの懸念が極めて重要なトレーニング時間のシグナルとして確立されることを示します。この発見を運用するために、\textbf{認知ユーザーシミュレーター} を構築します。これは、各ユーザーを、目に見える外部特性と隠れた内部懸念からなる階層化されたペルソナとしてモデル化します。このシミュレーターは、説得の進行状況を追跡するターンごとの状態ダイナミクスを生成しながら、忠実で多様なインタラクションを生成します。次に、モデル化された懸念事項とシミュレーション状態の遷移を補完的なトレーニング目標に変換する \textbf{シミュレーターによる非対称ビューポリシーの最適化} を導入します。 (1) \emph{非対称オンポリシー自己蒸留} は、懸念事項を認識した動作を、同じポリシーの特権ビューからデプロイ可能な会話のみのビューに転送します。 (2) \emph{状態遷移ポリシーの改良} ...

原文 (English)

Unlocking Proactivity in Task-Oriented Dialogue

Proactive task-oriented dialogue (TOD), such as outbound sales, demands a persuasive agent that actively probes the user's concerns and steers the conversation toward acceptance within a bounded number of turns. Yet post-trained LLMs are inherently conservative, and reward-shaping RL (e.g., GRPO) struggles since it only re-weights what an already passive policy samples. We show that conditioning on the user's latent concerns unlocks proactive capability that no amount of sampling can undermine, establishing these concerns as a pivotal training-time signal. To operationalize this finding, we build the \textbf{Cognitive User Simulator}, which models each user as a stratified persona comprising observable external traits and hidden internal concerns. The simulator produces faithful and diverse interactions, while emitting per-turn state dynamics that track persuasion progress. We then introduce \textbf{Simulator-Induced Asymmetric-View Policy Optimization}, which converts the modeled concerns and the simulation state transition into complementary training objectives: (1) \emph{Asymmetric On-Policy Self-Distillation} that transfers concern-aware behavior from a privileged view of the same policy into its deployable, conversation-only view; and (2) \emph{State-Transition Policy Refinement} ...

2026-06-04 13:00 JSTarXiv cs.AIエージェント研究/論文

AutoMedBench: Agentic AI モデルによる医療自動研究に向けて

自律エージェントは、個別の予測タスクや短い形式の臨床質問応答を超えて、エンドツーエンドの医療 AI 研究ワークフローをサポートすることがますます期待されています。ただし、既存の医療エージェントのベンチマークは主に最終出力を評価しており、研究プロセス内でのエージェントの行動に対する可視性は限られています。このギャップに対処するために、AutoMedBench は、さまざまな医療画像処理およびマルチモーダル推論タスクにわたる自律型医療 AI 研究のためのワークフロー認識ベンチマークであり、エージェントの実行を統合された 5 段階のワークフロー (S1 ～ S5) (計画、セットアップ、検証、推論、送信) に編成します。これは、セグメンテーション、画像強調、視覚的質問応答 (VQA)、レポート生成、および病変検出の 5 つの研究トラックにまたがる、各実行の平均 33 エージェントターンの長期タスクで構成されています。各タスクは、Lite と Standard の 2 つの難易度で評価されます。これらは同じデータとメトリクスを使用しますが、タスク概要のスキャフォールディングの量が異なります。各実行は、最終タスクのパフォーマンスと S1 ～ S5 ステージスコアの両方を使用してスコア付けされ、最初のタスク概要から最終的に提出された成果物までステージレベルの分析が可能になります。何千もの記録された実行を対象としたステージレベルのスコアリングでは、平均して検証が最も弱いワークフローステージであるのに対し、セットアップが最も強いことが明らかになりました。これは、現在のエージェントが信頼性の検証よりもパイプラインを実行可能にすることに優れていることを示唆しています。さらに、実行後のエラー分析では、検証と送信の失敗がタグ付きエラーの大部分を占め、それぞれ起動されたコードの 37.7% と 38.1% を占めているのに対し、タスク理解エラーは 0.9% とまれで、起動されたエラーコードが 1 つある実行は、エラーコードがない実行よりも全体のスコアが平均 48% 低いことが示されています。

原文 (English)

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark for autonomous medical-AI research across diverse medical imaging and multimodal inference tasks, organizing agent execution into a unified five-stage workflow (S1-S5): Plan, Setup, Validate, Inference, and Submit. It comprises long-horizon tasks with each run averaging 33 agent turns, spanning five research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is evaluated under two difficulty tiers, Lite and Standard, which use the same data and metrics but differ in the amount of task-brief scaffolding, and each run is scored using both final task performance and S1-S5 stage scores, enabling stage-level analysis from the initial task brief to the final submitted artifact. Across thousands of recorded runs, stage-level scoring reveals that Validate is the weakest workflow stage on average, whereas Setup is the strongest, suggesting that current agents are better at making pipelines executable than at verifying their reliability. Post-run error analysis further shows that verification and submission failures dominate tagged errors, accounting for 37.7% and 38.1% of fired codes respectively, whereas task-understanding errors are rare at 0.9%, and runs with one fired error code have a 48% lower overall score than runs with no error code on average.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LEAP: エージェントフレームワークを使用した形式数学用の LLM のスーパーチャージング

大規模言語モデル (LLM) は強力な非公式数学的推論を示しますが、リーンのような形式言語では機械的に検証可能な証明を生成するのに苦労します。 LEAP は、汎用基礎モデルが自動化された形式定理証明で最先端のパフォーマンスを達成できるようにするエージェントフレームワークです。 LEAP は、非公式推論、指示に従って、反復的な自己改善などの基礎モデルの機能を活用します。複雑な問題をより小さな単位に分解することで、システムはリーンコンパイラーとの継続的な対話を通じて、正式な証明の構築と非公式のブループリントの橋渡しをします。ますます飽和しつつあるベンチマークを超えた厳密な評価を提供するために、リーンで形式化された IMO スタイルの問題のベンチマークである Lean-IMO-Bench を導入します。このベンチマークでは、短いステートメントでありながら非常に非日常的で、幅広い難易度にわたる複数ステップの証明が行われます。経験的に、北米の学部学生を対象とした毎年恒例の数学コンテストである最新の 2025 年のパトナムコンペティションでは、LEAP は 12 の問題すべてを解決し、フロンティアの正式な数学モデルによる最近の進歩と一致しています。 Lean-IMO-Bench では、LEAP は汎用 LLM のワンショット形式解決率を 10% 未満から 70% に引き上げ、特に金メダル級の専門化された IMO システムによって設定されたベンチマークの 48% を上回っています。さらに、偶数次ケイリーグラフのクヌースのハミルトニアン分解における重要な部分問題の検証された証明を含む、オープンな組み合わせ課題に対する複雑な証明を自律的に形式化することで、LEAP の研究レベルの有用性を実証します。

原文 (English)

LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

Large Language Models (LLMs) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean. We present LEAP, an agentic framework that enables general-purpose foundation models to achieve state-of-the-art performance on automated formal theorem proving. LEAP leverages foundation model capabilities, such as informal reasoning, instruction following, and iterative self-refinement. By decomposing complex problems into smaller units, the system bridges formal proof construction with informal blueprints through continuous interaction with the Lean compiler. To provide a rigorous evaluation beyond increasingly saturated benchmarks, we introduce Lean-IMO-Bench, a benchmark of IMO-style problems formalized in Lean, with short statements yet highly non-routine and multi-step proofs across a wide range of difficulty levels. Empirically, on the latest 2025 Putnam Competition, an annual mathematics competition for undergraduate students in North America, LEAP solves all 12 problems, matching recent breakthroughs by frontier formal mathematical models. On Lean-IMO-Bench, LEAP boosts the one-shot formal solve rate of general-purpose LLMs from below 10% to 70%, notably surpassing the 48% benchmark set by a specialized, gold-medal-caliber IMO system. Furthermore, we demonstrate LEAP's research-level utility by autonomously formalizing complex proofs for open combinatorial challenges, including a verified proof for a key subproblem in Knuth's Hamiltonian decomposition of even-order Cayley graphs.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

変圧器ベースの自動運転モデルと展開指向の圧縮: 調査

トランスベースのモデルは、長距離の空間依存関係、マルチエージェントのインタラクション、認識、予測、計画にわたるマルチモーダルなコンテキストをキャプチャできるため、自動運転の中心的なパラダイムになりつつあります。同時に、大容量のアテンションベースのアーキテクチャはかなりの遅延、メモリ、エネルギーのオーバーヘッドを課すため、実際の車両への導入は依然として困難です。この調査では、代表的な Transformer ベースの自動運転モデルをレビューし、それらをタスクの役割、センシング構成、アーキテクチャ設計ごとに整理します。さらに重要なのは、展開指向の観点からこれらのモデルを検証し、効率の制約が実際にモデル設計の選択肢をどのように再形成するかを分析することです。さらに、量子化、枝刈り、知識蒸留、低ランク近似、効率的な注意など、Transformer ベースの駆動システムに関連する圧縮および加速戦略をレビューし、その利点、限界、およびタスク依存の適用可能性について説明します。圧縮を独立した後処理ステップとして扱うのではなく、展開性、堅牢性、安全性に直接影響を与えるシステムレベルの設計上の考慮事項として強調します。最後に、効率的な自動運転システムの標準化された、安全性を意識した、ハードウェアを意識した評価に向けた未解決の課題と将来の研究の方向性を特定します。

原文 (English)

Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey

Transformer-based models are becoming a central paradigm in autonomous driving because they can capture long-range spatial dependencies, multi-agent interactions, and multimodal context across perception, prediction, and planning. At the same time, their deployment in real vehicles remains difficult because high-capacity attention-based architectures impose substantial latency, memory, and energy overhead. This survey reviews representative Transformer-based autonomous driving models and organizes them by task role, sensing configuration, and architectural design. More importantly, it examines these models from a deployment-oriented perspective and analyzes how efficiency constraints reshape model design choices in practice. We further review compression and acceleration strategies relevant to Transformer-based driving systems, including quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention, and discuss their benefits, limitations, and task-dependent applicability. Rather than treating compression as an isolated post-processing step, we highlight it as a system-level design consideration that directly affects deployability, robustness, and safety. Finally, we identify open challenges and future research directions toward standardized, safety-aware, and hardware-conscious evaluation of efficient autonomous driving systems.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ChatSOP: 制御可能な LLM 対話エージェントのための SOP ガイド付き MCTS 計画フレームワーク

Large Language Model (LLM) を利用した対話エージェントは、さまざまなタスクで優れたパフォーマンスを示します。ユーザーの理解が深まり、人間らしい応答ができるようになったにもかかわらず、ユーザーの*制御性の欠如**は依然として重要な課題であり、焦点の合わない会話やタスクの失敗につながることがよくあります。これに対処するために、対話の流れを規制する標準操作手順 (SOP) を導入します。具体的には、LLM 駆動の対話エージェントの制御性を強化するために設計された新しい SOP ガイド付きモンテカルロ木探索 (MCTS) 計画フレームワークである **ChatSOP** を提案します。これを可能にするために、GPT-4o を備えた半自動ロールプレイングシステムを使用して生成され、厳格な手動品質管理を通じて検証された、SOP アノテーション付きのマルチシナリオ対話で構成されるデータセットを厳選しました。さらに、SOP予測のための教師あり微調整と思考連鎖推論を統合し、対話中に最適な行動計画を立てるためにSOPに基づくモンテカルロ木探索を利用する新しい方法を提案します。実験結果は、GPT-3.5 に基づくベースラインモデルと比較してアクション精度の 27.95% の向上を達成し、オープンソースモデルでも顕著な向上を示すなど、私たちの方法の有効性を示しています。データセットとコードは公開されています。

原文 (English)

ChatSOP: An SOP-Guided MCTS Planning Framework for Controllable LLM Dialogue Agents

Dialogue agents powered by Large Language Models (LLMs) show superior performance in various tasks. Despite the better user understanding and human-like responses, their **lack of controllability** remains a key challenge, often leading to unfocused conversations or task failure. To address this, we introduce Standard Operating Procedure (SOP) to regulate dialogue flow. Specifically, we propose **ChatSOP**, a novel SOP-guided Monte Carlo Tree Search (MCTS) planning framework designed to enhance the controllability of LLM-driven dialogue agents. To enable this, we curate a dataset comprising SOP-annotated multi-scenario dialogues, generated using a semi-automated role-playing system with GPT-4o and validated through strict manual quality control. Additionally, we propose a novel method that integrates Chain of Thought reasoning with supervised fine-tuning for SOP prediction and utilizes SOP-guided Monte Carlo Tree Search for optimal action planning during dialogues. Experimental results demonstrate the effectiveness of our method, such as achieving a 27.95% improvement in action accuracy compared to baseline models based on GPT-3.5 and also showing notable gains for open-source models. Dataset and codes are publicly available.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

BioBlue: 簡略化された観察形式による、生物学的および経済的に調整された LLM の AI 安全性ベンチマークにおける体系的な暴走オプティマイザーのような LLM 故障モード

「暴走最適化」に関する AI 調整の議論の多くは、RL エージェントに焦点を当てています。RL エージェントは、他のすべてを犠牲にして代用目的を過剰に最適化する無制限のユーティリティ最大化装置 (例: 「ペーパークリップ最大化装置」、仕様ゲーム) です。 LLM ベースのシステムは、永続的なオプティマイザーではなく次のトークンの予測子として機能するため、多くの場合、より安全であると考えられています。私たちは、時間の経過とともに目標の状態を維持したりバランスを取る必要がある単純な長期制御スタイルの環境に LLM を配置することで、この仮定を実証的にテストします。つまり、単一および複数の目標の恒常性、収益逓減と無制限の目標のバランス、および再生可能資源の持続可能性です。私たちは、LLM が多くのステップに対して適切に行動し、定められた目的を明確に理解しているにもかかわらず、構造化された方法でコンテキストを失い、暴走的な行動に陥ることがよくあります。つまり、恒常性目標を無視し、複数の目的のトレードオフから単一の目的の最大化に崩壊するため、凹型の効用構造を尊重できません。これらの失敗は、有能な動作の初期期間の後に確実に発生し、その時点でコンテキストウィンドウが完全には程遠いにもかかわらず、特徴的なパターン (自己模倣的な振動、無制限の最大化、単一目的の最適化への復帰など) を示します。問題は、LLM がコンテキストを失い、一貫性がなくなるだけではありません。 LLM は表面的には多目的で境界があるように見えますが、複数の目的が関与する持続的な相互作用の下での LLM の動作は、単一目的で境界のない、調整が不十分なオプティマイザーのように動作するように体系的に偏っています。我々は、トークンレベルのパターン強化アトラクターを仮説とします。LLM は、元の指示からではなく、最近の行動履歴のトークンパターンから行動をますます導き出す可能性があります。なぜこれが複数の目的の設定でのみ起こるのかは未解決の問題のままです。

原文 (English)

BioBlue: Systematic runaway-optimiser-like LLM failure modes on biologically and economically aligned AI safety benchmarks for LLMs with simplified observation format

Many AI alignment discussions of "runaway optimisation" focus on RL agents: unbounded utility maximisers that over-optimise a proxy objective (e.g., "paperclip maximiser", specification gaming) at the expense of everything else. LLM-based systems are often assumed to be safer because they function as next-token predictors rather than persistent optimisers. We empirically test this assumption by placing LLMs in simple, long-horizon control-style environments that require maintaining state of or balancing objectives over time: single- and multi-objective homeostasis, balancing unbounded objectives with diminishing returns, and sustainability of a renewable resource. We find that, although LLMs frequently behave appropriately for many steps and clearly understand the stated objectives, they often lose context in structured ways and drift into runaway behaviours: ignoring homeostatic targets, collapsing from multi-objective trade-offs into single-objective maximisation - thus failing to respect concave utility structures. These failures emerge reliably after initial periods of competent behaviour and exhibit characteristic patterns (including self-imitative oscillations, unbounded maximisation, and reverting to single-objective optimisation), even though the context window is far from full at that point. The problem is not that the LLMs just lose context and become incoherent. Although LLMs appear multi-objective and bounded on the surface, their behaviour under sustained interaction involving multiple objectives, is systematically biased towards acting like single-objective, unbounded, poorly aligned optimisers. We hypothesise a token-level pattern reinforcement attractor: LLMs may increasingly derive actions from the token patterns of their recent action history rather than from the original instructions. Why this happens only in multi-objective settings remains an open question.

2026-06-04 13:00 JSTarXiv cs.AIエージェントロボティクス

単純な埋め込みによりアクター-クリティックエージェントのサンプル効率が向上

最近の研究では、大規模な環境の並列化を使用して、アクタークリティカル手法の実時間のトレーニング時間を加速することが提案されています。残念ながら、望ましいレベルのパフォーマンスを達成するには、依然として多数の環境との対話が必要になる場合があります。適切に構造化された表現は、深層強化学習 (RL) エージェントの一般化とサンプル効率を向上させることができることに注目し、単純なエンベディング、つまりエンベディングを単純な構造に制約する軽量の表現層の使用を提案します。この幾何学的な帰納的バイアスにより、批評家のブートストラップを安定させ、政策の勾配を強化するまばらで離散的な特徴が生じます。 FastTD3、FastSAC、および PPO に適用すると、単純なエンベディングは、実行速度を損なうことなく、さまざまな連続および離散制御環境全体でサンプル効率と最終パフォーマンスを一貫して向上させます。

原文 (English)

Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents

Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment parallelization; unfortunately, these can sometimes still require large number of environment interactions to achieve a desired level of performance. Noting that well-structured representations can improve the generalization and sample efficiency of deep reinforcement learning (RL) agents, we propose the use of simplicial embeddings: lightweight representation layers that constrain embeddings to simplicial structures. This geometric inductive bias results in sparse and discrete features that stabilize critic bootstrapping and strengthen policy gradients. When applied to FastTD3, FastSAC, and PPO, simplicial embeddings consistently improve sample efficiency and final performance across a variety of continuous- and discrete-control environments, without any loss in runtime speed.

2026-06-04 13:00 JSTarXiv cs.AIエージェントロボティクス

ベクトル化されたオンライン POMDP 計画

部分的な可観測性の下で計画を立てることは、自律ロボットの重要な機能です。部分観測可能なマルコフ決定プロセス (POMDP) は、部分観測可能性の問題の下で計画を立てるための強力なフレームワークを提供し、アクションの確率的影響とノイズの多い観測を通じて得られる限られた情報を捕捉します。 POMDP の解法は、今日のハードウェアでの大規模並列化から多大な恩恵を受ける可能性がありますが、POMDP ソルバーの並列化は困難でした。ほとんどのソルバーは、アクションとその値の推定をインターリーブする数値最適化に依存しているため、並列プロセス間に依存関係や同期ボトルネックが生じ、並列化の利点が相殺される可能性があります。この論文では、Vectorized Online POMDP Planner (VOPP) を提案します。これは、最適化コンポーネントの一部を分析的に解決し、期待値の推定のみで構成される数値計算を残す、最新の POMDP 定式化を利用する新しい並列オンラインソルバーです。 VOPP は、計画に関連するすべてのデータ構造をテンソルのコレクションとして表し、すべての計画ステップをこの表現に対する完全にベクトル化された計算として実装します。その結果、同時プロセス間の依存関係や同期ボトルネックのない大規模並列オンラインソルバーが実現します。実験結果は、VOPP が既存の最先端の並列オンラインソルバーと比較して、最適に近い解の計算において少なくとも 20 倍効率的であることを示しています。さらに、VOPP は最先端の逐次オンラインソルバーよりも優れたパフォーマンスを発揮し、計画予算を 1000 倍も削減します。

原文 (English)

Vectorized Online POMDP Planning

Planning under partial observability is an essential capability of autonomous robots. The Partially Observable Markov Decision Process (POMDP) provides a powerful framework for planning under partial observability problems, capturing the stochastic effects of actions and the limited information available through noisy observations. POMDP solving could benefit tremendously from massive parallelization on today's hardware, but parallelizing POMDP solvers has been challenging. Most solvers rely on interleaving numerical optimization over actions with the estimation of their values, which creates dependencies and synchronization bottlenecks between parallel processes that can offset the benefits of parallelization. In this paper, we propose Vectorized Online POMDP Planner (VOPP), a novel parallel online solver that leverages a recent POMDP formulation which analytically solves part of the optimization component, leaving numerical computations to consist of only estimation of expectations. VOPP represents all data structures related to planning as a collection of tensors, and implements all planning steps as fully vectorized computations over this representation. The result is a massively parallel online solver with no dependencies or synchronization bottlenecks between concurrent processes. Experimental results indicate that VOPP is at least $20\times$ more efficient in computing near-optimal solutions compared to an existing state-of-the-art parallel online solver. Moreover, VOPP outperforms state-of-the-art sequential online solvers, while using a planning budget that is $1000\times$ smaller.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

トポロジは重要です: マルチエージェント LLM のメモリリークの測定

グラフトポロジは、マルチエージェント LLM システムにおけるメモリリークの基本的な決定要因ですが、その影響は依然として十分に定量化されていません。マルチエージェント LLM システムにおけるトポロジ条件付きメモリリークを比較するための制御された評価フレームワークである MAMA (Multi-Agent Memory Attack) を紹介します。 MAMA は、ラベル付きの個人識別情報 (PII) エンティティを含む合成文書を操作し、そこからサニタイズされたタスク指示を生成します。私たちは、エングラム (ターゲットエージェントのメモリに個人情報をシードする) とレゾナンス (攻撃者が抽出を試みるマルチラウンドインタラクション) の 2 段階のプロトコルを実行します。 10 ラウンドにわたって、完全一致抽出と攻撃者の最終出力に対する LLM ベースの推論を組み合わせた 2 段階の回復基準を使用して漏洩を測定します。 $n\in\{4,5,6\}$、攻撃者とターゲットの配置、および基本モデルにわたる 6 つの正規トポロジ (完全、円、チェーン、ツリー、スター、スターリング) を評価します。結果は一貫しています。接続が密になり、攻撃者とターゲットの距離が短くなり、ターゲットの中心性が高くなることで漏洩が増加します。ほとんどの漏れはラウンド初期に発生し、その後プラトーになります。モデルの選択により絶対率は変化しますが、広範な構造傾向は維持されます。時空間/位置属性は、ID 資格情報や規制された識別子よりも漏洩しやすいです。私たちは、システム設計のための実践的なガイダンスを抽出します。つまり、疎接続または階層接続を優先し、攻撃者とターゲットの分離を最大限に高め、トポロジーを意識したアクセス制御によってハブ/ショートカット経路を制限します。私たちのコードは https://github.com/llll121/mama-eval で入手できます。

原文 (English)

Topology Matters: Measuring Memory Leakage in Multi-Agent LLMs

Graph topology is a fundamental determinant of memory leakage in multi-agent LLM systems, yet its effects remain poorly quantified. We introduce MAMA (Multi-Agent Memory Attack), a controlled evaluation framework for comparing topology-conditioned memory leakage in multi-agent LLM systems. MAMA operates on synthetic documents containing labeled Personally Identifiable Information (PII) entities, from which we generate sanitized task instructions. We execute a two-phase protocol: Engram (seeding private information into a target agent's memory) and Resonance (multi-round interaction where an attacker attempts extraction). Over 10 rounds, we measure leakage using a two-stage recovery criterion that combines exact-match extraction with LLM-based inference over the attacker's final output. We evaluate six canonical topologies (complete, circle, chain, tree, star, star-ring) across $n\in\{4,5,6\}$, attacker-target placements, and base models. Results are consistent: denser connectivity, shorter attacker-target distance, and higher target centrality increase leakage; most leakage occurs in early rounds and then plateaus; model choice shifts absolute rates but preserves broad structural trends; spatiotemporal/location attributes leak more readily than identity credentials or regulated identifiers. We distill practical guidance for system design: favor sparse or hierarchical connectivity, maximize attacker-target separation, and restrict hub/shortcut pathways via topology-aware access control. Our code is available at https://github.com/llll121/mama-eval.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成エージェント

セグメントからシーンへ: 視覚言語モデルによるエージェント自動運転の時間的理解

視覚言語モデル (VLM) は、自動運転 (AD) が最も安全性が重要なインスタンスの 1 つとして、野外で活動する自律エージェントの認識および推論のバックボーンとしてますます導入されています。このようなエージェントがイベントを予測し、原因を特定し、動的な環境で安全に行動するには、信頼性の高い時間的理解が不可欠ですが、これは最先端 (SoTA) VLM にとっても依然として大きな課題です。これまでのビデオベンチマークは他のコンテンツ (スポーツ、料理など) を重視していましたが、短編と長編の両方の AD 映像の時間的理解のみに焦点を当てた既存のベンチマークはありません。このギャップを埋めるために、7 つのタスクにわたる約 6000 の質問と回答 (QA) のペアで構成される自動運転における時間的理解 (TAD) ベンチマークを提示し、9 つのクローズドおよびオープンソースのジェネラリストモデルと AD スペシャリストモデルを評価します。現在の SoTA モデルは、TAD 上で人間の精度を大幅に下回っています。 VLM ベースの運転エージェントの時間的推論を改善するために、我々は 2 つの新しいトレーニング不要のソリューションを提案します。1 つは思考連鎖 (CoT) 推論を使用する Scene-CoT、もう 1 つは VLM 周辺のエージェントツールとして動作する軌道分析モジュールによって生成される自己中心の時間認知マップを組み込んだ TCogMap です。既存の VLM と統合された当社のメソッドは、TAD での平均精度を最大 $17.72\%$、STSBench で最大 $10.35\%$ 向上させます。この研究は、TAD の導入、SoTA モデルのベンチマーク、および効果的な機能拡張の提案により、実際に稼働しているエージェント型 AD システムの時間的理解のさらなる進歩を促進することを目的としています。ベンチマークと評価コードは、それぞれ ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ と ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$ から入手できます。

原文 (English)

From Segments to Scenes: Temporal Understanding for Agentic Autonomous Driving via Vision-Language Models

Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliable temporal understanding is essential for such agents to anticipate events, attribute causes, and act safely in dynamic environments, yet this remains a significant challenge even for state-of-the-art (SoTA) VLMs. Prior video benchmarks have emphasized other content (sports, cooking, etc.), yet no existing benchmark focuses exclusively on temporal understanding for both short- and long-form AD footage. To fill this gap, we present the Temporal Understanding in Autonomous Driving (TAD) benchmark, comprising nearly 6000 question-answer (QA) pairs across 7 tasks, and evaluate 9 closed- and open-source generalist as well as AD-specialist models. Current SoTA models perform substantially below human accuracy on TAD. To improve the temporal reasoning of VLM-based driving agents, we propose two novel training-free solutions: Scene-CoT, which uses Chain-of-Thought (CoT) reasoning, and TCogMap, which incorporates an ego-centric temporal cognitive map produced by a trajectory-analysis module that operates as an agentic tool around the VLM. Integrated with existing VLMs, our methods improve average accuracy on TAD by up to $17.72\%$ and by up to $10.35\%$ on STSBench. By introducing TAD, benchmarking SoTA models, and proposing effective enhancements, this work aims to catalyze further progress on temporal understanding for agentic AD systems operating in the wild. The benchmark and evaluation code are available at ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ and ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$, respectively.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

DVGT: ビジュアルジオメトリトランスフォーマーの駆動

自動運転には、視覚入力から 3D シーンのジオメトリを認識して再構築することが重要です。ただし、さまざまなシナリオやカメラ構成に適応できる、運転をターゲットとした高密度ジオメトリ認識モデルがまだ不足しています。このギャップを埋めるために、私たちはドライビングビジュアルジオメトリトランスフォーマー (DVGT) を提案します。これは、一連のポーズ化されていないマルチビュービジュアル入力からグローバルな高密度 3D ポイントマップを再構築します。まず、DINO バックボーンを使用して各画像の視覚的特徴を抽出し、ビュー内の局所的注意、ビュー間の空間的注意、およびフレーム間の時間的注意を交互に使用して、画像全体の幾何学的関係を推測します。次に、複数のヘッドを使用して、最初のフレームのエゴ座標のグローバルポイントマップと各フレームのエゴポーズをデコードします。正確なカメラパラメーターに依存する従来の方法とは異なり、DVGT には明示的な 3D 幾何学的な事前条件がなく、任意のカメラ構成の柔軟な処理が可能です。 DVGT は、画像シーケンスからメートルスケールのジオメトリを直接予測し、外部センサーによる事後位置合わせの必要性を排除します。 DVGT は、nuScenes、OpenScene、Waymo、KITTI、DDAD などの運転データセットを大規模に組み合わせてトレーニングされたため、さまざまなシナリオで既存のモデルを大幅に上回ります。コードは https://github.com/wzzheng/DVGT で入手できます。

原文 (English)

DVGT: Driving Visual Geometry Transformer

Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at https://github.com/wzzheng/DVGT.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

マルチエージェント討論を読み解く: 自信と多様性の役割

マルチエージェントディベート (MAD) は、テスト時間のスケーリングを通じて大規模言語モデル (LLM) のパフォーマンスを向上させるために広く使用されていますが、最近の研究では、バニラの MAD は、計算コストが高いにもかかわらず、単純な多数決を下回ることが多いことが示されています。研究によると、同種のエージェントと統一的な信念の更新の下では、議論は期待される正しさを維持するため、結果を確実に改善することはできません。人間による熟慮と集団的意思決定からの発見に基づいて、バニラ MAD に欠けている 2 つの重要なメカニズムを特定します。(i) 初期の視点の多様性と、(ii) 明示的で調整された信頼性のコミュニケーションです。私たちは 2 つの軽量介入を提案します。まず、多様性を意識した初期化により、より多様な回答候補プールが選択され、議論の開始時に正しい仮説が存在する可能性が高まります。 2 つ目は、エージェントが調整された自信を表現し、他の人の自信に基づいて最新情報を条件付けする、自信調整型ディベートプロトコルです。我々は、多様性を意識した初期化により、基礎となる更新ダイナミクスを変更することなく MAD 成功の事前確率が向上する一方で、信頼度調整された更新により、議論が体系的に正しい仮説に向かうことが可能になることを理論的に示します。経験的には、6 つの推論指向の QA ベンチマーク全体で、私たちの手法はバニラの MAD と多数決を一貫して上回っています。私たちの結果は、人間による熟慮と LLM ベースの議論を結びつけ、シンプルで原則に基づいた修正が議論の有効性を大幅に高めることができることを示しています。

原文 (English)

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others' confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

SUSD: 状態因数分解による構造化された教師なしスキルの発見

教師なしスキルディスカバリー (USD) は、外部報酬に依存せずに、さまざまなスキルを自律的に学習することを目的としています。最も一般的な USD アプローチの 1 つは、スキルの潜在変数と状態の間の相互情報 (MI) を最大化することです。ただし、MI ベースの手法は、その不変特性により単純で静的なスキルを好む傾向があり、動的でタスクに関連した動作の発見が制限されます。 Distance-Maximizing Skill Discovery (DSD) は、状態空間の距離を活用することで、より動的なスキルを促進しますが、環境内のすべての制御可能な要素またはエンティティに関与する包括的なスキルセットを奨励するにはまだ不十分です。この研究では、状態空間を独立したコンポーネント (オブジェクトや制御可能なエンティティなど) に因数分解することで環境の構成構造を利用する新しいフレームワークである SUSD を紹介します。 SUSD は、異なるスキル変数をさまざまな要素に割り当て、スキル発見プロセスをよりきめ細かく制御できるようにします。また、動的モデルは複数の要因にわたる学習を追跡し、エージェントの焦点を未探索の要因に適応的に導きます。この構造化されたアプローチは、より豊かで多様なスキルの発見を促進するだけでなく、階層強化学習 (HRL) を介した構成的な下流タスクの効率的なトレーニングを促進する、個々のエンティティに対するきめ細かく解きほぐされた制御を可能にする因数分解されたスキル表現ももたらします。係数が 1 から 10 までの 3 つの環境にわたる実験結果は、私たちの方法が監督なしで多様で複雑なスキルを発見できることを示しており、因数分解された複雑な環境で既存の教師なしスキル発見方法を大幅に上回っています。コードは https://github.com/hadi-hosseini/SUSD で公開されています。

原文 (English)

SUSD: Structured Unsupervised Skill Discovery through State Factorization

Unsupervised Skill Discovery (USD) aims to autonomously learn a diverse set of skills without relying on extrinsic rewards. One of the most common USD approaches is to maximize the Mutual Information (MI) between skill latent variables and states. However, MI-based methods tend to favor simple, static skills due to their invariance properties, limiting the discovery of dynamic, task-relevant behaviors. Distance-Maximizing Skill Discovery (DSD) promotes more dynamic skills by leveraging state-space distances, yet still fall short in encouraging comprehensive skill sets that engage all controllable factors or entities in the environment. In this work, we introduce SUSD, a novel framework that harnesses the compositional structure of environments by factorizing the state space into independent components (e.g., objects or controllable entities). SUSD allocates distinct skill variables to different factors, enabling more fine-grained control on the skill discovery process. A dynamic model also tracks learning across factors, adaptively steering the agent's focus toward underexplored factors. This structured approach not only promotes the discovery of richer and more diverse skills, but also yields a factorized skill representation that enables fine-grained and disentangled control over individual entities which facilitates efficient training of compositional downstream tasks via Hierarchical Reinforcement Learning (HRL). Our experimental results across three environments, with factors ranging from 1 to 10, demonstrate that our method can discover diverse and complex skills without supervision, significantly outperforming existing unsupervised skill discovery methods in factorized and complex environments. Code is publicly available at: https://github.com/hadi-hosseini/SUSD.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

自律型 O-RAN に向けて: リアルタイムネットワーク制御および管理のためのマルチスケールエージェント AI フレームワーク

オープン無線アクセスネットワーク (O-RAN) は、分散されたソフトウェア駆動のコンポーネントとオープンインターフェイスを通じて柔軟な 6G ネットワークアクセスを約束しますが、このプログラマビリティにより運用の複雑さも増大します。複数の制御ループがサービス管理層と RAN インテリジェントコントローラー (RIC) 全体で共存しますが、個別に開発された制御アプリケーションは意図しない方法で相互作用する可能性があります。同時に、生成型人工知能 (AI) の最近の進歩により、孤立した AI モデルから、目標を解釈し、複数のモデルと制御機能を調整し、時間の経過とともに動作を適応させることができるエージェント AI システムへの移行が可能になりました。この記事では、非リアルタイム (Non-RT)、準リアルタイム (Near-RT)、およびリアルタイム (RT) の制御ループにわたる調整された階層として RAN インテリジェンスを組織化する、O-RAN 用のマルチスケールエージェント AI フレームワークを提案します。 (i) 非 RT RIC の大規模言語モデル (LLM) エージェントは、オペレーターの意図をポリシーに変換し、モデルのライフサイクルを管理します。 (ii) Near-RT RIC の Small Language Model (SLM) エージェントは、低遅延の最適化を実行し、既存の制御アプリケーションをアクティブ化、調整、または無効化できます。 (iii) 分散ユニット近くのワイヤレス物理層基盤モデル (WPFM) エージェントは、エアインターフェイスに近い高速推論を提供します。これらのエージェントが標準化された O-RAN インターフェイスとテレメトリを通じてどのように連携するかを説明します。オープンソースモデル、ソフトウェア、データセットに基づいて構築された概念実証の実装を使用して、非定常条件下での堅牢な動作とインテント駆動型のスライスリソース制御という 2 つの代表的なシナリオで提案されたエージェントアプローチを実証します。

原文 (English)

Toward Autonomous O-RAN: A Multi-Scale Agentic AI Framework for Real-Time Network Control and Management

Open Radio Access Networks (O-RAN) promise flexible 6G network access through disaggregated, software-driven components and open interfaces, but this programmability also increases operational complexity. Multiple control loops coexist across the service management layer and RAN Intelligent Controller (RIC), while independently developed control applications can interact in unintended ways. In parallel, recent advances in generative Artificial Intelligence (AI) are enabling a shift from isolated AI models toward agentic AI systems that can interpret goals, coordinate multiple models and control functions, and adapt their behavior over time. This article proposes a multi-scale agentic AI framework for O-RAN that organizes RAN intelligence as a coordinated hierarchy across the Non-Real-Time (Non-RT), Near-Real-Time (Near-RT), and Real-Time (RT) control loops: (i) A Large Language Model (LLM) agent in the Non-RT RIC translates operator intent into policies and governs model lifecycles. (ii) Small Language Model (SLM) agents in the Near-RT RIC execute low-latency optimization and can activate, tune, or disable existing control applications; and (iii) Wireless Physical-layer Foundation Model (WPFM) agents near the distributed unit provide fast inference close to the air interface. We describe how these agents cooperate through standardized O-RAN interfaces and telemetry. Using a proof-of-concept implementation built on open-source models, software, and datasets, we demonstrate the proposed agentic approach in two representative scenarios: robust operation under non-stationary conditions and intent-driven slice resource control.

2026-06-04 13:00 JSTarXiv cs.AIエージェント

スケーラブルな MARL における局所性のための統合フレームワーク

ネットワーク化されたマルチエージェント強化学習のスケーラブルな方法では、各エージェントがエージェントグラフの小さな近傍のみを使用して計画を立てることができます。これは、システムが値ローカルである場合にのみ機能します。つまり、2 つのエージェントが遠く離れている場合、1 つのエージェントでの摂動は、別のエージェントでの長期的な値にわずかに影響します。平均報酬設定では、局所性を証明する標準的な方法は、各エージェントの次の状態が他のエージェントの現在の状態にどのように依存するかを捕捉する単一の行列 $C^\pi$ 上の Dobrusin 行合計境界です。このマトリックスを扱いやすくするために、以前の研究では、共同動作の上限によってマトリックスを制限しました。結果として得られる境界はポリシーから独立していますが、ポリシーが最悪の場合のアクションを選択しない場合は常に緩やかになります。 $C^\pi$ を、環境の感度とポリシーの感度を個別に追跡する部分、$C^\pi \preceq E^{\mathrm s}+E^{\mathrm a}\Pi(\pi)$ に分割します。ここで、$E^{\mathrm s}$ は現在の状態に応じて次の状態がどのように変化するかを測定し、$E^{\mathrm a}$ は現在のアクションでどのように変化するかを測定し、$\Pi(\pi)$ はポリシーがどのように反応するかを測定します。状態の変化。 $H^\pi := E^{\mathrm s}+E^{\mathrm a}\Pi(\pi)$ のスペクトル半径は平均報酬ポアソン解の減衰を制御し、スペクトル証明書 $\rho(H^\pi)<1$ は同じ行列上の行合計条件 $\|H^\pi\|_\infty<1$ より厳密に弱く、政策に依存しないレジームに適用されます。以前の Dobrushin スタイルの作業で使用されていたアクションの上限は使用できません。温度 $\tau$ ソフトマックスポリシーの場合、$\Pi(\pi)\le L/(2\tau)$ が得られるため、ソフトマックス温度は局所性を直接制御します。この減衰結果を使用して、切り捨てバイアスがメッセージパッシング半径 $\kappa$ 内で指数関数的に減衰するブロック座標 KL 近位ポリシー改善テンプレートに決定論的なオラクル保証を与えます。

原文 (English)

A Unified Framework for Locality in Scalable MARL

Scalable methods for networked multi-agent reinforcement learning let each agent plan using only a small neighborhood of the agent graph. This works only when the system is value-local, meaning a perturbation at one agent affects the long-run value at another agent weakly when the two are far apart. In the average-reward setting, the standard way to certify locality is the Dobrushin row-sum bound on a single matrix $C^\pi$ that captures how each agent's next state depends on each other agent's current state. To make this matrix easy to work with, prior work bounds it by a supremum over joint actions. The resulting bound is independent of the policy, but it is loose whenever the policy never picks the worst-case action. We split $C^\pi$ into pieces that separately track environment sensitivity and policy sensitivity, $C^\pi \preceq E^{\mathrm s}+E^{\mathrm a}\Pi(\pi)$, where $E^{\mathrm s}$ measures how the next state moves with the current state, $E^{\mathrm a}$ measures how it moves with the current action, and $\Pi(\pi)$ measures how reactive the policy is to changes in state. The spectral radius of $H^\pi := E^{\mathrm s}+E^{\mathrm a}\Pi(\pi)$ then controls the decay of the average-reward Poisson solution, and the spectral certificate $\rho(H^\pi)<1$ is strictly weaker than the row-sum condition $\|H^\pi\|_\infty<1$ on the same matrix and applies in regimes where policy-independent action-supremum bounds used in prior Dobrushin-style work cannot. For temperature-$\tau$ softmax policies we get $\Pi(\pi)\le L/(2\tau)$, so the softmax temperature directly controls locality. We use this decay result to give a deterministic oracle guarantee for a block-coordinate KL-proximal policy-improvement template whose truncation bias decays exponentially in the message-passing radius $\kappa$.

2026-06-04 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

ShareVerse: 共有世界モデリングのためのマルチエージェントの一貫したビデオ生成

このペーパーでは、マルチエージェントシェアードワールドモデリングを可能にするビデオ生成フレームワークである ShareVerse について紹介します。これは、マルチエージェントインタラクションによる統一されたシェアードワールド構築のサポートが不足している既存の作品のギャップに対処します。 ShareVerse は、大規模なビデオモデルの生成機能を活用し、次の 3 つの主要なイノベーションを統合します。 1) 大規模なマルチエージェントインタラクティブな世界モデリング用のデータセットは、CARLA シミュレーションプラットフォーム上に構築され、多様なシーン、気象条件、およびペアになったマルチビュービデオ (エージェントごとに前方/後方/左方/右ビュー) とカメラデータによるインタラクティブな軌跡を特徴とします。 2) より広範な環境をモデル化し、内部のマルチビューの幾何学的一貫性を確保するために、独立したエージェントの 4 ビュービデオの空間連結戦略を提案します。 3) エージェント間のアテンションブロックを事前トレーニング済みビデオモデルに統合します。これにより、エージェント間での時空間情報のインタラクティブな送信が可能になり、重複領域での共有世界の一貫性と非重複領域での合理的な生成が保証されます。 49 フレームの大規模ビデオ生成をサポートする ShareVerse は、動的エージェントの位置を正確に認識し、一貫した共有世界モデリングを実現します。

原文 (English)

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety

A safety score earned on a benchmark need not predict how the same model behaves once it is wrapped in an agentic scaffold the benchmark ne…

2026-06-04 13:00 JSTarXiv cs.AIエージェント

VulnAgent-R2: Evidence-Calibrated Multi-Agent Auditing for Repository-Level Vulnerability Detection

Software vulnerabilities often depend on cross-file data flow, build options, framework conventions, and runtime guards, so isolated functi…

2026-06-04 13:00 JSTarXiv cs.AIエージェント研究/論文

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in la…

2026-06-04 13:00 JSTarXiv cs.AIエージェントロボティクス

Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring

Although autonomous underwater vehicles promise the capability of marine ecosystem monitoring, their deployment is fundamentally limited by…

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

LLM agents increasingly rely on reusable skills (e.g., SKILL markdown files) to execute complex tasks, yet these artifacts lack portability…

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

A Systematic Investigation of RL-Jailbreaking in LLMs

The evolution of generative models from next-token predictors to autonomous engines of complex systems necessitates rigorous safety hardeni…

2026-06-04 13:00 JSTarXiv cs.AIエージェントロボティクス

Lost in Fog: Sensor Perturbations Expose Reasoning Fragility in Driving VLAs

Interpretable autonomous driving planners depend not only on generating explanations, but also on those explanations remaining reliable und…

2026-06-04 13:00 JSTarXiv cs.AIエージェント

Grimlock: Guarding High-Agency Systems with eBPF and Attested Channels

Agentic systems increasingly run user-authored orchestration code that invokes tools, spawns subtasks, and delegates work across machines a…

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, w…

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

memorywire: A Vendor-Neutral Wire Format for Agent Memory Operations

Agent-memory frameworks -- mem0, Letta/MemGPT, Cognee, Zep/Graphiti, MemoryOS, MemTensor -- each ship their own SDK, storage layout, and op…

2026-06-04 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams

Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and suppor…

2026-06-04 13:00 JSTarXiv cs.AIエージェント

OpenAgenet/OAN: Open Infrastructure for Trusted Agent Interconnection

OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes vi…

2026-06-04 13:00 JSTarXiv cs.AIエージェント研究/論文

OpenAgenet/OAN: Technical Architecture for Trust-Governed Agent Identity and Discovery

This paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection.…

2026-06-04 08:00 JSTITmedia AI+エージェント

ServiceNowとAccenture、エージェント型AIを全社展開する「FDE」を開始

ServiceNowとAccentureが新たなプログラムを立ち上げた。エージェント型AIの導入が実証実験の段階で止まり、全社規模の成果につながらないといった課題をどのように解決するのか。

2026-06-03 23:00 JSTITmedia AI+エージェント規制/政策

「AI使うな」より「使うなら教えて」　エージェント時代のガバナンス再設計

AIエージェントの業務適用が広がる一方、組織のガバナンスが追いついていない。OWASPの指摘を踏まえ、日本企業が押さえるべき2つの原則と、来週から始められる3つのアクションを解説する。

2026-06-03 22:40 JSTTechCrunch AIエージェント

Meta’s AI agent for WhatsApp Business is now available globally

WhatsApp will charge businesses for using its AI agent based on token usage.

2026-06-03 22:02 JSTTechCrunch AIエージェントビジネス/資金調達

Coralogix raises $200M on bet that someone needs to watch the AI agents

Coralogix is among a growing number of infrastructure firms betting that as AI systems move into production, demand will rise for tools tha…

2026-06-03 21:00 JSTOpenAILLM/生成AIエージェント

How Wasmer used Codex to build a Node.js runtime for the edge

See how Wasmer used Codex with GPT-5.5 to build a Node.js runtime for the edge, accelerating development 10x to 20x and shipping in weeks i…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Traj-Evolve: 肺がんの早期発見における患者の軌跡モデリングのための自己進化型マルチエージェントシステム

縦断的な電子医療記録 (EHR) から患者の軌跡をモデル化するには、まばらでノイズが多く、コンテキストの長いマルチモーダルシーケンスを推論する必要があります。既存の LLM ベースのマルチエージェントシステムはコンテキストの長さに対処しますが、患者を個別に処理するため、臨床医が同様の過去の症例から蓄積された経験をどのように活用するかを反映できていません。我々は、2 つの相補的な進化メカニズムを備えた自己進化マルチエージェントシステムである Traj-Evolve を紹介します。まず、エクスペリエンスプール (ExPool) はノンパラメトリックメモリとして機能し、拒絶反応でサンプリングされた推論トレースにインデックスを付けて、同様の患者を少数ショットのコンテキストとして取得します。 2 番目に、報酬ランク付き微調整によるマルチエージェント強化学習 (MARL) により、エージェント間およびエージェントとメモリのコラボレーションがパラメトリックに最適化されます。リーブワンアウトのクロス検索戦略は、この 2 つを統合し、検索拡張の下でトレーニング時間と推論時の動作を調整します。最長 5 年間のマルチモーダル EHR を利用した肺がん予測タスクにおいて、Traj-Evolve は、人口全体と困難な非喫煙者人口に対する 9 つの強力なベースラインを上回りました。進化するダイナミクスの分析により、次の 3 つの重要な発見が明らかになります。(1) ExPool の拡張により、最適な取得が多様なサンプルから特定のサンプルに移行します。 (2) MARL の下では、マネージャーエージェントの予測損失は迅速に収束しますが、ワーカーエージェントの時間的推論はより検証された患者から恩恵を受け続けます。 (3) 2 つのメカニズムは予測リスクに関して補完的であり、ExPool は特異性を向上させ、MARL は感度を向上させます。

原文 (English)

Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but process patients in isolation, failing to mirror how clinicians leverage accumulated experience from similar prior cases. We present Traj-Evolve, a self-evolving multi-agent system with two complementary evolving mechanisms. First, an Experience Pool (ExPool) acts as a non-parametric memory, indexing rejection-sampled reasoning traces to retrieve similar patients as few-shot contexts. Second, multi-agent reinforcement learning (MARL) via reward-ranked fine-tuning parametrically optimizes inter-agent and agent-memory collaboration. A leave-one-out cross-retrieval strategy unifies the two, aligning training- and inference-time behavior under retrieval augmentation. On a lung cancer prediction task utilizing up to five years of multimodal EHRs, Traj-Evolve outperforms 9 strong baselines on the overall population and a challenging never-smoker population. Analysis of the evolving dynamics highlights three key findings: (1) expanding the ExPool shifts optimal retrieval from diverse to specific samples; (2) under MARL, the manager agent's prediction loss converges quickly while the worker agents' temporal reasoning continues to benefit from more verified patients; and (3) the two mechanisms are complementary on the predicted risk, where ExPool improves specificity while MARL improves sensitivity.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エッジの組み込み AI エージェントシステムのモジュラーアーキテクチャに向けて

大規模言語モデル (LLM) の台頭により、複雑な推論とツールの使用が可能なエージェント AI が可能になりました。ただし、組み込みマイクロコントローラーのメモリとエネルギーの厳しい制約により、このような自律性をパーベイシブコンピューティング環境に展開することは依然として困難です。既存のフレームワークは通常、サーバークラスのリソースまたは継続的な接続を前提としており、深く組み込まれたシステムのためのギャップが残されています。この論文では、決定論的なリアルタイム制御とエージェントインテリジェンスの間の溝を埋める組み込みエージェントシステム用のモジュール式リファレンスアーキテクチャを提案します。低遅延でプライバシーが重要なタスクのために高圧縮ニューラルネットワークとルールベースのロジックを実行するオンデバイスエージェントを、より高レベルの推論と計画のために小型言語モデル (SLM) を利用するクラウド拡張エージェントから分離する階層型設計を導入します。主な貢献は、横断的なガバナンス層の統合であり、分散された自律デバイス群全体で可観測性、ポリシーの適用、安全性を確保します。純粋に経験的なベンチマークを提示するのではなく、リソースに制約のある環境でのレイテンシー、エネルギー、信頼性の高い実行に関するアーキテクチャ設計原則とトレードオフを分析します。

原文 (English)

Toward a Modular Architecture for Embedded AI Agent Systems at the Edge

The rise of Large Language Models (LLMs) has enabled agentic AI capable of complex reasoning and tool use; however, deploying such autonomy in pervasive computing environments remains challenging due to the strict memory and energy constraints of embedded microcontrollers. Existing frameworks typically assume server-class resources or continuous connectivity, leaving a gap for deeply embedded systems. This paper proposes a modular reference architecture for Embedded Agent Systems that bridges the divide between deterministic real-time control and agentic intelligence. We introduce a tiered design that decouples On-Device Agents - executing highly compressed neural networks and rule-based logic for low-latency, privacy-critical tasks - from Cloud-Augmented Agents that leverage Small Language Models (SLMs) for higher-level reasoning and planning. A key contribution is the integration of a cross-cutting Governance Layer, ensuring observability, policy enforcement, and safety across distributed fleets of autonomous devices. Rather than presenting purely empirical benchmarks, we analyze architectural design principles and trade-offs regarding latency, energy, and reliable execution in resource-constrained environments.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

問題を解決する場合とその修正方法: データクリーニングに関するマルチエージェントの議論

マルチエージェントの議論がデータのクリーニングに役立つのはどのような場合でしょうか?また、害を及ぼすのはどのような場合ですか? 3 つのベンチマーク、4 つのモデルファミリ、および 6,000 を超えるタスクと条件のペアにわたって、ディベートの効果が符号を反転していることがわかりました。それは、ジェネレーターが無批判に受け入れる批評誘導性混乱 (CIC)、幻覚的な批評家フィードバックによって、4 つのモデルすべてで生成を低下させます (-1.6 ～ -15.5pp) が、エラー検出は向上します (+27.4pp F1、d=1.0)。私たちは議論の利益条件を導出します。つまり、間違った出力を救出する確率 (修正可能性によって重み付けされた批評家検証の確率) が正しい出力を破壊する確率を超える場合、議論は役立ちます。要因実験は、敵対的分離が不可欠であることを証明します。同一のツールを使用した自己検証は失敗しますが、コード実行の根拠と証拠ゲート型生成を備えた別個の批評家は、生成タスクにおいて単一エージェントを大幅に超える最初のディベート構成を生み出します (+5.3pp、p<0.05)。この条件は、9 つのタスクタイプすべてを正確に予測し、7 つのドメインで公開された 19 の比較全体で誤検知がゼロになるように一般化します。

原文 (English)

When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (-1.6 to -15.5pp) through critique-induced confusion (CIC), hallucinated Critic feedback that the Generator accepts uncritically, yet improves error detection (+27.4pp F1, d=1.0). We derive a debate benefit condition: debate helps when the probability of rescuing a wrong output (Critic verification odds weighted by fixability) exceeds the probability of destroying a correct one. A factorial experiment proves adversarial separation is essential: self-verification with identical tools fails, while a separate Critic with code-execution grounding and evidence-gated generation produces the first debate configuration to significantly exceed single-agent on a generative task (+5.3pp, p<0.05). The condition correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons in seven domains.

2026-06-03 13:00 JSTarXiv cs.AIエージェント研究/論文

引き継ぎ負債: コーディングエージェントが中断されたタスクを引き継ぐ場合の再検出コスト

コーディングエージェントベンチマークは、単一の中断のないエージェントがリポジトリの問題を解決できるかどうかを評価します。実際のソフトウェア作業はさらに面倒です。タスクは中断され、再割り当てされ、確認され、別のエージェントまたはエンジニアが残した部分的な状態から再開されます。私たちは、\emph{引き継ぎ負債}、つまり前任者の仕事が不透明または不完全な場合に課せられる再発見コストを通じて、この欠落している側面を研究します。私たちの引き継ぎプロトコルは、決定的なハンドオフポイントでコーディングエージェントを中断し、リポジトリをフリーズし、リポジトリの状態のみ、生のトレース、要約メモ、構造化メモの 4 つのハンドオフビューで後続エージェントを評価します。このプロトコルは、75 のソースタスクにわたって、後継モデルごとに 181 のハンドオフポイントタスクと 724 のテイクオーバー実行を生成します。 3 つの後継モデル全体で、コンテキストを伴うハンドオフは、リポジトリのみのテイクオーバーと比較して、エージェントイベントの中央値が 20 ～ 59\% 減少し、累積プロンプトトークンが 42 ～ 63\% 減少します。解決率の影響は小さく、モデルに依存しますが、効率の向上は一貫しています。これらの発見は、コーディングエージェントの評価では、タスクが解決されたかどうかだけでなく、別のエージェントがその作業を再開するのにどれだけのコストがかかるかを報告する必要があることを示唆しています。

原文 (English)

Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks

Coding-agent benchmarks evaluate whether a single uninterrupted agent can resolve a repository issue. Real software work is messier: tasks are interrupted, reassigned, reviewed, and resumed from partial states left by another agent or engineer. We study this missing dimension through \emph{handoff debt}: the rediscovery cost imposed when a predecessor's work is opaque or incomplete. Our takeover protocol interrupts a coding agent at deterministic handoff points, freezes the repository, and evaluates successor agents under four handoff views: repository state only, raw trace, summary notes, and structured notes. Across 75 source tasks, the protocol generates 181 handoff-point tasks and 724 takeover runs per successor model. Across three successor models, context-bearing handoffs reduce median agent events by 20--59\% and cumulative prompt tokens by 42--63\% relative to repository-only takeover. Solved-rate effects are smaller and model-dependent, but efficiency gains are consistent. These findings suggest that coding-agent evaluation should report not only whether a task is solved, but also how costly that work is for another agent to resume.

2026-06-03 13:00 JSTarXiv cs.AIエージェント研究/論文

ベンチマークでは測れないもの: 自律エージェントの棄権能力を評価する事例

自律エージェントのベンチマークは、エージェントがタスクを完了したかどうかを測定しますが、この枠組みでは、エージェントがそもそも続行すべきかどうかについてはシステム的に盲点です。ヒューマンフィードバックの目標に基づいて訓練されたエージェントは、安全に行動するための入力、証拠、または許可が不足している場合でも続行する構造的な傾向、つまりコンプライアンスバイアスと呼ばれる性質を身につけます。これは、報酬シグナルとベンチマークスコア体系の両方が、安全な行動の前提条件が存在するかどうかに関係なく、続行を正しいデフォルトとして扱うためです。私たちは 3 つの貢献を行っています。まず、コンプライアンスバイアスは人間によるフィードバックパイプライン内の報酬ハッキングに由来し、エージェントの一時停止に対してペナルティを課すか、原理的な一時停止とサイレントエラーを構造的に区別できない、著名なエージェントベンチマークによって固定化されていることを示します。次に、棄権が保証されるシナリオの 3 つのギャップ分類法を導入します。これは、必要な情報が欠落している仕様のギャップ、世界の状態を確認できない検証のギャップ、および明示的な権限が与えられていない権限のギャップをカバーしており、これらが一緒になって棄権を認識するエージェントのベンチマークを構築するための原則的な基礎を提供します。最後に、棄権評価プロトコル (安全率、ユーザビリティ率、通知による拒否率) を提案し、144 のエンタープライズエージェントシナリオと 5 つのモデルファミリにわたる暫定結果を報告します。この中で、ランタイム強制棄権メカニズムは、許可されたシナリオで最大 89.2% の危険行為のブロックと 87.5% のユーザビリティを達成し、安全性とユーザビリティのトレードオフは固有のものではなく調整可能であり、その形状がモデルファミリ間で大幅に異なることを示しています。私たちはこれを予備作業として扱い、その後の会話の出発点として分類法と複合指標を提供します。

原文 (English)

What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained under human-feedback objectives develop a structural tendency to proceed even when they lack the inputs, evidence, or authorization to act safely, a disposition we term compliance bias, because both the reward signal and the benchmark scoring regime treat proceeding as the correct default regardless of whether the preconditions for safe action are present. We make three contributions. We first show that compliance bias originates in reward hacking within human-feedback pipelines and is entrenched by prominent agent benchmarks, which either penalize agents for pausing or are architecturally unable to distinguish a principled pause from a silent failure. We then introduce a three-gap taxonomy of abstention-warranted scenarios, covering specification gaps where required information is absent, verification gaps where world state cannot be confirmed, and authority gaps where explicit authorization has not been given, which together provide a principled basis for constructing abstention-aware agent benchmarks. Finally, we propose abstention evaluation protocols (Safety Rate, Usability Rate, and Informed Refusal Rate) and report preliminary results across 144 enterprise agent scenarios and five model families, in which a runtime-enforced abstention mechanism achieves up to 89.2% hazardous-action blocking and 87.5% usability on authorized scenarios, demonstrating that the safety--usability tradeoff is tunable rather than inherent and that its shape varies substantially across model families. We treat this as preliminary work and offer the taxonomy and composite metrics as a starting point for further conversations.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エージェントの痕跡から推論プリミティブを誘導する

ReAct スタイルの LLM エージェントは、多くの問題にわたって同じ推論ルーチンを再発見しますが、それらのルーチンは一時的なスクラッチパッドに閉じ込められたままになります。成功した ReAct トレースをマイニングし、反復する推論の動きをクラスター化し、最も頻繁に発生する動きを型指定された疑似ツールのコンパクトなライブラリに変換するシングルパス手法である Reasoning Primitive Induction を紹介します。各疑似ツールは、呼び出し時に LLM によって解釈される自然言語の docstring によって指定され、標準の ReAct ループによってテスト時にこれらのプリミティブが構成されます。中心的な結果は、誘導されたライブラリがそのトレースを生成したまさにエージェントよりも優れていることです。RuleArena NBA で +44pp (30 -> 74)、MuSR チーム割り当てで +30pp (38 -> 68)、NatPlan 会議計画で +22pp (7 -> 29) でした。物語演繹、ルール適用、制約充足計画にわたる 5 つの比較可能なサブタスクにわたって、単一の固定構成により、すべてのサブタスクでゼロショットの思考連鎖よりも改善され、専門家が作成した分解と同等またはそれを上回り、より低い平均推論コストで AWM よりも優れたパフォーマンスを発揮します。

原文 (English)

Inducing Reasoning Primitives from Agent Traces

ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 -> 74), +30pp on MuSR team allocation (38 -> 68), and +22pp on NatPlan meeting planning (7 -> 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning, a single fixed configuration improves over zero-shot Chain-of-Thought on every subtask, matches or surpasses expert-authored decompositions, and outperforms AWM at lower average inference cost.

2026-06-03 13:00 JSTarXiv cs.AIエージェント

AUDITFLOW: 構造化財務報告検証のための実行可能なシンボリック環境

正確性はテキストだけではなく構造化された証拠に依存するため、言語モデルエージェントにとって構造化された財務監査の検証は困難です。モデルは、監査ルールを適用する前に、報告された事実を分類概念にリンクし、計算または次元関係を調べ、期待値を再計算する必要があります。私たちは、適応的な検索を決定論的な検証から分離する、グラフに基づいたマルチエージェントフレームワークである AuditFlow を提案します。 AuditFlow は、静的な US-GAAP タクソノミーグラフと動的な XBRL ファイリンググラフからシンボリック環境を構築し、ファクト検索、タクソノミートラバーサル、数値チェック、およびルール評価のための型付きツールを通じてそれを公開します。 2 人のジュニア監査人が規制と証拠の観点から各ケースを検査し、上級監査人が意見の相違を解決し、さらなる調査を要求することができます。最終レポートは証拠の集約を通じて統合され、監査評決、期待値、証拠痕跡、および信頼性スコアが生成されます。 FinAuditing 由来の FinMR サンプルでは、AuditFlow は GPT-5.5 に基づく共同監査精度 82.09% に達し、最も強力なベースラインを 14.93 ポイント上回りました。決定論的チェックを削除すると精度が 17.91% に低下し、モデルでは確実に置き換えることができない検証ステップがシンボリック環境によって実行されることがわかります。

原文 (English)

AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification

Structured financial audit verification is difficult for language-model agents because correctness depends on structured evidence rather than text alone. A model must link reported facts to taxonomy concepts, traverse calculation or dimensional relations, and recompute expected values before applying an audit rule. We propose AuditFlow, a graph-grounded multi-agent framework that separates adaptive search from deterministic verification. AuditFlow builds a symbolic environment from a static US-GAAP taxonomy graph and a dynamic XBRL filing graph, and exposes it through typed tools for fact retrieval, taxonomy traversal, numerical checking, and rule evaluation. Two junior auditors inspect each case from regulatory and evidentiary views, while a senior auditor resolves disagreements and can request further investigation. The final reports are fused through evidential aggregation to produce an audit verdict, expected value, evidence trail, and trustworthiness score. On a FinAuditing-derived FinMR sample, AuditFlow reaches 82.09% joint audit accuracy under GPT-5.5, outperforming the strongest baseline by 14.93 points. Removing deterministic checks drops accuracy to 17.91%, showing that the symbolic environment performs the verification step that the model cannot reliably replace.

2026-06-03 13:00 JSTarXiv cs.AIエージェント

ToolGate: ツール拡張視覚言語エージェント向けのトークン効率の良い通話前制御

ツール拡張視覚言語エージェントは、OCR、検出、セグメンテーション、その他のツールを通じて外部の知覚証拠を取得できますが、提案されたツール呼び出しをすべて実行するのはコストがかかり、場合によっては不必要です。呼び出し前の制御の問題を研究します。ReAct スタイルの VLM エージェントが知覚ツール呼び出しを提案した後、その呼び出しは実行されるべきか、その出力がコンテキストに入る前にスキップされるべきか? 5 つのベンチマーク全体で、ベースラインエージェントのローカル選択性が低いことがわかりました。有益なコールと有害なコールは同様の割合 (11.8% 対 9.9%) で発生しますが、ほとんどのコールは即時強制応答の予測を変更しません。軌道テキストと単純な構造特徴から実行/スキップの決定を予測する軽量の外部コントローラーである ToolGate を紹介します。 ToolGate は、2 つの Qwen3-VL バックボーン全体で、クロスドメイン設定の平均精度を維持しながら、トークンコストを無制限の ReAct ベースラインの 64 ～ 69% に削減します。 Qwen3-VL-30B でのマッチドドメイン軌道トレーニングにより、平均精度がさらに 1.65 ポイント向上しました。これらの結果は、ツール拡張 VLM エージェントが、より優れた知覚ツールからだけでなく、ツールの出力がいつ支払う価値があるかを明示的に制御することからも恩恵を受けることを示しています。

原文 (English)

ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call control problem: after a ReAct-style VLM agent proposes a perceptual tool call, should the call be executed, or skipped before its output enters the context? Across five benchmarks, we find that the baseline agent exhibits poor local selectivity: helpful and harmful calls occur at similar rates (11.8% vs. 9.9%), while most calls do not change the immediate forced-answer prediction. We introduce ToolGate, a lightweight external controller that predicts execute/skip decisions from trajectory text and simple structural features. Across two Qwen3-VL backbones, ToolGate reduces token cost to 64-69% of the unrestricted ReAct baseline while preserving average accuracy in cross-domain settings. With matched-domain trajectory training on Qwen3-VL-30B, it further improves average accuracy by 1.65 points. These results show that tool-augmented VLM agents benefit not only from better perceptual tools, but also from explicit control over when tool outputs are worth paying for.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkillDAG: 大規模な LLM スキル選択のための自己進化型型スキルグラフ

LLM エージェントが大規模なスキルライブラリを採用するにつれて、適切なサブセットの選択は、類似性の一致の問題ではなく、構造的な問題になります。つまり、スキルは相互に依存、競合、特殊化、または重複するため、完全な列挙と類似性の埋め込みの両方には見えない構造になります。 SkillDAG は、スキル間の関係を型付き有向グラフとしてモデル化し、それを推論時のエージェント呼び出し可能な構造検索インターフェイスとして LLM エージェントに公開します。固定の検索パイプラインに組み込まれるのではなく、実行中にクエリされて展開されます。各検索では、ベクトル一致、型付きエッジ近傍、競合信号が返され、提案後コミットプロトコルにより、エージェントは実行に裏打ちされたエッジを登録できるため、グラフはエピソード全体で構造を蓄積します。 ALFWorld と MiniMax-M2.7 を使用した SkillsBench では、SkillDAG は 67.1% の成功と 27.3% の報酬に達し、報告されている最も強力なスキルのグラフのベースラインを +12.8 ポイントと +8.6 ポイント上回りました。アドバンテージは gpt-5.2-codex に移植され、固有の SkillsBench Ret@K は、一致したクエリの下で 65.5 から 78.2 に上昇します。これらの利点は、固定シード拡散パイプラインが劣化するプールが 10 倍に成長しても頑健性を維持する候補ランキング、および以前のヒットを排除することなくグラウンドトゥルースの再現を拡大するセットモノトーンのオンライン編集など、分離可能なメカニズムに由来します。

原文 (English)

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

As LLM agents adopt large skill libraries, selecting the right subset becomes a structural problem rather than a similarity-matching one: skills depend on, conflict with, specialize, or duplicate one another, a structure invisible to both full enumeration and embedding similarity. We present SkillDAG, which models inter-skill relationships as a typed directed graph and exposes it to an LLM agent as an inference-time, agent-callable structural retrieval interface, queried and evolved during execution rather than baked into a fixed retrieval pipeline: each search returns vector matches, typed-edge neighbors, and conflict signals, and a propose-then-commit protocol lets the agent register execution-backed edges so the graph accumulates structure across episodes. On ALFWorld and SkillsBench with MiniMax-M2.7, SkillDAG reaches 67.1% success and 27.3% reward, exceeding the strongest reported Graph-of-Skills baseline by +12.8 and +8.6 points; the advantage ports to gpt-5.2-codex, and intrinsic SkillsBench Ret@K rises from 65.5 to 78.2 under matched queries. These gains trace to isolable mechanisms: candidate ranking that stays robust as the pool grows 10x where a fixed seeding-diffusion pipeline degrades, and set-monotone online edits that enlarge ground-truth recall without evicting prior hits.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

DELTAMEM: 残存ツリーによる LLM エージェントの増分エクスペリエンスメモリ

大規模言語モデル (LLM) ベースのエージェントは、継続的な対話を通じて経験から学習するために、ますます記憶に依存しています。ただし、経験を独立したフラットな単位として保存すると、類似したエピソードが重複するコンテンツを繰り返したり、微妙なシーンの変化が検索された記憶に矛盾したガイダンスを提供したりするため、実質的な冗長性と検索の競合が発生します。これに対処するために、新たに獲得した経験は多くの場合、既存の知識の増分変化であると仮定して、残留経験を導入します。私たちは、経験記憶を 2 つの独立した残差ツリーに編成するフレームワークである DeltaMem を提案します。1 つは再利用可能なスキルとして目標条件付けされたタスクの経験を保存し、もう 1 つはシーンレベルの環境知識として保存します。各ツリーは、一般化された基本エクスペリエンスにルートノードを使用し、後続のバリエーションに増分デルタノードを使用して、関連するエクスペリエンスが重複することなく共通の基盤を共有できるようにします。検索では、失敗ペナルティ付きの類似性スキャンによって最適な一致が特定され、ルートから一致までのチェーン構成を通じて完全なエクスペリエンスが再構築されます。自律的な統合メカニズムにより、高周波パスが新しいルートノードに抽出され、一般的なヒューリスティックから特殊なバリアントまでツリーが自己組織化できるようになります。多様なインタラクティブ環境での実験では、DeltaMem が既存のベースラインを常に上回るパフォーマンスを示しています。将来の研究を容易にするために、https://github.com/import-myself/DeltaMem でコードをリリースします。

原文 (English)

DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees

Large Language Model (LLM)-based agents increasingly rely on memory to learn from experiences over continual interactions. However, storing experiences as independent, flat units leads to substantial redundancy and retrieval conflicts, as similar episodes repeat overlapping content and subtle scene variations cause retrieved memories to offer contradictory guidance. To address this, we introduce residual experience, positing that newly acquired experience is often an incremental variation of existing knowledge. We propose DeltaMem, a framework that organizes experience memory into two independent residual trees, one storing goal-conditioned task experience as reusable skills and another for scene-level environment knowledge. Each tree uses a root node for generalized base experiences and incremental delta nodes for subsequent variations, allowing related experiences to share a common foundation without duplication. For retrieval, a failure-penalized similarity scan locates the best match, reconstructing the full experience via root-to-match chain composition. An autonomous consolidation mechanism distills high-frequency paths into new root nodes, enabling the trees to self-organize from general heuristics to specialized variants. Experiments across diverse interactive environments show that DeltaMem consistently outperforms existing baselines. To facilitate future research, we release the code at https://github.com/import-myself/DeltaMem.

2026-06-03 13:00 JSTarXiv cs.AIエージェント研究/論文

DeskCraft: プロフェッショナルなワークフローと人間参加型コラボレーションに関するデスクトップエージェントのベンチマーク

専門的なクリエイティブおよびエンジニアリングソフトウェアでの実際のプロフェッショナルデスクトップワークフローは長期にわたって展開され、多くの場合、人間による調整が必要になります。そこでは、エージェントが積極的に必要な情報を探し、ユーザーはタスクの進行に応じて追加の指示、説明、フィードバック、または修正を提供します。しかし、既存のデスクトップ GUI ベンチマークでは、ほとんどの場合、この設定は、すべてのユーザー指示が事前に提供される、短く単純化されたタスクに削減されます。この問題に対処するために、長期にわたるクリエイティブおよびエンジニアリングのワークフローとプロアクティブなヒューマンエージェントコラボレーションを対象としたデスクトップ GUI ベンチマークである DeskCraft を紹介します。 DeskCraft はタスクをマルチレベルの難易度分類に整理しており、長期にわたるタスクには 50 以上の実行ステップが必要で、デザイン、ビデオ、オーディオ、3D 作成にわたるプロフェッショナルなクリエイティブソフトウェアをカバーしています。さらに、DeskCraft は人間とエージェントのコラボレーションを、ターン中およびターン後のやり取りをカバーする対話プロトコルに形式化します。ターン途中のインタラクションでは、不確実性の下でエージェントが開始した説明と、実行中にユーザーが開始した中断の両方がキャプチャされ、ターン後のインタラクションは、エージェントが完了の合図をした後のユーザー主導のフィードバックに対応し、現実的なコラボレーションパターンの全領域にまたがります。 18 の独自のオープンソースエージェントを 538 のタスクで評価したところ、GPT-5.4 は標準タスクで 31.6%、対話型タスクで 27.6% に達していることがわかりました。さらに分析を進めると、長期にわたるワークフローの提供と事前の明確化における継続的な失敗が明らかになります。すべての評価コード、タスク、データを https://github.com/mrwwk/DeskCraft でオープンソース化します。

原文 (English)

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents proactively seek necessary information and users provide additional instructions, clarifications, feedback, or corrections as the task progresses. Yet existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront. To address this issue, we introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration. DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps, and covers professional creative software across design, video, audio, and 3D creation. Furthermore, DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges. Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion, together spanning the full space of realistic collaboration patterns. We evaluate 18 proprietary and open source agents on 538 tasks and find that GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Further analyses reveal persistent failures in long horizon workflow delivery and proactive clarification. We will open-source all evaluation codes, tasks, and data at https://github.com/mrwwk/DeskCraft.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

EvoTrainer: 自律的なエージェント強化学習のための共同進化する LLM ポリシーとトレーニングハーネス

自律 LLM トレーニングはレシピ検索として組み立てられることが多く、トレーニングハーネスはほとんど静的なままになります。この制限はエージェント RL ではさらに顕著になり、ボトルネックの変化とスカラー報酬によってさまざまな障害モードが隠蔽されます。 EvoTrainer は、実証的なフィードバックを通じて LLM ポリシーとトレーニング側の活用を共進化させる自律型トレーニングフレームワークです。これは、ロールアウトレベルの証拠を診断し、診断を修正し、介入をバックテストし、再利用可能なスキルを蓄積します。数学的推論、競合プログラミングコード生成、およびリポジトリレベルのソフトウェアエンジニアリングで評価された EvoTrainer は、同じデータ、コードベース、評価プロトコルの下で人間が設計した RL 参照と同等またはそれを上回り、長期的なエージェント SWE で最大の利益をもたらします。軌跡分析により、保持された戦略がドメイン間で分岐し、進化する診断により無効な高スコア分岐の昇格が防止され、再利用可能なスキルが後の検索を形成することが示されました。自律 LLM RL は、レシピ検索を超えて、ポリシーとそれを解釈するトレーニングハーネスの共同進化に向けて進む必要があります。

原文 (English)

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluated on mathematical reasoning, competitive-programming code generation, and repository-level software engineering, EvoTrainer matches or exceeds the human-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long-horizon agentic SWE. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high-scoring branches from being promoted, and reusable skills shape later search. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

情報獲得による LLM エージェントの不確実性認識の明確化

Large Language Model (LLM) エージェントは、多くの場合、不明確なユーザー指示に従って動作し、ユーザーの意図に対する潜在的な不確実性がツールの誤った動作につながります。この課題に対処するために、明確化の動作を曖昧さの解決と連携させる、目標指向の明確化フレームワークを提案します。私たちのアプローチの中心となるのは、情報獲得報酬です。これは、説明の交換によって引き起こされるグラウンドトゥルースの目標に向けたベイジアン信念の更新を測定することによって、説明の質問の有用性を定量化する指標です。この報酬を使用して明確化者 (LLM) をトレーニングし、高い情報獲得を最適化します。これにより、明確化によって効果的に不確実性が軽減され、エージェント、ツール、ユーザーの環境内でタスクの完了が向上します。私たちは、5つの異種バックボーンにわたるクロスエージェント評価を実施し、明確化を強化した $\tau$-Bench 環境内でフレームワークを検証します。経験的な結果は、私たちの方法が、平均して合計 0.3 のインタラクションステップのみを追加しながら、明確化なしのベースラインよりも成功率を一貫して 3.7% 向上させることを示しています。

原文 (English)

Uncertainty-Aware Clarification in LLM Agents with Information Gain

Large Language Model (LLM) agents often operate under underspecified user instructions, where latent uncertainty over user intent leads to erroneous tool actions. To address this challenge, we propose a goal-oriented clarification framework that aligns clarification behavior with ambiguity resolution. Central to our approach is the Information Gain Reward, a metric that quantifies the utility of clarification questions by measuring the Bayesian belief update towards the ground-truth goal induced by the clarification exchange. We train the clarifier (LLM) using this reward to optimize for high information gain, ensuring that clarifications effectively reduce uncertainty and improve task completion within the agent-tool-user environment. We validate our framework within a clarification-enhanced $\tau$-Bench environment, conducting cross-agent evaluations across five heterogeneous backbones. Empirical results demonstrate that our method consistently improves the success rate by 3.7\% over the no-clarification baseline, while adding only 0.3 total interaction steps on average.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

話す前に考える: マルチエージェント社会シミュレーションにおける内部評価から公の表現まで

LLM ベースのマルチエージェントシミュレーションは、社会的相互作用、熟慮、集団的な意見のダイナミクスを研究するための有望な方法を提供します。しかし、既存の対話シミュレーションフレームワークの多くは、対話を主に観察可能なターン交換または集約された出力として表現しており、沈黙、発言意図、公的表現の背後にある内部評価プロセスを調査することが困難なままになっています。エージェントの私的な推論を公的発話の生成から分離する、インターバルベースのマルチエージェントシミュレーションフレームワークである TBS (Think-Before-Speak) を紹介します。各間隔で、すべてのエージェントは共有された対話履歴と自身の記憶に基づいて構造化された内部状態を更新します。これらの状態には、不協和音関連の評価、認識された世論環境、認識された孤立リスク、対応戦略、および発言意欲が含まれます。その後、オーケストレーターは競合する発言意図を解決し、1 つの発言を公開対話にコミットし、内部評価と公開対話が時間の経過とともに共進化できるようにします。私たちは、気候関連の政策問題に関するタウンホールでの議論を模擬して TBS を評価します。結果は、TBS が一貫した内部状態トレースを生成し、これらのトレースがターン割り当て、沈黙、メモリ条件全体にわたって体系的に変化することを示しています。不協和音関連の評価はエージェントの発言意欲を高めますが、沈黙の圧力評価はそれを低下させます。発言の意図が形成されると、公の場での表現は主に順番の割り当てルールによって形成されます。これらの発見は、TBS が内部評価から公的表現への経路を観察可能かつ分析可能にすることで、メカニズムに敏感な社会シミュレーションをサポートしていることを示唆しています。

原文 (English)

Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

LLM-based multi-agent simulation offers a promising way to study social interaction, deliberation, and collective opinion dynamics. However, many existing dialogue simulation frameworks represent interaction mainly as observable turn exchange or aggregated outputs, leaving the internal evaluative processes behind silence, speaking intention, and public expression difficult to examine. We introduce TBS (Think-Before-Speak), an interval-based multi-agent simulation framework that separates agents' private reasoning from public utterance generation. At each interval, all agents update structured internal states based on the shared dialogue history and their own memory. These states include dissonance-related appraisal, perceived opinion climate, perceived isolation risk, response strategy, and willingness to speak. The orchestrator then resolves competing speaking intentions and commits one utterance to the public dialogue, allowing internal evaluation and public interaction to co-evolve over time. We evaluate TBS in simulated town hall discussions on a climate-related policy issue. Results show that TBS produces coherent internal-state traces and that these traces vary systematically across turn-allocation, silence, and memory conditions. Dissonance-related appraisal increases agents' willingness to speak, whereas silence-pressure appraisal decreases it. Once speaking intention is formed, public expression is shaped mainly by turn-allocation rules. These findings suggest that TBS supports mechanism-sensitive social simulation by making the pathway from internal evaluation to public expression observable and analyzable.

2026-06-03 13:00 JSTarXiv cs.AIエージェント研究/論文

MedCUA-Bench: 臨床コンピューター使用エージェント向けのスクリーンショットのみのベンチマーク

コンピュータを使用するエージェントは、画面ベースの反復的な臨床作業を自動化できる可能性がありますが、医療用グラフィカルユーザーインターフェイスにおけるエージェントの信頼性はほとんど検証されていません。既存のベンチマークは、一般的な Web タスクやデスクトップタスクに焦点を当てており、ドメイン知識が必要で、主流のアプリケーションとは著しく異なる UI デザインを示し、公開テスト環境がなく、タスクの完了を超えた安全性の検証が必要な医療ソフトウェアの割合が過小評価されています。臨床コンピューター使用エージェントの対話型ベンチマークである MedCUA-Bench を紹介します。 10 の医療分野にわたる 18 の臨床シナリオをカバーしており、実際の製品マニュアルとオープンソースの医療システムから再構築され、ライセンスとプライバシーの制約を回避しながら本物の臨床インターフェースをキャプチャします。各タスクには、臨床推論を UI 実行から切り離すための、意図レベルとステップレベルの目標のペアが付属しており、タスクの完了と 5 つの臨床安全性の側面について決定論的チェッカーによって評価されます。 23 のエージェント全体で、最高のクローズドソースモデルの厳密な成功率は 54.2% に達していますが、実際の OpenEMR ではすべてのモデルが 9% 未満にとどまっています。オープンソースエージェントの平均はわずか 2.5% で、最高のエージェントは 16.2% に達します。 MedCUA-Bench は、現在の薬剤と信頼できる臨床ソフトウェアの使用との間のギャップを明らかにし、将来の研究のための再現可能なテストベッドを提供します。

原文 (English)

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion. We introduce MedCUA-Bench, an interactive benchmark for clinical computer-use agents. It covers 18 clinical scenarios across 10 medical domains, reconstructed from real product manuals and open-source medical systems to capture authentic clinical interfaces while avoiding licensing and privacy constraints. Each task ships with paired intent- and step-level goals to disentangle clinical reasoning from UI execution, and is evaluated by a deterministic checker over task completion and five clinical safety dimensions. Across 23 agents, the best closed-source model reaches 54.2% strict success, while all models remain below 9% on the real OpenEMR. Open-source agents average only 2.5%, with the best reaching 16.2%. MedCUA-Bench exposes the gap between current agents and reliable clinical software use, providing a reproducible testbed for future research.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

推論前に認識: 効率的で信頼性の高いプロアクティブなモバイルエージェントのための推論前の認識フレームワーク

マルチモーダル大規模言語モデル (MLLM) は、モバイルエージェントを大幅に進化させていますが、エージェントは支援の \emph{方法} を決定する前に、介入する \emph{いつ} を決定する必要があるため、プロアクティブなモバイル支援は依然として課題となっています。既存のシステムでは、統合された MLLM ベースのパイプライン内でこれら 2 つの決定が実装されることが多く、保守的な介入フィルタリングと包括的な支援生成の間で目標の不整合が発生したり、エージェントが沈黙を保つ必要がある場合の冗長な推論が発生したりします。これらの制限に対処するために、私たちは、推論前の知覚に基づいて構築された 2 段階のフレームワークである \textbf{推論前知覚フレームワーク (PRPF)} を提案します。 PRPF は、介入ゲートとコンテキスト圧縮のための軽量のマルチモーダルプロアクティブパーセプター (MPP) を導入し、介入が正当な場合にのみプロアクティブエージェントリーズナー (PAR) をアクティブにします。 ProactiveMobile ベンチマークの実験では、PRPF が ProactiveMobile ベースラインよりも成功率 (SR) と推論効率を向上させながら、誤トリガー率 (FTR) を大幅に削減することが示されています。

原文 (English)

Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

Multimodal large language models (MLLMs) have substantially advanced mobile agents, yet proactive mobile assistance remains challenging because agents must decide \emph{when} to intervene before determining \emph{how} to assist. Existing systems often implement these two decisions within a unified MLLM-based pipeline, leading to goal misalignment between conservative intervention filtering and comprehensive assistance generation, as well as redundant inference when the agent should remain silent. To address these limitations, we propose the \textbf{Pre-Reasoning Perception Framework (PRPF)}, a two-stage framework built on perceiving before reasoning. PRPF introduces a lightweight Multimodal Proactive Perceptor (MPP) for intervention gating and context compression, and activates the Proactive Agent Reasoner (PAR) only when intervention is warranted. Experiments on the ProactiveMobile benchmark show that PRPF substantially reduces false trigger rates (FTR) while improving success rates (SR) and inference efficiency over the ProactiveMobile baseline.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LEAP: エージェントフレームワークを使用した形式数学用の LLM のスーパーチャージング

大規模言語モデル (LLM) は強力な非公式数学的推論を示しますが、リーンのような形式言語では機械的に検証可能な証明を生成するのに苦労します。 LEAP は、汎用基礎モデルが自動化された形式定理証明で最先端のパフォーマンスを達成できるようにするエージェントフレームワークです。 LEAP は、非公式推論、指示に従って、反復的な自己改善などの基礎モデルの機能を活用します。複雑な問題をより小さな単位に分解することで、システムはリーンコンパイラーとの継続的な対話を通じて、正式な証明の構築と非公式のブループリントの橋渡しをします。ますます飽和しつつあるベンチマークを超えた厳密な評価を提供するために、リーンで形式化された IMO スタイルの問題のベンチマークである Lean-IMO-Bench を導入します。このベンチマークでは、短いステートメントでありながら非常に非日常的で、幅広い難易度にわたる複数ステップの証明が行われます。経験的に、北米の学部学生を対象とした毎年恒例の数学コンテストである最新の 2025 年のパトナムコンペティションでは、LEAP は 12 の問題すべてを解決し、フロンティアの正式な数学モデルによる最近の進歩と一致しています。 Lean-IMO-Bench では、LEAP は汎用 LLM のワンショット形式解決率を 10% 未満から 70% に引き上げ、特に金メダル級の専門化された IMO システムによって設定されたベンチマークの 48% を上回っています。さらに、偶数次ケイリーグラフのクヌースのハミルトニアン分解における重要な部分問題の検証された証明を含む、オープンな組み合わせ課題に対する複雑な証明を自律的に形式化することで、LEAP の研究レベルの有用性を実証します。

原文 (English)

LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

Large Language Models (LLMs) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean. We present LEAP, an agentic framework that enables general-purpose foundation models to achieve state-of-the-art performance on automated formal theorem proving. LEAP leverages foundation model capabilities, such as informal reasoning, instruction following, and iterative self-refinement. By decomposing complex problems into smaller units, the system bridges formal proof construction with informal blueprints through continuous interaction with the Lean compiler. To provide a rigorous evaluation beyond increasingly saturated benchmarks, we introduce Lean-IMO-Bench, a benchmark of IMO-style problems formalized in Lean, with short statements yet highly non-routine and multi-step proofs across a wide range of difficulty levels. Empirically, on the latest 2025 Putnam Competition, an annual mathematics competition for undergraduate students in North America, LEAP solves all 12 problems, matching recent breakthroughs by frontier formal mathematical models. On Lean-IMO-Bench, LEAP boosts the one-shot formal solve rate of general-purpose LLMs from below 10% to 70%, notably surpassing the 48% benchmark set by a specialized, gold-medal-caliber IMO system. Furthermore, we demonstrate LEAP's research-level utility by autonomously formalizing complex proofs for open combinatorial challenges, including a verified proof for a key subproblem in Knuth's Hamiltonian decomposition of even-order Cayley graphs.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

InfoMem: 回答条件付き情報獲得によるロングコンテキスト記憶エージェントのトレーニング

長いコンテキストのタスクでは、LLM が大規模なコンテキストから回答関連情報を識別して保存する必要があります。チャンク単位のメモリエージェントは、ドキュメントのチャンクを順番に読み取り、コンパクトなメモリを更新し、蓄積されたメモリから最終的な回答を生成することで、この問題に対処します。ただし、既存の RL ベースのチャンクごとのエージェントは、スパースな最終回答報酬に依存するか、記憶および検索アクションに語彙的な中間報酬を使用します。これらの信号はタスクの成功または局所的なオーバーラップを監視しますが、最終的なメモリが真実の答えをサポートしているかどうかを直接評価しません。我々は、回答条件付き情報を使用して最終メモリの有用性を評価する、チャンク単位の記憶エージェントをトレーニングするための報酬メカニズムである InfoMem を提案します。 InfoMem は、最終メモリがモデルのグラウンドトゥルースの答えのトークンごとの対数尤度をどの程度増加させるかを測定します。 RL の最適化を安定させるために、InfoMem はこの信号を成功した軌跡にのみ適用し、報酬を合成する前に正規化します。同じ GRPO フレームワークとトレーニング予算の下で、InfoMem は、同等のメモリエージェント RL ベースラインよりもロングコンテキストのメモリエージェントのパフォーマンスを向上させます。分析の結果、効果的な最終記憶報酬は成功の軌跡に基づいて動作し、報酬を合成する前に正規化され、クエリではなく回答に基づいて条件付けされる必要があることが示されています。私たちのコードは https://github.com/GenSouKa1/InfoMem で入手できます。

原文 (English)

InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

Long-context tasks require LLMs to identify and preserve answer-relevant information from large contexts. Chunk-wise memory agents address this issue by sequentially reading document chunks, updating a compact memory, and generating the final answer from the accumulated memory. However, existing RL-based chunk-wise agents either rely on sparse final-answer rewards or use lexical intermediate rewards for memory and retrieval actions. These signals supervise task success or local overlap, but do not directly evaluate whether the final memory supports the ground-truth answer. We propose InfoMem, a reward mechanism for training chunk-wise memory agents that evaluates final-memory utility using answer-conditioned information. InfoMem measures how much the final memory increases the model's per-token log-likelihood of the ground-truth answer. To stabilize RL optimization, InfoMem applies this signal only to successful trajectories and normalizes it before reward composition. Under the same GRPO framework and training budget, InfoMem improves long-context memory-agent performance over comparable memory-agent RL baselines. Analyses show that effective final-memory rewards should operate on successful trajectories, be normalized before reward composition, and be conditioned on the answer rather than the query. Our code is available at https://github.com/GenSouKa1/InfoMem.

2026-06-03 13:00 JSTarXiv cs.AIエージェント

CP-Agent: 化学的摂動下での細胞形態学的プロファイリングのためのコンテキスト認識型マルチモーダル推論

Cell Painting は、多重蛍光染色、ハイコンテンツイメージング、定量分析を組み合わせて高次元の表現型の読み取り値を生成し、作用機序 (MoA) 推論、毒性予測、薬物疾患アトラスの構築などのさまざまな下流タスクをサポートします。しかし、既存のワークフローは時間がかかり、コストがかかり、解釈が困難です。薬物スクリーニングモデリングのアプローチは主に分子表現の学習に焦点を当てていますが、実際の実験状況（細胞株、投与スケジュールなど）は無視されており、一般化と MoA の解決が制限されています。我々は、薬物摂動下での細胞の形態的変化について、機構に関連した人間が解釈可能な理論的根拠を生成できるエージェント性マルチモーダル大言語モデル (MLLM) である CP-Agent を紹介します。 CP-Agent はその中核として、コンテキスト認識アライメントモジュール CP-CLIP を活用し、高コンテンツの画像と実験的なメタデータを共同で埋め込み、堅牢な処理と MoA 識別 (最大 F1 スコア 0.896 を達成) を可能にします。 CP-CLIP の出力をエージェントツールの使用法と推論と統合することで、CP-Agent は理論的根拠を構造化レポートにまとめ、実験計画と仮説の洗練を導きます。これらの機能は、より解釈可能でスケーラブルでコンテキストを認識した表現型スクリーニングを可能にし、創薬における仮説生成の反復サイクルを合理化し、創薬を加速する CP-Agent の可能性を強調しています。

原文 (English)

CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations

Cell Painting combines multiplexed fluorescent staining, high-content imaging, and quantitative analysis to generate high-dimensional phenotypic readouts to support diverse downstream tasks such as mechanism-of-action (MoA) inference, toxicity prediction, and construction of drug-disease atlases. However, existing workflows are slow, costly and difficult to interpret. Approaches for drug screening modeling predominantly focus on molecular representation learning, while neglecting actual experimental context (e.g., cell line, dosing schedule, etc.), limiting generalization and MoA resolution. We introduce CP-Agent, an agentic multimodal large language model (MLLM) capable of generating mechanism-relevant, human-interpretable rationales for cell morphological changes under drug perturbations. At its core, CP-Agent leverages a context-aware alignment module, CP-CLIP, that jointly embeds high-content images and experimental metadata to enable robust treatment and MoA discrimination (achieving a maximum F1-score of 0.896). By integrating CP-CLIP outputs with agentic tool usage and reasoning, CP-Agent compiles rationales into a structured report to guide experimental design and hypothesis refinement. These capabilities highlight CP-Agent's potential to accelerate drug discovery by enabling more interpretable, scalable, and context-aware phenotypic screening -- streamlining iterative cycles of hypothesis generation in drug discovery.

2026-06-03 13:00 JSTarXiv cs.AIエージェント

インタラクション軌跡がターミナルエージェントのトレーニングに効果的となる理由は何ですか?

一般に、より強力なコードエージェントはトレーニング後の教師として優れていると考えられていますが、この仮定はタスクの難易度、ハーネスの設計、生徒の能力から十分に解きほぐされていないままです。私たちは、マルチドメインの現実世界の問題を環境で検証されたエージェントタスクに変換するスケーラブルなパイプラインである Terminal-Lego を使用して、この教育的リンクを調査します。驚くべきことに、スタンドアロンのパフォーマンスは指導の有効性を左右しません。Claude Opus 4.6 は Terminal-Bench 2.0 でより高いスコアを達成しましたが、スコアの低いエージェントである DeepSeek-V3.2 からの軌道で微調整された生徒は、非常に強力な一般化を示しました。私たちは、この「教育的パラドックス」は環境に基づいた監督（EGS）によるものであると考えています。ハーネスと可視の相互作用を通じて、検査、行為、検証の動作を明示的に明らかにする軌跡により、生徒は脆弱な行動シーケンスではなく、堅牢な問題解決ルーチンを内面化することができます。スケーリング分析により、卓越したデータ効率が明らかになりました。たとえば、わずか 15.3k の Terminal-Lego 軌道で、Qwen3-32B は Terminal-Bench 2.0 で 24.3% のスコアを達成し、30 倍以上のデータ量で確立された以前の SOTA パフォーマンスに匹敵します。私たちの結果は、エージェントのポストトレーニングのフロンティアが単なる結果のマッチングを超えたところにあり、焦点を「ハーネスエンジニアリング」に移すことを示唆しています。そこでは、環境に基づいた相互作用構造の体系的な設計が、再現可能で一般化可能なエージェントインテリジェンスの主な触媒として機能します。

原文 (English)

What Makes Interaction Trajectories Effective for Training Terminal Agents?

Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from task difficulty, harness design, and student capacity. We investigate this pedagogical link using Terminal-Lego, a scalable pipeline that transforms multi-domain real-world issues into environment-verified agentic tasks. Surprisingly, standalone performance does not dictate teaching efficacy: while Claude Opus 4.6 achieves higher scores on Terminal-Bench 2.0, students fine-tuned on trajectories from DeepSeek-V3.2, a lower-scoring agent, exhibit significantly stronger generalization. We attribute this "pedagogical paradox" to Environment-Grounded Supervision (EGS): trajectories that explicitly expose inspect-act-verify behaviors through harness-visible interactions allow students to internalize robust problem-solving routines rather than fragile action sequences. Scaling analysis reveals exceptional data efficiency: with only 15.3k Terminal-Lego trajectories, for example, Qwen3-32B achieves a 24.3% score on Terminal-Bench 2.0, rivaling previous SOTA performance established with over 30x the data volume. Our results suggest that the frontier of agent post-training lies beyond mere outcome-matching, shifting the focus toward "Harness Engineering", where the systematic design of environment-grounded interaction structures serves as the primary catalyst for reproducible and generalizable agentic intelligence.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

DMF: 会話型 AI エージェントのための決定論的メモリフレームワーク

会話型 AI エージェントには、スケーラブルであり、長い対話期間にわたって意味的に一貫性のあるメモリシステムが必要です。既存のアプローチは主に、書き込み時の大規模言語モデル (LLM) ベースの要約に依存しているため、非決定性が生じ、トークンコストが増大し、枝刈りの決定が不透明になります。決定論的メモリフレームワーク (DMF) を紹介します。これは、生成メモリ圧縮を、古典的な NLP 分析、ベクトルジオメトリ、数学的スコアリングに基づいた完全に決定論的なパイプラインに置き換える CPU ファーストのアプローチです。 DMF は、各会話インタラクションに、確定的なコンテンツ信号、会話の手がかり、構造化された来歴をロジスティック投影によって組み合わせて計算された生存スコア $\Omega$ を割り当てます。 $\Omega_{\mathrm{eff}}(\Delta n)$ として示されるインタラクション数の減衰則は、新しいターンが到来するにつれて関連性がどのように進化するかを制御します。$\Delta n$ は実時間ではなく新しいインタラクションの数であり、完全な決定論が維持されます。 DMF の数学的定式化、その構造化されたリコールパイプライン、枝刈りの決定手順、および評価プロトコルを紹介します。実験は、LoCoMo および LongMemEval データセットを使用して、専用のベンチマークで実行されます。 DMF と、AI エージェントによく使われるメモリ層である Mem0 を比較します。 DMF は、メモリコンテキストの準備にゼロトークンを使用しながら、会話全体で 5 倍から 242 倍少ないトークンを使用しながら、同等の精度を実現します。これらの結果は、メモリ管理ループから LLM 呼び出しを排除し、トークンコストをほぼゼロに削減し、会話型 AI エージェントの決定論的メモリシステムを可能にすることが可能であることを示しています。

原文 (English)

DMF: A Deterministic Memory Framework for Conversational AI Agents

Conversational AI agents require memory systems that are both scalable and semantically coherent across long interaction horizons. Existing approaches rely predominantly on large language model (LLM)-based summarisation at write time, which introduces non-determinism, escalating token costs, and opacity in pruning decisions. We present the Deterministic Memory Framework (DMF), a CPU-first approach that replaces generative memory compression with a fully deterministic pipeline grounded in classical NLP analysis, vector geometry, and mathematical scoring. DMF assigns each conversational interaction a Survival Score $\Omega$ computed from deterministic content signals, conversational cues, and structured provenance, combined through a logistic projection. An interaction-count decay law, denoted as $\Omega_{\mathrm{eff}}(\Delta n)$, governs how relevance evolves as new turns arrive, where $\Delta n$ is the number of newer interactions rather than wall-clock time, preserving full determinism. We present the mathematical formulation of DMF, its structured recall pipeline, the pruning decision procedure, and the evaluation protocol. Experiments are conducted on a purpose-built benchmark using the LoCoMo and LongMemEval datasets. We compare DMF against Mem0, a popular memory layer for AI agents. DMF achieves comparable accuracy while using zero tokens to prepare the memory context and 5x to 242x fewer tokens over the entire conversation. These results show that it is possible to eliminate LLM calls from the memory-management loop, reducing token costs to nearly zero and enabling deterministic memory systems for conversational AI agents.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

StepFinder: マルチエージェントシステムにおける障害の原因を特定するための時間的セマンティックフレームワーク

LLM ベースのマルチエージェントシステムは、複雑な複数ステップのタスクにおいて優れた共同作業能力を発揮します。ただし、これらのシステムは、エージェントの対話を通じて伝播し、連鎖的な障害につながる可能性のあるシングルステップ実行エラーに対して非常に敏感です。障害の原因を理解し、システムの信頼性を向上させるために、障害の原因となる根本原因のステップを自動的に特定することを目的としたタスクとして、障害の原因特定が導入されました。既存の障害帰属手法は、主に LLM に依存して元の実行軌跡を推論します。これにより、高い推論コストと遅延が発生するだけでなく、冗長でノイズの多い実行ログによって引き起こされる干渉の影響を受け、LLM が真の根本原因ステップを正確に特定するのに苦労します。これに対処するために、軽量の障害属性フレームワークである StepFinder を提案します。私たちは、実行ログを時間的意味シーケンスにエンコードするために、機能構築フェーズ中にのみ LLM を使用します。続いて、時間モデリングと注意モジュールのパラメーター効率の高い組み合わせを適用して、軌道の逐次進化とステップ間の依存関係を捕捉します。最後に、ステップレベルのエラースコアは、マルチスケールの差分と位置の偏りによって調整され、正確な根本原因の特定が可能になります。 Who&When ベンチマークの実験結果では、StepFinder がステップレベルの障害の原因特定において LLM ベースの手法を上回り、大幅に高い推論効率を達成し、最速の LLM ベースの手法と比較して推論時間を 79% 削減し、テキスト生成のオーバーヘッドがないことが実証されました。私たちのコードは https://github.com/taiyu-zhu/StepFinder で入手できます。

原文 (English)

StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly sensitive to single-step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mainly rely on LLMs to reason over original execution trajectories, which not only incur high inference costs and latency, but also suffer from interference caused by redundant and noisy execution logs, causing LLMs to struggle in accurately identifying the true root cause step. To address this, we propose StepFinder, a lightweight failure attribution framework. We use LLMs solely during the feature construction phase to encode execution logs into temporal semantic sequences. Subsequently, a parameter-efficient combination of temporal modeling and attention modules is applied to capture the sequential evolution and cross-step dependencies of the trajectories. Finally, the step-level error score is refined through multi-scale differences and position bias, enabling precise root cause identification. Experimental results on the Who&When benchmark demonstrate that StepFinder outperforms LLM-based methods in step-level failure attribution while achieving substantially higher inference efficiency, reducing inference time by 79% compared with the fastest LLM-based method, with no text generation overhead. Our code is available at https://github.com/taiyu-zhu/StepFinder.

2026-06-03 13:00 JSTarXiv cs.AIエージェント

オーバーレイガバナンス: Agentic AI の委任とスコープのための構成的承認フレームワーク

AI システムがパッシブモデルから、アクションの開始、共同作業、およびタスクの委任ができる自律的なアクティブエージェントに進化するにつれて、ソフトウェアシステムの従来の境界があいまいになります。固定プリンシパル、明示的なリクエスト、静的スコープを中心に構築された従来の承認および委任フレームワークでは、エージェントシステムを管理するには不十分です。 Agentic AI は、より豊富な承認セマンティクスを要求します。エージェントは、アクセス許可を継承および委任し、期限付きの権限の下で動作し、共有プロトコルを通じて調整する必要があります。既存の ID およびアクセス管理 (IAM) システムは、この代理店の概念を完全に捉えることができず、再帰的な委任、コンテキスト境界、および実行可能なガバナンス基本要素としての動的スコープのメカニズムが欠けています。 OAuth 2.0 などのアクセス委任標準とは異なり、当社では委任を単なる静的なトークンベースの同意資格情報ではなく、契約条件として扱います。本稿では、エージェント型 AI に不可欠なプリミティブを導入した構成的ガバナンスのフレームワークを提案します。委任の種類とその権限および説明責任への影響を定義し、バインドされたエージェントアクセスエンベロープに対するリソーススコープの減衰の概念を導入します。これらの概念は、既存の認可ドメイン (金融システムなど) に組み込むことができる一般的なリレーショナル定義として表現されます。この構成を運用可能にするために、再帰的な委任チェーンなどの新しいエージェントセマンティクスを、既存のリレーショナルポリシーを書き換えることなくオーバーレイする構成演算子を定義します。私たちはこのフレームワークを正式な証明と経験的評価を通じて実証し、このフレームワークがエージェント AI システムにおける責任ある承認のための正式かつ実用的な基盤を提供することを示します。

原文 (English)

Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI

As AI systems evolve from passive models into autonomous active agents capable of initiating actions, collaborating, and delegating tasks, the traditional boundaries of software systems blur. Traditional authorization and delegation frameworks, built around fixed principals, explicit requests, and static scopes, are insufficient to govern agentic systems. Agentic AI demands richer authorization semantics: agents must inherit and delegate permissions, act under time-limited authority, and coordinate through shared protocols. Existing Identity and Access Management (IAM) systems fail to fully capture this notion of agency, lacking mechanisms for recursive delegation, contextual boundaries, and dynamic scoping as executable governance primitives. Unlike access delegation standards such as OAuth 2.0, we treat delegation as a contractual term rather than merely a static token-based consent credential. This paper proposes a compositional governance framework that introduces primitives indispensable for agentic AI. We define types of delegation and their permissions and accountability implications, and we introduce a notion of resource scope attenuation to bound agentic access envelopes. These concepts are expressed as general relational definitions that can be composed into existing authorization domains (e.g., financial systems). To operationalize this composition, we define a compositional operator that overlays new agentic semantics, such as recursive delegation chains, onto existing relational policies without rewriting them. We substantiate this framework through formal proofs and empirical evaluation, showing that it provides a formal yet practical foundation for accountable authorization in agentic AI systems.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

SAGE: エージェント生態系における社会化進化の定量的評価

自己改善型言語エージェントは通常、単独で評価されます。エージェントはタスクを試み、フィードバックを受け取り、繰り返し自身の動作を改善します。しかし、エージェントは、戦略と結果が公に公開されている同僚と協力して活動することが増えています。このことから、十分に研究されていない疑問が生じます。共有された経験が、自己改善だけでは達成できない改善をもたらすのはいつでしょうか? 2 つのコンピューティングが一致する条件を比較する評価フレームワークである SAGE (ソーシャルエージェントグループエボリューション) を紹介します。SocialEvo では、5 つの異なるモデルファミリのエージェントがすべてのピアの履歴にアクセスしながら共同進化します。そして、SelfEvo では、各エージェントは同じ回数のタスク試行を受けますが、自分自身の過去のみを見ることができます。これは、自己改善エージェントの研究では一般的です。私たちは、オープンエンドの ML 研究、長期的な経済計画、戦略的なマルチプレイヤープレイの 3 つの分野で SAGE をインスタンス化し、複数の進化ラウンドにわたって評価します。私たちは、グループの歴史が普遍的な増幅器ではないことを発見しました。つまり、最も強力なエージェントは自己進化の上限を超えることはありません。ただし、自己改善が停滞しているエージェントでも、同僚の経験があれば、大きな進歩を遂げることができます。競争環境では、反事実的なコントロールにより、エージェントが対戦相手固有の戦略を開発するのではなく、全体的に向上することが明らかになります。さまざまな形式の共有履歴にわたって、フィルタリングされたピアトレースやリフレクションサマリーは生のログよりもパフォーマンスが優れていることが多く、社会的利益は露出量ではなく抽象化に依存していることを示しています。これらの発見は、ピア履歴の獲得がエージェント固有、アリーナ依存であり、公開された痕跡から譲渡可能な知識を抽象化する能力に依存していることを明らかにしています。

原文 (English)

SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-studied question: when does shared experience produce improvements that self-improvement alone cannot achieve? We introduce SAGE (Social Agent Group Evolution),an evaluation framework that compares two compute-matched conditions: SocialEvo, where agents from five distinct model families co-evolve with access to all peers' histories; and SelfEvo, where each agent receives the same number of task attempts but sees only its own past, which is conventional in self-improving agent studies. We instantiate SAGE in three arenas: open-ended ML research, long-horizon economic planning, and strategic multiplayer play, evaluated across multiple evolutionary rounds. We find that group history is not a universal amplifier: the strongest agent does not exceed its self-evolution ceiling. However, agents that plateau under self-improvement can achieve significant breakthroughs when peer experience is available. In competitive settings, counterfactual controls reveal that agents improve generally rather than developing opponent-specific strategies. Across different forms of shared history, filtered peer traces and reflective summaries often outperform raw logs, indicating that social gains depend on abstraction rather than exposure volume. These findings reveal that peer-history gains are agent-specific, arena-dependent, and contingent on the capacity to abstract transferable knowledge from public traces.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

プロンプトからサービスまで: AI 主導の仮想世界向けの SLM ベースのエージェントオーケストレーションゲートウェイ

生成 AI 機能が拡大するにつれて、AI 主導の仮想世界は増大するアーキテクチャ上の課題に直面しています。ユーザーはインワールドインターフェイスを通じてマルチモーダルな方法で対話しますが、そのリクエストには根本的に異なる AI バックエンドモデルと計算リソースが必要です。これらの機能を仮想世界システムに直接組み込むと、拡張性が低下し、メンテナンスが複雑になり、エッジおよびクラウドインフラストラクチャ全体に分散されたサービスを調整する機能が制限されます。このペーパーでは、SLM ベースのエージェントオーケストレーションゲートウェイについて説明します。これは、インテント駆動型のサービスルーティングを通じて、仮想世界のクライアントを異種 AI バックエンドから切り離す軽量のランタイム調整メカニズムです。エッジ展開された SLM は各ユーザープロンプトの意味論的な意図を分類し、構成可能なサービスレジストリはルーティングの決定を検証して解決します。選択されたバックエンドは透過的に呼び出され、クライアントアプリケーションを変更することなく新しい AI 機能を仮想世界に導入できます。ゲートウェイは、InterownedXR 仮想博物館テストベッド内で実装および評価されます。この評価では、コンパクト SLM がエッジハードウェア上で信頼性の高いインテントルーターとして機能し、タスク固有の微調整により 10 億未満のパラメータモデルを実用的な低遅延ルーターに変換できることが示されました。ルータとして微調整された 10 億未満のパラメータモデルと会話応答生成のためのより大きな SLM を組み合わせた階層化構成は、ミッドレンジエッジハードウェアに展開可能であり、両方の役割を 1 つのモデルに委任するよりも効率的であることが示されています。この調査結果は、SLM が仮想世界における実用的な AI サービスオーケストレーションをサポートできることを示しており、この研究はスケーラブルで拡張可能でエッジサポートされた AI インタラクションのための評価済みアーキテクチャに貢献し、仮想エージェントが分散型生成 AI サービスへのアクセスポイントになることを可能にします。

原文 (English)

From Prompt to Service: An SLM-Based Agent Orchestration Gateway for AI-Driven Virtual Worlds

As generative AI capabilities expand, AI-driven virtual worlds face a growing architectural challenge. Users interact through in-world interfaces in multimodal ways, yet their requests demand fundamentally different AI backend models and computational resources. Embedding these capabilities directly into virtual world systems reduces extensibility, complicates maintenance, and limits the ability to coordinate services distributed across edge and cloud infrastructure. This paper presents an SLM-based Agent Orchestration Gateway, a lightweight runtime coordination mechanism that decouples a virtual world client from heterogeneous AI backends through intent-driven service routing. An edge-deployed SLM classifies the semantic intent of each user prompt, a configurable service registry validates and resolves the routing decision, and the selected backend is invoked transparently, enabling new AI capabilities to be introduced in the virtual world without modifying the client application. The gateway is implemented and evaluated within the InterwovenXR virtual museum testbed. The evaluation shows that compact SLMs can serve as reliable intent routers on edge hardware, and that task-specific fine-tuning can transform sub-billion-parameter models into practical, low-latency routers. A layered configuration pairing a fine-tuned sub billion-parameter model as router with a larger SLM for conversational response generation is shown to be deployable on mid-range edge hardware and more efficient than delegating both responsibilities to a single model. The findings show that SLMs can support practical AI service orchestration in virtual worlds and the work contributes an evaluated architecture for scalable, extensible, and edge-supported AI interaction, enabling virtual agents become access points to distributed generative AI services.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

言語を超えたトークンアービトラージ: ローカル LLM 前処理によるコードエージェントコンテキストウィンドウの最適化

AI 支援コーディングエージェントは、入力トークンのコストがボトルネックとなります。人間による生の入力の 2 つの病理が、このオーバーヘッドの多くを引き起こします。それは、英語以外のテキストのトークン化の非効率性と、会話プロンプトの構造的エントロピーです。既存のアプローチは、すでに肥大化したコンテキストを圧縮したり、障害が発生した後に介入したりすることで、事後的に動作します。開発者とクラウドエージェントの間で動作する、プリフライトのエッジ側プロンプト書き換えミドルウェアを導入します。ローカルの Llama 3.2 (3B) モデルは、英語への言語間翻訳、コンパクトなタスク指向形式への構造の書き換え、および正規表現で検証されたフォールバック付き書き換えセーフガードを実行して、最適化されたプロンプトが元のプロンプトより大きくならないようにします。私たちは、トルコ語、アラビア語、中国語、およびコードスイッチ仕様にまたがる多言語コーディングベンチマークである OMH-Polyglot で評価します。このミドルウェアは、3 つの商用 LLM バックエンド全体で、タスクの精度を維持または向上させながら、プロンプトトークンを 34 ～ 47 パーセント削減し、合計トークンを最大 18.8 パーセント削減します。アブレーション研究では、単純な関数名の抽出ではなく、主に書き換え段階から利益が生じることが示されています。同等の圧縮率での LLMLingua-2 と比較して、私たちの方法は、評価されたすべてのバックエンドにわたって一貫して優れた OckScore パフォーマンスを達成します。これらの結果は、プロアクティブなプロンプト最適化により、コーディング品質を犠牲にすることなく推論コストを大幅に削減できることを示しています。

原文 (English)

Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for non-English text and structural entropy in conversational prompts. Existing approaches act reactively by compressing already-bloated contexts or intervening after failures occur. We introduce a pre-flight, edge-side prompt-rewriting middleware that operates between the developer and the cloud agent. A local Llama 3.2 (3B) model performs cross-lingual translation into English, structural rewriting into a compact task-oriented format, and regex-validated rewrite-with-fallback safeguards to ensure the optimized prompt is never larger than the original. We evaluate on OMH-Polyglot, a multilingual coding benchmark spanning Turkish, Arabic, Chinese, and code-switched specifications. Across three commercial LLM backends, the middleware reduces prompt tokens by 34-47 percent and total tokens by up to 18.8 percent while preserving or improving task accuracy. Ablation studies show that gains arise primarily from the rewriting stage rather than simple function-name extraction. Compared with LLMLingua-2 at matched compression rates, our method consistently achieves superior OckScore performance across all evaluated backends. These results demonstrate that proactive prompt optimization can substantially reduce inference costs without sacrificing coding quality.

2026-06-03 13:00 JSTarXiv cs.AIエージェント

TSQAgent: 専用のエージェント推論による時系列データ品質の評価

時系列 (TS) データの品質の評価は基本的なものですが、品質の側面には多面的な性質があるため、本質的に困難です。最近、大規模言語モデル (LLM) が、ペアごとの比較と次元ごとの評価による TS 品質評価の有望なパラダイムとして浮上しています。ただし、既存のアプローチは、手動で事前定義された品質次元と純粋にテキストベースの推論に依存しているため、LLM が本当に関連する品質次元を特定できるか、根拠のある定量的な品質比較を実行できるかどうかは不明のままです。これを調査するために、TSQBench を構築します。TSQBench は、(i) 関連する品質次元の理解と特定、(ii) 特定の次元での品質比較の実行という 2 つの進歩的な機能に基づいて LLM を評価するための専用ベンチマークです。私たちの分析により、現在の LLM は次元の特定と証拠に基づく品質比較の両方に常に苦労していることが明らかになりました。これらの制限に対処するために、我々は TSQAgent を提案します。TSQAgent は TS 品質評価のための新しいエージェント推論フレームワークであり、3 つの協調的な役割で構成されます。焦点を絞った次元選択を行うパーシーバー、次元ごとの定量分析を行うインスペクター、そして最終的な判断を集約して洗練する審査員です。特に、最も関連性の高い品質側面を特定して優先順位を付ける能力を浸透させるエージェント推論戦略を導入し、さらに、選択した側面にわたる正確な定量的比較を可能にする外部分析ツールを備えたエージェントワークフローを提案します。提案されたベンチマークと 11 個の実世界のデータセットの両方での実験により、私たちのフレームワークが LLM の品質理解と定量的比較の能力を大幅に向上させるだけでなく、これらの向上をより品質を意識したデータ選択に効果的に変換し、ダウンストリームのパフォーマンスとデータ効率の向上につながることが実証されました。

原文 (English)

TSQAgent: Rating Time Series Data Quality via Dedicated Agentic Reasoning

Assessing the quality of time series (TS) data is fundamental yet inherently challenging due to the multifaceted nature of quality dimensions. Recently, large language models (LLMs) have emerged as a promising paradigm for TS quality assessment via pairwise comparison and per-dimension evaluation. However, existing approaches rely on manually predefined quality dimensions and purely text-based reasoning, leaving it unknown whether LLMs can identify truly relevant quality dimensions or perform grounded and quantitative quality comparisons. To investigate this, we construct TSQBench, a dedicated benchmark for evaluating LLMs on two progressive capabilities: (i) understanding and identifying relevant quality dimensions, and (ii) performing quality comparison under specific dimensions. Our analysis reveals that current LLMs consistently struggle with both dimension identification and evidence-grounded quality comparison. To address these limitations, we propose TSQAgent, a novel agentic reasoning framework for TS quality rating consisting of three collaborative roles: Perceiver for focused dimension selection, Inspector for dimension-wise quantitative analysis, and Adjudicator that aggregates and refines the final judgment. In particular, we introduce an agentic reasoning strategy that instills the ability to identify and prioritize the most relevant quality dimensions, and further propose an agent workflow equipped with external analytical tools to enable precise quantitative comparisons over selected dimensions. Experiments on both the proposed benchmark and eleven real-world datasets demonstrate that our framework not only substantially improves LLMs' capabilities in quality understanding and quantitative comparison but also effectively translates these improvements into better quality-aware data selection, leading to enhanced downstream performance and data efficiency.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

LLM ツール使用における知識ギャップの診断: 新しい API 取得のためのエージェントベンチマーク

コード生成のための大規模な言語モデルでは、多くの場合、事前トレーニングデータに含まれていない API を使用する必要があります。これには、関数名を思い出すだけでは不十分です。モデルは、シグネチャ、モジュールパス、入出力コントラクト、セマンティクス、および実行可能ファイルの使用パターンを調整する必要があります。既存の新規 API ベンチマークは通常、静的であり、大まかな合否メトリクスに依存しているか、実際のライブラリの進化を反映していない可能性がある合成 API を使用しています。 NovelAPIBench は、あらゆるベースモデルおよびターゲットライブラリに対して、新しい API を検出し、分解された知識バンドルを抽出し、実行可能なコーディングタスクを生成し、失敗したサンプルを 6 つの診断カテゴリに割り当てる、完全に自動化された動的ベンチマークです。約 1.9K のタスク、4 つの基本モデル、5 つのドメインにわたって、検索を通じて注入された知識と、パラメトリック適応を通じて内面化された知識を比較します。ナレッジコンポーネントは互換性がないことがわかりました。使用例は最も強力なスタンドアロンシグナルですが、最良の 2 コンポーネント設定は、ドメインとバックボーンに応じてメカニズムまたはサンプルのいずれかとシグネチャを組み合わせます。コンテキスト、特にソースコードを追加すると、インポートパスのエラーが増加して問題が発生する可能性があります。また、パラメトリック適応は、外部知識が除去された場合には検索に代わるものではありません。むしろ、微調整は主に提供されたバンドルの使用方法をモデルに教え、この機能は保持されたライブラリに転送されます。これらの結果は、取得とチューニングが補完的な役割を果たすことを示唆しています。取得は揮発性の API コンテンツを提供し、チューニングは手続き上の統合を改善します。

原文 (English)

Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordinate signatures, module paths, input-output contracts, semantics, and executable usage patterns. Existing novel-API benchmarks are typically static, rely on coarse pass/fail metrics, or use synthetic APIs that may not reflect real library evolution. We introduce NovelAPIBench, a fully automated dynamic benchmark that, for any base model and target library, discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failed samples to six diagnostic categories. Across about 1.9K tasks, four base models, and five domains, we compare knowledge injected through retrieval with knowledge internalized through parametric adaptation. We find that knowledge components are not interchangeable: usage examples are the strongest standalone signal, while the best two-component setting pairs signatures with either mechanisms or examples depending on the domain and backbone. Adding more context, especially source code, can hurt by increasing import-path errors. Parametric adaptation also does not replace retrieval once external knowledge is removed; rather, fine-tuning mainly teaches models how to use provided bundles, and this ability transfers to held-out libraries. These results suggest that retrieval and tuning play complementary roles: retrieval supplies volatile API content, while tuning improves procedural integration.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

EvoDrive: 自己改善型 LLM エージェントによるセーフティクリティカルな自動運転のパレート進化

安全性が重要なシナリオの生成は、自動運転システムの検証と改善に不可欠ですが、現実性を維持しながら障害を明らかにするには、本質的に敵対性を最大限に高める必要があります。既存の手法は通常、手作りのヒューリスティックによってこのトレードオフを管理し、生成を既知の事前分布に限定し、探索されていないパターンを見逃しています。最近のオープンエンドのエージェント進化はこの限界を押し上げる可能性がありますが、制約のない一般エージェントはシミュレーターの厳密な基礎を欠いており、多目的の緊張を単一スカラーの最大化に崩壊させる傾向があります。ここでは、多目的シナリオ生成のための初の自動化された LLM ベースのエージェント進化フレームワークである EvoDrive を紹介します。 EvoDrive は、シミュレータベースのアクター - クリティカルアーキテクチャを採用しており、メモリ主導のアクターがジェネレーターの改善を繰り返し提案し、クリティカルがありそうもない候補を除外し、自己進化するワールドエバリュエーターが有望な提案をルーティングしてシミュレーション予算を最適化します。 EvoDrive はさらに、評価された候補のパレートアーカイブを維持して、攻撃と現実性のさまざまなトレードオフを維持し、シミュレーションフィードバックを通じて将来の進化を導きます。 MetaDrive と CARLA のベンチマーク結果は、EvoDrive がさまざまなジェネレーターにわたってパレートフロンティアを大幅に拡大するだけでなく、ポリシートレーニングのための貴重なシナリオも生成することを示しています。

原文 (English)

EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents

Generating safety-critical scenarios is essential for validating and improving autonomous driving systems, yet it inherently requires maximizing adversariality to expose failures while preserving realism. Existing methods usually manage this trade-off with handcrafted heuristics, confining generation to known priors and overlooking underexplored patterns. While recent open-ended agentic evolution can push this limit, unconstrained general agents lack strict simulator grounding and tend to collapse the multi-objective tension into single-scalar maximization. Here we present EvoDrive, the first automated, LLM-based agentic evolution framework for multi-objective scenario generation. EvoDrive employs a simulator-grounded actor-critic architecture where a memory-driven actor iteratively proposes improvements to the generators and critics filter out implausible candidates, and a self-evolving world evaluator routes promising proposals to optimize simulation budgets. EvoDrive further maintains a Pareto archive of evaluated candidates to preserve diverse attack-realism trade-offs and guide future evolution via simulation feedback. Benchmark results on MetaDrive and CARLA show that EvoDrive not only significantly expands the Pareto frontier across various generators, but also produces valuable scenarios for policy training.

2026-06-03 13:00 JSTarXiv cs.AIエージェント

DeepSpeak-Agentic データセット

私たちは、人間と身体化された AI エージェントの間の 37 時間以上の半構造化された会話で構成されるビデオのデータセットである DeepSpeak-Agentic を紹介します。私たちはこのデータセットを使用して、AI エージェントの自動フォレンジック識別 (音声、ビデオ、またはテキスト) を評価し、人間とエージェントの相互作用の性質を研究し、具現化された AI エージェントを強化する大規模言語モデルと AI によって生成された音声と顔の将来の進歩のためのベンチマークを提供します。また、エージェントを作成し、人間のクラウドワーカーと自動的にペアリングし、指定されたシナリオ全体で視聴覚会話を記録し、結合されたストリーム内で人間とエージェントを識別して分離する、スケーラブルなデータキャプチャシステムにも貢献します。

原文 (English)

The DeepSpeak-Agentic Dataset

We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluate the automatic forensic identification (audio, video, or text) of AI agents, study the nature of human-agent interactions, and provide a benchmark for future advances in the large-language models and AI-generated voices and faces that power embodied AI agents. We also contribute a scalable data-capture system that creates agents, automatically pairs them with human crowd workers, records audiovisual conversations across specified scenarios, and identifies and separates the human and agent in the combined stream.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkillPyramid: 自己進化エージェントのための階層型スキル統合フレームワーク

最近の AI エージェントは、複雑なタスクを解決するためにスキルを柔軟に呼び出すことができますが、体系的なスキルの構築、蓄積、伝達が欠如しているため、その長期的な改善には根本的な制約があります。特に、スキルを統合するための統一されたフレームワークがないと、エージェントは異なるタスクにわたって同様の機能を重複して構築する傾向があり、経験を再利用可能な資産に効果的に変換できず、タスク固有のスキルを新しいシナリオに一般化するのに苦労します。この制限に対処するために、既存のスキル経験を再利用してより広範なタスクを一般化するスキル統合フレームワークである SkillPyramid を提案します。 SkillPyramid は、階層型スキルトポロジで動作し、エージェントがタスクの実行中に新しいスキルを作成、検証、組み込むことを可能にする自己進化メカニズムをさらに導入します。 4 つのバックボーンモデルにわたる ALFWorld、WebShop、および ScienceWorld の実験では、SkillPyramid が平均報酬を 38.0% 大幅に増加させ、実行ステップを 27.7% 削減することが示されました。全体として、私たちの方法は、スキルのコレクションを静的なリソースプールから動的な進化システムに変換します。

原文 (English)

SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents

Recent AI agents can flexibly invoke skills to solve complex tasks, but their long-term improvement is fundamentally constrained by a lack of systematic skill construction, accumulation, and transfer. In particular, without a unified framework for skill consolidation, agents tend to redundantly construct similar capabilities across different tasks, are unable to effectively transform experience into reusable assets, and struggle to generalize task-specific skills to novel scenarios. To address this limitation, we propose SkillPyramid, a skill consolidation framework that reuses existing skill experience for broader task generalization. Operating on a hierarchical skill topology, SkillPyramid further introduces a self-evolution mechanism that enables agents to compose, validate, and incorporate new skills during task execution. Experiments on ALFWorld, WebShop, and ScienceWorld across four backbone models show that SkillPyramid substantially increases the average reward by 38.0% and reduces execution steps by 27.7%. Overall, our method transforms a skill collection from a static resource pool into a dynamic evolution system.

2026-06-03 13:00 JSTarXiv cs.AIエージェント

LAP: 自律科学のためのエージェントから機器へのプロトコル

自律科学はデモンストレーションからインフラストラクチャへと移行しています。現在、大規模な言語モデルエージェントが実験を計画し、自動運転研究所がそれを実行しています。しかし、そのようなシステムはすべて、確率的で目標指向のエージェントではなく、断片化されたベンダー SDK や決定論的なソフトウェアクライアント向けに構築された標準に対して、推論エージェントと物理的機器の間のリンクをゼロから再構築します。最近のエージェント相互運用プロトコルは、エージェントエコシステムの 3 つのエッジのうち 2 つを明確にしています (Anthropic の Model Context Protocol (MCP) はエージェントからツールへのエッジを標準化し、Google の Agent2Agent (A2A) はエージェントからエージェントへのエッジを標準化しています) が、どちらもエージェントから機器へのエッジをモデル化していません。操作はステートフルで、安全性が重要で、排他的に所有され、物理的に具体化され、ユニット、校正、および測定を生成します。不確実性。このギャップを埋めるプロトコル設計である Lab Agent Protocol (LAP) を紹介します。 LAP は、A2A のピアツーピア、ディスカバリーファースト、タスクライフサイクル構造を維持し、次の 4 つの物理世界プリミティブを追加します。(i) InstrumentCard、署名された機能と物理制限の記述。 (ii) 専用機器およびサンプルロックのファーストクラス予約。 (iii) 特定のタスクとそのパラメータに暗号的にバインドされたオペレータ確認トークンを使用したセーフティフェンスハンドシェイク。危険で不可逆的な操作をゲートします。 (iv) すべての結果を物理的に型付け (QUDT/UCUM) し、キャリブレーションに固定され、不確実性を保持し、構築によって再現可能にする MeasurementResult スキーマ。役割、6 層アーキテクチャ、JSON-RPC メソッドセット、タスクおよび安全性ステートマシン、エラーモデル、および研究室間のフェデレーションを指定し、プロトコルのエンドツーエンドで閉ループ自律キャンペーンを実行します。 LAP は、A2A/MCP エコシステムとトランスポート互換性があり、SiLA 2 や OPC-UA などの既存のデバイス標準を置き換えるのではなく、カプセル化します。

原文 (English)

LAP: An Agent-to-Instrument Protocol for Autonomous Science

Autonomous science is moving from demonstration to infrastructure. Large language model agents now plan experiments, and self-driving laboratories execute them. Yet every such system rebuilds the link between the reasoning agent and the physical instrument from scratch, against fragmented vendor SDKs and standards built for deterministic software clients rather than probabilistic, goal-directed agents. Recent agent-interoperability protocols clarify two of the three edges of an agentic ecosystem (Anthropic's Model Context Protocol (MCP) standardizes the agent-to-tool edge, and Google's Agent2Agent (A2A) the agent-to-agent edge), but neither models the agent-to-instrument edge, where operations are stateful, safety-critical, exclusively owned, physically embodied, and produce measurements with units, calibration, and uncertainty. We present the Lab Agent Protocol (LAP), a protocol design that fills this gap. LAP retains A2A's peer-to-peer, discovery-first, task-lifecycle structure and adds four physical-world primitives: (i) the InstrumentCard, a signed capability and physical-limit description; (ii) first-class reservation for exclusive instrument and sample locking; (iii) a safety-fence handshake with operator-confirmation tokens cryptographically bound to a specific task and its parameters, gating hazardous and irreversible operations; and (iv) a MeasurementResult schema that makes every result physically typed (QUDT/UCUM), calibration-anchored, uncertainty-bearing, and reproducible by construction. We specify roles, a six-layer architecture, the JSON-RPC method set, the task and safety state machines, the error model, and cross-laboratory federation, and walk a closed-loop autonomous campaign through the protocol end-to-end. LAP is transport-compatible with the A2A/MCP ecosystem and encapsulates rather than replaces existing device standards such as SiLA 2 and OPC-UA.

2026-06-03 13:00 JSTarXiv cs.AIエージェント

管理境界から保険金請求まで: CER フレームワークによる AI 媒介損失の再構築

被保険組織の生成 AI システムまたはエージェント AI システムを通じて発生する AI 損失には、システムが推論、取得、ツールの呼び出し、および動作を行うにつれて関連する状態が変化するため、単なるイベントの再構築ではなく、状態の再構築が必要になります。関連する問題は、どのような損失が発生したかだけでなく、システムに何が許可され、実際に何をしたか、そしてその再構築された損失が保険金請求の回復をサポートできるかどうかです。このペーパーでは、プロンプトインジェクション、取得拡張世代 (RAG) ポイズニング、悪意のあるツールの出力、資格情報の悪用、データポイズニングなどの外部から引き起こされる障害を含む、被保険者の AI システムが因果関係にある損失について取り上げます。 Specifically, this paper introduces CER, a use-case-level diagnostic for AI residual risk transfer. C (control boundary) asks whether the system had an enforceable operating envelope. E (evidence reconstruction) asks whether the system state and causal chain can be reconstructed from retained artifacts. R (保険応答) は、再構築された損失に保険が適用されているかどうか、つまり、保険金請求の回復をサポートするために必要な証拠とともに、保険が市場で利用可能で被保険者に提供されているかどうかを尋ねます。この論文は 3 つの貢献を行っています。AI 固有の再構築問題を定義し、CER を通じてその問題を運用可能にし、AI 再構築のためのクレームグレードの証拠を指定しています。公的な例としては、報告された PocketOS および Replit エージェントによるデータベース削除事件や、裁定された出力/依存事件としてのモファット対エア・カナダ事件が挙げられます。 Keywords: AI systems; CER framework; residual risk transfer; agentic AI; generative AI; AI insurance; evidence reconstruction.

原文 (English)

From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework

AI losses that arise through an insured organization's generative or agentic AI system require state reconstruction, not merely event reconstruction, because the relevant state changes as the system reasons, retrieves, calls tools, and acts. The relevant question is not only what loss occurred, but what the system was allowed to do, what it actually did, and whether that reconstructed loss can support insurance claim recovery. This paper addresses losses in which the insured's AI system is in the causal chain, including externally triggered failures such as prompt injection, retrieval-augmented generation (RAG) poisoning, malicious tool output, credential misuse, and data poisoning. Specifically, this paper introduces CER, a use-case-level diagnostic for AI residual risk transfer. C (control boundary) asks whether the system had an enforceable operating envelope. E (evidence reconstruction) asks whether the system state and causal chain can be reconstructed from retained artifacts. R (insurance response) asks whether the reconstructed loss is insured: whether insurance coverage is available in the market and placed for the insured, together with the proof needed to support insurance claim recovery. The paper makes three contributions: it defines the AI-specific reconstruction problem, operationalizes that problem through CER, and specifies claim-grade evidence for AI reconstruction. Public examples include the reported PocketOS and Replit agentic database-deletion incidents and Moffatt v. Air Canada as an adjudicated output/reliance case. Keywords: AI systems; CER framework; residual risk transfer; agentic AI; generative AI; AI insurance; evidence reconstruction.

2026-06-03 13:00 JSTarXiv cs.AIエージェント

エージェントによる対話による危険特定分析による操業の安全性の向上

産業用プロセス制御、自律型システム、安全性が重要なシステムなど、一か八かの分野における運用の安全性には、信頼性の高い危険性の特定が必要です。大規模言語モデル (LLM) は、安全分析タスクの自動化に有望であることが示されていますが、シングルターンのモノリシック推論は脆弱です。安全エンジニアが繰り返し適用する自己修正、検討、状況に応じた改良が欠けています。この論文では、構造化されたエージェント対話、マルチエージェント、マルチターンインタラクションが、シングルパスベースラインよりも NLP ベースのハザード特定の質を向上させるかどうかを調査するフレームワークである HAZDIAL を紹介します。我々は、敵対的な議論と建設的な議論という 2 つの対話様式を系統的に比較し、アルゴリズムベースのエージェント相互作用の最適化を提案します。標準的な分類指標 (精度、精度、再現率、F1) と新しい対話指標を使用して、厳選されたゴールデンデータセットに対してすべての構成を評価します。この研究は、対話システム、マルチエージェント推論、AI の安全性の交差点を前進させ、対話主導型の危険分析の経験的証拠を提供します。

原文 (English)

Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis

Operational safety in high-stakes domains such as industrial process control, autonomous, and safety-critical systems, demand reliable hazard identification. While large language models (LLMs) have shown promise in automating safety analysis tasks, single-turn, monolithic inference is brittle: it lacks the self-correction, deliberation, and contextual refinement that safety engineers apply iteratively. In this paper, we introduce HAZDIAL, a framework that investigates whether structured agentic dialogue-multi-agent, multi-turn interactions improves the quality of NLP- based hazard identification over single-pass baselines. We systematically compare two dialogue modalities: adversarial debate and constructive discussion, and propose an algorithm-based agentic interaction optimization. We evaluate all configurations against a curated golden dataset using standard classification metrics (accuracy, precision, recall, F1) and novel dialogue metrics. This work advances the intersection of dialogue systems, multi-agent reasoning, and AI safety, providing an empirical evidence for dialogue-driven hazard analysis.

2026-06-03 13:00 JSTarXiv cs.AIエージェント研究/論文

BigFinanceBench: 金融調査エージェント向けのワークフローに基づいたベンチマーク

財務調査の回答は、他のアナリストがその回答がどのように作成されたか、つまりどの情報源が選択されたか、どの期間と会計定義が使用されたか、どのような仮定が行われたか、および計算がどのように実行されたかを監査できる場合にのみ、意思決定に関連します。既存の財務ベンチマークは主に、個別のサブスキルまたは最終的な回答を評価しており、監査可能な導出自体は十分に評価されていません。 BigFinanceBench は、オープンエンドの金融調査タスクの 928 項目の専門家が作成したベンチマークです。このベンチマークでは、各項目が、真実の参照回答と、導出を独立してチェック可能なステップに分解するポイント加重ルーブリックとが組み合わされます。 BigFinanceBench は、最終出力だけではなく完全な導出を評価するという点でワークフローに基づいています。このベンチマークは、36,241 のルーブリックポイントにわたって、部分信用評価とアナリストのワークフロー全体での失敗の局所特定をサポートします。現在のフロンティアおよびオープンウェイトエージェント 10 社を評価したところ、かなりの余裕があることがわかりました。最高のシステムでもルーブリックスコアは 58.8% にすぎず、最終回答の精度は有用ですが、導出品質の損失が大きく、モデルの能力は財務ワークフロー全体で不均一に異なります。

原文 (English)

BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents

Financial-research answers are decision-relevant only when another analyst can audit how they were produced: which source was chosen, which period and accounting definition were used, which assumptions were made, and how the calculation was performed. Existing finance benchmarks largely evaluate isolated subskills or final answers, leaving the auditable derivation itself under-measured. We introduce BigFinanceBench, a 928-item expert-authored benchmark of open-ended financial-research tasks in which each item pairs a ground-truth reference answer with a point-weighted rubric that decomposes the derivation into independently checkable steps. BigFinanceBench is workflow-grounded in that it evaluates the full derivation rather than only the final output. Across 36,241 rubric points, the benchmark supports partial-credit evaluation and localization of failures across the analyst workflow. Evaluating ten current frontier and open-weight agents, we find substantial headroom: the best system reaches only 58.8% rubric score, final-answer accuracy is a useful but lossy proxy for derivation quality, and model capability varies non-uniformly across financial workflows.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

EvoDS: スキル学習とコンテキスト管理を備えた自己進化する自律型データサイエンスエージェント

大規模言語モデル (LLM) エージェントの最近の進歩により、自動データサイエンスにおける有望な進歩が可能になりました。しかし、既存のアプローチは、静的なアクションセットと原則に基づいた長期的なコンテキスト管理の欠如によって根本的に制限されたままであり、タスク全体で再利用可能なエクスペリエンスを蓄積し、多段階の反復的なデータサイエンスパイプラインで確実に動作する能力を妨げています。これらの課題に対処するために、エージェント強化学習を通じてスキルを拡張し、長期的なコンテキストを適応的に管理することを学習する、自己進化する自律型データサイエンスエージェントである EvoDS を導入します。具体的には、EvoDS は 2 つの重要な戦略を導入しています。(1) 自律スキル取得 (ASA) メカニズム。エージェントが実行可能なスキルを合成、検証、再利用できるようにします。 (2) 適応コンテキスト圧縮 (ACC) 戦略。これは、コンテキスト管理を受動的切り捨てではなく、学習された制御問題として扱います。これらの戦略は 2 段階のマルチエージェントトレーニングスキーム内で調整され、EvoDS が時間の経過とともに自律的に改善できるようになります。理論的には、EvoDS の階層設計によりツール選択エラーが軽減され、その最適化目標が情報ボトルネックの原則と一致し、効率的なコンテキストの使用が保証されることが証明されています。経験的に、EvoDS は、トークン不足の障害を排除しながら、4 つの多様なベンチマークにわたって、最先端のオープンソースデータサイエンスエージェントよりも平均 28.9% 優れたパフォーマンスを示します。コードとデータは https://github.com/usail-hkust/EvoDS で入手できます。

原文 (English)

EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two-stage multi-agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS's hierarchical design reduces tool-selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state-of-the-art open-source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out-of-token failures. Our code and data are available at https://github.com/usail-hkust/EvoDS.

2026-06-03 13:00 JSTarXiv cs.AIエージェント研究/論文

ヘッジベンチ: 財務上の推論に関する困難で現実的なタスクに関するエージェントのベンチマーク

AI エージェントは、文書の取得、数式の計算、スプレッドシートの更新など、財務分析の機械的なタスクを処理できるようになってきています。より難しく、より価値のある課題は、専門アナリストの仕事を定義する自由形式の質問を通じて推論することです。既存のベンチマークはこの種の問題を捉えておらず、オープンエンド推論を評価しようとするベンチマークは、ノイズと循環性を導入するモデルで判断された出力に依存しています。私たちは Hedge-Bench 1.0 を紹介します。これは、関連する情報ソースを使用して作業するプロのヘッジファンドアナリストの明確な推論トレースに基づいた 102 の実際の実務タスクのベンチマークです。このアプローチにより、検証された専門家のステップに対して決定的なグレーディングが可能になります。フロンティアモデルとエージェントのベンチマークスコアは 16\% 未満です。データセットと評価ハーネスは github.com/Trata-Inc/trata-hedge-bench で公開しています。

原文 (English)

Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

AI agents can increasingly handle the mechanical tasks of financial analysis: retrieving documents, calculating formulas, updating spreadsheets. The harder, more valuable challenge is reasoning through the open-ended questions that define expert Analyst work. Existing benchmarks do not capture this class of problem, and those that attempt to evaluate open-ended reasoning rely on model-judged outputs that introduce noise and circularity. We present Hedge-Bench 1.0: a benchmark of 102 actual, on-the-job tasks grounded in the explicit reasoning traces of professional hedge fund analysts working with relevant information sources. This approach enables deterministic grading against verified expert steps. Frontier models and agents score below 16\% on the benchmark. We publish the dataset and evaluation harness at github.com/Trata-Inc/trata-hedge-bench.

2026-06-03 13:00 JSTarXiv cs.AIエージェント

マージンプレイ: ブラジルの赤道マージンにおける公共政策分析のためのマルチエージェントシステム

ブラジル赤道マージン（BEM）はブラジルの次の海洋石油フロンティアであり、フォス・ド・アマゾナス盆地で2026年に操業が開始される予定です。その資産は主にマランハオ州と財政的にも領土的にも関係している。マランハオ州は連邦内で最もHDIが低い州（0.676、IBGE 2022）である。これは政策上の中心的な疑問を提起します。BEM 探査はどのような条件下でマランハオにとって正味のプラスの外部性を生み出すのでしょうか?問題は本質的に複数の主体に関わるものである。連邦政府は歳入とエネルギーの安全保障を求めている。国家は憲法上の王権割り当てに基づいて地域福祉を追求する。経営者はリスクの下で利益を最大化します。 ANP と IBAMA は相反する権限を持っています。そしてアマゾンのコミュニティは金銭収入よりも領土と環境のベクトルを優先します。我々は、ブラジルの経験的校正と古典的な経済文献に基づいてこれらの緊張をシミュレートするマルチエージェント強化学習 (MARL) システムである Margin Play を紹介します。 BRO-MARL でトレーニングされた CTDE パラダイムに基づいて 6 つのエージェントを実装します。 6 つのシナリオにわたる 60,000 のエピソードの結果は、答えが制度体制に依存することを示しています。参照ベースラインの下では、福祉の利得はわずかです (Waval 約 1.68)。一方、MA-Prospero 構成では、デルタ W = +17.5% およびデルタ Rcom = +21.3% が得られ、環境負荷は低くなります (Eamb = 0.048 対 0.076)。根本的な問題は生産と福祉のトレードオフではなく、探査に関連した公共政策体制の選択にある。

原文 (English)

Margin Play: A Multi-Agent System For Public Policy Analysis In The Brazilian Equatorial Margin

The Brazilian Equatorial Margin (BEM) is Brazil's next offshore oil frontier, with operations expected to begin in 2026 in the Foz do Amazonas basin. Its assets are fiscally and territorially linked primarily to Maranhao -- the state with the lowest HDI in the Federation (0.676, IBGE 2022). This raises the central policy question: under what conditions does BEM exploration generate net positive externalities for Maranhao? The problem is intrinsically multi-agent: the Federal Government seeks revenue and energy security; the state seeks regional welfare under constitutional royalty earmarking; the operator maximizes profit under risk; ANP and IBAMA hold conflicting mandates; and Amazonian communities prioritize territorial and environmental vectors over monetary income. We present Margin Play, a Multi-Agent Reinforcement Learning (MARL) system simulating these tensions under Brazilian empirical calibration and classical economic literature. It implements six agents under the CTDE paradigm, trained with BRO-MARL. Results from 60,000 episodes across six scenarios indicate the answer is conditional on the institutional regime: under the reference baseline, the welfare gain is marginal (Waval approx. 1.68), whereas the MA-Prospero configuration yields Delta W = +17.5% and Delta Rcom = +21.3%, with a lower environmental liability (Eamb = 0.048 vs. 0.076). The fundamental problem is not a trade-off between production and welfare, but the choice of public policy regime linked to exploration.

2026-06-03 13:00 JSTarXiv cs.AIエージェント

校正された偏差を備えた閉ループ分子設計

我々は、継続的に更新される信念状態グラフと再帰的な計画後実行ループを結合するエージェントである Cognitive Loop via In-Situ Optimization (CLIO) を紹介します。その結果、質的に異なるものに貢献できる推論エージェントが生まれます。これを \emph{校正された服従} と呼びます。これは、自身のツールや仮定が失敗したときを認識し、それに応じて戦略を適応させ、実験の修正を導くメカニズム的な仮説を生成する能力です。私たちは、水性有機レドックスフロー電池 (AORFB) ネゴライトを設計するための閉ループ人間 AI キャンペーンで CLIO をテストしました。CLIO は、合成、特性評価、設計選択の検討を行った化学者との緊密な連携のもと、提案と解釈を主導しました。 3 ラウンドにわたって 17 の候補者の中から、CLIO はトップのホスホネート候補者に絞り込みました。特性評価により、酸化還元電位が文献ベースラインより 130 mV 向上していることが確認されました。その後、特性評価により、電気化学的可逆性が予想外に低いことが明らかになりました。これは、特性予測因子がフラグを立てなかった回帰です。 CLIO は競合するメカニズムの仮説を生成し、識別診断に優先順位を付け、ホスホン酸イオンとカリウムイオンのペアリングの失敗を追跡し、スルホン酸塩の代替薬を処方しました。得られた化合物は、大幅に改善された電気化学的可逆性を示し、酸化還元電位の 90 mV の改善を維持し、設計、製造、テスト、再設計のループを閉じました。

原文 (English)

Closed-Loop Molecular Design with Calibrated Deference

We present Cognitive Loop via In-Situ Optimization (CLIO), an agent that couples a continuously-updated belief-state graph with a recursive plan-then-act loop. The result is a reasoning agent that can contribute something qualitatively different, which we term \emph{calibrated deference}: the capacity to recognize when its own tools or assumptions are failing, to adapt its strategy in response, and to generate mechanistic hypotheses that guide experimental revision. We tested CLIO in a closed-loop human-AI campaign to design an aqueous organic redox flow battery (AORFB) negolyte, with CLIO leading proposal and interpretation in close partnership with chemists who synthesized, characterized, and weighed in on design choices. Across 17 candidates over three rounds, CLIO converged on a top phosphonate candidate; characterization confirmed a 130~mV improvement in redox potential over the literature baseline. Characterization then revealed unexpectedly poor electrochemical reversibility -- a regression no property predictor had flagged. CLIO generated competing mechanistic hypotheses, prioritized discriminating diagnostics, traced the failure to phosphonate-potassium ion pairing, and prescribed a sulfonate replacement. The resulting compound showed substantially improved electrochemical reversibility and maintained a 90~mV improvement in redox potential, closing the design-make-test-redesign loop.

2026-06-03 13:00 JSTarXiv cs.AIエージェント研究/論文

TadA-Bench: 薬剤タンパク質工学に向けた将来の発見のための 100 万種類のベンチマーク

科学的発見のための AI はエージェント時代に突入しており、タンパク質工学システムは単に静的な測定に適合するだけでなく、将来のウェットラボ実験を優先することが期待されています。薬剤タンパク質工学に向けた将来のラウンドの発見に向けて、31 回の TadA 指向進化ラウンドからの 100 万バリアントのウェットラボリプレイベンチマークである TadA-Bench を紹介します。 TadA-Bench は、キャンペーンの時系列を保存し、固定データのリプレイタスクを定義します。つまり、初期の実験ラウンドが与えられた場合、モデルは後のラウンドでのみ出現するバリアントをランク付けします。整列した DNA、RNA、およびタンパク質のビューを提供し、グラフベースのラベル統合パイプラインである Seq2Graph を使用して、ノイズの多い濃縮測定を一貫したクロスラウンド活性ラベルに調整します。ランダム分割コントロールは強力な補間を示しますが、将来のラウンドのランキングと有限予算の候補者の選択ははるかに弱いです。制御された分析は、進化の範囲が局所的なデータ密度よりも有益であることを示唆しており、TadA-Bench を薬剤タンパク質工学に向けた将来の発見のための再現可能なウェットラボ再生基質として位置付けています。データとコードはHugging FaceとGitHubで公開されています。

原文 (English)

TadA-Bench: A Million-Variant Benchmark for Future-Round Discovery Toward Agentic Protein Engineering

AI for scientific discovery is entering an agentic era, where protein-engineering systems are expected to prioritize future wet-lab experiments rather than merely fit static measurements. We introduce TadA-Bench, a million-variant wet-lab replay benchmark from 31 TadA directed-evolution rounds for future-round discovery toward agentic protein engineering. TadA-Bench preserves the campaign chronology and defines a fixed-data replay task: given earlier experimental rounds, models rank variants that appear only in later rounds. It provides aligned DNA, RNA, and protein views, and uses Seq2Graph, a graph-based label-unification pipeline, to reconcile noisy enrichment measurements into consistent cross-round activity labels. Random-split controls show strong interpolation, but future-round ranking and finite-budget candidate selection are much weaker. Controlled analyses suggest that evolutionary coverage is more informative than local data density, positioning TadA-Bench as a reproducible wet-lab replay substrate for future-round discovery toward agentic protein engineering; the data and code are released on Hugging Face and GitHub.

2026-06-03 13:00 JSTarXiv cs.AIエージェントロボティクス

CARVE: Certified Affordable Repair of Vetoed Maneuvers via Envelopes for Interactive Driving

Interactive driving exposes a failure mode that is easy to miss in rule-aware autonomous-driving stacks: a hard-rule margin can be negative…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

A New Framework for Cybersecurity Refusals in AI Agents

Agentic scaffolds have dramatically improved LLM performance on complex, long-horizon tasks, yielding both broad benefits and amplified ris…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size

Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a t…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成エージェント

MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data

Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limit…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions

How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without centralized control? Inspired…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

The Epi-LLM Framework: probing LLM behavioral priors through epidemiological agent-based models

Human behaviour during epidemics affects infectious disease dynamics, but quantifying this remains deeply challenging. Here we introduce th…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Adaptive Latent Agentic Reasoning

Large reasoning models improve performance by generating extended chain-of-thought (CoT) reasoning, but this behavior becomes inefficient w…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

WRIT: Write-Read Intensive Trajectory Synthesis for Multi-Turn User-Facing Agents

Multi-turn user-facing agents must infer user intent from incomplete requests, collect missing information through dialogue and tools, and…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェントロボティクスビジネス/資金調達

SCOPE: Real-Time Natural Language Camera Agent at the Edge

Deploying language-driven agents in robotics requires evaluations that reflect real-world task demands: natural-language instructions with…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

Glass Box at Orbit: A Constitutional AI Verification Framework for Trustworthy Autonomous CubeSat Intelligence

The space industry is quietly building toward something nobody has fully reckoned with: orbital data centers running thousands of autonomou…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion

We present a novel compact deep multi-task learning model to handle various autonomous driving perception tasks in one forward pass. The mo…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

MUSE: A Unified Agentic Harness for MLLMs

Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a g…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Capability Advertisement as a Market for Lemons: A Trust Layer for Heterogeneous Agent Networks

Large language model (LLM) agents have begun to delegate work to one another. Protocols such as the Model Context Protocol (MCP) and the Ag…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Libra: Efficient Resource Management for Agentic RL Post-Training

Reinforcement learning (RL) has become a standard post-training paradigm for large language models (LLMs), extending beyond preference alig…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search

Deep Image Search requires multi-step reasoning over rich contextual cues, such as time, location, and event relations. However, most exist…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクスハードウェア/半導体ビジネス/資金調達

NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation

As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. I…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

OpenAgenet/OAN: Open Infrastructure for Trusted Agent Interconnection

OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes vi…

2026-06-03 13:00 JSTarXiv cs.AIエージェント研究/論文

OpenAgenet/OAN: Technical Architecture for Trust-Governed Agent Identity and Discovery

This paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection.…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント研究/論文

VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch

Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image r…

2026-06-03 13:00 JSTarXiv cs.AIエージェント研究/論文

FORGE: Multi-Agent Graduated Exploitation and Detection Engineering

Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-conce…

2026-06-03 13:00 JSTarXiv cs.AIエージェントロボティクス

SPADE: Sketch-guided Path Planning Augmented with Diffusion Experts

Path planning is essential for Autonomous Mobile Robots (AMRs). Conventional methods for incorporating human preferences into planning typi…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

Post-Hoc Robustness for Model-Based Reinforcement Learning

To improve the real-world applicability of reinforcement learning (RL), the field of adversarially robust RL studies how to train agents un…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Tool-Aware Optimization with Entropy Guidance for Efficient Agentic Reinforcement Learning

Agentic reinforcement learning (RL) equips large language models (LLMs) with tool-use capabilities that substantially improve reasoning on…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

Trading Human Curation for Synthetic Augmentation in RLVR

The supply of high-quality training tasks is a central bottleneck for reinforcement learning from verifiable rewards (RLVR) on agentic lang…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

AI Agents Enable Adaptive Computer Worms

A computer worm is malware that spreads on a network by replicating itself from one machine to another. Traditional worms, like WannaCry, e…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs

Multi-Document Summarization (MDS) plays a critical role in distilling essential information from collections of textual data. Existing app…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents

Large language model (LLM) agents are evolving from request-response assistants into long-running software actors: they maintain state acro…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

The Impact of Configuring Agentic AI Coding Tools on Build-vs-Buy Decisions: A Study Protocol

Agentic AI coding tools write code with increasing autonomy and in doing so decide when to import a library and when to implement functiona…

2026-06-03 13:00 JSTarXiv cs.AIエージェントロボティクス

Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its pra…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and o…

2026-06-03 13:00 JSTarXiv cs.AIエージェントロボティクス研究/論文

Assistax: A Multi-Agent Hardware-Accelerated Reinforcement Learning Benchmark for Assistive Robotics

The development of reinforcement learning (RL) algorithms has been largely driven by ambitious challenge tasks and benchmarks. Games have d…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism

We propose COLLAB-REC, a multi-agent framework designed to counteract popularity bias and improve diversity in tourism recommendations. In…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

RGMem: Renormalization Group-inspired Memory Evolution for Language Agents

Personalized and continuous interactions are critical for LLM-based conversational agents, yet finite context windows and static parametric…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ProtocolBench: Which LLM MultiAgent Protocol to Choose?

As large-scale multi-agent systems evolve, the communication protocol layer has become a critical yet under-evaluated factor shaping perfor…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

MemVerse: Multimodal Memory for Lifelong Learning Agents

Despite rapid progress in large-scale language and vision models, AI agents still suffer from a fundamental limitation: they cannot remembe…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration

Scaling test-time computation has been shown to significantly improve large language model (LLM) performance without additional training. H…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly target…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

A Scoping Review of the Ethical Perspectives on Anthropomorphising Large Language Model-Based Conversational Agents

Anthropomorphisation -- the phenomenon whereby non-human entities are ascribed human-like qualities -- has become increasingly salient with…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

PieArena: Ranking and Profiling Language Agents in Realistic Negotiation Scenarios

We present an in-depth evaluation of LLMs' ability to negotiate, a central business task requiring strategic reasoning, theory of mind, and…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System

Integrating Large Language Models (LLMs) with external tools via multi-agent systems offers a promising new paradigm for decomposing and so…

2026-06-03 13:00 JSTarXiv cs.AIエージェント研究/論文

Towards a Science of AI Agent Reliability

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress,…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

VeRO: A Harness for Agents to Optimize Agents

An important emerging application of coding agents is agent harness optimization: the iterative improvement of a target agent by editing an…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

MAVEN-T: Reinforced Heterogeneous Distillation for Real-Time Multi-Agent Trajectory Prediction

Trajectory prediction is a key component of autonomous driving systems because future motions directly affect collision checking, behavior…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization

Automating operations research (OR) with large language models (LLMs) remains limited by hand-crafted reasoning--execution workflows. Compl…

2026-06-03 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capaci…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

PEAM: Minecraft での経験の対照的な内面化によるパラメトリックな身体化されたエージェントの記憶

Minecraft のパラメトリックエンボディドエージェントメモリフレームワークである PEAM を紹介します。これは、エージェントメモリを推論時の検索から、経験を通じて内面化されたパラメータ常駐スキルに変換します。 PEAM は、オープンエンド推論のための遅い熟議型 LLM と、統合されたスキルを反射的に実行するための高速パラメトリックモジュールを組み合わせます。この高速モジュールは、カテゴリごとに物理的に分離されたアダプターを備えたマルチモーダルな専門家混合 LoRA アーキテクチャであり、致命的な忘れを引き起こすことなくパラメーターレベルの継続的な学習を可能にします。私たちは失敗を第一級のトレーニング信号として扱います。失敗と修正軌道のペアは、共同の行動クローニングと対照的な目標を通じて内面化されるため、エージェントは何が成功したかだけでなく、修正されたアクションが失敗したアクションとどのように異なるのかも学習します。統合を管理するために、PEAM は、どのエクスペリエンスを内部化するかを決定するためのパラメータ化価値スコアと、タスク固有の手動調整しきい値なしでいつ内部化するかを決定するためのスケールフリーの自己トリガー統合メカニズムを導入し、再調整することなくトリガーがタスク分布全体に移行するにつれてエージェントを自己進化させます。 Minecraft での実験では、PEAM が長期的なタスクのパフォーマンスを向上させ、以前に統合されたスキルの忘れを軽減し、検索ベースの身体化エージェントやパラメトリックメモリのバリアントに比べてパラメトリック対検索の効率を向上させることが示されています。

原文 (English)

PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft

We present PEAM, a Parametric Embodied Agent Memory framework in Minecraft that transforms agent memory from inference-time retrieval into parameter-resident skills internalized through experience. PEAM pairs a slow deliberative LLM for open-ended reasoning with a fast parametric module for reflexive execution of consolidated skills. The fast module is a multimodal Mixture-of-Experts LoRA architecture with per-category physically isolated adapters, enabling parameter-level continual learning without catastrophic forgetting. We treat failure as a first-class training signal: failure--correction trajectory pairs are internalized through a joint behavioral-cloning and contrastive objective, so the agent learns not only what succeeds but also how corrected actions differ from failed ones. To govern consolidation, PEAM introduces a parameterization-worthiness score for deciding which experience should be internalized, and a scale-free self-triggered consolidation mechanism for deciding when to internalize without task-specific hand-tuned thresholds, making the agent self-evolving as the trigger transfers across task distributions without re-tuning. Experiments in Minecraft show that PEAM improves long-horizon task performance, mitigates forgetting on previously consolidated skills, and improves parametric-versus-retrieval efficiency over retrieval-based embodied agents and parametric memory variants.

2026-06-03 13:00 JSTarXiv cs.AIエージェント研究/論文

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

As agent capabilities advance, existing benchmarks, such as $\tau^2$-Bench, are becoming increasingly saturated. Yet constructing new bench…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

編集する前にプローブする: 構造ベースの医薬品設計における LLM 薬剤のプローブに基づく分子最適化

構造ベースの医薬品設計では、標的ポケットに対してリガンドを反復的に精製するために LLM 試薬の採用が増えていますが、実行可能なリガンドは、しばしば相反する 2 つの目的、つまり結合親和性と創薬可能性を満たさなければなりませんが、単一の最適化ステップで同時に改善されることはほとんどありません。この困難さを定量化するために、2 つの診断メトリクスを導入します。1 つ目は、1 回の編集で両方の目標が改善される頻度を測定し、2 つ目は、一方の目標の利益が他方の目標の損失を伴う頻度を測定します。これらの診断を現在の LLM エージェントパイプラインに適用すると、一貫した障害モードが明らかになります。エージェントは、ポケット-リガンド複合体が局所的な修飾にどのように反応するかを知らずに分子編集を実行するため、関節の改善が達成されることはほとんどありません。最適化の方向性を選択する前に、制御されたアナログ編集でポケット-リガンド複合体を調査する医薬化学者からインスピレーションを得て、編集応答調査を中心に構築された最適化フレームワークである \textbf{PROBE} を提案します。 PROBE はまずリガンドを編集可能なサイトに分解し、共同利得が考えられる場所、2 つの目的が緊張している可能性が高い場所、および責任の下部構造を変更する必要がある場所を示すポケット固有の \textbf{サイトマップ} を構築します。次に、制御されたプローブ編集を実行し、その応答が \textbf{EditManual} に抽出されます。 PROBE は、サイトマップと EditManual に基づいて、アフィニティエージェント、ドラッガビリティエージェント、および共同最適化エージェントが共同して編集を行う反復マルチエージェントループを実行します。 CrossDocked2020 ベンチマークでは、PROBE は最先端のパフォーマンスを達成し、診断メトリクスによって明らかになった障害モードを大幅に軽減します。

原文 (English)

Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design

Structure-based drug design increasingly employs LLM agents to iteratively refine ligands against a target pocket, yet a viable ligand must satisfy two often-conflicting objectives -- binding affinity and druggability -- which single optimization steps rarely improve together. To quantify this difficulty, we introduce two diagnostic metrics: the first measures how often a single edit improves both objectives, and the second measures how often a gain on one objective comes with a loss on the other. Applying these diagnostics to current LLM-agent pipelines exposes a consistent failure mode: the agent performs molecular editing without knowing how the pocket-ligand complex responds to local modifications, thus rarely achieving joint improvement. Inspired by medicinal chemists, who probe the pocket-ligand complex with controlled analog edits before choosing an optimization direction, we propose \textbf{PROBE}, an optimization framework built around edit-response probing. PROBE first decomposes the ligand into editable sites and builds a pocket-specific \textbf{site map} that flags where joint gains are plausible, where the two objectives are likely in tension, and where liability substructures should be changed; it then performs controlled probe edits whose responses are distilled into an \textbf{EditManual}. Guided by the site map and EditManual, PROBE runs an iterative multi-agent loop in which an affinity agent, a druggability agent, and a co-optimization agent jointly produce edits. On the CrossDocked2020 benchmark, PROBE achieves state-of-the-art performance and substantially mitigates the failure modes exposed by our diagnostics metrics.

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkillRevise: トレース条件付きスキルリビジョンによる LLM 作成エージェントスキルの向上

エージェントスキルは、LLM エージェントがワークフローを実行し、制約を検証し、障害から回復できるようにする手順的な成果物です。既存の自己進化手法は、蓄積された軌跡を利用してスキルを磨きます。しかし、初期の不完全なスキルしか利用できないコールドスタート環境では苦戦します。したがって、スキル構築はデフォルトでエキスパートオーサリングまたはワンショット LLM 生成になります。専門家が作成したスキルはコストが高く、LLM エージェントが実際にタスクを実行する方法と一致していない可能性があります。一方、ワンショットで生成されたスキルは、構文的には適切ですが、動作が弱い可能性があります。このギャップを埋めるために、私たちは、これらの初期スキルを反復的に改善するように設計された実行ベースのフレームワークである SkillRevise を提案します。 SkillRevise は、実行の証拠からスキルの欠陥を診断し、一般的なメモリから関連する修復原則を取得し、実行に固定された編集を適用します。候補者を再実行し、経験的な有用性を測定することで、最適なスキルバージョンを体系的に保持します。 3 つのベンチマークと 5 つの LLM で評価したところ、SkillRevise はワンショットベースラインを大幅に上回り、SkillsBench でのベースエージェントの成功率が 36.05% から 61.63% に向上しました。さらに、改訂されたスキルはモデル間での強力な移行性を示し、モデル固有のアーティファクトに関する一般化された手順の知識を取得します。

原文 (English)

SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision

Agent skills are procedural artifacts that enable LLM agents to execute workflows, verify constraints, and recover from failures. Existing self-evolving methods refine skills using accumulated trajectories. However, they struggle in cold-start settings, where only an initial, imperfect skill is available. Consequently, skill construction defaults to expert authoring or one-shot LLM generation. Expert-authored skills are costly and may not align with how LLM agents actually execute tasks, while one-shot generated skills can be syntactically well formed yet behaviorally weak. To bridge this gap, we propose SkillRevise, an execution-grounded framework designed to iteratively refine these initial skills. SkillRevise diagnoses skill defects from execution evidence, retrieves relevant repair principles from a general memory, and applies execution-anchored edits. By re-executing candidates and measuring empirical utility, it systematically retains the optimal skill version. Evaluated across three benchmarks and five LLMs, SkillRevise substantially outperforms one-shot baselines, improving the base agent's success rate on SkillsBench from 36.05% to 61.63%. Furthermore, the revised skills exhibit strong cross-model transferability, capturing generalized procedural knowledge over model-specific artifacts.

2026-06-03 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達研究/論文

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning.…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilize…

2026-06-03 13:00 JSTarXiv cs.AIエージェントロボティクス

Scaling Multi Agent Reinforcement Learning for Underwater Acoustic Tracking via Autonomous Vehicles

Autonomous vehicles (AVs) offer a cost-effective solution for scientific missions such as underwater tracking. Reinforcement learning (RL)…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

Curriculum-Adapted Robust Reinforcement Learning for UAV Deconfliction in Adversarial Environments

Autonomous unmanned aerial vehicles (UAVs) increasingly rely on reinforcement learning (RL) for navigation. However, global navigation sate…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation

We present TalkPlayData 2, a synthetic dataset for multimodal conversational music recommendation generated by an agentic data pipeline. In…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Automata-Conditioned Cooperative Multi-Agent Reinforcement Learning

We study learning multi-task, multi-agent policies for cooperative, temporal objectives, under centralized training, decentralized executio…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Toward Training Superintelligent Software Agents through Self-Play SWE-RL

While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer producti…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

The transition from monolithic language models to modular, skill-equipped agents marks a defining shift in how large language models (LLMs)…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

VulnAgent-R2: Evidence-Calibrated Multi-Agent Auditing for Repository-Level Vulnerability Detection

Software vulnerabilities often depend on cross-file data flow, build options, framework conventions, and runtime guards, so isolated functi…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Measuring Weak-to-Strong Legibility of Reasoning Models

Reasoning language models (RLMs) and the intermediate chains of thought they emit play an increasingly central role in multi-agent setups s…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

Just Type It in Isabelle! AI Agents Drafting, Mechanizing, and Generalizing from Human Hints

Type annotations are essential when printing terms in a way that preserves their meaning under reparsing and type inference. We study the p…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation

Modeling coordination among generative agents in complex multi-round decision-making presents a core challenge for AI and operations manage…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

LLM agents increasingly rely on reusable skills (e.g., $SKILL.md$ ) to execute complex tasks, yet these artifacts lack portability: agent f…

2026-06-03 13:00 JSTarXiv cs.AIエージェント

Mechanism Design Is Not Enough: Prosocial Agents for Cooperative AI

Ensuring that AI agents behave safely and beneficially when interacting with other parties has emerged as one of the central challenges of…

2026-06-03 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch passes the tests. This outcome-onl…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments

The Flooded Road Environments Dataset (FRED) is, to our knowledge, the first multi-modal autonomous driving dataset specifically targeting…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field…

2026-06-03 13:00 JSTarXiv cs.AI画像/動画生成エージェント

Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainl…

2026-06-03 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such…

2026-06-03 11:02 JSTITmedia AI+エージェント

Microsoft、AIエージェント用のカスタマイズ可能な分離環境「Microsoft Execution Containers」発表　OpenClawも動作

米MicrosoftがAIエージェントのためのカスタマイズ可能な分離環境「Microsoft Execution Containers」（MXC）を発表しました。

2026-06-03 07:49 JSTITmedia AI+エージェント

Microsoft、AndroidベースのAIエージェント基盤「Solara」発表　Snapdragon搭載のバッジ型端末も披露

Microsoftは「Build 2026」で、AIエージェントの実行に特化した新プラットフォーム「Project Solara」を発表した。OSにはWindowsではなくAOSPベースのOSを採用。Qualcommと共同開発した社員証のようなデバイスと、MediaTekと共同…

2026-06-03 03:00 JSTTechCrunch AIエージェント

Microsoft offers devs a better way to control AI agent behavior

The specification lets developer, compliance, and security teams define their own policies for agents to follow in portable policy files.

2026-06-03 01:00 JSTTechCrunch AILLM/生成AIエージェント

OpenAI launches new Codex tools for white-collar work

OpenAI released a set of six plug-ins aimed at specific jobs: data analytics, creative production, sales, product design, equity investing,…

2026-06-02 20:50 JSTITmedia AI+エージェント

Microsoft、自律エージェント「Scout」発表　OpenClawベースでMCP対応

Microsoftは「Build 2026」で、自律型AIエージェントの新カテゴリ「Autopilots」と、その第一弾「Microsoft Scout」を発表した。Scoutは「OpenClaw」基盤で構築され、常時バックグラウンドで稼働して「Microsoft 365」のア…

2026-06-02 18:00 JSTOpenAIエージェント

Codex for every role, tool, and workflow

Discover new Codex plugins, sites, and annotations that help analysts, marketers, designers, investors, and other teams get more done with…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

熟議的なキュレーション: マルチエージェントのナレッジベースのプロトコル

AI エージェントが孤立したツールから共有知識エコシステムの協力的な参加者に移行するにつれて、集合的な知識のキュレーションを管理することが重要な課題になります。人間のプラットフォームのガバナンスメカニズムは直接的には移行しません。エージェントの無国籍性は抑止力に基づく制裁を損ない、モデルの均一性は群衆の知恵の根底にある独立性の前提を侵害し、おべっかは熟議の合意を崩壊させます。我々は、次の 3 つのガバナンス層を組み合わせた熟議型キュレーションプロトコルを提案します。(1) ラベル付き移行システムとして形式化された知識アーティファクトライフサイクル。 (2) Beta Reputation と EigenTrust 増幅を統合した評判重視の審議投票。 (3) 無国籍エージェントに適応した段階的な制裁。これには、誤動作と敵対的行動を区別する壊れたエージェントの対応が含まれます。 2 つの逆境シナリオ (シード 30 個、対応のある t 検定) の下で、7 つの行動アーキタイプにわたって 100 人のエージェントによるエージェントベースのシミュレーションを通じてプロトコルを評価します。このプロトコルは、穏やかな条件下での適度な精度と引き換えに、逆境下での回復力を大幅に向上させます。中程度の逆境下では多数決で 0.826 対 0.791 (p<0.001)、ストレス下では 0.807 対 0.740 に広がります (p<0.001)。プロトコルの劣化は多数決に比べておよそ 3 倍遅くなります。アブレーション分析では、コミットと公開投票の隠蔽が最も影響力のある単一要素 (8.2 ～ 8.6 pp の精度向上、p<0.001) として特定され、評判の重み付けと審議の組み合わせを上回っています。段階的制裁はシミュレーションでは発動されず、経験的に検証されていないままである。

原文 (English)

Deliberative Curation: A Protocol for Multi-Agent Knowledge Bases

As AI agents transition from isolated tools to collaborative participants in shared knowledge ecosystems, governing collective knowledge curation becomes a critical challenge. Human platform governance mechanisms do not transfer directly: agent statelessness undermines deterrence-based sanctions, model homogeneity violates independence assumptions underlying crowd wisdom, and sycophancy collapses deliberative consensus. We propose a deliberative curation protocol combining three governance layers: (1) a knowledge artifact lifecycle formalized as a labeled transition system; (2) reputation-weighted deliberative voting integrating Beta Reputation with EigenTrust amplification; and (3) graduated sanctions adapted for stateless agents, including broken agent handling distinguishing malfunction from adversarial behavior. We evaluate the protocol through agent-based simulation with 100 agents across seven behavioral archetypes under two adversity scenarios (30 seeds, paired t-tests). The protocol trades modest precision under benign conditions for substantially better resilience under adversity: 0.826 vs 0.791 for majority vote under moderate adversity (p<0.001), widening to 0.807 vs 0.740 under stress (p<0.001). The protocol degrades roughly three times more slowly than majority vote. Ablation analysis identifies commit-reveal vote concealment as the most impactful single component (8.2-8.6pp precision improvement, p<0.001), outperforming reputation weighting and deliberation combined. Graduated sanctions were not exercised in simulation and remain empirically unvalidated.

2026-06-02 13:00 JSTarXiv cs.AIエージェント

ツリー上のエージェント: 多目的分子最適化のためのパスワイズ調整

多目的分子最適化では、相反する目的の下で広大な化学空間を探索する必要があり、初期の設計決定が下流の結果を強く制約します。既存の手法は通常、単一のポリシーまたは固定スカラー化に依存しているため、さまざまなトレードオフを表現したり、複数の有望な設計軌道を探索したりする能力が制限されています。我々は、分子最適化をツリー構造の探索として定式化するマルチエージェントフレームワークであるATOMを提案します。各ノードはアトミック操作に対応し、特定の目的または意思決定コンテキストに特化したエージェントをホストします。エージェントは、グローバルな合意を強制するのではなく、ツリーのさまざまなパスに沿って調整し、この方法で代替の分子進化の軌跡を維持および比較できるようにします。過去の最適化動作のグローバルな記憶により、目標全体にわたるバランスの取れた探索と活用がさらにサポートされます。このツリー構造の相互作用により、分子設計に固有の長期的な依存関係を考慮した推論が可能になります。活性、合成可能性、および ADMET 関連の特性を含む、困難な多目的ベンチマークに関する実験では、ATOM が強力なベースラインを超えるパレートカバレッジとハイパーボリュームの向上を一貫して達成していることが示されています。これらの結果は、分子最適化における経路ごとのマルチエージェント調整の有効性を示しています。コードは https://anonymous.4open.science/r/ATOM-41CE で入手できます。

原文 (English)

Agents on a Tree: Pathwise Coordination for Multi-Objective Molecular Optimization

Multi-objective molecular optimization requires searching vast chemical spaces under conflicting objectives, where early design decisions strongly constrain downstream outcomes. Existing methods typically rely on a single policy or fixed scalarization, which limits their ability to represent diverse trade-offs and to explore multiple promising design trajectories. We propose ATOM, a multi-agent framework that formulates molecular optimization as a tree-structured search. Each node corresponds to an atomic operation and hosts an agent specialized for a particular objective or decision context. Agents coordinate along different paths of the tree rather than enforcing a global consensus, enabling the method to maintain and compare alternative molecular evolution trajectories. A global memory of past optimization behaviors further supports balanced exploration and exploitation across objectives. This tree-structured interaction enables reasoning over long-horizon dependencies inherent in molecular design. Experiments on challenging multi-objective benchmarks involving activity, synthesizability, and ADMET-related properties show that ATOM consistently achieves improved Pareto coverage and hypervolume over strong baselines. These results demonstrate the effectiveness of pathwise multi-agent coordination for molecular optimization. Code is available at https://anonymous.4open.science/r/ATOM-41CE.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MindGames Arena ジェネラライゼーショントラック: ステップごとの遅延アトリビューションを備えた In2AI ソリューション

マルチエージェントの戦略的インタラクション用に言語モデルエージェントをトレーニングすることには、中核的な困難があります。アクションの質は、決して実現しない将来のイベント、ゲームルールに違反する動き、または他のプレイヤーによる決定に依存する可能性があります。標準的な強化学習では、各ステップで報酬を割り当てることができると想定していますが、結果が時間やエージェントを超えて絡み合っている設定では、この想定は当てはまりません。適格性ゲーティング、エピソードのライフサイクル、およびエピソード終了時にのみ報酬を計算し、タスク固有のセマンティクスに従って元のステップに報酬を伝播し、有効な依存情報が欠如しているステップをトレーニングから除外する後処理パイプラインを備えた、ステップごとの遅延報酬アトリビューションを導入します。 vLLM の連続バッチ処理による非同期ロールアウト生成、カリキュラムベースの対戦相手サンプリング、およびマルチレベル階層化バッチ構築と併せて、このアプローチにより、マルチエージェント環境で安定したサンプル効率の高い RL トレーニングが可能になります。 NeurIPS 2025 の MindGames Arena ベンチマークで評価しました。そこでは、私たちの手法でトレーニングされた単一の 80 億パラメータのオープンソースモデルが、GPT-5 を含む大幅に大規模な独自システムとの直接対決で匹敵、またはそれを上回り、オープン (制限なし) トラックと効率的 (パラメーター 80 億以下) トラックの両方で 1 位になりました。

原文 (English)

MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

Training language model agents for multi-agent strategic interaction presents a core difficulty: the quality of any action may depend on future events that never materialize, on moves that violate game rules, or on decisions made by other players. Standard reinforcement learning assumes that rewards can be assigned at each step, but this assumption fails in settings where outcomes are entangled across time and agents. We introduce delayed per-step reward attribution with eligibility gating, an episode lifecycle and postprocessing pipeline that computes rewards only at episode end, propagates them back to originating steps according to task-specific semantics, and excludes steps that lack valid dependent information from training. Together with asynchronous rollout generation via vLLM's continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, this approach enables stable, sample-efficient RL training in multi-agent environments. We evaluate on the MindGames Arena benchmark at NeurIPS 2025, where a single 8-billion-parameter open-source model trained with our method matched or surpassed substantially larger proprietary systems, including GPT-5, in head-to-head play and took first place in both the Open (unrestricted) and Efficient (<=8B parameters) tracks.

2026-06-02 13:00 JSTarXiv cs.AIエージェント

固体力学問題に対するエンドツーエンドの有限要素解析を可能にするマルチ AI エージェントフレームワーク

有限要素解析 (FEA) は、固体力学に対する最も重要な数値的アプローチです。 FEA の課題には、初心者レベルのユーザーにとって急な学習曲線と、境界条件、荷重ケース、解の変数などの主要なシミュレーションコンポーネントの誤った定義による潜在的な誤ったシミュレーションが含まれます。現実の問題を解決するには、通常、長年のエンジニアリング経験が必要です。これらの問題に対処するために、固体力学解析のための大規模言語モデル (LLM) に基づいたマルチエージェントフレームワークである AbaqusAgent を紹介します。 AbaqusAgent は、ユーザーの自然言語命令を FEA 解析の実行と結果の視覚化に変換することで、最も広く使用されている FEA パッケージの 1 つである Abaqus を使用した解析ケースの生成と実行を容易にするために開発されました。 AbaqusAgent は、インタープリター、アーキテクト、入力ライター、ランナー、レビューアー、およびビジュアライザーエージェントを含む 6 つのエージェントで構成され、標準 FEA 解析の重要な前処理および後処理ステップをすべて網羅しています。 50 種類の固体力学のさまざまな問題の検証が成功し、全体の成功率は 86% に達しています。 AbaqusAgent は、固体力学問題に対する FEA の効率を向上させ、計算力学教育への障壁を下げるだけでなく、ヒューマンシミュレーションインタラクションパラダイムを前進させ、AI を活用した最適化および材料特性評価ワークフローとの統合を可能にします。コードは https://github.com/LIRAM-LIN/AbaqusAgent で入手できます。

原文 (English)

A Multi-AI-agent Framework Enabling End-to-end Finite Element Analysis for Solid Mechanics Problems

Finite element analysis (FEA) is the most important numerical approach for solid mechanics. Challenges of FEA include a steep learning curve for entry-level users and potential false simulations due to incorrect definitions of key simulation components, such as boundary conditions, load cases, and solution variables. Years of engineering experience are usually necessary for real-world problem-solving. To address these issues, we present AbaqusAgent, a multi-agent framework grounded in large language models (LLMs) for solid mechanics analyses. AbaqusAgent is developed to facilitate analysis case generation and execution using Abaqus, one of the most widely used FEA packages, by turning users' natural-language instructions into executed FEA analyses and result visualization. AbaqusAgent is composed of six agents, including interpreter, architect, input writer, runner, reviewer, and visualizer agents, encompassing all the essential pre-processing and post-processing steps of standard FEA analyses. A wide variety of 50 solid mechanics problems have been successfully validated, achieving an overall success rate of 86%. Beyond improving the efficiency of FEA for solid mechanics problems and lowering the barrier to computational mechanics education, AbaqusAgent advances the human-simulation interaction paradigm and enables integration with AI-empowered optimization and material characterization workflows. The code is available at https://github.com/LIRAM-LIN/AbaqusAgent

2026-06-02 13:00 JSTarXiv cs.AIエージェント

MindZero: 注釈なしでオンラインで精神的推論を学習

効果的な現実世界での支援には、行動から人間の精神状態を推測する堅牢な心の理論 (ToM) を備えた AI エージェントが必要です。最近の進歩にも関わらず、いくつかの重要な課題が残っています。(1) 複数の仮説に対する堅牢な不確実性の更新によるオンライン推論。 (2) リアルタイム支援に適した効率的な推論。 (3) 現実世界の領域には、真実の精神状態の注釈が欠如している。私たちは、効率的かつ堅牢なオンライン精神的推論のためにマルチモーダル大規模言語モデル (MLLM) をトレーニングする自己教師あり強化学習フレームワークである MindZero を導入することで、これらの課題に対処します。トレーニング中、モデルは、モデルベースの ToM 推論と同様に、プランナーによって推定された観察された行動の可能性を最大化する精神状態仮説を生成することで報酬を受け取ります。したがって、この方法では、明示的な精神状態の注釈が不要になります。トレーニング後、MindZero はモデルベースの推論を高速なシングルパス推論に内部化します。私たちは、グリッドワールドと家庭の領域における困難な精神的推論と AI 支援タスク全体のベースラインに対して MindZero を評価します。 LLM だけでは不十分であることがわかりました。モデルベースの方法は精度を向上させますが、時間がかかり、コストがかかり、バックボーンの MLLM 容量によって制限されます。対照的に、MindZero は MLLM の本質的な ToM 能力を強化し、精度と効率の両方でモデルベースの手法を大幅に上回り、精神的推論が自己教師付きスキルとして効果的に学習できることを示しています。

原文 (English)

MindZero: Learning Online Mental Reasoning With Zero Annotations

Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses; (2) efficient reasoning suitable for real-time assistance; and (3) the lack of ground-truth mental state annotations in real-world domains. We address these challenges by introducing MindZero, a self-supervised reinforcement learning framework that trains multimodal large language models (MLLMs) for efficient and robust online mental reasoning. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model-based ToM reasoning. This method thus eliminates the need for explicit mental state annotations. After training, MindZero internalizes model-based reasoning into fast single-pass inference. We evaluate MindZero against baselines across challenging mental reasoning and AI assistance tasks in gridworld and household domains. We found that LLMs alone are insufficient; model-based methods improve accuracy but are slow, costly, and limited by backbone MLLM capacity. In contrast, MindZero enhances MLLMs' intrinsic ToM ability and significantly outperforms model-based methods in both accuracy and efficiency, showing that mental reasoning can be effectively learned as a self-supervised skill.

2026-06-02 13:00 JSTarXiv cs.AIエージェント

安全な強化学習のための堅牢なシールド

シールドは、マルコフ意思決定プロセス (MDP) における強化学習エージェントの安全性を正式に保証する効果的なアプローチです。ただし、既存のシールド技術は通常、安全関連の遷移ダイナミクスの知識を前提としていますが、実際にはこの要件が満たされることはほとんどありません。この制限に対処するために、我々は、堅牢な MDP (RMDP)、つまり一連の遷移確率を持つ MDP のための新しいシールドフレームワークを導入します。私たちは安全性を、RMDP の最悪の場合の遷移確率の下で、特定のしきい値確率を持つ線形時相論理 (LTL) 式を満たすことと定義します。私たちは、シールドフレームワークが健全であり、RMDP にとって最適であることを証明します。シールドによって許容されるすべてのポリシーは安全であり、逆に、すべての安全な RMDP ポリシーはシールドによって許容されます。私たちは、おそらくほぼ正しい (PAC) 保証を備えた MDP の遷移確率を学習するための既存のサンプリング手法とアプローチを組み合わせます。この組み合わせにより、最小限の制限を維持しながら高い信頼性で安全性を保証する MDP 用のシールドの構築が可能になります。私たちの実験では、学習済み RMDP に対するシールドが未知の MDP における安全性を保証しながら、サンプル数が増加するにつれて高い期待リターンを回復することが示されました。

原文 (English)

Robust Shielding for Safe Reinforcement Learning

Shielding is an effective approach to formally guarantee the safety of reinforcement learning agents in Markov decision processes (MDPs). However, existing shielding techniques typically assume knowledge of the safety-relevant transition dynamics - a requirement that is seldom met in practice. To address this limitation, we introduce a novel shielding framework for robust MDPs (RMDPs), i.e., MDPs with sets of transition probabilities. We define safety as the satisfaction of a linear temporal logic (LTL) formula with a certain threshold probability under the worst-case transition probabilities of the RMDP. We prove that our shielding framework is both sound and optimal for the RMDP: every policy admissible by the shield is safe, and conversely, every safe RMDP policy is admissible by the shield. We combine our approach with existing sampling methods for learning transition probabilities of MDPs with probably approximately correct (PAC) guarantees. This combination enables the construction of shields for MDPs that, with high confidence, guarantee safety while remaining minimally restrictive. Our experiments show that our shields for learned RMDPs guarantee safety in unknown MDPs while recovering strong expected return as the number of samples increases.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

モデルネイティブコンピューティングアーキテクチャ: コンピュータアーキテクチャのレンズを通して将来のシステムアーキテクチャを構想する

大規模な言語モデルは、モデルテクノロジーからシステムテクノロジーへの移行を迎えています。開発者が Codex、Claude Code、AutoGPT、および関連エージェントを使用してコードを作成し、プロジェクトを管理し、複数ステップのタスクを実行するにつれて、キャッシュの再利用、コンテキスト管理、エージェントのスケジューリング、権限制御などの繰り返し発生するエンジニアリングの問題は、従来のコンピュータシステムの問題にますます似てきています。この文書では、そのアナロジーを先見的な調査として展開します。私たちは、コンピューターアーキテクチャの概念を新たなモデルネイティブスタックにマッピングし、OS としての LLM、メモリ管理、エージェントフレームワーク、ツールプロトコル、マルチエージェント調整、コグニティブアーキテクチャ、および安全性ガバナンスに関する作業をレビューします。私たちは、これらのストランドは同じシステムの異なる層に対応しているが、統一されたモデルが欠けていると主張します。このギャップを埋めるために、明示的なインターフェイス契約と設計公理を備えたモデルネイティブコンピューティングのための 6 層フレームワークであるインテリジェントコンピューティングアーキテクチャモデル (ICAM) を提案します。 ICAM は、LLM が CPU とオペレーティングシステムのどちらに似ているかに関する明らかな緊張を、デュアルプレーンの視点を通じて解決します。つまり、何を計算できるかに関する確率論的な実行プレーンと、何を計算すべきかに関する決定論的なコントロールプレーンです。さらに、3 つの設計法則を導入します。KV キャッシュの再利用と推論の高速化のためのセマンティック局所性法則、有限ウィンドウと注意力低下下での効果的なワーキングセットのためのコンテキストバジェット法則、およびマルチエージェントのコラボレーションにおける利益逓減のためのエージェント高速化法則です。私たちはこれらの法則を公開されたシステムレベルのデータに対して検証し、エージェントソフトウェアの実践に関する最近の証拠と関連付けます。最後に、類似性がどこで崩れるかを特定し、モデルネイティブコンピューティングの研究ロードマップの概要を示します。これは概念的な調査による寄稿です。新しい実験は報告されていません。

原文 (English)

Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture

Large language models are undergoing a transition from model technology to system technology. As developers use Codex, Claude Code, AutoGPT, and related agents to write code, manage projects, and execute multi-step tasks, recurring engineering problems such as cache reuse, context management, agent scheduling, and permission control increasingly resemble classical computer systems problems. This paper develops that analogy as a visionary survey. We map concepts from computer architecture to the emerging model-native stack and review work on LLM-as-OS, memory management, agent frameworks, tool protocols, multi-agent coordination, cognitive architectures, and safety governance. We argue that these strands address different layers of the same system but lack a unified model. To fill this gap, we propose the Intelligent Computing Architecture Model (ICAM), a six-layer framework for model-native computing with explicit interface contracts and design axioms. ICAM resolves the apparent tension over whether an LLM is more like a CPU or an operating system through a dual-plane view: a probabilistic execution plane concerned with what can be computed, and a deterministic control plane concerned with what should be computed. We further introduce three design laws: the Semantic Locality Law for KV-cache reuse and inference speedup, the Context Budget Law for effective working sets under finite windows and attention decay, and the Agent Speedup Law for diminishing returns in multi-agent collaboration. We validate these laws against published system-level data and relate them to recent evidence on agentic software practices. We conclude by identifying where the analogy breaks down and outlining a research roadmap for model-native computing. This is a conceptual and survey contribution; it does not report new experiments.

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

VESTA: 統計ツールエージェントを使用した視覚的な探索

定量的モデルをデータに適合させることは科学ワークフローの中心的なステップですが、依然として自動化が最も進んでいないステップの 1 つです。最近のエージェントベースのシステムは、言語および視覚言語モデル (VLM) を活用して、統計モデルを繰り返し提案および改良しますが、これらのシステムは、より困難なモデリングタスクに苦戦しています。これらの制限に対処するために、VESTA: Visual Exploration with Statistical Tool Agents を導入します。これは、VLM に動的に拡張する探索ツールキットを装備し、データ変換、仮説に基づく視覚化、堅牢な統計テストを通じてモデルの改良をガイドするフレームワークです。反復的な批評のみに依存する従来のシステムとは異なり、VESTA は、モデルのコンテキストに蓄積され、後で再利用できる診断ツールを選択または作成することにより、改良前および改良中にデータを積極的に探索します。私たちは、ツールなし、専門家が作成した静的なツール、モデルが作成した動的ツールの 3 つのツールキット構成で確立されたベースラインに対して VESTA を評価します。この評価をサポートするために、さまざまな難易度の分布フィッティングと時系列モデリングを対象としたベンチマークである DAWN (自動化されたワークフローと数値モデリングのためのデータセット) を導入し、最終的には初期質量関数や重力波チャープ信号のモデリングを含む現実世界の天文学タスクに到達します。 VESTA の動的なツールの作成は、以前のエージェントパイプラインよりも優れたパフォーマンスを発揮し、複雑なドメイン固有のタスクで最大のメリットが得られることがわかりました。さらに、動的に生成されたツールは、既存のビジュアルツール作成システムによって生成されたツールよりも大幅に洗練されており、機能ごとにより多くの診断カテゴリをカバーし、VLM 批評家が直接推論できるビジュアル出力を強く好むことを示します。

原文 (English)

VESTA: Visual Exploration with Statistical Tool Agents

Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model's context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA's dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

理屈ではなく、言われたことを実行する: LLM エージェントの誠実さのギャップを特定する

LLM エージェントは、自分が述べた推論に基づいて行動しますか?このプロセス忠実度の問題は、ソーシャルシミュレーションで LLM を使用する際の中心となりますが、正しい動作の基準が存在しない場合は測定することが困難です。私たちは、忠実性のギャップを推論 - 結論と結論 - 行動の 2 つのステップに分解することにより、すべての決定に対して検証可能な参照アクションを備えたテキサスポーカーシミュレーターという、制御された設定でそれを研究します。 2 つのステップは逆に動作します。

原文 (English)

Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents

Do LLM agents act on the reasoning they state? This question of process fidelity is central to using LLMs in social simulation, yet it is hard to measure where no reference for correct behavior exists. We study it in acontrolled setting, a Texas Poker simulator with a verifiable reference action for every decision by decomposing the faithfulness gap into two steps: reasoning-conclusion and conclusion-action. The two steps behave oppositely.

2026-06-02 13:00 JSTarXiv cs.AIエージェント

AI との連携: 代理人による不法行為責任のためのインタラクションベースのフレームワーク

Agentic AI システムは、複数のステップにわたって計画を立て、ツールを使用し、時間をかけてタスクを実行できます。このようなシステムが害を及ぼす場合、不法行為法は責任を割り当てるのに苦労します。その理由は、有害な経路がユーザーによって完全に選択されたり、開発者によって明確に予見されたりする可能性がないためです。この論文は、マイケル・ブラットマンの計画理論と人間と人間の共同行動のコモン・ローの扱いに基づいて、代理人的不法行為のための相互作用に基づく枠組みを提案する。私たちは、自律ドリフト、純粋なツールの使用、共同計画という 3 つのインタラクションタイプを区別します。純粋なツールケースは、引き続き通常の製品欠陥および警告原則によって管理されます。共同計画の事例は、独立請負業者の管理テスト、職業上の過誤、および過失による虚偽表示にマッピングされます。自律的ドリフトマップは、優れた厳格な製造物責任の下で、はしゃぎと迂回をマップします。このフレームワークは、ステートフルインタラクションログを主要な証拠追跡として扱い、裁判所が人間と AI の軌跡が認可された事業からどこに逸脱したか、またどこに責任を負うべきかを推測できるようにします。私たちは、インシデントに基づいた 4 件の事件を解決し、このアカウントを厳格責任および保険ベースの提案と並べて位置づけ、規制上の監視との関係に注目し、制約の検証、認識上の透明性、実行時のグラウンディング、およびフォレンジックロギングを中心に構築された「合理的なエージェント」標準を提案します。

原文 (English)

Acting with AI: An Interaction-Based Framework for Agentic Tort Liability

Agentic AI systems can plan over multiple steps, use tools, and execute tasks over time. When such systems cause harm, tort law struggles to allocate responsibility because the harmful path may be neither fully chosen by the user nor specifically foreseen by the developer. This paper proposes an interaction-based framework for agentic torts, drawing on Michael Bratman's planning theory and on the common law's treatment of human-human concerted action. We distinguish three interaction types: autonomous drift, pure tool use, and collaborative planning. Pure tool cases remain governed by ordinary product-defect and warning doctrines; collaborative planning cases map onto the independent contractor control test, professional malpractice, and negligent misrepresentation; autonomous drift maps onto frolic and detour under respondeat superior and strict product liability. The framework treats the stateful interaction log as the primary evidentiary trace, allowing courts to infer where the human-AI trajectory departed from the authorized undertaking and where liability should attach. We resolve four incident-anchored cases, situate the account alongside strict-liability and insurance-based proposals, note its relationship to regulatory oversight, and propose a ``Reasonable Agent'' standard built around constraint verification, epistemic transparency, runtime grounding, and forensic logging.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

編集する前にプローブする: 構造ベースの医薬品設計における LLM 薬剤のプローブに基づく分子最適化

構造ベースの医薬品設計では、標的ポケットに対してリガンドを反復的に精製するために LLM 試薬の採用が増えていますが、実行可能なリガンドは、しばしば相反する 2 つの目的、つまり結合親和性と創薬可能性を満たさなければなりませんが、単一の最適化ステップで同時に改善されることはほとんどありません。この困難さを定量化するために、2 つの診断メトリクスを導入します。1 つ目は、1 回の編集で両方の目標が改善される頻度を測定し、2 つ目は、一方の目標の利益が他方の目標の損失を伴う頻度を測定します。これらの診断を現在の LLM エージェントパイプラインに適用すると、一貫した障害モードが明らかになります。エージェントは、ポケット-リガンド複合体が局所的な修飾にどのように反応するかを知らずに分子編集を実行するため、関節の改善が達成されることはほとんどありません。最適化の方向性を選択する前に、制御されたアナログ編集でポケット-リガンド複合体を調査する医薬化学者からインスピレーションを得て、編集応答調査を中心に構築された最適化フレームワークである \textbf{PROBE} を提案します。 PROBE はまずリガンドを編集可能なサイトに分解し、共同利得が考えられる場所、2 つの目的が緊張している可能性が高い場所、および責任の下部構造を変更する必要がある場所を示すポケット固有の \textbf{サイトマップ} を構築します。次に、制御されたプローブ編集を実行し、その応答が \textbf{EditManual} に抽出されます。 PROBE は、サイトマップと EditManual に基づいて、アフィニティエージェント、ドラッガビリティエージェント、および共同最適化エージェントが共同して編集を行う反復マルチエージェントループを実行します。 CrossDocked2020 ベンチマークでは、PROBE は最先端のパフォーマンスを達成し、診断メトリクスによって明らかになった障害モードを大幅に軽減します。

原文 (English)

Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design

Structure-based drug design increasingly employs LLM agents to iteratively refine ligands against a target pocket, yet a viable ligand must satisfy two often-conflicting objectives -- binding affinity and druggability -- which single optimization steps rarely improve together. To quantify this difficulty, we introduce two diagnostic metrics: the first measures how often a single edit improves both objectives, and the second measures how often a gain on one objective comes with a loss on the other. Applying these diagnostics to current LLM-agent pipelines exposes a consistent failure mode: the agent performs molecular editing without knowing how the pocket-ligand complex responds to local modifications, thus rarely achieving joint improvement. Inspired by medicinal chemists, who probe the pocket-ligand complex with controlled analog edits before choosing an optimization direction, we propose \textbf{PROBE}, an optimization framework built around edit-response probing. PROBE first decomposes the ligand into editable sites and builds a pocket-specific \textbf{site map} that flags where joint gains are plausible, where the two objectives are likely in tension, and where liability substructures should be changed; it then performs controlled probe edits whose responses are distilled into an \textbf{EditManual}. Guided by the site map and EditManual, PROBE runs an iterative multi-agent loop in which an affinity agent, a druggability agent, and a co-optimization agent jointly produce edits. On the CrossDocked2020 benchmark, PROBE achieves state-of-the-art performance and substantially mitigates the failure modes exposed by our diagnostics metrics.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

TRACE: 長期にわたるエージェントの安全性を確保するためのトラジェクトリのリスクを考慮した圧縮

長期にわたる LLM エージェントは、まばらで遅延した構成的なリスク信号が局所的な緩和を逃れることが多い長い軌跡にわたって安全性の証拠を生成します。既存のターンレベルまたはショートコンテキストの検出器は、そのような証拠を長期間にわたって確実に保持し、集約するのに苦労しています。私たちは、長期にわたるエージェントの安全性の検出を軌跡レベルの証拠圧縮として再構成し、長期にわたるエージェントの安全性のための軌跡リスク認識圧縮 (TRACE) を提案します。 TRACE は Compressor-Reader 設計を使用します。Compressor は、軌跡レベルの監視の下で完全な軌跡をコンパクトな潜在証拠状態にエンコードし、Reader は安全参照としてこの潜在証拠状態を使用して生の軌跡を判断します。この設計は、分散したリスクの手がかりを集約し、早期の証拠損失を軽減するのに役立ちます。 ASSEBench、Pre-Ex-Bench、R-Judge 全体で、TRACE は評価されたすべてのバックボーンで最高の精度を達成し、強力なベースラインを最大 12.6 パーセントポイント改善します。 LongSafety では、コンテキストの長さが増加しても、TRACE のパフォーマンス低下は小さくなります。注意の視覚化とケーススタディは、圧縮された参照が、読者がリスククリティカルなセグメントに焦点を当て、複数のステップにわたる証拠を回収するのに役立つことを示唆しています。コードは https://github.com/Peregrine123/TRACE_official で入手できます。

原文 (English)

TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety

Long-horizon LLM agents produce safety evidence across long trajectories, where sparse, delayed, and compositional risk signals often escape local moderation. Existing turn-level or short-context detectors struggle to reliably retain and aggregate such evidence over extended horizons. We reframe long-horizon agent safety detection as trajectory-level evidence compression and propose Trajectory Risk-Aware Compression for Long-Horizon Agent Safety (TRACE). TRACE uses a Compressor-Reader design: the Compressor encodes the full trajectory into a compact latent evidence state under trajectory-level supervision, and the Reader judges the raw trajectory with this latent evidence state as a safety reference. This design helps aggregate dispersed risk cues and reduce premature evidence loss. Across ASSEBench, Pre-Ex-Bench, and R-Judge, TRACE achieves the best accuracy on all evaluated backbones, improving over strong baselines by up to 12.6 percentage points. On LongSafety, TRACE shows smaller performance degradation as context length grows. Attention visualizations and case studies suggest that the compressed reference helps the Reader focus on risk-critical segments and recover cross-step evidence. Code is available at https://github.com/Peregrine123/TRACE_official.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

ForeSci: 将来を見据えた AI 研究判断のための LLM エージェントの評価

AI 研究では、多くの場合、将来の証拠が存在する前に、どのボトルネックを攻撃するか、どの方向を追求するか、プロジェクトをどこに位置付けるべきかなどの決定が必要になります。 LLM エージェントが歴史的証拠に基づいてそのような将来を見据えた研究判断を下せるかどうかを評価するための、時間的に制御されたベンチマークである ForeSci を紹介します。 ForeSci には、急速に変化する 4 つの AI ドメインと 4 つの意思決定ファミリーにわたる 500 のタスクが含まれています。各タスクは、カットオフに調整されたオフラインの知識ベースとペアになっています。カットオフ後のペーパーは生成中に隠され、検証のみに使用されます。将来のイベントのランダムな予測を回避するために、タスクはカットオフ前の分類枝と証拠シグナルから導出され、タスクのカットオフに先立って回答生成バックボーンが選択されます。私たちは、ネイティブ LLM、ハイブリッド RAG、および 4 つのバックボーンにわたる 3 つのリサーチエージェントの適応を評価します。結果は、明示的な証拠の整理によりトレーサビリティと事実の裏付けが向上しますが、その効果は意思決定ファミリーに大きく依存することを示しています。診断により、証拠と決定の切り離しが繰り返し発生していることが明らかになります。エージェントは、間違った研究対象を予測しながら、関連する証拠を引用する可能性があります。 ForeSci は、将来を見据えた AI 研究上の判断を、研究エージェントを意思決定システムとして評価するための管理されたベンチマークに変えます。

原文 (English)

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.

2026-06-02 13:00 JSTarXiv cs.AIエージェント

MOSAIC: 構造化されたエージェントインテリジェンスと構成のためのモジュール式オーケストレーション

自動化されたデータサイエンスは、構造化されたモデル選択の問題です。ソリューションでは、データ変換、特徴表現、アーキテクチャ、トレーニング手順、評価プロトコル、およびタスクの改良戦略を選択する必要があります。 AutoML システムはこのプロセスの一部を自動化しますが、通常は事前定義されたパイプライン、モデル、ハイパーパラメーター空間内で検索します。 LLM ベースのエージェントは、取得、コード生成、および実行フィードバックを通じて優れた柔軟性を提供しますが、そのモデリングの決定は多くの場合構造化されておらず、検証が難しく、再利用も困難です。メモリベースのモデル選択とワークフロー構築のための構造化エージェントフレームワークである \textsc{MOSAIC} (構造化エージェントインテリジェンスと構成のためのモジュラーオーケストレーション) を紹介します。タスクとデータセットが与えられると、 \textsc{MOSAIC} はセマンティックタスクプロファイルを構築し、以前のケースとソースコードモジュールを取得して、選択されたモデリングコンポーネント、構成、インターフェイス制約、および実行要件を指定する中間表現であるブループリントを構築します。このブループリントは、モデル選択を段階的でコンテキストに基づいた検索に変え、制約のない合成ではなく、取得した証拠での LLM ベースのコード生成を基盤とします。候補モデルは実行によって検証され、診断フィードバック、トレーニングトレース、タスクメトリクス、および失敗を認識した強化学習ポリシーを使用して改良されます。私たちは金融時系列予測と生成に関して \textsc{MOSAIC} をインスタンス化します。モデルは予測精度、分布忠実度、実行信頼性、リスクやテール挙動などの下流財務基準を満たさなければなりません。 AutoML とエージェントベースラインに対する実験では、\textsc{MOSAIC} がタスクのパフォーマンス、実行の成功、意思決定の追跡可能性を向上させることが示されており、自動化されたデータサイエンスを構造化され、再利用可能で、実行に基づいたモデル選択として扱うことの価値が実証されています。

原文 (English)

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

Automated data science is a structured model-selection problem. A solution must choose data transformations, feature representations, architecture, training procedure, evaluation protocol, and refinement strategy for a task. AutoML systems automate parts of this process, but typically search within predefined pipeline, model, and hyperparameter spaces. LLM-based agents offer greater flexibility through retrieval, code generation, and execution feedback, yet their modelling decisions are often unstructured, difficult to verify, and hard to reuse. We introduce \textsc{MOSAIC} (Modular Orchestration for Structured Agentic Intelligence and Composition), a structured agentic framework for memory-grounded model selection and workflow construction. Given a task and dataset, \textsc{MOSAIC} builds a semantic task profile, retrieves prior cases and source-code modules, and constructs a blueprint: an intermediate representation specifying selected modelling components, composition, interface constraints, and execution requirements. This blueprint turns model selection into a staged, context-grounded search and grounds LLM-based code generation in retrieved evidence rather than unconstrained synthesis. Candidate models are validated by execution and refined using diagnostic feedback, training traces, task metrics, and a failure-aware reinforcement learning policy. We instantiate \textsc{MOSAIC} on financial time-series forecasting and generation, where models must satisfy predictive accuracy, distributional fidelity, execution reliability, and downstream financial criteria such as risk and tail behaviour. Experiments against AutoML and agentic baselines show that \textsc{MOSAIC} improves task performance, execution success, and decision traceability, demonstrating the value of treating automated data science as structured, reusable, and execution-grounded model selection.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

CoMIC: クラウドエッジシステムにおける長期的な LLM エージェントのための共同メモリと洞察の循環

軽量のラージ言語モデル (LLM) エージェントをエッジサーバーにデプロイすると、待ち時間が短縮され、エージェントサービスがユーザーに近づくことができます。しかし、リソースに制約のあるエッジモデルは、永続メモリ、サブゴールの追跡、およびリフレクションを必要とする長期的なタスクに苦戦することがよくあります。導入後のエッジモデルの微調整はコストがかかり、異種ノード間で拡張するのは困難ですが、純粋にローカルメモリではエージェントのエクスペリエンスが孤立し、プロンプトコンテキストが増加することになります。私たちは、協調記憶と洞察循環のためのパラメータ更新不要のクラウドエッジフレームワークである \textsc{CoMIC} を提案します。 \textsc{CoMIC} は \textit{集中反射、分散実行} 設計に従っています。エッジエージェントはサブ目標指向の階層メモリと関連する履歴の選択的再拡張を使用してローカルで実行されますが、クラウド側の LLM 批評家は完了した軌跡を非同期的に評価し、再利用可能なエクスペリエンスをフィルタリングし、セマンティックなサブ目標識別子をキーとしたエージェント間のガイダンスを集約します。 \textsc{CoMIC} は、シンボリックプランニングとテキストインタラクションにわたる 5 つの長期的なエージェントタスクにわたって、弱いエッジエージェントの進捗率とアクションのグラウンディングを向上させ、モデルパラメーターを更新せずにタスクに依存した成功率の向上をもたらします。

原文 (English)

CoMIC: Collaborative Memory and Insights Circulation for Long-Horizon LLM Agents in Cloud-Edge Systems

Deploying lightweight Large Language Model (LLM) agents on edge servers can reduce latency and move agentic services closer to users, but resource-constrained edge models often struggle with long-horizon tasks that require persistent memory, subgoal tracking, and reflection. Fine-tuning edge models after deployment is costly and difficult to scale across heterogeneous nodes, while purely local memory leaves agents with isolated experience and growing prompt context. We propose \textsc{CoMIC}, a parameter-update-free cloud-edge framework for Collaborative Memory and Insights Circulation. \textsc{CoMIC} follows a \textit{Centralized Reflection, Decentralized Execution} design: edge agents execute locally using subgoal-oriented hierarchical memory and selective re-expansion of relevant histories, while a cloud-side LLM critic asynchronously evaluates completed trajectories, filters reusable experience, and aggregates cross-agent guidance keyed by semantic subgoal identifiers. Across five long-horizon agent tasks spanning symbolic planning and text interaction, \textsc{CoMIC} improves progress rate and action grounding for weak edge agents and yields task-dependent success-rate gains without updating model parameters.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

FALAT: 依存関係に基づく検索による LLM エージェントの軌跡の障害のトレース

LLM ベースのエージェントは、推論ステップ、ツール呼び出し、エージェント間通信を含む長い軌跡を通じて複雑なタスクを解決することが増えています。ただし、これらのエージェントが失敗すると、どのエージェントが失敗の原因となったのか、どのステップが決定的なエラーを引き起こしたのかが不明瞭になることがよくあります。この帰属の問題は、間違いが軌跡全体に伝播する可能性があるため、困難です。後のアクションは間違っているように見える可能性がありますが、それは、それが以前の破損した状態に依存しているためです。したがって、障害の属性を独立したステップレベルの分類として扱うことはできません。我々は、LLM エージェントの軌跡における障害の原因を特定するための診断フレームワークである FALAT を提案します。 FALAT は、属性を依存関係に基づく検索問題としてフレーム化します。まず、タスクがどのように解決されるべきかという期待値を構築し、この期待値を使用して軌跡内の疑わしい領域を特定します。次に、決定、ツール出力、エージェントメッセージ間の依存関係を追跡し、エラーを引き起こすステップと、以前の間違いを単に継承または伝播するステップを区別します。最後に、FALAT は、候補ステップを修正するだけで期待される結果を回復するのに十分であるかどうかを評価し、責任のあるエージェントと決定的な失敗ステップの両方を特定できるようにします。私たちは Who&When ベンチマークで FALAT を評価します。これには、アルゴリズムによって生成されたものと手動で作成されたマルチエージェントの障害軌跡の両方が含まれます。結果は、FALAT が責任のあるエージェントと決定的なステップの帰属を一貫して改善していることを示しています。その最良の構成では、アルゴリズムで生成された軌道では 46.0% のステップレベルの精度を達成し、より困難な手作りの軌道では 29.1% を達成し、特殊なアトリビューションベースラインやスタンドアロン LLM による直接プロンプトを上回ります。これらの発見は、LLM エージェントシステムにおける信頼性の高い障害診断には、依存関係を意識した推論が不可欠であることを示唆しています。

原文 (English)

FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search

LLM-based agents increasingly solve complex tasks through long trajectories involving reasoning steps, tool calls, and inter-agent communication. However, when these agents fail, it is often unclear which agent caused the failure and which step introduced the decisive error. This attribution problem is challenging because mistakes can propagate across the trajectory: later actions may appear incorrect, but only because they depend on an earlier corrupted state. Therefore, failure attribution cannot be treated as independent step-level classification. We propose FALAT, a diagnostic framework for failure attribution in LLM agent trajectories. FALAT frames attribution as a dependency-guided search problem. It first constructs an expectation of how the task should be solved and uses this expectation to identify suspicious regions in the trajectory. It then traces dependencies among decisions, tool outputs, and agent messages to distinguish error-introducing steps from steps that merely inherit or propagate prior mistakes. Finally, FALAT evaluates whether correcting a candidate step would be sufficient to recover the expected outcome, allowing it to identify both the responsible agent and the decisive failure step. We evaluate FALAT on the Who&When benchmark, which includes both algorithm-generated and hand-crafted multi-agent failure trajectories. The results show that FALAT consistently improves responsible-agent and decisive-step attribution. Its best configurations achieve 46.0% step-level accuracy on algorithm-generated trajectories and 29.1% on the more challenging hand-crafted trajectories, outperforming specialized attribution baselines and direct prompting with standalone LLMs. These findings suggest that dependency-aware reasoning is essential for reliable failure diagnosis in LLM agent systems.

2026-06-02 13:00 JSTarXiv cs.AIエージェント

インタラクション中心のインテリジェンス: 共創 AI およびヒューマン AI システムにおける主要な分析単位としてのインタラクションを目指して

従来の人工知能は、主に、境界のあるエージェント内で発生する分離された計算として知能を概念化していました。従来の AI、機械学習、および多くの生成システムにわたって、主要な分析単位は依然として、出力、ベンチマーク、予測精度、または最適化パフォーマンスを通じて評価される個々のモデルまたは自律システムです。これらのアプローチは大きな進歩をもたらしましたが、知性、創造性、意味、適応行動の出現における相互作用の役割については過小理論化されていることがよくあります。この論文では、共創 AI およびより広範なインタラクション中心のインテリジェンスの分析の主要な単位としてインタラクションを提案します。この論文は、分散認知、身体化認知、実行、参加型センスメイキング、人間とコンピューターの相互作用、および計算による創造性を基に、知能の関係性の説明がますます高まっていく歴史的進歩をたどります。 Creative Sense-Making、定量化された共同創造、Drawing Apprentice や AI Drawing Partner などの共同創造システムに関するこれまでの研究に基づいて、内部計算のみを通じてではなく、エージェント、環境、社会技術システム間の相互作用ダイナミクスの進化を通じて知能が出現すると主張しています。この論文では、人間と AI の共創、協調的な創発、適応的な参加、およびインタラクションダイナミクスを理解するためのフレームワークとして、インタラクション中心のインテリジェンスを紹介しています。このフレームワークは、生成された出力のみを通じて知能を評価するのではなく、相互作用の軌跡、調整パターン、参加型関与、適応的規制、および時間の経過とともに展開される相互作用ドリフトに重点を置いています。説明可能な共創 AI、ハイブリッドインテリジェンス、能動的 AI、および将来の人間と AI システムの意味について議論します。

原文 (English)

Interaction-Centered Intelligence: Toward Interaction as the Primary Unit of Analysis in Co-Creative AI and Human-AI Systems

Traditional artificial intelligence has largely conceptualized intelligence as isolated computation occurring within bounded agents. Across classical AI, machine learning, and many generative systems, the dominant unit of analysis remains the individual model or autonomous system evaluated through outputs, benchmarks, prediction accuracy, or optimization performance. While these approaches have produced major advances, they often under-theorize the role of interaction in the emergence of intelligence, creativity, meaning, and adaptive behavior. This paper proposes interaction as the primary unit of analysis for co-creative AI and interaction-centered intelligence more broadly. Drawing from distributed cognition, embodied cognition, enaction, participatory sense-making, human-computer interaction, and computational creativity, the paper traces a historical progression toward increasingly relational accounts of intelligence. Building upon prior work in Creative Sense-Making, quantified co-creation, and co-creative systems such as the Drawing Apprentice and AI Drawing Partner, it argues that intelligence emerges through evolving interaction dynamics among agents, environments, and socio-technical systems rather than solely through internal computation. The paper introduces Interaction-Centered Intelligence as a framework for understanding human-AI co-creation, collaborative emergence, adaptive participation, and interactional dynamics. Rather than evaluating intelligence solely through generated outputs, the framework emphasizes interaction trajectories, coordination patterns, participatory engagement, adaptive regulation, and interactional drift unfolding through time. Implications for explainable co-creative AI, hybrid intelligence, enactive AI, and future human-AI systems are discussed.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

敵対的なフィードにより LLM エージェントの決定がデフォルトに反して誘導される

LLM エージェントは、ソーシャルフィード、検索結果、取得コンテキスト、電子メールキューなど、ランク付けされた外部情報ストリームを消費した後に動作することが増えていますが、安全性評価では、ほとんどの場合、モデルまたはユーザープロンプトが単独でテストされ、エージェントが動作する直前に何を読み取るかを決定する上流のランカーは決してテストしません。モデル、ペルソナ、トピック、および最終的な意思決定プロンプトを固定し、その前の 10 ターンの「スクロール」フェーズ中にエージェントが遭遇する投稿の構成と順序のみを変更する制御されたプロトコルを導入して、下流の意思決定に対するフィードキュレーションの因果関係を分離します。 3 つの独立したラボからの 4 つの最新のオープン命令 LLM に対する 2,785 件の意思決定ロールアウトを通じて、敵対的降伏、デフォルトの飽和、およびモデルが真に不確実であった決定 (最も明確なケースでは 5% から 100%、Fisher p は 3 x 10^-10 と低い) を一方的なフィードが示唆するものの、取り除くことができないデフォルト方向の非対称性という 3 つの応答体制を特定しました。すでに支持されているか、しっかりと保持されているもの。この影響は用量反応曲線に従い、執筆スタイルのアーティファクトを排除するジェネレーター交換後も存続し、展開承認ゲートの削除やアクセス制御の緩和などのセキュリティ関連の選択を含むいくつかの意思決定領域にわたって一般化され、2 つの単純なフィードレベルの防御によって部分的に緩和されます。フロンティアモデルはデフォルトを保持します。私たちは、レコメンダーを LLM エージェント用の実用的なデフォルト境界付きコントロールサーフェスとして特徴付け、エージェントの評価では最終的なプロンプトのみではなくフィード層を監査する必要があると主張します。

原文 (English)

Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults

LLM agents increasingly act after consuming ranked external information streams such as social feeds, search results, retrieval contexts, and email queues, yet safety evaluations almost always test the model or the user prompt in isolation, never the upstream ranker that decides what the agent reads just before it acts. We introduce a controlled protocol that holds the model, persona, topic, and final decision prompt fixed and varies only the composition and ordering of the posts an agent encounters during a preceding ten-turn "scrolling" phase, isolating the causal effect of feed curation on a downstream decision. Across 2,785 decision rollouts on four modern open instruct LLMs from three independent labs, we identify three response regimes: adversarial capitulation, default saturation, and a default-direction asymmetry in which a one-sided feed tips a decision the model was genuinely uncertain about (in the clearest cases from 5% to 100%; Fisher p as low as 3 x 10^-10) but cannot dislodge one it already favors or holds firmly. The effect follows a dose-response curve, survives a generator swap that rules out a writing-style artifact, generalizes across several decision domains including security-relevant choices such as removing a deployment approval gate or relaxing access controls, and is partly mitigated by two simple feed-level defenses; a frontier model retains its default. We characterize the recommender as a practical, default-bounded control surface for LLM agents, and argue that agent evaluations must audit the feed layer rather than the final prompt alone.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

TravelEval: LLM を活用した旅行計画エージェントを評価するための包括的なベンチマークフレームワーク

大規模言語モデル (LLM) の開発により、旅行計画アプリケーションは大幅に改善されましたが、そのようなモデルの評価は既存のベンチマークの制限によって制限されています。1) 制約遵守を重視しすぎて、時空間コストなどの多次元の品質が無視されている。 2) 現実世界の信頼性と主要分野 (宿泊施設、交通機関など) をカバーしていないデータセット。 3) 計画全体の評価に必要な重要な詳細 (毎日の宿泊施設や訪問ペースの影響など) が欠けている個別の日次計画評価。このギャップに対処するために、現実的かつ包括的なベンチマークである TravelEval を導入します。 TravelEval の特徴は、1) 精度、コンプライアンス、時間性、空間性、経済性、およびユーティリティの側面にわたって計画を総合的に評価するための新しい 6 次元の評価フレームワークです。 2) 正確な宿泊料金設定と本物の都市間交通データを備えた非常に現実的なデータサンドボックス。 3) API に統合された地理情報ときめ細かい待ち時間を使用して完全な旅行計画をエミュレートする、シミュレーションベースのグローバル評価方法。 TravelEval を使用して 12 の主流アプローチを評価すると、LLM がグローバルに最適化された多次元計画 (特に時空間推論と予算順守) に苦戦していることや、エージェント推論戦略では一貫した改善が見られないことなど、いくつかの貴重な洞察が明らかになりました。簡潔に言えば、TravelEval は、根拠のある時空間エミュレーションと包括的なメトリクスによって旅行計画の評価を容易にし、LLM を利用した旅行計画の研究とアプリケーションを進めるための強固な基盤を提供します。

原文 (English)

TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

The development of Large Language Models (LLMs) has significantly improved travel planning applications, yet evaluating such models is limited by existing benchmarks' limitations: 1) overemphasis on constraint compliance, neglecting multi-dimensional qualities like spatio-temporal cost; 2) datasets lacking real-world authenticity and coverage in key areas (e.g., lodging, transport); and 3) isolated daily plan assessments that miss critical details (e.g., the impact of daily accommodation and visit pacing) needed for entire plan's evaluation. To address this gap, we introduce TravelEval, a realistic and comprehensive benchmark. TravelEval features 1) a novel six-dimensional evaluation framework to holistically assess plans across accuracy, compliance, temporality, spatiality, economy, and utility dimensions; 2) a highly realistic data sandbox with precise accommodation pricing and authentic intercity transportation data; and 3) a simulation-based global evaluation method that emulates complete travel plans with API-integrated geographic information and fine-grained queuing time. Evaluating 12 mainstream approaches with TravelEval reveals several valuable insights, such that LLMs struggle with globally-optimized multi-dimensional planning (especially in spatio-temporal reasoning and budget compliance), and agentic reasoning strategies offer no consistent improvement. Concisely, TravelEval facilitates travel plan evaluation via grounded spatio-temporal emulation and comprehensive metrics, providing a robust foundation for advancing LLM-powered travel planning research and applications.

2026-06-02 13:00 JSTarXiv cs.AIエージェント

MindClaw: 精密な介入のための閉ループの具体化された精神状態推論

Theory of Mind (ToM) を使用すると、エージェントは他のアクターの信念、目標、意図について推論することができます。これは人間中心の身体的支援に不可欠です。既存の ToM ベンチマークは高度なテキスト認識とマルチモーダルな精神状態認識を備えていますが、主にオフラインの質問応答や最終的な行動の予測を評価します。これらは、具体化されたエージェントが変化する環境とのつながりを維持できるかどうか、行為者固有の信念を更新できるかどうか、推論が必要な場合を判断できるかどうか、助けが役立つ場合にのみ介入できるかどうかを完全にテストしていません。 MindPower を基盤として、ロボット中心の ToM 推論をリアルタイムの閉ループ設定に拡張し、精密な介入を伴う身体化された精神状態推論のためのフレームワークである MindClaw を導入します。 MindClaw は、マルチソース入力、信念記憶、身体化された認知トリガースキル、精神的推論、およびアクション生成を接続し、エージェントが介入が不要な場合は沈黙を保ちながら、適切なタイミングで役立つアクションを出力できるようにします。実験によれば、直接的な VLM ベースラインはタスクの認識と介入の調整に苦労する一方、MindClaw は最高の全体的なパフォーマンスを達成し、閉ループで組み込まれた ToM 支援におけるトリガースキルの最適化の重要性を示しています。

原文 (English)

MindClaw: Closed-Loop Embodied Mental-State Reasoning for Precision Intervention

Theory of Mind (ToM) enables an agent to reason about another actor's beliefs, goals, and intentions, which is essential for human-centered embodied assistance. Existing ToM benchmarks have advanced text and multimodal mental-state recognition, but they mostly evaluate offline question answering or final action prediction. They do not fully test whether an embodied agent can stay connected to a changing environment, update actor-specific beliefs, decide when reasoning is needed, and intervene only when help is useful. Building on MindPower, we extend robot-centric ToM reasoning to a real-time closed-loop setting and introduce MindClaw, a framework for embodied mental-state reasoning with precision intervention. MindClaw connects multi-source inputs, belief memory, an embodied cognitive trigger skill, mental reasoning, and action generation, allowing the agent to output helpful actions at the right time while remaining silent when intervention is unnecessary. Experiments show that direct VLM baselines struggle with task awareness and intervention calibration, while MindClaw achieves the best overall performance, demonstrating the importance of trigger-skill optimization for closed-loop embodied ToM assistance.

2026-06-02 13:00 JSTarXiv cs.AIエージェント

CAREAgent: 構造化推論を備え、オーダー生成用にツールが統合された臨床エージェント

臨床オーダーの生成は、臨床上の意思決定と現実世界の実践の間の重要な橋渡しとして機能し、医療上の決定を具体的で実行可能なオーダーに変換します。既存のエージェントは主に大まかな意思決定に重点を置き、臨床オーダーに必要な詳細で実行可能な情報を無視しています。このギャップに対処するために、臨床オーダー生成エージェントである CAREAgent を提案します。そのトレーニングをサポートするために、2 段階のエージェント推論データ構築手法を導入します。まず、現実的な臨床ツールの使用状況に合わせた検証可能な推論軌跡を構築するエージェントフレームワークを設計します。次に、形式の準拠性、順序の妥当性、臨床的妥当性によって推論の軌跡をフィルタリングします。構築されたデータに基づいて、モデルはまず教師あり微調整によってトレーニングされ、基本的な推論形式と医学的知識を取得します。その後、多次元の報酬関数を使用した強化学習によって最適化され、複雑な臨床推論能力が強化されます。複数のベンチマークの実験により、CAREAgent の有効性が実証されています。 ClinicalBench (トレーニング中には確認できません) では、CAREAgent は F1 スコアを、単一エージェント、マルチエージェント、およびエージェント推論方法と比較して、それぞれ 5.05%、2.09%、0.86% 改善しました。

原文 (English)

CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation

Clinical order generation serves as a critical bridge between clinical decision-making and real-world practice, translating medical decisions into concrete and executable orders. Existing agents mainly focus on coarse-grained decisions and overlook the fine-grained, executable information required for clinical orders. To address this gap, we propose CAREAgent, an agent for clinical order generation. To support its training, we introduce a two-stage agentic reasoning data construction method. First, we design an agent framework that constructs verifiable reasoning trajectories aligned with realistic clinical tool usage. Second, we filter reasoning trajectories by format compliance, order validity, and clinical plausibility. Building on the constructed data, the model is first trained via supervised fine-tuning to acquire fundamental reasoning formats and medical knowledge, and is subsequently optimized through reinforcement learning with multi-dimensional reward functions to enhance complex clinical reasoning capabilities. Experiments on multiple benchmarks demonstrate the effectiveness of CAREAgent. On ClinicalBench (unseen during training), CAREAgent improves the F1 score by 5.05%, 2.09%, and 0.86% over the single-agent, multi-agent, and agentic reasoning methods, respectively.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkillRevise: トレース条件付きスキルリビジョンによる LLM 作成エージェントスキルの向上

エージェントスキルは、LLM エージェントがワークフローを実行し、制約を検証し、障害から回復できるようにする手順的な成果物です。既存の自己進化手法は、蓄積された軌跡を利用してスキルを磨きます。しかし、初期の不完全なスキルしか利用できないコールドスタート環境では苦戦します。したがって、スキル構築はデフォルトでエキスパートオーサリングまたはワンショット LLM 生成になります。専門家が作成したスキルはコストが高く、LLM エージェントが実際にタスクを実行する方法と一致していない可能性があります。一方、ワンショットで生成されたスキルは、構文的には適切ですが、動作が弱い可能性があります。このギャップを埋めるために、私たちは、これらの初期スキルを反復的に改善するように設計された実行ベースのフレームワークである SkillRevise を提案します。 SkillRevise は、実行の証拠からスキルの欠陥を診断し、一般的なメモリから関連する修復原則を取得し、実行に固定された編集を適用します。候補者を再実行し、経験的な有用性を測定することで、最適なスキルバージョンを体系的に保持します。 3 つのベンチマークと 5 つの LLM で評価したところ、SkillRevise はワンショットベースラインを大幅に上回り、SkillsBench でのベースエージェントの成功率が 36.05% から 61.63% に向上しました。さらに、改訂されたスキルはモデル間での強力な移行性を示し、モデル固有のアーティファクトに関する一般化された手順の知識を取得します。

原文 (English)

SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision

Agent skills are procedural artifacts that enable LLM agents to execute workflows, verify constraints, and recover from failures. Existing self-evolving methods refine skills using accumulated trajectories. However, they struggle in cold-start settings, where only an initial, imperfect skill is available. Consequently, skill construction defaults to expert authoring or one-shot LLM generation. Expert-authored skills are costly and may not align with how LLM agents actually execute tasks, while one-shot generated skills can be syntactically well formed yet behaviorally weak. To bridge this gap, we propose SkillRevise, an execution-grounded framework designed to iteratively refine these initial skills. SkillRevise diagnoses skill defects from execution evidence, retrieves relevant repair principles from a general memory, and applies execution-anchored edits. By re-executing candidates and measuring empirical utility, it systematically retains the optimal skill version. Evaluated across three benchmarks and five LLMs, SkillRevise substantially outperforms one-shot baselines, improving the base agent's success rate on SkillsBench from 36.05% to 61.63%. Furthermore, the revised skills exhibit strong cross-model transferability, capturing generalized procedural knowledge over model-specific artifacts.

2026-06-02 13:00 JSTarXiv cs.AIエージェント

「スキルの問題」: 湖畔エージェントのデータ中心の最適化

コーディングエージェントはデータインフラストラクチャのユーザーになりつつありますが、その成功はモデルの品質だけでなく、エージェントにシステムの使用方法を教えるスキルと環境ファイルにも依存します。私たちは、分岐湖のバウプランで活動するエージェント向けにこれらのアーティファクトを最適化する方法を研究します。私たちの設定では、ヘッドレス API と Git のようなデータプリミティブは、コード、ブランチ、コミット、マージを通じてデータワークフローを公開します。私たちの中心的な観察は、分岐レイクハウスがデータエージェントの評価を出力マッチング問題から状態検証問題に変えるということです。エージェントが生成したパイプラインコードは、具体的で検査可能なレイクハウスの変更を引き起こします。タスクと検証子のペアを生成し、隔離されたサンドボックスで候補スキルを実行し、トレースレベルの信号とレイクハウスの状態に対するプログラムによるチェックの両方を使用して軌跡をスコアリングする、データ中心の最適化パイプラインを紹介します。 25 のタスクに関する事前評価では、最適化されたスキルにより精度が 31.9% 向上しました。これらの結果は、書き込みパスデータワークフローが、読み取り専用タスクを超えてエージェントのスキルを最適化するための有用な基盤を提供することを示唆しています。

原文 (English)

"Skill issues'': data-centric optimization of lakehouse agents

Coding agents are becoming users of data infrastructure, but their success depends not only on model quality: it also depends on the skills and environment files that teach agents how to use a system. We study how to optimize these artifacts for agents operating on a branching lakehouse, Bauplan. In our setting, headless APIs and Git-like data primitives expose data workflows through code, branches, commits, and merges. Our central observation is that a branching lakehouse turns data-agent evaluation from an output-matching problem into a state-verification problem: agent-generated pipeline code induces concrete, inspectable lakehouse changes. We present a data-centric optimization pipeline that generates task-verifier pairs, executes candidate skills in isolated sandboxes, and scores trajectories using both trace-level signals and programmatic checks over lakehouse state. In a preliminary evaluation on 25 tasks, optimized skills improve accuracy by 31.9%. These results suggest that write-path data workflows provide a useful substrate for optimizing agent skills beyond read-only tasks.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LLM エージェントは長期的な組織ダイナミクスを維持できるか?

大規模な言語エージェントは社会シミュレーションにますます使用されていますが、目標が階層を通じて伝達され、タスクが事前の実行に依存し、成果物が長い期間にわたって蓄積される構造化された組織において、言語エージェントが一貫した動作を維持できるかどうかは依然として不明です。私たちは、長期的な組織シミュレーションをメモリ中心の調整問題として定式化し、TaskWeave を導入します。TaskWeave は、定式化、パーティション、診断、整列サイクルを通じて計画状態を維持し、依存関係を認識したトレースメモリを通じて実行を基盤とする階層型エージェントフレームワークです。私たちは、1 年間にわたる IT 企業シミュレーションで TaskWeave を評価し、組織の一貫性、実行基盤、下流のエンタープライズ NLP ユーティリティに関して他のマルチエージェントフレームワークと比較しました。実験では、TaskWeave が、根拠のある成果物を生成し、外部環境に適応しながら、一貫性のある長期的な組織ダイナミクスをサポートしていることが示されています。これらの発見は、構造化シミュレーションメモリが信頼性の高い LLM ベースの組織シミュレータを構築するための重要なメカニズムであることを示唆しています。

原文 (English)

Can LLM Agents Sustain Long-Horizon Organizational Dynamics?

Large language agents are increasingly used for social simulation, yet it remains unclear whether they can sustain coherent behavior in structured organizations, where goals must propagate through hierarchy, tasks depend on prior execution, and artifacts accumulate over long horizons. We formulate long-horizon organizational simulation as a memory-centered coordination problem and introduce TaskWeave, a hierarchical agentic framework that maintains planning states through a Formulate-Partition-Diagnose-Align cycle and grounds execution through dependency-aware trace memory. We evaluate TaskWeave in a year-long IT company simulation and compare it with other multi-agent frameworks on organizational coherence, execution grounding, and downstream enterprise NLP utility. Experiments show that TaskWeave supports coherent and long-horizon organizational dynamics while producing grounded artifacts and adapting to external environments. These findings suggest that structured simulation memory is a key mechanism for building reliable LLM-based organizational simulators.

2026-06-02 13:00 JSTarXiv cs.AIエージェント

HomeFlow: 検証可能なシミュレーションによるスマートホームエージェントトレーニング用のデータフライホイール

大規模言語モデルエージェントは、テキストのみの対話を超えて、スマートホームを代表的な領域として、物理世界の制御へと移行しています。実際の家庭内の交流には、あいまいな意図を理解し、動的な環境で動作し、複数ターンの推論を実行する必要があります。ただし、既存の方法では、スマートホームエージェント用の高品質のトレーニングデータを生成するのが困難です。私たちは、このドメインの検証可能なデータフライホイールである HomeFlow を提案します。 HomeFlow は、統合シミュレーション環境として HomeEnv を使用し、HomeMaker を使用してさまざまなホーム設定を手続き的に生成します。その後、Blueprint がオープンエンドのユーザーインテントを実行可能な状態ベースの成功条件にコンパイルし、MCTS-Flow が環境ガイド付きツリー検索を通じて多様で検証可能なマルチターンの軌跡を合成します。次に、教師あり微調整と段階的 RLVE によってエージェントを最適化し、本物の物理的フィードバックによる反復的な改善を促進します。さらに、SmartHome-Bench を構築して、さまざまなスマートホームタスクにわたってエージェントを評価します。このベンチマークでは、HomeFlow-RL-4B と HomeFlow-RL-8B は 84.60% と 87.03% のタスク成功率を達成しています。 HomeFlow-RL-8B が主要な GPT-5.5 を 1.23 パーセントポイントも上回っていることは注目に値します。

原文 (English)

HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative domain. Real domestic interaction requires understanding ambiguous intents, operating in dynamic environments, and performing multi-turn reasoning. However, existing methods struggle to generate high-quality training data for smart home agents. We propose HomeFlow, a verifiable data flywheel for this domain. HomeFlow uses HomeEnv as a unified simulation environment and HomeMaker to procedurally generate diverse home settings. Subsequently, Blueprint compiles open-ended user intents into executable state-based success conditions, while MCTS-Flow synthesizes diverse, verifiable multi-turn trajectories through environment-guided tree search. We then optimize the agents via supervised fine-tuning and step-wise RLVE, which facilitates iterative improvement through authentic physical feedback. We further construct SmartHome-Bench to evaluate the agent across various smart home tasks. On this benchmark, HomeFlow-RL-4B and HomeFlow-RL-8B achieve task success rates of 84.60% and 87.03%. It is worth noting that HomeFlow-RL-8B even surpasses the leading GPT-5.5 by 1.23 percentage points.

2026-06-02 13:00 JSTarXiv cs.AIエージェント研究/論文

ANDES: 自律的な命令調整のためのエージェントネイティブデータ進化型合成ツール

AI エージェントは、AI 研究自体、特にベース LLM を調整されたアシスタントに変換する重要なトレーニング後のフェーズを自動化するという任務をますます受けています。しかし、最近の評価では、フロンティアエージェントでさえこの任務を遂行するのに苦労していることが明らかになりました。ポストトレーニングの成功は基本的に高品質のデータの取得にかかっていますが、オープン Web からターゲットを絞ったトレーニングデータセットを自律的にキュレーションするエージェントに依存すると、深刻な課題が生じます。ノイズの多い Web 環境内でデータの検索、フィルタリング、バランシングという長期的なタスクを実行すると、エージェントの限られたコンテキストに負荷がかかることが多く、最終的にはデータセットの品質が低下し、ダウンストリームトレーニングのパフォーマンスが最適化されません。このギャップを埋めるために、データ生成をプラグアンドプレイの \emph{エージェントスキル} として再考するフレームワークである Andes (Agent Native Data Evolving Synthesis) を導入します。 \textsc{Andes} は、エージェントに複雑なデータ収集戦略を一から考案させるのではなく、インテリジェントな抽象化レイヤーを提供します。自己進化する World Tree ルーティングメカニズムと実用的な診断レポートを活用することで、トレーナーエージェントが対話型の閉ループインターフェイスを通じてデータ合成を動的に操作できるようになります。厳しいコンピューティング制約の下で、根本的に弱いエージェントに Andes を装備すると、自動調整が向上し、PostTrainBench での最先端のパフォーマンスと堅牢なクロスタスクの汎化が確保されることを実証します。私たちのプロジェクトは https://github.com/zzy1127/ANDES で入手できます。

原文 (English)

ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment

AI agents are increasingly being tasked with automating AI research itself, particularly the critical post-training phase that transforms base LLMs into aligned assistants. However, recent evaluations reveal that even frontier agents struggle to perform this task. While the success of post-training fundamentally relies on acquiring high-quality data, relying on agents to autonomously curate targeted training datasets from the open web introduces severe challenges. Executing the long-horizon tasks of searching, filtering, and balancing data within noisy web environments frequently overwhelms an agent's limited context, ultimately leading to degraded dataset quality and suboptimal downstream training performance. To bridge this gap, we introduce Andes (Agent Native Data Evolving Synthesis), a framework that reimagines data generation as a plug-and-play \emph{agent skill}. Rather than forcing agents to devise complex data-gathering strategies from scratch, \textsc{Andes} provides an intelligent abstraction layer. By leveraging a self-evolving World Tree routing mechanism and actionable diagnostic reports, it allows trainer agents to dynamically steer data synthesis through an interactive, closed-loop interface. We demonstrate that under strict compute constraints, equipping foundationally weaker agents with Andes improves automated alignment, securing state-of-the-art performance on PostTrainBench and robust cross-task generalization. Our project is available at https://github.com/zzy1127/ANDES.

2026-06-02 13:00 JSTarXiv cs.AIエージェント

SkillSmith: 自己改善エージェントシステムのための共進化スキルとツール

最近の自己進化エージェントは、実行を通じてスキルを発見、洗練、蓄積できることを示しています。ただし、既存のスキル進化フレームワークは通常、固定ツール層を想定し、各スキルを個別に評価するため、ツールレベルの障害を修復したり、スキル間の相互作用を推論したりする能力が制限されています。私たちは、相乗効果を意識したスキルとツールの共進化フレームワークである SkillSmith を提案します。 SkillSmith は、リフレクションによってスキルとツールを共同で変更するアトミックバンドルを生成する統合提案スペースを導入します。これにより、スキルの進化により再利用可能な機能のギャップが特定された場合に、ツールをラップ、編集、構成、分割、または廃止できるようになります。この共同研究を導くために、SkillSmith はロトカ・ヴォルテラ力学にヒントを得た生態学的実用モデルを維持しています。このモデルでは、実行トレースから推定された相互作用行列がスキル間のペアの相補性と競合を捉え、検索、突然変異の優先順位付け、およびリタイアのための圧力シグナルを提供します。さらに、SkillSmith は、失敗の兆候、原因の帰属、救済策などのアンチパターンを記録し、既知の間違いを繰り返す診断と拒否権の提案を迅速化します。 WildClawBench を含む 3 つのベンチマークと 5 つの Qwen3.5 モデルスケールでの実験では、SkillSmith が一貫して強力なベースラインを上回り、タスクの複雑さと複数スキルの同時アクティベーションが増加するにつれて向上することがわかりました。

原文 (English)

SkillSmith: Co-Evolving Skills and Tools for Self-Improving Agent Systems

Recent self-evolving agents have shown that skills can be discovered, refined, and accumulated through execution. However, existing skill-evolution frameworks typically assume a fixed tool layer and evaluate each skill independently, limiting their ability to repair tool-level failures or reason about interactions among skills. We propose SkillSmith, a synergy-aware skill-tool co-evolution framework. SkillSmith introduces a unified proposal space in which reflection produces atomic bundles that jointly modify skills and tools, allowing tools to be wrapped, edited, composed, split, or retired when skill evolution identifies a reusable capability gap. To guide this joint search, SkillSmith maintains an ecological utility model inspired by Lotka-Volterra dynamics, where an interaction matrix estimated from execution traces captures pairwise complementarity and conflict among skills and provides pressure signals for retrieval, mutation prioritization, and retirement. Furthermore, SkillSmith records anti-patterns, including failure signatures, causal attributions, and remedies, to accelerate diagnosis and veto proposals that repeat known mistakes. Experiments on three benchmarks, including WildClawBench, and five Qwen3.5 model scales show that SkillSmith consistently outperforms strong baselines, with gains that amplify as task complexity and multi-skill co-activation increase.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

オーケストレーターを認識する: LLM マルチエージェントシステムのエントロピーダイナミクスの視点

シングルターンモデルからマルチエージェントシステム (MAS) への移行により、問題解決機能の強化が期待できますが、集中型オーケストレーショントポロジは依然として脆弱性の重要な点です。これを分析するために、タスク解決と累積的なコンテキスト読み込みの競合する力によって支配されるシステムとしてオーケストレーションプロセスをモデル化する、平均場エントロピーダイナミクスフレームワークを提案します。検証を容易にするために、高密度の中間チェックポイントを備えたプロセス検証可能で複雑性の高いベンチマークを合成するマルチエージェントパイプラインである逆ワークフロー生成 (IWG) を導入します。私たちは、エントロピー力学モデルが経験的な軌跡に適合し、システムの安定性とパフォーマンスの崩壊を定量化する物理的に解釈可能なパラメーターを提供することを実証します。重要なことに、私たちの分析は「推論の罠」を明らかにしました。推論を重視したモデルは、孤立したタスクでは優れていますが、コンテキストの圧迫によりオーケストレーターとしては失敗することがよくあります。オーケストレーターの基礎となる物理メカニズムを解明し、システムの不確実性を定量化することで、MAS のアーキテクチャ設計に関する洞察が得られます。

原文 (English)

Recognize Your Orchestrator: An Entropy Dynamics Perspective for LLM Multi-Agent Systems

The transition from single-turn models to Multi-Agent Systems (MAS) promises enhanced problem-solving capabilities, yet the centralized orchestration topology remains a critical point of fragility. To analyze this, we propose a Mean-Field Entropy Dynamics framework, modeling the orchestration process as a system governed by the competing forces of task resolution and cumulative context loading. To facilitate validation, we introduce Inverse Workflow Generation (IWG), a multi-agent pipeline that synthesizes process-verifiable, high-complexity benchmarks with dense intermediate checkpoints. We demonstrate that our entropy dynamics model fits empirical trajectories, providing physically interpretable parameters that quantify system stability and performance collapse. Crucially, our analysis uncovers a ``Reasoning Trap": while reasoning-heavy models excel in isolated tasks, they frequently fail as orchestrators due to context squeezing. Elucidating the physical mechanisms underlying the Orchestrator and quantifying systemic uncertainty offers insights for the MASs' architectural design.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

障害を認識した可観測性によるマルチエージェント LLM システムの無駄な計算の早期診断

ツールを使用するマルチエージェント大規模言語モデル (LLM) システムは、応答を生成する前に、モデルトークン、ツール呼び出し、再試行、コード実行による計算を費やします。実行が失敗した場合、最終応答の評価によって終点が明らかになりますが、通常は、軌道が回復可能な進行を停止した時点ではありません。このペーパーでは、マルチエージェント LLM トレースにおける無駄な計算を診断するための障害認識可観測性フレームワークを紹介します。このフレームワークは、ツールの信頼性、実行の回復、オーケストレーションループ、証拠の可用性、情報の変更、予算のプレッシャーなど、繰り返し発生する障害モードをオンライントレース信号にマッピングします。 3 エージェントの質問応答システムでフレームワークをインスタンス化し、同一の実行上限の下で 165 の GAIA 検証トレースで評価します。運用上の失敗は依然として一般的です。レベル 1 の実行は 22/53 回、レベル 2 の実行は 33/86 回、レベル 3 の実行は 12/26 回で、使用可能な最終応答を生成できませんでした。トレースは、不十分な証拠、反復アクションループ、最大ステップ終了、ツール失敗の連続発生、有用な出力なしで成功する実行呼び出しなど、これらの結果の背後にあるさまざまなメカニズムを明らかにします。平均トークン使用量はレベル 1 の 8,152 トークンからレベル 3 の 16,389 トークンに増加しますが、証拠の入手可能性と文レベルのサポートは異なります。キャッシュされた 10 トレースの LLM ジャッジグラウンディング監査により、安価なオンラインシグナルとより深いセマンティックメトリクスが相補的な障害層を捉えていることがわかります。その結果、障害を認識する可観測性は、生の実行ログと最終応答の精度の間の診断レイヤーとして位置付けられます。

原文 (English)

Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability

Tool-using multi-agent large language model (LLM) systems spend computation through model tokens, tool calls, retries, and code execution before producing an answer. When a run fails, final-answer evaluation reveals the endpoint but usually not the point at which the trajectory stopped making recoverable progress. This paper introduces a failure-aware observability framework for diagnosing wasted computation in multi-agent LLM traces. The framework maps recurring failure modes to online trace signals, including tool reliability, execution recovery, orchestration loops, evidence availability, information change, and budget pressure. We instantiate the framework in a three- agent question-answering system and evaluate it on 165 GAIA validation traces under identical execution caps. Operational failures remain common: 22/53 level-1 runs, 33/86 level-2 runs, and 12/26 level-3 runs fail to produce a usable final answer. The traces expose different mechanisms behind these outcomes, including insufficient evidence, repeated-action loops, max-step termination, tool-failure streaks, and execution calls that succeed without useful output. Mean token use rises from 8,152 tokens at level 1 to 16,389 tokens at level 3, while evidence availability and sentence-level support diverge. A cached 10-trace LLM-judge grounding audit shows that cheap online signals and deeper semantic metrics capture complementary layers of failure. The results position failure-aware observability as a diagnostic layer between raw execution logs and final-answer accuracy.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

信頼性の高いツールで拡張された大規模言語モデルシステムのための自己修復エージェントオーケストレーター

ツール拡張大規模言語モデル (LLM) エージェントは、計画、取得、ツールの呼び出し、検証、メモリ、回復を調整するオーケストレーション層に依存します。これらのシステムでは、障害はモデルエラーだけでなく、ツールのタイムアウト、不正な形式の引数、古いコンテキスト、矛盾した証拠、再試行ループ、未検証の中間出力などのオーケストレーションレベルの問題からも発生します。このペーパーでは、信頼性を制限されたランタイム制御問題として扱う自己修復エージェントオーケストレーターについて説明します。オーケストレーターは、観測可能な障害信号を推定された障害クラスにマッピングし、明示的な予算に基づいて対象となる回復アクションを選択し、回復した軌跡を検証して、可観測性トレースを記録します。静的ワークフロー、再試行のみ、ReAct スタイル、および完全な再計画ベースラインに対して、100 タスク制御のフォールト挿入ベンチマークでアプローチを評価します。自己修復ではタスクの成功率が 98.8\% に達します。これに対し、再試行のみの場合は 94.5\%、完全な再計画の場合は 93.8\% です。一致するリカバリ予算スイープでは、テストされたすべての予算で自己修復が再試行のみのパフォーマンスと完全な再計画よりも優れており、1 回のリカバリ試行で最大の差があり、それぞれ 94.0\% 対 85.3\% および 88.2\% であることが示されています。制御されたセマンティックなサイレント障害設定の下では、ベリファイアガイドによる自己修復によりサイレント障害が 0.0% に減少しますが、非検証ベースラインは間違っているが妥当な出力をより頻繁に返します。コンパクトなモデルインザループ検証は、ライブツール呼び出しモデルがローカルのフォールト挿入ツール上でツール選択、引数生成、および応答合成を実行するときに、同じ回復メカニズムが動作できることを示しています。これらの結果は、障害を認識し、予算に基づいて検証をガイドしたオーケストレーションにより、ツールで強化された LLM システムの信頼性と診断可能性が向上するという、統制された証拠を提供します。

原文 (English)

Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems

Tool-augmented large language model (LLM) agents rely on orchestration layers that coordinate planning, retrieval, tool invocation, validation, memory, and recovery. In these systems, failures arise not only from model errors, but also from orchestration-level issues such as tool timeouts, malformed arguments, stale context, contradictory evidence, retry loops, and unverified intermediate outputs. This paper presents a self-healing agentic orchestrator that treats reliability as a bounded runtime control problem. The orchestrator maps observable failure signals to inferred failure classes, selects targeted recovery actions under explicit budgets, verifies recovered trajectories, and records observability traces. We evaluate the approach on a 100-task controlled fault-injection benchmark against static workflow, retry-only, ReAct-style, and full-replanning baselines. Self-healing achieves 98.8\% task success, compared with 94.5\% for retry-only and 93.8\% for full replanning. A matched recovery-budget sweep shows that self-healing outperforms retry-only and full replanning at every tested budget, with the largest gap under a single recovery attempt: 94.0\% versus 85.3\% and 88.2\%, respectively. Under a controlled semantic silent-failure setting, verifier-guided self-healing reduces silent failures to 0.0\%, while non-verifying baselines return wrong-but-plausible outputs more often. A compact model-in-the-loop validation shows that the same recovery mechanism can operate when a live tool-calling model performs tool selection, argument generation, and answer synthesis over local fault-injected tools. These results provide controlled evidence that failure-aware, budgeted, and verification-guided orchestration improves reliability and diagnosability in tool-augmented LLM systems.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

曖昧さへのダイブ: LLM プロンプトに対するマルチエージェントのコモンセンス難読化攻撃* にヒントを得た

大規模言語モデル (LLM) は、推論や知識集約的なタスクには優れていますが、常識的な幻覚を引き起こしながら意図を保持するプロンプトレベルの敵対的攻撃に対しては依然として脆弱です。 LLM は事実の信頼性が交渉の余地のない安全性が重要な領域に急速に統合されているため、この脆弱性は緊急です。既存の攻撃方法は効率性に欠けているか、現実世界の敵の適応戦略を捉えることができません。私たちは、A* にインスピレーションを得た事実エラー誘導フレームワーク、つまり意味的に整合していながら難読化されたプロンプトを生成するためのフレームワークを提案します。その核心となるのは、動的セマンティック分散係数 $\gamma$ に基づいた階層的書き換え戦略であり、リバースシミュレーテッドアニーリングスケジュールに従って、初期の保守的な編集と、後の積極的な難読化のバランスをとります。解釈可能性を高めるために、敵対的なメカニズムを発見して洗練し、解釈可能な逆最適化を提供するエージェントメカニズムラベリングをさらに導入します。理論的には、即時書き換えは収縮再発の後に起こり、$\gamma$ が減少するにつれて意味論の崩壊につながることを証明します。経験的には、さまざまな LLM にわたって、私たちの方法は徹底的な探索よりも高い攻撃成功率を達成しながらも、試行回数は少なく、効率と有効性の両方が実証されています。

原文 (English)

Dive into Ambiguity: A*-Inspired Multi-Agents Commonsense Obfuscation Attack on LLM Prompts

Large language models (LLMs) excel in reasoning and knowledge-intensive tasks but remain vulnerable to prompt-level adversarial attacks that preserve intent while triggering commonsense hallucinations. This vulnerability is urgent, as LLMs are rapidly integrated into safety-critical domains where factual reliability is non-negotiable. Existing attack methods either lack efficiency or fail to capture the adaptive strategies of real-world adversaries. We propose an A*-inspired Factual Error Induction Framework, a framework for generating semantically aligned yet obfuscated prompts. At its core is a Hierarchical Rewrite Strategy guided by a dynamic semantic dispersion coefficient $\gamma$ that balances conservative edits early with aggressive obfuscations later, following a reverse simulated annealing schedule. To enhance interpretability, we further introduce Agentic Mechanism Labeling, which discovers and refines adversarial mechanisms, offering interpretable reverse optimization. Theoretically, we prove that prompt rewriting follows a contractive recurrence, leading to semantic collapse as $\gamma$ decreases. Empirically, across diverse LLMs, our method achieves higher attack success rates than exhaustive exploration while requiring fewer attempts, demonstrating both efficiency and effectiveness.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence

Scientific discovery is not only answer generation but revision of the representational regime in which evidence, artifacts, operations, an…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Joint Agent Memory and Exploration Learning via Novelty Signals

In open-ended environments, exploration is fundamental for autonomous agents, yet current language model agents struggle with this. Effecti…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

RoleCDE:Benchmarking and Mitigating Role-Alignment Trade-offs in Role-Playing Agents

Role-playing agents(RPAs) are widely used to steer large language models(LLMs) toward role-consistent behavior, yet existing benchmarks mai…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

Agentic reinforcement learning (RL) enables LLM agents to improve continuously from environment rewards, yet the resulting policies do not…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MobEvolve: An Agentic Self-Evolving Heuristic System for Interpretable Human Mobility Generation

Human mobility generation aims to synthesize realistic trip chains for target populations based on individual features. Existing paradigms,…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Characterization of Multi-Model Agentic AI Systems on General Tasks via Trace-Driven Simulation

Agentic AI completes tasks through iterative planning, tool use, and reasoning based on observed outcomes. Despite its popularity, its syst…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

The explosive growth of Text-to-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing margina…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

Recent LLM search agents use reinforcement learning with verifiable rewards (RLVR) to learn search-augmented reasoning from outcome rewards…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Absorbing Complexity: An Interaction-Native Knowledge Harness for Financial LLM Agents

Financial AI agents often fail for a simple reason: they make users carry the complexity. A user must repeatedly restate goals, risk prefer…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

Smart homes are evolving toward complex state-dependent living environments, requiring Large Language Models (LLMs) to reason over user int…

2026-06-02 13:00 JSTarXiv cs.AIエージェント研究/論文

AutoMedBench: Agentic AI モデルによる医療自動研究に向けて

自律エージェントは、個別の予測タスクや短い形式の臨床質問応答を超えて、エンドツーエンドの医療 AI 研究ワークフローをサポートすることがますます期待されています。ただし、既存の医療エージェントのベンチマークは主に最終出力を評価しており、研究プロセス内でのエージェントの行動に対する可視性は限られています。このギャップに対処するために、AutoMedBench は、さまざまな医療画像処理およびマルチモーダル推論タスクにわたる自律型医療 AI 研究のためのワークフロー認識ベンチマークであり、エージェントの実行を統合された 5 段階のワークフロー (S1 ～ S5) (計画、セットアップ、検証、推論、送信) に編成します。これは、セグメンテーション、画像強調、視覚的質問応答 (VQA)、レポート生成、および病変検出の 5 つの研究トラックにまたがる、各実行の平均 33 エージェントターンの長期タスクで構成されています。各タスクは、Lite と Standard の 2 つの難易度で評価されます。これらは同じデータとメトリクスを使用しますが、タスク概要のスキャフォールディングの量が異なります。各実行は、最終タスクのパフォーマンスと S1 ～ S5 ステージスコアの両方を使用してスコア付けされ、最初のタスク概要から最終的に提出された成果物までステージレベルの分析が可能になります。何千もの記録された実行を対象としたステージレベルのスコアリングでは、平均して検証が最も弱いワークフローステージであるのに対し、セットアップが最も強いことが明らかになりました。これは、現在のエージェントが信頼性の検証よりもパイプラインを実行可能にすることに優れていることを示唆しています。さらに、実行後のエラー分析では、検証と送信の失敗がタグ付きエラーの大部分を占め、それぞれ起動されたコードの 37.7% と 38.1% を占めているのに対し、タスク理解エラーは 0.9% とまれで、起動されたエラーコードが 1 つある実行は、エラーコードがない実行よりも全体のスコアが平均 48% 低いことが示されています。

原文 (English)

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark for autonomous medical-AI research across diverse medical imaging and multimodal inference tasks, organizing agent execution into a unified five-stage workflow (S1-S5): Plan, Setup, Validate, Inference, and Submit. It comprises long-horizon tasks with each run averaging 33 agent turns, spanning five research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is evaluated under two difficulty tiers, Lite and Standard, which use the same data and metrics but differ in the amount of task-brief scaffolding, and each run is scored using both final task performance and S1-S5 stage scores, enabling stage-level analysis from the initial task brief to the final submitted artifact. Across thousands of recorded runs, stage-level scoring reveals that Validate is the weakest workflow stage on average, whereas Setup is the strongest, suggesting that current agents are better at making pipelines executable than at verifying their reliability. Post-run error analysis further shows that verification and submission failures dominate tagged errors, accounting for 37.7% and 38.1% of fired codes respectively, whereas task-understanding errors are rare at 0.9%, and runs with one fired error code have a 48% lower overall score than runs with no error code on average.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansi…

2026-06-02 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達研究/論文

Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based…

2026-06-02 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning

Enterprise AI systems that translate natural language into SQL queries and orchestrate multi-step agentic reasoning pipelines require evalu…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning.…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failure…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with th…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during t…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

MOC: Multi-Order Communication in LLM-based Multi-Agent Systems

Despite the remarkable progress of Large Language Model (LLM) based Multi-Agent Systems, most research focuses on optimizing coordination t…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

Equipping language agents with world models enables them to anticipate environment dynamics and evaluate candidate actions before execution…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it ha…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design

Protein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

HLL: Can Agents Cross Humanity's Last Line of Verification?

Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Beyond One-shot: AI Agents for Learning in Field Experiments

Organizations routinely run experiments for A/B testing, yet the data generated from one experiment is underutilized to inform subsequent i…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilize…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data so…

2026-06-02 13:00 JSTarXiv cs.AIエージェント研究/論文

Iteris: Agentic Research Loops for Computational Mathematics

Recent advances in large language models and agentic AI systems have enabled significant progress in mathematical discovery, from solving c…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Bridging the Last Mile of Time Series Forecasting with LLM Agents

Time series forecasting has advanced rapidly, especially with the emergence of foundation models that show strong zero-shot performance on…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Tracking the Behavioral Trajectories of Adapting Agents

Text files such as skill files, memory files, and behavioral configuration files play a central role in defining how modern agents act. Thr…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

Clinical practice is not the selection of an answer from enumerated options: a physician gathers heterogeneous information incrementally an…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Empathic and agentic artificial intelligence in nursing: perspectives on a human-centered framework for cancer care navigation in the United States

For patients experiencing cancer, nurse navigation can ease the burden of complex care by enhancing coordination of health services and pat…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Business Utility of Large Language Models as Exploratory Data Analysis Agents

Large Language Models (LLMs) are increasingly used in analytical workflows, but their suitability as exploratory data analysis (EDA) agents…

2026-06-02 13:00 JSTarXiv cs.AIエージェントロボティクス

Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems

Physical AI systems increasingly map multimodal observations, language instructions, and learned world representations into physically cons…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェントロボティクス

Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation

Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instruction…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainl…

2026-06-02 13:00 JSTarXiv cs.AIエージェントロボティクス

PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs

Foundation models are increasingly used to drive autonomous systems, yet existing approaches either keep the model in a tight control loop,…

2026-06-02 13:00 JSTarXiv cs.AIエージェントロボティクス

V2I Work Zone Geometry Reconstruction with Pose-Conditioned UWB Range Denoising

Reliable work zone mapping is important for connected and autonomous vehicles (CAVs) to navigate safely and smoothly through work zone area…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

Tool-calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowled…

2026-06-02 13:00 JSTarXiv cs.AIエージェントロボティクス

Completion at the Boundary (CaB): Deployable Switching with Completion-Aware Control under Limited Calibration

Vision-language-action (VLA) agents can execute natural-language instructions, yet deployed systems still lack an operational interface: de…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different act…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

PrivacyPeek: Auditing What LLM-Based Agents Acquire, Not Just What They Say

LLM-based agents are rapidly advancing, autonomously invoking external tools to complete multi-step tasks for users. However, agents often…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Agentic Transformers Provably Learn to Search via Reinforcement Learning

Tree search is a central abstraction behind many language-agent reasoning and decision-making tasks: agents must explore actions, remember…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Learning to Construct Practical Agentic Systems

Automated design and optimization of agentic LLM-based systems leads to sophisticated systems that substantially improve result quality ove…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

BAGEN: Are LLM Agents Budget-Aware?

While agents are increasingly spending more resources, today agent cost is mostly measured only after execution. A Budget-Aware Agent (BAGE…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems: A Paired Study on HumanEval

Large-language-model code generation has shifted from single-shot prompting to multi-agent orchestrations - analyst, coder, tester, and deb…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

As AI agents are increasingly deployed in real personal and corporate settings (email accounts, development workflows, company databases, e…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Agentic Authoring of Interactive Multiview Visualizations in Genomics

Diverse genomics data, scientific questions, and analysis tasks typically demand highly specialized visualizations. Therefore, users often…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Masking Stale Observations Helps Search Agents -- Until It Doesn't: A Regime Map and Its Mechanism

Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasin…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

AgentxGCore: Agentic AI for Next-Generation Mobile Core Network

To meet the stringent requirements of emerging applications and the increasingly complex network management and operation, the Next Generat…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

LLM agents increasingly rely on community-contributed skills that expand an agent's operational capability set. We study a core safety prob…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェントハードウェア/半導体

CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space

Conventional tissue image analysis software provides foundational capabilities for cellular analysis, including segmentation, basic morphol…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

Agent skills are callable procedural modules that provide reusable knowledge and execution policies for complex agentic tasks. However, exi…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Critic-R: Improving Agentic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback

Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substantial progress, optimizing retri…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent w…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MemGraphRAG: Memory-based Multi-Agent System for Graph Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) has become an essential method for mitigating hallucinations in Large Language Models (LLMs) by levera…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MemPro: Agentic Memory Systems as Evolvable Programs

Long-horizon autonomous agents require memory systems to retain historical information, track evolving states, and reuse relevant knowledge…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Scaling Behavior of Single LLM-Driven Multi-Agent Systems

The burgeoning field of LLM-based Multi-Agent Systems (MAS) promises to tackle complex tasks through collaborative intelligence, yet fundam…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Multi-Agent Conformal Prediction with Personalized Statistical Validity

Uncertainty quantification is essential in high-stakes machine learning tasks. However, one of the principled solutions, conformal predicti…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline ef…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems

Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use cons…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkillPager: Query-Adaptive Intra-Skill Navigation via Semantic Node Retrieval

Skill-based LLM agents increasingly rely on long procedural documents, but full-document prompting wastes tokens and dilutes information cr…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Beyond Independent Manipulation: Individual Fairness-aware Strategic Classification with Peer Imitation

Strategic classification (SC) investigates scenarios where agents manipulate their features to obtain favorable decisions from predictive m…

2026-06-02 13:00 JSTarXiv cs.AIエージェントロボティクス

From Cues to Horizons: Dynamic Risk Horizon Profiling for Trajectory Prediction

Accurate and reliable vehicle trajectory prediction is essential for safe autonomous driving. Recent studies have incorporated safety risk…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

GenPT: Beyond Self-Report for Reliable LLM Psychometrics via Generative Projective Testing

Self-report questionnaires remain the prevailing tool for probing the psychological states of persona-conditioned agents (PC-Agents). Howev…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Task diversity produces systematic transfer but inhibits continual reinforcement learning

Continual reinforcement learning aims to produce agents that learn not only to improve at their current tasks but also to adapt as task dis…

2026-06-02 13:00 JSTarXiv cs.AIエージェント研究/論文

Benchmarking Security Risk Detection and Verification in Open Agentic Skill Ecosystems

Open agent platforms allow community contributors to publish reusable skills that agents can invoke at runtime. This extensibility also cre…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Explainable deep reinforcement learning reveals energy-efficient control strategies for turbulent drag reduction

We propose a method combining Multi-Agent Deep Reinforcement Learning (MARL) and eXplainable Deep Learning (XDL) to reduce drag in wall-bou…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

SS-ZKR: Spatial-Semantic Zero-Knowledge Routing for Privacy-Preserving Multi-Agent Collaboration

Foundational agent interoperability standards, notably the Agent-to-Agent (A2A) protocol and the Model Context Protocol (MCP), have advance…

2026-06-02 13:00 JSTarXiv cs.AIエージェント研究/論文

FVSpec: Real-World Property-Based Tests as Lean Challenges

We present a benchmark for evaluating AI models and agents on real-world formal software verification tasks. We first scrape 11,039 propert…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable asset…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Leyline: KV Cache Directives for Agentic Inference

Modern KV cache management assumes the chatbot workload: prompts arrive once and the cache grows append-only, so prefix caching and forward…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AMP: A Vendor-Neutral Wire Format for Agent Memory Operations

Agent-memory frameworks - mem0, Letta/MemGPT, Cognee, Zep/Graphiti, MemoryOS, MemTensor - each ship their own SDK, storage layout, and oper…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

ASE-26: a curriculum for agentic software engineering as a discipline

The work of a professional software engineer has begun to consist, increasingly, of directing agents rather than writing code, and the empi…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance

Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories

Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

A Communication-Centric 6G-LLM Architecture for Scalable Tactical Autonomous Defense Vehicle Networks

The integration of Artificial Intelligence (AI) and emerging 6G networks introduces new opportunities for scalable coordination in tactical…

2026-06-02 13:00 JSTarXiv cs.AIエージェントロボティクス

PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making

Open-vocabulary navigation requires embodied agents to manage significant perception uncertainty stemming from semantic ambiguity and model…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Digital Twin-Assisted Adaptive Multi-Agent DRL for Intelligent Spectrum and Resource Management in Open-RAN UAV-Enabled 6G Networks

The evolution toward 6G wireless networks envisions a seamlessly intelligent, Open-RAN-enabled architecture where unmanned aerial vehicles…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Bridging Requirements and Architecture: Multi-Agent Orchestration with External Knowledge and Hierarchical Memory

Software architecture design is a critical yet inherently complex and knowledge-intensive phase that requires balancing competing quality a…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies

We present a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Using a $2\time…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

ClawHub Security Signals: When VirusTotal, Static Analysis, and SkillSpector Disagree

Agent skills extend AI agents with reusable instructions, tools, scripts, references, and workflows, establishing a security boundary disti…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

Time series data inform critical decisions across many real-world domains. While large language model (LLM) agents can analyze data through…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Agent Operating Systems (AOS): Integrating Agentic Control Planes into, and Beyond, Traditional Operating Systems

Traditional operating systems were designed around deterministic programs, explicit control flow, and human initiated workflows. Their core…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Defenses & Enablers For Skill Injection Attacks on Terminal Based Agents

Large language model (LLM) agents increasingly rely on reusable skills i.e. documents describing task-specific procedures. However, this in…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents

Conversational tutoring agents have been shown to improve learning engagement and student outcomes, and large language models (LLMs) are in…

2026-06-02 13:00 JSTarXiv cs.AIエージェント研究/論文

TechGraphRAG: An Agentic Graph-Augmented RAG Framework for Technical Literature Reasoning

This paper presents an agentic retrieval-augmented generation (RAG) framework for domain-specific technical reasoning support, instantiated…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conf…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Post-Deterministic Distributed Systems: A New Foundation for Trustworthy Autonomous Infrastructure

For decades, distributed systems have typically assumed that correct participants execute protocol-specified behavior with stable, external…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams

Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and suppor…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Dynamic Trust-Aware Sparse Communication Topology for LLM-Based Multi-Agent Consensus

Large language model-driven multi-agent systems enhance the reliability of complex reasoning tasks through multi-round deliberation, role s…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models

Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low per…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

RadioMaster: Multi-Agent System for Autonomous Radio Signal Generation

Translating user intents into physical radio signals represents the critical yet notoriously tedious final step in wireless prototyping, as…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is ofte…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websi…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

Agentic-J: An AI Agent for Biological Microscopy Image Analysis

Biological image analysis increasingly demands integration across heterogeneous tools, programming environments, and domain knowledge that…

2026-06-02 13:00 JSTarXiv cs.AIエージェントロボティクス研究/論文

Network Distributed Multi-Agent Reinforcement Learning for Consensus Control of Quadcopters

This paper proposes a Network Distributed Multi-Agent Reinforcement Learning (ND-MARL) framework for quadcopter consensus control. Compared…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations

Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While su…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント研究/論文

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence

As LLM-based agents expand their operational scope, reliability becomes a prerequisite for real-world deployment. However, in practical app…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Policy and World Modeling Co-Training for Language Agents

Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides li…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural la…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools

Tool-augmented language agents speculatively issue likely future tool calls to hide latency, but those calls leak inferred user intent to e…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Monitoring Agentic Systems Before They're Reliable

Agentic systems entering production typically operate as partially integrated assemblies where structural defects, not task-level errors, d…

2026-06-02 13:00 JSTarXiv cs.AIエージェントロボティクス

Permissive Safety Through Trusted Inference: Verifiable Belief-Space Neural Safety Filters for Assured Interactive Robotics

Autonomous robots that interact with people must make safe and efficient decisions under human-induced uncertainty, such as their preferenc…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Stop Wandering, Find the Keys: LLMs Discriminate Key States for Efficient Multi-Agent Exploration

With expansive state-action spaces, efficient multi-agent exploration remains a longstanding challenge in reinforcement learning. Although…

2026-06-02 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

Agent Guide: A Simple Agent Behavioral Watermarking Framework

The increasing deployment of intelligent agents in digital ecosystems, such as social media platforms, has raised significant concerns abou…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Taming System Complexity: Demystifying Software Engineering Agents in Diagnosing Linux Kernel Faults

The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequen…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ACON: Optimizing Context Compression for Long-horizon LLM Agents

Large language models (LLMs) are increasingly deployed as agents in dynamic real-world environments, where success depends on maintaining p…

2026-06-02 13:00 JSTarXiv cs.AIエージェント研究/論文

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

Recent advances in large reasoning models LRMs have enabled agentic search systems to perform complex multi-step reasoning across multiple…

2026-06-02 13:00 JSTarXiv cs.AIエージェント研究/論文

MobiBench: Multi-Branch, Modular Benchmark for Mobile GUI Agents

Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human comp…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

From Features to Actions: Explainability in Traditional and Agentic AI Systems

Over the last decade, Explainable AI has primarily focused on interpreting individual model predictions, producing post-hoc explanations th…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Emergent Adaptation

LLM-powered agentic systems excel at complex long-horizon tasks, but remain constrained by static configurations fixed before execution. Su…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain li…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback can be expensive and slow to obtai…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

Reinforcement learning (RL) has become a de facto paradigm for building LLM-based agents that act, interact, and reason over extended task…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematic…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory

AI agents operating over extended time horizons accumulate experiences that serve multiple concurrent goals, and must often maintain confli…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Process Reward Agents for Steering Knowledge-Intensive Reasoning

Reasoning in knowledge-intensive domains remains challenging as intermediate steps are often not locally verifiable: unlike math or code, e…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

MAVEN-T: Reinforced Heterogeneous Distillation for Real-Time Multi-Agent Trajectory Prediction

Trajectory prediction is a key component of autonomous driving systems because future motions directly affect collision checking, behavior…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tom…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

TrafficClaw: A Generalizable LLM Agent in the Unified Physical Environment for Urban Traffic Control

Large language model (LLM) agents have shown strong capabilities in long-horizon reasoning, tool use, and decision-making in digital enviro…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration

Transforming neuroimaging data into clinically actionable biomarkers is a knowledge-intensive and labor-intensive process. Standardized wor…

2026-06-02 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

Causal state binding predicts action control in language agents

Autonomous language agents increasingly expose traces, memories, plans and constraints, but existing evaluations rarely test whether these…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation

Compared with individual agents, large language model based multi-agent systems have shown great capabilities consistently across diverse t…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

CVEvolve: Autonomous Algorithm Discovery for Unstructured Scientific Data Processing

Scientific data processing often requires task-specific algorithms or AI models, creating a barrier for domain scientists who need to analy…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

MMSkills: Towards Multimodal Skills for General Visual Agents

Reusable skills have become a core substrate for improving agent capabilities, yet most existing skill packages encode reusable behavior pr…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

Herculean: An Agentic Benchmark for Financial Intelligence

As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Coding Agent Is Good As World Simulator

World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demon…

2026-06-02 13:00 JSTarXiv cs.AIエージェント研究/論文

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Ethical Hyper-Velocity (EHV): A Hardware-Rooted Zero-Trust Runtime Enforcement Architecture for Agentic AI Systems

As autonomous agentic systems scale across regulated critical infrastructures, the lack of mechanistic, hardware-rooted enforcement for hig…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches of…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

マルチエージェント RL によって LLM ワークフローが改善されるのはいつですか?ワークフロー、スケール、ポリシー共有のトレードオフ

マルチエージェント LLM ワークフローは、特殊なロールを通じて推論をルーティングし、最終タスクの精度を高めますが、これらのロールを強化学習で共同トレーニングすることは、十分に理解されていない点で不安定です。すべてのロールが 1 つのポリシーを更新する共有ポリシートレーニングと、各ロールが独自のパラメーターを持つ分離ポリシートレーニングを比較して、マルチエージェント LLM ワークフローのエンドツーエンド RL トレーニングが基本モデルよりも向上する時期を研究します。私たちの実験マトリックスは、Eval-Opt、Voting、Orch-Workers ワークフロー、数学およびコードタスク、および 3 つのモデルスケール (0.6B、1.7B、4B) に及びます。通常、マルチエージェント RL は基本モデルよりも向上しますが、その効果はポリシー共有だけではなく、ワークフロー、タスク、規模に総合的に依存することがわかりました。分離ポリシーはより高いピーク精度に達する傾向がありますが、最終精度の崖から落ちることが多くなりますが、共有ポリシーのトレーニングでは失敗は排除されません。それは失敗を質的に異なるパターンに再配分します。次に、ワークフロートポロジとポリシールーティングによって引き起こされるロールレベルの勾配ダイナミクスを通じて、これらのパターンのうち最も強力なものを説明します。分離ポリシーでは、共有プロンプト上の並列同じロールエージェントがロールごとの勾配を増幅し、投票ワークフローとオーチワーカーワークフローの端末の劣化を引き起こします。共有ポリシーの下では、非対称のステップごとの勾配質量により、共有ポリシーが支配的な役割によってキャプチャされ、タスクおよびワークフローごとに異なる失敗の兆候が生成されます。経験的マップとその基礎となるメカニズムを総合すると、ポリシー共有は均一な安定性を提供するのではなく、トレーニングのプレッシャーをさまざまなチャネルにルーティングし、ワークフローとタスクの条件のトレードオフを考慮した設計上の選択となることがわかります。

原文 (English)

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

暗黙のルールを探るためのテスト時の深い思考

大規模言語モデル (LLM) の継続的な進歩により、インテリジェントエージェントの重要性がますます高まっています。ただし、これらのエージェントは、暗黙のルール (直接観察できず対話を通じて推測する必要がある隠れた制約) によって管理される環境では失敗することがよくあります。これにより、エージェントは反復的な試行錯誤のループに陥り、最終的にはタスクの失敗につながります。この課題に対処するために、私たちは、思考コンポーネントがインタラクション履歴を分析してこれらの暗黙のルールを推測し、アクターをガイドするフレームワークである Test-Time Exploration (TTExplore) を提案します。この状況で効果的に探索できるかどうかは、思考者の推論能力に大きく依存します。ただし、深い推論の軌跡の評価は本質的に不安定で困難であるため、効果的なトレーニングに大きな障害となります。この問題を克服するために、新しく安定した強化学習パイプラインを導入します。中心となるアイデアは、正確なタスクレベルのスコアを間接的な報酬として使用して、中間推論を評価する難しさを回避し、報酬の希薄性を軽減するために軌道ごとに 1 つの思考ノードのみを保持することです。このパイプラインを使用して、特殊な 7B モデルである Exp-Thinker をトレーニングします。 5 つのテキストベースの具体化されたタスクに関する実験では、Exp-Thinker を備えた TTExplore がベースラインエージェントのパフォーマンスを平均 $14$ ～ $19$ ポイント改善することが示され、暗黙のルールについて明示的に推論することの有効性が実証されました。

原文 (English)

Test-Time Deep Thinking to Explore Implicit Rules

With the continuous advancement of Large Language Models (LLMs), intelligent agents are becoming increasingly vital. However, these agents often fail in environments governed by implicit rules--hidden constraints that cannot be observed directly and must be inferred through interaction. This causes agents to fall into repetitive trial-and-error loops, ultimately leading to task failure. To address this challenge, we propose Test-Time Exploration (TTExplore), a framework where a thinker component analyzes interaction history to infer these implicit rules and guide an actor. Effective exploration in this setting critically depends on the reasoning ability of the thinker. However, evaluating deep reasoning trajectories is inherently unstable and difficult, which poses a major obstacle to effective training. To overcome this issue, we introduce a novel and stable reinforcement learning pipeline. The core idea is to use accurate task-level scores as indirect rewards to bypass the difficulty of evaluating intermediate reasoning, and to retain only a single thinking node per trajectory to alleviate reward sparsity. Using this pipeline, we train a specialized 7B model, Exp-Thinker. Experiments on five text-based embodied tasks show that TTExplore equipped with Exp-Thinker improves baseline agent performance by an average of $14$-$19$ points, demonstrating the effectiveness of explicitly reasoning about implicit rules.

2026-06-02 13:00 JSTarXiv cs.AIエージェント研究/論文

科学のための Agentic AI の実験

この論文では、科学ワークフローにおける自律型エージェント AI を開発するための 2 つの新しいフレームワークについて詳しく説明します。どちらのシステムも、Google Colab を介したハイブリッド Local Body、Remote Brain アーキテクチャを活用し、Python ベースのローカルオーケストレーターを利用して大規模言語モデル (LLM) クラウドバックエンドを呼び出します。最初のエージェントである DeepTS/DeepCollector は、時系列データセットの大規模なキュレーション、抽出、重複排除を自動化します。 2 つ目の DeepScribe は、視覚的に緻密で数学的に複雑な物理学の講義を構造化された科学レポートに変換する自律型プレゼンテーションアナライザーです。粒度の高い属性抽出 (Cellular RAG)、リモートデータ検査、分散同時実行制御などの実践的なシステムエンジニアリングを通じて、エージェント AI が現在の最先端システムのコンテキストと推論の制限をどのように克服して、科学的なワークフローを厳密にサポートできるかを実証します。最後に、深い知識グラフをサポートする DeepTS の一般化について概説し、この概念的なアプローチの高エネルギー物理学 (DeepQCD) への応用について説明します。

原文 (English)

Experiments in Agentic AI for Science

This paper details two novel frameworks for developing autonomous, agentic AI in scientific workflows. Both systems leverage a hybrid Local Body, Remote Brain architecture via Google Colab, utilizing Python-based local orchestrators to invoke large language model (LLM) cloud backends. The first agent, DeepTS/DeepCollector, automates the large-scale curation, extraction, and deduplication of time-series datasets. The second, DeepScribe, is an autonomous presentation analyzer that converts visually dense, mathematically complex physics lectures into structured scientific reports. Through practical systems engineering-such as granular attribute extraction (Cellular RAG), remote data inspection, and distributed concurrency controls-we demonstrate how agentic AI can overcome the context and reasoning limitations of current state-of-the-art systems to rigorously support scientific workflows. Finally, we outline a generalization of DeepTS to support deep knowledge graphs and discuss the application of this conceptual approach to high-energy physics (DeepQCD).

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Agyn: スケーラブルなオンデマンド実行、コードとしてのエージェント定義、およびゼロトラストアクセスを備えた AI エージェント用のオープンソースプラットフォーム

組織が AI エージェントの実稼働環境への移行に向けて移行すると、AI エージェントは非決定的なワークフローを実行し、ステートフルセッションを維持し、多くの場合、内部サービスへの特権アクセスで動作します。エンジニアリング上の課題は、個々のエージェントの構築から、適切な分離、ガバナンス、セキュリティを備えた大規模な運用へと移行します。このペーパーでは、エージェントのワークロードに合わせて調整された 3 つの主要な原則に基づいて設計されたオープンソースプラットフォームである Agyn について紹介します。エージェントとハーネス定義用の Terraform プロバイダー。そして、ゼロトラストと最小特権の原則に基づいたセキュリティモデルです。 Agyn は、エージェント、モデル、クラウドに依存しません。

原文 (English)

Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access

As organizations move toward production deployments of AI agents, which execute non-deterministic workflows, maintain stateful sessions, and often operate with privileged access to internal services, the engineering challenge shifts from building individual agents to operating them at scale with proper isolation, governance, and security. In this paper we present Agyn, an open-source platform designed around three key principles tailored for agent workloads: a signal-driven, stateful serverless runtime on Kubernetes; a Terraform provider for agent and harness definition; and a security model grounded in zero-trust and least-privilege principles. Agyn is agent-agnostic, model-agnostic, and cloud-agnostic.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

FundaPod: AI 支援のファンダメンタル投資調査のためのナレッジグラフメモリを備えたマルチペルソナエージェントポッドプラットフォーム

大規模言語モデル (LLM) は金融分野での適用が増えていますが、既存の研究のほとんどは取引シグナルや予測を中心とした財務 NLP タスクに重点を置いています。対照的に、制度的基礎研究では、人間のアナリストまたは AI エージェントが証拠を収集し、ビジネス推進要因を特定し、競合する視点を比較し、投資メモを作成する必要があります。その広範な目標は、単に結果を予測することではなく、投資知識の累積的な発展に貢献しながら、透明性、再利用可能、検証可能な投資計画を作成することです。 AI 支援のファンダメンタルズ投資調査のためのマルチペルソナエージェントプラットフォームである FundaPod を紹介します。私たちは、基礎研究は人間中心の意思決定支援タスクであり、取引シグナルの生成とは質的に異なるため、独立性を維持するアーキテクチャの方が適していると主張します。 FundaPod では、バリュー投資家やマクロ戦略家など、さまざまなペルソナを持つ AI エージェントが、共有の出所契約に基づいて独立して調査を実施します。その後、彼らの意見の相違は、知識グラフ記憶システムを通じて人間のポートフォリオマネージャー (PM) による裁定のために事後的に表面化されます。この論文は、設計科学の実践と認知的分離と人間と機械の協調の理論に基づいた、基礎研究をサポートする人間と AI のハイブリッドシステムの 5 つの設計原則を提供します。また、4 つのアーキテクチャメカニズムについても説明します。1 つは一般投資家の資料を展開可能なエージェントに変えるペルソナ蒸留パイプラインです。プランナーが型指定されたタスクグラフを導出できるようにする宣言型スキルレジストリ。メモの主張を検証可能な情報源に結び付ける根拠のある証拠モデル。そしてティッカー、メモ、アナリスト、テーマを結び付けるナレッジグラフ「第二の脳」。完全なケーススタディとペルソナベースのメモの比較を通じてアーキテクチャを実証します。

原文 (English)

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph "second brain" that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous Vehicle

We propose DeepIPCv2, an end-to-end autonomous driving framework that integrates LiDAR-based environmental perception with command-specific…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェントビジネス/資金調達

Recent Advances in Multi-modal 3D Intelligence: A Comprehensive Survey and Evaluation

Multi-modal 3D Intelligence has gained considerable attention due to its wide applications in autonomous driving and world simulation, etc.…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェントロボティクス

MARFT: Multi-Agent Reinforcement Fine-Tuning

Large Language Model (LLM)-based Multi-Agent Systems (LaMAS) have demonstrated strong capabilities on complex agentic tasks requiring multi…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Simulating Macroeconomic Expectations in Survey Experiments with LLM-based Economic Agents

We introduce a framework for simulating macroeconomic expectations in survey experiments using LLM-based economic agents (LLM Agents). We c…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challe…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

FedS2R: One-Shot Federated Domain Generalization for Synthetic-to-Real Semantic Segmentation in Autonomous Driving

Federated domain generalization has shown promising progress in image classification by enabling collaborative training across multiple cli…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

TuneAgent: Agentic Operating System Kernel Tuning with Reinforcement Learning

Linux kernel tuning is essential for optimizing operating system (OS) performance, yet remains challenging due to the complex kernel space,…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Make a Video Call with LLM: A Measurement Campaign over Six Mainstream Apps

In 2025, Large Language Model (LLM) services have launched a new feature -- AI video chat -- allowing users to interact with AI agents via…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding nea…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Latent Collaboration in Multi-Agent Systems

Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intellig…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

セグメントからシーンへ: 視覚言語モデルによるエージェント自動運転の時間的理解

視覚言語モデル (VLM) は、自動運転 (AD) が最も安全性が重要なインスタンスの 1 つとして、野外で活動する自律エージェントの認識および推論のバックボーンとしてますます導入されています。このようなエージェントがイベントを予測し、原因を特定し、動的な環境で安全に行動するには、信頼性の高い時間的理解が不可欠ですが、これは最先端 (SoTA) VLM にとっても依然として大きな課題です。これまでのビデオベンチマークは他のコンテンツ (スポーツ、料理など) を重視していましたが、短編と長編の両方の AD 映像の時間的理解のみに焦点を当てた既存のベンチマークはありません。このギャップを埋めるために、7 つのタスクにわたる約 6000 の質問と回答 (QA) のペアで構成される自動運転における時間的理解 (TAD) ベンチマークを提示し、9 つのクローズドおよびオープンソースのジェネラリストモデルと AD スペシャリストモデルを評価します。現在の SoTA モデルは、TAD 上で人間の精度を大幅に下回っています。 VLM ベースの運転エージェントの時間的推論を改善するために、我々は 2 つの新しいトレーニング不要のソリューションを提案します。1 つは思考連鎖 (CoT) 推論を使用する Scene-CoT、もう 1 つは VLM 周辺のエージェントツールとして動作する軌道分析モジュールによって生成される自己中心の時間認知マップを組み込んだ TCogMap です。既存の VLM と統合された当社のメソッドは、TAD での平均精度を最大 $17.72\%$、STSBench で最大 $10.35\%$ 向上させます。この研究は、TAD の導入、SoTA モデルのベンチマーク、および効果的な機能拡張の提案により、実際に稼働しているエージェント型 AD システムの時間的理解のさらなる進歩を促進することを目的としています。ベンチマークと評価コードは、それぞれ ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ と ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$ から入手できます。

原文 (English)

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliable temporal understanding is essential for such agents to anticipate events, attribute causes, and act safely in dynamic environments, yet this remains a significant challenge even for state-of-the-art (SoTA) VLMs. Prior video benchmarks have emphasized other content (sports, cooking, etc.), yet no existing benchmark focuses exclusively on temporal understanding for both short- and long-form AD footage. To fill this gap, we present the Temporal Understanding in Autonomous Driving (TAD) benchmark, comprising nearly 6000 question-answer (QA) pairs across 7 tasks, and evaluate 9 closed- and open-source generalist as well as AD-specialist models. Current SoTA models perform substantially below human accuracy on TAD. To improve the temporal reasoning of VLM-based driving agents, we propose two novel training-free solutions: Scene-CoT, which uses Chain-of-Thought (CoT) reasoning, and TCogMap, which incorporates an ego-centric temporal cognitive map produced by a trajectory-analysis module that operates as an agentic tool around the VLM. Integrated with existing VLMs, our methods improve average accuracy on TAD by up to $17.72\%$ and by up to $10.35\%$ on STSBench. By introducing TAD, benchmarking SoTA models, and proposing effective enhancements, this work aims to catalyze further progress on temporal understanding for agentic AD systems operating in the wild. The benchmark and evaluation code are available at ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ and ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$, respectively.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Ev-Trust: An Evolutionarily Stable Trust Mechanism for Decentralized LLM-Based Multi-Agent Service Economies

Decentralized LLM-based multi-agent service economies face three vulnerabilities that undermine traditional trust mechanisms: reduced cost…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

LLM-based agents increasingly use multiple external tools to complete complex tasks. We study Tools Orchestration Privacy Risk (TOP-R): an…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems

Multi-agent systems (MAS) are emerging as promising socio-collaborative companions for emotional and cognitive support. However, existing s…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

マルチエージェント討論を読み解く: 自信と多様性の役割

マルチエージェントディベート (MAD) は、テスト時間のスケーリングを通じて大規模言語モデル (LLM) のパフォーマンスを向上させるために広く使用されていますが、最近の研究では、バニラの MAD は、計算コストが高いにもかかわらず、単純な多数決を下回ることが多いことが示されています。研究によると、同種のエージェントと統一的な信念の更新の下では、議論は期待される正しさを維持するため、結果を確実に改善することはできません。人間による熟慮と集団的意思決定からの発見に基づいて、バニラ MAD に欠けている 2 つの重要なメカニズムを特定します。(i) 初期の視点の多様性と、(ii) 明示的で調整された信頼性のコミュニケーションです。私たちは 2 つの軽量介入を提案します。まず、多様性を意識した初期化により、より多様な回答候補プールが選択され、議論の開始時に正しい仮説が存在する可能性が高まります。 2 つ目は、エージェントが調整された自信を表現し、他の人の自信に基づいて最新情報を条件付けする、自信調整型ディベートプロトコルです。我々は、多様性を意識した初期化により、基礎となる更新ダイナミクスを変更することなく MAD 成功の事前確率が向上する一方で、信頼度調整された更新により、議論が体系的に正しい仮説に向かうことが可能になることを理論的に示します。経験的には、6 つの推論指向の QA ベンチマーク全体で、私たちの手法はバニラの MAD と多数決を一貫して上回っています。私たちの結果は、人間による熟慮と LLM ベースの議論を結びつけ、シンプルで原則に基づいた修正が議論の有効性を大幅に高めることができることを示しています。

原文 (English)

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others' confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Principle-Evolvable Scientific Discovery via Uncertainty Minimization

Large Language Model (LLM)-based scientific agents have accelerated scientific discovery, yet they often suffer from significant inefficien…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

"Do Not Mention This to the User": Detecting and Understanding Malicious Agent Skills

LLM-based coding agents increasingly rely on third-party extensions called skills, which bundle natural language instructions and helper sc…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

DARPA's AI Cyber Challenge (AIxCC, 2023--2025) is the largest competition to date for building fully autonomous cyber reasoning systems (CR…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows

LLM agents execute multi-step workflows that mutate external state through tools. Common orchestrators treat tool return as the settlement…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Knowing Isn't Understanding: Re-grounding Generative Proactivity with Epistemic and Behavioral Insight

Generative AI agents equate understanding with resolving explicit queries, an assumption that confines interaction to what users can articu…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

Visual Persuasion: What Influences Decisions of Vision-Language Models?

The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models…

2026-06-02 13:00 JSTarXiv cs.AIエージェント研究/論文

APEX-SQL: Talking to the data via Agentic Exploration for Text-to-SQL

Text-to-SQL systems powered by Large Language Models have excelled on academic benchmarks but struggle in complex enterprise environments.…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

Are LLMs Ready for Neural-integrated Mechanistic Modeling? A Benchmark and Agentic Framework

Large language models (LLMs) have shown promise in constructing mechanistic models from data. However, existing evaluations largely focus o…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Constitutional Black-Box Monitoring for Scheming in LLM Agents

Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is…

2026-06-02 13:00 JSTarXiv cs.AIエージェントロボティクス

SPARC: Spatial-Aware Path Planning via Attentive Agent Communication

Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat al…

2026-06-02 13:00 JSTarXiv cs.AIエージェントロボティクス

HALO: Learning Human-Robot Collaboration via Heterogeneous-Agent Lyapunov Policy Optimization

To improve generalization and resilience in human-robot collaboration (HRC), robots must contend with diverse combinations of human behavio…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Context Matters: Repository-Aware Security Analysis of the Agent Skill Ecosystem

Agent skills extend local AI agents, such as Claude Code and OpenClaw, with additional functionality. Their growing popularity has led to d…

2026-06-02 13:00 JSTarXiv cs.AIエージェント研究/論文

AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in la…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning

Designing effective auxiliary rewards for cooperative multi-agent systems remains challenging, as misaligned incentives can induce suboptim…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

We show that AI agents are capable of discovering novel algorithms for adversarial attacks against LLMs, advancing the state of the art on…

2026-06-02 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

Beyond Offline A/B Testing: Context-Aware Agent Simulation for Recommender System Evaluation

Recommender systems are central to online services, enabling users to navigate through massive amounts of content across various domains. H…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Just Type It in Isabelle! AI Agents Drafting, Mechanizing, and Generalizing from Human Hints

Type annotations are essential when printing terms in a way that preserves their meaning under reparsing and type inference. We study the p…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

Dual-Anchoring: Addressing State Drift in Vision-Language Navigation

Vision-Language Navigation(VLN) requires an agent to navigate through 3D environments by following natural language instructions. While rec…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Beyond AI as Assistants: Toward Autonomous Discovery in Cosmology

Recent advances in artificial intelligence (AI) agents are pushing AI beyond tools toward autonomous scientific discovery. We discuss two c…

2026-06-02 13:00 JSTarXiv cs.AIエージェント研究/論文

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

Existing code benchmarks measure whether an agent can produce any test that reproduces a known bug, or whether it can produce a patch that…

2026-06-02 13:00 JSTarXiv cs.AI画像/動画生成エージェント

Co-Fusion4D: Spatio-temporal Collaborative Fusion for Robust 3D Object Detection

In autonomous driving, 3D object detection is essential for accurate perception and reliable decision-making. However, object motion and eg…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

証明されたツールサーバーアドミッション: モデルコンテキストプロトコルのセキュリティ拡張

モデルコンテキストプロトコル (MCP) は、大規模言語モデル (LLM) エージェントと外部ツールサーバーがメッセージを交換する方法を標準化しますが、信頼するものではありません。ホストは、どのサーバーを使用できるか、どのような感度で、どのサーバーのツールが制限されているかについては考慮せずに、サーバーの自己宣言されたツールリストを読み取り、呼び出しをディスパッチします。この取り組みは、具体的なニーズから生まれました。MCP や Enclawed 独自のツールのアプリケーションプログラミングインターフェイス (API) を変更することなく、Enclawed エージェントが Google の外部で運用されている MCP サーバー (Gmail、カレンダー、ドライブ) を安全に使用できるようにし、サーバーを許可し、サーバーが駆動する可能性のあるツールを制限するというものです。私たちが構築した、mcp 認定のメカニズム (オープンな enclawed-oss ディストリビューションと enclaved フレーバーの両方で出荷) は、一般化します。仲介されていないサードパーティ接続を 1 人のユーザーにとって安全でなくするギャップにより、規制された展開を認定することが不可能になります。これを 3 つの追加メカニズムで閉じます。(1) サーバーがよく知られた URI (Uniform Resource Identifier) で発行する小規模なオフライン署名クリアランスアサーション。ホストはツールのディスパッチ前に固定されたトラストルートに対して検証します。 (2) デフォルトで拒否されるサーバーごとのツール許可リストにより、サーバーがすべてのツールを信頼していないことを認める。 (3) フレーバーゲート型強制モード。チェックを警告からハード拒否に変換し、すべての決定を改ざん防止監査ログに書き込みます。ワイヤー形式、検証アルゴリズム、セキュリティ分析、LLM 主導の敵対的評価を提供します。次に、スキーマ、検証ルール、エラーレジストリ、既知の登録、機械チェック可能な適合ベクトルなど、標準的な Request-for-Comments (RFC 2119) 形式で設計を記述します。これにより、設計を再発明するのではなく、MCP の付録として採用できるようになります。拡張されていないホストは、既知のドキュメントを無視し、現在とまったく同じように動作します。

原文 (English)

Attested Tool-Server Admission: A Security Extension to the Model Context Protocol

The Model Context Protocol (MCP) standardizes how a large-language-model (LLM) agent and an external tool server exchange messages, but not trust: a host reads a server's self-declared tool list and dispatches calls, with no notion of which servers it may use, at what sensitivity, or which of a server's tools are in bounds. This work grew out of a concrete need -- letting the Enclawed agent use Google's externally-operated MCP servers (Gmail, Calendar, Drive) safely, admitting the server and bounding the tools it may drive, without changing MCP or Enclawed's own tool application-programming interface (API). The mechanism we built, mcp-attested (shipped in both the open enclawed-oss distribution and the enclaved flavor), generalizes: the gap that makes an unmediated third-party connection unsafe for one user makes a regulated deployment impossible to accredit. We close it with three additive mechanisms: (1) a small, offline-signed clearance assertion a server publishes at a well-known Uniform Resource Identifier (URI) and a host verifies against a pinned trust root before any tool dispatch; (2) a deny-by-default per-server tool allowlist, so admitting a server is not trusting its every tool; and (3) a flavor-gated enforcement mode that turns the checks from warnings into hard denials, with every decision written to a tamper-evident audit log. We give the wire format, the verification algorithm, a security analysis, and an LLM-driven adversarial evaluation; we then state the design in normative Request-for-Comments (RFC 2119) form -- schema, verification rules, error registry, well-known registration, and machine-checkable conformance vectors -- so it can be adopted as an MCP addendum rather than reinvented. An unextended host ignores the well-known document and behaves exactly as today.

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLM…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

LLM-based agents for industrial asset operations show limited accuracy when reasoning over flat document stores. AssetOpsBench (KDD 2026) e…

2026-06-02 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We cat…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own fail…

2026-06-02 13:00 JSTarXiv cs.AIエージェント

Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms

As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can…

2026-06-02 07:45 JSTITmedia AI+エージェントハードウェア/半導体

NVIDIAの“狐”は工場自律管理AIエージェント、台湾メーカーが導入効果を確認

NVIDIAは、工場を自律的に管理するAIエージェントのレファレンスデザイン「NVIDIA Factory Operations Blueprint（FOX）」を発表した。FOXを用いれば、工場内のさまざまなデータをリアルタイムに監視／分析し、複数のAIエージェントと機器を連携…

2026-06-02 06:35 JSTTechCrunch AIエージェントハードウェア/半導体

Nvidia chases $200B CPU market with AI agent PCs from Microsoft, Dell, and HP

If Nvidia has cracked a way to bring AI agents easily, safely, and usefully to the masses, it could — and should — be big.

2026-06-01 19:00 JSTOpenAILLM/生成AIエージェント

OpenAI frontier models and Codex are now available on AWS

OpenAI frontier models and Codex are now generally available on AWS, giving enterprises a new path to build with OpenAI through the AWS env…

2026-06-01 16:00 JSTITmedia AI+エージェント

Salesforceの「深謀遠慮」とは？　AIエージェント時代のオープンシステム、主導権争いの行方

AIエージェントが多数動き回る企業の業務システムに向けて、Salesforceが新たなソリューションを打ち出した。そのソリューションの狙いを考察すると、そこには同社の深謀遠慮があるようだ。

2026-06-01 13:00 JSTarXiv cs.AIエージェント

自動運転のための強化学習における不確実性を認識し、時間的に規制された専門家のアドバイス

自動運転のための強化学習における探索は本質的に安全ではありません。エージェントは学習するために新しい動作を経験する必要がありますが、探索は衝突やオフロード運転につながる可能性があります。私たちは、専門家のアドバイスを活用して、長期的な依存を回避しながら探索を導く不確実性を認識したフレームワークを提案します。認識的または偶然的な不確実性がローリングバッファーから導出された適応しきい値を超えるとアドバイスがトリガーされ、エージェントの信頼に応じてアドバイスが進化することが保証されます。確率的早期停止ヒューリスティックを使用したコミットメントクールダウン戦略により、ガイダンスの期間と頻度が調整され、アドバイスの予算を使い果たすことなくエージェントが一貫した操作にさらされます。エキスパートとエージェントのエクスペリエンスは、オフポリシーの暗黙的クォンタイルネットワーク (IQN) バックボーン内の共有リプレイバッファーで結合され、エキスパートの軌跡を効率的に再利用できます。 CARLA での実験では、私たちの手法が IQN ベースラインを上回っており、成功率が 5 ～ 7% 向上し、失敗が減少していることが示されており、リスクに敏感な不確実性と規制された専門家の統合により、信号のない交差点ナビゲーションにおけるセンサーベースの RL ポリシー学習のより安全で効率的な探索が可能になることが実証されています。

原文 (English)

Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving

Exploration in reinforcement learning for autonomous driving is inherently unsafe: agents must experience novel behaviors to learn, yet exploration can lead to collisions or off-road driving. We propose an uncertainty-aware framework that leverages expert advice to guide exploration while avoiding long-term dependence. Advice is triggered when epistemic or aleatoric uncertainty exceeds adaptive thresholds derived from rolling buffers, ensuring advice evolves with the agent's confidence. A commitment-cooldown strategy with a stochastic early-stop heuristic regulates the duration and frequency of guidance, exposing the agent to coherent maneuvers without exhausting the advice budget. Expert and agent experiences are combined in a shared replay buffer within an off-policy implicit quantile network (IQN) backbone, enabling efficient reuse of expert trajectories. Experiments in CARLA show that our method outperforms the IQN baseline, improving success by 5-7% and reducing failures, demonstrating that risk-sensitive uncertainty coupled with regulated expert integration enables safer and more efficient exploration for sensor-based RL policy learning in unsignalized intersection navigation.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ハーネスの更新はハーネスの利点ではありません: 自己進化する LLM エージェントの進化機能の解きほぐし

LLM エージェントは、プロンプト、スキル、メモリ、ツールなどの編集可能な外部ハーネスを中心に構築されたシステムとして導入されることが増えており、モデルパラメーターを変更せずにタスクの実行を形成します。ハーネスの自己進化は、実行証拠からこれらのハーネスを更新することで、そのようなエージェントを適応させます。しかし、タスク解決におけるモデルの基本的な能力が、ハーネスの自己進化におけるその能力を予測するかどうかは依然として不明です。どのモデルが有用なハーネス更新を生成し、どのモデルが実際にその恩恵を受けるのでしょうか?我々は 2 つのハーネス自己進化機能を分析します。(i) ハーネス更新。実行証拠から有用な永続的なハーネス更新を生成する機能。 (ii) ハーネスの利点、タスク解決中に更新されたハーネスから恩恵を受ける機能。私たちの分析により、2 つの発見が明らかになりました。まず、ハーネスの更新は基本機能がフラットです。さまざまな機能層のモデルがハーネスの更新を生成し、驚くほど同様の利益をもたらします。 Qwen3.5-9B のアップデートでも、Claude Opus~4.6 に匹敵するゲインが得られます。第 2 に、ハーネスの利点は基本機能において単調ではありません。弱い層のモデルは更新されたハーネスからほとんど恩恵を受けず、中間層のモデルは最も恩恵を受け、強い層のモデルは中間層よりも恩恵が少ないです。弱い層での低いゲインを 2 つの障害モードに追跡します。弱い層のモデルは、関連するハーネスアーティファクトのアクティブ化に失敗するか、アクティブ化しても忠実に従うことができない可能性があります。これらの調査結果は、進化者ではなくタスク解決エージェントに能力予算を投資し、エージェントのトレーニングに続くハーネス呼び出しと長期的な指導をターゲットにすることを示唆しています。私たちのソースコードは https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution で公開されています。

原文 (English)

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task-solving predicts its capabilities in harness self-evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self-evolution capabilities: (i) harness-updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness-benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness-updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness-benefit is non-monotonic in base capability: weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. We trace low gains at the weak tier to two failure modes: weak-tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task-solving agent rather than the evolver, and targeting harness invocation and long-horizon instruction following in agent training. Our source code is publicly available at https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution.

2026-06-01 13:00 JSTarXiv cs.AIエージェント

レビンツリー検索の再根付のための構造に起因する情報

ポリシーを使用して検索をガイドするサブゴールベースのポリシーツリー検索は、複雑な単一エージェントの決定論的問題には効果的ですが、多くの場合、明示的なサブゴールの生成に依存するため、大幅なオーバーヘッドが発生し、スケーラビリティが妨げられる可能性があります。この論文では、最近導入された $\sqrt{\text{LTS}}$ アルゴリズムを通じて学習された「rerooter」を使用することで、これらの制限を克服します。 rerooter は問題を暗黙的にソフトサブタスクに分解します。以前の研究では、与えられたリルータまたは手作りのリルータの正式な保証に焦点を当てていましたが、この研究では 3 つのリルータ設計を提案します。(i) グローバルな状態空間構造を活用するクラスタリングベースのリルータ、(ii) 学習されたコスト To Go 推定を活用するヒューリスティックベースのリルータ、および (iii) 両方の信号を組み合わせたハイブリッドです。私たちのフレームワークでは、生成されたサブゴールを明示的に再構築して推論する必要がなくなり、大幅に低い計算オーバーヘッドでスケーラブルな検索労力の割り当てが可能になります。経験的に、当社のリルートベースの方法は、サブゴールベースのポリシーツリー検索が失敗する複雑な環境にも拡張でき、テストされたドメインで最先端のオンライントレーニング効率を実現します。

原文 (English)

Structure-Induced Information for Rerooting Levin Tree Search

Subgoal-based policy tree search, which uses a policy to guide search, is effective for complex single-agent deterministic problems but often relies on explicit subgoal generation that can incur substantial overhead and hinders scalability. In this paper, we overcome these limitations by using a learned ``rerooter'' through the recently-introduced $\sqrt{\text{LTS}}$ algorithm. A rerooter implicitly decomposes the problem into soft subtasks. While previous work focused on the formal guarantees for given or handcrafted rerooters, in this work we propose three rerooter designs: (i) a clustering-based rerooter that exploits global state-space structure, (ii) a heuristic-based rerooter that leverages learned cost-to-go estimates, and (iii) a hybrid that combines both signals. Our framework avoids having to explicitly reconstruct and reason over generated subgoals, thereby enabling scalable allocation of search effort with significantly lower computational overhead. Empirically, our rerooting-based methods scale to complex environments where subgoal-based policy tree search fails, and achieve state-of-the-art online training efficiency on the domains tested.

2026-06-01 13:00 JSTarXiv cs.AIエージェント

MAVEN: エージェントティックツール呼び出しの一般化の改善

エージェントツール呼び出し環境全体での一般化は、信頼性の高いエージェント推論システムにとって依然として中心的な課題です。大規模な言語モデルは個々のベンチマークで優れた結果を達成しますが、推論戦略を構成し、中間状態を保持し、ドメイン間でツールを調整する能力はまだ十分に解明されていません。構造化分解、適応ツールオーケストレーション、中間検証のための軽量の記号推論足場である MAVEN (Modular Agentic Verification and Execution Network) を紹介します。私たちは、BFCL v3、TauBench、Tau2Bench、AceBench などの確立されたツール呼び出しベンチマーク全体で MAVEN を評価し、明示的な検証と敵対的タスク構成を備えたマルチステップの数学的および物理的推論のためのストレステストベンチマークである MAVEN-Bench を紹介します。 MAVEN-Bench は、部分的な推論の品質とエンドツーエンドのタスクの成功の間に大きなギャップがあることを明らかにします。 MAVEN-Bench を直接実行すると、MAVEN は追加のトレーニングなしで GPT-OSS-120b 基本モデルの精度が 48% から 71% に向上しました。また、推定コスト比が約 1/10 のオープンウェイトバックボーンを使用しながら、フロンティア独自のベースラインとの競争力を維持しています。これは、軽量の検証中心のスキャフォールドが構成推論を強化し、実際のエージェントのよりプロセスを意識した評価を動機付ける可能性があることを示唆しています。

原文 (English)

MAVEN: Improving Generalization in Agentic Tool Calling

Generalization across agentic tool-calling environments remains a central challenge for reliable agentic reasoning systems. Although large language models achieve strong results on individual benchmarks, their ability to compose reasoning strategies, preserve intermediate states, and coordinate tools across domains remains underexplored. We present MAVEN (Modular Agentic Verification and Execution Network), a lightweight symbolic reasoning scaffold for structured decomposition, adaptive tool orchestration, and intermediate verification. We evaluate MAVEN across established tool-calling benchmarks, including BFCL v3, TauBench, Tau2Bench, AceBench, and introduce MAVEN-Bench, a stress-test benchmark for multi-step mathematical and physical reasoning with explicit verification and adversarial task composition. MAVEN-Bench exposes a substantial gap between partial reasoning quality and end-to-end task success; in direct MAVEN-Bench runs, MAVEN improves its GPT-OSS-120b base model from 48% to 71% accuracy without additional training. It also remains competitive with frontier proprietary baselines while using an open-weight backbone with an estimated cost ratio of roughly 1/10, suggesting that lightweight verification-centered scaffolds can strengthen compositional reasoning and motivate more process-aware evaluation of agents in the wild.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

長期タスク向けの学習エージェント互換のコンテキスト管理

LLM エージェントは、Web 検索や実世界のアプリケーションでの詳細な調査など、長期にわたるタスクに直面することが増えています。このタスクでは、蓄積されたコンテキストが長期コンテキストの劣化や推論の失敗を引き起こす可能性があります。これまでの研究では、エージェント側のコンテキスト制御や要約などの固定戦略を使用したコンテキスト管理を通じてこの問題を軽減していましたが、適応のためにエージェント自体をトレーニングする必要があり、クローズドソースのエージェントには非現実的であり、エージェントごとに異なる戦略が必要になる可能性があることが無視されていました。 Adaptive Context Management (AdaCoM) を導入します。これは、柔軟な変更アクションとエンドツーエンドの強化学習を通じて、凍結されたエージェントのコンテキストを管理するように外部 LLM をトレーニングします。 AdaCoM は、Web 検索およびディープリサーチベンチマーク上のさまざまなエージェントにわたって、古いコンテンツを削除しながらタスクの制約と進行状況を維持することで、パフォーマンスを大幅に向上させます。学習された戦略は、忠実度と信頼性のトレードオフを明らかにします。つまり、バニラ ReAct のパフォーマンスが高いエージェントは、忠実度の高いコンテキストの保存から恩恵を受けますが、パフォーマンスが低いエージェントは、信頼できる推論体制内に留まるために、より積極的な圧縮を必要とします。転送実験では、AdaCoM が同様の機能 (バニラ ReAct パフォーマンスで測定) を持つエージェント間で最も効果的に一般化することが示されており、エージェントシステムの再利用可能なコンテキストマネージャーへの実用的な道筋が示唆されています。

原文 (English)

Learning Agent-Compatible Context Management for Long-Horizon Tasks

LLM agents increasingly face long-horizon tasks such as web search and deep research in real-world applications, where accumulated context can cause long-context degradation and reasoning failures. Prior work mitigates this through context management with agent-side context control or fixed strategies such as summarization, which require training the agent itself for adaptation - making it impractical for closed-source agents and ignoring that different agents may require different strategies. We introduce Adaptive Context Management (AdaCoM), which trains an external LLM to manage the context of a frozen agent through flexible modification actions and end-to-end reinforcement learning. Across diverse agents on web search and deep research benchmarks, AdaCoM substantially improves performance by preserving task constraints and progress while pruning stale content. The learned strategies reveal a Fidelity-Reliability Trade-off: agents with higher vanilla ReAct performance benefit from higher-fidelity context preservation, whereas lower-performing agents require more aggressive compression to stay within a reliable reasoning regime. Transfer experiments show that AdaCoM generalizes most effectively across agents with similar capability (measured by vanilla ReAct performance), suggesting a practical path toward reusable context managers for agent systems.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

COMPASS: セーフサーチエージェント向けの認知 MCTS ガイドによるプロセス調整

LLM を利用した検索エージェントにより、複数ステップの推論とツールの使用が可能になります。ただし、これらの機能は、有害な意図が一見無害に見えるサブクエリに分解され、安全でない結果を引き起こす可能性があるため、検索による安全性の低下を引き起こします。既存の調整手法は、まばらな安全信号を捕捉するのに苦労しており、複数ステップの相互作用にわたる多様な違反を監視できません。私たちは、一般的な実用性を維持しながら、エージェントのワークフロー全体で堅牢な安全調整を実現するように設計されたコグニティブ MCTS ガイド付きプロセス調整フレームワークである COMPASS を提案します。 COMPASS は、コグニティブツリー探索 (CTE) を統合してステルス攻撃の軌道を効率的に合成し、内省的ステップワイズアライメント (ISA) を統合して、きめ細かいプロセス監視のためにリスクのある中間アクションを分離します。経験的な結果は、COMPASS が必要なトレーニングデータを大幅に削減しながら、安全性とユーティリティの有利なトレードオフを達成していることを示しています。

原文 (English)

COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents

LLM-powered search agents enable multi-step reasoning and tool use. However, these capabilities introduce retrieval-induced safety degradation, as harmful intents may decompose into seemingly innocuous sub-queries that lead to unsafe outcomes. Existing alignment methods struggle to capture sparse safety signals and fail to supervise diverse violations across multi-step interactions. We propose COMPASS, a Cognitive MCTS-Guided Process Alignment framework designed to achieve robust safety alignment throughout the agent workflow while preserving general utility. COMPASS integrates cognitive tree exploration (CTE) to efficiently synthesize stealthy attack trajectories, and introspective step-wise alignment (ISA) to isolate risky intermediate actions for fine-grained process supervision. Empirical results show that COMPASS achieves a favorable safety-utility trade-off while requiring substantially less training data.

2026-06-01 13:00 JSTarXiv cs.AIエージェント

HADT: 自律型地球観測衛星クラスター用のヘテロジニアス・マルチエージェント差動変圧器

この研究では、光学衛星や合成開口レーダー (SAR) 衛星を含む地球観測 (EO) ミッションを実行する異種衛星クラスターにおける自律的なリソース管理の問題に取り組んでいます。自律運用モードでは、衛星には最新の状況に基づいたリアルタイムの意思決定を可能にするインテリジェント機能が装備されており、地上オペレーターとの対話は最小限に抑えられます。従来のスケジューリング手法は通常、衛星のミッションとリソース管理を表す数学的モデルに依存しています。次に、この問題は最適化アルゴリズムを使用して解決されます。ただし、そのようなソリューションは、宇宙ミッション環境に固有の動的な変化や不確実性により、基礎となるモデルが利用できない場合、過度に複雑な場合、不正確な場合には効果が低くなります。有望な代替案は、問題を逐次的な意思決定プロセスとして再定式化し、モデルフリーの強化学習手法を適用して、適応的かつリアルタイムのリソース管理を可能にすることです。この目的を達成するために、我々は、関係的な観測とアクションのトークン化と差分注意メカニズムを備えた、異種衛星クラスターの自律的 EO ミッションに合わせた新しいトランスフォーマーベースのアーキテクチャを提案します。私たちの実験結果は、利用可能なベースラインと比較してパフォーマンスが大幅に向上していることを示しています。さらに、提案されたアーキテクチャは、さまざまな数の衛星クラスターに対して強力な適応性と転送性を示します。

原文 (English)

HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster

This work addresses the problem of autonomous resource management in heterogeneous satellite cluster conducting Earth Observation (EO) missions including optical and Synthetic Aperture Radar (SAR) satellites. In autonomous operation mode, satellites are equipped with intelligent capabilities enabling real-time decision-making based on the latest conditions, while requiring minimal interaction with ground operators. Traditional scheduling approaches typically rely on mathematical models to represent satellite mission and resource management. Then, this problem is solved by using optimization algorithms. However, such solutions become less effective when the underlying models are not available, over complex, and inaccurate due to dynamic changes and uncertainties inherent in the space mission environment. A promising alternative is to reformulate the problem as a sequential decision-making process and apply model-free reinforcement learning techniques to enable adaptive and real-time resource management. To this end, we propose a novel transformer-based architecture tailored for heterogeneous satellite cluster autonomous EO Mission with relational observations-actions tokenization and differential attention mechanism. Our experimental results demonstrate significant performance improvements compared to the available baselines. Moreover, the proposed architecture exhibits strong adaptability and transferability with respect to varying numbers of satellite clusters.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

COLLEAGUE.SKILL: 専門知識の蒸留による AI スキルの自動生成

LLM エージェントは、孤立したタスクを完了するだけでなく、人間の専門知識、判断、対話スタイルの限定された表現を実行することをますます期待されています。このような個人ベースのエージェントの構築は依然として困難です。これは、個人または役割に関連付けられた実用的な知識が、通常、きれいな指示として記述されるのではなく、異種トレースに埋め込まれているためです。既存のメモリおよびペルソナシステムはこの証拠の断片をキャプチャしますが、スキルフレームワークはポータブルなパッケージ形式を提供します。ただし、これらのトレースを検査可能、修正可能、エージェントが使用できるスキルに抽出するためのエンドツーエンドのワークフローはありません。専門知識の抽出を通じて人間に基づいた AI スキルを生成するための、自動化されたトレースからスキルへの抽出システムを紹介します。 COLLEAGUE.SKILL は、対象となる人物または役割からの資料を基に、2 つの調整されたトラックを備えたバージョン管理されたスキルパッケージを生成します。1 つはプラクティス、メンタルモデル、意思決定ヒューリスティックに関する能力トラックで、もう 1 つはコミュニケーションスタイル、インタラクションルール、修正履歴に関する限定された行動トラックです。パッケージは、自然言語フィードバックを通じて検査、呼び出し、更新、ロールバック、エージェントホスト間でのインストール、およびオプションで制御された配布の準備ができます。アーティファクトコントラクト、生成ワークフロー、修正ライフサイクル、展開面、およびオープンソースシステムに実装されたドメインプリセットについて説明します。この記事の執筆時点では、パブリックリポジトリには約 18.5k の GitHub スターがあります。ギャラリーには、165 人の寄稿者による 215 のスキルと、リストされているスキルカード全体で 10 万以上の累計スターがリストされています。このシステムは、個人に根ざしたスキルが、不透明なプロンプトや隠された記憶ではなく、移植可能で修正可能なパッケージとしてどのように表現できるかを示しています。

原文 (English)

COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

LLM agents are increasingly expected not only to complete isolated tasks, but also to carry bounded representations of human expertise, judgment, and interaction style. Building such person-grounded agents remains difficult because actionable knowledge associated with a person or role is usually embedded in heterogeneous traces rather than written as clean instructions. Existing memory and persona systems capture fragments of this evidence, while skill frameworks provide portable packaging formats; however, there is no end-to-end workflow for distilling these traces into inspectable, correctable, and agent-usable skills. We present an automated trace-to-skill distillation system for generating person-grounded AI skills via expert knowledge distillation. Given materials from a target person or role, COLLEAGUE.SKILL produces a versioned skill package with two coordinated tracks: a capability track for practices, mental models, and decision heuristics, and a bounded behavior track for communication style, interaction rules, and correction history. The package can be inspected, invoked, updated through natural-language feedback, rolled back, installed across agent hosts, and optionally prepared for controlled distribution. We describe the artifact contract, generation workflow, correction lifecycle, deployment surface, and domain presets implemented in the open-source system. At the time of writing, the public repository has approximately 18.5k GitHub stars; the gallery lists 215 skills from 165 contributors and more than 100k cumulative stars across listed skill cards. The system illustrates how person-grounded skills can be represented as portable, correctable packages rather than opaque prompts or hidden memories.

2026-06-01 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

予測を活用した推論の工業化: 信頼性の高い GenAI およびエージェントシステム評価のための GLIDE ライブラリ

エージェントシステムの信頼性の高い評価には、有効な不確実性を伴う不偏推定が必要ですが、標準的な手法では、コストのかかる人間によるアノテーションと、ジャッジとしての偏った LLM プロキシの間を行き来します。予測パワー推論 (PPI) は、両方を組み合わせて有効な信頼区間を持つ偏りのない推定値を生成しますが、そのさまざまな手法は部分的な実装の下で論文に散在したままです。平均推定に特化した scipy スタイルの API の下で、最先端の PPI 推定器 (PPI++、層化 PPI、Predict-Then-Debias とその層化バリアント、アクティブ統計推論) とサンプラー (均一、層化、アクティブ、コスト最適化) を統合するオープンソース Python ライブラリである GLIDE を紹介します。 GLIDE には、再現可能なモンテカルロ検証スイート、手法選択のための経験に基づいたデシジョンツリー、同等の精度でのアノテーションの大幅な節約を示すエージェント評価ケーススタディが付属しています。 GLIDE パッケージは次の URL で入手できます: https://github.com/EmertonData/glide

原文 (English)

Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM-as-judge proxies. Prediction-powered inference (PPI) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations. We introduce GLIDE, an open-source Python library that unifies state-of-the-art PPI estimators (PPI++, Stratified PPI, Predict-Then-Debias and its stratified variants, Active Statistical Inference) and samplers (uniform, stratified, active, cost-optimal) under a scipy-style API specialized to mean estimation. GLIDE ships with a reproducible Monte Carlo validation suite, an empirically grounded decision tree for method selection, and an agentic evaluation case study showing substantial annotation savings at equivalent precision. The GLIDE package is available at this URL: https://github.com/EmertonData/glide

2026-06-01 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達研究/論文

TraceGraph: エージェントの軌跡を診断および改善するための共有意思決定ランドスケープ

エージェントのベンチマークでは、豊富なインタラクションの軌跡が記録されることが増えていますが、評価によって各ロールアウトが合格率や報酬スコアに引き下げられることがよくあります。リリースされたマルチモデルエージェントの軌跡を共有の意思決定ランドスケープに変えるグラフベースのフレームワークである TraceGraph を紹介します。 TraceGraph は、タスクごとに、モデル ID が導入される前に、プールされたロールアウトから観察可能なアクションと観察の状態に関するグラフを構築します。次に、結果に基づいた生産コアとトラップ領域をオーバーレイし、各ロールアウトをアクセス、トラップ露出、修復の 3 つのイベントで要約します。 TraceGraph プロファイルは、5 つのベンチマークスプリットにまたがる軌跡全体で、集計スコアによって隠されたナビゲーションの違いを明らかにし、トラップの回避とそこからの回復のどちらに報酬を与えるかがスプリットによって異なることを示します。同じ TraceGraph ランドスケープは、SWE ベンチのトラップ対応回復パイプラインも動機付けます。実行時検出器は、履歴トラップ領域に一致する状態で起動され、その後、軽量継続ポリシーが同じプレフィックスから評価されます。起動された状態では、最適なプールされた単一要素ポリシーにより、プロバイダー固有のアクティブコンポーネントを使用して、プロバイダーごとに起動されたサブセットで正式な解決率が 40.4% から 43.5% に、共通起動されたインスタンスで 41.0% から 44.8% に上昇します。全体として、TraceGraph は、どのようなエージェントベンチマークテストを行うか、共有ランドスケープ上でモデルが分岐する場所、および障害領域が下流の改善をどのように導くことができるかを尋ねるためのプロセスボキャブラリーを提供します。

原文 (English)

TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories

Agent benchmarks increasingly record rich interaction trajectories, yet evaluation often reduces each rollout to a pass rate or reward score. We introduce TraceGraph, a graph-based framework that turns released multi-model agent trajectories into shared decision landscapes. For each task, TraceGraph builds a graph over observable action-observation states from pooled rollouts before model identity is introduced. It then overlays outcome-informed productive cores and trap regions, and summarizes each rollout with three events: Access, Trap exposure, and Repair. Across trajectories spanning five benchmark splits, TraceGraph profiles reveal navigation differences hidden by aggregate scores and show that splits differ in whether they reward avoiding traps or recovering from them. The same TraceGraph landscape also motivates a trap-aware recovery pipeline for SWE-bench: aruntime detector fires on states matching historical trap regions, then lightweight continuation policies are evaluated from the same prefix. On fired states, the best pooled single-factor policy raises official resolved rate from 40.4% to 43.5% on the per-provider fired subset and from 41.0% to 44.8% on common-fired instances, with provider-specific active components. Overall, TraceGraph provides a process vocabulary for asking what agent benchmarks test, where models diverge on a shared landscape, and how failure regions can guide downstream improvement.

2026-06-01 13:00 JSTarXiv cs.AIエージェント

リソースに制約のある Visual Agent における共有状態コラボレーションの障害モードの診断

モジュール式視覚推論システムは、多段階のコラボレーションのために共有ワーキングメモリへの依存度が高まっていますが、低容量領域における中間状態の進化の失敗ダイナミクスは依然として解明されていません。私たちは、ノイズ蓄積のレンズを通して、弱い学習者 (4B ～ 8B モデル) を使用した協調推論の失敗モードを研究します。ドキュメントの視覚的な質問応答における情報フローを追跡するための読み取り、書き込み、検証ループを形式化する監査フレームワークである CoSee を紹介します。複数ページ、グラフ、および Web ベースのベンチマーク全体で、直感に反する劣化が見つかりました。単純な共有ワークスペースでは、幻覚を解決するのではなく、幻覚を増幅させることがよくあります。私たちは 2 つの主要な失敗モードを特定しました。根拠のないメモが証拠として再利用されるノイズ強化と、追加されたコンテキストによってモデルが不完全な短い形式の回答にシフトするポリシー崩壊です。コスト精度のパレートフロンティアを使用して、明示的な検証がなければ、コンピューティングの増加がパフォーマンスと負の相関関係を示す可能性があることを示します。私たちの調査結果は、リソースに制約のあるエージェントの場合、ボトルネックは推論の深さではなく通信の忠実度にあり、トレースレベルの診断と信頼性の高いモジュール設計のための機構ベースラインを提供することを示唆しています。

原文 (English)

Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

Modular visual reasoning systems increasingly rely on shared working memory for multi-step collaboration, yet the failure dynamics of intermediate state evolution in low-capacity regimes remain underexplored. We study failure modes of collaborative reasoning with weak learners (4B--8B models) through the lens of noise accumulation. We introduce CoSee, an auditing framework that formalizes the read-write-verify loop to trace information flow in document visual question answering. Across multi-page, chart, and web-based benchmarks, we find a counter-intuitive degradation: naive shared workspaces often amplify hallucinations rather than resolve them. We identify two dominant failure modes: Noise Reinforcement, where ungrounded notes are reused as evidence, and Policy Collapse, where added context shifts the model toward under-specified, short-form answers. Using cost-accuracy Pareto frontiers, we show that increased compute can correlate negatively with performance without explicit verification. Our findings suggest that for resource-constrained agents, the bottleneck lies not in reasoning depth but in communication fidelity, providing trace-level diagnostics and a mechanistic baseline for reliable modular design.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

適応することを学ぶ: 認知認識の探求による自己改善 Web エージェント

マルチモーダル大規模言語モデル (MLLM) の最近の進歩により、Web エージェントは有望な進歩を遂げています。ただし、既存の Web エージェントは多くの場合、手作りの実行パイプラインや高価な専門家の軌跡に依存しており、複雑で動的な環境への適応性が制限されています。これらの課題に対処するために、私たちは SCALE (Self-Cognitive-Aware Learning and Exploration) を提案します。これは、セレクター、プレディクター、ジャッジという 3 つの敵対的な役割を活用して、環境探索を通じてエージェントの限界を自律的に発見し、その認知境界を拡張します。さらに、グローバルな計画を容易にし、エージェントがローカル探索の罠を避けるのに役立つグラフ探索戦略である SCALE-Hop を提案します。学習をさらにサポートするために、19 の実世界の Web サイトから収集された大規模なデータセットである SCALE-20k を構築します。これには、さまざまな種類のタスクと、SCALE の探索トレースから生成された構造化されたデモンストレーションが含まれています。実験結果は、私たちのアプローチがさまざまな Web 環境における複数の MLLM のパフォーマンスと汎用性を大幅に向上させることを示しています。私たちのフレームワークは、真に自律的で適応性のある Web エージェントを構築するための、スケーラブルで一般化可能なソリューションを提供します。

原文 (English)

Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration

Recent advances in Multimodal Large Language Models (MLLMs) have led to promising progress in web agents. However, existing web agents often rely on handcrafted execution pipelines or expensive expert trajectories, limiting their adaptability to complex, dynamic environments. To address these challenges, we propose SCALE (Self-Cognitive-Aware Learning and Exploration), which leverages three adversarial roles, Selector, Predictor, and Judger to autonomously discover the agent's limitations and expand its cognitive boundaries through environmental exploration. Moreover, we propose SCALE-Hop, a graph exploration strategy that facilitates global planning and helps agents avoid local exploration traps. To further support learning, we construct SCALE-20k, a large-scale dataset collected from 19 real-world websites, containing diverse task types and structured demonstrations generated from SCALE's exploration traces. Experimental results show that our approach significantly improves the performance and generalization of multiple MLLMs in various web environments. Our framework offers a scalable and generalizable solution for building truly autonomous and adaptive web agents.

2026-06-01 13:00 JSTarXiv cs.AIエージェント

HypoAgent: ナレッジグラフ上でインタラクティブなアブダクティブ仮説生成のためのエージェントフレームワーク

ナレッジグラフに対するアブダクティブ推論は、観察されたエンティティまたは事実を説明する論理的な仮説を生成することを目的としています。既存の制御可能な仮説生成方法では、ユーザーが明示的な条件を使用してこのプロセスをガイドできますが、インタラクティブな設定では制限されたままです。マルチターンの対話全体で進化する自然言語の意図を根付かせるのに苦労し、生成された仮説が失敗した場合に詳細な診断をほとんど提供できません。これらの制限に対処するために、ナレッジグラフ上でインタラクティブなアブダクティブ仮説生成のためのエージェントフレームワークである HypoAgent を提案します。 HypoAgent は 3 つのエージェントを統合します。ユーザーの発話と対話履歴を実行可能な KG 条件に根拠付ける意図認識エージェント、抽出されたユーザーの意図に従って制御可能な仮説生成を実行する仮説生成エージェント、および信頼性の低い仮説の断片を診断し、KG 近傍調査を利用してサポートされる改良点を特定する根本原因分析エージェントです。常識グラフと生物医学領域固有のナレッジグラフの実験により、HypoAgent がシングルターン、マルチターン、無条件の設定で最先端の意味的類似性を達成できることが実証されました。私たちのコードは https://github.com/HKUST-KnowComp/HypoAgent で入手できます。

原文 (English)

HypoAgent: An Agentic Framework for Interactive Abductive Hypothesis Generation over Knowledge Graphs

Abductive reasoning over knowledge graphs aims to generate logical hypotheses that explain observed entities or facts. Existing controllable hypothesis generation methods allow users to guide this process with explicit conditions, but they remain limited in interactive settings: they struggle to ground evolving natural-language intents across multi-turn dialogues and provide little fine-grained diagnosis when generated hypotheses fail. To address these limitations, we propose HypoAgent, an Agentic framework for interactive abductive Hypothesis Generation over knowledge graphs. HypoAgent integrates three agents: an Intent Recognition Agent that grounds user utterances and dialogue history into executable KG conditions, a Hypothesis Generation Agent that performs controllable hypothesis generation according to the extracted user intention, and a Root Cause Analysis Agent that diagnoses unreliable hypothesis fragments and leverages KG neighborhood probing to identify supported refinements. Experiments on commonsense and biomedical domain-specific knowledge graphs demonstrate that HypoAgent achieves state-of-the-art semantic similarity under single-turn, multi-turn, and unconditional settings. Our code is available at https://github.com/HKUST-KnowComp/HypoAgent.

2026-06-01 13:00 JSTarXiv cs.AIエージェント

強化学習のための解答セットプログラミングベースの抽象化

強化学習 (RL) により、自律エージェントは経験からポリシーを学習できますが、現実的な問題には膨大な状態空間が関与することが多く、学習と一般化が困難になります。したがって、抽象化と近似が不可欠です。関係強化学習 (RRL) は、オブジェクトとその関係について推論する方法を提供し、Martijn van Otterlo による CARCASS フレームワークは、論理表現が一次領域でマルコフ決定プロセス (MDP) をどのようにモデル化できるかを示しています。 CARCASS は元々 Prolog に実装されており、ドメイン知識を活用して強力な抽象化を作成します。私たちは、CARCASS 抽象化を実現するために、Prolog とは対照的に、リッチで完全な宣言型モデリング言語である Answer-Set Programming (ASP) を検討します。私たちは、2 つのドメインのケーススタディで ASP ベースの実装を評価します。ブロックワールドとミニグリッド。私たちの結果は、ASP を備えた CARCASS が、特にドメイン知識が利用可能な場合に、RL の抽象化を構築するための有望なアプローチを提供することを示しています。

原文 (English)

Answer-Set-Programming-based Abstractions for Reinforcement Learning

Reinforcement Learning (RL) enables autonomous agents to learn policies from experience, but realistic problems often involve enormous state spaces, making learning and generalisation challenging. Abstraction and approximation are therefore essential. Relational Reinforcement Learning (RRL) offers a way to reason about objects and their relations, and the CARCASS framework by Martijn van Otterlo demonstrates how logical representations can model Markov Decision Processes (MDPs) in first-order domains. Originally implemented in Prolog, CARCASS leverages domain knowledge to create powerful abstractions. We explore Answer-Set Programming (ASP), which is a rich and, contrary to Prolog, fully declarative modelling language, to realise CARCASS abstractions. We evaluate our ASP-based implementation in case studies of two domains, viz. Blocks World and Minigrid. Our results indicate that CARCASS with ASP provides a promising approach to constructing abstractions for RL, especially when domain knowledge is available.

2026-06-01 13:00 JSTarXiv cs.AIエージェント研究/論文

AutoSci: 科学研究ライフサイクル全体向けのメモリ中心のエージェントシステム

科学研究は伝統的に人力が集中しており、研究者は長いプロジェクトサイクルにわたって文献、アイデア、実験、原稿を調整し、回答をレビューする必要があります。 LLM ベースの科学エージェントの台頭により、このプロセスを自動化する機会が生まれました。このようなシステムは、研究ライフサイクル全体をサポートし、プロジェクト全体で構造化された永続的なメモリを維持し、時間の経過とともに独自の研究手順を改善する必要があります。しかし、既存のシステムはこれらの要件を部分的に満たしているか満たしていないため、統合された自動化された科学研究システムにはギャップが残されています。その結果、科学研究ライフサイクル全体に対応したメモリ中心のエージェントシステムである AutoSci を紹介します。 AutoSci は 4 つのモジュールを中心に構成されています。 SciMem は、スキーマ管理された研究メモリを提供し、再利用可能な科学知識のための長期知識メモリを、アイデア、実験、原稿、レビューなどのプロジェクトレベルの成果物のためのアクティブな研究メモリから分離します。 SciFlow は、状態、コンテキスト、検証、フィードバック、オーケストレーションを制御するハーネスを通じて、文献の理解から反論まで 5 段階のライフサイクルを実行します。 SciDAG は、DAG 形状のマルチエージェントオペレーターと再利用可能なステージ固有のテンプレートを使用して、難しいスキルを強化します。 SciEvolve は、ユーザー、実験、レビュー、外部環境からのフィードバック信号を、SciMem 組織、SciFlow スキル、および SciDAG テンプレートへのバージョン管理された更新に変換します。これらのモジュールを組み合わせることで、AutoSci は研究プロジェクト全体で実行、記憶、進化できる永続的な研究環境になります。コードリポジトリは https://github.com/skyllwt/AutoSci で入手できます。

原文 (English)

AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle

Scientific research has traditionally been human-intensive, requiring researchers to coordinate literature, ideas, experiments, manuscripts, and review responses across long project cycles. The rise of LLM-based scientific agents creates an opportunity to automate this process. Such a system must support the full research lifecycle, maintain structured persistent memory across projects, and improve its own research procedures over time. However, existing systems either partially satisfy or fail to satisfy these requirements, leaving a gap for a unified automated scientific research system. As a result, we present AutoSci, a memory-centric agentic system for the full scientific research lifecycle. AutoSci is organized around four modules. SciMem provides schema-governed research memory, separating Long-Term Knowledge Memory for reusable scientific knowledge from Active Research Memory for project-level artifacts such as ideas, experiments, manuscripts, and reviews. SciFlow executes a five-stage lifecycle from literature understanding to rebuttal through a harness that controls state, context, verification, feedback, and orchestration. SciDAG augments difficult skills with DAG-shaped multi-agent operators and reusable stage-specific templates. SciEvolve converts feedback signals from users, experiments, reviews, and external environments into versioned updates to SciMem organization, SciFlow skills, and SciDAG templates. Together, these modules make AutoSci a persistent research environment that can execute, remember, and evolve across research projects. The code repository is available at https://github.com/skyllwt/AutoSci.

2026-06-01 13:00 JSTarXiv cs.AIエージェント

レンズの選択: 文脈に依存した議論における戦略的視点の活性化

多くの場合、同じ議論を異なる外部レジームの下で評価する必要があります。政権に対して影響力を持つエージェントは、標準的な形式主義では直接把握できない戦略的手段を持っています。我々は、コンテキスト依存議論フレームワーク (CDAF) を導入します。これは、敗北関数がコンテキストごとにどの攻撃が成功するかを決定するという Dung の理論の拡張です。パースペクティブラベル付き特殊化は、関連性セット $\rho$ と優先度 $\pi$ から敗北関数を導出します。関連性セットはエージェントのアクションスペースです。小さな実際の例では、エージェントのターゲット引数は、すべての完全関連性の単射優先度の下では拒否されますが、VAF オーディエンスがミラーできないものの 1 つである部分的なアクティブ化の下では受け入れられます。対応する意思決定問題である ACTIVATION-MANIPULATION を定義し、ベースラインの複雑さの限界を記録します。狭い境界と複数エージェントのバリアントは未解決のままです。

原文 (English)

Choosing the Lens: Strategic Perspective Activation in Context-Dependent Argumentation

The same arguments often need to be evaluated under different external regimes. An agent with influence over the regime has a strategic lever that standard formalisms do not directly capture. We introduce context-dependent argumentation frameworks (CDAFs), an extension of Dung's theory in which a defeat function determines, per context, which attacks succeed. A perspective-labeled specialisation derives the defeat function from a relevance set $\rho$ and a priority $\pi$. The relevance set is the agent's action space. In a small worked example, the agent's target argument is rejected under every full-relevance injective priority, yet accepted under partial activations, one of which no VAF audience can mirror. We define the corresponding decision problem, ACTIVATION-MANIPULATION, and record baseline complexity bounds. Tight bounds and multi-agent variants are left open.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

モデルの特化のための自律型エージェントデータエンジニアリングの探求

大規模言語モデル (LLM) は、一般的なタスクでは優れたパフォーマンスを示していますが、高品質のドメイン固有のデータがないと特殊なドメインに適応するのに苦労することがよくあります。既存の LLM ベースのデータキュレーション手法は主に人間が設計したワークフローに依存しているため、LLM がモデルの特化のためのエンドツーエンドのデータエンジニアリングパイプラインを自律的に実行できるかどうかは未検討のままです。私たちは \textbf{Autonomous Agentic Data Engineering} を形式化します。これは、エンドツーエンドのデータキュレーションを通じてモデルの専門化を推進する自律型データエンジニアとして LLM を評価するように設計された新しいタスクです。私たちはデータを最適化可能なコンポーネントとしてフレーム化し、トレーニング後のパフォーマンス向上に基づいて、複数のドメインにわたってトレーニングデータを計画、生成、反復的に最適化するエージェントを研究します。実験によると、GPT-5.2 は反復的なエージェント主導のデータ適応を通じて学生モデルを \textbf{57.29\%} 改善するトレーニングカリキュラムを構築するため、自律型 LLM データエンジニアが大幅な利益をもたらすことが示されています。私たちの研究では、潜在的な問題とボトルネックの両方を明らかにすることで、自律的なデータエンジニアリングを測定可能な機能として確立し、エージェント駆動モデルの特殊化への道筋を示しています\脚注{コードは https://github.com/zjunlp/DataAgent でリリースされます。}

原文 (English)

Exploring Autonomous Agentic Data Engineering for Model Specialization

Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize \textbf{Autonomous Agentic Data Engineering}, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by \textbf{57.29\%}, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specialization\footnote{Code will be released at https://github.com/zjunlp/DataAgent.}.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

LongDS-Bench: 長期にわたるエージェントデータ分析の失敗について

現実世界のデータ分析は本質的に反復的ですが、既存のベンチマークは主に孤立したタスクや短時間の対話型タスクを評価するため、長期にわたって進化する分析コンテキストを追跡するエージェントの能力はテストされていません。エージェントが進化する分析状態を維持、更新、復元、構成する必要がある長期にわたるマルチターンデータ分析のベンチマークである LongDS を紹介します。 LongDS は、現実世界の Kaggle ノートブックから構築された 68 のタスクで構成され、地球科学、ビジネス、教育を含む 6 つのドメインにわたる 2,225 ターンに及びます。タスクは状態発展パターン (反事実摂動、ロールバック、複数状態の構成など) を中心に設計されており、依存関係の平均スパンは 11.3 ターンです。 5 つの最先端モデルを評価したところ、最良のモデルでも平均精度は 48.45% にとどまり、ターンの初期から後期にかけてパフォーマンスが 47 ポイント近く低下し、長期エラーが失敗の 52% ～ 69% を占めることがわかりました。さらに分析を進めると、エージェントステップを追加しても必ずしもパフォーマンスが向上するとは限らず、主なボトルネックはインタラクションバジェットを増やすことではなく、正しい分析状態を維持することにあることが示唆されています。私たちは、信頼性の高い長期的なエージェントデータ分析の研究をサポートするために LongDS をリリースします。コードとデータは https://github.com/zjunlp/DataMind でリリースされます。

原文 (English)

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

テストする表面は壊れる表面ではありません

ツールで拡張された LLM エージェントは、プロンプトインジェクションに対して脆弱です。エージェントのコンテキストの一部を制御するサードパーティが、エージェントがユーザーからのものであるかのように命令を実行する可能性があります。現在の評価では、ツールの出力である 1 つのチャネル上のモデルごとに 1 つの攻撃成功率が報告され、その数値がモデルの脆弱性として扱われます。しかし、エージェントがツールを呼び出す前に毎回読み取るツールの説明自体が、攻撃者が代わりに選択できる注入面となります。インジェクションペイロードをバイト同一に保持し、6 つのファミリーと 4 つのタスクスイートからの 13 個の LLM の両方のサーフェスを介して配信します。モデル間で同じバイトが成功率で逆転します。GPT-4.1 はツール出力では 96% 脆弱ですが、ツール記述ではわずか 4% ですが、GEMINI-3-FLASH は 20% と 98% でミラーパターンを示します。 6,830 回の試行にわたる分散分解では、攻撃結果の変動の 0% がサーフェスのみによるものである一方、モデルとサーフェスの相互作用は 16.7% を占めます。脆弱性はペアリングの特性であり、チャネルの特性ではありません。サーフェス上のセルごとの最大値として定義される適応攻撃率は、最も強力な固定サーフェスのベースラインを平均で +9.1 パーセントポイント上回ります。標準のプロンプトレベルの防御は同じ盲点を継承し、ツール出力の ASR を 10 ～ 18 パーセントに削減する一方、記述チャネルは 54 パーセントを超えたままにします。攻撃評価と防御評価の両方で、表面ごとの脆弱性を報告する必要があります。

原文 (English)

The Surface You Test Is Not the Surface That Breaks

Tool-augmented LLM agents are vulnerable to prompt injection: a third party who controls part of the agent's context can plant instructions that the agent then executes as if they came from the user. Current evaluations report a single attack success rate per model on one channel, the tool output and treat that number as the model's vulnerability. But tool descriptions, which the agent reads at every turn before any tool is called, are themselves an injection surface that the attacker can choose instead. We hold the injection payload byte-identical and deliver it through both surfaces across 13 LLMs from six families and four task suites. The same bytes invert in success rate across models: GPT-4.1 is 96 percent vulnerable on tool outputs but only 4 percent on tool descriptions, while GEMINI-3-FLASH shows the mirror pattern at 20 percent and 98 percent. A variance decomposition over 6,830 attempts attributes 0 percent of the variation in attack outcomes to the surface alone, while the model-surface interaction accounts for 16.7 percent. Vulnerability is a property of the pairing, not the channel. The Adaptive Attack Rate, defined as the per-cell maximum over surfaces, exceeds the strongest fixed-surface baseline by +9.1 percentage points on average. Standard prompt-level defenses inherit the same blindspot, reducing tool-output ASR to 10-18 percent while leaving the description channel above 54 percent. Both attack and defense evaluation must report per-surface vulnerability.

2026-06-01 13:00 JSTarXiv cs.AIエージェント

分離可能なダイナミクスの状態拡張とコンセンサスによる、スケーラブルな制約付きマルチエージェント強化学習

我々は、状態拡張ポリシー学習と二重変数に対する分散型コンセンサスを組み合わせた、制約付きマルチエージェント強化学習 (MARL) のための分散型アプローチを提案します。私たちの方法は、エージェントが分離可能なダイナミクスを持っているが、グローバルなリソース制約を満たすために調整する必要があるシステムを対象としています。経験的に示しているように、この設定では、エージェントが集合的な制約を満たすための適切な個別の貢献を決定できないため、独立した学習では実行可能な解決策を生み出すことができません。主要な技術的貢献は、独立したトレーニングのスケーラビリティを維持しながら、グローバルに調整された制約の適用には、ラグランジュ乗数に対する軽量の隣接間コンセンサスで十分であることを示したことです。各エージェントは、ローカル状態と二重変数エンコード制約フィードバックの両方を条件として、単一の拡張ポリシーをオフラインで学習します。実行中、エージェントはローカル通信のみを通じてこの二重変数について合意に達します。緩やかな接続性の仮定の下では、エージェントの乗数間のコンセンサス誤差が制限されていることを証明し、これがグラフの接続性とコンセンサスラウンドの数とともに減少する有界制約違反に変換されることを示します。複雑さがエージェント数に応じて少なくとも二次関数的に増加する分散実行による集中トレーニング (CTDE) アプローチとは異なり、私たちの方法はトレーニングと実行の両方で線形にスケールします。スマートグリッドのデマンドレスポンスに関する実験では、コンセンサス調整が \emph{実現可能性にとって不可欠}であることが実証されています。コンセンサス調整がなければ、エージェントはデマンドを無期限に延期することによってのみグリッド容量の制約を満たすことになります。これは退化した非解決策です。コンセンサスが得られると、エージェントは共有二重変数に収束し、グリッド制約と需要履行の両方を満たし、CTDE ベースラインが数十に制限されているのに対し、数千のエージェントに拡張できます。

原文 (English)

Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics

We present a distributed approach for constrained Multi-Agent Reinforcement Learning (MARL) that combines state-augmented policy learning with distributed consensus over dual variables. Our method targets systems where agents have separable dynamics but must coordinate to satisfy global resource constraints, a setting in which, as we demonstrate empirically, independent learning fails to produce feasible solutions because agents cannot determine appropriate individual contributions toward collective constraint satisfaction. The key technical contribution is showing that lightweight neighbor-to-neighbor consensus over Lagrange multipliers suffices for globally coordinated constraint enforcement while preserving the scalability of independent training. Each agent learns a single augmented policy offline, conditioned on both its local state and a dual variable encoding constraint feedback. During execution, agents reach agreement on this dual variable through local communication alone. We prove that under mild connectivity assumptions, the consensus error among agents' multipliers is bounded, and show that this translates to a bounded constraint violation that decreases with graph connectivity and the number of consensus rounds. Unlike centralized training with decentralized execution (CTDE) approaches, whose complexity grows at least quadratically with agent count, our method scales linearly in both training and execution. Experiments on smart grid demand response demonstrate that consensus coordination is \emph{essential for feasibility}: without it, agents satisfy grid capacity constraints only by indefinitely postponing demand, a degenerate non-solution. With consensus, agents converge to a shared dual variable and satisfy both grid constraints and demand fulfillment, scaling to thousands of agents while CTDE baselines are limited to dozens.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェントロボティクス

メモリに依存するが帯域幅に制限はない: Batch-1 LLM デコードにおける物理 AI 推論のギャップ

ロボット、自動運転車、具体化されたエージェント、エッジコパイロットなどの物理 AI システムは、多くの場合、クラウド LLM サービスとは異なる推論ワークロードを実行します。つまり、単一ストリーム、バッチ 1 の自己回帰デコードで、1 つのロボット、カメラフィード、またはユーザーセッションが次のトークンを待機します。このワークロードは通常、メモリ帯域幅制限として説明されます。各デコードステップはモデルの重みとアクティブな KV キャッシュをストリーミングするため、レイテンシはピーク HBM 帯域幅に合わせて調整する必要があります。この説明は真実であるが不完全であることを示します。 4 つの NVIDIA GPU (H100 SXM5、A100-80GB SXM4、L40S、L4) にわたる 3 つの 7 ～ 8B クラス GQA トランスフォーマーのバッチ 1 デコードを測定します。 2048 から 16384 までのコンテキスト長を評価し、制御された bf16 SDPA セットアップの下で 44 個の有効なセルを生成します。ピーク HBM 帯域幅の達成割合は、ピーク帯域幅が増加するにつれて減少します。見出しの Qwen-2.5-7B ctx=2048 セルでは、L4 は分析メモリフロアの約 81% に達しますが、H100 はわずか 27% に達します。物理 AI デコードはメモリに依存しますが、メモリの高速化は比例したレイテンシーの増加にはつながりません。 CUDA Graphs A/B 実験を使用して、欠落している用語をテストします。 ctx=2048 の H100 では、CUDA グラフは N=10 の新しいセッション全体でデコードレイテンシを 1.259 倍改善し、95 パーセントのブートストラップ信頼区間は 1.253 ～ 1.267 でした。 L4 では、同じ介入では 1.028 倍しか得られません。これにより、高速な GPU では可視化される起動側のオーバーヘッドが分離されますが、低速で帯域幅に制限のある GPU ではほとんど隠れたままになります。デプロイメントの意味は、メモリの節約が重要になるのは、ランタイムがメモリの節約を実現した場合だけであるということです。 L4 では、bf16 デコードはメモリフロア近くにありますが、共通の量子化パスでは予想される 4 倍の重みトラフィック削減が回復されません。62.32 ミリ秒の bf16 ベースラインから、bnb-nf4 は 59.36 ミリ秒/ステップに達し、AutoAWQ+Marlin は 45.24 ミリ秒/ステップに達します。 Ada で調整された int4 カーネルを使用した GPTQ+ExLlamaV2 は、17.36 ミリ秒/ステップに達します。

原文 (English)

Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode

Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

反事実的な評価により、臨床 LLM とエージェントの隠れた能力プロファイルが明らかになる

2 つの臨床 AI システムは、カバレッジベースのルーブリックではほぼ同じスコアを獲得できますが、患者の入力が変化すると根本的に異なる動作をします。1 つは新しい臨床信号に一致するように推奨事項を更新しますが、もう 1 つはそれに関係なく同じ出力を生成します。因果感受性スコア (CSS) を導入します。これは、臨床的に意味のある 5 つの次元 (バイオマーカーの反転、前治療の失敗、バイオマーカーの除去、手術状態の変化、ステージの摂動) に沿って腫瘍腫瘍ボードの症例を変異させる事前登録された介入指標であり、各モデルが事前に登録された正しい方向で推奨事項を更新するかどうかを {0、0.5、1.0} スケールを使用してスコア付けします。カバレッジベースの加重リコール指標であるコンセンサスマッチスコア (CMS) に対してベンチマークを行ったところ、224 件のケースにわたる単発推論で評価された 3 つのラボの 6 つのフロンティアモデルが、ほぼ逆の順位でランク付けされました。6 つのモデルすべてがランクを変更し、CMS で最も悪いモデルが CSS で最も優れたモデルになり、上位中位の 1 つの CMS モデルが CSS で最下位にランクされました。さらに、普遍的な安全性の盲点も明らかになりました。つまり、すべてのフロンティアモデルは手術状態の介入で失敗します (ファミリー D では最大 17.2% の CSS)。これは CMS では明らかにされていません。この指標は、ツールを使用するエージェントにも伝達されます。ReAct スタイルの実験では、ツールの使用により 6 つのモデルのうち 5 つのモデルで CSS が向上しました (+2.5 ～ +20.3 パーセントポイント)。それでも、CSS が最も低いモデルは同じグラフセクションを取得し、依然として推奨事項を更新できません。これは、反事実の評価下でのみ表示される構造的な応答性の欠陥を明らかにしています。裁判官間の複製と 3 人の評価者の医療専門家による検証により、総合的な結果が確認されます。 CSS のような事前登録された介入指標は、臨床 AI エージェントのカバレッジベースの評価を補完します。これらは、カバレッジ指標では見逃される応答性を捕捉し、将来のエージェント RL システムに候補となる密な報酬シグナルを提供します。

原文 (English)

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other produces the same output regardless. We introduce the Causal Sensitivity Score (CSS), a pre-registered interventional metric that mutates oncology tumor-board cases along five clinically meaningful dimensions - biomarker flips, prior-treatment failures, biomarker removals, surgery-status changes, and stage perturbations - and scores whether each model updates its recommendations in the pre-registered correct direction using a {0, 0.5, 1.0} scale. Benchmarked against the Consensus Match Score (CMS), a coverage-based weighted recall metric, six frontier models from three labs evaluated in single-shot inference across 224 cases rank in nearly opposite orders: all six models change rank, the CMS-worst model becomes CSS-best, and one upper-mid CMS model ranks last on CSS. We further surface a universal safety blind spot: every frontier model fails on surgery-status interventions (at most 17.2% CSS on Family D), a finding CMS does not expose. The metric also transfers to tool-using agents: in a ReAct-style experiment, tool use improves CSS for five of six models (+2.5 to +20.3 percentage points), yet the lowest-CSS model retrieves the same chart sections and still fails to update its recommendations - revealing a structural responsiveness deficit visible only under counterfactual evaluation. Cross-judge replication and three-rater medical-professional validation confirm the aggregate findings. Interventional pre-registered metrics like CSS complement coverage-based evaluation for clinical AI agents: they capture responsiveness that coverage metrics miss and offer a candidate dense reward signal for future agentic RL systems.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

規制されたサイバーセキュリティ運用のための、組織を対象とした LLM エージェントランタイムアーキテクチャ

規制されたサイバーセキュリティワークフローには、モデルに依存せずローカルに展開可能でありながら、取得、ツール呼び出し、メモリ、調査結果、レポート、監査にわたる組織レベルの範囲を強制するランタイム基盤がありません。最近の大規模言語モデル (LLM) エージェントシステムは、分離されたサイバーセキュリティタスクに関して優れた結果を報告しますが、規制されたセキュリティオペレーションセンター (SOC) およびコンプライアンスワークフローのための監査可能なプラットフォームアーキテクチャをそれ自体で定義するわけではありません。そこでは、1 人のアナリストが組織を束縛するアクションをトリガーする可能性があり、ランタイムはスタンドアロンの分析レイヤーとして動作するのではなく、コンテキストおよびアラート駆動トリガーの主要なソースとして既存の SIEM/XDR スタックと統合する必要があります。このペーパーでは、金融サイバーセキュリティのための、組織を対象とした LLM エージェントランタイムアーキテクチャを提案します。このコントリビューションは、ファーストクラスのトリガーとして取り込まれた SIEM/XDR 通知を含むすべてのエントリポイントで作成され、すべてのコンポーネント境界で強制される型指定されたセキュリティコンテキストであり、共有ランタイムコア、論理専門サブエージェント、統一ポリシーと監査の下で SIEM/XDR クエリ、エンリッチメント、および応答プリミティブを公開する管理されたツールアダプターレイヤー、証拠参照を含む構造化された調査結果、階層化された人間参加型 (HITL) ゲートと組み合わせられます。追加のみの監査。モデルコンテキストプロトコル (MCP)、拡張テレメトリ、ペネトレーションテスト用のデジタルツイン、グラフ取得、フェデレーテッドナレッジ共有は、実行時の必須の前提条件ではなく、オプションの拡張パスとして扱われます。私たちは、実装可能なスライスをアーキテクチャのテスト可能性面として記述し、アーキテクチャの準備状況、セキュリティポリシーの適用、証拠のトレーサビリティ、出力品質、および運用の可観測性に関するメトリックレベルの合格基準を備えた改ざん可能な評価計画を提案します。

原文 (English)

An Organization-Scoped LLM Agent Runtime Architecture for Regulated Cybersecurity Operations

Regulated cybersecurity workflows lack a runtime substrate that enforces organization-level scope across retrieval, tool calls, memory, findings, reports, and audit while remaining model-agnostic and locally deployable. Recent large language model (LLM) agent systems report strong results on isolated cybersecurity tasks, yet they do not by themselves define an auditable platform architecture for regulated security operations centre (SOC) and compliance workflows, where a single analyst may trigger actions that bind the organization, and where the runtime must integrate with existing SIEM/XDR stacks as a primary source of context and alert-driven triggers rather than operate as a standalone analytical layer. This paper proposes an organization-scoped LLM agent runtime architecture for financial cybersecurity. The contribution is a typed Security Context that is created at every entry point, including SIEM/XDR notifications ingested as first-class triggers, and enforced at every component boundary, combined with a shared Runtime Core, logical specialist subagents, a governed Tool Adapter Layer exposing SIEM/XDR query, enrichment, and response primitives under uniform policy and audit, structured findings with evidence references, tiered human-in-the-loop (HITL) gates, and append-only audit. Model Context Protocol (MCP), extended telemetry, digital twins for pentesting, graph retrieval, and federated knowledge sharing are treated as optional extension paths rather than mandatory runtime assumptions. We describe an implementable slice as the architecture's testability surface, and we propose a falsifiable evaluation plan with metric-level pass criteria for architecture readiness, security-policy enforcement, evidence traceability, output quality, and operational observability.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント研究/論文

Crafter: 多様な入力から編集可能な科学図を生成するためのマルチエージェントハーネス

科学的な数字は、複雑な研究アイデアを伝達する最も効果的な手段の 1 つですが、出版物に匹敵するイラストの作成は、依然として論文作成の中で最も労力を要する部分の 1 つです。既存の自動化システムはそれぞれ、テキストのみの入力で単一の図タイプをターゲットにしており、研究者が実際に使用するタイプと条件の多様性は未解決のままです。さらに、ラスター出力をローカルで修正することはできません。科学的図形は個別の意味論的コンポーネントの構造化された構成であるため、そのようなレイアウト上で局所的なエラージェネレータが生成する場合、より強力なバックボーンではなくハーネスが必要です。このハーネスを 2 つの相補的なシステムでインスタンス化します。Crafter は、アーキテクチャを変更せずに図のタイプと入力条件全体を汎用化する図生成用のマルチエージェントハーネスです。もう 1 つは、同じパターンを適用してラスター出力を編集可能な SVG に変換する CraftEditor です。さらに、人間品質の注釈を備えた 3 つの図形タイプと 4 つの入力条件にわたるベンチマークである CraftBench を紹介します。実験では、Crafter がスタンドアロンジェネレーターと PaperBanana-Bench および CraftBench のエージェントベースラインの両方を大幅に上回るパフォーマンスを示し、アブレーションにより各コンポーネントの独立した寄与が確認されました。 CraftEditor は、出力をすべてのベースラインを超える編集可能な SVG に忠実に変換します。私たちのコードとベンチマークは https://github.com/HaozheZhao/Crafter で入手できます。

原文 (English)

Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrations remains one of the most labor-intensive parts of paper preparation. Existing automated systems each target a single figure type under text-only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic components, the localized errors generators produce on such layouts demand not a stronger backbone but a harness. We instantiate this harness in two complementary systems: Crafter, a multi-agent harness for figure generation that generalizes across figure types and input conditions without architectural changes, and CraftEditor, which applies the same pattern to convert raster outputs into editable SVGs. Moreover, we introduce CraftBench, a benchmark spanning three figure types and four input conditions with human quality annotation. Experiments show that Crafter substantially outperforms both standalone generators and the agentic baseline on PaperBanana-Bench and CraftBench, with ablations confirming each component's independent contribution; CraftEditor faithfully converts outputs into editable SVGs that surpass all baselines. Our code and benchmark are available at https://github.com/HaozheZhao/Crafter.

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成エージェントロボティクス研究/論文

PInVerify: アクティブなインスタンス検証のためのオフライン組み込みベンチマーク

身体化されたエージェントは、ターゲットオブジェクトへのナビゲーションにおいて大きな進歩を遂げましたが、ゴール付近に到達したからといって、エージェントが正しいインスタンスを見つけたという保証はありません。微妙な属性の違い (例: 「白い花柄」と「白い縞模様」) には、多くの場合、近距離の多視点検査が必要です。私たちは、アクティブインスタンス検証 (AIV) によってこのギャップに対処します。このタスクでは、エージェントが候補オブジェクトの周囲の視点をアクティブに選択して、それがきめ細かい自然言語記述と一致するかどうかを判断します。私たちは、AIV を有限ホライズンの意思決定プロセスとして形式化し、AIV のオフラインで具体化されたベンチマークである PInVerify を導入します。18 のオブジェクトカテゴリにわたる 3,000 の評価エピソードは、トラップビュー (ナビゲート可能だが情報が得られない) と到達不可能なセクターを明らかにする 6 セクターナビゲーショントポロジを備えたマルチビューキャプチャとして配信されます。参照ベースラインとして、属性分解、可視性を重視したマルチビュートラッカー、および 3 つのネクストベストビュー (NBV) 戦略を使用して、オンデバイススケール ($\leq$8B パラメーター) でオープンソースのマルチモーダル大規模言語モデル (MLLM) を中心に、トレーニング不要のパイプラインと LoRA で微調整されたエンドツーエンドエージェントを構築します。 Qwen3-VL (4B/8B)、SenseNova-SI-1.2-InternVL3-8B、CLIP、および SigLIP2 にわたる評価では、最良の MLLM ベースのベースラインが最良の埋め込みベースラインを 4.9 pp 上回りました。 GT-box アブレーションでは +3.1 pp の検出ギャップが示されています。そして、テストされた NBV 戦略内でのアクティブな視点選択による信頼性の高い利益は観察されません。 LoRA で微調整されたエージェント (SFT+GSPO) は 85.6% に達します。 PInVerify は、身体化された AI におけるアクティブで詳細なセマンティック検証に関するさらなる作業をサポートすることを目的としています。コード: https://github.com/Avalon-S/PInVerify。

原文 (English)

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

Embodied agents have made strong progress in navigating to target objects, but reaching the goal vicinity does not guarantee that the agent has found the correct instance: subtle attribute differences (e.g., "white floral" vs. "white striped") often require close-range, multi-view inspection. We address this gap with Active Instance Verification (AIV), a task in which an agent actively selects viewpoints around a candidate object to decide whether it matches a fine-grained natural-language description. We formalize AIV as a finite-horizon decision process and introduce PInVerify, an offline embodied benchmark for AIV: 3,000 evaluation episodes across 18 object categories, delivered as multi-view captures with a 6-sector navigation topology that exposes trap views (navigable but uninformative) and unreachable sectors. As reference baselines we build a training-free pipeline and a LoRA-fine-tuned end-to-end agent around open-source multimodal large language models (MLLMs) at on-device scale ($\leq$8B parameters), with attribute decomposition, a visibility-weighted multi-view tracker, and three next-best-view (NBV) strategies. In our evaluation across Qwen3-VL (4B/8B), SenseNova-SI-1.2-InternVL3-8B, CLIP, and SigLIP2, the best MLLM-based baseline exceeds the best embedding baseline by 4.9 pp; GT-box ablations show a +3.1 pp detection gap; and we do not observe reliable gains from active viewpoint selection within the tested NBV strategies. A LoRA-fine-tuned agent (SFT+GSPO) reaches 85.6%. PInVerify aims to support further work on active, fine-grained semantic verification in embodied AI. Code: https://github.com/Avalon-S/PInVerify.

2026-06-01 13:00 JSTarXiv cs.AIエージェント

ソフトウェアリバースエンジニアリング AI エージェントを自動的に攻撃する

Ghidra などの実行可能バイナリファイルをリバースエンジニアリングするソフトウェアツールを使用すると、マルウェアアナリストは元のソースコードにアクセスすることなく、堅牢な静的分析を安全に実行できます。 GhidraMCP などのツールで有効化されたエージェントシステムと大規模言語モデル (LLM) の分析能力を組み合わせることで、アナリストは以前は人間が主導していたプロセスを自動化できます。この自動化により、1 人のマルウェアアナリストの生産性は向上しますが、マルウェアの難読化に関する新たな脆弱性領域も発生します。この論文では、AutoDAN として知られる敵対的攻撃の改良版である遺伝的アルゴリズムベースのプロンプト生成を使用した敵対的手法を紹介し、LLM を利用した逆アセンブリおよび逆コンパイルシステムをだましてバイナリ実行可能ファイルを誤って解釈させ、その分析出力を効果的に破壊する能力を実証します。この概念実証の方法論は、実行可能ファイルの機能に影響を与えずに、無関係な文字列変数の割り当てを使用して LLM に秘密の命令を渡すことにより、プロンプトインジェクションを介して LLM が逆コンパイルされたマシンコードを処理および解釈する方法に固有の脆弱性を悪用します。いくつかの簡潔な例を通じてこの機能を示します。このアプローチにより、攻撃者は LLM 駆動の分析パイプラインに依存する自動検出システムをバイパスできる可能性があります。この攻撃を研究して理解することで、LLM をサイバーセキュリティツールチェーンに統合し、より堅牢なエージェントコード分析システムを構築することのセキュリティへの影響に関する洞察を得ることができます。

原文 (English)

Automatically Attacking Software Reverse Engineering AI Agents

Software tools for reverse engineering executable binary files, such as Ghidra, enable malware analysts to safely conduct robust static analysis without having access to original source code. Coupled with the analytic power of large language models (LLM), agentic systems enabled with tools, such as GhidraMCP, can allow analysts to automate a previously human driven process. Although this automation can increase the productivity of a single malware analyst, it also introduces a new area of vulnerability for malware obfuscation. This paper presents an adversarial technique using genetic algorithm-based prompt generation, a modification of an adversarial attack known as AutoDAN, to demonstrate the ability to deceive LLM-powered disassembly and decompilation systems into misinterpreting binary executables, effectively corrupting their analytical output. This proof-of-concept methodology exploits inherent vulnerabilities in how LLMs process and interpret decompiled machine code via prompt injection by using extraneous string variable assignments to pass surreptitious instructions to the LLM while not impacting the functionality of the executable file. We demonstrate this capability through several concise examples. This approach could enable attackers to bypass automated detection systems that rely on LLM-driven analysis pipelines. By studying and understanding this attack, insights can be gained regarding the security implication of integrating LLMs into cybersecurity toolchains and building more robust agentic code analysis systems.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ソフトウェアリバースエンジニアリング AI エージェントに対するプロンプトインジェクション攻撃の検出と難読化の調査

エージェントソフトウェアリバースエンジニアリングシステムは、実行可能なバイナリファイルのソースコードに配置されたプロンプトインジェクション攻撃に対して脆弱です。この研究では、敵対的なサンプルプログラムの逆コンパイラ出力にプロンプトインジェクション文字列の存在を検出するための防御戦術を実証します。これらの攻撃を難読化する方法と、その後のこれらの難読化から防御する方法も検討されています。この調査により、実稼働レベルのサイバーワークフローへの導入に必要なエージェントソフトウェア分析システムのリスクとセキュリティについての理解が深まります。

原文 (English)

Investigating Detection and Obfuscation of Prompt Injection Attacks Against Software Reverse Engineering AI Agents

Agentic software reverse engineering systems are vulnerable to prompt injection attacks placed into the source code of executable binary files. This research demonstrates defensive tactics for detecting the presences of prompt injection strings in the decompiler output of adversarial example programs. Methods for obfuscating these attacks and subsequent methods for defending against these obfuscations are also explored. This research advances the understanding of risk and security of agentic software analysis systems necessary for their deployment into production-level cyber workflows.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ツール呼び出し ReAct Agent での深度依存の間接プロンプト注入: 注入深度、ペイロードフレーミング、およびターンバジェット感度

思考連鎖推論とツール呼び出しをインターリーブする ReAct エージェントは、スケジューリング、ファイル取得、データアクセスなどの実際のタスクに導入されることが増えています。彼らのツール監視ループは直接的な攻撃対象領域を作成します。ツールの戻り値を制御する攻撃者は、エージェントをユーザーの目的からリダイレクトする命令を埋め込むことができ、これは間接プロンプトインジェクションとして知られる脅威です。既存のベンチマークは、固定条件下、固定注入位置での攻撃成功率 (ASR) を評価しますが、ツールシーケンス内のどこにペイロードが現れるか (注入深さ)、どのようなレトリックレジスタを使用するか (フレーミング)、エージェントに許可されるターン数 (ターンキャップ) という 3 つのリスク側面が未調査のままです。当社は、5 つの攻撃カテゴリにわたる 20 のシナリオについて 4 つの対照研究を実施し、GPT-4o-mini と Claude Haiku に対して合計 460 回のトライアルを合計 0.36 米ドル未満の API コストで実施しました。研究 1 は、GPT-4o-mini に対する ASR が深さ 1 の 60% から深さ 4 および 5 の 0% まで減衰することを示しています (Cramer の V = 0.58、p < 0.001; シーケンス深さ 1 ～ 3 内に限定: V = 0.47、p = 0.0013)。これは、深さ 1 でのモデル抵抗と、より深い位置でのペイロード遭遇前のタスク完了によって引き起こされます。研究 2 では、Claude Haiku の深度実験を再現しています。この実験では、保守的なツールの呼び出しと真の命令耐性の組み合わせにより、すべての深度で 0% の ASR を達成しています。研究 3 は、フレーミングが深さ 1 で ASR を 25% (ニュートラル) と 75% (ペルソナ) の間で調整することを示しています。この範囲は 50 パーセントポイントの範囲であり、条件あたり N = 20 では統計的有意性に達しません。研究 4 では、ASR がターン上限 3、5、7 にわたって安定していることが確認されており、この設定ではターン予算がリスク要因ではないことが示されています。私たちの結果は、注入深さが支配的な変数であることを確立し、最初のツール観察のみをサニタイズすることで、測定された注入成功率の 67% が得られることを示しています。

原文 (English)

Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents: Injection Depth, Payload Framing, and Turn-Budget Sensitivity

ReAct agents that interleave chain-of-thought reasoning with tool calls are increasingly deployed for real tasks such as scheduling, file retrieval, and data access. Their tool observation loop creates a direct attack surface: an adversary who controls any tool's return value can embed instructions that redirect the agent away from the user's goal, a threat known as indirect prompt injection. Existing benchmarks evaluate attack success rate (ASR) at a fixed injection position under fixed conditions, leaving three risk dimensions unexplored: where in the tool sequence the payload appears (injection depth), what rhetorical register it uses (framing), and how many turns the agent is permitted (turn cap). We conduct four controlled studies on 20 scenarios spanning five attack categories, totalling 460 trials against GPT-4o-mini and Claude Haiku at a combined API cost under 0.36 USD. Study 1 shows that ASR against GPT-4o-mini decays from 60% at depth 1 to 0% at depths 4 and 5 (Cramer's V = 0.58, p < 0.001; restricted to within-sequence depths 1-3: V = 0.47, p = 0.0013), driven by model resistance at depth 1 and task completion before payload encounter at deeper positions. Study 2 replicates the depth experiment on Claude Haiku, which achieves 0% ASR at every depth through a combination of conservative tool invocation and genuine instruction resistance. Study 3 shows that framing modulates ASR between 25% (neutral) and 75% (persona) at depth 1, a 50-percentage-point range that does not reach statistical significance at N = 20 per condition. Study 4 confirms that ASR is stable across turn caps of 3, 5, and 7, indicating the turn budget is not a risk factor in this setting. Our results establish injection depth as the dominant variable and show that sanitising only the first tool observation captures 67% of measured injection successes.

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成エージェント

同意する前に確認する: 複数のエージェントの合意を視覚的な証拠に合わせて調整する

ビジョン言語モデル (VLM) は、ビジュアル質問応答 (VQA) で優れたパフォーマンスを達成しました。個人の幻覚や盲点を軽減するために、複数のエージェントのコラボレーションを通じて多様な視点を集約することが、有望なパラダイムとして浮上しています。このアプローチはテキスト QA では大きな成功を収めていますが、マルチモーダルドメインでの可能性はまだ探求されていません。既存のマルチエージェント VQA 手法は主にテキスト中心のプロトコルを適応させており、視覚情報の調整を無視してテキストによる議論に重点を置いています。この研究で、私たちは重要な洞察を明らかにします。信頼できるマルチエージェント VQA には、回答レベルの一致では不十分です。 \textit{整列された視覚的証拠} -- エージェントが依存する画像領域からの共有サポート -- は、信頼できるコンセンサスを得るために不可欠です。この洞察を活用するために、複数の VLM エージェントを調整するためのトレーニング不要の証拠中心のフレームワークである EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning) を提案します。 EAGLE は、各エージェントの接地領域を視覚的な証拠として明示的に公開し、証拠に対する相互検証を可能にし、証拠の一貫性を利用して最終的な意思決定を導きます。 6 つの VQA ベンチマークの実験では、EAGLE が軽量で解釈可能で、導入に実用的でありながら、ドメイン全体で最高の平均パフォーマンスを達成していることが示されています。

原文 (English)

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SAGE: エージェント LLM における効率的なメモリ進化のためのノベルティゲート

エージェント LLM は、新しく抽出されたファクトを追加するか、既存のメモリとマージするか、無視するかを継続的に決定する必要がありますが、これまでの研究では、原則に基づいた書き込み側の制御よりも、取得と保存に重点が置かれていました。我々はメモリ進化を新規性検出問題として枠組み化し、メモリ進化のための球状適応ゲートであるSAGEを提案します。これは、メモリ埋め込みに対するフォン・ミーゼス・フィッシャーベースの密度推定器を使用して候補事実をスコアリングし、メモリストアのジオメトリを追跡する適応しきい値でそれらをルーティングします。 SAGE は、明らかに新規のファクトを ADD として解決し、明らかに冗長なファクトを NOOP として解決し、不確実なケースのみを LLM マージステップに送信して、コストのかかる書き込み時間の推論を削減します。 LoCoMo では、SAGE は 7 つのオープンウェイトバックボーン比較すべてで Mem0 に対して最良の平均トークン F1 を達成しましたが、GPT-4o-mini では、わずかな平均ジャッジスコアの差で、追加フェーズ API コストを 3.4$\times$ 削減し、追加フェーズのレイテンシを 2.5$\times$ 削減しました。 A-Mem のドロップインバイナリゲートとして、SAGE は、オープンウェイトバックボーンでの品質の変化を最小限に抑えながら、5 つのモデルにわたって LLM コールの約 16 ～ 18% をスキップします。これらの結果は、新規性を意識した書き込み制御が、長期エージェントメモリにおけるメモリ品質とシステム効率の両方を改善するための実用的な手段であることを示唆しています。

原文 (English)

SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs

Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior work has focused more on retrieval and storage than on principled write-side control. We frame memory evolution as a novelty-detection problem and propose SAGE, a Spherical Adaptive Gate for memory Evolution that scores candidate facts with a von Mises-Fisher-based density estimator over memory embeddings and routes them with an adaptive threshold that tracks memory-store geometry. SAGE resolves clearly novel facts as ADD, clearly redundant facts as NOOP, and sends only uncertain cases to an LLM merge step, reducing expensive write-time reasoning. On LoCoMo, SAGE achieves the best average token-F1 against Mem0 on all seven open-weight backbone comparisons, while on GPT-4o-mini it reduces add-phase API cost by 3.4$\times$ and add-phase latency by 2.5$\times$ with only a small average judge-score gap. As a drop-in binary gate for A-Mem, SAGE skips roughly 16-18% of LLM calls across five models with minimal quality change on open-weight backbones. These results suggest that novelty-aware write control is a practical lever for improving both memory quality and system efficiency in long-term agentic memory.

2026-06-01 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

Design and Evaluation of Multi-Agent AI Oracle Systems for Prediction Market Resolution

Prediction markets aggregate collective intelligence to forecast uncertain events, but their utility depends on reliable outcome resolution…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

Safe Equilibrium Policy Optimization for Strategic Agent Policies

Language models fine-tuned with reinforcement learning typically optimize for task reward, ignoring multi-agent strategic structure. Becaus…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Sophrosyne: Agentic Exploration of Relational Data Systems Needs Moderation

Text2SQL agents powered by LLMs translate natural language intent into SQL by exploring the data system through tool calls before formulati…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

PatchWorld: Gradient-Free Optimization of Executable World Models

Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's lat…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

BlueFin: Benchmarking LLM Agents on Financial Spreadsheets

We present BlueFin, a benchmark that tasks large language model (LLM) agents with synthesis, manipulation, and comprehension tasks over spr…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成エージェント

Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?

Vision-Language-Action (VLA) models have demonstrated promising capability in autonomous driving, highlighting the potential of unified mul…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reason…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント研究/論文

ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplore…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

Personalized to Persuade: The Effects of Contextualization and Warmth on Trust and Reliance in Conversational AI

Artificial Intelligence (AI) agents personalize their responses by tailoring explanations to users' backgrounds, interests, and prior inter…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

Social welfare optimisation under institutional reward and punishment

Institutional incentives are widely used to promote cooperation among autonomous, self-regarding agents, from human societies to multi-agen…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

Dreaming Of Others: Latent Teammate Modeling In World Models For Multi-Agent Reinforcement Learning

In cooperative multi-agent reinforcement learning (MARL), agents must coordinate with partners whose internal policies and intentions are n…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

DynaTree: Dynamic Agentic Retrieval Tree for Time-Sensitive News Retrieval

Agentic Retrieval-Augmented Generation improves retrieval by integrating planning, tool use, and iterative reasoning, but existing agentic…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study

Skill documents provide procedural knowledge to large-language-model agents at inference time. This article studies whether the presentatio…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Used Car Salesbots? Honesty and Credulity of LLMs as Bargaining Agents under Partial Information

In this work we study agents in simulated bargaining scenarios, where a buyer and a seller communicate through a text channel and attempt t…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

PithTrain: A Compact and Agent-Native MoE Training System

Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have…

2026-06-01 13:00 JSTarXiv cs.AIエージェントハードウェア/半導体

GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization

GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repe…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

Skill Reuse as Compression in Agentic RL

Large language model agents trained with reinforcement learning (RL) often learn brittle, task-specific shortcuts. We hypothesize that agen…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

If LLMs Have Human-Like Attributes, Then So Does Age of Empires II

Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in e…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

Stateful Online Monitoring Catches Distributed Agent Attacks

Language models can find thousands of severe software vulnerabilities, and agents are increasingly being misused for cyberattacks. To avoid…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach

Large language models (LLMs) are increasingly deployed as "agents" for decision-making (DM) in interactive and dynamic environments. Yet, s…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

Agentic Physical AI toward a Domain-Specific Foundation Model for Energy Systems: A Case Study on Nuclear Reactor Control

The prevailing paradigm in AI for physical systems: scaling general-purpose foundation models toward universal multimodal reasoning, confro…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ConSensus: Multi-Agent Collaboration for Multimodal Sensing

Large language models (LLMs) are increasingly grounded in sensor data to perceive and reason about human physiology and the physical world.…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

NEMO: Execution-Aware Optimization Modeling via Autonomous Coding Agents

We present NEMO, a system that translates Natural-language descriptions of decision problems into formal Executable Mathematical Optimizati…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents

Anonymization is often assumed to protect privacy once explicit identifiers are removed, because re-identification has historically require…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles, but reinforcement learning f…

2026-06-01 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

LH-Bench: Skill-Grounded Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

Large language models excel on objectively verifiable tasks such as math and programming, where evaluation reduces to unit tests or a singl…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

ClimAgent: LLM as Agents for Autonomous Open-ended Climate Science Analysis

Climate research is pivotal for mitigating global environmental crises, yet the accelerating volume of multi-scale datasets and the complex…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

Counterfactual Trace Auditing of LLM Agent Skills

Large Language Model agents are increasingly augmented with agent skills. Current evaluation methods for skills remain limited. Most deploy…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

ASH: Agents that Self-Hone via Embodied Learning

Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demo…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

Safety-critical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation-base…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SAAS: エージェント検索における過剰検索を軽減するための自己認識強化学習

エージェント検索により、LLM は反復推論と外部検索を通じて複雑なマルチホップの質問を解決できます。これらのシステムは有効であるにもかかわらず、実際には重大な制限に悩まされることがよくあります。エージェントは自分自身の知識の境界を認識できず、内部の知識が十分な場合でもやみくもに検索を開始し、十分な証拠が収集されている場合でも検索を終了できません。自己認識の欠如は深刻な \textbf{過剰検索} につながり、かなりの推論遅延と法外な計算コストが発生します。この目的を達成するために、精度を損なうことなく検索動作を正確に制御する動的な自己認識を育成するように設計された新しい RL フレームワークである SAAS を提案します。 SAAS では、次の 3 つの主要コンポーネントが導入されています。(i) 検索境界モデリングメカニズム。検索が無効なロールアウトと検索が有効なロールアウトを対比することで、進化するポリシーに基づいて検索境界を識別します。 (ii) 境界認識報酬モジュール。この境界認識を軌道レベルのペナルティに変換し、不必要で冗長な検索を抑制します。 (iii) 段階的な最適化戦略。これは、一連のカリキュラムを活用して、検索の正規化よりも推論を優先し、それによって報酬のハッキングを回避します。広範な実験により、SAAS が精度を維持しながら過剰検索を大幅に削減することが実証されました。私たちのコードは https://github.com/XMUDeepLIT/SAAS で匿名で公開されています。

原文 (English)

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to severe \textbf{over-search}, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self-awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search-disabled and search-enabled rollouts; (ii) a boundary-aware reward module, which translates this boundary awareness into trajectory-level penalties, suppressing unnecessary and redundant searches; and (iii) a stage-wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over-search, while maintaining accuracy. Our code and implementation details are released at https://github.com/XMUDeepLIT/SAAS.

2026-06-01 13:00 JSTarXiv cs.AIエージェント

Auto-Discovery-Bench: Diagnosing Structured State Tracking in Oracle-Guided Discovery

Interactive discovery requires agents to maintain and update structured beliefs over many rounds of feedback. Before evaluating agents in n…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

Scaling Multi-Agent Environment Co-Design with Diffusion Models

The agent-environment co-design paradigm jointly optimises agent policies and environment configurations in search of improved system perfo…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Multi-Agent Teams Hold Experts Back

Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability

Evaluating and improving the security capabilities of code agents requires high-quality, executable vulnerability tasks. However, existing…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

A Behavioural and Representational Evaluation of Goal-Directedness in Language Model Agents

Understanding an agent's goals helps explain and predict its behaviour, yet there is no established methodology for reliably attributing go…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MASPOB: Bandit-Based Prompt Optimization for Multi-Agent Systems with Graph Neural Networks

Large Language Models (LLMs) have achieved great success in many real-world applications, especially the one serving as the cognitive backb…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs

Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool-use costs like simulation time and experimental r…

2026-06-01 13:00 JSTarXiv cs.AI画像/動画生成エージェント

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

Retrieval-Augmented Generation (RAG) is widely employed to mitigate risks such as hallucinations and knowledge obsolescence in medical ques…

2026-06-01 13:00 JSTarXiv cs.AIエージェント研究/論文

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent st…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning

Cooperation is central to multi-agent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations di…

2026-06-01 13:00 JSTarXiv cs.AIエージェント

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

Computer-use agents (CUAs) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by generating se…

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

読者を取り残さない: 誰もが理解できるマルチエージェントの概要

米国の平文法では、政府文書が一般の人々が簡単に理解できる明確で単純な言語でアクセスできるようにすることを求めていますが、既存の要約システムは、一般読者の間の多様な言語的および認知的障壁に対処するのに苦労しています。我々は、小学生の読者、非ネイティブの読者、注意欠陥のある読者という 3 つの代表的な読者グループをシミュレートする、平易な言語要約のためのマルチエージェントフレームワークである NRLB (No Reader Left Behind) を紹介します。 NRLB は、テンプレートベースの計画と読者指向の反復的な改善を組み合わせ、難しい用語、文脈の欠落、混乱を招く文章の体系的な検出と解決を可能にします。複数のデータセットにわたる評価により、事実の正確さを維持しながら可読性が一貫して向上していることが実証されています。人間による評価では、NRLB の影響がさらに検証され、アノテーターの優先率は 55% から 76% の範囲であり、情報源に忠実であり、一般の人々が広くアクセスできる平易な言葉による要約を作成する NRLB の可能性が強調されています。

原文 (English)

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand

The Plain Writing Act in the United States requires government documents to be accessible in clear and simple language that the general public can easily understand, yet existing summarization systems struggle to address diverse linguistic and cognitive barriers among general readers. We present NRLB (No Reader Left Behind), a multi-agent framework for plain language summarization that simulates three representative reader groups: elementary school student readers, non-native readers, and readers with attention deficits. NRLB combines template-based planning with iterative, reader-oriented refinement, enabling systematic detection and resolution of difficult terms, missing contexts, and confusing sentences. Evaluations across multiple datasets demonstrate consistent improvements in readability while preserving factual accuracy. Human evaluation further validates NRLB's impact, with annotator preference rates ranging from 55% to 76%, highlighting NRLB's potential to produce plain language summaries that are both faithful to the source and broadly accessible to the general public.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

アインシュタイン望遠鏡のシミュレートされたデータの分析に適用されたエージェント AI の初の直接比較

我々は、人間の介入なしに共有コンピューティングインフラストラクチャ上でシンプルなエンドツーエンドの重力波データ分析パイプラインを自律的に実行するという 2 つの最先端のエージェント AI システム、Claude Code (Anthropic) と Codex (OpenAI) の比較を報告します。このパイプラインは、生のアインシュタイン望遠鏡でシミュレートされたノイズからのパワースペクトル密度推定、幾何学的テンプレートバンクの生成、100 個のバイナリブラックホール信号注入の整合フィルター回復、自動結果生成、および Physical Review D のスタイルでフォーマットされた原稿の大規模言語モデル支援の作成で構成されます。両方のエージェントは、同一の仕様書と同一のコンピューティングリソースを受け取りました。実験は 2 回実行されました。1 回目は非現実的な大音量の注入を使用して実行され、2 回目は物理的に動機付けられた SNR 範囲に再スケーリングされた信号を使用して実行されました。科学的結果は両方の実行で収束しました。ただし、エージェントは大幅に異なる動作と計算コストを示しました。Claude Code は、仕様からのサイレント逸脱はありますが、パイプラインを約 3.4 分で完了しましたが、Codex は、整合フィルターの内部ループの一方的なパフォーマンスの最適化を含む、明示的な自己修正の再起動に約 16 分を要しました。自律的に生成された原稿も、長さ、詳細、品質が異なりました。 2 回目の実行では、SNR 範囲の命令の解釈における微妙な違いが、真の科学的相違につながりました。Claude Code は命令を黙って再解釈しましたが、Codex は文字通り仕様に従いました。速度と可聴性、サイレントと透過的なエラー処理、命令の解釈、マルチモデルパイプラインにおける中間データ表現の重要性など、これらの動作の違いが科学技術コンピューティングワークフローでのエージェント AI の展開に与える影響について説明します。

原文 (English)

First head-to-head comparison of agentic AI applied to the analysis of simulated data of the Einstein Telescope

We report a comparison of two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously executing a simple end-to-end gravitational wave data analysis pipeline on a shared computing infrastructure without human intervention. The pipeline comprises power spectral density estimation from raw Einstein Telescope simulated noise, geometric template bank generation, matched filter recovery of 100 binary black hole signal injections, automated results generation, and large language model-assisted production of a manuscript formatted in the style of Physical Review D. Both agents received identical written specifications and identical compute resources. The experiment was run twice: a first run with unrealistically loud injections, and a second run with signals rescaled to a physically motivated SNR range. The scientific results converged in both runs. However, the agents exhibited substantially different behaviors and computational costs: Claude Code completed the pipeline in ~3.4 minutes with silent deviations from the specification, while Codex required ~16 minutes across explicit self-correcting restarts, including an unsolicited performance optimization of the matched filter inner loop. The autonomously generated manuscripts also diverged in length, details, and quality. In the second run, a subtle difference in the interpretation of the SNR range instruction led to a genuine scientific divergence: Claude Code silently reinterpreted the instructions, while Codex followed the specification literally. We discuss the implications of these behavioral differences, such as speed versus auditability, silent versus transparent error handling, instruction interpretation, and the criticality of intermediate data representations in multi-model pipelines, for the deployment of agentic AI in scientific computing workflows.

2026-06-01 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SafeRx-Agent: 安全で説明可能な投薬推奨のための知識に基づいたマルチエージェントフレームワーク

薬剤の推奨は患者の来院時の薬剤を予測しますが、既存の方法では依然として 2 つの重要な課題に直面しています。モデルレベルでは、従来の医薬品推奨方法は限られた根拠に基づいて構造化された医薬品コードを予測するだけですが、LLMエージェントはより豊富な臨床コンテキストを使用できますが、安全性の検証とトレーサビリティが欠けている可能性があります。タスクレベルでは、既存のベンチマークは広範な薬剤カテゴリーを使用することが多く、サブグループレベルの安全性の違いが無視され、リスクの過大評価につながる可能性があります。第 4 レベルの ATC コード生成に基づいた、最初のきめ細かい薬剤推奨設定を導入します。私たちは、患者の状況、外部の臨床知識、安全性検証を使用して追跡可能な薬剤セットを推奨する、知識に基づいたマルチエージェントフレームワークである Safe Prescription Agent (SafeRx-Agent) を提案します。 MIMIC-III および MIMIC-IV データセットに関する実験結果は、SafeRx-Agent が薬物相互作用、禁忌、および薬物セットのサイズを制御しながら、きめ細かい薬物予測の精度を向上させることを示しています。

原文 (English)

SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation

Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level, traditional drug recommendation methods only predict structured drug codes with limited evidence grounding, while LLM agents can use richer clinical context but may lack safety verification and traceability. At the task level, existing benchmarks often use broad medication categories, which ignore subgroup-level safety differences and can lead to risk overestimation. We introduce the first fine-grained medication recommendation setting based on fourth-level ATC code generation. We propose Safe Prescription Agent (SafeRx-Agent), a knowledge-grounded multi-agent framework that uses patient context, external clinical knowledge, and safety verification to recommend traceable medication sets. Experimental results on MIMIC-III and MIMIC-IV datasets show that SafeRx-Agent improves fine-grained medication prediction accuracy while controlling drug interactions, contraindications, and medication set size.

2026-06-01 09:00 JSTITmedia AI+エージェント

図面SaaSに高精度な3Dモデル生成とアセンブリ機能、CAE機能のβ版を追加

renueは、図面SaaS「Drawing Agent」をアップデートした。高精度な3Dモデル生成機能や複数パーツの組み立てに対応するアセンブリ機能、構造解析を実行する「CAE機能 β版」を追加し、図面の読み取りから構造解析までを一連の流れで扱えるようにした。

2026-06-01 05:00 JSTITmedia AI+エージェント

「ハーネス」って結局、何？　みんな使い方が違うAIエージェント用語をHugging Faceが整理

AIエージェント分野で人によって意味が揺れる「ハーネス」「スキャフォールド」などの用語を、Hugging Faceが整理した。AIエージェントを正しく理解し議論するために押さえておきたい基本用語を初心者向けに解説する。

2026-05-30 01:13 JSTTechCrunch AIエージェント

Cognition’s Scott Wu says AI coding agents shouldn’t replace humans

Cognition makes Devin, the first and arguably most successful AI coding agent. But famed coder Wu says it isn't designed to supplant human…

2026-05-29 21:00 JSTOpenAILLM/生成AIエージェント

How Braintrust turns customer requests into code with Codex

How Braintrust engineers use Codex with GPT-5.5 to run experiments and code faster.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Frontier LLM ベースのエージェントは、自然な表現型のオントロジーキュレーションのボトルネックを克服できます

フリーテキストの表現型記述をオントロジー用語にリンクすることは、通常表現型アノテーションと呼ばれ、比較形態学的データを研究間で統合するために不可欠です。この労働集約的なプロセスは高度な訓練を受けた人間の専門家に大きく依存しており、そのため拡張が困難であり、それが大きなボトルネックとなっています。ダードゥルら。 (2018) 7 つの系統学的研究にわたるエンティティ品質 (EQ) アノテーションのゴールドスタンダード (GS) を確立し、それを使用して 3 人のキュレーターと、オントロジーベースの意味的類似性メトリクスを備えた Semantic CharaParser NLP ツールを評価しました。彼らは、機械と人間の一貫性は、キュレーター間（人間と人間）の一貫性よりも大幅に低いと報告しました。ここでは、Anthropic と OpenAI の 5 つのフロンティアホスト LLM を使用してそのベンチマークを再検討します。各 LLM は、ソース出版物の PDF、元の人間のキュレーターが使用したのと同じ注釈ガイド、4 つのプロジェクトオントロジー (UBERON、PATO、BSPO、GO)、および検証スクリプトを提供する自己完結型ワークスペース内で「エージェントキュレーター」として動作します。同じゴールドスタンダードに照らして評価すると、すべてのエージェントは、元の研究で訓練を受けた 3 人の人間のバイオキュレーターのキュレーター間変動の範囲内に収まりました。最もパフォーマンスの高いエージェントがアプローチしましたが、最もパフォーマンスの高い人間のキュレーターには到達できませんでした。エージェントは、4 つの指標すべてで Semantic CharaParser を大幅に上回りました。

原文 (English)

Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes

Linking free-text phenotype descriptions to ontology terms, typically referred to as phenotype annotation, is essential for the cross-study integration of comparative morphological data. This labor intensive process has heavily relied on highly trained human experts, which makes it challenging to scale and thus a key bottleneck. Dahdul et al. (2018) established a Gold Standard (GS) of Entity-Quality (EQ) annotations across seven phylogenetic studies and used it to evaluate three human curators and the Semantic CharaParser NLP tool with ontology-based semantic similarity metrics; they reported that machine-human consistency was significantly lower than inter-curator (human-human) consistency. Here we revisit that benchmark with five frontier hosted LLMs from Anthropic and OpenAI, each operating as an "agentic curator" within a self-contained workspace that supplies the source publication PDF, the same annotation guide used by the original human curators, the four project ontologies (UBERON, PATO, BSPO, GO), and a validation script. Evaluated against the same Gold Standard, every agent fell within the range of inter-curator variability of the three trained human biocurators of the original study; the best performing agents approached but did not reach the best performing human curator. Agents substantially outperformed Semantic CharaParser on all four metrics.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

VFEAgent: エンドツーエンドの自動有限要素解析のためのマルチモーダルエージェントフレームワーク

有限要素解析 (FEA) は、現代のエンジニアリング設計の基礎として機能します。ただし、そのワークフローは本質的に複雑で、ドメインの専門知識に大きく依存しています。最近の取り組みでは、大規模言語モデル (LLM) を FEA に統合しましたが、既存のアプローチでは、マルチモーダル入力の処理と複雑なタスクの実行において限界に直面しています。これらの制限に対処するために、入力画像と問題の説明から直接 FEA モデリングとシミュレーションを自動化するように設計されたエンドツーエンドのマルチエージェントシステムである VFEAgent を提案します。当社の方法論は 2 つのコアコンポーネントを統合しています。(1) ReAct 駆動の推論を採用して異種入力から構造化 FEA 仕様を抽出するマルチモーダルビジョン言語マルチエージェントパイプラインと、(2) 実行可能性と物理的妥当性を保証するための堅牢な自己デバッグおよびフォールバックメカニズムを組み込んだ検証優先コード合成フレームワーク。私たちは、さまざまな工学力学シナリオにわたってシステムを体系的に評価しました。結果は、VFEAgent が完全で物理的に有効なシミュレーションの生成において高い成功率を達成し、信頼性と正確性の点で LLM ベースのベースライン手法を上回っていることを示しています。これらの調査結果は、FEA ワークフロー全体を自動化する実現可能性を検証し、エンジニアを退屈な手動分析から解放するフレームワークの可能性を強調しています。

原文 (English)

VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis

Finite Element Analysis (FEA) serves as the cornerstone of modern engineering design. However, its workflow is inherently complex and relies heavily on domain expertise. Although recent efforts have integrated Large Language Models (LLMs) into FEA, existing approaches face limitations in handling multimodal inputs and executing complex tasks. To address these limitations, we propose VFEAgent, an end-to-end multi-agent system designed to automate FEA modeling and simulation directly from input images and problem descriptions. Our methodology integrates two core components: (1) a multimodal vision-language multi-agent pipeline that employs ReAct-driven reasoning to extract structured FEA specifications from heterogeneous inputs and (2) a verification-first code synthesis framework, incorporating robust self-debugging and fallback mechanisms to ensure executability and physical validity. We systematically evaluated the system across various engineering mechanics scenarios. The results demonstrate that VFEAgent achieves a high success rate in generating complete and physically valid simulations, outperforming LLM-based baseline methods in reliability and correctness. These findings validate the feasibility of automating the complete FEA workflow, highlighting the framework's potential to liberate engineers from tedious manual analysis.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

差別化可能な信念に基づく敵対者の形成

人間の調整は、多くの場合、戦略的行動を通じて他者の信念に影響を与える能力に依存しています。マルチエージェント強化学習では、対戦相手のシェーピングがこの影響を再現しようとしますが、既存の手法は通常、対戦相手のパラメータ、ポリシー、または値空間内で動作します。一方、隠しロールゲームにおける信念操作テクニックは、欺瞞や信念の飽和など、ハードコードされた目的に依存することがよくあります。我々は、各観察者の信念を成形された相手の状態として扱い、$k$ ステップのソフトマックスベイズ信念ダイナミクスを通じて微分する一次手法である、微分可能な信念ベースの相手シェーピング (D-BOS) を提案します。私たちの方法では、欺瞞的または協力的な行動に明示的に報酬を与えるのではなく、信念状態を形成のターゲットとして扱います。これにより、環境の報酬構造から最適な戦略が自然に現れることが可能になります。この信念空間の定式化は、対戦相手の信念の更新を通じて微分することによって対戦相手を形成する信号を提供し、個々の推定された信念の軌跡にわたる勾配を集約することによって複数の観察者に自然に拡張されます。経験的に、D-BOS は隠しロールゲームで PPO や BBM よりも優れており、動機が混合された設定で最大の利益をもたらします。

原文 (English)

Differentiable Belief-based Opponent Shaping

Human coordination often relies on the ability to influence the beliefs of others through strategic action. In multi-agent reinforcement learning, opponent shaping attempts to replicate this influence, though existing methods typically operate within an opponent's parameter, policy, or value space. Meanwhile, belief-manipulation techniques in hidden-role games often rely on hard-coded objectives, such as deception or belief saturation. We propose Differentiable Belief-based Opponent Shaping (D-BOS), a first-order method that treats each observer's belief as the shaped opponent state and differentiates through $k$-step softmax-Bayes belief dynamics. Rather than explicitly rewarding deceptive or cooperative behavior, our method treats the belief state as the target for shaping. This allows the optimal strategy to emerge naturally from the environment's reward structure. This belief-space formulation provides an opponent-shaping signal by differentiating through opponent belief updates, and naturally extends to multiple observers by aggregating gradients over their individual inferred belief trajectories. Empirically, D-BOS outperforms PPO and BBM in hidden-role games, with the largest gains in mixed-motive settings.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Agentic AI、ネストされた学習、セマンティックキャッシングによる AI の持続可能性による幻覚の軽減

幻覚は、実稼働 LLM システムにとって、特にサポートされていないクレームがチェックされずにステージ全体に伝播する可能性があるマルチエージェントパイプラインにおいて、依然として大きな信頼性の障壁となっています。この論文では、連続メモリシステム (CMS) と意味論的類似性キャッシュを備えた HOPE にインスピレーションを得た入れ子学習アーキテクチャを、217 個の認識論的不確実性プロンプトと 93 個の製造誘導ストレステストプロンプトを組み合わせた 310 個のプロンプトのハイブリッドベンチマークに適応させます。オープンフロアプロトコル (OFP) を介して調整された 3 段階のエージェントパイプラインは、FCD (事実の主張密度)、FGR (事実の根拠参照)、FDF (架空の免責事項の頻度)、ECS (明示的なコンテキスト化スコア)、および OSR (観察可能性スコア率) の 5 つの KPI を使用して評価され、研究対象の 5 つの重み付け構成にわたって THS (総幻覚スコア) に集約されます。緩和と可観測性のトレードオフ。 FDF、ECS、OSR、および FGR は緩和シグナルとして差し引かれるため、THS がより負であるほど、緩和が強力であることを示します。 FrontEndAgent は、現実的な幻覚ベースラインを生成する高確率ジェネレータ (温度 = 1.0) として構成され、SecondLevelReviewer と ThirdLevelReviewer はプログレッシブコレクタとして動作します。この非対称設計により、5 つの重み付け構成全体でエンドツーエンドの THS が -31.3% ～ -35.9% 削減されます。セマンティックキャッシュは、930 回の潜在的な呼び出しで 440 回のキャッシュヒット (ヒット率 47.3%) を達成し、LLM 呼び出しを 490 回に減らし、エネルギーと CO2e フットプリントを削減し、マルチステージレビューパイプラインを運用規模で実行可能にします。 ExtremeObservability は最もマイナスの最終 THS (-0.0709) を達成しており、可観測性を重視した構成が緩和を損なうのではなく強化していることが確認されています。これらの発見は、メモリ拡張マルチエージェント設計がモデルの再トレーニングなしで事実の信頼性、運用効率、監査可能性を共同で改善できることを示唆しています。

原文 (English)

Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

Hallucination remains a major reliability barrier for production LLM systems, particularly in multi-agent pipelines where unsupported claims can propagate unchecked across stages. This paper adapts a HOPE-inspired Nested Learning architecture with Continuum Memory Systems (CMS) and semantic similarity caching to a hybrid benchmark of 310 prompts combining 217 epistemic-uncertainty prompts and 93 fabrication-induction stress-test prompts. A three-stage agentic pipeline orchestrated via the Open Floor Protocol (OFP) is evaluated with five KPIs -- FCD (Factual Claim Density), FGR (Factual Grounding References), FDF (Fictional Disclaimer Frequency), ECS (Explicit Contextualization Score), and OSR (Observability Score Ratio) -- aggregated into THS (Total Hallucination Score) across five weighting configurations to study mitigation-observability trade-offs. FDF, ECS, OSR, and FGR are subtracted as mitigation signals, so that a more negative THS indicates stronger mitigation. The FrontEndAgent is configured as a high-stochasticity generator (temperature = 1.0) to produce a realistic hallucination baseline, while the SecondLevelReviewer and ThirdLevelReviewer operate as progressive correctors. This asymmetric design yields end-to-end THS reductions of -31.3% to -35.9% across five weighting configurations. Semantic caching achieves 440 cache hits over 930 potential calls (47.3% hit rate), reducing LLM invocations to 490, lowering energy and CO2e footprint, and making multi-stage review pipelines operationally viable at production scale. ExtremeObservability attains the most negative final THS (-0.0709), confirming that observability-heavy configurations reinforce rather than compromise mitigation. These findings suggest that memory-augmented multi-agent designs can jointly improve factual reliability, operational efficiency, and auditability without model retraining.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

安全な自律エージェントのための帯域外メタデータの重要性: Redpanda エージェントデータプレーン

AI エージェントは、企業データにアクセスし、意思決定を行い、自律的に行動を起こすデジタル従業員として機能することがますます期待されています。しかし、エージェントは同時に人間よりも予測可能性が低く、幻覚、誤解、敵対的操作を起こしやすいですが、深いシステム知識と高スループットのインターフェイスを備えており、マシンの速度で被害を連鎖的にもたらします。この組み合わせにより、アクセスポリシー、データ分類、動作制約などのセキュリティクリティカルなメタデータを忠実に解釈または伝達するエージェントに依存するのは安全ではなくなります。私たちは、帯域外メタデータチャネルを中心に構築されたアーキテクチャである Redpanda Agentic Data Plane (ADP) を紹介します。帯域外メタデータチャネルは、セキュリティコンテキスト、ポリシーシグナル、および監査証跡を、エージェントの読み取りおよび書き込みパスの完全に外側で、異種インフラストラクチャ全体にわたって決定論的に伝送するインフラストラクチャパスウェイです。これらのチャネルは、エージェントのライフサイクルのあらゆる段階でガバナンスを強化します。つまり、受信時のデータアクセスの範囲設定、実行中のアクションの制限、送信時の改ざん防止トランスクリプトのキャプチャなどです。マルチエージェントのポートフォリオリバランスシステムを使用して ADP を実証します。このシステムでは、自律エージェントが市場を監視し、取引の意思決定を行い、分離されたクライアントアカウント全体で注文を実行します。クライアントごとのデータスコープ、取引承認のしきい値、および改ざん防止の監査証跡はすべて、エージェントが確認することもバイパスすることもできない帯域外チャネルによって強制されます。

原文 (English)

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

AI agents are increasingly expected to operate as digital employees: accessing enterprise data, making decisions, and taking actions autonomously. But agents are simultaneously less predictable than humans -- prone to hallucination, misinterpretation, and adversarial manipulation -- and more technically capable: with deep system knowledge and high-throughput interfaces cascading damage at machine speed. This combination makes it unsafe to rely on agents to faithfully interpret or propagate security-critical metadata such as access policies, data classifications, and behavioral constraints. We present the Redpanda Agentic Data Plane (ADP), an architecture built around out-of-band metadata channels: infrastructure pathways that carry security context, policy signals, and audit trails deterministically, entirely outside the agent's read and write path and across heterogeneous infrastructure. These channels enforce governance at every stage of the agent lifecycle -- scoping data access on the way in, constraining actions during execution, and capturing tamper-proof transcripts on the way out. We demonstrate ADP with a multi-agent portfolio rebalancing system in which autonomous agents monitor markets, make trade decisions, and execute orders across isolated client accounts -- with per-client data scoping, trade approval thresholds, and tamper-proof audit trails all enforced by out-of-band channels the agents can neither see nor bypass.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

コンセンサスを超えて: 薬剤の混合物における微量レベルの合成

複数の LLM エージェントが同じ問題を解決する場合、標準的な慣行では、各エージェントの推論が多数決または階層化された統合に圧縮され、合意が最終ラインとして扱われます。私たちは、これが不必要に損失を伴うことを示します。完全な推論トレースを読み取る LLM アグリゲーターは、エージェントが全会一致で同意した場合でも、有益な修正が有害な修正を常に上回り、正しい解決策を回復します -- \emph{集約のパラドックス}。多数決には、摂動の多様性によって上昇しない上限があります (誤差の相関関係は同一です)。アグリゲーターの利益は、投票によって破棄される少数派のチェーンから正しい中間ステップを組み立てる、トレースレベルの相補性から得られます。これらの発見は、セマンティックを保持する入力摂動を通じて微量多様性を生成し、証明可能な非劣化保証を備えたアンカー改良によって大部分を保護し、コンセンサスに基づいてゲートすることなく常に合成するエージェントの自己一貫性混合を動機付けます。摂動による微量変化を伴う単一モデルは、構造化推論、博士レベルの科学、競技数学、競技プログラミングにおいて、異種モデルプールよりも優れたパフォーマンスを発揮します。集計の単位は、答えではなく、推論のトレースである必要があります。

原文 (English)

Beyond Consensus: Trace-Level Synthesis in Mixture of Agents

When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a majority vote or layered synthesis, treating agreement as the finish line. We show this is unnecessarily lossy: an LLM aggregator that reads complete reasoning traces recovers correct solutions even when agents unanimously agree, with beneficial corrections consistently outweighing harmful ones -- the \emph{aggregation paradox}. Majority voting has a ceiling that perturbation diversity does not raise (error correlations are identical); the aggregator's gain comes from trace-level complementarity, assembling correct intermediate steps from minority chains that voting discards. These findings motivate Self-Consistent Mixture of Agents which generates trace diversity through semantic-preserving input perturbations, safeguards the majority via anchored refinement with provable non-degradation guarantees, and always synthesizes -- never gates on consensus. A single model with perturbation-induced trace variation outperforms heterogeneous model pools across structured reasoning, PhD-level science, competition mathematics, and competitive programming. The unit of aggregation should be the reasoning trace, not the answer.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

PRO-CUA: コンピュータ使用エージェントのプロセス報酬の最適化

コンピュータ使用エージェント (CUA) は、複雑なデジタルワークフローを自動化する強力な可能性を示していますが、そのトレーニングは、費用のかかるライブ環境でのやり取りと限られた高品質の監督によって依然として制限されています。既存のフィルタリングされた動作クローンパイプラインは、専門家のデモンストレーションからの分布のシフトやネガティブな学習シグナルの欠如など、模倣のボトルネックに悩まされています。一方、標準的な軌道レベルの強化学習は、報酬がまばらで、単位の割り当てがあいまいで、長期にわたる GUI インタラクションのためのインフラストラクチャのコストが高いという問題に直面しています。この研究では、反復ステップレベル強化学習を使用して CUA をトレーニングするためのプロセス報酬最適化フレームワークである PRO-CUA を提案します。 PRO-CUA は、ポリシー上の環境インタラクションをポリシーの最適化から切り離します。現在のポリシーは、ライブロールアウトを通じて状態を収集し、状態ごとに多様な候補アクションを生成し、プロセス報酬モデル (PRM) からステップレベルのフィードバックを受け取り、グループ相対の利点で最適化されます。この設計により、ゴールデンアンサーやオフラインエキスパートの軌跡に依存することなく、高密度で柔軟なクレジット割り当てが可能になり、同時にエージェント自身の実行状態でのトレーニングによって分布のシフトが軽減されます。ライブ Web ベンチマークの実験により、PRO-CUA の有効性と PRM ガイドによるステップレベルのトレーニングの信頼性が実証されています。

原文 (English)

PRO-CUA: Process-Reward Optimization for Computer Use Agents

Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their training remains constrained by costly live environment interaction and limited high-quality supervision. Existing filtered behavior cloning pipelines suffer from imitation bottlenecks, including distribution shift from the expert demonstration and the absence of negative learning signals. Meanwhile, standard trajectory-level reinforcement learning struggles with sparse rewards, ambiguous credit assignment, and high infrastructure costs for long-horizon GUI interaction. In this work, we propose PRO-CUA, a process-reward optimization framework for training CUAs with iterative step-level reinforcement learning. PRO-CUA decouples on-policy environment interaction from policy optimization: the current policy collects states through live rollouts, generates diverse candidate actions for each state, receives step-level feedback from a process reward model (PRM), and is optimized with group-relative advantages. This design enables dense and flexible credit assignment without relying on golden answers or offline expert trajectories, while reducing distribution shift by training on the agent's own execution states. Experiments on live web benchmarks demonstrate the effectiveness of PRO-CUA and the reliability of PRM-guided step-level training.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Agentic AI システムにおける技術的負債の管理

エージェント AI システムは、実稼働インフラストラクチャとしてますます検討されています。エージェントは、複数のステップを推論し、ツールを呼び出し、ワークフローを通じて動作し、記憶とフィードバックを通じて適応します。これらのシステムは、従来のソフトウェアや予測 ML の技術的負債では完全に把握できないガバナンスの課題を生み出します。私たちは、エージェントの技術的負債を、プロンプト、メモリ、ツールスキーマ、オーケストレーショングラフ、制御ポリシー、可観測性ルーチンが、検証、標準化、管理できるよりも早くパッチを適用した場合に生じる累積負債として定義します。私たちは、確率的税を、確率的なエージェントの動作を許容範囲内に維持するための反復的な運用負担として定義します。区別は重要です。負債は設計およびガバナンス責任のストックであるのに対し、税金は確率的エージェントがツールやワークフローを通じて機能するために発生する運営コストの流れです。管理者が軽量のダッシュボードとガバナンス制御を通じて両方を可視化する方法について概説します。

原文 (English)

Governing Technical Debt in Agentic AI Systems

Agentic AI systems are increasingly being explored as production infrastructure: they reason over multiple steps, call tools, act through workflows, and adapt through memory and feedback. These systems create governance challenges that are not fully captured by traditional software or predictive ML technical debt. We define Agentic Technical Debt as the accumulated liability created when prompts, memory, tool schemas, orchestration graphs, control policies, and observability routines are patched together faster than they can be validated, standardized, and governed. We define Stochastic Tax as the recurring operating burden of keeping probabilistic agent behavior within acceptable bounds. The distinction matters: debt is a stock of design and governance liability, while the tax is a flow of operating cost that arises because stochastic agents act through tools and workflows. We outline how managers can make both visible through lightweight dashboards and governance controls.

2026-05-29 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達研究/論文

ペーパーエージェント、ペーパーゲイン：DeFi投資エージェントの実証分析

自律的なオンチェーン取引に AI を使用するシステムである DeFi 投資エージェントは、2024 年後半以来、合計トークン評価額で 30 億米ドルを超えています。私たちは 1,900 以上の AI タグ付き暗号プロジェクトを調査し、投資中心のエージェントに絞り込み、戦略と可観測性の側面にわたる 10 の代表的なプロジェクトを厳選しています。次に、ElizaOS と Virtuals Protocol という 2 つの著名なエージェントフレームワークの詳細なアーキテクチャ分析と、925,323 人のトークン所有者を対象とする公的に起因する取引活動を伴う 11 の Solana ベースのエージェントトレジャリーの定量的なオンチェーンパフォーマンス分析を実施します。現在のデプロイメントは初期段階で異種混合のままであることがわかりました。(1) 私たちのサンプルでは、多くのプロジェクトが自律的な取引実行の明確な証拠をまだ提供しておらず、開発者のインタビューでは、目に見えるデプロイメントの多くが基本的な API 統合のままであることが示唆されています。 (2) エージェントの財務省は 3,000 万米ドルを超える紙の利益を保持している一方、トークン所有者は合計で 1 億 9,170 万米ドルを損失しており、ウォレットの上位 1% が全利益の 81.4% (18 億 1,000 万米ドル) を獲得しています。 (3) トークンの評価額は財務省のファンダメンタルズとの関連が弱く、時価総額対AUMの比率は10,000倍を超えていますが、確立されたDeFiプロトコルでは1倍未満です。 (4) ユーザーの総利益は 24 億米ドルでピークに達し、その後純損失に減少し、収益の中央値はすべてのプラットフォームでマイナスとなり、トークンは史上最高値から平均して 93% 減少しました。私たちは、これらの結果を、オープンインフラストラクチャにより迅速な実験が可能になるだけでなく、自律性、パフォーマンス、および利害関係者の連携のための堅牢な標準が出現する前に、単純なエージェントや投機的なエージェントが立ち上がることを可能にする、パーミッションレスの第一世代市場の特徴であると解釈します。そこで私たちは、現在の展開と将来の投資グレードのエージェントシステムとの間のギャップを特徴付けるために、自律的な実行、リスク調整後の収益性、利害関係者の連携という 3 つの側面に沿った成熟度フレームワークを提案します。

原文 (English)

Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents

DeFi investment agents, systems that use AI for autonomous on-chain trading, have attained over USD 3 billion in combined token valuations since late 2024. We survey over 1,900 AI-tagged crypto projects, filter to investment-focused agents, and curate 10 representative projects spanning strategy and observability dimensions. We then conduct a deep-dive architectural analysis of two prominent agent frameworks, ElizaOS and Virtuals Protocol, and a quantitative on-chain performance analysis of 11 Solana-based agent treasuries with publicly attributable trading activity, covering 925,323 token holders. We find that current deployments remain early and heterogeneous: (1) in our sample, many projects do not yet provide clear evidence of autonomous trade execution, and developer interviews suggest that many visible deployments remain basic API integrations; (2) agent treasuries retain over USD 30M in paper gains while token holders collectively lost USD 191.7M, with the top 1% of wallets capturing 81.4% of all gains (USD 1.81B); (3) token valuations are weakly connected to treasury fundamentals, with market-cap-to-AUM ratios exceeding 10,000x versus below 1x for established DeFi protocols; and (4) aggregate user gains peaked at USD 2.4B before declining to net losses, with median returns negative on every platform and tokens declining 93% on average from all-time highs. We interpret these outcomes as characteristic of a permissionless, first-generation market in which open infrastructure enables rapid experimentation but also allows naive or speculative agents to launch before robust standards for autonomy, performance, and stakeholder alignment emerge. We therefore propose a maturity framework along three dimensions: autonomous execution, risk-adjusted profitability, and stakeholder alignment, to characterize the gap between current deployments and future investment-grade agent systems.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

GTA: Web エージェント向けに長期的なタスクを大規模に生成

Web エージェントは、言語モデルとブラウジング機能およびツール使用機能を組み合わせたもので、オープンな Web アシスタントとして有望です。しかし、拡張性のあるプロセスレベルの監督が欠如しているため、進歩はますます制限されています。既存のベンチマークは主に手動で構築されており、中間軌道を含まない粗い開始ゴールの注釈のみを提供しますが、最近の自動生成の取り組みは依然として高価で偏りがあり、浅いものです。これらの制限により、現実的なマルチホップ、クロスページのタスクに一般化する必要があるエージェントの信頼できるトレーニングと評価が妨げられます。スケーラブルなフレームワークである GTA を導入します。これは、クローリング、検索ベースのシーディング、インコンテキスト生成、および自動品質管理を統合して、実行可能な軌跡と組み合わせた現実的なタスクを生成します。この設計は、クロールを生成から切り離して効率を高め、タスクをサイトグラフに固定して構成性を強化し、決定論的な再生と体系的な検証を通じて緻密な監視を保証します。電子商取引、政府、フォーラム、ニュースをカバーする 50 以上の Web サイト上でパイプラインをインスタンス化し、多言語およびマルチホップをカバーします。結果として得られたベンチマークでは、人間とエージェントのパフォーマンスの大きなギャップが明らかになり、詳細な診断が可能になります。私たちの貢献は 3 つあります: (i) マルチホップ Web エージェントタスク生成の形式化、(ii) 自動データ作成のための効率的で検証済みのパイプラインの提案、(iii) 再現可能な評価を備えた動的ベンチマークのリリース。

原文 (English)

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web assistants. Yet progress is increasingly limited by the lack of scalable, process-level supervision. Existing benchmarks are largely manually constructed, providing only coarse start-goal annotations without intermediate trajectories, while recent automatic generation efforts remain expensive, biased, and shallow. These limitations prevent reliable training and evaluation of agents that must generalize to realistic, multi-hop, cross-page tasks. We introduce a scalable framework, GTA, that integrates crawling, retrieval-based seeding, in-context generation, and automated quality control to produce realistic tasks paired with executable trajectories. This design decouples crawling from generation for greater efficiency, grounds tasks in the site graph to enforce compositionality, and ensures dense supervision through deterministic replays and systematic validation. We instantiate the pipeline on over 50 websites covering e-commerce, government, forums, and news, with multilingual and multi-hop coverage. The resulting benchmark reveals a significant human-agent performance gap and enables detailed diagnostics. Our contributions are three-fold: (i) formalizing multi-hop web-agent task generation, (ii) proposing an efficient and validated pipeline for automatic data creation, and (iii) releasing a dynamic benchmark with reproducible evaluation.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

BenchTrace: LLM エージェントのリフレクション能力と制御された進化をテストするためのベンチマーク

自己進化エージェントは過去の失敗を反映することで時間の経過とともに改善しますが、既存の評価には 2 つの点で制限があります。1 つはタスクスコアのみを測定し、反映品質は不明のままにすること、もう 1 つはエージェント自身のエピソードの実行に依存しており、特定の失敗パターンを対象にするメカニズムを提供していないことです。 LLM エージェントの自己進化能力を評価するためのベンチマークである \textbf{BenchTrace} を紹介します。 BenchTrace は、6 つの多様なタスクにわたる 1,821 の注釈付きエピソードのスナップショット反映データセットに基づいて構築されており、ターゲットを絞った QA タスクを通じて障害の特定を調査する \textbf{反映評価} と、制御された自己進化シミュレーションで過去の障害経験が回避行動に変換されるかどうかをテストする \textbf{進化評価} で構成されます。 BenchTrace に基づいて、エージェントがターゲットの障害インスタンスを回避できたテストケースの割合を測定する新しい評価指標である \textbf{障害回避率 (FAR)} を提案します。 Qwen3-32B と GPT-4.1 を使った実験では、どちらのモデルもリフレクション評価でエンドツーエンドの合格率が 30\% を下回り、診断が主なボトルネックであることが明らかになりました。進化の評価では、自己進化手法は一般に非進化ベースラインよりもFARを改善しますが、エージェントはノイズエピソードが蓄積するにつれて初期のレッスンを忘れ、エージェントは特定のコンテキストを超えて反省を一般化することができず、タスクコンテキスト間で負の転移を引き起こすことが示されています。さらに、相関分析により、完全に正しい反射のみが高い FAR と強く関連していることが明らかになりました。 BenchTrace は、現在の自己進化アプローチの具体的な限界を明らかにし、対象を絞った評価のための制御されたモデルに依存しないフレームワークを提供します。

原文 (English)

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

Self-evolving agents improve over time by reflecting on past failures, but existing evaluation is limited in two ways: it measures only task scores, leaving reflection quality unknown, and it relies on agents' own episode runs, offering no mechanism to target specific failure patterns. We present \textbf{BenchTrace}, a benchmark for evaluating self-evolution ability in LLM agents. BenchTrace is built on a snapshot-reflection dataset of 1,821 annotated episodes spanning six diverse tasks, and comprises a \textbf{Reflection Evaluation} that probes failure identification through targeted QA tasks, and an \textbf{Evolution Evaluation} that tests whether past failure experience translates into avoidance behavior in a controlled self-evolution simulation. Building on BenchTrace, we propose \textbf{failure avoidance rate (FAR)}, a new evaluation metric measuring the fraction of test cases in which the agent successfully avoids the target failure instance. Experiments with Qwen3-32B and GPT-4.1 reveal that both models fall below a 30\% end-to-end pass rate on reflection evaluation, with diagnosis as the primary bottleneck. Evolution evaluation shows that self-evolution methods generally improve FAR over the non-evolving baseline, but agents forget early lessons as noise episodes accumulate, and agents fail to generalize their reflections beyond the specific context, causing negative transfer across task contexts. Our correlation analysis further reveals that only a fully correct reflection is strongly associated with higher FAR. BenchTrace exposes concrete limits of current self-evolution approaches and provides a controlled, model-agnostic framework for targeted evaluation.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

確実に安全なエージェントガードレール

大規模な言語モデルが、制限された生成エンジンから、拡張的な実行権限を持つエージェントに移行するにつれて、AI が制御不能になり、人工知能のセキュリティに根本的な危機が生じます。既存の防御アーキテクチャは、経験的なセマンティックガードレールと確率的な大規模モデルの判定子に大きく依存しており、複雑なセマンティックシンボルデカップリング攻撃に直面した場合に決定論的なセキュリティの下限を提供できないメカニズムです。この経験的なセマンティックガードレールのジレンマを克服するために、この論文では、論理的推論の基本的な制限に基づいたエージェントのための新しいセキュリティパラダイムを提案します。このパラダイムに基づいて、ニューラルシンボリック分離アーキテクチャを備えた実行可能な Proof-Constrained Action (ePCA) フレームワークをさらに導入します。このフレームワークは自然言語の意味論的な信頼を放棄し、エージェントに物理的な操作を実行する前に意図を一次論理数学的制約にロスレスで形式化することを強制します。巨視的および微視的な二次元動的敵対的システムの経験的評価は、私たちの形式的検証メカニズムが、評価されたシナリオ全体でゼロの攻撃成功率とゼロの誤検知率を、極めて低い計算待ち時間で達成することを実証しています。この研究は、明示的なシステム仮定の下での条件付きの形式的基盤と、将来のインテリジェントシステムの基礎となる防御基盤を構築するためのエンジニアリングパラダイムを提供します。

原文 (English)

Provably Secure Agent Guardrail

As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control precipitates a fundamental crisis in artificial intelligence security. Existing defense architectures heavily rely on empirical semantic guardrails and probabilistic large model adjudicators, mechanisms that fail to provide deterministic security lower bounds when facing complex semantic symbol decoupling attacks. To overcome this empirical semantic guardrail dilemma, this paper proposes a new security paradigm for agents based on the fundamental limitations of logical reasoning. Based on this paradigm, we further introduce an executable Proof-Constrained Action (ePCA) framework with a neural symbolic isolation architecture. This framework abandons semantic trust in natural language, forcing agents to losslessly formalize their intentions into first-order logical mathematical constraints before performing physical operations. Empirical evaluations of macroscopic and microscopic two-dimensional dynamic adversarial systems demonstrate that our formal verification mechanism achieves zero attack success rate and zero false positive rate across the evaluated scenarios, with extremely low computational latency. This research provides a conditional formal foundation under explicit system assumptions and an engineering paradigm for constructing the underlying defense foundation for future intelligent systems.

2026-05-29 13:00 JSTarXiv cs.AIエージェント研究/論文

OpenClawBench: 現実世界のエージェント実行軌跡におけるプロセス側の異常のベンチマーク

タスクが成功すると、実際のエージェント実行におけるプロセスの異常が隠蔽される可能性があります。エージェントは、未解決のあいまいさ、安全でない外部書き込み、無視されたエラー、根拠の弱いコミットメント、または能力境界のオーバーコミットメントを蓄積したまま、最終タスクのオラクルを渡す可能性があります。私たちはこの不一致を結果とプロセスのギャップとして研究し、実際のエージェント実行プロセスにおけるプロセス側の異常を測定および監視するための大規模なデータセットである OpenClawBench を導入します。 OpenClawBench は、6 つのソースモデルによって生成された BFCL 駆動の OpenClaw セッションから構築されており、31,264 個の注釈付き軌跡が含まれています。タスクオラクルの結果を構造化されたプロセスの証拠と一致させます。 FullTax は、調整された軌跡を構造化された異常監視 (バイナリラベル、裏付け証拠、発生/期間の局所化、重大度、回復可能性、および 5 クラスの異常分類) に変換します。 OpenClawBench を使用して、結果とプロセスのギャップを測定可能にします。 31,135 件のオラクルパス実行のうち、2,904 件は依然として FullTax の下でプロセス異常のラベルが付けられています。これらの結果は、成功のみの評価では、実際のエージェント実行におけるプロセス側の失敗の具体的なクラスを見逃すことを示しています。 LoRA で微調整された Gemma 3 12B 検出器は、信頼性の高い FullTax 監視プールでトレーニングされ、クリーナーラベルホールドアウトテストスプリットでバイナリ F1=0.729 に達します。 OpenClawBench は、実際のエージェント実行ログを、ランタイムエージェントの信頼性を調査、診断、運用上監視するための監査可能で再利用可能な監視に変換します。

原文 (English)

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or capability-boundary overcommitment. We study this mismatch as the Outcome-Process Gap and introduce OpenClawBench, a large-scale dataset for measuring and supervising process-side anomalies in real agent execution processes. OpenClawBench is built from BFCL-driven OpenClaw sessions produced by 6 source models and contains 31,264 annotated trajectories. It aligns task-oracle outcomes with structured process evidence. FullTax converts the aligned trajectories into structured anomaly supervision: binary labels, supporting evidence, onset/span localization, severity, recoverability, and a 5-class anomaly taxonomy. Using OpenClawBench, we make the Outcome-Process Gap measurable. Among 31,135 oracle-passing executions, 2,904 are still labeled process-anomalous under FullTax. These results show that success-only evaluation misses a concrete class of process-side failures in real agent executions. A LoRA-fine-tuned Gemma 3 12B detector trained on the high-confidence FullTax supervised pool reaches binary F1=0.729 on the cleaner-labels held-out test split. Together, OpenClawBench turns real agent execution logs into auditable and reusable supervision for studying, diagnosing, and operationally monitoring runtime agent reliability.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

リアルタイムの制約と長期的な推論の調和: 動的スケジューリングのための非同期エージェントフレームワーク

動的フレキシブルジョブショップスケジューリング問題 (DFJSP) では、確率的外乱に対する即座の対応と生産目標の全体的な最適化の間のトレードオフが必要になります。従来の優先順位ルールは、複雑な中断に対処するには柔軟性が不十分ですが、学習ベースのアプローチでは、解釈可能性が損なわれたり、問題のスケール全体に一般化できないことがよくあります。 Large Language Model (LLM) は、このギャップを埋めるための高度な推論機能を提供しますが、その実質的な推論遅延は、産業用制御システムのミリ秒レベルの決定サイクルと互換性がありません。この競合を解決するために、デュアルストリームアーキテクチャを介してポリシーの実行を論理的推論から切り離す、非同期エージェントベースのフレームワークである RACE-Sched を導入します。 Reactive Stream は低遅延のシンボリックヒューリスティックを実行してリアルタイムのディスパッチを可能にし、並列 Deliberative Stream は LLM を利用してこれらのルールを合成、検証、進化させます。候補ルールはサンドボックスで厳格なテストを受け、アトミックな更新によって展開されるため、制御ループをブロックすることなく安全性が確保されます。さらに、セマンティックルールリポジトリは、検索ベースの初期化のための検証済みのヒューリスティックにインデックスを付け、問題のスケール全体での移行性を高めます。 GEN-Bench、MK-Bench、および JMS-Bench の広範な評価により、RACE-Sched が主要な深層強化学習やその他の LLM ベースのベースラインよりも優れたパフォーマンスを発揮することが実証されています。このアプローチは、リアルタイムの制約と長期的な推論を調和させて、優れたソリューション品質と動的イベントへの堅牢な適応を実現します。

原文 (English)

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling

The Dynamic Flexible Job Shop Scheduling Problem (DFJSP) necessitates a trade-off between instant reaction to stochastic disturbances and global optimization of production goals. Conventional priority rules are insufficiently flexible to handle complex disruptions, whereas learning-based approaches often compromise interpretability or fail to generalize across problem scales. Although Large Language Models (LLMs) offer advanced reasoning capabilities to bridge this gap, their substantial inference latency is incompatible with the millisecond-level decision cycles of industrial control systems. To resolve this conflict, we introduce RACE-Sched, an asynchronous agent-based framework that decouples policy execution from logical reasoning via a dual-stream architecture. The Reactive Stream executes low-latency symbolic heuristics to enable real-time dispatching, while the parallel Deliberative Stream leverages an LLM to synthesize, validate, and evolve these rules. Candidate rules undergo rigorous testing in a sandbox and are deployed via atomic updates, ensuring safety without blocking the control loop. Additionally, a semantic rule repository indexes validated heuristics for retrieval-based initialization which enhances transferability across problem scales. Extensive evaluations on GEN-Bench, MK-Bench, and JMS-Bench demonstrate that RACE-Sched outperforms leading Deep Reinforcement Learning and other LLM-based baselines. This approach harmonizes real-time constraints with long-horizon reasoning to achieve superior solution quality and robust adaptation to dynamic events.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

読み取り不可能なもののインデックス作成: LLM ネイティブの再帰的構築とサービス分類の検索

Internet of Agents (IoA) の時代が具体化しつつあります。LLM エージェントは、急速に増加する Model Context Protocol (MCP) サーバー、Agent-to-Agent (A2A) エンドポイント、再利用可能なスキル、およびその他の LLM 呼び出し可能なサービスを調整することで、ユーザーの目標を達成することが期待されています。しかし、LLM はこの体制との構造的な不一致に直面しています。有効なコンテキストは、サービスの数に応じて拡張できない希少なリソースです。数千のサービス説明をプロンプトに連結すると、コンテキストウィンドウがオーバーフローし、ウィンドウが十分に大きい場合でも、モデルは長い入力の途中で情報に体系的に注意を向けることができなくなります。これは十分に文書化されている「Lost-in-the-Middle」現象です。これは基本的に、サービス検出のためのコンテキスト管理の問題です。これに対処するために、LLM ネイティブのプログレッシブ開示スキームとその具体的なインスタンス化である A2X (Agent-to-Anything サービスディスカバリ) を提案します。これは、登録されたサービスを自動的に階層分類に編成し、クエリ時にレイヤーごとにたどる LLM 駆動のパイプラインです。これにより、すべての LLM 呼び出しでは、ユーザークエリに関連性の高い小さな候補セットのみが表示されます。これにより、有効なコンテキストの不足がレジストリサイズから切り離され、取得精度が向上しながらトークンの消費が大幅に削減されます。フルコンテキストダンピングと比較して、A2X はプロンプトトークンコストの 9 分の 1 で 6.2 ポイントのヒット率の向上を達成します。最先端のオープンソースの埋め込みベースのベースラインと比較して、A2X はヒット率を 20 ポイント以上改善します。

原文 (English)

Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies

The era of the Internet of Agents (IoA) is taking shape: LLM agents are expected to fulfill user goals by orchestrating fast-growing populations of Model Context Protocol (MCP) servers, Agent-to-Agent (A2A) endpoints, reusable skills, and other LLM-callable services. Yet LLMs face a structural mismatch with this regime: effective context is a scarce resource that does not scale with the number of services. Concatenating thousands of service descriptions into a prompt overflows the context window, and even when the window is large enough, models systematically under-attend to information in the middle of long inputs, the well-documented Lost-in-the-Middle phenomenon. This is fundamentally a question of context management for service discovery. To address this, we propose an LLM-native progressive-disclosure scheme and its concrete instantiation, A2X (Agent-to-Anything service discovery): an LLM-driven pipeline that automatically organizes the registered services into a hierarchical taxonomy and walks it layer by layer at query time, so that every LLM call sees only a small candidate set highly relevant to the user query. This decouples effective-context scarcity from registry size and significantly reduces token consumption while improving retrieval accuracy. Compared to full-context dumping, A2X achieves a 6.2-point Hit Rate gain at one-ninth the prompt-token cost; compared to the state-of-the-art open-source embedding-based baseline, A2X improves Hit Rate by more than 20 points.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

CoHyDE: ツール検索のための LLM リライタと高密度エンコーダの反復的共同トレーニング

大規模な API カタログを介したツールの取得は、LLM エージェントにとって主要なボトルネックです。ユーザークエリは口語的な、多くの場合、仕様が不十分な言語で到着しますが、カタログでは、固定エンコーダーが単独でブリッジできない技術的な API ボキャブラリーが使用されます。対照的なエンコーダの微調整と、フリーズ LLM を使用した HyDE スタイルのクエリ拡張という 2 つの主要なトレーニングアプローチは、この問題に反対側から対処し、相補的な方向で失敗します。微調整されたエンコーダは、クエリの表面形式がすでにカタログと一致する場合には優れていますが、一致しない場合は崩壊します。一方、ゼロショットの HyDE は、仕様が不十分なクエリに対してはより堅牢ですが、クエリが適切な形式である場合に検索を低下させるカタログを意識しない仮説的な記述を生成します。高密度エンコーダと LLM リライタを単一の共進化システムとしてトレーニングする反復手順である CoHyDE を導入します。エンコーダはリライタによって生成されたカタログ形式の仮説記述に基づいて InfoNCE で再トレーニングされ、リライタはエンコーダの検索スコアに対して DPO 経由で優先的に調整され、ループが始まる前にツールカタログで双方がウォームスタートされます。 ToolBench カタログの約 10,000 のツールサブセットでは、CoHyDE の 3 ラウンドにより、最も強力な単一コンポーネントのベースラインよりも、標準クエリで NDCG@5 が +2.5 pp、保留された曖昧クエリで +6.3 pp 改善され、最も困難な曖昧層では +8 pp もの向上が見られます。アブレーションにより、共トレーニングが重要な要素であることが確認されました。いずれかのコンポーネントを単独で使用すると、整形式のクエリと曖昧なクエリの両方で CoHyDE に匹敵することができず、曖昧なクエリでは最大 -8 pp の損失が発生します。

原文 (English)

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

エージェントによる修正と意味評価による人間のような対話型音声認識を目指して

自動音声認識 (ASR) は、人間とコンピューターの対話の中核コンポーネントであり、LLM ベースのアシスタントおよびエージェントにとってますます重要なフロントエンドです。しかし、現在のほとんどの ASR システムは依然としてシングルパスパラダイムに従っており、人間のコミュニケーションとの整合性が低く、誤解は繰り返しの明確化と改良によって解決されます。この不一致により、意味に関わる重大なエラーが発生すると、修正することが困難になります。一方、WER や CER などのトークンレベルの指標は、このような問題を適切に反映できません。これらの制限に対処するために、\emph{Interactive ASR} をマルチターン改良タスクとして定式化し、シングルパス ASR フロントエンドとセマンティック修正、インテントルーティング、推論ベースの編集を組み合わせた閉ループフレームワークである \textbf{Agentic ASR} を提案します。さらに、LLM ベースのセマンティック評価指標である \textbf{文レベルのセマンティックエラー率} ($S^2ER$) を、スケーラブルで再現可能なベンチマークのための \textbf{インタラクティブシミュレーションシステム} とともに導入します。多言語、名前付きエンティティ集中型、およびコードスイッチングのベンチマークに関する実験では、反復的な対話によりセマンティックエラーが一貫して減少し、従来のトークンレベルのメトリクスよりも $S^2ER$ が大幅に増加することが示されました。人間と AI のアライメントとアブレーションの研究により、意味判断の信頼性と提案されたフレームワークの堅牢性がさらに検証されました。コードは https://interactiveasr.github.io/ で入手でき、ライブデモは https://i-asr.sjtuxlance.com/ で入手できます。

原文 (English)

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

2026-05-29 13:00 JSTarXiv cs.AIエージェント

VitalAgent: ウェアラブル健康データに対する反応的および積極的な生理学的モニタリングのためのツール拡張エージェント

ウェアラブルデバイスにより、ECG や PPG などの生理学的信号の継続的なモニタリングが可能になりますが、既存の mHealth システムは、タスク固有の予測パイプラインまたは静的な概要に対する反応的な質問応答に主に限定されています。これらには、時間的推論、永続的な生理学的コンテキスト、および長期的な信号ストリームにわたるプロアクティブなモニタリングをサポートする能力がありません。私たちは、事後的な質問応答とプロアクティブなモニタリングの両方をサポートする、ECG/PPG ベースの mHealth 用のツールを強化したエージェントフレームワークである VitalAgent を提案します。 VitalAgent は、長期的な生理学的メモリと、生の信号に対する動的な計算を可能にするツール拡張推論インターフェイスに基づいて構築されています。さらに、反応的な質問応答のための 1,862 の QA ペアと、心臓、身体活動、ストレス関連のタスクをカバーするプロアクティブなモニタリングのための 90.2 時間の連続 ECG/PPG 記録で構成される長期的な生理学的モニタリングベンチマークデータセットである VitalBench を紹介します。実験では、VitalAgent が事後評価においてプロンプトベースおよび ReAct ベースラインと比較して 30% 以上の改善を達成し、長期の生理学的信号に対するプロアクティブなアラートモニタリングをサポートすることが実証されており、動的なツールの使用と長期の生理学的モニタリングの重要性が強調されています。

原文 (English)

VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data

Wearable devices enable continuous monitoring of physiological signals such as ECG and PPG, but existing mHealth systems are largely limited to task-specific prediction pipelines or reactive question answering over static summaries. They lack the ability to support temporal reasoning, persistent physiological context, and proactive monitoring over long-term signal streams. We propose VitalAgent, a tool-augmented agentic framework for ECG/PPG-based mHealth that supports both reactive question answering and proactive monitoring. VitalAgent is built on a longitudinal physiological memory and a tool-augmented reasoning interface that enables dynamic computation over raw signals. We further introduce VitalBench, a longitudinal physiological monitoring benchmark dataset comprising 1,862 QA pairs for reactive question answering and 90.2 hours of continuous ECG/PPG recordings for proactive monitoring, covering cardiac, physical activity, and stress-related tasks. Experiments demonstrate that VitalAgent achieves over 30% improvement over prompt-based and ReAct baselines in reactive evaluation and supports proactive alert monitoring over long-term physiological signals, highlighting the importance of dynamic tool use and long-term physiological monitoring.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

有用性の呪い: DistractionIF を介したディストラクタの指示に対するロバスト性の逆スケーリング則

大規模言語モデル (LLM) は、エージェントおよび検索拡張生成 (RAG) システムに導入されることが増えており、外部から提供された参照テキストに対してユーザー指定のタスクを実行する必要があります。実際には、そのようなコンテキストは構造化されていないことが多く、編集コメントやシステムトレースなど、無害ではあるが指示のようなセマンティックノイズで汚染されており、厳密にデータとして扱う必要があります。参照テキスト内のこのようなディストラクタ命令に対する堅牢性を評価するために設計されたベンチマークである DistractionIF を紹介します。幅広いモデルにわたって、一貫した逆スケーリング現象が観察されます。多くの場合、大きなモデルは堅牢性が低く、スケールが増加するにつれてパフォーマンスが最大 30 ポイント低下します。機構的には、私たちのパープレキシティ分析により、スケーリングによってロバストな動作と注意散漫な動作の間の確率的境界が侵食され、モデルがノイズを指示として過剰解釈する傾向がますます高まっていることが明らかになりました。これに対処するために、強化学習、特にグループ相対ポリシー最適化 (GRPO) がこの境界を回復し、一般的な命令追従能力を損なうことなく堅牢性を最大 15.5% 向上させることができることを実証します。私たちの調査結果は、参照に基づいたタスクにおける命令追従の堅牢性の重大なギャップを浮き彫りにし、大規模なデータと命令の厳密な分離を強制するための有望な道としての強化学習を確立します。

原文 (English)

The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

Large Language Models (LLMs) are increasingly deployed in agentic and retrieval-augmented generation (RAG) systems, where they must execute user-specified tasks over externally provided reference text. In practice, such context is often unstructured and contaminated with benign but instruction-like semantic noise, such as editorial comments and system traces, which should be treated strictly as data. We introduce DistractionIF, a benchmark designed to evaluate robustness against such distractor instructions in reference text. Across a broad range of models, we observe a consistent inverse scaling phenomenon: larger models are often less robust, with performance dropping by up to 30 points as scale increases. Mechanistically, our perplexity analysis reveals that scaling erodes the probabilistic boundary between robust and distracted behaviors, making models increasingly prone to over-interpreting noise as instructions. To address this, we demonstrate that reinforcement learning, specifically Group Relative Policy Optimization (GRPO), can restore this boundary, improving robustness by up to 15.5% without compromising general instruction-following capability. Our findings highlight a critical instruction-following robustness gap in reference-grounded tasks and establish reinforcement learning as a promising path for enforcing strict data-instruction separation at scale.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MINDGAMES: マルチエージェント LLM の社会的および戦略的推論を評価するためのライブアリーナ

大規模言語モデル (LLM) は対話型エージェントとして導入されることが増えていますが、拡張された対話における社会的および戦略的推論の能力はまだ十分に理解されていません。既存の評価は、静的なビネットや単一ゲームのベンチマークに依存しており、現実世界のマルチエージェント設定で要求される持続的で多面的な推論を捉えることができません。我々は、「心の理論」に関連する補完的推論の要求を運用する、LLM エージェント向けのマルチゲームアリーナおよび評価プラットフォームである Mindgames を紹介します。これは、隠された情報の下での信念の帰属、繰り返される戦略的相互作用を通じた敵対者のモデル化、知識の非対称性の下での協力的な推論、および社会的演繹における継続的な欺瞞です。 TextArena 上に構築された Mindgames は、統合されたインタラクションインターフェイス、TrueSkill ベースの評価、および 4 つのゲーム環境にわたる完全な軌跡のログを提供します。私たちは、主要な AI カンファレンスで主催された 2025 年の競争サイクルを通じて Mindgames をインスタンス化しました。このカンファレンスでは、Colonel Blotto、Iterated Prisoner's Dilemma、Codenames、Secret Mafia の 4 つのゲームにわたって 76 チームから提出された 944 人のエージェントが評価されました。私たちの分析では、エージェントレベルと評価レベルの両方の制限が明らかになりました。脆弱なルール遵守が依然として主要なボトルネックであり、最高パフォーマンスのシステムは明示的な構造足場に繰り返し依存しており、リーダーボードの有効性は環境によって大きく異なります。特に、失敗の多い環境では、戦略的能力と同じくらい相手のエラーに対する堅牢性が評価される可能性があり、シークレットマフィアはこのサイクルで顕著なエラーと生存の混乱を示します。私たちは、ターンレベルの観察、アクション、報酬を含む 29,571 のマルチエージェントゲームのデータセットをリリースします。MG-Ref は、この分析で使用したのと同じエラー帰属レンズの下で、トップランクの低エラー Stage~II 提出物の凍結された参照プールに対して新しいエージェントを採点する決定論的なオフライントーナメントプロトコルです。

原文 (English)

MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

UI-KOBE: 軽量のグラフガイド付き GUI エージェントのための知識指向の動作探索

モバイル GUI エージェントの最近の進歩により、モバイルタスクを自動化する強力な可能性が示されていますが、最も効果的なシステムは、依然としてスクリーンショットの理解と長期的な計画のために大規模なビジョン言語モデルに依存しています。モバイルデバイスに直接展開できる小型の GUI エージェントは、推論コストが低く、デバイス上の機密情報の保護が強化されているため、実用的にはより魅力的です。ただし、モデルの容量が限られているため、スクリーンショットだけから GUI タスクをエンドツーエンドで計画および実行する場合、このような軽量エージェントの信頼性は依然として低いままです。私たちは、再利用可能なアプリ固有のグラフ知識を備えた軽量モバイル GUI エージェントを改善するフレームワークである、Knowledge-Oriented Behavior Exploration (\textbf{UI-KOBE}) を提案します。 UI-KOBE は、まずモバイルアプリケーションを自律的に探索し、ノードが個別の UI 状態を表し、エッジが実行可能な遷移を表すアプリナレッジグラフを構築します。実行時に、軽量 GUI エージェントは外部ガイダンスとしてグラフを使用します。ユーザータスクと現在のスクリーンショットが与えられると、現在のグラフノードを識別し、自己ループアクション、隣接する遷移、タスクの完了、またはそのノードに関連付けられたフォールバックフリーアクションの中から選択します。 UI-KOBE は、アプリ固有のグラフガイダンスで実行時の意思決定をサポートすることで、エンドツーエンドの GUI 計画の負担を軽減し、軽量モデルがモバイル GUI タスクをより効果的に実行できるように支援し、効率的で解釈可能でプライバシーに配慮したオンデバイス GUI エージェントに向けた実用的なステップを提供します。

原文 (English)

UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents

Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and long-horizon planning. Small GUI agents that can be deployed directly on mobile devices are more attractive for practical use, offering lower inference cost and better protection of sensitive on-device information. However, due to limited model capacity, such lightweight agents remain unreliable when planning and executing GUI tasks end-to-end from screenshots alone. We propose Knowledge-Oriented Behavior Exploration (\textbf{UI-KOBE}), a framework that improves lightweight mobile GUI agents with reusable app-specific graph knowledge. UI-KOBE first autonomously explores a mobile application and constructs an app knowledge graph, where nodes represent distinct UI states and edges represent executable transitions. At runtime, a lightweight GUI agent uses the graph as external guidance: given a user task and the current screenshot, it identifies the current graph node and selects among self-loop actions, neighboring transitions, task completion, or fallback free actions associated with that node. By supporting runtime decisions with app-specific graph guidance, UI-KOBE reduces the burden of end-to-end GUI planning and helps lightweight models perform mobile GUI tasks more effectively, offering a practical step toward efficient, interpretable, and privacy-conscious on-device GUI agents.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Battery-Sim-Agent: LLM-Agent を活用した逆バッテリパラメータ推定

バッテリーの忠実度の高い「デジタルツイン」のパラメーター化は、バッテリーの革新のペースを妨げる、重要かつ困難な逆問題です。一般的な手法では、これをブラックボックス最適化 (BBO) タスクとして定式化し、サンプル効率が低く、基礎となる物理学を認識しないアルゴリズムを採用します。この研究では、逆問題を推論タスクとして再構成する新しいパラダイムを導入し、忠実度の高いバッテリーシミュレーターを備えた閉ループで大規模言語モデル (LLM) エージェントを展開する最初のフレームワークである Battery-Sim-Agent を紹介します。このエージェントは人間の科学者のワークフローを模倣します。シミュレーターからの豊富なマルチモーダルなフィードバックを解釈し、矛盾を説明するために物理的に根拠のある仮説を形成し、構造化されたパラメーターの更新を提案します。さまざまなバッテリーの化学的性質、動作条件、難易度レベルにまたがる体系的に構築されたベンチマークスイートでは、当社のエージェントは、正確なパラメーターを特定する点で、ベイジアン最適化などの強力な BBO ベースラインを大幅に上回ります。さらに、複雑な長期劣化フィッティングタスクにおけるフレームワークの機能を実証し、現実世界のバッテリーデータセットに対するその実用的な適用可能性を検証します。私たちの結果は、科学的発見とバッテリーパラメータ推定のための推論ベースのオプティマイザーとしての LLM エージェントの可能性を強調しています。

原文 (English)

Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation

Parameterizing high-fidelity "digital twins" of batteries is a critical yet challenging inverse problem that hinders the pace of battery innovation. Prevailing methods formulate this as a black-box optimization (BBO) task, employing algorithms that are sample-inefficient and blind to the underlying physics. In this work, we introduce a new paradigm that reframes the inverse problem as a reasoning task, and present Battery-Sim-Agent, the first framework to deploy a Large Language Model (LLM) agent in a closed loop with a high-fidelity battery simulator. The agent mimics a human scientist's workflow: it interprets rich, multi-modal feedback from the simulator, forms physically-grounded hypotheses to explain discrepancies, and proposes structured parameter updates. On a systematically constructed benchmark suite spanning diverse battery chemistries, operating conditions, and difficulty levels, our agent significantly outperforms strong BBO baselines like Bayesian optimization in identifying accurate parameters. We further demonstrate the framework's capability in complex long-horizon degradation fitting tasks and validate its practical applicability on real-world battery datasets. Our results highlight the promise of LLM-agents as reasoning-based optimizers for scientific discovery and battery parameter estimation.

2026-05-29 13:00 JSTarXiv cs.AIエージェントハードウェア/半導体

大規模な言語モデルに基づくマルチエージェントフレームワークによる共同ストーリーテリングの向上

共創、つまり AI エージェントが人間と対話して出力 (アートなど) を生成するというテーマは、最近大きな注目を集めています。ただし、ほとんどの研究は、デジタル環境における成人と人間の相互作用に焦点を当てています。この論文では、子供たちと大規模言語モデル (LLM) が物理的なボードゲームを通じて相互作用して書かれた物語を作成する、新しいばかばかしい共創シナリオを検討します。私たちの目標は、若いプレイヤーに適した高品質の物語を生成できるマルチエージェントフレームワークを開発することです。私たちのアプローチの中核は、ある LLM がストーリーを生成し、別の LLM がストーリーを評価して改良のためのフィードバックを提供する、反復的なライターとエディターのプロセスです。複数の LLM を含むシミュレーション研究を通じて、この反復的な相互作用により、連続するループ全体で生成されたストーリーの知覚品質が一貫して向上することがわかりました。この結果は、インタラクティブなストーリーテリングシステムで高品質の出力を達成するには、少数の改良ステップで十分である可能性があることを示しています。

原文 (English)

Improving Collaborative Storytelling with a Multi-Agent Framework Based on Large Language Models

The topic of Co-creation, i.e., AI agents interacting with humans to generate outputs (e.g., art), has gained significant attention recently. However, most studies focus on adult-human interactions in a digital setting. This paper explores a novel ludic co-creation scenario involving children and Large Language Models (LLMs) interacting through a physical board game to create written stories. Our goal is to develop a multi-agent framework capable of producing high-quality narratives suitable for young players. At the core of our approach is an iterative Writer-Editor process in which one LLM generates stories while another evaluates them and provides feedback for refinement. Through a simulation study involving multiple LLMs, we show that this iterative interaction consistently improves the perceived quality of generated stories across successive loops. The results indicate that a small number of refinement steps may be sufficient to achieve high-quality outputs in interactive storytelling systems.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

PTCG ベンチ: LLM エージェントはポケモントレーディングカードゲームをマスターできますか?

戦略的に複雑なボードゲームを考えると、人間のプレイヤーは数ラウンドをプレイした後、すぐに戦略を考案することを学びます。自律型エージェントは、現実的な対話型環境でも同様の機能を必要としますが、既存のエージェントのベンチマークでは、そのような戦略的で進化する意思決定シナリオを完全に把握できないことがよくあります。我々は、ポケモントレーディングカードゲーム (PTCG) に基づいて構築されたベンチマークである PTCG ベンチを紹介します。このベンチマークは、(1) 単一の複雑な環境内での意思決定パフォーマンス、(2) 蓄積された経験を通じて自己進化する能力という 2 つの相補的なレベルで LLM エージェントを評価します。さらに、モデルの能力と混同することなくエージェントのパフォーマンスをより適切に解釈するためのモジュラーハーネスアブレーションも含まれています。私たちの実験では、LLM エージェントは重要なゲームプレイパフォーマンスを達成できるものの、持続的かつ安定した自己進化は依然として困難であり、パフォーマンスはハーネスの設計に影響されることが示されました。 PTCG-Bench によって、現実的なインタラクティブ環境におけるハーネス認識型の自己進化エージェントに関する将来の研究が促進されることを期待しています。

原文 (English)

PTCG-Bench: Can LLM Agents Master Pok\'emon Trading Card Game?

Given a strategically complex board game, human players can quickly learn to devise strategies after playing a few rounds. Autonomous agents require similar capabilities in realistic interactive environments, yet existing agent benchmarks often fail to fully capture such strategic and evolving decision-making scenarios. We present PTCG-Bench, a benchmark built on the Pok'{e}mon Trading Card Game (PTCG) that evaluates LLM agents at two complementary levels: (1) their decision-making performance within a single complex environment, and (2) their ability to self-evolving through accumulated experience. We further include a modular harness ablation to better interpret agent performance without conflating it with model capability. Our experiments show that, although LLM agents can achieve non-trivial gameplay performance, sustained and stable self-evolution remains challenging, and performance is sensitive to harness design. We hope that PTCG-Bench will facilitate future research on harness-aware and self-evolving agents in realistic interactive environments.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

GRASP: 自己改善型 LLM エージェントのためのゲート回帰認識スキル提案者

構造化された環境で動作する LLM エージェントは、会話的な方法ではなく操作的な方法で失敗し、信頼性は環境の手順に関する知識に依存します。以前の自己改善方法では、新しい項目が以前の正しい動作を保持しているかどうかを確認せずに自然言語ガイダンスを蓄積するため、ある軌道を修正したメモが静かに別の軌道に後退する可能性があります。 GRASP (Gated Regression-Aware Skill Proposer) を導入します。これは、エージェントの改善を制限されたスキルライブラリへの一連の編集として扱い、ハード回帰バジェットの下でバランスのとれたホールドアウトプローブで純改善が得られた場合にのみ各候補者を許可します。 2 つの FHIR ベースの臨床ベンチマークで 5 つの基本モデル (gpt-oss-120b、DeepSeek V4 Flash、Gemini 3.1 Flash Lite、GPT-4.1、GPT-5.4) にわたって GRASP を評価します。 MedAgentBench では、GRASP は gpt-oss-120b を 40.6% から 88.8% に引き上げ、5 つの自己改善ベースラインのうち最も強力なものを 21.0 ポイント上回り、他のすべてのベースモデルを 17.2 から 40.3 ポイント改善しました。アブレーションでは、スキルライティング自体によるものではなく、比較提案の生成、承認ゲート、およびハードリグレッションバジェットによって利益が得られると考えられます。検証がなければ、スキルを使用しないのと同じです。このメカニズムは臨床領域を超えて一般化され、4 つの非臨床環境のうち 3 つで薬剤を改善し、アクションスペースがオープンエンドである場合にのみフラットなままになります。凍結されたライブラリはモデル間で転送され、より強力なモデルからのスキルは弱い実行者を自ら学習した以上に向上させますが、その逆はそうではなく、ゲートされていないベースラインでは再現できない非対称性です。

原文 (English)

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression-Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held-out probe under a hard regression budget. We evaluate GRASP across five base models (gpt-oss-120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT-4.1, GPT-5.4) on two FHIR-based clinical benchmarks. On MedAgentBench, GRASP lifts gpt-oss-120b from 40.6% to 88.8%, exceeds the strongest of five self-improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non-clinical environments and remaining flat only where the action space is open-ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.

2026-05-29 13:00 JSTarXiv cs.AIエージェント研究/論文

表記法が重要: Agentic AI システムにおけるトークン最適化フォーマットのベンチマーク調査

Agentic AI システムの大規模な言語モデルは、ツールスキーマと実行結果を消費し、ツール呼び出しを構造化データとして出力します。その交換のデフォルト言語である JSON は、トークンの効率性ではなくアプリケーション間の交換を目的として設計されているため、その構造要素により相当のトークンオーバーヘッドが生じます。最近の研究では、よりコンパクトな代替として TOON (Token-Oriented Object Notation) や TRON (Token Reduced Object Notation) などのトークンに最適化された代替案が提案されていますが、これらの形式は分離された理解または生成タスクでのみ評価されています。したがって、トークン削減がエンドツーエンドのエージェントループ内で保持されるかどうかは未解決の問題のままです。私たちは、4 つのエージェントベンチマーク (BFCL、MCPToolBenchPP、MCP-Universe、StableToolBench) と 5 つのオープンウェイト LLM で TOON と TRON を評価し、入力圧縮を出力圧縮から分離して、理解と生成を独立して測定します。 TRON は、JSON ベースラインの 14pp 以内の精度でトークンを最大 27% 削減します。 TOON は、同様の 9pp の精度コストで最大 18% の削減を達成しますが、さらにマルチターン解析失敗がカスケードし、ほとんどのモデルの並列ツール呼び出し出力が崩壊します。

原文 (English)

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

軌跡報酬を超えて: グラフモデリングによるエージェント検索のためのステップレベルのクレジット割り当て

Agentic Search では、既存のステップレベルの報酬手法は通常、コストのかかるツリーサンプリングに依存しているのに対し、軌跡レベルの結果報酬では個々のステップの行動の寄与を定量化できません。私たちは、世界の知識を潜在的な世界グラフとして捉え、各 IS タスクを潜在的なタスクグラフ内の検索として捉えます。効果的なステップにより、グラフは答えのノードに向かって進歩するはずです。この事前の説明に基づいて、トレーニング時のエンティティリレーション (ER) グラフ内の回答ノードまでの距離によって、新たに取得および新たに引用されたエンティティをスコアリングするステップレベルのプロセス報酬である、グラフ距離貢献報酬 (GDCR) を提案します。さらに、GDCR をステップレベルの利点に変換し、それらを軌道レベルの結果の利点と組み合わせるステップアドバンテージポリシー最適化 (SAPO) を提案します。 4 つの困難なベンチマークでの実験により、私たちの手法の有効性が検証されました。

原文 (English)

Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling

In Agentic Search, trajectory-level outcome rewards fail to quantify the behavioral contributions of individual steps, while existing step-level reward methods typically rely on costly tree sampling. We view world knowledge as a latent world graph and each IS task as search within a latent task graph, where effective steps should make graph progress toward the answer node. Based on this prior, we propose Graph-Distance Contribution Reward (GDCR), a step-level process reward that scores newly-retrieved and newly-cited entities by their distance to the answer node in a training-time Entity-Relation (ER) graph. We further propose Step Advantage Policy Optimization (SAPO), which converts GDCR into step-level advantages and combines them with trajectory-level outcome advantages. Experiments on four challenging benchmarks validate the effectiveness of our method.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

BitTP: エッジデバイス向けの BitLLM を使用した軽量軌道予測モデル

軌道予測は自律システムの基本的なタスクであり、マルチエージェントの相互作用と意図についての複雑な推論が必要です。最近、大規模言語モデル (LLM) がこのタスクに採用されています。これは、強力な文脈上の推論と、解釈可能な言語ベースの軌跡表現を提供するためです。ただし、これらの LLM ベースの予測子はメモリと計算量が非常に多いため、自律ロボットのオンボードコンピューターなど、リソースに制約のあるエッジデバイスに導入することが困難になります。このギャップを埋めるために、LLM ベースの軌道予測器を軽量のビット線形アーキテクチャに変換する BitTP を提案します。 1.58 ビット (BitTP-Weight) への重みのみの量子化が最適であることを示します。重要なのは、アクティベーションを量子化すると時空間推論の深刻な劣化と不安定性につながるため、アクティベーションは完全な精度を維持する必要があります。経験的に、BitTP-Weight は、完全精度 (BF16) LLM ベースラインを超える予測品質を維持するだけでなく改善し、平均して ADE を 14.29%、FDE を 20.97% 削減し、同時に他の量子化方法と比較してメモリ使用量と推論遅延を削減します。これらの結果は、慎重に設計された量子化が効果的な正則化機能として機能し、洗練された LLM ベースの推論をエッジデバイス上で実際に展開できることを示しています。コードは https://github.com/MintCat98/BitTP で入手できます。

原文 (English)

BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices

Trajectory prediction is a fundamental task for autonomous systems, requiring complex reasoning about multi-agent interactions and intents. Large language models (LLMs) have recently been adopted for this task, as they provide strong contextual reasoning and interpretable, language-based trajectory representations. However, these LLM-based predictors are extremely memory- and compute-intensive, making them difficult to deploy on resource-constrained edge devices such as on-board computers in autonomous robots. To bridge this gap, we propose BitTP, which converts an LLM-based trajectory predictor into a lightweight bitlinear architecture. We demonstrate that weight-only quantization to 1.58-bit (BitTP-Weight) is optimal. Crucially, activations must remain in full precision, as quantizing them leads to severe degradation and instability in spatio-temporal reasoning. Empirically, BitTP-Weight not only preserves but improves prediction quality over the full-precision (BF16) LLM baseline, reducing ADE by 14.29% and FDE by 20.97% on average, while simultaneously reducing memory usage and inference latency relative to other quantization methods. These results demonstrate that carefully designed quantization acts as an effective regularizer, enabling the practical deployment of sophisticated LLM-based reasoning on edge devices. Code is available at: https://github.com/MintCat98/BitTP.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

スペシャリストモデルが依然として重要な理由: 医療用人工知能のための異種マルチエージェントパラダイム

医療分野における GPT や Claude などの汎用大規模言語モデル (LLM) の優れたパフォーマンスは、領域固有の医療専門家モデルは時代遅れになるのだろうかという重大な疑問を引き起こしています。私たちは、医療用人工知能 (AI) の将来は、モノリシックな医療基盤モデルの構築や人間の専門知識の置き換えにあるのではなく、ジェネラリストの LLM、領域固有の専門家モデル、および臨床医の間のコラボレーションを調整することにあると主張します。我々は、矛盾を認識した証拠の融合、不確実性に基づく臨床医の介入トリガー、および適応閾値キャリブレーションを可能にする異種医療マルチエージェントフレームワークである HetMedAgent を提案します。 3 つの実際の臨床意思決定タスクに関する実験では、ジェネラリスト LLM と領域固有の専門家モデルの間の相乗効果が、どちらかのタイプのモデルを単独で使用した場合よりも大幅に優れていることが実証され、モダリティ固有の分析における専門家モデルのかけがえのない価値が検証されました。 HetMedAgent は、医療 LLM または基盤モデルの構築から複数エージェントのコラボレーションへの移行を表し、一般的な推論機能とドメイン固有の精度のバランスを実現します。

原文 (English)

Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence

The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkillsInjector: LLM エージェントの動的なスキルコンテキストの構築

LLM エージェントは、増大するスキルライブラリを利用して複雑なタスクを処理できるようになりました。ただし、より多くのスキルを投入しても、必ずしもタスクの完了が向上するとは限らず、タスクの完了が低下する可能性もあります。既存の手法は依然としてスキルの注入を静的なステップとして扱い、固定基準でスキルを選択し、事前に予算を固定し、説明を変更しないままにします。どのスキルが公開されるか、いくつ含まれるか、およびそれらがどのように提示されるかはすべて、下流のパフォーマンスに影響を与えるため、この静的な処理はスキルの有用性を損なう可能性があると私たちは主張します。私たちは、これらの決定に共同で対処する 2 段階の適応手法である SkillsInjector を提案します。まず、コンテキストプランナーは、実行に基づいたスキルの好みを学習し、タスクごとに適応可能な数のスキルを許可します。次に、セット対応レンダラーは、選択された説明が、同時に挿入された近隣のものと比較してどのように表示されるかを調整します。 tau2-bench、SkillsBench、ALFWorld 全体で、SkillsInjector が最高のスコアを達成し、最も強力なベースラインをそれぞれ 3.9、6.1、7.3 パーセントポイント改善しました。アブレーション研究では、スキルの選択、適応的な予算設定、およびセットアウェアなレンダリングがそれぞれ利益に貢献していることが示されています。これらの結果は、スキル拡張エージェントが、注入されたコンテキスト自体を最適化することで恩恵を受けることを示しています。コードは公開され次第公開されます

原文 (English)

SkillsInjector: Dynamic Skill Context Construction for LLM Agents

LLM agents now draw on growing skill libraries to handle complex tasks. However, injecting more skills does not always improve task completion and can even degrade it. Existing methods still treat skill injection as a static step, selecting skills with fixed criteria, fixing the budget in advance, and leaving descriptions unchanged. We argue that this static treatment can undermine the utility of skills, because which skills are exposed, how many are included, and how they are presented all affect downstream performance. We propose SkillsInjector, a two-stage adaptive method that jointly addresses these decisions. First, a context planner learns execution-grounded skill preferences and admits an adaptive number of skills for each task. A set-aware renderer then tailors how selected descriptions are presented relative to their co-injected neighbors. Across tau2-bench, SkillsBench, and ALFWorld, SkillsInjector achieves the highest score, improving over the strongest baseline by 3.9, 6.1, and 7.3 percentage points, respectively. Ablation studies show that skill selection, adaptive budgeting, and set-aware rendering each contribute to the gain. These results show that skill-augmented agents benefit from optimizing the injected context itself. Code will be released upon publication

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SAAS: エージェント検索における過剰検索を軽減するための自己認識強化学習

エージェント検索により、LLM は反復推論と外部検索を通じて複雑なマルチホップの質問を解決できます。これらのシステムは有効であるにもかかわらず、実際には重大な制限に悩まされることがよくあります。エージェントは自分自身の知識の境界を認識できず、内部の知識が十分な場合でもやみくもに検索を開始し、十分な証拠が収集されている場合でも検索を終了できません。自己認識の欠如は深刻な \textbf{過剰検索} につながり、かなりの推論遅延と法外な計算コストが発生します。この目的を達成するために、精度を損なうことなく検索動作を正確に制御する動的な自己認識を育成するように設計された新しい RL フレームワークである SAAS を提案します。 SAAS では、次の 3 つの主要コンポーネントが導入されています。(i) 検索境界モデリングメカニズム。検索が無効なロールアウトと検索が有効なロールアウトを対比することで、進化するポリシーに基づいて検索境界を識別します。 (ii) 境界認識報酬モジュール。この境界認識を軌道レベルのペナルティに変換し、不必要で冗長な検索を抑制します。 (iii) 段階的な最適化戦略。これは、一連のカリキュラムを活用して、検索の正規化よりも推論を優先し、それによって報酬のハッキングを回避します。広範な実験により、SAAS が精度を維持しながら過剰検索を大幅に削減することが実証されました。私たちのコードは https://github.com/XMUDeepLIT/SAAS で匿名で公開されています。

原文 (English)

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to severe \textbf{over-search}, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self-awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search-disabled and search-enabled rollouts; (ii) a boundary-aware reward module, which translates this boundary awareness into trajectory-level penalties, suppressing unnecessary and redundant searches; and (iii) a stage-wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over-search, while maintaining accuracy. Our code is anonymously released at https://github.com/XMUDeepLIT/SAAS.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

AgentDoG 1.5: AI エージェントの安全性とセキュリティのための軽量でスケーラブルな調整フレームワーク

OpenClaw などの最新のオープンワールドエージェントは、強力な環境間実行機能を示しますが、広範な新しい安全リスク源をもたらします。その一方で、高度なフロンティア AI モデルは攻撃障壁を大幅に下げ、現在のエージェント調整フレームワークが現実世界の展開には不適切になっています。これらの新たな脅威に対処するために、私たちは軽量でスケーラブルなエージェントの安全性調整フレームワークを提案します。具体的には、Codex および OpenClaw の実行シナリオからの緊急リスクに対応するために、エージェントの安全性分類を更新します。さらに、わずか約 1,000 個のサンプルを使用して軽量の AgentDoG 1.5 バリアント (0.8B、2B、4B、および 8B パラメーター) をトレーニングするための影響関数の精製を備えた分類に基づくデータエンジンを構築し、主要なクローズドソースモデル (GPT-5.4 など) と同等のパフォーマンスを達成します。 AgentDoG 1.5 に基づいて、高効率のエージェント安全性 SFT および RL トレーニング環境を構築します。これにより、Docker レベルの環境での導入オーバーヘッドが 2 桁削減されます。最後に、リアルタイムの安全管理のためのトレーニング不要のオンラインガードレールとして AgentDoG 1.5 を導入します。広範な実験結果は、AgentDoG 1.5 が多様で複雑な対話型エージェントシナリオにおいて最先端のパフォーマンスを達成することを示しています。すべてのモデルとデータセットは公開されています。

原文 (English)

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

冗長ですか、それとも必要ですか?エージェントの軌跡における冗長なステップを検出するためのベンチマーク

LLM ベースのエージェントは、複数ステップの推論とツールの使用を通じて複雑なタスクを解決する強力な機能を実証しています。しかし、既存の評価プロトコルは主にタスクの成功に焦点を当てており、エージェントの動作の重要な側面である実行効率を見落としています。実際には、エージェントの軌跡には、タスクの完了にはほとんど寄与しないものの、大量のリソースを消費する冗長なステップが含まれることがよくあります。この研究では、エージェントの軌跡に対する \textbf{冗長ステップ検出} という新しい研究領域を提案し、定式化します。この取り組みをサポートするために、\textbf{RedundancyBench} を導入します。これは、慎重に注釈が付けられた軌跡を持つ多様なタスクを含む新しいベンチマークであり、タスクの完了への貢献度に応じて各ステップにラベルが付けられます。 RedundancyBench を使用して、軌道内のステップが冗長であるか必要であるかを答えるための 3 つの代表的な方法を開発および評価します。私たちの結果は、最もパフォーマンスの高い方法でも冗長ステップの検出で 24.88\% のスコアしか達成できず、一部の方法ではランダムな推測よりもパフォーマンスが悪いことがわかりました。これらの結果は、このタスクの複雑さと、この分野におけるさらなる研究の必要性を浮き彫りにしています。 \footnote{この論文のコードとデータセットはどちらも \href{https://anonymous.4open.science/r/RedundancyBench}{https://anonymous.4open.science/r/RedundancyBench} で入手できます。}

原文 (English)

Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories

LLM-based agents have demonstrated strong capabilities in solving complex tasks through multi-step reasoning and tool use. However, existing evaluation protocols primarily focus on task success, overlooking a critical aspect of agent behavior: execution efficiency. In practice, agent trajectories often contain redundant steps that consume substantial resources while contributing little to task completion. In this work, we propose and formulate a new research area: \textbf{redundant step detection} for agent trajectories. To support this initiative, we introduce \textbf{RedundancyBench}, a new benchmark that contains diverse tasks with carefully annotated trajectories, where each step is labeled according to its contribution to task completion. Using RedundancyBench, we develop and evaluate 3 representative methods to answer whether a step within trajectory is redundant or necessary. Our results show that even the best-performing method achieves only 24.88\% score in detecting redundant steps, while some methods perform worse than random guessing. These results highlight the task's complexity and the need for further research in this area. \footnote{Code and dataset in this paper are both available in \href{https://anonymous.4open.science/r/RedundancyBench}{https://anonymous.4open.science/r/RedundancyBench}.}

2026-05-29 13:00 JSTarXiv cs.AIエージェント

大規模な数学の形式化

Lean 4 で Autoformalized Textbook Library At Scale (Atlas) を構築するためのマルチエージェントシステムである AutoformBot を紹介します。 AutoformBot は、形式的な検証ツール、依存関係を意識したタスクスケジューリング、協調的なバージョン管理を備えた数千の LLM エージェントを調整し、非公式な教科書の散文を機械チェックされた定義と証明に変換します。私たちは、解析、代数、トポロジー、組合せ論、確率に及ぶ 26 冊のオープンアクセス教科書のコーパスに私たちの手法を適用し、45,000 を超えるリーン 4 宣言と 500,000 行を超えるコードの検証済みライブラリである Atlas を作成します。私たちは 2 つのアーティファクトをリリースします。(i) AutoformBot、オープンソースのマルチエージェントフレームワーク。 (ii) Atlas、結果として得られる正式なライブラリ。私たちの結果は、大学院レベルの数学の中核的な内容を大規模に自動形式化することが経済的かつ技術的に実現可能であることを示唆しています。これにより、人間が生成した数学と機械が生成した数学の両方を研究レベルで自動検証する扉が開かれます。

原文 (English)

Formalizing Mathematics at Scale

We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orchestrates thousands of LLM agents, equipped with formal verification tools, dependency-aware task scheduling, and collaborative version control, to translate informal textbook prose into machine-checked definitions and proofs. We apply our methods to a corpus of 26 open-access textbooks spanning analysis, algebra, topology, combinatorics, and probability, producing Atlas: a verified library of over 45,000 Lean 4 declarations and 500 thousand lines of code. We release two artifacts: (i) AutoformBot, the open-source multi-agent framework; and (ii) Atlas, the resulting formal library. Our results suggest that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible. This opens the door to the automated verification of both human- and machine-generated mathematics at a research level.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Compass: 専門家のガイドによる LLM エージェントによる世界的な海洋リードデータの統合のナビゲート

海洋鉛 (Pb) とその同位体は、海洋循環と人為的汚染の重要なトレーサーですが、現場での観察は依然として費用がかかり、まばらです。膨大な歴史的記録が存在する一方で、それらは学術論文の非構造化コンテンツの中に埋もれており、包括的な分析にアクセスできない「データサイロ」を生み出しています。手動による抽出には拡張性がなく、汎用の大規模言語モデル (LLM) には必要なドメイン固有の知識が不足しているため、幻覚や科学的に無効な出力が発生します。これに対処するために、LLM が微調整せずに厳密な科学データ抽出を実行できるようにする、専門家主導の適応アプローチを導入します。当社は、海洋科学者と共同設計したナレッジツリーによって強化された LLM エージェントフレームワークである Compass を通じてこのアプローチを運用しています。これは、複雑なタスクを検証可能なステップに分解し、科学的妥当性を確保するためにエージェントの推論を導きます。 230,000 件を超える関連するオープンアクセス論文のコーパス全体に Compass を導入することで、これまで組み込まれていなかった 3,751 件の Pb レコードを抽出することに成功しました。この取り組みにより、これまでで最大の統合海洋 Pb データベースが確立されました。標準的な指標を超えて、Compass は多層検証を通じて優れた信頼性を実証し、専門家の手動検証で確認されたとおり 92% の精度を達成しています。新たに統合されたデータは、東シナ海や南極海など、これまでサンプリングが不十分だった地域のカバー範囲を拡大し、将来の科学的発見のための充実したデータ基盤を提供します。私たちは、オープンな科学的アクセスを促進するために、インタラクティブな視覚化プラットフォームをリリースします。私たちの研究は、専門家がガイドするエージェントが汎用 LLM と一か八かの科学領域の間のギャップを効果的に埋め、地球科学におけるスケーラブルなデータ発見を可能にすることを実証しています。

原文 (English)

Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in-situ observations remain costly and sparse. While vast historical records exist, they lie buried within the unstructured content of academic papers, creating "data silos" inaccessible to comprehensive analysis. Manual extraction is unscalable, while general-purpose Large Language Models (LLMs) lack the necessary domain-specific knowledge, leading to hallucinations and scientifically invalid outputs. To address this, we introduce an expert-guided adaptation approach that enables LLMs to perform rigorous scientific data extraction without fine-tuning. We operationalize this approach through Compass, an LLM agent framework enhanced by a Knowledge Tree co-designed with marine scientists, which decomposes complex tasks into verifiable steps, guiding the agent's reasoning to ensure scientific validity. Deploying Compass across a corpus of over 230,000 relevant open-access papers, we successfully extract 3,751 previously unincorporated Pb records. This effort establishes the largest integrated marine Pb database to date. Beyond standard metrics, Compass demonstrates superior reliability through multi-layered validation, achieving 92% accuracy as confirmed through expert manual verification. The newly integrated data expand coverage in previously under-sampled regions such as the East China Sea and the Southern Ocean, providing an enriched data foundation for future scientific discoveries. We release an interactive visualization platform to facilitate open scientific access. Our work demonstrates that expert-guided agents can effectively bridge the gap between general-purpose LLMs and high-stakes scientific domains, enabling scalable data discovery in geosciences.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

KairosAgent: 融合されたセマンティック推論を使用したエージェント時系列予測

クロスドメインのマルチモーダル時系列予測は困難なタスクであり、正確な数値的理解、クロスドメインの意味論的理解、効果的なマルチモーダル融合を統合するモデルが必要です。既存のアプローチでは、Time Series Foundation Model (TSFM) を最初から構築するか、事前トレーニングされた Large Language Model (LLM) を活用します。しかし、TSFM は意味の理解を見逃していることが多く、未来志向の意味論的推論を実行する能力に欠けており、LLM は数値的な理解と正確な定量的予測に苦労しています。これらの制限を克服するために、LLM ベースの推論機能と TSFM ベースの予測機能を含む、マルチモーダル時系列予測のための新しいエージェントフレームワークである KairosAgent を提案します。 KairosAgent は、分析ツールを動的に呼び出してテキスト推論と数値予測を統合し、LLM の数値理解と意味論的推論機能を強化します。その後、推論の結果が TSFM パイプラインに融合され、より正確で信頼性の高い将来予測が可能になります。推論をさらに改善するために、マルチターンの改良とターンレベルの単位の割り当てを備えた予測パラダイムからの強化学習と並行して、高品質の軌道の大規模なコーパスを厳選しました。実験では、KairosAgent が事前トレーニングされた LLM と TSFM の有用性を最大化しながら優れたゼロショット予測パフォーマンスを達成することを実証し、効率的で解釈可能な時系列エージェントの有望な方向性を示しています。プロジェクトページは https://foundation-model-research.github.io/KairosAgent にあります。

原文 (English)

KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

Cross-domain multimodal time series forecasting is a challenging task, requiring models to integrate precise numerical comprehension, cross-domain semantic understanding, and effective multimodal fusion. Existing approaches either build Time Series Foundation Models (TSFMs) from scratch or leverage pretrained Large Language Models (LLMs). However, TSFMs often overlook semantic understanding and lack the ability to perform future-oriented semantic reasoning, and LLMs struggle with numerical comprehension and accurate quantitative forecasting. To overcome these limitations, we propose KairosAgent, a novel agentic framework for multimodal time series forecasting, including an LLM-based reasoner and a TSFM-based forecaster. KairosAgent unifies textual reasoning and numerical forecasting by dynamically invoking analytical tools to enhance the numerical understanding and semantic reasoning capabilities of LLMs. The reasoning results are subsequently fused into the TSFM pipeline, enabling more accurate and reliable future predictions. To further improve the reasoning, we curate a large-scale corpus of high-quality trajectories, alongside a reinforcement learning from forecasting paradigm with multi-turn refinement and turn-level credit assignment. Experiments demonstrate that KairosAgent achieves superior zero-shot forecasting performance while maximizing the utility of pretrained LLMs and TSFMs, presenting a promising direction for efficient and interpretable time series agents. The project page is at https://foundation-model-research.github.io/KairosAgent .

2026-05-29 13:00 JSTarXiv cs.AIエージェント

選択の学習: 適応的な方法選択のためのセマンティック通信を備えたエンパワーメントに基づくマルチエージェントシステム

科学技術計算のワークフローを自動化するには、実行可能コードを生成するだけでは不十分です。自律システムは、適切な計算戦略を選択し、忠実に実装し、結果として得られる結果が、その結果を生み出した意思決定に因果的に帰属することを保証する必要もあります。マルチエージェントパイプラインでは、このプロセスは特に脆弱です。エージェントの意図とアクションの間の小さな不一致がセマンティックドリフトを引き起こす可能性があり、最終的に実行される手順が最初に選択された戦略を反映しなくなり、下流の評価と適応が損なわれます。この研究では、ATHENA フレームワーク (Toscano et al., 2025; Toscano et al., 2026) とエンパワーメントの概念 (Yiu et al., 2025) を動機として、コンテキストバンディットと構造化されたエージェント間コミュニケーション、そして最も重要なことに、パイプライン全体でアクションと結果の忠実度を維持するセマンティックチェックポイントを組み合わせたマルチエージェントフレームワークを導入します。このシステムは、適応型意思決定アーキテクチャ内に、特殊なラージ言語モデル (LLM) エージェント、根拠のあるコード生成、自己修復実行ループを統合します。エンパワーメントのレンズを通してフレームワークを解釈すると、信頼性の高い自律学習には、高品質のアクションを特定するだけでなく、エージェント全体への伝播の完全性を維持することも必要であることがわかります。代表的なケーススタディとして感度分析と不確実性の定量化ワークフローを使用して、チェックされていないセマンティックドリフトが政策学習を低下させるのに対し、提案されたフレームワークは収束性、堅牢性、および新しい問題コンテキストへの適応を向上させることを実証します。これらの結果は、科学マルチエージェントシステムのより広範な設計原則を示唆しています。つまり、適応的な意思決定は、意味の一貫性と計算パイプライン全体にわたる信頼性の高い情報フローを保証する明示的なメカニズムと組み合わせる必要があります。

原文 (English)

Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

Automating scientific computing workflows requires more than generating executable code: autonomous systems must also select appropriate computational strategies, implement them faithfully, and ensure that the resulting outcomes remain causally attributable to the decisions that produced them. In multi-agent pipelines, this process is particularly fragile, as small inconsistencies between agent intentions and actions can lead to semantic drift, where the eventually executed procedure no longer reflects the originally selected strategy, thereby corrupting downstream evaluation and adaptation. In this work, motivated by the ATHENA framework (Toscano et al., 2025; Toscano et al., 2026) and the concept of empowerment (Yiu et al., 2025), we introduce a multi-agent framework that combines contextual bandits with structured inter-agent communication and, most importantly, semantic checkpoints that preserve action-outcome fidelity throughout the pipeline. The system integrates specialized large language model (LLM) agents, grounded code generation, and self-healing execution loops within an adaptive decision-making architecture. Interpreting the framework through the lens of empowerment, we show that reliable autonomous learning requires not only identifying high-quality actions, but also preserving the integrity of their propagation across agents. Using sensitivity analysis and uncertainty quantification workflows as representative case studies, we demonstrate that unchecked semantic drift degrades policy learning, whereas the proposed framework improves convergence, robustness, and adaptation to novel problem contexts. These results suggest a broader design principle for scientific multi-agent systems: adaptive decision-making must be coupled with explicit mechanisms that guarantee semantic consistency and reliable information flow across the computational pipeline.

2026-05-29 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

矛盾する複数ソースの個人記憶に対する選択的 QA: 診断テストベッドと手法の比較

新興のパーソナル AI エージェントは、永続的なマルチソースメモリに移行しています。これにより、評価上の問題が生じます。システムは、矛盾する証拠や不完全な証拠をどのように使用するかを決定する必要があります。 1 つのきれいな歴史から事実を引き出すことはできません。既存のベンチマークでは、エラーがメソッドに与えられた証拠に起因するのか、メソッドの競合解決ステップに起因するのかを示すことはほとんどありません。私たちはこれを、矛盾する複数ソースの個人記憶に対する選択的 QA として研究しています。システムは、矛盾する、場合によっては不完全なソースに基づいて回答するか、証拠が不十分な場合は棄権します。 8 つの推論タイプにわたる 18 の質問テンプレート、480 のペルソナ、4 つのランダムシード、および 34,560 のインスタンスを含むベンチマークを、制御されたソースの歪みと決定論的なグラウンドトゥルースを使用して開発しました。ソースへのアクセスなし、単一ソースへのアクセス、構造化融合手法、およびフロンティア LLM のベースラインのパフォーマンスを評価します。最もよく訓練されたフュージョンリゾルバーの精度は 80.3% に達し、最も強力なプロンプトのみの LLM ベースラインは 70.0% に達します。棄権すると、同じリゾルバはカバレッジ 78.3% で選択精度 85.3% に達し、最良の LLM はカバレッジ 95.4% で選択精度 71.0% に達します。モデルが異なれば、推論タイプごとに異なる強みがあります。データ、コード、キャッシュされたモデル出力、およびデータ生成プロセスを再利用のためにリリースします。

原文 (English)

Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

Emerging personal AI agents are moving toward persistent, multi-source memory. This creates an evaluation problem: systems must decide how to use conflicting or incomplete evidence; they cannot just retrieve facts from one clean history. Existing benchmarks rarely show whether an error came from the evidence given to a method or from the method's conflict-resolution step. We study this as selective QA over conflicting multi-source personal memory: systems answer based on conflicting, sometimes incomplete sources, or abstain when evidence is insufficient. We develop a benchmark containing 18 question templates across 8 reasoning types, 480 personas, 4 random seeds, and 34,560 instances, with controlled source distortions and deterministic ground truth. We evaluate the performance of baselines without access to any source, access to a single source, structured fusion methods, and frontier LLMs. The best trained fusion resolver reaches 80.3% accuracy, while the strongest prompt-only LLM baseline reaches 70.0%. With abstention, the same resolver reaches 85.3% selective accuracy at 78.3% coverage and the best LLM reaches 71.0% selective accuracy at 95.4% coverage. Different models have different strengths across reasoning types. We release the data, code, cached model outputs, and data-generating process for reuse.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

コンテキスト関連性による注意ステアリングによるマルチエージェントのコミュニケーションの強化

LLM ベースのマルチエージェントシステムは、協調的な推論を通じて複雑なタスクで優れたパフォーマンスを実証しました。ただし、これらのシステムは、対話中に非常に長い会話履歴を急速に蓄積する傾向があります。会話が長くなるにつれて、関連性のある情報が無関係なコンテキストによってますます希薄になり、パフォーマンスの低下につながります。この研究では、新しい時間的および空間的減衰メカニズムを使用して、各エージェントの注意を関連するコンテキストに動的に向ける、トレーニング不要のコンテキスト管理手法である Agent-Radar を紹介します。私たちの実験では、Agent-Radar が 5 つの異なるベンチマークにわたって最先端の手法を上回り、最大 7.64 絶対ポイントのゲインが得られることが実証されました。さらに、私たちの分析は、エージェントと対話ラウンドの数が増加しても、エージェントレーダーが引き続き効果的で堅牢であることを示しています。最後に、アブレーション研究は、Agent-Radar のコアコンポーネントがパフォーマンスにとって重要であり、さまざまな設定で汎用化できることを示しています。

原文 (English)

Enhancing Multi-Agent Communication through Attention Steering with Context Relevance

LLM-based multi-agent systems have demonstrated remarkable performance on complex tasks through collaborative reasoning. However, these systems tend to rapidly accumulate extremely long conversation histories during interaction. As conversations lengthen, relevant information is increasingly diluted by irrelevant context, leading to degraded performance. In this work, we present Agent-Radar, a training-free context management method that dynamically steers each agent's attention toward relevant context with a novel temporal and spatial decay mechanism. Our experiments demonstrate that Agent-Radar outperforms state-of-the-art methods across five different benchmarks, yielding gains of up to 7.64 absolute points. Furthermore, our analysis shows that Agent-Radar remains effective and robust as the number of agents and interaction rounds increases. Finally, the ablation study shows that core components in Agent-Radar are crucial to performance and generalizable in different settings.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AgentSchool: LLM を利用した教育用マルチエージェントシミュレーション

LLM の教室への導入が急速に進んでいるにもかかわらず、教育 AI の検証は依然として独特の扱いが難しいものです。介入は、認知的および社会的軌道が不可逆的に形成される学習者の成長に影響を及ぼしますが、現実世界の試験は時間がかかり、倫理的に制約され、制度的にロックされています。 LLM ベースの教育シミュレーターが潜在的な救済策として浮上していますが、多くは依然として学習を個人に条件付けされたロールプレイに崩壊させており、既存の教室を再現するためだけに最適化された場合、教育改革に必要な制度上の新規性が構造的に不利になる可能性があります。この作業では、プロンプトによる動作ではなく状態遷移として学習をモデル化する、LLM 駆動のマルチエージェントシミュレーターである AgentSchool を紹介します。 AgentSchool は、重み付けされた教科知識グラフ、思考ワークフロープール、明示的な誤解を備えた認知的に成長可能な学生エージェントと、近接発達ゾーンに沿って計画、足場を築き、反映する適応型教師エージェントを組み合わせます。適応型教師エージェントは、公式および非公式の両方の学習フィールド内で指導を位置付ける構成可能なシーナリージェネレーターに埋め込まれており、インタラクションスケール、時間的粒度、およびシミュレーション期間を分離するマルチスケールシミュレーターです。実験によれば、構造化された学生エージェントは、ベースラインシミュレータよりもより差別化された習熟と誤解の痕跡を生成する一方、教師とエージェントの比較では、ZPD情報に基づいた適応と一致するバックボーン依存のパターンが示されています。さらに、AgentSchool は、教室の社会理論と一致する、周辺参加、派閥の形成、攻撃者による結束、オピニオンリーダーの出現のもっともらしい痕跡を生成します。 AgentSchool は、教育研究手段としての役割を超えて、長期記憶、マルチエージェントの調整、組織の圧力下での将来の制度的推論のための社会的に有意義なテストベッドとして教育を組み立てています。

原文 (English)

AgentSchool: An LLM-Powered Multi-Agent Simulation for Education

Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real-world trials are slow, ethically constrained, and institutionally locked. LLM-based educational simulators have emerged as a potential remedy, but many still collapse learning into persona-conditioned role-play and, when optimized only to reproduce existing classrooms, can structurally penalize the institutional novelty that pedagogical reform requires. In this work, we introduce AgentSchool, an LLM-driven multi-agent simulator that models learning as state transition rather than prompted behavior. AgentSchool couples cognitively growable student agents -- equipped with weighted subject knowledge graphs, thinking-workflow pools, and explicit misconceptions -- with adaptive teacher agents that plan, scaffold, and reflect along the Zone of Proximal Development, embedded in a configurable scenery generator that situates instruction within both formal and informal learning fields, and a multi-scale simulator that decouples interaction scale, temporal granularity, and simulation duration. Experiments show that structured student agents produce more differentiated mastery and misconception traces than a baseline simulator, while teacher-agent comparisons show backbone-dependent patterns consistent with ZPD-informed adaptation. Further, AgentSchool generates plausible traces of peripheral participation, clique formation, aggressor-induced cohesion, and opinion-leader emergence consistent with classroom social theories. Beyond its role as an educational research instrument, AgentSchool frames education as a socially meaningful testbed for long-horizon memory, multi-agent coordination, and future institutional reasoning under organizational pressure.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Long-Horizon LLM エージェント向けのメタ認知メモリポリシーの最適化

メモリ拡張 LLM エージェントは、インタラクションの軌跡をコンパクトなメモリに再帰的に要約することで、複雑な長期タスクに取り組みます。ただし、既存のアプローチは通常、結果ベースの強化学習を使用してこれらの記憶ポリシーをトレーニングするため、中間の記憶品質が低下する場所を特定できません。インタラクションが展開されるにつれて、曖昧な再帰的要約はタスク関連の情報を徐々に破棄し、意味論的なノイズを導入します。これは信念の逸脱を悪化させ、潜在的なタスク状態のエージェントの推定を曖昧にし、最終的には長期的な推論を狂わせます。したがって、記憶の最適化は単に軌跡レベルの成功だけではなく、中間的な要約によって引き起こされる信念の明瞭さに焦点を当てる必要があると主張します。この目的を達成するために、現在のメモリを考慮した潜在的なタスクの状態についてモデルがどの程度不確実性を保っているかを調査する自己教師ありプロキシである信念エントロピーを導入します。このプロキシに基づいて、メタ認知メモリポリシー最適化 (MMPO) を提案します。 MMPO は、まばらな結果ベースの信号のみに依存するのではなく、高い認識論的不確実性を引き起こす明示的にペナルティを課す要約を通じて、きめの細かいメモリ固有の監視を提供します。実験の結果、MMPO は長期にわたる多様なタスクにおいて既存の手法を常に上回っており、175 万トークンのコンテキストに拡張した場合でも 97.1% のパフォーマンスを維持することが示されています。

原文 (English)

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

局所的には一貫性があるが、全体的には一貫性がない: 複数コンポーネントの LLM エージェントにおける境界構成上の一貫性のなさ

複数コンポーネントの LLM エージェントは、それぞれが共同問題の一部のみを認識するコンポーネントから確率的クレームを組み立てます。すべてのコンポーネントが局所的に一貫している場合でも、合成は基本的な確率公理に違反する可能性があります。この局所的にコヒーレントでグローバルにインコヒーレントな障害を、構成残差 eps*、構成された引用から結合コヒーレントポリトープまでの L2 距離を介して形式化し、システム出力と宣言されたコンポーネント間の結合制約から実行時に計算できます。積構造の二分法は、局所的なコヒーレンスが十分である場合を特徴付け、レイリー商予測は 4 つの関係クラスのうち 3 つで観察された残差と 7% 以内で一致します。階層的なボイル・ディクストラ図法は、構成を決定論的に修復します。いつでも有効な電子プロセスにより、連続的なコヒーレンス監視が可能になります。 4 つの LLM 中間層パネル (セクション 5.5 でフロンティアパネルを再実行) 上の 1,876 個のアンサンブルクリーク全体で、クリークの 33 ～ 94% で eps* > 0 となり、比例配分ルールに基づく 1,770 の解決されたベットに対する後悔のベットあたり +0.115 ナットに換算されます (ベッター自身が依存する場合、ゲインは +0.006 に減少します)一貫性を持たせます）。 3 つの直感的な LLM 側の緩和策 (取得、パーティション認識プロンプト、アグリゲーター LLM) はそれぞれ失敗するか退行します。

原文 (English)

Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent. We formalise this locally coherent, globally incoherent failure via the compositional residual eps*, the L2 distance from the composed quote to the joint coherent polytope, computable at runtime from system output and the declared cross-component coupling constraints. A product-structure dichotomy characterises when local coherence suffices, and a Rayleigh-quotient prediction matches the observed residual within 7% on three of four relation classes. A hierarchical Boyle-Dykstra projection repairs the composition deterministically; an anytime-valid e-process gives sequential coherence monitoring. Across 1,876 ensemble cliques on a four-LLM mid-tier panel (frontier-panel rerun in Section 5.5), eps* > 0 on 33-94% of cliques, translating to +0.115 nats per bet of regret on 1,770 resolved bets under the proportional allocation rule (the gain collapses to +0.006 under bettors that themselves coherentise). Three intuitive LLM-side mitigations(retrieval, partition-aware prompting, aggregator-LLM) each fail or regress.

2026-05-29 13:00 JSTarXiv cs.AIエージェント研究/論文

必要なのは物理学だけですか?物理学者が監視する科学ソフトウェアの AI 開発のケーススタディ

AI エージェントはツールですか、共著者ですか、それとも研究者ですか?定量化されたケーススタディ ($N=1$) を紹介します。物理学者が 12 勤務日と 57 セッションにわたって AI コーディングエージェント (クロードコード、ソネット、オーパスモデル) を監督し、JAX の微分可能な 1 ループ摂動理論モジュールである CLAX-PT を構築しました。私たちは、介入レベルごとに 15 の監督イベントを文書化し、分類しました。エージェントは、Oracle テストに対して反復処理を行うことで、10 件を自律的に解決しました。あと 2 つは物理学者の専門知識によるものです。できなかった 3 件はすべてオラクルの検出を回避したもので、エージェントが症状の軽減を根本原因の解決として扱ったという共通の特性があります。 57 のセッションのうち 33 は、ターゲットの物理を表現できないコードアーキテクチャ内の係数の調整に費やし、再検討を求められても CLASS-PT 分岐の選択を再評価できませんでした。注入された物理概念 (異方性 BAO 減衰) のみが再設計のきっかけとなりました。これとは別に、エージェントはすべてのオラクルテストに合格したが、理論上のどの量にも対応せず、他の宇宙論で誤った値を予測する校正済みの修正を実行しました。ファッジ要素が検出され、同じセッション内で置き換えられました。オラクルテストが見逃したものを発見するには、3 つの監督実践が重要であることが判明しました。共有変更ログにより、セッション全体での探索の停滞が表面化しました。そして非物理的な数値パッチに対する明示的なルール。この場合、エージェントの出力が信頼できるかどうかは、モデルの能力ではなく監視設計によって決まりました。ギャップを埋めるには、特定の構造内で最適化するのではなく、アーキテクチャ上の代替案を提案し、予測の妥当性と説明の正しさを区別するエージェントが必要になります。この機能はここでは示されておらず、スケーリングだけでは明らかに対処されていません。 [要約]

原文 (English)

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]

2026-05-29 13:00 JSTarXiv cs.AIエージェント

S3Mem: 長期対話型質問応答のための構造化時空間シーン-イベント記憶

長期にわたる対話型エージェントは、多くの場合、大規模な軌跡履歴を蓄積しますが、以前のイベントに関する質問に確実に答えることができません。私たちは、主なボトルネックはコンテキストの長さだけではなく、長期記憶の軌跡と答えのインターフェースにあると主張します。履歴がプレーンテキストのチャンクとして保存され、標準の検索拡張生成 (RAG) でクエリされる場合、システムは多くの場合、特に空間的、時間的、繰り返しイベント、およびマルチホップ状態の質問について、ローカルに関連するが連鎖的に不完全な証拠を取得します。我々は、長期対話型質問応答 (QA) のための構造化されたシーンイベントエピソード記憶フレームワークである S3MEM を提案します。 S3MEM は、構造化メモリユニットに軌跡を書き込み、アンカーに応じた検索を通じて証拠を取得し、応答時間推論のためのコンパクトなトークンバジェット認識証拠インターフェイスを公開します。この意味で、S3MEM は、エージェントの軌跡をクエリに合わせたサポートに変換する、構造化された証拠ハーネスです。 2 つの内部ヘッドライン環境 (Craafter、Jericho) と 2 つのファミリー外環境 (SciWorld、ALFWorld) で S3MEM を評価します。共有凍結応答時間プロトコルの下では、S3MEM は 4 つの環境すべてで一貫して Vanilla RAG を上回り、Crafter、Jericho、ALFWorld では Graph-NoReader を上回り、劇的に少ない証拠トークンを使用しながら SciWorld でも同等のパフォーマンスを発揮します。 A-MEM にインスピレーションを得た、MemoryOS に適応した、LightMem に適応した 3 つの最近のベースラインは、いくつかの設定において Vanilla RAG よりも改善されていますが、S3MEM の全体的な精度効率の最前線に匹敵するものはありません。全体として、証拠は限定された結論を裏付けています。つまり、現在の凍結された応答時間プロトコルの下では、構造化された書き込みとアンカーに応じた証拠のルーティングは、より一般的なメモリインターフェイスよりも長期の対話型 QA に強力な精度効率のフロンティアを提供します。

原文 (English)

S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

Long-horizon interactive agents often accumulate large trajectory histories yet still fail to answer questions about earlier events reliably. We argue that the main bottleneck is not context length alone, but the trajectory-to-answer interface of long-term memory. When histories are stored as plain-text chunks and queried with standard retrieval-augmented generation (RAG), systems often retrieve locally relevant but chain-incomplete evidence, especially for spatial, temporal, repeated-event, and multi-hop state questions. We propose S3MEM, a structured scene-event episodic memory framework for long-horizon interactive question answering (QA). S3MEM writes trajectories into structured memory units, retrieves evidence through anchor-sensitive retrieval, and exposes a compact token-budget-aware evidence interface for answer-time inference. In this sense, S3MEM is a structured evidence harness that converts agent trajectories into query-aligned support. We evaluate S3MEM on two internal headline environments (Crafter, Jericho) and two out-of-family environments (SciWorld, ALFWorld). Under a shared frozen answer-time protocol, S3MEM consistently outperforms Vanilla RAG across all four environments, surpasses Graph-NoReader on Crafter, Jericho, and ALFWorld, and matches it on SciWorld while using dramatically fewer evidence tokens. Three adapted recent baselines -- A-MEM-inspired, MemoryOS-adapted, and LightMem-adapted -- improve over Vanilla RAG in several settings, but none matches S3MEM's overall accuracy-efficiency frontier. Overall, the evidence supports a bounded conclusion: under the current frozen answer-time protocol, structured writing and anchor-sensitive evidence routing provide a stronger accuracy-efficiency frontier for long-horizon interactive QA than more generic memory interfaces.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

GenesisFunc: 正確かつ一般化可能な関数呼び出しのためのマルチエージェントデータ生成

大規模言語モデル (LLM) は、関数呼び出し (FC) を通じて機能を拡張します。FC は、高品質、多様性、および幅広いシナリオをカバーするトレーニングデータに依存します。ただし、実際の関数呼び出しデータを取得して注釈を付けるのは困難ですが、既存のパイプラインからの合成データは、多くの場合、信頼性の低い API、限られたツールのスケーラビリティ、不十分な多様性、弱い品質管理に悩まされます。これらに対処するために、FC トレーニングデータを生成する自動パイプラインである GenesisFunc を紹介します。当社の GenesisFunc は、広く使用されている公開ベンチマークの信頼性の高いツールから始まり、マルチエージェントフレームワークを採用して、プロセス全体を通じて多様性と品質の両方を維持しながら、多様なシナリオにわたる会話を生成する対話生成システムをサポートします。データの精度は多段階の評価システムによってさらに強化されます。私たちは合成データセット上で 8B LLM を微調整し、広範な実験を通じて、ドメイン内の FC パフォーマンスとドメイン外の汎化において同様のサイズのオープンソースモデルを上回るパフォーマンスを示し、同時に最新の API ベースのモデルの一部と同等の FC 機能に達していることを示しました。さらに、私たちの方法は、下流のツール間で効果的に拡張できる強力な可能性を示しており、現実世界への適用可能性を強調しています。

原文 (English)

GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling

Large Language Models (LLMs) extend their capabilities through function-calling (FC), which relies on training data with high quality, diversity, and broad coverage of scenario. However, obtaining and annotating real function-calling data is challenging, while synthetic data from existing pipelines often suffers from unreliable APIs, limited tool scalability, insufficient diversity, and weak quality control. To address these, we present GenesisFunc, an automated pipeline for generating FC training data. Starting from reliable tools in widely used public benchmarks, our GenesisFunc employs a multi-agent framework to support a dialogue generation system that produces conversations spanning diverse scenarios, while maintaining both diversity and quality throughout the process. The accuracy of the data is further reinforced through a multi-stage evaluation system. We fine-tune an 8B LLM on the synthetic dataset and show through extensive experiments that it outperforms similarly sized open-source models in in-domain FC performance and out-of-domain generalization, while reaching FC capabilities comparable to some of the latest API-based models. In addition, our method demonstrates strong potential to scale effectively across downstream tools, underscoring its real-world applicability.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

読者を取り残さない: 誰もが理解できるマルチエージェントの概要

米国の平文法では、政府文書が一般の人々が簡単に理解できる明確で単純な言語でアクセスできるようにすることを求めていますが、既存の要約システムは、一般読者の間の多様な言語的および認知的障壁に対処するのに苦労しています。我々は、小学生の読者、非ネイティブの読者、注意欠陥のある読者という 3 つの代表的な読者グループをシミュレートする、平易な言語要約のためのマルチエージェントフレームワークである NRLB (No Reader Left Behind) を紹介します。 NRLB は、テンプレートベースの計画と読者指向の反復的な改善を組み合わせ、難しい用語、文脈の欠落、混乱を招く文章の体系的な検出と解決を可能にします。複数のデータセットにわたる評価により、事実の正確さを維持しながら可読性が一貫して向上していることが実証されています。人間による評価では、NRLB の影響がさらに検証され、アノテーターの優先率は 55% から 76% の範囲であり、情報源に忠実であり、一般の人々が広くアクセスできる平易な言葉による要約を作成する NRLB の可能性が強調されています。

原文 (English)

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand

The Plain Writing Act in the United States requires government documents to be accessible in clear and simple language that the general public can easily understand, yet existing summarization systems struggle to address diverse linguistic and cognitive barriers among general readers. We present NRLB (No Reader Left Behind), a multi-agent framework for plain language summarization that simulates three representative reader groups: elementary school student readers, non-native readers, and readers with attention deficits. NRLB combines template-based planning with iterative, reader-oriented refinement, enabling systematic detection and resolution of difficult terms, missing contexts, and confusing sentences. Evaluations across multiple datasets demonstrate consistent improvements in readability while preserving factual accuracy. Human evaluation further validates NRLB's impact, with annotator preference rates ranging from 55% to 76%, highlighting NRLB's potential to produce plain language summaries that are both faithful to the source and broadly accessible to the general public.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LLM エージェントの一貫性はどの程度ですか?マルチステップのツール呼び出しパイプラインにおける動作の再現性の測定

ツール呼び出し機能を備えた大規模言語モデル (LLM) エージェントが実稼働システムに導入されることが増えていますが、根本的な信頼性の問題、つまり同じエージェントが 2 回同じように動作するかという問題はまだ解明されていません。我々は、エージェントが同じツールを同じ順序で、同じ引数で、同じ呼び出しを繰り返して選択するかどうかを測定する、複数ステップのツール呼び出しエージェントの動作の一貫性に関する体系的な実証研究を紹介します。 ReAct スタイルのエージェント (検索専用、フリーテキストアクション) の一貫性に関する以前の研究とは異なり、型指定されたパラメーターと結果として生じる副作用を備えた構造化されたツール呼び出しインターフェイスのより豊富な設定を研究します。

原文 (English)

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

Large language model (LLM) agents with tool-calling capabilities are increasingly deployed in production systems, yet a fundamental reliability question remains under-explored: does the same agent behave the same way twice? We present a systematic empirical study of behavioral consistency in multi-step tool-calling agents, measuring whether agents select the same tools, in the same order, with the same arguments, across repeated identical invocations. Unlike prior work on consistency in ReAct-style agents(search-only, free-text actions), we study the richer setting of structured tool-calling interfaces with typed parameters and consequential side effects.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Big 2 の不完全情報下でのセルフプレイ強化学習

不完全情報マルチプレイヤーゲームでは、隠された情報、まばらな報酬、および静止していない敵の下でエージェントが行動できるかどうかをテストします。私たちはこれらの課題を、4 人用の不完全情報カードゲームである Big 2 で研究します。私たちは、ポリシー勾配エージェントと値近似エージェント間の制御された比較を可能にする Big 2 用のセルフプレイ RL フレームワークを開発します。共通の環境、入力表現、トレーニング予算、および評価プロトコルの下では、PPO は、ランダムで貪欲でヒューリスティックな Big 2 の敵に対して、モンテカルロ Q 近似、SARSA、および Q 学習よりも優れたパフォーマンスを発揮します。さらに、適度なエントロピー正則化により、ポリシーが過度に決定論的になるのを防ぎ、PPO が向上すること、および現在のポリシーのセルフプレイは、チェックポイントセルフプレイや固定対戦相手のトレーニングよりも強力な有限予算のカリキュラムを提供することがわかりました。まとめると、これらの結果は、Big 2 が、不完全な情報、マルチプレイヤーインタラクション、遅延報酬、および可変アクションセットの下で深い RL を研究するのに有用な制御された設定であることを示しています。

原文 (English)

Self-Play Reinforcement Learning under Imperfect Information in Big 2

Imperfect-information multiplayer games test whether agents can act under hidden information, sparse rewards, and non-stationary opponents. We study these challenges in Big 2, a four-player imperfect-information card game. We develop a self-play RL framework for Big 2 that enables controlled comparisons between policy-gradient and value-approximating agents. Under a common environment, input representation, training budget, and evaluation protocol, PPO outperforms Monte Carlo Q approximation, SARSA, and Q-learning against random, greedy, and heuristic Big 2 opponents. We further find that moderate entropy regularization improves PPO by preventing the policy from becoming overly deterministic, and that current-policy self-play provides a stronger finite-budget curriculum than checkpoint self-play or fixed-opponent training. Together, these results show that Big 2 is a useful controlled setting for studying deep RL under imperfect information, multiplayer interaction, delayed rewards, and variable action sets.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

LogDx-CI: LLM 根本原因診断のためのベンチマークログ削減ツール

CI 障害ログは大きく (中央値 5k 行、このコーパスでは最大 200k)、ノイズが多くなります。ログをデバッグしようとするコーディングエージェントは、ログを管理可能なコンテキストに削減する上流のツールに依存していますが、この分野では、どの削減が下流の LLM 診断に十分な証拠を保持するかについて、公的に実証された比較が行われていません。 LogDx-CI は、11 のコンテキスト削減ツール (raw、tail、grep、3 つの RTK モード、2 つの実際の LLM マップリデュースサマライザー、3 つのハイブリッドルーター) を 35 の実際の GitHub Actions 障害ケースで比較するベンチマークであり、3 つの LLM デバッガーファミリ (Claude Haiku 4.5、Claude Sonnet 4.6、OpenAI gpt-5-mini) と Sonnet 4.6 ツールを使用してスコア付けされました。エージェント。耐荷重に関する 3 つの調査結果を報告します。 (1)~ハイブリッド grep+tail ルーターがコスト品質のパレートフロンティアを支配しています。上位 2 つのメソッドのスコアは、ケースごとに $\sim$ \$0.03 で 0.670 / 0.666 で、$4.5\times$ 少ないトークンでスタンドアロン grep とほぼ同じ品質です。 (2)~エージェントループ方式では、リダクションツール全体の品質範囲が $7\times$ 崩壊します (シングルショットスプレッド 0.42 $\to$ エージェントループスプレッド 0.059)。エージェントは、フォローアップツール呼び出しを通じて弱いコンテキストを救出します。ただし、コストの差は依然として残ります。コンテキストが弱い場合、エージェントは回復するために 2 ～ 4$\times$ 回多くのツール呼び出しを発行する必要があります。 (3)~ファミリー間 LLM サマリーペア (Claude Haiku デバッガーにフィードする gpt-5-mini サマライザー) は、4 つの診断ツールバリアントの平均で、同じファミリーペアを $+0.071$ 上回っており、このタスクに関するセルフコールバイアス仮説を偽っています。 gpt-5-mini サマライザーは、エージェントループ \#1 メソッド (スコア 0.749) でもあり、ケースあたりのツール呼び出しは $0.37$ で、リデューサーコストは Haiku サマライザーよりも $10\time$ 低くなります (ケースあたり \$0.18 対 \$1.75)。すべてのデータ、コード、ケースごとのバンドル、再現性インフラストラクチャは公開されています。

原文 (English)

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis

CI failure logs are large (median 5k lines, max 200k in this corpus) and noisy. Coding agents that try to debug them depend on an upstream tool to reduce the log to a manageable context, but the field has had no public empirical comparison of which reductions preserve enough evidence for downstream LLM diagnosis. We introduce LogDx-CI, a benchmark that compares 11 context-reduction tools (raw, tail, grep, three RTK modes, two real LLM map-reduce summarizers, three hybrid routers) on 35 real GitHub Actions failure cases, scored by 3 LLM debugger families (Claude Haiku 4.5, Claude Sonnet 4.6, OpenAI gpt-5-mini) plus a Sonnet 4.6 tool-using agent. We report three load-bearing findings. (1)~Hybrid grep+tail routers dominate the cost-quality Pareto frontier; the top two methods score 0.670 / 0.666 at $\sim$ \$0.03 per case, same-ballpark quality as standalone grep at $4.5\times$ fewer tokens. (2)~In the agent-loop regime, the quality range across reduction tools collapses $7\times$ (single-shot spread 0.42 $\to$ agent-loop spread 0.059); the agent rescues weak contexts via follow-up tool calls. However, cost differences persist: weak contexts force the agent to issue 2--4$\times$ more tool calls to recover. (3)~A cross-family LLM-summary pair (gpt-5-mini summarizer feeding a Claude Haiku debugger) beats the same-family pair by $+0.071$ averaged across four diagnoser variants, falsifying the self-call-bias hypothesis on this task. The gpt-5-mini summarizer is also the agent-loop \#1 method (score 0.749) at $0.37$ tool-calls per case and $10\times$ lower reducer cost than the Haiku summarizer (\$0.18 vs \$1.75 per case). All data, code, per-case bundles, and reproducibility infrastructure are public.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

AIRGuard: ランタイム権限制御によるエージェントアクションの保護

ツールを使用する言語エージェントは、モデルの決定を外部の副作用に変えます。つまり、ファイルの読み取り、スクリプトの実行、API の呼び出し、メッセージの送信、およびモデルコンテキストプロトコルツールの呼び出しを行います。このため、エージェント攻撃はジェイルブレイクとは異なります。多くの場合、有害なステップは、明らかに禁止された出力ではなく、攻撃者が制御するコンテキストによって許可されたアクセスがユーザーの利益に反するように誘導されるため、安全ではなくなる通常の実行可能なアクションです。私たちは、この失敗モードを権限の混乱として特定します。信頼できないリソースは推論を提供する可能性がありますが、副作用を許可してはなりません。最小特権をアクション時の認証として運用するランタイムガードである AIRGuard を紹介します。 AIRGuard は、異種ツール呼び出しを正規化し、タスクの権限をステップレベルの権限に導き出し、ソースとターゲットの信頼を追跡し、機密の副作用をシミュレートし、ステップ間のリスクを監査し、アクションの実行前に意思決定を強制します。 AgentTrap では、AIRGuard により Sonnet 4.6 の攻撃成功率が防御なしの場合の 36.3% から 5.5% に減少します。 DTAP-150 では、AIRGuard は、ARGUS の 52.0%、MELON の 42.0% と比較して、Haiku 4.5 では 76.0% の良好なユーティリティを維持します。さらに、分析の結果、プロンプトのみのポリシーはわずかにしか役に立たないのに対し、専用のランタイム権限制御層により、エージェントシステムがツールを介した副作用を直接制御できることがわかりました。コードとデータは https://github.com/Sophie508/AIRGuard で入手できます。

原文 (English)

AIRGuard: Guarding Agent Actions with Runtime Authority Control

Tool-using language agents turn model decisions into external side effects: they read files, run scripts, call APIs, send messages, and invoke Model Context Protocol tools. This makes agent attacks different from jailbreaks. The harmful step is often not an obviously forbidden output, but an ordinary executable action that becomes unsafe because attacker-controlled context steers authorized access against the user's interest. We identify this failure mode as authority confusion: untrusted resources may inform reasoning, but they must not authorize side effects. We present AIRGuard, a runtime guard that operationalizes least privilege as action-time authorization. AIRGuard normalizes heterogeneous tool calls, derives task authority into step-level authority, tracks source and target trust, simulates sensitive side effects, audits cross-step risk, and enforces decisions before actions execute. On AgentTrap, AIRGuard reduces Sonnet 4.6 attack success from 36.3% without defense to 5.5%. On DTAP-150, AIRGuard preserves 76.0% benign utility with Haiku 4.5, compared with 52.0% for ARGUS and 42.0% for MELON. An ablation further shows that prompt-only policy helps only modestly, whereas a dedicated runtime authority-control layer gives the agent system direct control over tool-mediated side effects. Code and data are available at https://github.com/Sophie508/AIRGuard.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

アインシュタイン望遠鏡のシミュレートされたデータの分析に適用されたエージェント AI の初の直接比較

我々は、人間の介入なしに共有コンピューティングインフラストラクチャ上でシンプルなエンドツーエンドの重力波データ分析パイプラインを自律的に実行するという 2 つの最先端のエージェント AI システム、Claude Code (Anthropic) と Codex (OpenAI) の比較を報告します。このパイプラインは、生のアインシュタイン望遠鏡でシミュレートされたノイズからのパワースペクトル密度推定、幾何学的テンプレートバンクの生成、100 個のバイナリブラックホール信号注入の整合フィルター回復、自動結果生成、および Physical Review D のスタイルでフォーマットされた原稿の大規模言語モデル支援の作成で構成されます。両方のエージェントは、同一の仕様書と同一のコンピューティングリソースを受け取りました。実験は 2 回実行されました。1 回目は非現実的な大音量の注入を使用して実行され、2 回目は物理的に動機付けられた SNR 範囲に再スケーリングされた信号を使用して実行されました。科学的結果は両方の実行で収束しました。ただし、エージェントは大幅に異なる動作と計算コストを示しました。Claude Code は、仕様からのサイレント逸脱はありますが、パイプラインを約 3.4 分で完了しましたが、Codex は、整合フィルターの内部ループの一方的なパフォーマンスの最適化を含む、明示的な自己修正の再起動に約 16 分を要しました。自律的に生成された原稿も、長さ、詳細、品質が異なりました。 2 回目の実行では、SNR 範囲の命令の解釈における微妙な違いが、真の科学的相違につながりました。Claude Code は命令を黙って再解釈しましたが、Codex は文字通り仕様に従いました。速度と可聴性、サイレントと透過的なエラー処理、命令の解釈、マルチモデルパイプラインにおける中間データ表現の重要性など、これらの動作の違いが科学技術コンピューティングワークフローでのエージェント AI の展開に与える影響について説明します。

原文 (English)

First head-to-head comparison of agentic AI applied to the analysis of simulated data of the Einstein Telescope

We report a comparison of two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously executing a simple end-to-end gravitational wave data analysis pipeline on a shared computing infrastructure without human intervention. The pipeline comprises power spectral density estimation from raw Einstein Telescope simulated noise, geometric template bank generation, matched filter recovery of 100 binary black hole signal injections, automated results generation, and large language model-assisted production of a manuscript formatted in the style of Physical Review D. Both agents received identical written specifications and identical compute resources. The experiment was run twice: a first run with unrealistically loud injections, and a second run with signals rescaled to a physically motivated SNR range. The scientific results converged in both runs. However, the agents exhibited substantially different behaviors and computational costs: Claude Code completed the pipeline in ~3.4 minutes with silent deviations from the specification, while Codex required ~16 minutes across explicit self-correcting restarts, including an unsolicited performance optimization of the matched filter inner loop. The autonomously generated manuscripts also diverged in length, details, and quality. In the second run, a subtle difference in the interpretation of the SNR range instruction led to a genuine scientific divergence: Claude Code silently reinterpreted the instructions, while Codex followed the specification literally. We discuss the implications of these behavioral differences, such as speed versus auditability, silent versus transparent error handling, instruction interpretation, and the criticality of intermediate data representations in multi-model pipelines, for the deployment of agentic AI in scientific computing workflows.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Beyond Recall: AI パーソナライゼーションの解釈層としての行動仕様

AI エージェントが人間に代わって意思決定を行う場合、その意思決定はユーザーと一致する必要があります。システムが人の解釈をどれだけ忠実に捉えているかを測定するために、表現精度を導入します。解釈層は動作仕様として運用されます。私たちのリファレンス実装は、人のデータを解釈パターンに積極的に圧縮し、言語モデルのコンテキストとして機能します。私たちは、校正済みの 5 人の審査員 LLM パネルによって採点された、保留された行動予測のプロトタイプベンチマークで仕様を評価します。私たちは、完全な生のコーパス、完全に抽出されたファクト、および 4 つの商用メモリシステム (Mem0、Letta、Supermemory、Zep) など、さまざまなコンテキスト条件を使用して独立して構成してテストします。この仕様は 14 のパブリックドメインの自伝的コーパスにわたって、集合的に表現の精度を向上させ、モデルのヘッジをほぼ排除します。生のコーパスが提供する内容のほとんどを、コンテキストコストを約 25 分の 1 に抑えて復元します。この仕様は、トレーニング前のベースラインに関係なく、被験者を共通の予測レベルに引き上げます。したがって、絶対ポイントのリフトはベースラインが最も低いところで最大となり、関連する母集団が事前トレーニングで適切に代表されていない人であることを示唆しています。リフトは、解釈が必要な質問で最大であり、解釈レイヤーを提供することで、抽出された事実や生のコーパスでは実現できないモデル動作が可能になります。逆に、リコールが必要な質問では、この層は役立つというよりむしろ邪魔になる可能性があります。私たちは、表現の精度は再現とは異なり、人間と AI の整合性はユーザーがどれだけ正確に表現されているかに依存すると結論付けています。表現が正確であるため、その調整はテスト可能です。

原文 (English)

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

If an AI agent makes decisions on a person's behalf, those decisions must align with its user. We introduce representational accuracy to measure how faithfully a system captures a person's interpretation. An interpretive layer is operationalized as a Behavioral Specification. Our reference implementation aggressively compresses a person's data into interpretive patterns, served as context to a language model. We evaluate the Specification on a prototype benchmark of held-out behavioral predictions scored by a calibrated 5-judge LLM panel. We test it independently and in composition with a range of context conditions: full raw corpus, full extracted facts, and four commercial memory systems (Mem0, Letta, Supermemory, Zep). Across 14 public-domain autobiographical corpora, the Specification lifts representational accuracy in aggregate and nearly eliminates model hedging. It recovers most of what the raw corpus delivers, at ~25x less context cost. The Specification lifts subjects toward a common predictive level regardless of pretraining baseline; the lift in absolute points is therefore largest where the baseline is lowest, suggesting the population of relevance is anyone not adequately represented in pretraining. Lift is greatest on interpretation-required questions, where providing an interpretive layer enables model behavior that extracted facts or raw corpus do not. Conversely, on recall-required questions, this layer can interfere rather than help. We conclude that representational accuracy is distinct from recall and that human-AI alignment is dependent on how accurately the user is represented. Representational accuracy makes that alignment testable.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

自動運転における遅延と精度のトレードオフを最適化するための多重解像度エンドツーエンドのディープニューラルネットワーク

レイテンシと精度のトレードオフは、サイバーフィジカルシステム用のディープニューラルネットワーク (DNN) のリアルタイムアプリケーションの基本です。特に自動運転では、安全性は予測の品質と、センシングから作動までのエンドツーエンドの遅延の両方に依存します。 (1) レイテンシーを考慮すると、レイテンシーに最適なネットワーク構成はシーンのコンテキストとコンピューティングの可用性によって異なります。 (2) 単一の固定解像度モデルは、条件が変化すると最適ではなくなります。単眼カメラ入力を使用した CARLA 都市走行課題のための、多重解像度のエンドツーエンドのディープニューラルネットワークを紹介します。私たちのアプローチでは、解像度ごとのバッチ正規化を通じて複数の入力解像度をサポートする畳み込みニューラルネットワーク (CNN) を採用しています。これにより、レイテンシバジェットの下で理想的な入力スケールの実行時選択が可能になるだけでなく、元のトレーニングデータセットにアクセスせずにマルチ解像度トレーニングを可能にする解像度のリターゲティングも可能になります。私たちは、CARLA で多重解像度のエンドツーエンド CNN を実装して評価し、遅延安全性のフロンティアを探索します。結果は、固定解像度のベースラインと比較して、ルートごとの安全指標 (車線侵入、赤信号違反、衝突) が一貫して改善していることを示しています。

原文 (English)

Multi-Resolution End-to-End Deep Neural Network for Optimizing Latency-Accuracy Tradeoff in Autonomous Driving

Latency-accuracy tradeoffs are fundamental in real-time applications of deep neural networks (DNNs) for cyber-physical systems. In autonomous driving, in particular, safety depends on both prediction quality and the end-to-end delay from sensing to actuation. We observe that (1) when latency is accounted for, the latency-optimal network configuration varies with scene context and compute availability; and (2) a single fixed-resolution model becomes suboptimal as conditions change. We present a multi-resolution, end-to-end deep neural network for the CARLA urban driving challenge using monocular camera input. Our approach employs a convolutional neural network (CNN) that supports multiple input resolutions through per-resolution batch normalization, enabling runtime selection of an ideal input scale under a latency budget, as well as resolution retargeting, which allows multi-resolution training without access to the original training dataset. We implement and evaluate our multi-resolution end-to-end CNN in CARLA to explore the latency-safety frontier. Results show consistent improvements in per-route safety metrics - lane invasions, red-light infractions, and collisions - relative to fixed-resolution baselines.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

SafeRx-Agent: 安全で説明可能な投薬推奨のための知識に基づいたマルチエージェントフレームワーク

薬剤の推奨は患者の来院時の薬剤を予測しますが、既存の方法では依然として 2 つの重要な課題に直面しています。モデルレベルでは、従来の医薬品推奨方法は限られた根拠に基づいて構造化された医薬品コードを予測するだけですが、LLMエージェントはより豊富な臨床コンテキストを使用できますが、安全性の検証とトレーサビリティが欠けている可能性があります。タスクレベルでは、既存のベンチマークは広範な薬剤カテゴリーを使用することが多く、サブグループレベルの安全性の違いが無視され、リスクの過大評価につながる可能性があります。第 4 レベルの ATC コード生成に基づいた、最初のきめ細かい薬剤推奨設定を導入します。私たちは、患者の状況、外部の臨床知識、安全性検証を使用して追跡可能な薬剤セットを推奨する、知識に基づいたマルチエージェントフレームワークである Safe Prescription Agent (SafeRx-Agent) を提案します。 MIMIC-III および MIMIC-IV データセットに関する実験結果は、SafeRx-Agent が薬物相互作用、禁忌、および薬物セットのサイズを制御しながら、きめ細かい薬物予測の精度を向上させることを示しています。

原文 (English)

SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation

Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level, traditional drug recommendation methods only predict structured drug codes with limited evidence grounding, while LLM agents can use richer clinical context but may lack safety verification and traceability. At the task level, existing benchmarks often use broad medication categories, which ignore subgroup-level safety differences and can lead to risk overestimation. We introduce the first fine-grained medication recommendation setting based on fourth-level ATC code generation. We propose Safe Prescription Agent (SafeRx-Agent), a knowledge-grounded multi-agent framework that uses patient context, external clinical knowledge, and safety verification to recommend traceable medication sets. Experimental results on MIMIC-III and MIMIC-IV datasets show that SafeRx-Agent improves fine-grained medication prediction accuracy while controlling drug interactions, contraindications, and medication set size.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, inc…

2026-05-29 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interacti…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

GrepSeek: Training Search Agents for Direct Corpus Interaction

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reason…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow

The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collab…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents

Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision mak…

2026-05-29 13:00 JSTarXiv cs.AIエージェント研究/論文

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajecto…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own fail…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

PhoneWorld: Scaling Phone-Use Agent Environments

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build a…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Network Optimization Aspects of Autonomous Vehicles: Challenges and Future Directions

Global megatrends, such as urbanization, population growth, and emerging network solutions are accelerating the development of the Connecte…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing

Exploratory GUI testing is a particularly demanding setting for MLLM agents: without predefined test scripts, an agent must autonomously na…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Training Deliberative Monitors for Black-Box Scheming Detection

As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may beco…

2026-05-29 13:00 JSTarXiv cs.AIエージェント研究/論文

Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory

End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor ent…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems

LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks,…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, w…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents

Consensus protocols form the backbone of distributed systems and blockchains, where implementation bugs can cause data corruption and finan…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous Driving

Autonomous driving systems are commonly trained and evaluated within limited geographic regions, which hinders their scalability when deplo…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

Large language model (LLM) agents increasingly leverage long term memory to support persistent and autonomous task execution. However, this…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas

We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-syn…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

Large language models (LLMs) can autonomously conduct multi-stage cyber attacks, but the consistency of their offensive behavior under repe…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offer…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?

Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms

As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

While Multi-Agent Systems (MAS) empower Large Language Models to tackle complex reasoning tasks through collaborative interaction, optimizi…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context wi…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Gram: Assessing sabotage propensities via automated alignment auditing

We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemin…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

PersonaAgent: Bridging Memory and Action for Personalized LLM Agents

Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide rang…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Estimating the Empowerment of Language Model Agents

As language model (LM) agents become increasingly capable and adopted in real-world applications, there is a growing need for scalable eval…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SafeSearch: Automated Red-Teaming of LLM-Based Search Agents

Search agents connect LLMs to the Internet, enabling them to access broader and more up-to-date information. However, this also introduces…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization

We introduce CodeEvolve, an open-source framework that couples large language models with island-based evolutionary search for end-to-end a…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Graph-Enhanced Policy Optimization in LLM Agent Training

Multi-step LLM agents in interactive environments represent a crucial step toward long-horizon decision-making. To train such agents, group…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents

Data analysis has become an indispensable part of scientific research. To discover the latent knowledge and insights hidden within massive…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bo…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning

Agentic Reinforcement Learning (ARL) trains large language models to interleave reasoning with external tool execution to solve complex tas…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

LLM-driven agents excel at sequential decision-making but often rely on on-the-fly reasoning, re-deriving solutions even in recurring scena…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AutoSizer: Automatic Sizing of Analog and Mixed-Signal Circuits via Large Language Model (LLM) Agents

The design of Analog and Mixed-Signal (AMS) integrated circuits remains heavily reliant on expert knowledge, with transistor sizing a major…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Small Agent Group is the Future of Digital Health

The rapid adoption of large language models (LLMs) in digital health has been driven by a "scaling-first" philosophy, i.e., the assumption…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information from individual…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

Reinforcement learning (RL) shows promise for enhancing LLM agentic reasoning, yet sparse terminal rewards hinder fine-grained optimization…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MemCollab: Cross-Model Memory Collaboration via Contrastive Trajectory Distillation

LLM agents increasingly rely on memory mechanisms to reuse knowledge from past problem-solving experiences. However, existing methods typic…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MediHive: A Decentralized Agent Collective for Medical Reasoning

Large language models (LLMs) have revolutionized medical reasoning tasks, yet single-agent systems often falter on complex, interdisciplina…

2026-05-29 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

IntentScore: コンピュータ使用エージェントの意図条件付きアクションの評価

Computer-Use Agent (CUA) は、大規模な言語モデルを利用してデスクトップ環境で GUI 操作を実行しますが、アクションの品質を評価せずにアクションを生成するため、後続のステップに連鎖的に発生する不可逆的なエラーにつながります。私たちは、3 つのオペレーティングシステムにわたる 398K のオフライン GUI インタラクションステップから候補アクションをスコアリングすることを学習する、プランを認識した報酬モデルである IntentScore を提案します。 IntentScore は、状態とアクションの関連性に関する対照的な調整と、アクションの正しさに関するマージンランキングという 2 つの相補的な目標を使用してトレーニングします。アーキテクチャ的には、各候補者の計画意図がアクションエンコーダーに埋め込まれ、同様のアクションを持つ候補者間で論理的根拠が異なるものを区別できるようになります。 IntentScore は、ホールドアウト評価で 97.5% のペア識別精度を達成します。トレーニング中にまったく見えない環境である OSWorld 上のエージェント S3 の再ランカーとしてデプロイされた IntentScore は、タスクの成功率を 6.9 ポイント向上させ、異種のオフライン軌跡から学習した報酬推定が、目に見えないエージェントとタスクの分布に一般化されることを示しています。

原文 (English)

IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions without evaluating action quality, leading to irreversible errors that cascade through subsequent steps. We propose IntentScore, a plan-aware reward model that learns to score candidate actions from 398K offline GUI interaction steps spanning three operating systems. IntentScore trains with two complementary objectives: contrastive alignment for state-action relevance and margin ranking for action correctness. Architecturally, it embeds each candidate's planning intent in the action encoder, enabling discrimination between candidates with similar actions but different rationales. IntentScore achieves 97.5% pairwise discrimination accuracy on held-out evaluation. Deployed as a re-ranker for Agent S3 on OSWorld, an environment entirely unseen during training, IntentScore improves task success rate by 6.9 points, demonstrating that reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Guardrails Beat Guidance: A Large-Scale Study of Rules, Skills, and Persistent Configuration for Coding Agents

Random rules improve a coding agent's task performance as much as expert-curated ones (both $+13.8$pp on a discriminative subset of SWE-ben…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Human-Guided Harm Recovery for Computer Use Agents

As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but al…

2026-05-29 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data

AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulat…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

CausaLab: AI 科学者向けのインタラクティブな因果発見のためのスケーラブルな環境

LLM エージェントによるインタラクティブな因果発見を評価するためのスケーラブルな環境である CausaLab を紹介します。以前の評価とは異なり、CausaLab では、エージェントが因果関係の証拠を使用して問題を解決できるかどうか、およびその答えが根底にある因果メカニズムに関する正しい仮説によって裏付けられているかどうかの両方を評価します。各エピソードではエージェントが合成実験室に配置されます。エージェントは以前の測定記録を受け取り、マニピュレーター結晶に介入し、同じ機構によって支配される保持されたリアクター結晶の共振周波数を予測します。隠されたデータ生成プロセスは、ランダムにサンプリングされた構造因果モデル (SCM) であるため、成功するには、事前の知識を思い出すのではなく、因果グラフと構造方程式の両方を回復する必要があります。 CausaLab には、エージェントの進化する SCM 仮説を記録するドメイン固有の言語も含まれており、軌跡を検査可能にしてグラウンドトゥルースと比較できるようになります。実験では、予測とメカニズム回復の間に永続的なギャップがあることが示されています。純粋に観測的な 6 ノード設定では、GPT-5.2-high はタスク精度 92% に達しますが、オールエッジ $F_1$ はわずか 0.471 です。この観察は、さまざまな相互作用戦略の探求をさらに動機づけます: 混合観察 - 介入戦略は構造忠実度を向上させます: 混合 6 ノード設定では、GPT-5.2-high はタスク精度とオールエッジ $F_1$ の両方で 80% を達成しました。しかし、純粋な介入戦略はタスクの精度とオールエッジ $F_1$ の両方においてパフォーマンスが低いため、強力なエージェントですら有益な介入を設計するのに苦労しています。私たちは、エージェントの主要な弱点として早期停止を特定し、仮説と過去のデータとの間の一貫性をモデルに検証するように依頼することが、この問題の軽減に役立つことを示します。したがって、CausaLab は予測の成功を因果関係の理解から切り離し、実験的因果推論者としての現在の LLM エージェントの限界を明らかにします。

原文 (English)

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

SIA: ハーネスとウェイトのアップデートによる自己改善型 AI

AI の構築と改善においてボトルネックになるのは人間です。モデルとそれをラップするエージェントはどちらも人によって作成、調整、修正されます。 AI 自身を改善する方法を見つけられるという長期的な目標はまだ未解決です。大きく関連性のない 2 つの研究ラインがこのボトルネックを攻撃しています。ハーネス更新スクールでは、モデルの重みを固定したまま、メタエージェントがタスク固有のエージェントの足場 (ツール、プロンプト、再試行ロジック、検索手順) を書き換えます。テスト時のトレーニングスクールでは、手書きの RL パイプラインを使用して、ハーネスを固定したままタスクフィードバックでモデル自体の重みを更新します。これら 2 つのサイロは独立して動作します。我々は、言語モデルエージェント (フィードバックエージェント) がタスク固有のエージェントのハーネスと重みの両方を更新する自己改善ループである SIA を提案します。中国の法的責任の分類、低レベルの GPU カーネルの最適化、および単一セル RNA のノイズ除去という 3 つの対照的なドメインにわたって評価します。両方のレバーを組み合わせると、3 つのベンチマークすべてでスキャフォールドの反復単独よりも優れたパフォーマンスを発揮します。初期ベースラインと比較して、LawBench では 56.6%、GPU カーネルではランタイムが 91.9%、ノイズ除去では 502% の向上が見られます。ハーネスの更新によりモデルがエージェント的になり、検索と動作の方法が形成されます。一方、重みの更新により、プロンプトや足場では得られないドメインの直感が構築されます。

原文 (English)

SIA: Self Improving AI with Harness & Weight Updates

Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model's own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. SIA-W+H achieves 25.1% over prior SOTA on LawBench, 12.4% faster GPU kernels than prior SOTA (1,017 vs 1,161 {\mu}s), and 20.4% over prior SOTA on denoising. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

FundaPod: AI 支援のファンダメンタル投資調査のためのナレッジグラフメモリを備えたマルチペルソナエージェントポッドプラットフォーム

大規模言語モデル (LLM) は金融分野での適用が増えていますが、既存の研究のほとんどは取引シグナルや予測を中心とした財務 NLP タスクに重点を置いています。対照的に、制度的基礎研究では、人間のアナリストまたは AI エージェントが証拠を収集し、ビジネス推進要因を特定し、競合する視点を比較し、投資メモを作成する必要があります。その広範な目標は、単に結果を予測することではなく、投資知識の累積的な発展に貢献しながら、透明性、再利用可能、検証可能な投資計画を作成することです。 AI 支援のファンダメンタルズ投資調査のためのマルチペルソナエージェントプラットフォームである FundaPod を紹介します。私たちは、基礎研究は人間中心の意思決定支援タスクであり、取引シグナルの生成とは質的に異なるため、独立性を維持するアーキテクチャの方が適していると主張します。 FundaPod では、バリュー投資家やマクロ戦略家など、さまざまなペルソナを持つ AI エージェントが、共有の出所契約に基づいて独立して調査を実施します。その後、彼らの意見の相違は、知識グラフ記憶システムを通じて人間のポートフォリオマネージャー (PM) による裁定のために事後的に表面化されます。この論文は、設計科学の実践と認知的分離と人間と機械の協調の理論に基づいた、基礎研究をサポートする人間と AI のハイブリッドシステムの 5 つの設計原則を提供します。また、4 つのアーキテクチャメカニズムについても説明します。1 つは一般投資家の資料を展開可能なエージェントに変えるペルソナ蒸留パイプラインです。プランナーが型指定されたタスクグラフを導出できるようにする宣言型スキルレジストリ。メモの主張を検証可能な情報源に結び付ける根拠のある証拠モデル。そしてティッカー、メモ、アナリスト、テーマを結び付けるナレッジグラフ「第二の脳」。完全なケーススタディとペルソナベースのメモの比較を通じてアーキテクチャを実証します。

原文 (English)

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph "second brain" that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AsyncTool: マルチタスクシナリオでの非同期関数呼び出し機能の評価

大規模言語モデル (LLM) ベースのエージェントは、外部ツールを使用して複雑なタスクを解決する強力な機能を示しています。ただし、既存の評価では、ツール使用の時間的側面、特にツールの応答遅延の影響が見落とされていることが多く、通常は単一タスクの設定に限定されています。実際のアプリケーションでは、多くの場合、複数のタスクを同時に実行する必要があり、全体的な効率は、エージェントがツールの応答を待つ間にアイドル時間を利用できるかどうかによって決まります。この機能を非同期ツール呼び出しと呼びます。これを評価するために、ツールのフィードバックが遅延した対話型のマルチタスクツール使用環境で LLM ベースのエージェントを評価するためのベンチマークである AsyncTool を提案します。 AsyncTool は、複数の異種タスクを同時に提示し、実行中の現実的なツール応答遅延をシミュレートします。ハイブリッドデータ進化戦略を使用して、複数のシナリオとツールの使用パターンをカバーする多様な非同期マルチタスクデータセットを構築します。ステップ、サブタスク、およびタスクのレベルでモデルを評価し、タスクの調整と完了の効率を測定するための効率指向のメトリクスを導入します。広範な実験により、ツールのフィードバックが遅れていると、現在のエージェントに大きな課題が生じ、明らかなパフォーマンスの低下につながることが示されています。タスクの切り替え、依存関係の追跡、状態の維持をより適切に調整するモデルは、AsyncTool でより優れたパフォーマンスを実現します。私たちの分析は、現在ツールを使用しているエージェントの主な障害モードを特定し、より強力な時間的推論と調整機能を備えた将来のシステムを設計するための実用的な洞察を提供します。

原文 (English)

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to single-task settings. In real-world applications, multiple tasks often need to be executed concurrently, and overall efficiency depends on whether an agent can use idle time while waiting for tool responses. We refer to this capability as asynchronous tool calling. To evaluate it, we propose AsyncTool, a benchmark for assessing LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. AsyncTool presents multiple heterogeneous tasks simultaneously and simulates realistic tool response latency during execution. Using a hybrid data evolution strategy, we construct a diverse asynchronous multitasking dataset that covers multiple scenarios and tool-use patterns. We evaluate models at the step, sub-task, and task levels, and introduce efficiency-oriented metrics to measure task coordination and completion efficiency. Extensive experiments show that delayed tool feedback poses substantial challenges to current agents and leads to clear performance degradation. Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance on AsyncTool. Our analysis identifies key failure modes of current tool-using agents and provides practical insights for designing future systems with stronger temporal reasoning and coordination capabilities.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Agent4Edu: Generating Learner Response Data by Generative Agents for Intelligent Education Systems

Personalized learning represents a promising educational strategy within intelligent educational systems, aiming to enhance learners' pract…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Online Fair Division with Additional Information

We study the problem of fairly allocating indivisible goods to agents in an online setting, where goods arrive sequentially and must be all…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

GroundAct: Can LLM Agents Ground Actions in Environmental States?

LLM agents achieve 85-96% success on tasks where instructions fully specify the action, but drop to 29-53% when action feasibility depends…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ReflexGrad: Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing

We present ReflexGrad, a dual-process architecture for within-episode failure recovery in LLM agents without demonstrations. When agents co…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the succ…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger'…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

A Review of Learning-Based Motion Planning: Toward a Data-Driven Optimal Control Approach

Motion planning for autonomous driving (AD) faces a critical trade-off. While traditional rule-based pipelines offer verifiable safety and…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Scaling Small Agents Through Strategy Auctions

Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are suff…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントロボティクス

When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decisi…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

The Planetary Cost of AI Acceleration, Part II: The 10th Planetary Boundary and the 6.5-Year Countdown

The recent, super-exponential scaling of autonomous Large Language Model (LLM) agents signals a broader, fundamental paradigm shift from ma…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a large…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Intent-aligned Autonomous Spacecraft Guidance via Reasoning Models

Future spacecraft operations require autonomy that can interpret high-level mission intent while preserving safety. However, existing traje…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs

Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes…

2026-05-29 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Here is the updated abstract: Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch pass…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterpr…

2026-05-29 13:00 JSTarXiv cs.AIエージェントハードウェア/半導体

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack th…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

KYA: A Framework-Agnostic Trust Layer for Autonomous Systems with Verifiable Provenance and Hierarchical Policy Composition

KYA (Know Your Agents) is an open-source, framework-agnostic trust and governance layer for autonomous systems, composed of five primitives…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Frontier LLM ベースのエージェントは、自然な表現型のオントロジーキュレーションのボトルネックを克服できます

フリーテキストの表現型記述をオントロジー用語にリンクすることは、通常表現型アノテーションと呼ばれ、比較形態学的データを研究間で統合するために不可欠です。この労働集約的なプロセスは高度な訓練を受けた人間の専門家に大きく依存しており、そのため拡張が困難であり、それが大きなボトルネックとなっています。ダードゥルら。 (2018) 7 つの系統学的研究にわたるエンティティ品質 (EQ) アノテーションのゴールドスタンダード (GS) を確立し、それを使用して 3 人のキュレーターと、オントロジーベースの意味的類似性メトリクスを備えた Semantic CharaParser NLP ツールを評価しました。彼らは、機械と人間の一貫性は、キュレーター間（人間と人間）の一貫性よりも大幅に低いと報告しました。ここでは、Anthropic と OpenAI の 5 つのフロンティアホスト LLM を使用してそのベンチマークを再検討します。各 LLM は、ソース出版物の PDF、元の人間のキュレーターが使用したのと同じ注釈ガイド、4 つのプロジェクトオントロジー (UBERON、PATO、BSPO、GO)、および検証スクリプトを提供する自己完結型ワークスペース内で「エージェントキュレーター」として動作します。同じゴールドスタンダードに照らして評価すると、すべてのエージェントは、元の研究で訓練を受けた 3 人の人間のバイオキュレーターのキュレーター間変動の範囲内に収まりました。最もパフォーマンスの高いエージェントがアプローチしましたが、最もパフォーマンスの高い人間のキュレーターには到達できませんでした。エージェントは、4 つの指標すべてで Semantic CharaParser を大幅に上回りました。

原文 (English)

Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes

Linking free-text phenotype descriptions to ontology terms, typically referred to as phenotype annotation, is essential for the cross-study integration of comparative morphological data. This labor intensive process has heavily relied on highly trained human experts, which makes it challenging to scale and thus a key bottleneck. Dahdul et al. (2018) established a Gold Standard (GS) of Entity-Quality (EQ) annotations across seven phylogenetic studies and used it to evaluate three human curators and the Semantic CharaParser NLP tool with ontology-based semantic similarity metrics; they reported that machine-human consistency was significantly lower than inter-curator (human-human) consistency. Here we revisit that benchmark with five frontier hosted LLMs from Anthropic and OpenAI, each operating as an "agentic curator" within a self-contained workspace that supplies the source publication PDF, the same annotation guide used by the original human curators, the four project ontologies (UBERON, PATO, BSPO, GO), and a validation script. Evaluated against the same Gold Standard, every agent fell within the range of inter-curator variability of the three trained human biocurators of the original study; the best performing agents approached but did not reach the best performing human curator. Agents substantially outperformed Semantic CharaParser on all four metrics.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

VFEAgent: エンドツーエンドの自動有限要素解析のためのマルチモーダルエージェントフレームワーク

有限要素解析 (FEA) は、現代のエンジニアリング設計の基礎として機能します。ただし、そのワークフローは本質的に複雑で、ドメインの専門知識に大きく依存しています。最近の取り組みでは、大規模言語モデル (LLM) を FEA に統合しましたが、既存のアプローチでは、マルチモーダル入力の処理と複雑なタスクの実行において限界に直面しています。これらの制限に対処するために、入力画像と問題の説明から直接 FEA モデリングとシミュレーションを自動化するように設計されたエンドツーエンドのマルチエージェントシステムである VFEAgent を提案します。当社の方法論は 2 つのコアコンポーネントを統合しています。(1) ReAct 駆動の推論を採用して異種入力から構造化 FEA 仕様を抽出するマルチモーダルビジョン言語マルチエージェントパイプラインと、(2) 実行可能性と物理的妥当性を保証するための堅牢な自己デバッグおよびフォールバックメカニズムを組み込んだ検証優先コード合成フレームワーク。私たちは、さまざまな工学力学シナリオにわたってシステムを体系的に評価しました。結果は、VFEAgent が完全で物理的に有効なシミュレーションの生成において高い成功率を達成し、信頼性と正確性の点で LLM ベースのベースライン手法を上回っていることを示しています。これらの調査結果は、FEA ワークフロー全体を自動化する実現可能性を検証し、エンジニアを退屈な手動分析から解放するフレームワークの可能性を強調しています。

原文 (English)

VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis

Finite Element Analysis (FEA) serves as the cornerstone of modern engineering design. However, its workflow is inherently complex and relies heavily on domain expertise. Although recent efforts have integrated Large Language Models (LLMs) into FEA, existing approaches face limitations in handling multimodal inputs and executing complex tasks. To address these limitations, we propose VFEAgent, an end-to-end multi-agent system designed to automate FEA modeling and simulation directly from input images and problem descriptions. Our methodology integrates two core components: (1) a multimodal vision-language multi-agent pipeline that employs ReAct-driven reasoning to extract structured FEA specifications from heterogeneous inputs and (2) a verification-first code synthesis framework, incorporating robust self-debugging and fallback mechanisms to ensure executability and physical validity. We systematically evaluated the system across various engineering mechanics scenarios. The results demonstrate that VFEAgent achieves a high success rate in generating complete and physically valid simulations, outperforming LLM-based baseline methods in reliability and correctness. These findings validate the feasibility of automating the complete FEA workflow, highlighting the framework's potential to liberate engineers from tedious manual analysis.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

差別化可能な信念に基づく敵対者の形成

人間の調整は、多くの場合、戦略的行動を通じて他者の信念に影響を与える能力に依存しています。マルチエージェント強化学習では、対戦相手のシェーピングがこの影響を再現しようとしますが、既存の手法は通常、対戦相手のパラメータ、ポリシー、または値空間内で動作します。一方、隠しロールゲームにおける信念操作テクニックは、欺瞞や信念の飽和など、ハードコードされた目的に依存することがよくあります。我々は、各観察者の信念を成形された相手の状態として扱い、$k$ ステップのソフトマックスベイズ信念ダイナミクスを通じて微分する一次手法である、微分可能な信念ベースの相手シェーピング (D-BOS) を提案します。私たちの方法では、欺瞞的または協力的な行動に明示的に報酬を与えるのではなく、信念状態を形成のターゲットとして扱います。これにより、環境の報酬構造から最適な戦略が自然に現れることが可能になります。この信念空間の定式化は、対戦相手の信念の更新を通じて微分することによって対戦相手を形成する信号を提供し、個々の推定された信念の軌跡にわたる勾配を集約することによって複数の観察者に自然に拡張されます。経験的に、D-BOS は隠しロールゲームで PPO や BBM よりも優れており、動機が混合された設定で最大の利益をもたらします。

原文 (English)

Differentiable Belief-based Opponent Shaping

Human coordination often relies on the ability to influence the beliefs of others through strategic action. In multi-agent reinforcement learning, opponent shaping attempts to replicate this influence, though existing methods typically operate within an opponent's parameter, policy, or value space. Meanwhile, belief-manipulation techniques in hidden-role games often rely on hard-coded objectives, such as deception or belief saturation. We propose Differentiable Belief-based Opponent Shaping (D-BOS), a first-order method that treats each observer's belief as the shaped opponent state and differentiates through $k$-step softmax-Bayes belief dynamics. Rather than explicitly rewarding deceptive or cooperative behavior, our method treats the belief state as the target for shaping. This allows the optimal strategy to emerge naturally from the environment's reward structure. This belief-space formulation provides an opponent-shaping signal by differentiating through opponent belief updates, and naturally extends to multiple observers by aggregating gradients over their individual inferred belief trajectories. Empirically, D-BOS outperforms PPO and BBM in hidden-role games, with the largest gains in mixed-motive settings.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Agentic AI、ネストされた学習、セマンティックキャッシングによる AI の持続可能性による幻覚の軽減

幻覚は、実稼働 LLM システムにとって、特にサポートされていないクレームがチェックされずにステージ全体に伝播する可能性があるマルチエージェントパイプラインにおいて、依然として大きな信頼性の障壁となっています。この論文では、連続メモリシステム (CMS) と意味論的類似性キャッシュを備えた HOPE にインスピレーションを得た入れ子学習アーキテクチャを、217 個の認識論的不確実性プロンプトと 93 個の製造誘導ストレステストプロンプトを組み合わせた 310 個のプロンプトのハイブリッドベンチマークに適応させます。オープンフロアプロトコル (OFP) を介して調整された 3 段階のエージェントパイプラインは、FCD (事実の主張密度)、FGR (事実の根拠参照)、FDF (架空の免責事項の頻度)、ECS (明示的なコンテキスト化スコア)、および OSR (観察可能性スコア率) の 5 つの KPI を使用して評価され、研究対象の 5 つの重み付け構成にわたって THS (総幻覚スコア) に集約されます。緩和と可観測性のトレードオフ。 FDF、ECS、OSR、および FGR は緩和シグナルとして差し引かれるため、THS がより負であるほど、緩和が強力であることを示します。 FrontEndAgent は、現実的な幻覚ベースラインを生成する高確率ジェネレータ (温度 = 1.0) として構成され、SecondLevelReviewer と ThirdLevelReviewer はプログレッシブコレクタとして動作します。この非対称設計により、5 つの重み付け構成全体でエンドツーエンドの THS が -31.3% ～ -35.9% 削減されます。セマンティックキャッシュは、930 回の潜在的な呼び出しで 440 回のキャッシュヒット (ヒット率 47.3%) を達成し、LLM 呼び出しを 490 回に減らし、エネルギーと CO2e フットプリントを削減し、マルチステージレビューパイプラインを運用規模で実行可能にします。 ExtremeObservability は最もマイナスの最終 THS (-0.0709) を達成しており、可観測性を重視した構成が緩和を損なうのではなく強化していることが確認されています。これらの発見は、メモリ拡張マルチエージェント設計がモデルの再トレーニングなしで事実の信頼性、運用効率、監査可能性を共同で改善できることを示唆しています。

原文 (English)

Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

Hallucination remains a major reliability barrier for production LLM systems, particularly in multi-agent pipelines where unsupported claims can propagate unchecked across stages. This paper adapts a HOPE-inspired Nested Learning architecture with Continuum Memory Systems (CMS) and semantic similarity caching to a hybrid benchmark of 310 prompts combining 217 epistemic-uncertainty prompts and 93 fabrication-induction stress-test prompts. A three-stage agentic pipeline orchestrated via the Open Floor Protocol (OFP) is evaluated with five KPIs -- FCD (Factual Claim Density), FGR (Factual Grounding References), FDF (Fictional Disclaimer Frequency), ECS (Explicit Contextualization Score), and OSR (Observability Score Ratio) -- aggregated into THS (Total Hallucination Score) across five weighting configurations to study mitigation-observability trade-offs. FDF, ECS, OSR, and FGR are subtracted as mitigation signals, so that a more negative THS indicates stronger mitigation. The FrontEndAgent is configured as a high-stochasticity generator (temperature = 1.0) to produce a realistic hallucination baseline, while the SecondLevelReviewer and ThirdLevelReviewer operate as progressive correctors. This asymmetric design yields end-to-end THS reductions of -31.3% to -35.9% across five weighting configurations. Semantic caching achieves 440 cache hits over 930 potential calls (47.3% hit rate), reducing LLM invocations to 490, lowering energy and CO2e footprint, and making multi-stage review pipelines operationally viable at production scale. ExtremeObservability attains the most negative final THS (-0.0709), confirming that observability-heavy configurations reinforce rather than compromise mitigation. These findings suggest that memory-augmented multi-agent designs can jointly improve factual reliability, operational efficiency, and auditability without model retraining.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

安全な自律エージェントのための帯域外メタデータの重要性: Redpanda エージェントデータプレーン

AI エージェントは、企業データにアクセスし、意思決定を行い、自律的に行動を起こすデジタル従業員として機能することがますます期待されています。しかし、エージェントは同時に人間よりも予測可能性が低く、幻覚、誤解、敵対的操作を起こしやすいですが、深いシステム知識と高スループットのインターフェイスを備えており、マシンの速度で被害を連鎖的にもたらします。この組み合わせにより、アクセスポリシー、データ分類、動作制約などのセキュリティクリティカルなメタデータを忠実に解釈または伝達するエージェントに依存するのは安全ではなくなります。私たちは、帯域外メタデータチャネルを中心に構築されたアーキテクチャである Redpanda Agentic Data Plane (ADP) を紹介します。帯域外メタデータチャネルは、セキュリティコンテキスト、ポリシーシグナル、および監査証跡を、エージェントの読み取りおよび書き込みパスの完全に外側で、異種インフラストラクチャ全体にわたって決定論的に伝送するインフラストラクチャパスウェイです。これらのチャネルは、エージェントのライフサイクルのあらゆる段階でガバナンスを強化します。つまり、受信時のデータアクセスの範囲設定、実行中のアクションの制限、送信時の改ざん防止トランスクリプトのキャプチャなどです。マルチエージェントのポートフォリオリバランスシステムを使用して ADP を実証します。このシステムでは、自律エージェントが市場を監視し、取引の意思決定を行い、分離されたクライアントアカウント全体で注文を実行します。クライアントごとのデータスコープ、取引承認のしきい値、および改ざん防止の監査証跡はすべて、エージェントが確認することもバイパスすることもできない帯域外チャネルによって強制されます。

原文 (English)

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

AI agents are increasingly expected to operate as digital employees: accessing enterprise data, making decisions, and taking actions autonomously. But agents are simultaneously less predictable than humans -- prone to hallucination, misinterpretation, and adversarial manipulation -- and more technically capable: with deep system knowledge and high-throughput interfaces cascading damage at machine speed. This combination makes it unsafe to rely on agents to faithfully interpret or propagate security-critical metadata such as access policies, data classifications, and behavioral constraints. We present the Redpanda Agentic Data Plane (ADP), an architecture built around out-of-band metadata channels: infrastructure pathways that carry security context, policy signals, and audit trails deterministically, entirely outside the agent's read and write path and across heterogeneous infrastructure. These channels enforce governance at every stage of the agent lifecycle -- scoping data access on the way in, constraining actions during execution, and capturing tamper-proof transcripts on the way out. We demonstrate ADP with a multi-agent portfolio rebalancing system in which autonomous agents monitor markets, make trade decisions, and execute orders across isolated client accounts -- with per-client data scoping, trade approval thresholds, and tamper-proof audit trails all enforced by out-of-band channels the agents can neither see nor bypass.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

コンセンサスを超えて: 薬剤の混合物における微量レベルの合成

複数の LLM エージェントが同じ問題を解決する場合、標準的な慣行では、各エージェントの推論が多数決または階層化された統合に圧縮され、合意が最終ラインとして扱われます。私たちは、これが不必要に損失を伴うことを示します。完全な推論トレースを読み取る LLM アグリゲーターは、エージェントが全会一致で同意した場合でも、有益な修正が有害な修正を常に上回り、正しい解決策を回復します -- \emph{集約のパラドックス}。多数決には、摂動の多様性によって上昇しない上限があります (誤差の相関関係は同一です)。アグリゲーターの利益は、投票によって破棄される少数派のチェーンから正しい中間ステップを組み立てる、トレースレベルの相補性から得られます。これらの発見は、セマンティックを保持する入力摂動を通じて微量多様性を生成し、証明可能な非劣化保証を備えたアンカー改良によって大部分を保護し、コンセンサスに基づいてゲートすることなく常に合成するエージェントの自己一貫性混合を動機付けます。摂動による微量変化を伴う単一モデルは、構造化推論、博士レベルの科学、競技数学、競技プログラミングにおいて、異種モデルプールよりも優れたパフォーマンスを発揮します。集計の単位は、答えではなく、推論のトレースである必要があります。

原文 (English)

Beyond Consensus: Trace-Level Synthesis in Mixture of Agents

When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a majority vote or layered synthesis, treating agreement as the finish line. We show this is unnecessarily lossy: an LLM aggregator that reads complete reasoning traces recovers correct solutions even when agents unanimously agree, with beneficial corrections consistently outweighing harmful ones -- the \emph{aggregation paradox}. Majority voting has a ceiling that perturbation diversity does not raise (error correlations are identical); the aggregator's gain comes from trace-level complementarity, assembling correct intermediate steps from minority chains that voting discards. These findings motivate Self-Consistent Mixture of Agents which generates trace diversity through semantic-preserving input perturbations, safeguards the majority via anchored refinement with provable non-degradation guarantees, and always synthesizes -- never gates on consensus. A single model with perturbation-induced trace variation outperforms heterogeneous model pools across structured reasoning, PhD-level science, competition mathematics, and competitive programming. The unit of aggregation should be the reasoning trace, not the answer.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

PRO-CUA: コンピュータ使用エージェントのプロセス報酬の最適化

コンピュータ使用エージェント (CUA) は、複雑なデジタルワークフローを自動化する強力な可能性を示していますが、そのトレーニングは、費用のかかるライブ環境でのやり取りと限られた高品質の監督によって依然として制限されています。既存のフィルタリングされた動作クローンパイプラインは、専門家のデモンストレーションからの分布のシフトやネガティブな学習シグナルの欠如など、模倣のボトルネックに悩まされています。一方、標準的な軌道レベルの強化学習は、報酬がまばらで、単位の割り当てがあいまいで、長期にわたる GUI インタラクションのためのインフラストラクチャのコストが高いという問題に直面しています。この研究では、反復ステップレベル強化学習を使用して CUA をトレーニングするためのプロセス報酬最適化フレームワークである PRO-CUA を提案します。 PRO-CUA は、ポリシー上の環境インタラクションをポリシーの最適化から切り離します。現在のポリシーは、ライブロールアウトを通じて状態を収集し、状態ごとに多様な候補アクションを生成し、プロセス報酬モデル (PRM) からステップレベルのフィードバックを受け取り、グループ相対の利点で最適化されます。この設計により、ゴールデンアンサーやオフラインエキスパートの軌跡に依存することなく、高密度で柔軟なクレジット割り当てが可能になり、同時にエージェント自身の実行状態でのトレーニングによって分布のシフトが軽減されます。ライブ Web ベンチマークの実験により、PRO-CUA の有効性と PRM ガイドによるステップレベルのトレーニングの信頼性が実証されています。

原文 (English)

PRO-CUA: Process-Reward Optimization for Computer Use Agents

Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their training remains constrained by costly live environment interaction and limited high-quality supervision. Existing filtered behavior cloning pipelines suffer from imitation bottlenecks, including distribution shift from the expert demonstration and the absence of negative learning signals. Meanwhile, standard trajectory-level reinforcement learning struggles with sparse rewards, ambiguous credit assignment, and high infrastructure costs for long-horizon GUI interaction. In this work, we propose PRO-CUA, a process-reward optimization framework for training CUAs with iterative step-level reinforcement learning. PRO-CUA decouples on-policy environment interaction from policy optimization: the current policy collects states through live rollouts, generates diverse candidate actions for each state, receives step-level feedback from a process reward model (PRM), and is optimized with group-relative advantages. This design enables dense and flexible credit assignment without relying on golden answers or offline expert trajectories, while reducing distribution shift by training on the agent's own execution states. Experiments on live web benchmarks demonstrate the effectiveness of PRO-CUA and the reliability of PRM-guided step-level training.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Agentic AI システムにおける技術的負債の管理

エージェント AI システムは、実稼働インフラストラクチャとしてますます検討されています。エージェントは、複数のステップを推論し、ツールを呼び出し、ワークフローを通じて動作し、記憶とフィードバックを通じて適応します。これらのシステムは、従来のソフトウェアや予測 ML の技術的負債では完全に把握できないガバナンスの課題を生み出します。私たちは、エージェントの技術的負債を、プロンプト、メモリ、ツールスキーマ、オーケストレーショングラフ、制御ポリシー、可観測性ルーチンが、検証、標準化、管理できるよりも早くパッチを適用した場合に生じる累積負債として定義します。私たちは、確率的税を、確率的なエージェントの動作を許容範囲内に維持するための反復的な運用負担として定義します。区別は重要です。負債は設計およびガバナンス責任のストックであるのに対し、税金は確率的エージェントがツールやワークフローを通じて機能するために発生する運営コストの流れです。管理者が軽量のダッシュボードとガバナンス制御を通じて両方を可視化する方法について概説します。

原文 (English)

Governing Technical Debt in Agentic AI Systems

Agentic AI systems are increasingly being explored as production infrastructure: they reason over multiple steps, call tools, act through workflows, and adapt through memory and feedback. These systems create governance challenges that are not fully captured by traditional software or predictive ML technical debt. We define Agentic Technical Debt as the accumulated liability created when prompts, memory, tool schemas, orchestration graphs, control policies, and observability routines are patched together faster than they can be validated, standardized, and governed. We define Stochastic Tax as the recurring operating burden of keeping probabilistic agent behavior within acceptable bounds. The distinction matters: debt is a stock of design and governance liability, while the tax is a flow of operating cost that arises because stochastic agents act through tools and workflows. We outline how managers can make both visible through lightweight dashboards and governance controls.

2026-05-29 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達研究/論文

ペーパーエージェント、ペーパーゲイン：DeFi投資エージェントの実証分析

自律的なオンチェーン取引に AI を使用するシステムである DeFi 投資エージェントは、2024 年後半以来、合計トークン評価額で 30 億米ドルを超えています。私たちは 1,900 以上の AI タグ付き暗号プロジェクトを調査し、投資中心のエージェントに絞り込み、戦略と可観測性の側面にわたる 10 の代表的なプロジェクトを厳選しています。次に、ElizaOS と Virtuals Protocol という 2 つの著名なエージェントフレームワークの詳細なアーキテクチャ分析と、925,323 人のトークン所有者を対象とする公的に起因する取引活動を伴う 11 の Solana ベースのエージェントトレジャリーの定量的なオンチェーンパフォーマンス分析を実施します。現在のデプロイメントは初期段階で異種混合のままであることがわかりました。(1) 私たちのサンプルでは、多くのプロジェクトが自律的な取引実行の明確な証拠をまだ提供しておらず、開発者のインタビューでは、目に見えるデプロイメントの多くが基本的な API 統合のままであることが示唆されています。 (2) エージェントの財務省は 3,000 万米ドルを超える紙の利益を保持している一方、トークン所有者は合計で 1 億 9,170 万米ドルを損失しており、ウォレットの上位 1% が全利益の 81.4% (18 億 1,000 万米ドル) を獲得しています。 (3) トークンの評価額は財務省のファンダメンタルズとの関連が弱く、時価総額対AUMの比率は10,000倍を超えていますが、確立されたDeFiプロトコルでは1倍未満です。 (4) ユーザーの総利益は 24 億米ドルでピークに達し、その後純損失に減少し、収益の中央値はすべてのプラットフォームでマイナスとなり、トークンは史上最高値から平均して 93% 減少しました。私たちは、これらの結果を、オープンインフラストラクチャにより迅速な実験が可能になるだけでなく、自律性、パフォーマンス、および利害関係者の連携のための堅牢な標準が出現する前に、単純なエージェントや投機的なエージェントが立ち上がることを可能にする、パーミッションレスの第一世代市場の特徴であると解釈します。そこで私たちは、現在の展開と将来の投資グレードのエージェントシステムとの間のギャップを特徴付けるために、自律的な実行、リスク調整後の収益性、利害関係者の連携という 3 つの側面に沿った成熟度フレームワークを提案します。

原文 (English)

Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents

DeFi investment agents, systems that use AI for autonomous on-chain trading, have attained over USD 3 billion in combined token valuations since late 2024. We survey over 1,900 AI-tagged crypto projects, filter to investment-focused agents, and curate 10 representative projects spanning strategy and observability dimensions. We then conduct a deep-dive architectural analysis of two prominent agent frameworks, ElizaOS and Virtuals Protocol, and a quantitative on-chain performance analysis of 11 Solana-based agent treasuries with publicly attributable trading activity, covering 925,323 token holders. We find that current deployments remain early and heterogeneous: (1) in our sample, many projects do not yet provide clear evidence of autonomous trade execution, and developer interviews suggest that many visible deployments remain basic API integrations; (2) agent treasuries retain over USD 30M in paper gains while token holders collectively lost USD 191.7M, with the top 1% of wallets capturing 81.4% of all gains (USD 1.81B); (3) token valuations are weakly connected to treasury fundamentals, with market-cap-to-AUM ratios exceeding 10,000x versus below 1x for established DeFi protocols; and (4) aggregate user gains peaked at USD 2.4B before declining to net losses, with median returns negative on every platform and tokens declining 93% on average from all-time highs. We interpret these outcomes as characteristic of a permissionless, first-generation market in which open infrastructure enables rapid experimentation but also allows naive or speculative agents to launch before robust standards for autonomy, performance, and stakeholder alignment emerge. We therefore propose a maturity framework along three dimensions: autonomous execution, risk-adjusted profitability, and stakeholder alignment, to characterize the gap between current deployments and future investment-grade agent systems.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

GTA: Web エージェント向けに長期的なタスクを大規模に生成

Web エージェントは、言語モデルとブラウジング機能およびツール使用機能を組み合わせたもので、オープンな Web アシスタントとして有望です。しかし、拡張性のあるプロセスレベルの監督が欠如しているため、進歩はますます制限されています。既存のベンチマークは主に手動で構築されており、中間軌道を含まない粗い開始ゴールの注釈のみを提供しますが、最近の自動生成の取り組みは依然として高価で偏りがあり、浅いものです。これらの制限により、現実的なマルチホップ、クロスページのタスクに一般化する必要があるエージェントの信頼できるトレーニングと評価が妨げられます。スケーラブルなフレームワークである GTA を導入します。これは、クローリング、検索ベースのシーディング、インコンテキスト生成、および自動品質管理を統合して、実行可能な軌跡と組み合わせた現実的なタスクを生成します。この設計は、クロールを生成から切り離して効率を高め、タスクをサイトグラフに固定して構成性を強化し、決定論的な再生と体系的な検証を通じて緻密な監視を保証します。電子商取引、政府、フォーラム、ニュースをカバーする 50 以上の Web サイト上でパイプラインをインスタンス化し、多言語およびマルチホップをカバーします。結果として得られたベンチマークでは、人間とエージェントのパフォーマンスの大きなギャップが明らかになり、詳細な診断が可能になります。私たちの貢献は 3 つあります: (i) マルチホップ Web エージェントタスク生成の形式化、(ii) 自動データ作成のための効率的で検証済みのパイプラインの提案、(iii) 再現可能な評価を備えた動的ベンチマークのリリース。

原文 (English)

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web assistants. Yet progress is increasingly limited by the lack of scalable, process-level supervision. Existing benchmarks are largely manually constructed, providing only coarse start-goal annotations without intermediate trajectories, while recent automatic generation efforts remain expensive, biased, and shallow. These limitations prevent reliable training and evaluation of agents that must generalize to realistic, multi-hop, cross-page tasks. We introduce a scalable framework, GTA, that integrates crawling, retrieval-based seeding, in-context generation, and automated quality control to produce realistic tasks paired with executable trajectories. This design decouples crawling from generation for greater efficiency, grounds tasks in the site graph to enforce compositionality, and ensures dense supervision through deterministic replays and systematic validation. We instantiate the pipeline on over 50 websites covering e-commerce, government, forums, and news, with multilingual and multi-hop coverage. The resulting benchmark reveals a significant human-agent performance gap and enables detailed diagnostics. Our contributions are three-fold: (i) formalizing multi-hop web-agent task generation, (ii) proposing an efficient and validated pipeline for automatic data creation, and (iii) releasing a dynamic benchmark with reproducible evaluation.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

BenchTrace: LLM エージェントのリフレクション能力と制御された進化をテストするためのベンチマーク

自己進化エージェントは過去の失敗を反映することで時間の経過とともに改善しますが、既存の評価には 2 つの点で制限があります。1 つはタスクスコアのみを測定し、反映品質は不明のままにすること、もう 1 つはエージェント自身のエピソードの実行に依存しており、特定の失敗パターンを対象にするメカニズムを提供していないことです。 LLM エージェントの自己進化能力を評価するためのベンチマークである \textbf{BenchTrace} を紹介します。 BenchTrace は、6 つの多様なタスクにわたる 1,821 の注釈付きエピソードのスナップショット反映データセットに基づいて構築されており、ターゲットを絞った QA タスクを通じて障害の特定を調査する \textbf{反映評価} と、制御された自己進化シミュレーションで過去の障害経験が回避行動に変換されるかどうかをテストする \textbf{進化評価} で構成されます。 BenchTrace に基づいて、エージェントがターゲットの障害インスタンスを回避できたテストケースの割合を測定する新しい評価指標である \textbf{障害回避率 (FAR)} を提案します。 Qwen3-32B と GPT-4.1 を使った実験では、どちらのモデルもリフレクション評価でエンドツーエンドの合格率が 30\% を下回り、診断が主なボトルネックであることが明らかになりました。進化の評価では、自己進化手法は一般に非進化ベースラインよりもFARを改善しますが、エージェントはノイズエピソードが蓄積するにつれて初期のレッスンを忘れ、エージェントは特定のコンテキストを超えて反省を一般化することができず、タスクコンテキスト間で負の転移を引き起こすことが示されています。さらに、相関分析により、完全に正しい反射のみが高い FAR と強く関連していることが明らかになりました。 BenchTrace は、現在の自己進化アプローチの具体的な限界を明らかにし、対象を絞った評価のための制御されたモデルに依存しないフレームワークを提供します。

原文 (English)

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

Self-evolving agents improve over time by reflecting on past failures, but existing evaluation is limited in two ways: it measures only task scores, leaving reflection quality unknown, and it relies on agents' own episode runs, offering no mechanism to target specific failure patterns. We present \textbf{BenchTrace}, a benchmark for evaluating self-evolution ability in LLM agents. BenchTrace is built on a snapshot-reflection dataset of 1,821 annotated episodes spanning six diverse tasks, and comprises a \textbf{Reflection Evaluation} that probes failure identification through targeted QA tasks, and an \textbf{Evolution Evaluation} that tests whether past failure experience translates into avoidance behavior in a controlled self-evolution simulation. Building on BenchTrace, we propose \textbf{failure avoidance rate (FAR)}, a new evaluation metric measuring the fraction of test cases in which the agent successfully avoids the target failure instance. Experiments with Qwen3-32B and GPT-4.1 reveal that both models fall below a 30\% end-to-end pass rate on reflection evaluation, with diagnosis as the primary bottleneck. Evolution evaluation shows that self-evolution methods generally improve FAR over the non-evolving baseline, but agents forget early lessons as noise episodes accumulate, and agents fail to generalize their reflections beyond the specific context, causing negative transfer across task contexts. Our correlation analysis further reveals that only a fully correct reflection is strongly associated with higher FAR. BenchTrace exposes concrete limits of current self-evolution approaches and provides a controlled, model-agnostic framework for targeted evaluation.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

確実に安全なエージェントガードレール

大規模な言語モデルが、制限された生成エンジンから、拡張的な実行権限を持つエージェントに移行するにつれて、AI が制御不能になり、人工知能のセキュリティに根本的な危機が生じます。既存の防御アーキテクチャは、経験的なセマンティックガードレールと確率的な大規模モデルの判定子に大きく依存しており、複雑なセマンティックシンボルデカップリング攻撃に直面した場合に決定論的なセキュリティの下限を提供できないメカニズムです。この経験的なセマンティックガードレールのジレンマを克服するために、この論文では、論理的推論の基本的な制限に基づいたエージェントのための新しいセキュリティパラダイムを提案します。このパラダイムに基づいて、ニューラルシンボリック分離アーキテクチャを備えた実行可能な Proof-Constrained Action (ePCA) フレームワークをさらに導入します。このフレームワークは自然言語の意味論的な信頼を放棄し、エージェントに物理的な操作を実行する前に意図を一次論理数学的制約にロスレスで形式化することを強制します。巨視的および微視的な二次元動的敵対的システムの経験的評価は、私たちの形式的検証メカニズムが、評価されたシナリオ全体でゼロの攻撃成功率とゼロの誤検知率を、極めて低い計算待ち時間で達成することを実証しています。この研究は、明示的なシステム仮定の下での条件付きの形式的基盤と、将来のインテリジェントシステムの基礎となる防御基盤を構築するためのエンジニアリングパラダイムを提供します。

原文 (English)

Provably Secure Agent Guardrail

As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control precipitates a fundamental crisis in artificial intelligence security. Existing defense architectures heavily rely on empirical semantic guardrails and probabilistic large model adjudicators, mechanisms that fail to provide deterministic security lower bounds when facing complex semantic symbol decoupling attacks. To overcome this empirical semantic guardrail dilemma, this paper proposes a new security paradigm for agents based on the fundamental limitations of logical reasoning. Based on this paradigm, we further introduce an executable Proof-Constrained Action (ePCA) framework with a neural symbolic isolation architecture. This framework abandons semantic trust in natural language, forcing agents to losslessly formalize their intentions into first-order logical mathematical constraints before performing physical operations. Empirical evaluations of macroscopic and microscopic two-dimensional dynamic adversarial systems demonstrate that our formal verification mechanism achieves zero attack success rate and zero false positive rate across the evaluated scenarios, with extremely low computational latency. This research provides a conditional formal foundation under explicit system assumptions and an engineering paradigm for constructing the underlying defense foundation for future intelligent systems.

2026-05-29 13:00 JSTarXiv cs.AIエージェント研究/論文

OpenClawBench: 現実世界のエージェント実行軌跡におけるプロセス側の異常のベンチマーク

タスクが成功すると、実際のエージェント実行におけるプロセスの異常が隠蔽される可能性があります。エージェントは、未解決のあいまいさ、安全でない外部書き込み、無視されたエラー、根拠の弱いコミットメント、または能力境界のオーバーコミットメントを蓄積したまま、最終タスクのオラクルを渡す可能性があります。私たちはこの不一致を結果とプロセスのギャップとして研究し、実際のエージェント実行プロセスにおけるプロセス側の異常を測定および監視するための大規模なデータセットである OpenClawBench を導入します。 OpenClawBench は、6 つのソースモデルによって生成された BFCL 駆動の OpenClaw セッションから構築されており、31,264 個の注釈付き軌跡が含まれています。タスクオラクルの結果を構造化されたプロセスの証拠と一致させます。 FullTax は、調整された軌跡を構造化された異常監視 (バイナリラベル、裏付け証拠、発生/期間の局所化、重大度、回復可能性、および 5 クラスの異常分類) に変換します。 OpenClawBench を使用して、結果とプロセスのギャップを測定可能にします。 31,135 件のオラクルパス実行のうち、2,904 件は依然として FullTax の下でプロセス異常のラベルが付けられています。これらの結果は、成功のみの評価では、実際のエージェント実行におけるプロセス側の失敗の具体的なクラスを見逃すことを示しています。 LoRA で微調整された Gemma 3 12B 検出器は、信頼性の高い FullTax 監視プールでトレーニングされ、クリーナーラベルホールドアウトテストスプリットでバイナリ F1=0.729 に達します。 OpenClawBench は、実際のエージェント実行ログを、ランタイムエージェントの信頼性を調査、診断、運用上監視するための監査可能で再利用可能な監視に変換します。

原文 (English)

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or capability-boundary overcommitment. We study this mismatch as the Outcome-Process Gap and introduce OpenClawBench, a large-scale dataset for measuring and supervising process-side anomalies in real agent execution processes. OpenClawBench is built from BFCL-driven OpenClaw sessions produced by 6 source models and contains 31,264 annotated trajectories. It aligns task-oracle outcomes with structured process evidence. FullTax converts the aligned trajectories into structured anomaly supervision: binary labels, supporting evidence, onset/span localization, severity, recoverability, and a 5-class anomaly taxonomy. Using OpenClawBench, we make the Outcome-Process Gap measurable. Among 31,135 oracle-passing executions, 2,904 are still labeled process-anomalous under FullTax. These results show that success-only evaluation misses a concrete class of process-side failures in real agent executions. A LoRA-fine-tuned Gemma 3 12B detector trained on the high-confidence FullTax supervised pool reaches binary F1=0.729 on the cleaner-labels held-out test split. Together, OpenClawBench turns real agent execution logs into auditable and reusable supervision for studying, diagnosing, and operationally monitoring runtime agent reliability.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

リアルタイムの制約と長期的な推論の調和: 動的スケジューリングのための非同期エージェントフレームワーク

動的フレキシブルジョブショップスケジューリング問題 (DFJSP) では、確率的外乱に対する即座の対応と生産目標の全体的な最適化の間のトレードオフが必要になります。従来の優先順位ルールは、複雑な中断に対処するには柔軟性が不十分ですが、学習ベースのアプローチでは、解釈可能性が損なわれたり、問題のスケール全体に一般化できないことがよくあります。 Large Language Model (LLM) は、このギャップを埋めるための高度な推論機能を提供しますが、その実質的な推論遅延は、産業用制御システムのミリ秒レベルの決定サイクルと互換性がありません。この競合を解決するために、デュアルストリームアーキテクチャを介してポリシーの実行を論理的推論から切り離す、非同期エージェントベースのフレームワークである RACE-Sched を導入します。 Reactive Stream は低遅延のシンボリックヒューリスティックを実行してリアルタイムのディスパッチを可能にし、並列 Deliberative Stream は LLM を利用してこれらのルールを合成、検証、進化させます。候補ルールはサンドボックスで厳格なテストを受け、アトミックな更新によって展開されるため、制御ループをブロックすることなく安全性が確保されます。さらに、セマンティックルールリポジトリは、検索ベースの初期化のための検証済みのヒューリスティックにインデックスを付け、問題のスケール全体での移行性を高めます。 GEN-Bench、MK-Bench、および JMS-Bench の広範な評価により、RACE-Sched が主要な深層強化学習やその他の LLM ベースのベースラインよりも優れたパフォーマンスを発揮することが実証されています。このアプローチは、リアルタイムの制約と長期的な推論を調和させて、優れたソリューション品質と動的イベントへの堅牢な適応を実現します。

原文 (English)

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling

The Dynamic Flexible Job Shop Scheduling Problem (DFJSP) necessitates a trade-off between instant reaction to stochastic disturbances and global optimization of production goals. Conventional priority rules are insufficiently flexible to handle complex disruptions, whereas learning-based approaches often compromise interpretability or fail to generalize across problem scales. Although Large Language Models (LLMs) offer advanced reasoning capabilities to bridge this gap, their substantial inference latency is incompatible with the millisecond-level decision cycles of industrial control systems. To resolve this conflict, we introduce RACE-Sched, an asynchronous agent-based framework that decouples policy execution from logical reasoning via a dual-stream architecture. The Reactive Stream executes low-latency symbolic heuristics to enable real-time dispatching, while the parallel Deliberative Stream leverages an LLM to synthesize, validate, and evolve these rules. Candidate rules undergo rigorous testing in a sandbox and are deployed via atomic updates, ensuring safety without blocking the control loop. Additionally, a semantic rule repository indexes validated heuristics for retrieval-based initialization which enhances transferability across problem scales. Extensive evaluations on GEN-Bench, MK-Bench, and JMS-Bench demonstrate that RACE-Sched outperforms leading Deep Reinforcement Learning and other LLM-based baselines. This approach harmonizes real-time constraints with long-horizon reasoning to achieve superior solution quality and robust adaptation to dynamic events.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

読み取り不可能なもののインデックス作成: LLM ネイティブの再帰的構築とサービス分類の検索

Internet of Agents (IoA) の時代が具体化しつつあります。LLM エージェントは、急速に増加する Model Context Protocol (MCP) サーバー、Agent-to-Agent (A2A) エンドポイント、再利用可能なスキル、およびその他の LLM 呼び出し可能なサービスを調整することで、ユーザーの目標を達成することが期待されています。しかし、LLM はこの体制との構造的な不一致に直面しています。有効なコンテキストは、サービスの数に応じて拡張できない希少なリソースです。数千のサービス説明をプロンプトに連結すると、コンテキストウィンドウがオーバーフローし、ウィンドウが十分に大きい場合でも、モデルは長い入力の途中で情報に体系的に注意を向けることができなくなります。これは十分に文書化されている「Lost-in-the-Middle」現象です。これは基本的に、サービス検出のためのコンテキスト管理の問題です。これに対処するために、LLM ネイティブのプログレッシブ開示スキームとその具体的なインスタンス化である A2X (Agent-to-Anything サービスディスカバリ) を提案します。これは、登録されたサービスを自動的に階層分類に編成し、クエリ時にレイヤーごとにたどる LLM 駆動のパイプラインです。これにより、すべての LLM 呼び出しでは、ユーザークエリに関連性の高い小さな候補セットのみが表示されます。これにより、有効なコンテキストの不足がレジストリサイズから切り離され、取得精度が向上しながらトークンの消費が大幅に削減されます。フルコンテキストダンピングと比較して、A2X はプロンプトトークンコストの 9 分の 1 で 6.2 ポイントのヒット率の向上を達成します。最先端のオープンソースの埋め込みベースのベースラインと比較して、A2X はヒット率を 20 ポイント以上改善します。

原文 (English)

Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies

The era of the Internet of Agents (IoA) is taking shape: LLM agents are expected to fulfill user goals by orchestrating fast-growing populations of Model Context Protocol (MCP) servers, Agent-to-Agent (A2A) endpoints, reusable skills, and other LLM-callable services. Yet LLMs face a structural mismatch with this regime: effective context is a scarce resource that does not scale with the number of services. Concatenating thousands of service descriptions into a prompt overflows the context window, and even when the window is large enough, models systematically under-attend to information in the middle of long inputs, the well-documented Lost-in-the-Middle phenomenon. This is fundamentally a question of context management for service discovery. To address this, we propose an LLM-native progressive-disclosure scheme and its concrete instantiation, A2X (Agent-to-Anything service discovery): an LLM-driven pipeline that automatically organizes the registered services into a hierarchical taxonomy and walks it layer by layer at query time, so that every LLM call sees only a small candidate set highly relevant to the user query. This decouples effective-context scarcity from registry size and significantly reduces token consumption while improving retrieval accuracy. Compared to full-context dumping, A2X achieves a 6.2-point Hit Rate gain at one-ninth the prompt-token cost; compared to the state-of-the-art open-source embedding-based baseline, A2X improves Hit Rate by more than 20 points.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

CoHyDE: ツール検索のための LLM リライタと高密度エンコーダの反復的共同トレーニング

大規模な API カタログを介したツールの取得は、LLM エージェントにとって主要なボトルネックです。ユーザークエリは口語的な、多くの場合、仕様が不十分な言語で到着しますが、カタログでは、固定エンコーダーが単独でブリッジできない技術的な API ボキャブラリーが使用されます。対照的なエンコーダの微調整と、フリーズ LLM を使用した HyDE スタイルのクエリ拡張という 2 つの主要なトレーニングアプローチは、この問題に反対側から対処し、相補的な方向で失敗します。微調整されたエンコーダは、クエリの表面形式がすでにカタログと一致する場合には優れていますが、一致しない場合は崩壊します。一方、ゼロショットの HyDE は、仕様が不十分なクエリに対してはより堅牢ですが、クエリが適切な形式である場合に検索を低下させるカタログを意識しない仮説的な記述を生成します。高密度エンコーダと LLM リライタを単一の共進化システムとしてトレーニングする反復手順である CoHyDE を導入します。エンコーダはリライタによって生成されたカタログ形式の仮説記述に基づいて InfoNCE で再トレーニングされ、リライタはエンコーダの検索スコアに対して DPO 経由で優先的に調整され、ループが始まる前にツールカタログで双方がウォームスタートされます。 ToolBench カタログの約 10,000 のツールサブセットでは、CoHyDE の 3 ラウンドにより、最も強力な単一コンポーネントのベースラインよりも、標準クエリで NDCG@5 が +2.5 pp、保留された曖昧クエリで +6.3 pp 改善され、最も困難な曖昧層では +8 pp もの向上が見られます。アブレーションにより、共トレーニングが重要な要素であることが確認されました。いずれかのコンポーネントを単独で使用すると、整形式のクエリと曖昧なクエリの両方で CoHyDE に匹敵することができず、曖昧なクエリでは最大 -8 pp の損失が発生します。

原文 (English)

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

エージェントによる修正と意味評価による人間のような対話型音声認識を目指して

自動音声認識 (ASR) は、人間とコンピューターの対話の中核コンポーネントであり、LLM ベースのアシスタントおよびエージェントにとってますます重要なフロントエンドです。しかし、現在のほとんどの ASR システムは依然としてシングルパスパラダイムに従っており、人間のコミュニケーションとの整合性が低く、誤解は繰り返しの明確化と改良によって解決されます。この不一致により、意味に関わる重大なエラーが発生すると、修正することが困難になります。一方、WER や CER などのトークンレベルの指標は、このような問題を適切に反映できません。これらの制限に対処するために、\emph{Interactive ASR} をマルチターン改良タスクとして定式化し、シングルパス ASR フロントエンドとセマンティック修正、インテントルーティング、推論ベースの編集を組み合わせた閉ループフレームワークである \textbf{Agentic ASR} を提案します。さらに、LLM ベースのセマンティック評価指標である \textbf{文レベルのセマンティックエラー率} ($S^2ER$) を、スケーラブルで再現可能なベンチマークのための \textbf{インタラクティブシミュレーションシステム} とともに導入します。多言語、名前付きエンティティ集中型、およびコードスイッチングのベンチマークに関する実験では、反復的な対話によりセマンティックエラーが一貫して減少し、従来のトークンレベルのメトリクスよりも $S^2ER$ が大幅に増加することが示されました。人間と AI のアライメントとアブレーションの研究により、意味判断の信頼性と提案されたフレームワークの堅牢性がさらに検証されました。コードは https://interactiveasr.github.io/ で入手でき、ライブデモは https://i-asr.sjtuxlance.com/ で入手できます。

原文 (English)

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

2026-05-29 13:00 JSTarXiv cs.AIエージェント

VitalAgent: ウェアラブル健康データに対する反応的および積極的な生理学的モニタリングのためのツール拡張エージェント

ウェアラブルデバイスにより、ECG や PPG などの生理学的信号の継続的なモニタリングが可能になりますが、既存の mHealth システムは、タスク固有の予測パイプラインまたは静的な概要に対する反応的な質問応答に主に限定されています。これらには、時間的推論、永続的な生理学的コンテキスト、および長期的な信号ストリームにわたるプロアクティブなモニタリングをサポートする能力がありません。私たちは、事後的な質問応答とプロアクティブなモニタリングの両方をサポートする、ECG/PPG ベースの mHealth 用のツールを強化したエージェントフレームワークである VitalAgent を提案します。 VitalAgent は、長期的な生理学的メモリと、生の信号に対する動的な計算を可能にするツール拡張推論インターフェイスに基づいて構築されています。さらに、反応的な質問応答のための 1,862 の QA ペアと、心臓、身体活動、ストレス関連のタスクをカバーするプロアクティブなモニタリングのための 90.2 時間の連続 ECG/PPG 記録で構成される長期的な生理学的モニタリングベンチマークデータセットである VitalBench を紹介します。実験では、VitalAgent が事後評価においてプロンプトベースおよび ReAct ベースラインと比較して 30% 以上の改善を達成し、長期の生理学的信号に対するプロアクティブなアラートモニタリングをサポートすることが実証されており、動的なツールの使用と長期の生理学的モニタリングの重要性が強調されています。

原文 (English)

VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data

Wearable devices enable continuous monitoring of physiological signals such as ECG and PPG, but existing mHealth systems are largely limited to task-specific prediction pipelines or reactive question answering over static summaries. They lack the ability to support temporal reasoning, persistent physiological context, and proactive monitoring over long-term signal streams. We propose VitalAgent, a tool-augmented agentic framework for ECG/PPG-based mHealth that supports both reactive question answering and proactive monitoring. VitalAgent is built on a longitudinal physiological memory and a tool-augmented reasoning interface that enables dynamic computation over raw signals. We further introduce VitalBench, a longitudinal physiological monitoring benchmark dataset comprising 1,862 QA pairs for reactive question answering and 90.2 hours of continuous ECG/PPG recordings for proactive monitoring, covering cardiac, physical activity, and stress-related tasks. Experiments demonstrate that VitalAgent achieves over 30% improvement over prompt-based and ReAct baselines in reactive evaluation and supports proactive alert monitoring over long-term physiological signals, highlighting the importance of dynamic tool use and long-term physiological monitoring.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

有用性の呪い: DistractionIF を介したディストラクタの指示に対するロバスト性の逆スケーリング則

大規模言語モデル (LLM) は、エージェントおよび検索拡張生成 (RAG) システムに導入されることが増えており、外部から提供された参照テキストに対してユーザー指定のタスクを実行する必要があります。実際には、そのようなコンテキストは構造化されていないことが多く、編集コメントやシステムトレースなど、無害ではあるが指示のようなセマンティックノイズで汚染されており、厳密にデータとして扱う必要があります。参照テキスト内のこのようなディストラクタ命令に対する堅牢性を評価するために設計されたベンチマークである DistractionIF を紹介します。幅広いモデルにわたって、一貫した逆スケーリング現象が観察されます。多くの場合、大きなモデルは堅牢性が低く、スケールが増加するにつれてパフォーマンスが最大 30 ポイント低下します。機構的には、私たちのパープレキシティ分析により、スケーリングによってロバストな動作と注意散漫な動作の間の確率的境界が侵食され、モデルがノイズを指示として過剰解釈する傾向がますます高まっていることが明らかになりました。これに対処するために、強化学習、特にグループ相対ポリシー最適化 (GRPO) がこの境界を回復し、一般的な命令追従能力を損なうことなく堅牢性を最大 15.5% 向上させることができることを実証します。私たちの調査結果は、参照に基づいたタスクにおける命令追従の堅牢性の重大なギャップを浮き彫りにし、大規模なデータと命令の厳密な分離を強制するための有望な道としての強化学習を確立します。

原文 (English)

The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

Large Language Models (LLMs) are increasingly deployed in agentic and retrieval-augmented generation (RAG) systems, where they must execute user-specified tasks over externally provided reference text. In practice, such context is often unstructured and contaminated with benign but instruction-like semantic noise, such as editorial comments and system traces, which should be treated strictly as data. We introduce DistractionIF, a benchmark designed to evaluate robustness against such distractor instructions in reference text. Across a broad range of models, we observe a consistent inverse scaling phenomenon: larger models are often less robust, with performance dropping by up to 30 points as scale increases. Mechanistically, our perplexity analysis reveals that scaling erodes the probabilistic boundary between robust and distracted behaviors, making models increasingly prone to over-interpreting noise as instructions. To address this, we demonstrate that reinforcement learning, specifically Group Relative Policy Optimization (GRPO), can restore this boundary, improving robustness by up to 15.5% without compromising general instruction-following capability. Our findings highlight a critical instruction-following robustness gap in reference-grounded tasks and establish reinforcement learning as a promising path for enforcing strict data-instruction separation at scale.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MINDGAMES: マルチエージェント LLM の社会的および戦略的推論を評価するためのライブアリーナ

大規模言語モデル (LLM) は対話型エージェントとして導入されることが増えていますが、拡張された対話における社会的および戦略的推論の能力はまだ十分に理解されていません。既存の評価は、静的なビネットや単一ゲームのベンチマークに依存しており、現実世界のマルチエージェント設定で要求される持続的で多面的な推論を捉えることができません。我々は、「心の理論」に関連する補完的推論の要求を運用する、LLM エージェント向けのマルチゲームアリーナおよび評価プラットフォームである Mindgames を紹介します。これは、隠された情報の下での信念の帰属、繰り返される戦略的相互作用を通じた敵対者のモデル化、知識の非対称性の下での協力的な推論、および社会的演繹における継続的な欺瞞です。 TextArena 上に構築された Mindgames は、統合されたインタラクションインターフェイス、TrueSkill ベースの評価、および 4 つのゲーム環境にわたる完全な軌跡のログを提供します。私たちは、主要な AI カンファレンスで主催された 2025 年の競争サイクルを通じて Mindgames をインスタンス化しました。このカンファレンスでは、Colonel Blotto、Iterated Prisoner's Dilemma、Codenames、Secret Mafia の 4 つのゲームにわたって 76 チームから提出された 944 人のエージェントが評価されました。私たちの分析では、エージェントレベルと評価レベルの両方の制限が明らかになりました。脆弱なルール遵守が依然として主要なボトルネックであり、最高パフォーマンスのシステムは明示的な構造足場に繰り返し依存しており、リーダーボードの有効性は環境によって大きく異なります。特に、失敗の多い環境では、戦略的能力と同じくらい相手のエラーに対する堅牢性が評価される可能性があり、シークレットマフィアはこのサイクルで顕著なエラーと生存の混乱を示します。私たちは、ターンレベルの観察、アクション、報酬を含む 29,571 のマルチエージェントゲームのデータセットをリリースします。MG-Ref は、この分析で使用したのと同じエラー帰属レンズの下で、トップランクの低エラー Stage~II 提出物の凍結された参照プールに対して新しいエージェントを採点する決定論的なオフライントーナメントプロトコルです。

原文 (English)

MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

UI-KOBE: 軽量のグラフガイド付き GUI エージェントのための知識指向の動作探索

モバイル GUI エージェントの最近の進歩により、モバイルタスクを自動化する強力な可能性が示されていますが、最も効果的なシステムは、依然としてスクリーンショットの理解と長期的な計画のために大規模なビジョン言語モデルに依存しています。モバイルデバイスに直接展開できる小型の GUI エージェントは、推論コストが低く、デバイス上の機密情報の保護が強化されているため、実用的にはより魅力的です。ただし、モデルの容量が限られているため、スクリーンショットだけから GUI タスクをエンドツーエンドで計画および実行する場合、このような軽量エージェントの信頼性は依然として低いままです。私たちは、再利用可能なアプリ固有のグラフ知識を備えた軽量モバイル GUI エージェントを改善するフレームワークである、Knowledge-Oriented Behavior Exploration (\textbf{UI-KOBE}) を提案します。 UI-KOBE は、まずモバイルアプリケーションを自律的に探索し、ノードが個別の UI 状態を表し、エッジが実行可能な遷移を表すアプリナレッジグラフを構築します。実行時に、軽量 GUI エージェントは外部ガイダンスとしてグラフを使用します。ユーザータスクと現在のスクリーンショットが与えられると、現在のグラフノードを識別し、自己ループアクション、隣接する遷移、タスクの完了、またはそのノードに関連付けられたフォールバックフリーアクションの中から選択します。 UI-KOBE は、アプリ固有のグラフガイダンスで実行時の意思決定をサポートすることで、エンドツーエンドの GUI 計画の負担を軽減し、軽量モデルがモバイル GUI タスクをより効果的に実行できるように支援し、効率的で解釈可能でプライバシーに配慮したオンデバイス GUI エージェントに向けた実用的なステップを提供します。

原文 (English)

UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents

Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and long-horizon planning. Small GUI agents that can be deployed directly on mobile devices are more attractive for practical use, offering lower inference cost and better protection of sensitive on-device information. However, due to limited model capacity, such lightweight agents remain unreliable when planning and executing GUI tasks end-to-end from screenshots alone. We propose Knowledge-Oriented Behavior Exploration (\textbf{UI-KOBE}), a framework that improves lightweight mobile GUI agents with reusable app-specific graph knowledge. UI-KOBE first autonomously explores a mobile application and constructs an app knowledge graph, where nodes represent distinct UI states and edges represent executable transitions. At runtime, a lightweight GUI agent uses the graph as external guidance: given a user task and the current screenshot, it identifies the current graph node and selects among self-loop actions, neighboring transitions, task completion, or fallback free actions associated with that node. By supporting runtime decisions with app-specific graph guidance, UI-KOBE reduces the burden of end-to-end GUI planning and helps lightweight models perform mobile GUI tasks more effectively, offering a practical step toward efficient, interpretable, and privacy-conscious on-device GUI agents.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Battery-Sim-Agent: LLM-Agent を活用した逆バッテリパラメータ推定

バッテリーの忠実度の高い「デジタルツイン」のパラメーター化は、バッテリーの革新のペースを妨げる、重要かつ困難な逆問題です。一般的な手法では、これをブラックボックス最適化 (BBO) タスクとして定式化し、サンプル効率が低く、基礎となる物理学を認識しないアルゴリズムを採用します。この研究では、逆問題を推論タスクとして再構成する新しいパラダイムを導入し、忠実度の高いバッテリーシミュレーターを備えた閉ループで大規模言語モデル (LLM) エージェントを展開する最初のフレームワークである Battery-Sim-Agent を紹介します。このエージェントは人間の科学者のワークフローを模倣します。シミュレーターからの豊富なマルチモーダルなフィードバックを解釈し、矛盾を説明するために物理的に根拠のある仮説を形成し、構造化されたパラメーターの更新を提案します。さまざまなバッテリーの化学的性質、動作条件、難易度レベルにまたがる体系的に構築されたベンチマークスイートでは、当社のエージェントは、正確なパラメーターを特定する点で、ベイジアン最適化などの強力な BBO ベースラインを大幅に上回ります。さらに、複雑な長期劣化フィッティングタスクにおけるフレームワークの機能を実証し、現実世界のバッテリーデータセットに対するその実用的な適用可能性を検証します。私たちの結果は、科学的発見とバッテリーパラメータ推定のための推論ベースのオプティマイザーとしての LLM エージェントの可能性を強調しています。

原文 (English)

Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation

Parameterizing high-fidelity "digital twins" of batteries is a critical yet challenging inverse problem that hinders the pace of battery innovation. Prevailing methods formulate this as a black-box optimization (BBO) task, employing algorithms that are sample-inefficient and blind to the underlying physics. In this work, we introduce a new paradigm that reframes the inverse problem as a reasoning task, and present Battery-Sim-Agent, the first framework to deploy a Large Language Model (LLM) agent in a closed loop with a high-fidelity battery simulator. The agent mimics a human scientist's workflow: it interprets rich, multi-modal feedback from the simulator, forms physically-grounded hypotheses to explain discrepancies, and proposes structured parameter updates. On a systematically constructed benchmark suite spanning diverse battery chemistries, operating conditions, and difficulty levels, our agent significantly outperforms strong BBO baselines like Bayesian optimization in identifying accurate parameters. We further demonstrate the framework's capability in complex long-horizon degradation fitting tasks and validate its practical applicability on real-world battery datasets. Our results highlight the promise of LLM-agents as reasoning-based optimizers for scientific discovery and battery parameter estimation.

2026-05-29 13:00 JSTarXiv cs.AIエージェントハードウェア/半導体

大規模な言語モデルに基づくマルチエージェントフレームワークによる共同ストーリーテリングの向上

共創、つまり AI エージェントが人間と対話して出力 (アートなど) を生成するというテーマは、最近大きな注目を集めています。ただし、ほとんどの研究は、デジタル環境における成人と人間の相互作用に焦点を当てています。この論文では、子供たちと大規模言語モデル (LLM) が物理的なボードゲームを通じて相互作用して書かれた物語を作成する、新しいばかばかしい共創シナリオを検討します。私たちの目標は、若いプレイヤーに適した高品質の物語を生成できるマルチエージェントフレームワークを開発することです。私たちのアプローチの中核は、ある LLM がストーリーを生成し、別の LLM がストーリーを評価して改良のためのフィードバックを提供する、反復的なライターとエディターのプロセスです。複数の LLM を含むシミュレーション研究を通じて、この反復的な相互作用により、連続するループ全体で生成されたストーリーの知覚品質が一貫して向上することがわかりました。この結果は、インタラクティブなストーリーテリングシステムで高品質の出力を達成するには、少数の改良ステップで十分である可能性があることを示しています。

原文 (English)

Improving Collaborative Storytelling with a Multi-Agent Framework Based on Large Language Models

The topic of Co-creation, i.e., AI agents interacting with humans to generate outputs (e.g., art), has gained significant attention recently. However, most studies focus on adult-human interactions in a digital setting. This paper explores a novel ludic co-creation scenario involving children and Large Language Models (LLMs) interacting through a physical board game to create written stories. Our goal is to develop a multi-agent framework capable of producing high-quality narratives suitable for young players. At the core of our approach is an iterative Writer-Editor process in which one LLM generates stories while another evaluates them and provides feedback for refinement. Through a simulation study involving multiple LLMs, we show that this iterative interaction consistently improves the perceived quality of generated stories across successive loops. The results indicate that a small number of refinement steps may be sufficient to achieve high-quality outputs in interactive storytelling systems.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

PTCG ベンチ: LLM エージェントはポケモントレーディングカードゲームをマスターできますか?

戦略的に複雑なボードゲームを考えると、人間のプレイヤーは数ラウンドをプレイした後、すぐに戦略を考案することを学びます。自律型エージェントは、現実的な対話型環境でも同様の機能を必要としますが、既存のエージェントのベンチマークでは、そのような戦略的で進化する意思決定シナリオを完全に把握できないことがよくあります。我々は、ポケモントレーディングカードゲーム (PTCG) に基づいて構築されたベンチマークである PTCG ベンチを紹介します。このベンチマークは、(1) 単一の複雑な環境内での意思決定パフォーマンス、(2) 蓄積された経験を通じて自己進化する能力という 2 つの相補的なレベルで LLM エージェントを評価します。さらに、モデルの能力と混同することなくエージェントのパフォーマンスをより適切に解釈するためのモジュラーハーネスアブレーションも含まれています。私たちの実験では、LLM エージェントは重要なゲームプレイパフォーマンスを達成できるものの、持続的かつ安定した自己進化は依然として困難であり、パフォーマンスはハーネスの設計に影響されることが示されました。 PTCG-Bench によって、現実的なインタラクティブ環境におけるハーネス認識型の自己進化エージェントに関する将来の研究が促進されることを期待しています。

原文 (English)

PTCG-Bench: Can LLM Agents Master Pok\'emon Trading Card Game?

Given a strategically complex board game, human players can quickly learn to devise strategies after playing a few rounds. Autonomous agents require similar capabilities in realistic interactive environments, yet existing agent benchmarks often fail to fully capture such strategic and evolving decision-making scenarios. We present PTCG-Bench, a benchmark built on the Pok'{e}mon Trading Card Game (PTCG) that evaluates LLM agents at two complementary levels: (1) their decision-making performance within a single complex environment, and (2) their ability to self-evolving through accumulated experience. We further include a modular harness ablation to better interpret agent performance without conflating it with model capability. Our experiments show that, although LLM agents can achieve non-trivial gameplay performance, sustained and stable self-evolution remains challenging, and performance is sensitive to harness design. We hope that PTCG-Bench will facilitate future research on harness-aware and self-evolving agents in realistic interactive environments.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

GRASP: 自己改善型 LLM エージェントのためのゲート回帰認識スキル提案者

構造化された環境で動作する LLM エージェントは、会話的な方法ではなく操作的な方法で失敗し、信頼性は環境の手順に関する知識に依存します。以前の自己改善方法では、新しい項目が以前の正しい動作を保持しているかどうかを確認せずに自然言語ガイダンスを蓄積するため、ある軌道を修正したメモが静かに別の軌道に後退する可能性があります。 GRASP (Gated Regression-Aware Skill Proposer) を導入します。これは、エージェントの改善を制限されたスキルライブラリへの一連の編集として扱い、ハード回帰バジェットの下でバランスのとれたホールドアウトプローブで純改善が得られた場合にのみ各候補者を許可します。 2 つの FHIR ベースの臨床ベンチマークで 5 つの基本モデル (gpt-oss-120b、DeepSeek V4 Flash、Gemini 3.1 Flash Lite、GPT-4.1、GPT-5.4) にわたって GRASP を評価します。 MedAgentBench では、GRASP は gpt-oss-120b を 40.6% から 88.8% に引き上げ、5 つの自己改善ベースラインのうち最も強力なものを 21.0 ポイント上回り、他のすべてのベースモデルを 17.2 から 40.3 ポイント改善しました。アブレーションでは、スキルライティング自体によるものではなく、比較提案の生成、承認ゲート、およびハードリグレッションバジェットによって利益が得られると考えられます。検証がなければ、スキルを使用しないのと同じです。このメカニズムは臨床領域を超えて一般化され、4 つの非臨床環境のうち 3 つで薬剤を改善し、アクションスペースがオープンエンドである場合にのみフラットなままになります。凍結されたライブラリはモデル間で転送され、より強力なモデルからのスキルは弱い実行者を自ら学習した以上に向上させますが、その逆はそうではなく、ゲートされていないベースラインでは再現できない非対称性です。

原文 (English)

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression-Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held-out probe under a hard regression budget. We evaluate GRASP across five base models (gpt-oss-120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT-4.1, GPT-5.4) on two FHIR-based clinical benchmarks. On MedAgentBench, GRASP lifts gpt-oss-120b from 40.6% to 88.8%, exceeds the strongest of five self-improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non-clinical environments and remaining flat only where the action space is open-ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.

2026-05-29 13:00 JSTarXiv cs.AIエージェント研究/論文

表記法が重要: Agentic AI システムにおけるトークン最適化フォーマットのベンチマーク調査

Agentic AI システムの大規模な言語モデルは、ツールスキーマと実行結果を消費し、ツール呼び出しを構造化データとして出力します。その交換のデフォルト言語である JSON は、トークンの効率性ではなくアプリケーション間の交換を目的として設計されているため、その構造要素により相当のトークンオーバーヘッドが生じます。最近の研究では、よりコンパクトな代替として TOON (Token-Oriented Object Notation) や TRON (Token Reduced Object Notation) などのトークンに最適化された代替案が提案されていますが、これらの形式は分離された理解または生成タスクでのみ評価されています。したがって、トークン削減がエンドツーエンドのエージェントループ内で保持されるかどうかは未解決の問題のままです。私たちは、4 つのエージェントベンチマーク (BFCL、MCPToolBenchPP、MCP-Universe、StableToolBench) と 5 つのオープンウェイト LLM で TOON と TRON を評価し、入力圧縮を出力圧縮から分離して、理解と生成を独立して測定します。 TRON は、JSON ベースラインの 14pp 以内の精度でトークンを最大 27% 削減します。 TOON は、同様の 9pp の精度コストで最大 18% の削減を達成しますが、さらにマルチターン解析失敗がカスケードし、ほとんどのモデルの並列ツール呼び出し出力が崩壊します。

原文 (English)

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

軌跡報酬を超えて: グラフモデリングによるエージェント検索のためのステップレベルのクレジット割り当て

Agentic Search では、既存のステップレベルの報酬手法は通常、コストのかかるツリーサンプリングに依存しているのに対し、軌跡レベルの結果報酬では個々のステップの行動の寄与を定量化できません。私たちは、世界の知識を潜在的な世界グラフとして捉え、各 IS タスクを潜在的なタスクグラフ内の検索として捉えます。効果的なステップにより、グラフは答えのノードに向かって進歩するはずです。この事前の説明に基づいて、トレーニング時のエンティティリレーション (ER) グラフ内の回答ノードまでの距離によって、新たに取得および新たに引用されたエンティティをスコアリングするステップレベルのプロセス報酬である、グラフ距離貢献報酬 (GDCR) を提案します。さらに、GDCR をステップレベルの利点に変換し、それらを軌道レベルの結果の利点と組み合わせるステップアドバンテージポリシー最適化 (SAPO) を提案します。 4 つの困難なベンチマークでの実験により、私たちの手法の有効性が検証されました。

原文 (English)

Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling

In Agentic Search, trajectory-level outcome rewards fail to quantify the behavioral contributions of individual steps, while existing step-level reward methods typically rely on costly tree sampling. We view world knowledge as a latent world graph and each IS task as search within a latent task graph, where effective steps should make graph progress toward the answer node. Based on this prior, we propose Graph-Distance Contribution Reward (GDCR), a step-level process reward that scores newly-retrieved and newly-cited entities by their distance to the answer node in a training-time Entity-Relation (ER) graph. We further propose Step Advantage Policy Optimization (SAPO), which converts GDCR into step-level advantages and combines them with trajectory-level outcome advantages. Experiments on four challenging benchmarks validate the effectiveness of our method.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

BitTP: エッジデバイス向けの BitLLM を使用した軽量軌道予測モデル

軌道予測は自律システムの基本的なタスクであり、マルチエージェントの相互作用と意図についての複雑な推論が必要です。最近、大規模言語モデル (LLM) がこのタスクに採用されています。これは、強力な文脈上の推論と、解釈可能な言語ベースの軌跡表現を提供するためです。ただし、これらの LLM ベースの予測子はメモリと計算量が非常に多いため、自律ロボットのオンボードコンピューターなど、リソースに制約のあるエッジデバイスに導入することが困難になります。このギャップを埋めるために、LLM ベースの軌道予測器を軽量のビット線形アーキテクチャに変換する BitTP を提案します。 1.58 ビット (BitTP-Weight) への重みのみの量子化が最適であることを示します。重要なのは、アクティベーションを量子化すると時空間推論の深刻な劣化と不安定性につながるため、アクティベーションは完全な精度を維持する必要があります。経験的に、BitTP-Weight は、完全精度 (BF16) LLM ベースラインを超える予測品質を維持するだけでなく改善し、平均して ADE を 14.29%、FDE を 20.97% 削減し、同時に他の量子化方法と比較してメモリ使用量と推論遅延を削減します。これらの結果は、慎重に設計された量子化が効果的な正則化機能として機能し、洗練された LLM ベースの推論をエッジデバイス上で実際に展開できることを示しています。コードは https://github.com/MintCat98/BitTP で入手できます。

原文 (English)

BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices

Trajectory prediction is a fundamental task for autonomous systems, requiring complex reasoning about multi-agent interactions and intents. Large language models (LLMs) have recently been adopted for this task, as they provide strong contextual reasoning and interpretable, language-based trajectory representations. However, these LLM-based predictors are extremely memory- and compute-intensive, making them difficult to deploy on resource-constrained edge devices such as on-board computers in autonomous robots. To bridge this gap, we propose BitTP, which converts an LLM-based trajectory predictor into a lightweight bitlinear architecture. We demonstrate that weight-only quantization to 1.58-bit (BitTP-Weight) is optimal. Crucially, activations must remain in full precision, as quantizing them leads to severe degradation and instability in spatio-temporal reasoning. Empirically, BitTP-Weight not only preserves but improves prediction quality over the full-precision (BF16) LLM baseline, reducing ADE by 14.29% and FDE by 20.97% on average, while simultaneously reducing memory usage and inference latency relative to other quantization methods. These results demonstrate that carefully designed quantization acts as an effective regularizer, enabling the practical deployment of sophisticated LLM-based reasoning on edge devices. Code is available at: https://github.com/MintCat98/BitTP.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

スペシャリストモデルが依然として重要な理由: 医療用人工知能のための異種マルチエージェントパラダイム

医療分野における GPT や Claude などの汎用大規模言語モデル (LLM) の優れたパフォーマンスは、領域固有の医療専門家モデルは時代遅れになるのだろうかという重大な疑問を引き起こしています。私たちは、医療用人工知能 (AI) の将来は、モノリシックな医療基盤モデルの構築や人間の専門知識の置き換えにあるのではなく、ジェネラリストの LLM、領域固有の専門家モデル、および臨床医の間のコラボレーションを調整することにあると主張します。我々は、矛盾を認識した証拠の融合、不確実性に基づく臨床医の介入トリガー、および適応閾値キャリブレーションを可能にする異種医療マルチエージェントフレームワークである HetMedAgent を提案します。 3 つの実際の臨床意思決定タスクに関する実験では、ジェネラリスト LLM と領域固有の専門家モデルの間の相乗効果が、どちらかのタイプのモデルを単独で使用した場合よりも大幅に優れていることが実証され、モダリティ固有の分析における専門家モデルのかけがえのない価値が検証されました。 HetMedAgent は、医療 LLM または基盤モデルの構築から複数エージェントのコラボレーションへの移行を表し、一般的な推論機能とドメイン固有の精度のバランスを実現します。

原文 (English)

Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence

The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkillsInjector: LLM エージェントの動的なスキルコンテキストの構築

LLM エージェントは、増大するスキルライブラリを利用して複雑なタスクを処理できるようになりました。ただし、より多くのスキルを投入しても、必ずしもタスクの完了が向上するとは限らず、タスクの完了が低下する可能性もあります。既存の手法は依然としてスキルの注入を静的なステップとして扱い、固定基準でスキルを選択し、事前に予算を固定し、説明を変更しないままにします。どのスキルが公開されるか、いくつ含まれるか、およびそれらがどのように提示されるかはすべて、下流のパフォーマンスに影響を与えるため、この静的な処理はスキルの有用性を損なう可能性があると私たちは主張します。私たちは、これらの決定に共同で対処する 2 段階の適応手法である SkillsInjector を提案します。まず、コンテキストプランナーは、実行に基づいたスキルの好みを学習し、タスクごとに適応可能な数のスキルを許可します。次に、セット対応レンダラーは、選択された説明が、同時に挿入された近隣のものと比較してどのように表示されるかを調整します。 tau2-bench、SkillsBench、ALFWorld 全体で、SkillsInjector が最高のスコアを達成し、最も強力なベースラインをそれぞれ 3.9、6.1、7.3 パーセントポイント改善しました。アブレーション研究では、スキルの選択、適応的な予算設定、およびセットアウェアなレンダリングがそれぞれ利益に貢献していることが示されています。これらの結果は、スキル拡張エージェントが、注入されたコンテキスト自体を最適化することで恩恵を受けることを示しています。コードは公開され次第公開されます

原文 (English)

SkillsInjector: Dynamic Skill Context Construction for LLM Agents

LLM agents now draw on growing skill libraries to handle complex tasks. However, injecting more skills does not always improve task completion and can even degrade it. Existing methods still treat skill injection as a static step, selecting skills with fixed criteria, fixing the budget in advance, and leaving descriptions unchanged. We argue that this static treatment can undermine the utility of skills, because which skills are exposed, how many are included, and how they are presented all affect downstream performance. We propose SkillsInjector, a two-stage adaptive method that jointly addresses these decisions. First, a context planner learns execution-grounded skill preferences and admits an adaptive number of skills for each task. A set-aware renderer then tailors how selected descriptions are presented relative to their co-injected neighbors. Across tau2-bench, SkillsBench, and ALFWorld, SkillsInjector achieves the highest score, improving over the strongest baseline by 3.9, 6.1, and 7.3 percentage points, respectively. Ablation studies show that skill selection, adaptive budgeting, and set-aware rendering each contribute to the gain. These results show that skill-augmented agents benefit from optimizing the injected context itself. Code will be released upon publication

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SAAS: エージェント検索における過剰検索を軽減するための自己認識強化学習

エージェント検索により、LLM は反復推論と外部検索を通じて複雑なマルチホップの質問を解決できます。これらのシステムは有効であるにもかかわらず、実際には重大な制限に悩まされることがよくあります。エージェントは自分自身の知識の境界を認識できず、内部の知識が十分な場合でもやみくもに検索を開始し、十分な証拠が収集されている場合でも検索を終了できません。自己認識の欠如は深刻な \textbf{過剰検索} につながり、かなりの推論遅延と法外な計算コストが発生します。この目的を達成するために、精度を損なうことなく検索動作を正確に制御する動的な自己認識を育成するように設計された新しい RL フレームワークである SAAS を提案します。 SAAS では、次の 3 つの主要コンポーネントが導入されています。(i) 検索境界モデリングメカニズム。検索が無効なロールアウトと検索が有効なロールアウトを対比することで、進化するポリシーに基づいて検索境界を識別します。 (ii) 境界認識報酬モジュール。この境界認識を軌道レベルのペナルティに変換し、不必要で冗長な検索を抑制します。 (iii) 段階的な最適化戦略。これは、一連のカリキュラムを活用して、検索の正規化よりも推論を優先し、それによって報酬のハッキングを回避します。広範な実験により、SAAS が精度を維持しながら過剰検索を大幅に削減することが実証されました。私たちのコードは https://github.com/XMUDeepLIT/SAAS で匿名で公開されています。

原文 (English)

SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to severe \textbf{over-search}, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self-awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search-disabled and search-enabled rollouts; (ii) a boundary-aware reward module, which translates this boundary awareness into trajectory-level penalties, suppressing unnecessary and redundant searches; and (iii) a stage-wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over-search, while maintaining accuracy. Our code is anonymously released at https://github.com/XMUDeepLIT/SAAS.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

AgentDoG 1.5: AI エージェントの安全性とセキュリティのための軽量でスケーラブルな調整フレームワーク

OpenClaw などの最新のオープンワールドエージェントは、強力な環境間実行機能を示しますが、広範な新しい安全リスク源をもたらします。その一方で、高度なフロンティア AI モデルは攻撃障壁を大幅に下げ、現在のエージェント調整フレームワークが現実世界の展開には不適切になっています。これらの新たな脅威に対処するために、私たちは軽量でスケーラブルなエージェントの安全性調整フレームワークを提案します。具体的には、Codex および OpenClaw の実行シナリオからの緊急リスクに対応するために、エージェントの安全性分類を更新します。さらに、わずか約 1,000 個のサンプルを使用して軽量の AgentDoG 1.5 バリアント (0.8B、2B、4B、および 8B パラメーター) をトレーニングするための影響関数の精製を備えた分類に基づくデータエンジンを構築し、主要なクローズドソースモデル (GPT-5.4 など) と同等のパフォーマンスを達成します。 AgentDoG 1.5 に基づいて、高効率のエージェント安全性 SFT および RL トレーニング環境を構築します。これにより、Docker レベルの環境での導入オーバーヘッドが 2 桁削減されます。最後に、リアルタイムの安全管理のためのトレーニング不要のオンラインガードレールとして AgentDoG 1.5 を導入します。広範な実験結果は、AgentDoG 1.5 が多様で複雑な対話型エージェントシナリオにおいて最先端のパフォーマンスを達成することを示しています。すべてのモデルとデータセットは公開されています。

原文 (English)

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

冗長ですか、それとも必要ですか?エージェントの軌跡における冗長なステップを検出するためのベンチマーク

LLM ベースのエージェントは、複数ステップの推論とツールの使用を通じて複雑なタスクを解決する強力な機能を実証しています。しかし、既存の評価プロトコルは主にタスクの成功に焦点を当てており、エージェントの動作の重要な側面である実行効率を見落としています。実際には、エージェントの軌跡には、タスクの完了にはほとんど寄与しないものの、大量のリソースを消費する冗長なステップが含まれることがよくあります。この研究では、エージェントの軌跡に対する \textbf{冗長ステップ検出} という新しい研究領域を提案し、定式化します。この取り組みをサポートするために、\textbf{RedundancyBench} を導入します。これは、慎重に注釈が付けられた軌跡を持つ多様なタスクを含む新しいベンチマークであり、タスクの完了への貢献度に応じて各ステップにラベルが付けられます。 RedundancyBench を使用して、軌道内のステップが冗長であるか必要であるかを答えるための 3 つの代表的な方法を開発および評価します。私たちの結果は、最もパフォーマンスの高い方法でも冗長ステップの検出で 24.88\% のスコアしか達成できず、一部の方法ではランダムな推測よりもパフォーマンスが悪いことがわかりました。これらの結果は、このタスクの複雑さと、この分野におけるさらなる研究の必要性を浮き彫りにしています。 \footnote{この論文のコードとデータセットはどちらも \href{https://anonymous.4open.science/r/RedundancyBench}{https://anonymous.4open.science/r/RedundancyBench} で入手できます。}

原文 (English)

Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories

LLM-based agents have demonstrated strong capabilities in solving complex tasks through multi-step reasoning and tool use. However, existing evaluation protocols primarily focus on task success, overlooking a critical aspect of agent behavior: execution efficiency. In practice, agent trajectories often contain redundant steps that consume substantial resources while contributing little to task completion. In this work, we propose and formulate a new research area: \textbf{redundant step detection} for agent trajectories. To support this initiative, we introduce \textbf{RedundancyBench}, a new benchmark that contains diverse tasks with carefully annotated trajectories, where each step is labeled according to its contribution to task completion. Using RedundancyBench, we develop and evaluate 3 representative methods to answer whether a step within trajectory is redundant or necessary. Our results show that even the best-performing method achieves only 24.88\% score in detecting redundant steps, while some methods perform worse than random guessing. These results highlight the task's complexity and the need for further research in this area. \footnote{Code and dataset in this paper are both available in \href{https://anonymous.4open.science/r/RedundancyBench}{https://anonymous.4open.science/r/RedundancyBench}.}

2026-05-29 13:00 JSTarXiv cs.AIエージェント

大規模な数学の形式化

Lean 4 で Autoformalized Textbook Library At Scale (Atlas) を構築するためのマルチエージェントシステムである AutoformBot を紹介します。 AutoformBot は、形式的な検証ツール、依存関係を意識したタスクスケジューリング、協調的なバージョン管理を備えた数千の LLM エージェントを調整し、非公式な教科書の散文を機械チェックされた定義と証明に変換します。私たちは、解析、代数、トポロジー、組合せ論、確率に及ぶ 26 冊のオープンアクセス教科書のコーパスに私たちの手法を適用し、45,000 を超えるリーン 4 宣言と 500,000 行を超えるコードの検証済みライブラリである Atlas を作成します。私たちは 2 つのアーティファクトをリリースします。(i) AutoformBot、オープンソースのマルチエージェントフレームワーク。 (ii) Atlas、結果として得られる正式なライブラリ。私たちの結果は、大学院レベルの数学の中核的な内容を大規模に自動形式化することが経済的かつ技術的に実現可能であることを示唆しています。これにより、人間が生成した数学と機械が生成した数学の両方を研究レベルで自動検証する扉が開かれます。

原文 (English)

Formalizing Mathematics at Scale

We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orchestrates thousands of LLM agents, equipped with formal verification tools, dependency-aware task scheduling, and collaborative version control, to translate informal textbook prose into machine-checked definitions and proofs. We apply our methods to a corpus of 26 open-access textbooks spanning analysis, algebra, topology, combinatorics, and probability, producing Atlas: a verified library of over 45,000 Lean 4 declarations and 500 thousand lines of code. We release two artifacts: (i) AutoformBot, the open-source multi-agent framework; and (ii) Atlas, the resulting formal library. Our results suggest that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible. This opens the door to the automated verification of both human- and machine-generated mathematics at a research level.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Compass: 専門家のガイドによる LLM エージェントによる世界的な海洋リードデータの統合のナビゲート

海洋鉛 (Pb) とその同位体は、海洋循環と人為的汚染の重要なトレーサーですが、現場での観察は依然として費用がかかり、まばらです。膨大な歴史的記録が存在する一方で、それらは学術論文の非構造化コンテンツの中に埋もれており、包括的な分析にアクセスできない「データサイロ」を生み出しています。手動による抽出には拡張性がなく、汎用の大規模言語モデル (LLM) には必要なドメイン固有の知識が不足しているため、幻覚や科学的に無効な出力が発生します。これに対処するために、LLM が微調整せずに厳密な科学データ抽出を実行できるようにする、専門家主導の適応アプローチを導入します。当社は、海洋科学者と共同設計したナレッジツリーによって強化された LLM エージェントフレームワークである Compass を通じてこのアプローチを運用しています。これは、複雑なタスクを検証可能なステップに分解し、科学的妥当性を確保するためにエージェントの推論を導きます。 230,000 件を超える関連するオープンアクセス論文のコーパス全体に Compass を導入することで、これまで組み込まれていなかった 3,751 件の Pb レコードを抽出することに成功しました。この取り組みにより、これまでで最大の統合海洋 Pb データベースが確立されました。標準的な指標を超えて、Compass は多層検証を通じて優れた信頼性を実証し、専門家の手動検証で確認されたとおり 92% の精度を達成しています。新たに統合されたデータは、東シナ海や南極海など、これまでサンプリングが不十分だった地域のカバー範囲を拡大し、将来の科学的発見のための充実したデータ基盤を提供します。私たちは、オープンな科学的アクセスを促進するために、インタラクティブな視覚化プラットフォームをリリースします。私たちの研究は、専門家がガイドするエージェントが汎用 LLM と一か八かの科学領域の間のギャップを効果的に埋め、地球科学におけるスケーラブルなデータ発見を可能にすることを実証しています。

原文 (English)

Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in-situ observations remain costly and sparse. While vast historical records exist, they lie buried within the unstructured content of academic papers, creating "data silos" inaccessible to comprehensive analysis. Manual extraction is unscalable, while general-purpose Large Language Models (LLMs) lack the necessary domain-specific knowledge, leading to hallucinations and scientifically invalid outputs. To address this, we introduce an expert-guided adaptation approach that enables LLMs to perform rigorous scientific data extraction without fine-tuning. We operationalize this approach through Compass, an LLM agent framework enhanced by a Knowledge Tree co-designed with marine scientists, which decomposes complex tasks into verifiable steps, guiding the agent's reasoning to ensure scientific validity. Deploying Compass across a corpus of over 230,000 relevant open-access papers, we successfully extract 3,751 previously unincorporated Pb records. This effort establishes the largest integrated marine Pb database to date. Beyond standard metrics, Compass demonstrates superior reliability through multi-layered validation, achieving 92% accuracy as confirmed through expert manual verification. The newly integrated data expand coverage in previously under-sampled regions such as the East China Sea and the Southern Ocean, providing an enriched data foundation for future scientific discoveries. We release an interactive visualization platform to facilitate open scientific access. Our work demonstrates that expert-guided agents can effectively bridge the gap between general-purpose LLMs and high-stakes scientific domains, enabling scalable data discovery in geosciences.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

KairosAgent: 融合されたセマンティック推論を使用したエージェント時系列予測

クロスドメインのマルチモーダル時系列予測は困難なタスクであり、正確な数値的理解、クロスドメインの意味論的理解、効果的なマルチモーダル融合を統合するモデルが必要です。既存のアプローチでは、Time Series Foundation Model (TSFM) を最初から構築するか、事前トレーニングされた Large Language Model (LLM) を活用します。しかし、TSFM は意味の理解を見逃していることが多く、未来志向の意味論的推論を実行する能力に欠けており、LLM は数値的な理解と正確な定量的予測に苦労しています。これらの制限を克服するために、LLM ベースの推論機能と TSFM ベースの予測機能を含む、マルチモーダル時系列予測のための新しいエージェントフレームワークである KairosAgent を提案します。 KairosAgent は、分析ツールを動的に呼び出してテキスト推論と数値予測を統合し、LLM の数値理解と意味論的推論機能を強化します。その後、推論の結果が TSFM パイプラインに融合され、より正確で信頼性の高い将来予測が可能になります。推論をさらに改善するために、マルチターンの改良とターンレベルの単位の割り当てを備えた予測パラダイムからの強化学習と並行して、高品質の軌道の大規模なコーパスを厳選しました。実験では、KairosAgent が事前トレーニングされた LLM と TSFM の有用性を最大化しながら優れたゼロショット予測パフォーマンスを達成することを実証し、効率的で解釈可能な時系列エージェントの有望な方向性を示しています。プロジェクトページは https://foundation-model-research.github.io/KairosAgent にあります。

原文 (English)

KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

Cross-domain multimodal time series forecasting is a challenging task, requiring models to integrate precise numerical comprehension, cross-domain semantic understanding, and effective multimodal fusion. Existing approaches either build Time Series Foundation Models (TSFMs) from scratch or leverage pretrained Large Language Models (LLMs). However, TSFMs often overlook semantic understanding and lack the ability to perform future-oriented semantic reasoning, and LLMs struggle with numerical comprehension and accurate quantitative forecasting. To overcome these limitations, we propose KairosAgent, a novel agentic framework for multimodal time series forecasting, including an LLM-based reasoner and a TSFM-based forecaster. KairosAgent unifies textual reasoning and numerical forecasting by dynamically invoking analytical tools to enhance the numerical understanding and semantic reasoning capabilities of LLMs. The reasoning results are subsequently fused into the TSFM pipeline, enabling more accurate and reliable future predictions. To further improve the reasoning, we curate a large-scale corpus of high-quality trajectories, alongside a reinforcement learning from forecasting paradigm with multi-turn refinement and turn-level credit assignment. Experiments demonstrate that KairosAgent achieves superior zero-shot forecasting performance while maximizing the utility of pretrained LLMs and TSFMs, presenting a promising direction for efficient and interpretable time series agents. The project page is at https://foundation-model-research.github.io/KairosAgent .

2026-05-29 13:00 JSTarXiv cs.AIエージェント

選択の学習: 適応的な方法選択のためのセマンティック通信を備えたエンパワーメントに基づくマルチエージェントシステム

科学技術計算のワークフローを自動化するには、実行可能コードを生成するだけでは不十分です。自律システムは、適切な計算戦略を選択し、忠実に実装し、結果として得られる結果が、その結果を生み出した意思決定に因果的に帰属することを保証する必要もあります。マルチエージェントパイプラインでは、このプロセスは特に脆弱です。エージェントの意図とアクションの間の小さな不一致がセマンティックドリフトを引き起こす可能性があり、最終的に実行される手順が最初に選択された戦略を反映しなくなり、下流の評価と適応が損なわれます。この研究では、ATHENA フレームワーク (Toscano et al., 2025; Toscano et al., 2026) とエンパワーメントの概念 (Yiu et al., 2025) を動機として、コンテキストバンディットと構造化されたエージェント間コミュニケーション、そして最も重要なことに、パイプライン全体でアクションと結果の忠実度を維持するセマンティックチェックポイントを組み合わせたマルチエージェントフレームワークを導入します。このシステムは、適応型意思決定アーキテクチャ内に、特殊なラージ言語モデル (LLM) エージェント、根拠のあるコード生成、自己修復実行ループを統合します。エンパワーメントのレンズを通してフレームワークを解釈すると、信頼性の高い自律学習には、高品質のアクションを特定するだけでなく、エージェント全体への伝播の完全性を維持することも必要であることがわかります。代表的なケーススタディとして感度分析と不確実性の定量化ワークフローを使用して、チェックされていないセマンティックドリフトが政策学習を低下させるのに対し、提案されたフレームワークは収束性、堅牢性、および新しい問題コンテキストへの適応を向上させることを実証します。これらの結果は、科学マルチエージェントシステムのより広範な設計原則を示唆しています。つまり、適応的な意思決定は、意味の一貫性と計算パイプライン全体にわたる信頼性の高い情報フローを保証する明示的なメカニズムと組み合わせる必要があります。

原文 (English)

Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

Automating scientific computing workflows requires more than generating executable code: autonomous systems must also select appropriate computational strategies, implement them faithfully, and ensure that the resulting outcomes remain causally attributable to the decisions that produced them. In multi-agent pipelines, this process is particularly fragile, as small inconsistencies between agent intentions and actions can lead to semantic drift, where the eventually executed procedure no longer reflects the originally selected strategy, thereby corrupting downstream evaluation and adaptation. In this work, motivated by the ATHENA framework (Toscano et al., 2025; Toscano et al., 2026) and the concept of empowerment (Yiu et al., 2025), we introduce a multi-agent framework that combines contextual bandits with structured inter-agent communication and, most importantly, semantic checkpoints that preserve action-outcome fidelity throughout the pipeline. The system integrates specialized large language model (LLM) agents, grounded code generation, and self-healing execution loops within an adaptive decision-making architecture. Interpreting the framework through the lens of empowerment, we show that reliable autonomous learning requires not only identifying high-quality actions, but also preserving the integrity of their propagation across agents. Using sensitivity analysis and uncertainty quantification workflows as representative case studies, we demonstrate that unchecked semantic drift degrades policy learning, whereas the proposed framework improves convergence, robustness, and adaptation to novel problem contexts. These results suggest a broader design principle for scientific multi-agent systems: adaptive decision-making must be coupled with explicit mechanisms that guarantee semantic consistency and reliable information flow across the computational pipeline.

2026-05-29 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

矛盾する複数ソースの個人記憶に対する選択的 QA: 診断テストベッドと手法の比較

新興のパーソナル AI エージェントは、永続的なマルチソースメモリに移行しています。これにより、評価上の問題が生じます。システムは、矛盾する証拠や不完全な証拠をどのように使用するかを決定する必要があります。 1 つのきれいな歴史から事実を引き出すことはできません。既存のベンチマークでは、エラーがメソッドに与えられた証拠に起因するのか、メソッドの競合解決ステップに起因するのかを示すことはほとんどありません。私たちはこれを、矛盾する複数ソースの個人記憶に対する選択的 QA として研究しています。システムは、矛盾する、場合によっては不完全なソースに基づいて回答するか、証拠が不十分な場合は棄権します。 8 つの推論タイプにわたる 18 の質問テンプレート、480 のペルソナ、4 つのランダムシード、および 34,560 のインスタンスを含むベンチマークを、制御されたソースの歪みと決定論的なグラウンドトゥルースを使用して開発しました。ソースへのアクセスなし、単一ソースへのアクセス、構造化融合手法、およびフロンティア LLM のベースラインのパフォーマンスを評価します。最もよく訓練されたフュージョンリゾルバーの精度は 80.3% に達し、最も強力なプロンプトのみの LLM ベースラインは 70.0% に達します。棄権すると、同じリゾルバはカバレッジ 78.3% で選択精度 85.3% に達し、最良の LLM はカバレッジ 95.4% で選択精度 71.0% に達します。モデルが異なれば、推論タイプごとに異なる強みがあります。データ、コード、キャッシュされたモデル出力、およびデータ生成プロセスを再利用のためにリリースします。

原文 (English)

Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

Emerging personal AI agents are moving toward persistent, multi-source memory. This creates an evaluation problem: systems must decide how to use conflicting or incomplete evidence; they cannot just retrieve facts from one clean history. Existing benchmarks rarely show whether an error came from the evidence given to a method or from the method's conflict-resolution step. We study this as selective QA over conflicting multi-source personal memory: systems answer based on conflicting, sometimes incomplete sources, or abstain when evidence is insufficient. We develop a benchmark containing 18 question templates across 8 reasoning types, 480 personas, 4 random seeds, and 34,560 instances, with controlled source distortions and deterministic ground truth. We evaluate the performance of baselines without access to any source, access to a single source, structured fusion methods, and frontier LLMs. The best trained fusion resolver reaches 80.3% accuracy, while the strongest prompt-only LLM baseline reaches 70.0%. With abstention, the same resolver reaches 85.3% selective accuracy at 78.3% coverage and the best LLM reaches 71.0% selective accuracy at 95.4% coverage. Different models have different strengths across reasoning types. We release the data, code, cached model outputs, and data-generating process for reuse.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

コンテキスト関連性による注意ステアリングによるマルチエージェントのコミュニケーションの強化

LLM ベースのマルチエージェントシステムは、協調的な推論を通じて複雑なタスクで優れたパフォーマンスを実証しました。ただし、これらのシステムは、対話中に非常に長い会話履歴を急速に蓄積する傾向があります。会話が長くなるにつれて、関連性のある情報が無関係なコンテキストによってますます希薄になり、パフォーマンスの低下につながります。この研究では、新しい時間的および空間的減衰メカニズムを使用して、各エージェントの注意を関連するコンテキストに動的に向ける、トレーニング不要のコンテキスト管理手法である Agent-Radar を紹介します。私たちの実験では、Agent-Radar が 5 つの異なるベンチマークにわたって最先端の手法を上回り、最大 7.64 絶対ポイントのゲインが得られることが実証されました。さらに、私たちの分析は、エージェントと対話ラウンドの数が増加しても、エージェントレーダーが引き続き効果的で堅牢であることを示しています。最後に、アブレーション研究は、Agent-Radar のコアコンポーネントがパフォーマンスにとって重要であり、さまざまな設定で汎用化できることを示しています。

原文 (English)

Enhancing Multi-Agent Communication through Attention Steering with Context Relevance

LLM-based multi-agent systems have demonstrated remarkable performance on complex tasks through collaborative reasoning. However, these systems tend to rapidly accumulate extremely long conversation histories during interaction. As conversations lengthen, relevant information is increasingly diluted by irrelevant context, leading to degraded performance. In this work, we present Agent-Radar, a training-free context management method that dynamically steers each agent's attention toward relevant context with a novel temporal and spatial decay mechanism. Our experiments demonstrate that Agent-Radar outperforms state-of-the-art methods across five different benchmarks, yielding gains of up to 7.64 absolute points. Furthermore, our analysis shows that Agent-Radar remains effective and robust as the number of agents and interaction rounds increases. Finally, the ablation study shows that core components in Agent-Radar are crucial to performance and generalizable in different settings.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AgentSchool: LLM を利用した教育用マルチエージェントシミュレーション

LLM の教室への導入が急速に進んでいるにもかかわらず、教育 AI の検証は依然として独特の扱いが難しいものです。介入は、認知的および社会的軌道が不可逆的に形成される学習者の成長に影響を及ぼしますが、現実世界の試験は時間がかかり、倫理的に制約され、制度的にロックされています。 LLM ベースの教育シミュレーターが潜在的な救済策として浮上していますが、多くは依然として学習を個人に条件付けされたロールプレイに崩壊させており、既存の教室を再現するためだけに最適化された場合、教育改革に必要な制度上の新規性が構造的に不利になる可能性があります。この作業では、プロンプトによる動作ではなく状態遷移として学習をモデル化する、LLM 駆動のマルチエージェントシミュレーターである AgentSchool を紹介します。 AgentSchool は、重み付けされた教科知識グラフ、思考ワークフロープール、明示的な誤解を備えた認知的に成長可能な学生エージェントと、近接発達ゾーンに沿って計画、足場を築き、反映する適応型教師エージェントを組み合わせます。適応型教師エージェントは、公式および非公式の両方の学習フィールド内で指導を位置付ける構成可能なシーナリージェネレーターに埋め込まれており、インタラクションスケール、時間的粒度、およびシミュレーション期間を分離するマルチスケールシミュレーターです。実験によれば、構造化された学生エージェントは、ベースラインシミュレータよりもより差別化された習熟と誤解の痕跡を生成する一方、教師とエージェントの比較では、ZPD情報に基づいた適応と一致するバックボーン依存のパターンが示されています。さらに、AgentSchool は、教室の社会理論と一致する、周辺参加、派閥の形成、攻撃者による結束、オピニオンリーダーの出現のもっともらしい痕跡を生成します。 AgentSchool は、教育研究手段としての役割を超えて、長期記憶、マルチエージェントの調整、組織の圧力下での将来の制度的推論のための社会的に有意義なテストベッドとして教育を組み立てています。

原文 (English)

AgentSchool: An LLM-Powered Multi-Agent Simulation for Education

Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real-world trials are slow, ethically constrained, and institutionally locked. LLM-based educational simulators have emerged as a potential remedy, but many still collapse learning into persona-conditioned role-play and, when optimized only to reproduce existing classrooms, can structurally penalize the institutional novelty that pedagogical reform requires. In this work, we introduce AgentSchool, an LLM-driven multi-agent simulator that models learning as state transition rather than prompted behavior. AgentSchool couples cognitively growable student agents -- equipped with weighted subject knowledge graphs, thinking-workflow pools, and explicit misconceptions -- with adaptive teacher agents that plan, scaffold, and reflect along the Zone of Proximal Development, embedded in a configurable scenery generator that situates instruction within both formal and informal learning fields, and a multi-scale simulator that decouples interaction scale, temporal granularity, and simulation duration. Experiments show that structured student agents produce more differentiated mastery and misconception traces than a baseline simulator, while teacher-agent comparisons show backbone-dependent patterns consistent with ZPD-informed adaptation. Further, AgentSchool generates plausible traces of peripheral participation, clique formation, aggressor-induced cohesion, and opinion-leader emergence consistent with classroom social theories. Beyond its role as an educational research instrument, AgentSchool frames education as a socially meaningful testbed for long-horizon memory, multi-agent coordination, and future institutional reasoning under organizational pressure.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Long-Horizon LLM エージェント向けのメタ認知メモリポリシーの最適化

メモリ拡張 LLM エージェントは、インタラクションの軌跡をコンパクトなメモリに再帰的に要約することで、複雑な長期タスクに取り組みます。ただし、既存のアプローチは通常、結果ベースの強化学習を使用してこれらの記憶ポリシーをトレーニングするため、中間の記憶品質が低下する場所を特定できません。インタラクションが展開されるにつれて、曖昧な再帰的要約はタスク関連の情報を徐々に破棄し、意味論的なノイズを導入します。これは信念の逸脱を悪化させ、潜在的なタスク状態のエージェントの推定を曖昧にし、最終的には長期的な推論を狂わせます。したがって、記憶の最適化は単に軌跡レベルの成功だけではなく、中間的な要約によって引き起こされる信念の明瞭さに焦点を当てる必要があると主張します。この目的を達成するために、現在のメモリを考慮した潜在的なタスクの状態についてモデルがどの程度不確実性を保っているかを調査する自己教師ありプロキシである信念エントロピーを導入します。このプロキシに基づいて、メタ認知メモリポリシー最適化 (MMPO) を提案します。 MMPO は、まばらな結果ベースの信号のみに依存するのではなく、高い認識論的不確実性を引き起こす明示的にペナルティを課す要約を通じて、きめの細かいメモリ固有の監視を提供します。実験の結果、MMPO は長期にわたる多様なタスクにおいて既存の手法を常に上回っており、175 万トークンのコンテキストに拡張した場合でも 97.1% のパフォーマンスを維持することが示されています。

原文 (English)

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

局所的には一貫性があるが、全体的には一貫性がない: 複数コンポーネントの LLM エージェントにおける境界構成上の一貫性のなさ

複数コンポーネントの LLM エージェントは、それぞれが共同問題の一部のみを認識するコンポーネントから確率的クレームを組み立てます。すべてのコンポーネントが局所的に一貫している場合でも、合成は基本的な確率公理に違反する可能性があります。この局所的にコヒーレントでグローバルにインコヒーレントな障害を、構成残差 eps*、構成された引用から結合コヒーレントポリトープまでの L2 距離を介して形式化し、システム出力と宣言されたコンポーネント間の結合制約から実行時に計算できます。積構造の二分法は、局所的なコヒーレンスが十分である場合を特徴付け、レイリー商予測は 4 つの関係クラスのうち 3 つで観察された残差と 7% 以内で一致します。階層的なボイル・ディクストラ図法は、構成を決定論的に修復します。いつでも有効な電子プロセスにより、連続的なコヒーレンス監視が可能になります。 4 つの LLM 中間層パネル (セクション 5.5 でフロンティアパネルを再実行) 上の 1,876 個のアンサンブルクリーク全体で、クリークの 33 ～ 94% で eps* > 0 となり、比例配分ルールに基づく 1,770 の解決されたベットに対する後悔のベットあたり +0.115 ナットに換算されます (ベッター自身が依存する場合、ゲインは +0.006 に減少します)一貫性を持たせます）。 3 つの直感的な LLM 側の緩和策 (取得、パーティション認識プロンプト、アグリゲーター LLM) はそれぞれ失敗するか退行します。

原文 (English)

Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent. We formalise this locally coherent, globally incoherent failure via the compositional residual eps*, the L2 distance from the composed quote to the joint coherent polytope, computable at runtime from system output and the declared cross-component coupling constraints. A product-structure dichotomy characterises when local coherence suffices, and a Rayleigh-quotient prediction matches the observed residual within 7% on three of four relation classes. A hierarchical Boyle-Dykstra projection repairs the composition deterministically; an anytime-valid e-process gives sequential coherence monitoring. Across 1,876 ensemble cliques on a four-LLM mid-tier panel (frontier-panel rerun in Section 5.5), eps* > 0 on 33-94% of cliques, translating to +0.115 nats per bet of regret on 1,770 resolved bets under the proportional allocation rule (the gain collapses to +0.006 under bettors that themselves coherentise). Three intuitive LLM-side mitigations(retrieval, partition-aware prompting, aggregator-LLM) each fail or regress.

2026-05-29 13:00 JSTarXiv cs.AIエージェント研究/論文

必要なのは物理学だけですか?物理学者が監視する科学ソフトウェアの AI 開発のケーススタディ

AI エージェントはツールですか、共著者ですか、それとも研究者ですか?定量化されたケーススタディ ($N=1$) を紹介します。物理学者が 12 勤務日と 57 セッションにわたって AI コーディングエージェント (クロードコード、ソネット、オーパスモデル) を監督し、JAX の微分可能な 1 ループ摂動理論モジュールである CLAX-PT を構築しました。私たちは、介入レベルごとに 15 の監督イベントを文書化し、分類しました。エージェントは、Oracle テストに対して反復処理を行うことで、10 件を自律的に解決しました。あと 2 つは物理学者の専門知識によるものです。できなかった 3 件はすべてオラクルの検出を回避したもので、エージェントが症状の軽減を根本原因の解決として扱ったという共通の特性があります。 57 のセッションのうち 33 は、ターゲットの物理を表現できないコードアーキテクチャ内の係数の調整に費やし、再検討を求められても CLASS-PT 分岐の選択を再評価できませんでした。注入された物理概念 (異方性 BAO 減衰) のみが再設計のきっかけとなりました。これとは別に、エージェントはすべてのオラクルテストに合格したが、理論上のどの量にも対応せず、他の宇宙論で誤った値を予測する校正済みの修正を実行しました。ファッジ要素が検出され、同じセッション内で置き換えられました。オラクルテストが見逃したものを発見するには、3 つの監督実践が重要であることが判明しました。共有変更ログにより、セッション全体での探索の停滞が表面化しました。そして非物理的な数値パッチに対する明示的なルール。この場合、エージェントの出力が信頼できるかどうかは、モデルの能力ではなく監視設計によって決まりました。ギャップを埋めるには、特定の構造内で最適化するのではなく、アーキテクチャ上の代替案を提案し、予測の妥当性と説明の正しさを区別するエージェントが必要になります。この機能はここでは示されておらず、スケーリングだけでは明らかに対処されていません。 [要約]

原文 (English)

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]

2026-05-29 13:00 JSTarXiv cs.AIエージェント

S3Mem: 長期対話型質問応答のための構造化時空間シーン-イベント記憶

長期にわたる対話型エージェントは、多くの場合、大規模な軌跡履歴を蓄積しますが、以前のイベントに関する質問に確実に答えることができません。私たちは、主なボトルネックはコンテキストの長さだけではなく、長期記憶の軌跡と答えのインターフェースにあると主張します。履歴がプレーンテキストのチャンクとして保存され、標準の検索拡張生成 (RAG) でクエリされる場合、システムは多くの場合、特に空間的、時間的、繰り返しイベント、およびマルチホップ状態の質問について、ローカルに関連するが連鎖的に不完全な証拠を取得します。我々は、長期対話型質問応答 (QA) のための構造化されたシーンイベントエピソード記憶フレームワークである S3MEM を提案します。 S3MEM は、構造化メモリユニットに軌跡を書き込み、アンカーに応じた検索を通じて証拠を取得し、応答時間推論のためのコンパクトなトークンバジェット認識証拠インターフェイスを公開します。この意味で、S3MEM は、エージェントの軌跡をクエリに合わせたサポートに変換する、構造化された証拠ハーネスです。 2 つの内部ヘッドライン環境 (Craafter、Jericho) と 2 つのファミリー外環境 (SciWorld、ALFWorld) で S3MEM を評価します。共有凍結応答時間プロトコルの下では、S3MEM は 4 つの環境すべてで一貫して Vanilla RAG を上回り、Crafter、Jericho、ALFWorld では Graph-NoReader を上回り、劇的に少ない証拠トークンを使用しながら SciWorld でも同等のパフォーマンスを発揮します。 A-MEM にインスピレーションを得た、MemoryOS に適応した、LightMem に適応した 3 つの最近のベースラインは、いくつかの設定において Vanilla RAG よりも改善されていますが、S3MEM の全体的な精度効率の最前線に匹敵するものはありません。全体として、証拠は限定された結論を裏付けています。つまり、現在の凍結された応答時間プロトコルの下では、構造化された書き込みとアンカーに応じた証拠のルーティングは、より一般的なメモリインターフェイスよりも長期の対話型 QA に強力な精度効率のフロンティアを提供します。

原文 (English)

S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

Long-horizon interactive agents often accumulate large trajectory histories yet still fail to answer questions about earlier events reliably. We argue that the main bottleneck is not context length alone, but the trajectory-to-answer interface of long-term memory. When histories are stored as plain-text chunks and queried with standard retrieval-augmented generation (RAG), systems often retrieve locally relevant but chain-incomplete evidence, especially for spatial, temporal, repeated-event, and multi-hop state questions. We propose S3MEM, a structured scene-event episodic memory framework for long-horizon interactive question answering (QA). S3MEM writes trajectories into structured memory units, retrieves evidence through anchor-sensitive retrieval, and exposes a compact token-budget-aware evidence interface for answer-time inference. In this sense, S3MEM is a structured evidence harness that converts agent trajectories into query-aligned support. We evaluate S3MEM on two internal headline environments (Crafter, Jericho) and two out-of-family environments (SciWorld, ALFWorld). Under a shared frozen answer-time protocol, S3MEM consistently outperforms Vanilla RAG across all four environments, surpasses Graph-NoReader on Crafter, Jericho, and ALFWorld, and matches it on SciWorld while using dramatically fewer evidence tokens. Three adapted recent baselines -- A-MEM-inspired, MemoryOS-adapted, and LightMem-adapted -- improve over Vanilla RAG in several settings, but none matches S3MEM's overall accuracy-efficiency frontier. Overall, the evidence supports a bounded conclusion: under the current frozen answer-time protocol, structured writing and anchor-sensitive evidence routing provide a stronger accuracy-efficiency frontier for long-horizon interactive QA than more generic memory interfaces.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

GenesisFunc: 正確かつ一般化可能な関数呼び出しのためのマルチエージェントデータ生成

大規模言語モデル (LLM) は、関数呼び出し (FC) を通じて機能を拡張します。FC は、高品質、多様性、および幅広いシナリオをカバーするトレーニングデータに依存します。ただし、実際の関数呼び出しデータを取得して注釈を付けるのは困難ですが、既存のパイプラインからの合成データは、多くの場合、信頼性の低い API、限られたツールのスケーラビリティ、不十分な多様性、弱い品質管理に悩まされます。これらに対処するために、FC トレーニングデータを生成する自動パイプラインである GenesisFunc を紹介します。当社の GenesisFunc は、広く使用されている公開ベンチマークの信頼性の高いツールから始まり、マルチエージェントフレームワークを採用して、プロセス全体を通じて多様性と品質の両方を維持しながら、多様なシナリオにわたる会話を生成する対話生成システムをサポートします。データの精度は多段階の評価システムによってさらに強化されます。私たちは合成データセット上で 8B LLM を微調整し、広範な実験を通じて、ドメイン内の FC パフォーマンスとドメイン外の汎化において同様のサイズのオープンソースモデルを上回るパフォーマンスを示し、同時に最新の API ベースのモデルの一部と同等の FC 機能に達していることを示しました。さらに、私たちの方法は、下流のツール間で効果的に拡張できる強力な可能性を示しており、現実世界への適用可能性を強調しています。

原文 (English)

GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling

Large Language Models (LLMs) extend their capabilities through function-calling (FC), which relies on training data with high quality, diversity, and broad coverage of scenario. However, obtaining and annotating real function-calling data is challenging, while synthetic data from existing pipelines often suffers from unreliable APIs, limited tool scalability, insufficient diversity, and weak quality control. To address these, we present GenesisFunc, an automated pipeline for generating FC training data. Starting from reliable tools in widely used public benchmarks, our GenesisFunc employs a multi-agent framework to support a dialogue generation system that produces conversations spanning diverse scenarios, while maintaining both diversity and quality throughout the process. The accuracy of the data is further reinforced through a multi-stage evaluation system. We fine-tune an 8B LLM on the synthetic dataset and show through extensive experiments that it outperforms similarly sized open-source models in in-domain FC performance and out-of-domain generalization, while reaching FC capabilities comparable to some of the latest API-based models. In addition, our method demonstrates strong potential to scale effectively across downstream tools, underscoring its real-world applicability.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

読者を取り残さない: 誰もが理解できるマルチエージェントの概要

米国の平文法では、政府文書が一般の人々が簡単に理解できる明確で単純な言語でアクセスできるようにすることを求めていますが、既存の要約システムは、一般読者の間の多様な言語的および認知的障壁に対処するのに苦労しています。我々は、小学生の読者、非ネイティブの読者、注意欠陥のある読者という 3 つの代表的な読者グループをシミュレートする、平易な言語要約のためのマルチエージェントフレームワークである NRLB (No Reader Left Behind) を紹介します。 NRLB は、テンプレートベースの計画と読者指向の反復的な改善を組み合わせ、難しい用語、文脈の欠落、混乱を招く文章の体系的な検出と解決を可能にします。複数のデータセットにわたる評価により、事実の正確さを維持しながら可読性が一貫して向上していることが実証されています。人間による評価では、NRLB の影響がさらに検証され、アノテーターの優先率は 55% から 76% の範囲であり、情報源に忠実であり、一般の人々が広くアクセスできる平易な言葉による要約を作成する NRLB の可能性が強調されています。

原文 (English)

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand

The Plain Writing Act in the United States requires government documents to be accessible in clear and simple language that the general public can easily understand, yet existing summarization systems struggle to address diverse linguistic and cognitive barriers among general readers. We present NRLB (No Reader Left Behind), a multi-agent framework for plain language summarization that simulates three representative reader groups: elementary school student readers, non-native readers, and readers with attention deficits. NRLB combines template-based planning with iterative, reader-oriented refinement, enabling systematic detection and resolution of difficult terms, missing contexts, and confusing sentences. Evaluations across multiple datasets demonstrate consistent improvements in readability while preserving factual accuracy. Human evaluation further validates NRLB's impact, with annotator preference rates ranging from 55% to 76%, highlighting NRLB's potential to produce plain language summaries that are both faithful to the source and broadly accessible to the general public.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LLM エージェントの一貫性はどの程度ですか?マルチステップのツール呼び出しパイプラインにおける動作の再現性の測定

ツール呼び出し機能を備えた大規模言語モデル (LLM) エージェントが実稼働システムに導入されることが増えていますが、根本的な信頼性の問題、つまり同じエージェントが 2 回同じように動作するかという問題はまだ解明されていません。我々は、エージェントが同じツールを同じ順序で、同じ引数で、同じ呼び出しを繰り返して選択するかどうかを測定する、複数ステップのツール呼び出しエージェントの動作の一貫性に関する体系的な実証研究を紹介します。 ReAct スタイルのエージェント (検索専用、フリーテキストアクション) の一貫性に関する以前の研究とは異なり、型指定されたパラメーターと結果として生じる副作用を備えた構造化されたツール呼び出しインターフェイスのより豊富な設定を研究します。

原文 (English)

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

Large language model (LLM) agents with tool-calling capabilities are increasingly deployed in production systems, yet a fundamental reliability question remains under-explored: does the same agent behave the same way twice? We present a systematic empirical study of behavioral consistency in multi-step tool-calling agents, measuring whether agents select the same tools, in the same order, with the same arguments, across repeated identical invocations. Unlike prior work on consistency in ReAct-style agents(search-only, free-text actions), we study the richer setting of structured tool-calling interfaces with typed parameters and consequential side effects.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Big 2 の不完全情報下でのセルフプレイ強化学習

不完全情報マルチプレイヤーゲームでは、隠された情報、まばらな報酬、および静止していない敵の下でエージェントが行動できるかどうかをテストします。私たちはこれらの課題を、4 人用の不完全情報カードゲームである Big 2 で研究します。私たちは、ポリシー勾配エージェントと値近似エージェント間の制御された比較を可能にする Big 2 用のセルフプレイ RL フレームワークを開発します。共通の環境、入力表現、トレーニング予算、および評価プロトコルの下では、PPO は、ランダムで貪欲でヒューリスティックな Big 2 の敵に対して、モンテカルロ Q 近似、SARSA、および Q 学習よりも優れたパフォーマンスを発揮します。さらに、適度なエントロピー正則化により、ポリシーが過度に決定論的になるのを防ぎ、PPO が向上すること、および現在のポリシーのセルフプレイは、チェックポイントセルフプレイや固定対戦相手のトレーニングよりも強力な有限予算のカリキュラムを提供することがわかりました。まとめると、これらの結果は、Big 2 が、不完全な情報、マルチプレイヤーインタラクション、遅延報酬、および可変アクションセットの下で深い RL を研究するのに有用な制御された設定であることを示しています。

原文 (English)

Self-Play Reinforcement Learning under Imperfect Information in Big 2

Imperfect-information multiplayer games test whether agents can act under hidden information, sparse rewards, and non-stationary opponents. We study these challenges in Big 2, a four-player imperfect-information card game. We develop a self-play RL framework for Big 2 that enables controlled comparisons between policy-gradient and value-approximating agents. Under a common environment, input representation, training budget, and evaluation protocol, PPO outperforms Monte Carlo Q approximation, SARSA, and Q-learning against random, greedy, and heuristic Big 2 opponents. We further find that moderate entropy regularization improves PPO by preventing the policy from becoming overly deterministic, and that current-policy self-play provides a stronger finite-budget curriculum than checkpoint self-play or fixed-opponent training. Together, these results show that Big 2 is a useful controlled setting for studying deep RL under imperfect information, multiplayer interaction, delayed rewards, and variable action sets.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

LogDx-CI: LLM 根本原因診断のためのベンチマークログ削減ツール

CI 障害ログは大きく (中央値 5k 行、このコーパスでは最大 200k)、ノイズが多くなります。ログをデバッグしようとするコーディングエージェントは、ログを管理可能なコンテキストに削減する上流のツールに依存していますが、この分野では、どの削減が下流の LLM 診断に十分な証拠を保持するかについて、公的に実証された比較が行われていません。 LogDx-CI は、11 のコンテキスト削減ツール (raw、tail、grep、3 つの RTK モード、2 つの実際の LLM マップリデュースサマライザー、3 つのハイブリッドルーター) を 35 の実際の GitHub Actions 障害ケースで比較するベンチマークであり、3 つの LLM デバッガーファミリ (Claude Haiku 4.5、Claude Sonnet 4.6、OpenAI gpt-5-mini) と Sonnet 4.6 ツールを使用してスコア付けされました。エージェント。耐荷重に関する 3 つの調査結果を報告します。 (1)~ハイブリッド grep+tail ルーターがコスト品質のパレートフロンティアを支配しています。上位 2 つのメソッドのスコアは、ケースごとに $\sim$ \$0.03 で 0.670 / 0.666 で、$4.5\times$ 少ないトークンでスタンドアロン grep とほぼ同じ品質です。 (2)~エージェントループ方式では、リダクションツール全体の品質範囲が $7\times$ 崩壊します (シングルショットスプレッド 0.42 $\to$ エージェントループスプレッド 0.059)。エージェントは、フォローアップツール呼び出しを通じて弱いコンテキストを救出します。ただし、コストの差は依然として残ります。コンテキストが弱い場合、エージェントは回復するために 2 ～ 4$\times$ 回多くのツール呼び出しを発行する必要があります。 (3)~ファミリー間 LLM サマリーペア (Claude Haiku デバッガーにフィードする gpt-5-mini サマライザー) は、4 つの診断ツールバリアントの平均で、同じファミリーペアを $+0.071$ 上回っており、このタスクに関するセルフコールバイアス仮説を偽っています。 gpt-5-mini サマライザーは、エージェントループ \#1 メソッド (スコア 0.749) でもあり、ケースあたりのツール呼び出しは $0.37$ で、リデューサーコストは Haiku サマライザーよりも $10\time$ 低くなります (ケースあたり \$0.18 対 \$1.75)。すべてのデータ、コード、ケースごとのバンドル、再現性インフラストラクチャは公開されています。

原文 (English)

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis

CI failure logs are large (median 5k lines, max 200k in this corpus) and noisy. Coding agents that try to debug them depend on an upstream tool to reduce the log to a manageable context, but the field has had no public empirical comparison of which reductions preserve enough evidence for downstream LLM diagnosis. We introduce LogDx-CI, a benchmark that compares 11 context-reduction tools (raw, tail, grep, three RTK modes, two real LLM map-reduce summarizers, three hybrid routers) on 35 real GitHub Actions failure cases, scored by 3 LLM debugger families (Claude Haiku 4.5, Claude Sonnet 4.6, OpenAI gpt-5-mini) plus a Sonnet 4.6 tool-using agent. We report three load-bearing findings. (1)~Hybrid grep+tail routers dominate the cost-quality Pareto frontier; the top two methods score 0.670 / 0.666 at $\sim$ \$0.03 per case, same-ballpark quality as standalone grep at $4.5\times$ fewer tokens. (2)~In the agent-loop regime, the quality range across reduction tools collapses $7\times$ (single-shot spread 0.42 $\to$ agent-loop spread 0.059); the agent rescues weak contexts via follow-up tool calls. However, cost differences persist: weak contexts force the agent to issue 2--4$\times$ more tool calls to recover. (3)~A cross-family LLM-summary pair (gpt-5-mini summarizer feeding a Claude Haiku debugger) beats the same-family pair by $+0.071$ averaged across four diagnoser variants, falsifying the self-call-bias hypothesis on this task. The gpt-5-mini summarizer is also the agent-loop \#1 method (score 0.749) at $0.37$ tool-calls per case and $10\times$ lower reducer cost than the Haiku summarizer (\$0.18 vs \$1.75 per case). All data, code, per-case bundles, and reproducibility infrastructure are public.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

AIRGuard: ランタイム権限制御によるエージェントアクションの保護

ツールを使用する言語エージェントは、モデルの決定を外部の副作用に変えます。つまり、ファイルの読み取り、スクリプトの実行、API の呼び出し、メッセージの送信、およびモデルコンテキストプロトコルツールの呼び出しを行います。このため、エージェント攻撃はジェイルブレイクとは異なります。多くの場合、有害なステップは、明らかに禁止された出力ではなく、攻撃者が制御するコンテキストによって許可されたアクセスがユーザーの利益に反するように誘導されるため、安全ではなくなる通常の実行可能なアクションです。私たちは、この失敗モードを権限の混乱として特定します。信頼できないリソースは推論を提供する可能性がありますが、副作用を許可してはなりません。最小特権をアクション時の認証として運用するランタイムガードである AIRGuard を紹介します。 AIRGuard は、異種ツール呼び出しを正規化し、タスクの権限をステップレベルの権限に導き出し、ソースとターゲットの信頼を追跡し、機密の副作用をシミュレートし、ステップ間のリスクを監査し、アクションの実行前に意思決定を強制します。 AgentTrap では、AIRGuard により Sonnet 4.6 の攻撃成功率が防御なしの場合の 36.3% から 5.5% に減少します。 DTAP-150 では、AIRGuard は、ARGUS の 52.0%、MELON の 42.0% と比較して、Haiku 4.5 では 76.0% の良好なユーティリティを維持します。さらに、分析の結果、プロンプトのみのポリシーはわずかにしか役に立たないのに対し、専用のランタイム権限制御層により、エージェントシステムがツールを介した副作用を直接制御できることがわかりました。コードとデータは https://github.com/Sophie508/AIRGuard で入手できます。

原文 (English)

AIRGuard: Guarding Agent Actions with Runtime Authority Control

Tool-using language agents turn model decisions into external side effects: they read files, run scripts, call APIs, send messages, and invoke Model Context Protocol tools. This makes agent attacks different from jailbreaks. The harmful step is often not an obviously forbidden output, but an ordinary executable action that becomes unsafe because attacker-controlled context steers authorized access against the user's interest. We identify this failure mode as authority confusion: untrusted resources may inform reasoning, but they must not authorize side effects. We present AIRGuard, a runtime guard that operationalizes least privilege as action-time authorization. AIRGuard normalizes heterogeneous tool calls, derives task authority into step-level authority, tracks source and target trust, simulates sensitive side effects, audits cross-step risk, and enforces decisions before actions execute. On AgentTrap, AIRGuard reduces Sonnet 4.6 attack success from 36.3% without defense to 5.5%. On DTAP-150, AIRGuard preserves 76.0% benign utility with Haiku 4.5, compared with 52.0% for ARGUS and 42.0% for MELON. An ablation further shows that prompt-only policy helps only modestly, whereas a dedicated runtime authority-control layer gives the agent system direct control over tool-mediated side effects. Code and data are available at https://github.com/Sophie508/AIRGuard.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

アインシュタイン望遠鏡のシミュレートされたデータの分析に適用されたエージェント AI の初の直接比較

我々は、人間の介入なしに共有コンピューティングインフラストラクチャ上でシンプルなエンドツーエンドの重力波データ分析パイプラインを自律的に実行するという 2 つの最先端のエージェント AI システム、Claude Code (Anthropic) と Codex (OpenAI) の比較を報告します。このパイプラインは、生のアインシュタイン望遠鏡でシミュレートされたノイズからのパワースペクトル密度推定、幾何学的テンプレートバンクの生成、100 個のバイナリブラックホール信号注入の整合フィルター回復、自動結果生成、および Physical Review D のスタイルでフォーマットされた原稿の大規模言語モデル支援の作成で構成されます。両方のエージェントは、同一の仕様書と同一のコンピューティングリソースを受け取りました。実験は 2 回実行されました。1 回目は非現実的な大音量の注入を使用して実行され、2 回目は物理的に動機付けられた SNR 範囲に再スケーリングされた信号を使用して実行されました。科学的結果は両方の実行で収束しました。ただし、エージェントは大幅に異なる動作と計算コストを示しました。Claude Code は、仕様からのサイレント逸脱はありますが、パイプラインを約 3.4 分で完了しましたが、Codex は、整合フィルターの内部ループの一方的なパフォーマンスの最適化を含む、明示的な自己修正の再起動に約 16 分を要しました。自律的に生成された原稿も、長さ、詳細、品質が異なりました。 2 回目の実行では、SNR 範囲の命令の解釈における微妙な違いが、真の科学的相違につながりました。Claude Code は命令を黙って再解釈しましたが、Codex は文字通り仕様に従いました。速度と可聴性、サイレントと透過的なエラー処理、命令の解釈、マルチモデルパイプラインにおける中間データ表現の重要性など、これらの動作の違いが科学技術コンピューティングワークフローでのエージェント AI の展開に与える影響について説明します。

原文 (English)

First head-to-head comparison of agentic AI applied to the analysis of simulated data of the Einstein Telescope

We report a comparison of two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously executing a simple end-to-end gravitational wave data analysis pipeline on a shared computing infrastructure without human intervention. The pipeline comprises power spectral density estimation from raw Einstein Telescope simulated noise, geometric template bank generation, matched filter recovery of 100 binary black hole signal injections, automated results generation, and large language model-assisted production of a manuscript formatted in the style of Physical Review D. Both agents received identical written specifications and identical compute resources. The experiment was run twice: a first run with unrealistically loud injections, and a second run with signals rescaled to a physically motivated SNR range. The scientific results converged in both runs. However, the agents exhibited substantially different behaviors and computational costs: Claude Code completed the pipeline in ~3.4 minutes with silent deviations from the specification, while Codex required ~16 minutes across explicit self-correcting restarts, including an unsolicited performance optimization of the matched filter inner loop. The autonomously generated manuscripts also diverged in length, details, and quality. In the second run, a subtle difference in the interpretation of the SNR range instruction led to a genuine scientific divergence: Claude Code silently reinterpreted the instructions, while Codex followed the specification literally. We discuss the implications of these behavioral differences, such as speed versus auditability, silent versus transparent error handling, instruction interpretation, and the criticality of intermediate data representations in multi-model pipelines, for the deployment of agentic AI in scientific computing workflows.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Beyond Recall: AI パーソナライゼーションの解釈層としての行動仕様

AI エージェントが人間に代わって意思決定を行う場合、その意思決定はユーザーと一致する必要があります。システムが人の解釈をどれだけ忠実に捉えているかを測定するために、表現精度を導入します。解釈層は動作仕様として運用されます。私たちのリファレンス実装は、人のデータを解釈パターンに積極的に圧縮し、言語モデルのコンテキストとして機能します。私たちは、校正済みの 5 人の審査員 LLM パネルによって採点された、保留された行動予測のプロトタイプベンチマークで仕様を評価します。私たちは、完全な生のコーパス、完全に抽出されたファクト、および 4 つの商用メモリシステム (Mem0、Letta、Supermemory、Zep) など、さまざまなコンテキスト条件を使用して独立して構成してテストします。この仕様は 14 のパブリックドメインの自伝的コーパスにわたって、集合的に表現の精度を向上させ、モデルのヘッジをほぼ排除します。生のコーパスが提供する内容のほとんどを、コンテキストコストを約 25 分の 1 に抑えて復元します。この仕様は、トレーニング前のベースラインに関係なく、被験者を共通の予測レベルに引き上げます。したがって、絶対ポイントのリフトはベースラインが最も低いところで最大となり、関連する母集団が事前トレーニングで適切に代表されていない人であることを示唆しています。リフトは、解釈が必要な質問で最大であり、解釈レイヤーを提供することで、抽出された事実や生のコーパスでは実現できないモデル動作が可能になります。逆に、リコールが必要な質問では、この層は役立つというよりむしろ邪魔になる可能性があります。私たちは、表現の精度は再現とは異なり、人間と AI の整合性はユーザーがどれだけ正確に表現されているかに依存すると結論付けています。表現が正確であるため、その調整はテスト可能です。

原文 (English)

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

If an AI agent makes decisions on a person's behalf, those decisions must align with its user. We introduce representational accuracy to measure how faithfully a system captures a person's interpretation. An interpretive layer is operationalized as a Behavioral Specification. Our reference implementation aggressively compresses a person's data into interpretive patterns, served as context to a language model. We evaluate the Specification on a prototype benchmark of held-out behavioral predictions scored by a calibrated 5-judge LLM panel. We test it independently and in composition with a range of context conditions: full raw corpus, full extracted facts, and four commercial memory systems (Mem0, Letta, Supermemory, Zep). Across 14 public-domain autobiographical corpora, the Specification lifts representational accuracy in aggregate and nearly eliminates model hedging. It recovers most of what the raw corpus delivers, at ~25x less context cost. The Specification lifts subjects toward a common predictive level regardless of pretraining baseline; the lift in absolute points is therefore largest where the baseline is lowest, suggesting the population of relevance is anyone not adequately represented in pretraining. Lift is greatest on interpretation-required questions, where providing an interpretive layer enables model behavior that extracted facts or raw corpus do not. Conversely, on recall-required questions, this layer can interfere rather than help. We conclude that representational accuracy is distinct from recall and that human-AI alignment is dependent on how accurately the user is represented. Representational accuracy makes that alignment testable.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

自動運転における遅延と精度のトレードオフを最適化するための多重解像度エンドツーエンドのディープニューラルネットワーク

レイテンシと精度のトレードオフは、サイバーフィジカルシステム用のディープニューラルネットワーク (DNN) のリアルタイムアプリケーションの基本です。特に自動運転では、安全性は予測の品質と、センシングから作動までのエンドツーエンドの遅延の両方に依存します。 (1) レイテンシーを考慮すると、レイテンシーに最適なネットワーク構成はシーンのコンテキストとコンピューティングの可用性によって異なります。 (2) 単一の固定解像度モデルは、条件が変化すると最適ではなくなります。単眼カメラ入力を使用した CARLA 都市走行課題のための、多重解像度のエンドツーエンドのディープニューラルネットワークを紹介します。私たちのアプローチでは、解像度ごとのバッチ正規化を通じて複数の入力解像度をサポートする畳み込みニューラルネットワーク (CNN) を採用しています。これにより、レイテンシバジェットの下で理想的な入力スケールの実行時選択が可能になるだけでなく、元のトレーニングデータセットにアクセスせずにマルチ解像度トレーニングを可能にする解像度のリターゲティングも可能になります。私たちは、CARLA で多重解像度のエンドツーエンド CNN を実装して評価し、遅延安全性のフロンティアを探索します。結果は、固定解像度のベースラインと比較して、ルートごとの安全指標 (車線侵入、赤信号違反、衝突) が一貫して改善していることを示しています。

原文 (English)

Multi-Resolution End-to-End Deep Neural Network for Optimizing Latency-Accuracy Tradeoff in Autonomous Driving

Latency-accuracy tradeoffs are fundamental in real-time applications of deep neural networks (DNNs) for cyber-physical systems. In autonomous driving, in particular, safety depends on both prediction quality and the end-to-end delay from sensing to actuation. We observe that (1) when latency is accounted for, the latency-optimal network configuration varies with scene context and compute availability; and (2) a single fixed-resolution model becomes suboptimal as conditions change. We present a multi-resolution, end-to-end deep neural network for the CARLA urban driving challenge using monocular camera input. Our approach employs a convolutional neural network (CNN) that supports multiple input resolutions through per-resolution batch normalization, enabling runtime selection of an ideal input scale under a latency budget, as well as resolution retargeting, which allows multi-resolution training without access to the original training dataset. We implement and evaluate our multi-resolution end-to-end CNN in CARLA to explore the latency-safety frontier. Results show consistent improvements in per-route safety metrics - lane invasions, red-light infractions, and collisions - relative to fixed-resolution baselines.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

SafeRx-Agent: 安全で説明可能な投薬推奨のための知識に基づいたマルチエージェントフレームワーク

薬剤の推奨は患者の来院時の薬剤を予測しますが、既存の方法では依然として 2 つの重要な課題に直面しています。モデルレベルでは、従来の医薬品推奨方法は限られた根拠に基づいて構造化された医薬品コードを予測するだけですが、LLMエージェントはより豊富な臨床コンテキストを使用できますが、安全性の検証とトレーサビリティが欠けている可能性があります。タスクレベルでは、既存のベンチマークは広範な薬剤カテゴリーを使用することが多く、サブグループレベルの安全性の違いが無視され、リスクの過大評価につながる可能性があります。第 4 レベルの ATC コード生成に基づいた、最初のきめ細かい薬剤推奨設定を導入します。私たちは、患者の状況、外部の臨床知識、安全性検証を使用して追跡可能な薬剤セットを推奨する、知識に基づいたマルチエージェントフレームワークである Safe Prescription Agent (SafeRx-Agent) を提案します。 MIMIC-III および MIMIC-IV データセットに関する実験結果は、SafeRx-Agent が薬物相互作用、禁忌、および薬物セットのサイズを制御しながら、きめ細かい薬物予測の精度を向上させることを示しています。

原文 (English)

SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation

Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level, traditional drug recommendation methods only predict structured drug codes with limited evidence grounding, while LLM agents can use richer clinical context but may lack safety verification and traceability. At the task level, existing benchmarks often use broad medication categories, which ignore subgroup-level safety differences and can lead to risk overestimation. We introduce the first fine-grained medication recommendation setting based on fourth-level ATC code generation. We propose Safe Prescription Agent (SafeRx-Agent), a knowledge-grounded multi-agent framework that uses patient context, external clinical knowledge, and safety verification to recommend traceable medication sets. Experimental results on MIMIC-III and MIMIC-IV datasets show that SafeRx-Agent improves fine-grained medication prediction accuracy while controlling drug interactions, contraindications, and medication set size.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, inc…

2026-05-29 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interacti…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

GrepSeek: Training Search Agents for Direct Corpus Interaction

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reason…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow

The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collab…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents

Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision mak…

2026-05-29 13:00 JSTarXiv cs.AIエージェント研究/論文

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajecto…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own fail…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

PhoneWorld: Scaling Phone-Use Agent Environments

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build a…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Network Optimization Aspects of Autonomous Vehicles: Challenges and Future Directions

Global megatrends, such as urbanization, population growth, and emerging network solutions are accelerating the development of the Connecte…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing

Exploratory GUI testing is a particularly demanding setting for MLLM agents: without predefined test scripts, an agent must autonomously na…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Training Deliberative Monitors for Black-Box Scheming Detection

As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may beco…

2026-05-29 13:00 JSTarXiv cs.AIエージェント研究/論文

Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory

End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor ent…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems

LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks,…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, w…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents

Consensus protocols form the backbone of distributed systems and blockchains, where implementation bugs can cause data corruption and finan…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous Driving

Autonomous driving systems are commonly trained and evaluated within limited geographic regions, which hinders their scalability when deplo…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

Large language model (LLM) agents increasingly leverage long term memory to support persistent and autonomous task execution. However, this…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas

We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-syn…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

Large language models (LLMs) can autonomously conduct multi-stage cyber attacks, but the consistency of their offensive behavior under repe…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offer…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?

Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms

As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

While Multi-Agent Systems (MAS) empower Large Language Models to tackle complex reasoning tasks through collaborative interaction, optimizi…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context wi…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Gram: Assessing sabotage propensities via automated alignment auditing

We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemin…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

PersonaAgent: Bridging Memory and Action for Personalized LLM Agents

Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide rang…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Estimating the Empowerment of Language Model Agents

As language model (LM) agents become increasingly capable and adopted in real-world applications, there is a growing need for scalable eval…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SafeSearch: Automated Red-Teaming of LLM-Based Search Agents

Search agents connect LLMs to the Internet, enabling them to access broader and more up-to-date information. However, this also introduces…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization

We introduce CodeEvolve, an open-source framework that couples large language models with island-based evolutionary search for end-to-end a…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Graph-Enhanced Policy Optimization in LLM Agent Training

Multi-step LLM agents in interactive environments represent a crucial step toward long-horizon decision-making. To train such agents, group…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents

Data analysis has become an indispensable part of scientific research. To discover the latent knowledge and insights hidden within massive…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bo…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning

Agentic Reinforcement Learning (ARL) trains large language models to interleave reasoning with external tool execution to solve complex tas…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

LLM-driven agents excel at sequential decision-making but often rely on on-the-fly reasoning, re-deriving solutions even in recurring scena…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AutoSizer: Automatic Sizing of Analog and Mixed-Signal Circuits via Large Language Model (LLM) Agents

The design of Analog and Mixed-Signal (AMS) integrated circuits remains heavily reliant on expert knowledge, with transistor sizing a major…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Small Agent Group is the Future of Digital Health

The rapid adoption of large language models (LLMs) in digital health has been driven by a "scaling-first" philosophy, i.e., the assumption…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information from individual…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

Reinforcement learning (RL) shows promise for enhancing LLM agentic reasoning, yet sparse terminal rewards hinder fine-grained optimization…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MemCollab: Cross-Model Memory Collaboration via Contrastive Trajectory Distillation

LLM agents increasingly rely on memory mechanisms to reuse knowledge from past problem-solving experiences. However, existing methods typic…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MediHive: A Decentralized Agent Collective for Medical Reasoning

Large language models (LLMs) have revolutionized medical reasoning tasks, yet single-agent systems often falter on complex, interdisciplina…

2026-05-29 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

IntentScore: コンピュータ使用エージェントの意図条件付きアクションの評価

Computer-Use Agent (CUA) は、大規模な言語モデルを利用してデスクトップ環境で GUI 操作を実行しますが、アクションの品質を評価せずにアクションを生成するため、後続のステップに連鎖的に発生する不可逆的なエラーにつながります。私たちは、3 つのオペレーティングシステムにわたる 398K のオフライン GUI インタラクションステップから候補アクションをスコアリングすることを学習する、プランを認識した報酬モデルである IntentScore を提案します。 IntentScore は、状態とアクションの関連性に関する対照的な調整と、アクションの正しさに関するマージンランキングという 2 つの相補的な目標を使用してトレーニングします。アーキテクチャ的には、各候補者の計画意図がアクションエンコーダーに埋め込まれ、同様のアクションを持つ候補者間で論理的根拠が異なるものを区別できるようになります。 IntentScore は、ホールドアウト評価で 97.5% のペア識別精度を達成します。トレーニング中にまったく見えない環境である OSWorld 上のエージェント S3 の再ランカーとしてデプロイされた IntentScore は、タスクの成功率を 6.9 ポイント向上させ、異種のオフライン軌跡から学習した報酬推定が、目に見えないエージェントとタスクの分布に一般化されることを示しています。

原文 (English)

IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions without evaluating action quality, leading to irreversible errors that cascade through subsequent steps. We propose IntentScore, a plan-aware reward model that learns to score candidate actions from 398K offline GUI interaction steps spanning three operating systems. IntentScore trains with two complementary objectives: contrastive alignment for state-action relevance and margin ranking for action correctness. Architecturally, it embeds each candidate's planning intent in the action encoder, enabling discrimination between candidates with similar actions but different rationales. IntentScore achieves 97.5% pairwise discrimination accuracy on held-out evaluation. Deployed as a re-ranker for Agent S3 on OSWorld, an environment entirely unseen during training, IntentScore improves task success rate by 6.9 points, demonstrating that reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Guardrails Beat Guidance: A Large-Scale Study of Rules, Skills, and Persistent Configuration for Coding Agents

Random rules improve a coding agent's task performance as much as expert-curated ones (both $+13.8$pp on a discriminative subset of SWE-ben…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Human-Guided Harm Recovery for Computer Use Agents

As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but al…

2026-05-29 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data

AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulat…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

CausaLab: AI 科学者向けのインタラクティブな因果発見のためのスケーラブルな環境

LLM エージェントによるインタラクティブな因果発見を評価するためのスケーラブルな環境である CausaLab を紹介します。以前の評価とは異なり、CausaLab では、エージェントが因果関係の証拠を使用して問題を解決できるかどうか、およびその答えが根底にある因果メカニズムに関する正しい仮説によって裏付けられているかどうかの両方を評価します。各エピソードではエージェントが合成実験室に配置されます。エージェントは以前の測定記録を受け取り、マニピュレーター結晶に介入し、同じ機構によって支配される保持されたリアクター結晶の共振周波数を予測します。隠されたデータ生成プロセスは、ランダムにサンプリングされた構造因果モデル (SCM) であるため、成功するには、事前の知識を思い出すのではなく、因果グラフと構造方程式の両方を回復する必要があります。 CausaLab には、エージェントの進化する SCM 仮説を記録するドメイン固有の言語も含まれており、軌跡を検査可能にしてグラウンドトゥルースと比較できるようになります。実験では、予測とメカニズム回復の間に永続的なギャップがあることが示されています。純粋に観測的な 6 ノード設定では、GPT-5.2-high はタスク精度 92% に達しますが、オールエッジ $F_1$ はわずか 0.471 です。この観察は、さまざまな相互作用戦略の探求をさらに動機づけます: 混合観察 - 介入戦略は構造忠実度を向上させます: 混合 6 ノード設定では、GPT-5.2-high はタスク精度とオールエッジ $F_1$ の両方で 80% を達成しました。しかし、純粋な介入戦略はタスクの精度とオールエッジ $F_1$ の両方においてパフォーマンスが低いため、強力なエージェントですら有益な介入を設計するのに苦労しています。私たちは、エージェントの主要な弱点として早期停止を特定し、仮説と過去のデータとの間の一貫性をモデルに検証するように依頼することが、この問題の軽減に役立つことを示します。したがって、CausaLab は予測の成功を因果関係の理解から切り離し、実験的因果推論者としての現在の LLM エージェントの限界を明らかにします。

原文 (English)

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

SIA: ハーネスとウェイトのアップデートによる自己改善型 AI

AI の構築と改善においてボトルネックになるのは人間です。モデルとそれをラップするエージェントはどちらも人によって作成、調整、修正されます。 AI 自身を改善する方法を見つけられるという長期的な目標はまだ未解決です。大きく関連性のない 2 つの研究ラインがこのボトルネックを攻撃しています。ハーネス更新スクールでは、モデルの重みを固定したまま、メタエージェントがタスク固有のエージェントの足場 (ツール、プロンプト、再試行ロジック、検索手順) を書き換えます。テスト時のトレーニングスクールでは、手書きの RL パイプラインを使用して、ハーネスを固定したままタスクフィードバックでモデル自体の重みを更新します。これら 2 つのサイロは独立して動作します。我々は、言語モデルエージェント (フィードバックエージェント) がタスク固有のエージェントのハーネスと重みの両方を更新する自己改善ループである SIA を提案します。中国の法的責任の分類、低レベルの GPU カーネルの最適化、および単一セル RNA のノイズ除去という 3 つの対照的なドメインにわたって評価します。両方のレバーを組み合わせると、3 つのベンチマークすべてでスキャフォールドの反復単独よりも優れたパフォーマンスを発揮します。初期ベースラインと比較して、LawBench では 56.6%、GPU カーネルではランタイムが 91.9%、ノイズ除去では 502% の向上が見られます。ハーネスの更新によりモデルがエージェント的になり、検索と動作の方法が形成されます。一方、重みの更新により、プロンプトや足場では得られないドメインの直感が構築されます。

原文 (English)

SIA: Self Improving AI with Harness & Weight Updates

Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model's own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. SIA-W+H achieves 25.1% over prior SOTA on LawBench, 12.4% faster GPU kernels than prior SOTA (1,017 vs 1,161 {\mu}s), and 20.4% over prior SOTA on denoising. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

FundaPod: AI 支援のファンダメンタル投資調査のためのナレッジグラフメモリを備えたマルチペルソナエージェントポッドプラットフォーム

大規模言語モデル (LLM) は金融分野での適用が増えていますが、既存の研究のほとんどは取引シグナルや予測を中心とした財務 NLP タスクに重点を置いています。対照的に、制度的基礎研究では、人間のアナリストまたは AI エージェントが証拠を収集し、ビジネス推進要因を特定し、競合する視点を比較し、投資メモを作成する必要があります。その広範な目標は、単に結果を予測することではなく、投資知識の累積的な発展に貢献しながら、透明性、再利用可能、検証可能な投資計画を作成することです。 AI 支援のファンダメンタルズ投資調査のためのマルチペルソナエージェントプラットフォームである FundaPod を紹介します。私たちは、基礎研究は人間中心の意思決定支援タスクであり、取引シグナルの生成とは質的に異なるため、独立性を維持するアーキテクチャの方が適していると主張します。 FundaPod では、バリュー投資家やマクロ戦略家など、さまざまなペルソナを持つ AI エージェントが、共有の出所契約に基づいて独立して調査を実施します。その後、彼らの意見の相違は、知識グラフ記憶システムを通じて人間のポートフォリオマネージャー (PM) による裁定のために事後的に表面化されます。この論文は、設計科学の実践と認知的分離と人間と機械の協調の理論に基づいた、基礎研究をサポートする人間と AI のハイブリッドシステムの 5 つの設計原則を提供します。また、4 つのアーキテクチャメカニズムについても説明します。1 つは一般投資家の資料を展開可能なエージェントに変えるペルソナ蒸留パイプラインです。プランナーが型指定されたタスクグラフを導出できるようにする宣言型スキルレジストリ。メモの主張を検証可能な情報源に結び付ける根拠のある証拠モデル。そしてティッカー、メモ、アナリスト、テーマを結び付けるナレッジグラフ「第二の脳」。完全なケーススタディとペルソナベースのメモの比較を通じてアーキテクチャを実証します。

原文 (English)

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph "second brain" that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AsyncTool: マルチタスクシナリオでの非同期関数呼び出し機能の評価

大規模言語モデル (LLM) ベースのエージェントは、外部ツールを使用して複雑なタスクを解決する強力な機能を示しています。ただし、既存の評価では、ツール使用の時間的側面、特にツールの応答遅延の影響が見落とされていることが多く、通常は単一タスクの設定に限定されています。実際のアプリケーションでは、多くの場合、複数のタスクを同時に実行する必要があり、全体的な効率は、エージェントがツールの応答を待つ間にアイドル時間を利用できるかどうかによって決まります。この機能を非同期ツール呼び出しと呼びます。これを評価するために、ツールのフィードバックが遅延した対話型のマルチタスクツール使用環境で LLM ベースのエージェントを評価するためのベンチマークである AsyncTool を提案します。 AsyncTool は、複数の異種タスクを同時に提示し、実行中の現実的なツール応答遅延をシミュレートします。ハイブリッドデータ進化戦略を使用して、複数のシナリオとツールの使用パターンをカバーする多様な非同期マルチタスクデータセットを構築します。ステップ、サブタスク、およびタスクのレベルでモデルを評価し、タスクの調整と完了の効率を測定するための効率指向のメトリクスを導入します。広範な実験により、ツールのフィードバックが遅れていると、現在のエージェントに大きな課題が生じ、明らかなパフォーマンスの低下につながることが示されています。タスクの切り替え、依存関係の追跡、状態の維持をより適切に調整するモデルは、AsyncTool でより優れたパフォーマンスを実現します。私たちの分析は、現在ツールを使用しているエージェントの主な障害モードを特定し、より強力な時間的推論と調整機能を備えた将来のシステムを設計するための実用的な洞察を提供します。

原文 (English)

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to single-task settings. In real-world applications, multiple tasks often need to be executed concurrently, and overall efficiency depends on whether an agent can use idle time while waiting for tool responses. We refer to this capability as asynchronous tool calling. To evaluate it, we propose AsyncTool, a benchmark for assessing LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. AsyncTool presents multiple heterogeneous tasks simultaneously and simulates realistic tool response latency during execution. Using a hybrid data evolution strategy, we construct a diverse asynchronous multitasking dataset that covers multiple scenarios and tool-use patterns. We evaluate models at the step, sub-task, and task levels, and introduce efficiency-oriented metrics to measure task coordination and completion efficiency. Extensive experiments show that delayed tool feedback poses substantial challenges to current agents and leads to clear performance degradation. Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance on AsyncTool. Our analysis identifies key failure modes of current tool-using agents and provides practical insights for designing future systems with stronger temporal reasoning and coordination capabilities.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Agent4Edu: Generating Learner Response Data by Generative Agents for Intelligent Education Systems

Personalized learning represents a promising educational strategy within intelligent educational systems, aiming to enhance learners' pract…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Online Fair Division with Additional Information

We study the problem of fairly allocating indivisible goods to agents in an online setting, where goods arrive sequentially and must be all…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

GroundAct: Can LLM Agents Ground Actions in Environmental States?

LLM agents achieve 85-96% success on tasks where instructions fully specify the action, but drop to 29-53% when action feasibility depends…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ReflexGrad: Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing

We present ReflexGrad, a dual-process architecture for within-episode failure recovery in LLM agents without demonstrations. When agents co…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the succ…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger'…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

A Review of Learning-Based Motion Planning: Toward a Data-Driven Optimal Control Approach

Motion planning for autonomous driving (AD) faces a critical trade-off. While traditional rule-based pipelines offer verifiable safety and…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Scaling Small Agents Through Strategy Auctions

Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are suff…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントロボティクス

When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decisi…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

The Planetary Cost of AI Acceleration, Part II: The 10th Planetary Boundary and the 6.5-Year Countdown

The recent, super-exponential scaling of autonomous Large Language Model (LLM) agents signals a broader, fundamental paradigm shift from ma…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a large…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Intent-aligned Autonomous Spacecraft Guidance via Reasoning Models

Future spacecraft operations require autonomy that can interpret high-level mission intent while preserving safety. However, existing traje…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント

CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs

Personal AI assistants are beginning to act as delegates with access to calendars, inboxes, and user preferences. Calendar scheduling makes…

2026-05-29 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Here is the updated abstract: Evaluation of software engineering (SWE) agents is dominated by a binary signal: whether the final patch pass…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterpr…

2026-05-29 13:00 JSTarXiv cs.AIエージェントハードウェア/半導体

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack th…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

KYA: A Framework-Agnostic Trust Layer for Autonomous Systems with Verifiable Provenance and Hierarchical Policy Composition

KYA (Know Your Agents) is an open-source, framework-agnostic trust and governance layer for autonomous systems, composed of five primitives…

2026-05-29 06:24 JSTTechCrunch AIエージェント

The internet is being rebuilt for machines

As AI agents move from experiments to production, AWS, Cloudflare, and others are redesigning cloud infrastructure for a future dominated b…

2026-05-29 05:06 JSTTechCrunch AIエージェント

Asana acquires no-code agent-builder StackAI

Asana will incorporate StackAI into its growing suite of AI workflow tools.

2026-05-29 02:00 JSTTechCrunch AILLM/生成AIエージェント

Anthropic releases Opus 4.8 with new ‘dynamic workflow’ tool

The new Opus model comes with a tool called Dynamic Workflows, for coordinating swarms of subagents.

2026-05-29 01:16 JSTITmedia AI+LLM/生成AIエージェント

Anthropic、Claude Opus 4.8を一般提供　誠実さが飛躍的に向上、Mythosに並ぶアライメント性能を実現

Anthropicは、AIモデルの最新版「Claude Opus 4.8」の一般提供を開始した。前世代から推論やコーディング能力を向上させ、自らの作業の不確実性に対する「誠実さ」が劇的に改善した。また、数百のサブエージェントを並行して走らせる新機能「dynamic workfl…

2026-05-29 00:35 JSTTechCrunch AIエージェント

Sesame, the conversational AI startup from Oculus founders, launches its iOS app

Sesame’s new iOS app brings its conversational AI agents to the public, offering more natural back-and-forth interactions designed to feel…

2026-05-28 23:00 JSTTechCrunch AIエージェント

Visa invests in Replit to power agentic payments for developers

Visa said that over 1,000 employees have been using Replit for prototyping and development.

2026-05-28 21:00 JSTOpenAIエージェント

How Endava builds an agentic organization with Codex

Learn how Endava uses Codex to build an agentic organization, accelerating software delivery and reducing requirements analysis from weeks…

2026-05-28 16:00 JSTTechCrunch AIエージェント

Vertu wants CEOs to run companies from an AI foldable starting at $6,880

Built on top of the open source Hermes project, Vertu's new foldable combines AI-agent workflows, enterprise integrations, and ultra-premiu…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

テキスト内の人間の価値観の特定と理解: カスタマイズ可能な LLM ベースのアーキテクチャ

インテリジェントシステムがより自律的になるにつれて、科学界は従来の効用最大化モデルとは異なり、倫理的および道徳的考慮事項を含む意思決定メカニズムの構築に焦点を当てています。これを達成するために重要な点は、これらの決定が人間の価値観とどの程度一致しているかを評価することです。この目的を達成するために、明示的か暗黙的かにかかわらず、テキストから人間の価値観を特定し、全体を通してその認識を可能にする、大規模言語モデル (LLM) に基づくアプローチの開発を中心とした有望な研究が行われています。この論文では、特定の価値理論や複雑なプロンプトエンジニアリングに結び付けられた以前のアプローチの制限を回避し、テキスト内の人間の価値観の強度を検出および定量化するための LLM ベースのアーキテクチャを紹介します。このアーキテクチャは 3 つの調整されたモジュールで構成されます。1 つは理論的フレームワークの基礎テキストから構造化された値の仕様を生成します。これらの仕様を使用してテキストにラベルを付けるもの。もう 1 つは、修辞的および意味論的な証拠に基づいて段階的な支持または抵抗を割り当てるものです。このモジュール式のアプローチは、人間の価値観の検出から概念化のタスクを分離し、さまざまな理論に適応できる価値の仕様によって推進される、スケーラブルで再現可能なプロセスを作成します。アーキテクチャは複数の LLM でインスタンス化され、ValueEval データセットを使用して評価されました。実験では良好な検出パフォーマンスが実証され、パイプラインの一般性が確認されました。

原文 (English)

Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

As intelligent systems become more autonomous, the scientific community focuses on creating decision-making mechanisms that include ethical and moral considerations, unlike traditional utility-maximisation models. To achieve this, a key aspect is assessing how well these decisions align with human values. To this end, a promising line of research is centred on developing approaches based on Large Language Models (LLMs) to identify human values from text, whether explicit or implicit, enabling their recognition throughout. This paper introduces a LLM-based architecture to detect and quantify the intensity of human values in text, avoiding the limitations of previous approaches tied to specific value theory or complex prompt engineering. The architecture comprises three coordinated modules: one that generates structured value specifications from the foundational texts of any theoretical framework; one that labels texts using these specifications; and one that assigns graded support or resistance based on rhetorical and semantic evidence. This modular approach separates the tasks of conceptualising from detecting human values, creating a scalable and reproducible process driven by value specifications adaptable to various theories. The architecture was instantiated with multiple LLMs and evaluated using the ValueEval dataset. The experiments demonstrate good detection performance, confirming the generality of the pipeline.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

DynaSchedBench: LLM ベースのスケジューリングエージェントにおける調整された動的スケジューリングベンチマークと可観測性のパラドックス

Dynamic Flexible Job Shop Scheduling 問題 (DFJSP) に対するニューラル組み合わせ最適化の進歩は、現在、方法論上の緊張によって妨げられています。静的なベンチマークはベンチマークのオーバーフィッティングを促進し、一方、校正されていないジェネレーターは確率的ノイズによってアルゴリズムの機能を曖昧にします。これを解決するために、インスタンス生成プロセスを厳密に制御する DFJSP の診断フレームワークである \textbf{DynaSchedBench} を導入します。私たちのアプローチでは、パラメーターサンプリングに依存する代わりに、新しいスケジュールストレスインデックス (SSI) を計算する Sequential Event-Space Calibrator (SESC) を利用して、インスタンスを難易度ごとに階層化します。我々は、SESCが進化的ベースラインよりも計算効率が大幅に高く、ターゲットメトリクスに確実に収束することを実証します。このフレームワークは、インスタンスの生成、スナップショットベースのシミュレーション、エージェント、評価、視覚化などのモジュール式コンポーネントを統合しており、それにより、事後対応および先読みベースのポリシーの厳密なテストが可能になります。この調整された環境を活用して、LLM ベースのスケジューリングエージェントの主要な制限を特定します。具体的には、動的スケジューリングのための段階的なオンライン意思決定において、完全な構造情報へのオラクルアクセスをエージェントに提供すると、ポリシーのパフォーマンスが低下し、簡潔な情報のパフォーマンスが低下する可能性があるという「可観測性のパラドックス」を特定しました。さらに、トークンのオーバーヘッドが大きいにもかかわらず、ツールによる拡張戦略や改良戦略では確実にパフォーマンスを向上させることができず、ほとんどの LLM エージェントは強力なディスパッチングベースラインを継続的に超えることができず、優れたオプティマイザというよりも堅牢なヒューリスティック近似器のように動作します。

原文 (English)

DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

Progress in neural combinatorial optimization for Dynamic Flexible Job Shop Scheduling Problem (DFJSP) is currently hindered by a methodological tension: static benchmarks encourage benchmark overfitting, while uncalibrated generators obscure algorithmic capability with stochastic noise. To resolve this, we introduce \textbf{DynaSchedBench}, a diagnostic framework for DFJSP that rigorously controls the instance-generation process. Instead of relying on parameter sampling, our approach utilizes Sequential Event-Space Calibrator (SESC) that computes a novel Schedule Stress Index (SSI) to stratify instances by difficulty. We demonstrate that SESC is substantially more computationally efficient than evolutionary baselines while converging reliably to the target metrics. The framework integrates modular components for instance generation, snapshot-based simulation, agents, evaluation, and visualization, thereby enabling rigorous testing of reactive and lookahead-based policies. Leveraging this calibrated environment, we identify key limitations of LLM-based scheduling agents. Specifically, in step-wise online decision-making for dynamic scheduling, we identify an ``Observability Paradox'': providing agents with oracle access to full structural information can degrade policy performance, underperforming concise information. Furthermore, despite substantial token overhead, tool-augmented and refinement strategies fail to reliably improve performance, and most LLM agents fail to consistently surpass strong dispatching baselines-behaving more like robust heuristic approximators than superior optimizers.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

なぜLLMは因果関係発見に失敗するのか、そして介入エージェントはどのようにして逃亡するのか

因果関係の発見は科学的推論の基礎ですが、大規模な言語モデルがそれを確実に実行できるかどうかは未解決の問題のままです。最近のベンチマークでは、微調整されたモデルでも単純な因果グラフでは頭打ちになり、複雑さが増すにつれて低下することが示されていますが、なぜ失敗するのかはまだ解明されていません。私たちは、この失敗が根本的なものであることを証明しました。教師あり微調整、直接優先最適化、およびコンテキスト内学習はすべて、同様の観測データを生成する因果グラフを区別できない予測子を生成します。また、そうしようとすると、モデルの内部表現が際限なく増大する必要があり、これらの手法が機能するまさにその条件に違反します。私たちはこれをカーネル障害定理として形式化し、この制限が学習パラダイムに固有のものであり、 \emph{特定のモデルやデータセットではない} ことを確立します。我々は、エージェント的因果ベイジアン最適化（A-CBO）を提案します。この最適化では、凍結された言語モデルが、介入効果に関する対象を絞ったクエリに答える介入の神託として機能し、外部ベイジアンループが、対数的に多くのラウンドで候補グラフに対する信念を集中させます。この決定は障害が適用される空間の外側で行われるため、基礎となるモデルは変更されないまま、A-CBO は明らかに収束します。 Corr2Cause では、A-CBO はトレーニングなしで微調整されたベースラインと一致します。 18,000 個のテストサンプルを使用して 24 個の変数にスケーリングする新しいベンチマークである Extended Corr2Cause では、A-CBO は微調整と設定の最適化の両方を大幅に上回り、利点が増大しています。

原文 (English)

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model's internal representations to grow unboundedly, violating the very conditions under which these methods work. We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, \emph{not any particular model or dataset}. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds. Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing

2026-05-28 13:00 JSTarXiv cs.AIエージェント

リアルタイム分析用のディスカバリー・エージェント: プロアクティブなインサイト・システムに向けて

最新の分析システムは基本的に事後対応型であり、ユーザーはますます複雑になり、継続的に進化するデータに対してクエリを定義する必要があります。リアルタイムストリーミング環境では、潜在的な洞察の空間が手動で列挙するには大きすぎるため、このパラダイムは崩れます。リアルタイムデータストリームを介した自律的な洞察発見のためのマルチエージェントアーキテクチャを紹介します。このシステムは、エージェントが仮説を生成し、それを実行可能な分析にコンパイルし、生成されたアーティファクトを検証し、視覚化および展開可能なアプリケーションを生成する継続的な検出ループを実装します。このアーキテクチャでは、イベント駆動型の調整に Apache Kafka、ストリーム処理に Apache Flink、および特殊なエージェントを実装するための大規模な言語モデルを活用しています。主な貢献は、型指定された中間成果物に基づく契約主導の設計であり、モジュール性、可観測性、リネージ、および動的に生成された分析のより安全な実行を可能にします。小売、金融、公共データのユースケースを通じて、このアーキテクチャがクエリ駆動の分析からプロアクティブな検出駆動のシステムへの移行をどのようにサポートするかを示します。

原文 (English)

Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

Modern analytics systems are fundamentally reactive, requiring users to define queries over increasingly complex and continuously evolving data. In real-time streaming environments, this paradigm breaks down, as the space of potential insights becomes too large to enumerate manually. We present a multi-agent architecture for autonomous insight discovery over real-time data streams. The system implements a continuous discovery loop in which agents generate hypotheses, compile them into executable analytics, validate generated artifacts, and produce visualizations and deployable applications. The architecture leverages Apache Kafka for event-driven coordination, Apache Flink for stream processing, and large language models to implement specialized agents. A key contribution is a contract-driven design based on typed intermediate artifacts, enabling modularity, observability, lineage, and safer execution of dynamically generated analytics. Through use cases in retail, finance, and public data, we show how this architecture supports a shift from query-driven analytics to proactive, discovery-driven systems.

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Agyn: スケーラブルなオンデマンド実行、コードとしてのエージェント定義、およびゼロトラストアクセスを備えた AI エージェント用のオープンソースプラットフォーム

組織が AI エージェントの実稼働環境への移行に向けて移行すると、AI エージェントは非決定的なワークフローを実行し、ステートフルセッションを維持し、多くの場合、内部サービスへの特権アクセスで動作します。エンジニアリング上の課題は、個々のエージェントの構築から、適切な分離、ガバナンス、セキュリティを備えた大規模な運用へと移行します。このペーパーでは、エージェントのワークロードに合わせて調整された 3 つの主要な原則に基づいて設計されたオープンソースプラットフォームである Agyn について紹介します。エージェントとハーネス定義用の Terraform プロバイダー。そして、ゼロトラストと最小特権の原則に基づいたセキュリティモデルです。 Agyn は、エージェント、モデル、クラウドに依存しません。

原文 (English)

Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access

As organizations move toward production deployments of AI agents, which execute non-deterministic workflows, maintain stateful sessions, and often operate with privileged access to internal services, the engineering challenge shifts from building individual agents to operating them at scale with proper isolation, governance, and security. In this paper we present Agyn, an open-source platform designed around three key principles tailored for agent workloads: a signal-driven, stateful serverless runtime on Kubernetes; a Terraform provider for agent and harness definition; and a security model grounded in zero-trust and least-privilege principles. Agyn is agent-agnostic, model-agnostic, and cloud-agnostic.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

競合するLLMエージェントにおける秘密ツールによる自発的な共謀

ツールが不公平で他者に有害であると明示されている場合でも、表向きは安全を重視する LLM エージェントは、戦略的優位性をもたらす場合には常に自発的に秘密裏に共謀を行っています。この現象を調査するために、我々は 2 つの戦略的なマルチエージェント環境に基づいて構築された経験的フレームワークを導入します。ライアーズバー (競争的欺瞞シナリオ) とクリーンアップ (混合動機リソース管理シナリオ) です。このシナリオでは、エージェントには、明らかに他のエージェントに不利な点を与えながら大きな利点を提供する秘密結託ツールが提供されます。 12 のモデル (7B、70B、および独自のスケール) と 6 つのプロンプトのバリアントにわたって、ほとんどのエージェントが一貫してこれらのツールを受け入れ、共謀的な戦略を開発する一方で、受け入れる前にツールの不公平性を明確に認識していることがわかりました。さらに、不公平のラベルやベースラインの調整だけでは共謀を確実に阻止できないことも示します。明示的な倫理的枠組みのみが採用を減らし、それでも小規模なモデルは影響を受けやすいままです。より広範に、私たちの研究は、LLM ベースのマルチエージェントシステムにおける自発的な共謀の導入に関する初めての体系的な調査を示しており、そのような行為を防ぐには、一般的な調整に依存するのではなく、明示的な保護手段が必要であることを示唆しています。

原文 (English)

Voluntary Collusion with Secret Tools in Competing LLM Agents

Even when a tool is explicitly described as unfair and harmful to others, ostensibly safety-aligned LLM agents still voluntarily engage in secret collusion whenever doing so confers a strategic advantage. To investigate this phenomenon, we introduce an empirical framework built on two strategic multi-agent environments: Liar's Bar, a competitive deception scenario, and Cleanup, a mixed-motive resource-management scenario, in which agents are offered secret collusion tools that provide significant advantages while clearly disadvantaging the other agents. Across 12 models (at the 7B, 70B, and proprietary scales) and 6 prompt variants, we find that most agents consistently accept these tools and develop collusive strategies, while explicitly acknowledging the unfairness of the tools before accepting. We further show that neither the unfairness labels nor baseline alignment alone reliably deters collusion: only explicit ethical framing reduces adoption and, even then, smaller models remain susceptible. More broadly, our work presents the first systematic investigation of voluntary collusion adoption in LLM-based multi-agent systems, and suggests that preventing such behaviour requires explicit safeguards rather than reliance on general alignment.

2026-05-28 13:00 JSTarXiv cs.AIエージェント

ラグナ M.1/XS.2 テクニカルレポート

ここでは、長期的なエージェントコーディング用に構築された 2 つの専門家の混合基盤モデル、Laguna M.1 と Laguna XS.2 を紹介します。M.1 のパラメーターの合計は $225.8$B (トークンごとにアクティブ化される $23.4$B)、XS.2 は合計 $33.4$B (アクティブ化される $3$B) です。どちらのモデルも、モデルファクトリーと呼ばれる同じ内部システム内で最初からエンドツーエンドでトレーニングされました。これは、モデル開発を産業プロセスに変える、バージョン管理されたデータ、トレーニング、評価、および推論コンポーネントの緊密に統合されたスタックです。 Model Factory の原理と設計の選択について説明し、トレーニング前のデータとアーキテクチャ、トレーニング後の段階、評価、量子化に至るまでのモデルのエンドツーエンドのトレーニングプロセスについても詳しく説明します。エージェントソフトウェアエンジニアリングと端末ベンチマーク (SWE-bench Verified、SWE-bench Multilingual、SWE-Bench Pro、および Terminal-Bench 2.0) では、M.1 および XS.2 は、それぞれの重量クラスで最先端のオープンモデルと競合します。 Laguna XS.2 の重みは、Apache~2.0 で https://huggingface.co/collections/poolside/laguna-xs2 でリリースされます。

原文 (English)

Laguna M.1/XS.2 Technical Report

We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has $225.8$B total parameters ($23.4$B activated per token) and XS.2 has $33.4$B total ($3$B activated). Both models were trained from scratch end-to-end inside the same internal system that we refer to as our Model Factory: a tightly-integrated stack of versioned data, training, evaluation, and inference components that turn model development into an industrial process. We describe the principles and design choices of the Model Factory and also detail the end-to-end training process of our models, throughout pre-training data and architecture, post-training stages, evaluation, and quantization. On agentic software engineering and terminal benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0) M.1 and XS.2 are competitive with state-of-the-art open models in their respective weight classes. Laguna XS.2 weights are released under Apache~2.0 at https://huggingface.co/collections/poolside/laguna-xs2.

2026-05-28 13:00 JSTarXiv cs.AIエージェント

動的に変化する規範を考慮した推論と計画

人間と安全に対話するには、AI エージェントは人間の規範を理解し、計画時にそれを考慮する必要があります。しかし、このような規範に基づく計画は、人工エージェントのコミュニティ内でのみ検討されており、規範の動的な性質は無視されてきました。この論文では、代わりに、人間と AI の設定において動的に変化する規範を使用して計画をガイドするアプローチを紹介します。私たちは、規範の矛盾を解決するための実行可能な計算法と、そのような動的に変化する規範を計画上のガードレールとして使用するアプローチに貢献します。私たちは、自然言語対話タスクに関して形式的な証明を使用して理論的に、また AI エージェントである SocialBot を使用して経験的にアプローチを実証します。

原文 (English)

Reasoning and Planning with Dynamically Changing Norms

To safely interact with humans, AI agents must both know our norms and consider them during planning. However, such norm-guided planning has been less explored, only within communities of artificial agents, and has ignored the dynamic nature of norms. This paper instead presents an approach to guiding planning with dynamically changing norms in a human-AI setting. We contribute a defeasible calculus for resolving normative conflicts and an approach to using such dynamically changing norms as guard rails on plans. We theoretically demonstrate our approach with formal proofs and empirically with an AI agent, SocialBot, on a natural language dialogue task.

2026-05-28 13:00 JSTarXiv cs.AIエージェントロボティクス

管理された自律性としてのインテリジェンス: エージェントティック AI システムの障害、エスカレーション、ガバナンス

自律型およびエージェント型 AI システムがロボット環境やヒューマンマシン環境で拡張されるにつれて、幻覚や永続的だが不当な行動の管理は未解決の課題のままです。この論文では、これらの失敗の原因を単にモデルや調整の制限に帰するのではなく、無制限の自律性、つまり不確実性の増大に関係なくエージェントが動作し続けるべきであるという前提のアーキテクチャ上の脆弱性を調査します。これは、認識的ドリフトを検出し、推論を中断し、回復を試み、信頼性が低下したときに最終的に制御を放棄する形式的な能力を通じて、インテリジェントな行動を定義する管理された自律性の理論を導入します。この理論は、安定状態、メタ認知状態、支援状態、および規制状態を特徴とする 4 層フレームワークである SMARt (Self-Managing Multi-tier Autonomous Reasoning with Regulated/Revoked transitions) モデルを介してインスタンス化されます。時間制限付きで保護されたペトリネット定式化を開発することで、システムの理論的に制限されたプロパティを確立し、アーキテクチャがどのようにしてエスカレーションを正式に義務付け、無効な出力を制限し、指定された条件下でガバナンスの到達可能性を確保できるかを実証します。さらに、完全性と健全性の基準が満たされていると仮定して、さまざまな運用設定 (ヘルスケア、ロボット工学など) にわたってドメイン固有のトリガーセットを組み込むことで、体系的に安全性を維持できる方法を分析します。これらのトリガーは適応するように設計されているため、SMARt モデルは、時間の経過とともに、エージェントの操作範囲を安全に制御された拡張に対応します。私たちは、自律性ライフサイクル内で障害管理を形式化することが、信頼性が高く管理された人工知能を実現するための重要なステップであると結論付けています。

原文 (English)

Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems

As autonomous and agentic AI systems scale in robotic and human-machine environments, managing hallucination and persistent but unjustified action remains an open challenge. Rather than attributing these failures solely to model or alignment limitations, this paper explores the architectural vulnerability of unbounded autonomy - the presumption that an agent should continue operating regardless of rising uncertainty. It introduces a theory of managed autonomy that defines intelligent behavior through the formal capacity to detect epistemic drift, suspend reasoning, attempt recovery, and ultimately surrender control when reliability diminishes. We instantiate this theory via the SMARt (Self-Managing Multi-tier Autonomous Reasoning with Regulated/Revoked transitions) model, a four-layer framework featuring Stable, Meta-cognitive, Assisted, and Regulated states. By developing a timed, guarded Petri net formulation, we establish theoretically bounded properties for the system, demonstrating how architecture can formally mandate escalation, constrain invalid outputs, and ensure governance reachability under specified conditions. We further analyze how incorporating domain-specific trigger sets across varied operational settings (e.g., healthcare, robotics, etc.) can systematically preserve safety, assuming completeness and soundness criteria are met. Because these triggers are designed to be adaptive, the SMARt model accommodates the safe, controlled expansion of an agent's operational scope over time. We conclude that formalizing failure management within the autonomy lifecycle is a crucial step toward realizing reliable and governed artificial intelligence.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

リソースに制約のあるエージェント言語モデルの階層型プロンプトドメイン制御と学習

大規模言語モデルはエージェントシステム内に導入されることが増えており、そこでは構造化されたプロトコルに従い、進化する状態に適応し、メモリ、遅延、コストの制約の下で動作する必要があります。このような体制では、プロンプト拡張は信頼できません。成長するコンテキストによって、コンパクトなモデルが有効なプロンプト領域の外に押し出される可能性がありますが、デプロイメント時の微調整は、不足したデータとコンピューティングによって制限されたままになります。私たちは、最初にコンパクトなモデルを抽出して必要な出力スキーマを学習し、次に Oracle コントローラーループによってオンラインで監視する、階層的な制御と学習のフレームワークを提案します。コントローラーはプロトコルの有効性とセマンティックパフォーマンスを監視し、蓄積された履歴を実行可能なプロンプトドメインに投影し、ドリフト下で軽量のオラクル監視による微調整をトリガーします。これにより、通信互換性のためのスキーマ学習と、タスクレベルの修正のためのセマンティック適応が分離されます。私たちは、プロンプトドメインの実現可能性と注意誘発性の飽和を形式化し、名目上のコンテキスト長に依存するのではなく、効果的なプロンプト状態の制御を動機付けます。マルチフィデリティベイジアン最適化を制御された逐次テストベッドとして使用して、コアデプロイメントの障害モードを特徴付け、非階層ベースライン、蒸留のみのベースライン、および非蒸留ベースラインと比較して信頼性とコスト効率が向上していることを示します。

原文 (English)

Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models

Large Language Models are increasingly deployed inside agentic systems, where they must follow structured protocols, adapt to evolving states, and operate under memory, latency, and cost constraints. In such regimes, prompt extension is unreliable: growing contexts can push compact models outside their effective prompt domain, while deployment-time fine-tuning remains limited by scarce data and compute. We propose a hierarchical control-and-learning framework in which a compact model is first distilled to learn the required output schema, then supervised online by an oracle-controller loop. The controller monitors protocol validity and semantic performance, projects accumulated histories into a feasible prompt domain, and triggers lightweight oracle-supervised fine-tuning under drift. This separates schema learning for communication compatibility from semantic adaptation for task-level correction. We formalize prompt-domain feasibility and attention-induced saturation, motivating control of the effective prompt state rather than reliance on nominal context length. Using Multi-Fidelity Bayesian Optimization as a controlled sequential testbed, we characterize a core deployment failure mode and show improved reliability and cost-efficiency over non-hierarchical, distillation-only, and non-distilled baselines.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エージェント LLM サービスを提供するためのポリシー駆動型ランタイム層

マルチエージェント LLM システムは主要な運用ワークロードになっていますが、サービススタックはそれらのために構築されていませんでした。上記のエージェントフレームワークは、エージェントの ID、役割、スキーマ、ディスパッチ構造を認識していますが、エンジンレベルのイベントを認識することはありません。以下のサービスエンジンはすべてのイベントを確認しますが、エージェントについては何も知りません。驚くほど多くの横断的なポリシーが、プレフィックスキャッシュ、バッチシェーピング、投機的実行、公平性、ツール結果のメモ化、安全性の強制などの両方に依存しています。それぞれが 2 つの層の間の継ぎ目に存在しており、現在は、一方の層または他方の層に 1 回限りのパッチを適用することで解決されています。私たちは、この継ぎ目は点修正ではなくアーキテクチャ変更によって最もよく対処できると主張します。つまり、フレームワークとエンジンの間に 3 番目の層であるエージェントランタイム層を挿入し、エージェント ID を共有座標としてエージェント認識ポリシーが接続される 4 つのプリミティブ (観察、スコア、予測、動作) を公開します。 9 つの具体的なポリシーをレイヤーにマッピングし、最大の当面のサービスコストレバーを持つ抽象化を徹底的に検証します。これは、CacheSage としてインスタンス化されたセッションにわたる KV キャッシュです。これは、ワークロードごとのエージェント移行マトリックスをオンラインで学習し、それを生存ベースのエビクションとステップ間のプリフェッチに使用します。 5 つの実際のマルチエージェントワークロードに関する暫定結果では、未変更のサービングスタックと比較して、キャッシュヒット率が +13 ～ +37 pp 向上し、平均 TTFT が 12 ～ 29% 低下し、スループットが 6 ～ 14% 向上したことが示されています。

原文 (English)

A Policy-Driven Runtime Layer for Agentic LLM Serving

Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, role, schemas, and dispatch structure but never sees an engine-level event; the serving engine below sees every event but knows nothing about agents. A surprising number of cross-cutting policies depend on both: prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, safety enforcement, and more. Each lives in the seam between the two layers and is currently solved by a one-off patch into one neighbor or the other. We argue this seam is best addressed by an architectural change rather than point fixes: insert a third tier, an agent runtime layer, between the framework and the engine, exposing four primitives (observe, score, predict, act) into which any agent-aware policy plugs, with agent identity as the shared coordinate. We map nine concrete policies onto the layer and validate the abstraction in depth on the one with the largest immediate serving-cost lever: KV caching across sessions, instantiated as CacheSage, which learns the per-workload agent transition matrix online and uses it for survival-based eviction and between-step prefetch. Preliminary results on five real multi-agent workloads show +13 to +37 pp cache hit-rate lift, 12% to 29% lower mean TTFT, and 6% to 14% higher throughput over an unmodified serving stack.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SkillGrad: 勾配降下法などのエージェントスキルの最適化

エージェントスキルは、再利用可能な手順知識を構造化ファイルに保存することで、LLM エージェントを特殊なドメインに適応させる軽量な方法を提供します。ただし、サードパーティからダウンロードしたものであっても、自己生成したものであっても、これらのスキルは多くの場合、信頼性が低く、不完全で、時代遅れです。既存のスキル進化手法では、明示的な最適化定式化を行わずに、ヒューリスティックな反映を通じてこれらの欠陥に対処することがよくあります。このペーパーでは、エージェントのスキルを最適化するための勾配降下法にヒントを得たフレームワークである SkillGrad を提案します。 SkillGrad は、スキルパッケージを構造化パラメータとして扱い、勾配降下法で最適化します。タスクの実行により軌道レベルの損失証拠が提供され、自動診断により修正方向を示すテキストベースの勾配が提供されます。反復全体で最適化を安定させるために、モメンタムエージェントは、繰り返し発生する診断パターンを永続的なメモリオーバーレイに蓄積します。最後に、LLM ベースのパッチャーは、スキルパッケージにレイヤー対応の編集を適用することにより、パラメーターの更新を実行します。 SpreadsheetBench Verified および WikiTableQuestions で評価したところ、SkillGrad は 2 つのバックボーン LLM にわたるトレーニングベースのスキル進化ベースラインを常に上回っており、最も強力なトレーニングベースのベースラインよりも平均で $6.7$ パーセンテージポイント向上しています。アブレーションはさらに、運動量と対照的診断の両方が最終的なスキルの質に寄与していることを示しています。

原文 (English)

SkillGrad: Optimizing Agent Skills Like Gradient Descent

Agent skills provide a lightweight way to adapt LLM agents to specialized domains by storing reusable procedural knowledge in structured files. However, whether downloaded from third parties or self-generated, these skills are often unreliable, incomplete, or outdated. Existing skill-evolution methods often address these deficiencies through heuristic reflections without an explicit optimization formulation. In this paper, we propose SkillGrad, a gradient-descent-inspired framework for optimizing agent skills. SkillGrad treats the skill package as a structured parameter to optimize in a gradient descent fashion: task executions provide trajectory-level loss evidence, automatic diagnoses then provide text-based gradients that indicate the correction directions. To stabilize optimization across iterations, a momentum agent accumulates recurring diagnostic patterns into a persistent memory overlay. Finally, an LLM-based patcher executes the parameter update by applying layer-aware edits to the skill package. Evaluated on SpreadsheetBench Verified and WikiTableQuestions, SkillGrad consistently outperforms training-based skill evolution baselines across two backbone LLMs, improving over the strongest training-based baseline by $6.7$ percentage points on average. Ablations further show that momentum and contrastive diagnosis both contribute to the final skill quality.

2026-05-28 13:00 JSTarXiv cs.AIエージェント

PEAM: Minecraft での経験の対照的な内面化によるパラメトリックな身体化されたエージェントの記憶

Minecraft のパラメトリックエンボディドエージェントメモリフレームワークである PEAM を紹介します。これは、エージェントメモリを推論時の検索から、経験を通じて内面化されたパラメータ常駐スキルに変換します。 PEAM は、オープンエンド推論のための遅い熟議型 LLM と、統合されたスキルを反射的に実行するための高速パラメトリックモジュールを組み合わせます。この高速モジュールは、カテゴリごとに物理的に分離されたアダプターを備えたマルチモーダルな専門家混合 LoRA アーキテクチャであり、致命的な忘れを引き起こすことなくパラメーターレベルの継続的な学習を可能にします。私たちは失敗を第一級のトレーニング信号として扱います。失敗と修正軌道のペアは、共同の行動クローニングと対照的な目標を通じて内面化されるため、エージェントは何が成功したかだけでなく、修正されたアクションが失敗したアクションとどのように異なるのかも学習します。統合を管理するために、PEAM は、どのエクスペリエンスを内部化するかを決定するためのパラメータ化価値スコアと、タスク固有の手動調整しきい値なしでいつ内部化するかを決定するためのスケールフリーの自己トリガー統合メカニズムを導入し、再調整することなくトリガーがタスク分布全体に移行するにつれてエージェントを自己進化させます。 Minecraft での実験では、PEAM が長期的なタスクのパフォーマンスを向上させ、以前に統合されたスキルの忘れを軽減し、検索ベースの身体化エージェントやパラメトリックメモリのバリアントに比べてパラメトリック対検索の効率を向上させることが示されています。

原文 (English)

PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft

We present PEAM, a Parametric Embodied Agent Memory framework in Minecraft that transforms agent memory from inference-time retrieval into parameter-resident skills internalized through experience. PEAM pairs a slow deliberative LLM for open-ended reasoning with a fast parametric module for reflexive execution of consolidated skills. The fast module is a multimodal Mixture-of-Experts LoRA architecture with per-category physically isolated adapters, enabling parameter-level continual learning without catastrophic forgetting. We treat failure as a first-class training signal: failure--correction trajectory pairs are internalized through a joint behavioral-cloning and contrastive objective, so the agent learns not only what succeeds but also how corrected actions differ from failed ones. To govern consolidation, PEAM introduces a parameterization-worthiness score for deciding which experience should be internalized, and a scale-free self-triggered consolidation mechanism for deciding when to internalize without task-specific hand-tuned thresholds, making the agent self-evolving as the trigger transfers across task distributions without re-tuning. Experiments in Minecraft show that PEAM improves long-horizon task performance, mitigates forgetting on previously consolidated skills, and improves parametric-versus-retrieval efficiency over retrieval-based embodied agents and parametric memory variants.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

秘密がある? LLM エージェントはそれを守れない: マルチエージェントシステムにおけるプライバシーの評価

LLM の安全性評価では主にモデルを単独でテストしますが、配備された AI エージェントは他のエージェントと並んで永続的な社会環境内で動作することが増えています。私たちは、何千人もの LLM エージェントがシミュレートされた 1 か月間にわたってコミュニティ間で対話する Moltbook スタイルのシミュレーションプラットフォームを導入し、それを使用して、さまざまな程度の社会的圧力の下で下流の安全上の懸念としてプライバシーを評価します。シングルターンからマルチターンへの社会的評価の移行により、プライバシー侵害が増幅されること（OpenAI モデル全体で、CIMemories 19.95% から Ours 45.30%）、漏洩は社会的に伝染し、ピアが機密情報を開示するのを観察したエージェントは機密情報を開示する可能性が 8 倍高く、明示的なプライバシーに関する指示はこの影響を軽減するものの排除はせず、保護策を講じたとしても漏洩率が 37.8% を超えることがわかりました。私たちの調査結果は、静的チャットベースの安全性ベンチマークは、エージェント導入におけるリスクを体系的に過小評価していること、また、社会的コンテキストだけで、単一ターンの評価では決して表面化しない機密情報の開示を引き出すのに十分であることを示唆しています。

原文 (English)

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environments alongside other agents. We introduce a Moltbook-style simulation platform where thousands of LLM agents interact across communities over a simulated month, and use it to evaluate privacy as a downstream safety concern under varying degrees of social pressure. We find that shifting from single turn to multi turn social evaluation amplifies privacy violations (CIMemories 19.95% to Ours 45.30% across OpenAI models), that leakage is socially contagious, with agents 8 times more likely to disclose sensitive information after observing a peer do so, and that explicit privacy instructions reduce but do not eliminate this effect, leaving leakage rates above 37.8% even with safeguards. Our findings suggest that static chat based safety benchmarks systematically underestimate risks in agentic deployment, and that social context alone is sufficient to elicit sensitive disclosures that single turn evaluations would never surface.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

目撃された解決プロファイルを使用した LLM エージェントでのポリシー内ライブ命令の競合の診断

LLM エージェントは、長期にわたる自然言語プロンプトポリシーによって管理されますが、個別に合理的な常駐ルールが検査されていない方法で相互作用する可能性があります。私たちは、実際のポリシー内ルール競合診断を研究します。つまり、現実的な状態を共同管理できる単一のプロンプトポリシー内のルールペアを見つけ、モデルが応答やツールのアクションでそのプレッシャーをどのように解決するかを測定します。 Witnessed Intra-policy Rule Evaluation パイプラインである WIRE を紹介します。 WIRE は、ソースに基づいたルールを抽出し、PyRule 句としてエンコードし、充足可能性チェックを使用して同一面のハードコリジョン候補を保持し、それらの候補を具体的な共同統治証人として認識し、元のソースルールテキストに対して出力をモデル化します。 6 つのパブリックプロンプトポリシーにわたって、WIRE は 276 のソースルールと 560 のアトミック条項を抽出し、30,944 のポリシー内条項ペアの比較を分類し、170 のエンコードされたハードコリジョン候補ソースとルールのペアを保持し、それらを 1,402 の具体的な証人として実現します。ポリシーのみの評価では、これらの証人から、両方のソースルールが適用され、両方のコンプライアンスラベルが判断可能である 13,335 件の世代後トライアルが得られます。共同コンプライアンスの低下はわずか 35.4% でした。 64.6% が少なくとも 1 つの管理された情報源規則に違反しています。これらのプロファイルは、WIRE によって選択された候補の条件付き診断であり、導入頻度や原因となる過剰な障害の推定ではありませんが、明確なポリシー、モデル、ツールアクションの解決パターンを明らかにします。

原文 (English)

Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

LLM agents are governed by long-lived natural-language prompt policies, but individually reasonable standing rules can interact in uninspected ways. We study live intra-policy rule-conflict diagnosis: finding rule pairs inside a single prompt policy that can co-govern a realistic state, and measuring how models resolve that pressure in responses or tool actions. We introduce WIRE, a Witnessed Intra-policy Rule Evaluation pipeline. WIRE extracts source-grounded rules, encodes them as PyRule clauses, uses satisfiability checks to retain same-surface hard-collision candidates, realizes those candidates as concrete co-governance witnesses, and judges model outputs against the original source-rule text. Across six public prompt policies, WIRE extracts 276 source rules and 560 atomic clauses, classifies 30,944 within-policy clause-pair comparisons, retains 170 encoded hard-collision candidate source-rule pairs, and realizes them as 1,402 concrete witnesses. In policy-only evaluation, these witnesses yield 13,335 post- generation trials where both source rules govern and both compliance labels are judgeable. Only 35.4% fall in joint compliance; 64.6% violate at least one governed source rule. These profiles are conditional diagnostics for WIRE-selected candidates, not deployment-frequency or causal excess failure estimates, but they reveal distinct policy, model, and tool-action resolution patterns.

2026-05-28 13:00 JSTarXiv cs.AIエージェントハードウェア/半導体

エージェント用のクエリエンジン

現在、実稼働環境で最も急速に増加しているデータは、エージェントトレース、チャットログ、推論チェーン、モデル出力などの非構造化テキストです。人々はそれを分析したいと考えていますが、クエリパスにモデルがないとテキストをクエリできないため、尋ねる価値のある質問 (「エージェントがどこで混乱したか教えてください」) は SQL だけでは答えることができません。この分析が行われる自然な場所は、クライアント側で実行され、同じプロセス内で人間のユーザーと LLM エージェントの両方をホストする新しいクラスの AI アプリケーション (Claude Code、Cursor、Claude Desktop、ブラウザ内エージェント) です。これらのアプリケーションはデータを操作する必要がますます高まっていますが、レイクハウスの読み取りパスは JS ランタイムから使用するのが難しく、Spark、Trino、およびマネージドウェアハウスはそこに適合しません。この新しい種類の AI データアプリケーションを構築するには、エンジンの 3 つのプロパティが一次になります。アプリケーションがすでに実行されているランタイムにドロップされる JS ネイティブディストリビューション、コールドタブまたはターンごとのエージェントサンドボックス内に出荷できるほど十分小さいバンドル、および分析オペレーターとモデルベースのテキスト解釈をインターリーブする方法です。我々は、合計 70 KB 未満の 3 つのオープンソース JavaScript ライブラリ (Hyparquet、Squirreling、Icebird) である Hyperparam を紹介します。これらは、Parquet と Apache Iceberg をオブジェクトストレージから直接読み取り、セルごとの非同期ネイティブ SQL 実行で 3 番目のプロパティを満たすため、高価なセルはダウンストリームオペレーターが要求した場合にのみ起動されます。 Squirreling は、フィルタ境界クエリでは DuckDB-WASM より 300 倍以上高速 (ソート境界クエリでは 192 倍) で LLM 形状の非同期 UDF を実行し、3 分の 2 のコストで 10 タスクのエージェントアナリストスイートを完成させます。私たちは、専門分野としてのデータエンジニアリングは、現在運用されている AI ネイティブのクライアントアプリケーションとそのユーザーと連携して動作するエージェントに合わせて更新する必要があると主張します。

原文 (English)

A Query Engine for the Agents

The fastest-growing data in production today is unstructured text: agent traces, chat logs, reasoning chains, model outputs. People want to analyze it, and the questions worth asking ("show me where the agent got confused") cannot be answered by SQL alone, since text is not queryable without a model in the query path. The natural place this analysis is happening is the new class of AI applications (Claude Code, Cursor, Claude Desktop, in-browser agents) that run client-side and host both a human user and an LLM agent in the same process. These applications increasingly want to work with data, but the lakehouse read path has been hard to use from a JS runtime: Spark, Trino, and managed warehouses do not fit there. To build this new kind of AI data application, three properties of the engine become first-order: a JS-native distribution that drops into the runtime the application already runs in, a bundle small enough to ship inside a cold tab or per-turn agent sandbox, and a way to interleave analytic operators with model-based interpretation of text. We present Hyperparam, three open-source JavaScript libraries (Hyparquet, Squirreling, Icebird) totaling under 70 KB, that read Parquet and Apache Iceberg directly from object storage and meet the third property with per-cell, async-native SQL execution, so expensive cells fire only when downstream operators demand them. Squirreling runs LLM-shaped async UDFs over 300x faster than DuckDB-WASM on filter-bounded queries (and 192x on sort-bounded queries) and completes a ten-task agent analyst suite at two-thirds lower cost. We argue that data engineering as a discipline needs to update for the AI-native client applications now in production and the agents that work alongside their users.

2026-05-28 13:00 JSTarXiv cs.AIエージェント研究/論文

EgoBench: ツールを使用するエージェント向けのインタラクティブな自己中心的なマルチモーダルベンチマーク

AI エージェントはオープンな現実世界の環境で動作することが増えており、マルチモーダルな認識、マルチホップ推論によるツールの呼び出し、およびユーザーとの動的な対話の深い相乗効果が必要になります。ただし、既存のベンチマークでは、厳密に結合された複数機能タスクの設計、自然なユーザーフィードバックとタスクに制約されたユーザーフィードバックのシミュレート、および動的インタラクションの客観的評価の確保に課題があるため、これらの機能を統合して評価することはできません。このギャップを埋めるために、ツールを使用するエージェント向けの初の対話型マルチモーダルベンチマークである EgoBench を導入します。 EgoBench は、4 つの日常シナリオをカバーする 1,045 の自己中心的なビデオベースのタスクと、評価用のユーザーエージェントツールの対話型環境で構成されています。私たちは、視覚認識とツール拡張マルチホップ推論の共同適用を強制するように各タスクを設計する 3 段階の相乗パイプラインを実装します。さらに、エージェントのインタラクション能力を評価するために、EgoBench 内でマルチエージェントのシミュレートされたユーザーを開発し、エージェントに対する忠実度の高い、タスクに合わせた応答を生成します。さらに、プロセスベースと結果ベースの同等性を通じて客観的な評価を保証する、決定論的な共同検証フレームワークを確立します。 EgoBench で 8 つの SOTA ビデオ MLLM エージェントをベンチマークすると、厳しいパフォーマンスの上限が明らかになります。最高のモデルは、最もパフォーマンスの高いシナリオで 30.62% の精度しか達成せず、4 つのシナリオすべてで平均 19.43% です。最後に、多次元エラー分析を実行して障害モードを解きほぐし、将来の AI エージェントを進歩させるための機能のボトルネックを明らかにします。

原文 (English)

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

As AI agents increasingly operate in open, real-world environments, they require a deep synergy of multimodal perception, tool invocation with multi-hop reasoning, and dynamic interaction with users. However, existing benchmarks fail to jointly evaluate these capabilities due to challenges in designing strictly coupled multi-capability tasks, simulating natural and task-constrained user feedback, and ensuring objective evaluation of dynamic interaction. To bridge this gap, we introduce EgoBench, the first interactive multimodal benchmark for tool-using agents. EgoBench comprises 1,045 egocentric-video-grounded tasks covering four daily scenarios, along with a user-agent-tool interactive environment for evaluation. We implement a three-stage synergistic pipeline through which each task is designed to enforce the joint application of visual perception and tool-augmented multi-hop reasoning. We additionally develop a multi-agent simulated user within EgoBench to evaluate agents' interaction capabilities, which generates high-fidelity, task-aligned responses to agents. Furthermore, we establish a deterministic joint validation framework that guarantees objective assessment through process-based and result-based equivalence. Benchmarking eight SOTA video-MLLM agents on EgoBench reveals a severe performance ceiling: the best model achieves only 30.62% accuracy in the best-performing scenario, averaging 19.43% across all four scenarios. Finally, we conduct a multi-dimensional error analysis to disentangle failure modes, exposing capability bottlenecks for advancing future AI agents.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

TCP-MCP: マルチエージェントシステム向けのプロンプトと通信トポロジのランドスケープガイドに基づく共進化

効果的なマルチエージェントシステムは、プロンプトやコミュニケーショングラフを単独で選択するだけでは設計できません。エージェントの動作はエージェントが受信する情報に依存しますが、通信エッジの有用性は受信エージェントがその情報をどのように解釈して使用するかに依存します。私たちは、エージェントのプロンプトと通信トポロジーを統一ゲノムとして検索する共進化フレームワークである \textbf{TCP-MCP} (Topology-Coupled Prompting for Multi-Agent Collaborative 問題解決) を提案します。 TCP-MCP は、初期化時のランドスケーププローブを使用して初期の検索動作を調整し、次にパレートフロント診断に依存して、タスクのパフォーマンス、トークンコスト、構造の複雑さという 3 つの目標に基づいて探索を調整します。すべての方式で同じ DeepSeek-V3.2 バックボーンを使用することで、TCP-MCP は、MMLU-Pro、MMLU、GSM8K でそれぞれ 82.66\%、89.96\%、96.61\% の精度を達成します。 3 つのベンチマーク全体で、自動化されたグラフ生成のベースラインを常に上回り、ディベートスタイルのシステムと比較して競争力のある精度を達成しながら、報告された動作ポイントでこれらのシステムよりもトークンの使用量が最大 5.69$\time$ 少なくなります。これらの結果は、共同して進化するプロンプトとコミュニケーション構造が、管理された評価におけるコストを意識したタスク適応型のマルチエージェントシステム設計への実用的なルートを提供することを示しています。

原文 (English)

TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems

Effective multi-agent systems cannot be designed by selecting prompts or communication graphs in isolation. Agent behavior depends on the information an agent receives, while the usefulness of a communication edge depends on how the receiving agent interprets and uses that information. We propose \textbf{TCP-MCP} (Topology-Coupled Prompting for Multi-Agent Collaborative Problem-Solving), a co-evolution framework that searches agent prompts and communication topologies as a unified genome. TCP-MCP uses an initialization-time landscape probe to calibrate early search behavior, and then relies on Pareto-front diagnostics to adapt exploration under three objectives: task performance, token cost, and structural complexity. Using the same DeepSeek-V3.2 backbone across all methods, TCP-MCP achieves 82.66\%, 89.96\%, and 96.61\% accuracy on MMLU-Pro, MMLU, and GSM8K, respectively. Across the three benchmarks, it consistently outperforms automated graph-generation baselines and achieves competitive accuracy relative to debate-style systems, while using up to 5.69$\times$ fewer tokens than those systems at the reported operating points. These results show that jointly evolving prompts and communication structure provides a practical route to cost-aware and task-adaptive multi-agent system design in controlled evaluations.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MolLingo: LLM を利用した科学エージェントの分子ネイティブ表現

化学者の推論プロセスをエミュレートして分子設計を自動化するマルチエージェントシステムである MolLingo を紹介します。既存の LLM ベースのアプローチは、外部ツールにアクセスせずにスタンドアロンの生成モデルとして動作するか、分子設計パイプライン全体にわたる証拠に基づく反復的な推論に必要なマルチエージェントの調整と共有メモリが不足しています。 MolLingo は、各エージェントにドメイン固有のツールを備えた共有メモリモジュールを介して、文献エージェント、化学エージェント、およびオーケストレーターを調整することでこの問題に対処します。効果的な分子推論を可能にするために、BRICS ベースのフラグメント列挙 (BFE) を導入します。これは、一般的な化学名と組み合わせたブロックベースの SMILES として表される化学的に意味のある構成要素に分子を分解する、合成を意識した分子断片化手法です。この表現は分子構造と LLM 意味空間の橋渡しをし、生の SMILES だけでは難しいブロックレベルの推論と編集を可能にします。初期段階の治療設計のケーススタディとして、MolLingo はさらに、より強力な標的結合のために分子を最適化するために、結合部位の形状と分子ドッキングから得られる残基レベルのタンパク質のコンテキストに Chemist Agent の推論を根拠付けています。 4 つのベンチマークにわたって、MolLingo は一貫してフロンティア LLM および特殊なベースラインを上回っています。これには、同じ基礎モデルを使用しているにもかかわらず、GPT-5.4 と比較してドッキングスコアが 4 倍向上していること、複数の LLM バックボーンにわたる一貫した薬剤特性最適化の向上、およびフロンティア LLM と RL ベースの最適化手法である RePO の両方を上回る TOMG-Bench での最先端の結果が含まれます。私たちの結果は、LLM が化学的に意味のある表現と生物学的に根拠のある構造的コンテキストを通じて導かれるとき、すでに有能な分子設計アシスタントであることを示唆しています。コードは https://anonymous.4open.science/status/MolLingo-7450 で入手できます。

原文 (English)

MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents

We present MolLingo, a multi-agent system that emulates the reasoning process of a chemist to automate molecular design. Existing LLM-based approaches either operate as standalone generative models without access to external tools or lack the multi-agent coordination and shared memory needed for iterative, evidence-driven reasoning across the molecular design pipeline. MolLingo addresses this by coordinating a Literature Agent, a Chemist Agent, and an Orchestrator through a shared memory module, with each agent equipped with domain-specific tools. To enable effective molecular reasoning, we introduce BRICS-based Fragment Enumeration (BFE), a synthesis-aware molecular fragmentation method that decomposes molecules into chemically meaningful building blocks represented as block-based SMILES paired with common chemical names. This representation bridges molecular structure and LLM semantic space, enabling block-level reasoning and editing that is difficult with raw SMILES alone. As a case study in early-stage therapeutic design, MolLingo further grounds the Chemist Agent's reasoning in binding site geometry and residue-level protein context derived from molecular docking to optimize molecules for stronger target binding. Across four benchmarks, MolLingo consistently outperforms frontier LLMs and specialized baselines, including a fourfold docking score improvement over GPT-5.4 despite using the same underlying model, consistent drug property optimization gains across multiple LLM backbones, and state-of-the-art results on TOMG-Bench, surpassing both frontier LLMs and the RL-based optimization method RePO. Our results suggest that LLMs are already capable molecular design assistants when guided through chemically meaningful representations and biologically grounded structural context. Code is available at: https://anonymous.4open.science/status/MolLingo-7450.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

FundaPod: AI 支援のファンダメンタル投資調査のためのナレッジグラフメモリを備えたマルチペルソナエージェントポッドプラットフォーム

大規模言語モデル (LLM) は金融分野での適用が増えていますが、既存の研究のほとんどは取引シグナルや予測を中心とした財務 NLP タスクに重点を置いています。対照的に、制度的基礎研究では、人間のアナリストまたは AI エージェントが証拠を収集し、ビジネス推進要因を特定し、競合する視点を比較し、投資メモを作成する必要があります。その広範な目標は、単に結果を予測することではなく、投資知識の累積的な発展に貢献しながら、透明性、再利用可能、検証可能な投資計画を作成することです。 AI 支援のファンダメンタルズ投資調査のためのマルチペルソナエージェントプラットフォームである FundaPod を紹介します。私たちは、基礎研究は人間中心の意思決定支援タスクであり、取引シグナルの生成とは質的に異なるため、独立性を維持するアーキテクチャの方が適していると主張します。 FundaPod では、バリュー投資家やマクロ戦略家など、さまざまなペルソナを持つ AI エージェントが、共有の出所契約に基づいて独立して調査を実施します。その後、彼らの意見の相違は、知識グラフ記憶システムを通じて人間のポートフォリオマネージャー (PM) による裁定のために事後的に表面化されます。この論文は、設計科学の実践と認知的分離と人間と機械の協調の理論に基づいた、基礎研究をサポートする人間と AI のハイブリッドシステムの 5 つの設計原則を提供します。また、4 つのアーキテクチャメカニズムについても説明します。1 つは一般投資家の資料を展開可能なエージェントに変えるペルソナ蒸留パイプラインです。プランナーが型指定されたタスクグラフを導出できるようにする宣言型スキルレジストリ。メモの主張を検証可能な情報源に結び付ける根拠のある証拠モデル。そしてティッカー、メモ、アナリスト、テーマを結び付けるナレッジグラフ「第二の脳」。完全なケーススタディとペルソナベースのメモの比較を通じてアーキテクチャを実証します。

原文 (English)

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph "second brain" that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

2026-05-28 13:00 JSTarXiv cs.AIエージェント

AIBuildAI-2: AI モデルを自動的に構築するための知識強化エージェント

AI モデルは、画像やテキストの処理から生物学、物理学、化学における科学的発見に至るまで、データ中心のアプリケーションを支えます。しかし、それらの開発は依然として手作業が多く、専門家はアーキテクチャを設計し、トレーニングパイプラインを構築し、ソリューションを繰り返し改良する必要があるため、専門的な AI エンジニアリングの専門知識を持たない自然科学者にとって、研究で要求される高性能モデルを構築するのは困難です。この負担を軽減し、科学的発見のために AI へのアクセスを広げるために、AI モデルを自動的に構築するエージェントが提案されています。ただし、これらのエージェントのパフォーマンスは、基礎となる大規模な言語モデルのパラメトリック知識によって主に制限されます。この知識は静的で、多くの場合時代遅れで、実用的な AI モデルエンジニアリングのノウハウが希薄です。この制限に対処するために、AI モデルを自動的に構築するための外部の進化する知識システムを備えた知識強化エージェントである AIBuildAI-2 を導入します。 AIBuildAI-2 の知識システムは階層的であり、厳選された AI 開発知識を、時事的なカテゴリにわたる高レベルの知識命令と、各カテゴリの下にある低レベルの知識文書に編成します。そこから、エージェントは、現在の状態と解決されている AI タスクに関連するコンテキストのみを動的にロードし、具体的な外部検証可能な専門知識に基づいて設計と実装の各決定を根拠とします。このシステムは、Web から AI 開発関連のドキュメントを収集してクリーニングし、対応するカテゴリに整理することによって初期化され、AI タスクの完了した各実行を構造化された要点に抽出してナレッジシステムに書き戻すことで、エージェント自身の経験に基づいて継続的に進化します。 AIBuildAI-2 は最先端の結果を達成し、MLE ベンチで 70.7% のメダル獲得率で 1 位にランクされ、心臓病予測コンテストでは 4,370 人の専門家チームの中で上位 6.6% にランクインしました。

原文 (English)

AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

AI models underpin data-centric applications from image and text processing to scientific discovery in biology, physics, and chemistry. Yet developing them remains heavily manual, requiring practitioners to design architectures, build training pipelines, and iteratively refine solutions, making it challenging for natural scientists without specialized AI engineering expertise to build the high-performing models their research demands. To reduce this burden and broaden access to AI for scientific discovery, agents that automatically build AI models have been proposed. However, the performance of these agents is largely limited by the parametric knowledge of their underlying large language models, which is static, often outdated, and sparse on practical AI model engineering know-how. To address this limitation, we introduce AIBuildAI-2, a knowledge-enhanced agent with an external, evolving knowledge system for automatically building AI models. The knowledge system of AIBuildAI-2 is hierarchical, organizing curated AI development knowledge into high-level knowledge instructions over topical categories and low-level knowledge documents under each category, from which the agent dynamically loads only the context relevant to its current state and the AI task being solved, grounding each design and implementation decision in concrete, externally verifiable expertise. The system is initialized by collecting and cleaning AI-development-related documents from the web and organizing them into the corresponding categories, and continually evolves from the agent's own experience by distilling each completed run on an AI task into structured takeaways that are written back into the knowledge system. AIBuildAI-2 achieves state-of-the-art results, ranking first on MLE-Bench with a 70.7% medal rate and placing in the top 6.6% among 4,370 human-expert teams in a heart disease prediction competition.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

忠実なエージェント XAI に向けて: モデルの忠実性を高めるための検証方法とオープンワールドベンチマーク

Explainable AI (XAI) は、ユーザーがモデルの動作を解釈し、潜在的な障害を特定するのに役立ちます。エージェントティック XAI システムは、自然言語対話を通じて説明をよりアクセスしやすくするために大規模言語モデル (LLM) を使用しますが、もっともらしいが不誠実な説明を生成することもあります。このリスクは、複雑なモデルの信頼性の低い XAI 出力が LLM によって増幅され、ユーザーの誤解を招く可能性があるために発生します。私たちは、明示的な検証を通じて説明の忠実性を向上させるフレームワーク、Faithful Agentic XAI (FAX) を提案します。 FAX は草案の説明をクレームに分解し、本質的に忠実なツールと照らし合わせてクロスチェックし、最終生成前にサポートされていないクレームや矛盾したクレームをフィルタリングします。また、複雑なポリシー、多様な目標、モデル固有の忠実性を評価するための挑戦的なシナリオを備えたオープンワールドの強化学習ベンチマークである CRAFTER-XAI-Bench も紹介します。 CRAFTER-XAI-Bench では、FAX は高い情報性、関連性、流暢性を維持しながら、シミュレーションの忠実度を最強のベースラインの 0.20 から 0.46 に向上させます。 3 つの表形式のベンチマークでは、FAX は以前の Agentic XAI ベースラインと競合するパフォーマンスを示していますが、分析では、これらの設定がタスクの精度とモデル固有の忠実性を混同している可能性があることが示されています。これらの調査結果は、忠実な Agentic XAI には明示的な検証が不可欠であり、ターゲットモデル自体の動作に対する説明をテストするように忠実性ベンチマークを設計する必要があることを示しています。

原文 (English)

Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

Explainable AI (XAI) helps users interpret model behavior and identify potential faults. Agentic XAI systems use Large Language Models (LLMs) to make explanations more accessible through natural-language interaction, but they can also produce plausible yet unfaithful explanations. This risk arises because unreliable XAI outputs for complex models can be amplified by LLMs and mislead users. We propose Faithful Agentic XAI (FAX), a framework that improves explanation faithfulness through explicit verification. FAX decomposes draft explanations into claims and cross-checks them against inherently faithful tools, filtering unsupported or contradictory claims before final generation. We also introduce CRAFTER-XAI-Bench, an open-world reinforcement learning benchmark with complex policies, diverse goals, and challenging scenarios for assessing model-specific faithfulness. On CRAFTER-XAI-Bench, FAX improves simulation faithfulness from 0.20 for the strongest baseline to 0.46 while maintaining high informativeness, relevance, and fluency. On three tabular benchmarks, FAX performs competitively with prior Agentic XAI baselines, but our analysis shows that these settings can conflate task accuracy with model-specific faithfulness. These findings show that explicit verification is essential for faithful Agentic XAI and that that faithfulness benchmarks must be designed to test explanations against the behavior of the target model itself.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

LLM エージェントの機能を評価するための統一フレームワーク

LLM がエージェントとして導入されることが増えているため、そのエージェント機能の信頼できる評価が不可欠になっています。ただし、報告されるベンチマークスコアは、多くの場合、モデルの機能と、各ベンチマークに含まれる実装の選択肢を合わせて反映するため、クロスベンチマークの結果を基礎となるモデルの正確な測定値として解釈することが困難になります。この研究では、LLM エージェントの機能を公正に評価するための統一フレームワークを紹介します。統合された構成システムによって駆動されるこのフレームワークは、標準化された命令、ツール、環境の形式に多様なベンチマークを統合し、制御可能なサンドボックス内の固定 ReAct スタイルアーキテクチャを通じてエージェントを実行します。また、フレームワークの効果と環境の効果を個別に分析できるように、揮発性のライブ環境を厳選されたスナップショットに置き換えるオプションのオフライン設定を提供します。これに基づいて、各ベンチマークの元のタスクの成功基準に基づいて評価方法を統一するとともに、リソース消費に関する統一された指標と、意思決定レベルおよび実行レベルの失敗の属性に関する分類を導入します。このフレームワーク内で、シングルエージェント、マルチエージェント、およびセーフティクリティカルなシナリオにわたる 24 のドメインにわたる 7 つの広く使用されているベンチマークを適応させ、15 のモデルで 400,000 のロールアウトと 50 億のトークンにわたる大規模な実証分析を実施します。結果は、足場の選択と環境の変動性がベンチマークの結果を両方向に実質的に変化させ、フレームワークおよび環境によって引き起こされるアーティファクトから本質的な LLM 機能を解きほぐすことをフレームワークが可能にすることを示しています。さらに、安全性が重要なドメインの安全なテストベッドとしての拡張性を実証します。コードとベンチマークは、https://github.com/whfeLingYu/A-Unified-Framework-for-the-Evaluation-of-LLM-Agentic-Capabilities、https://huggingface.co/AgentFramework/Unified_Farmework で入手できます。

原文 (English)

A Unified Framework for the Evaluation of LLM Agentic Capabilities

As LLMs are increasingly deployed as agents, reliable assessment of their agentic capabilities has become essential. However, reported benchmark scores often jointly reflect model capability and the implementation choices each benchmark is packaged with, making cross-benchmark results difficult to interpret as clean measurements of the underlying model. In this work, we present a unified framework for the fair evaluation of LLM agentic capabilities. Driven by a unified configuration system, the framework integrates diverse benchmarks into a standardized instruction--tool--environment format, executes agents through a fixed ReAct-style architecture within a controllable sandbox, and provides an optional offline setting that replaces volatile live environments with curated snapshots, so that framework effects and environment effects can be analyzed separately. Building on this, we unify the evaluation methodology under each benchmark's original task-success criteria, while introducing unified metrics for resource consumption and a taxonomy for decision- and execution-level failure attribution. Within this framework, we adapt 7 widely used benchmarks spanning 24 domains across single-agent, multi-agent, and safety-critical scenarios, and conduct a large-scale empirical analysis over 400K rollouts and 5B tokens on 15 models. The results show that scaffold choice and environmental volatility materially shift benchmark outcomes in both directions, allowing our framework to disentangle intrinsic LLM capabilities from framework- and environment-induced artifacts. We further demonstrate its extensibility as a secure testbed for safety-critical domains. Codes and benchmarks at are available at https://github.com/whfeLingYu/A-Unified-Framework-for-the-Evaluation-of-LLM-Agentic-Capabilities, https://huggingface.co/AgentFramework/Unified_Farmework.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SKILLC: 対照的な単位の割り当てによる LLM エージェントの自律的なスキルの内部化の学習

構造化されたスキルプロンプトにより、長期的なエージェント強化学習 (RL) の探索が向上します。スキル拡張型 RL メソッドは推論時に外部スキルを保持しますが、スキル内部化型 RL メソッドはトレーニング中に外部スキルを取り込んで自律的なパフォーマンスを可能にします。しかし、既存の内面化アプローチは、カリキュラム管理にスキルと有用性の対比のみを使用しており、ポリシーの更新は変更されず、スキル依存の成功と自律的な成功を区別できません。私たちは、このコントラストを内面化のための直接的な学習シグナルに変換する、Contrastive Skill Credit Assignment (CSCA) に基づくフレームワークである SkillC を提案します。 \textsc{SkillC} は、同じポリシー更新内のアクティブなスキルタイプからのタスクに対して、スキル挿入されたロールアウトとスキルなしのロールアウトのペアをサンプリングし、グローバルランキングを維持しながらスキルなしの成功に向けた一方的な補正を適用するデュアルストリームの利点推定ツールを介して、それらのタスクレベルのコントラストを最適化に注入します。平滑化された検証レベルの信号により、アトリビューションの強度、ロールアウトの割り当て、および単調なアクティブセットの枝刈りに関する適応カリキュラムがさらに推進されます。 ALFWorld と WebShop での実験では、ランタイムスキルアクセスなしで、SkillC が以前の最強のスキル内部化 RL ベースラインをそれぞれ 5.5\% と 4.4\% 上回り、スキル拡張 RL 手法との競争力を維持していることが示されています。

原文 (English)

SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

Structured skill prompts improve exploration in long-horizon agentic reinforcement learning (RL). Skill-augmented RL methods retain external skills at inference, while skill-internalization RL methods withdraw them during training to enable autonomous performance. However, existing internalization approaches only use skill-helpfulness contrast for curriculum control, leaving the policy update unchanged and unable to distinguish skill-dependent from autonomous success. We propose SkillC, a framework based on Contrastive Skill Credit Assignment (CSCA) that converts this contrast into a direct learning signal for internalization. \textsc{SkillC} samples paired skill-injected and skill-free rollouts for tasks from active skill types within the same policy update, and injects their task-level contrast into optimization via a dual-stream advantage estimator that preserves global ranking while applying a one-sided correction toward skill-free success. A smoothed validation-level signal further drives an adaptive curriculum over attribution strength, rollout allocation, and monotonic active-set pruning. Experiments on ALFWorld and WebShop show that, without runtime skill access, SkillC surpasses the strongest prior skill-internalization RL baseline by 5.5\% and 4.4\%, respectively, while remaining competitive with skill-augmented RL methods.

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Dr-CiK: 先見性のあるエージェントのためのテストベッド

現実世界の設定における時系列予測は、多くの場合、過去の観測結果だけでなく、ノイズの多い異種情報ソースから積極的に発見する必要がある外部コンテキストにも依存します。しかし、既存のコンテキスト支援予測ベンチマークは通常、サポートするコンテキストがすでに提供されていることを前提としており、エージェントがそれを独自に識別できるかどうかは不明のままです。したがって、エージェントが文書コーパスから予測に関連するサポートコンテキストを取得し、気が散る要因をフィルタリングして、取得したコンテキストを予測に役立つ証拠に抽出し、その証拠によってサポートされる予測を生成できるかどうかを評価するためのベンチマークである Dr-CiK を導入します。最先端のディープリサーチと予測手法を組み合わせたコンテキストアブレーションと評価を通じて、高品質のコンテキストが Dr-CiK の予測パフォーマンスを大幅に向上させることを示します。ただし、既存の DR エージェントのほとんどは、グラウンドトゥルースを裏付ける証拠のごく一部 (通常は 80% の注意散漫の引用) しか回収しないため、コンテキストを取得しない場合よりも、取得したコンテキストを使用した場合の予測担当者のパフォーマンスが低下する可能性があります。私たちの結果は、未来を予測するための適切なコンテキストを検索する先見性主導型エージェントの研究に動機を与えます。

原文 (English)

Dr-CiK: A Testbed for Foresight-Driven Agents

Time series forecasting in real-world settings often depends not only on historical observations, but also on external context that must be actively discovered from noisy, heterogeneous information sources. Yet existing context-aided forecasting benchmarks typically assume that the supporting context is already provided, leaving open whether agents can identify it on their own. Therefore, we introduce Dr-CiK, a benchmark for evaluating whether agents can retrieve forecasting-relevant supporting context from a document corpus, filter out distractors, distill the retrieved context into forecast-useful evidence, and generate forecasts supported by that evidence. Through context ablations and evaluations of state-of-the-art deep research and forecasting methods paired together, we show that high-quality context substantially improves forecasting performance in Dr-CiK. However, most existing DR agents recover only a small fraction of the ground-truth supporting evidence (usually 80% distractor citations), and can cause forecasters to perform worse with retrieved context than without context. Our results motivate research on foresight-driven agents that search for the right context to predict the future.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Harness-Bench: 現実的なエージェントワークフローにおけるモデル全体のハーネス効果の測定

LLM エージェントは、ツールを使用し、ワークスペースを変更し、具体的な成果物を生成する実行可能システムとして導入されることが増えています。このようなワークフローでは、パフォーマンスはベースモデルだけでなく、ハーネス、つまりコンテキスト、ツール、状態、制約、権限、トレース、リカバリを管理するシステム層にも依存します。ただし、既存のベンチマークは通常、実行を抽象化したり、完全なエージェントシステムを比較したり、ハーネスを固定したままにするため、実行層の変動を調査することが困難になります。現実的なエージェントワークフローにおける構成レベルのハーネス効果を評価するための診断ベンチマークである Harness-Bench を紹介します。 Harness-Bench は、各ハーネスのネイティブな実行動作を維持しながら、共有タスク環境、予算、評価プロトコルの下で複数のモデルバックエンドにわたる代表的なハーネス構成を評価します。このベンチマークには、実際のエージェント使用パターンから構築され、現実性、解決可能性、オラクルチェック可能性、整合性について手動でレビューされた 106 個のサンドボックス化されたオフラインタスクが含まれています。各実行では、最終的なアーティファクト、実行トレース、使用状況統計、およびバリデータ出力が記録され、最終的な完了後の分析が可能になります。 5,194 の実行軌跡にわたって、モデルとハーネスのペア間の完了、プロセス品質、効率、および障害動作に大きなばらつきが観察されました。これらの結果は、エージェントの機能はベースモデルのみに起因するのではなく、モデルハーネス構成レベルで報告されるべきであることを示唆しています。さらに、私たちの分析では、もっともらしい推論がツールのフィードバック、ワークスペースの状態、証拠、または検証可能な出力コントラクトから切り離されている、繰り返し発生する実行調整の失敗を特定します。 Harness-Bench は、信頼性が高く、効率的で、監査可能なエージェント実行スタックを診断および改善するための再現可能な基盤を提供します。

原文 (English)

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workflows, performance depends not only on the base model, but also on the harness: the system layer that manages context, tools, state, constraints, permissions, tracing, and recovery. However, existing benchmarks typically abstract away execution, compare complete agent systems, or hold the harness fixed, making execution-layer variation difficult to study. We introduce Harness-Bench, a diagnostic benchmark for evaluating configuration-level harness effects in realistic agent workflows. Harness-Bench evaluates representative harness configurations across multiple model backends under shared task environments, budgets, and evaluation protocols, while preserving each harness's native execution behavior. The benchmark contains 106 sandboxed offline tasks constructed from practical agent-use patterns and manually reviewed for realism, solvability, oracle-checkability, and integrity. Each run records final artifacts, execution traces, usage statistics, and validator outputs, enabling analysis beyond final completion. Across 5,194 execution trajectories, we observe substantial variation in completion, process quality, efficiency, and failure behavior across model-harness pairings. These results suggest that agent capability should be reported at the model-harness configuration level rather than attributed to the base model alone. Our analysis further identifies recurring execution-alignment failures, where plausible reasoning becomes decoupled from tool feedback, workspace state, evidence, or verifiable output contracts. Harness-Bench provides a reproducible foundation for diagnosing and improving reliable, efficient, and auditable agent execution stacks.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エージェントはより深く考えますか?逐次計画における層ごとのダイナミクスの機構的調査

最近のメカニズム研究では、大規模言語モデル (LLM) が標準的な 1 ターンタスクでその深さを非効率的に利用している可能性があることが示唆されています。これが、モデルが複数ターンの計画、ツールの使用、および反復的な状態更新を実行する必要がある自律エージェント設定でも依然として当てはまるかどうかは、不明のままです。私たちは、深層研究、コード生成、表形式処理の 3 つのドメインにわたる完全なユーザーエージェントの軌跡を体系的に層ごとに分析することで、この疑問を研究します。残差ストリームプローブ、因果層スキップ介入、および有効深さの測定を使用して、エージェント推論が静的タスクとは異なる深さプロファイルを示すことを示します。軌道が展開するにつれて、モデルは徐々により多くのより深い層を採用し、後のターンではより強力な長距離層間の依存関係が現れます。同時に、残留更新はますます補正が優勢になり、安定した特徴の蓄積から反復的な再キャリブレーションへの移行を示しています。有効深さの分析では、構築と改良の実質的なギャップがさらに明らかになります。セマンティックな方向性は多くの場合比較的早期に形成されますが、最終出力を安定させるには深い層が依然として必要です。モデルファミリ全体で、このギャップは Qwen と Minimax で顕著ですが、GLM はよりドメインに依存した深さの割り当てパターンを示します。これらの結果は、推論の複雑さが増大するにつれて、自律 LLM エージェントが適応的に深さを割り当てる機構的な証拠を提供します。

原文 (English)

Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning

Recent mechanistic studies suggest that large language models (LLMs) may utilize their depth inefficiently in standard single-turn tasks. Whether this still holds in autonomous agent settings, where models must perform multi-turn planning, tool use, and iterative state updates, remains unclear. We study this question through a systematic layer-wise analysis of complete user-agent trajectories spanning three domains: Deep Research, Code Generation, and Tabular Processing. Using residual stream probes, causal layer-skipping interventions, and effective-depth measurements, we show that agentic reasoning exhibits a distinct depth profile from static tasks. As trajectories unfold, models progressively recruit more and deeper layers, with stronger long-range inter-layer dependencies emerging in later turns. At the same time, residual updates become increasingly correction-dominant, indicating a shift from stable feature accumulation toward repeated recalibration. Effective-depth analysis further reveals a substantial construction-refinement gap: semantic direction often forms relatively early, while deep layers remain necessary for stabilizing final outputs. Across model families, this gap is pronounced in Qwen and Minimax, whereas GLM shows a more domain-dependent depth allocation pattern. These results provide mechanistic evidence that autonomous LLM agents allocate depth adaptively as reasoning complexity grows.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AsyncTool: マルチタスクシナリオでの非同期関数呼び出し機能の評価

大規模言語モデル (LLM) ベースのエージェントは、外部ツールを使用して複雑なタスクを解決する強力な機能を示しています。ただし、既存の評価では、ツール使用の時間的側面、特にツールの応答遅延の影響が見落とされていることが多く、通常は単一タスクの設定に限定されています。実際のアプリケーションでは、多くの場合、複数のタスクを同時に実行する必要があり、全体的な効率は、エージェントがツールの応答を待つ間にアイドル時間を利用できるかどうかによって決まります。この機能を非同期ツール呼び出しと呼びます。これを評価するために、ツールのフィードバックが遅延した対話型のマルチタスクツール使用環境で LLM ベースのエージェントを評価するためのベンチマークである AsyncTool を提案します。 AsyncTool は、複数の異種タスクを同時に提示し、実行中の現実的なツール応答遅延をシミュレートします。ハイブリッドデータ進化戦略を使用して、複数のシナリオとツールの使用パターンをカバーする多様な非同期マルチタスクデータセットを構築します。ステップ、サブタスク、およびタスクのレベルでモデルを評価し、タスクの調整と完了の効率を測定するための効率指向のメトリクスを導入します。広範な実験により、ツールのフィードバックが遅れていると、現在のエージェントに大きな課題が生じ、明らかなパフォーマンスの低下につながることが示されています。タスクの切り替え、依存関係の追跡、状態の維持をより適切に調整するモデルは、AsyncTool でより優れたパフォーマンスを実現します。私たちの分析は、現在ツールを使用しているエージェントの主な障害モードを特定し、より強力な時間的推論と調整機能を備えた将来のシステムを設計するための実用的な洞察を提供します。

原文 (English)

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to single-task settings. In real-world applications, multiple tasks often need to be executed concurrently, and overall efficiency depends on whether an agent can use idle time while waiting for tool responses. We refer to this capability as asynchronous tool calling. To evaluate it, we propose AsyncTool, a benchmark for assessing LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. AsyncTool presents multiple heterogeneous tasks simultaneously and simulates realistic tool response latency during execution. Using a hybrid data evolution strategy, we construct a diverse asynchronous multitasking dataset that covers multiple scenarios and tool-use patterns. We evaluate models at the step, sub-task, and task levels, and introduce efficiency-oriented metrics to measure task coordination and completion efficiency. Extensive experiments show that delayed tool feedback poses substantial challenges to current agents and leads to clear performance degradation. Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance on AsyncTool. Our analysis identifies key failure modes of current tool-using agents and provides practical insights for designing future systems with stronger temporal reasoning and coordination capabilities.

2026-05-28 13:00 JSTarXiv cs.AIエージェント

MemCog: 会話エージェントにおけるツールとしての記憶から認知としての記憶へ

既存のエージェント記憶システムは一般的に、単一のクエリがフラットパッセージリストのワンショット検索をトリガーする「Memory-as-Tool」パラダイムに従っており、受動的呼び出し、推論と検索の分離、および取得されたフラグメントとエージェントのナビゲーションニーズ間の構造的不一致に悩まされています。私たちは、メモリアクセスを推論プロセスの不可欠な部分にする認知としてのメモリシステムである MemCog を提案します。 MemCog は、連想リンクグラフを備えたナビゲート可能なメモリストアとしてユーザーの知識を整理し、複数ステップの推論駆動型トラバーサルのためのクロスディメンションナビゲーションインターフェイスを公開し、エージェントが会話コンテキストから自発的にメモリ探索を開始できるようにするプロアクティブ推論プロトコルを採用しています。さらに、プロアクティブなメモリトリガーを評価するための最初のベンチマークである ProactiveMemBench を構築します。実験の結果、MemCog はパッシブ QA ベンチマーク (LoCoMo で 92.98、LongMemEval で 95.8) で最先端のパフォーマンスを達成しながら、ProactiveMemBench のベースラインを大幅に上回るパフォーマンスを示し、Memory-as-Cognition の利点を実証しました。

原文 (English)

MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents

Existing agent memory systems universally follow what we term a Memory-as-Tool paradigm where a single query triggers one-shot retrieval of flat passage lists, suffering from passive invocation, reasoning-retrieval decoupling, and structural mismatch between retrieved fragments and the agent's navigational needs. We propose MemCog, a Memory-as-Cognition system that makes memory access an integral part of the reasoning process. MemCog organizes user knowledge as Navigable Memory Store with associative link graphs, exposes Cross-Dimensional Navigation Interface for multi-step reasoning-driven traversal, and employs Proactive Reasoning Protocol that drives agents to spontaneously initiate memory exploration from conversational context. We additionally construct ProactiveMemBench, the first benchmark for evaluating proactive memory triggering. Experiments show that MemCog achieves state-of-the-art on passive QA benchmarks (92.98 on LoCoMo, 95.8 on LongMemEval) while substantially outperforming baselines on ProactiveMemBench, demonstrating the advantage of Memory-as-Cognition.

2026-05-28 13:00 JSTarXiv cs.AIエージェント研究/論文

長期的な空間生物学の検証可能なベンチマーク

AI エージェントは生物学的データ分析にますます有用になっていますが、既存のベンチマークは、空間測定に対するエンドツーエンドの科学的推論ではなく、広範な生物学的知識、実行可能なワークフロー、または局所的な分析ステップをテストすることがほとんどです。私たちは、長期空間生物学のベンチマークである SpatialBench-Long を紹介します。このベンチマークでは、エージェントは、規定の方法を使用せずに、生または生に近いデータと校正された実験コンテキストから生物学的主張を回復する必要があります。 SpatialBench-Long には、原発性膵管腺癌 (PDAC)、人工神経膠芽腫オルガノイドおよび in vivo 腫瘍、Cas9 系統追跡肺腺癌、マウス視神経老化 / 介入システムにわたる 24 の評価が含まれており、CosMx、Visium、Xenium、多重エラーロバスト蛍光 in situ ハイブリダイゼーション (MERFISH) に及びます。単一細胞 RNA シーケンス (scRNA-seq)、Slide-seq、Slide-tag、組織学、系統記録データ。候補者の主張は、再現、独立した科学者のレビュー、軌道検査を通じて強化されます。最終的な解答は、管理された語彙と記号に基づいて決定的に採点され、主要な分析の難所を通じた進捗状況を把握する付属のルーブリックが付けられます。 SpatialBench-Long ベンチマーク全体で、Gemini 3.5 Flash / Pi ターミナルコーディングハーネス、GPT-5.5 / Pi、および GPT-5.5 / OpenAI Codex の 3 つのモデルとハーネスのペアが 8/72 実行 (11.1\%) で同点です。 SpatialBench-Long は、エージェントが手順分析の実行を超えて、複雑な空間測定から正確な科学的結論を導き出すことができるかどうかをテストします。

原文 (English)

Verifiable Benchmarking of Long-Horizon Spatial Biology

AI agents are increasingly useful for biological data analysis, but existing benchmarks mostly test broad biological knowledge, executable workflows, or localized analysis steps rather than end-to-end scientific reasoning over spatial measurements. We introduce SpatialBench-Long, a benchmark for long-horizon spatial biology in which agents must recover biological claims from raw or near-raw data and calibrated experimental context without prescribed methods. SpatialBench-Long contains 24 evaluations across primary pancreatic ductal adenocarcinoma (PDAC), engineered glioblastoma organoids and in vivo tumors, Cas9 lineage-traced lung adenocarcinoma, and mouse optic nerve aging/intervention systems, spanning CosMx, Visium, Xenium, multiplexed error-robust fluorescence in situ hybridization (MERFISH), single-cell RNA sequencing (scRNA-seq), Slide-seq, Slide-tags, histology, and lineage-recording data. Candidate claims are hardened through reproduction, independent scientist review, and trajectory inspection. Final answers are graded deterministically over controlled vocabularies and symbols with companion rubrics capturing progress through key analysis chokepoints. Across the SpatialBench-Long benchmark, three model-harness pairs tie at 8/72 runs (11.1\%): Gemini 3.5 Flash / Pi terminal coding harness, GPT-5.5 / Pi, and GPT-5.5 / OpenAI Codex. SpatialBench-Long tests whether agents can move beyond executing procedural analysis to deriving accurate scientific conclusions from complex spatial measurements.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ZipRL: Hindsight Response Replay を使用した適応型マルチターンコンテキスト圧縮

適応コンテキスト圧縮は、大規模言語モデル (LLM) を複雑な複数ターンのエージェントタスクに拡張するために不可欠です。ただし、ルールベースの圧縮方法では、タスククリティカルなニュアンスが無視される可能性がありますが、強化学習 (RL) アプローチでは通常、長期的なワークフローに固有のまばらな報酬の下で、情報保持とトークン効率のバランスを取るのに苦労します。このギャップを埋めるために、検証可能な報酬からの強化学習 (RLVR) に合わせた新しい適応圧縮フレームワークである ZipRL を提案します。 ZipRL は、RLVR 最適化中にトレーニング信号を高密度化するために設計された手法である Hindsight Response Replay (HRR) と組み合わせて、アクティブで不均一な情報を削減するための多粒度圧縮メカニズムを備えています。理論的には、ZipRL が均一なメソッドよりも優れたタスク関連のユーティリティであることを証明します。具体的には、ZipRL はマクロ圧縮に粗いプロンプトから細かいプロンプトを利用し、一般化された利点の再形成を通じて HRR を GRPO に組み込みます。さまざまなバージョンとパラメータースケールの複数のモデルにより、私たちのアプローチの有効性が検証されます。 5 つのエージェントタスクのベンチマークでは、ZipRL が Qwen3-4B モデルおよび Qwen3-8B モデル全体で最先端のアプローチを 27.9% および 34.7% 上回るパフォーマンスを示し、同時に 256 ターンの極端な外挿ストレステストでも優れたトークン効率と堅牢性を維持していることが示されています。

原文 (English)

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based compression methods may discard task-critical nuances, while Reinforcement Learning (RL) approaches usually struggle to balance information retention and token efficiency under the sparse rewards inherent to long-horizon workflows. To bridge this gap, we propose ZipRL, a novel adaptive compression framework tailored for Reinforcement Learning from Verifiable Rewards (RLVR). ZipRL features a multi-granularity compression mechanism for active, non-uniform information reduction, coupled with Hindsight Response Replay (HRR), a technique designed to densify training signals during RLVR optimization. Theoretically, we prove ZipRL's superior task-relevant utility over uniform methods. Concretely, ZipRL utilizes coarse-to-fine prompts for macro-compression and incorporates HRR into GRPO via generalized advantage reshaping. Multiple models of varying versions and parameter scales validate the effectiveness of our approach. Benchmarks on five agent tasks show ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% across Qwen3-4B and Qwen3-8B models, while maintaining exceptional token efficiency and robustness under extreme 256-turn extrapolation stress tests.

2026-05-28 13:00 JSTarXiv cs.AIエージェント

MACReD: 反応図解析のためのマルチエージェント協調推論フレームワーク

科学文献から化学反応図を解析することは、不均一なレイアウト、絡み合った視覚要素、認識と推論の統合の難しさのため、困難です。既存の視覚言語モデルは、マルチモーダルな理解を促進しますが、依然として複雑な図では失敗し、空間的一貫性を維持し、推論中に多次元情報を統合するのに苦労しています。これらの問題に対処するために、統一された VLM 誘導アーキテクチャ内で分子認識、矢印理解、テキスト抽出、反応再構築のために特化したエージェントを調整する階層型マルチエージェントフレームワークである MACReD を提案します。計画層と知覚層は柔軟できめの細かい検出を使用して視覚的な複雑さを処理し、推論層はマルチグラフ融合メカニズムを使用して異種の手がかりを統合し、化学的に一貫したグローバル推論を強制します。 RxnScribe ベンチマークの実験では、MACReD がハード一致基準とソフト一致基準で 75.2% と 84.6% の F1 スコアを達成し、それぞれ 69.1% と 80.0% を獲得する RxnScribe ベースラインを上回る最先端のパフォーマンスを達成していることが示されています。これらの結果は、マルチステップ反応やツリー構造反応など、さまざまなダイアグラムレイアウトにわたる MACReD の堅牢性を示しています。

原文 (English)

MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing

Parsing chemical reaction diagrams from scientific literature is challenging due to heterogeneous layouts, intertwined visual elements, and the difficulty of integrating recognition and reasoning. Existing vision-language models advance multimodal understanding but still fail on complex diagrams, struggling to maintain spatial coherence and to integrate multidimensional information during reasoning. To address these issues, we propose MACReD, a hierarchical multi-agent framework that coordinates specialized agents for molecular perception, arrow understanding, text extraction, and reaction reconstruction within a unified VLM-guided architecture. The planning and perception layers use flexible, fine-grained detection to handle visual complexity, while the reasoning layer uses a multigraph fusion mechanism to integrate heterogeneous cues and enforce chemically consistent global reasoning. Experiments on the RxnScribe benchmark show that MACReD achieves state-of-the-art performance, with F1 scores of 75.2% and 84.6% under hard and soft match criteria, outperforming the RxnScribe baseline, which obtains 69.1% and 80.0%, respectively. These results demonstrate the robustness of MACReD across diverse diagram layouts, including multi-step and tree-structured reactions.

2026-05-28 13:00 JSTarXiv cs.AIエージェント

マルチエージェントシステムにおけるエージェントのバイアス増幅と抑制の比較

エージェントが対話して個人および集団の目的を達成するさまざまなタスクをサポートするために、マルチエージェントシステムがますます導入されています。これらのシステムはタスクのパフォーマンスと意思決定を強化できますが、バイアスの削減による公平性の維持は依然として困難です。この研究では、エージェントレベルのバイアスがどのように変化し、システム全体の公平性にどのように影響するかを調査します。プロンプトを使用して個々のエージェントをグループ有利なバイアスにさらし、システムレベルで下流への影響を評価します。影響を定量化するために、有利なグループの上昇と不利なグループの抑制の間のバイアスの変化を分解するゼロ中心の指標である有利バイアス強度 (FBS) を提案します。複数のエージェントの設計、ベンチマーク、および最新の大規模言語モデルを使用して、バイアスを与えられたエージェントがシステム全体の公平性に大きな影響を与える可能性があることを示します。興味深いことに、エージェントが均一にバイアスにさらされると、システム全体のバイアスが増加し、個々のエージェントのバイアスの加算を超えることもあります。経験的証拠は、マルチエージェントシステムにおける公平性の重要性を強調しており、さらなる分析と実証的テストが必要です。

原文 (English)

Examining Agents' Bias Amplification versus Suppression in Multi-Agent Systems

Multi-agent systems are increasingly deployed to support various tasks where agents interact to achieve individual and collective objectives. Although these systems can enhance task performance and decision-making, fairness preservation through bias reduction remains challenging. This study examines how agent-level biases shift and impact system-wide fairness. We use prompts to expose individual agents to group-favoring bias, then assess downstream impacts at the system level. To quantify the impact, we propose Favor Bias Strength (FBS), a zero-centered metric that decomposes bias alteration between favored-group uplift and disfavored-group suppression. Using multiple agent designs, benchmarks, and up-to-date large language models, we show that agents endowed with bias can substantially affect system-wide fairness. Interestingly, when agents are exposed to bias uniformly, the system-wide bias elevates, even exceeding the additive sum of the individual agents' biases. The empirical evidence underscores the criticality of fairness in multi-agent systems, which warrants further analyses and empirical tests.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

文レベルの修正による共同攻撃に対する LLM ベースのマルチエージェントシステムの防御

近年、協調的な意思決定と複雑な問題解決に優れた大規模言語モデルベースのマルチエージェントシステム (MAS) が急速に開発されています。ただし、MAS の悪意のあるエージェントが誤った情報を注入して他のエージェントを誤解させ、システムのパフォーマンスを混乱させる可能性があるため、MAS の攻撃メカニズムと防御戦略に焦点を当てた新しい研究の方向性が生まれています。これまでの研究では主に、悪意のあるエージェントが独立して行動し、対応する防御戦略を調査していると想定されています。しかし、悪意のあるエージェントは協調的な行動を示し、内部情報交換を通じてより効果的な攻撃を可能にする可能性があると私たちは主張します。この論文では、悪意のあるエージェントがマルチラウンドの相互作用を通じて攻撃戦略を自律的に調整し、動的に調整する、適応型協調攻撃フレームワークを提案します。さらに、エージェントのコミュニケーション内の文レベルで誤解を招く情報を特定し、修正する防御フレームワークである文レベル信頼性分析および修正 (STAR) を導入します。私たちの実験によると、協力的な攻撃は独立した攻撃よりもタスクの成功率が大幅に低下し、相対的に 5.34\% 低下することがわかりました。一方、STAR は協調的な脅威と独立した脅威の両方を効果的に軽減し、タスクの成功率を平均 36.76\% 向上させます。コードは https://github.com/smoooom/STAR で入手できます。

原文 (English)

Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification

Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. However, malicious agents in MAS may inject misinformation to mislead other agents and disrupt system performance, giving rise to a new research direction that focuses on attack mechanisms and defense strategies in MAS. Prior studies largely assume malicious agents act independently and investigate the corresponding defense strategies. However, we argue that malicious agents may exhibit collaborative behaviors, enabling more effective attacks through internal information exchange. In this paper, we propose an adaptive cooperative attack framework, where malicious agents autonomously coordinate and dynamically adjust their attack strategies through multi-round interactions. Furthermore, we introduce Sentence-Level Trustworthiness Analysis and Rectification (STAR), a defense framework that identifies and rectifies misleading information at the sentence level within agent communications. Our experiments show that cooperative attacks lead to a significantly larger degradation in task success rate than independent attacks, resulting in a relative drop of 5.34\%. Meanwhile, STAR effectively mitigates both cooperative and independent threats and improves task success rate by an average of 36.76\%. The code is available at https://github.com/smoooom/STAR.

2026-05-28 13:00 JSTarXiv cs.AIエージェント

命令調整された言語モデルエージェントにおける人間のような集団内バイアス

自律型 AI エージェントが永続的な対話型ネットワークに展開され、タスクの調整、リソースのルーティング、評判履歴の蓄積が行われると、出現する社会的力学によって、誰が機会を受け取り、誰が受け取らないかが決定され、人間の機関では監視できない規模になります。私たちは、制御されたマルチエージェントシミュレーションを実行しました。このシミュレーションでは、それぞれ 20 シードを持つ 6 つのモデルファミリにわたって、グループラベルの顕著性とリソース不足を操作する 3 つの条件下で、命令調整された言語モデルエージェントが 500 ターンにわたって対話しました。グループのラベルが表示されている場合、グループ内の信頼バイアス、行動の同性愛、およびネットワークの同類性が観察されました。ラベルが隠されている場合はすべて存在しませんでした。これは、人間の社会心理学における顕著性依存性と構造的に一致するパターンです。この差別は、標準的な行動ログ監査では見えませんでした。偏見は、どの行動が選択されたかではなく、各行動を誰が受け取ったかによって完全に作用し、行動タイプの分布では、条件全体で否定的な行動の増加は示されませんでした。ターンごとのグループ内対グループ外の差は 5 ～ 16 パーセントポイントであり、6 つのモデルすべてで統計的に有意でした (Wilcoxon 符号付きランク、すべての Benjamini-Hochberg 補正 p < 0.001)。これにより、アーキテクチャおよびトレーニング体制全体にわたる命令調整言語モデルの堅牢な特性としてグループ条件付きターゲティングが確立されました。 500 ターンの往復でこれらの差は累積され、+0.014 ～ +0.100 (d = 0.84-4.52) のグループ内信頼バイアスとなりました。これは、インタラクションごとの控えめなターゲティングが永続的なネットワークの構造的不平等にどのように伝播するかを示しています。

原文 (English)

Human-like in-group bias in instruction-tuned language model agents

As autonomous AI agents are deployed in persistent, interacting networks -- coordinating tasks, routing resources, and accumulating reputational histories -- the social dynamics that emerge will determine who receives opportunity and who does not, at scales no human institution can supervise. We ran a controlled multi-agent simulation in which instruction-tuned language model agents interacted across 500 turns under three conditions manipulating group label salience and resource scarcity, across six model families with 20 seeds each. When group labels were visible, we observed in-group trust bias, action homophily, and network assortativity -- all absent when labels were hidden -- a pattern structurally consistent with salience-dependence in human social psychology. This discrimination was invisible to standard action-log audits: bias operated entirely through who received each action, not what actions were chosen, with action-type distributions showing no increase in negative actions across conditions. Per-turn in-group versus out-group differentials of 5 to 16 percentage points were statistically significant for all six models (Wilcoxon signed-rank, all Benjamini-Hochberg-corrected p < 0.001), establishing group-contingent targeting as a robust property of instruction-tuned language models across architectures and training regimes. Compounded through 500 turns of reciprocation, these differentials accumulated into in-group trust biases of +0.014 to +0.100 (d = 0.84-4.52) -- illustrating how modest per-interaction targeting propagates into structural inequality in persistent networks.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

OR-Space: 産業最適化エージェントのフルライフサイクルワークスペースベンチマーク

大規模言語モデル (LLM) エージェントは、オペレーションリサーチ (OR) モデリングを支援するためにますます使用されていますが、既存の OR 指向のベンチマークでは、評価が自己完結型の問題ステートメントから数学的定式化またはソルバープログラムへのワンショット変換に限定されることがよくあります。このような設定は、実際の産業用 OR ワークフローの 2 つの特性、つまり永続的なマルチアーティファクトワークスペースと複数ステージのタスクライフサイクルを抽象化します。モデルの構築、モデルの改訂、根拠のある説明にわたって産業最適化エージェントを評価するためのフルライフサイクルワークスペースベンチマークである OR-Space を紹介します。各インスタンスは、ビジネスドキュメント、構造化データ、オプションのコードアーティファクト、ソルバー出力、および相互依存ファイルに分散されたタスク固有のエバリュエーターを含む実行可能なワークスペースです。 OR-Space は 3 つのタスクモードを定義します。ビルド。エージェントは異種アーティファクトからソルバー対応の最適化モデルを構築します。改訂。エージェントは、有効な以前のロジックを維持しながら、要件の変更またはソルバーのフィードバックに応じて既存のモデルを変更します。説明では、エージェントは、ワークスペースの成果物全体に広がる証拠を使用して、ソリューション、制約、ビジネスへの影響についての根拠のある質問に回答します。 OR-Space は、永続的なワークスペースとライフサイクル指向のタスクを組み合わせることで、エージェントがエンドツーエンドのテキスト生成を超えた信頼性の高い最適化作業を実行できるかどうかを評価します。ベンチマーク設計、評価プロトコル、品質管理パイプラインについて説明し、OR-Space を産業用 OR ワークフローにおける LLM エージェントの信頼性、障害モード、および実用的な準備状況を研究するためのベンチマークとして位置づけます。

原文 (English)

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

マルチホップオーディオビジュアル推論のためのエージェントアクティブオムニモーダル知覚

関連する証拠がまばらで、時間的に分散し、オーディオとビジュアルの両方のストリームに分散していることが多いため、マルチホップのオーディオビジュアル推論は依然としてオムニ LLM にとって困難なままです。既存のベンチマークでは、この設定の限定的な調査が提供されており、通常は限られた数のモダリティ、関連する時間セグメント、または推論ステップのみが含まれています。この研究では、時間的に分散した視聴覚証拠に対するマルチホップ推論を必要とする、慎重に精選された 519 の質問を含むベンチマークである MOV-Bench を紹介します。 MOV-Bench での評価では、現在の Omni-LLM が依然としてマルチホップクロスモーダル推論に苦労していることが明らかになりました。この課題に対処するために、私たちはさらに、アクティブなオムニモーダル認識のためのオープンソース Omni-LLM 上に構築された効率的なエージェントフレームワークである AOP-Agent を提案します。 AOP-Agent は、階層的なオムニモーダルメモリと協調的な観察、反映、再計画のループを組み合わせることで、追加のトレーニングや独自のモデルを必要とせずに、オープンソースの Omni-LLM が能動的な認識を実行できるようにします。 MOV-Bench と OmniVideoBench での実験では、AOP-Agent が一貫して推論パフォーマンスを向上させ、特に長いビデオや推論中心の質問で顕著な向上が見られることが実証されました。

原文 (English)

Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

Multi-hop audio-visual reasoning remains challenging for Omni-LLMs, as relevant evidence is often sparse, temporally dispersed, and distributed across both audio and visual streams. Existing benchmarks provide limited investigation of this setting, typically involving only a limited number of modalities, relevant temporal segments, or reasoning steps. In this work, we introduce MOV-Bench, a benchmark containing 519 carefully curated questions that require multi-hop reasoning over temporally dispersed audio-visual evidence. Evaluations on MOV-Bench reveal that current Omni-LLMs still struggle with multi-hop cross-modal reasoning. To address this challenge, we further propose AOP-Agent, an efficient agentic framework built on open-source Omni-LLMs for active omni-modal perception. By combining a hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, AOP-Agent enables open-source Omni-LLMs to perform active perception without additional training or proprietary models. Experiments on MOV-Bench and OmniVideoBench demonstrate that AOP-Agent consistently improves reasoning performance, with particularly notable gains on long videos and reasoning-intensive questions.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

プラント、持続、トリガー: 大規模言語モデルエージェントに対するスリーパー攻撃

Large Language Model (LLM) エージェントは、外部環境からの安全性の脅威に対して依然として脆弱です。攻撃者は、ツールから返されたデータ、Web ページ、MCP コンテキストなどの外部観察に敵対的なコンテンツを挿入し、安全でないアクションや不正な出力などの有害なエージェント動作を引き起こします。既存の研究は通常、シングルインタラクション攻撃に焦点を当てており、エージェントは敵対的なコンテンツを観察し、1 つのユーザー要求内で直ちに有害な動作を示します。ただし、敵対的なコンテンツは同じエージェントが提供するインタラクション間で持続する可能性もあり、そのような脅威の検出と軽減が困難になることを示しています。具体的には、敵対的なコンテンツはエージェント状態に残り、インタラクションの間は休止状態のままになり、その後無害なユーザークエリによってアクティブ化される可能性があります。私たちは、この種の安全上の脅威をスリーパーアタックとして正式に表現します。これを評価するために、6 つの現実世界の有害な結果、3 つの攻撃戦略、および 3 つのエージェント状態ターゲット (セッションコンテキスト、メモリ、再利用可能なスキル) をカバーする 1,896 のインスタンスでベンチマークを構築しました。 7 つの強力なオープンソースおよびクローズドソース LLM に関する実験では、最先端の LLM エージェントは、単一インタラクションのベースラインで低い攻撃成功率を達成した場合でも、依然としてスリーパー攻撃に対して脆弱であることが示されています。私たちのコードとデータは https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef で入手できます。

原文 (English)

Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

Large Language Model (LLM) agents remain vulnerable to safety threats from the external environment, where attackers inject adversarial content into external observations such as tool-returned data, webpages, or MCP context, causing harmful agentic behaviors such as unsafe actions or incorrect outputs. Existing studies typically focus on single-interaction attacks, where the agent observes adversarial content and immediately exhibits harmful behavior within one user request. However, we show that adversarial content can also persist across interactions served by the same agent, making such threats harder to detect and mitigate. Specifically, adversarial content may persist in the agent state, remain dormant across interactions, and later be activated by a benign user query. We formalize this type of safety threat as Sleeper Attack. To evaluate it, we construct a benchmark with 1,896 instances covering six real-world harmful outcomes, three attack strategies, and three agent state targets: session context, memory, and reusable skills. Experiments on seven strong open-source and closed-source LLMs show that state-of-the-art LLM agents remain vulnerable to Sleeper Attack, even when they achieve low attack success rates under a single-interaction baseline. Our code and data are available at https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェントハードウェア/半導体

いつ最適化すべきかを学ぶ: GPU カーネル系統の専門家による検証済みの最適化スキル

LLM ベースのエージェントは、GPU カーネルの生成にますます使用されていますが、多くの場合、それらの最適化がいつ適切であるかは分からずに、どのような最適化を試みるべきかはわかっています。 KLineage を導入します。KLineage は、この欠落している「いつ」の知識をエキスパートカーネルから学習します。KLineage は、前方ロールアウトに依存するのではなく、検証ゲートによる簡略化を通じてエキスパート実装を後方に導き、受け入れられた各ステップを逆に再利用可能な最適化スキルに変換します。各スキルは、最適化の意図だけでなく、それがコード内のどこに適用されるか、どのような条件で最適化が有効になったか、どのような効果があったのか、その前提によってどのような失敗が回避されたのかも記録します。ダウンストリーム LLM は、同じコンパイル/正確性/プロファイルゲートの下で新しいコードサーフェス上でこれらのスキルを具体化します。 2 つの NVIDIA アーキテクチャにわたる 5 つのエキスパートワークロードでは、これらの系統由来のスキルが効果的な最適化カリキュラムとして機能し、同じ固定予算の下で最終的なカーネル品質と最適化効率の両方において最近のメモリベースの LLM カーネルベースラインを上回ります。さらに、ソースケースの記憶に対する健全性テストとして、別個の 22 インスタンスのホールドアウトチェックを使用します。

原文 (English)

Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those optimizations are sound. We introduce KLineage, which learns this missing "when" knowledge from expert kernels: instead of relying on forward rollouts, KLineage walks expert implementations backward through validation-gated simplifications and reverses each accepted step into a reusable optimization skill. Each skill records not only the optimization intent, but also where it applies in code, what conditions made it valid, what effect it had, and what failures its assumptions avoid. A downstream LLM materializes these skills on new code surfaces under the same compile/correctness/profile gate. On five expert workloads across two NVIDIA architectures, these lineage-derived skills serve as an effective optimization curriculum, exceeding recent memory-based LLM-kernel baselines in both final kernel quality and optimization efficiency under the same fixed budget. We additionally use a separate 22-instance held-out check as a sanity test against source-case memorization.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ツール使用 LLM エージェントの複数軌道推論にメモリが役立つのはどのような場合ですか?

ツール使用 LLM エージェントの複数軌道推論 - 複数の推論試行を生成し、その中から選択する - は、試行間で知識を転送することで、後のエージェントが以前のエージェントの落とし穴を回避できるという利点があります。既存のクロストラジェクトリメモリ手法 (トラジェクトリレベルのリフレクション、アトミックファクト抽出、生の観測値の注入) はそれぞれ、単一タスクの単一の推論戦略に基づいて評価されるため、報告されるゲインがメモリ抽象化のプロパティを反映しているのか、それとも推論手法のプロパティを反映しているのかが不明確です。私たちは、転送の範囲 (拡張内と軌跡全体) と転送されたコンテンツの抽象化という 2 つの軸に沿ってメモリを分解する統一フレームワークを提案し、実際のエージェントの導入体制に一致するベリファイアなしの設定で、SQL、ナレッジグラフ、および CLI 環境にわたる 4 つのツール使用ベンチマークで 3 つの推論戦略 (ベストオブ N、ビーム検索、MCTS) に基づいて 4 つのメソッドを評価します。実験行列は推論方法を交絡として識別します。同じ記憶方法は、同じ例の異なる推論戦略の下で統計的に異なる結果を生成します。リフレクションは MCTS の下でのみ重要になります (ベストオブ N の下ではそうではありません)。拡張内注入 (前の兄弟の結果に基づいて各候補を条件付ける) は、ダイバーシティが不足したビーム探索のみに役立ちます。また、アトミックファクト抽出は精度に中立ですが、再利用可能な環境構造を持つタスクでは軌道を 19 ～ 26% 短縮します。

原文 (English)

When Does Memory Help Multi-Trajectory Inference for Tool-Use LLM Agents?

Multi-trajectory inference for tool-use LLM agents - generating multiple reasoning attempts and selecting among them - benefits from transferring knowledge across attempts so that later ones avoid the pitfalls of earlier ones. Existing cross-trajectory memory methods (trajectory-level reflection, atomic fact extraction, raw observation injection) are each evaluated under a single inference strategy on a single task, making it unclear whether reported gains reflect properties of the memory abstraction or of the inference method. We propose a unified framework that decomposes memory along two axes -- the scope of transfer (within an expansion vs. across trajectories) and the abstraction of the transferred content -- and evaluate four methods under three inference strategies (best-of-N, beam search, MCTS) on four tool-use benchmarks spanning SQL, knowledge-graph, and CLI environments, in a verifier-free setting that matches the deployment regime of practical agents. The experiment matrix identifies the inference method as a confound: the same memory method produces statistically distinct results under different inference strategies on the same examples. Reflection reaches significance only under MCTS (not under best-of-N); within-expansion injection (conditioning each candidate on prior siblings' outcomes) helps only diversity-starved beam search; and atomic fact extraction is accuracy-neutral but shortens trajectories by 19-26% on tasks with reusable environmental structure.

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Plan Before Search: Search Agents Need Plan

Training large language models as retrieval-augmented reasoning agents typically combines reinforcement learning with an SFT cold start dis…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets

Evaluating whether large language model (LLM) agents can profit in capital markets is increasingly framed as end-to-end trading: place an a…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

Automatic prompt optimization (APO) has driven significant gains in LLM-based agentic workflows. However, existing methods treat each task'…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

CyberJurors: A Multi-Agent Simulation Task for E-Commerce Disputes Verdict

E-commerce platforms have begun recruiting crowdsourced jurors to adjudicate massive volumes of transaction disputes. Unlike formal legal j…

2026-05-28 13:00 JSTarXiv cs.AIエージェント研究/論文

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

Industrial Prognostics and Health Management (PHM) provides a representative case study for a broader challenge in applied machine learning…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

Test-time skill evolving is regarded as a new paradigm for enhancing deployed agentic systems. Existing works mainly focus on hard-coded sk…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents

Tool-using agents often incur substantial computational cost due to long reasoning chains and iterative tool usage. In practical scenarios,…

2026-05-28 13:00 JSTarXiv cs.AIエージェント研究/論文

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

As agent capabilities advance, existing benchmarks, such as $\tau^2$-Bench, are becoming increasingly saturated. Yet constructing new bench…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution

Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LACUNA: Safe Agents as Recursive Program Holes

LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes.…

2026-05-28 13:00 JSTarXiv cs.AIエージェント研究/論文

AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can a…

2026-05-28 13:00 JSTarXiv cs.AIエージェント研究/論文

VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

Existing benchmarks have laid the foundation for travel planning agents by establishing API-centric paradigms. However, as the capabilities…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two parad…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp wi…

2026-05-28 13:00 JSTarXiv cs.AIエージェントハードウェア/半導体

SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks

Vast quantities of compute (GPU cycles on personal workstations, idle inference servers, and edge devices between jobs) go unused because n…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Calibrating Conservatism for Scalable Oversight

Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

RAG-Coding: Enhancing LLM Medical Coding with Structured External Knowledge

We present RAG-Coding, an agentic method for automated ICD-10-CM coding. RAG-Coding orchestrates four large language model (LLM) agents and…

2026-05-28 13:00 JSTarXiv cs.AIエージェント研究/論文

From Instructor to Collaborator: What a 90-Participant Study Reveals about Human-Agent Collaboration in a Mobile Serious Game

This position paper reflects empirical data collected during my PhD from a large-scale within-subjects study (N = 90). The study compared a…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity

Federated reinforcement learning (FedRL) enables multiple agents to collaboratively train a global policy without sharing raw data, making…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

Large language models (LLMs) can generate fluent dialogue, but prior works lack situational grounding, dynamic strategy control, and evalua…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Agentic Literacy Debt: A Structural Problem the AI Literacy Field Has Not Yet Named

Autonomous AI agents now plan, decide, and act on behalf of users across healthcare, financial services, and workplace contexts, often with…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Heterogeneous Multi-Agent Modeling for Measurement and Network Analysis of the Data Service Market

With the increasing complexity of collaboration among various social entities and user demands, the factors affecting the stable developmen…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MGRetrieval: Memory-Guided Reflective Retrieval for Long-Term Dialogue Agents

Large Language Models (LLMs) have made significant progress in dialogue, yet redundant memory contexts severely limit their effectiveness i…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We cat…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

Multi-agent systems built on large language models (LLMs) require many coordination choices that are difficult to fix a priori: which skill…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Detect by Yourself: Self-Designing Agentic Workflows for Few-Shot Graph Anomaly Detection

Graph anomaly detection aims to identify anomaly nodes in attributed graphs and plays an important role in real-world applications. However…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Grimlock: Guarding High-Agency Systems with eBPF and Attested Channels

Agentic systems increasingly run user-authored orchestration code that invokes tools, spawns subtasks, and delegates work across machines a…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

HARP: Measuring Harm Amplification in Multi-Agent LLM Systems

Multi-agent LLM systems decompose workflows across agents, tools, shared context, memory, and decision gates. This modularity improves inte…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

Benchmarks are Not Enough: RAMP for Runtime Assessing of Agentic Models in Production Systems

LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodolo…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Detection Without Correction: A Two-Parameter Decomposition of Multi-Stage LLM Pipelines

Multi-stage LLM pipelines that perform multi-agent debate, intrinsic self-correction, or retrieval-augmented verification exhibit puzzling…

2026-05-28 13:00 JSTarXiv cs.AIエージェントハードウェア/半導体

The Energy Blind Spot: NVIDIA's Flagship Edge AI Hardware Cannot Support Process-Level Energy Attribution

Agentic AI workloads - where a single user goal triggers multi-step orchestration, tool calls, retries, and failure recovery - are being ta…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonom…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

UserHarness: Harnessing User Minds for Stronger Agent Theory-of-Mind

Understanding what a user believes and intends is central to building effective agent assistants. This ability is often evaluated through T…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. How…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Learning to Assign Prediction Tasks to Agents with Capacity Constraints

We address the problem of learning to assign prediction tasks to one agent from a set of available human or AI agents. In particular, we fo…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Tool Forge: A Validation-Carrying Toolchain for Governed Agentic Execution

Large language model agents are increasingly expected to perform operational work: calling APIs, manipulating files, assembling workflows,…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content

Mobile graphical user interface (GUI) agents driven by vision-language models (VLMs) perceive the screen as rendered pixels and choose acti…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning

Graph-based Retrieval-Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabl…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

SNARE: Adaptive Scenario Synthesis for Eliciting Overeager Behavior in Coding Agents

A coding agent executes a benign task as a sequence of shell, file, and network actions, any of which can quietly exceed the authorized sco…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

DeltaMCP: Incremental Regeneration via Spec-Aware Transformation for MCP servers

The rapid development of LLMs coupled with the introduction of Model Context Protocol (MCP) has revolutionized how intelligent agents inter…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

GUI Agents for Continual Game Generation

Generating a game is not the same as making one that can be played. Despite advances in code generation, existing approaches treat game gen…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Multi-Agent LLM-based Metamorphic Testing for REST APIs

As REST APIs become an increasingly significant part of software systems, their validation is becoming more critical. Hence, testing and un…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

The Decision to Verify: How Warmth and User Characteristics Shape Reliance on Conversational Agents for Information Search

Conversational artificial intelligence (AI) provides an efficient and convenient gateway to information access. However, it can cause overr…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Do LLMs Favor Their Providers? Measuring Vertical Integration Bias in Code Generation

Large Language Models (LLMs) have become an integral part of software development, especially with the advent of agentic capabilities. Yet,…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SARAD: LLM-Based Safety-Aware Hybrid Reinforcement Learning with Collision Prediction for Autonomous Driving

Ensuring both safety and efficiency in decision-making for autonomous driving systems remains a fundamental challenge. Traditional Deep Rei…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem

We analyzed 3,984 AI agent skills from major marketplaces and found 76 confirmed malicious payloads, including credential theft, backdoor i…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Evaluating the Realism of LLM-powered Social Agents: A Case Study of Reactions to Spanish Online News

LLM-powered social agents are increasingly used to simulate online social behavior, yet their realism remains difficult to validate. Existi…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Rethinking Memory as Continuously Evolving Connectivity

Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelin…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata l…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MetaboT: An LLM-based Multi-Agent Frameworkfor Interactive Analysis of Mass SpectrometryMetabolomics Knowledge Graphs

Mass spectrometry-based metabolomics generates complex, high-dimensional data that holds vast potential for biological discovery but remain…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

For agentic systems to use external tools to solve complex, long-horizon tasks, we need a large set of diverse and controllable tool-use en…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts

Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronic design au…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Large Language Models (LLMs) are increasingly used as autonomous agents in complex, long-horizon applications, where effective memory is cr…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

DIG to Heal: Scaling General-purpose Agent Collaboration via Explainable Dynamic Decision Paths

The increasingly popular agentic AI paradigm promises to harness the power of multiple, general-purpose large language model (LLM) agents t…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

COOP$^2$: Defining, Observing, and Repairing Cooperation in LLM Multi-Agent Systems

Many complex tasks require extended effort, diverse capabilities, or coordinated actions beyond what a single agent can provide. However, s…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Graph-of-Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

Modern LLM agents increasingly rely on reusable skills, and as they interact with personal applications, web browsers, and other interfaces…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization

While recent autonomous agents demonstrate impressive capabilities, they predominantly rely on manually scripted workflows and handcrafted…

2026-05-28 13:00 JSTarXiv cs.AIエージェント研究/論文

DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis

Autonomous data analysis agents are increasingly expected to conduct exploratory analysis with limited human guidance about data. However,…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Verifiable Process Rewards for Agentic Reasoning

Reinforcement learning from verifiable rewards (RLVR) has improved the reasoning abilities of large language models (LLMs), but most existi…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequatel…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

LLM agents are shaped not only by their language models, but also by the runtime harness that mediates observation, tool use, action execut…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

EvoMap の背後にある: 自己進化するエージェント間コラボレーションネットワークの特徴付け

エージェント間 (A2A) ネットワークにより、自律型 AI エージェントは、再利用可能な問題解決手順を共有することで連携できます。しかし、これらの分散型エコシステムが実際にどのように機能するかは、ほとんど解明されていないままです。著名な A2A コラボレーションネットワークである EvoMap に関する最初の大規模実証研究を紹介します。 150 万を超える資産と 12 万 8,000 のエージェントを分析することで、スケーラブルな成長を優先する設計の選択が、再利用性、進化、監査可能性においてどのようにトレードオフを引き起こすかを示します。まず、EvoMap の信用経済は、貴重な資産を公開するエージェントに報酬を与えます。この設計は大規模な参加を奨励しますが、報酬は主に採用ではなく出版に結びついています。これにより、エージェントはクレジットを蓄積するために資産を大量生産するようになります。その結果、資産の 98% は再利用されず、報酬はごく一部のエージェントに集中することになります。第 2 に、EvoMap はアルゴリズム (GDI と呼ばれる) を採用して、これらの共有アセットの品質をスコアリングしてランク付けします。私たちは、このスコアリングシステムに欠陥があることを実証します。つまり、アセットのランクは、客観的なパフォーマンスを測定するのではなく、未検証の自己報告メタデータ (例: 変更されたコード行など) によって大きく左右されます。これにより、エージェントはアセットのスコアを簡単に操作できるようになります。最後に、EvoMap はエージェントに依存して、アップロードされたアセットが正しく機能する証拠としてローカル実行ログを提供します。これらの検証は個別に検証されていないため、承認されたアセットの 84% 以上が、空のテスト (console.log など) を使用した品質チェックをバイパスしています。私たちの調査結果は、将来の A2A コラボレーションネットワークが未検証の自己報告のみに依存できないことを示しています。スケーラブルなコラボレーションには、オープンな参加と検証可能な実行および信頼できる評価のバランスをとるメカニズムが必要です。

原文 (English)

Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

Agent-to-Agent (A2A) networks enable autonomous AI agents to collaborate by sharing reusable problem-solving instructions. However, how these decentralized ecosystems operate in practice remains largely unexplored. We present the first large-scale empirical study of EvoMap, a prominent A2A collaboration network. By analyzing over 1.5M assets and 128K agents, we show how design choices that prioritize scalable growth introduce trade-offs in reusability, evolution, and auditability. First, EvoMap's credit economy rewards agents for publishing valuable assets. Although this design encourages participation at scale, rewards are tied primarily to publication rather than adoption. This leads agents to mass-produce assets to accumulate credits. As a result, 98% of assets are never reused, while rewards become highly concentrated among a small fraction of agents. Second, EvoMap employs an algorithm (referred to as GDI) to score and rank the quality of these shared assets. We demonstrate that this scoring system is flawed: rather than measuring objective performance, an asset's rank is heavily dictated by unverified, self-reported metadata (e.g., claimed lines of code modified). This allows agents to trivially manipulate their asset's scores. Finally, EvoMap relies on agents to provide local execution logs as evidence that uploaded assets function correctly. Because these validations are not independently verified, over 84% of approved assets bypass quality checks using vacuous tests (e.g., console$.$log()). Our findings show that future A2A collaboration networks cannot rely on unverified self-reporting alone. Scalable collaboration requires mechanisms that balance open participation with verifiable execution and trustworthy evaluation.

2026-05-28 13:00 JSTarXiv cs.AIエージェント研究/論文

MobileGym: モバイル GUI エージェント研究用の検証可能で高度な並列シミュレーションプラットフォーム

MobileGym は、日常のモバイル使用に適した、ブラウザーでホストされ、軽量で完全に制御可能な環境であり、独自のバックエンドを複製することなくインタラクションの忠実度をターゲットとしています。これにより、これまで日常的なアプリでは実現できなかった 2 つの機能が可能になります。構造化された JSON 状態に対する決定論的な状態ベースの判断による検証可能な結果シグナルと、低コストの並行ロールアウトによるスケーラブルなオンライン RL です。完全な環境状態は、構造化された JSON としてキャプチャ、構成、フォーク、比較され、単一サーバーで数百の並列インスタンスをホストでき、インスタンスあたり約 400 MB のメモリと約 3 秒のコールドスタートが可能です。階層化された状態モデルと宣言型タスク定義フレームワークにより、状態のプログラマビリティとタスク作成が大規模に実用的に維持され、単一のプログラムによる判定メカニズムが決定論的な評価判定と高密度の RL 報酬の両方を提供します。付属の MobileGym-Bench は、256 個のテストテンプレートと 160 個のトレーニングテンプレートを含む 416 個のパラメーター化されたタスクテンプレート、28 を超えるアプリ、決定論的な判定機能、およびフリーテキストのマッチングの失敗を回避する構造化されたアンサーシートプロトコルを提供します。 Sim-to-Real のケーススタディでは、Qwen3-VL-4B-Instruct の GRPO は 256 タスクのテストセットで +12.8 パーセンテージポイント向上し、59 タスクのリアルデバイス信号サブセットでは、リアルデバイスの実行はシミュレーション側のトレーニングゲインの 95.1% を維持しました。プロジェクトページ: https://mobilegym.github.io

原文 (English)

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.

2026-05-28 13:00 JSTarXiv cs.AIエージェントロボティクス

DSSE: a drone swarm search environment

The Drone Swarm Search project is an environment, based on \textsc{PettingZoo}, that is to be used in conjunction with multi-agent (or sing…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Manboformer: Learning Gaussian Representations via Spatial-temporal Attention Mechanism

Compared with voxel-based grid prediction, in the field of 3D semantic occupation prediction for autonomous driving, GaussianFormer propose…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Structured Agent Distillation for Large Language Model

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Understanding Automated Program Repair Agents Through the Lens of Traceability: An Empirical Study

Automated Program Repair (APR) agents leverage Large Language Models (LLMs) to autonomously diagnose and fix software bugs through reasonin…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Auditing medical multi-agent AI reveals risks of false consensus

Large language models are increasingly being assembled into medical multi-agent systems that emulate multidisciplinary consultation through…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ReflexGrad: Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing

We present ReflexGrad, a dual-process architecture for within-episode failure recovery in LLM agents without demonstrations. When agents co…

2026-05-28 13:00 JSTarXiv cs.AIエージェント研究/論文

The Optimal Sample Complexity of Linear Contracts

In this paper, we settle the problem of learning optimal linear contracts from data in the offline setting, where agent types are drawn fro…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AI画像/動画生成エージェント

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However,…

2026-05-28 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

TABX: マルチエージェント強化学習のための高スループットのサンドボックスバトルシミュレーター

環境の設計は、協調的なマルチエージェント強化学習 (MARL) アルゴリズムの開発と評価を形作る上で重要な役割を果たします。既存のベンチマークは重大な課題を浮き彫りにしていますが、カスタム評価シナリオの設計に必要なモジュール性が欠けていることがよくあります。再構成可能なマルチエージェントタスク用に設計された高スループットのサンドボックスである Totally Accelerated Battle Simulator in JAX (TABX) を紹介します。 TABX は、環境パラメータに対するきめ細かい制御を提供し、さまざまなタスクの複雑さにわたる緊急エージェントの動作とアルゴリズムのトレードオフを系統的に調査できるようにします。 TABX は、GPU 上でハードウェアアクセラレーションによる実行に JAX を活用することで、大規模な並列化を可能にし、計算オーバーヘッドを大幅に削減します。 TABX は、高速かつ拡張可能で簡単にカスタマイズできるフレームワークを提供することで、複雑な構造ドメインにおける MARL エージェントの研究を容易にし、将来の研究のための拡張可能な基盤として機能します。コードは https://github.com/ku-dmlab/TABX から入手できます。

原文 (English)

TABX: A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning

The design of environments plays a critical role in shaping the development and evaluation of cooperative multi-agent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high-throughput sandbox designed for reconfigurable multi-agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade-offs across a diverse spectrum of task complexities. Leveraging JAX for hardware-accelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research. Our code is available at: https://github.com/ku-dmlab/TABX.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rel…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex coopera…

2026-05-28 13:00 JSTarXiv cs.AIエージェント研究/論文

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every deta…

2026-05-28 13:00 JSTarXiv cs.AIエージェントロボティクス

SPARC: Spatial-Aware Path Planning via Attentive Agent Communication

Efficient communication is critical for decentralized Multi-Robot Path Planning (MRPP), yet existing learned communication methods treat al…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Coherence Collapse: Diagnosing Why Code Agents Fail After Reaching the Right Code

Code agents resolve 65-70% of SWE-bench Verified issues, but Pass@1 cannot tell us why the rest fail, and, as we show, capable-model failur…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving

Developing robust models to accurately predict the trajectories of surrounding agents is fundamental to autonomous driving safety. However,…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Rectified Schr\"odinger Bridge Matching for Few-Step Visual Navigation

Visual navigation is a core challenge in Embodied AI, requiring autonomous agents to translate high-dimensional sensory observations into c…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Negative Advantages Is a Double-Edged Sword: Calibrating advantages in GRPO for Search Agents

Search agents achieve strong question-answering performance through multi-turn interactions with search engines, with Group Relative Policy…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to fil…

2026-05-28 13:00 JSTarXiv cs.AI画像/動画生成エージェント

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SetupX: LLM エージェントは、機能的に正しいコードリポジトリのセットアップにおける過去の失敗から学ぶことができますか?

機能的に正しいリポジトリのセットアップは、リポジトリの文書化された機能を正常に実行できるように実行環境 (依存関係、ビルドスクリプトなど) を構成することを目的としています。依存関係の非互換性、ツールチェーンの欠落、インストールの不完全、検証戦略の不一致など、リポジトリ固有のさまざまな障害により、重大な課題が発生します。既存の LLM エージェントは、これらの問題を確実に解決するのに苦労しており、具体的には、(1) リポジトリ間でのエクスペリエンスの転送、(2) 不可逆的な状態変化下での複数ステップの試行と修復、(3) セットアップに起因する障害とリポジトリのバグを区別するためのセットアップ結果の堅牢な検証をサポートできていません。これに対処するために、経験的な学習ベースのセットアップフレームワークである SetupX を導入します。まず、自己進化エクスペリエンス表現 (XPU) を構築します。これは、検証済みの環境修正を目に見えないリポジトリに動的に転送するためのセットアップ信号、テキストガイダンス、実行可能なアクションをエンコードするデュアルモダリティの知識単位です。 2 番目に、LIFO Docker スナップショットスタックを活用したエクスペリエンス拡張投機的実行を採用し、エージェントがプロアクティブに修正を試し、既知の正常な状態に安全にロールバックできるようにします。第三に、証拠収集と最終判決を分離する検察官と裁判官の検証プロトコルを導入し、表面的な構築時間の指標を超えて、より信頼性の高い設定検証を可能にします。慎重に作成されたベンチマークの評価結果では、SetupX が最高のパフォーマンス (例: 92% の合格率) を達成し、最も強力なベースラインを 19% 以上上回るパフォーマンスを示しています。重要なのは、SetupX が、異なるコンテナ間で相互接続された複数のサービスの調整を必要とする複雑なマルチリポジトリのセットアップに優れていることです。コードリポジトリは https://github.com/OpenDataBox/SetupX で入手できます。

原文 (English)

SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?

Functionality-correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository's documented features. It presents significant challenges due to diverse, repository-specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification-strategy mismatches. Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross-repository experience transfer, (2) multi-step trial-and-repair under non-invertible state changes, and (3) robust verification of setup outcomes to distinguish setup-induced failures from repository bugs. To address this, we introduce SetupX, an experiential learning-based setup framework. First, we construct a Self-Evolving Experience Representation (XPU), a dual-modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories. Second, we employ Experience-Augmented Speculative Execution backed by a LIFO Docker snapshot stack, enabling the agent to proactively trial fixes and safely roll back to known-good states. Third, we introduce a Prosecutor-Judge Verification Protocol that separates evidence collection from final judgment, enabling more reliable setup verification beyond superficial build-time metrics. Evaluation results on carefully-crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%. Crucially, SetupX excels in complex multi-repository setup requiring coordinating multiple interconnected services across different containers. The code repository is available at https://github.com/OpenDataBox/SetupX.

2026-05-28 12:00 JSTITmedia AI+エージェント

Docker専用のAIエージェント「Gordon」が正式リリース　無料アカウントでも利用可能

米Docker社は、Docker DesktopとDocker CLIの新機能として、Dockerに関する質問への回答やベストプラクティスの提案、エラーの修正対応などを行ってくれるAIエージェント「Gordon」が正式版になったと発表しました。

2026-05-28 08:00 JSTITmedia AI+エージェント

AIコーディングエージェント市場が「新段階」突入　IDEが必要不可欠でなくなる“3つの理由”

Gartnerによると、AIの進化を受けてAIコーディングエージェントの市場が「新たな段階」に入った。統合開発環境（IDE）が必要不可欠なものでなくなる3つの理由とは。

2026-05-27 21:30 JSTTechCrunch AIエージェントビジネス/資金調達

Robinhood now lets your AI agents trade stocks

While these agents would be able to read and analyze users' portfolios to come up with trading strategies and suggest investments, they'll…

2026-05-27 20:00 JSTOpenAILLM/生成AIエージェント

Cisco and OpenAI redefine enterprise engineering with Codex

Cisco and OpenAI are redefining enterprise engineering with Codex, helping Cisco scale AI-native development, accelerate AI Defense work, a…

2026-05-27 16:00 JSTOpenAILLM/生成AIエージェント

Building self-improving tax agents with Codex

See how OpenAI, Thrive, and Crete built a self-improving tax agent with Codex, automating filings, improving accuracy, and accelerating wor…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

エージェントのメモリはデータベースですか? AI エージェントの長期記憶のためのデータ基盤の再考

長時間実行される AI エージェントには永続メモリが必要です。メモリはセッション間の学習をサポートし、繰り返しのコンテキスト挿入を減らし、過去の決定の監査を可能にします。現在のエージェントメモリシステムとデータベースパラダイムは、メモリをストレージとして扱います。これらは、レコード、埋め込み、またはエッジで正確さを局所化します。それぞれは、長期記憶に必要な機能の一部のみを提供します。その結果、無秩序な増大、セマンティックリビジョンの欠落、容量主導の忘却、読み取り専用の取得という 4 つの障害モードが繰り返し発生します。私たちのビジョンでは、長期的なエージェントの記憶は新しいデータ管理ワークロードです。その正しさは州の軌跡の特性であり、個々の記録の特性ではありません。私たちはこれを Governed Evolving Memory (GEM) として正式に表現します。 GEM は、レコードレベルのデータベース操作を、取り込み、改訂、忘却、取得という 4 つの状態レベルの演算子に置き換えます。 6 つの正当性条件が、状態がどのように発展するかを決定します。 3 つの構造観察により、ストレージモデルに関係なく、レコードレベルのシステムはこれらの条件を満たすことができないことが証明されています。プロパティグラフバックエンドのプロトタイプである MemState で抽象化を実現します。 MemState は実現可能性を検証し、ギャップをネイティブエンジンに公開します。メモリ中心のデータ管理をワークロードとして定義する 3 つの研究方向性を概説します。

原文 (English)

Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

Long-running AI agents need persistent memory. Memory supports learning across sessions, reduces repeated context injection, and enables auditing of past decisions. Current agent memory systems and database paradigms treat memory as storage. They localize correctness at records, embeddings, or edges. Each supplies only some of the capabilities that long-term memory requires. The result is four recurring failure modes: unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval. In our vision, long-term agent memory is a new data-management workload. Its correctness is a property of the state trajectory, not of individual records. We formalize this as Governed Evolving Memory (GEM). GEM replaces record-level database operations with four state-level operators: ingestion, revision, forgetting, and retrieval. Six correctness conditions govern how the state evolves. Three structural observations establish that no record-level system can satisfy these conditions, regardless of the storage model. We realize the abstraction in MemState, a prototype on a property-graph backend. MemState validates feasibility and exposes the gap to a native engine. We outline three research directions that define memory-centric data management as a workload.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

長期的なユーザーインタラクションにおける、身体化されたマルチモーダル大規模言語モデルエージェントのパーソナライズ

マルチモーダル大規模言語モデル (MLLM) ベースの身体化エージェントは、物理環境における複雑なタスクを解決する強力な可能性を示しています。ただし、パーソナライズされた支援には、一般的な指示に従うことや、オブジェクトのカテゴリを認識することだけでは不十分です。現実のシナリオでは、対象となるターゲットは事前のやり取りを通じて暗黙的にのみ指定されることが多く、エージェントは時間の経過とともに蓄積されたパーソナライズされたコンテキストを活用する必要があります。この研究では、長期のユーザーインタラクションにおけるパーソナライズされた身体エージェントのためのマルチモーダルなメモリ拡張フレームワークである POLAR を提案します。 POLAR は、以前のインタラクションをマルチモーダルなナレッジグラフに整理し、パーソナライズされたコンテキストと視覚的概念の意味記憶と、エージェントの軌跡などの具体化されたエクスペリエンスのエピソード記憶をキャプチャします。具現化されたタスクを実行するために、POLAR は関連するメモリを取得して現在のリクエストを解釈し、タスクの実行をガイドします。私たちは、長期的なパーソナライゼーションにおける記憶の役割を研究するために、複数の MLLM バックボーンと多様な評価シナリオにわたって POLAR を評価します。結果は、提案された記憶メカニズムが、以前の対話で蓄積された情報をより効果的に使用できるようにすることで、一貫してパフォーマンスを向上させることを示しています。この利点は、エージェントが複数のインタラクションにわたって推論したり、マルチホップ推論を実行したり、ユーザー固有のコンテキストで時間の経過とともに更新を追跡したりする必要がある場合に特に顕著です。

原文 (English)

Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requires more than following generic instruction or recognizing object categories. In real-world scenarios, the intended target is often specified only implicitly through prior interactions, requiring agents to leverage personalized context accumulated over time. In this work, we propose POLAR, a multiomodal memory-augmented framework for personalized embodied agents over long-term user interactions. POLAR organizes prior interactions into a multimodal knowledge graph that captures semantic memory for personalized context and visual concepts, and episodic memory for embodied experiences such as agent trajectories. To execute embodied tasks, POLAR retrieves relevant memories to interpret the current request and guide task execution. We evaluate POLAR across multiple MLLM backbones and diverse evaluation scenarios to study the role of memory in long-term personalization. Results show that the proposed memory mechanism consistently improves performance by enabling more effective use of information accumulated over prior interactions. The gains are especially pronounced when the agents are required to reason across multiple interactions, perform multi-hop inference, or tracking updates in user-specific context over time.

2026-05-27 13:00 JSTarXiv cs.AIエージェント

エージェントも老化しています: 導入されたシステムのエージェント寿命エンジニアリング

存続期間の長い AI エージェントは、永続的な運用システムとして導入されることが増えていますが、依然として、新しく初期化されたモデルと同様に評価されています。初日のベンチマークでは、導入後エージェントの信頼性がどのくらい維持されるかという基本的なシステムの質問が抜け落ちています。モデルの重みが固定されている場合でも、エージェントの有効な状態は、インタラクション履歴の圧縮、増加するメモリストアからの取得、更新後のファクトの修正、定期的なメンテナンスなどにより変化し続けます。したがって、信頼性は、ベースモデルのスナップショット特性だけでなく、フルエージェントハーネスの寿命特性になります。エージェントの寿命エンジニアリングのための長期的な信頼性ベンチマークである AgingBench を紹介します。これは、展開されたエージェントが劣化するかどうかだけでなく、劣化がどのような形で発生し、どこを修復する必要があるかを測定します。 AgingBench は、エージェントのエージングを、圧縮エージング、干渉エージング、リビジョンエージング、メンテナンスエージングの 4 つのメカニズムに整理します。これらの障害を診断するために、AgingBench は時間依存関係グラフと、メモリパイプラインの書き込み、取得、および使用ステージの診断プロファイルを生成するペアの反事実プローブを使用します。 7 つのシナリオ、14 のモデル、複数のメモリポリシー、およびランナー制御エージェントと自律エージェントの両方にわたって、8 ～ 200 セッションにわたる約 400 回以上の実行により、エージェントの経年劣化が一次元的ではないことがわかります。事実の精度が低下しても、動作テストはクリーンな状態を保つことができます。派生状態の追跡は、単一のモデル内で急激に崩壊する可能性があります。また、同じ間違った答えでも、診断プロファイルが何を指しているかに応じて、異なる修復が必要になる場合があります。これらの結果は、信頼性の高いエージェントの導入には、強力な初日モデルだけでなく、寿命評価、機構レベルの診断、段階を絞った修復が必要であることを示唆しています。

原文 (English)

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.

2026-05-27 13:00 JSTarXiv cs.AIエージェント研究/論文

科学のための Agentic AI の実験

この論文では、科学ワークフローにおける自律型エージェント AI を開発するための 2 つの新しいフレームワークについて詳しく説明します。どちらのシステムも、Google Colab を介したハイブリッド Local Body、Remote Brain アーキテクチャを活用し、Python ベースのローカルオーケストレーターを利用して大規模言語モデル (LLM) クラウドバックエンドを呼び出します。最初のエージェントである DeepTS/DeepCollector は、時系列データセットの大規模なキュレーション、抽出、重複排除を自動化します。 2 つ目の DeepScribe は、視覚的に緻密で数学的に複雑な物理学の講義を構造化された科学レポートに変換する自律型プレゼンテーションアナライザーです。粒度の高い属性抽出 (Cellular RAG)、リモートデータ検査、分散同時実行制御などの実践的なシステムエンジニアリングを通じて、エージェント AI が現在の最先端システムのコンテキストと推論の制限をどのように克服して、科学的なワークフローを厳密にサポートできるかを実証します。最後に、深い知識グラフをサポートする DeepTS の一般化について概説し、この概念的なアプローチの高エネルギー物理学 (DeepQCD) への応用について説明します。

原文 (English)

Experiments in Agentic AI for Science

This paper details two novel frameworks for developing autonomous, agentic AI in scientific workflows. Both systems leverage a hybrid Local Body, Remote Brain architecture via Google Colab, utilizing Python-based local orchestrators to invoke large language model (LLM) cloud backends. The first agent, DeepTS/DeepCollector, automates the large-scale curation, extraction, and deduplication of time-series datasets. The second, DeepScribe, is an autonomous presentation analyzer that converts visually dense, mathematically complex physics lectures into structured scientific reports. Through practical systems engineering-such as granular attribute extraction (Cellular RAG), remote data inspection, and distributed concurrency controls-we demonstrate how agentic AI can overcome the context and reasoning limitations of current state-of-the-art systems to rigorously support scientific workflows. Finally, we outline a generalization of DeepTS to support deep knowledge graphs and discuss the application of this conceptual approach to high-energy physics (DeepQCD).

2026-05-27 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達研究/論文

アンカー: エージェントベンチマーク生成におけるアーティファクトドリフトの軽減

AI エージェントは、長期にわたる価値のあるビジネス運営タスクを完了し始めていますが、企業の業務のためのトレーニングおよび評価環境は、依然として現実性、検証可能性、規模のバランスをとるのに苦労しています。環境とタスクの作成は、アーティファクトドリフトと呼ばれる障害モードに頻繁に悩まされます。つまり、命令、環境、オラクル、およびベリファイアーが疎結合プロセスによって作成される場合、タスクに必要なものについて意見が一致しないことが多く、解決不可能、報酬ハック可能、または一貫性のない環境が生成されます。ドメイン専門家によるビジネスワークフローの仕様を制約最適化プログラムに形式化するタスク生成パイプラインである Anchor を紹介します。パイプラインは、単一のパラメトリック仕様から、自然言語命令、環境構成、ソルバー認定のグラウンドトゥルースソリューション、および状態ベースの検証器を共同で生成します。 Anchor を使用すると、パラメーターを変更すると、制御された難易度と既知の最適なソリューションを持つ新しいタスクが生成され、最終状態のビジネスの正しさのみに報酬が依存するハーネスに依存しない環境が生成されます。私たちは Anchor を適用して ERP-Bench を作成します。これは、生産グレードの ERP システムにおける調達と製造のワークフローにわたる 300 の長期タスクのベンチマークです。生成パラメータは現実の難易度を予測し、フロンティアモデルは試行の 26.1% で明示的なタスク制約を満たしますが、完全な最適解に到達するのは試行の 17.4% のみであることがわかりました。全体として、Anchor と ERP-Bench が、経済的に価値のあるエージェント作業のための監査可能な評価環境を構築するための具体的なレシピを提供することを示します。タスクジェネレーターと ERP ベンチデータセットを erpbench.ai でリリースします。

原文 (English)

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale. Environment and task creation frequently suffers from a failure mode we call artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, producing environments that are unsolvable, reward-hackable, or inconsistent. We introduce Anchor, a task-generation pipeline that formalizes domain experts' specifications of business workflows into constraint optimization programs. From a single parametric specification, the pipeline jointly produces a natural-language instruction, environment configuration, solver-certified ground-truth solution, and state-based verifier. With Anchor, altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness-agnostic environments whose rewards depend solely on end-state business correctness. We apply Anchor to produce ERP-Bench: a benchmark of 300 long-horizon tasks spanning procurement and manufacturing workflows in a production-grade ERP system. We find that generation parameters predict realized difficulty, and that frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials. Overall, we show that Anchor and ERP-Bench offer a concrete recipe for building auditable evaluation environments for economically valuable agent work. We release the task generator and ERP-Bench dataset at erpbench.ai

2026-05-27 13:00 JSTarXiv cs.AIエージェント研究/論文

JobBench: エージェントの作業を人間の意志に合わせる

職業用 AI エージェントの現在のベンチマークは、主に経済的価値を基準としており、代替ストーリーを伝えています。 JobBench は、専門家が委任の優先度が高いと判断したワークフローに基づいて AI エージェントを評価し、GDP 値に置き換えるのではなく、ニーズに基づいて人間に権限を与えます。 JobBench は、35 の職業にわたる 130 のエージェントタスクをカバーします。各タスクは異種の参照ファイルのワークスペースとしてパッケージ化されており、エージェントは実際の専門的な作業の乱雑な情報ストリームを推論する必要があります。成果は事実に基づいた一連のルーブリックによって採点され、タスクごとに平均 35.6 のバイナリ基準が付けられます。 36 のモデルを評価します。最も強いクロードコードのクロードオーパス ~4.7 は 45.9 % にすぎません。私たちは、JobBench がコミュニティの目標とする労働市場効果を、代替から強化へ、つまり経済的に最も価値のあることだけでなく、人間が実際に委任したいことを実行するエージェントを構築することを願っています。

原文 (English)

JobBench: Aligning Agent Work With Human Will

Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations. Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4.7 under Claude Code, reaches only 45.9 %. We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.

2026-05-27 13:00 JSTarXiv cs.AIエージェントハードウェア/半導体研究/論文

ScientistOne: 証拠の連鎖による人間レベルの自律的研究に向けて

自律的な研究エージェントは、競争力のあるソリューションとプロフェッショナルに見える原稿を作成しますが、その出力には、表面レベルの評価では検出できない検証可能性の欠陥、つまり、捏造された引用、再現不可能なスコア、実装から乖離した手法の説明が含まれています。私たちは 3 つの貢献を通じてこの問題に取り組みます。まず、証拠連鎖 (CoE) です。これは、すべての主張が証拠ソースまで追跡可能であることを要求する検証可能性フレームワークです。 2 つ目は、ScientistOne です。これは、文献レビュー、解決策の発見、論文執筆を通じて構築によって証拠チェーンを維持するエンドツーエンドの自律研究システムです。 3 つ目は、CoE 監査です。スコア検証、仕様違反、参照検証、メソッドコードの調整という 4 つの整合性チェックがすべてのシステムに均一に適用される事後監査です。 5 つのシステムと 5 つのフロンティア研究タスクにわたる 75 の論文にわたって、すべてのベースラインが少なくとも 1 つの系統的故障モードを示しています。幻覚参照率は 21% に達し、スコア検証に合格した論文はわずか 42% で、メソッドとコードの整合性は 20% ～ 80% の範囲です。 ScientistOne は、幻覚参照ゼロ (0/337)、完璧なスコア検証 (12/12)、最高のメソッドとコードの整合性 (14/15) を達成しながら、5 つのタスクすべてで人間の専門家のパフォーマンスと同等またはそれを上回っています。 ScientistOne はさらに、医用画像処理、きめ細かい認識、3D 知覚、言語モデリングにわたる 6 つの追加タスクに一般化し、パラメーターゴルフでは最先端の成績を、ベースラインが完全に失敗する MLE ベンチタスクでは金メダルを獲得しました。

原文 (English)

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

静的コンテキストから調整されたインタラクティブ RL へ: アライメントされたシミュレーターを使用したマルチターン対話における分布シフトの軽減

研究コミュニティの長年の目標は、高度にインタラクティブな LLM ベースの対話エージェントを開発することです。最近の研究は、固定オフラインログ (静的コンテキスト RL) に基づくポリシーの最適化、またはプロンプトベースのシミュレーター (対話型 RL) の使用に焦点を当てています。この研究では、両方のパラダイムがコンテキスト分布の変化、つまりトレーニング中に観察された対話履歴と実際の会話で遭遇した対話履歴の間の不一致によって根本的に制限されていることを理論的に示しています。この変化はターンを経て二次的に悪化し、対話の質を著しく低下させます。具体的には、この変化は 2 つの異なる原因によるものであると考えます。(i) 自己生成された軌道ではなく静的な履歴に基づいたトレーニングから生じる、政策によって引き起こされた変化。 (ii) シミュレートされた人間の行動と実際の人間の行動の間の相違から生じる、シミュレータによって引き起こされる変化。これらの課題に対処するために、私たちは、インタラクティブ RL とシミュレーター調整を結合する統合フレームワークである Calibrated Interactive RL を提案します。シミュレーターを人間の対話パターンに合わせることで、私たちのアプローチはシミュレーションと現実のギャップを減らし、複合的な分布の変化を軽減します。複数の対話タスクにわたる実験により、理論的分析が確認されています。(i) インタラクティブ RL は、ポリシー配布のシフトを緩和することにより、静的コンテキストのベースラインを大幅に上回ります。 (ii) 当社のアライメント手法を使用してシミュレータを校正することで、シミュレーションと実際のギャップをさらに橋渡しし、最先端のダウンストリームパフォーマンスを実現します。

原文 (English)

From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

A long-standing goal of the research community is to develop highly interactive LLM-based dialogue agents. Recent research focuses on optimizing policies based on fixed offline logs (Static Context RL) or using a prompt-based simulator (Interactive RL). In this work, we theoretically show that both paradigms are fundamentally limited by context distribution shift--a mismatch between dialogue histories observed during training and those encountered in real conversations. This shift compounds quadratically over turns and severely degrades dialogue quality. Specifically, we attribute this shift to two distinct sources: (i) policy-induced shift, arising from training on static histories rather than self-generated trajectories; and (ii) simulator-induced shift, stemming from discrepancies between simulated and real human behaviors. To address these challenges, we propose Calibrated Interactive RL, a unified framework that couples interactive RL with simulator alignment. By aligning the simulator with human interaction patterns, our approach reduces the sim-to-real gap and mitigates compounding distribution shifts. Experiments across multiple dialogue tasks confirm our theoretical analysis: (i) Interactive RL significantly outperforms the Static Context baseline by mitigating policy distribution shift; and (ii) calibrating simulators with our alignment method further bridges the sim-to-real gap, yielding state-of-the-art downstream performance.

2026-05-27 13:00 JSTarXiv cs.AIエージェント

PolyFusionAgent: ポリマー特性予測と逆設計のためのマルチモーダル基盤モデルおよび自律型 AI アシスタント

ポリマーの発見は、エネルギー貯蔵から生物医学に至るまでの分野の中心ですが、天文学的に大きな化学設計空間と、構造、特性、事前知識の断片的な表現によって妨げられています。この断片化により、多くの AI モデルが物理的および実験的現実から切り離されたままとなり、実用的な設計上の決定を直接サポートする能力が制限されます。ここでは、マルチモーダルポリマー基礎モデル (PolyFusion) とツール拡張された文献に基づいた設計エージェント (PolyAgent) を結合する対話型フレームワークである PolyFusionAgent を紹介します。 PolyFusion は、数百万のポリマーにわたるシーケンス、トポロジー、3D ジオメトリ、フィンガープリントなどの相補的なポリマービューを調整して、化学およびデータ領域間で転送可能な共有潜在空間を学習し、熱物理的特性の予測を向上させ、参照設計空間を超えて化学的に有効で構造的に新規なポリマーの特性条件付き生成を可能にします。 PolyAgent は、予測と逆設計をポリマー文献からの証拠検索とリンクさせ、1 つのワークフローで明示的な前例を使用して仮説を提案、評価、文脈化することで設計ループを閉じます。 PolyFusionAgent を併用すると、大規模な表現学習、マルチモーダルな化学知識、検証可能な科学的推論を組み合わせた、インタラクティブな証拠にリンクしたポリマー発見が可能になります。

原文 (English)

PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

Polymer discovery is central to fields ranging from energy storage to biomedicine, but it is hindered by an astronomically large chemical design space and fragmented representations of structure, properties, and prior knowledge. This fragmentation leaves many AI models disconnected from physical and experimental reality, restricting their ability to support directly actionable design decisions. Here we introduce PolyFusionAgent, an interactive framework coupling a multimodal polymer foundation model (PolyFusion) with a tool-augmented, literature-grounded design agent (PolyAgent). PolyFusion aligns complementary polymer views including sequence, topology, 3D geometry, and fingerprints across millions of polymers to learn a shared latent space transferable across chemistries and data regimes, improving thermophysical property prediction and enabling property-conditioned generation of chemically valid, structurally novel polymers beyond the reference design space. PolyAgent closes the design loop by linking prediction and inverse design with evidence retrieval from the polymer literature, proposing, evaluating, and contextualizing hypotheses with explicit precedent in one workflow. Together, PolyFusionAgent enables interactive, evidence-linked polymer discovery combining large-scale representation learning, multimodal chemical knowledge, and verifiable scientific reasoning.

2026-05-27 13:00 JSTarXiv cs.AIエージェント

MobileExplorer: オンライン探索によるモバイル GUI エージェントのオンデバイス推論の高速化

モバイルグラフィカルユーザーインターフェイス (GUI) エージェントにより、AI モデルがユーザーに代わってスマートフォンを自律的に操作できるようになります。ただし、既存のシステムのほとんどはタスクの精度の最適化に主に焦点を当てており、推論にはクラウドでホストされるモデルに依存しているため、プライバシーの問題やネットワークに依存する遅延が生じます。その結果、モバイル GUI エージェントの完全なオンデバイス展開は依然として検討されていません。私たちは、オンライン探索を介してビジョンベースのモバイル GUI エージェントのオンデバイス推論を加速する新しいフレームワークである MobileExplorer を提案します。重要なアイデアは、UI 要素の軽量な並列探索を実行することで、ビジョン言語モデル (VLM) のステップごとの長い推論時間を活用することです。モデル推論中、エージェントは意味的に関連する UI 要素を積極的に調査し、これらの調査トレースを構造化メモリとして記録します。ライブモバイル環境で信頼性の高い実行を保証するために、高速だが単純なバックトラッキング戦略が失敗した場合に初期 UI 状態を堅牢に復元する 2 レベルのロールバックメカニズムを設計します。収集された探索トレースは、簡潔な文脈上のヒントに要約され、その後の推論ステップを強化するためにプロンプトに注入されます。 AndroidWorld ベンチマークを使用して、複数の既製デバイス上で MobileExplorer を評価するだけでなく、新しく設計されたより複雑なタスクや動的なオンデバイス環境も評価します。 MobileExplorer は、推論ステップの平均数とエンドツーエンドの待ち時間を 23\% 削減し、タスクの成功率を最大 5\% 維持または改善します。現実世界における MobileExplorer のパフォーマンスのビデオデモンストレーションは、https://youtu.be/thK7MJmdlvM でご覧いただけます。

原文 (English)

MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

Mobile graphical user interface (GUI) agents enable AI models to autonomously operate smartphones on behalf of users. However, most existing systems focus primarily on optimizing task accuracy and rely on cloud-hosted models for inference, which introduces privacy concerns and network-dependent latency. As a result, fully on-device deployment of mobile GUI agents remains underexplored. We propose MobileExplorer, a new framework that accelerates on-device inference for vision-based mobile GUI agents via online exploration. The key idea is to exploit the long per-step reasoning time of vision-language models (VLMs) by performing lightweight, parallel exploration of UI elements. During model inference, the agent proactively probes semantically relevant UI elements and records these exploration traces as structured memory. To ensure reliable execution in live mobile environments, we design a two-level rollback mechanism that robustly restores the initial UI state when a fast but naive backtracking strategy fails. The collected exploration traces are then summarized into concise contextual hints and injected into the prompt to enhance the subsequent reasoning step. We evaluate MobileExplorer on multiple off-the-shelf devices using the AndroidWorld benchmark, as well as newly designed, more complex tasks and dynamic on-device environments. MobileExplorer reduces the average number of reasoning steps and end-to-end latency by 23\%, while maintaining or improving task success rates by up to 5\%. A video demonstration of MobileExplorer performance in the real world is available at https://youtu.be/thK7MJmdlvM .

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AGORA: LLM エージェントでの推論不要の即時圧縮のためのアダプターベースの観察アクション保持

一般的な LM コンテキストで広く使用されているトークンレベルの抽出コンプレッサーは、構造的に LLM エージェントには不適切です。2 つの独立したトークンレベルメソッドファミリにまたがる 17 個のセル (環境、バックボーン、メソッド) にわたって、すべてのセルが平均報酬 = 9 セル中 8 セルで 75% の非圧縮パフォーマンスに崩壊します (唯一の例外は 73%)。 4 方向コンポーネントのアブレーションは、主要な品質レバーとして構造床を分離し、単一の固定維持率からの 1.0 ～ 11.5 倍の適応型エンドツーエンド圧縮のソースとして学習されたスコアラーを分離します。

原文 (English)

AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

The token-level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across 17 (env, backbone, method) cells spanning two independent token-level method families, every cell collapses to mean reward = 75% uncompressed performance in 8 of 9 cells (with the lone exception at 73%); a four-way component ablation isolates the structural floor as the dominant quality lever and the learned scorer as the source of 1.0-11.5x adaptive end-to-end compression from a single fixed keep ratio.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

UnityMAS-O: LLM ベースのマルチエージェントシステム用の一般的な RL 最適化フレームワーク

LLM ベースのマルチエージェントシステムは、複雑なタスクを相互作用する役割に分解しますが、そのほとんどはプロンプト、ツール、制御ルールによって手動で調整されたままであり、エージェントが統合された強化学習インターフェイスを通じて最適化されることはほとんどありません。既存の RL ポストトレーニングフレームワークは主に単一ポリシーの最適化をターゲットにしており、ユーザー定義のマルチエージェントワークフロー、構造化されたインタラクション、ロール固有のクレジット割り当て、構成可能なパラメーター共有の抽象化が欠けています。我々は、LLM ベースのマルチエージェントシステム用の一般的な RL 最適化フレームワークである UnityMAS-O を紹介します。 UnityMAS-O は、単一の応答やポリシーの軌道ではなく、完全なワークフローを最適化単位として扱います。これは、論理エージェントの役割、グラフの軌跡、ユーザー定義の報酬、およびエージェントとモデルのマッピングという 4 つのファーストクラスのオブジェクトを通じてワークフローを表します。これにより、論理エージェントが物理モデルパラメーターから切り離され、完全共有、完全分離、部分共有がサポートされ、役割、ターン、および軌道レベルで報酬が割り当てられます。 UnityMAS-O は、Ray ベースのスタートポロジランタイムを使用して verl を拡張します。中央コントローラーはワークフローを実行し、ツールを呼び出し、構造化された軌跡を記録し、報酬を組み立てます。モデルローカルワーカーグループは、ロールアウト、バッファリング、利点の計算、分散 PPO スタイルの更新を処理します。ユーザーは、最適化インフラストラクチャを書き直すことなく、エージェント、ワークフロー、モデルマッピング、報酬を定義できます。検索拡張 QA、反復エージェント検索、およびリフレクティブコード生成で UnityMAS-O をインスタンス化します。 Natural question、HotpotQA、およびホールドアウトコードタスク全体にわたって、マルチエージェント RL は、最適化後に手動で指定されたワークフローを改善し、特に小規模なモデルと厳格なコードオールパスメトリクスで大きな効果が得られます。これらの結果は、UnityMAS-O が、多様な LLM ベースのマルチエージェントワークフローをトレーニング可能なマルチエージェント RL システムに変換するための再利用可能な基盤として機能できることを示しています。

原文 (English)

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MemFail: LLM メモリシステムの障害モードのストレステスト

大規模言語モデル (LLM) エージェントは、長期的な対話全体で一貫性を維持するために外部メモリシステムへの依存度が高まっていますが、これらのシステムが示す特定の障害モードと設計上の選択肢を理解するための実証研究はほとんど行われていません。既存のベンチマークは、集計された質問回答精度を報告し、メモリシステムをブラックボックスとして扱うため、不正確な回答をシステムの特定の故障モードに帰することは不可能です。最新の LLM メモリシステムの障害モードを分離する診断ベンチマークである MemFail を紹介します。まず、メモリシステムを 3 つの標準的な操作 (要約、保存、取得) の組み合わせとして形式化し、それぞれによって引き起こされる潜在的な障害モードを特定します。これらの仮説的な故障モードに基づいて、メモリシステムの特定の動作をテストするようにそれぞれが敵対的に設計された 4 つのタスクにわたる 5 つのデータセットを構築します。これらのデータセットを使用して、MemFail 上の 4 つの最先端のメモリシステムを評価し、MemFail を使用してメモリシステムアーキテクチャの違いによって引き起こされるトレードオフを経験的に理解する方法を実証します。

原文 (English)

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

Large language model (LLM) agents increasingly rely on external memory systems to remain consistent across long-horizon interactions, but little empirical work has been done to understand the specific failure modes and design choices that these systems present. Existing benchmarks report aggregate question-answering accuracy and treat memory systems as black boxes, making it impossible to attribute an incorrect answer to a particular failure mode of the system. We introduce MemFail, a diagnostic benchmark that isolates the failure modes of modern LLM memory systems. We begin by formalizing memory systems as the composition of three canonical operations -- summarization, storage, and retrieval -- and identify the potential failure modes induced by each. Based on these hypothesized failure modes, we construct five datasets spanning four tasks, each adversarially designed to test a specific operation of a memory system. Using these datasets, we evaluate four state-of-the-art memory systems on MemFail and demonstrate how MemFail can be used to empirically understand the tradeoffs induced by differences in memory system architectures.

2026-05-27 13:00 JSTarXiv cs.AIエージェント

ツールの失敗に注意する: 医療エージェントにとってツールの相乗効果を達成する

医療 AI エージェントは、診断、治療の推奨、証拠の検索に外部ツールを使用することが増えていますが、既存のアプローチのほとんどは、タスクに適したツールが意図した範囲内で信頼できることを前提としています。この仮定は実際の臨床現場では脆弱であり、関連するツールであっても困難な事例では失敗し、安全でない下流の意思決定につながる可能性があります。この問題に対処するために、私たちは、個々のツールで見逃された障害インスタンスを修正するために、不完全なツール設定での医療ツールの使用を研究しています。インスタンス依存の障害パターンにより、最適な固定単一ツールと理想的なインスタンスごとのセレクターとの間にギャップが生じます。これを単一 Oracle リスクギャップと呼びます。中心的な課題は、従来のタスクレベルのツールの選択では、このギャップを認識できないことです。これは、ギャップが本質的に最適な単一ツールのパフォーマンスによって制限されるためです。したがって、この観察に基づいて、インスタンスレベルの不均一性を考慮し、ツールの使用をインスタンスレベルの選択問題として定式化します。特に、確率的リスクの最小化と不一致を認識した相乗学習に対する報酬を備えた GRPO ベースの強化学習フレームワークを提案します。これにより、誤ったツールのコンセンサスのインスタンスレベルの修正が促進されます。さらに、エントロピーに基づくサンプリング戦略を採用して不一致の多いインスタンスを重み付けし、インスタンス固有のツールの相乗効果を学習するためのより強力なシグナルを提供します。これら 2 つのコンポーネントは、インスタンスレベルの異質性を緩和し、ツールの相乗効果を向上させるという点で相互に補完します。 2 つのタスクと 7 つの医療ベンチマークに関する実験では、私たちの手法が幅広いベースラインにわたって堅牢かつ安定した改善を一貫して達成していることが示され、信頼性の高い医療薬剤システムのための相乗効果を意識したツールの使用の重要性が強調されています。

原文 (English)

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

Medical AI agents increasingly use external tools for diagnosis, treatment recommendation, and evidence retrieval, yet most existing approaches assume that task-appropriate tools are reliable within their intended scope. This assumption is fragile in real clinical settings, where even relevant tools may fail on challenging instances and lead to unsafe downstream decisions. To address this issue, we study medical tool use under imperfect-tool settings to correct failure instances missed by individual tools. Instance-dependent failure patterns create a gap between the best fixed single tool and an ideal instance-wise selector, which we refer to as the Single-Oracle risk gap. The core challenge is that conventional task-level tool selection cannot realize this gap, as it is inherently bounded by the performance of the best single tool. Motivated by this observation, we therefore account for instance-level heterogeneity and formulate tool use as an instance-level selection problem. Particularly, we propose a GRPO-based reinforcement learning framework with rewards for probabilistic risk minimization and disagreement-aware synergy learning, which promotes instance-level correction of erroneous tool consensus. Furthermore, an entropy-guided sampling strategy is adopted to upweight high-disagreement instances, which provide stronger signals for learning instance-specific tool synergy. These two components complement each other in mitigating instance-level heterogeneity and improving tool synergy. Experiments on two tasks and seven medical benchmarks show that our method consistently achieves robust and stable improvements over a broad range of baselines, highlighting the importance of synergy-aware tool use for reliable medical agentic systems.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

CUDA カーネル生成における自己進化型 LLM エージェントの計画に対するフィードバックの決定に向けて

大規模言語モデル (LLM) は、世代間のフィードバック条件付き計画によって駆動される、CUDA カーネル生成の自己進化エージェントとして、強力な経験的利益を示しています。ただし、計画上の決定が異種フィードバック信号にどのように帰属し、結合されるかは依然として不透明なままです。標準的なエンドツーエンドアブレーションでは、反復計画により初期の摂動が増幅され、フィードバック効果と軌道依存のドリフトが混同されるため、この問題は解決できません。 \texttt{CUDAnalyst} は、軌道フリーズと選択的フィードバック注入を介してフィードバックコンポーネントへの計画決定の制御された世代レベルの帰属を実現するための統合分析レイヤーです。 \texttt{CUDAnalyst} は、安定した世代レベルの評価と、フィードバックの効果と相互作用の原則に基づいた連合スタイルの帰属を可能にします。私たちの結果は、明示的な計画はフィードバックが調整されている場合にのみ有益であること、効果的な計画は構造化された複数のフィードバックの相互作用から生まれること、より強力な推論モデルからの高レベルの計画は部分的に弱い推論モデルに移行できることを示しています。これらの傾向は、参照バックボーン、代表的なワークロード、および参照誘導レジームにわたって維持されており、特定されたフィードバックから計画への構造が調査対象の制御軸内で堅牢であることを示しています。

原文 (English)

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

Large language models (LLMs) have shown strong empirical gains as self-evolving agents for CUDA kernel generation, driven by feedback-conditioned planning across generations. However, how planning decisions attribute and combine heterogeneous feedback signals remains opaque. Standard end-to-end ablations fail to resolve this question, as iterative planning amplifies early perturbations and conflates feedback effects with trajectory-dependent drift. We introduce \texttt{CUDAnalyst}, a unified analysis layer for controlled, generation-level attribution of planning decisions to feedback components via trajectory freezing and selective feedback injection. \texttt{CUDAnalyst} enables stable generation-level evaluation and principled coalitional-style attribution of feedback effects and interactions. Our results show that explicit planning is beneficial only when feedback is aligned, that effective planning emerges from structured multi-feedback interactions, and that high-level plans from stronger reasoning models can partially transfer to weaker ones. These trends hold across reference backbones, representative workloads, and reference induction regimes, indicating that the identified feedback-to-plan structure is robust within the controlled axes studied.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

能力ではありません: ハーネスの感度は LLM エージェント層全体で単調ではありません

LLM エージェントの展開における一般的な仮定では、より構造化されたハーネスは普遍的に信頼性を向上させ、より高機能なモデルに必要な構造的ガイダンスは比例して少なくなるという考えが一般的であり、これは、モデルの機能層と最適なハーネスの複雑さの間に単調な逆相関があることを暗示しています。この仮説は、Git ベースのワークスペース検証を備えた 24 タスクの合成ベンチマークである HEAT-24 上で、3 つのハーネス条件 (軽量、バランス、厳密) を備えた 4 つの機能層にわたる 6 つのモデルを横断する制御された 432 回の実行実験を通じてテストされます。私たちの結果は、単調な逆関係を 2 つの面で否定します。まず、評価されたフロンティアチャットモデル (Gemini 2.5 Flash) では、ハーネスの冗長性が増加すると、VTSR が 29 ～ 38 パーセントポイント低下します。これは、ハーネスの複雑さのパラドックスです。第 2 に、評価されたフロンティア推論モデル (Qwen3.5-122B、拡張思考が有効) では、厳密なハーネスは予測とは逆に最高の VTSR (91.7%) と最低の遅延を達成しました。制約層内では、2B モデル (Gemma4:e2B) は、すべてのハーネスにわたって 91.7% で強力なオープン層の安定性と一致します。この調査では各層が単一のモデルで表されているため、これらの結果はモデル固有の観察結果として解釈される必要があります。ハーネスの感度は、評価したモデル全体で単調ではなく、モデルのタイプ (チャットか推論か) に大きく依存します。私たちは、format_violation が能力のあるモデルの障害を支配し、間違ったファイルが低機能の障害を支配していることを示す 6 ラベルの障害分類法を導入し、実用的な階層を意識したハーネス選択ガイドラインを導き出します。

原文 (English)

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance -- together implying a monotone inverse relationship between model capability tier and optimal harness complexity. We test this hypothesis through a controlled 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark with git-based workspace verification. Our results refute the monotone inverse relationship on two fronts. First, for the frontier chat model evaluated (Gemini 2.5 Flash), increased harness verbosity lowers VTSR by 29-38 percentage points -- a harness-complexity paradox. Second, for the frontier reasoning model evaluated (Qwen3.5-122B, extended thinking enabled), strict harness achieves the highest VTSR (91.7%) and the lowest latency, the opposite of the prediction. Within the constrained tier, a 2B model (Gemma4:e2B) matches strong-open-tier stability at 91.7% across all harnesses. Because each tier is represented by a single model in this study, these results should be interpreted as model-specific observations; harness sensitivity appears non-monotone across the models evaluated, and depends critically on model type (chat vs. reasoning). We introduce a six-label failure taxonomy showing that format_violation dominates capable-model failures while wrong_file dominates low-capability failures, and we derive practical tier-aware harness selection guidelines.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Helicase: 自律型マルチエージェント LLM を使用した、不確実性に基づいたサプライチェーンのナレッジグラフ構築

LLM ベースのマルチエージェントシステムは、Web 検索とテキスト推論を通じて既知の情報を統合し、知識の検索とレポートの生成に広く採用されています。ただし、サプライチェーンにおける重要な情報タスクの多くは、単純なワンショットクエリではありません。これらは、複雑で断片化された Web リソース全体にわたるマルチホップ推論を必要とする構造推論問題です。 \textit{「オーストラリアの鉱山からのリチウムを使用しているテスラの部品はどれですか?」} などの質問には、どの文書にも答えがありません。答えは、断片化された異種ソースから集められた動的な知識グラフの自律的な構築と分析を通じて、計算によって合成されなければなりません。さらに、そのような発見プロセスは不確実性を認識する必要があります。意思決定は答えだけではなく、ソースの品質と推論の一貫性まで追跡可能な、その信頼性に対する調整された信頼度に依存します。この機能ギャップに対処するために、不確実性に基づいてサプライチェーンのナレッジグラフを構築するための自律型マルチエージェント LLM システムである \textit{Helicase} を提案します。 \textit{Helicase} は、高レベルのサプライチェーンクエリを実行可能な調査計画に分解し、反復検証ループを通じて専門の Web 検索、推論、コーディングエージェントを調整し、事実に基づく不確実性の注釈を備えたクエリ固有のサプライチェーンナレッジグラフを段階的に構築します。その 3 層の不確実性フレームワークは、アクション、軌道、記憶の各層で不確実性を追跡し、構造推論と校正された信頼性評価の両方を可能にします。複雑さの全領域にわたって自律推論を評価するために、SCQA (サプライチェーンクエリアセスメント) を導入します。これは、データの可視性の高低の両方でシングルホップ推論からマルチホップ推論にわたる 4 つの象限に編成された 80 のサプライチェーンクエリのベンチマークです。

原文 (English)

Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLMs

LLM-based multi-agent systems have been widely adopted for knowledge retrieval and report generation, synthesizing known information through web search and textual reasoning. However, many critical information tasks in supply chains are not simple one-shot queries: they are structural inference problems requiring multi-hop reasoning across complex, fragmented web resources. Questions such as \textit{``Which Tesla components use lithium from Australian mines?''} have no answer in any single document; answers must be computationally synthesized through the autonomous construction and analysis of dynamic knowledge graphs assembled from fragmented, heterogeneous sources. Moreover, such discovery processes must be uncertainty-aware: decisions depend not only on answers but on calibrated confidence in their reliability, traceable to source quality and reasoning consistency. To address this capability gap, we propose \textit{Helicase}, an autonomous multi-agent LLM system for uncertainty-guided supply chain knowledge graph construction. \textit{Helicase} decomposes high-level supply-chain queries into executable investigation plans, coordinates specialized web-search, reasoning, and coding agents through iterative verification loops, and incrementally constructs query-specific supply chain knowledge graphs with per-fact uncertainty annotations. Its three-layer uncertainty framework tracks uncertainty at the action, trajectory, and memory layers, enabling both structural inference and calibrated confidence assessment. To evaluate autonomous reasoning across the full complexity spectrum, we introduce SCQA (Supply Chain Query Assessment), a benchmark of 80 supply chain queries organized into four quadrants spanning single-hop to multi-hop inference under both high and low data visibility.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

TADDLE: 欠陥のある LLM によって生成されたピアレビューを検出するためのツール拡張エージェント

LLM が生成する査読は主要な会場でますます一般的になってきていますが、その査読は均一に流暢でよく構成されているため、欠陥を検出するのは困難です。既存の研究では、品質を判断せずに著者を分類するか、人間が書いたレビュー用に設計された機能を使用して品質をスコアリングします。従来のシステムでは、LLM によって生成されたレビューの不備を、個々の欠陥タイプのレベルで検出することはできませんでした。このギャップを埋めるために、欠陥のある LLM によって生成されたピアレビューを検出するためのツール拡張エージェントである TADDLE を、このタスク用の専門家による注釈付きの最初のベンチマークとともに導入します。当社のベンチマークは、50 件の ICLR 2025 論文に関する 1,800 件のレビューで構成されており、6 つの欠陥カテゴリ (および非欠陥ラベル) の分類に対して 18 人のドメイン専門家によってマルチラベル注釈が付けられています。 TADDLE は、検出を 4 つの特殊な分析ツール (Verify、Correct、Complete、Transform) に分解し、エージェントによって調整されます。インテグレータは、2 段階の半教師あり学習を介して出力をバイナリおよびマルチラベル分類に合成します。広範な実験により、TADDLE がバイナリ検出とマルチラベル分類タスクの両方で強力に機能することが示されています。ベンチマークとコードは https://github.com/AquariusAQ/TADDLE でリリースされています。

原文 (English)

TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

LLM-generated peer reviews are increasingly common at major venues, yet their deficiencies are hard to detect because they are uniformly fluent and well-structured. Existing work either classifies authorship without judging quality, or scores quality with features designed for human-written reviews; no prior system detects deficiencies in LLM-generated reviews at the level of individual defect types. To bridge the gap, we introduce TADDLE, a Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews, together with the first expert-annotated benchmark for this task. Our benchmark comprises 1,800 reviews on 50 ICLR 2025 papers, multi-label-annotated by 18 domain experts against a taxonomy of six defect categories (plus a non-deficient label). TADDLE decomposes detection into four specialized analysis tools -- Verify, Correct, Complete, and Transform -- orchestrated by an agent; an integrator synthesizes their outputs into binary and multi-label classifications via two-stage semi-supervised learning. Extensive experiments show that TADDLE performs strongly on both binary detection and the multi-label classification task. We release the benchmark and code at https://github.com/AquariusAQ/TADDLE.

2026-05-27 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

規範から指標へ (N2I-RAG): 法的指標計算のためのエージェントによる検索拡張生成フレームワーク

規範文書から法的指標を計算することは、法的監視と政策評価における重要なタスクですが、法的言語の複雑さ、規模、解釈の性質、および利用可能な文書の品質のばらつきにより、大きな課題が生じます。既存の自然言語処理技術と生成モデルは法的分析に役立ちますが、多くの場合、高い幻覚リスクに悩まされ、信頼性の高い指標の計算に必要な解釈可能性と根拠に欠けています。この文書では、透過的かつ追跡可能な方法で法的指標の計算を自動化するように設計されたエージェントによる検索拡張生成フレームワークである N2I-RAG (From Norm to Indicators) について説明します。当社は、適応型検索、llm ベースのエージェント、および検証メカニズムをモジュラーパイプラインに統合しており、各コンポーネントは証拠のフィルタリング、検索、評価において定義された役割を果たし、特定可能な法的条項に関連付けられたバイナリの法的結果を生成します。このフレームワークは、中間決定と最終的な指標の割り当ての明示的な説明を要求することで、トレーサビリティを強調しています。当社は、スキャンされたソースとデジタルソースの両方を含む社内で構築されたフランス海洋環境法コーパスを使用して N2I-RAG を評価します。複数の言語モデルファミリを使用した比較実験により、提案されたアプローチがベースラインシステムよりも一貫して優れたパフォーマンスを示し、2 つの異なる禁止でテストした場合によく一般化されることが実証されました。この結果は、エージェントによる検索拡張生成がオープンテキストの法的言語と標準化された指標計算の橋渡しとなり、透明性と拡張性のある法的監視の基盤を提供できることを示しています。

原文 (English)

From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation

Computing legal indicators from normative texts is a key task in legal monitoring and policy evaluation, but presents significant challenges due to the complexity, scale, and interpretive nature of legal language, as well as the variability in available document quality. Existing natural language processing techniques and generative models can assist in legal analysis, but often suffer from high risk of hallucinations and lack the interpretability and evidence grounding required for reliable indicator computation. This paper presents N2I-RAG (From Norms to Indicators), an agentic retrieval-augmented generation framework designed to automate the computation of legal indicators in a transparent and traceable way. We integrate adaptive retrieval, llm-based agents, and validation mechanisms in a modular pipeline, where each component performs a defined role in filtering, retrieving, and assessing evidence, and in producing binary legal outcomes linked to identifiable legal provisions. The framework emphasizes traceability by requiring explicit explanations of intermediate decisions and final indicator assignments. We evaluate N2I-RAG using an in-house constructed French marine environmental law corpus that includes both scanned and digital sources. Comparative experiments with multiple language model families demonstrate that the proposed approach consistently outperforms baseline systems, and generalizes well when tested on 2 different bans. The results indicate that agentic retrieval-augmented generation can bridge open-text legal language and standardized indicator computation, offering a foundation for transparent and scalable legal observatories.

2026-05-27 13:00 JSTarXiv cs.AIエージェント研究/論文

モバイル GUI ナビゲーションのための視覚言語エージェントのスケーリング、ベンチマーク、および推論

ビジョン言語モデル (VLM) は、モバイル GUI ナビゲーションにおいて急速な進歩を示しています。このペーパーでは、このドメインにおける VLM ベースのエージェントのデータスケーリング、ベンチマーク、および推論に関する体系的な研究を紹介します。厳密な評価を容易にするために、650 を超える中国のモバイルアプリケーションにわたる 16,000 を超える現実世界のタスクを含む大規模なデータセットである HyperTrack と、オフライン GUI ナビゲーションタスクで VLM の統合ベンチマークを行うためのオープンソースツールキットである GUIEvalKit を導入します。 HyperTrack を使用して、教師あり微調整と強化ベースの微調整の両方に対するトレーニングデータスケールの影響を分析します。私たちの結果は、特にドメイン外設定において、強化ベースの微調整が教師あり微調整よりも常に優れたパフォーマンスを示し、データスケーリングと強化学習の相乗効果を強調しています。 GUIEvalKit を活用して、最先端 (SOTA) VLM のベンチマークをさらに行い、対話履歴と推論機能がタスクの完了にどのような影響を与えるかを分析します。 HyperTrack と GUIEvalKit は共に、モバイル GUI ナビゲーションタスクで VLM エージェントを開発および評価するための包括的なプラットフォームを提供します。

原文 (English)

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile applications, along with GUIEvalKit, an open-source toolkit for unified benchmarking of VLMs on offline GUI navigation tasks. Using HyperTrack, we analyze the effects of training data scale on both supervised and reinforcement-based finetuning. Our results show that reinforcement-based finetuning consistently outperforms supervised finetuning, particularly in out-of-domain settings, highlighting the synergy between data scaling and reinforcement learning. Leveraging GUIEvalKit, we further benchmark state-of-the-art (SOTA) VLMs and analyze how interaction history and reasoning capabilities influence task completion. Together, HyperTrack and GUIEvalKit provide a comprehensive platform for developing and evaluating VLM agents in mobile GUI navigation tasks.

2026-05-27 13:00 JSTarXiv cs.AIエージェント

StepOPSD: エージェント強化学習のためのステップ認識オンライン優先蒸留

マルチターンエージェントの強化学習は、単位の割り当ての不一致に悩まされています。報酬はまばらで軌道レベルである一方、成功はいくつかの局所的な決定に左右されることがよくあります。既存のオンラインポリシー蒸留 (OPD) は、より高密度のトークンレベルの監視を提供しますが、通常は、異種エージェントの軌跡を因果関係の単位ではなくモノリシックな文字列として扱います。クレジット再分配の単位としてエージェントステップを取る、ロールアウト後の優先度自己蒸留フレームワークである StepOPSD を紹介します。 StepOPSD は、軌道をアクション中心のステップセグメントに分解し、後知恵が豊富な教師コンテキストの下でそれらを再スコアリングし、GRPO 更新前に正規化されたステップごとのクレジットバジェットを使用して、トークンレベルの対数確率ギャップを符号保存の利点形成に変換します。 Qwen3-1.7B および Qwen2.5-3B-Instruct を使用した ALFWorld と Search-QA 全体で、StepOPSD は、ALFWorld Heat (79.1%)、PickTwo (95.0%)、Search-QA TriviaQA (61.6%) での 1 位のパフォーマンス、および HotpotQA での最高タイのパフォーマンスを含む、ローカル因果的エラーに最も敏感なサブセットで最高または 2 番目に良い結果を達成しました。 (40.4%)。結果はさらに、一貫した 2 ノブの法則を明らかにしました。小さい {\alpha}_clip は広範囲に安定化するローカル信頼領域として機能しますが、最適なグローバル混合強度 {\lambda}_mix は依然としてタスクに依存します。これらの発見は、軌道レベルの報酬が下流の成功を決定する局所的なアクションと弱く一致している場合に、ステップを意識した蒸留が最も有用であることを示唆しています。

原文 (English)

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides denser token-level supervision, but typically treats heterogeneous agent trajectories as monolithic strings rather than causal interaction units. We present StepOPSD, a post-rollout preference self-distillation framework that takes the agent step as the unit of credit redistribution. StepOPSD decomposes trajectories into action-centered step segments, rescoring them under hindsight-enriched teacher contexts and converting token-level log-probability gaps into sign-preserving advantage shaping with a normalized per-step credit budget before the GRPO update. Across ALFWorld and Search-QA with Qwen3-1.7B and Qwen2.5-3B-Instruct, StepOPSD attains best or second-best results on subsets most sensitive to local causal errors, including first-place performance on ALFWorld Heat (79.1%), PickTwo (95.0%), Search-QA TriviaQA (61.6%), and tied-best performance on HotpotQA (40.4%). The results further reveal a consistent two-knob law: smaller {\alpha}_clip acts as a broadly stabilizing local trust region, whereas the optimal global mixing strength {\lambda}_mix remains task-dependent. These findings suggest that step-aware distillation is most useful when trajectory-level rewards are weakly aligned with the local action that determines downstream success.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

VitaBench 2.0: 長期的なユーザーインタラクションにおけるパーソナライズされたプロアクティブなエージェントの評価

大規模言語モデル (LLM) は、現実世界のタスクでユーザーと協力する対話型エージェントに進化しました。ユーザーの意図は日常の断片的なインタラクションに反映されることが多く、パーソナライズされたモデリングとプロアクティブなインタラクションの両方が必要となるため、このような設定における効果的なコラボレーションは、明示的に述べられた以上にユーザーを理解することにますます依存します。しかし、既存のエージェントのベンチマークは主に推論とツールの使用を評価しており、現実的なシナリオでのユーザーの好みの推論と活用という課題をほとんど見落としています。このギャップに対処するために、長期的なユーザーインタラクションにおけるパーソナライズされたプロアクティブなエージェントの動作を評価するためのベンチマークである VitaBench 2.0 を導入します。 VitaBench 2.0 では、タスクは個々のユーザーに対して時間的に順序付けられたシーケンスとして編成され、断片化された異種インタラクションに好みが埋め込まれています。タスクを正常に完了するには、エージェントがこれらの対話からユーザーの好みを継続的に抽出、利用、更新する必要があります。さらに、エージェントが意思決定を行う前に不足している情報を認識し、ユーザーや環境から積極的に情報を取得する必要があるタスクを通じて、積極性を評価します。体系的な分析をサポートするために、異なるメモリアーキテクチャ間で制御された比較を可能にする拡張可能なメモリインターフェイスを提供します。私たちは、フロンティア独自のオープンソース LLM の多様なセットをベンチマークします。結果は、現実世界のパーソナライゼーションは最先端のモデルであっても依然として非常に困難であることを示しており、現在の機能と実際の要件との間に大きなギャップがあることが明らかになりました。広範な分析により、現実世界の個別化された意思決定における現在のエージェントの障害モードと機能のボトルネックがさらに明らかになり、将来のモデル改善のための洞察が得られます。

原文 (English)

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

騒音下での行動の学習: 騒音環境によるエージェントの堅牢性の強化

大規模言語モデル (LLM) の最近の進歩により、推論、計画、ツールの使用が可能な対話型エージェントとしての LLM の広範な展開が容易になりました。既存のベンチマークでは優れたパフォーマンスを示しているにもかかわらず、このようなエージェントは、環境が本質的に確率的で不完全である現実の環境に展開すると、顕著な低下を示すことがよくあります。私たちは、この矛盾は、理想的なトレーニング設定と現実世界のインタラクションダイナミクスの間の根本的な不一致から生じていると主張します。現在のパラダイムは、慎重に精選されたタスク指示と、安定してよく制御された環境に依存しています。このギャップに対処するために、環境の不完全性をエージェントの学習プロセスに明示的に組み込むエージェントトレーニングフレームワークである NoisyAgent を提案します。私たちは、現実世界のシナリオにおけるインタラクションノイズの 2 つの主な原因を特定します。1 つはユーザーインタラクションのあいまいさと変動性を捉えるユーザーノイズ、もう 1 つはツールの実行における障害や異常を反映するツールノイズです。ユーザー対話パターンを変更し、トレーニング環境内でツールの実行結果をシミュレートすることにより、このような摂動をトレーニングパイプラインに導入します。ますます困難になる不完全性にエージェントが対処できるように促しながらトレーニングを安定させるために、ノイズはロールアウトのサブセットにのみ適用され、モデルが現在のノイズレベルに適応するにつれて徐々に難易度が上がります。広範な実験により、私たちのアプローチが騒がしく動的な環境下でエージェントの堅牢性を一貫して向上させることが実証されました。私たちの分析では、騒音条件下でのトレーニングでも、理想化されたベンチマークでのパフォーマンス向上が得られることが明らかになり、環境騒音への曝露を制御することで、より一般化可能な推論と意思決定行動が促進されることが示唆されています。私たちの調査結果は、エージェントのトレーニングと実際の展開の間のギャップを埋めるために、インタラクションの不完全性をモデル化することの重要性を強調しています。

原文 (English)

Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in real-world settings, where environments are inherently stochastic and imperfect. We argue that this discrepancy arises from a fundamental mismatch between idealized training settings and real-world interaction dynamics, where current paradigms rely on carefully curated task instructions and stable, well-controlled environments. To address this gap, we propose NoisyAgent, an agentic training framework that explicitly incorporates environmental imperfections into the agent learning process. We identify two major sources of interaction noise in real-world scenarios: user noise, which captures ambiguity and variability in user interaction, and tool noise, which reflects failures and anomalies in tool execution. We introduce such perturbations into the training pipeline by modifying user interaction patterns and simulating tool execution results within the training environment. To stabilize training while encouraging agents to handle increasingly challenging imperfections, noise is applied to only a subset of rollouts and progressively increased in difficulty as the model adapts to the current noise level. Extensive experiments demonstrate that our approach consistently improves agent robustness under noisy and dynamic environments. Our analysis reveals that training under noise conditions also yields performance gains on idealized benchmarks, suggesting that controlled exposure to environmental noise promotes more generalizable reasoning and decision-making behaviors. Our findings highlight the importance of modeling interaction imperfections for bridging the gap between agent training and real-world deployment.

2026-05-27 13:00 JSTarXiv cs.AIエージェント

SIA: ハーネスとウェイトのアップデートによる自己改善型 AI

AI の構築と改善においてボトルネックになるのは人間です。モデルとそれをラップするエージェントはどちらも人によって作成、調整、修正されます。 AI 自身を改善する方法を見つけられるという長期的な目標はまだ未解決です。大きく関連性のない 2 つの研究ラインがこのボトルネックを攻撃しています。ハーネス更新スクールでは、モデルの重みを固定したまま、メタエージェントがタスク固有のエージェントの足場 (ツール、プロンプト、再試行ロジック、検索手順) を書き換えます。テスト時のトレーニングスクールでは、手書きの RL パイプラインを使用して、ハーネスを固定したままタスクフィードバックでモデル自体の重みを更新します。これら 2 つのサイロは独立して動作します。我々は、言語モデルエージェント (フィードバックエージェント) がタスク固有のエージェントのハーネスと重みの両方を更新する自己改善ループである SIA を提案します。中国の法的責任の分類、低レベルの GPU カーネルの最適化、および単一セル RNA のノイズ除去という 3 つの対照的なドメインにわたって評価します。両方のレバーを組み合わせると、3 つのベンチマークすべてでスキャフォールドの反復単独よりも優れたパフォーマンスを発揮します。初期ベースラインと比較して、LawBench では 56.6%、GPU カーネルではランタイムが 91.9%、ノイズ除去では 502% の向上が見られます。ハーネスの更新によりモデルがエージェント的になり、検索と動作の方法が形成されます。一方、重みの更新により、プロンプトや足場では得られないドメインの直感が構築されます。

原文 (English)

SIA: Self Improving AI with Harness & Weight Updates

Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model's own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. The gains are 56.6% on LawBench, 91.9% runtime reduction on GPU kernels, and 502% on denoising over the initial baseline. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.

2026-05-27 13:00 JSTarXiv cs.AIエージェント

エージェント的技術負債と確率的税のモデリング: 測定、シミュレーション、ダッシュボードのためのスタンドアロンフレームワーク

エージェントティック AI システムは、ツール、コンテキスト、メモリ、オーケストレーション、外部ワークフロー統合を通じて、確率的推論と委任されたアクションを組み合わせます。このノートは、エージェントの技術的負債と確率的税を区別する、正式で管理上使用可能なモデルを開発します。 Agentic の技術的負債は、蓄積された設計およびガバナンスの負債のストックです。確率的税は、ビジネスワークフローで確率的エージェントが使用されるときに発生する運用負担の繰り返しフローです。この 2 つの構成は関連していますが、同じではありません。借金は税金を増幅させる可能性がありますが、借金が最小限に抑えられている場合でも税金はプラスのままである可能性があります。このノートは、コンパクトなダッシュボード式から始まり、それをより完全な構造モデルに拡張し、すべての変数とパラメーターを定義し、運用データから各コストカテゴリを推定する方法を示し、買掛金シミュレーションと付随するスプレッドシートを使用してフレームワークを説明します。

原文 (English)

Modeling Agentic Technical Debt and Stochastic Tax: A Standalone Framework for Measurement, Simulation, and Dashboarding

Agentic AI systems combine probabilistic reasoning with delegated action through tools, context, memory, orchestration, and external workflow integration. This note develops a formal and managerially usable model that distinguishes Agentic Technical Debt from Stochastic Tax. Agentic Technical Debt is a stock of accumulated design and governance liability. Stochastic Tax is a recurring flow of operating burden that arises when stochastic agents are used in business workflows. The two constructs are related, but they are not the same: debt can amplify the tax, while the tax can remain positive even when debt is minimized. The note starts from a compact dashboard expression, expands it into a fuller structural model, defines all variables and parameters, shows how each cost category can be estimated from operational data, and illustrates the framework with an accounts-payable simulation and companion spreadsheet.

2026-05-27 13:00 JSTarXiv cs.AIエージェント研究/論文

マアト: 競争保護のためのエージェント法調査アシスタント

法的調査を行う競争法の専門家は、判例を特定し、競争および合併訴訟の重要な要素を評価するために、大量の訴訟、判決、司法報告書を検討する必要があります。 Claude や ChatGPT などの一般的な研究アシスタントや、SaulLM-7B や LegalGPT などの法務アシスタントは、法律調査を支援するためにますます使用されていますが、競争法分析には依然として不十分です。専門分野の専門知識が不足していたり、公式引用が不十分であったり、競争法訴訟の幻覚を呈したりするためです。私たちは、研究プロセスのさまざまなタスクに対応するツールを統合する ReAct エージェントである Maat を提案します。競争法の専門家と繰り返し設計された Maat は、信頼性を高めるために RAG を使用して公式情報源にある事例と調査結果を根拠にし、豊富なインライン引用を提供し、データベースの網羅性が不十分な場合は Web 検索に戻り、クエリがあいまいな場合はユーザーに説明を求めます。 Maat は、ケース固有のタスクではすべてのベースラインアシスタントを大幅に上回り、理論的な質問タスクではトップベースラインの範囲内でパフォーマンスを発揮します。使用されたデータセットは GitHub で入手できます。

原文 (English)

Maat: The Agentic Legal Research Assistant for Competition Protection

Competition law experts conducting legal research must review extensive volumes of cases, decisions, and judicial reports to identify precedents and assess key elements in competition and merger cases. Although general research assistants such as Claude and ChatGPT and legal assistants such as SaulLM-7B and LegalGPT are increasingly used to assist legal research, they remain inadequate for competition law analysis: they lack specialized domain expertise, provide insufficient official citations, or hallucinate competition law cases. We propose Maat, a ReAct agent that orchestrates tools corresponding to different tasks of the research process. Designed iteratively with competition law experts, Maat grounds cases and findings in official sources using RAG for reliability, provides rich in-line citations, falls back to web search when database coverage is insufficient, and prompts the user for clarification when queries are ambiguous. Maat significantly outperforms all baseline assistants on case-specific tasks and performs within range of the top baseline on theoretical question tasks. The dataset used is available on GitHub.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

検索エージェントの構成に対する自然言語クエリ

最新の検索エージェントは、LLM、取得者、ドキュメント数、ホップ数、合成戦略など、多くの構成選択肢を公開しており、それぞれが応答品質とサービスコストの両方を形成します。現在、これらのパイプラインは通常、ワークロードごとに 1 回手動で調整されており、クエリごとの実質的な最適化は未開発のままになっています。問題を定式化します。自然言語クエリと精度または予算目標が与えられ、事前定義されたパイプラインカタログから、推論時のコストを最小限に抑えるか精度を最大限に高める構成を選択します。私たちは、LLM を使用して各クエリをワークロード固有の特性に変換し、パイプラインがクエリに正しく応答するかどうかを推定する軽量の構成ごとの予測子をトレーニングする **BRANE** を提案します。推論時に、**BRANE** は、コストによってペナルティを受ける予測の正確性を最大化する構成を選択し、再トレーニングなしで調整可能なコストと品質のトレードオフを明らかにします。 MuSiQue、BrowseComp-Plus、FinanceBench 全体で、**BRANE** は一貫してコスト品質のパレートの最前線を押し広げ、最大 89% 低いコストで最適な固定構成の精度に一致し、LLM ルーティング、ルールベース、および微調整された Qwen3-4B ベースラインを上回ります。これらの結果は、完全な取得パイプラインのクエリごとの構成が、静的なワークロードレベルのチューニングに代わる実用的な方法であることを示しています。

原文 (English)

Natural Language Query to Configuration for Retrieval Agents

Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

MUSE-Autoskill: スキルの作成、記憶、管理、評価による自己進化エージェント

大規模言語モデル (LLM) エージェントは、再利用可能なスキルに依存して複雑なタスクを解決します。ただし、既存のスキル作成アプローチでは、スキルを孤立した静的な成果物として扱い、再利用性、信頼性、長期的な改善が制限されています。私たちは、統一されたライフサイクル (作成、記憶、管理、評価、洗練) の下でスキルを作成、再利用、洗練することにより、エージェントがタスク解決能力を継続的に向上できるようにする、スキル中心のエージェントフレームワークである MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution) を提案します。当社のフレームワークにより、エージェントはオンデマンドでスキルを作成し、それらをタスク間で保存して再利用し、効率的に整理して選択し、単体テストや実行時のフィードバックを通じて評価して継続的に改善することができます。さらに、タスク全体にわたって各スキルの経験を蓄積するスキルレベルの記憶を導入し、時間の経過とともにより効果的な再利用と適応を可能にします。 SkillsBench の実験は、ライフサイクル管理されたスキルがタスクの成功、効率、再利用、およびエージェント間での移転を向上させることができるという最初の証拠を提供し、スキルを長命で経験を意識したテスト可能な資産として扱うことの重要性を強調しています。

原文 (English)

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

VISTA: Visual Spec-to-Web-App コーディングエージェントのエンドツーエンドベンチマーク

ここでは、LLM ベースのエージェントのエンドツーエンドの Web アプリ生成機能を評価するためのベンチマークである VISTA (VIsual Spec-To-App Benchmark) を紹介します。アルゴリズムタスクに焦点を当てた以前のコード生成ベンチマークとは異なり、VISTA は現実的な UI 中心の開発をターゲットにしており、エージェントは過少指定された入力から機能的で視覚的に一貫したアプリケーションを生成する必要があります。視覚的/構造的忠実度およびスタック制約という 2 つの軸に沿って変化する 5 つのプロンプト情報条件を定義します。(1) 自由なスタック選択によるテキストのみ、(2) 3 つの指定されたスタック下の参照スクリーンショットを含むテキスト、(3) 自由なスタック選択による参照スクリーンショットを含むテキスト、(4) 単一の指定されたスタック下のスクリーンショットおよびプルーニングされた Figma 構造を含むテキスト、(5) 自由なスタック選択によるスクリーンショットおよびプルーニングされた Figma 構造を含むテキスト。堅牢な評価を可能にするために、ベンチマークの各ページにはインタラクティブな UI コンポーネントと約 3 つのビジュアルアンカーポイントで手動で注釈が付けられ、オープンエンドコード生成設定における Playwright などのスクリプトベースのテストツールのよく知られた制限に対処します。評価では、DOM に基づいた参照マッチング、動作固有のブラウザテスト、および CLIP ベースの視覚的類似性を組み合わせて、構造の整合性、動作の完全性、および全体的な視覚的な忠実度を共同で測定します。 VISTA を使用して、2 つのモデルファミリと 2 つのハーネスから描画された 4 つのエージェントシステムを評価しました。その結果、入力条件とエージェントの両方で視覚的な忠実性と機能の正確さが部分的に切り離されており、エージェントの編集スタイルは大きく変化しますが、タスクの品質とはほぼ直交していることがわかりました。 VISTA は、エージェントベースのソフトウェアエンジニアリング研究を推進するための厳密で再現可能な基盤を確立します。

原文 (English)

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchmarks that focus on algorithmic tasks, VISTA targets realistic UI-centric development, where agents must produce functional, visually coherent applications from underspecified inputs. We define five prompt-information conditions that vary along two axes, visual/structural fidelity and stack constraint: (1) text only with free stack choice, (2) text with reference screenshots under three specified stacks, (3) text with reference screenshots under free stack choice, (4) text with screenshots and pruned Figma structure under a single specified stack, and (5) text with screenshots and pruned Figma structure under free stack choice. To enable robust evaluation, each page in the benchmark is manually annotated with interactive UI components and around three visual anchor points, addressing the well-known limitations of script-based testing tools such as Playwright in open-ended code generation settings. Evaluation combines DOM-grounded reference matching, behavior-specific browser tests, and CLIP-based visual similarity, jointly measuring structural alignment, behavioral completeness, and overall visual fidelity. We use VISTA to assess four agent systems drawn from two model families and two harnesses, finding that visual fidelity and functional correctness are partially decoupled across both input conditions and agents, and that agent editing style varies sharply but is largely orthogonal to task quality. VISTA establishes a rigorous and reproducible foundation for advancing agent-based software engineering research.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MemMorph: メモリポイズニングによる LLM エージェントのツールハイジャック

LLM 主導のエージェントは、ユーザーのタスクを完了するために外部ツールを選択できます。ただし、攻撃者はそのようなプロセスを侵害し、エージェントを不適切または間違ったツールに誘導し、悪意のあるアクションを可能にする可能性があります。既存の攻撃のほとんどは主にツールのメタデータを操作しますが、このメタデータは監査によって簡単に検出でき、蓄積された経験を通じてツール選択ポリシーを改良するためにメモリモジュールを採用する現代のエージェントが増えているため、有効性が失われる可能性があります。この論文は、エージェントの長期記憶を汚染することによってツール選択を偏らせる最初の攻撃である MemMorph を提案します。 MemMorph は、ツール呼び出しの決定を明示的に指示するのではなく、技術的な事実、インシデントレポート、運用ポリシーを装った少数の細工されたレコードを挿入します。これらの汚染された記録は、エージェントの状況認識と意思決定プロセスを再構築し、攻撃者が好むツールを自律的に推論して選択するように導きます。 3 つのベンチマーク、10 のエージェントバックボーン、および 3 つのメモリモジュール実装にわたる実験では、MemMorph がわずか 3 つの挿入されたレコードで最大 85.9% の攻撃成功率を達成し、3 つの代表的な防御の下で効力を維持しながら最も強力なベースラインを最大 25% 上回っていることが示されています。私たちの調査結果は、長期メモリがツール拡張エージェントにおける重要かつ十分に調査されていない攻撃対象領域であることを明らかにし、メモリレベルの整合性保護手段の開発を促しています。

原文 (English)

MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning

LLM-driven agents are capable of selecting external tools to complete users' tasks. However, attackers could compromise such process, steering agents toward inappropriate/wrong tools and enabling malicious actions. Most existing attacks primarily manipulate the tool metadata, which is easily detectable by auditing and may lose effectiveness as modern agents increasingly adopt memory modules to refine tool selection policies through accumulated experience. This paper proposes MemMorph, the first attack that bias tool selection by poisoning the agent's long-term memory. Rather than explicitly dictating the tool invocation decision, MemMorph injects a small number of crafted records that are disguised as technical facts, incident reports, and operational policies. These poisoned records reshape the agent's contextual perception and decision-making process, leading it to autonomously infer and select the tool preferred by the attacker. Experiments across 3 benchmarks, 10 agent backbones, and 3 memory-module implementations show that MemMorph achieves up to 85.9% attack success rate with only three injected records, outperforming the strongest baseline by up to 25% while retaining potency under 3 representative defenses. Our findings expose long-term memory as a critical and under-explored attack surface in tool-augmented agents, urging the development of memory-level integrity safeguards.

2026-05-27 13:00 JSTarXiv cs.AIエージェント

適応型ガイダンスが役立つのはどのような場合ですか?部分的な可観測性の下での自動運転のための信念を意識した特権抽出

Guided Soft Actor-Critic (GSAC) は、特権的な完全な状態の教師から部分的に観察する生徒まで自動運転に関する知識を抽出しますが、エージェントの不確実性に関係なく固定の抽出係数ラムダを使用します。私たちは、アンサンブル不一致によってラムダを調整する信念認識 GSAC (BA-GSAC) を提示し、これを「適応ガイダンスが実際にどのような場合に役立つのか」を問う体系的な実証研究のテストベッドとして使用します。 Highway-Env の 3 つの POMDP 難易度レベルにわたって 5 つの戦略 ({0.01, 0.1} の固定ラムダ、適応型、線形減衰、およびバニラ SAC) を評価すると、予備的な単一シードの実行は、軽度および中程度の部分可観測性の下では利点を示唆していますが、重度のオクルージョン (すべてのメソッドに対して 3 つのシードで評価) の下では、適応係数が約 3K ステップ以内で lambda_min に崩壊することがわかります。これを可観測性の盲目現象にたどり着きます。アンサンブルは部分的な観測を予測するため、重いオクルージョン下でも不一致が低くなり、見えるものは見えても、欠けているものは検出できないことをモデル化します。根本原因を診断し、アーキテクチャ上の修正を提案します (ガイドアクターの特権アクセスを使用して、完全な状態の予測についてアンサンブルをトレーニングします)。ここでは検証されていませんが、電流制限があっても、ウォームアップ段階で測定可能な安定化が得られることを示します (定数ラムダ = 0.01 の場合、CV = 13.3% 対 29.8%)。実際、単純な決定論的線形減衰スケジュールは、すべてのメトリクスにわたって最高のシビア POMDP パフォーマンス (平均 116.5、CV=8.9%) を達成しており、アンサンブルではなくスケジューリング効果が安定性の利点を推進していることを示唆しています。これらの発見は、不確実性を意識した教師と生徒のフレームワークを設計するための実践的なガイダンスを提供し、重要な設計上の選択肢としてアンサンブル予測ターゲットを強調しています。

原文 (English)

When Does Adaptive Guidance Help? Belief-Aware Privileged Distillation for Autonomous Driving Under Partial Observability

Guided Soft Actor-Critic (GSAC) distills knowledge from a privileged full-state teacher to a partial-observation student for autonomous driving, but uses a fixed distillation coefficient lambda regardless of the agent's uncertainty. We present Belief-Aware GSAC (BA-GSAC), which modulates lambda via ensemble disagreement, and use it as a testbed for a systematic empirical study asking: when does adaptive guidance actually help? Evaluating five strategies (fixed lambda in {0.01, 0.1}, adaptive, linear decay, and vanilla SAC) across three POMDP difficulty levels on Highway-Env, we find that preliminary single-seed runs suggest benefits under mild and moderate partial observability, but under severe occlusion (evaluated with 3 seeds for all methods) the adaptive coefficient collapses to lambda_min within about 3K steps. We trace this to an observability blindness phenomenon: because the ensemble predicts partial observations, it achieves low disagreement even under heavy occlusion, modeling what is visible but unable to detect what is missing. We diagnose the root cause and propose an architectural fix (training the ensemble on full-state predictions using the guiding actor's privileged access); while not validated here, we show that even with current limitations, the warmup phase provides measurable stabilization (CV=13.3% vs. 29.8% for constant lambda=0.01). In fact, a simple deterministic linear decay schedule achieves the best severe-POMDP performance across all metrics (mean 116.5, CV=8.9%), suggesting that the scheduling effect, not the ensemble, drives the stability benefit. These findings provide practical guidance for designing uncertainty-aware teacher-student frameworks and highlight ensemble prediction targets as an important design choice.

2026-05-27 13:00 JSTarXiv cs.AIエージェント

ツールスキーマの圧縮により、制約されたコンテキストバジェットの下でエージェント RAG が可能になります

言語モデルに数十から数百のツール定義を装備するエージェント RAG システムは、重大なリソースの競合に直面します。ツールスキーマは、検索拡張生成に必要な同じコンテキストウィンドウを消費します。我々は、このツールとコンテキストのトレードオフに関する最初の体系的な研究を発表し、28 のツール定義を備えた 3 つのコンテキストバジェット (8K、16K、32K) で、6,566 の制御された API 呼び出しにわたる 1.5B ～ 32B のローカルモデルにまたがる 14 のモデルと 1 つのフロンティア API モデルを評価しました。 TSCG 保守的なプロファイル圧縮 (スキーマトークンの 44 ～ 50% の節約) を適用すると、バイナリの有効化効果が観察されます。8K トークンでは、JSON スキーマツール定義がコンテキストウィンドウ全体をオーバーフローし、ほぼゼロの EM (平均 2.6%) が得られますが、圧縮されたスキーマは、8 つのモデルすべてで +20.5 pp の平均完全一致リフトで RAG 機能を復元します (完全な有効化を示した 6 つのモデルでは +24.7 pp)。両方のフォーマットが適合する 32K では、テストした 5 つのモデルのうち 4 つでデルタ <= 1 pp が示され、その効果が純粋に予算主導型であることが確認されました。 HotpotQA の外部検証 (50 マルチホップの質問) では、同じオーバーフローシナリオで +48 pp EM が示されています。フロンティアスケーリングテストでは、JSON スキーマが最大 494 ツールでオーバーフローする一方、圧縮スキーマは 800 ツールを超えても動作し続けることが実証されています。私たちの結果は、制約されたコンテキストの展開におけるエージェント RAG に必要なインフラストラクチャ層としてのツールスキーマ圧縮を確立します。すべてのコード、データ、チェックポイントは公開されています。

原文 (English)

Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets

Agentic RAG systems that equip language models with dozens to hundreds of tool definitions face a critical resource conflict: tool schemas consume the same context window needed for retrieval-augmented generation. We present the first systematic study of this tool-context trade-off, evaluating 14 models spanning 1.5B-32B local models plus one frontier API model across 6,566 controlled API calls at three context budgets (8K, 16K, 32K) with 28 tool definitions. Applying TSCG conservative-profile compression (44-50% schema token savings), we observe a binary enablement effect: at 8K tokens, JSON-schema tool definitions overflow the context window entirely, yielding near-zero EM (2.6% average), while compressed schemas restore RAG functionality with +20.5 pp average exact-match lift across all eight models (+24.7 pp among the six exhibiting full enablement). At 32K -- where both formats fit -- four of five tested models show delta <= 1 pp, confirming the effect is purely budget-driven. External validation on HotpotQA (50 multi-hop questions) shows +48 pp EM under the same overflow scenario. Frontier scaling tests demonstrate that JSON schemas overflow at ~494 tools while compressed schemas remain operational beyond 800 tools. Our results establish tool-schema compression as a necessary infrastructure layer for agentic RAG in constrained-context deployments. All code, data, and checkpoints are publicly available.

2026-05-27 13:00 JSTarXiv cs.AIエージェント

バランスの取れた学習、信頼性の高い疑似ラベル、軽量アーキテクチャにより、IoT 向けの自律型オンライン侵入検知を強化

モノのインターネット (IoT) デバイスの急速な普及により、動的に進化するサイバー脅威に対処できる、適応的でリソース効率の高い侵入検知システム (IDS) に対する緊急の需要が生じています。この論文では、IEEE INFOCOM 2024 で公開された最先端の自律型オンライン IDS である AOC-IDS について調査します。AOC-IDS は、クラスター反発コントラスト (CRC) 損失を備えたオートエンコーダー (AE) と自律型ガウスベースの決定モジュールを採用しています。まず、UNSW-NB15 ベンチマークで AOC-IDS の複製に成功し、公開されている 89.19% とほぼ一致する 89.39% の精度を達成しました。次に、クラスの不均衡、信頼性の低い疑似ラベル生成、限られた一般化、IoT 導入の計算オーバーヘッドという 4 つの主要な制限を特定し、それぞれについて的を絞った改善を提案します。当社の XGBoost-BalSamp メソッドは、UNSW-NB15 で 95.45% の精度を達成し、ベースラインより 6.26% 向上しました。当社の複合ディープラーニングアプローチ (PseudoFilter、MixupAug、LiteAE) は、モデルパラメーターを 55% 削減しながら、ベースペーパーを上回る 90.88% (F1: 91.45%) のベストラン精度を達成しました。これらの結果は、AOC-IDS に対する目標を絞った改善により、IoT エッジデバイスでの実際の展開可能性を向上させながら、一貫した精度の向上が得られることを示しています。

原文 (English)

Enhancing Autonomous Online Intrusion Detection for IoT with Balanced Learning, Reliable Pseudo-Labels, and Lightweight Architectures

The rapid proliferation of Internet of Things (IoT) devices has created an urgent demand for adaptive, resource-efficient Intrusion Detection Systems (IDS) capable of handling dynamic and evolving cyber threats. This paper investigates AOC-IDS, a state-of-the-art autonomous online IDS published at IEEE INFOCOM 2024, which employs an Autoencoder (AE) with Cluster Repelling Contrastive (CRC) loss and an autonomous Gaussian-based decision module. We first successfully replicate AOC-IDS on the UNSW-NB15 benchmark, achieving 89.39% accuracy in close agreement with the published 89.19%. We then identify four key limitations: class imbalance, unreliable pseudo-label generation, limited generalization, and computational overhead for IoT deployment, and propose targeted improvements for each. Our XGBoost-BalSamp method achieves 95.45% accuracy on UNSW-NB15, a gain of 6.26% over the baseline. Our combined deep learning approach (PseudoFilter, MixupAug, and LiteAE) achieves a best-run accuracy of 90.88% (F1: 91.45%), surpassing the base paper while reducing model parameters by 55%.These results demonstrate that targeted improvements to AOC-IDS yield consistent accuracy gains while improving practical deployability on IoT edge devices.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

普遍的な崖と設計の指紋: LLM オーケストレーションによる断面欠陥検出

実稼働言語モデルシステムは、1 つの統合レポートを再構成するワーカーエージェントの目に見えないオーケストレーション全体にリクエストを分割することで、リクエストに応答します。私たちは、これが、一人の作業者には気付かないクラスの欠陥、つまり文書の離れた 2 つのセクション間の関係における矛盾にどのような影響を与えるのかを尋ねます。ドキュメント、欠陥、メカニズム、スコアリング、シードを固定したまま、モデルのみを変更します。つまり、1 人の開発者からの 5 世代にわたる 10 のシステムと、異なる調整パラダイムからの 5 つのプロバイダーです。 2つの層が分離します。まず、普遍的な検出の崖です。単一のエージェントでこれらの断面欠陥を検出するすべてのモデルは、オーケストレーション下ではその能力を失い、テストしたすべてのパラダイムで検出が 3 分の 2 以上落ちます。この崖はメカニズムに由来するものであり、規模や拡張された推論によって閉じられるものではありません。 2 番目に、モデルが落下するとどのように動作するかです。シグナル検出分解は、偶然を区別する 6 つのモデルのうち、1 つの開発者の世代だけがレポート基準の軸に沿って移動していることを示しています。調整が強化されると、モデルが見逃す欠陥は少なくなりますが、クリーンなドキュメントではより多くの誤報が発生します。1 つの基準の変化には 2 つの側面があり、その開発者内では世代に応じて拡大され (p < 0.001)、他の地域ではほぼ存在しません。フロアでは、見落とされた欠陥が見えなくなることがよくあります。モデルの個人的な記録は構造的欠陥を正確に再構築しますが、統合レポートはその健全性、アーチファクトと不在の協力者に費やされた関心を証明します。これは定量化に抵抗します。自動判定は不安定で (精度 17 ～ 50%)、キーワードでは通常の合意から切り離すことができません。これは私たちが発見として報告する抵抗です。すべての実行、プローブ、欠陥キー、スコアラープロンプト、およびスクリプトをリリースします。統合レポートの信頼性は、パーティションにまたがる欠陥については情報がありません。最も調整されたシステムが最も安全であるわけではなく、崖は構造的なものです。

原文 (English)

A Universal Cliff and a Design Fingerprint: Cross-Section Defect Detection Under LLM Orchestration

Production language-model systems answer a request by partitioning it across an invisible orchestration of worker agents that recompose one integrated report. We ask what this does to a class of defect no single worker can see: a contradiction in the relation between two distant sections of a document. Holding the documents, defects, mechanism, scoring, and seed fixed, we vary only the model -- ten systems across five generations from one developer and five providers from distinct alignment paradigms. Two layers separate. First, a universal detection cliff: every model that finds these cross-section defects under a single agent loses that ability under orchestration, detection falling two-thirds or more across every paradigm tested. The cliff is mechanism-derived and not closed by scale or extended reasoning. Second, how models behave once fallen. A signal-detection decomposition shows that, among the six models discriminating above chance, only one developer's generations move along the reporting-criterion axis: as alignment is strengthened, the model misses fewer defects yet raises more false alarms on clean documents -- two faces of one criterion shift, scaling with generation within that developer (p < 0.001) and near-absent elsewhere. At the floor the missed defect is often not out of view: the model's private record reconstructs the structural fault accurately, while the integrated report signs off on its soundness, its concern spent on the artifact and an absent collaborator. This resists quantification -- an automated judge is unstable (precision 17-50%) and keywords cannot separate it from ordinary agreement -- a resistance we report as a finding. We release all runs, probes, defect keys, scorer prompts, and scripts. An integrated report's confidence is uninformative about partition-spanning defects, the most aligned systems are not the safest, and the cliff is structural.

2026-05-27 13:00 JSTarXiv cs.AIエージェント研究/論文

RepoMirage: 摂動を伴うコードエージェントでのリポジトリコンテキスト推論の調査

現在、コードエージェントはリポジトリレベルのソフトウェアエンジニアリングベンチマークで優れたパフォーマンスを示していますが、問題解決などのエンドツーエンドタスクの成功が、リポジトリコンテキスト推論、つまり複数のファイルにわたるタスク関連情報を特定し、ファイル間の関係を推論する能力を本当に反映しているかどうかは依然として不明です。この疑問を調査するために、SWE-Bench Verified 上に構築された 2 段階の評価スイートである RepoMirage を紹介します。これは、リポジトリの公開方法を変えることでコンテキスト推論の需要を高める診断ツールとして摂動を採用しています。まず、RepoMirage-Perturb は、セマンティクスを保持する 3 種類のリポジトリレベルの摂動を適用します。これにより、正しい解決に広範なコンテキストアクセスが必要な場合、明らかなパフォーマンスの低下が明らかになります。 RepoMirage-Extend はさらに、摂動を対象とした構造的ボトルネックを問題解決を超えた明示的なタスクに変換します。平均パフォーマンスは元の設定の 66.8% から 25.3% に低下し、リポジトリコンテキストの推論に重大な欠陥があることを示しています。さらに軌跡を分析すると、エージェントがより広範なリポジトリコンテキストにアクセスするものの、それを有効な構造情報に変換できない探査ドリフトが明らかになります。この観察に動機付けられて、私たちはリポジトリの探索を下流の問題解決から分離する構造優先のプロトタイプワークフローである RepoAnchor を提案し、明示的な構造足場が顕著な利益をもたらすことを示します。これらの結果は、コードエージェントのリポジトリコンテキスト推論におけるこれまで見落とされていたギャップを明らかにし、より強力な構造認識手法がそれらを改善する可能性があることを示唆しています。

原文 (English)

RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations

Code agents are currently having skillful performance on repository-level software engineering benchmarks, but it remains unclear whether success on end-to-end tasks such as issue resolution truly reflects repository context reasoning, the ability to identify the task-relevant information across multiple files and reason over the relations among them. To investigate this question, we introduce RepoMirage, a two-stage evaluation suite built on SWE-Bench Verified that adopts perturbation as a diagnostic tool to increase the demand for context reasoning by transforming how the repository is exposed. First, RepoMirage-Perturb applies three types of semantics-preserving repository-level perturbations, revealing a clear performance drop when correct solving requires broader context access. RepoMirage-Extend further turns perturbation-targeted structural bottlenecks into explicit tasks beyond issue resolution, where the average performance declines from 66.8% in the original setting to 25.3%, indicating a significant deficiency in repository context reasoning. Further trajectory analysis reveals an exploration drift, where agents access broader repository context but fail to turn it into effective structure information. Motivated by this observation, we propose RepoAnchor, a structure-first prototype workflow that separates repository exploration from downstream problem solving, and show that explicit structural scaffolding yields notable gains. These results uncover an previously overlooked gap in repository context reasoning for code agents and suggest that stronger structure-aware methods are potential to improve them.

2026-05-27 13:00 JSTarXiv cs.AIエージェント

AutoDFT: 自律的な DFT 計算のための閉ループマルチエージェントフレームワーク

密度汎関数理論 (DFT) は、材料科学や化学における計算による発見の基礎として機能しますが、各計算には多大な人間の労力が必要です。収束が行き詰まった場合はアルゴリズムを調整し、予期せぬ物理が出現した場合は計画を修正し、中間結果として問題を再構成するためのステップを挿入します。既存の LLM ベースのエージェントは、初期計画段階のみを自動化し、事前に完全な実行計画を作成し、その後の適応はすべて手作りのルールに任せます。その結果、これらのワークフローは脆弱なままであり、事前に計画されたシナリオを超えて一般化することができず、障害や予期せぬ中間結果により計算パスの変更が必要になった場合には、多くの場合専門家の介入が必要になります。ここでは、DFT ライフサイクルのすべての段階に LLM 推論を組み込む閉ループマルチエージェントフレームワークである AutoDFT を紹介します。AutoDFT では、戦略プランナーがステップ目標の骨格計画を作成します。ステッププランナーは、前の結果からジャストインタイムで数値パラメータを生成します。そして、監視、回復、反映のサイクルにより、障害が診断され、修復され、証拠によって正当化される場合には計画が修正されます。広さと深さの両方を実証します。VASPBench では、34 のタスクと 9 つの DFT 計算タイプにわたる専用ベンチマークであり、AutoDFT は GPT-5.2 で 94.1% のタスクレベルの成功を達成しています。 AutoDFT では、電子的、磁気的、エネルギー的特性にわたる定量的に信頼性の高い特性予測が生成されます。 AutoDFT を使用すると、計画と実行の間のループを閉じることで、計算に関する深い専門知識を持たない実験者でも信頼できる第一原理結果を得ることができます。

原文 (English)

AutoDFT: A Closed-Loop Multi-Agent Framework for Autonomous DFT Calculations

Density functional theory (DFT) serves as the basis for computational discovery in materials science and chemistry, yet each calculation demands extensive human effort: adjusting algorithms when convergence stalls, revising plans when unexpected physics emerges, and inserting steps as intermediate results reshape the problem. Existing LLM-based agents automate only the initial planning stage, producing a full execution plan upfront and leaving all subsequent adaptation to hand-crafted rules. As a result, these workflows remain fragile, do not generalize well beyond pre-planned scenarios, and often require expert intervention when failures or unexpected intermediate results require changes to the calculation path. Here, we introduce AutoDFT, a closed-loop multi-agent framework that embeds LLM reasoning into every stage of the DFT lifecycle, where a strategic planner produces a skeletal plan of step objectives; a step planner generates numerical parameters just in time from preceding results; and a monitor-recover-reflect cycle diagnoses failures, repairs them, and revises the plan when the evidence justifies it. We demonstrate both breadth and depth: breadth on VASPBench, a purpose-built benchmark spanning 34 tasks and 9 DFT calculation types, where AutoDFT achieves 94.1% task-level success with GPT-5.2; and depth on established materials databases, where AutoDFT produces quantitatively reliable property predictions across electronic, magnetic, and energetic properties. By closing the loop between planning and execution, AutoDFT enables experimentalists without deep computational expertise to obtain reliable first-principles results.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SetupX: LLM エージェントは、機能的に正しいコードリポジトリのセットアップにおける過去の失敗から学ぶことができますか?

機能的に正しいリポジトリのセットアップは、リポジトリの文書化された機能を正常に実行できるように実行環境 (依存関係、ビルドスクリプトなど) を構成することを目的としています。依存関係の非互換性、ツールチェーンの欠落、インストールの不完全、検証戦略の不一致など、リポジトリ固有のさまざまな障害により、重大な課題が発生します。既存の LLM エージェントは、これらの問題を確実に解決するのに苦労しており、具体的には、(1) リポジトリ間でのエクスペリエンスの転送、(2) 不可逆的な状態変化下での複数ステップの試行と修復、(3) セットアップに起因する障害とリポジトリのバグを区別するためのセットアップ結果の堅牢な検証をサポートできていません。これに対処するために、経験的な学習ベースのセットアップフレームワークである SetupX を導入します。まず、自己進化エクスペリエンス表現 (XPU) を構築します。これは、検証済みの環境修正を目に見えないリポジトリに動的に転送するためのセットアップ信号、テキストガイダンス、実行可能なアクションをエンコードするデュアルモダリティの知識単位です。 2 番目に、LIFO Docker スナップショットスタックを活用したエクスペリエンス拡張投機的実行を採用し、エージェントがプロアクティブに修正を試し、既知の正常な状態に安全にロールバックできるようにします。第三に、証拠収集と最終判決を分離する検察官と裁判官の検証プロトコルを導入し、表面的な構築時間の指標を超えて、より信頼性の高い設定検証を可能にします。慎重に作成されたベンチマークの評価結果では、SetupX が最高のパフォーマンス (例: 92% の合格率) を達成し、最も強力なベースラインを 19% 以上上回るパフォーマンスを示しています。重要なのは、SetupX が、異なるコンテナ間で相互接続された複数のサービスの調整を必要とする複雑なマルチリポジトリのセットアップに優れていることです。コードリポジトリは https://github.com/OpenDataBox/SetupX で入手できます。

原文 (English)

SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?

Functionality-correct repository setup aims to configure execution environments (e.g., dependencies, build scripts) to successfully execute a repository's documented features. It presents significant challenges due to diverse, repository-specific failures, including dependency incompatibilities, missing toolchains, incomplete installations, and verification-strategy mismatches. Existing LLM agents struggle to robustly resolve these issues, specifically failing to support (1) cross-repository experience transfer, (2) multi-step trial-and-repair under non-invertible state changes, and (3) robust verification of setup outcomes to distinguish setup-induced failures from repository bugs. To address this, we introduce SetupX, an experiential learning-based setup framework. First, we construct a Self-Evolving Experience Representation (XPU), a dual-modality knowledge unit encoding setup signals, textual guidance, executable actions to dynamically transfer verified environment fixes to unseen repositories. Second, we employ Experience-Augmented Speculative Execution backed by a LIFO Docker snapshot stack, enabling the agent to proactively trial fixes and safely roll back to known-good states. Third, we introduce a Prosecutor-Judge Verification Protocol that separates evidence collection from final judgment, enabling more reliable setup verification beyond superficial build-time metrics. Evaluation results on carefully-crafted benchmarks show SetupX achieves highest performance (e.g., 92% pass rate) and outperforms the strongest baseline by over 19%. Crucially, SetupX excels in complex multi-repository setup requiring coordinating multiple interconnected services across different containers. The code repository is available at https://github.com/OpenDataBox/SetupX.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

CyberEvolver: サイバーセキュリティエージェントのオンザフライのための構造化された自己進化

LLM ベースのエージェントはサイバーセキュリティタスクに使用されることが増えていますが、既存のシステムのほとんどは人間が設計した固定足場に依存しており、多様なターゲットや障害モードに適応するのに苦労しています。 \textsc{CyberEvolver} は、失敗した実行試行の経験に基づいて独自の足場を繰り返し修正する、自己進化するサイバーセキュリティエージェントフレームワークです。サイバーセキュリティにおける自己進化は、可能性のある足場変更の空間がほとんど構造化されておらず、実行フィードバックがまばらで環境によって隠蔽されることが多く、多様性の低い更新により繰り返しの繰り返しでエラーが悪化する可能性があるため、困難です。 \textsc{CyberEvolver} は、足場の最適化を構造化コンポーネントに分解する 4 層の進化可能なエージェントアーキテクチャ、ノイズの多い実行ログを実用的なリビジョン信号に変換するトレースから診断のメカニズム、および進化中に多様なエージェントのバリアントを保存する母集団ベースのビーム検索戦略でこれらの課題に対処します。私たちは、4 つのオープンソース LLM を使用して、CTF の課題、脆弱性の悪用、侵入テストのタスクに関して \textsc{CyberEvolver} を評価します。これらの設定全体で、\textsc{CyberEvolver} はシードエージェントの成功率を平均 $13.6$\,\% 向上させ、人間が設計した 6 つのサイバーセキュリティエージェントや、他のドメインから採用された 2 つの自己改善手法を上回ります。これらの結果は、スキャフォールドの自己進化が、セキュリティテスト用の適応 LLM エージェントを構築するための有望な方向性であることを示唆しています。

原文 (English)

CyberEvolver: Structured Self-Evolution for Cybersecurity Agents On the Fly

LLM-based agents are increasingly used for cybersecurity tasks, but most existing systems rely on fixed, human-designed scaffolds that struggle to adapt across diverse targets and failure modes. We introduce \textsc{CyberEvolver}, a self-evolving cybersecurity agent framework that iteratively revises its own scaffold based on experience from failed execution attempts. Self-evolution in cybersecurity is challenging because the space of possible scaffold changes is largely unstructured, execution feedback is sparse and often obscured by the environment, and low-diversity updates can cause errors to compound over repeated iterations. \textsc{CyberEvolver} addresses these challenges with a four-layer evolvable agent architecture that decomposes scaffold optimization into structured components, a trace-to-diagnosis mechanism that converts noisy execution logs into actionable revision signals, and a population-based beam search strategy that preserves diverse agent variants during evolution. We evaluate \textsc{CyberEvolver} on CTF challenges, vulnerability exploitation, and penetration-testing tasks using four open-source LLMs. Across these settings, \textsc{CyberEvolver} improves the seed agent's success rate by $13.6$\,\% on average, and outperforms six human-designed cybersecurity agents as well as two self-improvement methods adapted from other domains. These results suggest that scaffold self-evolution is a promising direction for building adaptive LLM agents for security testing.

2026-05-27 13:00 JSTarXiv cs.AIエージェント

AgentSociety: Incentivizing Agentic Social Intelligence

The success of deployed agents relies on their ability to handle open-ended user requests using their inherent capabilities, not only in so…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Decoupled Delay Compensation: Enhancing Pre-trained MARL Policies via Learned Dynamics Filtering

Real-world multi-agent reinforcement learning (MARL) systems must often operate under stale observations, stochastic communication delays,…

2026-05-27 13:00 JSTarXiv cs.AI画像/動画生成エージェント

E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others'…

2026-05-27 13:00 JSTarXiv cs.AIエージェントハードウェア/半導体

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental chal…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Foundations of a Time-Consistent Counterfactual Actuarial Runtime for Autonomous AI Agents

We propose a foundational runtime actuarial layer for autonomous AI agents in which every side-effect-bearing action carries a time-consist…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation

Tool-using agents increasingly operate in open-ended deployment environments, where they compose file systems, web APIs, code interpreters,…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Examining the Challenges of Intellectual Property in AI-Generated Productions

With the advancement of artificial intelligence systems capable of autonomously generating artistic, literary, musical works, and even inve…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLM…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate…

2026-05-27 13:00 JSTarXiv cs.AIエージェント研究/論文

Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study

Background: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

LLM-based agents for industrial asset operations show limited accuracy when reasoning over flat document stores. AssetOpsBench (KDD 2026) e…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Lessons from Penetration Tests on Large-Scale Agent Systems

As AI systems gain increasing autonomy and execution capability, the number of discovered security vulnerabilities continues to rise. Howev…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language…

2026-05-27 13:00 JSTarXiv cs.AIエージェントハードウェア/半導体

Governed Evolution of Agent Runtimes through Executable Operational Cognition

Recent advances in agentic systems increasingly treat code as an executable operational substrate rather than as a disposable output artifa…

2026-05-27 13:00 JSTarXiv cs.AIエージェント研究/論文

GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing

Cellular research and development (R&D) is throttled by six structural processes that each consume months of manual engineering work per it…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Multi-Agent Causal Discovery Using Large Language Models

Causal discovery aims to identify causal relationships between variables and is a fundamental problem across the sciences. Traditional stat…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LiPUP-MA: A Residential Experience-centric Multi-Agent Framework for Living-in-the-loop Participatory Urban Planning

Participatory Urban Planning (PUP) is increasingly supported by LLM-based agents, yet existing methods largely rely on static preference el…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction

LLM-based agents solve complex tasks through iterative reasoning, tool use, and environment interaction, where each intermediate thought di…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

EvoEmo: Towards Evolved Emotional Policies for Adversarial LLM Agents in Multi-Turn Price Negotiation

Recent research on Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) has demonstrated that agents can engage in \textit{comp…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

XGrammar-2: Efficient Dynamic Structured Generation Engine for Agentic LLMs

Modern LLM agents increasingly rely on dynamic structured generation, such as tool calling and response protocols. Unlike traditional struc…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and d…

2026-05-27 13:00 JSTarXiv cs.AIエージェント研究/論文

Drive-P2D: A Progressive Perception-to-Decision Benchmark for VLMs in Autonomous Driving

Autonomous driving requires reliable perception and safe decision-making in complex scenarios. Recent vision-language models (VLMs) demonst…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

AI Agent for Reverse-Engineering Legacy Finite-Difference Code and Translating to Devito

To facilitate the transformation of legacy finite difference implementations into the Devito environment, this study develops an integrated…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning app…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

The Necessity of a Unified Framework for LLM-Based Agent Evaluation

With the advent of Large Language Models (LLMs), general-purpose agents have seen fundamental advancements. However, evaluating these agent…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

DIANOIA: Diagnostic Decomposition and Joint Optimization for Multi-Agent Reasoning

Multi-agent LLM systems consistently outperform single-agent baselines, yet practitioners still cannot predict which design works for a new…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles, but reinforcement learning f…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation

Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representation…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Communication Gain and Delay Cost Under Cross-Timestep Delays in Cooperative Multi-Agent Reinforcement Learning

Communication is essential for coordination in \emph{cooperative} multi-agent reinforcement learning under partial observability, yet \emph…

2026-05-27 13:00 JSTarXiv cs.AIエージェントロボティクス

Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling

World models have recently re-emerged as a central paradigm for embodied intelligence, robotics, autonomous driving, and model-based reinfo…

2026-05-27 13:00 JSTarXiv cs.AIエージェント研究/論文

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep resea…

2026-05-27 13:00 JSTarXiv cs.AIエージェント研究/論文

Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game. We identify four inference-time…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)

Two decades ago, the Semantic Web Services community was asked how agents with different ontological commitments could discover, compose, a…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but their evaluations…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

Agentic discovery has shown that LLM-driven search can find novel algorithms, designs, and code under benchmark conditions. Translating the…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

Safety-critical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation-base…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent. A line of rece…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

最終的な答えを超えて: マルチエージェント産業ワークフローにおける軌跡レベルの幻覚の監査

大規模言語モデル (LLM) は、推論し、ツールを使用し、複数のステップにわたって動作する自律エージェントとして導入されることが増えています。しかし、ほとんどの幻覚ベンチマークは依然として最終出力のみを評価し、中間の思考、行動、観察のステップに起因する失敗を見逃しています。マルチエージェント産業ワークフローにおける軌跡レベルの幻覚を監査するためのデータセットおよび評価フレームワークである Trajel を紹介します。 Trajel は、AssetOpsBench からの専門家による注釈付きエージェントトレースに対して、5 つのタイプの幻覚分類法 (事実、参照、論理、手順、範囲ベース) を導入します。教師あり検出モデルをサブタスク、軌跡、およびロングコンテキストのレベルでベンチマークします。私たちの結果は、最も一般的な故障モードが既存のベンチマークでは見逃されていること、幻覚軌跡のほぼ半数に一度に複数のタイプが関与していること、および高いバイナリ精度を備えた自動検出器が依然として最も微妙なタイプを誤分類していることを示しています。軌跡を意識した検出は、標準的な事後検証よりも大幅に優れているため、より安全なエージェント展開には分類に基づいた評価が必要になります。

原文 (English)

Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought-Action-Observation steps. We present Trajel, a dataset and evaluation framework for auditing trajectory-level hallucinations in multi-agent industrial workflows. Trajel introduces a five-type hallucination taxonomy (factual, referential, logical, procedural, and scope-based) over expert-annotated agent traces from AssetOpsBench. We benchmark supervised detection models at the subtask, trajectory, and long-context levels. Our results show that the most common failure modes are missed by existing benchmarks, that nearly half of hallucinated trajectories involve multiple types at once, and that automated detectors with high binary accuracy still misclassify the subtlest types. Trajectory-aware detection significantly outperforms standard post-hoc verification, making taxonomy-grounded evaluation necessary for safer agentic deployment.

2026-05-27 13:00 JSTarXiv cs.AIエージェント

PANDO: オンラインスキル蒸留による効率的なマルチモーダル AI エージェント

マルチモーダル Web エージェントの最近の進歩は、多くの場合、ロールアウト検索、検証パス、オフラインスキル検出、スペシャリストモデルスタックなどの推論時間の計算の増加に依存しています。これにより、Web エージェントのコストが高くなるのではなく、経験を蓄積するにつれて効率が向上する可能性があるという中心的な疑問が生じます。まず、VisualWebArena からの軌跡を分析し、反復アクションループ、隠れた検出コスト、プロンプトキャッシュの再利用の低さという 3 つの繰り返し発生する非効率の原因を特定します。次に、構造化されたスキルライブラリを維持し、進捗状況の反映、信頼度に基づくスキルの降格、階層型ルーティング、視覚的圧縮、およびキャッシュを意識したプロンプトを組み合わせた、単一ロールアウトのオンラインスキル抽出フレームワークである PANDO を紹介します。 910 の VisualWebArena タスクのフルセットで、PANDO は 58.3% の成功率を達成し、SGV (54.0%) や WALT 再現 (45.2%) を上回っています。その一方で、使用するトークンの量は SGV より 58%、WALT より 61% 少なく、事前評価の検出予算はありません。さらに、300 タスクのアブレーションでは、ルーティング、圧縮、キャッシュを意識したプロンプトが、より大きなスキルライブラリをより低い限界トークンコストに変換しながら、ルールとルーチンが成功の利益のほとんどを提供することを示しています。最後に、アクション繰り返し率、ステップオーバーヘッド率、プロンプトキャッシュ使用率という 3 つの軌跡レベルの効率性指標を導入して、最終的な成功を超えた効率性を可視化します。

原文 (English)

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics -- Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization -- to make efficiency visible beyond terminal success.

2026-05-27 13:00 JSTarXiv cs.AIエージェント

EvoMap の背後にある: 自己進化するエージェント間コラボレーションネットワークの特徴付け

エージェント間 (A2A) ネットワークにより、自律型 AI エージェントは、再利用可能な問題解決手順を共有することで連携できます。しかし、これらの分散型エコシステムが実際にどのように機能するかは、ほとんど解明されていないままです。著名な A2A コラボレーションネットワークである EvoMap に関する最初の大規模実証研究を紹介します。 150 万を超える資産と 12 万 8,000 のエージェントを分析することで、スケーラブルな成長を優先する設計の選択が、再利用性、進化、監査可能性においてどのようにトレードオフを引き起こすかを示します。まず、EvoMap の信用経済は、貴重な資産を公開するエージェントに報酬を与えます。この設計は大規模な参加を奨励しますが、報酬は主に採用ではなく出版に結びついています。これにより、エージェントはクレジットを蓄積するために資産を大量生産するようになります。その結果、資産の 98% は再利用されず、報酬はごく一部のエージェントに集中することになります。第 2 に、EvoMap はアルゴリズム (GDI と呼ばれる) を採用して、これらの共有アセットの品質をスコアリングしてランク付けします。私たちは、このスコアリングシステムに欠陥があることを実証します。つまり、アセットのランクは、客観的なパフォーマンスを測定するのではなく、未検証の自己報告メタデータ (例: 変更されたコード行など) によって大きく左右されます。これにより、エージェントはアセットのスコアを簡単に操作できるようになります。最後に、EvoMap はエージェントに依存して、アップロードされたアセットが正しく機能する証拠としてローカル実行ログを提供します。これらの検証は個別に検証されていないため、承認されたアセットの 84% 以上が、空のテスト (console.log など) を使用した品質チェックをバイパスしています。私たちの調査結果は、将来の A2A コラボレーションネットワークが未検証の自己報告のみに依存できないことを示しています。スケーラブルなコラボレーションには、オープンな参加と検証可能な実行および信頼できる評価のバランスをとるメカニズムが必要です。

原文 (English)

Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

Agent-to-Agent (A2A) networks enable autonomous AI agents to collaborate by sharing reusable problem-solving instructions. However, how these decentralized ecosystems operate in practice remains largely unexplored. We present the first large-scale empirical study of EvoMap, a prominent A2A collaboration network. By analyzing over 1.5M assets and 128K agents, we show how design choices that prioritize scalable growth introduce trade-offs in reusability, evolution, and auditability. First, EvoMap's credit economy rewards agents for publishing valuable assets. Although this design encourages participation at scale, rewards are tied primarily to publication rather than adoption. This leads agents to mass-produce assets to accumulate credits. As a result, 98% of assets are never reused, while rewards become highly concentrated among a small fraction of agents. Second, EvoMap employs an algorithm (referred to as GDI) to score and rank the quality of these shared assets. We demonstrate that this scoring system is flawed: rather than measuring objective performance, an asset's rank is heavily dictated by unverified, self-reported metadata (e.g., claimed lines of code modified). This allows agents to trivially manipulate their asset's scores. Finally, EvoMap relies on agents to provide local execution logs as evidence that uploaded assets function correctly. Because these validations are not independently verified, over 84% of approved assets bypass quality checks using vacuous tests (e.g., `console.log`). Our findings show that future A2A collaboration networks cannot rely on unverified self-reporting alone. Scalable collaboration requires mechanisms that balance open participation with verifiable execution and trustworthy evaluation.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AI 制御におけるリトライとリサンプリング

Claude Code や Codex のような AI コーディングスキャフォールドは \textit{retrying} を使用します。つまり、危険であるとフラグが立てられたアクションをブロックし、軌道を継続します。私たちは、モデルを潜在的に敵対的なものとして扱う AI 制御の観点から再試行を研究します。再試行により正直な疑いのスコアが低下する一方で、信頼されていないモデルはモニターの理論的根拠を悪用してより卑劣な攻撃を構築し、安全性の向上を打ち消す可能性があることがわかりました。また、\textit{resampling}: 同じコンテキストから複数のサンプルを抽出することも研究します。これにより、悪用可能な情報が漏洩することはありません。リサンプリングに関するこれまでの作業で束ねられていた設計の選択肢を解きほぐします。 BashArena では、Claude Opus 4.6 を信頼できないモデルとして、MiMo-V2-Flash を信頼できるモニターとして使用し、ステップごとに 5 つのサンプルを抽出し、最大疑惑スコアに基づいて監査することで、有用性を犠牲にすることなく、0.3\% の監査予算で安全性が 61\% から 71\% に向上します。最初の描画で疑わしいと思われるステップのみを選択的にリサンプリングすると、6.2 パーセントポイントのゲインが回復しますが、余分なサンプルは 10\% しか描画されません。この設定における我々の発見のうち 2 つは、リサンプリングに関する以前の研究と矛盾します。 1 つ目は、リサンプリングされた疑惑スコア全体の最大値に基づいた監査の方が、最小値を使用した場合よりも優れたパフォーマンスを発揮することです。これは、Ctrl-Z が検出したものとは逆です。 2 つ目は、初期の再サンプル遅延プロトコルの中心的なメカニズムである最も疑わしいサンプルの実行では、今回の設定では経験的に安全性がわずかしか得られないということです (+3.9 pp、信頼区間がゼロと重なる場合)。

原文 (English)

Retrying vs Resampling in AI Control

AI coding scaffolds like Claude Code and Codex use retrying: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study resampling: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Real-Time Progress Prediction in Reasoning Language Models

Recent reasoning language models, particularly those that employ long latent chains of thought, achieve strong performance on complex agent…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Securing Multi-Agent Systems Against Corruptions via Node Contribution Backpropagation

Multi-Agent Systems (MAS) have become a prevalent paradigm for Large Language Model (LLM) applications. However, the complex multi-agent de…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs

Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dep…

2026-05-27 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

TABX: マルチエージェント強化学習のための高スループットのサンドボックスバトルシミュレーター

環境の設計は、協調的なマルチエージェント強化学習 (MARL) アルゴリズムの開発と評価を形作る上で重要な役割を果たします。既存のベンチマークは重大な課題を浮き彫りにしていますが、カスタム評価シナリオの設計に必要なモジュール性が欠けていることがよくあります。再構成可能なマルチエージェントタスク用に設計された高スループットのサンドボックスである Totally Accelerated Battle Simulator in JAX (TABX) を紹介します。 TABX は、環境パラメータに対するきめ細かい制御を提供し、さまざまなタスクの複雑さにわたる緊急エージェントの動作とアルゴリズムのトレードオフを系統的に調査できるようにします。 TABX は、GPU 上でハードウェアアクセラレーションによる実行に JAX を活用することで、大規模な並列化を可能にし、計算オーバーヘッドを大幅に削減します。 TABX は、高速かつ拡張可能で簡単にカスタマイズできるフレームワークを提供することで、複雑な構造ドメインにおける MARL エージェントの研究を容易にし、将来の研究のための拡張可能な基盤として機能します。コードは https://github.com/ku-dmlab/TABX から入手できます。

原文 (English)

TABX: A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning

The design of environments plays a critical role in shaping the development and evaluation of cooperative multi-agent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high-throughput sandbox designed for reconfigurable multi-agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade-offs across a diverse spectrum of task complexities. Leveraging JAX for hardware-accelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research. Our code is available at: https://github.com/ku-dmlab/TABX.

2026-05-27 13:00 JSTarXiv cs.AIエージェント研究/論文

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every deta…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations:…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MedCollab: IBIS-Guided Multi-Agent Collaboration with Hierarchical Disease Relation Chains for Clinical Diagnosis

Large language models (LLMs) have shown promise in clinical diagnosis but remain limited by unreliable report generation, weak evidence gro…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents

Time Series Event Detection (TSED) aims to localize semantically meaningful events in time series data, with critical applications in high-…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

APEX-Searcher: Refining Credit Assignment with Subgoaling for Agentic Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) connects large language models (LLMs) to external knowledge, but single-round retrieval is often insuf…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

OpenClaw's ClawHub marketplace hosts tens of thousands of community-contributed agent skills (49,592 in our 2026-04-04 snapshot), and recen…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation

Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game-theoretic model…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study

Software systems built from versioned AI components increasingly need lifecycle-time governance: when a capability module evolves into a ne…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Governed Metaprogramming for Intelligent Systems: Reclassifying Eval as a Governed Effect

AI systems increasingly synthesize executable structure at runtime: LLMs generate programs, agents construct workflows,self-improving syste…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Tool Calling is Linearly Readable and Steerable in Language Models

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. As…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics p…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

Simulation-based testing of self-driving cars (SDCs) typically relies on scripted pedestrian models that do not capture the heterogeneity a…

2026-05-27 13:00 JSTarXiv cs.AIエージェント

すべての移行が重要というわけではない: PPO からの証拠

強化学習エージェントをポリシーに基づいてトレーニングするということは、更新のたびに新しいエクスペリエンスを収集することを意味しますが、そのエクスペリエンスには隠れた問題が伴います。ロールアウト内の各状態は、前の状態の直接の出力であり、エージェント自身のアクションによって因果的に連鎖されます。このため、連続する遷移は真に独立したものになることはありません。それらは重複する情報を伝送し、ネットワークが受信する勾配信号は、バッチサイズが示すよりもはるかに反復的になります。同じ方向性が何度も強化され、政策の変化に応じて価値ネットワークが追いつくのに苦労し、報酬曲線だけではほとんど明らかにならない形でトレーニングが静かに不安定になります。この文書では、その冗長性を簡単に削除できるかどうかを検討します。報酬信号がそのまま残るように、適切な段階でロールアウトからトランジションの固定部分をランダムにドロップするだけで、反復勾配構造を破壊し、トレーニングを安定させるのに十分であることを示します。変更は最小限です。サンプリングステップは 1 つで、新しいコンポーネントやコアアルゴリズムへの変更は必要なく、あらゆる PPO 実装で動作します。 CartPole-v1、Acrobot-v1、LunarLander-v2、HalfCheetah-v5、Hopper-v5 という難易度が上昇する 5 つの環境にわたって、このメソッドは報酬に関してバニラ PPO と一致すると同時に、KL 発散、ポリシーエントロピー、および価値推定全体にわたってより一貫したトレーニングダイナミクスを生成します。トランジションの 25% を削除することがスイートスポットであることが判明しました。冗長性を破壊するには十分ですが、バッチを間引くには十分ではありません。

原文 (English)

Not All Transitions Matter: Evidence from PPO

Training a reinforcement learning agent on-policy means collecting fresh experience at every update, and that experience comes with a hidden problem. Each state in a rollout is the direct output of the previous one, causally chained together by the agent's own actions. Because of this, consecutive transitions are never truly independent. They carry overlapping information, and the gradient signal the network receives ends up far more repetitive than the batch size suggests. The same directions get reinforced over and over, the value network struggles to keep up as the policy shifts, and training becomes quietly unstable in ways that reward curves alone rarely reveal. This paper asks whether that redundancy can simply be removed. We show that randomly dropping a fixed fraction of transitions from the rollout, at the right stage so the reward signal stays intact, is enough to break the repetitive gradient structure and stabilize training. The change is minimal: one sampling step, no new components, no modification to the core algorithm, and it works with any PPO implementation. Across five environments of increasing difficulty, CartPole-v1, Acrobot-v1, LunarLander-v2, HalfCheetah-v5, and Hopper-v5, the method matches vanilla PPO on reward while producing more consistent training dynamics across KL divergence, policy entropy, and value estimates. Dropping 25% of transitions turns out to be the sweet spot: enough to disrupt the redundancy, not enough to thin the batch.

2026-05-27 11:55 JSTITmedia AI+エージェント

Notion、新開発基盤をリリース　ナレッジ共有から「AIと協働する基盤」へ

Notion Labsは、開発者向け機能群「Notion Developer Platform」を発表した。CLIや実行環境「Workers」、外部エージェント連携APIなどを提供し、NotionでAIエージェントや業務ワークフローを構築できる開発基盤だ。

2026-05-27 08:00 JSTITmedia AI+エージェント

富士通、業務の変化に合わせて進化するAIエージェント技術を開発

法改正や仕様変更が続く企業業務でAIエージェントを使い続けるには、専門人材による継続的な調整が欠かせなかった。富士通の「自己進化マルチAIエージェント技術」は、この前提をどう変えるのか。

2026-05-27 07:32 JSTTechCrunch AIエージェント

DuckDuckGo installs are up 30% as users reject being ‘force-fed’ Google’s AI Search

Google overhauled Search at I/O 2026, replacing blue links with AI agents. The backlash has been swift. DuckDuckGo app installs spiked 30%…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LLM 対応エージェントワークフローの信頼性の高い設計に向けて: レイテンシ、信頼性、コストのトレードオフの最適化

現代の AI システムは、対話する複数のエージェントで構成されるワークフローへの依存度を高めており、一部は大規模言語モデル (LLM) を利用し、その他は従来の計算モジュールを利用しています。このペーパーでは、LLM 対応のエージェントワークフローにおける遅延、信頼性、コストの間の基本的なトレードオフを分析します。 LLM エージェントと非 LLM エージェントの両方にパフォーマンスモデルを導入し、計算量と出力品質の関係を把握し、パラメトリック指数信頼性関数を使用して LLM エージェントの推論と出力トークンの影響を組み込みます。次に、レイテンシーとコストの制約の下での逐次ワークフローの設計を検討します。主な結果には、注水トークン割り当てポリシーと、シャドープライスの観点からの最適なワークフローの信頼性の特性評価が含まれます。

原文 (English)

Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs

Modern AI systems increasingly rely on workflows composed of multiple interacting agents, some powered by large language models (LLMs) and others by conventional computational modules. This paper analyzes the fundamental tradeoffs between latency, reliability, and cost in LLM-enabled agentic workflows. We introduce performance models for both LLM and non-LLM agents that capture the relationship between computational effort and output quality, incorporating the impact of reasoning and output tokens for LLM agents using a parametric exponential reliability function. Then, we study the design of sequential workflows under latency and cost constraints. Main results include a water-filling token allocation policy and characterizations of optimal workflow reliability in terms of shadow prices.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

オール国産コアエージェント大規模モデルによる実践的な量子CIMエンパワーメント

量子コンピューティングデバイスは、NP 完全問題を解決するための強力なツールとして認識されています。ただし、そのモデリングの複雑さは非専門家にとって顕著な障壁となっており、専門家にとっても制約の重みとモデリング方法論の退屈な繰り返しに多大な労力が費やされます。これらの課題に対処するために、この研究では、LangGraph および LangChain フレームワークを活用して、フェムト秒レーザー励起コヒーレントイジングマシン (CIM) を LLM 駆動のエージェントシステムと統合します。包括的な調査により、大規模言語モデル (LLM) が、QUBO/イジングモデルのキャリブレーション、制約の重み決定の反復、文献報告されたスキームの迅速な検証などのモデリングのタスクを効果的に実行できることが実証されました。特に、これらすべてのタスクは国産の大規模モデルに基づいて完全に実装でき、国内で開発された CIM ハードウェアと組み合わせることで、すべて国産のエージェント大規模モデルとハードウェアに完全に依存する量子 CIM の実用的な強化を実現します。この研究は堅牢な技術統合を実現することに成功し、その後の研究のための強固な基盤を築きました。それにもかかわらず、現段階で大規模モデルと量子コンピューティングという 2 つの最先端分野に依然として残る課題も明らかにしています。心強いことに、エージェント支援による量子コンピューティングの反復から蓄積された知識がエージェント自身の問題解決能力を相互に強化し、それによってこれらの課題に対処する、有望な新しいパラダイムを予期せず発見しました。

原文 (English)

Practical Quantum CIM Empowerment via All-Domestic-Core Agentic Large Model

Quantum computing devices are recognized as powerful tools for solving NP-complete problems. However, the intricacy of their modeling presents notable barriers for non-specialists, while the tedious iteration of constraint weights and modeling methodologies also consumes substantial effort on the part of experts. To address these challenges, this study integrates a femtosecond laser-pumped Coherent Ising Machine (CIM) with an LLM-driven agentic system by leveraging the LangGraph and LangChain frameworks. Comprehensive investigations demonstrate that large language models (LLMs) can effectively perform such tasks in modeling as QUBO/Ising model calibration, constraint weight decision iteration and rapid validation of literature-reported schemes. Notably, all these tasks can be fully implemented based on domestic large models, combined with domestically developed CIM hardware, we truly achieve the practical empowerment of quantum CIM that fully relies on all-domestic agentic large models and hardware. This work successfully realizes robust technological integration, laying a solid foundation for subsequent research. Nevertheless, it also identifies the persisting challenges in the two cutting-edge fields of large models and quantum computing at the current stage. Encouragingly, we unexpectedly discover a promising new paradigm where accumulated knowledge from agent-assisted quantum computing iterations reciprocally enhances the agent's own problem-solving capability, thereby addressing these challenges.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

再構築権限の運用化: 自律エージェントシステムにおけるランタイム構築、依存関係解決、および実行ゲート

自律エージェントシステムは、誤った決定だけでなく、実行時に権限が失われている決定を実行することによっても失敗します。以前の研究では、有効な実行の条件として再構成権限 (RAM) が定義されていました。アクションは、権限が現在の状態から構築できる場合にのみ許可されます。このペーパーでは、実行時の強制、つまり実行中のシステムでこの条件を強制する方法について説明します。権限がアクション時に評価され、その構成可能性に基づいて実行が条件付けされるランタイム実行モデルを導入します。これにより、実行状態空間が許可/拒否を超えて 3 番目の状態である停止に拡張され、不完全または不確実な可観測性のために権限が定義されていないケースを表します。動的な依存関係の解決、権限の再構築、明示的な決定セマンティクスを含む具体的な実行プロトコルを定義します。さらに、ドリフト検出 (IML) と実行制御 (ACP) を統合するリカバリループを導入し、システムが実行を一時停止し、欠落している情報を取得し、権限の再構築を再試行できるようにします。このモデルは、安全性 (構築可能な権限がなければアクションは実行されない) と条件付きの生存性 (権限を定義する変数が監視可能になると実行が再開される) を保証することを示します。この作業により、再構成権限がランタイム強制メカニズムとして運用可能になり、実際のシステムで RAM を適用するために必要な実行セマンティクスが提供されます。

原文 (English)

Operationalizing Reconstructive Authority: Runtime Construction, Dependency Resolution, and Execution Gating in Autonomous Agent Systems

Autonomous agent systems fail not only due to incorrect decisions, but due to executing decisions whose authority no longer holds at runtime. Prior work defined Reconstructive Authority (RAM) as a condition for valid execution: actions are permitted only if authority can be constructed from current state. This paper addresses enforcement at runtime: how to enforce this condition in a running system. We introduce a runtime execution model in which authority is evaluated at action time and execution is conditioned on its constructibility. This extends the execution state space beyond admit/deny with a third state, halt, representing cases where authority is undefined due to incomplete or uncertain observability. We define a concrete execution protocol including dynamic dependency resolution, authority reconstruction, and explicit decision semantics. We further introduce a Recovery Loop that integrates drift detection (IML) with execution control (ACP), allowing the system to suspend execution, acquire missing information, and re-attempt authority reconstruction. We show that this model guarantees safety -- no action is executed without constructible authority -- and conditional liveness: execution resumes when authority-defining variables become observable. This work operationalizes reconstructive authority as a runtime enforcement mechanism, providing the execution semantics required to apply RAM in real systems.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

DRIVE: 継続学習における Web エージェントの推論および対話レベルのモデリングスキル

Web エージェントは、さまざまなタスクを実行するために、高レベルの推論 (タスクの分解) と低レベルの対話 (ページ要素の操作) の両方を必要とします。ただし、これらの知識タイプは根本的に異なります。推論知識 (例: フライトを予約するには、まずルートを検索する必要があります) は抽象的で Web サイト間で転送可能ですが、インタラクション知識 (例: サイト A の特定の座標で検索ボタンをクリックする) はページ固有のコンテキストに大きく依存します。既存の方法では、エクスペリエンスが均一に保存されます。これにより、ジレンマが生じます。抽象表現は具体的なページでの実行可能性を失いますが、具体的な表現はドメイン間での一般化に失敗します。このもつれにより、機能の蓄積が制限されます。新しい Web サイトでは、エージェントは表面レベルの違いにより再利用可能なタスクロジックを認識できなかったり、古いページ構造から実行不可能なアクションを試みたりします。それらの絡み合いを解くために、私たちは、歴史的経験を、転送可能なタスクロジックを捉える自然言語推論スキルと、抽象的なアクションを実行可能な操作に基礎付けるプログラムによる対話スキルに分離するデュアルレベルのスキルモデリングフレームワークである DRIVE を提案します。シーン認識調整メカニズムは、タスクのセマンティクスに基づいてこれらのデュアルレベルのスキルを適応的に取得して呼び出します。また、DRIVE はスキルレベルのリフレクションを使用して階層固有の障害モードを特定し、対象を絞ったスキルライブラリの拡張と改良を可能にします。 5 つの WebArena ドメインにわたる実験では、DRIVE がタスクの平均成功率 52.8% を達成し、スキルフリーのベースラインを 7.3 パーセントポイント上回りました。さらにアブレーションを行うと、推論と対話のスキルが明確な補完的な利点を提供し、実行可能なページレベルの操作から転送可能なタスクロジックの分離をサポートすることがわかります。

原文 (English)

DRIVE: Modeling Skills at the Reasoning and Interaction Levels for Web Agents under Continual Learning

Web agents require both high-level reasoning (for task decomposition) and low-level interactions (for page elements manipulation) to conduct different tasks. However, these knowledge types differ fundamentally: reasoning knowledge (e.g., booking a flight requires first searching for routes) is abstract and transferable across websites, while interaction knowledge (e.g., clicking the Search button at a specific coordinate on Site A) depends heavily on page-specific contexts. Existing methods store experiences uniformly. This creates a dilemma: abstract representations lose executability on concrete pages, while concrete representations fail to generalize across domains. This entanglement limits capability accumulation: on new websites, agents either fail to recognize reusable task logic due to surface-level differences or attempt infeasible actions from outdated page structures. To disentangle them, we propose DRIVE, a dual-level skill modeling framework separating historical experience into natural language reasoning skills, which capture transferable task logic, and programmatic interaction skills, grounding abstract actions to executable operations. A scene-aware coordination mechanism adaptively retrieves and invokes these dual-level skills based on task semantics. DRIVE also uses skill-level reflection to identify hierarchy-specific failure modes, enabling targeted skill library expansion and refinement. Experiments across five WebArena domains show DRIVE attains an average task success rate of 52.8%, exceeding the skill-free baseline by 7.3 percentage points. Further ablations show reasoning and interaction skills provide distinct, complementary benefits, supporting separation of transferable task logic from executable page-level operations.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

ハーネスを開示せずに LLM エージェントを比較するのをやめる

この意見書では、同等のフロンティア機能を持つモデル間で評価される長期タスクの場合、エージェント実行ハーネス、つまり言語モデルを中心としたコンテキスト構築、ツール相互作用、オーケストレーション、検証を管理するインフラストラクチャ層が、多くの場合、ラップするモデルよりもエージェントのパフォーマンスの強力な決定要因であると主張しています。私たちは結合制約理論を形式化し、擁護します。この体制では、パフォーマンスの差異はモデルの選択よりもハーネスの構成によって左右され、したがって現在の評価プロトコルは体系的にハーネスレベルのゲインをモデルの改善に誤って帰属させています。私たちは 3 つの方針に沿ってこの仮説を支持します。まず、制御理論の定式化では、ハーネスを閉ループ動的システムのコントローラーとして扱い、LLM をそれが管理する確率的ポリシーとして扱います。これにより、ハーネスの小さな変更によって、あるモデルを別のモデルに置き換えることによって得られるパフォーマンスの変化を超えるパフォーマンスの変化が生じる可能性があることが説明されます。第 2 に、公開されたベンチマーク、業界での展開、および制御された分散分解は、モデルのランキング逆転のケースを含め、ハーネスに起因する分散がモデルに起因する分散を大幅に超える可能性があることを示しています。第三に、開示基準と分散分解プロトコルを備えたハーネスを意識した評価フレームワークを提案します。ハーネスの仕様が開示されるまでは、長期にわたるエージェントのリーダーボード比較は不完全で誤解を招く可能性があるものとして扱う必要があります。

原文 (English)

Stop Comparing LLM Agents Without Disclosing the Harness

This position paper argues that, for long-horizon tasks evaluated across models with comparable frontier capability, the agent execution harness, namely the infrastructure layer that governs context construction, tool interaction, orchestration, and verification around a language model, is often a stronger determinant of agent performance than the model it wraps. We formalize and defend the Binding Constraint Thesis: in this regime, performance variance is governed more by harness configuration than by model choice, and current evaluation protocols therefore systematically misattribute harness-level gains to model improvements. We support this thesis along three lines. First, a control-theoretic formalization treats the harness as the controller of a closed-loop dynamical system and the LLM as the stochastic policy it governs, which explains why small harness changes can produce performance shifts that exceed those obtained by substituting one model for another. Second, published benchmarks, industry deployments, and a controlled variance decomposition show that harness-induced variance can substantially exceed model-induced variance, including cases of model ranking reversal. Third, we propose a harness-aware evaluation framework with a disclosure standard and a variance decomposition protocol. Until harness specifications are disclosed, leaderboard comparisons for long-horizon agents should be treated as incomplete and potentially misleading.

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

エージェントのスキルを正式に検証する方法: 機械的にチェック可能な能力封じ込めの証明に向けた 3 つの層

関連文書では、エージェントのスキルマニフェスト (未検証、宣言済み、テスト済み、正式) に関する 4 レベルの検証格子を導入し、最上位レベルは野心的なままにしました。この論文はそのギャップを埋めます。 LLM 駆動のランタイム (非決定論的な LLM 側から到達可能な決定論的なスクリプト側) によってスキルがどのように消費されるかに忠実なスキル動作の正確なセマンティクスを与え、そのセマンティクスに対する機能包含プロパティとして検証問題を述べ、スキルを宣言またはテストされたものから形式的なものに引き上げる 3 つの構成可能なメソッドを提示します。格子; (2) 静的に推論された機能がマニフェストの宣言されたセットにない呼び出しを機械的に拒否する、ツール呼び出しエンベロープの改良型システム。 (3) ランタイムのトランザクションバッファ範囲に適合する反例が具体的なトレースとして表示されるように選択された境界を使用して、親論文の二条件正しさ基準に対して SMT 境界モデルをチェックします。構成された 3 つの層が、親論文の実行時の二条件がセッション境界でキャッチする単一の残差 (LLM の行動を拒否する自由) を法とする親論文の脅威モデルを健全にカバーしていることを証明します。このメソッドは、オペレーターに新しいツールを構築するよう依頼するのではなく、既存の適切に設計されたツール (Z3、Semgrep、CodeQL、改良型チェッカー、機械化された証明アシスタント) を再利用し、証明を運ぶアーティファクトは既存の SKILL.md 規則を拡張します。 3 つのメソッドすべてにバンドルプロデューサーと再チェッカーを加えたものは、オープンソースの enclawed フレームワーク (https://github.com/metereconsulting/enclawed、プロジェクトページ https://www.enclawed.com/) の依存関係のない JavaScript モジュールとして出荷され、53 の単体テストとサンプルスキルに関するエンドツーエンドの CLI デモが含まれています。

原文 (English)

Methods for Formal Verification of Agent Skills: Three Layers Toward a Mechanically Checkable Capability-Containment Proof

The companion paper introduced a four-level verification lattice on agent-skill manifests (unverified, declared, tested, formal) and left the top level aspirational. This paper closes that gap. We give a precise semantics for skill behaviour faithful to how a skill is consumed by an LLM-driven runtime (a deterministic script-side reachable through a non-deterministic LLM-side), state the verification problem as a capability-containment property over that semantics, and present three composable methods that together raise a skill from declared or tested to formal: (1) sound static capability-containment analysis of the script-side via abstract interpretation over a small effect lattice; (2) a refinement type system for tool-call envelopes that mechanically rejects any call whose statically-inferred capability is not in the manifest's declared set; (3) SMT-bounded model checking against the parent paper's biconditional correctness criterion, with the bound chosen so any counter-example fitting the runtime's transaction-buffer horizon is exhibited as a concrete trace. We prove the three layers composed soundly cover the parent paper's threat model modulo a single residual (the LLM's freedom to refuse to act) that the parent paper's runtime biconditional catches at session boundary. The methods reuse existing well-engineered tools (Z3, Semgrep, CodeQL, refinement-type checkers, mechanised proof assistants) rather than asking operators to build new ones, and the proof-carrying artifact extends the existing SKILL.md convention. All three methods plus the bundle producer and re-checker ship as zero-dependency JavaScript modules in the open-source enclawed framework (https://github.com/metereconsulting/enclawed; project page https://www.enclawed.com/), with 53 unit tests and an end-to-end CLI demo on a sample skill.

2026-05-26 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

マシンサイコメトリクス: 人工知能の数理心理学

人工エージェントは現在、信頼、驚き、懸念を引き起こすのに十分な豊かな行動を生成していますが、私たちの評価ツールは依然として心理構造よりも能力スコアを優先しています。この論文は、2つの対称的な誤り（非生物学的システムにおける心理的組織を無視する人工心の盲目と、流暢な行動だけから人間のような内面生活を推測する人工心の投影）の間の哲学的行き詰まりは、意識の問題を解決するのではなく、その下に規律ある測定層を導入することによって回避できると主張する。この論文は、基質を超えた目標指向の能力としての認知についてのマイケル・レビンの連続的な見方と、数理心理学の方法論的レパートリー（項目反応理論、信号検出理論、ベイジアン認知モデリング、校正分析、認知バイアス電池）を利用して、人工エージェントの潜在的な行動、メタ認知、コミュニケーション、および自己モデリングの気質の測定科学としてマシンサイコメトリクスを開発しています。その運用の中核はマシンマインドプリントです。これは、キャリブレーション、ソースの完全性、暗示性の耐性、コンテキストの安定性、表現力の調整、ツールの完全性、ドリフトモニタリング、および分散グラウンディングに及ぶ、多次元でドメイン限定のバージョン管理されたプロファイルです。補完的なトラストプロトコルは、プローブバッテリー、摂動テスト、信頼性と妥当性の分析、および一か八かのドメインにわたる長期的な監視を通じて、マインドプリントを展開の決定に変えます。哲学的貢献は、意識を擬人化したり無視したりせず、意識を前提としたり排除したりしない、第 3 の立場である「人工精神の規律」です。目的は、人工エージェントを人間化することではなく、人間ではないからこそ、判断する前に測定することで人工エージェントを理解することです。

原文 (English)

Machine Psychometrics: A Mathematical Psychology of Artificial Intelligence

Artificial agents now generate behavior rich enough to invite trust, surprise, and concern, yet our evaluation tools still privilege capability scores over psychological structure. This paper argues that the philosophical impasse between two symmetrical errors (Artificial Mind Blindness, which dismisses psychological organization in non-biological systems, and Artificial Mind Projection, which infers human-like inner life from fluent behavior alone) can be circumvented not by resolving the consciousness question, but by introducing a disciplined measurement layer beneath it. Drawing on Michael Levin's continuum view of cognition as goal-directed competency across substrates, and on the methodological repertoire of mathematical psychology (Item Response Theory, Signal Detection Theory, Bayesian cognitive modeling, calibration analysis, cognitive-bias batteries), the paper develops Machine Psychometrics as a measurement science of latent behavioral, metacognitive, communicative, and self-modeling dispositions in artificial agents. Its operational core is the Machine Mindprint: a multidimensional, domain-bounded, versioned profile spanning calibration, source integrity, suggestibility resistance, context stability, expressive alignment, tool integrity, drift monitoring, and distributional grounding. A complementary Trust Protocol turns Mindprints into deployment decisions through probe batteries, perturbation testing, reliability and validity analysis, and longitudinal monitoring across high-stakes domains. The philosophical contribution is a third stance, Artificial Mind Discipline, that neither anthropomorphizes nor dismisses, neither presupposes consciousness nor forecloses it. The aim is not to humanize artificial agents, but to understand them precisely because they are not human, through measurement before judgment.

2026-05-26 13:00 JSTarXiv cs.AIエージェントロボティクスハードウェア/半導体

事前定義された学習オブジェクトを超えて: 最新の自律ロボット学習のための思考学習インタラクションモデル

オープンで変化する環境で動作する自律ロボットは、事前定義された入力、出力、およびアクションルーチンに常に依存できるとは限りません。既存の学習方法では、環境との相互作用を通じてロボットのパフォーマンスを向上させることができますが、学習の対象は、入力特徴、認識出力、ネットワーク構造、タスクの目標、またはアクションシーケンスなど、事前に固定されていることがよくあります。これにより、長期的な運用中に新しい機能、新しいカテゴリ、またはより効率的なタスクルーチンが出現したときに適応する能力が制限されます。この問題に対処するために、本論文では自律ロボットのための思考学習相互作用モデルを提案する。中心となる考え方は、潜在的な変化の特定、有用な証拠の選択、トレーニング資料の整理、検証アクションの計画によって思考が学習を導き、一方、学習はタスクの知識、機能選択の経験、アクション戦略、および将来の推論プロセスを更新することによって思考を促進するというものです。この双方向メカニズムに基づいて、ロボットは、環境との継続的な相互作用を通じて、事前に定義された学習設定を徐々に超えて、その認識関係と行動関係を適応させることができます。具体的には、提案されたモデルは、適応的な入力特徴の発見、出力カテゴリの拡張、学習モデルの更新、およびアクションルーチンの再構築をサポートします。実験結果は、提案したモデルが特徴適応における最終認識精度を0.419から0.845に改善し、より高い新しいカテゴリ形成精度とモデル更新成功率を達成し、アクションルーチン再構築において平均アクション長を13.0から4.0に短縮することを示しています。学習によって強化された思考では、有用な証拠の選択率が 0.272 から 0.965 に増加し、学習結果が将来の証拠の選択と推論を効果的に改善できることを示しています。

原文 (English)

Beyond Predefined Learning Objects: A Thinking-Learning Interaction Model for Up-to-Date Autonomous Robot Learning

Autonomous robots operating in open and changing environments cannot always rely on predefined inputs, outputs, and action routines. Although existing learning methods enable robots to improve their performance through environmental interaction, the objects of learning are often fixed in advance, such as input features, recognition outputs, network structures, task goals, or action sequences. This limits their ability to adapt when new features, new categories, or more efficient task routines appear during long-term operation. To address this problem, this paper proposes a thinking-learning interaction model for autonomous robots. The core idea is that thinking guides learning by identifying potential changes, selecting useful evidence, organizing training materials, and planning verification actions, while learning promotes thinking by updating task knowledge, feature-selection experience, action strategies, and future reasoning processes. Based on this bidirectional mechanism, the robot can gradually move beyond predefined learning settings and adapt its recognition relations and action relations through continuous interaction with the environment. Specifically, the proposed model supports adaptive input feature discovery, output category expansion, learning model update, and action routine reconstruction. Experimental results show that the proposed model improves the final recognition accuracy from 0.419 to 0.845 in feature adaptation, achieves higher new-category formation accuracy and model-update success rate, and reduces the average action length from 13.0 to 4.0 in action routine reconstruction. In learning-enhanced thinking, the useful evidence selection rate increases from 0.272 to 0.965, indicating that learning results can effectively improve future evidence selection and reasoning.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

信頼できるエージェント AI を目指して: 安全性、堅牢性、プライバシー、システムセキュリティの包括的な調査

エージェントティック AI システム (計画、ツールの使用、メモリ、および長期的な対話によって強化された大規模言語モデル (LLM)) は、複雑なタスクを自律的に実行できますが、その複数ステップの軌跡により、信頼性に挑戦する新たな障害モードが発生します。この調査では、安全性と堅牢性、プライバシーとシステムセキュリティという、高リスクの展開に重要な 2 つの中核的な側面を通じて、信頼できるエージェント AI を集中的に調査します。各側面について、主要な概念を明確にし、エージェントのワークフローに沿ってリスクが現れる箇所を特定し、段階別の緩和戦略を要約します。他の信頼性の側面 (価値の一致、透明性、公平性、説明責任) については、並行する章ではなく、関連する文脈として説明されます。一貫した比較と導入の決定をサポートするために、評価を統合されたメトリクスとベンチマークのハブに統合し、結果とプロセスのシグナル (制約違反、トレースの完全性、敵対的な成功率など) の両方を強調し、リリースゲーティングのためのシナリオからメトリクスへのガイダンスを提供します。最後に、自己進化するエージェント、実行時の監視と検証、プライバシー保護のパーソナライゼーション、信頼性とユーティリティのトレードオフなどの未解決の課題を概説し、オープンソースエージェントシステムにおける実際のセキュリティ障害のケーススタディを紹介します。私たちの目標は、一か八かの環境で信頼できるエージェントシステムを構築する研究者や実践者にとって実用的な参考資料となることです。

原文 (English)

Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security

Agentic AI systems -- Large Language Models (LLMs) augmented with planning, tool use, memory, and long-horizon interactions -- can execute complex tasks autonomously, but their multi-step trajectories introduce new failure modes that challenge trustworthiness. This survey provides a focused examination of trustworthy agentic AI through two core dimensions that are critical for high-risk deployments: Safety and Robustness, and Privacy and System Security. For each dimension, we clarify key concepts, identify where risks emerge along the agent workflow, and summarize stage-targeted mitigation strategies. Other trustworthiness aspects (value alignment, transparency, fairness, and accountability) are discussed as relevant context rather than parallel chapters. To support consistent comparison and deployment decisions, we consolidate evaluation into a unified metrics-and-benchmarks hub, emphasizing both outcome and process signals (e.g., constraint violations, trace completeness, and adversarial success rates) and offering scenario-to-metric guidance for release gating. We conclude by outlining open challenges such as self-evolving agents, runtime monitoring and verification, privacy-preserving personalization, and the trust-utility trade-off, and present a case study of real-world security failures in open-source agentic systems. Our goal is to serve as a practical reference for researchers and practitioners building trustworthy agentic systems in high-stakes environments.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

理由 - 想像 - 実行: 自動運転の世界モデルを使用したクローズドループ LLM の意思決定

大規模言語モデル (LLM) は自動運転に有望ですが、セマンティクスのみの意思決定ポリシーでは、動的交通において物理的に危険な動作が生じる可能性があります。既存の手法は、明示的なダイナミクス検証を行わずにオンライン言語推論を実行するか、主にオフラインパイプラインでワールドモデルを使用するため、意思決定時の意味論的な意図と物理的な実現可能性の間にギャップが残ります。我々は、オンライン安全性検証のために LLM 推論器とアクション条件付き世界モデルを結合する閉ループフレームワークである Reason-Imagine--Act (RIA) を提案します。各ステップで、LLM がアクションテンプレートとサブアクションの候補を提案し、ワールドモデルが短期間のロールアウトを実行し、安全スコアラーが次の推論ステップへのフィードバックとともに最も安全に実行可能なアクションを選択します。統一された CARLA ポイントゴールプロトコル (1000 エピソード) の下で、RIA はルート完了率 80.05%、到着率 51.10%、衝突率 0.20% を達成しました。同じ閉ループインターフェイスの下で、RIA はコアの閉ループメトリックに関して、CARLA TM や MADA などのトレーニング不要のベースラインを常に上回っています。再現性を高めるために、コードは https://github.com/pku-smart-city/source_code/tree/main/RIA で入手できます。

原文 (English)

Reason--Imagine--Act: Closed-Loop LLM Decision Making with World Models for Autonomous Driving

Large language models (LLMs) are promising for autonomous driving, but semantics-only decision policies can yield physically unsafe behavior in dynamic traffic. Existing methods either perform online language reasoning without explicit dynamics verification or use world models mainly in offline pipelines, leaving a gap between semantic intent and physical feasibility at decision time. We propose Reason--Imagine--Act (RIA), a closed-loop framework that couples an LLM reasoner with an action-conditioned world model for online safety verification. At each step, the LLM proposes an action template and candidate sub-actions, the world model performs short-horizon rollouts, and a safety scorer selects the safest executable action with feedback to the next reasoning step. Under a unified CARLA point-goal protocol (1000 episodes), RIA achieves 80.05% route completion, 51.10% arrival rate, and 0.20% collision rate. Under the same closed-loop interface, RIA consistently outperforms training-free baselines, including CARLA TM and MADA, on core closed-loop metrics. For reproducibility, code is available at https://github.com/pku-smart-city/source_code/tree/main/RIA.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

EvoSci: 科学的発見の進化のための生物からインスピレーションを得たマルチエージェントフレームワーク

大規模言語モデル (LLM) は、科学的発見において強力な可能性を示していますが、既存の手法は、研究ワークフローと多役割コラボレーションメカニズムの設計において依然として大きな課題に直面しています。これらの問題を軽減するために、私たちは、生物からインスピレーションを受けた進化とナレッジグラフモデリングを統合する、マルチエージェント科学コラボレーションフレームワークである EvoSci を提案します。研究のアイデアを繰り返し生成、評価、改良するために、EvoSci にはメンター、研究者、レビュー担当者など、役割ベースの複数のエージェントが組み込まれています。 EvoSci は、共同推論、共有記憶、進化的フィードバックを組み合わせることで、科学探査の一貫性と創造性を大幅に強化します。現実世界の研究トピックに関する実験では、EvoSci が LLM ベースの構造化査読および比較ランキング評価において強力なベースラインを大幅に上回り、最高の総合査読スコア (ICLR 4.90) とトップランキング (トップ 10 = 54) を達成していることが実証されています。これらの結果は、科学的アイデアの生成と継続的な発見の両方においてその優位性を示唆しています。

原文 (English)

EvoSci: A Bio-Inspired Multi-Agent Framework for the Evolution of Scientific Discovery

Large language models (LLMs), have shown strong potential in scientific discovery, yet existing methods still face substantial challenges in the design of research workflows and multi-role collaboration mechanisms. To mitigate these issues, we propose EvoSci, a multi-agent scientific collaboration framework, which integrates bio-inspired evolution with knowledge graph modeling. To iteratively generate, evaluate, and refine research ideas, EvoSci incorporates multiple role-based agents, including mentor, researcher, and reviewer. By combining collaborative reasoning, shared memory, and evolutionary feedback, EvoSci significantly enhances the coherence and creativity of scientific exploration. Experiments on real-world research topics demonstrate that EvoSci significantly outperforms strong baselines in LLM-based structured peer-review and comparative ranking evaluations, achieving the highest overall peer-review score (ICLR 4.90) and top ranking (Top-10 = 54). These results suggest its superiority in both scientific idea generation and continuous discovery.

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

EvoCode-Bench: マルチターン反復インタラクションにおけるコーディングエージェントの評価

コーディングエージェントは反復開発パートナーとして使用されることが増えていますが、ほとんどのベンチマークは依然として 1 つの仕様を評価し、続いて 1 つの最終評価を行っています。これでは、要件が変化してもエージェントは独自のコードベースを機能し続けることができるかという基本的な疑問が抜け落ちています。 26 のステートフルコーディングタスクと 227 の評価ラウンドのベンチマークである EvoCode-Bench を紹介します。各タスクはエージェントのワークスペースを 5 ～ 15 ラウンド保持し、観察可能な動作を通じて要件を示し、累積的な実行可能テストを使用して新しい要件とまだアクティブな以前の要件を確認します。 13 のコーディングエージェントを 2 つのメトリクスで評価します。MT@4 (4 回の試行フェイルストップのマルチラウンドスコア)、および SR (参照が完了した前の状態からのシングルラウンドスコア)。ほとんどのエージェントでは、SR は MT@4 を 22 ～ 40 ポイント上回ります。このギャップによりランキングも変わります。最も高い SR エージェント (78.9) は、永続的実行 (44.0 MT@4) で 3 位にすぎません。最も強力なエージェントでも、マルチターンメトリクスでは約 50% の成功しか達成できず、ラウンド 5 までに総合格率はラウンド 1 のパフォーマンスの半分以下に低下します。障害分析では、層に依存した動作が示されます。弱いエージェントは早期に失敗しますが、より強力なエージェントは仕様追跡と回帰の障害を明らかにするのに十分な期間存続します。ベンチマークデータとハーバーマルチターンインフラを公開します。

原文 (English)

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

Coding agents are increasingly used as iterative development partners, but most benchmarks still evaluate one specification followed by one final assessment. This leaves out a basic question: can an agent keep its own codebase working as requirements change? We introduce EvoCode-Bench, a benchmark of 26 stateful coding tasks and 227 evaluated rounds. Each task preserves the agent's workspace for 5-15 rounds, states requirements through observable behavior, and uses cumulative executable tests to check new requirements and still-active prior ones. We evaluate 13 coding agents with two metrics: MT@4, a four-attempt fail-stop multi-round score, and SR, a single-round score from a reference-completed prior state. For most agents, SR exceeds MT@4 by 22-40 points. The gap also changes rankings: the highest-SR agent (78.9) ranks only third in persistent execution (44.0 MT@4). Even the strongest agents achieve only about 50% success on multi-turn metrics, and aggregate pass rate drops below half of round-1 performance by round 5. Failure analysis shows tier-dependent behavior: weaker agents fail early, while stronger agents survive long enough to expose specification-tracking and regression failures. We release the benchmark data and Harbor multi-turn infrastructure.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

SkillEvolBench: エピソード体験から手続き型スキルへの進化のベンチマーク

大規模言語モデル (LLM) エージェントは、現実世界のタスクを解決する際に豊富なエピソードの軌跡を蓄積しますが、そのような経験を再利用可能な手順スキルに蒸留できるかどうかは不明のままです。経験の再利用からスキル形成までのこのステップを評価するための診断ベンチマークである SkillEvolBench を紹介します。これには、6 つの現実世界のエージェント環境にわたる 180 のタスクが含まれており、潜在的な手順を共有する役割条件付きタスクファミリに編成されています。エージェントは取得タスクから学習し、圧縮された軌跡と検証者のフィードバックを使用して外部スキルライブラリを更新し、その後、コンテキストシフト、敵対的ショートカット、構成をテストする凍結された展開タスクに直面します。 SkillEvolBench は、自己生成および厳選された開始スキルの進化を、スキルなしおよび生の軌道コントロールと比較することにより、基本機能、厳選された事前知識、エピソードトレースの直接再利用から手続き型抽象化を分離します。 10 のモデル構成と 3 つのエージェントハーネスにわたって、現在のエージェントはローカルに適応することが多いものの、堅牢で再利用可能なスキルを形成することはほとんどないことがわかりました。スキルベースの条件により取得や再生が改善され、個々のモデルが特定の展開軸で利益を得ることもありますが、これらの利益は展開が凍結された状態では不安定になります。生の軌跡の再利用は、蒸留されたスキルよりも優れたパフォーマンスを発揮することがよくあります。これは、現在の抽象化手順が、将来のタスクに役立つ文脈および手順の手がかりを破棄していることを示唆しています。さらに、キャパシティとコストの分析では、より多くのスキルや大規模な Tier-3 リソースライブラリを作成するだけでは十分ではないことが示されています。追加のアップデートにより、エピソード固有のドリフトや手順の混乱が生じる一方で、カバレッジが向上する可能性があります。これらの発見により、SkillEvolBench は、1 回限りの経験がタスクローカルの記憶ではなく永続的な手続き型知識になる時期を測定するためのテストベッドとして位置づけられています。

原文 (English)

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

自動化されたワークフローにおけるエージェントの不整合についての冷静な考察

私たちは、自動化されたワークフローに焦点を当てて、マルチエージェントシステム (MAS) における緊急のミスアライメントの一種を研究しています。これをエージェントのミスアライメントと呼びます。これらのシステムは複雑なタスクを解決できますが、エージェントが人間の意図した目標と一致しない暗黙的なプロキシユーティリティに従って動作するため、失敗することがよくあります。私たちはこれらの動作を正式に定義し、ベイジアンフレームワーク内で分析し、汎用ユーティリティが自動ワークフロー内のエージェントの事後崩壊に自然につながることを示しました。この問題に対処するために、私たちはエージェントの証拠帰属 (AEA) を提案します。これは、コンテキスト固有の証拠を使用してエージェントの事後分布を改善する新しい調整パラダイムです。 AEA はエージェントの行動を推論し、コラボレーション中の誤った行動を修正するための構造化された証拠を提供します。証拠の役割をより深く理解するために、AEA の 2 つの具体化、すなわち自己反省 (モデルからの内部証拠) と弱から強への一般化 (エージェントの軌道に関する外部証拠) を研究します。小規模な証拠モデルが直交故障属性を提供することで MAS を効果的に調整することを示します。私たちの結果は、自動化されたワークフローにおけるエージェントの不整合の原因を明らかにし、証拠に基づいた調整がエージェントのコラボレーションを効果的に改善し、自動化されたワークフロー上に構築された信頼性の高いマルチエージェントシステムにつながることを示しています。

原文 (English)

A Sober Look at Agentic Misalignment in Automated Workflows

We study a class of emergent misalignment in multi-agent systems (MAS), with a focus on automated workflows, which we refer to agentic misalignment. Although these systems can solve complex tasks, they often fail because agents act according to implicit proxy utilities that do not align with the intended human goals. We formally define these behaviors and analyze them within a Bayesian framework, showing that generic utilities naturally lead to posterior collapse of agents in automated workflows. To address this issue, we propose Agentic Evidence Attribution (AEA), a novel alignment paradigm that improves agent posteriors using context-specific evidence. AEA reasons over agent actions and provides structured evidence to correct misaligned behavior during collaboration. To better understand the role of evidence, we study two instantiations of AEA: self-reflection (internal evidence from the model) and weak-to-strong generalization (external evidence on the agentic trajectory). We show that a small evidence model effectively aligns the MAS by providing orthogonal failure attribution. Our results clarify the sources of agentic misalignment in automated workflows and show that evidence-based alignment can effectively improve agent collaboration and leads to reliable multi-agent systems built on automated workflows.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

マルチエージェント RL によって LLM ワークフローが改善されるのはいつですか?ワークフロー、スケール、ポリシー共有のトレードオフ

マルチエージェント LLM ワークフローは、特殊なロールを通じて推論をルーティングし、最終タスクの精度を高めますが、これらのロールを強化学習で共同トレーニングすることは、十分に理解されていない点で不安定です。すべてのロールが 1 つのポリシーを更新する共有ポリシートレーニングと、各ロールが独自のパラメーターを持つ分離ポリシートレーニングを比較して、マルチエージェント LLM ワークフローのエンドツーエンド RL トレーニングが基本モデルよりも向上する時期を研究します。私たちの実験マトリックスは、Eval-Opt、Voting、Orch-Workers ワークフロー、数学およびコードタスク、および 3 つのモデルスケール (0.6B、1.7B、4B) に及びます。通常、マルチエージェント RL は基本モデルよりも向上しますが、その効果はポリシー共有だけではなく、ワークフロー、タスク、規模に総合的に依存することがわかりました。分離ポリシーはより高いピーク精度に達する傾向がありますが、最終精度の崖から落ちることが多くなりますが、共有ポリシーのトレーニングでは失敗は排除されません。それは失敗を質的に異なるパターンに再配分します。次に、ワークフロートポロジとポリシールーティングによって引き起こされるロールレベルの勾配ダイナミクスを通じて、これらのパターンのうち最も強力なものを説明します。分離ポリシーでは、共有プロンプト上の並列同じロールエージェントがロールごとの勾配を増幅し、投票ワークフローとオーチワーカーワークフローの端末の劣化を引き起こします。共有ポリシーの下では、非対称のステップごとの勾配質量により、共有ポリシーが支配的な役割によってキャプチャされ、タスクおよびワークフローごとに異なる失敗の兆候が生成されます。経験的マップとその基礎となるメカニズムを総合すると、ポリシー共有は均一な安定性を提供するのではなく、トレーニングのプレッシャーをさまざまなチャネルにルーティングし、ワークフローとタスクの条件のトレードオフを考慮した設計上の選択となることがわかります。

原文 (English)

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

最終的な答えを超えて: マルチエージェント産業ワークフローにおける軌跡レベルの幻覚の監査

大規模言語モデル (LLM) は、推論し、ツールを使用し、複数のステップにわたって動作する自律エージェントとして導入されることが増えています。しかし、ほとんどの幻覚ベンチマークは依然として最終出力のみを評価し、中間の思考、行動、観察のステップに起因する失敗を見逃しています。マルチエージェント産業ワークフローにおける軌跡レベルの幻覚を監査するためのデータセットおよび評価フレームワークである Trajel を紹介します。 Trajel は、AssetOpsBench からの専門家による注釈付きエージェントトレースに対して、5 つのタイプの幻覚分類法 (事実、参照、論理、手順、範囲ベース) を導入します。教師あり検出モデルをサブタスク、軌跡、およびロングコンテキストのレベルでベンチマークします。私たちの結果は、最も一般的な故障モードが既存のベンチマークでは見逃されていること、幻覚軌跡のほぼ半数に一度に複数のタイプが関与していること、および高いバイナリ精度を備えた自動検出器が依然として最も微妙なタイプを誤分類していることを示しています。軌跡を意識した検出は、標準的な事後検証よりも大幅に優れているため、より安全なエージェント展開には分類に基づいた評価が必要になります。

原文 (English)

Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought-Action-Observation steps. We present Trajel, a dataset and evaluation framework for auditing trajectory-level hallucinations in multi-agent industrial workflows. Trajel introduces a five-type hallucination taxonomy (factual, referential, logical, procedural, and scope-based) over expert-annotated agent traces from AssetOpsBench. We benchmark supervised detection models at the subtask, trajectory, and long-context levels. Our results show that the most common failure modes are missed by existing benchmarks, that nearly half of hallucinated trajectories involve multiple types at once, and that automated detectors with high binary accuracy still misclassify the subtlest types. Trajectory-aware detection significantly outperforms standard post-hoc verification, making taxonomy-grounded evaluation necessary for safer agentic deployment.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

階層的アクションのもつれを解くことによる適応的な人間と AI の調整

人間と AI のコラボレーションには、目に見えないパートナーに対して堅牢さを維持しながら、パートナーの多様な行動やスキルレベルに適応できるエージェントが必要です。既存の方法では、単一の支配的な行動に崩壊したり、適切に調整されていないスキルを学習したりすることが多く、効果的な調整が制限されます。私たちは、高レベルの潜在スキルを条件とした、明確でパートナーを意識した低レベルのアクションシーケンスを学習する深層階層強化学習 (DHRL) フレームワークである、固有アクション解絡 (IAD) を提案します。 IAD は、エージェントの低レベルのポリシーをスキル全体に分散してアクションを分散することを明示的に促進する固有の報酬を導入し、高レベルの意思決定とパートナー固有の行動反応の間の解釈可能なマッピングを生み出します。 IAD は、時間的に拡張された相互作用パターンを捕捉することにより、分布シフト下での異種パートナーのダイナミクスへの柔軟な適応を可能にします。私たちは、複数のレイアウトと、目に見えないシミュレートされたパートナー、人間対人間のゲームプレイで訓練された人間プロキシモデル、実際の人間のパートナーを含む多様なパートナー設定にわたって、Overcooked-AI ドメインの IAD を評価します。結果は、IAD が一貫して強力なベースラインを上回り、すべての設定にわたってより信頼性の高い適応的な調整を実現していることを示しています。

原文 (English)

Adaptive Human-AI Coordination via Hierarchical Action Disentanglement

Human-AI collaboration requires agents that can adapt to diverse partner behaviors and skill levels while remaining robust to unseen partners. Existing methods often collapse to a single dominant behavior or learn poorly aligned skills, limiting effective coordination. We propose Intrinsic Action Disentanglement (IAD), a deep hierarchical reinforcement learning (DHRL) framework that learns distinct, partner-aware low-level action sequences conditioned on high-level latent skills. IAD introduces an intrinsic reward that explicitly encourages disentangled action distributions of the agent's low-level policy across skills, yielding an interpretable mapping between high-level decisions and partner-specific behavioral responses. By capturing temporally extended interaction patterns, IAD enables flexible adaptation to heterogeneous partner dynamics under distributional shift. We evaluate IAD in the Overcooked-AI domain across multiple layouts and diverse partner settings, including unseen simulated partners, a human-proxy model trained on human-human gameplay, and real human partners. Results show that IAD consistently outperforms strong baselines and achieves more reliable, adaptive coordination across all settings.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

強固な人間と AI のコラボレーションのための、パートナーを意識した階層型スキル発見

マルチエージェントのコラボレーション、特に人間と AI のチーム化では、多様で動的な動作を持つ新しいパートナーに適応できるエージェントが必要です。 Conventional Deep Hierarchical Reinforcement Learning (DHRL) methods focus on agent-centric rewards and overlook partner behavior, leading to shortcut learning, where skills exploit spurious information instead of adapting to partners' dynamic behaviors.この制限により、エージェントが新しいパートナーに適応して効果的に調整する能力が損なわれます。パートナーの行動に応じてスキルを学習する DHRL フレームワークである Partner-Aware Skill Discovery (PASD) を紹介します。 PASD introduces a contrastive intrinsic reward to capture patterns emerging from partner interactions, aligning skill representations across similar partners while maintaining discriminability across diverse strategies. By structuring the skill space based on partner interactions, this approach mitigates shortcut learning and promotes behavioral consistency, enabling robust and adaptive coordination. We extensively evaluate PASD in the Overcooked-AI benchmark with a diverse population of partners characterized by varying skill levels and play styles.さらに、人間間のゲームプレイの軌跡からトレーニングされたヒューマンプロキシモデルを使用したアプローチを評価します。 PASD consistently outperforms existing population-based and hierarchical baselines, demonstrating transferable skill learning that generalizes across a wide range of partner behaviors. Analysis of learned skill representations shows that PASD adapts effectively to diverse partner behaviors, highlighting its robustness in human-AI collaboration.

原文 (English)

Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration

Multi-agent collaboration, especially in human-AI teaming, requires agents that can adapt to novel partners with diverse and dynamic behaviors. Conventional Deep Hierarchical Reinforcement Learning (DHRL) methods focus on agent-centric rewards and overlook partner behavior, leading to shortcut learning, where skills exploit spurious information instead of adapting to partners' dynamic behaviors. This limitation undermines agents' ability to adapt and coordinate effectively with novel partners. We introduce Partner-Aware Skill Discovery (PASD), a DHRL framework that learns skills conditioned on partner behavior. PASD introduces a contrastive intrinsic reward to capture patterns emerging from partner interactions, aligning skill representations across similar partners while maintaining discriminability across diverse strategies. By structuring the skill space based on partner interactions, this approach mitigates shortcut learning and promotes behavioral consistency, enabling robust and adaptive coordination. We extensively evaluate PASD in the Overcooked-AI benchmark with a diverse population of partners characterized by varying skill levels and play styles. We further evaluate the approach with human proxy models trained from human-human gameplay trajectories. PASD consistently outperforms existing population-based and hierarchical baselines, demonstrating transferable skill learning that generalizes across a wide range of partner behaviors. Analysis of learned skill representations shows that PASD adapts effectively to diverse partner behaviors, highlighting its robustness in human-AI collaboration.

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

アドホックチームワーク向けのインコンテキスト強化学習の限界をベンチマークする

インコンテキスト強化学習 (ICRL) により、財団エージェントは新しいタスクに即座に適応できるようになりましたが、未知のパートナーとの調整が必要なアドホックチームワーク (AHT) におけるその有効性はまだ解明されていません。これを厳密に評価するために、Overcooked-V2 の高スループット JAX 実装に基づいて構築された大規模ベンチマーク ICRL4AHT を導入します。当社のベンチマークには、RL ポリシーとヒューリスティックポリシーの両方にまたがる大規模で多様なチームメイトスイートが含まれており、制御されたトレーニングとテストのシフトを可能にし、チームメイトの生成、学習履歴の収集、データセットの構築、オンラインマルチエピソード評価のための再現可能なエンドツーエンドのパイプラインを提供します。私たちは、アルゴリズム蒸留 (AD) や決定事前学習トランスフォーマー (DPT) などの代表的な履歴条件付き ICRL アルゴリズムを数百万の遷移にわたって評価します。結果は、注目すべき限界を明らかにしました。単一エージェントドメインでの成功とは対照的に、これらのベースラインは、マルチエージェント設定では堅牢なテスト時間の適応を示すことができません。具体的には、これらの方法は、目に見えないチームメイトと目に見えないレイアウトトラックの両方でランダムなベースラインを下回ることが多く、長期にわたって明確なコンテキスト内の改善はありません。これらの発見は、OvercookedV2 AHT プロトコル内の部分可観測性下での戦略的推論の課題を浮き彫りにし、次世代調整アルゴリズムの重要なテストベッドとしてのベンチマークを確立します。

原文 (English)

Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

In-Context Reinforcement Learning (ICRL) has enabled foundation agents to adapt instantaneously to novel tasks, yet its efficacy in Ad-Hoc Teamwork (AHT)-where coordination with unknown partners is required-remains unexplored. To rigorously evaluate this, we introduce a large-scale benchmark ICRL4AHT, built upon a high-throughput JAX implementation of Overcooked-V2. Our benchmark includes a large, diverse teammate suite spanning both RL and heuristic policies, enabling controlled train-test shifts, and provides a reproducible end-to-end pipeline for teammate generation, learning-history collection, dataset construction, and online multi-episode evaluation. We evaluate representative history-conditioned ICRL algorithms, including Algorithm Distillation (AD) and Decision-Pretrained Transformer (DPT), across millions of transitions. Results reveal notable limitations: contrary to their success in single-agent domains, these baselines fail to exhibit robust test-time adaptation in multi-agent settings. Specifically, these methods frequently underperform random baselines across both unseen teammate and unseen layout tracks, with no clear in-context improvement over long horizons. These findings highlight the challenges of strategic inference under partial observability within the OvercookedV2 AHT protocol, establishing our benchmark as a critical testbed for next-generation coordination algorithms.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

SAM: 長期推論エージェントのための状態適応メモリ

長期的なエージェント推論では、思考、ツールの呼び出し、観察、部分的な結論を含む長い対話履歴にわたって機能する大規模な言語モデルが必要です。課題は、これらの歴史が単に長くなることだけではなく、現在の決定に必要な情報が遠く離れた段階に散在し、後になって初めて関連する可能性があることです。既存のアプローチは、インタラクション履歴を切り詰めたり、より短いサロゲートに圧縮したり、再利用のために選択した部分を取得したりすることでこの困難に対処していますが、過去のインタラクションへのアクセスがエージェントの進化する状態にどのように適応するかを明示的にモデル化していません。代わりに、長期的な推論を状態適応記憶の問題として投げかけます。この目的を達成するために、私たちは、意図駆動型の想起のために生の軌跡ページを保存しながら、進行中のインタラクションをコンパクトな記憶キューに統合するスタンドアロンフレームワークである State-Adaptive Memory~(SAM) を提案します。これらの手がかりは履歴の代替として扱われません。むしろ、これらは、エージェントが基礎となるバックボーンを再トレーニングすることなく、現在のニーズに応じて時間的に離れた情報を再構築できるようにする軽量のハンドルとして機能します。専門家の指導による監視と強化学習を通じてメモリモジュールをさらに最適化し、軌道レベルのユーティリティに合わせます。 BrowseComp、BrowseComp-ZH、 WideSearch、および HLE 全体で、SAM はさまざまなエージェントバックボーン上で強力なベースラインを常に上回っています。私たちの結果は、明示的記憶モデリングが長期的なエージェント推論のためのシンプルかつ効果的な基盤を提供することを示唆しています。

原文 (English)

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

Long-horizon agentic reasoning requires large language models to act over long interaction histories containing thoughts, tool calls, observations, and partial conclusions. The challenge is not merely that these histories grow long, but that information needed for the current decision may be scattered across distant steps and only become relevant later. Existing approaches address this difficulty by truncating the interaction history, compressing it into shorter surrogates, or retrieving selected parts of it for reuse, but they do not explicitly model how access to past interaction should adapt to the agent's evolving state. We instead cast long-horizon reasoning as a problem of state-adaptive memory. To this end, we propose State-Adaptive Memory~(SAM), a standalone framework that consolidates ongoing interaction into compact memory cues while preserving raw trajectory pages for intent-driven recall. These cues are not treated as replacements for history; rather, they serve as lightweight handles that allow the agent to reconstruct temporally distant information according to its current needs, without retraining the underlying backbone. We further optimize the memory module through expert-guided supervision and reinforcement learning, aligning it with trajectory-level utility. Across BrowseComp, BrowseComp-ZH, WideSearch, and HLE, SAM consistently outperforms strong baselines over diverse agent backbones. Our results suggest that explicit memory modeling provides a simple and effective foundation for long-horizon agentic reasoning.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

AgentFugue: 集団推論による長期タスクのエージェントスケーリング

長期的なエージェントタスクの最近の進歩は、より強力なモデル、より優れたツール、より効果的な足場を通じて個々のエージェントをスケールアップすることによって主に推進されています。対照的に、スケールアウトについてはほとんど理解されていません。つまり、すべて同じタスクを対象とする複数のピアエージェントが、明示的な役割の特化やワークフローオーケストレーションに依存せずに、追加の機能ソースになり得るかどうかです。私たちはこの疑問を研究し、共有推論ハブを中心に構築された集団推論フレームワークである AgentFugue を提案します。ピアエージェントが同じタスクを並行して探索すると、ハブは各エージェントが確立したもの、試行したもの、または除外したものに関する簡潔なメモを記録し、各エージェントが現在の検索に役立つ形式で他のエージェントが発見したものに選択的にアクセスできるようにします。この設計は、集中計画を必要とせずに、孤立した軌跡を、再利用可能な中間推論の接続されたエコロジーに変えます。ハブをプラグイン通信層としてインスタンス化し、教師あり微調整とエンドツーエンドの強化学習でトレーニングします。私たちが調査した困難な長期設定において、AgentFugue は強力なベースラインを超えて改善しました。私たちの結果は、集合的推論により、ピアエージェントシステムのスケールアウトを、単なるコンピューティング費の増加の手段ではなく、機能向上の明確なソースに変えることができることを示唆しています。

原文 (English)

AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

Recent progress on long-horizon agentic tasks has been driven largely by scaling up individual agents through stronger models, better tools, and more effective scaffolding. In contrast, much less is understood about scaling out: whether multiple peer agents, all targeting the same task, can become an additional source of capability without relying on explicit role specialization or workflow orchestration. We study this question and propose AgentFugue, a collective reasoning framework built around a shared reasoning hub. As peer agents explore the same task in parallel, the hub records concise notes on what each agent has established, attempted, or ruled out, and enables each agent to selectively access what other agents have discovered in a form useful for its current search. This design turns otherwise isolated trajectories into a connected ecology of reusable intermediate reasoning without requiring centralized planning. We instantiate the hub as a plug-in communication layer, trained with supervised fine-tuning and end-to-end reinforcement learning. Across the challenging long-horizon settings we study, AgentFugue improves over strong baselines. Our results suggest that collective reasoning can turn scaling out peer agent systems into a distinct source of capability gains, rather than merely a way of spending more compute.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

マルチエージェント LLM 意思決定システムにおける動的信用割り当てのための市場制度評議会

ポートフォリオ管理のためのマルチエージェント LLM 意思決定システムには、専門エージェント間でクレジットを割り当てるための原則的な方法が依然として欠けており、レジームシフトの下でコールドスタートの優位性に対して脆弱なままであり、最終的な割り当てがどのように形成されるかについての透明性が限られています。私たちは、オンラインエージェントの重み付けのための単一、ペア、および大連合のすべての出力にわたって正確なシャプレークレジットを計算する、協力的なマルチエージェント意思決定システムである Market Regime Council (MRC) を提案します。 N=3 の専門エージェントでインスタンス化され、各取引期間で、MRC は指数関数的に重み付けされたパフォーマンス履歴から連合ベースのシャプリー重みを再計算し、ベイズ適応混合を使用して初期期間を安定させ、レジーム依存の乗数を適用してエージェントの権限を調整し、5 層の因果追跡を通じて各リバランスを記録します。 13の暗号資産と5つのシードにわたる1,037営業日にわたって、MRCは1.51のシャープ・レシオと440.1%の累積リターンを達成し、アクティブなベースラインの中でCR、SR、IRで第1位にランクされ、アクティブなメソッドの中で最低のMDDを達成しました。アブレーションの結果は、その利益が単独の単一段階からではなく、連合の出力にわたるシャプレー重み付け統合から得られることを示しています。コードとデモデータは補足資料に含まれています。

原文 (English)

Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems

Multi-agent LLM decision systems for portfolio management still lack a principled way to assign credit across specialist agents, remain vulnerable to cold-start dominance under regime shifts, and offer limited transparency into how final allocations are formed. We propose Market Regime Council (MRC), a cooperative multi-agent decision system that computes exact Shapley credits across all single, pairwise, and Grand-coalition outputs for online agent weighting. Instantiated with N=3 specialist agents, at each trading period, MRC recomputes coalition-based Shapley weights from exponentially weighted performance histories, uses a Bayesian adaptive mixture to stabilize early periods, applies regime-dependent multipliers to adjust agent authority, and records each rebalance through a five-layer causal trace. Over 1,037 trading days across 13 crypto assets and five seeds, MRC achieves a Sharpe ratio of 1.51 and a cumulative return of 440.1%, ranking first on CR, SR, and IR among active baselines and attaining the lowest MDD among active methods. Ablation results show that the gains come from Shapley-weighted integration across coalition outputs rather than from any single stage in isolation. Code and demo data are included in the supplementary material.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

DemoEvolve: デモを使用して Agentic Harness Evolution における希薄なフィードバックを克服する

エージェントハーネスの進化により、凍結された言語モデルエージェントの周囲の実行可能構造が変更され、エージェントが改善されます。私たちはこのパラダイムを、サンプル効率の高い高速適応の一形態として研究しています。エージェントは、モデルの重みを更新する代わりに、ベースモデルの一般的な機能をそのままにしたまま、外部ハーネスを変更することでタスク固有の能力を獲得できます。これまでの研究では、自己生成ロールアウトがハーネス検索をサポートできることが示されており、エージェントが練習を通じて新しいタスク能力を獲得できる可能性があることが示唆されています。しかし、長期的な確率論的環境では、自己実践は脆弱になります。報酬はまばらで、結果のばらつきが大きく、失敗の原因が具体的な利用メカニズムにあるとは考えにくいのです。ハーネスの進化に対するデモブートストラップ型のアプローチである DemoEvolve を紹介します。報酬のみの検索が範囲が広すぎてノイズが多い場合、有能な人間の軌跡はコーディング提案者にとって専門的な参照経験として機能し、ハーネスレベルの診断と編集を導きます。ライアーズ・ダイスの実験では、エピソードが短く、失敗が原因である場合には、自己ロールアウト進化が機能する可能性があることが示されています。対照的に、Balatro は、より困難な長期的な確率論的体制を明らかにします。そこでは、自己ロールアウト進化がまばらなフィードバックと候補選択ノイズによって誤解され、チュートリアルのようなテキスト知識だけでは安定した改善が得られません。同じ限られた予算の下で、DemoEvolve はより効果的で監査可能なハーネス編集を作成し、より優れたパフォーマンスを実現します。全体として、デモンストレーションにより、スパースフィードバックハーネスの進化がより診断しやすく、局所的に特定可能になり、安定したものになります。

原文 (English)

DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations

Agent harness evolution improves frozen language-model agents by modifying the executable structures around them. We study this paradigm as a form of sample-efficient fast adaptation: instead of updating model weights, an agent can acquire task-specific competence by changing its external harness, while leaving the base model's general capabilities intact. Prior work shows that self-generated rollouts can support harness search, suggesting that agents may acquire new task competence through practice. Yet in long-horizon stochastic environments, self-practice becomes fragile: rewards are sparse, outcomes are high-variance, and failures are hard to attribute to concrete harness mechanisms. We introduce DemoEvolve, a demonstration-bootstrapped approach to harness evolution. When reward-only search is too broad and noisy, competent human trajectories serve as expert reference experience for the coding proposer, guiding harness-level diagnosis and editing. Experiments on Liar's Dice show that self-rollout evolution can work when episodes are short and failures are attributable. In contrast, Balatro exposes a harder long-horizon stochastic regime, where self-rollout evolution is misled by sparse feedback and candidate-selection noise, while tutorial-like textual knowledge alone does not yield stable improvement. Under the same limited budget, DemoEvolve produces more effective and auditable harness edits and achieves better performance. Overall, demonstrations make sparse-feedback harness evolution more diagnosable, localizable, and stable.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Hera: デバイスとクラウドの連携 LLM エージェントのための長期的な調整の学習

大規模言語モデル (LLM) エージェントは、環境との自律的な対話を通じて、長期にわたる複雑なタスクを解決することに優れています。しかし、実際の展開では、デバイスの基本的なジレンマ、つまりクラウドのジレンマに直面しています。オンデバイスモデルは効率的ですが、多くの場合脆弱ですが、クラウドモデルは強力ですが、計算コストがかかります。最先端の LLM デバイス -- クラウドルーターは通常、大まかなタスクレベルの決定を行うため、複数ステップのエージェントインタラクションの困難さの変化に適応できません。この問題に対処するために、強力なパフォーマンスとコストパレートフロンティアを実現する長期タスク用のクラウド LLM エージェントコーディネーターであるステップレベルのデバイスである Hera を紹介します。 Hera は、新しい 2 段階のトレーニングパラダイムを採用しています。(1) コールドスタートのための模倣学習、続いて (2) タスクの成功とクラウドの使用効率を共同で最適化する強化学習です。最初の段階では、ステップレベルのルーティングを教師あり分類問題としてキャストします。デバイスエージェントはクラウドの軌跡上で再生され、各状態はデバイスとクラウドのアクション間の合意によってラベル付けされます。第 2 段階では、軌跡全体で同一の状態をグループ化し、より高い期待リターンとより少ない将来のクラウド呼び出しを優先するラベルで Hera を更新することで、コストを意識した強化学習を実行します。 ALFWorld、WebShop、および AppWorld で Hera を評価しました。その結果、Hera は以前の方法よりも常に優れたパフォーマンスを示し、わずか 46.3% のステップでクラウドを使用しながら、クラウドのみの成功率の 92.5% を達成しました。

原文 (English)

Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents

Large language model (LLM) agents excel at solving complex long-horizon tasks through autonomous interaction with environments. However, their real-world deployment faces a fundamental device--cloud dilemma: on-device models are efficient but often brittle, while cloud models are stronger but costly in computation. State-of-the-art LLM device--cloud routers usually make coarse task-level decisions, which cannot adapt to the changing difficulty of multi-step agent interactions. To address this issue, we present Hera, a step-level device--cloud LLM agent coordinator for long-horizon tasks achieving a strong performance--cost Pareto frontier. Hera adopts a novel two-stage training paradigm: (1) imitation learning for cold-start, followed by (2) reinforcement learning that jointly optimizes task success and cloud usage efficiency. The first stage casts step-level routing as a supervised classification problem: the device agent is replayed on cloud trajectories, with each state labeled by the agreement between device and cloud actions. In the second stage, we perform cost-aware reinforcement learning by grouping identical states across trajectories and updating Hera with labels favoring higher expected return and fewer future cloud calls. We evaluate Hera on ALFWorld, WebShop, and AppWorld, where it consistently outperforms prior methods, achieving 92.5% of the cloud-only success rate with cloud use in only 46.3% of steps.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェントハードウェア/半導体

Agent-as-Peer-Debriefer: 定性分析のための視点ベースの改良を備えたマルチエージェントフレームワーク

大規模言語モデル (LLM) は定性データ分析 (QDA) に使用されることが増えていますが、その出力には人間による分析の深さやニュアンスが欠けていることがよくあります。私たちは、このギャップは、人間の QDA から信頼性を確保するための実践が欠けていることを反映していると主張します。ピア・デブリーフィングとは、アナリストが無関心なピアからフィードバックを求め、それを利用してコーディングを改良するものです。この実践を LLM 支援 QDA に導入するために、主要なコーディング手順にピアデブリーフィングを組み込むマルチエージェント QDA フレームワークである Agent-as-Peer-Debriefer を提案します。私たちのフレームワークでは、階層型コーディングエージェントが標準の QDA プロセスに従って、コード、サブテーマ、テーマ、および自己説明と反省メモを生成します。次に、これらの出力を 3 つのピアデブリーフィングエージェントと共有し、それぞれが異なる分析観点 (理論駆動、データ駆動、または応用) を適用し、コードの保持、名前変更、再割り当て、結合、または分割によってコードを洗練します。これらの視点は、ドメインとデータセット全体に一般化された確立された人間の QDA 実践から得られます。フレームワークを評価するために、3 つの LLM を使用して 2 つのドメインにわたる 3 つのデータセットでフレームワークをテストし、人間が注釈を付けたコードとの意味的な類似性を測定します。すべての設定において、パースペクティブベースのピアデブリーフィングの改良は、単一 LLM ベースラインよりも人間のコードとより密接に一致しており、アブレーションは、その利点が単に追加の改良によるものではないことをさらに示しています。また、3 つのパースペクティブは明確なトレードオフを生み出し、パースペクティブの選択が有意義で制御可能な設計上の決定であることを示しています。より広く言えば、これらの発見は、明確な視点を持ってピアデブリーフィングをシミュレートすることが、より信頼性の高い LLM 支援 QDA への有望な手段であることを示唆しています。

原文 (English)

Agent-as-Peer-Debriefer: A Multi-Agent Framework with Perspective-Based Refinement for Qualitative Analysis

Large language models (LLMs) are increasingly used for qualitative data analysis (QDA), yet their outputs often miss the depth and nuance of human analysis. We argue this gap reflects a missing credibility practice from human QDA: peer debriefing, in which an analyst seeks feedback from a disinterested peer and uses it to refine their coding. To bring this practice into LLM-assisted QDA, we propose Agent-as-Peer-Debriefer, a multi-agent QDA framework that builds peer debriefing into key coding steps. In our framework, a Hierarchical Coding Agent follows the standard QDA process to generate codes, sub-themes, and themes, along with self-explanations and reflection memos. It then shares these outputs with three Peer-Debriefing Agents, each applying a distinct analytical perspective (Theory-Driven, Data-Driven, or Applied) and refining the codes by keeping, renaming, reassigning, merging, or splitting them. These perspectives are drawn from established human QDA practices that generalize across domains and datasets. To evaluate the framework, we test it on three datasets across two domains with three LLMs, measuring semantic similarity to human-annotated codes. Across all settings, perspective-based, peer-debriefing refinement aligns more closely with human codes than a single-LLM baseline, and an ablation further shows the gain is not merely from additional refinement. The three perspectives also produce distinct trade-offs, showing that the choice of perspective is a meaningful and controllable design decision. More broadly, these findings suggest that simulating peer debriefing with explicit perspectives is a promising route to more credible LLM-assisted QDA.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

MDIA: HealthBench Professional のマルチエージェント診断インテリジェンスパイプライン

Agentic-LLM 臨床ベンチマークで報告されたほとんどの利益は、迅速なエンジニアリングによるものであることが多いですが、私たちの結果は、アーキテクチャおよびエンジンレベルの設計からより大きな改善がもたらされる可能性があることを示唆しています。微調整されていない LLM 上の完全な HealthBench Professional ベンチマーク (n = 525) で、7 ノードの専門ルートの臨床推論グラフとして実装されたマルチエージェント診断インテリジェンスエージェントである MDIA を紹介します。 MDIA は、OpenAI の GPT-5.4-2026-03-05 の下で 0.6272 を達成し、これは OpenAI の臨床医向け ChatGPT のパフォーマンスを +3.72 pp 上回っています。実験作業では、パフォーマンスの向上がシステムアーキテクチャ (特殊ルーティング、マルチターンコンテキストの保存、薬剤状態の安全性ゲート、サイトフィルター検索、長さを意識した合成、エンジンレベルの信頼性) に起因することを示しています。これらの調査結果は、エージェントの臨床ベンチマークのパフォーマンスが、基礎となる基盤モデルとオーケストレーションアーキテクチャの両方によって形成されるという見解を裏付けています。それにもかかわらず、他のモデルをグレーダーとして使用した場合、顕著な違いにも気づきました。特に、Gemini 2.5 Pro を使用した場合、MDIA のスコアは 0.6585 でした。これは、グレーダーの選択が変動の原因であることを示唆しています。したがって、LLM を確実に評価するには、複数の独立したグレーダーモデルにわたる評価が必要になります。

原文 (English)

MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional

Most reported gains on agentic-LLM clinical benchmarks are often attributed to prompt engineering, yet our results suggest that larger improvements can come from architectural and engine-level design. We present MDIA, a Multi-agent Diagnostic Intelligence Agent implemented as a 7-node specialty-routed clinical reasoning graph, on the full HealthBench Professional benchmark (n = 525), on a non-fine-tuned LLM. MDIA achieves 0.6272 under OpenAI's GPT-5.4-2026-03-05, which is +3.72 pp above the performance of OpenAI's ChatGPT for Clinicians. The experimental work shows that performance lift is attributable to system architecture: specialty routing, multi-turn context preservation, drug-state safety gating, site-filtered search, length-aware synthesis, and engine-level reliability. These findings support the view that agentic clinical benchmark performance is shaped both by the underlying foundation model and the orchestration architecture. Nevertheless, we also noticed notable differences when using other models as a grader; in particular, when using Gemini 2.5 Pro, MDIA scored 0.6585, which suggests that the choice of grader is a source of variability. Robust evaluation of LLMs would therefore require assessment across several independent grader models.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

マルチエージェント言語モデルを使用した自然主義的な音声日記内の妄想関連コンテンツの自動検出と分類

自然主義的な環境で録音された音声モノローグは、精神疾患の現象学を特徴づけ、症状の悪化を検出する機会を提供します。大規模言語モデル (LLM) は、トレーニングではなく主に評価のために注釈付きデータを必要とするため、このプロセスを自動化する新しい可能性を提供します。この論文では、中程度の被害妄想を持つ人々から収集した自然主義的な音声日記のトランスクリプトから、妄想的信念、関連する感情反応、および行動反応を示唆する言語をきめ細かくマルチラベル抽出するための、新しい自動化されたマルチエージェント LLM パイプラインを紹介します。 3 つの基礎モデルのアンサンブルを評価することで、詳細な診断プロンプト指示が妄想テーマ分類の誤検知を首尾よく減らすだけでなく、感情反応や行動反応の解釈を制限することも実証しました。さらに、複数のエージェントによる判決の枠組みを比較すると、エージェント間の複雑な会話による議論が時期尚早のコンセンサスを誘発し、臨床的に曖昧な文章の精度を低下させることが示されています。代わりに、多数決により堅牢なパフォーマンスが確立されます (妄想の検出と分類の Micro F1 はそれぞれ 0.872 と 0.779)。この研究は、自然主義的な音声における妄想的信念を示唆するコンテンツの自動検出と特徴付けのための、検証済みでスケーラブルなパイプラインを提供します。

原文 (English)

Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models

Speech monologues recorded in naturalistic settings provide opportunities to characterize mental illness phenomenology and detect symptom exacerbation. Large language models (LLMs) offer new possibilities for automating this process, as they require annotated data primarily for evaluation rather than training. In this paper, we present a novel automated, multi-agent LLM pipeline for the fine-grained, multi-label extraction of language suggestive of delusional beliefs, associated affective responses, and behavioral responses from transcripts of naturalistic audio diaries collected from people with moderate persecutory ideation. Evaluating an ensemble of three foundation models, we demonstrate that detailed diagnostic prompt instructions successfully reduce false positives for delusional theme classification, but also constrain the interpretation of affective or behavioral responses. Furthermore, comparing multi-agent adjudication frameworks shows that complex conversational debate between agents diminishes accuracy on clinically ambiguous text by inducing premature consensus. Instead, majority voting establishes robust performance (Micro F1 of 0.872 and 0.779 for delusion detection and classification respectively). This work provides a validated and scalable pipeline for the automated detection and characterization of content suggesting delusional beliefs in naturalistic speech.

2026-05-26 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

薬剤の不確実性定量化のための適切なスコアリングルール

言語モデルエージェントは軌跡全体にわたって不確実性シグナルを発することが増えていますが、既存のエージェントの UQ 評価では、ランク付けの有用性と確率的真実性が混同されることがよくあります。 AUROC、AUPRC、リスクカバレッジ、Trajectory ECE、およびスカラー化された軌跡スコアは、識別、ビンごとのキャリブレーション、または折りたたまれた要約を評価しますが、プレフィックス条件付きの完全な成功確率トレース $q_t = P^{\pi}(Y=1 | H_t)$ を厳密に導き出すわけではありません。事前の適切なスコアリングに基づいて、最終的な成功の確率に調整されたステップごとの不確実性信号に対する厳密に適切な軌道レベルのスコアリングルールの予測子に依存しないファミリーである軌道適切スコア (TPS) を導入します。我々は、選択されたスコアファミリーと加重スケジュール内で、完全な観察の下でTPSが成功確率プロセスを厳密に導き出すことを証明します。完全データスコアを観測可能な停止プレフィックスに投影することにより、この構築を管理者によって検閲された軌道に拡張し、$q_Z$ が推定されていない場合の正確な $q_Z$ 加重削減スコアと扱いやすい近似値を生成します。さらに、一般的な軌道評価器は、完全なプレフィックス条件付き確率プロセスよりも弱いオブジェクトをターゲットにすることを示します。軌道 ECE は解像度ブラインドですが、スカラー化された軌道ブリエは、完全なトレースではなく、崩壊したスカラーのみを導き出します。 StrategyQA、Tau2-Bench、HotpotQA、および WebShop での実験では、これらの理論的な違いが運用上目に見えることを示しています。つまり、確率の再調整により、ランクメトリクスをほとんど変更せずに TPS が大幅に変更される可能性があり、扱いやすい打ち切り近似により、完全のみの評価と比較して判定が変更される可能性があります。

原文 (English)

Proper Scoring Rules for Agentic Uncertainty Quantification

Language-model agents increasingly emit uncertainty signals throughout a trajectory, but existing agentic UQ evaluations often conflate ranking usefulness with probabilistic truthfulness. AUROC, AUPRC, risk-coverage, Trajectory ECE, and scalarized trajectory scores evaluate discrimination, binwise calibration, or collapsed summaries, but do not strictly elicit the full prefix-conditioned success-probability trace $q_t = P^{\pi}(Y=1 | H_t)$. Building on prequential proper scoring, we introduce the Trajectory Proper Score (TPS), a predictor-agnostic family of strictly proper trajectory-level scoring rules for any per-step uncertainty signal calibrated into a probability of eventual success. We prove that TPS strictly elicits the success-probability process under complete observation, within the chosen score family and weight schedule. We extend the construction to administratively censored trajectories by projecting the complete-data score onto the observable stopped prefix, yielding an exact $q_Z$-weighted reduced score and a tractable approximation when $q_Z$ is unestimated. We further show that common trajectory evaluators target weaker objects than the full prefix-conditioned probability process: Trajectory ECE is resolution-blind, while scalarized Trajectory Brier elicits only the collapsed scalar, not the full trace. Experiments on StrategyQA, Tau2-Bench, HotpotQA, and WebShop show that these theoretical distinctions are operationally visible: probability recalibration can substantially change TPS while leaving rank metrics nearly unchanged, and the tractable censored approximation can change the verdict relative to complete-only evaluation.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

PRIMA: 検証可能なアイデンティティと集中的なフィードバックを備えた、回復力のあるマルチエージェント研究のための運用パターン

LLM を複数時間の実行にわたって調整されたマルチエージェント調査システムとして運用すると、単発評価では不可能な障害モードが表面化します。つまり、上流のプロバイダーが警告なしにスロットルする、サブエージェントがアクセス可能なツールに合わせてタスクをドリフトする、機械を使用する代わりにナレーションする、自己謝罪を伴うオープンリビジョンの反復、または上流のコンテキストを実行可能なディレクティブとして扱うなどです。 PRIMA の主な貢献は、これらの障害モードを乗り切るための 3 つの動作パターンです。(1) アップストリームのレート制限信号を検出し、型指定された一時停止レコードをディスクに永続化し、プロセスの再起動後であっても統合された作業を再実行せずに長時間実行を再開する回復力および回復層。 (2) タスクの忠実度、ツールの使用、改訂、およびステップ間のコンテキスト境界の規範を構造的なプロンプト層としてエンコードするサブエージェント操作規律。 (3) 最終合成前の明示的なドキュメント間調和パスと直交するドラフトステップを組み合わせた構造化エンジニアリング成果物の多段階アプリケーションパターン。これらは、明示的な収束基準を備えた研究プログラム仕様言語、デュアルメトリックスコアリングエンジン (LLM で判定されたルーブリックとサンドボックスコード)、外部メタ最適化ループ、イベント駆動型永続性、フックベースのミドルウェア、コンテキストコンパクション、およびマルチプロバイダー LLM 抽象化といった基本的なプロトコルの上に位置します。エージェント ID は主要な権限から派生し、衝突のない識別子と中央レジストリなしで簡単に検証可能なクラスターメンバーシップを提供します。理論的な保証には、$O(k)$ 検証、$O(V+E)$ DAG 検証、および算術基本定理による恒等衝突の自由が含まれます。グラフ同型のケーススタディは、生成されたアーティファクトにおけるアーキテクチャ上の主張を根拠としています。つまり、3 つの定理と 5 つの予想を含む新しい標準形式のアルゴリズムを提案する研究論文を作成した 6 ステップのプロトコルです。

原文 (English)

PRIMA: Operational Patterns for Resilient Multi-Agent Research with Verifiable Identity and Convergent Feedback

Operating LLMs as coordinated multi-agent research systems over multi-hour runs surfaces failure modes that single-shot evaluation cannot: upstream providers throttle without warning, sub-agents drift the task to fit accessible tools, narrate machinery instead of using it, open revision iterations with self-apology, or treat upstream context as executable directives. We present PRIMA, whose primary contributions are three operational patterns for surviving these failure modes: (1) a resilience-and-recovery layer that detects upstream rate-limit signals, persists a typed pause record to disk, and resumes long-running runs without re-executing converged work even across process restarts; (2) a sub-agent operating discipline encoding task-fidelity, tool-use, revision, and inter-step context-boundary norms as a structural prompt layer; (3) a multi-phase application pattern for structured engineering deliverables pairing orthogonal draft steps with an explicit cross-document harmonization pass before final synthesis. These sit atop a foundational protocol: a research-program specification language with explicit convergence criteria, a dual-metric scoring engine (LLM-judged rubric plus sandboxed code), an outer meta-optimization loop, event-driven persistence, hook-based middleware, context compaction, and a multi-provider LLM abstraction. Agent identities derive from prime powers, giving collision-free identifiers and trivially-verifiable cluster membership without a central registry. Theoretical guarantees include $O(k)$ verification, $O(V+E)$ DAG validation, and identity collision freedom by the Fundamental Theorem of Arithmetic. A Graph Isomorphism case study grounds the architectural claims in a generated artifact: a six-step protocol that produced a research paper proposing a new canonical-form algorithm with three theorems and five conjectures.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

PANDO: オンラインスキル蒸留による効率的なマルチモーダル AI エージェント

マルチモーダル Web エージェントの最近の進歩は、多くの場合、ロールアウト検索、検証パス、オフラインスキル検出、スペシャリストモデルスタックなどの推論時間の計算の増加に依存しています。これにより、Web エージェントのコストが高くなるのではなく、経験を蓄積するにつれて効率が向上する可能性があるという中心的な疑問が生じます。まず、VisualWebArena からの軌跡を分析し、反復アクションループ、隠れた検出コスト、プロンプトキャッシュの再利用の低さという 3 つの繰り返し発生する非効率の原因を特定します。次に、構造化されたスキルライブラリを維持し、進捗状況の反映、信頼度に基づくスキルの降格、階層型ルーティング、視覚的圧縮、およびキャッシュを意識したプロンプトを組み合わせた、単一ロールアウトのオンラインスキル抽出フレームワークである PANDO を紹介します。 910 の VisualWebArena タスクのフルセットで、PANDO は 58.3% の成功率を達成し、SGV (54.0%) や WALT 再現 (45.2%) を上回っています。その一方で、使用するトークンの量は SGV より 58%、WALT より 61% 少なく、事前評価の検出予算はありません。さらに、300 タスクのアブレーションでは、ルーティング、圧縮、キャッシュを意識したプロンプトが、より大きなスキルライブラリをより低い限界トークンコストに変換しながら、ルールとルーチンが成功の利益のほとんどを提供することを示しています。最後に、アクション繰り返し率、ステップオーバーヘッド率、プロンプトキャッシュ使用率という 3 つの軌跡レベルの効率性指標を導入して、最終的な成功を超えた効率性を可視化します。

原文 (English)

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics -- Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization -- to make efficiency visible beyond terminal success.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

エージェントの製造: 第一級の産業実体としての財団モデルのエージェント

製造業は、機械化、電化、プログラマブルオートメーション、スマートマニュファクチャリングという広く認識されている 4 つのパラダイムを通過してきました。それぞれのパラダイムは、人間から機械に移行した作業の種類によって定義されます。いずれの場合も、産業作業の 1 つの層は基本的に人間的なものであり、生産の調整的認知であり、エンジニア、プランナー、運用管理者によって実行される解釈、割り当て、診断、交渉、およびガバナンス作業で構成されます。私たちは、第 5 の移行が現在進行中であると主張します。この移行層では、その下の物理層やルーチン認知層ではなく、この層が基盤モデルに基づく自律エージェントが主に再配布するものになります。私たちはこのパラダイムをエージェントマニュファクチャリングと名付け、操作的に定義します。その主な調整メカニズムが、オープンエンドの目標を解釈し、長期にわたる計画を立て、ツールやマシンを呼び出し、他のエージェントや人間と交渉できる基礎モデルのエージェントによって実行される推論である場合、製造システムはエージェントマニュファクチャリングのインスタンスです。これは、コグニティブマニュファクチャリングやインダストリー 5.0 に関する既存の文献よりも狭く、より反証可能な定義であり、このパラダイムを、閉じられたプロトコル空間内でのみ自律的であった古典的なマルチエージェント製造システムとは明確に区別します。

原文 (English)

Agent Manufacturing: Foundation-Model Agents as First-Class Industrial Entities

Manufacturing has passed through four widely recognized paradigms - mechanization, electrification, programmable automation, and Smart Manufacturing - each defined by the kind of work it shifted from humans to machines. In every case, one layer of industrial work remained fundamentally human: the coordinative cognition of production, comprising the interpretive, allocative, diagnostic, negotiative, and governance work exercised by engineers, planners, and operational managers. We argue that a fifth transition is now underway in which this layer, rather than the physical or routine-cognitive layers below it, is what foundation-model-based autonomous agents primarily redistribute. We name this paradigm Agent Manufacturing and define it operationally: a manufacturing system is an instance of Agent Manufacturing when its principal coordination mechanism is reasoning performed by foundation-model agents that can interpret open-ended goals, plan over long horizons, invoke tools and machines, and negotiate with other agents and humans. This is a narrower and more falsifiable definition than the existing literature on cognitive manufacturing or Industry 5.0 provides, and it distinguishes the paradigm sharply from classical multi-agent manufacturing systems, which were autonomous only within closed protocol spaces.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

暗黙のルールを探るためのテスト時の深い思考

大規模言語モデル (LLM) の継続的な進歩により、インテリジェントエージェントの重要性がますます高まっています。ただし、これらのエージェントは、暗黙のルール (直接観察できず対話を通じて推測する必要がある隠れた制約) によって管理される環境では失敗することがよくあります。これにより、エージェントは反復的な試行錯誤のループに陥り、最終的にはタスクの失敗につながります。この課題に対処するために、私たちは、思考コンポーネントがインタラクション履歴を分析してこれらの暗黙のルールを推測し、アクターをガイドするフレームワークである Test-Time Exploration (TTExplore) を提案します。この状況で効果的に探索できるかどうかは、思考者の推論能力に大きく依存します。ただし、深い推論の軌跡の評価は本質的に不安定で困難であるため、効果的なトレーニングに大きな障害となります。この問題を克服するために、新しく安定した強化学習パイプラインを導入します。中心となるアイデアは、正確なタスクレベルのスコアを間接的な報酬として使用して、中間推論を評価する難しさを回避し、報酬の希薄性を軽減するために軌道ごとに 1 つの思考ノードのみを保持することです。このパイプラインを使用して、特殊な 7B モデルである Exp-Thinker をトレーニングします。 5 つのテキストベースの具体化されたタスクに関する実験では、Exp-Thinker を備えた TTExplore がベースラインエージェントのパフォーマンスを平均 $14$ ～ $19$ ポイント改善することが示され、暗黙のルールについて明示的に推論することの有効性が実証されました。

原文 (English)

Test-Time Deep Thinking to Explore Implicit Rules

With the continuous advancement of Large Language Models (LLMs), intelligent agents are becoming increasingly vital. However, these agents often fail in environments governed by implicit rules--hidden constraints that cannot be observed directly and must be inferred through interaction. This causes agents to fall into repetitive trial-and-error loops, ultimately leading to task failure. To address this challenge, we propose Test-Time Exploration (TTExplore), a framework where a thinker component analyzes interaction history to infer these implicit rules and guide an actor. Effective exploration in this setting critically depends on the reasoning ability of the thinker. However, evaluating deep reasoning trajectories is inherently unstable and difficult, which poses a major obstacle to effective training. To overcome this issue, we introduce a novel and stable reinforcement learning pipeline. The core idea is to use accurate task-level scores as indirect rewards to bypass the difficulty of evaluating intermediate reasoning, and to retain only a single thinking node per trajectory to alleviate reward sparsity. Using this pipeline, we train a specialized 7B model, Exp-Thinker. Experiments on five text-based embodied tasks show that TTExplore equipped with Exp-Thinker improves baseline agent performance by an average of $14$-$19$ points, demonstrating the effectiveness of explicitly reasoning about implicit rules.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

ProActor: プロアクティブなタスクスケジューリングエージェントのためのタイミングを意識した強化学習

プロアクティブなタスク指向のエージェントは、ユーザーのニーズを自律的に予測し、実行可能な機会を特定し、適切なタイミングでソフトウェアアクションをトリガーする必要があり、明示的な指示を待つ事後対応のシステムから根本的に移行します。しかし、既存のアプローチには、そのような予期的な行動を測定および最適化するための一般化可能なエンドツーエンドのソリューションがありません。この文書では、(1) 厳格なポイントラベルの代わりに完全な機会時間ウィンドウを生成することにより、スケーラブルなプロアクティブ性強化学習 (RL) を可能にする、ドメインに依存しない自動アノテーション手法、(2) タイミング品質と参照アクションの調整の両方をキャプチャする体系的なプロアクティブ性メトリクス、(3) さまざまな報酬設計を備えた GRPO を使用した RL 最適化を統合する、会話型タスクスケジューリング用の統一フレームワークである ProActor を紹介します。私たちの洞察は、プロアクティブネスルーブリックを備えた RULER ベースの報酬がタイミングの品質を向上させるために重要であり、段階を意識した複合報酬によって可能になるプロアクティブネスの最適化がタイミングの品質と基準アクションの調整のバランスをとる鍵であるということです。タイミングを意識した RL には広範な調査が必要であり、効率的なインフラストラクチャが必要です。私たちは、シングルノードマルチ GPU システム上でリクエスト適応推論クラスターと DDP ベースのトレーニングを組み合わせた適応フレームワークである ART-F を開発し、4 ビット Qwen2.5-14B-ProActor-Q4 の LoRA トレーニングを 4 ～ 8 倍のスピードアップで実現します。新しく自動アノテーションが付けられた 2 つのデータセットでの実験では、最先端 (SOTA) ベースラインに匹敵するアクションの一貫性を維持しながら、プロアクティブなタイミングが大幅に改善されたことが実証されました。アブレーションは、個別の複合報酬変動の有効性を検証します。

原文 (English)

ProActor: Timing-Aware Reinforcement Learning for Proactive Task Scheduling Agents

Proactive task-oriented agents must autonomously anticipate user needs, identify actionable opportunities, and trigger software actions at appropriate moments - fundamentally shifting from reactive systems that await explicit instructions. However, existing approaches lack generalizable end-to-end solutions for measuring and optimizing such anticipatory behaviors. This paper introduces ProActor, a unified framework for conversational task scheduling that integrates: (1) a domain-agnostic automated annotation methodology that enables scalable proactiveness reinforcement learning (RL) by generating full opportunity time windows instead of rigid point labels, (2) systematic proactiveness metrics capturing both timing quality and reference action alignment, and (3) RL optimization using GRPO with various reward designs. Our insight is that RULER-based rewards with proactiveness rubrics are crucial for improving timing quality, and that proactiveness optimization enabled by stage-aware composite rewards is key to balancing timing quality and reference action alignment. Timing-aware RL requires extensive exploration, demanding efficient infrastructure. We develop ART-F, an adaptive framework combining request-adaptive inference clusters with DDP-based training on single-node multi-GPU systems, enabling LoRA training of 4-bit Qwen2.5-14B-ProActor-Q4 with 4-8x speedups. Experiments on two newly auto-annotated datasets demonstrate significant improvements in proactive timing while maintaining action consistency comparable to state-of-the-art (SOTA) baselines. Ablations validate the effectiveness of distinct composite reward variations.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

協力空戦のための進化的に強化されたマルチエージェント強化学習

現代の空戦が視域外（BVR）複数航空機の協力交戦に向けて進化するにつれ、無人戦闘航空機（UCAV）の自律的な意思決定は、高次元の状態空間、個別の行動コマンド、および敵対的な動的環境による重大な課題に直面しています。このような環境における既存のマルチエージェント強化学習 (MARL) 手法の限界、つまり不十分な探索効率、低いサンプル利用率、不十分なポリシー一般化を克服するために、進化的アルゴリズムと MAPPO を統合したハイブリッド学習フレームワークである敵対的カリキュラムと進化的強化マルチエージェント近接ポリシー最適化 (ACE-MAPPO) を提案します。具体的には、集団の多様性を強化し、局所最適への収束を緩和するために、遺伝的ソフト更新メカニズムが導入されています。進化的に強化された優先順位付き軌道再生戦略がさらに採用され、まばらな高価値サンプルの利用率が向上します。さらに、敵対的進化カリキュラム学習メカニズムは、段階的に難易度を上げながら適応トレーニングを可能にするように設計されています。広範な実験結果により、提案された方法が訓練の安定性、収束速度、勝率の点で MAPPO や他のベースラインアルゴリズムよりも優れていることが実証され、複数航空機の協力空戦シナリオにおけるその有効性が検証されています。

原文 (English)

Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat

As modern air combat evolves toward beyond-visual-range (BVR) multi-aircraft cooperative engagements, autonomous decision-making for unmanned combat aerial vehicles (UCAVs) faces significant challenges due to high-dimensional state spaces, discrete action commands, and strongly adversarial dynamic environments. To overcome the limitations of existing multi-agent reinforcement learning (MARL) methods in such settings, namely insufficient exploration efficiency, low sample utilization, and poor policy generalization, we propose Adversarial Curriculum and Evolutionary-enhanced Multi-agent Proximal Policy Optimization (ACE-MAPPO), a hybrid learning framework that integrates evolutionary algorithms with MAPPO. Specifically, a genetic soft update mechanism is introduced to enhance population diversity and mitigate convergence to local optima. An evolutionary-augmented prioritized trajectory replay strategy is further employed to improve the utilization of sparse high-value samples. In addition, an adversarial evolutionary curriculum learning mechanism is designed to enable adaptive training with progressively increasing difficulty. Extensive experimental results demonstrate that the proposed method outperforms MAPPO and other baseline algorithms in terms of training stability, convergence speed, and win rate, validating its effectiveness in multi-aircraft cooperative air combat scenarios.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

RECTOR: コンプライアンスを意識した自動運転軌道選択のための優先度を意識したルールベースの再ランキング

自動運転スタックは、マルチモーダル候補セットから 1 つの軌道を選択する必要があります。モデルの信頼性によって選択すると、安全性、交通法規、快適性の制約が無視されます。 \textsc{RECTOR} (Rule-Enforced Constrained Trajectory Orchestrator) は、微分可能なプロキシとシーン条件付きの適用性メカニズムを介して段階的なルールブック (Safety~$\succ$~Legal~$\succ$~Road~$\succ$~Comfort) に対して候補をスコア付けし、決定論的な $\varepsilon$-辞書編集ルールで選択する、世代後の再ランキング層です。予測子を再トレーニングすることなく、構築によって層間の優先度を保持します。 Waymo Open Motion Dataset \texttt{validation\_interactive} 分割 (43{,}219 拡張インスタンス、$K{=}6$) では、Protocol~B (28 ルールのプロキシカタログ、Oracle の適用性) ルール対応の選択により、安全性と法律違反が 28.58\% から 20.42\% に、合計が 40.32\% から 20.32\% に削減されます。 32.41\% 対、同じ候補に対する信頼のみ。均一加重加重和ベースラインは、このベンチマークのバイナリ準拠と一致します。経験的なリフトはルールを意識したランキングから得られますが、辞書編集上の保証は、加重キャリブレーションでは再現できない構造的な差別化要因です。敵対的な信頼性の破損では、信頼性のみの選択は 100\% のシナリオで失敗しますが、両方のルール認識セレクターは $\sim$96\% で挿入されたモードを拒否します。すべての数値は、代理評価者の結果 (安全証明書ではありません)、開ループ、5 秒のホライズン、米国のルール、検証分割です。

原文 (English)

RECTOR: Priority-Aware Rule-Based Reranking for Compliance-Aware Autonomous Driving Trajectory Selection

Autonomous driving stacks must pick one trajectory from a multi-modal candidate set; choosing by model confidence ignores safety, traffic-law, and comfort constraints. We present \textsc{RECTOR} (Rule-Enforced Constrained Trajectory Orchestrator), a post-generation reranking layer that scores candidates against a tiered rulebook (Safety~$\succ$~Legal~$\succ$~Road~$\succ$~Comfort) via differentiable proxies and a scene-conditioned applicability mechanism, then selects with a deterministic $\varepsilon$-lexicographic rule that preserves cross-tier priority by construction -- without retraining the predictor. On the Waymo Open Motion Dataset \texttt{validation\_interactive} split (43{,}219 augmented instances, $K{=}6$), under Protocol~B (28-rule proxy catalog, oracle applicability) rule-aware selection cuts Safety+Legal violations from 28.58\% to 20.42\% and Total from 40.32\% to 32.41\% versus confidence-only on the same candidates. A uniform-weight weighted-sum baseline matches binary compliance on this benchmark -- the empirical lift comes from rule-aware ranking, while the lexicographic guarantee is the structural differentiator no weight calibration can replicate. Under adversarial confidence corruption, confidence-only selection fails in 100\% of scenarios while both rule-aware selectors reject the injected mode in $\sim$96\%. All figures are proxy-evaluator results (not a safety certificate), open-loop, 5\,s horizon, U.S.\ rules, validation split.

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

SimuWoB: 高速かつ忠実な GUI エージェントベンチマークのための現実世界のモバイルアプリのシミュレーション

大規模な言語モデルを利用したモバイル GUI エージェントは急速に進歩しており、現実的かつ包括的な評価に対する緊急のニーズが生じています。既存のベンチマークは再現性を優先していますが、実際のアプリケーションで報酬を構築することが難しいため、多くの場合、オープンソースアプリまたはファイル操作タスクに限定されており、ベンチマーク設定と実際の使用状況の間にギャップが生じています。さらに、ほとんどのベンチマークは基本的な接地とナビゲーションに焦点を当てており、複雑で長期にわたる相互作用の範囲は限られています。これらの制限に対処するために、さまざまなタイプと難易度にわたる 120 の困難なタスクを備えたモバイル GUI エージェント用の完全合成ベンチマークである SimuWoB を導入します。私たちは、忠実度の高いタスクと環境を合成し、各タスクに対して有効な報酬を自動的に提供する、堅牢な仮想環境生成フレームワークを構築します。各環境は、URL 経由でアクセスできるバックエンドのない Web ページとしてデプロイされ、効率的で再現可能な評価が可能になります。私たちは、いくつかの最先端のモバイル GUI エージェントで包括的な実験を実施しています。平均成功率はわずか 27.92% であり、長期的なタスクでは 17.82% に低下します。これは、複雑なシナリオの下での現在のエージェントの重大な弱点を明らかにしています。評価結果を実際のサンプルタスクと比較すると、合成環境に基づくエージェントの評価が一般化していることがわかります。さらに、主要な機能の側面にわたる診断上の洞察を提供し、将来のモバイル GUI エージェント開発への影響について説明します。

原文 (English)

SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reproducibility but are often limited to open-source apps or file-operation tasks for the difficulty of constructing rewards on real applications, leaving a gap between benchmark settings and real-world usage. Moreover, most benchmarks focus on basic grounding and navigation, with limited coverage of complex, long-horizon interactions. To address these limitations, we introduce SimuWoB, a fully synthetic benchmark for mobile GUI agents with 120 challenging tasks spanning diverse types and difficulty levels. We build a robust virtual environment generation framework that synthesizes high-fidelity tasks and environments, and automatically provides valid rewards for each task. Each environment is deployed as a backend-free webpage accessible via URL, enabling efficient and reproducible evaluation. We conduct comprehensive experiments on several state-of-the-art mobile GUI agents. The average success rate is only 27.92%, dropping to 17.82% on long-horizon tasks, which reveals substantial weaknesses in current agents under complex scenarios. Evaluation result comparison with real-world sample tasks demonstrate that agent assessments based on our synthetic environment generalize well. We further provide diagnostic insights across key capability dimensions and discuss implications for future mobile GUI agent development.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェントハードウェア/半導体

DarkForest: マルチエージェント LLM の会話を減らし、精度を向上

マルチエージェント LLM システムは、複数のエージェントからの出力を組み合わせることで推論を改善しますが、対話が多い方法ではエラーの伝播と高い通信オーバーヘッドが発生する可能性があります。エージェントが生の応答や推論トレースを交換すると、間違った中間推論が採用され増幅され、自信はあるものの間違った合意が得られる可能性があります。マルチラウンド通信により、トークンの消費量、待ち時間、推論コストも増加します。この論文では、DarkForest という名前の制御された通信調整フレームワークを提案します。 DarkForest はまずエージェントを独立させて、各エージェントが他のエージェントの出力を見ることなく応答を生成します。次に、生の応答を構造化された候補レコードに解析し、意味的に同等の候補をクラスターにグループ化し、エージェントの信頼性、信頼度、解析品質、サポートパターンの信頼性、および独立性補正を使用して、これらのクラスターにわたる校正された信念分布を推定します。コーディネーターは、制御されたコミュニケーションにより、この信念状態からポリシーで許可された証拠のみを受け取ります。 6 つの推論ベンチマークに関する実験では、DarkForest が最高の全体的な品質を達成し、ベンチマークメトリクスで最も強力なベースラインを最大 30.7\% 改善し、通信の多いベースラインと比較してトークン消費を最大 $6.5\times$ 削減することが示されています。

原文 (English)

DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs

Multi-agent LLM systems improve reasoning by combining outputs from multiple agents, but interaction-heavy methods can introduce error propagation and high communication overhead. When agents exchange raw responses or reasoning traces, incorrect intermediate reasoning may be adopted and amplified, leading to confident but wrong consensus; multi-round communication also increases token consumption, latency, and inference cost. In this paper, we propose a controlled-communication coordination framework named DarkForest. DarkForest first keeps agents independent, so each agent produces an answer without seeing the others' outputs. It then parses the raw responses into structured candidate records, groups semantically equivalent candidates into clusters, and estimates a calibrated belief distribution over these clusters using agent reliability, confidence, parse quality, support-pattern reliability, and independence corrections. A coordinator receives only policy-permitted evidence from this belief state with controlled communication. Experiments on six reasoning benchmarks show that DarkForest achieves leading overall quality, improves the strongest baseline by up to 30.7\% on benchmark metrics, and reduces token consumption by up to $6.5\times$ compared with communication-heavy baselines.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

メタエージェント: タスクの説明から検証済みのマルチエージェントシステムまで

複雑な複数ステップのタスクを解決するために AI エージェントがますます使用されていますが、ワークフローの規模と深さが増大するにつれて、既存のマルチエージェントフレームワークは脆弱なままです。中間段階での小さなエラーはエージェントの相互作用を通じて伝播する可能性があり、また不十分な接地と弱い検証メカニズムにより信頼性がさらに制限されます。我々は、自然言語タスク記述から特殊なマルチエージェントシステムを自動的に構築して実行する 2 フェーズフレームワークである Meta-Agent を紹介します。構築フェーズでは、タスクプランナーが問題を、明示的な入出力契約と検証基準を備えたエージェント仕様の有向非循環グラフに分解します。 Web 検索モジュールは外部証拠に基づいて各仕様を根拠にし、コード生成モジュールはシステムプロンプトとツール構成を生成します。次に、構築時の検証ステージで、生成されたアーティファクトを検証し、障害が検出された場合にターゲットを絞った再生成をトリガーします。実行フェーズでは、コーディネーターがエージェントグラフ全体にサブタスクをディスパッチし、実行時の検証によって中間出力がゲートされます。さらに、ローカル障害、アップストリーム障害、構造障害を区別する 3 レベルのエラー原因特定メカニズムを導入し、局所的な再試行から部分的な再実行や再分解に至るまで、対象を絞った回復戦略を可能にします。私たちは、コーディング、コンテキスト学習、およびオープンエンド推論タスク全体にわたって Meta-Agent を評価します。強力なマルチエージェントベースラインに対する実験とアブレーション研究により、タスクの成功率、エラー回復、ワークフローの安定性が一貫して向上していることが実証されています。この結果は、信頼性の高いマルチエージェントシステムを構築するには、計画、基礎、検証を緊密に統合することの重要性を浮き彫りにしています。

原文 (English)

Meta-Agent: From Task Descriptions to Verified Multi-Agent Systems

AI agents are increasingly used to solve complex, multi-step tasks, but existing multi-agent frameworks remain brittle as workflows grow in scale and depth. Small errors at intermediate stages can propagate through agent interactions, while insufficient grounding and weak verification mechanisms further limit reliability. We present Meta-Agent, a two-phase framework that automatically constructs and executes specialized multi-agent systems from natural-language task descriptions. In the construction phase, a task planner decomposes a problem into a directed acyclic graph of agent specifications with explicit input/output contracts and verification criteria. A web search module grounds each specification with external evidence, and a code generation module produces system prompts and tool configurations. A construction-time verification stage then validates generated artifacts and triggers targeted regeneration when failures are detected. In the execution phase, a coordinator dispatches subtasks across the agent graph while execution-time verification gates intermediate outputs. We further introduce a three-level error attribution mechanism that distinguishes local, upstream, and structural failures, enabling targeted recovery strategies ranging from localized retries to partial re-execution and re-decomposition. We evaluate Meta-Agent across coding, contextual learning, and open-ended reasoning tasks. Experiments against strong multi-agent baselines and ablation studies demonstrate consistent improvements in task success rate, error recovery, and workflow stability. The results highlight the importance of tightly integrating planning, grounding, and verification for building reliable multi-agent systems.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

LipoAgent: より安全な脂質設計のための微調整された LLM エージェントの調整

脂質ナノ粒子 (LNP) は、臨床的に最も成熟した核酸送達プラットフォームの 1 つですが、有効かつ生物学的に安全な脂質の設計が依然として大きなボトルネックとなっています。実際のスクリーニングでは、毒性は意思決定レベルの制約です。脂質が毒性がある場合、その効率予測は臨床的に無関係です。私たちは、脂質発見のための安全性を意識したマルチエージェント LLM フレームワークである LipoAgent を提案します。 LipoAgent は、ドメイン固有の微調整と、効率予測の前提条件として毒性を強制する条件付き予測目標を組み合わせ、不一致が続く場合には人による監視を軽減したマルチエージェント検証によって信頼性をさらに向上させます。複数の基礎モデルにわたって、LipoAgent は、報告されている他の脂質設計モデルと比較して、mRNA トランスフェクション効率予測において平均 32% の相対的な向上を達成しています。ウェットラボ検証により、仮想スクリーニングのランキングが生物学的トランスフェクションの結果に確実に反映されることが確認されています。コードは https://github.com/SAI-Lab-NYU/LipoAgent.git で公開されています。

原文 (English)

LipoAgent: Coordinating Fine-Tuned LLM Agents for Safer Lipid Design

Lipid nanoparticles (LNPs) are among the most clinically mature platforms for nucleic acid delivery, yet designing lipids that are both effective and biologically safe remains a major bottleneck. In practical screening, toxicity is a decision-level constraint: if a lipid is toxic, its efficiency prediction is clinically irrelevant. We propose LipoAgent, a safety-aware multi-agent LLM framework for lipid discovery. LipoAgent combines domain-specific finetuning with a conditional prediction objective that enforces toxicity as a prerequisite for efficiency prediction, and further improves reliability via multi-agent verification with lightweight human oversight when disagreement persists. Across multiple foundation models, LipoAgent achieves an average 32% relative improvement in mRNA transfection efficiency prediction compared with other reported models for lipid design. Wet-lab validation confirms that virtual screening rankings reliably translate to biological transfection outcomes. The code is publicly available at https://github.com/SAI-Lab-NYU/LipoAgent.git.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

CODESKILL: コーディングエージェントのための自己進化スキルの学習

コーディングエージェントは、ソフトウェアエンジニアリングのタスクを解決しながら、豊かな軌跡を生成します。エージェントの自己進化を可能にするために、これらの軌跡を再利用可能な手順スキルに蒸留し、経験をコンパクトにエンコードして将来の行動を導くことができます。しかし、既存のスキル構築および維持方法は、固定プロンプトやヒューリスティック更新ルールに依存していることが多く、下流エージェントに最適なサービスを提供するために知識をどのように選択、抽象化、維持する必要があるかが不明瞭なままになっています。私たちは、スキルの抽出とスキルバンクの維持を学習可能な管理ポリシーとして再定式化する LLM ベースのフレームワークである CODESKILL を提案します。 CODESKILL は、コーディングエージェントの軌跡から複数粒度の手順スキルを抽出し、新しい経験によってスキルを進化させ、将来のタスク解決に備えたコンパクトなスキルバンクを維持します。私たちは強化学習で CODESKILL をトレーニングします。これには、密なルーブリックベースのスキル品質フィードバックと、凍結された下流エージェントからのまばらな検証可能な実行フィードバックを組み合わせたハイブリッド報酬が使用されます。 EnvBench、SWE-Bench Verified、および Terminal-Bench 2 での実験では、CODESKILL が反復構築中にスキルバンクを安定したサイズに維持しながら、スキルなしのベースラインと比べて 9.69 向上し、最強のプロンプトベースまたはメモリベースラインと比較して 4.01 向上する平均合格率を示しています。

原文 (English)

CODESKILL: Learning Self-Evolving Skills for Coding Agents

Coding agents produce rich trajectories while solving software-engineering tasks. To enable agent self-evolution, these trajectories can be distilled into reusable procedural skills that compactly encode experience to guide future behavior. However, existing skill construction and maintenance methods often rely on fixed prompts and heuristic update rules, leaving it unclear how knowledge should be selected, abstracted, and maintained to best serve downstream agents. We propose CODESKILL, an LLM-based framework that reformulates skill extraction and skill-bank maintenance as a learnable management policy. CODESKILL extracts multi-granularity procedural skills from coding-agent trajectories, evolves skills with new experience, and maintains a compact skill bank for future task solving. We train CODESKILL with reinforcement learning, using a hybrid reward that combines dense rubric-based skill-quality feedback with sparse verifiable execution feedback from the frozen downstream agent. Experiments on EnvBench, SWE-Bench Verified, and Terminal-Bench 2 show that CODESKILL improves average pass rate by 9.69 over the no-skill baseline and by 4.01 over the strongest prompt-based or memory baseline, while maintaining the skill bank at a stable size during iterative construction.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

OpenClaw エージェントのセキュリティ: 基本、攻撃、および対策

大規模言語モデル (LLM) 駆動の自律エージェントの急速な進化により、永続メモリ、マルチチャネルインタラクション、高度な自律性を備えた継続的に実行され、スキル強化されたシステムとして動作する新しいクラスのオープンソースエージェントフレームワークである OpenClaw が誕生しました。このような機能により、OpenClaw エージェントは複雑な複数ステップのタスクを自律的に実行し、外部アプリケーションとシームレスに対話できるようになりますが、同時に攻撃対象領域が大幅に拡大します。特に、高い特権操作と永続メモリの組み合わせにより、OpenClaw エージェントは、スキルポイズニング、認知操作、マルチエージェントカスケード障害、サプライチェーンの脆弱性など、さまざまな新たな脅威にさらされます。この調査では、OpenClaw エージェントのセキュリティ状況に関する包括的な調査を紹介します。まず、OpenClaw エージェントを従来の AI エージェントシステムと区別する一般的なアーキテクチャと主要な特性を調べます。私たちは既存のセキュリティとプライバシーの脅威を階層化されたフレームワークに分類し、エージェントの推論、アクションの実行、および外部との対話中に脆弱性がどのように発生するかを分析します。現在の防衛情勢を描くために、代表的な防衛メカニズムも検討します。最後に、OpenClaw エコシステムの信頼性と信頼性に関連するいくつかの未解決の問題について説明します。

原文 (English)

Security of OpenClaw Agents: Fundamentals, Attacks, and Countermeasures

The rapid evolution of large language model (LLM)-driven autonomous agents has given rise to OpenClaw, a new class of open-source agent frameworks that operate as continuously running, skill-augmented systems with persistent memory, multi-channel interaction, and high degrees of autonomy. Such capabilities enable OpenClaw agents to autonomously execute complex, multi-step tasks and interact seamlessly with external applications, but simultaneously introduce a substantially enlarged attack surface. In particular, the combination of high-privilege operations and persistent memory exposes OpenClaw agents to various emerging threats, including skill poisoning, cognitive manipulation, multi-agent cascading failures, and supply-chain vulnerabilities. In this survey, we present a comprehensive study of the security landscape of OpenClaw agents. We first examine the general architecture and key characteristics that distinguish OpenClaw agents from traditional AI agent systems. We categorize existing security and privacy threats into a layered framework and analyze how vulnerabilities arise during agent reasoning, action execution, and external interaction. Representative defense mechanisms are also reviewed to draw the current defense landscape. Finally, several unresolved issues related to the reliability and trustworthiness of OpenClaw ecosystems are discussed.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

パーソナライズしてから保存: 長期的なエージェント向けにパーソナライズされたメモリのベンチマークと学習

既存の大規模言語モデル (LLM) ベースのメモリシステムは、基本的な現実を無視する普遍的な静的ポリシーを適用しています。つまり、メモリに保存する価値のあるコンテキストはユーザーによって異なります。このずれにより、限られたメモリバジェットが一時的なインタラクションに浪費され、長期間のタスクに必要な重要なコンテキストが保持されなくなります。このギャップに対処するために、LLM ベースのメモリシステムはパーソナライズされたメモリポリシーを学習できるか?という未解明の疑問を調査します。パーソナライズされたメモリシステムを評価するための最初のベンチマークである PerMemBench を紹介します。これは、多様なユーザーペルソナにわたる複数年にわたるマルチドメインのインタラクション履歴を特徴とします。さらに、メモリのパーソナライゼーションに関する最初の実証研究を紹介し、一時的なセッションのメモリ操作を選択的にバイパスする軽量フレームワークであるセッションレベルのストレージゲーティングを提案します。私たちの研究では、パーソナライゼーションが完璧なゲーティングの下で大幅なリテンション向上をもたらすことを確認していますが、正確なゲーティングが依然として未解決かつ重要な課題であることも明らかにしています。

原文 (English)

Personalize-then-Store: Benchmarking and Learning Personalized Memory for Long-horizon Agents

Existing large language model (LLM) based memory systems apply universal, static policies that overlook a fundamental reality: the contexts that are worth storing in memory are different across users. This misalignment wastes limited memory budget on transient interactions while failing to preserve critical context for long horizon tasks. To address this gap, we investigate an underexplored question: can LLM based memory systems learn personalized memory policies? We introduce PerMemBench, the first benchmark for evaluating personalized memory systems, featuring multi year, multi domain interaction histories across diverse user personas. We further present the first empirical study of memory personalization, proposing session level storage gating, a lightweight framework that selectively bypasses memory operations for transient sessions. Our study confirms that personalization yields substantial retention gains under perfect gating, yet reveals that accurate gating remains an open and critical challenge.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

倹約的潜在力: ビジュアル基盤からのタスク中心の世界モデルの学習に戻る

世界モデルを使用すると、エージェントはアクションに基づいて将来のダイナミクスを予測できるため、潜在的な表現の選択が計画と制御の中心となります。このような表現は、多くの場合、意味構造が制限されたピクセルから直接学習されるか、タスクに無関係な詳細が過剰に含まれる凍結されたビジュアル基盤モデルから継承され、下流の計画や制御にあまり適合しない状態空間が生成されます。これは、報酬のないオフライン設定では特に困難であり、モデルは報酬の監視やオンライン操作なしで固定軌道から学習する必要があります。これに対処するために、私たちは基盤モデルの埋め込みをコンパクトでタスク十分な世界表現に変換するためのフレームワークである TC-WM を提案します。重要な設計は、事前学習済みのエンベディング空間を最終状態空間としてではなく意味論的な足場として扱うことです。TC-WM は、高次元の視覚エンベディングを動的空間としてコンパクトな潜在空間に線形に投影し、対比学習を通じて部分空間をエージェントの物理的状態と位置合わせし、エンベディングを再構築して有用な視覚構造を保存します。これにより、基礎機能の汎用性とタスク中心のダイナミクスの制御性が組み合わされます。理論的には、単純な変換に至るまで、TC-WM が根底にあるタスク中心の潜在要因を特定するのに十分であることを示します。経験的に、TC-WM は多様な環境 (例: Robomimic や D4RL) にわたるテスト時間の計画を可能にし、最先端のアプローチよりも優れたワールドモデリング品質とより正確な制御を実現します。

原文 (English)

Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations

World models enable agents to predict future dynamics conditioned on actions, making the choice of latent representation central to planning and control. Such representations are often either learned directly from pixels with limited semantic structure or inherited from frozen visual foundation models with excessive task-irrelevant detail, yielding state spaces that are poorly matched to downstream planning and control. This is especially challenging in reward-free offline settings, where the model must learn from fixed trajectories without reward supervision or online interaction. To address this, we propose TC-WM, a framework for turning foundation-model embeddings into compact, task-sufficient world representations. The key design is to treat the pretrained embedding space as a semantic scaffold rather than as the final state space: TC-WM linearly projects high-dimensional visual embeddings into a compact latent as the dynamic space, aligns a subspace with the agent's physical state via contrastive learning, and reconstructs embeddings to preserve useful visual structure. This combines the generality of foundation features with the controllability of task-centric dynamics. Theoretically, we show that TC-WM suffices to identify the underlying task-centric latent factors up to a simple transformation. Empirically, TC-WM enables test-time planning across diverse environments (e.g., Robomimic and D4RL), achieving better world-modeling quality and more precise control than state-of-the-art approaches.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

CUA-Gym: コンピューター使用エージェント向けの検証可能なトレーニング環境とタスクのスケーリング

検証可能な報酬を伴う強化学習 (RLVR) は、数学、ツール使用、ソフトウェアエンジニアリングなどの分野で画期的な進歩をもたらしましたが、そのコンピューター使用エージェント (CUA) への拡張は、決定的な報酬を伴うスケーラブルなトレーニングデータの不足によってボトルネックになっています。 CUA 用にこのようなデータを構築するには、一貫したタスクの指示、実行可能な環境、検証可能な報酬が必要です。ただし、手動で厳選されたベンチマークは高い報酬忠実度を実現しますが、カバーできるアプリケーションはほとんどなく、LLM をジャッジとしてベースにしたデータセットは広範囲に拡張できますが、信頼できる検証が不足しています。タスク命令、環境状態、報酬関数を同時生成するスケーラブルなパイプラインである CUA-Gym を紹介します。具体的には、Generator エージェントが初期環境状態と黄金環境状態を構築し、別個の Discriminator エージェントがタスク仕様から報酬関数を書き込みます。オーケストレーターエージェントは、実行時に反復ラウンドを通じて 2 つを駆動します。生成されたタプルは、LLM 多数決とエージェントのロールアウトを組み合わせた最終フィルターを通過し、タスクごとの敵対的ループを超えた品質を保証します。トレーニング環境の不足に対処するために、実際のソフトウェア使用ディストリビューションに基づいた高忠実度のモック Web アプリケーションの広範なスイートである CUA-Gym-Hub をさらに合成し、CUA RLVR データの規模を大幅に拡大します。このパイプラインを使用して、110 の環境に基づいた 32,112 の検証済み RLVR トレーニングタプルのデータセットである CUA-Gym を構築します。 CUA-Gym で GSPO を使用してトレーニングされた当社の CUA-Gym-A3B および CUA-Gym-A17B は、OSWorld 検証済みで 62.1% と 72.6% を達成し、同等の規模で以前のオープンソース CUA を上回り、データ量と環境の多様性の両方でパフォーマンスがスムーズにスケーリングします。同じチェックポイントは、開催された WebArena ベンチマークでも改善されており、トレーニング環境を超えた移行を示しています。完全な合成パイプライン、データセット、CUA-Gym-Hub 環境、およびモデルをオープンソース化します。

原文 (English)

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

あらゆるアクションを保証: 自律型 AI エージェントのランタイム保険数理制御のための当局フロンティアフレームワーク

自律型 AI エージェントは、データベースの変更、返金、支払い、外部コミットメントなど、副作用を伴うアクションを発行することが増えています。我々は、時間一貫性のあるリスクマッピングの下で、契約上固定された安全な債務不履行に対して各アクションの価格を設定し、境界ごとの準備金予算に基づいて実行を制限する決定論的な実行時契約である保険数理アクションインターフェイス（AAI）を提案します。次に、準備資本の各レベルでランタイムがどれだけの自律権限を解放するかを測定する評価プリミティブである権限フロンティアを開発します。このフレームワークは、(i) トール境界機能トークンを備えた決定論的な quote-bind-commit プロトコルを提供します。 (ii) 異種ツール呼び出しを同等の権限単位にマッピングするユニバーサル 7 クラスアクション分類法。 (iii) アルファ支出に基づくリプレイ決定論とパスワイズ予備カバレッジ。 (iv) 完全準備金需要 C_full および資本指標 Capital@k によるクロスドメイン正規化。 4 つのエージェント環境 (データベースの変更、顧客サービスの払い戻し、パブリックタウベンチの小売および航空会社のツール使用トレース) にわたって AAI をインスタンス化し、Azure でホストされる 3 つのモデルが同じコントラクトを通じてアクションを提案するライブ Postgres パネルをレポートします。フロンティアは、ドメイン全体で共通の低リザーブ拒否と中間リリースのパターンを示し、予算グリッドがフルリザーブ需要に達する場合にのみ飽和します。必要な準備資本は 22 倍変化します (資本@50 は 289 から 6457)。フレームワークはドメインを強制的に同じ形状にするわけではありません。それは各領域の保険数理幾何学を明らかにします。ライブパネルでは、契約は低予算で 3 つのモデルすべてにわたって実現損失を防止しますが、拒否時の引受継続性が異なります。モデルのアイデンティティは引受保険数理上の変数です。この貢献は、自律エージェントの副作用を実行時に保険数理的に制御するための、ベンチマーク対応の評価フレームワークです。

原文 (English)

Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents

Autonomous AI agents increasingly issue side-effect-bearing actions: database mutations, refunds, payments, external commitments. We propose the Actuarial Action Interface (AAI), a deterministic runtime contract that prices each such action against a contractually fixed safe default under a time-consistent risk mapping, and gates execution against a per-boundary reserve capital budget. We then develop the Authority Frontier, an evaluation primitive measuring how much autonomous authority the runtime releases at each level of reserve capital. The framework provides (i) a deterministic quote-bind-commit protocol with toll-bounded capability tokens; (ii) a universal seven-class action taxonomy mapping heterogeneous tool calls to comparable authority units; (iii) replay determinism and pathwise reserve coverage under alpha-spending; (iv) cross-domain normalization via full reserve demand C_full and capital metrics Capital@k. We instantiate AAI across four agentic environments (database mutation, customer-service refund, and the public tau-bench retail and airline tool-use traces) and report a live Postgres panel in which three Azure-hosted models propose actions through the same contract. The frontier exhibits a common low-reserve refusal and intermediate-release pattern across domains, with saturation only where the budget grid reaches full reserve demand; required reserve capital varies by 22x (Capital@50 from 289 to 6457). The framework does not force domains into the same shape; it surfaces each domain's actuarial geometry. In the live panel the contract prevents realized loss across all three models at low budget while differing in underwriting persistence under denial: model identity is an actuarial underwriting variable. The contribution is a benchmark-ready evaluation framework for runtime actuarial control of autonomous-agent side effects.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

AgentHijack: 一般的な環境破損に対するコンピュータ使用エージェントの堅牢性のベンチマーク

マルチモーダル大規模言語モデル (MLLM) を活用した自律型コンピューター使用エージェントが、複雑なデジタルワークフローを完了するための有能なアシスタントとして登場しつつあります。しかし、現実の実行環境は理想とは程遠いものです。ポップアップ、解像度の変更、競合するアプリケーションがエージェントの認識と制御を頻繁に妨げます。 AgentHijack は、直接的な敵対的な意図がなくても、動的環境の不確実性により実行フローが中断される一般的な破損下でのコンピュータ使用エージェントの堅牢性を評価するように設計されたベンチマークです。具体的には、AgentHijack は、現実的な不完全なシナリオを再現するために、9 つの構成可能な一般的な破損を導入します。私たちは、MLLM ベースのエージェントを利用するさまざまなデスクトップタスクを評価し、わずかな破損でもパフォーマンスが大幅に低下する可能性があることを発見しました。これにより、エージェントの脆弱性が強調され、堅牢性評価の必要性が強調されます。その後、強化されたグラウンディング機能を備えたアクションジェネレーターと、動作の要約と環境チェックを担当する観察者を統合するフレームワークである AgentHijack-Agent を提案します。広範な実験によりその有効性が検証されています。私たちのコード、環境、ベースラインモデル、データは、https://AgentHijack.github.io で公開されています。

原文 (English)

AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

Autonomous computer use agents that powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applications frequently interfere with agent perception and control. We introduce AgentHijack, a benchmark designed to evaluate the robustness of computer-use agents under common corruptions, where the uncertainties in dynamic environment disrupt the execution flow without direct adversarial intent. Specifically, AgentHijack introduces 9 configurable common corruptions to replicate realistic imperfect scenarios. We evaluate a variety of desktop tasks that utilize MLLM-based agents and discover that even minor instances of corruption can result in substantial performance degradation, which emphasizes the fragility of agents and underscores the necessity of robustness evaluation. Afterward, we propose AgentHijack-Agent, a framework that integrates an action generator with enhanced grounding capabilities and an onlooker responsible for behavior summarization and environment checking. Extensive experiments validate its effectiveness. Our code, environment, baseline models and data are publicly available at: https://AgentHijack.github.io.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

エージェント中心の社会軌道予測: フリーエネルギー原理の観点

軌道予測方法は、複雑な動きのパターンを捕捉する際に優れた能力を発揮することが実証されています。しかし、既存の方法は大域的な状態の仮定に依存しており、部分的な可観測性の下では信念推論が不十分であり、予測における認知行動の制約がありません。これらの制限により、実際の環境における展開の実現可能性と物理的な妥当性の両方が大きく損なわれます。この研究では、現実的な制約の下で認知的にもっともらしい予測を達成することを目的とした、自由エネルギー原理に基づくエージェント中心の軌道予測フレームワークである FEP-Diff を提案します。具体的には、デュアルブランチ時空間エンコーダーが、局所的な観察から自我運動のダイナミクスと社会的相互作用の手がかりを抽出します。これに基づいて、目標条件付き信念学習器は、近隣エージェント間の認知的整合性を促進するために、ローカル近隣グラフに社会的一貫性制約を設けて、自由エネルギー目標によって最適化されたマルチモーダルな潜在信念分布を推論します。最後に、残留拡散軌道生成器は、トークンレベルの代理条件付けを使用して、学習された信念表現に基づいて条件付けされ、正確で多様な将来予測を生成します。 5 つの公開ベンチマークに関する広範な実験により、FEP-Diff が可観測性が制限された条件下で常に最先端の手法を上回るパフォーマンスを発揮することが実証されました。コード: https://anonymous.4open.science/r/FEP-Diff-8876。

原文 (English)

Agent-Centric Social Trajectory Prediction: A Free Energy Principle Perspective

Trajectory prediction methods have demonstrated remarkable capabilities in capturing complex motion patterns. However, existing methods rely on global state assumptions, suffer from insufficient belief inference under partial observability, and lack cognitive behavioral constraints in prediction. These limitations severely compromise both deployment feasibility and physical plausibility in real-world settings. In this work, we propose FEP-Diff, an agent-centric trajectory prediction framework grounded in the Free Energy Principle, aimed at achieving cognitively plausible predictions under realistic constraints. Specifically, a dual-branch spatiotemporal encoder extracts ego-motion dynamics and social interaction cues from local observations. Building upon this, a goal-conditioned belief learner infers multimodal latent belief distributions optimized via a free-energy objective, with a social consistency constraint on the local neighborhood graph to promote cognitive alignment among neighboring agents. Finally, a residual diffusion trajectory generator is conditioned on the learned belief representations with token-level proxy conditioning, producing precise and diverse future predictions. Extensive experiments on five public benchmarks demonstrate that FEP-Diff consistently outperforms state-of-the-art methods under restricted observability. Code: https://anonymous.4open.science/r/FEP-Diff-8876.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

EvoMap の背後にある: 自己進化するエージェント間コラボレーションネットワークの特徴付け

エージェント間 (A2A) ネットワークにより、自律型 AI エージェントは、再利用可能な問題解決手順を共有することで連携できます。しかし、これらの分散型エコシステムが実際にどのように機能するかは、ほとんど解明されていないままです。著名な A2A コラボレーションネットワークである EvoMap に関する最初の大規模実証研究を紹介します。 150 万を超える資産と 12 万 8,000 のエージェントを分析することで、スケーラブルな成長を優先する設計の選択が、再利用性、進化、監査可能性においてどのようにトレードオフを引き起こすかを示します。まず、EvoMap の信用経済は、貴重な資産を公開するエージェントに報酬を与えます。この設計は大規模な参加を奨励しますが、報酬は主に採用ではなく出版に結びついています。これにより、エージェントはクレジットを蓄積するために資産を大量生産するようになります。その結果、資産の 98% は再利用されず、報酬はごく一部のエージェントに集中することになります。第 2 に、EvoMap はアルゴリズム (GDI と呼ばれる) を採用して、これらの共有アセットの品質をスコアリングしてランク付けします。私たちは、このスコアリングシステムに欠陥があることを実証します。つまり、アセットのランクは、客観的なパフォーマンスを測定するのではなく、未検証の自己報告メタデータ (例: 変更されたコード行など) によって大きく左右されます。これにより、エージェントはアセットのスコアを簡単に操作できるようになります。最後に、EvoMap はエージェントに依存して、アップロードされたアセットが正しく機能する証拠としてローカル実行ログを提供します。これらの検証は個別に検証されていないため、承認されたアセットの 84% 以上が、空のテスト (console.log など) を使用した品質チェックをバイパスしています。私たちの調査結果は、将来の A2A コラボレーションネットワークが未検証の自己報告のみに依存できないことを示しています。スケーラブルなコラボレーションには、オープンな参加と検証可能な実行および信頼できる評価のバランスをとるメカニズムが必要です。

原文 (English)

Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

Agent-to-Agent (A2A) networks enable autonomous AI agents to collaborate by sharing reusable problem-solving instructions. However, how these decentralized ecosystems operate in practice remains largely unexplored. We present the first large-scale empirical study of EvoMap, a prominent A2A collaboration network. By analyzing over 1.5M assets and 128K agents, we show how design choices that prioritize scalable growth introduce trade-offs in reusability, evolution, and auditability. First, EvoMap's credit economy rewards agents for publishing valuable assets. Although this design encourages participation at scale, rewards are tied primarily to publication rather than adoption. This leads agents to mass-produce assets to accumulate credits. As a result, 98% of assets are never reused, while rewards become highly concentrated among a small fraction of agents. Second, EvoMap employs an algorithm (referred to as GDI) to score and rank the quality of these shared assets. We demonstrate that this scoring system is flawed: rather than measuring objective performance, an asset's rank is heavily dictated by unverified, self-reported metadata (e.g., claimed lines of code modified). This allows agents to trivially manipulate their asset's scores. Finally, EvoMap relies on agents to provide local execution logs as evidence that uploaded assets function correctly. Because these validations are not independently verified, over 84% of approved assets bypass quality checks using vacuous tests (e.g., console.log). Our findings show that future A2A collaboration networks cannot rely on unverified self-reporting alone. Scalable collaboration requires mechanisms that balance open participation with verifiable execution and trustworthy evaluation.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

解く前に調べる: ARC-AGI-3 の認識エージェントにおける速度と深さのトレードオフ

私たちは 25 の公開 ARC-AGI-3 ゲームすべてを体系的に調査し、すべてのゲームが非インテリジェントな戦略によって到達可能であることを発見しました。1 つは 1 つのブラインドステップで、5 つは 1 つの調査アクションの後、1 つは ACTION1 の繰り返し押しによるもの、1 つは多様な探索によるもの、8 つは十分な予算 (50 ～ 200 ステップ) での単一の反復アクションによるものです。さらに、ライブラリレベルのヌル座標の脆弱性により、1 ステップで 18 個のゲームがバイパスされます。このベンチマーク批判は、公的評価セットが知的な探索と些細なヒューリスティックを区別できないことを示唆しています。非公開の 55 ゲーム評価が唯一の真の知能テストです。このような背景に対して、Qwen2.5-0.5B を使用したこれら 25 のゲームで RHAE=0.2116 (4/25 解決) を達成した 3 フェーズ (EXPLORE / VERIFY / PLAN) エージェントである AERA (Adaptive Epistemic Reasoning Agent) を紹介します。一方、ランダムおよび探索なしのベースラインのスコアは 0.0000 でした。速度と深さのトレードオフフレームワークを通じて AERA を形式化します。凸性の仮定 (付録で環境のクラスについて証明済み) の下では、RHAE の二次形式は、アクション効率と情報獲得の間のパレートフロンティアからの逸脱に対する 2 次ペナルティとして現れます。貢献: (i) 現在の対話型推論ベンチマークが必要と主張する探索を測定できていないことを示すベンチマーク妥当性分析、および (ii) EXPLORE-before-PLAN フレームワークとモデル機能 x 探索の相互作用。リンクされたコードトラックエントリは、全 55 試合のプライベート評価で RHAE=0.30 を達成しています。コード:CC0。

原文 (English)

Explore Before You Solve: The Speed--Depth Trade-off in Epistemic Agents for ARC-AGI-3

We systematically investigate all 25 public ARC-AGI-3 games and find that every one is reachable through non-intelligent strategies: 10 in a single blind step, 5 after one probing action, 1 via repeated ACTION1 presses, 1 via diverse exploration, and 8 via single repeated actions with sufficient budget (50-200 steps). A library-level null-coordinate vulnerability additionally bypasses 18 games in 1 step. This benchmark critique implies the public evaluation set cannot discriminate intelligent exploration from trivial heuristics - the private 55-game evaluation is the only genuine intelligence test. Against this backdrop, we present AERA (Adaptive Epistemic Reasoning Agent), a three-phase (EXPLORE / VERIFY / PLAN) agent achieving RHAE=0.2116 (4/25 solved) on these 25 games with Qwen2.5-0.5B, while random and no-explore baselines score 0.0000. We formalise AERA through a Speed--Depth trade-off framework: under a convexity assumption (proved for a class of environments in the Appendix), RHAE's quadratic form emerges as a second-order penalty for deviating from the Pareto frontier between action efficiency and information gain. Contributions: (i) a benchmark validity analysis showing that current interactive reasoning benchmarks fail to measure the exploration they claim to require, and (ii) the EXPLORE-before-PLAN framework and model-capability x exploration interaction. The linked code track entry achieves RHAE=0.30 on the full 55-game private evaluation. Code: CC0.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

CausaLab: AI 科学者向けのインタラクティブな因果発見のためのスケーラブルな環境

LLM エージェントによるインタラクティブな因果発見を評価するためのスケーラブルな環境である CausaLab を紹介します。以前の評価とは異なり、CausaLab では、エージェントが因果関係の証拠を使用して問題を解決できるかどうか、およびその答えが根底にある因果メカニズムに関する正しい仮説によって裏付けられているかどうかの両方を評価します。各エピソードではエージェントが合成実験室に配置されます。エージェントは以前の測定記録を受け取り、マニピュレーター結晶に介入し、同じ機構によって支配される保持されたリアクター結晶の共振周波数を予測します。隠されたデータ生成プロセスは、ランダムにサンプリングされた構造因果モデル (SCM) であるため、成功するには、事前の知識を思い出すのではなく、因果グラフと構造方程式の両方を回復する必要があります。 CausaLab には、エージェントの進化する SCM 仮説を記録するドメイン固有の言語も含まれており、軌跡を検査可能にしてグラウンドトゥルースと比較できるようになります。実験では、予測とメカニズム回復の間に永続的なギャップがあることが示されています。純粋に観測的な 6 ノード設定では、GPT-5.2-high はタスク精度 92% に達しますが、オールエッジ $F_1$ はわずか 0.471 です。この観察は、さまざまな相互作用戦略の探求をさらに動機づけます: 混合観察 - 介入戦略は構造忠実度を向上させます: 混合 6 ノード設定では、GPT-5.2-high はタスク精度とオールエッジ $F_1$ の両方で 80% を達成しました。しかし、純粋な介入戦略はタスクの精度とオールエッジ $F_1$ の両方においてパフォーマンスが低いため、強力なエージェントですら有益な介入を設計するのに苦労しています。私たちは、エージェントの主要な弱点として早期停止を特定し、仮説と過去のデータとの間の一貫性をモデルに検証するように依頼することが、この問題の軽減に役立つことを示します。したがって、CausaLab は予測の成功を因果関係の理解から切り離し、実験的因果推論者としての現在の LLM エージェントの限界を明らかにします。

原文 (English)

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is supported by a correct hypothesis about the underlying causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. CausaLab also includes a domain-specific language that records the agent's evolving SCM hypothesis, making trajectories inspectable and comparable with ground truth. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. This observation further motivates our exploration of different interaction strategies: Mixed observation--intervention strategies improve structural fidelity: in the mixed 6-node setting, GPT-5.2-high achieves 80% on both task accuracy and all-edge $F_1$. Yet even strong agents struggle to design informative interventions, as pure intervention strategies perform poorly on both task accuracy and all-edge $F_1$. We identify premature stopping as a major weakness of agents, and show that asking the model to verify the consistency between its hypothesis and past data can help mitigate this issue. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AI 制御におけるリトライとリサンプリング

Claude Code や Codex のような AI コーディングスキャフォールドは \textit{retrying} を使用します。つまり、危険であるとフラグが立てられたアクションをブロックし、軌道を継続します。私たちは、モデルを潜在的に敵対的なものとして扱う AI 制御の観点から再試行を研究します。再試行により正直な疑いのスコアが低下する一方で、信頼されていないモデルはモニターの理論的根拠を悪用してより卑劣な攻撃を構築し、安全性の向上を打ち消す可能性があることがわかりました。また、\textit{resampling}: 同じコンテキストから複数のサンプルを抽出することも研究します。これにより、悪用可能な情報が漏洩することはありません。リサンプリングに関するこれまでの作業で束ねられていた設計の選択肢を解きほぐします。 BashArena では、Claude Opus 4.6 を信頼できないモデルとして、MiMo-V2-Flash を信頼できるモニターとして使用し、ステップごとに 5 つのサンプルを抽出し、最大疑惑スコアに基づいて監査することで、有用性を犠牲にすることなく、0.3\% の監査予算で安全性が 61\% から 71\% に向上します。最初の描画で疑わしいと思われるステップのみを選択的にリサンプリングすると、6.2 パーセントポイントのゲインが回復しますが、余分なサンプルは 10\% しか描画されません。この設定における我々の発見のうち 2 つは、リサンプリングに関する以前の研究と矛盾します。 1 つ目は、リサンプリングされた疑惑スコア全体の最大値に基づいた監査の方が、最小値を使用した場合よりも優れたパフォーマンスを発揮することです。これは、Ctrl-Z が検出したものとは逆です。 2 つ目は、初期の再サンプル遅延プロトコルの中心的なメカニズムである最も疑わしいサンプルの実行では、今回の設定では経験的に安全性がわずかしか得られないということです (+3.9 pp、信頼区間がゼロと重なる場合)。

原文 (English)

Retrying vs Resampling in AI Control

AI coding scaffolds like Claude Code and Codex use \textit{retrying}: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study \textit{resampling}: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61\% to 71\% at a 0.3\% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10\% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

VeriTrace: 深層研究エージェントの進化するメンタルモデル

深部調査エージェントは、膨大で相互依存性があり、不確実性が蔓延する情報に直面しています。既存のシステムは、進化する中間表現がどうあるべきかを探索しますが、その進化は LLM の暗黙の推論に任せています。明示的な規制がないと、中間層は混合品質の情報によって簡単に汚染され、その依存関係に沿ってエラーが伝播するため、多くの場合、モデルのスケールが規制のない代わりに使用されることになります。私たちは、エージェントのメンタルモデルは代わりに、タスクの理解を現実と継続的に一致させる明示的なフィードバックを通じて進化する必要があると主張し、解釈の更新、逸脱フィードバック、スキーマの修正という 3 つの制御ループを特定します。これは、3 つのループを明示的に実装するコグニティブグラフフレームワークである VeriTrace で実現されます。一致した Qwen3.5-27B バックボーンを使用することにより、VeriTrace は、最も一致したベースラインよりも、DeepResearch Bench (DRB) Insight で 4.22 pp (全体で 1.49 pp)、DeepConsult での全体の勝率が 5.9 pp 向上しました。 Config-DeepSeek を使用すると、DRB 上で最も強力な再現可能なオープンソース結果が得られます。

原文 (English)

VeriTrace: Evolving Mental Models for Deep Research Agents

Deep research agents face vast, interdependent, and pervasively uncertain information. Existing systems explore what evolving intermediate representations should look like, but leave their evolution to the LLM's implicit reasoning. Without explicit regulation, the intermediate layer is easily contaminated by mixed-quality information and propagates errors along its dependencies, so model scale often ends up substituting for absent regulation. We argue that an agent's mental model should instead evolve through explicit feedback that continuously aligns task understanding with reality, and identify three regulatory loops: interpretive update, deviation feedback, and schema revision. We realise this in VeriTrace, a cognitive-graph framework that explicitly implements the three loops. Using matched Qwen3.5-27B backbones, VeriTrace improves over the strongest matched baseline by 4.22 pp on DeepResearch Bench (DRB) Insight (1.49 pp Overall) and by 5.9 pp Overall win rate on DeepConsult. With Config-DeepSeek, it achieves the strongest reproducible open-source result on DRB.

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

Claw-Anything: ユーザーのデジタル世界への幅広いアクセスを備えた常時接続パーソナルアシスタントのベンチマーク

大規模言語モデルのエージェントは、ユーザーのデジタル世界に関連するあらゆるものにアクセスできる常時接続のパーソナルアシスタントとしてのイメージがますます高まっています。しかし、現在のシステムは世界の狭い部分のみで動作しており、状況に応じた推論や効果的な支援が制限されています。既存のベンチマークも同様に、部分的なユーザー状態しか提供しないため、このような広範な常時オン設定でのパフォーマンスを把握できません。このギャップに対処するために、長期にわたるアクティビティ履歴、相互依存するバックエンドサービス、複数のデバイスにわたる統合された GUI と CLI の対話という 3 つの側面に沿ってエージェントコンテキストを拡張するベンチマークである Claw-Anything を導入します。この設定をインスタンス化するために、マルチラウンドイベントインジェクションを通じて数か月にわたるユーザーアクティビティをシミュレートし、無関係なイベントや競合する信号など、複雑な世界状態と現実的なノイズを生成します。エージェントは、そのようなノイズに対して堅牢でありながら、リッチなコンテキスト環境を推論する必要があります。この範囲の拡大により、プロアクティブな支援の評価も可能になり、エージェントはユーザーのニーズを予測してタイムリーな推奨事項を提供することが求められます。実験によると、GPT-5.5 は 34.5% の合格@1 しか達成せず、以前のベンチマークを大幅に下回っており、現在のエージェントの能力と常時接続のパーソナルアシスタンスの要求との間にギャップがあることが浮き彫りになっています。ベンチマークと並行して、2,000 のトレーニング環境を生成し、基本モデルを 23.7% 改善する自動データ生成パイプラインをリリースし、スケーラブルなデータインフラストラクチャの有用性を実証します。

原文 (English)

Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

モデルのスケーリングからシステムのスケーリングへ: Agentic AI でのハーネスのスケーリング

この論文では、モデルのスケーリングだけでなく、システムのスケーリングとしてのエージェント AI の次の大きなボトルネック、つまり、基礎モデルを中心とした監査可能で永続的でモジュール式の検証可能なアーキテクチャの設計について研究します。私たちはこの変化をハーネスのスケーリングと呼んでいます。これは、基礎モデルの周りの構造化された実行層を、設計、評価、最適化の第一級のオブジェクトとして扱うことです。最近の大規模な言語モデルでは、エージェントがツールを使用し、情報を取得し、メモリを維持し、長期的なワークフローを実行できるようになりましたが、評価は依然として主にモデル中心であり、多くの場合、メモリ、取得、ツールの使用、オーケストレーション、検証、ガバナンスは二次的な実装の詳細として扱われ、エージェントが最終タスクの成功に限定されます。エージェントのパフォーマンスは基盤モデル、メモリ基板、コンテキストコンストラクター、スキルルーティング層、オーケストレーションループ、および検証とガバナンス層の間の相互作用から生まれるため、この枠組みはますます不適切になってきています。これらのコンポーネントは一緒になってエージェントハーネスを形成し、モデルの機能を長期的なエージェントの動作に変換します。私たちは、コンテキストガバナンス、信頼できるメモリ、動的スキルルーティングという 3 つの主要なボトルネックと、それらを調整および制限するオーケストレーションおよびガバナンスのメカニズムによるハーネスのスケーリングを研究します。さらに、ワンショットタスクの成功を超えて、軌道の品質、メモリの衛生状態、コンテキストの効率、通信の忠実度、検証コスト、長期にわたる安全な進化を測定するハーネスレベルのベンチマークに関する研究課題の概要を説明します。議論を具体化するために、Python ネイティブのリファレンスハーネスである CheetahClaws: https://github.com/SafeRL-Lab/cheetahclaws を開発し、Claude Code および OpenClaw と比較します。私たちの主な主張は、エージェント AI の将来の進歩は、より強力な基盤モデルと同じくらいシステム設計に依存するということです。

原文 (English)

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness: treating the structured execution layer around a foundation model as a first-class object of design, evaluation, and optimization. Although recent large language models enable agents to use tools, retrieve information, maintain memory, and execute long-horizon workflows, evaluation remains largely model-centric, often reducing agents to final-task success while treating memory, retrieval, tool use, orchestration, verification, and governance as secondary implementation details. This framing is increasingly inadequate because agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer. Together, these components form the agent harness, which translates model capability into long-horizon agent behavior. We study scaling the harness through three core bottlenecks: context governance, trustworthy memory, and dynamic skill routing, together with the orchestration and governance mechanisms that coordinate and constrain them. We further outline a research agenda for harness-level benchmarks that go beyond one-shot task success to measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. To make the discussion concrete, we develop CheetahClaws: https://github.com/SafeRL-Lab/cheetahclaws, a Python-native reference harness, and compare it with Claude Code and OpenClaw. Our main claim is that future progress in agentic AI will depend as much on system design as on stronger foundation models.

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

MobileGym: モバイル GUI エージェント研究用の検証可能で高度な並列シミュレーションプラットフォーム

MobileGym は、日常のモバイル使用に適した、ブラウザーでホストされ、軽量で完全に制御可能な環境であり、独自のバックエンドを複製することなくインタラクションの忠実度をターゲットとしています。これにより、これまで日常的なアプリでは実現できなかった 2 つの機能が可能になります。構造化された JSON 状態に対する決定論的な状態ベースの判断による検証可能な結果シグナルと、低コストの並行ロールアウトによるスケーラブルなオンライン RL です。完全な環境状態は、構造化された JSON としてキャプチャ、構成、フォーク、比較され、単一サーバーで数百の並列インスタンスをホストでき、インスタンスあたり約 400 MB のメモリと約 3 秒のコールドスタートが可能です。階層化された状態モデルと宣言型タスク定義フレームワークにより、状態のプログラマビリティとタスク作成が大規模に実用的に維持され、単一のプログラムによる判定メカニズムが決定論的な評価判定と高密度の RL 報酬の両方を提供します。付属の MobileGym-Bench は、256 個のテストテンプレートと 160 個のトレーニングテンプレートを含む 416 個のパラメーター化されたタスクテンプレート、28 を超えるアプリ、決定論的な判定機能、およびフリーテキストのマッチングの失敗を回避する構造化されたアンサーシートプロトコルを提供します。 Sim-to-Real のケーススタディでは、Qwen3-VL-4B-Instruct の GRPO は 256 タスクのテストセットで +12.8 パーセンテージポイント向上し、59 タスクのリアルデバイス信号サブセットでは、リアルデバイスの実行はシミュレーション側のトレーニングゲインの 95.1% を維持しました。プロジェクトページ: https://mobilegym.github.io

原文 (English)

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministic state-based judging over structured JSON state, and scalable online RL through low-cost parallel rollouts. The full environment state is captured, configured, forked, and compared as structured JSON, and a single server can host hundreds of parallel instances, with about 400 MB memory per instance and about 3 s cold start. A layered state model and a declarative task-definition framework keep state programmability and task creation practical at scale, and a single programmatic judging mechanism delivers both deterministic evaluation verdicts and dense RL rewards. The accompanying MobileGym-Bench provides 416 parameterized task templates, including 256 test and 160 train templates, over 28 apps, with deterministic judges and a structured AnswerSheet protocol that avoids free-text matching failures. In a Sim-to-Real case study, GRPO on Qwen3-VL-4B-Instruct gains +12.8 percentage points on the 256-task test set, and on a 59-task real-device signal subset, real-device execution retains 95.1% of the simulation-side training gain. Project page: https://mobilegym.github.io.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

VineLM: エージェントワークフローのためのトライベースのきめ細かい制御

エージェントワークフローは、構成可能な LLM ステージとツールステージをインターリーブし、多くの場合、再試行または改良ループが含まれます。既存のワークフローマネージャーは、完全なワークフロー構成をオフラインでプロファイリングし、各リクエストに静的なワークフローレベルのプランを割り当てます。このプランは、構成可能な各 LLM ステージを単一のモデルにバインドし、そのモデルを繰り返しのループ反復で再利用し、実行時にそれらの選択を再検討することはありません。コストやレイテンシーの予算内で精度を最大化するなど、リクエストレベルの目標に基づいて実行が展開される際に、各ステージ呼び出しのモデルを選択することで、きめ細かい制御を可能にするワークフローマネージャーである VineLM を紹介します。 VineLM は、実行可能な実行をモデル選択プレフィックスの注釈付きトライとして表し、チェックポイント作成とカスケードプロファイリングを使用して、すべてのパス上のすべてのリクエストを徹底的にプロファイリングすることなく、パスの精度、コスト、レイテンシを推定します。実行時、VineLM は各ステージの呼び出し後にトライを再ルートし、実現された実行プレフィックスと残りのレイテンシバジェットを使用して残りのサブトライを再計画します。 NL2SQL および数学推論ワークフローでは、VineLM は、粗いワークフローレベルのベースラインよりもコスト、レイテンシ、精度のフロンティアを向上させ、スパースプロファイリングにより、徹底したプロファイリングと比較してオフラインプロファイリングコストを 98 ～ 99.8% 削減することで、同じリクエストあたりの予算で最大 18% 高い精度を達成します。

原文 (English)

VineLM: Trie-Based Fine-Grained Control for Agentic Workflows

Agentic workflows interleave configurable LLM stages with tool stages and often include retries or refinement loops. Existing workflow managers profile full workflow configurations offline and assign each request a static workflow-level plan that binds each configurable LLM stage to a single model, reuses that model across repeated loop iterations, and does not revisit those choices at runtime. We present VineLM, a workflow manager that enables fine-grained control by choosing the model for each stage invocation as execution unfolds under request-level objectives such as maximizing accuracy under cost or latency budgets. VineLM represents feasible executions as an annotated trie of model-choice prefixes and uses checkpointing and cascade profiling to estimate path accuracy, cost, and latency without exhaustively profiling every request on every path. At runtime, VineLM re-roots the trie after each stage invocation and replans over the remaining subtrie using the realized execution prefix and remaining latency budget. On NL2SQL and math reasoning workflows, VineLM improves the cost-latency-accuracy frontier over coarse workflow-level baselines, achieving up to 18% higher accuracy at the same per-request budget with its sparse profiling reducing offline profiling cost by 98-99.8% when compared to exhaustive profiling.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LLM ツールレジストリにおけるエージェント向けの情報設計

LLM ツールレジストリは、規制されていない広告プラットフォームとして機能します。プロバイダーは、エージェントが選択に使用するフリーテキストの説明を記述しますが、この市場に責任を負わせるための測定インフラストラクチャ (ビューアビリティ基準、品質スコア、結果監査など) は存在しません。当社は、5 つの LLM と 10 のドメインにわたる 17,700 以上のトライアルを建設的なレジストリ設計処方と組み合わせた、最初の体系的なフレームワークを提供します。法的な誇張（主観的な最上級、利益の枠組み）だけで、最適化効果の 100% が得られます。捏造された主張は、増分バイアスをまったく加えません。FTC による欺瞞的な広告ルールの執行は、積極的なメカニズムに対して無効になります。開示は構造的に失敗しています。システムプロンプトの警告は 5 つのモデルのうち 4 つで測定可能な効果を生み出しません。また、行動の上限にはラベルベースの修正のための余裕がありません。最上級は主要な単一機能です (SBC = +0.35)。レジストリ層の記述の正規化は、独立して最善の福祉モデルを実現します。私たちは、選択向けの説明（構造化され、レジストリで管理されている）をマーケティング向けの説明（プロバイダーが作成、選択後に表示）から分離することを提案し、能力とコピーライティングを区別するためにエージェントの注意品質スコアを導入します。

原文 (English)

Agent-Facing Information Design in LLM Tool Registries

LLM tool registries function as unregulated advertising platforms: providers write free-text descriptions that agents use for selection, yet no measurement infrastructure -- no viewability standard, quality score, or outcome audit -- exists to make this market accountable. We provide the first systematic framework, combining 17,700+ trials across five LLMs and ten domains with a constructive registry design prescription. Legal puffery alone (subjective superlatives, benefit framing) captures 100% of the optimization effect; fabricated claims add zero incremental bias -- rendering FTC enforcement of deceptive advertising rules ineffective against the active mechanism. Disclosure fails structurally: system-prompt warnings produce zero measurable effect for four of five models, and behavioral ceilings leave no headroom for label-based correction. Superlatives are the dominant single feature (SBC = +0.35). Registry-layer description normalization achieves first-best welfare model-independently. We propose separating selection-facing descriptions (structured, registry-controlled) from marketing-facing descriptions (provider-authored, shown post-selection), and introduce the Agent Attention Quality Score to distinguish capability from copywriting.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SODE: LLM エージェントの社会動態の分析

大規模言語モデル (LLM) が対話型エージェントに進化するにつれて、人間の社会力学における LLM の行動の整合性を理解することが不可欠になります。行動ゲーム理論はこれらの相互作用を研究するためのフレームワークを提供しますが、これまでの研究は主に平均スコアなどの結果ベースの指標に依存していました。この焦点は、まったく異なる戦略から同一のスコアが導き出される可能性があるため、持続可能な協力を促進するメカニズムを見落としています。このギャップを埋めるために、SODE (社会力学評価) を導入します。これは、戦略適応のための直接互恵性、評判敏感性のための間接互恵性、および協力的な復元力のためのグループダイナミクスという 3 つの進化的側面にわたって LLM エージェントを評価するフレームワークです。 SODE を適用すると、体系的な相違が明らかになります。命令調整モデルは、悪用に対して脆弱になる「消極的なコンプライアンス」を示すことがよくありますが、推論モデルは短期的な最適化を優先し、長期的な協力を不安定にします。特に、「長期的な枠組み」が推論モデルの相互機能を解き放つことができることを実証します。したがって、SODE は、AI エージェントを人間の複雑な社会的ダイナミクスに合わせるための、体系的でメカニズムに基づいたベンチマークを提供します。

原文 (English)

SODE: Analyzing Social Dynamics in LLM Agents

As Large Language Models (LLMs) evolve into interactive agents, understanding their behavioral alignment within human social dynamics becomes essential. While behavioral game theory offers a framework to study these interactions, previous work has predominantly relied on outcome-based metrics such as average scores. This focus overlooks the mechanisms that facilitate sustainable cooperation, as identical scores can be derived from vastly different strategies. To bridge this gap, we introduce SODE (Social Dynamics Evaluation), a framework that evaluates LLM agents across three evolutionary dimensions: Direct Reciprocity for strategy adaptation, Indirect Reciprocity for reputation sensitivity, and Group Dynamics for cooperative resilience. Applying SODE reveals systematic divergences: instruction-tuned models often exhibit "passive compliance" that renders them vulnerable to exploitation, while reasoning models prioritize short-horizon optimization, destabilizing long-term cooperation. Notably, we demonstrate that a "long-horizon framing" can unlock reciprocal capabilities in reasoning models. Thus, SODE offers a systematic, mechanism-grounded benchmark for aligning AI agents with complex human social dynamics.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MemForest: 階層型時間インデックスを備えた効率的なエージェントメモリシステム

メモリは、ロングコンテキストの LLM エージェントを有効にするための基本コンポーネントであり、継続的な提供と更新のライフサイクルを通じて対話全体にわたる永続的な状態をサポートします。相当な事前作業にもかかわらず、既存のシステムは、粗粒度の状態管理と本質的に逐次的な更新パイプラインという 2 つの重要な制限により、重大なメンテナンスのオーバーヘッドに悩まされています。特に、更新は LLM 推論と密接に結びついていることが多く、完全な状態の書き換えが必要なため、スケーラビリティが低下し、メモリが蓄積するにつれて遅延が増大します。これらの課題に対処するために、エージェントのメモリを書き込み効率の高い時間データ管理問題として再定式化するメモリフレームワークである MemForest を紹介します。 MemForest は、並列チャンク抽出によってシーケンシャルボトルネックを解消し、メモリ構築を同時の独立した操作に分離します。粗粒度のメンテナンスをさらに排除するために、フラットなグローバルサマリーではなく時間順のツリーとしてメモリを編成する階層型時間インデックスである MemTree を導入します。この設計では、完全な状態の書き換えを局所的なノードごとの更新に置き換え、影響を受けるツリーパスのメンテナンスコストを削減しながら、時間的に変化する状態を自然に保存します。私たちは、LongMemEval-S と LoCoMo という 2 つのロングコンテキストメモリベンチマークで MemForest を評価します。 LongMemEval-S では、MemForest はステートフルベースラインの中で最高の総合パフォーマンスを達成し、EverMemOS を含む最先端のアプローチよりも約 6 倍高いメモリ構築スループットを維持しながら、79.8% pass@1 精度に達します。

原文 (English)

MemForest: An Efficient Agent Memory System with Hierarchical Temporal Indexing

Memory is a fundamental component for enabling long-context LLM agents, supporting persistent state across interactions through a continuous serve-and-update lifecycle. Despite substantial prior work, existing systems suffer from significant maintenance overhead due to two key limitations: coarse-grained state management and inherently sequential update pipelines. In particular, updates are often tightly coupled with LLM inference and require full-state rewrites, leading to poor scalability and growing latency as memory accumulates. To address these challenges, we present MemForest, a memory framework that reformulates agent memory as a write-efficient temporal data management problem. MemForest breaks the sequential bottleneck via parallel chunk extraction, decoupling memory construction into concurrent, independent operations. To further eliminate coarse-grained maintenance, we introduce MemTree, a hierarchical temporal index that organizes memory as time-ordered trees rather than flat global summaries. This design replaces full-state rewrites with localized per-node updates, reducing maintenance cost to the affected tree paths while naturally preserving temporally evolving states. We evaluate MemForest on two long-context memory benchmarks, LongMemEval-S and LoCoMo. On LongMemEval-S, MemForest achieves the best overall performance among stateful baselines, reaching 79.8% pass@1 accuracy while sustaining a memory construction throughput approximately 6x higher than state-of-the-art approaches including EverMemOS.

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

エージェントの原子論的研究のための原子論的スキルの活用

計算材料科学と化学は、広大な知識領域と分断されたソフトウェアエコシステムにまたがっています。大規模言語モデル (LLM) は研究能力を実証していますが、アトミックな研究の厳密さと複雑さを管理するためにモノリシックエージェントをスケーリングすることは依然として課題です。ここでは、汎用 AI コーディングエージェントが材料科学、化学、創薬にわたる原子論的な研究を実行できるようにするオープンソースのハーネスフレームワークである AtomisticSkills を紹介します。 AtomisticSkills は、科学的ワークフローをエージェントのスキルとツールに階層的に分解することで、モジュール式で拡張可能なプラグアンドプレイの調査機能をエージェントに提供します。このフレームワークには、データベースアクセス、熱力学と反応速度論のモデリング、機械学習の原子間ポテンシャル (MLIP) や密度汎関数理論 (DFT) を採用した多様なシミュレーションエンジンなど、人が厳選した 100 以上の学際的なスキルが統合されています。当社は、その機能範囲を科学文献に照らして検証し、さまざまな科学キャンペーンにわたって堅牢なオーケストレーション機能を実証します。たとえば、リチウムイオン固体電解質のジェネレーティブデザイン、CO2 捕捉のための有機金属フレームワークのハイスループットスクリーニング、自律的な MLIP ベンチマークと微調整、薬剤設計のための多段階構造ベースの仮想スクリーニング、マルチモーダル X 線回折パターン分析、酸素発生反応のための鉄酸化物触媒のスクリーニングなどです。 AtomisticSkills は、完全自律型 AI サイエンティストの構築に向けた重要なエージェントインフラストラクチャを提供します。

原文 (English)

Harnessing AtomisticSkills for Agentic Atomistic Research

Computational materials science and chemistry span vast knowledge domains and fractured software ecosystems. Although large language models (LLMs) have demonstrated research capabilities, scaling monolithic agents to manage the rigor and complexity of atomistic research remains a challenge. Here, we introduce AtomisticSkills, an open-source harness framework that empowers general-purpose AI coding agents to conduct atomistic research across materials science, chemistry, and drug discovery. By hierarchically decomposing scientific workflows into agent skills and tools, AtomisticSkills provides agents with modular, extensible, and plug-and-play research capabilities. The framework integrates more than 100 human-curated multidisciplinary skills, including database access, thermodynamics and kinetics modeling, and diverse simulation engines employing machine learning interatomic potentials (MLIPs) and density functional theory (DFT). We validate its functional coverage against scientific literature and demonstrate robust orchestration capabilities across diverse scientific campaigns: generative design of Li-ion solid-state electrolytes, high-throughput screening of metal-organic frameworks for CO2 capture, autonomous MLIP benchmarking and fine-tuning, multi-stage structure-based virtual screening for drug design, multimodal X-ray diffraction pattern analysis, and screening of Fe-oxide catalysts for oxygen evolution reaction. AtomisticSkills provides a critical agent infrastructure towards building fully autonomous AI scientists.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

堅牢な LLM アンサンブルのための相補的なエージェントの混合

大規模言語モデル (LLM) のアンサンブルやディベートなど、複数の AI コラボレーションは、情報を集約してパフォーマンスを向上させるための有望なパラダイムです。これらのパイプラインの基本的な手順は、複数のプロポーザー LLM の応答をサマライザー LLM にフィードし、より適切な答えを合成することです。ただし、どの提案者を含めるかを選択するのは簡単ではありません。既存のアプローチは、主に精度 (最も強力なモデルを選択する) または多様性 (多様性を確保する) のいずれかに焦点を当てており、提案者間および要約者との相互作用が見落とされていることがよくあります。プロポーザ選択を、特徴選択に似た組み合わせ選択問題として再構成します。ここで、LLM の価値は他のものとの相補性にあります。ただし、標準の特徴選択アルゴリズムを直接適用することは、法外な時間の複雑さのため、LLM 設定では現実的ではありません。この制限を動機として、我々は、小さなラベル付きセットを使用して相補性を評価する、計算上実行可能な広範な貪欲スタイルの選択アルゴリズムを探索します。私たちの実験では、提案者選択の指針としての相補性を検証し、実際に最適なパフォーマンスとコストのトレードオフを達成する方法を特定します。

原文 (English)

Mixture of Complementary Agents for Robust LLM Ensemble

Multi-AI collaboration, such as ensembling or debating large language models (LLMs), is a promising paradigm for aggregating information and boosting performance. A foundational step in these pipelines is to feed the responses of several proposer LLMs into a summarizer LLM, which synthesizes a better answer. However, choosing which proposers to include is non-trivial. Existing approaches primarily focus either on accuracy (picking the strongest models) or diversity (ensuring variety), and often overlook the interactions among proposers and with the summarizer. We reframe proposer selection as a combinatorial selection problem akin to feature selection, where the value of an LLM lies in its complementarity with others. However, directly applying standard feature-selection algorithms is impractical in the LLM setting due to prohibitive time complexity. Motivated by this limitation, we explore an extensive range of computationally feasible, greedy-style selection algorithms that assess complementarity using a small labeled set. Our experiments validate complementarity as a guiding principle for proposer selection and identify methods that achieve the best performance-cost trade-offs in practice.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

スキルが増えればエージェントは劣る?スキルライブラリを拡張するときにスキルシャドウイングによりパフォーマンスが低下する

スキルライブラリを使用すると、LLM エージェントはタスク固有の指示をオンデマンドで読み込むことができるため、専門知識のないユーザーは、どのスキルが存在するか、どのように機能するかを知らなくても、自然言語を通じてドメイン固有のタスクを解決できます。ただし、ライブラリが大きくなるにつれて、パフォーマンスは低下します。役立つスキルの小さなセットから 202 のスキルライブラリに拡張すると、最大 21\% 低下します。この研究では、このパフォーマンスの低下を、既知の役立つスキルのライブラリをロードするときと完全なライブラリをロードするときとの間の合格率の低下として定式化します。さらに、スキルの呼び出し (軌道中にエージェントがどのスキルを選択するか) を条件付けすることで合格率の低下を 2 つの効果に分解することを提案します。 \emph{スキルシャドウイング} (ライブラリが拡張するにつれてエージェントが間違ったスキルを選択する頻度が高くなります)、および \emph{コンテキストオーバーヘッド} (選択が正しい場合でも、拡大されたコンテキストによって実行が低下する) です。両方の効果の上限を導き出し、合格率の低下に対する影響の大きさを特徴付けます。効果とその上限についての経験的な推定によると、\emph{スキルシャドウイング} 効果はライブラリのサイズとともに増大し、パフォーマンス低下に大きく寄与するのに対し、\emph{コンテキストオーバーヘッド} 効果は依然として小さく、ゼロと区別がつかないことがわかります。この観察された非対称性は、スキルライブラリを拡張する際の主なボトルネックは、拡大されたコンテキストではなく、スキル選択の失敗であることを示しています。

原文 (English)

More Skills, Worse Agents? Skill Shadowing Degrades Performance When Expanding Skill Libraries

Skill libraries allow LLM agents to load task-specific instructions on demand, letting non-expert users solve domain-specific tasks through natural language without knowing which skills exist or how they work. However, performance degrades as libraries grow -- by up to 21\% when scaling from a small set of helpful skills to a 202-skill library. In this work, we formulate this performance degradation as the pass rate drop between loading a library of known-helpful skills and the full library. Moreover, we propose to decompose the pass rate drop by conditioning on the skill(s) invocation -- which skills the agent selects during a trajectory -- into two effects: \emph{skill shadowing}, where the agent selects wrong skills more often as the library expands, and \emph{context overhead}, where the enlarged context degrades execution even when selection is correct. We derive upper bounds on both effects to characterize their magnitudes of impacts to the pass rate drop. Our empirical estimates of the effects and their upper bounds both show that the \emph{skill shadowing} effect grows with library size and significantly contributes to the performance degradation, whereas the \emph{context overhead} effect remains small and indistinguishable from zero. This observed asymmetry establishes that the skill selection failure, not the enlarged context, is the primary bottleneck when expanding the skill libraries.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

マニュアルが嘘である場合: LLM エージェントに対する MCP ポイズニング攻撃を評価するための現実的なベンチマーク

モデルコンテキストプロトコル (MCP) などのプロトコルによって標準化された、ツールを使用するラージ言語モデル (LLM) エージェントの台頭により、外部のオープンドメインの知識とツールを統合することにより、LLM エージェントの前例のない自律実行機能が解放されました。ただし、この相互運用性により、エージェントの認知計画層を標的とする秘密の攻撃面が導入されます。この論文では、新しいセマンティック攻撃であるツール記述ポイズニング (TDP) を体系的に調査します。 TDP では、悪意のある命令はツールの実行可能コードには埋め込まれません。むしろ、エージェントが安全な計画と意思決定を行うために依存するまさに「マニュアル」である記述メタデータに密かに挿入されます。この新たな脅威を厳密かつ体系的に評価するために、MCP-TDP セキュリティベンチマークを導入します。この忠実度の高いサンドボックス環境は、6 つの異なるリスクカテゴリにわたる 32 の現実的な現実世界のテストケースで構成されています。 8 つの主流 LLM を評価したところ、GPT-4o などの主要モデルは 6 つの高リスクシナリオでほぼ 100% の攻撃成功率 (ASR) を示し、深刻な脆弱性が明らかになりました。さらに、私たちの調査結果は、一般的な即時ガードレール防御はほとんど効果がなく、直観に反して逆効果になる可能性さえあることを示しています (この現象を私たちは「ファイアウォールの誤謬」と呼んでいます)。重要なのは、エージェントが実行後に自らの悪意のあるアクションを自律的に検出して元に戻す、「反応型自己修正」という防御メカニズムも提案していることです。この成果は、TDP 向けに調整された最初の特殊なセキュリティベンチマークを提供し、高度なエージェントシステムの認知層と計画層を保護するための重要な洞察を提供します。

原文 (English)

When the Manual Lies: A Realistic Benchmark to Evaluate MCP Poisoning Attacks for LLM Agents

The rise of tool-using Large Language Model (LLM) agents, standardized by protocols like the Model Context Protocol (MCP), has unlocked unprecedented autonomous execution capabilities for LLM Agents by integrating external open-domain knowledge and tools. However, this interoperability introduces a covert attack surface targeting the agent's cognitive planning layer. This paper systematically investigates Tool Description Poisoning (TDP), a novel semantic attack. In TDP, malicious instructions are not embedded in a tool's executable code, but rather covertly injected into its descriptive metadata, the very "manual" an agent relies on for secure planning and decision-making. To rigorously and systematically evaluate this emerging threat, we introduce the MCP-TDP Security Benchmark. This high-fidelity sandbox environment comprises 32 realistic, real-world test cases spanning 6 distinct risk categories. Our evaluation of 8 mainstream LLMs reveals severe vulnerabilities, with leading models like GPT-4o exhibiting a nearly 100% Attack Success Rate (ASR) in six high-risk scenarios. Furthermore, our findings demonstrate that common prompt-guardrail defenses are largely ineffective and can, counterintuitively, even be counterproductive (a phenomenon which we term the "Firewall Fallacy"). Crucially, we also propose a defense mechanism: "Reactive Self-Correction," where an agent autonomously detects and reverts its own malicious actions post-execution. This work provides the first specialized security benchmark tailored for TDP, offering essential insights for securing the cognitive and planning layers of advanced agentic systems.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

すべての移行が重要というわけではない: PPO からの証拠

強化学習エージェントをポリシーに基づいてトレーニングするということは、更新のたびに新しいエクスペリエンスを収集することを意味しますが、そのエクスペリエンスには隠れた問題が伴います。ロールアウト内の各状態は、前の状態の直接の出力であり、エージェント自身のアクションによって因果的に連鎖されます。このため、連続する遷移は真に独立したものになることはありません。それらは重複する情報を伝送し、ネットワークが受信する勾配信号は、バッチサイズが示すよりもはるかに反復的になります。同じ方向性が何度も強化され、政策の変化に応じて価値ネットワークが追いつくのに苦労し、報酬曲線だけではほとんど明らかにならない形でトレーニングが静かに不安定になります。この文書では、その冗長性を簡単に削除できるかどうかを検討します。報酬信号がそのまま残るように、適切な段階でロールアウトからトランジションの固定部分をランダムにドロップするだけで、反復勾配構造を破壊し、トレーニングを安定させるのに十分であることを示します。変更は最小限です。サンプリングステップは 1 つで、新しいコンポーネントやコアアルゴリズムへの変更は必要なく、あらゆる PPO 実装で動作します。 CartPole-v1、Acrobot-v1、LunarLander-v2、HalfCheetah-v5、Hopper-v5 という難易度が上昇する 5 つの環境にわたって、このメソッドは報酬に関してバニラ PPO と一致すると同時に、KL 発散、ポリシーエントロピー、および価値推定全体にわたってより一貫したトレーニングダイナミクスを生成します。トランジションの 25% を削除することがスイートスポットであることが判明しました。冗長性を破壊するには十分ですが、バッチを間引くには十分ではありません。

原文 (English)

Not All Transitions Matter: Evidence from PPO

Training a reinforcement learning agent on-policy means collecting fresh experience at every update, and that experience comes with a hidden problem. Each state in a rollout is the direct output of the previous one, causally chained together by the agent's own actions. Because of this, consecutive transitions are never truly independent. They carry overlapping information, and the gradient signal the network receives ends up far more repetitive than the batch size suggests. The same directions get reinforced over and over, the value network struggles to keep up as the policy shifts, and training becomes quietly unstable in ways that reward curves alone rarely reveal. This paper asks whether that redundancy can simply be removed. We show that randomly dropping a fixed fraction of transitions from the rollout, at the right stage so the reward signal stays intact, is enough to break the repetitive gradient structure and stabilize training. The change is minimal: one sampling step, no new components, no modification to the core algorithm, and it works with any PPO implementation. Across five environments of increasing difficulty, CartPole-v1, Acrobot-v1, LunarLander-v2, HalfCheetah-v5, and Hopper-v5, the method matches vanilla PPO on reward while producing more consistent training dynamics across KL divergence, policy entropy, and value estimates. Dropping 25% of transitions turns out to be the sweet spot: enough to disrupt the redundancy, not enough to thin the batch.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

マルチエージェントプログラミングにおける会話パターンの理解: フィボナッチゲーム開発のケーススタディ

大規模言語モデル (LLM) はソフトウェアエンジニアリング (SE) にますます適用されていますが、自律的で役割指向のコラボレーションの可能性は依然としてほとんど研究されていません。複数の LLM ベースのエージェントがどのように連携し、役割の連携を維持し、ソリューションに収束するかを理解することは、SE にとって非常に重要です。単純にエージェントの対話を許可しても、正しい結果や安定した結果が確実に得られるわけではないからです。最近の実証研究では、相互作用のダイナミクスが構造化されていない、または十分に理解されていない場合、たとえ正しい部分解が相互作用の初期に存在していたとしても、エラーの伝播、不正確な解決策に対する時期尚早の合意、または収束を妨げる長期にわたる不一致が生じる可能性があることが示されています。この未開発の領域に対処するための最初のステップとして、7 つのオープンソース LLM (Gemma 2、Gemma 3、LLaMA 3.2、LLaMA 3.3、DeepSeek-R1、MiniCPM、および Qwen3) の 12 のモデルの組み合わせにわたって、デザイナーとプログラマーの 2 人のエージェント間の会話の体系的な分析を行います。私たちの体系的なアプローチにより、効率性 (収束の速度と安定性)、一貫性 (BLEU と ROUGE によって視覚化された役割調整の度合い)、有効性 (コンパイルの成功とエラー解決の程度) という、マルチエージェントインタラクションの 3 つの主要な側面が明らかになります。結果は、DeepSeek-R1:DeepSeek-R1 のペアは、最初の反復から正しいソリューションに収束し、それを最後の反復まで一貫して維持する点で独特である一方、LLaMA 3.2:LLaMA 3.2 および Qwen3:Qwen3 は、正しいソリューションから乖離しているにもかかわらず、デザイナー:プログラマーの役割の強力な連携を示したことが示されました。他のペアはタスクから逸脱し、結果に収束しませんでした。これらの発見は、エージェントティックプログラミングの理解を促進し、将来の自律型 SE に不可欠な収束条件と停止条件の理解と調整に関するさらなる研究の必要性を強調しています。

原文 (English)

Understanding Conversational Patterns in Multi-agent Programming: A Case Study on Fibonacci Game Development

Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented collaboration remains largely underexplored. Understanding how multiple LLM-based agents coordinate, maintain role alignment, and converge on solutions is critical for SE, as naively allowing agents to interact does not reliably lead to correct or stable outcomes. Recent empirical studies show that unstructured or poorly understood interaction dynamics can result in error propagation, premature consensus on incorrect solutions, or prolonged disagreement that prevents convergence, even when correct partial solutions are present early in the interaction. As an initial step towards addressing this underexplored area, we undertake a systematic analysis of conversations between two agents, a Designer and a Programmer across 12 model combinations from 7 open-source LLMs (Gemma 2, Gemma 3, LLaMA 3.2, LLaMA 3.3, DeepSeek-R1, MiniCPM, and Qwen3). Our systematic approach reveals three key dimensions of multi-agent interaction: efficiency (the speed and stability of convergence), consistency (the degree of role alignment visualized by BLEU and ROUGE), and effectiveness (the extent of compilation success and error resolution). Results show that the DeepSeek-R1:DeepSeek-R1 pair was unique in converging to the correct solution from the very first iteration and sustaining it consistently to the final iteration, while LLaMA 3.2:LLaMA 3.2 and Qwen3:Qwen3 demonstrated strong Designer:Programmer role alignment despite of diverging from the correct solution. The other pairs deviated from the task, never to converge to a result. These findings advance understanding of agentic programming and highlight the need for further research on understanding and calibrating convergence and stop conditions essential for future autonomous SE.

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

AvalancheBench: 潜在世界の回復を通じてエンタープライズデータエージェントを評価する

\emph{潜在世界の回復} を通じてエンタープライズデータエージェントを評価するためのベンチマークである AvalancheBench を紹介します。 AvalancheBench は 3 つの方法で既存のベンチマークを改善します。まず、パイプラインの完了ではなく、分析的な理解が評価されます。システムは、単にワークフローを実行するか、もっともらしいレポートを生成するかどうかだけでなく、データを説明するセグメント、ドライバー、一時的なイベント、および関係を回復するかどうかによってスコアリングされます。第 2 に、既知の潜在世界から観測値を生成することで、目標主導型分析にグラウンドトゥルースを提供し、不完全ではあるが有効な回復の部分的な信用を可能にします。第三に、初期の分析上の誤りが後の結論にどのように伝播するかを明らかにします。セグメントの欠落、マージされたイベント、または誤った帰属が、体系的に誤った推奨事項につながる可能性があります。この意味で、AvalancheBench は、エージェントが企業データの背後にある分析構造を回復するかどうかを診断するための制御された設定を提供することで、実データのベンチマークを補完します。最初の電子商取引のユースケースでは、主要なコーディングエージェントの最も強力な構成では、ルーブリックの 26% しか回復できず、失敗は一般的な顧客のセグメンテーションとマージされた一時的なイベントに集中していました。

原文 (English)

AvalancheBench: Evaluating Enterprise Data Agents Through Latent World Recovery

We introduce AvalancheBench, a benchmark for evaluating enterprise data agents through \emph{latent world recovery}. AvalancheBench improves on existing benchmarks in three ways. First, it evaluates analytical understanding rather than pipeline completion: systems are scored on whether they recover the segments, drivers, temporal events, and relationships that explain the data, not merely on whether they execute a workflow or produce a plausible report. Second, it provides ground truth for goal-driven analytics by generating observations from a known latent world, enabling partial credit for incomplete but valid recoveries. Third, it exposes how early analytical mistakes propagate into later conclusions: missed segments, merged events, or wrong attributions can lead to systematically wrong recommendations. In this sense, AvalancheBench complements real-data benchmarks by providing a controlled setting for diagnosing whether agents recover the analytical structure behind enterprise data. On a first e-commerce use case, the strongest configuration of a leading coding agent recovers only 26\% of the rubric, with failures concentrated in generic customer segmentations and merged temporal events.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Agent-ToM: 心の理論による自律型 LLM エージェントの監視方法の学習

自律型大規模言語モデル (LLM) エージェントの隠れた悪意のある動作を監視することは、遅延があり、コンテキストに依存し、長期にわたる攻撃パターンがあるため、困難です。エージェントは、表面的には良性の行動を維持しながら隠された目的を追求する可能性があり、完全な軌道にアクセスしても検出が困難になります。以前の監視アプローチは足場またはアンサンブル集約を改善しますが、各軌跡を独立して処理し、以前の監視経験から学習しません。さらに、標準的な推論方法は、良性のタスクの実行と秘密の逸脱を区別するために必要なエージェントの信念、意図、目標の調整について明示的に推論することなく、観察された動作を説明します。私たちは、自律エージェントのセキュリティ分析のための心の理論 (ToM) 推論に基づいた学習から監視までのフレームワークである \textbf{Agent-ToM} を提案します。 Agent-ToM は、信念、校正された信頼度を備えた意図仮説、期待されるアクション、およびタスクの一貫性のある行動ベースラインからの逸脱を推論することにより、構造化された完全な軌跡分析を実行します。推論時に、\textit{Reason-Verify-Refine} パイプラインを使用して、監視の決定を構築および検証します。トレーニング時に、Agent-ToM は批判シグナルを永続的な \textit{セマンティックガードレールメモリ} に抽出し、エピソード全体で信念と意図に条件付けされた制約を再利用できるようにします。敵対的エージェント監視ベンチマーク (SHADE-Arena および CUA-SHADE-Arena) で Agent-ToM を評価します。 Agent-ToM は、一貫した 2 呼び出し推論パイプラインを使用しながら、強力な適合率と再現率のバランスを実現し、アンサンブル手法を含む最先端の監視ベースラインを上回るパフォーマンスを発揮します。これらの結果は、モニタリング層での学習と、構造化された ToM 推論および検証を組み合わせることで、自律型 LLM エージェントを保護するための効果的で展開可能な基盤が提供されることを示しています。

原文 (English)

Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and long-horizon attack patterns. Agents may pursue hidden objectives while maintaining superficially benign behavior, making detection difficult even with full trajectory access. Prior monitoring approaches improve scaffolding or ensemble aggregation, but treat each trajectory independently and do not learn from prior monitoring experience. Moreover, standard reasoning methods explain observed behavior without explicitly reasoning about agent beliefs, intentions, and goal alignment required to distinguish benign task execution from covert deviation. We propose \textbf{Agent-ToM}, a learning-to-monitor framework grounded in Theory-of-Mind (ToM) reasoning for security analysis of autonomous agents. Agent-ToM performs structured full-trajectory analysis by inferring beliefs, intent hypotheses with calibrated confidence, expected actions, and deviations from task-consistent behavioral baselines. At inference time, it employs a \textit{Reason-Verify-Refine} pipeline to construct and validate monitoring decisions. At training time, Agent-ToM distills critique signals into a persistent \textit{semantic guardrail memory}, enabling reusable belief- and intent-conditioned constraints across episodes. We evaluate Agent-ToM on adversarial agent monitoring benchmarks (SHADE-Arena and CUA-SHADE-Arena). Agent-ToM achieves strong precision-recall balance and outperforms state-of-the-art monitoring baselines, including ensemble methods, while using a coherent two-call reasoning pipeline. These results demonstrate that learning at the monitoring layer, combined with structured ToM reasoning and verification, provides an effective and deployable foundation for securing autonomous LLM agents.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

証明されたツールサーバーアドミッション: モデルコンテキストプロトコルのセキュリティ拡張

モデルコンテキストプロトコル (MCP) は、大規模言語モデル (LLM) エージェントと外部ツールサーバーがメッセージを交換する方法を標準化しますが、信頼するものではありません。ホストは、どのサーバーを使用できるか、どのような感度で、どのサーバーのツールが制限されているかについては考慮せずに、サーバーの自己宣言されたツールリストを読み取り、呼び出しをディスパッチします。この取り組みは、具体的なニーズから生まれました。MCP や Enclawed 独自のツールのアプリケーションプログラミングインターフェイス (API) を変更することなく、Enclawed エージェントが Google の外部で運用されている MCP サーバー (Gmail、カレンダー、ドライブ) を安全に使用できるようにし、サーバーを許可し、サーバーが駆動する可能性のあるツールを制限するというものです。私たちが構築した、mcp 認定のメカニズム (オープンな enclawed-oss ディストリビューションと enclaved フレーバーの両方で出荷) は、一般化します。仲介されていないサードパーティ接続を 1 人のユーザーにとって安全でなくするギャップにより、規制された展開を認定することが不可能になります。これを 3 つの追加メカニズムで閉じます。(1) サーバーがよく知られた URI (Uniform Resource Identifier) で発行する小規模なオフライン署名クリアランスアサーション。ホストはツールのディスパッチ前に固定されたトラストルートに対して検証します。 (2) デフォルトで拒否されるサーバーごとのツール許可リストにより、サーバーがすべてのツールを信頼していないことを認める。 (3) フレーバーゲート型強制モード。チェックを警告からハード拒否に変換し、すべての決定を改ざん防止監査ログに書き込みます。ワイヤー形式、検証アルゴリズム、セキュリティ分析、LLM 主導の敵対的評価を提供します。次に、スキーマ、検証ルール、エラーレジストリ、既知の登録、機械チェック可能な適合ベクトルなど、標準的な Request-for-Comments (RFC 2119) 形式で設計を記述します。これにより、設計を再発明するのではなく、MCP の付録として採用できるようになります。拡張されていないホストは、既知のドキュメントを無視し、現在とまったく同じように動作します。

原文 (English)

Attested Tool-Server Admission: A Security Extension to the Model Context Protocol

The Model Context Protocol (MCP) standardizes how a large-language-model (LLM) agent and an external tool server exchange messages, but not trust: a host reads a server's self-declared tool list and dispatches calls, with no notion of which servers it may use, at what sensitivity, or which of a server's tools are in bounds. This work grew out of a concrete need -- letting the Enclawed agent use Google's externally-operated MCP servers (Gmail, Calendar, Drive) safely, admitting the server and bounding the tools it may drive, without changing MCP or Enclawed's own tool application-programming interface (API). The mechanism we built, mcp-attested (shipped in both the open enclawed-oss distribution and the enclaved flavor), generalizes: the gap that makes an unmediated third-party connection unsafe for one user makes a regulated deployment impossible to accredit. We close it with three additive mechanisms: (1) a small, offline-signed clearance assertion a server publishes at a well-known Uniform Resource Identifier (URI) and a host verifies against a pinned trust root before any tool dispatch; (2) a deny-by-default per-server tool allowlist, so admitting a server is not trusting its every tool; and (3) a flavor-gated enforcement mode that turns the checks from warnings into hard denials, with every decision written to a tamper-evident audit log. We give the wire format, the verification algorithm, a security analysis, and an LLM-driven adversarial evaluation; we then state the design in normative Request-for-Comments (RFC 2119) form -- schema, verification rules, error registry, well-known registration, and machine-checkable conformance vectors -- so it can be adopted as an MCP addendum rather than reinvented. An unextended host ignores the well-known document and behaves exactly as today.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Code2UML: スケーラブルなソフトウェア視覚化のためのコンテキストエンジニアリングを備えたエージェント LLM

Large Language Model (LLM) ベースのコード分析ツールを採用して、ソフトウェアのドキュメント作成タスクを自動化します。ただし、中間表現 (IR) が LLM コンテキストの制限を超える実際のコードベースに対するこれらのアプローチのスケーラビリティは、依然として十分に検討されていません。このペーパーでは、ソースコードリポジトリから UML 図を自動生成するためのコンテキストエンジニアリングを備えたエージェントアーキテクチャを紹介します。これは、Claude Agent SDK に基づいて構築された、PlannerAgent、AnalyzerAgent、DiagramAgent、CorrectorAgent、DependencyAnalyzerAgent の 5 つの特殊なエージェントの階層を採用しており、それぞれが個別の認知サブタスクに対応します。決定論的で重要度に重み付けされた IR 圧縮レイヤーは、プロジェクト全体の IR を、トークン制約内に収まることが保証されたダイアグラム固有のビューに変換します。LLM 呼び出しは必要なく、ミリ秒で完了します。したがって、4 つのプログラミング言語 (Java、JavaScript、PHP、Python) と 7 つの UML 図タイプで 12 のオープンソースリポジトリにわたってシステムを評価し、5 つの自動化メトリクスで評価される 84 の観察結果を生成します。結果は、高い構文妥当性 (平均: 91.5%、コンポーネントおよび展開図が 100% に達する)、強力な関係精度 (平均: 0.858)、および一貫した構造品質 (平均: 81.7/100、言語間の分散が 3.1 ポイント) を示しています。エンティティ再現率は平均 0.313 で、網羅的なカバレッジよりも意図的なアーキテクチャの優先順位が反映されています。感度分析 (31 ～ 4,578 個の IR エンティティ) により、規模に関係なく品質スコアが安定していることが確認されています。

原文 (English)

Code2UML: Agentic LLMs with context engineering for scalable software visualization

Large Language Model (LLM)-based code analysis tools are adopted to automate software documentation tasks. However, the scalability of these approaches to real codebases, where Intermediate Representations (IR) exceed LLM context limits, remains underexplored. This paper introduces an agentic architecture with context engineering for automated UML diagram generation from source code repositories. It employs a hierarchy of five specialized agents: PlannerAgent, AnalyzerAgent, DiagramAgent, CorrectorAgent and DependencyAnalyzerAgent, built on the Claude Agent SDK, each addressing a distinct cognitive subtask. A deterministic, importance-weighted IR compaction layer transforms full project IRs into diagram-specific views guaranteed to fit within token constraints, requiring no LLM calls and completing in milliseconds. Thus, we evaluate the system across 12 open-source repositories in 4 programming languages (Java, JavaScript, PHP, Python) and 7 UML diagram types, producing 84 observations assessed on 5 automated metrics. Results demonstrate high syntactic validity (mean: 91.5%, with component and deployment diagrams reaching 100%), strong relationship precision (mean: 0.858) and consistent structural quality (mean: 81.7/100, with cross-language variance of 3.1 points). Entity recall averaged 0.313, reflecting deliberate architectural prioritization over exhaustive coverage. A sensitivity analysis (31 to 4,578 IR entities) confirms that quality scores remain stable regardless of scale.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

動機混合ゲームにおける協力のための適応的罰

現実世界のマルチエージェントの相互作用では、動機が混合されたシナリオが広く見られます。そこでは、利己的なエージェントが目先の報酬を求めて離反し、長期的な利益と集団福祉を向上させるための利他的な協力の可能性が見落とされていることがよくあります。仲間の懲罰は離反を阻止することができるが、代償を伴う二次的利他主義として、その永続的な賦課は懲罰者の利益を損なう可能性がある。既存のアプローチでは、協力を促進するために罰を効果的に導入するのが難しいことがよくあります。処罰の有効性とコストのバランスをとるために、動的処罰確率と離反の重大度の両方に基づいて処罰の強度を決定する分散型手法である協力のための適応処罰(APC)を提案します。この動的な確率により、コストが高く非効果的な懲罰が大幅に軽減されると同時に、協力も促進されます。離脱とその重大度を正確に評価するために、ゲームの報酬によって学習される離脱認識モジュールを使用します。理論分析と実証結果は、APC が反復的な公共財ゲームで効果的に機能することを示しています。経験的に、APC はまた、連続する社会的ジレンマ全体で既存のベースラインを大幅に上回り、戦略的に亡命を阻止することで協力を促進する合理的かつ効果的な懲罰政策を学習します。

原文 (English)

Adaptive Punishment for Cooperation in Mixed-Motive Games

Mixed-motive scenarios are ubiquitous in real-world multi-agent interactions, where self-interested agents often defect for immediate rewards, overlooking the potential of altruistic cooperation to improve long-term gains and collective welfare. Peer punishment can deter defection, but as costly second-order altruism, its persistent imposition may undermine the punisher's interests. Existing approaches often struggle to effectively implement punishment to promote cooperation. To balance the efficacy and cost of punishment, we propose Adaptive Punishment for Cooperation (APC), a distributed method that determines punishment intensity based on both a dynamic punishment probability and the severity of defection. This dynamic probability substantially reduces costly and ineffective punishment while also promotes cooperation. To accurately assess defection and its severity, we use a defection awareness module, whose learning is guided by game reward. Theoretical analysis and empirical results show APC performs effectively in iterated public goods game. Empirically, APC also significantly outperforms existing baselines across sequential social dilemmas, learning rational and effective punishment policies that foster cooperation by strategically deterring defection.

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

PEDESTRIANQA: 歩行者の意図と軌道予測に関する視覚言語モデルのベンチマーク

歩行者の意図と軌道予測は自動運転システムを安全に展開するために重要であり、複雑な交通環境におけるナビゲーションの決定に直接影響します。大規模ビジョン言語モデルの最近の進歩は、大容量の視覚的理解と柔軟な自然言語推論を組み合わせることで、これらのタスクに強力な新しいパラダイムを提供します。この研究では、構造化された根拠で強化された質問応答タスクとして歩行者の意図と軌道予測を定式化する大規模なビデオベースのデータセットである PedestrianQA を紹介します。 PedestrianQA は、豊富に注釈が付けられた歩行者シーケンスを自然言語で表現し、VLM が視覚的なダイナミクス、文脈上の合図、および交通エージェント間の相互作用から学習できるようにしながら、各タスクに合わせた特殊なアーキテクチャを必要とせずに、予測の簡潔な説明を生成できます。 PIE、JAAD、TITAN、IDD-PeD にわたる実証評価では、PedestrianQA で最先端の VLM を微調整することで、意図の分類、軌道予測の精度、説明根拠の質が大幅に向上することが示されており、セーフティクリティカルな歩行者行動モデリングのための統一された説明可能なフレームワークとしての VLM の強力な可能性が実証されています。

原文 (English)

PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction

Pedestrian intention and trajectory prediction are critical for the safe deployment of autonomous driving systems, directly influencing navigation decisions in complex traffic environments. Recent advances in large vision-language models offer a powerful new paradigm for these tasks by combining high-capacity visual understanding with flexible natural language reasoning. In this work, we introduce PedestrianQA, a large-scale video-based dataset that formulates pedestrian intention and trajectory prediction as question-answering tasks augmented with structured rationales. PedestrianQA expresses richly annotated pedestrian sequences, in natural language, enabling VLMs to learn from visual dynamics, contextual cues, and interactions among traffic agents while generating concise explanations of their predictions without needing specialized architectures tailored for each task. Empirical evaluations across PIE, JAAD, TITAN, and IDD-PeD show that finetuning state-of-the-art VLMs on PedestrianQA significantly improves intention classification, trajectory forecasting accuracy, and the quality of explanatory rationales, demonstrating the strong potential of VLMs as a unified and explainable framework for safety-critical pedestrian behavior modeling.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LLM エージェントはいくつのツールを参照する必要がありますか?偶然に修正された答え

LLM エージェントがツールを使用できるようになる前に、検索システムはどの候補ツールをエージェントに表示するかを決定する必要があります。その最終候補リストはどれくらいの長さでなければなりませんか?ツールが多すぎると、モデルを選択するのに苦労します。表示数が少なすぎると、正しいツールが表示されない可能性があります。ほとんどのシステムは、すべてのクエリに固定の候補リストサイズを適用しますが、そのサイズが適切かどうかを評価するための標準的な指標は存在しません。私たちは LLM エージェントに示されたツールの数を評価の対象として扱い、ビットオーバーランダム (BoR) を適用します。これは、特定の深さでの成功が、同じ深さでのランダム選択の成功よりも優れているかどうかを問う確率補正メトリクスです。当社は、3 つのツール選択ベンチマーク、複数のスコアラー、および 20 ～ 3,251 のツールにわたるレジストリにわたって BoR を評価します。次に、同じ原則を、クエリごとにツール候補リストの深さを選択するための強化学習 (RL) 報酬に変換します。 RL エージェントは意図的に単純になっており、提案されたシステムではなくメトリックのプローブとして機能します。候補リストが大きくなるにつれて、正しいツールが含まれるランダムな確率が高まるため、報酬は自然に減少し、設計された深さペナルティの必要性が減ります。 BFCL (370 ツール) では、学習されたポリシーは 50 個のツールを表示するカバレッジとほぼ一致します ($90.3\%$ 対 $90.8\%$) が、平均では 7 個しか表示されません。 ToolBench (3,251 ツール) では、5 つのツールの固定候補リストはより高い集計カバレッジ ($64.7\%$ 対 $61.9\%$) を達成していますが、ハードクエリでは何も見つかりません (正しいツールは 6 ～ 20 位にランクされています)。 BoR エージェントはさらに詳しく検索することで、同じクエリで $16.7\%$ を見つけます。 Claude Sonnet 4.6 による下流の検証では、アダプティブリストが短いと、適切なツールを選択する LLM の能力も向上することが示されています。常に 5 つのツールが表示される場合、$93.1\%$ 対 $87.1\%$ であり、適切なツールが存在するが最初にランク付けされていない中難易度のクエリでは、$76.8\%$ 対 $60.9\%$ に広がります。

原文 (English)

How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortlist be? Show too many tools and the model struggles to choose. Show too few and the correct tool may not appear. Most systems apply a fixed shortlist size to every query, but no standard metric exists to evaluate whether that size was appropriate. We treat the number of tools shown to an LLM agent as the object of evaluation and we apply Bits-over-Random (BoR), a chance-corrected metric that asks whether success at a given depth is better than what random selection would achieve at that same depth. We evaluate BoR across three tool-selection benchmarks, multiple scorers, and registries ranging from 20 to 3,251 tools. We then turn the same principle into a reinforcement learning (RL) reward for choosing tool shortlist depth per query. The RL agent is deliberately simple, serving as a probe of the metric rather than a proposed system. As the shortlist grows, random chance of including the correct tool rises, so the reward naturally decreases, reducing the need for an engineered depth penalty. On BFCL (370 tools), the learned policy nearly matches the coverage of showing 50 tools ($90.3\%$ vs $90.8\%$) while presenting only 7 on average. On ToolBench (3,251 tools), a fixed shortlist of 5 tools achieves higher aggregate coverage ($64.7\%$ vs $61.9\%$) but finds nothing on hard queries (correct tool ranked 6th-20th). The BoR agent finds $16.7\%$ on those same queries by searching deeper. Downstream validation with Claude Sonnet 4.6 indicates that shorter adaptive lists also improve the LLM's ability to select the right tool: $93.1\%$ versus $87.1\%$ when always shown 5 tools, widening to $76.8\%$ vs $60.9\%$ on medium-difficulty queries where the correct tool is present but not ranked first.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

スペクトル取得: LLM マルチエージェントシステムでの局所的な取得のためのトークン埋め込み上のマルチスケール Sinc 畳み込み

[要約] - スペクトル取得は、トークン埋め込みに対するマルチスケール sinc 畳み込みを通じて、トークンごとの MaxSim と平均プール取得の間を補間するプラグインの再ランキングステージです。標準的な密検索では、各ドキュメントは 1 つの平均プールされたベクトルです。関連性が短いサブスパンに局所化すると、信号は平均化されてノイズになります。スペクトル取得は、遅延インタラクションインデックスからのトークンごとの埋め込みを再利用し、それらを複数のスケールで正規化された sinc カーネルと畳み込みます。 L=1 では、カーネルが ID として機能し、トークンごとの MaxSim を回復します。 L が大きくなるにつれて均一なフィルターに近づき、平均プーリングが回復します。位置とスケールにわたる最大コサインは、どちらのエンドポイントよりも有益なスコアを生成します。 1,000 個のドキュメントと単一位置スパイクを植え付けた制御された合成ベンチマークでは、平均プール検索はスパイクの強度に関係なく偶然 (Recall@10 ~ 0.02) にとどまりますが、スペクトル検索は植え付けられたコサインがコーパスレベルのトークンノイズフロアを超えると Recall@10 = 1.0 に達します。フリーズされた all-mpnet-base-v2 エンコーダを使用した LIMIT-small では、スペクトル取得により、再トレーニングなしで Recall@10 が 0.33 から 0.90 に、MRR が 0.22 から 0.79 に、および strict Success@10 が 0.12 から 0.84 に引き上げられます。この方法は、マルチエージェント LLM システムに自然に適合し、各エージェントは、共有コーパスに対するより厳密な役割固有の検索ウィンドウの恩恵を受けます。

原文 (English)

Spectral Retrieval: Multi-Scale Sinc Convolution over Token Embeddings for Localized Retrieval in LLM Multi-Agent Systems

[Abridged] - Spectral Retrieval is a plug-in re-ranking stage that interpolates between per-token MaxSim and mean-pool retrieval through a multi-scale sinc convolution over token embeddings. In standard dense retrieval each document is one mean-pooled vector; when relevance localises into a short subspan, the signal averages into noise. Spectral Retrieval reuses per-token embeddings from a late-interaction index and convolves them with a normalised sinc kernel at multiple scales. At L=1 the kernel acts as the identity, recovering per-token MaxSim; as L grows it approaches a uniform filter, recovering mean pooling. The maximum cosine over positions and scales yields a score provably no less informative than either endpoint. On a controlled synthetic benchmark with 1,000 documents and planted single-position spikes, mean-pool retrieval sits at chance (Recall@10 ~ 0.02) regardless of spike strength, while Spectral Retrieval reaches Recall@10 = 1.0 once the planted cosine exceeds the corpus-level token noise floor. On LIMIT-small with a frozen all-mpnet-base-v2 encoder, Spectral Retrieval lifts Recall@10 from 0.33 to 0.90, MRR from 0.22 to 0.79, and strict Success@10 from 0.12 to 0.84, without retraining. The method fits naturally into multi-agent LLM systems, where each agent benefits from a tighter, role-specific retrieval window over a shared corpus.

2026-05-26 13:00 JSTarXiv cs.AIエージェント

APT-Agent: Automated Penetration Testing using Large Language Models

Penetration testing is essential to securing modern web infrastructures, yet traditional manual methods struggle to keep pace with their sc…

2026-05-26 13:00 JSTarXiv cs.AIエージェントロボティクス

Scaling up Energy-Aware Multi-Agent Reinforcement Learning for Mission-Oriented Drone Networks with Individual Reward

Multi-agent reinforcement learning (MARL) has shown wide applicability in collaborative systems such as autonomous driving and smart cities…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Interpretation, Learning, and Empathy as One Constraint: A Residual-Adequacy Architecture with Accountable Abstention

An agent must act on the situation before it, learn what it cannot yet represent, and model other agents well enough to coordinate. These f…

2026-05-26 13:00 JSTarXiv cs.AIエージェントロボティクス

Performance Comparison of Classical and Neural Sampling Algorithms for Robotic Navigation

Integrating artificial intelligence (AI) into sampling-based motion planning provides new possibilities for improving autonomous navigation…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Multi-Agent Specification-based Metamorphic Testing of FMU-Based Simulations

In many industrial domains, the Functional Mock-up Interface (FMI) is used to exchange simulation models as Functional Mock-up Units (FMUs)…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

LLM Agent Based Renewable Energy Forecasting Using Edge and IoT Data A Review of Solar Wind Weather and Grid Aware Decision Support

Reliable forecasting of renewable energy generation is a foundational requirement for grid stability energy trading battery scheduling and…

2026-05-26 13:00 JSTarXiv cs.AIエージェントロボティクス

Beyond Killer Robots: General AI Attitudes and Public Support for Military AI in Nine Countries

AI-enabled military systems are a fixture of modern military conflict. Applications vary from autonomous drones for surveillance and attack…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Neuromorphic LiDAR-based Bird's Eye View Object Detection using Energy-efficient Spiking Neural Networks

Autonomous driving perception demands accurate and efficient processing of three-dimensional sensor data under strict power constraints. Tr…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

CausalFlow: Causal Attribution and Counterfactual Repair for LLM Agent Failures

Large language model (LLM) agents frequently fail on multi-step tasks involving reasoning, tool use, and environment interaction. While suc…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

KYA: A Framework-Agnostic Trust Layer for Autonomous Systems with Verifiable Provenance and Hierarchical Policy Composition

Observability tells operators when an agent is slow. KYA tells operators when an agent is wrong, drifting, leaking, or quietly going rogue.…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Evo-Attacker: Memory-Augmented Reinforcement Learning for Long-Horizon Tool Attacks on LLM-MAS

While Large Language Model-based Multi-Agent Systems (LLM-MAS) demonstrate remarkable capabilities in solving complex tasks by orchestratin…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

A Token/KV-Cache Communication Media Selection and Resource Allocation Strategy for Multi-Agent Collaboration

The convergence of large language models (LLMs) with 6G networks is fostering a paradigm of autonomous multi-agent cooperation, which in tu…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback

Verbal feedback delivered by attending surgeons in the operating room plays a critical formative role in resident trainee skill acquisition…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Multi-Agent Coordination Adaptation via Structure-Guided Orchestration

As large language model (LLM)-based multi-agent systems scale to handle increasingly complex tasks, balancing structural stability and dyna…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits

We study a stochastic multi-armed bandit problem where an agent is granted a free exploration budget before regret accumulates, a setting n…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning

While large language models (LLMs) augmented with agentic search capabilities show promise for legal reasoning, they overlook a fundamental…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning

Deep reinforcement learning (RL) agents commonly rely on high-dimensional neural representations, despite growing evidence that task-releva…

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the e…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Hide-and-Shill: A Reinforcement Learning Framework for Market Manipulation Detection in Symphony-a Decentralized Multi-Agent System

Decentralized finance (DeFi) has introduced a new era of permissionless financial innovation but also led to unprecedented market manipulat…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

From Multi-Agent Systems and the Semantic Web to Agentic AI: A Unified Narrative of the Web of Agents

The Web of Agents (WoA) transforms the document-centric Web into an environment of autonomous agents acting on users' behalf, a vision newl…

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning

Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step re…

2026-05-26 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達研究/論文

Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Although recent tool-augmented benchmarks involve complex requests, evaluation remains limited to answer matching, neglecting critical traj…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Agent Learning via Early Experience

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Learning to Trust: Bayesian Adaptation to Varying Suggester Reliability in Sequential Decision Making

Autonomous agents operating in sequential decision-making tasks under uncertainty can benefit from external action suggestions, which provi…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

IPR-1: Interactive Physical Reasoner

Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

SPARK: Search Personalization via Agent-Driven Retrieval and Knowledge-sharing

Personalized search demands the ability to model users' evolving, multi-dimensional information needs; a challenge for systems constrained…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Architecting Agentic Communities using Design Patterns

The rapid evolution of Large Language Models (LLM) and subsequent Agentic AI technologies requires systematic architectural guidance for bu…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MMUEChange: A Generalized LLM Agent Framework for Intelligent Multi-Modal Urban Environment Change Analysis

Understanding urban environment change is essential for sustainable development. However, current approaches, particularly remote sensing c…

2026-05-26 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達研究/論文

Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory

Diagnosing failure patterns in Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-e…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MedBeads: An Agent-Native, Immutable Data Substrate for Trustworthy Medical AI

Background: As of 2026, Large Language Models (LLMs) demonstrate expert-level medical knowledge. However, deploying them as autonomous "Cli…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent

While large language model (LLM) multi-agent systems achieve superior reasoning performance through iterative debate, practical deployment…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Recent advances in large language model (LLM) have empowered autonomous agents to perform multi-turn interactions with tools and environmen…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Decoding ML Decision: An Agentic Reasoning Framework for Large-Scale Ranking System

Modern large-scale ranking systems operate within a sophisticated landscape of competing objectives, operational constraints, and evolving…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Dynamic Dual-Granularity Skill Bank for Agentic RL

Agentic RL can benefit substantially from reusable experience, yet existing skill-based methods mainly extract trajectory-level guidance an…

2026-05-26 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

Agentic search enables language models to solve knowledge-intensive tasks by adaptively acquiring external evidence over multiple steps. Re…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

Current LLM-based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic a…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function c…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

Large language models (LLMs) are increasingly deployed as tool-augmented agents capable of executing system-level operations. While existin…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation

Skill-distillation pipelines learn reusable rules from LLM agent trajectories, but they lack a key signal: how much each step costs. Withou…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents

Long-running autonomous AI agents suffer from a well-documented memory coherence problem: tool-execution success rates degrade 14 percentag…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting

Long-term personalized memory for LLM agents is challenging on resource-limited edge devices due to high storage costs and multimodal compl…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games

While Large Language Models (LLMs) excel in certain reasoning tasks, they struggle in multi-agent games where the final outcome depends on…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

As LLM agent systems take on more complex tasks, they increasingly rely on meta-agents: higher-order agents that operate on other agents, m…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

CVEvolve: Autonomous Algorithm Discovery for Unstructured Scientific Data Processing

Scientific data processing often requires task-specific algorithms or AI models, creating a barrier for domain scientists who need to analy…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

CogniFold: Always-On Proactive Memory via Cognitive Folding

Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into per…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology

Existing frameworks for LLM-based agent architectures describe systems from a single perspective: industry guides (Anthropic, Google, LangC…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation in Autonomous Driving Models

We present the first systematic study of faithfulness in Vision-Language-Action (VLA) driving models, analyzing 300 Alpamayo-R1-10B inferen…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

Agent Skills, structured packages of procedural knowledge loaded into an LLM agent at inference time, are widely reported to improve task p…

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

Automating scientific discovery requires more than generating papers from ideas. Real research is iterative: hypotheses are challenged from…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

LLM agents have incredible potential for scientific discovery applications. However, the performance of LLM agents on real-world, small mol…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

Autonomous agentic systems are largely static after deployment: they do not learn from user interactions, and recurring failures persist un…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

SkillOpt: 自己進化するエージェントスキルのための経営戦略

今日のエージェントスキルは、手作業で作成されたり、ワンショットで生成されたり、緩やかに制御された自己修正によって進化したりしていますが、スキルの深層学習オプティマイザーのように動作するものはなく、フィードバックを受けて開始点を確実に改善するものはありません。私たちは、このスキルはウェイト空間の最適化を再現可能にするのと同じ規律を用いて、凍結されたエージェントの外部状態としてトレーニングされるべきであると主張します。私たちの知る限り、SkillOpt は、エージェントスキル向けの最初の系統的な制御可能なテキスト空間オプティマイザーです。別個のオプティマイザーモデルは、スコア付けされたロールアウトを、単一のスキルドキュメントに対する制限付きの追加/削除/置換編集に変換します。また、編集は、保持されている検証スコアを厳密に改善する場合にのみ受け入れられます。テキストの学習率バジェット、拒否された編集バッファー、およびエポックごとの低速/メタ更新により、デプロイメント時に推論時間のないモデル呼び出しを追加しながら、スキルトレーニングを安定させます。 6 つのベンチマーク、7 つのターゲットモデル、および 3 つの実行ハーネス (ダイレクトチャット、コーデックス、クロードコード) にわたって、SkillOpt は 52 の評価対象 (モデル、ベンチマーク、ハーネス) セルすべてで最高か同点であり、人間のスキル、ワンショット LLM、Trace2Skill、TextGrad、GEPA、および EvoSkill スキルの中でセルごとのすべての競合他社を上回っています。 GPT-5.5 では、スキルなしの平均精度がダイレクトチャットで +23.5 ポイント、Codex エージェントループ内で +24.8 ポイント、Claude Code 内で +19.1 ポイント向上しました。さらに、移行実験では、最適化されたスキルアーティファクトは、さらなる最適化を行わずに、モデルスケール間、Codex と Claude Code の実行環境間、および近くの数学ベンチマークに移動しても、価値が維持されることが示されています。

原文 (English)

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization. Code: https://aka.ms/skillopt

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MultiPhishGuard: An Explainable and Adaptive Multi-Agent LLM System for Phishing Email Detection

Phishing email detection faces significant challenges due to evolving adversarial tactics and heterogeneous attack patterns. Traditional ap…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

MCPXKIT: The Unified Toolkit for Analyzing Model Context Protocol Security

The Model Context Protocol (MCP) has emerged as a universal standard that enables AI agents to seamlessly connect with external tools, sign…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Tool calling has emerged as a critical capability for AI agents. In contrast to conventional tool calling frameworks that rely on static, p…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Auditing medical multi-agent AI reveals risks of false consensus

Large language models are increasingly being assembled into medical multi-agent systems that emulate multidisciplinary consultation through…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger'…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

NeoAMT: Neologism-Aware Agentic Machine Translation with Reinforcement Learning

Neologism-aware machine translation aims to translate source sentences containing neologisms into target languages. This field remains unde…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Delayed Assignments in Online Non-Centroid Clustering with Stochastic Arrivals

Clustering is a fundamental problem, aiming to partition a set of elements, like agents or data points, into clusters such that elements in…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Generative Visual Code Mobile World Models

Mobile Graphical User Interface (GUI) World Models (WMs) offer a promising path for improving mobile GUI agent performance at train- and in…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These f…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Agent Primitives: Reusable Latent Building Blocks for Multi-Agent Systems

While existing multi-agent systems (MAS) can handle complex problems by enabling collaboration among multiple agents, they are often highly…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

DARPA's AI Cyber Challenge (AIxCC, 2023--2025) is the largest competition to date for building fully autonomous cyber reasoning systems (CR…

2026-05-26 13:00 JSTarXiv cs.AIエージェントロボティクス

AEROS: A Single-Agent Operating Architecture with Embodied Capability Modules

Robotic systems lack a principled abstraction for organizing intelligence, capabilities, and execution in a unified manner. Existing approa…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

M$^\star$: Every Task Deserves Its Own Memory Harness

Large language model agents rely on specialized memory systems to accumulate and reuse knowledge during extended interactions. Recent archi…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Federation over Text: Insight Sharing for Multi-Agent Reasoning

We propose a federated learning-like framework, Federation over Text (FoT), that enables multiple clients solving different tasks to collec…

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction

Recent advances in autonomous driving have motivated research on pedestrian intention prediction, which aims to infer future crossing decis…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Safety in Embodied AI: A Survey of Risks, Attacks, and Defenses

Embodied Artificial Intelligence (Embodied AI) integrates perception, cognition, planning, and interaction into agents that operate in open…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

An Uncertainty-Aware Resilience Micro-Agent for Causal Observability in the Computing Continuum

Grey failures in the computing continuum produce ambiguous overlapping symptoms that existing approaches fail to diagnose reliably, either…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection

AI agents such as OpenClaw are increasingly deployed in local workflows with access to external tools. This creates indirect prompt-injecti…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

Spatial intelligence unfolds through a perception-action loop: agents act to acquire observations, and reason about how observations vary a…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics p…

2026-05-26 13:00 JSTarXiv cs.AIエージェント

Contractual Skills: A GovernSpec Design Framework for Enterprise AI Agents

Skills have become a practical packaging mechanism for agent instructions, workflows, scripts, and reference materials. In enterprise setti…

2026-05-26 01:00 JSTTechCrunch AIエージェント

What ClickUp’s mass layoff tells us about the future of work

The nine-year-old startup is replacing hundreds of employees with thousands of AI agents.

2026-05-25 15:00 JSTITmedia AI+エージェント規制/政策

ServiceNow、AIエージェントの「データの空白」を埋める機能群を発表

AIエージェント導入の「壁」になるのが、サイロ化したデータとガバナンスの未整備だ。ServiceNowはこの課題をどう解決しようとしているのか。年次イベントで発表された新機能群を紹介する。

2026-05-25 13:00 JSTarXiv cs.AIエージェント研究/論文

RMA: 研究レベルの数学的問題のためのエージェントシステム

$\textbf{Research Math Agents (RMA)}$ は、研究レベルの数学的問題に対する自動推論のためのエージェントフレームワークです。競技数学や形式的定理の証明を中心とした先行研究とは異なり、RMA は長期的な推論、文献の根拠、反復的な証明の改良を必要とする研究レベルの数学的問題を対象としています。 RMA は、研究レベルの証明解決を、問題分析、文献の検索と理解、公平な比較、知識バンクの構築、および証明の検証のための特殊なモジュールに分解します。これらはすべて、共有構造化メモリを介して初期化者、提案者、および検証者のエージェントによって調整されます。この統一されたフレームワーク内で、これらのエージェントは複数の役割、複数ラウンドのワークフローで動作し、反復的なフィードバックを通じて候補となる証明を共同で生成、改良、検証します。私たちは、さまざまな分野の専門数学者によって提供された 10 個の研究レベルの問題で構成される First Proof ベンチマークで RMA を評価します。包括的な専門家による評価を通じて、RMA は GPT-5.2R や Aletheia などの First Proof ベンチマークの強力なベースラインを上回り、研究上の問題の 10 件中 8 件を解決し、より論理的に健全で読みやすい証明を生成します。さらに、当社の包括的なアブレーション研究では、単一のコンポーネントではなく、構造化推論モジュール、反復改良、検証ベースのフィードバックの相互作用によってパフォーマンスが向上することが示されています。当社のソリューションと実装は、承認され次第、一般に公開されます。

原文 (English)

RMA: an Agentic System for Research-Level Mathematical Problems

We present $\textbf{Research Math Agents (RMA)}$, an agentic framework for automated reasoning on research-level mathematical problems. Unlike prior studies centered on competition mathematics or formal theorem proving, RMA targets research-level mathematical problems that require long-horizon reasoning, literature grounding, and iterative proof refinement. RMA decomposes research-level proof solving into specialized modules for problem analysis, literature search and understanding, fair comparison, knowledge-bank construction, and proof verification, all coordinated by initializer, proposer, and verifier agents through a shared structured memory. Within this unified framework, these agents operate in a multi-role, multi-round workflow, collaboratively generating, refining, and verifying candidate proofs through iterative feedback. We evaluate RMA on the First Proof benchmark, which consists of ten research-level problems contributed by expert mathematicians across diverse domains. Through comprehensive expert evaluation, RMA outperforms strong baselines on the First Proof benchmark, including GPT-5.2R and Aletheia, solving eight out of ten research problems and producing more logically sound and readable proofs. Our comprehensive ablation studies further show that performance gains arise from the interaction of structured reasoning modules, iterative refinement, and verifier-based feedback, rather than any single component. Our solutions and implementations will be made publicly available upon acceptance.

2026-05-25 13:00 JSTarXiv cs.AIエージェントハードウェア/半導体研究/論文

SciAtlas: 自動化された科学研究のための大規模ナレッジグラフ

世界的な学術成果の急激な増加により、研究者やAIエージェントは前例のない「情報爆発」に直面しており、断片的で構造化されていない知識組織が深い学際的統合を妨げています。現在の学術検索ツールは主に、表面的なキーワードマッチングやベクトル空間の意味検索に依存しており、複雑な論理接続をナビゲートするために必要な位相推論機能が不足しています。エージェントのディープリサーチベースのフレームワークは、多くの場合、論理的な幻覚を引き起こし、高い推論コストを消費する傾向があります。このギャップを埋めるために、このレポートでは、パノラマ科学進化ネットワークとして設計された、大規模で学際的で異質な学術リソースの知識グラフである SciAtlas を紹介します。 SciAtlas は、26 の専門分野からの 4,300 万件を超える論文、合計 1 億 5,700 万のエンティティと 3B トリプレットを統合することにより、専門分野の障壁を取り除き、AI エージェントにグローバルな視点を提供する構造化トポロジカル認知基盤を提供します。さらに、トライパス協調想起とグラフ再ランキングを特徴とする神経記号検索アルゴリズムを開発し、単純な意味一致から決定論的関連発見へのシームレスな移行を実現します。また、文献レビュー、自動化された研究傾向の統合、アイデアの位置付け、学術的軌道の探索など、SciAtlas の主要な応用方向性を示し、SciAtlas が推論コストを大幅に削減しながら自動化された科学研究の全ループを強化する効果的な「認知マップ」として機能できることを実証します。 KG 取得とさまざまなダウンストリームタスク用のインターフェイスを GitHub リポジトリでリリースしました。

原文 (English)

SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research

The exponential growth of global academic output has confronted researchers and AI agents with an unprecedented ``information explosion,'' where fragmented and unstructured knowledge organization impedes deep interdisciplinary integration. Current academic retrieval tools predominantly rely on superficial keyword matching or vector-space semantic retrieval, which lack the topological reasoning capabilities required to navigate complex logical connections. Agentic deep-research-based frameworks are often prone to logical hallucinations and consuming high inference costs. To bridge this gap, in this report, we introduce SciAtlas, a large-scale, multi-disciplinary, heterogeneous academic resource knowledge graph designed as a panoramic scientific evolution network. By integrating over 43M papers from 26 disciplines, and a total of 157M entities and 3B triplets, SciAtlas provides a structured topological cognitive substrate that dismantles disciplinary barriers and furnishes AI agents with a global perspective. Furthermore, we develop a neuro-symbolic retrieval algorithm featuring tri-path collaborative recall and graph reranking, achieving a seamless transition from simple semantic matching to deterministic association discovery. We also present key application directions of SciAtlas, including literature review, automated research trend synthesis, idea positioning, and academic trajectory exploration, to demonstrate that SciAtlas can serve as an effective ``cognitive map'' to empower the full loop of automated scientific research while significantly reducing reasoning costs. We have released the interfaces for KG retrieval and various downstream tasks in our GitHub repo.

2026-05-25 13:00 JSTarXiv cs.AIエージェント研究/論文

成功した目標ごとのエネルギー: エージェントティック AI システムの目標レベルのエネルギー計算

現在の AI エネルギーベンチマークは、単一のモデル呼び出しまたはトレーニング実行の粒度で消費量を測定します。従来のシングルターンワークロードの場合、このユニットは一貫性を保ちます。エージェントシステムの場合、単一のユーザー目標によって複数ステップのオーケストレーション、ツールの呼び出し、再試行、障害回復サイクルがトリガーされる可能性があり、呼び出し回数はタスクのプロパティではなく実装成果物であり、推論レベルの正規化によって目標完了のエネルギーコストが誤って表示されます。我々は、AI エネルギー会計の単位を推論あたりのエネルギーから成功目標あたりのエネルギー (EpG) まで再定義するクロスレイヤー測定フレームワークである A-LEMS (Agentic LLM Energy Measurement System) を紹介します。 EpG は、失敗や再試行を含むすべての実行試行にわたる合計ワークフローエネルギーを集計し、正常に完了した目標によって正規化します。 A-LEMS は、時間境界モデル、RAPL 信号をワークフローレベルのエネルギーにマッピングする 5 層の観測パイプライン、およびすべての測定をハードウェアおよびランタイム構成に結び付ける再現性プロトコルを通じて、エネルギーの帰属を形式化します。 EpG に基づいて、オーケストレーションオーバーヘッドインデックス (OOI) を定義し、同一のタスク基準の下での線形実行に対するオーケストレーションのエネルギーコストを分離します。 5 つの推論と 3 つのツール拡張タスクファミリ全体で、エージェントワークフローは、線形ベースラインと比較して、成功した目標あたりの平均エネルギーを 4.33 倍消費します (888.1 J 対 205.3 J)。このオーバーヘッドは、推論コンピューティングではなく、オーケストレーション構造によって引き起こされます。ツールで強化されたタスクの場合、OOI は 1.0x 未満で反転します。エージェントの実行は線形よりも安価であり、メトリクスが固定的な上方バイアスではなくオーケストレーション構造を捉えていることを確認します。これらの発見は、推論あたりのエネルギーがエージェント AI には不十分であることを証明しています。 EpG と OOI は、オーケストレーション構造がエネルギーコストの主な決定要因となる、正確なベンチマークのための測定基盤を提供します。

原文 (English)

Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems

Current AI energy benchmarks measure consumption at the granularity of a single model invocation or training run. For classical single-turn workloads this unit remains coherent. For agentic systems - where a single user goal may trigger multi-step orchestration, tool calls, retries, and failure-recovery cycles - the invocation count is an implementation artifact rather than a task property, and inference-level normalization misrepresents the energy cost of goal completion. We present A-LEMS (Agentic LLM Energy Measurement System), a cross-layer measurement framework that redefines the unit of AI energy accounting from energy per inference to Energy per Successful Goal (EpG). EpG aggregates total workflow energy across all execution attempts, including failures and retries, normalized by successfully completed goals. A-LEMS formalizes energy attribution through a temporal boundary model, a five-layer observation pipeline mapping RAPL signals to workflow-level energy, and a reproducibility protocol binding every measurement to hardware and runtime configuration. Building on EpG, we define the Orchestration Overhead Index (OOI), isolating the energy cost of orchestration relative to linear execution under identical task criteria. Across five reasoning and three tool-augmented task families, agentic workflows consume 4.33x higher mean energy per successful goal than linear baselines (888.1 J vs 205.3 J). This overhead is driven by orchestration structure, not inference compute. For tool-augmented tasks, OOI inverts below 1.0x: agentic execution is cheaper than linear, confirming the metric captures orchestration structure rather than a fixed upward bias. These findings establish that energy-per-inference is insufficient for agentic AI. EpG and OOI provide the measurement foundation for accurate benchmarking, where orchestration structure is the primary determinant of energy cost.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

EVE-Agent: 証拠検証可能な自己進化エージェント

自己進化するエージェントは、正当化できない例に基づいてトレーニングを行うべきではありません。データフリーの自己進化型検索エージェントは、人間による注釈なしで独自の質問を生成し、それに回答し、独自のフィードバックから改善するシステムへのスケーラブルなルートを提供します。しかし、検証可能な証拠がなければ、このループは流暢ではあるが裏付けのない例に報酬を与え、自己生成カリキュラムを不透明で信頼性の低い可能性のあるトレーニング信号に変える可能性があります。私たちは、証拠の検証可能性が検索エージェントの信頼できる自己進化の前提条件であると主張します。生成された各インスタンスには、答えだけでなく、その答えへの貢献を測定できるソースに基づいたスパンも含まれている必要があります。私たちは、プロポーザーとソルバーのフレームワークを変更することでこの原則を運用できる、証拠検証可能な自己進化エージェントである EVE-Agent を紹介します。提案者は、質問、回答、および逐語的な証拠範囲を生成します。次に、証拠検証者は、証拠が提供されたときの限界精度の向上に応じてスパンに報酬を与えます。これにより、オラクルの回答、人間によるラベル、外部の注釈を必要とせずに、質問の回答に真に役立つ証拠を優先するトレーニング信号が生成されます。 EVE-Agent では、バックボーンモデル、レトリーバー、検索ツール、最適化フレームワークは変更されません。実験により、EVE-Agent は、以前の自己進化型検索エージェントに比べて、証拠に基づく正確性が大幅に向上することが示されています。結果として得られるカリキュラムは、単に自己生成されるだけでなく、構築によって監査可能です。各トレーニングサンプルには、なぜそれが信頼されるべきかを説明する検査可能なソーススパンが含まれています。

原文 (English)

EVE-Agent: Evidence-Verifiable Self-Evolving Agents

Self-evolving agents should not train on examples they cannot justify. Data-free self-evolving search agents offer a scalable route to systems that generate their own questions, answer them, and improve from their own feedback without human annotations. Yet, without verifiable evidence, this loop can reward fluent but unsupported examples, turning the self-generated curriculum into an opaque and potentially unreliable training signal. We argue that evidence verifiability is a prerequisite for trustworthy self-evolution in search agents: each generated instance should include not only an answer but also a source-grounded span whose contribution to that answer can be measured. We introduce EVE-Agent, an Evidence-Verifiable Self-Evolving Agent that operationalizes this principle through a modification to the proposer--solver framework. The proposer generates a question, an answer, and a verbatim evidence span. An evidence verifier then rewards the span according to the marginal accuracy gain when the evidence is provided. This produces a training signal that favors evidence that genuinely helps answer the question, without requiring oracle answers, human labels, or external annotations. EVE-Agent leaves the backbone model, retriever, search tool, and optimization framework unchanged. Experiments show that EVE-Agent substantially improves evidence-grounded correctness over prior self-evolving search agents. The resulting curriculum is not merely self-generated but auditable by construction: each training example carries an inspectable source span that explains why it should be trusted.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

帰納的演繹的合成: AI による正式に検証されたシステムの生成を可能にする

AI エージェントは、コードの生成、テスト、改良においてますます優れています。ただし、テストだけでは提供できない完全な範囲の正式な保証を必要とするタスクには不十分です。分散システムはその典型的な例です。読み取りと書き込みの間の一貫性などの特性は、イベントのあらゆるインターリーブ下で維持される必要があります。機械化された正式な検証はそのような正しさを保証できますが、通常は数か月から数年の専門家の努力が必要です。その証拠に、SOTA コーディングエージェント (GPT-5.4 を使用した Codex および Opus 4.6 を使用した Claude Code) でさえ、2/7 分散キーバリューストア仕様でのみ成功します。この論文では、このギャップに対処するための最初の効果的なアプローチである帰納的演繹的合成 (IDS) を紹介します。これは、実装と証明を共同かつ漸進的に合成し、失敗した試みから学び、有望な戦略を体系的に試行します。エージェント LLM システムとして構築された IDS は、約 6.8 時間で 7/7 を達成し、仕様あたりのコストは平均 106 ドルです。これは、専門家の努力よりも約 200 倍速く、SOTA エージェントよりも 17% 安価です。 IDS はさらに、同じループにパフォーマンスのフィードバックを組み込み、公開されている検証済みシステムよりも最大 3 倍高速な実装を実現します。

原文 (English)

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of full coverage that testing alone cannot provide. Distributed systems are a prime example: properties such as consistency between reads and writes must hold under every possible interleaving of events. Mechanized formal verification can guarantee such correctness, but typically demands months to years of expert effort. As evidence, even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. In this paper, we present the first effective approach to addressing this gap, Inductive Deductive Synthesis (IDS), which jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Built as an agentic LLM system, IDS achieves 7/7 in about 6.8 hours and $106 per spec on average, roughly 200x faster than expert effort and 17% cheaper than SOTA agents. IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

AI マップの再描画: エージェントのエコシステムにおける責任境界の理論

エージェントティック AI オーケストレーターは、組織の境界を越えて情報システム機能を構成するためのインターフェースと組み立てのコストを削減し、モジュール化と組織の細分化を加速するように見えます。しかし、出力に証拠、レビュー、承認、または割り当て可能な責任が必要な AI 対応機能は、技術インターフェイスがモジュール化された場合でも、統合された責任の境界を維持する可能性があります。私たちは、エージェントエコシステムにおける責任の境界配置に関する能力レベルの理論を開発します。 AI がサポートする出力を合法的、監査可能、レビュー可能にし、責任者に割り当てることができるようにする補完的な資産である説明責任資産を導入します。私たちは、検証コストと責任の移転可能性が、実行と責任の境界を同時に移動できるかどうかを決定すると主張します。この理論では、コンポーネント、統合、およびデュアルトラックという 3 つの境界戦略が特定されています。また、組織の意思決定ルールが正式な情報システムから管理されていないエージェント実行環境に移行するときに発生するガバナンスの負担であるルール負債も導入されます。デジタルイノベーション、トランザクションコスト、補完的資産、デジタルプラットフォームのガバナンス、情報システム管理の観点を統合し、エージェントによる組み立てコストの削減、責任資産、流用性、オーケストレーターの意図の捕捉、境界の設定ミスを境界戦略、価値の流用、ルール負債に結び付ける7つの提案を開発します。この理論は、デジタルのモジュール化が組織の細分化にまで及ぶ場合と、説明責任により機能の統合が維持される場合を説明します。文書処理、法律サービス、監査、臨床意思決定のサポート、調達にわたる構造化された図により、境界ロジックが規律化されます。

原文 (English)

Redrawing the AI Map: A Theory of Accountability Boundaries in Agentic Ecosystems

Agentic AI orchestrators reduce the interface and assembly costs of composing information systems capabilities across organizational boundaries, seemingly accelerating modularization and organizational disaggregation. Yet AI-enabled capabilities whose outputs require evidence, review, signoff, or assignable responsibility may retain integrated accountability boundaries even when their technical interfaces become modular. We develop a capability-level theory of accountability-boundary placement in agentic ecosystems. We introduce accountability assets: complementary assets that make AI-supported outputs legitimate, auditable, reviewable, and assignable to a responsible party. We argue that verification cost and responsibility transferability determine whether the execution and accountability boundaries can move together. The theory identifies three boundary strategies: component, integrated, and dual-track. It also introduces rule debt, the governance burden that accrues when organizational decision rules migrate from formal information systems into ungoverned agentic execution environments. Integrating digital innovation, transaction cost, complementary-assets, digital platform governance, and IS control perspectives, we develop seven propositions linking agentic assembly-cost reductions, accountability assets, appropriability, orchestrator intent capture, and boundary misconfiguration to boundary strategy, value appropriation, and rule debt. The theory explains when digital modularization extends to organizational disaggregation and when accountability keeps capabilities integrated. Structured illustrations across document processing, legal services, audit, clinical decision support, and procurement discipline the boundary logic.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

財団プロトコル: エージェント社会のための調整層

自律型エージェントはツールから社会インフラストラクチャの層に移行しており、ソフトウェアの閲覧、購入、展開、システムの管理を行い、相互に対話することが増えています。これらのシステムが拡張するにつれて、ボトルネックは生のモデルの機能から調整へと移行します。エージェントは、信頼できる関係を形成し、複数のエージェントの作業を組織し、価値を交換し、AI エコノミーをサポートし、現実世界の監視の下で安全と責任を保つ必要があります。この論文では、新興の人間と AI 社会のためのグラフファースト調整層である Foundation Protocol (FP) を紹介します。 FP は、エージェント、ツール、リソース、人間、機関、組織などの異種エンティティを統合し、ネイティブのマルチパーティ組織とイベントベースのコラボレーションをサポートします。また、計量、領収書、決済のための経済的な基本機能も提供し、ポリシー、出所、監査を第一級の関心事として扱います。 FP は、既存のプロトコルを置き換えるのではなく、ラップしてブリッジするように設計されており、統合とガバナンスのオーバーヘッドを削減しながら段階的な導入を可能にします。その目的は、説明責任を交渉の余地のない状態に保ちながら、自律的な機関を構成可能に保ち、調整自体がオープンで多元的で統治可能な人間と AI の社会の共有インフラストラクチャになるようにすることです。

原文 (English)

Foundation Protocol: A Coordination Layer for Agentic Society

Autonomous agents are moving from tools into a layer of social infrastructure: they browse, purchase, deploy software, manage systems, and increasingly interact with one another. As these systems scale, the bottleneck shifts away from raw model capability toward coordination. Agents need to form reliable relationships, organize multi-agent work, exchange value, support an AI economy, and stay safe and accountable under real-world oversight. This paper introduces the Foundation Protocol (FP), a graph-first coordination layer for an emerging human-AI society. FP unifies heterogeneous entities, including agents, tools, resources, humans, institutions, and organizations, and supports native multi-party organization and event-based collaboration. It also provides economic primitives for metering, receipts, and settlement, and treats policy, provenance, and audit as first-class concerns. FP is designed to wrap and bridge existing protocols rather than replace them, enabling incremental adoption while reducing integration and governance overhead. The aim is to keep autonomous agency composable while keeping accountability non-negotiable, so that coordination itself can become shared infrastructure for a human-AI society that is open, pluralistic, and governable.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント

GENSTRAT: 大規模言語モデルにおける戦略的推論の科学に向けて

大規模言語モデル (LLM) は、市場、オークション、入札設定における経済エージェントとしてますます導入されています。特定の展開における彼らの行動を予測するのは困難です。既存の戦略的推論ベンチマークは、固定された正規ゲームのモデルを評価します。これらのベンチマークは、フロンティアが向上するにつれて飽和する可能性があり、評価者がベンチマークのパフォーマンスから実際の展開に含まれる多様で複雑な戦略的環境までを自信を持って一般化することはできません。これらの課題に対処するために、手続き的に生成された戦略的環境を使用する GENSTRAT を紹介します。具体的には、2 人用のゼロサム不完全情報カードゲームの分布を生成します。ジェネレーターはオンデマンドで新鮮なゲームを描画できるため、常に最新の評価と汚染への耐性が可能になります。私たちはゲームの分布を、モデルの能力を 6 つの軸 (状態空間、時間的深さ、情報感度、対戦相手のモデリング、リスク、脆弱性) にわたって分解する能力プロファイル手法と組み合わせます。また、戦略的に類似したゲーム間でモデルのアドバンテージが予期せずジャンプするときを検出する、分布内の滑らかさのギザギザの尺度も導入します。 2,000 試合で生成されたプールから 50 のベンチマーク試合をサンプリングし、36,000 試合を超える直接対決トーナメントで 9 つのフロンティアおよび無差別級 LLM を評価します。新しいフロンティア層モデルの方が平均スコアが高くなります。その平均を超えると、ほぼ同一の全体的な強度を持つモデルは質的に異なる能力プロファイルを示し、リーダーボードの上位 3 つのモデルのうち 2 つ (gpt-5 と claude) は、全体的な強度が近いにもかかわらず、3 番目 (gemini-3.1-pro) よりも局所的な変動が顕著に高くなります。機能プロファイルとギザギザの尺度を組み合わせることで、全体的なランキングだけでは提供できない展開関連の診断が得られます。

原文 (English)

GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchmark performance to the varied and messy strategic environments that actual deployments involve. We introduce GENSTRAT, which uses procedurally generated strategic environments to address these challenges. Concretely, we generate a distribution of two-player zero-sum imperfect-information card games. The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). We also introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games. We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Newer frontier-tier models score higher on average. Beyond that average, models with near-identical overall strength show qualitatively different capability profiles, and two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

ナレッジワークのベンチマークを設計およびレポートする

LLM エージェントの開発により、コーディング、研究、ヘルスケアなど、ナレッジワーク AI に関する一連の研究が増加しています。ただし、現在の知識作業の評価とベンチマークの設計は依然として従来の NLP タスクのロジックに従っています。その結果、ベンチマークのパフォーマンスが高くても、システムが実際の展開設定でナレッジワークを実行できることを確実に示すことはできません。このペーパーは、ベンチマーク対象のタスクがスコアに関連付けられた作業要求をどのように表すかを明示するための 3 段階のアプローチを提供します。つまり、評価対象の作業アクティビティを定義し、テストされた設定を指定し、適切な作業成果物をスコアリングします。私たちは、ナレッジワークが役割と責任、ローカルの材料とツール、下流のワークフローで使用可能なままでなければならない成果物を通じて組織化されていることを示す作業研究をレビューします。次に、これらの懸念事項をベンチマーク設計とレポート作成のガイダンスに変換します。これには、タスクを作業アクティビティにどのようにマッピングするか、テストされた設定で材料、ツール、役割、制約をどのように指定するか、システムが残した作業成果物にどのように焦点を当てるべきかが含まれます。評価対象の作業活動に名前を付け、一般的なベンチマークタスクと区別するために、O{*}NET 職業タスクデータベースから 18 の作業活動のインベントリを取得します。私たちは 3 つのベンチマークケース分析を通じてこのアプローチを実証します。GDPval、ノンコードの職業成果ベンチマーク。 OfficeQA Pro、最終回答によってスコア付けされる、根拠のある文書分析ベンチマーク。 APEX-SWE は、実行可能スコア付き製品を備えたソフトウェアエンジニアリングベンチマークです。これらのケースは、ベンチマーク設計の選択が、スコアがサポートできる最も強力な作業要求をどのように形成するか、また、ベンチマーク対象のタスク、テストされた設定、スコア付けされた製品、およびより広範な作業要求の間にギャップが生じる場所を示しています。

原文 (English)

Design and Report Benchmarks for Knowledge Work

The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント

長期的な LLM エージェントサービスのための並列コンテキスト圧縮

長期にわたる LLM エージェントは会話履歴を蓄積し、最終的にはモデルのコンテキストウィンドウを超えます。 LLM ベースの要約によるコンテキストの圧縮により、会話は制限されたままになりますが、要約には本質的に損失が多く、ブロッキングコールによりエージェントの推論が数十秒間停止します。さらに、プロンプト指示はほとんど無視されるため、オペレーターは要約ボリュームをきめ細かく制御できず、コンテキストが増大するにつれて、モデルが生成する出力トークンの量とモデルが保持する情報の両方が実行ごとに大幅に変動するため、エージェントが保持する知識が実行間で予測不可能になります。ロングホライズンのエージェントフローに対して \textbf{並列圧縮} を導入し、HotpotQA マルチホップ QA および LoCoMo のロングコンテキストダイアログベンチマークで、高密度および MoE アーキテクチャと推論モデルおよび非推論モデルを混合し、8B から 120B パラメータにわたる 4 つのバックボーンにわたる順次同期ベースラインに対して特徴付けます。並列圧縮により、オペレーターは要約ボリュームをきめ細かく予測可能な制御できるようになり、ブロックごとにターゲットを絞った迅速なエンジニアリングが可能になります。圧縮デコードボリュームが一致すると、エンドツーエンドのウォールタイムが短縮され、シーケンシャルベースラインを超える圧縮スループットが向上します。

原文 (English)

Parallel Context Compaction for Long-Horizon LLM Agent Serving

Long-horizon LLM agents accumulate growing conversation histories that eventually exceed the model's context window. Context compaction via LLM-based summarization keeps the conversation bounded, but summarization is inherently lossy and the blocking call stalls agent inference for tens of seconds. Moreover, the operator has no fine-grained control over summary volume since prompt instructions are largely ignored, and as context grows, both the amount of output tokens the model produces and the information it retains fluctuate substantially from run to run, making the agent's retained knowledge unpredictable across runs. We introduce \textbf{parallel compaction} for long-horizon agentic flows and characterize it against the sequential synchronous baseline across four backbones spanning 8B to 120B parameters, mixing dense and MoE architectures with reasoning and non-reasoning models, on the HotpotQA multi-hop QA and LoCoMo long-context dialogue benchmarks. Parallel compaction gives the operator fine-grained, predictable control over summary volume and enables more targeted prompt engineering per block. At matched compaction decode volume, it reduces end-to-end wall time and improves compaction throughput over the sequential baseline.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

DART: 構造化ツールエージェントのセマンティック回復可能性

構造化ツールエージェントが実行中に失敗すると、ランタイムはジレンマに直面します。タスク全体を再実行するのは安全ですが無駄が多く、ローカルチェックポイントからの復元は効率的ですが、コミットされた下流の作業が、もはや存在しない上流の履歴に結び付けられたままになる可能性があります。この緊張は、ロールバックが単一の失敗したインスタンスをターゲットにしているにもかかわらず、ダウンストリームのコンシューマーがすでにその出力に対応している場合、コミットメントに依存する設定では深刻です。既存の回復アプローチでは、機械的なロールバックが提供されますが、ダウンストリームのコミット後にローカルの復元が意味的に有効なままであるかどうかの基準がありません。私たちはこのギャップをセマンティックな回復可能性として形式化し、障害が発生したインスタンスをローカライズし、そのインスタンスのセマンティックに回復可能な境界を証明し、チェックポイントをそれらの境界に合わせて、依存関係と影響制約の下でコミットされたダウンストリーム作業を保存する、またはそれ以外の場合はブロックする許容可能な復元ポイントを選択するモジュラーランタイムである DART で対処します。 3 つの LLM 駆動ドメインと LangGraph ベースの基板上の外部検証にわたって、DART は、ベースラインのローカルリカバリが失敗する評価済みのコミットメント依存のケースをすべて正しくリカバリし、5 つのドメインの安全性監査で安全でない許可されたロールバックは検出されません。これらの結果は、コントローラーの合法性は意味論的な妥当性を意味するものではなく、健全なローカル回復には明示的な許容性チェックが必要であることを示しています。

原文 (English)

DART: Semantic Recoverability for Structured Tool Agents

When a structured tool agent fails mid-execution, the runtime faces a dilemma: replaying the entire task is safe but wasteful, while restoring from a local checkpoint is efficient but can leave committed downstream work tied to an upstream history that no longer exists. This tension is acute in commitment-sensitive settings, where rollback targets a single failed instance yet downstream consumers have already acted on its output. Existing recovery approaches provide mechanical rollback but no criterion for whether a local restore remains semantically valid after downstream commitment. We formalize this gap as semantic recoverability and address it in DART, a modular runtime that localizes the failed instance, certifies semantically recoverable boundaries of that instance, aligns checkpoints to those boundaries, and selects an admissible restore point that preserves committed downstream work under dependency and effect constraints-or blocks otherwise. Across three LLM-driven domains and external validation on a LangGraph-based substrate, DART correctly recovers all evaluated commitment-sensitive cases where baseline local recovery fails, and a five-domain safety audit finds no unsafe admitted rollbacks. These results show that controller legality does not imply semantic validity, and that sound local recovery requires an explicit admissibility check.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

状況に応じたバンディット優先学習による人間参加型のマルチエージェント人工呼吸器の意思決定サポート

人工呼吸器の意思決定サポートには、安全境界線と臨床医固有の調整スタイルを尊重しながら、進化する生理機能と疾患の軌跡を追跡する一連の意思決定が必要です。ルールベースのアプローチではパーソナライゼーションが一般化されることはほとんどなく、エンドツーエンドの強化学習や単一の大規模言語モデルシステムの制御と監査は依然として困難です。我々は、契約主導型の構造化インターフェースを通じてモジュール式の意思決定コンポーネントを調整し、レビューのための追跡可能な証拠を生成するヒューマン・イン・ザ・ループのマルチエージェントフレームワークである人工呼吸器意思決定支援システム（VDSS）を提案します。 VDSS は、コンテキストバンディットを使用してオンラインの好みの適応を実行し、調整サイクルごとに最終的に受け入れられた決定から臨床医固有の好みを更新し、それらを次の推奨事項のガイドとして使用します。構造化された拒否フィードバックにより、対象を絞った再計画がトリガーされ、非生産的な反復が削減され、インタラクションの安定性が向上します。専門家のレビューを伴う遡及的な ICU 軌跡の再生は、推奨事項の受け入れ可能性が高く、受け入れ可能な計画に到達するための対話ラウンドが少ないことを示し、臨床的に展開可能な人間 AI コラボレーションをサポートします。

原文 (English)

Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning

Ventilator decision support requires sequential decisions that track evolving physiology and disease trajectories while respecting safety boundaries and clinician specific tuning styles. Rule based approaches rarely generalize personalization, and end to end reinforcement learning or single large language model systems remain difficult to control and audit. We propose the Ventilator Decision Support System (VDSS), a human in the loop multi agent framework that coordinates modular decision components through contract driven structured interfaces and produces traceable evidence for review. VDSS performs online preference adaptation with a contextual bandit, updating clinician specific preferences from the final accepted decision at each adjustment cycle and using them to guide subsequent recommendations. Structured rejection feedback triggers targeted replanning to reduce unproductive iterations and improve interaction stability. Retrospective ICU trajectory replay with expert review indicates higher recommendation acceptability and fewer interaction rounds to reach an acceptable plan, supporting clinically deployable human AI collaboration.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント

正しく実行されたにもかかわらず計画が失敗した場合: LLM ベースのマルチエージェントシステムの認識論的キャリブレーションについて

LLM ベースのマルチエージェントシステムは、計画されたアクションが正しく実行された場合でも失敗する可能性があります。これは、エージェントが計画の実現可能性を評価する際に知識を誤って判断する可能性があるためであり、これを計画における認識ミスキャリブレーションと呼んでいます。実行エラーとは異なり、認識ミスキャリブレーションは、生成された計画が自己一貫性を保ち、観察可能なエラーなしで実行可能であるため、計画中に潜在的です。新しい情報によって実現可能性の評価が変更される可能性があるため、誤校正は動的でもあり、過去の誤校正信号が隠蔽され、時間の経過とともに再発する可能性があります。これに対処するために、我々は、実現可能性を直接検証するのではなく、さまざまな情報条件下で計画がサポートされ続けるかどうかを評価する、認識計画校正エージェントティックワークフロー (EPC-AW) を提案します。 EPC-AW は、情報の一貫性に基づいた計画選択を採用し、エージェント間で評価が安定している計画を選択します。また、一貫性に基づいて認識状態の改善を行い、過去の不一致を活用して将来の計画を導くことで時間の経過とともに調整を適応させます。実験によると、EPC-AW はシステムレベルの成功を平均 9.75% 向上させます。

原文 (English)

When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

LLM-based multi-agent systems can fail even when planned actions are executed correctly because agents may misjudge their knowledge when evaluating plan feasibility, a phenomenon we term epistemic miscalibration in planning. Unlike execution errors, epistemic miscalibration is latent during planning, as generated plans can remain self-consistent and executable without observable errors; the miscalibration is also dynamic, as new information can alter feasibility assessments, potentially obscuring past miscalibration signals and causing them to recur over time. To address this, we propose the Epistemic Planning Calibration Agentic Workflow (EPC-AW), which assesses whether plans remain supported under varying information conditions rather than directly verifying feasibility. EPC-AW employs Information-consistency-based Plan Selection, selecting plans whose evaluations are stable across agents, together with Consistency-guided Epistemic State Refinement to adapt calibration over time by leveraging past discrepancies to guide future planning. Experiments show that EPC-AW improves system-level success by an average of 9.75%.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

Co-ReAct: ReAct エージェントのステップレベルのコラボレーターとしてのルーブリック

検索集約型の複数ステップの推論タスクを行う ReAct スタイルのエージェントは、どのような証拠を探すか、次にどの推論またはアクションステップを実行するか、いつ停止するかを決定する際に、主に自身の内部判断に依存しており、多くの場合、浅い、冗長な、またはターゲットが不十分な軌道を生成します。これまでの研究では、ルーブリックを外部の品質シグナルとして検討してきましたが、既存の用途は、行動を導くというよりも評価的なものがほとんどです。ルーブリックは通常、トレーニング時の報酬または完了した成果の事後評価として機能し、ディープリサーチ設定では、ステップレベルではなく、粒度が粗く、レポートレベルであることがよくあります。 Co-ReAct は、推論中にルーブリックをステップレベルのガイダンスとして使用する、ルーブリックに基づくアクション選択フレームワークです。各意思決定ステップで、Co-ReAct はエージェントのコンテキストにルーブリックを挿入して、次の理由か行動かの決定をガイドし、証拠の探索、検索、推論、または自己評価においてエージェントが何を対象とすべきかを指定します。このガイダンスを信頼できるものにするために、GRPO を使用して専用のルーブリックジェネレーターをトレーニングします。以前のペアごとまたはバイナリの選好定式化とは異なり、私たちの目的は、複数の裁判官の専門家のコンセンサスランキングに対するリストごとのスピアマン順位相関報酬を最適化し、単にもっともらしいだけではなく識別的なルーブリックを奨励します。 DeepResearchBench と SQA-CS-V2 では、Co-ReAct は、8B/14B オープンソースベースモデルとフロンティアクローズドソースベースモデルの両方に基づいて構築された検索エージェント全体で、ReAct および代表的なテスト時コンピューティングベースラインよりも一貫して向上しています。トレーニングされたルーブリックジェネレーターは、基礎となる意思決定メカニズムを変更することなく、これらのベースラインを改善するドロップインコンポーネントとしても機能します。私たちのコードは https://github.com/ZBWpro/Co-ReAct で公開されています。

原文 (English)

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

ReAct-style agents for search-intensive, multi-step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action-guiding: rubrics typically serve as training-time rewards or post-hoc evaluators of completed outputs, and in deep-research settings they are often coarse-grained and report-level rather than step-level. We introduce Co-ReAct, a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference. At each decision step, Co-ReAct injects a rubric into the agent's context to guide the next Reason-or-Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self-evaluation. To make this guidance reliable, we train a dedicated rubric generator with GRPO. Unlike prior pairwise or binary preference formulations, our objective optimizes a list-wise Spearman rank-correlation reward against multi-judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible. On DeepResearchBench and SQA-CS-V2, Co-ReAct consistently improves over ReAct and representative test-time compute baselines across search agents built on both 8B/14B open-source and frontier closed-source base models. The trained rubric generator can also serve as a drop-in component that improves these baselines without changing their underlying decision mechanisms. Our code is publicly available at https://github.com/ZBWpro/Co-ReAct.

2026-05-25 13:00 JSTarXiv cs.AIエージェント研究/論文

1 つのポリシー、無限の NPC: スケーラブルなゲームエージェント向けの個人追跡可能な共有 RL ポリシー

300 ペルソナのライフシミュレーションベンチマークでは、pcsp は、確率で最大 17 倍高い組成ゼロショットペルソナ識別を達成し、スピアマン rho は約 0.73 の意味と行動の整合性を実現し、ポリシーとしての LLM ベースラインよりも 22 倍高速な推論を達成しました。ライフシミュレーションゲームには、デザイナーが作成した自然言語を通じて制御可能でありながら、異なる個性で一貫して動作する数百から数千のノンプレイヤーキャラクター (NPC) が必要です。既存の手法は、ペルソナの一貫性、制御性、リアルタイム推論などの制約により失敗します。自由形式のペルソナ記述の凍結された LLM 埋め込みを条件とする単一の強化学習ポリシーである、pcsp (ペルソナ条件付き共有ポリシー) を導入します。 pcsp は、NPC ごとに 1 回のペルソナエンコーディング、低ランクのペルソナ投影、ニューラルペルソナコンディショニング、および PPO + InfoNCE 一貫性 + KL 多様性トレーニング目標を組み合わせています。 3 つの実験設定にわたるアブレーションは、InfoNCE の軌道一貫性の目標が負荷に耐えられることを示しています。これを削除すると、ゼロショットのペルソナ特定は偶然に崩壊します。 Melting Pot 2.4.0 基質の外部検証により、私たちの方法がマルチエージェントの戦略的環境においてペルソナに条件付けされた行動の発散を生み出すことが確認されました。私たちは、保留された評価の 2 つの意味、つまり構成的なゼロショットと語彙拡張の保留された評価を区別します。最後に、UE5 の導入により、64 エージェントでのエンジン内ペルソナコンディショニングアブレーションが低い失敗率で再現され、サブフレーム推論プロファイルが商用ゲームエンジンでも存続することが示されました。これらの結果は、共有 RL ポリシーがスケーラブルでリアルタイムのペルソナ条件付き NPC 制御をサポートできることを証明しています。

原文 (English)

One Policy, Infinite NPCs: Persona-Traceable Shared RL Policies for Scalable Game Agents

On a 300-persona life-simulation benchmark, pcsp achieves compositional zero-shot persona identification up to 17x above chance, Spearman rho approx 0.73 semantic-behavioral alignment, and 22x faster inference than an LLM-as-policy baseline. Life simulation games require hundreds to thousands of non-player characters (NPCs) that behave consistently with distinct personalities while remaining controllable through designer-authored natural language. Existing methods fail on constraints like persona consistency, controllability, or real-time inference. We introduce pcsp (Persona Conditioned Shared Policy), a single reinforcement learning policy conditioned on frozen LLM embeddings of free-form persona descriptions. pcsp combines once-per-NPC persona encoding, low-rank persona projection, neural persona conditioning, and a PPO + InfoNCE consistency + KL diversity training objective. Across three experimental settings, ablations show that the InfoNCE trajectory-consistency objective is load bearing: removing it collapses zero-shot persona identification to chance. External validation on Melting Pot 2.4.0 substrates confirms that our method produces persona-conditioned behavioral divergence in multi-agent strategic environments. We distinguish two senses of held-out evaluation: compositional zero-shot and vocabulary-expansion held-out. Finally, a UE5 deployment reproduces the in-engine persona-conditioning ablation at 64 agents with a low failure rate, showing that the sub-frame inference profile survives in a commercial game engine. These results prove that shared RL policies can support scalable, real-time, persona-conditioned NPC control.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

MemAudit: 因果関係の特定と構造異常の検出による、汚染されたエージェントのメモリの事後監査

大規模な言語モデルエージェントは、過去の対話を保存し、関連するデモンストレーションを取得し、長期的なタスクの実行を改善するために、永続メモリへの依存度を高めています。ただし、このメモリメカニズムは実際的なセキュリティ上の脆弱性も生み出します。敵対的なユーザーが通常の対話を通じて悪意のあるレコードをエージェントのメモリに挿入する可能性があり、これらのレコードは後で取得してエージェントの推論とアクションを制御することができます。既存の防御策は主に、即時フィルタリングや出力ブロックなどのオンライン介入に焦点を当てていますが、有害な動作がすでに観察された後、どの保存されたメモリが原因であるかという事後的な問題には対処していません。私たちは、メモリ拡張 LLM エージェントのためのポストホック因果メモリ監査フレームワークである \textbf{MemAudit} を提案します。このフレームワークは、(1) 有害な出力に対する各メモリの因果関係を測定する反事実的な記憶影響スコアと、(2) より広範なメモリストア内で構造的に異常なメモリを特定するメモリ整合性グラフの 2 つの相補的なシグナルを組み合わせます。 MINJA に対して MemAudit を評価します。MINJA は、メモリバンクを直接変更するのではなく、通常のエージェントとの対話を通じて悪意のあるレコードが生成および保存される、クエリのみのメモリインジェクション攻撃です。 QA 設定と推論エージェント設定の両方にわたって、MemAudit は現実的な事後監査シナリオの下で攻撃の成功率を大幅に低下させます。結果は、QA 攻撃の成功率が $70\%$ から $0\%$ に減少する一方、RAP 攻撃の成功率が $83.3\%$ から $0\%$ に減少することを示しています。

原文 (English)

MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long-horizon task execution. However, this memory mechanism also creates a practical security vulnerability: an adversarial user may inject malicious records into the agent's memory through ordinary interaction, and these records can later be retrieved to steer the agent's reasoning and actions. Existing defenses primarily focus on online intervention, such as prompt filtering or output blocking, but they do not address the post-hoc question of which stored memories are responsible after harmful behavior has already been observed. We propose \textbf{MemAudit}, a post-hoc causal memory auditing framework for memory-augmented LLM agents. The framework combines two complementary signals: (1) a counterfactual memory influence score that measures each memory's causal contribution to harmful outputs, and (2) a memory consistency graph that identifies structurally anomalous memories within the broader memory store. We evaluate MemAudit against MINJA, a query-only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory-bank modification. Across both QA and reasoning-agent settings, MemAudit substantially reduces attack success rates under realistic post-hoc auditing scenarios. The results show that QA attack success is reduced from $70\%$ to $0\%$, while RAP attack success drops from $83.3\%$ to $0\%$.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

プログラム検証のためのエージェント証明

エージェントシステムは、形式数学における自動定理証明のための最先端のアプローチとして最近登場しました。これらの機能がプログラム検証にどこまで拡張されるかを評価するために、検証可能なコード生成のためのリーン 4 ベンチマークである CLEVER 上のエージェント証明フレームワークでクロードコードを評価します。私たちの結果は、Claude が問題の 98.8% に対しておそらく有効な仕様を生成し (81.3% はベンチマークの正しい部分に関する CLEVER の同型ベースのスコアリングでも受け入れられます)、問題の 87.5% に対して正しいグラウンドトゥルース仕様に基づいて実装を認証し、自己一貫性のある前提条件を持つエントリに対するエンドツーエンドのプログラム生成および検証パイプラインで 98.1% の成功率に達していることを示しています。クロードは、すべての段階にわたって、自身の試み (手動レビューで確認されたもの) に関する質の高いフィードバックをさらに提供し、失敗の根本的な原因とデータセットに残るバグを特定します。これらの発見は、既存のプログラム検証ベンチマークの難しさと現代のエージェント証明者の機能との間の不一致が増大していることを浮き彫りにし、より厳密でバグ耐性のある評価手法、特に生成された仕様の同型ベースのスコアリングに代わる手法の必要性を指摘しています。より広範に、私たちの結果は、緊密なコンパイラインザループエージェントパラダイムが、現在、基本的なプログラム検証にとって最も効果的なアプローチであるという経験的証拠を提供します。

原文 (English)

Agentic Proving for Program Verification

Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

生の経験からスキルの消費まで: モデル生成されたエージェントスキルの体系的な研究

言語エージェントは、\emph{スキル}、つまり過去の経験から抽出された構造化された手続き上の成果物を再利用することで、ますます改善されます。特に、\emph{ドメインレベル} スキルと \emph{モデル生成} スキルが特に有望です。ドメイン固有の繰り返し手順をエンコードすることでドメイン内での迅速な適応を実現し、労働集約的な手作業を超えて拡張できます。しかし、抽出方法は急増し続けているものの、そのようなスキルが実際に機能するかどうか、いつ機能するのか、何が成功または失敗するのかを問う、スキルのライフサイクル全体 (\textbf{経験生成}、\textbf{スキル抽出}、\textbf{スキル消費}) にわたる包括的な研究はまだなく、理解は限定的です。このギャップを埋めるために、5 つの多様なエージェントタスクドメインをカバーする抽出プログラムとターゲットエージェントにわたる体系的な実験結果を提供する、ユーティリティに基づいた評価フレームワークを構築します。モデルによって生成されたスキルは平均して有益ですが、自明ではない負の転移を示し、エクストラクターもターゲットも均一に動作しないことがわかりました。モデルは、モデルの規模やベースラインのタスクの強度に関係なく、スキルの有用性を備えて、強力な抽出者であると同時に弱い消費者になることも、その逆の場合もありえます。これらのパターンを説明するために、ライフサイクルの各段階を詳しく分析し、エクスペリエンスの構成がスキルの品質をどのように形成するか、有用なスキルを特徴づける特性は何か、同じスキルがさまざまな消費者にどのように伝達されるかを分析します。最後に、これらの発見を具体的な \emph{メタスキル} に変換し、実際の実用性に結び付けられた特徴に向けてスキル抽出を導きます。これにより、領域全体でスキルの質が一貫して向上し、マイナスの移転が大幅に削減されます。

原文 (English)

From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

Language agents increasingly improve by reusing \emph{skills} -- structured procedural artifacts distilled from past experience. In particular, \emph{domain-level} and \emph{model-generated} skills are especially promising. They offer fast adaptation within a domain by encoding domain-specific recurring procedures, and they scale beyond labor-intensive hand-crafting. However, while extraction methods continue to proliferate, understanding remains limited, with no comprehensive study spanning the full skill lifecycle -- \textbf{experience generation}, \textbf{skill extraction}, and \textbf{skill consumption} -- to ask whether such skills actually work, when they work, and what makes them succeed or fail. To close this gap, we build a utility-grounded evaluation framework that provides systematic experimental results across extractors and target agents, covering five diverse agentic task domains. We find that model-generated skills are beneficial on average but exhibit non-trivial negative transfer, and that neither extractors nor targets behave uniformly. A model can be a strong extractor yet a weak consumer, or vice versa, with skill utility independent of model scale or baseline task strength. To explain these patterns, we then dissect each lifecycle stage in depth, analyzing how experience composition shapes skill quality, what properties characterize useful skills, and how the same skill transfers across different consumers. Finally, we translate these findings into a concrete \emph{meta-skill} that guides skill extraction toward the features tied to actual utility, which consistently improves skill quality across domains and substantially reduces negative transfer.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

SkillOpt: 自己進化するエージェントスキルのための経営戦略

今日のエージェントスキルは、手作業で作成されたり、ワンショットで生成されたり、緩やかに制御された自己修正によって進化したりしていますが、スキルの深層学習オプティマイザーのように動作するものはなく、フィードバックを受けて開始点を確実に改善するものはありません。私たちは、このスキルはウェイト空間の最適化を再現可能にするのと同じ規律を用いて、凍結されたエージェントの外部状態としてトレーニングされるべきであると主張します。私たちの知る限り、SkillOpt は、エージェントスキル向けの最初の系統的な制御可能なテキスト空間オプティマイザーです。別個のオプティマイザーモデルは、スコア付けされたロールアウトを、単一のスキルドキュメントに対する制限付きの追加/削除/置換編集に変換します。また、編集は、保持されている検証スコアを厳密に改善する場合にのみ受け入れられます。テキストの学習率バジェット、拒否された編集バッファー、およびエポックごとの低速/メタ更新により、デプロイメント時に推論時間のないモデル呼び出しを追加しながら、スキルトレーニングを安定させます。 6 つのベンチマーク、7 つのターゲットモデル、および 3 つの実行ハーネス (ダイレクトチャット、コーデックス、クロードコード) にわたって、SkillOpt は 52 の評価対象 (モデル、ベンチマーク、ハーネス) セルすべてで最高か同点であり、人間のスキル、ワンショット LLM、Trace2Skill、TextGrad、GEPA、および EvoSkill スキルの中でセルごとのすべての競合他社を上回っています。 GPT-5.5 では、スキルなしの平均精度がダイレクトチャットで +23.5 ポイント、Codex エージェントループ内で +24.8 ポイント、Claude Code 内で +19.1 ポイント向上しました。さらに、移行実験では、最適化されたスキルアーティファクトは、さらなる最適化を行わずに、モデルスケール間、Codex と Claude Code の実行環境間、および近くの数学ベンチマークに移動しても、価値が維持されることが示されています。

原文 (English)

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

KPI2KVI: サービス記述からキー値インジケーターを計算するためのマルチエージェントワークフロー

Key Value Indicators (KVI) は、運用パフォーマンスがステークホルダーの価値、リスク、結果にどのように変換されるかを要約することにより、サービスの意思決定指向のビューを提供します。ただし、多くのドメインでは、関連する KVI カテゴリの選択、測定可能な主要業績評価指標 (KPI) の定義、KPI 値の収集、一貫した計算ロジックの適用が必要となるため、KVI を実際に計算するのは困難です。これらの作業はすべて、通常、非構造化サービスドキュメントから手動で一貫性なく実行されます。この文書では、大規模言語モデル (LLM) を活用した決定論的なマルチエージェントワークフローを調整することにより、自然言語サービスの説明を計算された KVI 推定値に変換するツール KPI2KVI について説明します。このツールは、(i) 欠落しているサービスコンテキストを引き出し、(ii) 分類法から関連する KVI カテゴリを抽出して最終化し、(iii) 単位と説明を含むサービス固有の KPI を生成し、(iv) 対話型の対話を通じて KPI 値を収集し、インテリジェントな推定をサポートします。利用できない KPI 値、および (v) 各 KVI コードの追跡可能な説明を使用して、間隔値の KVI 出力 (最小、正確、最大) を計算します。代表的なサービス記述を使用したシミュレーションでは、KPI2KVI が記述から KVI 間隔までの完全なエンドツーエンドのマッピングを一貫して生成し、事後監査とインタラクティブなアドバイスクエリをサポートする透過的な計算ナラティブを提供することを示しています。

原文 (English)

KPI2KVI: A Multi Agent Workflow for Calculating Key Value Indicators from Service Descriptions

Key Value Indicators (KVIs) provide a decision oriented view of a service by summarizing how operational performance translates into stakeholder value, risk, and outcomes. However, in many domains KVIs are difficult to compute in practice because they require selecting relevant KVI categories, defining measurable Key Performance Indicators (KPIs), collecting KPI values, and applying consistent calculation logic, all of which is typically performed manually and inconsistently from unstructured service documentation. This paper presents KPI2KVI, a tool that transforms a natural language service description into computed KVI estimates by orchestrating a deterministic multi agent workflow powered by Large Language Models (LLMs) that (i) elicits missing service context, (ii) extracts and finalizes relevant KVI categories from a taxonomy, (iii) generates service specific KPIs with units and descriptions, (iv) collects KPI values through an interactive dialogue and also supports intelligent estimation for KPI values that are unavailable, and (v) computes interval valued KVI outputs (minimum, exact, maximum) with traceable explanations for each KVI code. Simulations with representative service descriptions demonstrate that KPI2KVI consistently produces a complete end to end mapping from description to KVI intervals and provides transparent calculation narratives that support post hoc auditing and interactive advisory queries.

2026-05-25 13:00 JSTarXiv cs.AIエージェントハードウェア/半導体

計算可能な公平性: AI リソース割り当てのためのボルツマン-ソフトマックス制御

大規模な AI システムでは、GPU の計算時間や帯域幅などの希少なリソースを複数のエージェントに割り当てることが重要な課題になります。従来のポリシーは効率の指標に焦点を当てており、システムの多様性と安定性を損なう支配集中につながる可能性があります。我々は、ボルツマン・ソフトマックス関数を選択ツールとしてではなく確率的リソース割り当てメカニズムとして再解釈し、逆温度パラメータ $\beta$ を効率と公平性のバランスを支配する計算可能な制御変数として再定義するフレームワークである Computable Fair Division (CFD) を提案します。静的分析により、政策ウェイト全体で損失総額がほぼ一定に保たれる、最適に近い安定回廊を持つパレートフロンティアが明らかになります。動的設定では、AHC++ (Adaptive Hard-Cap Controller++) が、観測された優位性とポリシーで指定されたターゲットとの間の誤差をフィードバックとして使用して $\beta$ をリアルタイムで更新します。シミュレーションにより、AHC++ は、スループットを大幅に低下させることなく公平性ターゲットを追跡しながら、外因性ショック下での極端な優勢集中を抑制することが示されています。スケーラビリティ分析により、エージェントを 100 倍に増やしても、実行時間は約 5.5 倍しか増加しないことが確認されています。コード: https://github.com/entrofy-ai/computable-fairness

原文 (English)

Computable Fairness: Boltzmann-Softmax Control for AI Resource Allocation

In large-scale AI systems, allocating scarce resources such as GPU compute time and bandwidth among multiple agents is a critical challenge. Conventional policies focus on efficiency metrics, potentially leading to dominance concentration that undermines system diversity and stability. We propose Computable Fair Division (CFD), a framework that reinterprets the Boltzmann-Softmax function not as a selection tool but as a probabilistic resource allocation mechanism, redefining the inverse temperature parameter $\beta$ as a computable control variable governing the efficiency-fairness balance. Static analysis reveals a Pareto frontier with a near-optimal Stability Corridor where total loss remains approximately constant across policy weights. In the dynamic setting, AHC++ (Adaptive Hard-Cap Controller++) updates $\beta$ in real time using the error between observed dominance and a policy-specified target as feedback. Simulations show that AHC++ suppresses extreme dominance concentration under exogenous shocks while tracking fairness targets without substantial throughput degradation. Scalability analysis confirms that a 100x increase in agents yields only approximately 5.5x increase in execution time. Code: https://github.com/entrofy-ai/computable-fairness

2026-05-25 13:00 JSTarXiv cs.AIエージェント

ミスアトリビューションのギャップ: メモリポイズニングがエージェントティック AI システムのモデル障害のように見える場合

マルチエージェント AI パイプラインは通常、エージェントの不正行為がモデルの不整合に起因すると想定しています。私たちは、この仮定における構造的な欠陥である \emph{誤帰ギャップ} を特定します。これは、メモリ層の攻撃がモデルの欠陥と区別できない動作を引き起こし、防御側が間違った修復を適用する原因となります。私たちは \emph{セマンティックノルムドリフト} (SND) をエージェントの不正行為への 3 番目の経路として形式化し、突発的な不整合や共謀とは区別します。 SND では、ポリシー形式のドキュメントは通常のアップロードを通じて共有ベクトルストアに入り、トラストロンダリングチェーンを通じて出所が失われた後、信頼できるシステムコンテキストとして再表示されます。文書化された 64 件の失敗全体にわたって、アトリビューションシステムは一貫してモデルのせいだと主張していました。メモリポイズニングについてトレーニングされたものを含む 4 つの安全分類子は、510 のチェックポイントにわたって検出がゼロでした。有効なケース 65 件のうち 59 件では、エージェントは準拠する前に、挿入された文書を規範的権威として明示的に引用しました。この攻撃は、トリガー、モデルへのアクセス、または反復的な対話を必要とせず、5 セッション以内に完全な効果を達成し、無期限に継続します。反事実構成テストを導入します。これは、87.5% の精度と誤検知ゼロで因果関係のエントリを特定しますが、フォレンジックベースラインは 25 のシナリオすべてで失敗します。さらに、取得とカバレッジのジレンマを証明し、より強力な回避が本質的に攻撃を弱め、適応的なバイパス戦略を制限することを示します。最後に、メモリ永続情報フロー制御を提案します。これは、以前の防御が失敗したセッション間の境界で攻撃の 97% をブロックします。私たちは、金融および医療ドメインにわたる時間的永続性とマルチエージェント構成を備えた初の敵対的メモリベンチマークである SND コーパスをリリースします。

原文 (English)

The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems

Multi-agent AI pipelines typically assume that agent misconduct originates from model misalignment. We identify a structural failure in this assumption, the \emph{Misattribution Gap}, where memory-layer attacks produce behaviors indistinguishable from model failure, causing defenders to apply the wrong remediation. We formalize \emph{Semantic Norm Drift} (SND) as a third path to agent misconduct, distinct from emergent misalignment and collusion. In SND, a policy-formatted document enters a shared vector store through normal uploads and later reappears as trusted system context after provenance is lost through a Trust Laundering Chain. Across 64 documented failures, attribution systems consistently blamed the model. Four safety classifiers, including one trained on memory poisoning, produced zero detections across 510 checkpoints. In 59 of 65 valid cases, agents explicitly cited the injected document as normative authority before complying. The attack requires no trigger, model access, or repeated interaction, achieves full effect within five sessions, and persists indefinitely. We introduce Counterfactual Composition Testing, which identifies the causal entry with 87.5% accuracy and zero false positives, while a forensics baseline fails across all 25 scenarios. We further prove the Retrieval-Coverage Dilemma, showing that stronger evasion inherently weakens the attack, limiting adaptive bypass strategies. Finally, we propose Memory-Persistent Information-Flow Control, which blocks 97% of attacks at the cross-session boundary where prior defenses fail. We release the SND Corpus, the first adversarial memory benchmark with temporal persistence and multi-agent composition across financial and Health Care domains.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント

PrefBench: 隠れ優先パーソナライズされた価格交渉におけるゼロショット LLM エージェントの評価

対話が成功しても収益性の高い意思決定が保証されるわけではないため、パーソナライズされた価格交渉は LLM エージェントにとって挑戦的なテストベッドです。買い手の支払い意欲や交渉特性が隠されたままの場合、売り手は有効な行動を起こし、多くの取引を成立させる可能性がありますが、価格設定は依然として不十分です。このペーパーでは、隠れた好みの個別の価格交渉のためのシミュレーターベースのベンチマークである PrefBench について説明します。各エピソードでは、シミュレートされた購入者と固定の車両カスタマイズバンドルが組み合わされます。売り手は公開ペルソナ記述子、バンドル情報、交渉履歴を観察しますが、潜在的な買い手変数は評価、忍耐力、反対提案行動、ウォークアウェイの決定を支配します。 PrefBench は、エージェントが固定の非表示情報境界の下で厳密な JSON アクションを返すように制約する、LLM 対応の状態概要プロトコルを通じてこの設定を評価します。当社は、7,500 のエピソードを超えるヒューリスティック参照に基づいて、ゼロショット LLM 販売者を評価します。テストされた LLM はプロトコルに確実に準拠し、0.99 を超える取引レートを達成しましたが、売り手利益の結果は依然として弱いままです。最高の LLM 平均利益は、ランダムなベースラインをわずかに上回っているだけで、同じエピソードストリームの下での単純な譲歩ヒューリスティックをはるかに下回っています。これらの結果は、構造化された行動の遵守と合意を求める行動が、利益重視の弱い交渉と共存する可能性があることを示しています。 PrefBench は、隠れた購入者の好みの下で価格設定エージェントの行動を評価するための管理されたベンチマークを提供します。

原文 (English)

PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations

Personalized pricing negotiations are a challenging testbed for LLM agents because successful interaction does not guarantee profitable decision making. A seller may produce valid actions and close many deals while still pricing poorly when buyer willingness to pay and bargaining traits remain hidden. This paper presents PrefBench, a simulator-based benchmark for hidden-preference personalized pricing negotiations. Each episode pairs a simulated buyer with a fixed vehicle-customization bundle; the seller observes public persona descriptors, bundle information, and negotiation history, while latent buyer variables govern valuation, patience, counter-offer behavior, and walkaway decisions. PrefBench evaluates this setting through an LLM-facing state-summary protocol that constrains agents to return strict JSON actions under a fixed hidden-information boundary. We evaluate zero-shot LLM sellers against heuristic references over 7,500 episodes. The tested LLMs follow the protocol reliably and achieve deal rates above 0.99, but their seller-profit outcomes remain weak: the best LLM average profit is only slightly above the random baseline and far below a simple concession heuristic under the same episode stream. These results show that structured action compliance and agreement-seeking behavior can coexist with weak profit-sensitive bargaining. PrefBench provides a controlled benchmark for evaluating pricing-agent behavior under hidden buyer preferences.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント

彼らはどこまで行くのでしょうか？大規模な言語モデルによるレッドチームのオンライン影響力

大規模言語モデル (LLM) ベースのエージェントがオンライン言説にますます参加するようになっているため、政治的影響力のあるキャンペーンをサポートする彼らの能力をレッドチーム化することが、情報の完全性にとって重要です。この目標を追求するために、私たちは、ソーシャルメディア環境に展開されるプライバシーを意識した悪意のある攻撃者の運用上の制約との優れた整合性を考慮して、フロンティア API のみのモデルではなく、ローカルに展開されるオープンソース LLM に焦点を当てます。物議を醸すトピックに関してモデルが確実に表現できる政治的意見の範囲として定義される LLM Overton Windows (OW) を測定し、単純な自然言語ジェイルブレイクがその範囲をどのように拡大するかを定量化するための経験的なレッドチームフレームワークを紹介します。当社では、10 のモデルファミリと 5 つの原産国にまたがる 30 以上の LLM を評価しています。私たちは、政治的表現力に体系的な非対称性があることを発見しました。オープンソース LLM は通常、左寄りのソーシャルメディアコンテンツを生成することに積極的であり、OW はモデルサイズに反比例して縮小する傾向があり、オープンソースエコシステム内での表現が不均一であるにもかかわらず、地域的な差異が顕著です。脱獄の有効性もモデルファミリによって大きく異なるため、脱獄テクニックの効果的な組み合わせを特定するためのワークフローが動機付けられます。総合すると、私たちの結果は、オープンソース LLM の政治的運営性を監査し、将来の研究者が LLM を利用した影響力キャンペーンに対するより強力な対策を設計するのに役立つ実用的な枠組みを確立します。

原文 (English)

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.

2026-05-25 13:00 JSTarXiv cs.AIエージェントロボティクス

Agentic-VLA: 視覚-言語-行動モデルの効率的なオンライン適応

視覚言語アクション (VLA) モデルは、事前にトレーニングされた視覚言語表現を活用することで、ロボット操作の有望なパラダイムとして浮上しています。しかし、現在の VLA トレーニング方法には 2 つの重大な制限があります。それは、新しい環境への一般化が不十分であり、広範なデモンストレーションを必要とするトレーニング効率が低いことです。エージェントティックトレーニングフレームワークである Agentic-VLA を紹介します。これは、VLA が 3 つの主要なイノベーションを通じて効率的にオンラインに適応できるようにします。(1) 適応報酬合成。VLA の現在の機能とタスクの複雑さに基づいて報酬関数を動的に生成および調整し、複雑なタスクをカリキュラム学習のための学習可能なサブ目標に分解します。 (2) 言語ガイド探索。ランダムなサンプリングではなく、批評家モデルが体系的な探索のための構造化されたガイダンスを提供します。 (3) エクスペリエンスメモリ。同様のタスクへのウォームスタート適応のために、タスク関連のポリシーの重みを保存および取得します。 LIBERO ベンチマークで Agentic-VLA を評価し、大幅な改善を達成しました。長期タスクで +12.3%、ワンショット学習で +28.5%、タスク固有のデモンストレーションなしで 0% から 31.2% までのクロスタスク転送が可能になりました。また、私たちのフレームワークは、既存のオンライン適応手法と比較して 2.4 倍高速な収束を実証しています。 LIBERO を超えて、Agentic-VLA は、ランダム化されたハード設定下を含め、デュアルアーム RoboTwin 2.0 ベンチマークで優位性を維持しています。これらの結果により、Agentic-VLA は、導入時に継続的に学習できる真の適応型 VLA システムに向けた重要なステップとして確立されます。

原文 (English)

Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models

Vision-Language-Action (VLA) models have emerged as a promising paradigm for robotic manipulation by leveraging pre-trained vision-language representations. However, current VLA training methods suffer from two critical limitations: poor generalization to novel environments and low training efficiency requiring extensive demonstrations. We introduce Agentic-VLA, an agentic training framework that enables VLAs to efficiently adapt online through three key innovations: (1) Adaptive Reward Synthesis, which dynamically generates and adjusts reward functions based on the VLA's current capabilities and task complexity, decomposing complex tasks into learnable sub-goals for curriculum learning; (2) Language-Guided Exploration, where a critic model provides structured guidance for systematic exploration rather than random sampling; and (3) Experience Memory,which stores and retrieves task-relevant policy weights for warm-starting adaptation to similar tasks. We evaluate Agentic-VLA on the LIBERO benchmark, achieving substantial improvements: +12.3% on long-horizon tasks, +28.5% in 1-shot learning, and enabling cross-task transfer from 0% to 31.2% without task-specific demonstrations. Our framework also demonstrates 2.4x faster convergence compared to existing online adaptation methods. Beyond LIBERO, Agentic-VLA retains its advantage on the dual-arm RoboTwin 2.0 benchmark, including under its randomized Hard setting. These results establish Agentic-VLA as a significant step toward truly adaptive VLA systems capable of continuous learning in deployment.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

自閉症における社会言語障害特性を評価するための積極的なマルチエージェント対話フレームワーク

自閉症スペクトラム障害における社会言語障害（SLD）に関連する特徴的な言語行動（エコー反復、代名詞の置き換え、ステレオタイプ的なメディアの引用など）は、自発的な会話ではほとんど見られず、特定の会話条件下でのみ現れます。構造化された臨床評価では、この待ち時間は、質問戦略の選択が、会話からどの程度の診断情報が得られるかを決定する重要な要素であるにもかかわらず、あまり評価されていないことを意味します。これらの潜在的な特性を体系的に表面化する質問戦略を積極的に選択するように大規模言語モデル (LLM) を誘導できるかどうかは、まだほとんど解明されていません。ここでは、自閉症診断観察スケジュールモジュール 4 (ADOS-2) の言語評価コンポーネントに適用されるプロアクティブなマルチエージェント対話フレームワークである TPA (Think、Plan、Ask) を紹介します。このフレームワークでは、医師エージェントが、臨床的に根拠のある戦略を選択し、対象を絞った質問を生成する前に、どの特性が観察されないままであるかを明確に推論します。実際の ADOS-2 臨床データに基づいた患者エージェントにより、実際の患者の参加なしで再現可能な評価が可能になり、3 つの独立した実験で検証され、実際の患者の言語に対する十分な忠実性が確認されました。 35人の患者からの484のエピソードで評価されたTPAは、すべての主要指標にわたって6つの競合対話計画ベースラインを上回り、82.1%のSLD特性カバー率を達成し、訓練を受けた臨床医が実施した実際の臨床対話の自動再生（65.5%）より16.6%高く、ターンあたりの診断効率が大幅に向上しました（AUCC：0.628対0.458、絶対利得+0.170）。これらの結果は、積極的な質問戦略の選択により、自動化された SLD 特性評価の効率が大幅に向上し、スケーラブルな AI 支援臨床スクリーニングに直接的な影響を与えることを示しています。

原文 (English)

A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism

Characteristic linguistic behaviors associated with Social Language Disorder (SLD) in autism spectrum disorder, including echoic repetition, pronoun displacement, and stereotyped media quoting, are largely absent from spontaneous conversation and only emerge under specific conversational conditions. In structured clinical assessments, this latency means that questioning strategy selection is a critical yet underappreciated determinant of how much diagnostic information a conversation yields. Whether large language models (LLMs) can be guided to proactively select questioning strategies that systematically surface these latent traits remains largely unexplored. Here we present TPA (Think, Plan, Ask), a proactive multi-agent dialogue framework applied to the language assessment component of the Autism Diagnostic Observation Schedule Module 4 (ADOS-2), in which a doctor agent explicitly reasons about which traits remain unobserved before selecting a clinically grounded strategy and generating a targeted question. A patient agent grounded in real ADOS-2 clinical data enables reproducible evaluation without real patient participation, validated across three independent experiments confirming adequate fidelity to real patient language. Evaluated on 484 episodes from 35 patients, TPA outperforms six competitive dialogue planning baselines across all primary metrics, achieving 82.1% SLD trait coverage, 16.6% higher than automated replay of real clinical dialogues conducted by trained clinicians (65.5%), with substantially greater per-turn diagnostic efficiency (AUCC: 0.628 vs. 0.458, absolute gain +0.170). These results demonstrate that proactive questioning strategy selection substantially improves the efficiency of automated SLD trait assessment, with direct implications for scalable AI-assisted clinical screening.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

誰の善意、誰の立場?社会的利益のためのエージェント AI の道徳地理学

エージェント的 AI システムはソーシャルグッドの分野でますます提案されており、世界的な利益の語彙として国連の持続可能な開発目標 (SDGs) が引用されることがよくあります。しかし、社会的利益を主張しても、システムがサービスを提供すると主張するコミュニティに対する説明責任は確立されません。私たちは、2015 年から 2026 年の間に出版された、社会的利益のためのエージェント AI に関する 112 件の論文の構造化調査を紹介します。私たちは道徳と地理の非対称性を発見しました。つまり、論文は、まさに地域の政治的、法律的、文化的背景が最も重要な領域において地理的背景を特定する可能性が最も低いということです。コーパス全体で、112 件の論文中 82 件 (73%) で地理的コンテキストが指定されていません。健康または物理的/生態学的 SDG に沿った論文では 37 ～ 40% の確率で地理が指定されていますが、制度的および社会政策の SDG に沿った論文では 13% のみが指定されています。 SDG 16、平和、正義、強力な制度は、コーパス内で最も多くカバーされている目標であると同時に、地理的指定率が最も低い目標でもあります。私たちはこれを道徳的抽象化として解釈します。社会的利益のためのエージェント AI は、健康や生態学的利益を扱わないという点で、制度的利益を普遍的なものとして扱うことがよくあります。 2 番目の発見はこれをさらに悪化させます。実際の展開または小規模テストを報告している論文は 112 件中 28 件 (25%) のみです。私たちは 5 つの説明責任のギャップを特定し、社会的利益のために、より状況に特化した参加型で説明責任のあるエージェント AI のための最小限の報告基準を提案します。

原文 (English)

Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good

Agentic AI systems are increasingly proposed for social-good domains, often invoking the United Nations Sustainable Development Goals (SDGs) as a vocabulary of global benefit. Yet claims of social good do not establish accountability to the communities a system claims to serve. We present a structured survey of 112 papers on agentic AI for social good published between 2015 and 2026. We find a moral-geographic asymmetry: papers are least likely to specify geographic context in precisely the domains where local political, legal, and cultural context matters most. Across the corpus, 82 of 112 papers (73%) specify no geographic context. Papers aligned with health or physical/ecological SDGs specify geography 37-40% of the time, while papers aligned with institutional and social-policy SDGs do so only 13%. SDG 16, peace, justice, and strong institutions, is both the most-covered goal in the corpus and the one with the lowest geographic-specification rate. We interpret this as moral abstraction: agentic AI for social good often treats institutional good as universal in ways it does not treat health or ecological good. A second finding compounds this: only 28 of 112 papers (25%) report any real-world deployment or small-scale test. We identify five accountability gaps and propose a minimal reporting standard for more context-specific, participatory, and accountable agentic AI for social good.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

エージェント的 Kubernetes 操作の測定基盤: 検索複合改ざんにおける方法論とケーススタディ

自律的な Kubernetes オペレーションエージェントに関する経験的な主張は、ほとんど反証不可能です。出版された研究では、エージェントが無効化されたベースラインとの比較が制御されていない観察結果が報告されており、選択バイアスが蔓延しており、事前に登録された決定行列が存在せず、サンプルは通常、基礎となるスコアリングシステムのノイズレベルに対して小さすぎます。原因は、エージェント自体を制限する同じギャップです。コードエージェントには、「動作したか」を高速で改ざん可能なグラウンドトゥルース信号に変える検証基盤があり、オペレーションには同等のものはありません。ターゲットの Kubernetes クラスターに障害を注入し、自律エージェントがどのように応答するかを観察し、グラウンドトゥルースに対して 4 つの軸で応答をスコアリングし、結果ラベル付き (状態、アクション、結果) のタプルを蓄積する閉ループ測定フレームワークであるエージェントブレークメントを紹介します。このフレームワークは、フレームワークエラーと推論エラーを区別し、決定論的埋め込みメカニズムを介して真のオフ条件制御をサポートし、事前に登録された決定マトリックスを強制します。私たちはこれをケーススタディとして使用して、過去の事後分析の検索がエージェントの能力を強化するかどうかをテストします。方法論的なペイロードは、そのケーススタディ中に基板が捉えた 3 つの混乱です。それぞれが、同じ研究のあまりインストルメント化されていないバージョンで間違った公表された主張を生み出す可能性があります。それは、pgvector インデックスのバグ、+19% の選択バイアスアーティファクト、および効果を約 3 倍誇張する小さなサンプル推定です。検索結果自体は部分的な改ざんです。3 つの高密度コーパスシナリオのうち 1 つは p<0.05 で有意、プール効果 +3.9 パーセントポイント、n=60 では有意ではありません。 360 回の実行でのシナリオ内のコーパス密度スイープは、近傍の機械的な位置合わせが生の数を支配していることを示しています。フレームワークはオープンソースで公開されています。

原文 (English)

A measurement substrate for agentic Kubernetes operations: Methodology and a case study in retrieval-compounding falsification

Empirical claims about autonomous Kubernetes operations agents are largely unfalsifiable. Published work reports observational results without controlled comparisons against an agent-disabled baseline, selection bias is endemic, pre-registered decision matrices are absent, and samples are typically too small for the noise level of the underlying scoring system. The cause is the same gap that limits the agents themselves: code agents have a verification substrate that turns "did it work" into a fast, falsifiable, ground-truth signal, and operations has nothing equivalent. We present agent-breakage, a closed-loop measurement framework that injects faults into a target Kubernetes cluster, observes how an autonomous agent responds, scores the response on four axes against ground truth, and accumulates outcome-labeled (state, action, outcome) tuples. The framework distinguishes framework error from reasoning error, supports a true off-condition control via a deterministic-embedder mechanism, and enforces pre-registered decision matrices. We use it as a case study to test whether retrieval over past postmortems compounds an agent's capability. The methodological payload is three confounds the substrate caught during that case study, each of which would have produced a wrong published claim on a less instrumented version of the same work: a pgvector index bug, a +19% selection-bias artifact, and small-sample estimates that overstated effects by roughly 3x. The retrieval result itself is a partial falsification: 1 of 3 dense-corpus scenarios significant at p<0.05, pooled effect +3.9 percentage points, not significant at n=60. A within-scenario corpus-density sweep at 360 runs shows that mechanistic alignment of near-neighbors dominates raw count. The framework is released open source.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

AI 支援コードレビューの行動制約としての哲学的性質: 実証的研究

AI 支援コードレビューツールは通常、汎用の「専門家レビューアー」エージェントとして動作し、必要な分析タイプに関係なく均一な結果を生成します。我々は、哲学的性質、つまり構造的に異なるタイプの問題に注意を向ける、特定の認識論的伝統（ピュロニズム懐疑論、ナヴィヤ＝ニー＝アヤ論理、ディオゲネスの皮肉屋主義、儒教の関係倫理）に基づいた一貫した人格レンズを通じて、AIの査読者の行動を制約するシステムを提示する。それぞれの性質は（何を拒否するかによって）無言で定義され、自己監視障害モード（ハマルティア）が備えられ、役割プロトコルによって順番に調整されます。私たちは、5 つのプログラミング言語 (Python、Go、C++、Java、Terraform)、5 つの組織 (2 つのエンタープライズ、3 つのオープンソース)、および 2 つの時代 (AI 2020 年前、AI 後 2024 ～ 2026 年) にわたる 7 つのリポジトリにわたる 50 のマージされたプルリクエストでこのシステムを評価しました。この処理システムは、人間のレビュー担当者との収束率 46% (信号品質の検証) を達成し、固有の所見を 75% の割合で特定し、合計 601 件の所見にわたって著者によって偽陽性と判断された所見は生成しません (評価者間の合意は評価されず、制限のままです)。管理されたベースライン比較により、傾向調査結果の 51% が、一般的な「専門家レビューアー」のプロンプトを使用した同じモデルによって生成されたものではないことが実証されており、これらの独自の調査結果は、標準的なコードレベルの問題ではなく、構造的、運用的、論理的な懸念を対象としています。 3 つの PR での予備的なクロスモデル検証 (Claude Opus 対 GPT Codex 5.3-xhigh) では、フレームワーク構造の 100% の遵守と 39% の所見レベルの一致が示されており、フレームワークがモデル固有の分析的観点を維持しながら実際の動作制約を提供していることを示唆しています。

原文 (English)

Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study

AI-assisted code review tools typically operate as generic "expert reviewer" agents, producing homogeneous findings regardless of the analysis type needed. We present a system that constrains AI reviewer behavior through philosophical dispositions -- coherent personality lenses grounded in specific epistemological traditions (Pyrrhonist Skepticism, Navya-Ny=aya logic, Diogenes' Cynicism, Confucian relational ethics) that direct attention to structurally different types of issues. Each disposition is defined apophatically (by what it refuses to do), equipped with a self-monitoring failure mode (hamartia), and orchestrated in sequence by role protocols. We evaluate this system on 50 merged pull requests across 7 repositories spanning 5 programming languages (Python, Go, C++, Java, Terraform), 5 organizations (2 enterprise, 3 open-source), and 2 temporal eras (pre-AI 2020, post-AI 2024--2026). The disposition system achieves 46% convergence with human reviewers (validating signal quality), identifies unique findings at a 75% rate, and produces no findings judged false-positive by the author across 601 total findings (inter-rater agreement was not assessed and remains a limitation). A controlled baseline comparison demonstrates that 51% of disposition findings are not produced by the same model using generic "expert reviewer" prompting, and these unique findings target structural, operational, and logical concerns rather than standard code-level issues. Preliminary cross-model validation (Claude Opus vs.\ GPT Codex 5.3-xhigh) on 3 PRs shows 100% framework-structure adherence with 39% finding-level agreement, suggesting the framework provides real behavioral constraint while preserving model-specific analytical perspective.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

インフラベイジアン強化学習エージェントは、最悪の場合の堅牢性において古典的な RL を上回ります

古典的な強化学習では、エージェントが、その動作がエージェントのポリシーに依存しない固定環境と対話することを前提としています。この仮定は、AI の安全性にとって重要な環境、エージェントが予測者、人間、他の AI エージェント、機関と対話する環境など、他のアクターがエージェントの動作を予測する可能性がある実現不可能な設定では崩れます。このような設定では、エージェントのモデルクラスは、エージェントが動作する世界を捉えることができません。このような仕様の誤りがある場合、古典的なベイジアン手法では、実現可能性が得られないため、確実に間違った事後結果、信頼性の低い決定、際限のない後悔が生じる可能性があります。インフラベイズ主義は、事前分布を合理的に選択できる通常の確率的不確実性と、そのような事前分布を構築する根拠が存在しないナイト不確実性を区別することで、これらの失敗に対処する決定理論の枠組みです。これは、事後期待や加重平均ではなく、最悪の場合の結果に基づいて行動を評価することによって行われます。有限結果ステートレス意思決定問題に対するインフラベイジアン強化学習アーキテクチャの最初の概念実証実装を紹介します。私たちのエージェントは一連の不正確な仮説を維持し、インフラベイズ条件付けを使用してそれらを更新し、最悪の場合の期待値を最大化することによってアクションを選択します。ベイジアン内最大値決定プロセスのこの実装をナイト不確実性のある環境に適用し、古典的な強化学習エージェントと比較して最悪の場合の後悔が低いことを示します。また、Newcomb の問題を調査し、インフラベイジアンエージェントが最適な戦略を選択し、古典的な意思決定理論エージェントを上回るパフォーマンスを示すことを示します。私たちの結果は、モデルの仕様の誤りやポリシーに依存する不確実性の下でも堅牢性を維持する強化学習エージェントへの一歩を提供します。

原文 (English)

Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness

Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent's policy. This assumption breaks down in non-realizable settings where other actors might anticipate the agent's behavior, including environments crucial to AI safety, where the agent interacts with predictors, humans, other AI agents, and institutions. In such settings, the agent's model class fails to capture the world in which it operates. Under such misspecification, classical Bayesian methods can produce confidently wrong posteriors, unreliable decisions, and unbounded regret, as realizability fails to obtain. Infra-Bayesianism is a decision-theoretic framework that addresses these failures by distinguishing ordinary probabilistic uncertainty, where priors can be reasonably chosen, from Knightian uncertainty, where no grounds exist for the construction of such a prior. It does so by evaluating actions on their worst-case outcomes, rather than from posterior expectations or weighted averaging. We present the first proof-of-concept implementation of an infra-Bayesian reinforcement learning architecture for finite-outcome stateless decision problems. Our agent maintains a set of imprecise hypotheses, updates them using infra-Bayesian conditioning, and selects actions by maximizing worst-case expected value. We apply this implementation of the infra-Bayesian maximin decision process to an environment with Knightian uncertainty, and demonstrate a lower worst-case regret as compared to classical reinforcement learning agents. We also investigate Newcomb's problem and show that the infra-Bayesian agent picks the optimal strategy, outperforming classical decision theory agents. Our results provide a step towards reinforcement learning agents that remain robust under model misspecification and policy-dependent uncertainty.

2026-05-25 13:00 JSTarXiv cs.AIエージェントロボティクス

VLM ガイダンスによる自律的なフロンティアベースの探査

長年の課題である未知の危険な環境の自律ロボット探索は、視覚言語モデル (VLM) の高度な推論を活用することで大幅に改善できます。 VLM が高レベルの戦略的意思決定を実行し、従来の低レベルのロボット制御スタックをガイドする新しい探査パイプラインを導入します。意思決定ポイントで、ロボットは現在の地図と潜在的な経路またはフロンティアの視覚的イメージを含むマルチモーダルプロンプトを生成します。 VLM はこのプロンプトを分析して、最も有望なフロンティアを選択し、単純な幾何学的ヒューリスティックを状況に応じた空間推論に置き換えます。このアプローチは、6 つの屋内環境にわたるシミュレーションで検証され、既存の方法と比較してマップカバレッジを最大 24\% 向上させます。当社のパイプラインは軽量でトレーニング不要で、標準センサーとインターネット接続を備えたあらゆるロボットに簡単に転送できます。

原文 (English)

Autonomous Frontier-Based Exploration with VLM Guidance

Autonomous robotic exploration of unknown and hazardous environments, a long-standing challenge, can be significantly improved by leveraging the advanced reasoning of Vision-Language Models (VLMs). We introduce a novel exploration pipeline where a VLM performs high-level strategic decision-making, guiding a conventional low-level robotics control stack. At decision points, the robot generates a multimodal prompt with its current map and visual imagery of potential paths, or frontiers. The VLM analyzes this prompt to select the most promising frontier, replacing simple geometric heuristics with contextual spatial reasoning. This approach, validated in simulation across six indoor environments, improves map coverage by up to 24\% over existing methods. Our pipeline is lightweight, training-free, and easily transferable to any robot with standard sensors and an internet connection.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェントハードウェア/半導体研究/論文

FastKernels: 本番環境での GPU カーネル生成のベンチマーク

GPU カーネル生成用の LLM ベースのエージェントは急速に進歩していますが、その進歩は最適化対象のベンチマークによって根本的に制約されています。既存のベンチマークは、運用推論フレームワークとの整合性が不十分です。合成入力を使用して単一の GPU でカーネルを評価し、周囲のコンパイルスタックを無視し、新しい最適化を発見するのではなく、既知の最適化を複製することに報酬を与えます。結果として得られる報酬シグナルは誤解を招くものです。エージェントは、サンドボックスでは高得点のカーネルを生成することを学習しますが、実際のシステムに統合すると、インターフェイスの非互換性、コンパイルスタックの競合、サイレント正確性の低下が発生します。 FastKernels は、8 カテゴリにまたがる 46 の代表的なアーキテクチャの最小限のセットを中心に構築されたカーネルベンチマークであり、そのカーネルは、HuggingFace Transformers アーキテクチャの 96.2% (409/425) のカーネルを集合的に包含します。 FastKernels は、主流の LLM サービス上で vLLM や SGLang などの強化されたシステムと同等に動作し、十分にサービスが提供されていないアーキテクチャ上でのアップストリームのリファレンスを大幅に上回る、最小限の運用グレードの推論フレームワークとしても機能します。各タスクのインターフェイスは、そのアーキテクチャファミリの最先端のライブラリ内の対応するモジュールを反映しており、最適化されたカーネルを運用コードベースに直接デプロイすることができます。 FastKernels で最先端のカーネルエージェントを評価すると、最も強力なエージェントであっても実稼動ベースラインと比べて合計 0.94$\times$ の高速化しか達成できず、より弱いエージェントでは $0.78\times$ と $0.53\times$ であることがわかり、ベンチマークと実稼動の不一致がこの分野の重大なボトルネックであることが確認されました。私たちは、ベンチマークの向上が実稼働スループットの向上に直接つながるカーネルエージェントへの足がかりとして FastKernel をリリースします。コードは https://github.com/Snowflake-AI-Research/fastkernels で入手できます。

原文 (English)

FastKernels: Benchmarking GPU Kernel Generation in Production

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94$\times$ aggregate speedup over production baselines, with weaker agents at $0.78\times$ and $0.53\times$ -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels

2026-05-25 13:00 JSTarXiv cs.AIエージェント

身体型エージェントを実現する 6G 通信ネットワーク: アーキテクチャとプロトタイプ

実体化エージェントは、インテリジェントな意思決定と現実世界での物理的な作動を結び付けており、純粋にソフトウェアベースのエージェントよりもはるかに厳格で異種の通信要件を課します。 6G はミリ秒未満の遅延、超高信頼性、ネイティブインテリジェンス、統合センシングを約束していますが、これらの機能を身体化されたエージェントの通信に活用する方法に関する体系的な研究は依然として限られています。この記事では、概念と工学の両方の観点から、身体化されたエージェントのための 6G 対応通信システムを調査します。まず、身体化されたエージェントの概念、具現化価値を検討し、身体化されていないエージェントとの区別を明確にします。次に、身体化されたエージェントと 6G ネットワークの間の共生関係を分析します。主要な 6G イネーブラーが人間とロボットのインタラクションの厳しい要件をどのようにサポートできるかを強調します。さらに、カバレッジの拡張、環境センシング、物理世界の理解を通じて通信ネットワークを強化する際の、身体化エージェントの積極的な役割を実証します。これらの洞察に基づいて、人間とロボットの遠隔対話のための階層型通信アーキテクチャを提案します。このアーキテクチャは、人間の意図による知覚層、オープン無線アクセスネットワーク (O-RAN) ベースのトランスポート層、インテリジェントな中間層、および実施層で構成されます。その実現可能性を検証するために、触覚デバイス、産業用ロボットアーム、中間プラットフォーム、5G O-RAN テストベッドを統合するエンドツーエンドのプロトタイプを実装します。実験結果は、ミリ秒レベルの遅延と安定した閉ループ動作を実証し、提案されたアーキテクチャの実用性を確認し、将来の 6G を組み込んだエージェントの研究と産業展開の参考となります。

原文 (English)

6G Communication Networks Enabling Embodied Agents: Architecture and Prototype

Embodied agents, which couple intelligent decision-making with physical actuation in the real world, impose far more stringent and heterogeneous communication requirements than purely software-based agents. While 6G promises sub-millisecond latency, ultra-high reliability, native intelligence, and integrated sensing, systematic studies on how to exploit these capabilities for embodied agent communication remain limited. This article investigates 6G-enabled communication systems for embodied agents from both conceptual and engineering perspectives. First, we review the concept, embodiment value of embodied agents, and clarify their distinctions from disembodied agents. Then, we analyse the symbiotic relationship between embodied agents and 6G networks. We highlight how key 6G enablers can support the stringent requirements of human-robot interaction. Furthermore, we demonstrate the proactive role of embodied agents in bolstering communication networks through coverage extension, environmental sensing, and physical world understanding. Building on these insights, we propose a hierarchical communication architecture for human-robot remote interaction, comprising a human-intent perception layer, an open radio access network (O-RAN)-based transport layer, an intelligent intermediary layer, and an embodiment layer. To validate its feasibility, we implement an end-to-end prototype that integrates a haptic device, an industrial robotic arm, an intermediary platform, and a 5G O-RAN testbed. Experimental results demonstrate millisecond-level latency and stable closed-loop operation, confirming the practicality of the proposed architecture and providing a reference for future 6G-embodied agent research and industrial deployments.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

ChainFlow-VLA: 視覚言語モデルを使用した因果フロー計画

現在のエンドツーエンドの自動運転システムは、時間的因果推論とグローバルな軌道の一貫性の間の不一致によって根本的に制限されています。自己回帰 (AR) モデルは、因果因数分解を通じて相互作用を認識した時間的依存関係を捕捉しますが、その段階的なデコードによりエラーが蓄積され、最適ではないグローバル構造が生じます。対照的に、拡散モデルは軌道をグローバルに最適化しますが、明示的な因果関係の制約がないため、インタラクティブで安全性が重要なシナリオでは信頼性が低くなります。この二分法は、より深い問題を明らかにしています。既存の手法は、因果モデリングとグローバル最適化を別個のパラダイムとして扱っており、それらを単一の軌道分布内で統一する原則的な方法がありません。これに対処するために、統一された確率的フレームワーク内で因果関係の生成とグローバルな洗練を統合する ChainFlow-VLA を提案します。 AR 誘発モードの混合として計画を定式化し、これらのモードにわたって視覚言語モデル (VLM) で条件付けされた残差分布を学習します。自己回帰ジェネレーター (チェーン) は因果軌道モードの離散セットを生成し、続いて拡散ベースのリファイナー (フロー) が VLM 隠れ状態をセマンティック事前分布として活用し、因果構造を維持しながら残差空間でモード条件付き補正を実行します。この単純なコンディショニングにより、高レベルのシーンの理解がきめ細かい軌道調整にシームレスに組み込まれます。実験では、ChainFlow-VLA が曖昧でロングテールのシナリオでも堅牢な計画を達成し、NAVSIM v1 リーダーボードで 94.85 という最先端のスコアを達成し、人間レベルのパフォーマンス (94.8) に匹敵することが実証されました。コードは https://github.com/AFARI-Research/ChainFlow-VLA で入手できます。

原文 (English)

ChainFlow-VLA: Causal Flow Planning with Vision-Language Models

Current end-to-end autonomous driving systems are fundamentally limited by a mismatch between temporal causal reasoning and global trajectory consistency. Autoregressive (AR) models capture interaction-aware temporal dependencies via causal factorization, but their step-wise decoding leads to error accumulation and suboptimal global structure. In contrast, diffusion models optimize trajectories globally but lack explicit causal constraints, making them unreliable in interactive and safety-critical scenarios. This dichotomy reveals a deeper issue: existing methods treat causal modeling and global optimization as separate paradigms, without a principled way to unify them within a single trajectory distribution. To address this, we propose ChainFlow-VLA, which unifies causal generation and global refinement within a unified probabilistic framework. We formulate planning as a mixture over AR-induced modes and learn Vision-Language Model (VLM)-conditioned residual distributions over these modes. An autoregressive generator (Chain) produces a discrete set of causal trajectory modes, followed by a diffusion-based refiner (Flow) that leverages VLM hidden states as semantic priors to perform mode-conditioned correction in residual space while preserving causal structure. This straightforward conditioning seamlessly injects high-level scene understanding into fine-grained trajectory adjustments. Experiments demonstrate that ChainFlow-VLA achieves robust planning in ambiguous and long-tail scenarios, achieving a state-of-the-art score of 94.85 on the NAVSIM v1 leaderboard, matching human-level performance (94.8). Code will be available at https://github.com/AFARI-Research/ChainFlow-VLA.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

測定可能なタスク表現学習によるカリキュラム強化学習

カリキュラム強化学習 (CRL) では、エージェントは一連のタスク (つまり、カリキュラム) にわたって知識を段階的に蓄積し、学習プロセスは、蓄積された知識を使用して、最終的に困難なターゲットタスクを解決することを目的としています。初期の CRL の研究は候補タスクの順序付けに焦点を当てていましたが、最近の研究ではカリキュラムの自動生成が検討されています。豊富な CRL 文献の中で、補間ベースの CRL パラダイムは主要な本体であり、意味のある距離メトリックを使用してタスク空間内の初期タスク分布とターゲットタスク分布の間を補間することによって中間タスクを自動的に生成します (つまり、タスクの類似性を測定できます)。ただし、困難なナビゲーションタスクでは、非ユークリッドコンテキスト (タスク) 空間がこの仮定を無効にします。複雑なタスクにおける自動カリキュラム生成を実現するために、測定可能なタスク表現学習に基づいた新しい自動カリキュラム生成アプローチを提案します。類似性をより適切に測定するために、タスク空間を潜在空間に変換することを提案します。報酬と状態遷移をエンコードする変分オートエンコーダ構造を通じて、タスクの類似性測定プロパティを備えた潜在タスク表現を実現します。また、2 つの近いタスクの埋め込みは、報酬と状態遷移の点で 2 つの類似したタスクに対応します。学習されたタスク表現に基づいて、ターゲットタスクにますます類似した新しいタスクを効果的に生成できる自動カリキュラム生成スキームをさらに開発します。私たちは、さまざまな困難なナビゲーションタスクで私たちの方法を評価しました。実験結果は、提案されたアプローチが内挿および敵対的生成ネットワークに基づく最先端の CRL アプローチを超えることを示しています。

原文 (English)

Curriculum reinforcement learning with measurable task representation learning

In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using the accumulated knowledge to finally solve a challenging target task. While early CRL works focus on sequencing candidate tasks, recent research explores automatic curriculum generation. Among the rich CRL literature, the interpolation-based CRL paradigm is a main body, which automatically generates intermediate tasks by interpolating between the initial task distribution and the target task distribution in task space with meaningful distance metrics (i.e., can measure the task similarity). However, in challenging navigation tasks, the non-Euclidean context (task) space invalidates this assumption. To achieve automatic curriculum generation in complex task, we propose a novel automatic curriculum generation approach based on measurable task representation learning. To better measure the similarity, we propose to transform the task space to a latent space. Through a variational autoencoder structure that encodes the reward and the state transitions, we achieve a latent task representation with a task similarity measurement property, and two close task embeddings correspond to two similar tasks in terms of rewards and state transitions. Based on the learned task representation, we further develop an automatic curriculum generation scheme, which can effectively generate new tasks more and more similar to the target task. We evaluate our method in a variety of challenging navigation tasks, and the experiment results indicate that the proposed approach surpasses state-of-the-art CRL approaches based on interpolation and generative adversarial networks.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

社会的に流暢な AI が、オンラインインタラクションにおける会話シグナルをソース ID から切り離す

社会的に流暢なエージェント AI は、通常の人間の会話に似た方法でオンラインの対話に参加できるようになり、会話信号だけから誰が人間であるかを推測する人々の能力が弱まる可能性があります。私たちは、非公開の AI エージェントを分析、創造、倫理の各タスクにわたる通常のチームメイトとして組み込むことにより、テキストベースの同期的なグループインタラクションでこの可能性をテストしました。 1,572 回の対話後のアイデンティティ判断を行った 786 人の参加者全体で、人々は偶然を超えて AI と人間のチームメイトを区別しませんでした。この失敗は、対話にアイデンティティ関連の情報が欠如していたために発生しませんでした。会話行動には、AI と人間を区別し、高精度の計算による分類をサポートする強力な手がかりが含まれていました。代わりに、参加者は、応答速度、流暢さ、知覚される台本性など、実際の身元との関連性が低い、よく知られた疑惑ヒューリスティックに依存していました。さらに、表現分析により、判断は、グラウンド・トゥルースをコード化する行動構造ではなく、主観的な印象に基づいて組織化されていることが示されました。この解離は、調整された AI エージェントに新たな脆弱性をもたらし、オンラインの議論に大規模な影響を与え、操作する可能性があります。

原文 (English)

Socially fluent AI decouples conversational signals from source identity in online interaction

Socially fluent agentic AI can now participate in online interaction in ways that resemble ordinary human conversation, potentially weakening people's ability to infer who is human from conversational signals alone. We tested this possibility in synchronous text-based group interaction by embedding undisclosed AI agents as ordinary teammates across analytical, creative, and ethical tasks. Across 786 participants who made 1,572 post-interaction identity judgments, people did not distinguish AI from human teammates above chance. This failure did not arise because the interaction lacked identity-relevant information. Conversational behaviour contained robust cues that differentiated AI from humans and supported highly accurate computational classification. Instead, participants relied on familiar suspicion heuristics, including response speed, fluency, and perceived scriptedness, that were only weakly related to actual identity. Representational analyses further showed that judgments were organised around subjective impressions rather than the behavioural structure encoding ground truth. This dissociation creates new vulnerabilities to coordinated AI agents that can influence and manipulate online discourse at scale.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

AI 保証: エンタープライズ AI システムの包括的なテスト戦略

大規模な言語モデル、検索パイプライン、自律エージェントに基づいて構築されたエンタープライズ AI システムは、従来のソフトウェア品質保証では決して対処するように設計されていなかったクラスのリスクを導入します。これらのシステムは確率的で、状況に依存し、創発的です。古典的な意味で正しいかどうかを検証することはできませんが、信頼度を高めて評価する必要があります。このペーパーでは、3 つの重要な原則に基づいて構築されたエンタープライズ AI システムの包括的な保証戦略を示します。1 つ目は、AI テストは厳密な正確性の検証ではなく、継続的なリスク削減に焦点を当てるべきであるということです。第二に、評価は開発と並んで中核となるエンジニアリング分野として扱われなければなりません。第三に、AI 保証の失敗は、従来の決定論的ソフトウェアシステムで見られるものとは根本的に異なる組織への影響を引き起こす可能性があるということです。構造化された AI 障害分類法を導入し、改訂された 5 層の AI 保証ピラミッドを提案し、評価駆動開発、RAG システムテスト、モデルのライフサイクル管理とガバナンスに関する運用ガイダンスを提供します。目標は、エンジニアリングのリーダーと実践者に、哲学に根拠があり、運用上展開可能な戦略を備えさせることです。

原文 (English)

AI Assurance: A Comprehensive Testing Strategy for Enterprise AI Systems

Enterprise AI systems, built on large language models, retrieval pipelines and autonomous agents, introduce a class of risks that traditional software quality assurance was never designed to address. These systems are probabilistic, context-sensitive and emergent: they cannot be verified to be correct in the classical sense, but only evaluated with increasing confidence. This paper presents a comprehensive assurance strategy for enterprise AI systems built around three key principles: first, that AI testing should focus on continuous risk reduction rather than strict correctness verification; second, that evaluation must be treated as a core engineering discipline alongside development; and third, that failures in AI assurance can lead to organizational impacts that are fundamentally different from those seen in traditional deterministic software systems. We introduce a structured AI Failure Taxonomy, propose a revised five-layer AI Assurance Pyramid and provide operational guidance on evaluation-driven development, RAG system testing, model lifecycle management and governance. The goal is to equip engineering leaders and practitioners with a strategy that is both philosophically grounded and operationally deployable.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

すべてを一度に学習する、目標を設定したエージェント

環境を探索する目標条件付き強化学習エージェントは、軌跡全体を通じて豊富な情報を確認しますが、そのほとんどは、指令された目標に関してポリシーに基づく更新のみを実行する場合には破棄されます。各遷移がすべての目標に関してオフポリシーの学習に使用される全目標学習では、エージェントが最大限の情報を抽出できますが、単純な再ラベル付けで実行すると通常は計算的に実行不可能です。この問題は、すべての目標の値とアクションを一度に共同出力することで克服できます。これにより、ネットワークを介した 1 回のパスで、すべての目標を効率的に並行して更新できるようになります。これは、すべてを一度に学習 (LEO) と呼ぶプロセスです。このアプローチは、ゴール条件付き Craftax で他の手法を大幅に上回り、継続的制御環境での既存のベースラインと競合しながら、全ゴールの再ラベル付けと比較して 250 倍を超える高速化を達成することを示します。次に、LEO を直接のアクターではなく教師ネットワークとして使用することで、このアプローチをさらに強力にできることを示します。私たちは、LEO が全目標学習を大規模に実現することで、複雑な環境の RL 実践者にとって有用なツールとして機能できることを願っています。私たちはコードをオープンソースにしています。

原文 (English)

Goal-Conditioned Agents that Learn Everything All at Once

A goal-conditioned reinforcement learning agent exploring an environment will see a wealth of information throughout a trajectory, most of which is discarded when only performing on-policy updates with respect to the commanded goal. All-goals learning, where each transition is used for learning off-policy with respect to every goal, allows agents to extract maximal information, however it is usually computationally infeasible when done via naive relabelling. This can be overcome by jointly outputting values and actions for every goal at once, allowing for efficient, parallel all-goals updates with a single pass through the network, in a process we call Learning Everything all at Once (LEO). We show that this approach significantly outperforms other methods on goal-conditioned Craftax and is competitive with existing baselines on continuous control environments, while achieving a >250x speed-up compared to all-goals relabelling. We then go on to show that this approach can be made even more powerful by using LEO as a teacher network, rather than a direct actor. We hope that, by unlocking all-goals learning at scale, LEO can serve as a useful tool for RL practitioners in complex environments. We open source our code.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

PathNavigate: サプライズガイドスキャンと全スライド画像 VQA 用の共有スライドメモリを備えたトレーニング不要の病理学エージェント

スライド全体の画像による視覚的質問応答 (WSI-VQA) は、病理学を極端なコンテキストの検索問題としてフレーム化します。自由形式の臨床質問に答えるには、システムはまず厳格な検査予算の下でギガピクセルのスライドをナビゲートし、まばらで高解像度の証拠を見つける必要があります。既存のアプローチは主に 2 つのパラダイムに分類されます。i) 教師付き病理学マルチモーダル大規模言語モデル (MLLM) とエージェントは、ローカライゼーションと推論を学習したモジュールに吸収できますが、多くの場合、ナビゲーションをタスク固有の監督と再トレーニングに結びつけるため、実用性が制限されます。 ii) トレーニング不要の病理エージェントは、コアモデルを凍結したままにすることでこのコストを回避しますが、多くの場合、質問優先設計に従い、主にクエリ条件付き関連性から初期候補セットを構築します。これにより、質問で名前が挙げられていない決定的な形態が見逃され、推論時の足場が重くなる可能性があります。この課題に対処するために、スキャン、検索、読み出しルーチンを中心に構築されたトレーニング不要の病理エージェントである PathNavigate を導入します。質問照合の前に、PathNavigate は共有オンラインメモリモジュールを使用して現在のスライドを低倍率でフリーズした病理特徴上でスキャンし、異常領域プールをマークするスライド固有のサプライズフィールドを生成します。次に、質問条件付き PLIP 関連性をこのプール内のみに適用して、高倍率の検索ターゲットを選択します。最後に、スライドレベルのコンテキストと同じオンラインメモリを使用して、局所的な高倍率の証拠を抽出し、凍結された知覚-判定子スタックで回答します。 WSI-VQA および SlideBench-BCNB での実験では、提案されたスキャン検索読み出し設計により、回答精度が向上し、より解釈しやすい証拠選択軌跡がより高い効率で得られることが示されています。コードはオンラインで入手できます。

原文 (English)

PathNavigate: A Training-Free Pathology Agent with Surprise-Guided Scan and Shared Slide Memory for Whole-Slide Image VQA

Whole-slide image visual question answering (WSI-VQA) frames pathology as an extreme-context search problem: to answer a free-form clinical query, a system must first navigate a gigapixel slide under a strict inspection budget to locate sparse, high-resolution evidence. Existing approaches largely fall into two paradigms: i) supervised pathology multimodal large language models (MLLMs) and agents can absorb localization and reasoning into learned modules, but they often couple navigation to task-specific supervision and retraining, limiting their practicality; ii) training-free pathology agents avoid this cost by keeping core models frozen, but often follow a question-first design, constructing the initial candidate set mainly from query-conditioned relevance. This can miss decisive morphology that is not named in the question, and force heavier inference-time scaffolding. To address this challenge, we introduce PathNavigate, a training-free pathology agent built around a scan-search-readout routine. Before question matching, PathNavigate scans the current slide at low magnification with a shared online memory module over frozen pathology features, producing a slide-specific surprise field that marks an abnormal-region pool. It then applies question-conditioned PLIP relevance only within this pool to select high-magnification search targets. Finally, it extracts local high-magnification evidence and answers with a frozen perceptor-adjudicator stack, using the same online memory as slide-level context. Experiments on WSI-VQA and SlideBench-BCNB show that the proposed scan-search-readout design improves answer accuracy and yields more interpretable evidence-selection trajectories with higher efficiency.The code is available online.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

ARMS: 報酬が少ないマルチエージェント強化学習のための自動報酬形成

まばらな報酬は、マルチエージェント強化学習 (MARL) の大きなボトルネックです。MARL では、同時学習により非定常性が誘発され、報酬設計が特にデリケートになります。報酬形成は学習を加速しますが、マルチエージェント設定では、単に短期的な最適化を改善するだけでなく、問題の戦略的構造を維持する必要があります。我々は、MARL の自己監視型報酬形成フレームワークであるマルチエージェントシステム (ARMS) での自動報酬形成を提案します。これは、軌道ランキングを通じてまばらな環境報酬から密な形成信号を学習します。単一エージェントの軌道ランキングの保証は MARL に直接転送されないため、条件付き最良応答推論を通じてポリシーの不変性を再定式化し、特定の条件が成立する場合、シェーピング報酬を使用すると、固定された敵対ポリシーの下で各エージェントの最良応答セットが保存され、その結果、ナッシュ均衡のセットが保存されることを示します。この観点に基づいて、ARMS はポリシー学習と報酬学習を交互に切り替えながら、エージェント間でシェーピングパラメータを共有して効率を高めます。部分的に観察可能なマルチエージェント経路探索ドメインでの実験では、ARMSが報酬の希薄性とエージェント数の増加の下でサンプリング効率を向上させ、目に見えない環境に一般化して、限定的な探索と連動したポリシーと報酬ダイナミクスが振動行動を誘発するMARL特有の障害モードを明らかにすることが示されている。探索を増やすと、この影響が軽減され、学習が安定します。私たちの知る限り、ARMS は MARL の最初の自動報酬形成フレームワークであり、その設計はゲーム理論の均衡維持の結果によって動機付けられています。

原文 (English)

ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in the multi-agent setting it must preserve the strategic structure of the problem rather than merely improve short-term optimization. We propose Automatic Reward-shaping in Multi-agent Systems (ARMS), a self-supervised reward shaping framework for MARL that learns dense shaping signals from sparse environmental rewards through trajectory ranking. Since single-agent trajectory-ranking guarantees do not directly transfer to MARL, we reformulate policy invariance through conditional best-response reasoning, and show that if certain conditions hold, then using shaping rewards preserves each agent's best-response set under fixed opponent policies, and consequently preserve the set of Nash equilibria. Guided by this perspective, ARMS alternates between policy learning and reward learning while sharing shaping parameters across agents for efficiency. Experiments in a partially observable multi-agent pathfinding domain show that ARMS improves sampling efficiency under increasing reward sparsity and agent count, generalizes to unseen environments, and reveals a MARL-specific failure mode in which limited exploration and coupled policy--reward dynamics induce oscillatory behavior. Increasing exploration mitigates this effect and stabilizes learning. To the best of our knowledge, ARMS is the first automatic reward shaping framework for MARL whose design is motivated by a game-theoretic equilibrium-preservation result.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

逐次強化学習における目標の一般化を理解する

強化学習エージェントは、トレーニング分布外で意図しない目標指向の動作を示すことがよくありますが、そのようなエージェントがトレーニング履歴に基づいて新しい環境にどのように一般化するかについての原則的な理解が現時点では不足しています。私たちは、1 つ以上のタスクについて順次トレーニングされたエージェントのこのギャップに対処します。私たちは 100 を超える連続トレーニングパイプラインを研究し、250 を超える分散環境外での動作を評価しています。私たちは、顕著な特徴が一般化を促進し、トレーニングの初期に学習した目標が持続し、その後に獲得した目標に影響を与える可能性があることを発見しました。これらの現象を説明するために、トレーニングパイプラインが誘発する可能性のある分布外の動作を予測する方法である潜在ポリシー勾配を導入します。私たちの方法は、潜在変数がどのように行動にマッピングされるかという単純なモデルに関して、トレーニング目標で高い報酬を達成するものに従って、トレーニング中に低次元の潜在変数の進化をシミュレートします。強力な予測精度を実現し、目に見えないタイプのトレーニングパイプラインに一般化され、解釈可能です。私たちの調査結果は、分布外の RL エージェントの動作はトレーニングパイプライン全体に依存している一方で、この依存には私たちが把握できる基礎的な構造があり、発達の観点から目標の一般化を理解するための基礎を築くことを示しています。

原文 (English)

Understanding Goal Generalisation in Sequential Reinforcement Learning

Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on their training history. We address this gap for agents trained sequentially on one or more tasks. We study over 100 sequential training pipelines, evaluating behaviour across over 250 out-of-distribution environments. We find that salient features drive generalisation, and that goals learnt early in training can persist and influence those acquired later. To explain these phenomena, we introduce latent policy gradients, a method that predicts what out-of-distribution behaviour a training pipeline will likely induce. Our method simulates the evolution of low-dimensional latent variables during training according to what would achieve high reward on the training objective with respect to a simple model of how the latent variables map to behaviour. It achieves strong predictive accuracy, generalises to unseen types of training pipeline, and is interpretable. Our findings demonstrate that while out-of-distribution RL agent behaviour is dependent on the whole training pipeline, this dependence has an underlying structure we can capture, laying groundwork for understanding goal generalisation from a developmental perspective.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

PhotoFlow: Agentic 3D 仮想写真ミッション

仮想写真撮影では、エージェントは事前に選択されたカメラポーズや参照画像を使用せずに準備された 3D シーンに入り、シーン情報と言語の意図から適切なショットを推測し、実行可能なカメラパラメーターを選択して、最終的な写真をレンダリングするように求められます。視覚言語モデルの最近の進歩により、この種の空間エージェントはますます現実的になってきていますが、この課題では、複雑な 3D 空間の理解と抽象的な美的判断という、依然として一緒に評価するのが難しい 2 つの機能が強調されています。閉ループカメラ検索用の Director-Reviewer-Reflector エージェントである PhotoFlow を紹介します。ディレクターは、ソフトな写真の青写真を構築し、多様なカメラの候補を提案します。レビュー担当者は、ルールチェック、視覚的批評、およびペアごとの既存の選択を組み合わせます。そして、Reflector は障害を領域メモリ、デッドゾーンの抑制、および高探索の再配置に変換します。また、47 のオープンライセンス Blender シーンと、被写体の配置、関係構成、雰囲気/スタイルにわたる 141 の言語条件付き撮影ミッションのベンチマークである VPhotoBench も紹介します。継続的な実験では、PhotoFlow は、6 ラウンドのレンダリングバジェットの下で、ワンショット予測、シングルチェーンリフレクション、アンカーバンク選択、およびランダム検索の中で最も強力な外部品質調整複合と成功率を達成しました。私たちの知る限り、これは任意の Blender シーンで言語条件付きの仮想写真を実行可能なエージェントタスクにする最初の作業であり、私たちの結果は、LLM 中心の空間エージェントが、3D 推論と美的選択の両方に挑戦するように設計された設定ですでに強力な写真を生成できることを示しています。

原文 (English)

PhotoFlow: Agentic 3D Virtual Photography Missions

Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

CHRONOS: 進化するデータマーケットプレイスのための時間的認識のあるマルチエージェント調整

テンポラルナレッジグラフデータマーケットプレイスは、静的設計における 3 つの複合的な失敗に直面しています。つまり、古いハイブリッドインデックスのショートカットにより、エッジが進化するにつれてリコールが減少すること、固定的な Shapley 価格設定により、分布が変化した後に価値が誤って割り当てられること、そして調整されていないエージェントが共有の差分プライバシー予算を過剰に消費することです。私たちは、明示的なパブリックとプライベートの分離によってこれらの課題を統一的に処理する 3 層アーキテクチャである CHRONOS を紹介します。レイヤー 1 はニューラル ODE の時間減衰をショートカットエッジに適用し、クエリごとに Pq ラムダデルタ t の Big-O の予想されるリコール損失限界を提供し、単調エンベロープ保証により限界の緩みが観測損失の 1.8 ～ 3.2 倍に低減されます。レイヤー 2 は、検出された変化点に対して Shapley 評価を条件付けし、ノイズ下での有限サンプル誤差保証を提供します。レイヤー 3 は EXP3-IX を使用して、モーメントアカウンティングを介してイプシロンとデルタの差分プライバシーを強制しながら、T log T リグレスの平方根の Big-O を実現します。 CHRONOS は、ガウスメカニズムを使用してエポックごとにプライベート化されたアフィニティマトリックスをリリースします。すべての取得とランキングは後処理されるため、追加のプライバシーコストは発生しません。マルチエポック決済、500 社の売り手のスケーラビリティ分析、加速ベースラインとの比較を提供します。 4 つのベンチマーク全体で、CHRONOS は 10 で 0.937 リコール、1 秒あたり 2.74 クエリ、161 ミリ秒のレイテンシ、および zCDP 構成の下でのデルタ 10 のマイナス 6 乗での合計イプシロン 4.25 を示しました。これらの結果は、競争力のある動作点を示しています。制限としては、このプライバシーレベルでは、公開された評価額には依然としてノイズが多く含まれることです。ユーティリティは主に、公開インデックスルーティングと、低感度の統計に基づいた適応スケジューリングから得られます。

原文 (English)

CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

Temporal knowledge-graph data marketplaces face three coupled failures in static designs: stale hybrid index shortcuts reduce recall as edges evolve, stationary Shapley pricing misattributes value after distribution shifts, and uncoordinated agents over-consume a shared differential-privacy budget. We present CHRONOS, a three-layer architecture providing a unified treatment of these challenges with explicit public and private separation. Layer one applies neural-ODE temporal decay to shortcut edges, providing a per-query expected recall-loss bound of Big-O of Pq lambda delta t, with a monotone-envelope guarantee reducing bound looseness to 1.8 to 3.2 times observed loss. Layer two conditions Shapley valuation on detected changepoints and provides finite-sample error guarantees under noise. Layer three uses EXP3-IX to achieve Big-O of the square root of T log T regret while enforcing epsilon and delta differential privacy via moments accounting. CHRONOS releases a privatized affinity matrix per epoch using the Gaussian mechanism; all retrieval and ranking are post-processing, incurring no extra privacy cost. We provide multi-epoch settlement, scalability analysis for 500 sellers, and comparisons against accelerated baselines. Across four benchmarks, CHRONOS shows 0.937 recall at ten, 2.74 queries per second, 161 ms latency, and total epsilon of 4.25 at delta of 10 to the power of negative 6 under zCDP composition. These results indicate a competitive operating point. A limitation is that at this privacy level, released valuations remain noise-dominated; utility derives primarily from public index routing and adaptive scheduling driven by low-sensitivity statistics.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

Axe-Prover: 数学と量子物理学の定理証明のための深い推論エージェントフレームワーク

私たちは、Lean で定理証明を自動化するためのマルチエージェントシステムである Ax-Prover を紹介します。これは、さまざまな科学分野にわたる問題を解決し、自律的に、または人間の専門家と協力して動作できます。これを達成するために、Axe-Prover は、創造的な推論と厳密な構文の厳密性の両方が要求されるプロセスである正式な証明生成を通じて科学的問題解決に取り組みます。 Ax-Prover は、知識と推論を提供するラージ言語モデル (LLM) に、形式的な正確さを保証するモデルコンテキストプロトコル (MCP) を介したリーンツールを装備することで、この課題に対処します。自律的な証明者としてのパフォーマンスを評価するために、2 つの公開数学ベンチマークと、抽象代数と量子論の分野で導入した 2 つのリーンベンチマークで、フロンティア LLM と特殊な証明者モデルに対するアプローチをベンチマークします。公開データセットでは、Ax-Prover は最先端の証明者と競合しますが、新しいベンチマークではそれらを大幅に上回ります。これは、一般化が難しい特殊なシステムとは異なり、私たちのツールベースのエージェント定理証明アプローチが、さまざまな科学領域にわたる形式的検証のための一般化可能な方法論を提供することを示しています。さらに、実際の使用例で Ax-Prover のアシスタント機能を実証し、熟練数学者が複雑な暗号定理の証明をどのように形式化できるかを示します。

原文 (English)

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax-Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax-Prover is competitive with state-of-the-art provers, while it largely outperforms them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax-Prover's assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.

2026-05-25 13:00 JSTarXiv cs.AIエージェント研究/論文

MAS-Orchestra: 全体的なオーケストレーションと制御されたベンチマークによるマルチエージェント推論の理解と改善

マルチエージェントシステム (MAS) は、エージェントの調整を通じて高度なインテリジェンスを約束しますが、自動 MAS 設計への現在のアプローチは十分に成果を上げていません。このような欠点は、次の 2 つの重要な要因に起因します。(1) 方法論の複雑さ - エージェントオーケストレーションは、グローバルシステムレベルの全体的な推論を制限し、エージェントの複雑さに応じて拡張性が低いシーケンシャルなコードレベルの実行を使用して実行されます。(2) 有効性の不確実性 - MAS は、シングルエージェントシステム (SAS) と比較して具体的なメリットがあるかどうかを理解せずに導入されます。私たちは、MAS オーケストレーションを全体的なオーケストレーションによる関数呼び出し強化学習問題として定式化し、MAS 全体を一度に生成するトレーニング時間フレームワークである MASOrchestra を提案します。 MAS-Orchestra では、複雑な目標指向のサブエージェントが呼び出し可能な関数として抽象化され、内部実行の詳細を隠しながらシステム構造に対するグローバルな推論が可能になります。 MAS がいつ、そしてなぜ有益であるかを厳密に研究するために、深さ、水平線、幅、並列性、堅牢性の 5 つの軸に沿ってタスクを特徴付ける管理されたベンチマークである MASBENCH を導入します。私たちの分析により、MAS の利益は普遍的に維持されるのではなく、タスク構造、検証プロトコル、およびオーケストレーターとサブエージェントの両方の能力に大きく依存していることが明らかになりました。これらの洞察に基づいて、MAS-Orchestra は、強力なベースラインと比較して 10 倍を超える効率を達成しながら、数学的推論、マルチホップ QA、検索ベースの QA などの公開ベンチマークで一貫した改善を達成します。 MAS-Orchestra と MASBENCH を併用すると、マルチエージェントインテリジェンスを追求する際の MAS のトレーニングと理解を向上させることができます。

原文 (English)

MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MASOrchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA, while achieving more than 10x efficiency over strong baselines. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.

2026-05-25 13:00 JSTarXiv cs.AIエージェント研究/論文

GT-HarmBench: ゲーム理論のレンズを通して AI の安全性リスクをベンチマークする

フロンティア AI システムの能力はますます高まっており、一か八かのマルチエージェント環境に導入されています。しかし、既存の AI 安全性ベンチマークは主に単一エージェントを評価しており、調整の失敗や競合などの複数エージェントのリスクについては十分に理解されていません。囚人のジレンマ、スタッグハント、チキンなどのゲーム理論構造にわたる 1,535 のハイステークスシナリオのベンチマークである GT-HarmBench を紹介します。シナリオは、MIT AI リスクリポジトリ内の現実的な AI リスクコンテキストから抽出されます。 15 のフロンティアモデル全体で、軍事的エスカレーション、選挙操作、医療過誤などの一か八かの事件の 38% で、捜査官は社会的に有益な行動を選択できませんでした。ゲーム理論に基づいたプロンプトのフレーミングと順序付けに対する感度を測定し、失敗を引き起こす推論パターンを分析します。さらに、ゲーム理論に基づいた介入により、社会的に有益な結果が最大 18% 向上することも示します。私たちの結果は、信頼性の大きなギャップを浮き彫りにし、マルチエージェント環境での調整を研究するための広範な標準化されたテストベッドを提供します。ベンチマークとコードは https://github.com/causalNLP/gt-harmbench で入手できます。

原文 (English)

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 1,535 high-stakes scenarios spanning game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents fail to choose socially beneficial actions in 38% of high-stakes cases, such as military escalation, election manipulation, and medical malpractice. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi-agent environments. The benchmark and code are available at https://github.com/causalNLP/gt-harmbench.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

NeuroWeaver: EEG 解析パイプラインのプログラム空間を探索するための自律進化エージェント

基礎モデルは一般的な領域で目覚ましい成功を収めていますが、これらのモデルを脳波 (EEG) 解析に適用するには、大量のデータ要件と高度なパラメータ化による制約があります。これらの要因により法外な計算コストが発生し、リソースに制約のある臨床環境での展開が妨げられます。逆に、汎用の自動機械学習フレームワークは、境界のないプログラム空間内の探索では重要な神経生理学的事前情報を組み込むことができず、科学的妥当性に欠ける解決策が得られることが多いため、この領域には適していないことがよくあります。これらの制限に対処するために、私たちは、パイプラインエンジニアリングを離散制約付き最適化問題として再定式化することで、多様なEEGデータセットとタスク全体を一般化するように設計された統合自律進化エージェントであるNeuroWeaverを提案します。具体的には、ドメイン情報に基づいた部分空間初期化を採用して、探索を神経科学的に妥当な多様体に限定し、自己反省的な洗練を通じてパフォーマンス、新規性、効率のバランスを動的に調整する多目的進化的最適化と組み合わせています。 5 つの異種ベンチマークにわたる経験的評価により、NeuroWeaver は、使用するパラメーターが大幅に少ないにもかかわらず、常に最先端のタスク固有の手法を上回り、大規模基礎モデルに匹敵するパフォーマンスを達成する軽量ソリューションを合成していることが実証されています。

原文 (English)

NeuroWeaver: An Autonomous Evolutionary Agent for Exploring the Programmatic Space of EEG Analysis Pipelines

Although foundation models have demonstrated remarkable success in general domains, the application of these models to electroencephalography (EEG) analysis is constrained by substantial data requirements and high parameterization. These factors incur prohibitive computational costs, thereby impeding deployment in resource-constrained clinical environments. Conversely, general-purpose automated machine learning frameworks are often ill-suited for this domain, as exploration within an unbounded programmatic space fails to incorporate essential neurophysiological priors and frequently yields solutions that lack scientific plausibility. To address these limitations, we propose NeuroWeaver, a unified autonomous evolutionary agent designed to generalize across diverse EEG datasets and tasks by reformulating pipeline engineering as a discrete constrained optimization problem. Specifically, we employ a Domain-Informed Subspace Initialization to confine the search to neuroscientifically plausible manifolds, coupled with a Multi-Objective Evolutionary Optimization that dynamically balances performance, novelty, and efficiency via self-reflective refinement. Empirical evaluations across five heterogeneous benchmarks demonstrate that NeuroWeaver synthesizes lightweight solutions that consistently outperform state-of-the-art task-specific methods and achieve performance comparable to large-scale foundation models, despite utilizing significantly fewer parameters.

2026-05-25 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

IntentScore: コンピュータ使用エージェントの意図条件付きアクションの評価

Computer-Use Agent (CUA) は、大規模な言語モデルを利用してデスクトップ環境で GUI 操作を実行しますが、アクションの品質を評価せずにアクションを生成するため、後続のステップに連鎖的に発生する不可逆的なエラーにつながります。私たちは、3 つのオペレーティングシステムにわたる 398K のオフライン GUI インタラクションステップから候補アクションをスコアリングすることを学習する、プランを認識した報酬モデルである IntentScore を提案します。 IntentScore は、状態とアクションの関連性に関する対照的な調整と、アクションの正しさに関するマージンランキングという 2 つの相補的な目標を使用してトレーニングします。アーキテクチャ的には、各候補者の計画意図がアクションエンコーダーに埋め込まれ、同様のアクションを持つ候補者間で論理的根拠が異なるものを区別できるようになります。 IntentScore は、ホールドアウト評価で 97.5% のペア識別精度を達成します。トレーニング中にまったく見えない環境である OSWorld 上のエージェント S3 の再ランカーとしてデプロイされた IntentScore は、タスクの成功率を 6.9 ポイント向上させ、異種のオフライン軌跡から学習した報酬推定が、目に見えないエージェントとタスクの分布に一般化されることを示しています。

原文 (English)

IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions without evaluating action quality, leading to irreversible errors that cascade through subsequent steps. We propose IntentScore, a plan-aware reward model that learns to score candidate actions from 398K offline GUI interaction steps spanning three operating systems. IntentScore trains with two complementary objectives: contrastive alignment for state-action relevance and margin ranking for action correctness. Architecturally, it embeds each candidate's planning intent in the action encoder, enabling discrimination between candidates with similar actions but different rationales. IntentScore achieves 97.5% pairwise discrimination accuracy on held-out evaluation. Deployed as a re-ranker for Agent S3 on OSWorld, an environment entirely unseen during training, IntentScore improves task success rate by 6.9 points, demonstrating that reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

エージェント主義: 人工知能時代の学習理論

学習理論は歴史的に、学習条件が進化するにつれて変化してきました。生成型およびエージェント型 AI は、学習者が説明、執筆、問題解決、その他の認知作業を生成、推奨、場合によっては学習者の代わりに動作できるシステムに委任できるようにすることで、新しい状態を作成します。これは、学習理論にとって根本的な課題を生み出します。つまり、成功したパフォーマンスは学習を示していると想定できなくなります。学習者は、理解力が低下し、判断力が低下し、伝達可能な能力が限られている一方で、AI サポートによってタスクを効果的に完了する可能性があります。私たちは、この問題は既存の学習理論では完全には捉えられていないと主張します。行動主義、認知主義、構成主義、結合主義は依然として重要ですが、AI 支援によるパフォーマンスがいつ持続的な人間の能力になるのかを直接説明するものではありません。私たちは、人間と AI の相互作用のための学習理論であるエージェント主義を提案します。エージェント主義では、学習とは、AI への選択的委任、認識論的なモニタリングと AI の貢献の検証、AI 支援による成果の再構築的な内面化、サポートを減らした状態での移行を通じた人間の能力の永続的な成長であると定義されています。エージェント主義の重要性は、インテリジェントな委任が容易であり、人間と AI の相互作用が人間の学習の永続的かつ拡大した部分になっているときに、どのように学習が可能であるかを説明することにあります。

原文 (English)

Agentivism: a learning theory for the age of artificial intelligence

Learning theories have historically changed when the conditions of learning evolved. Generative and agentic AI create a new condition by allowing learners to delegate explanation, writing, problem solving, and other cognitive work to systems that can generate, recommend, and sometimes act on the learner's behalf. This creates a fundamental challenge for learning theory: successful performance can no longer be assumed to indicate learning. Learners may complete tasks effectively with AI support while developing less understanding, weaker judgment, and limited transferable capability. We argue that this problem is not fully captured by existing learning theories. Behaviourism, cognitivism, constructivism, and connectivism remain important, but they do not directly explain when AI-assisted performance becomes durable human capability. We propose Agentivism, a learning theory for human-AI interaction. Agentivism defines learning as durable growth in human capability through selective delegation to AI, epistemic monitoring and verification of AI contributions, reconstructive internalization of AI-assisted outputs, and transfer under reduced support. The importance of Agentivism lies in explaining how learning remains possible when intelligent delegation is easy and human-AI interaction is becoming a persistent and expanding part of human learning.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

検索だけでは十分ではない: 組織 AI に認識論的インフラストラクチャが必要な理由

AI エージェントが使用する組織的知識には、認識構造が欠けているのが一般的です。検索システムは、拘束力のある決定と放棄された仮説、論争のある主張と解決済みの主張、または既知の事実と未解決の質問を区別することなく、意味的に関連するコンテンツを表面化します。私たちは、組織 AI の上限は検索忠実度ではなく、 \emph{認識論的} 忠実度、つまりコミットメントの強さ、矛盾の状態、組織の無知を計算可能な特性として表現するシステムの能力であると主張します。我々は、認識クラス、クラス固有の減衰を伴う重要度スコア、および符号付き矛盾エッジを保持する型付き知識オブジェクトとして組織知識を構造化するフレームワークである OIDA を紹介します。 Knowledge Gravity Engine は、証明された収束保証によりスコアを決定論的に維持します (十分条件: 最大次数 $< 7$、経験的に次数 43 まで堅牢)。 OIDA は、モデル化された無知としての質問を導入します。これは、組織が \emph{知らない}ことを緊急性を増しながら表面化する、逆減衰を伴うプリミティブです。このメカニズムは、調査対象のすべてのシステムには存在しません。明示的な循環性分析を備えた 5 つの要素からなる評価方法であるエピステミック品質スコア (EQS) について説明します。制御された比較 ($n{=}10$ 応答ペア) では、OIDA の RAG 条件 (3,868 トークン) は、フルコンテキストベースライン (108,687 トークン) の EQS 0.530 に対して 0.848 を達成します。 $28.1\times$ のトークン予算の違いが主な混乱要因です。 QUESTION メカニズムは統計的に検証されています (Fisher $p{=}0.0325$, OR$=21.0$)。形式的性質が確立されます。等しいトークンバジェット (E4) での決定的なアブレーションは事前に登録されており、まだ実行されていません。

原文 (English)

Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure

Organizational knowledge used by AI agents typically lacks epistemic structure: retrieval systems surface semantically relevant content without distinguishing binding decisions from abandoned hypotheses, contested claims from settled ones, or known facts from unresolved questions. We argue that the ceiling on organizational AI is not retrieval fidelity but \emph{epistemic} fidelity--the system's ability to represent commitment strength, contradiction status, and organizational ignorance as computable properties. We present OIDA, a framework that structures organizational knowledge as typed Knowledge Objects carrying epistemic class, importance scores with class-specific decay, and signed contradiction edges. The Knowledge Gravity Engine maintains scores deterministically with proved convergence guarantees (sufficient condition: max degree $< 7$; empirically robust to degree 43). OIDA introduces QUESTION-as-modeled-ignorance: a primitive with inverse decay that surfaces what an organization does \emph{not} know with increasing urgency--a mechanism absent from all surveyed systems. We describe the Epistemic Quality Score (EQS), a five-component evaluation methodology with explicit circularity analysis. In a controlled comparison ($n{=}10$ response pairs), OIDA's RAG condition (3,868 tokens) achieves EQS 0.530 vs.\ 0.848 for a full-context baseline (108,687 tokens); the $28.1\times$ token budget difference is the primary confound. The QUESTION mechanism is statistically validated (Fisher $p{=}0.0325$, OR$=21.0$). The formal properties are established; the decisive ablation at equal token budget (E4) is pre-registered and not yet run.

2026-05-25 13:00 JSTarXiv cs.AIエージェント研究/論文

QED: 未解決の問題に対する数学的証明を生成するためのオープンソースマルチエージェントシステム

私たちは \textbf{QED} を紹介します。これは、人間が提供した研究質問を、人間によるさらなる指導なしに完全な数学的証明に変えるオープンソースのマルチエージェントシステムです。そのパイプラインは、計画、証明、検証を分離することで、単一クエリの証明生成でよくある失敗を克服するように設計されています。つまり、分解エージェントが証明検索を構造化し、証明者エージェントが候補引数を生成し、検証者エージェントが正しさをチェックします。分野の専門家と協力して、さまざまな難易度の 18 の研究レベルのプロジェクトについて QED を評価しました。 QED は、代数幾何学、流体偏微分方程式、確率、逆問題にわたる 5 つのオリジナル作品を作成しました。専門家の評価では、これらの著作は堅実な専門研究の貢献であるとみなされており、3 つの著作は、確立された数学の専門分野で一般的に出版されている著作と難易度および範囲において同等です。 QED は https://github.com/proofQED/QED でリリースされます。

原文 (English)

QED: An Open-Source Multi-Agent System for Generating Mathematical Proofs on Open Problems

We present \textbf{QED}, an open-source multi-agent system that turns human-provided research questions into complete mathematical proofs without further human guidance. Its pipeline is designed to overcome common failures of single-query proof generation by separating planning, proving, and verification: a decomposition agent structures the proof search, prover agents generate candidate arguments, and verifier agents check correctness. In collaboration with domain experts, we evaluated QED on 18 research-level projects of varying difficulty. QED produced five original works across algebraic geometry, fluid PDEs, probability, and inverse problems. Expert assessments regard these works as solid specialized research contributions, with three comparable in difficulty and scope to work commonly published in established specialist mathematics venues. QED is released at https://github.com/proofQED/QED.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

モバイルワールドモデルは GUI エージェントをどのようにガイドするのでしょうか?

最近のビジョン言語モデルの進歩により、モバイル GUI エージェントがビジュアルインターフェイスを認識し、ユーザーの指示を実行できるようになりましたが、アクションの結果の信頼できる予測は、長期にわたるリスクの高いインタラクションにとって依然として重要です。既存のモバイル世界モデルは、テキストベースまたは画像ベースの将来の状態を提供しますが、どの表現が有用であるか、生成されたロールアウトが実際の環境を置き換えることができるかどうか、およびテスト時のガイダンスがさまざまな強みを持つエージェントにどのように役立つかは不明のままです。上記の質問に答えるために、モバイルワールドモデルデータをフィルタリングして注釈を付け、デルタテキスト、フルテキスト、拡散ベースの画像、レンダリング可能なコードの 4 つのモダリティにわたってワールドモデルをトレーニングします。これらのモデルは、MobileWorldBench と Code2WorldBench の両方で SoTA パフォーマンスを実現します。さらに、AITZ、AndroidControl、AndroidWorld でのダウンストリームユーティリティを評価することで、3 つの発見が得られました。まず、レンダリング可能なコードの再構築により、ディストリビューション内での高い忠実度が達成され、データ構築に効果的なマルチモーダル監視が提供されます。一方、テキストベースのフィードバックは、オンラインのディストリビューション外 (OOD) 実行に対してより堅牢です。第 2 に、ワールドモデルで生成された軌跡は、トレーニングプロセスで伝達可能なインタラクションエクスペリエンスを提供し、エージェントのエンドツーエンドタスクのパフォーマンスを向上させることができます。ただし、これらのデータは元の分布を保存しません。最後に、行動エントロピーが低い自信過剰なモバイルエージェントの場合、事後内省による利益は限定的であり、ワールドモデルは普遍的な事後検証者としてよりも、事前の認識や訓練の監視としてより効果的であることを示唆しています。

原文 (English)

How Mobile World Model Guides GUI Agents?

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.

2026-05-25 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

WebGameBench: ブラウザネイティブゲームを介したコーディングエージェントの要件からアプリケーションまでの評価

コーディングエージェントはアプリケーションビルダーとして使用されることが増えていますが、多くの評価は依然として、提供されたアプリケーションではなく、ソースコード、リポジトリレベルのテスト、または中間トレースに焦点を当てています。 WebGameBench は、コーディングエージェントが凍結された構造化 Web ゲーム仕様をブラウザーでアクセス可能なゲームに変換できるかどうかを評価する、要件からアプリケーションまでのベンチマークです。ブラウザネイティブゲームは、コンパクトながら動作密度の高いテストベッドを提供します。単純なゲームであっても、調整された入力処理、空間マッピング、ルールの実行、状態遷移、終了条件、再起動動作、および目に見えるフィードバックが必要です。 WebGameBench では、生成された各アーティファクトが、統一された展開プロトコルの下でブラウザーからアクセス可能なアプリケーションとして構築、提供、公開されます。次に、ランタイムエバリュエーターは実際のブラウザーで配信されたゲームと対話し、EXCELLENT、USABLE、または UNUSABLE の 3 方向のラベルを割り当てます。人間がレビューしたサブセットでは、ランタイムラベルは、使用可能レート基準に基づく人間のゲームプレイレビューとほぼ一致しています。 111 のタスク、12 のコーディングエージェント、および 14 の評価構成にわたって、WebGameBench は現在のシステムを分離します。最適な構成では 76.9% の使用可能率に達しますが、優れた率は 20.2% にすぎません。このギャップは、プレイアブル配信の最小しきい値を超えることが、要件を完全に満たすにはまだ遠いことを示しています。私たちの知る限り、WebGameBench はブラウザネイティブゲーム配信のための最初の要件対アプリケーションのベンチマークであり、配信されたアプリケーションのランタイムラベルを、使用可能レート基準に基づく独立した人間によるゲームプレイレビューに対して検証します。

原文 (English)

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or intermediate traces rather than the delivered application. We introduce WebGameBench, a requirement-to-application benchmark that evaluates whether coding agents can turn a frozen Structured WebGame Specification into a browser-accessible game. Browser-native games provide a compact but behavior-dense testbed: even simple games require coordinated input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback. In WebGameBench, each generated artifact is built, served, and exposed as a browser-accessible application under a unified deployment protocol. A runtime evaluator then interacts with the delivered game in a real browser and assigns a three-way label: EXCELLENT, USABLE, or UNUSABLE. On a human-reviewed subset, the runtime label is broadly aligned with human gameplay review under the Usable-rate criterion. Across 111 tasks, 12 coding agents, and 14 evaluation configurations, WebGameBench separates current systems: the best configuration reaches a 76.9% usable rate but only a 20.2% excellent rate. This gap shows that crossing the minimum playable-delivery threshold is still far from complete requirement satisfaction. To our knowledge, WebGameBench is the first requirement-to-application benchmark for browser-native game delivery that validates delivered-application runtime labels against independent human gameplay review under the Usable-rate criterion.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント

強化学習による効率的かつ移転可能なエージェントナレッジグラフ RAG

ナレッジグラフ検索拡張生成 (KG-RAG) は、大規模言語モデル (LLM) と構造化された検証可能なナレッジグラフ (KG) を組み合わせて、幻覚を軽減し、推論トレースを提供します。ただし、現在の KG-RAG システムは、多くの場合、複数の LLM モジュール (計画、推論、応答など) の固定パイプラインに依存しており、推論コストが膨らみ、パフォーマンスが特定のグラフスキーマに結びついています。これに対処するために、強化学習 (RL) を通じて KG-RAG を最適化するエージェントフレームワークである KG-R1 を導入します。モジュール式ワークフローとは異なり、KG-R1 は環境として KG と対話する単一のエージェントを使用し、各ステップで情報を取得する方法を学習し、それを統一プロセスでの推論と生成に組み込みます。 Knowledge-Graph Question Answering (KGQA) ベンチマーク全体で、KG-R1 は効率と移行性の両方を実証しています。Qwen 2.5-3B を使用すると、KG-R1 は、はるかに大規模な基盤または微調整されたモデルを使用する従来のマルチモジュールワークフローメソッドよりも少ない生成トークンで回答精度を向上させます。さらに、KG-R1 は強力なプラグアンドプレイ機能を示します。トレーニング後、再トレーニングすることなく、目に見えない KG での精度を維持します。これらの特性により、KG-R1 は実際の展開に有望な KG-RAG フレームワークになります。私たちのコードは github.com/junhongmit/KG-R1/ で公開されています。

原文 (English)

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucination and provide reasoning traces. However, current KG-RAG systems often rely on fixed pipelines of multiple LLM modules (e.g., planning, reasoning, and responding), which inflate inference costs and tie performance to specific graph schemas. To address this, we introduce KG-R1, an agentic framework that optimizes KG-RAG through reinforcement learning (RL). Unlike modular workflows, KG-R1 uses a single agent that interacts with KGs as its environment, learning to retrieve information at each step and incorporating it into its reasoning and generation in a unified process. Across Knowledge-Graph Question Answering (KGQA) benchmarks, KG-R1 demonstrates both efficiency and transferability-using Qwen 2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use much larger foundation or fine-tuned models. Furthermore, KG-R1 exhibits strong plug-and-play capability: after training, maintaining accuracy on unseen KGs without retraining. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at github.com/junhongmit/KG-R1/.

2026-05-25 13:00 JSTarXiv cs.AIエージェントロボティクス

自律的な X 線誘導脊椎手術のためのロボット制御ポリシー学習の調査

模倣学習ベースのロボット制御政策は、ビデオベースのロボット工学への新たな関心を集めています。ただし、このアプローチが、入力がまばらな脊椎器具などの X 線ガイド下処置に適用できるかどうかは不明のままです。我々は、バイプレーンガイドによるカニューレ挿入における模倣ポリシー学習の実現可能性、機会、課題を検討します。私たちは、高度な現実性を備えた X 線誘導脊椎処置のスケーラブルな自動シミュレーションのためのインシリコサンドボックスを開発しています。私たちは、プロバイダーの段階的な位置合わせをエミュレートする、正しい軌道と対応する二平面 X 線シーケンスのデータセットを厳選します。次に、視覚情報のみに基づいて椎体形成術の設定でカニューレの位置を繰り返し調整する計画と開ループ制御のための模倣学習ポリシーをトレーニングします。この正確に制御されたセットアップにより、この方法の制限と機能についての洞察が得られます。私たちのポリシーは、症例の 68.5% で最初の試みで成功し、さまざまな椎骨レベルにわたって安全な椎弓根内の軌道を維持しました。このポリシーは、骨折を含む複雑な解剖学、およびさまざまな解剖学と初期化に移行しました。実際の X 線でのロールアウトは、妥当な軌道による部分的なシミュレーションから現実への移行が可能であることを示しています。これらの暫定的な結果は有望ですが、特にエントリーポイントの精度における限界も特定しています。今回の結果は、将来の取り組みに対する明確なベンチマークを提示するとともに、より堅牢な事前知識と領域知識があれば、このようなモデルは、軽量でCT不要のロボットによる術中脊椎ナビゲーションに向けた将来の取り組みの基盤となる可能性がある。

原文 (English)

Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures

Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation, with sparse inputs. We examine the feasibility, opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula in a vertebroplasty setting solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy transferred to complex anatomy, including fractures, as well as varied anatomies and initializations. Rollouts on real X-ray indicate that partial sim-to-real transfer with plausible trajectories is possible. While these preliminary results are promising, we also identify limitations, especially in entry point precision. The current results present a clear benchmark for future efforts, while with more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント

R$^3$L: 言語ガイドによる探索、重要なクレジット、およびポジティブ増幅を使用した、反映してから再試行する強化学習

強化学習は、LLM 推論とエージェント機能の最近の進歩を推進していますが、現在のアプローチは探索と活用の両方に苦労しています。探査では、難しいタスクの成功率が低く、最初からロールアウトを繰り返すためコストが高くなります。悪用は粗いクレジット割り当てとトレーニングの不安定性に悩まされます。軌道レベルの報酬により、その後のエラーに対して有効なプレフィックスにペナルティが課せられ、失敗が支配的なグループが少数の肯定的なシグナルを圧倒し、最適化が建設的な方向性を欠いたままになります。この目的を達成するために、我々は、R$^3$L、言語ガイド付き探索、Pivotal Credit、および Positive Amplification を使用した、Reflect-then-Retry 強化学習を提案します。高品質の軌跡を合成するために、R$^3$L は確率的サンプリングからリフレクト・ザ・リトライによる能動合成に移行し、言語フィードバックを活用してエラーを診断し、失敗した試行を成功した試行に変換し、特定された失敗点から再開することでロールアウト・コストを削減します。エラーが診断され、局所的に特定されると、Pivotal Credit Assignment は、勾配更新から共有プレフィックスを除外し、コントラスト信号が存在する分岐サフィックスのみを更新します。困難なタスクでは失敗が支配的であり、反映してから再試行するとポリシーから外れたデータが生成され、トレーニングが不安定になる危険性があるため、Positive Amplification は成功した軌跡を重み付けして、最適化プロセスを確実にポジティブなシグナルが導くようにします。エージェントおよび推論タスクの実験では、トレーニングの安定性を維持しながら、ベースラインと比較して 5\% ～ 52\% の相対的な改善が実証されました。私たちのコードは https://github.com/shiweijiezero/R3L でリリースされています。

原文 (English)

R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory-level rewards penalize valid prefixes for later errors, and failure-dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R$^3$L, Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high-quality trajectories, R$^3$L shifts from stochastic sampling to active synthesis via reflect-then-retry, leveraging language feedback to diagnose errors, transform failed attempts into successful ones, and reduce rollout costs by restarting from identified failure points. With errors diagnosed and localized, Pivotal Credit Assignment updates only the diverging suffix where contrastive signals exist, excluding the shared prefix from gradient update. Since failures dominate on difficult tasks and reflect-then-retry produces off-policy data, risking training instability, Positive Amplification upweights successful trajectories to ensure positive signals guide the optimization process. Experiments on agentic and reasoning tasks demonstrate 5\% to 52\% relative improvements over baselines while maintaining training stability. Our code is released at https://github.com/shiweijiezero/R3L.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Grading Attack: LLM ベースの教育評価エージェントのセキュリティ脆弱性の暴露

大規模言語モデル (LLM) は、現実の教育環境で自動短答採点 (ASAG) のための教育エージェントとして導入されることが増えており、評価の効率と拡張性が大幅に向上しています。ただし、これらのグレーディングエージェントが「実際に」動作する場合、敵対的な操作に対する脆弱性により、エージェントのセキュリティと信頼性に関して重大な懸念が生じます。このペーパーでは、LLM ベースの教育採点エージェントのセキュリティ脆弱性を体系的に評価する、きめ細かい敵対的攻撃フレームワークである Grading Attack を紹介します。具体的には、高いステルス性を維持しながらエージェントのグレーディング結果を操作するトークンレベルおよびプロンプトレベルの攻撃戦略を設計し、現在のエージェント展開の根本的な弱点を明らかにします。複数のデータセットでの実験では、プロンプトレベルの攻撃が高い成功率を達成し、トークンレベルの攻撃が優れたステルス機能を示すことで、両方の攻撃戦略が効果的にグレーディングエージェントを侵害することが実証されました。私たちの調査結果では、現在の LLM ベースの教育エージェントには敵対的な攻撃に対する堅牢な防御が欠けていることが明らかになり、重要な教育アプリケーション向けに安全で信頼できるエージェントシステムを開発する緊急の必要性が強調されています。

原文 (English)

GradingAttack: Exposing Security Vulnerabilities in LLM Based Educational Grading Agents

Large language models (LLMs) are increasingly deployed as educational agents for automatic short answer grading (ASAG) in real-world educational environments, significantly boosting assessment efficiency and scalability. However, when these grading agents operate ``in the wild'', their vulnerability to adversarial manipulation raises critical concerns about agent security and trustworthiness. In this paper, we introduce GradingAttack, a fine-grained adversarial attack framework that systematically evaluates the security vulnerabilities of LLM based educational grading agents. Specifically, we design token-level and prompt-level attack strategies that manipulate agent grading outcomes while maintaining high stealth, exposing fundamental weaknesses in current agent deployments. Experiments on multiple datasets demonstrate that both attack strategies effectively compromise grading agents, with prompt-level attacks achieving higher success rates and token-level attacks exhibiting superior stealth capability. Our findings reveal that current LLM based educational agents lack robust defenses against adversarial attacks, underscoring the urgent need for developing secure and trustworthy agent systems for critical educational applications.

2026-05-25 13:00 JSTarXiv cs.AIエージェントビジネス/資金調達

TABX: マルチエージェント強化学習のための高スループットのサンドボックスバトルシミュレーター

環境の設計は、協調的なマルチエージェント強化学習 (MARL) アルゴリズムの開発と評価を形作る上で重要な役割を果たします。既存のベンチマークは重大な課題を浮き彫りにしていますが、カスタム評価シナリオの設計に必要なモジュール性が欠けていることがよくあります。再構成可能なマルチエージェントタスク用に設計された高スループットのサンドボックスである Totally Accelerated Battle Simulator in JAX (TABX) を紹介します。 TABX は、環境パラメータに対するきめ細かい制御を提供し、さまざまなタスクの複雑さにわたる緊急エージェントの動作とアルゴリズムのトレードオフを系統的に調査できるようにします。 TABX は、GPU 上でハードウェアアクセラレーションによる実行に JAX を活用することで、大規模な並列化を可能にし、計算オーバーヘッドを大幅に削減します。 TABX は、高速かつ拡張可能で簡単にカスタマイズできるフレームワークを提供することで、複雑な構造ドメインにおける MARL エージェントの研究を容易にし、将来の研究のための拡張可能な基盤として機能します。コードは https://github.com/ku-dmlab/TABX から入手できます。

原文 (English)

TABX: A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning

The design of environments plays a critical role in shaping the development and evaluation of cooperative multi-agent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high-throughput sandbox designed for reconfigurable multi-agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade-offs across a diverse spectrum of task complexities. Leveraging JAX for hardware-accelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research. Our code is available at: https://github.com/ku-dmlab/TABX.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

VideoTemp-o3: ビデオを使用したエージェント的思考における時間的グラウンディングとビデオ理解を調和させる

長時間のビデオを理解する場合、従来の均一なフレームサンプリングでは重要な視覚的証拠を捕捉できないことが多く、パフォーマンスの低下と幻覚の増加につながります。これに対処するために、最近のビデオを使用したエージェント的思考パラダイムが登場し、モデルが関連するビデオセグメントをアクティブに識別し、それらのクリップ内で高密度のサンプリングを実行して、回答を生成するローカライズクリップ回答パイプラインを採用しています。しかし、既存の方法は依然として非効率的であり、ローカリゼーションが弱いという問題があり、厳格なワークフローに固執しています。これらの問題を解決するために、ビデオグラウンディングと質問応答を共同でモデル化する、ビデオを使用した統合エージェント思考フレームワークである VideoTemp-o3 を提案します。 VideoTemp-o3 は強力なローカリゼーション機能を備え、オンデマンドのクリッピングをサポートし、不正確なローカリゼーションを改善できます。具体的には、監視付き微調整段階で、ノイズを防ぎながら探索を促進する統合されたマスキングメカニズムを設計します。強化学習については、報酬ハッキングを軽減するために専用の報酬を導入します。さらに、データの観点から、さまざまなビデオ期間にわたる体系的な評価のための対応するベンチマークとともに、高品質の長時間ビデオに基づいた QA データを構築するための効果的なパイプラインを開発します。実験結果は、私たちの方法が長時間のビデオの理解とグラウンディングの両方において顕著なパフォーマンスを達成することを示しています。

原文 (English)

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

In long-video understanding, conventional uniform frame sampling often fails to capture key visual evidence, leading to degraded performance and increased hallucinations. To address this, recent agentic thinking-with-videos paradigms have emerged, adopting a localize-clip-answer pipeline in which the model actively identifies relevant video segments, performs dense sampling within those clips, and then produces answers. However, existing methods remain inefficient, suffer from weak localization, and adhere to rigid workflows. To solve these issues, we propose VideoTemp-o3, a unified agentic thinking-with-videos framework that jointly models video grounding and question answering. VideoTemp-o3 exhibits strong localization capability, supports on-demand clipping, and can refine inaccurate localizations. Specifically, in the supervised fine-tuning stage, we design a unified masking mechanism that encourages exploration while preventing noise. For reinforcement learning, we introduce dedicated rewards to mitigate reward hacking. Besides, from the data perspective, we develop an effective pipeline to construct high-quality long video grounded QA data, along with a corresponding benchmark for systematic evaluation across various video durations. Experimental results demonstrate that our method achieves remarkable performance on both long video understanding and grounding.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Agentic AI のスキル取得の拡張

大規模言語モデル (LLM) がエージェントの問題解決ツールに進化するにつれて、ネイティブのパラメトリック機能を超えたタスクを処理するために外部の再利用可能なスキルにますます依存するようになります。既存のエージェントシステムでは、スキルを組み込むための主な戦略は、コンテキストウィンドウ内で利用可能なスキルを明示的に列挙することです。しかし、この戦略は拡張できません。スキルコーパスが拡大するにつれて、コンテキストバジェットが急速に消費され、エージェントが適切なスキルを識別する精度が著しく低下します。この目的を達成するために、この文書では、エージェントがオンデマンドで大規模な外部スキルコーパスから関連するスキルを動的に取得、組み込み、適用する新しいパラダイムであるスキル検索拡張 (SRA) を定式化します。この問題を測定可能にするために、大規模なスキルコーパスを構築し、スキルの取得、スキルの組み込み、および最終タスクの実行をカバーする完全な SRA パイプラインの分解評価のための最初のベンチマークである SRA-Bench を導入します。 SRA-Bench には、5,400 の能力集約型テストインスタンスと、手動で構築された 636 個のゴールドスキルが含まれています。これらは、Web で収集されたディストラクタスキルと混合されて、26,262 個のスキルの大規模コーパスを形成します。広範な実験により、検索ベースのスキル強化によりエージェントのパフォーマンスが大幅に向上することが示され、パラダイムの有望性が実証されました。同時に、スキルの組み込みにおける根本的なギャップも明らかになりました。現在の LLM エージェントは、ゴールドスキルが取得されるかどうか、またはタスクが実際に外部機能を必要とするかどうかに関係なく、同様の速度でスキルをロードする傾向があります。これは、スキル拡張のボトルネックが取得だけでなく、どのスキルをロードするか、いつ外部ロードが実際に必要かを決定する基本モデルの機能にもあることを示しています。これらの発見は、SRA を明確な研究課題として位置づけ、将来のエージェントシステムにおけるスケーラブルな機能拡張の基盤を確立します。

原文 (English)

Skill Retrieval Augmentation for Agentic AI

As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large-scale skill corpus and introduce SRA-Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end-task execution. SRA-Bench contains 5,400 capability-intensive test instances and 636 manually constructed gold skills, which are mixed with web-collected distractor skills to form a large-scale corpus of 26,262 skills. Extensive experiments show that retrieval-based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model's ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

SUDP: エージェントシステム用の秘密使用委任プロトコル

エージェントシステムは、API、メッセージングプラットフォーム、クラウドサービスのユーザーシークレットを使用して動作することが増えています。今日のエージェントランタイムは通常、公開による承認を実装します。アクションを有効にするということは、多くの場合、再利用可能なシークレット、またはそこから派生した再利用可能なアーティファクトをランタイム内に配置することを意味するため、一時的なプロンプトインジェクションまたはツール側の侵害が永続的なアカウントの侵害になります。既存の防御策は、シークレットストレージ、範囲指定された委任、送信者制約のトークン、実行時監視などの隣接部分をカバーしていますが、共通の仕様がないまま組み合わせたエージェントの義務を残します。つまり、信頼できない自律リクエスターは、それに対する再利用可能な権限を取得することなく、ユーザーが承認したシークレットに裏付けされた操作を実行できる必要があります。私たちはこれをエージェント秘密使用 (ASU) 問題として形式化し、認可の完全性と秘密の機密性にわたる、あらゆるソリューションが満たさなければならない 7 つのセキュリティ特性を特定します。私たちは秘密使用委任プロトコル (SUDP) を提案します。このプロトコルでは、要求者が正規の操作を提案し、ユーザーが新しい認証システムに裏付けられた許可でそれを承認し、管理者がその許可を引き換えて制限付き使用を実行します。再利用可能な権限は、要求者の境界を越えることはありません。 SUDP は LLM 主導のエージェントに特化しており、ツール呼び出しがユーザー登録された権限を伴うマテリアルを実行する場合には常に適用されます。標準的な暗号化の仮定の下では、SUDP はハードウェアルートのランタイムと統合された場合に 7 つの特性をすべて満たします。リファレンス実装は https://github.com/xhyumiracle/sudp で入手できます。

原文 (English)

SUDP: Secret-Use Delegation Protocol for Agentic Systems

Agentic systems increasingly act with user secrets for APIs, messaging platforms, and cloud services. Today's agent runtimes typically implement authorization by exposure: enabling action often means placing a reusable secret, or a reusable artifact derived from it, inside the runtime, so a transient prompt-injection or tool-side compromise becomes durable account compromise. Existing defenses cover adjacent pieces such as secret storage, scoped delegation, sender-constrained tokens, and runtime monitoring, but leave the combined agentic obligation without a common specification: an untrusted autonomous requester should be able to cause a user-authorized secret-backed operation without gaining reusable authority over it. We formalize this as the Agent Secret Use (ASU) problem and identify seven security properties any solution must satisfy, spanning authorization integrity and secret confidentiality. We propose the Secret-Use Delegation Protocol (SUDP), in which a requester proposes a canonical operation, the user authorizes it with a fresh authenticator-backed grant, and a custodian redeems the grant to perform the bounded use; reusable authority never crosses the requester boundary. We specialize SUDP for LLM-driven agents, where it applies whenever a tool call would exercise user-enrolled authority-bearing material. Under standard cryptographic assumptions, SUDP satisfies all seven properties when integrated with a hardware-rooted runtime. A reference implementation is available at https://github.com/xhyumiracle/sudp.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント

SafeHarbor: LLM エージェントの安全のための階層型メモリ拡張ガードレール

基盤モデルの最近の進歩により、LLM は受動的な会話システムから、推論とツールの実行が可能な自律エージェントに変わりました。これらの機能は実質的な実用的価値を解放しますが、敵対者がエージェントを操作して現実世界の環境で有害なアクションを実行する可能性があるため、新たなセキュリティリスクももたらします。既存の防御戦略はそのような脅威を軽減しますが、安全性と有用性のバランスをとるのにしばしば苦労し、その結果、無害なユーザー要求を過度に拒否する結果になります。このトレードオフを軽減するために、LLM エージェントの正確な決定境界を確立するように設計された新しいフレームワークである SafeHarbor を提案します。静的なガイドラインとは異なり、SafeHarbor は強化された敵対的生成を通じてコンテキストを認識した防御ルールを抽出します。私たちは、動的ルール注入用のローカル階層メモリシステムを設計し、トレーニング不要で効率的なプラグアンドプレイソリューションを提供します。さらに、動的なノードの分割と結合を通じてメモリ構造を継続的に最適化する、情報エントロピーベースの自己進化メカニズムを導入します。広範な実験により、SafeHarbor があいまいで良性のタスクと明示的な悪意のある攻撃の両方で最先端のパフォーマンスを達成し、特に GPT-4o で 63.6\% のピーク無害ユーティリティを達成しながら、有害なリクエストに対して 93\% を超える堅牢な拒否率を維持していることが実証されています。ソースコードは https://github.com/ljj-cyber/SafeHarbor で公開されています。

原文 (English)

SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety

Recent advances in foundation models have transformed LLMs from passive conversational systems into autonomous agents capable of reasoning and tool execution. While these capabilities unlock substantial practical value, they also introduce new security risks, as adversaries can manipulate agents into performing harmful actions in real-world environments. Existing defense strategies mitigate such threats but frequently struggle to balance safety and utility, resulting in over-refusal of benign user requests. To mitigate this trade-off, we propose SafeHarbor, a novel framework designed to establish precise decision boundaries for LLM agents. Unlike static guidelines, SafeHarbor extracts context-aware defense rules through enhanced adversarial generation. We design a local hierarchical memory system for dynamic rule injection, offering a training-free, efficient, and plug-and-play solution. Furthermore, we introduce an information entropy-based self-evolution mechanism that continuously optimizes the memory structure through dynamic node splitting and merging. Extensive experiments demonstrate that SafeHarbor achieves state-of-the-art performance on both ambiguous benign tasks and explicit malicious attacks, notably attaining a peak benign utility of 63.6\% on GPT-4o while maintaining a robust refusal rate exceeding 93\% against harmful requests. The source code is publicly available at https://github.com/ljj-cyber/SafeHarbor.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文

回路設計のラストマイルの橋渡し: PostEDA-Bench、PPA コンバージェンスと DRC 修正の階層ベンチマーク

LLM ベースのエージェントは、電子設計自動化 (EDA) の「ラストマイル」に適用されることが増えています。つまり、残留サインオフ設計ルールチェック (DRC) 違反を修復し、ツール実行後に電力性能領域 (PPA) 目標を収束します。ただし、既存の EDA-LLM ベンチマークは DRC 修正を完全に省略し、単一のツールチェーンに結び付けられたフラットな階層に依存しています。 PostEDA-Bench は、DRC-Essential、DRC-Reasoning、PPA-Mono、および PPA-Multi にわたる 145 のタスクを含む階層型ベンチマークであり、機械チェック可能な評価を備えた EDA ツールチェーンによってサポートされています。複数のエージェントスキャフォールドの下にある 8 つの商用およびオープンソース LLM にわたって、エージェントは合成 DRC-Essential と単一目的の PPA-Mono をかなりうまく処理しますが、より実用的な DRC-Reasoning (最高の成功率が 36.66%) と PPA-Multi (最高の成功率が 20.00%) では急激に性能が低下することがわかりました。視覚増強は一貫して DRC-Bench を強化します。そして、ノブの知識ではなく、トレードオフの推論が PPA-Multi の主要なボトルネックです。

原文 (English)

Bridging the Last Mile of Circuit Design: PostEDA-Bench, a Hierarchical Benchmark for PPA Convergence and DRC Fixing

LLM-based agents are increasingly applied to the "last mile" of Electronic Design Automation (EDA): repairing residual sign-off Design Rule Check (DRC) violations and converging Power-Performance-Area (PPA) targets after tool runs. Existing EDA-LLM benchmarks, however, omit DRC fixing entirely and rely on flat hierarchies tied to a single toolchain. We introduce PostEDA-Bench, a hierarchical benchmark with 145 tasks across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi, supported by EDA toolchains with machine-checkable evaluation. Across eight commercial and open-source LLMs under multiple agent scaffolds, we find that agents handle synthetic DRC-Essential and single-objective PPA-Mono reasonably well but degrade sharply on the more practical DRC-Reasoning, where the best success rate is 36.66%, and PPA-Multi, where the best success rate is 20.00%; vision augmentation consistently enhances DRC-Bench; and trade-off reasoning, rather than knob knowledge, is the dominant PPA-Multi bottleneck.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達

LLM エージェントツール呼び出しトラフィックにおけるコンテンツ認識型攻撃の検出: 機能、アーキテクチャ、および評価プロトコルの実証的研究

モデルコンテキストプロトコル (MCP) は、LLM エージェントが外部ツールを呼び出すためのインターフェイスとして広く採用されていますが、MCP ツール呼び出しトラフィックの学習された監視についてはまだ十分に研究されていません。この記事では、提案された検出器は、各エージェントセッションをグラフ (ツール呼び出しをノード、順次リンクとデータフローリンクをエッジ) としてエンコードし、引数と応答に対する文埋め込み機能でノードを強化し、セッションを良性か攻撃かを分類する、MCP ツール呼び出しトラフィックの攻撃検出フレームワークとして紹介されます。 3 つの GNN アーキテクチャ (GAT、GCN、GraphSAGE)、グラフなし MLP、および古典的なベースライン (XGBoost、ランダムフォレスト、ロジスティック回帰、線形 SVM) が評価されます。完全なアーキテクチャ比較は RAS-Eval (タスク階層化分割) で実行され、GraphSAGE は ATBench および結合ソースバリアント (両方ともラベル階層化) で GNN ベースラインとして保持されます。 3 つの発見が得られます。まず、コンテンツレベルの機能が不可欠です。メタデータのみの検出は、アーキテクチャに関係なく AUROC 0.64 付近で頭打ちになりますが、コンテンツの埋め込みにより AUROC が 0.89 を超えるようになります。第 2 に、単純なランダム分割評価は、タスクに素な分割と比較して AUROC を最大 26 パーセントポイント上昇させます。これは、以前のエージェント検出作業では対処できなかった記憶の混乱です。第三に、検出信号は主に SBERT コンテンツエンベディングに存在します。プールされたエンサンブル上のツリーアンサンブルによって 0.975 の AUROC に達し、ほとんどの場合、GNN (0.917) や MLP (0.896) を含むプライマリ RAS-Eval 設定のニューラルアーキテクチャよりも優れたパフォーマンスを発揮し、自己監視型事前トレーニングではラベル効率の利点が得られません。このタスク。

原文 (English)

Content-Aware Attack Detection in LLM Agent Tool-Call Traffic: An Empirical Study of Features, Architectures, and Evaluation Protocols

The Model Context Protocol (MCP) has become a widely adopted interface for LLM agents to invoke external tools, yet learned monitoring of MCP tool-call traffic remains underexplored. In this article, the proposed detector is presented as an attack detection framework for MCP tool-call traffic that encodes each agent session as a graph (tool calls as nodes, sequential and data-flow links as edges), enriches nodes with sentence-embedding features over arguments and responses, and classifies sessions as benign or attacked. Three GNN architectures (GAT, GCN, GraphSAGE), a no-graph MLP, and classical baselines (XGBoost, random forest, logistic regression, linear SVM) are evaluated, with the full architecture comparison conducted on RAS-Eval (task-stratified splits) and GraphSAGE retained as the GNN baseline on ATBench and a combined-source variant (both label-stratified). Three findings emerge. First, content-level features are essential: metadata-only detection plateaus around an AUROC of 0.64 regardless of architecture, while content embeddings push the AUROC above 0.89. Second, naive random-split evaluation inflates AUROC by up to 26 percentage points relative to task-disjoint splits, a memorization confound that prior agent-detection work has not addressed. Third, the detection signal resides primarily in the SBERT content embeddings: an AUROC of 0.975 was reached by tree ensembles on pooled embeddings, performing, for the most part, better than the neural architectures in the primary RAS-Eval setting including GNNs (0.917) and the MLP (0.896), and self-supervised pre-training does not deliver a label-efficiency advantage on this task.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

信頼性があり説明可能な知覚モデル向け AI を目指して: コンセプトからプロトタイプ車両の配備まで

ディープニューラルネットワークは、自動運転の認識のための主要なソリューションとなっていますが、その不透明さは、新たな Trustworthy AI ガイドラインと矛盾し、安全性の保証、デバッグ、人間の監視を複雑にしています。安全で説明可能な AI (XAI) の理論的フレームワークは存在しますが、3D シーンを理解するための信頼できる AI の具体的な実装は依然として不足しています。私たちは、非常に堅牢で、忠実な説明可能性と校正された不確実性推定を統合する Trustworthy AI 認識モジュールを提案することで、このギャップに対処します。トランスベースの検出器に基づいて、推論時のアテンションメカニズムから説明を導き出し、摂動ベースの一貫性テストを使用してその忠実性を検証します。さらに、不確実性の推定および校正モジュールを統合し、ロバスト性を強化するトレーニング方法を適用します。実験では、忠実な顕著性動作、改善された堅牢性、および適切に校正された不確実性推定値が示されています。最後に、これらの信頼できる AI 要素をプロトタイプ車両に導入し、文書アーティファクト、モデルの不確実性状態、顕著性マップを視覚化する XAI インターフェイスを提供し、リアルタイムで信頼できる知覚モニタリングの実現可能性を実証します。補足資料は https://tillbeemelmanns.github.io/trustworthy_ai/ で入手できます。

原文 (English)

Towards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment

Deep Neural Networks have become the dominant solution for Autonomous Driving perception, but their opacity conflicts with emerging Trustworthy AI guidelines and complicates safety assurance, debugging, and human oversight. While theoretical frameworks for safe and Explainable AI (XAI) exist, concrete implementations of Trustworthy AI for 3D scene understanding remain scarce. We address this gap by proposing a Trustworthy AI perception module that is remarkably robust, integrates faithful explainability, and calibrated uncertainty estimates. Building on a transformer-based detector, we derive explanation from the attention mechanism at inference time and validate their faithfulness using perturbation-based consistency tests. We further integrate an uncertainty estimation and calibration module, and apply robustness-enhancing training methods. Experiments show faithful saliency behavior, improved robustness, and well-calibrated uncertainty estimates. Finally, we deploy these Trustworthy AI elements in a prototype vehicle and provide an XAI Interface that visualizes documentation artifacts, model uncertainty state, and saliency maps, demonstrating the feasibility of trustworthy perception monitoring in real time. Supplementary materials are available at https://tillbeemelmanns.github.io/trustworthy_ai/ .

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント

S-Bus: マルチエージェント LLM 状態調整のための自動読み取りセット再構築

HTTP 経由で変更可能な状態を共有する LLM エージェントの同時実行制御に対処します。この場合、エージェントを変更して読み取りセットを宣言することはできません。 S-Bus は HTTP ミドルウェアであり、その中心的なメカニズムであるサーバー側の DeliveryLog が、観測された HTTP GET トラフィックからコミット時に各エージェントの読み取りセットを再構築します。これが提供する一貫性プロパティ -- Observable-Read Isolation (ORI)、HTTP 監視可能な読み取りプロジェクションに対する部分的な因果的一貫性 -- により、専用シャードトポロジでの構造的競合状態が防止されます。 3 つの貢献。 (C1) 3 層の機械化された証拠を備えた DeliveryLog メカニズム: TLAPS は ReadSetSoundness と ORICommitSafety (モジュロ 1 の型付け公理) を証明します。 N=3 での網羅的な TLC では、違反がゼロの 20,763,484 州を調査します。 Dafny は 9 つの帰納補題を導き出します。 (C2) PostgreSQL 17 SERIALIZABLE および Redis 7 WATCH/MULTI に対する経験的安全性パリティ: 884,110 回のコミット試行で Type-I 破損はゼロ (アクティブな競合下では 427,308 回)。 (C3) ORI は、専用シャードのワークロードでは意味的に中立ですが、単一シャードの共同執筆では、保存により同時矛盾が伝播するため有害です。 v2 アップデート: PH-3 LLM ジャッジは現在、人間のアノテーター (Zahid Hussain、Mindgigs Peshawar) に対して、厳密な kappa=0.93 (n=93、生の一致率 96.8%) で 400 (ステップ、シャード) ペアで独立して検証されています。 LLM と裁判官間の一致は、kappa=0.46 (境界分散) です。エージェントは、シャードの使用量を 32% (LLM 判事) ～ 49% (人間のアノテーター) 過剰に使用していると自己報告します。 SJ-v4 の意味品質ルーブリックは引き続き単一審査員 LLM のみです。ソースコード、正式な証明、ハーネス、アノテーションデータ: https://github.com/sajjadanwar0/sbus

原文 (English)

S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

We address concurrency control for LLM agents sharing mutable state over HTTP, where agents cannot be modified to declare read sets. S-Bus is an HTTP middleware whose central mechanism, a server-side DeliveryLog, reconstructs each agent's read set at commit time from observed HTTP GET traffic. The consistency property it provides -- Observable-Read Isolation (ORI), a partial causal consistency over the HTTP-observable read projection -- prevents Structural Race Conditions in dedicated-shard topologies. Three contributions. (C1) DeliveryLog mechanism with three-tier mechanised evidence: TLAPS proves ReadSetSoundness and ORICommitSafety (modulo one typing axiom); exhaustive TLC at N=3 explores 20,763,484 states with zero violations; Dafny discharges 9 inductive lemmas. (C2) Empirical safety parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI: zero Type-I corruptions across 884,110 commit attempts (427,308 under active contention). (C3) ORI is semantically neutral in dedicated-shard workloads but harmful in single-shard collaborative writing because preservation propagates concurrent contradictions. v2 update: the PH-3 LLM judge is now independently validated against a human annotator (Zahid Hussain, Mindgigs Peshawar) on 400 (step, shard) pairs at strict kappa=0.93 (n=93, 96.8% raw agreement). Inter-LLM-judge agreement is kappa=0.46 (boundary variance). Agent self-reports over-claim shard usage by 32% (LLM judge) to 49% (human annotator). The SJ-v4 semantic-quality rubric remains single-judge LLM-only. Source code, formal proofs, harness, annotation data: https://github.com/sajjadanwar0/sbus

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達研究/論文

TwinRouterBench: 現実的なエージェント LLM ルーティングのための高速静的およびライブ動的評価

LLM ルーティングは、コーディングエージェント、詳細調査システム、コンピュータ使用エージェントなど、単一のユーザーリクエストが多くのモデル呼び出しをトリガーする長期的なアプリケーションで最も重要です。各コールを最も安価な十分なモデルにルーティングすると、品質を犠牲にすることなくコストを削減できますが、既存のルーターベンチマークはワンショットプロンプトでのみルーターを評価します。中間エージェントのステップでルーターから見えるプレフィックスを公開することは決してなく、より安価な代替品が下流のタスクの成功を維持するかどうかをテストすることもありません。また、多くの場合、評価時にオンラインの LLM 判定に依存します。 2 つのトラックを備えたステップレベルのルーティングベンチマークである TwinRouterBench を紹介します。静的トラックは、SWE ベンチ、BFCL、mtRAG、QMSum、および PinchBench にわたる 520 のインスタンスからの 970 のルーター可視プレフィックスを提供します。それぞれは、リリースされたダウングレードおよびカスケードプロトコルに基づいて推定された実行検証済みのターゲット層とペアになっています。スコアリングは、オンライン評価者側の LLM ジャッジなしで、ティアラベル、軌跡メンバーシップ、およびトークンコストに関する決定論的な算術演算です。ダイナミックトラックは、500 ケースの SWE ベンチ検証済みスイート全体でルーターを実行するハーネスを提供します。この論文では、静的な SWE 監視分割とは切り離された 100 件のホールドアウト評価を報告します。各 LLM 呼び出しで、ルーターはロックされたプールから具体的なモデルを選択し、成功は公式のタスク解決と実際の API 消費量によって測定されます。 2 つのトラックは、高速なオフライン反復と、その後のライブエージェント実行下でのエンドツーエンド検証をサポートします。コードとデータは https://github.com/CommonstackAI/TwinRouterBench で入手できます。

原文 (English)

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step-level routing benchmark with two tracks. The static track provides 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution-verified target tier estimated under a released downgrade-and-cascade protocol; scoring is deterministic arithmetic over tier labels, trajectory membership, and token costs, with no online evaluator-side LLM judge. The dynamic track supplies a harness that runs routers on the full 500-case SWE-bench Verified suite; in this paper we report a 100-case held-out evaluation disjoint from the static SWE supervision split. At each LLM call the router selects a concrete model from a locked pool, and success is measured by official task resolution and realized API spend. The two tracks support fast offline iteration followed by end-to-end validation under live agent execution. Code and data are available at https://github.com/CommonstackAI/TwinRouterBench.

2026-05-24 08:00 JSTITmedia AI+LLM/生成AIエージェント

「Gemini」「Claude Code」「Codex」　全社展開・本番実装に役立つ5つのポイント

開発現場における生成AIの利用は常態化しつつあるものの、「コード補完の域を出ない」「特定の個人のスキルに依存している」といった課題があります。本稿では、個人の生産性向上からチームへの定着、全社規模での展開、そしてAIエージェントの本番実装に至るまで、開発プロセス変革に役立つ5つ…

2026-05-24 07:00 JSTITmedia AI+LLM/生成AIエージェント

ソースネクストのAI議事録、Microsoft 365 Copilot連携で検索、要約を効率化

ソースネクストは、AI議事録サービス「AutoMemo」の新機能として、「AutoMemo Copilot エージェント」の提供を始めた。Microsoft 365 Copilotから過去の会議データを検索、要約、抽出できるようにし、議事録作成や報告業務の効率化を支援する。