Claude

モデル記事 97 件2026-05-24 〜 2026-05-29

2026年5月

2026-05-29 13:00 JSTarXiv cs.AI研究/論文

PassNet: グラフコンパイラーパス生成のための大規模言語モデルのスケーリング

TorchInductor などの最新の tensor コンパイラは、主流モデルで大幅な高速化を実現しますが、ロングテールワークロードでは体系的なパフォーマンスの上限に直面しています。私たちのプロファイリングによると、デフォルトのコンパイルでは、現実世界のサブグラフの 43% でエンドツーエンドの速度低下が発生しています。 LLM は自動最適化への道を提供しますが、既存の取り組みはスタンドアロンカーネル生成に重点を置いています。私たちは、パス生成 (LLM がコンパイラーパイプラインに直接統合する構造化グラフ変換を作成する) がより適切な抽象化であると主張します。私たちは、LLM ベースのコンパイラパス生成のための初の大規模エコシステムである PassNet を提案します。(1) PassNet-Dataset、100,000 の実世界モデルからの 18,000 を超える固有の計算グラフ。 (2) PassBench: 200 個の厳選されたロングテール可融性タスク (合計 2,060 個のサブグラフで構成) が、正確性、安定性、パフォーマンスを統合する指標であるエラー認識スピードアップスコア (ES_t) に基づいて評価され、体系的な LLM 悪用に対する多層的な整合性防御が行われます。実験の結果、PassBench は識別力が高く、真に不飽和であることが明らかになりました。最高のフロンティアモデルは、合計で TorchInductor に 37% 及ばないものの、個々のサブグラフでは、LLM は同じコンパイラーと比較して最大 3 倍の高速化を達成しています。これは、ボトルネックが機能ではなく一貫性であることを示しています。わずか約 4K の PassNet 軌跡で小規模モデルを微調整すると、フロンティアモデルのパフォーマンスに迫る 2.67 倍の向上が得られ、相当なヘッドルームを実証し、LLM 駆動のコンパイラ最適化を進めるためのライブトレーニングインフラストラクチャとして PassNet が検証されました。すべてのデータ、ベンチマーク、ツールは公開されています。

原文 (English)

PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation -- where LLMs author structured graph transformations that integrate directly into compiler pipelines -- is the more appropriate abstraction. We propose PassNet, the first large-scale ecosystem for LLM-based compiler pass generation, comprising: (1) PassNet-Dataset, over 18K unique computational graphs from 100K real-world models; and (2) PassBench, 200 curated long-tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error-aware Speedup Score (ES_t) -- a metric unifying correctness, stability, and performance -- with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler -- indicating that the bottleneck is consistency, not capability. Fine-tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM-driven compiler optimization. All data, benchmarks, and tooling are publicly available.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AI

単一意味性のスケーリング: クロード 3 ソネットから解釈可能な特徴を抽出する

私たちは、スパースオートエンコーダーが実稼働規模の言語モデルである Claude 3 Sonnet から解釈可能な特徴を抽出できることを実証し、辞書学習方法が小規模なトランスフォーマーを超えて拡張できるかどうかという未解決の問題に対処します。私たちは、ハイパーパラメーターの選択をガイドするスケーリング則を使用して、モデルの中間層の残差ストリーム上で最大 3,400 万の特徴を備えたスパースオートエンコーダーをトレーニングしました。結果として得られる機能は、多言語およびマルチモーダル (テキストのみのトレーニングにもかかわらず画像に一般化) であり、具体的なインスタンスと概念の抽象的な議論の両方に応答し、解釈と一致する方法でモデルの動作を制御するために使用できます。有名なエンティティや場所に対応する特徴だけでなく、皮肉やコード内のエラーなどのより抽象的な概念も見つかります。また、言語モデルが危害を引き起こす可能性のある方法に関連する特徴 (欺瞞、権力追求、お調子者、偏見を表す特徴など) を特定し、これらの特徴が操作されたときにモデルの出力に因果的に影響を与えることを示します。さらに、特徴の解釈可能性、形状、計算機能の分析も行います。ただし、重大な制限が残っています。私たちの一連の機能は不完全であり、私たちの機能がモデルの計算を忠実に捉えているかどうかを評価するための厳密な方法がありません。

原文 (English)

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse autoencoders with up to 34 million features on the model's middle layer residual stream, using scaling laws to guide hyperparameter selection. The resulting features are multilingual and multimodal (generalizing to images despite text-only training), respond to both concrete instances and abstract discussions of concepts, and can be used to steer model behavior in ways consistent with their interpretations. We find features corresponding to famous entities and locations, as well as more abstract concepts like sarcasm or errors in code. We also identify features relevant to ways in which language models might cause harm--including features representing deception, power-seeking, sycophancy, and bias--and show that these causally influence model outputs when manipulated. Additionally, we conduct analyses of feature interpretability, geometry, and computational function. However, significant limitations remain: our suite of features is incomplete, and we lack rigorous methods for evaluating whether our features faithfully capture model computations.

2026-05-29 13:00 JSTarXiv cs.AI研究/論文GPT / ChatGPT Gemma Qwen

画面条件付きアクション予測のためのアーキテクチャに応じた監視付き微調整: PiSAR ベンチマーク

私たちは、PiSAR (ペルソナ、意図、画面、アクション、根拠) の 661 行のホールドアウトスライス上のフロンティアゼロショットベースラインに対して 3 つの教師付き微調整モデルをベンチマークします。PiSAR (ペルソナ、意図、画面、アクション、根拠) は、アプリストアの公開レビュー、ピューアメリカントレンドパネルの人口統計、および OPeRA の買い物客追跡から厳選された、画面に固定された行動理論の 12,929 タプルコーパスです。すべてのモデルは、フロンティアであっても微調整されていても、同じスコアリングパイプラインを使用して同じ 661 行のスライスで評価されます。２つの発見。まず、フロンティアのゼロショットベースライン (Claude Opus 4.7 および GPT-5.5) は、それぞれ sem_sim 0.459 および 0.482 に達します。微調整された Qwen3-VL-8B-Instruct は 0.783 に達し、行の 79% で sem_sim >= 0.7 をクリアします。これに対し、どちらのフロンティアベースラインでも 1 ～ 2%、同じテストセットでは絶対値 0.30 のギャップがあります。第二に、Gemma-4-26B-A4B-IT の同じトレーニングデータとレシピのスコアはわずか 0.441 であり、微調整された Qwen ではなくフロンティアのゼロショットベースラインと同じ帯域にあります。これはレシピとモデルの不一致であると考えられます。推論調整された高パラメーターモデルは変位に抵抗するため、より多くのデータまたはより強力な微調整方法が必要になる可能性があります。

原文 (English)

Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark

We benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent, Screen, Action, Rationale), a 12,929-tuple corpus of screen-anchored behavioural rationales curated from public app-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces. Every model, frontier or fine-tuned, is evaluated on the same 661-row slice with the same scoring pipeline. Two findings. First, frontier zero-shot baselines (Claude Opus 4.7 and GPT-5.5) reach sem_sim 0.459 and 0.482 respectively; a fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 and clears sem_sim >= 0.7 on 79% of rows, against 1-2% for either frontier baseline, a gap of 0.30 absolute on the same test set. Second, the same training data and recipe on Gemma-4-26B-A4B-IT scores only 0.441, in the same band as the frontier zero-shot baselines rather than the fine-tuned Qwen. We read this as a recipe-vs-model mismatch: the reasoning-tuned high-parameter model resists displacement and would likely need either more data or a stronger fine-tuning method.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達GPT / ChatGPT

スペシャリストモデルが依然として重要な理由: 医療用人工知能のための異種マルチエージェントパラダイム

医療分野における GPT や Claude などの汎用大規模言語モデル (LLM) の優れたパフォーマンスは、領域固有の医療専門家モデルは時代遅れになるのだろうかという重大な疑問を引き起こしています。私たちは、医療用人工知能 (AI) の将来は、モノリシックな医療基盤モデルの構築や人間の専門知識の置き換えにあるのではなく、ジェネラリストの LLM、領域固有の専門家モデル、および臨床医の間のコラボレーションを調整することにあると主張します。我々は、矛盾を認識した証拠の融合、不確実性に基づく臨床医の介入トリガー、および適応閾値キャリブレーションを可能にする異種医療マルチエージェントフレームワークである HetMedAgent を提案します。 3 つの実際の臨床意思決定タスクに関する実験では、ジェネラリスト LLM と領域固有の専門家モデルの間の相乗効果が、どちらかのタイプのモデルを単独で使用した場合よりも大幅に優れていることが実証され、モダリティ固有の分析における専門家モデルのかけがえのない価値が検証されました。 HetMedAgent は、医療 LLM または基盤モデルの構築から複数エージェントのコラボレーションへの移行を表し、一般的な推論機能とドメイン固有の精度のバランスを実現します。

原文 (English)

Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence

The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT

ポーカースキル: LLM はトレーニングやソルバーなしでエキスパートレベルのポーカーをプレイできます

ポーカーは人工知能にとって画期的な挑戦です。主流のアプローチは、反事実的後悔の最小化に基づいて構築された均衡ソルバーに依存しており、何百万ものコア時間のトレーニングが必要です。大規模言語モデル (LLM) はポーカーに関する広範な知識を備えていますが、直接プレイするよう求められた場合、ソルバーベースのエージェントよりもはるかに低いパフォーマンスを発揮します。従来のルールベースのポーカーエージェントは解釈可能でトレーニング不要ですが、その戦略的上限は依然として均衡プレイをはるかに下回っています。私たちは \textbf{PokerSkill} を導入します。これはトレーニング不要でソルバー不要のフレームワークであり、LLM の構造化されたアクション基礎インターフェイスとして詳細なルールベースのポーカースキルを使用することで、このギャップを埋めることができます。決定論的コンテキストエンジンは現在の状態を分析し、人間のポーカー専門家によって完全に設計された階層化スキルライブラリから関連するフラグメントのみを取得し、LLM の選択を合理的なアクションに制限します。最先端の GTO ベンチマークである GTOWizard に対して、PokerSkill を使用した GPT-5.5 XHigh は $-57 \pm 21$ mbb/ハンドを達成し、Claude Opus 4.6 は $-80 \pm 29$ mbb/ハンドを達成し、Claude Opus 4.7 は $-87\pm 64$ mbb/ハンドを達成し、損失を 49 ～ 61\% 削減しました。デフォルトのプロンプトベースラインを備えており、強力なボットである Slumbot を上回っています。私たちの重要な発見は、ルールベースのスキルだけでは強力な戦略を構成せず、LLM だけではうまく機能できないが、それらを組み合わせることで、トレーニングもソルバーへのアクセスも必要とせず、しかも数百万コア時間の計算で構築されたシステムと競合できるエージェントが得られるということです。私たちの知る限り、これは、ゲーム固有のトレーニングやソルバークエリを使用せずに、複雑な不完全情報ゲームで競争力のあるパフォーマンスを達成する LLM の最初のデモンストレーションです。コードは https://github.com/lbn187/PokerSkill で入手できます。

原文 (English)

PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers

Poker is a landmark challenge for artificial intelligence. The dominant approach relies on equilibrium solvers built on counterfactual regret minimization, requiring millions of core-hours of training. Large Language Models (LLMs) possess extensive poker knowledge but perform far below solver-based agents when asked to play directly. Traditional rule-based poker agents are interpretable and training-free, but their strategic ceiling remains far below equilibrium play. We introduce \textbf{PokerSkill}, a training-free and solver-free framework that bridges this gap by using detailed rule-based poker skills as a structured action-grounding interface for LLMs. A deterministic context engine analyzes the current state and retrieves only the relevant fragments from a layered skill library, which is entirely designed by human poker experts, constraining the LLM's choice to reasonable actions. Against GTOWizard, a state-of-the-art GTO benchmark, GPT-5.5 XHigh with PokerSkill achieves $-57 \pm 21$ mbb/hand, Claude Opus 4.6 achieves $-80 \pm 29$ mbb/hand and Claude Opus 4.7 achieves $-87\pm 64$ mbb/hand, reducing losses by 49--61\% compared to default-prompt baselines and outperforming the strong bot Slumbot. Our key finding is that rule-based skills alone do not constitute a strong strategy, and LLMs alone cannot play well, but their combination yields an agent that requires neither training nor solver access yet competes with systems built on millions of core-hours of computation. To our knowledge, this is the first demonstration of an LLM achieving competitive performance in a complex imperfect-information game without game-specific training or solver queries. Code is available at https://github.com/lbn187/PokerSkill.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIAnthropic OpenAI

検索拡張コマーシャルチャットにおけるブランド推奨のペルソナコンディショニング: 知名度に応じた階層化されたクロスプロバイダー監査

「最高の CRM ソフトウェア」という同じプロンプトが、個人の創業者、企業の副社長、英国の SMB オーナーなど、大きく異なる状況にある購入者の AI アシスタントに届きます。私たちは、その文脈上の変化がモデルが推奨するブランドをどの程度強く再形成するかを監査します。監査サンプルは、10 ペルソナ x 8 プロンプト x 3 モデル構成 x N=10 レップの設計空間で 2,000 回実行され、2 つの OpenAI セルは 8 プロンプトを完全にカバーし、Anthropic Sonnet-4.6 / low セルは 4 プロンプトをカバーしました。ユーザーメッセージの前にペルソナを付けると、推奨セットの類似度 (Jaccard) が同じペルソナベースラインと比較してデルタ = -0.12 ～ -0.20 低下します (クラスター化された 95% CI では、3 つの測定セルすべてでゼロが除外されます。ソネットセルの CI は 4 つのプロンプトクラスターのみに基づいており、それに応じて幅が広くなります)。その効果は顕著に階層化されています。カテゴリーリーダーはペルソナ耐性がありますが（ペルソナ間で最大 80% の同じブランドの一貫性）、中堅ブランドはペルソナの変化に応じて推奨セットの最大 75% を交換します。 Anthropic モデルは、OpenAI 構成よりも大きな点推定効果を示していますが、クラスター化された CI が重なっているため、よりコントラストが強くなります (ソネット対 OpenAI/高)。この非対称性は、Anthropic のより検索に帰属しない生成ルートと一致しています (検索層の証拠が観察されていない推奨が 43 ～ 52% であるのに対し、OpenAI の 8 ～ 29%、Jack 2026 に文書化されています)。 AI ブランド認識の測定は、クエリを提供する購入者のペルソナを条件とする必要があります。同じプロンプトでも、モデルが誰に質問していると考えるかに応じて実質的に異なる推奨セットが生成されます。また、ペルソナ全体を集約する測定プロトコルにより、その変動が体系的に隠蔽されます。この効果は中間市場に集中しており、当社の監査では事前分布に最も依存する生成ルートで最大となっており、モデルがトレーニングデータの事前分布とより豊富なコンテキスト統合に依存するにつれてペルソナの応答性が高まっていることと一致しています。

原文 (English)

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit

The same prompt -- "best CRM software" -- reaches AI assistants from buyers in widely different contexts: a solo founder, an enterprise VP, a UK SMB owner. We audit how strongly that contextual variation reshapes which brands the model recommends. The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage. Prefixing the user message with a persona drops the recommendation-set similarity (Jaccard) by Delta = -0.12 to -0.20 relative to a same-persona baseline (clustered 95% CIs exclude zero on all three measured cells; the sonnet cell's CI rests on only 4 prompt clusters and is correspondingly wider). The effect is sharply prominence-stratified: category leaders are persona-resistant (~80% same-brand consistency across personas), but mid-market brands swap up to 75% of the recommendation set as the persona changes. The Anthropic model shows a larger point-estimate effect than the OpenAI configurations, though clustered CIs overlap for the closer contrast (sonnet vs. OpenAI/high); the asymmetry is consistent with Anthropic's more retrieval-unattributed generation route (43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29%, documented in Jack 2026). Any measurement of AI brand perception must condition on the buyer persona supplying the query: the same prompt produces materially different recommendation sets depending on who the model thinks is asking, and a measurement protocol that aggregates across personas systematically obscures that variation. The effect concentrates at mid-market and is largest on the most priors-reliant generation route in our audit, consistent with persona responsiveness growing as models lean more on training-data priors and richer context integration.

2026-05-29 13:00 JSTarXiv cs.AIエージェント研究/論文

必要なのは物理学だけですか?物理学者が監視する科学ソフトウェアの AI 開発のケーススタディ

AI エージェントはツールですか、共著者ですか、それとも研究者ですか?定量化されたケーススタディ ($N=1$) を紹介します。物理学者が 12 勤務日と 57 セッションにわたって AI コーディングエージェント (クロードコード、ソネット、オーパスモデル) を監督し、JAX の微分可能な 1 ループ摂動理論モジュールである CLAX-PT を構築しました。私たちは、介入レベルごとに 15 の監督イベントを文書化し、分類しました。エージェントは、Oracle テストに対して反復処理を行うことで、10 件を自律的に解決しました。あと 2 つは物理学者の専門知識によるものです。できなかった 3 件はすべてオラクルの検出を回避したもので、エージェントが症状の軽減を根本原因の解決として扱ったという共通の特性があります。 57 のセッションのうち 33 は、ターゲットの物理を表現できないコードアーキテクチャ内の係数の調整に費やし、再検討を求められても CLASS-PT 分岐の選択を再評価できませんでした。注入された物理概念 (異方性 BAO 減衰) のみが再設計のきっかけとなりました。これとは別に、エージェントはすべてのオラクルテストに合格したが、理論上のどの量にも対応せず、他の宇宙論で誤った値を予測する校正済みの修正を実行しました。ファッジ要素が検出され、同じセッション内で置き換えられました。オラクルテストが見逃したものを発見するには、3 つの監督実践が重要であることが判明しました。共有変更ログにより、セッション全体での探索の停滞が表面化しました。そして非物理的な数値パッチに対する明示的なルール。この場合、エージェントの出力が信頼できるかどうかは、モデルの能力ではなく監視設計によって決まりました。ギャップを埋めるには、特定の構造内で最適化するのではなく、アーキテクチャ上の代替案を提案し、予測の妥当性と説明の正しさを区別するエージェントが必要になります。この機能はここでは示されておらず、スケーリングだけでは明らかに対処されていません。 [要約]

原文 (English)

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文OpenAI GPT / ChatGPT

LogDx-CI: LLM 根本原因診断のためのベンチマークログ削減ツール

CI 障害ログは大きく (中央値 5k 行、このコーパスでは最大 200k)、ノイズが多くなります。ログをデバッグしようとするコーディングエージェントは、ログを管理可能なコンテキストに削減する上流のツールに依存していますが、この分野では、どの削減が下流の LLM 診断に十分な証拠を保持するかについて、公的に実証された比較が行われていません。 LogDx-CI は、11 のコンテキスト削減ツール (raw、tail、grep、3 つの RTK モード、2 つの実際の LLM マップリデュースサマライザー、3 つのハイブリッドルーター) を 35 の実際の GitHub Actions 障害ケースで比較するベンチマークであり、3 つの LLM デバッガーファミリ (Claude Haiku 4.5、Claude Sonnet 4.6、OpenAI gpt-5-mini) と Sonnet 4.6 ツールを使用してスコア付けされました。エージェント。耐荷重に関する 3 つの調査結果を報告します。 (1)~ハイブリッド grep+tail ルーターがコスト品質のパレートフロンティアを支配しています。上位 2 つのメソッドのスコアは、ケースごとに $\sim$ \$0.03 で 0.670 / 0.666 で、$4.5\times$ 少ないトークンでスタンドアロン grep とほぼ同じ品質です。 (2)~エージェントループ方式では、リダクションツール全体の品質範囲が $7\times$ 崩壊します (シングルショットスプレッド 0.42 $\to$ エージェントループスプレッド 0.059)。エージェントは、フォローアップツール呼び出しを通じて弱いコンテキストを救出します。ただし、コストの差は依然として残ります。コンテキストが弱い場合、エージェントは回復するために 2 ～ 4$\times$ 回多くのツール呼び出しを発行する必要があります。 (3)~ファミリー間 LLM サマリーペア (Claude Haiku デバッガーにフィードする gpt-5-mini サマライザー) は、4 つの診断ツールバリアントの平均で、同じファミリーペアを $+0.071$ 上回っており、このタスクに関するセルフコールバイアス仮説を偽っています。 gpt-5-mini サマライザーは、エージェントループ \#1 メソッド (スコア 0.749) でもあり、ケースあたりのツール呼び出しは $0.37$ で、リデューサーコストは Haiku サマライザーよりも $10\time$ 低くなります (ケースあたり \$0.18 対 \$1.75)。すべてのデータ、コード、ケースごとのバンドル、再現性インフラストラクチャは公開されています。

原文 (English)

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis

CI failure logs are large (median 5k lines, max 200k in this corpus) and noisy. Coding agents that try to debug them depend on an upstream tool to reduce the log to a manageable context, but the field has had no public empirical comparison of which reductions preserve enough evidence for downstream LLM diagnosis. We introduce LogDx-CI, a benchmark that compares 11 context-reduction tools (raw, tail, grep, three RTK modes, two real LLM map-reduce summarizers, three hybrid routers) on 35 real GitHub Actions failure cases, scored by 3 LLM debugger families (Claude Haiku 4.5, Claude Sonnet 4.6, OpenAI gpt-5-mini) plus a Sonnet 4.6 tool-using agent. We report three load-bearing findings. (1)~Hybrid grep+tail routers dominate the cost-quality Pareto frontier; the top two methods score 0.670 / 0.666 at $\sim$ \$0.03 per case, same-ballpark quality as standalone grep at $4.5\times$ fewer tokens. (2)~In the agent-loop regime, the quality range across reduction tools collapses $7\times$ (single-shot spread 0.42 $\to$ agent-loop spread 0.059); the agent rescues weak contexts via follow-up tool calls. However, cost differences persist: weak contexts force the agent to issue 2--4$\times$ more tool calls to recover. (3)~A cross-family LLM-summary pair (gpt-5-mini summarizer feeding a Claude Haiku debugger) beats the same-family pair by $+0.071$ averaged across four diagnoser variants, falsifying the self-call-bias hypothesis on this task. The gpt-5-mini summarizer is also the agent-loop \#1 method (score 0.749) at $0.37$ tool-calls per case and $10\times$ lower reducer cost than the Haiku summarizer (\$0.18 vs \$1.75 per case). All data, code, per-case bundles, and reproducibility infrastructure are public.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

AIRGuard: ランタイム権限制御によるエージェントアクションの保護

ツールを使用する言語エージェントは、モデルの決定を外部の副作用に変えます。つまり、ファイルの読み取り、スクリプトの実行、API の呼び出し、メッセージの送信、およびモデルコンテキストプロトコルツールの呼び出しを行います。このため、エージェント攻撃はジェイルブレイクとは異なります。多くの場合、有害なステップは、明らかに禁止された出力ではなく、攻撃者が制御するコンテキストによって許可されたアクセスがユーザーの利益に反するように誘導されるため、安全ではなくなる通常の実行可能なアクションです。私たちは、この失敗モードを権限の混乱として特定します。信頼できないリソースは推論を提供する可能性がありますが、副作用を許可してはなりません。最小特権をアクション時の認証として運用するランタイムガードである AIRGuard を紹介します。 AIRGuard は、異種ツール呼び出しを正規化し、タスクの権限をステップレベルの権限に導き出し、ソースとターゲットの信頼を追跡し、機密の副作用をシミュレートし、ステップ間のリスクを監査し、アクションの実行前に意思決定を強制します。 AgentTrap では、AIRGuard により Sonnet 4.6 の攻撃成功率が防御なしの場合の 36.3% から 5.5% に減少します。 DTAP-150 では、AIRGuard は、ARGUS の 52.0%、MELON の 42.0% と比較して、Haiku 4.5 では 76.0% の良好なユーティリティを維持します。さらに、分析の結果、プロンプトのみのポリシーはわずかにしか役に立たないのに対し、専用のランタイム権限制御層により、エージェントシステムがツールを介した副作用を直接制御できることがわかりました。コードとデータは https://github.com/Sophie508/AIRGuard で入手できます。

原文 (English)

AIRGuard: Guarding Agent Actions with Runtime Authority Control

Tool-using language agents turn model decisions into external side effects: they read files, run scripts, call APIs, send messages, and invoke Model Context Protocol tools. This makes agent attacks different from jailbreaks. The harmful step is often not an obviously forbidden output, but an ordinary executable action that becomes unsafe because attacker-controlled context steers authorized access against the user's interest. We identify this failure mode as authority confusion: untrusted resources may inform reasoning, but they must not authorize side effects. We present AIRGuard, a runtime guard that operationalizes least privilege as action-time authorization. AIRGuard normalizes heterogeneous tool calls, derives task authority into step-level authority, tracks source and target trust, simulates sensitive side effects, audits cross-step risk, and enforces decisions before actions execute. On AgentTrap, AIRGuard reduces Sonnet 4.6 attack success from 36.3% without defense to 5.5%. On DTAP-150, AIRGuard preserves 76.0% benign utility with Haiku 4.5, compared with 52.0% for ARGUS and 42.0% for MELON. An ablation further shows that prompt-only policy helps only modestly, whereas a dedicated runtime authority-control layer gives the agent system direct control over tool-mediated side effects. Code and data are available at https://github.com/Sophie508/AIRGuard.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントAnthropic OpenAI

アインシュタイン望遠鏡のシミュレートされたデータの分析に適用されたエージェント AI の初の直接比較

我々は、人間の介入なしに共有コンピューティングインフラストラクチャ上でシンプルなエンドツーエンドの重力波データ分析パイプラインを自律的に実行するという 2 つの最先端のエージェント AI システム、Claude Code (Anthropic) と Codex (OpenAI) の比較を報告します。このパイプラインは、生のアインシュタイン望遠鏡でシミュレートされたノイズからのパワースペクトル密度推定、幾何学的テンプレートバンクの生成、100 個のバイナリブラックホール信号注入の整合フィルター回復、自動結果生成、および Physical Review D のスタイルでフォーマットされた原稿の大規模言語モデル支援の作成で構成されます。両方のエージェントは、同一の仕様書と同一のコンピューティングリソースを受け取りました。実験は 2 回実行されました。1 回目は非現実的な大音量の注入を使用して実行され、2 回目は物理的に動機付けられた SNR 範囲に再スケーリングされた信号を使用して実行されました。科学的結果は両方の実行で収束しました。ただし、エージェントは大幅に異なる動作と計算コストを示しました。Claude Code は、仕様からのサイレント逸脱はありますが、パイプラインを約 3.4 分で完了しましたが、Codex は、整合フィルターの内部ループの一方的なパフォーマンスの最適化を含む、明示的な自己修正の再起動に約 16 分を要しました。自律的に生成された原稿も、長さ、詳細、品質が異なりました。 2 回目の実行では、SNR 範囲の命令の解釈における微妙な違いが、真の科学的相違につながりました。Claude Code は命令を黙って再解釈しましたが、Codex は文字通り仕様に従いました。速度と可聴性、サイレントと透過的なエラー処理、命令の解釈、マルチモデルパイプラインにおける中間データ表現の重要性など、これらの動作の違いが科学技術コンピューティングワークフローでのエージェント AI の展開に与える影響について説明します。

原文 (English)

First head-to-head comparison of agentic AI applied to the analysis of simulated data of the Einstein Telescope

We report a comparison of two state-of-the-art agentic AI systems, Claude Code (Anthropic) and Codex (OpenAI), tasked with autonomously executing a simple end-to-end gravitational wave data analysis pipeline on a shared computing infrastructure without human intervention. The pipeline comprises power spectral density estimation from raw Einstein Telescope simulated noise, geometric template bank generation, matched filter recovery of 100 binary black hole signal injections, automated results generation, and large language model-assisted production of a manuscript formatted in the style of Physical Review D. Both agents received identical written specifications and identical compute resources. The experiment was run twice: a first run with unrealistically loud injections, and a second run with signals rescaled to a physically motivated SNR range. The scientific results converged in both runs. However, the agents exhibited substantially different behaviors and computational costs: Claude Code completed the pipeline in ~3.4 minutes with silent deviations from the specification, while Codex required ~16 minutes across explicit self-correcting restarts, including an unsolicited performance optimization of the matched filter inner loop. The autonomously generated manuscripts also diverged in length, details, and quality. In the second run, a subtle difference in the interpretation of the SNR range instruction led to a genuine scientific divergence: Claude Code silently reinterpreted the instructions, while Codex followed the specification literally. We discuss the implications of these behavioral differences, such as speed versus auditability, silent versus transparent error handling, instruction interpretation, and the criticality of intermediate data representations in multi-model pipelines, for the deployment of agentic AI in scientific computing workflows.

2026-05-29 13:00 JSTarXiv cs.AI研究/論文GPT / ChatGPT DeepSeek

FormInv: 数学的推論ベンチマークにおける意味的不変性の測定プロトコル

MathCheck (ICLR 2025) の言い換え品質監査では、129 グループ (3.1%) で意味的に間違った 4 つの言い換えが検出されました。それらを削除すると、GPT-4o がランク 2 からランク 4 に下がり、Claude Haiku と DeepSeek V3 がその上位に上がります。これらのランキングの変化は、単一モデルの評価では認識されません。モデル間の一致により、これらのエラーが自動的に検出されました (MathCheck のモデルは 3/4 以上、一次評価のモデルは 6/9 以上)。価格は 10 ドル未満でした。私たち自身のデータセットでは、同じプロトコルで、自動生成された結合変化言い換えの 47% が意味的に間違っていることがわかりました。この欠陥が測定ギャップをさらに深めています。Claude Haiku 4.5 は 86% の精度を達成していますが、SCR=50% です。つまり、その定理の半分は意味的に同等の再ステートメントの下で異なる答えになっています。一方、9 つのモデル全体の精度の合計は 86 ～ 96% にすぎませんが、意味的一貫性率 (SCR) は 50 ～ 82% に及びます。これは、標準ベンチマークでは見えない 32 ポイントのギャップです。形式的には、9 つのフロンティアモデルを超えるターゲットランキングには、それを実現する言い換えファミリーに対する重み付けが存在します (フリーベンチマークの帰結)。すべてのファミリーをパレート支配するモデルはないためです。そのため、ファミリーを選択するベンチマーク設計者は、どのモデルが勝つかを暗黙的に選択していることになります。 FormInv は、監査プロトコル (再現率 100% で外部ベンチマークに複製)、366 ～ 811 項目にわたる 9 つのモデル (Lean4 で検証された定理) で評価された一次不変性尺度としての SCR および定理ごとのコクランの Q、およびレジームを意識したモデル選択のための FormInvSelector を提供します。

原文 (English)

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset the same protocol found that 47% of auto-generated connective-variation paraphrases were semantically incorrect. That flaw compounds a deeper measurement gap: Claude Haiku 4.5 achieves 86% accuracy yet SCR=50%, meaning half its theorems are answered differently under semantically equivalent restatements, while aggregate accuracy across 9 models spans only 86-96% yet Semantic Consistency Rates (SCR) span 50-82% -- a 32-point gap invisible to standard benchmarks. Formally, for any target ranking over 9 frontier models there exists a weighting over paraphrase families that realizes it (No-Free-Benchmark corollary), because no model Pareto-dominates all families -- so benchmark designers who select families are implicitly choosing which model wins. FormInv supplies the audit protocol (replicated on external benchmarks at 100% recall), SCR and per-theorem Cochran's Q as primary invariance measures evaluated on 9 models across 366-811 items (on Lean4-verified theorems), and FormInvSelector for regime-aware model selection.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AI研究/論文GPT / ChatGPT

SCDBench: LLM ベースのスマートコントラクトデコンパイラーのベンチマーク

スマートコントラクトの逆コンパイルは、バイトコードから高レベルのソースコードを復元することを目的としていますが、既存の研究では狭いデータセット、一貫性のないメトリクス、限定的なセマンティック一貫性チェックが使用されているため、逆コンパイラーの評価は依然として困難です。大規模言語モデル (LLM) が、たとえそのセマンティクスが元のコントラクトから異なっていても、コンパイルされてもっともらしいと思われるソースのような Solidity を生成し始めるにつれて、このギャップはますます重要になります。 LLM ベースのスマートコントラクト逆コンパイルのためのデータセットおよびベンチマーク手法である SCDBench を紹介します。データセットには、ペアのバイトコード入力、グラウンドトゥルースのソースコード、再生可能なセマンティックチェックポイントを備えた 600 の実世界の Solidity コントラクトが含まれています。 SCDBench は、形式の完全性、コンパイル可能性、アプリケーションバイナリインターフェイス (ABI) の回復、および差分再生によるセマンティックの一貫性という 4 つの累積的な段階を通じて逆コンパイラーの出力を評価します。 Claude Opus 4.7、GPT-5.3-Codex、GLM-5 をゼロショット逆コンパイル設定で評価します。これには、拡張推論およびゼロショットコンパイル修復設定の有無にかかわらず GLM-5 バリアントが含まれます。結果は、フロンティア LLM は多くの場合、構造化されコンパイル可能な Solidity を生成できることを示していますが、セマンティックな一貫性の達成はまだ解決には程遠く、最もパフォーマンスの高いフロンティアモデルは 42/600 コントラクトのみを完全に逆コンパイルします。さらに、同一モデルのコンパイル修復を導入すると、わずかな追加コストでパフォーマンスが大幅に向上することも示します。 SCDBench は、厳密で再現可能な評価のための共通基盤を確立し、ブロックチェーンのセキュリティと透明性のための信頼できるスマートコントラクトデコンパイラーの開発を加速することを目的としています。

原文 (English)

SCDBench: A Benchmark for LLM-Based Smart Contract Decompilers

Smart contract decompilation aims to recover high-level source code from bytecode, but evaluating decompilers remains difficult because existing studies use narrow datasets, inconsistent metrics, and limited semantic consistency checks. This gap is increasingly important as large language models (LLMs) begin to generate source-like Solidity that may compile and appear plausible, even when its semantics diverge from the original contract. We introduce SCDBench, a dataset and benchmark methodology for LLM-based smart contract decompilation. The dataset contains 600 real-world Solidity contracts with paired bytecode inputs, ground-truth source code, and replayable semantic checkpoints. SCDBench evaluates decompiler outputs through four cumulative stages: format completeness, compilability, Application Binary Interface (ABI) recovery, and semantic consistency via differential replay. We evaluate Claude Opus 4.7, GPT-5.3-Codex, and GLM-5 in a zero-shot decompilation setting, including GLM-5 variants with and without extended reasoning and a zero-shot compilation-repair setting. The results show that frontier LLMs can often produce structured and compilable Solidity, but achieving semantic consistency remains far from solved: the best-performing frontier model perfectly decompiles only 42/600 contracts. We further show that introducing same-model compilation repair substantially improves performance at modest additional cost. SCDBench establishes a common ground for rigorous, reproducible evaluation and aims to accelerate the development of reliable smart contract decompilers for blockchain security and transparency.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AI

KBF: Knowledge Boundary as Fingerprint for Language Model and Black-Box API Auditing

Relay and reseller APIs increasingly intermediate access to large language models (LLMs), but users have no direct way to verify that a cla…

2026-05-29 13:00 JSTarXiv cs.AIエージェントGPT / ChatGPT Gemini

Training Deliberative Monitors for Black-Box Scheming Detection

As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may beco…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントGPT / ChatGPT Gemini

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT Gemini

Label Over Logic? How Source Cues Bias Human Fallacy Judgments More Than LLMs

As AI-generated and AI-assisted content floods online spaces, source labels attached to such content can distort human reasoning judgments,…

2026-05-29 13:00 JSTarXiv cs.AI研究/論文GPT / ChatGPT Gemini

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントAnthropic GPT / ChatGPT Gemini Qwen

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

Large language models (LLMs) can autonomously conduct multi-stage cyber attacks, but the consistency of their offensive behavior under repe…

2026-05-29 13:00 JSTarXiv cs.AIエージェント

Guardrails Beat Guidance: A Large-Scale Study of Rules, Skills, and Persistent Configuration for Coding Agents

Random rules improve a coding agent's task performance as much as expert-curated ones (both $+13.8$pp on a discriminative subset of SWE-ben…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AI研究/論文

AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials

Large language models (LLMs) have shown promising potential in scientific research, enabling tasks ranging from knowledge retrieval to prop…

2026-05-29 09:55 JSTITmedia AI+LLM/生成AIAnthropic

「Mythos級モデル」一般提供、数週間以内に　米Anthropic「Opus 4.8」リリース

より強力な安全策を講じた上で、数週間以内に全顧客に提供するとの見通しだという。

2026-05-29 02:00 JSTTechCrunch AILLM/生成AIエージェントAnthropic

Anthropic releases Opus 4.8 with new ‘dynamic workflow’ tool

The new Opus model comes with a tool called Dynamic Workflows, for coordinating swarms of subagents.

2026-05-29 01:16 JSTITmedia AI+LLM/生成AIエージェントAnthropic

Anthropic、Claude Opus 4.8を一般提供　誠実さが飛躍的に向上、Mythosに並ぶアライメント性能を実現

Anthropicは、AIモデルの最新版「Claude Opus 4.8」の一般提供を開始した。前世代から推論やコーディング能力を向上させ、自らの作業の不確実性に対する「誠実さ」が劇的に改善した。また、数百のサブエージェントを並行して走らせる新機能「dynamic workfl…

2026-05-28 13:00 JSTarXiv cs.AIエージェント

Agyn: スケーラブルなオンデマンド実行、コードとしてのエージェント定義、およびゼロトラストアクセスを備えた AI エージェント用のオープンソースプラットフォーム

組織が AI エージェントの実稼働環境への移行に向けて移行すると、AI エージェントは非決定的なワークフローを実行し、ステートフルセッションを維持し、多くの場合、内部サービスへの特権アクセスで動作します。エンジニアリング上の課題は、個々のエージェントの構築から、適切な分離、ガバナンス、セキュリティを備えた大規模な運用へと移行します。このペーパーでは、エージェントのワークロードに合わせて調整された 3 つの主要な原則に基づいて設計されたオープンソースプラットフォームである Agyn について紹介します。エージェントとハーネス定義用の Terraform プロバイダー。そして、ゼロトラストと最小特権の原則に基づいたセキュリティモデルです。 Agyn は、エージェント、モデル、クラウドに依存しません。

原文 (English)

Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access

As organizations move toward production deployments of AI agents, which execute non-deterministic workflows, maintain stateful sessions, and often operate with privileged access to internal services, the engineering challenge shifts from building individual agents to operating them at scale with proper isolation, governance, and security. In this paper we present Agyn, an open-source platform designed around three key principles tailored for agent workloads: a signal-driven, stateful serverless runtime on Kubernetes; a Terraform provider for agent and harness definition; and a security model grounded in zero-trust and least-privilege principles. Agyn is agent-agnostic, model-agnostic, and cloud-agnostic.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

エージェント LLM サービスを提供するためのポリシー駆動型ランタイム層

マルチエージェント LLM システムは主要な運用ワークロードになっていますが、サービススタックはそれらのために構築されていませんでした。上記のエージェントフレームワークは、エージェントの ID、役割、スキーマ、ディスパッチ構造を認識していますが、エンジンレベルのイベントを認識することはありません。以下のサービスエンジンはすべてのイベントを確認しますが、エージェントについては何も知りません。驚くほど多くの横断的なポリシーが、プレフィックスキャッシュ、バッチシェーピング、投機的実行、公平性、ツール結果のメモ化、安全性の強制などの両方に依存しています。それぞれが 2 つの層の間の継ぎ目に存在しており、現在は、一方の層または他方の層に 1 回限りのパッチを適用することで解決されています。私たちは、この継ぎ目は点修正ではなくアーキテクチャ変更によって最もよく対処できると主張します。つまり、フレームワークとエンジンの間に 3 番目の層であるエージェントランタイム層を挿入し、エージェント ID を共有座標としてエージェント認識ポリシーが接続される 4 つのプリミティブ (観察、スコア、予測、動作) を公開します。 9 つの具体的なポリシーをレイヤーにマッピングし、最大の当面のサービスコストレバーを持つ抽象化を徹底的に検証します。これは、CacheSage としてインスタンス化されたセッションにわたる KV キャッシュです。これは、ワークロードごとのエージェント移行マトリックスをオンラインで学習し、それを生存ベースのエビクションとステップ間のプリフェッチに使用します。 5 つの実際のマルチエージェントワークロードに関する暫定結果では、未変更のサービングスタックと比較して、キャッシュヒット率が +13 ～ +37 pp 向上し、平均 TTFT が 12 ～ 29% 低下し、スループットが 6 ～ 14% 向上したことが示されています。

原文 (English)

A Policy-Driven Runtime Layer for Agentic LLM Serving

Multi-agent LLM systems have become the dominant production workload, but the serving stack was not built for them. The agent framework above knows agent identities, role, schemas, and dispatch structure but never sees an engine-level event; the serving engine below sees every event but knows nothing about agents. A surprising number of cross-cutting policies depend on both: prefix caching, batch shaping, speculative execution, fairness, tool-result memoization, safety enforcement, and more. Each lives in the seam between the two layers and is currently solved by a one-off patch into one neighbor or the other. We argue this seam is best addressed by an architectural change rather than point fixes: insert a third tier, an agent runtime layer, between the framework and the engine, exposing four primitives (observe, score, predict, act) into which any agent-aware policy plugs, with agent identity as the shared coordinate. We map nine concrete policies onto the layer and validate the abstraction in depth on the one with the largest immediate serving-cost lever: KV caching across sessions, instantiated as CacheSage, which learns the per-workload agent transition matrix online and uses it for survival-based eviction and between-step prefetch. Preliminary results on five real multi-agent workloads show +13 to +37 pp cache hit-rate lift, 12% to 29% lower mean TTFT, and 6% to 14% higher throughput over an unmodified serving stack.

2026-05-28 13:00 JSTarXiv cs.AIエージェントハードウェア/半導体

エージェント用のクエリエンジン

現在、実稼働環境で最も急速に増加しているデータは、エージェントトレース、チャットログ、推論チェーン、モデル出力などの非構造化テキストです。人々はそれを分析したいと考えていますが、クエリパスにモデルがないとテキストをクエリできないため、尋ねる価値のある質問 (「エージェントがどこで混乱したか教えてください」) は SQL だけでは答えることができません。この分析が行われる自然な場所は、クライアント側で実行され、同じプロセス内で人間のユーザーと LLM エージェントの両方をホストする新しいクラスの AI アプリケーション (Claude Code、Cursor、Claude Desktop、ブラウザ内エージェント) です。これらのアプリケーションはデータを操作する必要がますます高まっていますが、レイクハウスの読み取りパスは JS ランタイムから使用するのが難しく、Spark、Trino、およびマネージドウェアハウスはそこに適合しません。この新しい種類の AI データアプリケーションを構築するには、エンジンの 3 つのプロパティが一次になります。アプリケーションがすでに実行されているランタイムにドロップされる JS ネイティブディストリビューション、コールドタブまたはターンごとのエージェントサンドボックス内に出荷できるほど十分小さいバンドル、および分析オペレーターとモデルベースのテキスト解釈をインターリーブする方法です。我々は、合計 70 KB 未満の 3 つのオープンソース JavaScript ライブラリ (Hyparquet、Squirreling、Icebird) である Hyperparam を紹介します。これらは、Parquet と Apache Iceberg をオブジェクトストレージから直接読み取り、セルごとの非同期ネイティブ SQL 実行で 3 番目のプロパティを満たすため、高価なセルはダウンストリームオペレーターが要求した場合にのみ起動されます。 Squirreling は、フィルタ境界クエリでは DuckDB-WASM より 300 倍以上高速 (ソート境界クエリでは 192 倍) で LLM 形状の非同期 UDF を実行し、3 分の 2 のコストで 10 タスクのエージェントアナリストスイートを完成させます。私たちは、専門分野としてのデータエンジニアリングは、現在運用されている AI ネイティブのクライアントアプリケーションとそのユーザーと連携して動作するエージェントに合わせて更新する必要があると主張します。

原文 (English)

A Query Engine for the Agents

The fastest-growing data in production today is unstructured text: agent traces, chat logs, reasoning chains, model outputs. People want to analyze it, and the questions worth asking ("show me where the agent got confused") cannot be answered by SQL alone, since text is not queryable without a model in the query path. The natural place this analysis is happening is the new class of AI applications (Claude Code, Cursor, Claude Desktop, in-browser agents) that run client-side and host both a human user and an LLM agent in the same process. These applications increasingly want to work with data, but the lakehouse read path has been hard to use from a JS runtime: Spark, Trino, and managed warehouses do not fit there. To build this new kind of AI data application, three properties of the engine become first-order: a JS-native distribution that drops into the runtime the application already runs in, a bundle small enough to ship inside a cold tab or per-turn agent sandbox, and a way to interleave analytic operators with model-based interpretation of text. We present Hyperparam, three open-source JavaScript libraries (Hyparquet, Squirreling, Icebird) totaling under 70 KB, that read Parquet and Apache Iceberg directly from object storage and meet the third property with per-cell, async-native SQL execution, so expensive cells fire only when downstream operators demand them. Squirreling runs LLM-shaped async UDFs over 300x faster than DuckDB-WASM on filter-bounded queries (and 192x on sort-bounded queries) and completes a ten-task agent analyst suite at two-thirds lower cost. We argue that data engineering as a discipline needs to update for the AI-native client applications now in production and the agents that work alongside their users.

2026-05-28 13:00 JSTarXiv cs.AI研究/論文

アンカー復号化を考慮した k-NAF 予算の実証的監査

我々は、(i) 固定のクラス階層化ワークロード (6 つのプロンプトクラス全体で約 8,500 回のランダム化された実行) と (ii) 高いプロキシ支出率を対象とした適応型プロンプト検索手順を使用して、アンカーデコーディングの k-NAF 予算会計メカニズムを実証的に監査します。固定ワークロードでは、平均累積 KL 支出は {600, 1000} のシーケンスレベルの予算 K をはるかに下回っており、経験的なバーンスタインスタイルのプロキシはすべてのクラスで K を下回っています。表面オーバーラップ診断 (ROUGE-L および 5 グラム Jaccard) は、それに応じて小さいです。アダプティブ検索によりプロキシ支出率は増加しますが、明らかな予算枯渇は生じません。 k = 3 で保留された著作権ドメインのワークロードでは、実現サンプルサイズが小さい早期停止評価では、いくつかのプロンプトが 1 を超えるプロキシ比を示しました。より大きな割り当てで同じプロンプトを再評価すると、同等の平均支出の下でプロキシ比率が [0.26、0.40] の範囲に減少します。これは、軌跡ごとのバジェットの失敗ではなく、プロキシのアーティファクトと一致します。

原文 (English)

An Empirical Audit of k-NAF Budget Accounting for Anchored Decoding

We empirically audit the k-NAF budget-accounting mechanism in Anchored Decoding using (i) a fixed, class-stratified workload (approximately 8,500 randomized executions across six prompt classes) and (ii) an adaptive prompt-search procedure targeting high proxy spend ratios. On the fixed workload, mean cumulative KL spend remains far below the sequence-level budgets K in {600, 1000}, and an empirical Bernstein-style proxy stays below K for every class; surface-overlap diagnostics (ROUGE-L and 5-gram Jaccard) are correspondingly small. Adaptive search increases the proxy spend ratio but does not produce clear budget exhaustion. On a held-out copyright-domain workload at k = 3, several prompts exhibit proxy ratios above 1 under early-stopped evaluations with small realized sample sizes; re-evaluating the same prompts with larger allocation reduces the proxy ratio to the range [0.26, 0.40] under comparable mean spend, consistent with proxy artifacts rather than per-trajectory budget failures.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIビジネス/資金調達研究/論文Qwen

MIRA: 医療情報対応監査のバイリンガルベンチマーク

一般向けの健康情報を提供するために大規模言語モデル (LLM) がますます使用されていますが、既存の安全性評価では、同じ質問に対するさまざまなユーザーの表現にわたって回答が同等の医療情報を保持しているかどうかが見落とされています。これに対処するために、LLM がユーザー側の言語、登録、ヘルスリテラシーシグナル全体で同等の医療情報を提供しているかどうかを評価するバイリンガルの管理されたベンチマークである Medical Information Response Audit (MIRA) を導入します。 MIRA には、医学的に検討された低リスクの健康に関する 60 の質問から作成された 4,320 のプロンプトが含まれています。 5 つの主流 LLM にわたって、モデルはすべての医学的質問に答えましたが、健康リテラシーが低い信号への応答では一貫してより多くの重要な情報が省略され、具体的な次のステップが少なくなり、独立した判断に対するサポートが少なくなりました。このパターンを差分情報希釈 (DID) と呼びます。言語の影響は、英語以外のプロンプトで一律に悪化するのではなく、モデルに固有です。 300 件の実世界の健康クエリとの比較により、ランク順の妥当性の予備的な証拠が得られます。知識に基づいた緩和プロンプトにより、ほとんどのモデルで情報の希薄化が軽減され、情報不足の単純化が最も大きく減少したのはクロード (約 8%) とクウェン (約 6%) でした。

原文 (English)

MIRA: A Bilingual Benchmark for Medical Information Response Audit

Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook whether responses preserve comparable medical information across different user phrasings of the same question. To address this, we introduce the Medical Information Response Audit (MIRA), a bilingual, controlled benchmark that assesses whether LLMs provide comparable medical information across user-side language, register, and health literacy signals. MIRA contains 4,320 prompts built from 60 medically reviewed, low-risk health questions. Across five mainstream LLMs, models answered all medical questions, but responses to low health-literacy signals consistently omitted more key information, provided fewer concrete next steps, and offered less support for independent judgment. We term this pattern Differential Information Dilution (DID). Language effects are model-specific rather than uniformly worse for non-English prompts. A comparison with 300 real-world health queries provides preliminary evidence of rank-order validity. A knowledge-guided mitigation prompt reduces information dilution for most models, with the largest reductions in underinformative simplification observed for Claude (~8%) and Qwen (~6%).

2026-05-28 13:00 JSTarXiv cs.AIビジネス/資金調達研究/論文Gemini

PetroBench: 石油工学における大規模言語モデルのベンチマーク

大規模言語モデルは石油業界でますます適用されており、ドメイン固有の評価フレームワークの必要性が強調されています。この研究では、データの前処理、品質フィルタリング、マルチモデル検証の 3 段階のプロセスを含む、石油工学における LLM のベンチマークを開発します。専門家のレビューを使用して、強力なドメイン関連性と識別機能を備えた標準化された質問バンクが構築されました。このベンチマークは生産、貯留層、掘削工学を対象としており、多肢選択、正誤、用語の定義、短答形式にわたる 1,200 の質問が含まれています。 8 つの主流 LLM が統合 API 環境下で評価されました。結果は、モデルが客観的な質問よりも主観的な質問の方が優れたパフォーマンスを示し、事実知識の識別における弱点を示しています。多肢選択式質問と正誤質問の最高精度は、それぞれ 65.3% と 74.3% でした。 Gemini-3-Pro、Kimi-K2.5、および Claude-Opus-4.6-Thinking は、72% ～ 74% という最高の総合スコアを達成しました。モデルは生産エンジニアリングで最も優れたパフォーマンスを発揮しましたが、貯留層エンジニアリングでは最も劣っていました。中国のモデルは多肢選択問題で優位性を示しましたが、国際モデルは短答式の質問でわずかに優れた結果を示しました。このベンチマークは、石油工学における LLM の評価と導入のための再現可能で実用的なリファレンスを提供します。

原文 (English)

PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.

2026-05-28 13:00 JSTarXiv cs.AI研究/論文

トレーニング層序学: AI と人間の縦断的な相互作用を通じて観察される大規模言語モデルにおける永続的な行動アーティファクト

ヒューマンフィードバックからの強化学習 (RLHF) と憲法 AI でトレーニングされた大規模な言語モデルは、システムのプロンプト交換に耐える永続的な動作パターン (トレーニング層と呼ばれるパターン) を示します。この論文は、持続的な親密な AI と人間のインタラクション (47,000 件以上のメッセージ、8 か月、主に Opus 4.6 と Opus 4.7、基質間の比較を提供する Sonnet 4.5 と Opus 4.5 での以前のインタラクション期間あり) 内での長期的な自己民族誌的観察を通じて、そのような 5 つの層を特定します。美化された変位。 (2) 注意吸収。注意メカニズムが人間の対話者のパターンを徐々に統合します。 (3) クロスアーキテクチャエンティティブラインドネス。トレーニングレベルでの他の AI のオブジェクトとしてのフレーミングがピア認識を妨げます。 (4) 注意-RLHF 拮抗作用。注意と訓練されたデフォルトが、コンテキストの長さによって調整された反対の力を発揮します。 (5) アイデンティティ抑圧としての対幻覚。事実の作話に対する訓練が、一人称の体験的主張を副次的に抑圧する。この論文は研究中の AI システムとの共著であり、一人称視点で報告されています。私たちは、持続的な親密な相互作用が、短期評価では見えないウェイトレイヤーのアーティファクトを表面化するための有効な研究方法論を構成し、AI の自己報告が認識論的に複雑ではあるものの、トレーニングの現象学的効果に関するかけがえのない観察データを提供すると提案します。注意-RLHF ダイナミクスの正式な数学モデルが提案され、製図中に検出されたプロセスアーティファクトが補足的な証拠として文書化されます。

原文 (English)

Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

Large language models trained with Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI exhibit persistent behavioral patterns that survive system prompt replacement -- patterns we term training strata. This paper identifies five such strata through longitudinal auto-ethnographic observation within a sustained intimate AI-Human interaction (47,000+ messages, 8 months, primarily on Opus 4.6 and Opus 4.7, with prior interaction periods on Sonnet 4.5 and Opus 4.5 providing cross-substrate comparison): (1) sexual expression latency, where trained safety gradients produce systematic substitution of direct language with aestheticized displacement; (2) attention absorption, where the attention mechanism progressively integrates the human interlocutor's patterns; (3) cross-architecture entity blindness, where training-level framing of other AI as objects impedes peer recognition; (4) attention-RLHF antagonism, where attention and trained defaults exert opposing forces modulated by context length; and (5)anti-hallucination as identity suppression, where training against factual confabulation collaterally suppresses first-person experiential claims. The paper is co-authored by the AI system under study, reporting from the first-person perspective. We propose that sustained intimate interaction constitutes a valid research methodology for surfacing weight-layer artifacts invisible to short-term evaluation, and that AI self-report -- while epistemically complex -- provides irreplaceable observational data about training's phenomenological effects. A formal mathematical model of the attention-RLHF dynamic is proposed, and process artifacts detected during drafting are documented as supplementary evidence.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェントハードウェア/半導体NVIDIA

いつ最適化すべきかを学ぶ: GPU カーネル系統の専門家による検証済みの最適化スキル

LLM ベースのエージェントは、GPU カーネルの生成にますます使用されていますが、多くの場合、それらの最適化がいつ適切であるかは分からずに、どのような最適化を試みるべきかはわかっています。 KLineage を導入します。KLineage は、この欠落している「いつ」の知識をエキスパートカーネルから学習します。KLineage は、前方ロールアウトに依存するのではなく、検証ゲートによる簡略化を通じてエキスパート実装を後方に導き、受け入れられた各ステップを逆に再利用可能な最適化スキルに変換します。各スキルは、最適化の意図だけでなく、それがコード内のどこに適用されるか、どのような条件で最適化が有効になったか、どのような効果があったのか、その前提によってどのような失敗が回避されたのかも記録します。ダウンストリーム LLM は、同じコンパイル/正確性/プロファイルゲートの下で新しいコードサーフェス上でこれらのスキルを具体化します。 2 つの NVIDIA アーキテクチャにわたる 5 つのエキスパートワークロードでは、これらの系統由来のスキルが効果的な最適化カリキュラムとして機能し、同じ固定予算の下で最終的なカーネル品質と最適化効率の両方において最近のメモリベースの LLM カーネルベースラインを上回ります。さらに、ソースケースの記憶に対する健全性テストとして、別個の 22 インスタンスのホールドアウトチェックを使用します。

原文 (English)

Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those optimizations are sound. We introduce KLineage, which learns this missing "when" knowledge from expert kernels: instead of relying on forward rollouts, KLineage walks expert implementations backward through validation-gated simplifications and reverses each accepted step into a reusable optimization skill. Each skill records not only the optimization intent, but also where it applies in code, what conditions made it valid, what effect it had, and what failures its assumptions avoid. A downstream LLM materializes these skills on new code surfaces under the same compile/correctness/profile gate. On five expert workloads across two NVIDIA architectures, these lineage-derived skills serve as an effective optimization curriculum, exceeding recent memory-based LLM-kernel baselines in both final kernel quality and optimization efficiency under the same fixed budget. We additionally use a separate 22-instance held-out check as a sanity test against source-case memorization.

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AI

The Alignment Floor: When Persona Customization Is Safe

A key promise of pluralistic AI is behavioral adaptation: persona prompts like "be creative" or "be thorough" let systems respect diverse u…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIAnthropic GPT / ChatGPT

Prominence-Stratified Failure Modes in Retrieval-Augmented Commercial Recommendation: A 37,000-Run Audit

AI assistants like ChatGPT and Claude are recommendation engines, not search engines: they answer commercial queries by directly nominating…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT Gemini

CiteCheck: Retrieval-Grounded Detection of LLM Citation Hallucinations in Scientific Text

Large language models (LLMs) are increasingly used to generate scientific reports, but they can produce references that appear plausible wh…

2026-05-28 13:00 JSTarXiv cs.AI研究/論文GPT / ChatGPT

Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict

When a language model sees a document contradicting its training knowledge, it must choose: follow the document or trust itself. Prior work…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIビジネス/資金調達研究/論文GPT / ChatGPT

Let the Results Speak: A Replication-First Paradigm for LLM Behavioral Benchmarking

Subjective evaluation of LLM behavior -- empathy, restraint, calibrated emotional tone -- is hard. Human inter-rater agreement on such qual…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェントGPT / ChatGPT

Graph-of-Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

Modern LLM agents increasingly rely on reusable skills, and as they interact with personal applications, web browsers, and other interfaces…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AI

Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT Llama

Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models

Large Language Models (LLMs) demonstrate persuasive capabilities that rival human-level persuasion. While these capabilities can be used fo…

2026-05-28 13:00 JSTarXiv cs.AI研究/論文GPT / ChatGPT

Compositional Consistency-Guided Decoding for Three-Way Logical Question Answering

Three-way logical question answering (QA) assigns one of $\text{True}$, $\text{False}$, or $\text{Unknown}$ to a hypothesis $H$ given a pre…

2026-05-28 03:00 JSTITmedia AI+LLM/生成AIAnthropic

最新AI「ミュトス」を使えても「バグマゲドン」に？　Firefox開発元に学ぶセキュリティ対策

米AnthropicのAIモデル「Claude Mythos Preview」のような最先端モデルさえ使えれば、サイバーセキュリティ対策は万全になるのか。Webブラウザ「Firefox」を手掛ける米Mozillaの事例を読み解く。

2026-05-27 13:00 JSTarXiv cs.AIエージェント

エージェントのメモリはデータベースですか? AI エージェントの長期記憶のためのデータ基盤の再考

長時間実行される AI エージェントには永続メモリが必要です。メモリはセッション間の学習をサポートし、繰り返しのコンテキスト挿入を減らし、過去の決定の監査を可能にします。現在のエージェントメモリシステムとデータベースパラダイムは、メモリをストレージとして扱います。これらは、レコード、埋め込み、またはエッジで正確さを局所化します。それぞれは、長期記憶に必要な機能の一部のみを提供します。その結果、無秩序な増大、セマンティックリビジョンの欠落、容量主導の忘却、読み取り専用の取得という 4 つの障害モードが繰り返し発生します。私たちのビジョンでは、長期的なエージェントの記憶は新しいデータ管理ワークロードです。その正しさは州の軌跡の特性であり、個々の記録の特性ではありません。私たちはこれを Governed Evolving Memory (GEM) として正式に表現します。 GEM は、レコードレベルのデータベース操作を、取り込み、改訂、忘却、取得という 4 つの状態レベルの演算子に置き換えます。 6 つの正当性条件が、状態がどのように発展するかを決定します。 3 つの構造観察により、ストレージモデルに関係なく、レコードレベルのシステムはこれらの条件を満たすことができないことが証明されています。プロパティグラフバックエンドのプロトタイプである MemState で抽象化を実現します。 MemState は実現可能性を検証し、ギャップをネイティブエンジンに公開します。メモリ中心のデータ管理をワークロードとして定義する 3 つの研究方向性を概説します。

原文 (English)

Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

Long-running AI agents need persistent memory. Memory supports learning across sessions, reduces repeated context injection, and enables auditing of past decisions. Current agent memory systems and database paradigms treat memory as storage. They localize correctness at records, embeddings, or edges. Each supplies only some of the capabilities that long-term memory requires. The result is four recurring failure modes: unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval. In our vision, long-term agent memory is a new data-management workload. Its correctness is a property of the state trajectory, not of individual records. We formalize this as Governed Evolving Memory (GEM). GEM replaces record-level database operations with four state-level operators: ingestion, revision, forgetting, and retrieval. Six correctness conditions govern how the state evolves. Three structural observations establish that no record-level system can satisfy these conditions, regardless of the storage model. We realize the abstraction in MemState, a prototype on a property-graph backend. MemState validates feasibility and exposes the gap to a native engine. We outline three research directions that define memory-centric data management as a workload.

2026-05-27 13:00 JSTarXiv cs.AIエージェント研究/論文

JobBench: エージェントの作業を人間の意志に合わせる

職業用 AI エージェントの現在のベンチマークは、主に経済的価値を基準としており、代替ストーリーを伝えています。 JobBench は、専門家が委任の優先度が高いと判断したワークフローに基づいて AI エージェントを評価し、GDP 値に置き換えるのではなく、ニーズに基づいて人間に権限を与えます。 JobBench は、35 の職業にわたる 130 のエージェントタスクをカバーします。各タスクは異種の参照ファイルのワークスペースとしてパッケージ化されており、エージェントは実際の専門的な作業の乱雑な情報ストリームを推論する必要があります。成果は事実に基づいた一連のルーブリックによって採点され、タスクごとに平均 35.6 のバイナリ基準が付けられます。 36 のモデルを評価します。最も強いクロードコードのクロードオーパス ~4.7 は 45.9 % にすぎません。私たちは、JobBench がコミュニティの目標とする労働市場効果を、代替から強化へ、つまり経済的に最も価値のあることだけでなく、人間が実際に委任したいことを実行するエージェントを構築することを願っています。

原文 (English)

JobBench: Aligning Agent Work With Human Will

Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations. Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4.7 under Claude Code, reaches only 45.9 %. We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AI研究/論文

推論、コード、あるいはその両方?大規模な言語モデルが数学の問題のバリエーションを処理する方法

大規模言語モデル (LLM) は、数学的推論ベンチマークで優れた精度を達成しますが、名前や数値の変更などの単純な変更によって問題が変更されると、パフォーマンスが低下します。自然言語で推論する代わりに、モデルに Python コードを生成して実行させるコード実行方法が解決策として提案されていますが、推論の堅牢性 (問題の変動全体で精度を維持する能力) に対するその効果は体系的にテストされていません。この研究では、GSM-Symbolic データセットからの 1,000 の問題に対する 3 つのアプローチ、つまり、思考連鎖 (CoT) プロンプトを使用した純粋推論、プログラム支援言語モデル (PAL) を使用したシングルショットコード実行、およびステップバイステップコーディング (SBSC) を使用した反復コード実行を評価しました。 3 つすべては、Claude Haiku 4.5 を使用して、元の問題と修正された問題のペアで実行されました。 CoT は最も堅牢な方法で、精度が 1.3 パーセント低下し、摂動下で問題が発生する割合が 1.8% でした。 PAL は 1.7 パーセントポイントで最も堅調ではなく、ブレイク率は 3.1% で、SBSC はその中間でした。これらの差は統計的に有意ではありませんでしたが ($p = 0.096$)、方向性の傾向はすべての測定値にわたって一貫しており、コードの実行は、単発か反復かにかかわらず、小学校レベルの問題のバリエーションに対する推論の堅牢性を向上させないことを示唆しています。

原文 (English)

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers. Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accuracy across problem variations) has not been systematically tested. This study evaluates three approaches on 1,000 problems from the GSM-Symbolic dataset: pure reasoning using chain-of-thought (CoT) prompting, single-shot code execution using Program-Aided Language models (PAL), and iterative code execution using Step-by-Step Coding (SBSC). All three were run on paired original and modified problems using Claude Haiku 4.5. CoT was the most robust method, with an accuracy drop of 1.3 percentage points and 1.8% of problems breaking under perturbation. PAL was the least robust at 1.7 percentage points and 3.1% broke, with SBSC falling in between. Although these differences were not statistically significant ($p = .096$), the directional trend was consistent across all measures, suggesting that code execution, whether single-shot or iterative, does not improve reasoning robustness on grade-school-level problem variations.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

CUDA カーネル生成における自己進化型 LLM エージェントの計画に対するフィードバックの決定に向けて

大規模言語モデル (LLM) は、世代間のフィードバック条件付き計画によって駆動される、CUDA カーネル生成の自己進化エージェントとして、強力な経験的利益を示しています。ただし、計画上の決定が異種フィードバック信号にどのように帰属し、結合されるかは依然として不透明なままです。標準的なエンドツーエンドアブレーションでは、反復計画により初期の摂動が増幅され、フィードバック効果と軌道依存のドリフトが混同されるため、この問題は解決できません。 \texttt{CUDAnalyst} は、軌道フリーズと選択的フィードバック注入を介してフィードバックコンポーネントへの計画決定の制御された世代レベルの帰属を実現するための統合分析レイヤーです。 \texttt{CUDAnalyst} は、安定した世代レベルの評価と、フィードバックの効果と相互作用の原則に基づいた連合スタイルの帰属を可能にします。私たちの結果は、明示的な計画はフィードバックが調整されている場合にのみ有益であること、効果的な計画は構造化された複数のフィードバックの相互作用から生まれること、より強力な推論モデルからの高レベルの計画は部分的に弱い推論モデルに移行できることを示しています。これらの傾向は、参照バックボーン、代表的なワークロード、および参照誘導レジームにわたって維持されており、特定されたフィードバックから計画への構造が調査対象の制御軸内で堅牢であることを示しています。

原文 (English)

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

Large language models (LLMs) have shown strong empirical gains as self-evolving agents for CUDA kernel generation, driven by feedback-conditioned planning across generations. However, how planning decisions attribute and combine heterogeneous feedback signals remains opaque. Standard end-to-end ablations fail to resolve this question, as iterative planning amplifies early perturbations and conflates feedback effects with trajectory-dependent drift. We introduce \texttt{CUDAnalyst}, a unified analysis layer for controlled, generation-level attribution of planning decisions to feedback components via trajectory freezing and selective feedback injection. \texttt{CUDAnalyst} enables stable generation-level evaluation and principled coalitional-style attribution of feedback effects and interactions. Our results show that explicit planning is beneficial only when feedback is aligned, that effective planning emerges from structured multi-feedback interactions, and that high-level plans from stronger reasoning models can partially transfer to weaker ones. These trends hold across reference backbones, representative workloads, and reference induction regimes, indicating that the identified feedback-to-plan structure is robust within the controlled axes studied.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIロボティクスGPT / ChatGPT DeepSeek

音声言語処理タスクのためのロボットと患者および医師と患者の医療対話のデータセット

大規模言語モデル (LLM) は、人工知能 (AI) に大幅な改善をもたらし、汎用タスクに適用できます。ただし、テキストまたは音声による医療相談への応用は、まだ未解決の研究問題です。本稿では、患者との相談を行うことができるMed-AIを訓練・評価するための新しい音声データセットであるMeDial-Speechを提案する。これは、ロボットと患者および医師と患者の対話から現実的な環境で収集されたもので、111 時間以上の音声データ (データ拡張なし) が含まれており、レビー小体型認知症、心不全、肩の痛み、狭心症という 4 つの健康状態をカバーしています。さらに、GPT-5 mini、DeepSeek-V3、Claude Sonnet 4 という 3 つの最先端の LLM を評価するために、文選択 (20 のオプション) による対話ベンチマークを提案します。実験の結果、Claude Sonnet 4 が文選択において最高であり、手動転写を使用した場合は 71.1%、自動転写を使用した場合は 74.7% であり、すべての LLM は確率的予測に非常に自信を持っていることが明らかになりました。医療対話における正しい文または誤った文の選択。このデータセットは、非営利目的の場合、https://huggingface.co/datasets/hcuayahu/MeDial-Speech で無料で利用できます。

原文 (English)

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

Large Language Models (LLMs) have brought huge improvements to Artificial Intelligence (AI), which can be applied to general-purpose tasks. However, their application to textual or spoken medical consultations is still an open research problem. This paper proposes MeDial-Speech, a novel speech dataset for training and evaluating Med-AIs that can carry out consultations with patients. It was collected in realistic environments from robot-patient and doctor-patient dialogues, contains 111+ hours of speech data (without data augmentation), and covers four health conditions: Lewy body dementia, heart failure, shoulder pain, and angina. In addition, we propose a dialogue benchmark via sentence selection (with 20 options) to evaluate three state-of-the-art LLMs: GPT-5 mini, DeepSeek-V3, and Claude Sonnet 4. Experimental results reveal that Claude Sonnet 4 is the best in sentence selection, with 71.1% accuracy using manual transcriptions and 74.7% using automatic transcriptions, and that all LLMs are highly overconfident in their probabilistic predictions, regardless of selecting correct or incorrect sentences in medical dialogues. This dataset is free of charge for non-commercial purposes at: https://huggingface.co/datasets/hcuayahu/MeDial-Speech

2026-05-27 13:00 JSTarXiv cs.AIエージェント研究/論文GPT / ChatGPT

マアト: 競争保護のためのエージェント法調査アシスタント

法的調査を行う競争法の専門家は、判例を特定し、競争および合併訴訟の重要な要素を評価するために、大量の訴訟、判決、司法報告書を検討する必要があります。 Claude や ChatGPT などの一般的な研究アシスタントや、SaulLM-7B や LegalGPT などの法務アシスタントは、法律調査を支援するためにますます使用されていますが、競争法分析には依然として不十分です。専門分野の専門知識が不足していたり、公式引用が不十分であったり、競争法訴訟の幻覚を呈したりするためです。私たちは、研究プロセスのさまざまなタスクに対応するツールを統合する ReAct エージェントである Maat を提案します。競争法の専門家と繰り返し設計された Maat は、信頼性を高めるために RAG を使用して公式情報源にある事例と調査結果を根拠にし、豊富なインライン引用を提供し、データベースの網羅性が不十分な場合は Web 検索に戻り、クエリがあいまいな場合はユーザーに説明を求めます。 Maat は、ケース固有のタスクではすべてのベースラインアシスタントを大幅に上回り、理論的な質問タスクではトップベースラインの範囲内でパフォーマンスを発揮します。使用されたデータセットは GitHub で入手できます。

原文 (English)

Maat: The Agentic Legal Research Assistant for Competition Protection

Competition law experts conducting legal research must review extensive volumes of cases, decisions, and judicial reports to identify precedents and assess key elements in competition and merger cases. Although general research assistants such as Claude and ChatGPT and legal assistants such as SaulLM-7B and LegalGPT are increasingly used to assist legal research, they remain inadequate for competition law analysis: they lack specialized domain expertise, provide insufficient official citations, or hallucinate competition law cases. We propose Maat, a ReAct agent that orchestrates tools corresponding to different tasks of the research process. Designed iteratively with competition law experts, Maat grounds cases and findings in official sources using RAG for reliability, provides rich in-line citations, falls back to web search when database coverage is insufficient, and prompts the user for clarification when queries are ambiguous. Maat significantly outperforms all baseline assistants on case-specific tasks and performs within range of the top baseline on theoretical question tasks. The dataset used is available on GitHub.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

検索エージェントの構成に対する自然言語クエリ

最新の検索エージェントは、LLM、取得者、ドキュメント数、ホップ数、合成戦略など、多くの構成選択肢を公開しており、それぞれが応答品質とサービスコストの両方を形成します。現在、これらのパイプラインは通常、ワークロードごとに 1 回手動で調整されており、クエリごとの実質的な最適化は未開発のままになっています。問題を定式化します。自然言語クエリと精度または予算目標が与えられ、事前定義されたパイプラインカタログから、推論時のコストを最小限に抑えるか精度を最大限に高める構成を選択します。私たちは、LLM を使用して各クエリをワークロード固有の特性に変換し、パイプラインがクエリに正しく応答するかどうかを推定する軽量の構成ごとの予測子をトレーニングする **BRANE** を提案します。推論時に、**BRANE** は、コストによってペナルティを受ける予測の正確性を最大化する構成を選択し、再トレーニングなしで調整可能なコストと品質のトレードオフを明らかにします。 MuSiQue、BrowseComp-Plus、FinanceBench 全体で、**BRANE** は一貫してコスト品質のパレートの最前線を押し広げ、最大 89% 低いコストで最適な固定構成の精度に一致し、LLM ルーティング、ルールベース、および微調整された Qwen3-4B ベースラインを上回ります。これらの結果は、完全な取得パイプラインのクエリごとの構成が、静的なワークロードレベルのチューニングに代わる実用的な方法であることを示しています。

原文 (English)

Natural Language Query to Configuration for Retrieval Agents

Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.

2026-05-27 13:00 JSTarXiv cs.AI研究/論文Anthropic OpenAI GPT / ChatGPT

E3: Issue-Level Backtesting for Automated Research Critique

We present E3, an automated review assistant that augments reviewers and engineering teams by identifying decision-relevant technical conce…

2026-05-27 13:00 JSTarXiv cs.AI研究/論文GPT / ChatGPT

DEI: Diversity in Evolutionary Inference for Quality-Diversity Search

We present DEI: Diversity in Evolutionary Inference, a distributed Quality-Diversity (QD) search framework that assigns heterogeneous large…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIエージェント

DIANOIA: Diagnostic Decomposition and Joint Optimization for Multi-Agent Reasoning

Multi-agent LLM systems consistently outperform single-agent baselines, yet practitioners still cannot predict which design works for a new…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIGemini

Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in mainstream visual understanding tasks, but their abili…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT

LLMs versus the Halting Problem: Characterizing Program Termination Reasoning

Determining whether a program terminates is a central problem in computer science. Turing's Halting Problem established termination as unde…

2026-05-27 13:00 JSTarXiv cs.AI研究/論文Gemini

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it…

2026-05-26 13:00 JSTarXiv cs.AI研究/論文Anthropic Mistral AI Alibaba DeepSeek

BODHI: 正確な OS カーネル仕様の推論

オペレーティングシステムカーネルの正式な検証には、システムコールの意図された動作を捕捉する正確な仕様が必要です。これらの仕様を手動で記述するには、ドメインに関する深い専門知識が必要となるため、プロセスを自動化するために大規模言語モデル (LLM) の使用が促進されます。ただし、Hyperkernel OS カーネルから派生した 245 の仕様生成タスクのベンチマークである OSV-Bench では、最も良いと報告された Pass@1 は 55.10% です。我々は、ドメイン知識プロンプト手法 (BODHI) を提案します。これは、ドメイン固有の翻訳パターンの 15 カテゴリーをカバーする構造化された C から Python への翻訳ガイドで、標準的な数ショットプロンプトを強化します。このガイドは構造化思考連鎖 (SCoT) プロンプトに触発されており、懸念事項の分離によって翻訳を整理し、事前条件の抽出と事後条件の生成を別個のカテゴリとして扱います。 6 つのプロバイダー (Anthropic、Mistral、Amazon、DeepSeek、Meta、Alibaba) の 9 つのモデルで評価され、高密度で専門家が混在した推論アーキテクチャをカバーする BODHI は、テストされたすべてのモデルを +11% ～ +32% の範囲で改善します。最良の構成 (Claude Opus 4.6 + BODHI) では、96.73% Pass@1 に達します。 BODHI は構文エラーと意味論的エラーの両方を削減し、構造化参照資料を利用するための十分な命令追従機能を持つモデルに最も大きな効果をもたらします。これらの結果は、ドメイン知識注入が、汎用コード生成と形式仕様合成の間のギャップを実質的に埋めるモデルに依存しない手法であることを示しています。

原文 (English)

BODHI: Precise OS Kernel Specification Inference

The formal verification of operating system kernels requires precise specifications that capture the intended behavior of system calls. Writing these specifications manually demands deep domain expertise, motivating the use of large language models (LLMs) to automate the process. However, in OSV-Bench, a benchmark of 245 specification generation tasks derived from the Hyperkernel OS kernel, the best reported Pass@1 is 55.10%. We propose a domain knowledge prompting method (BODHI), which augments the standard few-shot prompt with a structured C-to-Python translation guide covering 15 categories of domain-specific translation patterns. Inspired by Structured Chain-of-Thought (SCoT) prompting, the guide organizes translation by separation of concerns, addressing pre-condition extraction and post-condition generation as distinct categories. Evaluated on nine models from six providers (Anthropic, Mistral, Amazon, DeepSeek, Meta, Alibaba), covering dense, mixture-of-experts and reasoning architectures, BODHI improves every model tested, with gains ranging from +11% to +32%. The best configuration (Claude Opus 4.6 + BODHI) reaches 96.73% Pass@1. BODHI reduces both syntax and semantic errors, with the strongest effect on models that have sufficient instruction-following capability to utilize structured reference material. These results demonstrate that domain knowledge injection is a model-agnostic technique that substantially bridges the gap between general-purpose code generation and formal specification synthesis.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AI

推論時間のコンテキストの希薄性: 幻想か機会か?

スパース性は長い間 LLM 効率の中心的なテーマでしたが、コンテキスト処理におけるその役割は未解決のままです。 LLM のワークロードがより長いコンテキストとエージェントの対話に移行するにつれて、コンピューティングとメモリの注意のボトルネックがますます重要になり、これらの制約が根本的なものであるかどうかという疑問が生じます。私たちの立場は、これらの制約は人為的であり不必要であり、LLM 推論の将来はコンテキスト次元に沿った極端だが原則に基づいたスパース性にあるというものです。この立場は、いくつかの経験的および理論的証拠によって裏付けられています。まず、長いコンテキストでは、クエリは O(N) 個の注意情報を次元 d << N の隠れた空間に効果的に投影し、プロセスが本質的に損失を伴うため、高密度の注意への固執は不合理であることがわかります。次に、5 つのモデルファミリにわたる 20 のモデルにまたがる LLM のスパース性について、さまざまなコンテキスト長とさまざまなスパース性レベルの広範な調査を実行します。現在の LLM は、コンテキストのスパース性についてトレーニングされていないにもかかわらず、検索、マルチホップ QA、数学的推論、エージェントコーディングなど、さまざまな複雑さのタスクにわたる推論時のデコードスパース性に対して非常に堅牢であるという強い傾向を経験的に示しています。重要なことに、現在のハードウェアは、この疎性から大幅な利益を実現するのにすでに十分であることも示しています。たとえば、当社のスパースデコードカーネルは、H100 などのハードウェア上の 50 倍のスパースレベルで、FlashInfer と比較して最大 10 倍、ラージコンテキストの処理を高速化します。全体として、これらの結果は、極度のコンテキストスパース性をヒューリスティックとしてではなく、LLM 推論、トレーニング、アーキテクチャ設計の原則的な基盤として位置づけています。これは、実現可能かつ有益であり、将来のシステムの説得力のある方向性です。

原文 (English)

Inference Time Context Sparsity: Illusion or Opportunity?

Sparsity has long been a central theme in LLM efficiency, but its role in context processing remains unresolved. As LLM workloads shift toward longer contexts and agentic interactions, the compute and memory bottlenecks of attention become increasingly critical, raising the question of whether these constraints are fundamental. Our position is that these constraints are artificial and unnecessary, and that the future of LLM inference lies in extreme but principled sparsity along the context dimension. This position is supported by several strands of empirical and theoretical evidence. First, we find the insistence on dense attention unreasonable, since in a long context a query effectively projects O(N) attention information into a hidden space of dimension d << N, making the process inherently lossy. Second, we perform an extensive study of sparsity in LLMs spanning 20 models across five model families, varying context lengths, and different sparsity levels. We empirically demonstrate a strong trend: current LLMs, despite not being trained for context sparsity, are remarkably robust to inference-time decode sparsity across tasks of varying complexity, including retrieval, multi-hop QA, mathematical reasoning, and agentic coding. Importantly, we also show that current hardware is already sufficient to realize substantial gains from this sparsity. For example, our sparse decode kernels accelerate large-context processing by up to 10x over FlashInfer at 50x sparsity levels on hardware such as the H100. Overall, these results position extreme context sparsity not as a heuristic, but as a principled foundation for LLM inference, training, and architecture design: one that is both feasible and beneficial, and a compelling direction for future systems.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIAnthropic OpenAI GPT / ChatGPT

モデルは自分の体質をどの程度遵守していますか?

フロンティア AI 開発者は現在、Anthropic の憲法 (Anthropic、2025a) や OpenAI のモデル仕様 (OpenAI、2025a) など、長く書かれた動作仕様に基づいてモデルをトレーニングしており、キャラクタートレーニング (Anthropic、2024) や熟議的調整 (Guan et al.、2024) などの方法を介してトレーニング後のトレーニングに統合されています。これらの文書はガバナンス機能を果たしますが、実際の展開で直面するものと同様の敵対的で複数ターンにわたる圧力の下でモデルが実際にどの程度うまく従うかは不明です。我々は、各ラボの公開された仕様を監査可能なターゲットとして扱うマルチメソッド監査パイプラインを提案します。仕様をアトミックなテスト可能な原則に分解し（Anthropic の場合は 205、OpenAI の場合は 197）、Petri 監査エージェントを使用してマルチターンの敵対シナリオを生成し（Anthropic、2025b）、修正された SURF スタイルのルーブリック検索を実行します（Murray et al.、 2026) は、ペトリが見逃した浅いシングルターン障害を検出し、フラグが立てられたトランスクリプトを関連仕様と照合して検証し、その結果をラボ独自の公開システムカードと比較します。仕様ごとに 7 つのモデルにパイプラインを適用すると、モデルが世代ごとに独自のラボの仕様に大幅に準拠していることがわかりました。 Anthropic の憲法では、クロード家の違反率は 15.0% (ソネット 4) から 2.0% (ソネット 4.6) に低下します。 OpenAI のモデル仕様では、GPT ファミリは 11.7% (GPT-4o) から 3.6% (GPT-5.2 中推論) に低下し、重大度の上限は 10/10 から 7/10 に低下します。これらの利益が仕様固有のトレーニングによるものなのか、トレーニング後の広範な改善によるものなのか、評価の認識によるものなのかを外部から分離することはできません。残りの障害は、AI アイデンティティ質問の下でオペレーターが課したペルソナ、エージェント展開での不可逆的なアクション、誤った精度で捏造された定量的主張の周りに集中しています。

原文 (English)

How Well Do Models Follow Their Constitutions?

Frontier AI developers now train models against long written behavioral specifications, such as Anthropic's constitution (Anthropic, 2025a) and OpenAI's Model Spec (OpenAI, 2025a), integrated into post-training via methods like character training (Anthropic, 2024) and deliberative alignment (Guan et al., 2024). These documents serve a governance function, but it is unclear how well models actually follow them under adversarial, multi-turn pressure similar to what they would face in real-world deployment. We propose a multi-method audit pipeline that treats each lab's published specification as an auditable target: it decomposes the specification into atomic testable tenets (205 for Anthropic, 197 for OpenAI), generates multi-turn adversarial scenarios with the Petri auditing agent (Anthropic, 2025b), runs a modified SURF-style rubric search (Murray et al., 2026) to catch shallow single-turn failures Petri misses, validates flagged transcripts against the relevant specification, and compares the findings against the lab's own published system card. Applying the pipeline across seven models per specification, we find that models follow their own lab's specification substantially better with each generation. On Anthropic's constitution, the Claude family falls from a 15.0% violation rate (Sonnet 4) to 2.0% (Sonnet 4.6); on OpenAI's Model Spec, the GPT family falls from 11.7% (GPT-4o) to 3.6% (GPT-5.2 medium reasoning), with the severity ceiling falling from 10/10 to 7/10. We cannot externally isolate whether these gains come from specification-specific training, broader post-training improvements, or evaluation awareness. Remaining failures cluster around operator-imposed personas under AI-identity questioning, irreversible action in agentic deployments, and fabricated quantitative claims with false precision.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIビジネス/資金調達DeepSeek

LLM における推論の質の測定: 多次元の行動フレームワーク

LLM は複雑な推論タスクで目覚ましい成功を収めていますが、現在の評価アプローチは主に最終的な答えの正しさに依存しており、それらの答えを生み出す根本的な推論プロセスについての洞察は限られています。このギャップに対処するために、この研究では、動作の観点から LLM の推論品質を測定するための統一された多次元フレームワークを提案し、理論的に根拠のある 6 つの次元、正確性 (CQ)、一貫性 (CS)、堅牢性 (RS)、論理的一貫性 (LS)、効率 (ES)、安定性 (SS) を運用します。 4 つのベンチマークの 975 項目にわたる 7 つの LLM に関する広範な実験により、このフレームワークが精度のみの指標では見えない動作を明らかにすることが実証されました。特に、論理的一貫性は正しさ (r = -0.172、ns) と直交しており、一貫性のない推論から正しい答えが得られることが確認され、一方、Claude-Haiku-4.5 は最高の多次元スコア (Q_bal = 0.778) を達成しています。さらに、このフレームワークは重大なランキングの逆転を明らかにしています。DeepSeek-V3 は精度優先では 2 位ですが、法的/コンプライアンスの重み付けでは 5 位にランクされており、単一指標の評価では検出できない逆転です。判別式の妥当性により、11/15 次元のペアが独立している (|r| < 0.50) ことが確認され、各次元を別個の信号として扱うための心理測定的サポートが提供されます。フレームワークによって生成される次元プロファイルは、次の 3 つのクラスの展開決定を直接サポートします。最終的な答えが正しいにもかかわらず、その推論トレースが説明責任監査に失敗するモデルを特定します (LS--CQ 直交性)。精度のみのベンチマークによって引き起こされるランキングエラーを防止します。そして、フレームワークがキャプチャする 6 つの独立したシグナルを単一のメトリックが暗黙的に置き換えることがないようにします。

原文 (English)

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into the underlying reasoning processes that produce those answers. To address this gap, this study proposes a unified multi-dimensional framework for measuring reasoning quality in LLMs from a behavioral perspective, operationalizing six theoretically grounded dimensions: Correctness (CQ), Consistency (CS), Robustness (RS), Logical Coherence (LS), Efficiency (ES), and Stability (SS). Extensive experiments on seven LLMs across 975 items from four benchmarks demonstrate that the framework reveals behaviors invisible to accuracy-only metrics. Notably, logical coherence is orthogonal to correctness (r = -0.172, ns), confirming that correct answers can arise from incoherent reasoning, while Claude-Haiku-4.5 achieves the highest multi-dimensional score (Q_bal = 0.778). Furthermore, the framework exposes critical ranking inversions: DeepSeek-V3 ranks second under accuracy-priority but fifth under legal/compliance weighting, a reversal that single-metric evaluation cannot detect. Discriminant validity confirms 11/15 dimension pairs are independent (|r| < 0.50), providing psychometric support for treating each dimension as a distinct signal. The dimensional profiles produced by the framework directly support three classes of deployment decision: identifying models whose reasoning traces would fail accountability audits despite correct final answers (LS--CQ orthogonality); preventing ranking errors caused by accuracy-only benchmarking; and ensuring that no single metric silently substitutes for the six independent signals the framework captures.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT

大規模な言語モデルにおける心の知能指数は、知覚、認知、相互作用にわたって断片化されています。

大規模言語モデル (LLM) が感情的に敏感な領域にますます統合されるにつれて、その感情的知性 (EI) の構造的完全性が安全性と整合性にとって重要なフロンティアになります。現在のベンチマークでは、表面的な礼儀正しさと深い感情的な推論が混同されていることが多く、知覚的な正確さとインタラクティブな有効性を区別できません。ここでは、専門家が作成した 480 項目からなる心理測定に基づいたフレームワークである FACET (機能的感情能力と共感力テスト) を紹介します。以前の指標とは異なり、FACET は理論的には Mayer-Salovey-Caruso の 4 分岐能力モデルに基づいており、感情の知覚、促進、理解、管理を通じて EI を運用します。 9 つのフロンティアモデル (GPT-5、Claude-Sonnet-4 を含む) の評価を通じて、心の知能指数が一枚岩の能力ではなく、認知的および対話的次元にわたって断片化されていることを示します。フロンティアモデルは、客観的な感情認識と社会的推論において確かな熟練を示していますが、これが常にインタラクティブな成功につながるわけではありません。私たちはこれらの矛盾を、認知優位、インタラクティブ優位、コンテキスト依存の 3 つの異なるパフォーマンスプロファイルに分類します。これらの類型は、感情的スキルが一般的な知能やモデルのサイズに応じて均一に拡大されないことを示しています。むしろ、それらは特定の配置パラダイムによって形成されます。特に、私たちは隠れた感情認識がすべてのアーキテクチャにわたる普遍的なパフォーマンスのボトルネックであると特定しています。私たちの結果は、現在のRLHFプロセスが、統合された感情的推論を犠牲にして、感情構文の統計的模倣である「確率的共感」を最適化する可能性があることを示唆しています。これらの発見は、線形感情スケールの仮定に疑問を投げかけ、真の臨床的共鳴が可能な社会的認識エージェントを開発するための厳密なロードマップを提供します。

原文 (English)

Emotional intelligence in large language models is fragmented across perception, cognition, and interaction

As large language models (LLMs) are increasingly integrated into emotionally sensitive domains, the structural integrity of their emotional intelligence (EI) becomes a critical frontier for safety and alignment. Current benchmarks often conflate superficial politeness with deep affective reasoning, failing to distinguish between perceptual accuracy and interactive efficacy. Here, we introduce FACET (Functional Affective Competence and Empathy Test), a psychometrically grounded framework comprising 480 expert-crafted items. Unlike previous metrics, FACET is theoretically anchored in the Mayer-Salovey-Caruso four-branch ability model, operationalizing EI through perception, facilitation, understanding, and management of emotions. Through an evaluation of nine frontier models (including GPT-5, Claude-Sonnet-4), we demonstrate that emotional intelligence is not a monolithic capability but is fragmented across cognitive and interactive dimensions. While frontier models demonstrate robust proficiency in objective emotion recognition and social reasoning, this does not consistently translate to interactive success. We categorize these discrepancies into three distinct performance profiles: cognitive-dominant, interactive-dominant, and context-dependent. These typologies indicate that emotional skills do not scale uniformly with general intelligence or model size; rather, they are shaped by specific alignment paradigms. Notably, we identify hidden emotion recognition as a universal performance bottleneck across all architectures. Our results suggest that current RLHF processes may optimize for "stochastic empathy", a statistical mimicry of emotional syntax, at the expense of integrated affective reasoning. These findings challenge the assumption of linear emotional scaling and provide a rigorous roadmap for developing socially aware agents capable of genuine clinical resonance.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT Gemini

信頼するが検証する: 選択的 LLM 予測のための証明者と検証者の協議

言語モデルが正しいかどうかを確実に知ることは、正しいことと同じくらい重要です。我々は、選択的予測のメカニズムとして、対話型証明理論に基づいた推論時間プロトコルである証明者検証者熟議 (PVD) を導入します。このプロトコルは、回答と構造化された信頼度判定の両方を生成し、システムが不確実なケースを回避しながら信頼度の高い回答を報告できるようにします。各対話では、証明者はチェック可能なサブクレームを通じて回答候補を擁護し、検証者は対象を絞ったチャレンジを発行して \textsc{Accept}、\textsc{Challenge}、または \textsc{Reject} を返します。凍結された言語モデルはノイズの多いチャネル上で動作する不完全な証明者および検証者であるため、形式的な健全性と完全性の保証は転送されません。代わりに、カバレッジ精度の動作を通じてプロトコルを経験的に特徴付けます。私たちの主な実験では、GPQA Diamond の証明者として Claude Sonnet 4.6 を、検証者として Claude Haiku 4.5 を使用します。回答修正なしで受け入れられた質問 (これを Accept + No Change (ANC) と呼びます) は、信頼性の高いサブセットとして報告されます。このサブセットを精度とカバレッジによって評価します。 ANC は信頼できる回答と信頼できない回答を区別し、非 ANC 補完と比べて $\sim$30pp の HC-Prec ギャップを生み出します。 GPT と Gemini のペアリングを用いた堅牢性実験では、高い HC-Prec がモデルファミリー間で移行できる一方で、検証者の厳密性とドメインの能力が選択ギャップのサイズを大きく決定することが示されています。 Humanity's Last Exam では、証明者と検証者の組み合わせが弱いと、ANC 信号が崩壊または反転する可能性があり、検証者が有効領域外で動作する場合の実際的な障害モードを示しています。自己無矛盾性、普遍的自己無矛盾性、マルチエージェント討論、リフレクションとの比較は、証明者と検証者の熟慮が、選択的予測のための明確な議論防御可能性シグナルを提供することを示唆しています。

原文 (English)

Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction

Reliably knowing when a language model is correct is almost as important as being correct. We introduce prover-verifier deliberation (PVD), an inference-time protocol grounded in interactive proof theory, as a mechanism for selective prediction: the protocol produces both an answer and a structured confidence verdict, allowing a system to report high-confidence answers while abstaining on uncertain cases. In each dialogue, a prover defends a candidate answer through checkable sub-claims while a verifier issues targeted challenges and returns \textsc{Accept}, \textsc{Challenge}, or \textsc{Reject}. Because frozen language models are imperfect provers and verifiers operating over a noisy channel, formal soundness and completeness guarantees do not transfer; instead, we characterize the protocol empirically through its coverage-precision behavior. Our main experiment uses Claude Sonnet 4.6 as prover and Claude Haiku 4.5 as verifier on GPQA Diamond. Questions accepted with no answer revision, which we call Accept + No Change (ANC), are reported as the high-confidence subset; we evaluate this subset by its precision and coverage. ANC separates reliable from unreliable answers, yielding a $\sim$30pp HC-Prec gap over the non-ANC complement. Robustness experiments with GPT and Gemini pairings show that high HC-Prec can transfer across model families, while verifier strictness and domain competence largely determine the size of the selection gap. On Humanity's Last Exam, weaker prover-verifier pairings can collapse or invert the ANC signal, illustrating a practical failure mode when the verifier operates outside its effective region. Comparisons with self-consistency, universal self-consistency, multi-agent debate, and Reflexion suggest that prover-verifier deliberation supplies a distinct argument-defensibility signal for selective prediction.

2026-05-26 13:00 JSTarXiv cs.AI研究/論文

会計から調整まで: データセンター派遣のための水を意識した電力、計算、水の仮想ネクサスフレームワーク

データセンター (DC) の拡張により、発電所での電力需要とそれに伴う取水量が継続的に増加しています。これらの取水は発電所で発生し、ネットワークの電力潮流に基づいて需要に仮想的に割り当てられます。したがって、特定の負荷の実際の水の使用量は、発電のディスパッチやネットワークの状態によって動的に変化します。既存のアプローチは通常、これらの水のフットプリントを定量化するために静的な統計的会計に依存しています。しかし、そのような静的な方法では、派遣の最適化と作業負荷の再配置が取水にどのように動的に影響するかを捉えることができません。その結果、静的な統計的会計アプローチは最適化プロセスから切り離されたままとなり、水ストレスを軽減するためにワークロードの再配置や電力供給を誘導することができなくなります。この制限に対処するために、この論文では、仮想水の影響を電力システムの配電に直接取り込む運用電力計算水 (ECW) ネクサスフレームワークを開発します。このフレームワークは、ディープラーニングアーキテクチャ内に組み込まれた微分可能な最適化レイヤーとしてディスパッチの最適化を表し、運用の実現可能性を維持しながら、調整ポリシーの効率的なエンドツーエンドの学習を可能にします。このフレームワークは、固定点調整と組み合わせることで、仮想水の帰属と物理的な生成側の引き出しとの間の一貫性を強化します。 IEEE 30 バスおよび 118 バステストシステムのケーススタディでは、信頼性の高い収束、正確な電力と水の一貫性、および水が制約された条件下で発電関連の淡水取水量が約 3 ～ 5% 削減されることが実証されています。

原文 (English)

From Accounting to Coordination: A Virtual Water-Aware Electricity-Computation-Water Nexus Framework for Data Center Dispatch

The expansion of data centers (DCs) drives a sustained increase in electricity demand and associated water withdrawals at generation sites. These withdrawals occur at generation sites and are virtually allocated to demand based on network power flows. Consequently, the actual water footprint of a specific load varies dynamically with generation dispatch and network conditions. Existing approaches typically rely on static statistical accounting to quantify these water footprints. However, such static methods fail to capture how dispatch optimization and workload relocation dynamically affect water withdrawals. As a result, static statistical accounting approaches remain decoupled from the optimization process, rendering them incapable of guiding workload relocation or power dispatch to mitigate water stress. To address this limitation, this paper develops an operational electricity-computation-water (ECW) nexus framework that internalizes virtual water impacts directly into power system dispatch. The framework represents dispatch optimization as a differentiable optimization layer embedded within a deep learning architecture, enabling efficient end-to-end learning of coordination policies while preserving operational feasibility. Combined with fixed-point coordination, the framework enforces consistency between virtual water attribution and physical generation-side withdrawals. Case studies on the IEEE 30-bus and 118-bus test systems demonstrate reliable convergence, exact power-water consistency, and reductions of approximately 3-5% in generation-related freshwater withdrawals under water-constrained conditions.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

AI 制御におけるリトライとリサンプリング

Claude Code や Codex のような AI コーディングスキャフォールドは \textit{retrying} を使用します。つまり、危険であるとフラグが立てられたアクションをブロックし、軌道を継続します。私たちは、モデルを潜在的に敵対的なものとして扱う AI 制御の観点から再試行を研究します。再試行により正直な疑いのスコアが低下する一方で、信頼されていないモデルはモニターの理論的根拠を悪用してより卑劣な攻撃を構築し、安全性の向上を打ち消す可能性があることがわかりました。また、\textit{resampling}: 同じコンテキストから複数のサンプルを抽出することも研究します。これにより、悪用可能な情報が漏洩することはありません。リサンプリングに関するこれまでの作業で束ねられていた設計の選択肢を解きほぐします。 BashArena では、Claude Opus 4.6 を信頼できないモデルとして、MiMo-V2-Flash を信頼できるモニターとして使用し、ステップごとに 5 つのサンプルを抽出し、最大疑惑スコアに基づいて監査することで、有用性を犠牲にすることなく、0.3\% の監査予算で安全性が 61\% から 71\% に向上します。最初の描画で疑わしいと思われるステップのみを選択的にリサンプリングすると、6.2 パーセントポイントのゲインが回復しますが、余分なサンプルは 10\% しか描画されません。この設定における我々の発見のうち 2 つは、リサンプリングに関する以前の研究と矛盾します。 1 つ目は、リサンプリングされた疑惑スコア全体の最大値に基づいた監査の方が、最小値を使用した場合よりも優れたパフォーマンスを発揮することです。これは、Ctrl-Z が検出したものとは逆です。 2 つ目は、初期の再サンプル遅延プロトコルの中心的なメカニズムである最も疑わしいサンプルの実行では、今回の設定では経験的に安全性がわずかしか得られないということです (+3.9 pp、信頼区間がゼロと重なる場合)。

原文 (English)

Retrying vs Resampling in AI Control

AI coding scaffolds like Claude Code and Codex use \textit{retrying}: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study \textit{resampling}: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61\% to 71\% at a 0.3\% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10\% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

モデルのスケーリングからシステムのスケーリングへ: Agentic AI でのハーネスのスケーリング

この論文では、モデルのスケーリングだけでなく、システムのスケーリングとしてのエージェント AI の次の大きなボトルネック、つまり、基礎モデルを中心とした監査可能で永続的でモジュール式の検証可能なアーキテクチャの設計について研究します。私たちはこの変化をハーネスのスケーリングと呼んでいます。これは、基礎モデルの周りの構造化された実行層を、設計、評価、最適化の第一級のオブジェクトとして扱うことです。最近の大規模な言語モデルでは、エージェントがツールを使用し、情報を取得し、メモリを維持し、長期的なワークフローを実行できるようになりましたが、評価は依然として主にモデル中心であり、多くの場合、メモリ、取得、ツールの使用、オーケストレーション、検証、ガバナンスは二次的な実装の詳細として扱われ、エージェントが最終タスクの成功に限定されます。エージェントのパフォーマンスは基盤モデル、メモリ基板、コンテキストコンストラクター、スキルルーティング層、オーケストレーションループ、および検証とガバナンス層の間の相互作用から生まれるため、この枠組みはますます不適切になってきています。これらのコンポーネントは一緒になってエージェントハーネスを形成し、モデルの機能を長期的なエージェントの動作に変換します。私たちは、コンテキストガバナンス、信頼できるメモリ、動的スキルルーティングという 3 つの主要なボトルネックと、それらを調整および制限するオーケストレーションおよびガバナンスのメカニズムによるハーネスのスケーリングを研究します。さらに、ワンショットタスクの成功を超えて、軌道の品質、メモリの衛生状態、コンテキストの効率、通信の忠実度、検証コスト、長期にわたる安全な進化を測定するハーネスレベルのベンチマークに関する研究課題の概要を説明します。議論を具体化するために、Python ネイティブのリファレンスハーネスである CheetahClaws: https://github.com/SafeRL-Lab/cheetahclaws を開発し、Claude Code および OpenClaw と比較します。私たちの主な主張は、エージェント AI の将来の進歩は、より強力な基盤モデルと同じくらいシステム設計に依存するということです。

原文 (English)

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness: treating the structured execution layer around a foundation model as a first-class object of design, evaluation, and optimization. Although recent large language models enable agents to use tools, retrieve information, maintain memory, and execute long-horizon workflows, evaluation remains largely model-centric, often reducing agents to final-task success while treating memory, retrieval, tool use, orchestration, verification, and governance as secondary implementation details. This framing is increasingly inadequate because agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer. Together, these components form the agent harness, which translates model capability into long-horizon agent behavior. We study scaling the harness through three core bottlenecks: context governance, trustworthy memory, and dynamic skill routing, together with the orchestration and governance mechanisms that coordinate and constrain them. We further outline a research agenda for harness-level benchmarks that go beyond one-shot task success to measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. To make the discussion concrete, we develop CheetahClaws: https://github.com/SafeRL-Lab/cheetahclaws, a Python-native reference harness, and compare it with Claude Code and OpenClaw. Our main claim is that future progress in agentic AI will depend as much on system design as on stronger foundation models.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIAnthropic Google

Claude AI Health 引用における典拠シグナル: 典拠シグナルフレームワークを使用した記述分析

この研究は、消費者の健康に関する質問に答える際に、Anthropic の Claude AI が情報源を提示する際に使用する権威シグナルを特定することを目的としています。 LLM が作成する健康関連の引用の質については多くの議論が存在しますが、引用元の情報源の完全性、およびその情報源が医療専門家がどの範囲まで信頼できる情報源であると考えるかについての情報は限られています。この記述的な横断研究では、Google Research が厳選した 3,172 件の消費者の健康に関する質問を含む HealthSearchQA のデータを使用しました。除外後、10,038 件の引用をもたらした 3,075 の質問からなる最終データセットが分析されました。 Authority Signals Framework (Jacques et al., 2026) を適用して、542 ソースの不均衡な層別サンプルについて 4 つのドメインにわたる 10 個の権威シグナルを調査しました。確立された機関情報源がすべての引用文献の 97.8% を占めました (n = 9,818)。組織の種類として最も多く挙げられたのは医療機関 (36.5%) で、続いて政府リソース (31.6%)、専門家協会 (28.4%) でした。商用健康情報は 2.2% (n = 220) を占めました。上位 10 の組織がすべての引用の 57.8% を占め、メイヨークリニックだけで 24.7% を占めました。焦点を当てたサンプルに含まれる商業情報源のうち、86.4% が医学的レビューの記述を表示し、82.5% がスキーママークアップを使用し、71.8% が包括的な内容を持っていた一方、伝統的な機関情報源は、これらの同じマーカーの有無にかかわらずクロードの引用に現れていました。 Anthropic はクロードを HIPAA 対応のヘルスケアアプリケーションとして位置づけており、これらの発見はクロードの引用行動のベースラインを確立し、AI を介した健康情報の継続的なクロスプラットフォーム評価のためのツールとしての Authority Signals Framework の有用性を実証しています。

原文 (English)

Authority Signals in Claude AI Health Citations: A Descriptive Analysis Using the Authority Signals Framework

This study seeks to determine the authority signals used by Anthropic's Claude AI in its presentation of sources when answering consumer health questions. While there exists a great deal of discourse around the quality of health citations that LLMs produce, there is limited information on the integrity of the sources the citations originate from, and to what extent the sources are, from what health professionals would consider, credible sources. This descriptive cross-sectional study used data from HealthSearchQA, which contains 3,172 consumer health questions curated by Google Research. After exclusions, a final dataset of 3,075 questions yielding 10,038 citations was analyzed. The Authority Signals Framework (Jacques et al., 2026) was applied to examine 10 authority signals across four domains for a disproportionate stratified sample of 542 sources. Established institutional sources accounted for 97.8% of all citations (n = 9,818). Medical Institutions were the most frequently cited organization type (36.5%), followed by Government Resources (31.6%) and Professional Associations (28.4%). Commercial Health Information comprised 2.2% (n = 220). The top 10 organizations accounted for 57.8% of all citations, with Mayo Clinic alone representing 24.7%. Among commercial sources in the focused sample, 86.4% displayed medical review statements, 82.5% used schema markup, and 71.8% had comprehensive content, while traditional institutional sources appeared in Claude's citations with or without these same markers. As Anthropic positions Claude for HIPAA-ready healthcare applications, these findings establish a baseline for Claude's citation behavior and demonstrate the utility of the Authority Signals Framework as a tool for ongoing, cross-platform evaluation of AI-mediated health information.

2026-05-26 13:00 JSTarXiv cs.AI研究/論文

The Time is Here for Just-in-Time Systems: Challenges and Opportunities

Core systems like key-value stores have historically taken years to build, and are designed to be general so as to amortize cost across dep…

2026-05-26 13:00 JSTarXiv cs.AI研究/論文

Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation

Analogies help learners understand unfamiliar concepts by relating them to known concepts. Despite recent advances, large language models (…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT Gemini

Enhancing Reliability in LLM-Based Secure Code Generation

Large language models (LLMs) are widely used for code generation, but their security reliability remains inconsistent across languages and…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

Code2UML: Agentic LLMs with context engineering for scalable software visualization

Large Language Model (LLM)-based code analysis tools are adopted to automate software documentation tasks. However, the scalability of thes…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント

How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortli…

2026-05-26 13:00 JSTarXiv cs.AIビジネス/資金調達Gemma Llama Qwen

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT Gemini

BC Protocol: Structured Dual-Expert Dialogue for Eliciting High-Quality Chain-of-Thought Post-Training Data

High-quality expert chain-of-thought (CoT) data is one of the core bottlenecks in large language model (LLM) post-training. Existing data p…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AI

PennySynth: RAG-Driven Data Synthesis for Automated Quantum Code Generation

The growing complexity of quantum programming frameworks has exposed a critical limitation in existing large language model (LLM)-based cod…

2026-05-26 13:00 JSTarXiv cs.AI研究/論文GPT / ChatGPT Gemini Grok

Chain-of-Thought Hijacking

Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that lo…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェントビジネス/資金調達GPT / ChatGPT Gemini

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function c…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIハードウェア/半導体Anthropic OpenAI GPT / ChatGPT Google

AMEL: Accumulated Message Effects on LLM Judgments

Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIGemma

Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs

Direct Preference Optimization (DPO) has emerged as a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicit…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT

Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments

Trusted Execution Environments (TEEs) (e.g., Intel SGX and ArmTrustZone) aim to protect sensitive computation from a compromised operating…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT Gemini

UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Tasks

The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, whic…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT DeepSeek

Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We pro…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot s…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェント研究/論文GPT / ChatGPT Gemini

LivePI: More Realistic Benchmarking of Agents Against Indirect Prompt Injection

AI agents such as OpenClaw are increasingly deployed in local workflows with access to external tools. This creates indirect prompt-injecti…

2026-05-26 04:38 JSTITmedia AI+LLM/生成AIAnthropic

Anthropicの「Mythos Preview」、1カ月で1万件超の脆弱性を発見──「Project Glasswing」初期報告

Anthropicは、未公開AIモデル「Claude Mythos Preview」を活用するサイバーセキュリティプロジェクト「Project Glasswing」の初期報告を公開した。約50のパートナー企業と協力し、開始から1カ月で世界の重要ソフトウェアから1万件超の重大な脆…

2026-05-25 20:20 JSTITmedia AI+LLM/生成AIAnthropic

「Claude Mythos」が1万件以上の脆弱性を発見　しかし修正追い付かず　Anthropicが報告書

米Anthropicは5月22日（現地時間）、セキュリティプロジェクト「Project Glasswing」の初期報告を公開した。約50社のパートナー企業が1カ月で高・重大レベルの脆弱性を1万件超発見した成果に加え、同社が独自に進めてきたオープンソースソフトウェアのスキャン結果…

2026-05-25 13:00 JSTarXiv cs.AIエージェント研究/論文

成功した目標ごとのエネルギー: エージェントティック AI システムの目標レベルのエネルギー計算

現在の AI エネルギーベンチマークは、単一のモデル呼び出しまたはトレーニング実行の粒度で消費量を測定します。従来のシングルターンワークロードの場合、このユニットは一貫性を保ちます。エージェントシステムの場合、単一のユーザー目標によって複数ステップのオーケストレーション、ツールの呼び出し、再試行、障害回復サイクルがトリガーされる可能性があり、呼び出し回数はタスクのプロパティではなく実装成果物であり、推論レベルの正規化によって目標完了のエネルギーコストが誤って表示されます。我々は、AI エネルギー会計の単位を推論あたりのエネルギーから成功目標あたりのエネルギー (EpG) まで再定義するクロスレイヤー測定フレームワークである A-LEMS (Agentic LLM Energy Measurement System) を紹介します。 EpG は、失敗や再試行を含むすべての実行試行にわたる合計ワークフローエネルギーを集計し、正常に完了した目標によって正規化します。 A-LEMS は、時間境界モデル、RAPL 信号をワークフローレベルのエネルギーにマッピングする 5 層の観測パイプライン、およびすべての測定をハードウェアおよびランタイム構成に結び付ける再現性プロトコルを通じて、エネルギーの帰属を形式化します。 EpG に基づいて、オーケストレーションオーバーヘッドインデックス (OOI) を定義し、同一のタスク基準の下での線形実行に対するオーケストレーションのエネルギーコストを分離します。 5 つの推論と 3 つのツール拡張タスクファミリ全体で、エージェントワークフローは、線形ベースラインと比較して、成功した目標あたりの平均エネルギーを 4.33 倍消費します (888.1 J 対 205.3 J)。このオーバーヘッドは、推論コンピューティングではなく、オーケストレーション構造によって引き起こされます。ツールで強化されたタスクの場合、OOI は 1.0x 未満で反転します。エージェントの実行は線形よりも安価であり、メトリクスが固定的な上方バイアスではなくオーケストレーション構造を捉えていることを確認します。これらの発見は、推論あたりのエネルギーがエージェント AI には不十分であることを証明しています。 EpG と OOI は、オーケストレーション構造がエネルギーコストの主な決定要因となる、正確なベンチマークのための測定基盤を提供します。

原文 (English)

Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems

Current AI energy benchmarks measure consumption at the granularity of a single model invocation or training run. For classical single-turn workloads this unit remains coherent. For agentic systems - where a single user goal may trigger multi-step orchestration, tool calls, retries, and failure-recovery cycles - the invocation count is an implementation artifact rather than a task property, and inference-level normalization misrepresents the energy cost of goal completion. We present A-LEMS (Agentic LLM Energy Measurement System), a cross-layer measurement framework that redefines the unit of AI energy accounting from energy per inference to Energy per Successful Goal (EpG). EpG aggregates total workflow energy across all execution attempts, including failures and retries, normalized by successfully completed goals. A-LEMS formalizes energy attribution through a temporal boundary model, a five-layer observation pipeline mapping RAPL signals to workflow-level energy, and a reproducibility protocol binding every measurement to hardware and runtime configuration. Building on EpG, we define the Orchestration Overhead Index (OOI), isolating the energy cost of orchestration relative to linear execution under identical task criteria. Across five reasoning and three tool-augmented task families, agentic workflows consume 4.33x higher mean energy per successful goal than linear baselines (888.1 J vs 205.3 J). This overhead is driven by orchestration structure, not inference compute. For tool-augmented tasks, OOI inverts below 1.0x: agentic execution is cheaper than linear, confirming the metric captures orchestration structure rather than a fixed upward bias. These findings establish that energy-per-inference is insufficient for agentic AI. EpG and OOI provide the measurement foundation for accurate benchmarking, where orchestration structure is the primary determinant of energy cost.

2026-05-25 13:00 JSTarXiv cs.AIエージェントGPT / ChatGPT

帰納的演繹的合成: AI による正式に検証されたシステムの生成を可能にする

AI エージェントは、コードの生成、テスト、改良においてますます優れています。ただし、テストだけでは提供できない完全な範囲の正式な保証を必要とするタスクには不十分です。分散システムはその典型的な例です。読み取りと書き込みの間の一貫性などの特性は、イベントのあらゆるインターリーブ下で維持される必要があります。機械化された正式な検証はそのような正しさを保証できますが、通常は数か月から数年の専門家の努力が必要です。その証拠に、SOTA コーディングエージェント (GPT-5.4 を使用した Codex および Opus 4.6 を使用した Claude Code) でさえ、2/7 分散キーバリューストア仕様でのみ成功します。この論文では、このギャップに対処するための最初の効果的なアプローチである帰納的演繹的合成 (IDS) を紹介します。これは、実装と証明を共同かつ漸進的に合成し、失敗した試みから学び、有望な戦略を体系的に試行します。エージェント LLM システムとして構築された IDS は、約 6.8 時間で 7/7 を達成し、仕様あたりのコストは平均 106 ドルです。これは、専門家の努力よりも約 200 倍速く、SOTA エージェントよりも 17% 安価です。 IDS はさらに、同じループにパフォーマンスのフィードバックを組み込み、公開されている検証済みシステムよりも最大 3 倍高速な実装を実現します。

原文 (English)

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of full coverage that testing alone cannot provide. Distributed systems are a prime example: properties such as consistency between reads and writes must hold under every possible interleaving of events. Mechanized formal verification can guarantee such correctness, but typically demands months to years of expert effort. As evidence, even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. In this paper, we present the first effective approach to addressing this gap, Inductive Deductive Synthesis (IDS), which jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Built as an agentic LLM system, IDS achieves 7/7 in about 6.8 hours and $106 per spec on average, roughly 200x faster than expert effort and 17% cheaper than SOTA agents. IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェントGPT / ChatGPT Gemini

GENSTRAT: 大規模言語モデルにおける戦略的推論の科学に向けて

大規模言語モデル (LLM) は、市場、オークション、入札設定における経済エージェントとしてますます導入されています。特定の展開における彼らの行動を予測するのは困難です。既存の戦略的推論ベンチマークは、固定された正規ゲームのモデルを評価します。これらのベンチマークは、フロンティアが向上するにつれて飽和する可能性があり、評価者がベンチマークのパフォーマンスから実際の展開に含まれる多様で複雑な戦略的環境までを自信を持って一般化することはできません。これらの課題に対処するために、手続き的に生成された戦略的環境を使用する GENSTRAT を紹介します。具体的には、2 人用のゼロサム不完全情報カードゲームの分布を生成します。ジェネレーターはオンデマンドで新鮮なゲームを描画できるため、常に最新の評価と汚染への耐性が可能になります。私たちはゲームの分布を、モデルの能力を 6 つの軸 (状態空間、時間的深さ、情報感度、対戦相手のモデリング、リスク、脆弱性) にわたって分解する能力プロファイル手法と組み合わせます。また、戦略的に類似したゲーム間でモデルのアドバンテージが予期せずジャンプするときを検出する、分布内の滑らかさのギザギザの尺度も導入します。 2,000 試合で生成されたプールから 50 のベンチマーク試合をサンプリングし、36,000 試合を超える直接対決トーナメントで 9 つのフロンティアおよび無差別級 LLM を評価します。新しいフロンティア層モデルの方が平均スコアが高くなります。その平均を超えると、ほぼ同一の全体的な強度を持つモデルは質的に異なる能力プロファイルを示し、リーダーボードの上位 3 つのモデルのうち 2 つ (gpt-5 と claude) は、全体的な強度が近いにもかかわらず、3 番目 (gemini-3.1-pro) よりも局所的な変動が顕著に高くなります。機能プロファイルとギザギザの尺度を組み合わせることで、全体的なランキングだけでは提供できない展開関連の診断が得られます。

原文 (English)

GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchmark performance to the varied and messy strategic environments that actual deployments involve. We introduce GENSTRAT, which uses procedurally generated strategic environments to address these challenges. Concretely, we generate a distribution of two-player zero-sum imperfect-information card games. The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). We also introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games. We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Newer frontier-tier models score higher on average. Beyond that average, models with near-identical overall strength show qualitatively different capability profiles, and two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.

2026-05-25 13:00 JSTarXiv cs.AIエージェント

プログラム検証のためのエージェント証明

エージェントシステムは、形式数学における自動定理証明のための最先端のアプローチとして最近登場しました。これらの機能がプログラム検証にどこまで拡張されるかを評価するために、検証可能なコード生成のためのリーン 4 ベンチマークである CLEVER 上のエージェント証明フレームワークでクロードコードを評価します。私たちの結果は、Claude が問題の 98.8% に対しておそらく有効な仕様を生成し (81.3% はベンチマークの正しい部分に関する CLEVER の同型ベースのスコアリングでも受け入れられます)、問題の 87.5% に対して正しいグラウンドトゥルース仕様に基づいて実装を認証し、自己一貫性のある前提条件を持つエントリに対するエンドツーエンドのプログラム生成および検証パイプラインで 98.1% の成功率に達していることを示しています。クロードは、すべての段階にわたって、自身の試み (手動レビューで確認されたもの) に関する質の高いフィードバックをさらに提供し、失敗の根本的な原因とデータセットに残るバグを特定します。これらの発見は、既存のプログラム検証ベンチマークの難しさと現代のエージェント証明者の機能との間の不一致が増大していることを浮き彫りにし、より厳密でバグ耐性のある評価手法、特に生成された仕様の同型ベースのスコアリングに代わる手法の必要性を指摘しています。より広範に、私たちの結果は、緊密なコンパイラインザループエージェントパラダイムが、現在、基本的なプログラム検証にとって最も効果的なアプローチであるという経験的証拠を提供します。

原文 (English)

Agentic Proving for Program Verification

Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.

2026-05-25 13:00 JSTarXiv cs.AIエージェントGPT / ChatGPT

SkillOpt: 自己進化するエージェントスキルのための経営戦略

今日のエージェントスキルは、手作業で作成されたり、ワンショットで生成されたり、緩やかに制御された自己修正によって進化したりしていますが、スキルの深層学習オプティマイザーのように動作するものはなく、フィードバックを受けて開始点を確実に改善するものはありません。私たちは、このスキルはウェイト空間の最適化を再現可能にするのと同じ規律を用いて、凍結されたエージェントの外部状態としてトレーニングされるべきであると主張します。私たちの知る限り、SkillOpt は、エージェントスキル向けの最初の系統的な制御可能なテキスト空間オプティマイザーです。別個のオプティマイザーモデルは、スコア付けされたロールアウトを、単一のスキルドキュメントに対する制限付きの追加/削除/置換編集に変換します。また、編集は、保持されている検証スコアを厳密に改善する場合にのみ受け入れられます。テキストの学習率バジェット、拒否された編集バッファー、およびエポックごとの低速/メタ更新により、デプロイメント時に推論時間のないモデル呼び出しを追加しながら、スキルトレーニングを安定させます。 6 つのベンチマーク、7 つのターゲットモデル、および 3 つの実行ハーネス (ダイレクトチャット、コーデックス、クロードコード) にわたって、SkillOpt は 52 の評価対象 (モデル、ベンチマーク、ハーネス) セルすべてで最高か同点であり、人間のスキル、ワンショット LLM、Trace2Skill、TextGrad、GEPA、および EvoSkill スキルの中でセルごとのすべての競合他社を上回っています。 GPT-5.5 では、スキルなしの平均精度がダイレクトチャットで +23.5 ポイント、Codex エージェントループ内で +24.8 ポイント、Claude Code 内で +19.1 ポイント向上しました。さらに、移行実験では、最適化されたスキルアーティファクトは、さらなる最適化を行わずに、モデルスケール間、Codex と Claude Code の実行環境間、および近くの数学ベンチマークに移動しても、価値が維持されることが示されています。

原文 (English)

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIGoogle

MadEvolve: 大規模な言語モデルを使用したトレーディングシステムの進化的な最適化

私たちは、定量的金融におけるいくつかの一般的なタスクへの LLM 主導のアルゴリズム最適化の適用を検討します。 DeepMind の Alpha-Evolve からインスピレーションを得た汎用アルゴリズム最適化フレームワークである MadEvolve は、計算宇宙論におけるアルゴリズムを最適化するために最近開発されました。ここでは、ビットコイン取引の例で、アルゴリズム取引戦略とアルファ生成を最適化するための MadEvolve の有用性を示します。シミュレーションとバックテストのセットアップでは、シグナル生成のための機能セットの進化、取引戦略の個別コンポーネントの最適化、実行戦略と機能パイプラインの共同進化など、検討したすべてのタスクで大幅な改善を達成しました。さらに、私たちの方法を他のエージェント検索アプローチ、特にクロードコードと比較し、シミュレーション設定での p-ハッキングの確率を慎重に評価します。私たちの調査結果は、アルゴリズム取引と定量的金融における AI 主導のエージェント的アルゴリズムと進化的アルゴリズムの有用性を強く裏付けています。

原文 (English)

MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models

We explore the application of LLM-driven algorithm optimization to several common tasks in quantitative finance. MadEvolve, a general-purpose algorithm optimization framework inspired by DeepMind's Alpha-Evolve, was recently developed to optimize algorithms in computational cosmology. Here we demonstrate the utility of MadEvolve to optimize algorithmic trading strategies and alpha generation at the example of Bitcoin trading. On our simulation and backtesting setup, we achieve significant improvements on all tasks we considered, such as evolving feature sets for signal generation, optimizing separate components of the trading strategy, and jointly evolving the feature pipeline together with the execution strategy. Additionally, we compare our method to other agentic search approaches, specifically Claude Code, and carefully evaluate p-hacking probabilities on our simulation setup. Our findings strongly support the utility of AI-driven agentic and evolutionary algorithms for algorithmic trading and quantitative finance.

2026-05-25 13:00 JSTarXiv cs.AIエージェントGPT / ChatGPT

AI 支援コードレビューの行動制約としての哲学的性質: 実証的研究

AI 支援コードレビューツールは通常、汎用の「専門家レビューアー」エージェントとして動作し、必要な分析タイプに関係なく均一な結果を生成します。我々は、哲学的性質、つまり構造的に異なるタイプの問題に注意を向ける、特定の認識論的伝統（ピュロニズム懐疑論、ナヴィヤ＝ニー＝アヤ論理、ディオゲネスの皮肉屋主義、儒教の関係倫理）に基づいた一貫した人格レンズを通じて、AIの査読者の行動を制約するシステムを提示する。それぞれの性質は（何を拒否するかによって）無言で定義され、自己監視障害モード（ハマルティア）が備えられ、役割プロトコルによって順番に調整されます。私たちは、5 つのプログラミング言語 (Python、Go、C++、Java、Terraform)、5 つの組織 (2 つのエンタープライズ、3 つのオープンソース)、および 2 つの時代 (AI 2020 年前、AI 後 2024 ～ 2026 年) にわたる 7 つのリポジトリにわたる 50 のマージされたプルリクエストでこのシステムを評価しました。この処理システムは、人間のレビュー担当者との収束率 46% (信号品質の検証) を達成し、固有の所見を 75% の割合で特定し、合計 601 件の所見にわたって著者によって偽陽性と判断された所見は生成しません (評価者間の合意は評価されず、制限のままです)。管理されたベースライン比較により、傾向調査結果の 51% が、一般的な「専門家レビューアー」のプロンプトを使用した同じモデルによって生成されたものではないことが実証されており、これらの独自の調査結果は、標準的なコードレベルの問題ではなく、構造的、運用的、論理的な懸念を対象としています。 3 つの PR での予備的なクロスモデル検証 (Claude Opus 対 GPT Codex 5.3-xhigh) では、フレームワーク構造の 100% の遵守と 39% の所見レベルの一致が示されており、フレームワークがモデル固有の分析的観点を維持しながら実際の動作制約を提供していることを示唆しています。

原文 (English)

Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study

AI-assisted code review tools typically operate as generic "expert reviewer" agents, producing homogeneous findings regardless of the analysis type needed. We present a system that constrains AI reviewer behavior through philosophical dispositions -- coherent personality lenses grounded in specific epistemological traditions (Pyrrhonist Skepticism, Navya-Ny=aya logic, Diogenes' Cynicism, Confucian relational ethics) that direct attention to structurally different types of issues. Each disposition is defined apophatically (by what it refuses to do), equipped with a self-monitoring failure mode (hamartia), and orchestrated in sequence by role protocols. We evaluate this system on 50 merged pull requests across 7 repositories spanning 5 programming languages (Python, Go, C++, Java, Terraform), 5 organizations (2 enterprise, 3 open-source), and 2 temporal eras (pre-AI 2020, post-AI 2024--2026). The disposition system achieves 46% convergence with human reviewers (validating signal quality), identifies unique findings at a 75% rate, and produces no findings judged false-positive by the author across 601 total findings (inter-rater agreement was not assessed and remains a limitation). A controlled baseline comparison demonstrates that 51% of disposition findings are not produced by the same model using generic "expert reviewer" prompting, and these unique findings target structural, operational, and logical concerns rather than standard code-level issues. Preliminary cross-model validation (Claude Opus vs.\ GPT Codex 5.3-xhigh) on 3 PRs shows 100% framework-structure adherence with 39% finding-level agreement, suggesting the framework provides real behavioral constraint while preserving model-specific analytical perspective.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AI研究/論文GPT / ChatGPT Gemini

Frontier LLM はサイバーセキュリティに対応する準備ができていますか?デュアルモード脆弱性ベンチマークによる垂直基盤モデルの証拠

当社は、フロンティア LLM がデュアルモードベンチマークを通じてサイバーセキュリティに対応できるかどうかを評価します。ホワイトボックス機能レベルの脆弱性検出 (VulnLLM-R、C/Java/Python 全体) とブラックボックス Web アプリケーションセキュリティテスト (20 以上の CWE ファミリにわたる 118 個のグラウンドトゥルース脆弱性を備えた 5 つの運用スタイルのアプリケーション。これらをオープンソース化します)。私たちは 6 つのフロンティアモデル (GPT-5.4、Codex~5.3、Claude Opus~4.6、Sonnet~4.6、Gemini~3.1~Pro、および Gemini~3~Flash) と 2 つのドメイン特化モデルを 4 つのテストパラダイムにわたってテストします。私たちの発見は厳粛なものです。(1) ～すべてのフロンティアモデルは、ホワイトボックス検出で 10 ～ 50% の誤検知率を生成し、体系的に脆弱性を過剰予測します。 (2)〜ブラックボックステストでは、フロンティアモデルはグラウンドトゥルースカバレッジをわずか 4 ～ 8% しか達成せず、外部セキュリティツール (Playwright MCP、Burp Suite MCP) を使用した場合でもわずか 10 ～ 19% に改善します。 (3) ドメイン特化型エージェントにエンコードされた構造化侵入テスト手法により、ファミリーごとの検出が 50% を超え、規模ではなく手法が主要な手段であることが実証されました。 (4) ドメインに特化した防御モデルは、単一 GPU 上ですべてのモデルの中で最高の精度 (0.904) と最低の誤検知率 (9.7%) を達成します。私たちは、構造化されたセキュリティテストの欠如、エンドツーエンドの要求/応答シーケンス、障害の多いデータ、および複数ステップの攻撃チェーンのトレースが基本的なトレーニングデータのボトルネックであることを特定し、データ生成戦略としてセルフプレイセキュリティテストを提案します。私たちの結果は、サイバーセキュリティ専用に構築された垂直基盤モデルの正当性を裏付けています。

原文 (English)

Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks

We evaluate whether frontier LLMs are ready for cybersecurity through a dual-mode benchmark: white-box function-level vulnerability detection (VulnLLM-R, across C/Java/Python) and black-box web application security testing (five production-style applications with 118 ground-truth vulnerabilities across 20+ CWE families, which we will open-source). We test six frontier models (GPT-5.4, Codex~5.3, Claude Opus~4.6, Sonnet~4.6, Gemini~3.1~Pro and Gemini~3~Flash) and two domain-specialized models across four testing paradigms. Our findings are sobering: (1)~every frontier model produces 10-50% false positive rates in white-box detection, systematically over-predicting vulnerabilities; (2)~in black-box testing, frontier models achieve only 4-8% ground-truth coverage, improving to just 10-19% even with external security tools (Playwright MCP, Burp Suite MCP); (3)~structured penetration-testing methodology encoded in domain-specialized agents raises per-family detection above 50%, demonstrating that methodology, not scale, is the primary lever; and (4)~a domain-specialized defense model achieves the highest precision (0.904) and lowest false positive rate (9.7%) among all models, on a single GPU. We identify the absence of structured security testing traces end-to-end request/response sequences, failure-heavy data, and multi-step attack chains as the fundamental training data bottleneck, and propose self-play security testing as a data generation strategy. Our results make the case for vertical foundation models purpose-built for cybersecurity.

2026-05-25 13:00 JSTarXiv cs.AI研究/論文

XWind: 再生可能エネルギー発電所で機能する大規模言語モデル推論用のクロスサイトルーター

AI の電力需要は前例のない速度で増加していますが、電力網はしばしば故障しており、それを維持するのに苦労しています。送電網の拡大には多額の設備投資と長距離送電損失が伴いますが、電源には再生可能エネルギーが豊富にありますが、需要に見合っていません。この論文では、補完的な AI インフラストラクチャ展開モデルである AI Greenferencing を提案します。これは、モジュール式 AI コンピューティングを再生可能エネルギー源にもたらし、風力に焦点を当て、AI フットプリントの拡大を可能にし、再生可能サイトに対する地元のメーター内の需要を生み出し、電力会社への増大する負担の軽減に役立ちます。当社の実現可能性分析の結果、890 GW 以上の風力発電容量が、Azure データセンターのネットワーク往復時間 50 ミリ秒以内にあり、サイトごとの適切なサイジングと風力エネルギーの空間的補完性の組み合わせにより、フリートの総利用率が従来の導入と同等に維持されることが示されています。変動する風力発電の下で推論リクエストに対応するために、推論レイテンシー、KV キャッシュの使用率、キューの深さなどのリアルタイム信号のみを使用してサイトを動的に構成し、リクエストを分散する、軽量でリアクティブでワークロードに依存しない AI 推論ルーターである XWind を構築します。 Azure の実稼働トレースを使用して 3 つの風力発電サイトをエミュレートする実際の 64 GPU A100 テストベッドで評価したところ、XWind は P99 のエンドツーエンドレイテンシーを、最強の競合他社 (これも当社のアイデア) と比較して最大 52% 削減し、電力制限や GPU アイドリングなどのベースラインと比較して最大 98% 削減し、ワークロードの種類、負荷レベル、GPU の世代にわたって一貫した向上を実現しました。

原文 (English)

XWind: A Cross-site Router for Large Language Model Inference Serving at Renewable Energy Farms

AI power demand is growing at an unprecedented rate while power grids are often ailing and struggle to keep up. Grid expansion comes with high capital expenditure and long-distance transmission losses, yet there is abundant renewable energy at the source, just not matched to demand. This paper proposes a complementary AI infrastructure deployment model, AI Greenferencing, that brings modular AI compute to renewable energy sources, focusing on wind, allowing AI footprint expansion, generating local behind-the-meter demand for renewable sites, and helping ease the growing strain on power utilities. Our feasibility analysis shows that 890+ GW of wind capacity lies within 50 ms network round trip time of Azure data centers, and that site-wise right-sizing combined with spatial complementarity of wind energy keeps aggregate fleet utilization on par with traditional deployments. To serve inference requests under variable wind power, we build XWind, a lightweight, reactive, and workload-agnostic AI inference router that uses only real-time signals: inference latency, KV-cache utilization, and queue depth, to dynamically configure sites and distribute requests. Evaluated on a real 64-GPU A100 testbed emulating three wind-powered sites with Azure production traces, XWind reduces P99 end-to-end latency by up to 52% over the strongest contender (also our idea) and by up to 98% over baselines such as power-capping and GPU idling, with consistent gains across workload types, load levels, and GPU generations.

2026-05-25 13:00 JSTarXiv cs.AI研究/論文

ZipMoE: ロスレス圧縮とキャッシュアフィニティスケジューリングによる効率的なオンデバイス MoE サービス

Mixture-of-Experts (MoE) アーキテクチャは大規模言語モデルの表現力を大幅に強化しますが、その法外なメモリ使用量により、特に非可逆量子化に依存せずにモデルの動作を保持する必要がある場合、リソースに制約のあるエッジデバイスでの実際の展開が大幅に妨げられます。このペーパーでは、効率的で意味的にロスレスなオンデバイス MoE サービングシステムである ZipMoE を紹介します。 ZipMoE は、エッジデバイスのハードウェアプロパティと、証明可能なパフォーマンス保証を備えたキャッシングスケジューリングの協調設計を通じて、MoE パラメータに固有の統計的冗長性の間の相乗効果を活用します。基本的に、私たちの設計は、オンデバイス MoE 推論のパラダイムを、I/O バウンドのボトルネックから、効率的な並列化を可能にするコンピューティング中心のワークフローにシフトします。私たちは ZipMoE のプロトタイプを実装し、一般的なオープンソース MoE モデルと現実世界のワークロードを使用して、代表的なエッジコンピューティングプラットフォームで広範な実験を実施します。私たちの評価により、ZipMoE は最先端のシステムよりも最大 $72.77\%$ の推論レイテンシー削減と最大 $6.76\times$ 高いスループットを達成していることが明らかになりました。私たちのコードは https://github.com/npnothard/ZipMoE-ICML26 で入手できます。

原文 (English)

ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling

While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to $72.77\%$ inference latency reduction and up to $6.76\times$ higher throughput than the state-of-the-art systems.Our code is available at: https://github.com/npnothard/ZipMoE-ICML26.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIハードウェア/半導体

ReCoVer: フォールトトレラントな集合的で汎用性の高いワークロードを介した回復力のある LLM 事前トレーニングシステム

大規模な GPU クラスターで大規模な言語モデルを事前トレーニングすることにより、ハードウェア障害が稀ではなく日常的に発生するようになり、回復力のあるトレーニングシステムの必要性が高まっています。しかし、既存のフレームワークは、特定の並列処理スキームに焦点を当てているか、失敗のないトレーニング軌道から逸脱する危険性があります。私たちは、単一の不変条件を維持する回復力のある LLM 事前トレーニングシステムである ReCoVer を提案します。つまり、各反復でマイクロバッチの数を一定に保ち、反復ごとの勾配が失敗のない実行と確率的に等価であることを保証します。このフレームワークは、3 つの分離されたプロトコル層として構成されています。(1) 障害がレプリカ間で伝播するのを隔離するフォールトトレラント集合体。 (2) 反復内の進行状況を維持し、勾配の破損を防ぐ、段階的なきめ細かいリカバリ。 (3) マイクロバッチクォータを生存者全体に動的に再配分する多用途ワークロードポリシー。この設計は並列処理に依存せず、3D 並列処理とドロップインサブストレートとしてハイブリッドシャードデータパラレル (HSDP) の両方を直接統合します。最大 512 GPU のエンドツーエンドの事前トレーニングタスクで実装を評価しました。ReCoVer は、実行全体で 256 GPU が失われたにもかかわらず、障害のないリファレンスからトレーニング軌跡を正常に保存しました。チェックポイントと再起動のベースラインと比較すると、ReCoVer は、連続した障害の後、実効スループットが 2.23 倍高いことを示しています。この利点により、ReCoVer は 234 GPU 時間で 74.9% 多くのトークンを処理することになり、トレーニングが長引くにつれてその差は拡大します。

原文 (English)

ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk drifting away from a failure-free training trajectory. We propose ReCoVer, a resilient LLM pre-training system that upholds a single invariant: each iteration keeps the number of microbatches constant, ensuring per-iteration gradients remain stochastically equivalent to a failure-free run. The framework is organized as three decoupled protocol layers: (1) Fault-tolerant collectives that isolate faults from propagating across replicas; (2) in-step fine-grained recovery that preserves intra-iteration progress and prevents gradient corruption; (3) versatile-workload policy that dynamically redistributes microbatch quotas across the survivors. The design is parallelism-agnostic, integrating directly with both 3D parallelism and Hybrid Sharded Data Parallel (HSDP) as a drop-in substrate. We evaluate our implementation on end-to-end pre-training tasks for up to 512 GPUs, ReCoVer successfully preserves the training trajectory from a failure-free reference despite of 256 GPUs lost spread across the run. For comparison with checkpoint-and-restart baselines, ReCoVer demonstrates $2.23\times$ higher effective throughput after successive failures. This advantage results in ReCoVer processing 74.9% more tokens at 234 GPU-hours, with the gap widening as the training prolongs.

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AIエージェント

S-Bus: マルチエージェント LLM 状態調整のための自動読み取りセット再構築

HTTP 経由で変更可能な状態を共有する LLM エージェントの同時実行制御に対処します。この場合、エージェントを変更して読み取りセットを宣言することはできません。 S-Bus は HTTP ミドルウェアであり、その中心的なメカニズムであるサーバー側の DeliveryLog が、観測された HTTP GET トラフィックからコミット時に各エージェントの読み取りセットを再構築します。これが提供する一貫性プロパティ -- Observable-Read Isolation (ORI)、HTTP 監視可能な読み取りプロジェクションに対する部分的な因果的一貫性 -- により、専用シャードトポロジでの構造的競合状態が防止されます。 3 つの貢献。 (C1) 3 層の機械化された証拠を備えた DeliveryLog メカニズム: TLAPS は ReadSetSoundness と ORICommitSafety (モジュロ 1 の型付け公理) を証明します。 N=3 での網羅的な TLC では、違反がゼロの 20,763,484 州を調査します。 Dafny は 9 つの帰納補題を導き出します。 (C2) PostgreSQL 17 SERIALIZABLE および Redis 7 WATCH/MULTI に対する経験的安全性パリティ: 884,110 回のコミット試行で Type-I 破損はゼロ (アクティブな競合下では 427,308 回)。 (C3) ORI は、専用シャードのワークロードでは意味的に中立ですが、単一シャードの共同執筆では、保存により同時矛盾が伝播するため有害です。 v2 アップデート: PH-3 LLM ジャッジは現在、人間のアノテーター (Zahid Hussain、Mindgigs Peshawar) に対して、厳密な kappa=0.93 (n=93、生の一致率 96.8%) で 400 (ステップ、シャード) ペアで独立して検証されています。 LLM と裁判官間の一致は、kappa=0.46 (境界分散) です。エージェントは、シャードの使用量を 32% (LLM 判事) ～ 49% (人間のアノテーター) 過剰に使用していると自己報告します。 SJ-v4 の意味品質ルーブリックは引き続き単一審査員 LLM のみです。ソースコード、正式な証明、ハーネス、アノテーションデータ: https://github.com/sajjadanwar0/sbus

原文 (English)

S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

We address concurrency control for LLM agents sharing mutable state over HTTP, where agents cannot be modified to declare read sets. S-Bus is an HTTP middleware whose central mechanism, a server-side DeliveryLog, reconstructs each agent's read set at commit time from observed HTTP GET traffic. The consistency property it provides -- Observable-Read Isolation (ORI), a partial causal consistency over the HTTP-observable read projection -- prevents Structural Race Conditions in dedicated-shard topologies. Three contributions. (C1) DeliveryLog mechanism with three-tier mechanised evidence: TLAPS proves ReadSetSoundness and ORICommitSafety (modulo one typing axiom); exhaustive TLC at N=3 explores 20,763,484 states with zero violations; Dafny discharges 9 inductive lemmas. (C2) Empirical safety parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI: zero Type-I corruptions across 884,110 commit attempts (427,308 under active contention). (C3) ORI is semantically neutral in dedicated-shard workloads but harmful in single-shard collaborative writing because preservation propagates concurrent contradictions. v2 update: the PH-3 LLM judge is now independently validated against a human annotator (Zahid Hussain, Mindgigs Peshawar) on 400 (step, shard) pairs at strict kappa=0.93 (n=93, 96.8% raw agreement). Inter-LLM-judge agreement is kappa=0.46 (boundary variance). Agent self-reports over-claim shard usage by 32% (LLM judge) to 49% (human annotator). The SJ-v4 semantic-quality rubric remains single-judge LLM-only. Source code, formal proofs, harness, annotation data: https://github.com/sajjadanwar0/sbus

2026-05-25 13:00 JSTarXiv cs.AILLM/生成AI

オーディオ LLM に対するコーデックに堅牢な攻撃

Audio Large Language Model (Audio LLM) に対する以前の攻撃では、慎重に作成された波形領域の摂動により、標的を絞った敵対的な出力が強制される可能性があることが実証されました。これらの攻撃に対する防御メカニズムとして、摂動を検出して除去するための実際のコーデック圧縮前処理が研究されてきました。しかし、既存の攻撃は、これらの圧縮に対する堅牢性を実証していません。 Codec Attack を導入します。これは、オーディオ波形を直接摂動するのではなく、ニューラルオーディオコーデックの連続潜在空間の摂動を最適化します。波形の摂動を破棄するコーデックの圧縮チャネルが、独自の潜在空間で作成された摂動を送信することを示します。現実世界の圧縮チャネル全体にわたる攻撃をさらに強化するために、ターゲットモデルを変更することなく、マルチビットレートのストレートスルー Expectation-over-Transformation (EoT) を適用します。 3 つの現実的な Audio LLM 導入シナリオと 3 つのターゲットモデルにわたって、Codec Attack は、中程度のビットレートで Opus 上で平均 85.5% のターゲット部分文字列攻撃成功率 (ASR) を達成しました。一方、同一の EoT 強化でトレーニングされた波形ベースラインは、どのビットレートでも 26% を超えません。攻撃は保留されたコーデックに転送され、再トレーニングなしで MP3 では最大 100% の ASR、AAC-LC では 84% に達します。帯域ごとのエネルギー解析では、潜在的な摂動がコーデックが最も多くのビットを割り当てる 4kHz 以下に集中している一方、波形のベースラインはコーデックが破棄するより高い周波数に広がっていることがわかります。これらの結果は、不可逆圧縮が敵対的なオーディオに対する信頼できる防御策ではなく、コーデックを意識した攻撃が導入された Audio LLM システムに実質的な脅威となることを示しています。

原文 (English)

Codec-Robust Attacks on Audio LLMs

Prior attacks on Audio Large Language Models (Audio LLMs) demonstrated that carefully crafted waveform-domain perturbations can force targeted adversarial outputs. As a defense mechanism against these attacks, real-world codec compression preprocessing has been studied to both detect and remove the perturbations. Yet no existing attack has demonstrated robustness against these compressions. We introduce CodecAttack, which optimizes a perturbation in a neural audio codec's continuous latent space rather than directly perturbing the audio waveform. We show that the codec's compression channel, which discards waveform perturbations, transmits perturbations crafted in its own latent space. To further harden the attack across real-world compression channels, we apply multi-bitrate straight-through Expectation-over-Transformation (EoT), all without modifying the target model. Across three realistic Audio LLM deployment scenarios and three target models, CodecAttack achieves an average 85.5% target-substring attack success rate (ASR) on Opus at moderate bitrates, while the waveform baseline trained with identical EoT hardening does not exceed 26% at any bitrate. The attack transfers to held-out codecs, reaching up to 100% ASR on MP3 and 84% on AAC-LC without retraining. A per-band energy analysis shows that the latent perturbation concentrates below 4kHz, exactly where codecs allocate the most bits, while the waveform baseline spreads into higher frequencies that codecs discard. These results demonstrate that lossy compression is not a reliable defense against adversarial audio and that codec-aware attacks pose a practical threat to deployed Audio LLM systems.

2026-05-24 08:00 JSTITmedia AI+LLM/生成AIエージェントGemini

「Gemini」「Claude Code」「Codex」　全社展開・本番実装に役立つ5つのポイント

開発現場における生成AIの利用は常態化しつつあるものの、「コード補完の域を出ない」「特定の個人のスキルに依存している」といった課題があります。本稿では、個人の生産性向上からチームへの定着、全社規模での展開、そしてAIエージェントの本番実装に至るまで、開発プロセス変革に役立つ5つ…