DeepSeek

企業記事 42 件2026-05-25 〜 2026-05-29

2026年5月

2026-05-29 13:00 JSTarXiv cs.AI研究/論文

ConMoE: MoE 圧縮のためのプロトタイプの再割り当てによるエキスパートプールの統合

Mixture-of-Experts (MoE) 言語モデルでは、トークンごとの計算が削減されますが、それでもすべてのエキスパートを保存して提供する必要があるため、展開のメモリが大量に消費されます。既存のトレーニング後の圧縮方法は、主にエキスパートを枝刈りするか、重みをマージすることでこのコストを削減します。私たちは、トレーニング後の MoE 圧縮をエキスパートプールの統合として定式化します。つまり、事前トレーニングされたエキスパートの少数のセットを再利用可能なプロトタイプとして保持し、元の各エキスパート参照を 1 つの選択されたプロトタイプに決定論的に再マッピングします。このビューは、縮小されたエキスパートプールを元のエキスパートスロットを表す再利用構造から分離し、元のルータインターフェイスを維持しながら、ローカルレイヤスコープ内でプロトタイプを共有できるようにします。我々は、キャリブレーションベースの貢献と置換可能性信号を使用して保持されたエキスパートを選択し、その後、重みの更新や圧縮後の微調整を行わずに、元のエキスパートの呼び出しを選択されたプロトタイプにリダイレクトする、トレイン不要のプロトタイプ再マッピングフレームワークである ConMoE を提案します。 3 つの事前トレーニング済み MoE 言語モデルでの実験では、ConMoE がいくつかの設定で強力なプルーニングおよびマージベースラインと同等またはそれを上回り、25% および 50% のルーティングエキスパート削減で deepseek-moe-16b-base で最高の平均スコアを達成しながら、Qwen3-30B-A3B および OLMoE-1B-7B-0125 では競争力を維持していることが示されています。アブレーションは、決定論的な再割り当てが最も安定したコンポーネントであるのに対し、より広範な層間共有とポストホック重み融合はモデルに依存することを示しています。

原文 (English)

ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

Mixture-of-Experts (MoE) language models reduce per-token computation but still require storing and serving all experts, making deployment memory-intensive. Existing post-training compression methods mainly shrink this cost by pruning experts or merging their weights. We formulate post-training MoE compression as expert-pool consolidation: retaining a smaller set of pretrained experts as reusable prototypes and deterministically remapping each original expert reference to one selected prototype. This view separates the reduced expert pool from the reuse structure that represents the original expert slots, and allows prototype sharing within local layer scopes while preserving the original router interface. We propose ConMoE, a train-free prototype remapping framework that selects retained experts using calibration-based contribution and replaceability signals, then redirects original expert calls to the selected prototypes without weight updates or post-compression fine-tuning. Experiments on three pretrained MoE language models show that ConMoE matches or outperforms strong pruning and merging baselines in several settings, achieving the best average score on deepseek-moe-16b-base at both 25% and 50% routed-expert reduction, while remaining competitive on Qwen3-30B-A3B and OLMoE-1B-7B-0125. Ablations indicate that deterministic reassignment is the most stable component, whereas broader cross-layer sharing and post-hoc weight fusion are model-dependent.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIエージェントGPT / ChatGPT Gemini

GRASP: 自己改善型 LLM エージェントのためのゲート回帰認識スキル提案者

構造化された環境で動作する LLM エージェントは、会話的な方法ではなく操作的な方法で失敗し、信頼性は環境の手順に関する知識に依存します。以前の自己改善方法では、新しい項目が以前の正しい動作を保持しているかどうかを確認せずに自然言語ガイダンスを蓄積するため、ある軌道を修正したメモが静かに別の軌道に後退する可能性があります。 GRASP (Gated Regression-Aware Skill Proposer) を導入します。これは、エージェントの改善を制限されたスキルライブラリへの一連の編集として扱い、ハード回帰バジェットの下でバランスのとれたホールドアウトプローブで純改善が得られた場合にのみ各候補者を許可します。 2 つの FHIR ベースの臨床ベンチマークで 5 つの基本モデル (gpt-oss-120b、DeepSeek V4 Flash、Gemini 3.1 Flash Lite、GPT-4.1、GPT-5.4) にわたって GRASP を評価します。 MedAgentBench では、GRASP は gpt-oss-120b を 40.6% から 88.8% に引き上げ、5 つの自己改善ベースラインのうち最も強力なものを 21.0 ポイント上回り、他のすべてのベースモデルを 17.2 から 40.3 ポイント改善しました。アブレーションでは、スキルライティング自体によるものではなく、比較提案の生成、承認ゲート、およびハードリグレッションバジェットによって利益が得られると考えられます。検証がなければ、スキルを使用しないのと同じです。このメカニズムは臨床領域を超えて一般化され、4 つの非臨床環境のうち 3 つで薬剤を改善し、アクションスペースがオープンエンドである場合にのみフラットなままになります。凍結されたライブラリはモデル間で転送され、より強力なモデルからのスキルは弱い実行者を自ら学習した以上に向上させますが、その逆はそうではなく、ゲートされていないベースラインでは再現できない非対称性です。

原文 (English)

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression-Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held-out probe under a hard regression budget. We evaluate GRASP across five base models (gpt-oss-120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT-4.1, GPT-5.4) on two FHIR-based clinical benchmarks. On MedAgentBench, GRASP lifts gpt-oss-120b from 40.6% to 88.8%, exceeds the strongest of five self-improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non-clinical environments and remaining flat only where the action space is open-ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AI

OptSkills: クラスターベースの蒸留を介して問題の原型から一般化可能な最適化スキルを学習する

大規模言語モデル (LLM) を活用して、自然言語から最適化問題を自動的に定式化し、解決することが、自動最適化の効率的なパラダイムとして浮上しています。しかし、既存の手法はまだ一般化が限られています。表面的な物語の変化に敏感で、主に事例レベルで経験を再利用し、変化した問題や新たな問題の種類に適応するのに苦労しています。私たちは、最適化モデリングと解決のための原型中心のスキル学習および推論エージェントシステムである OptSkills を提案します。堅牢な一般化を向上させるために、私たちのシステムは、表面的な物語ではなく、その根底にある原型によって問題をクラスター化します。ディストリビューション内の一般化を改善するために、各クラスター内の多様なモデリングパラダイムとソルバー構成を調査し、成功した軌跡を再利用可能なワークフローレベルのスキルに抽出します。分布外の一般化を改善するために、既存のスキルを改良するか、新しく取得した軌道を使用してスキルライブラリを拡張します。当社のシステムは、さまざまな問題の種類とシナリオを含むデータセットで 68.27% という最先端のミクロ平均精度を達成しています。さらに、難易度の高い大規模かつ高次元のベンチマークである MIPLIB-NL では、26.91% の精度を達成し、DeepSeek-V3.2-Thinking を 4.53% 上回っています。 Nano-CO でスキルを学習すると、OOD NLCO ベンチマークで 72.79% に達します。コードとスキルは https://github.com/fujiwaranoM0kou/OptSkills で入手できます。

原文 (English)

OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation

Leveraging Large Language Models (LLMs) to automatically formulate and solve optimization problems from natural language has emerged as an efficient paradigm for automated optimization. However, existing methods still exhibit limited generalization: they are sensitive to superficial narrative variations, reuse experience mainly at the case level, and struggle to adapt to shifted or emerging problem types. We propose OptSkills, an archetype-centric skill learning and reasoning agent system for optimization modeling and solving. To improve robust generalization, our system clusters problems by their underlying archetypes rather than surface narratives. To improve in-distribution generalization, it explores diverse modeling paradigms and solver configurations within each cluster, then distills successful trajectories into reusable workflow-level skills. To improve out-of-distribution generalization, it refines existing skills or expands the skill library using newly obtained trajectories. Our system achieves a state-of-the-art micro-averaged accuracy of 68.27% on datasets encompassing diverse problem types and scenarios. In addition, on MIPLIB-NL, a highly challenging large-scale and high-dimensional benchmark, it achieves 26.91% accuracy, outperforming DeepSeek-V3.2-Thinking by 4.53%. After skill learning on Nano-CO, it reaches 72.79% on the OOD NLCO benchmark. Code and skills are available at https://github.com/fujiwaranoM0kou/OptSkills.

2026-05-29 13:00 JSTarXiv cs.AI研究/論文Claude GPT / ChatGPT

FormInv: 数学的推論ベンチマークにおける意味的不変性の測定プロトコル

MathCheck (ICLR 2025) の言い換え品質監査では、129 グループ (3.1%) で意味的に間違った 4 つの言い換えが検出されました。それらを削除すると、GPT-4o がランク 2 からランク 4 に下がり、Claude Haiku と DeepSeek V3 がその上位に上がります。これらのランキングの変化は、単一モデルの評価では認識されません。モデル間の一致により、これらのエラーが自動的に検出されました (MathCheck のモデルは 3/4 以上、一次評価のモデルは 6/9 以上)。価格は 10 ドル未満でした。私たち自身のデータセットでは、同じプロトコルで、自動生成された結合変化言い換えの 47% が意味的に間違っていることがわかりました。この欠陥が測定ギャップをさらに深めています。Claude Haiku 4.5 は 86% の精度を達成していますが、SCR=50% です。つまり、その定理の半分は意味的に同等の再ステートメントの下で異なる答えになっています。一方、9 つのモデル全体の精度の合計は 86 ～ 96% にすぎませんが、意味的一貫性率 (SCR) は 50 ～ 82% に及びます。これは、標準ベンチマークでは見えない 32 ポイントのギャップです。形式的には、9 つのフロンティアモデルを超えるターゲットランキングには、それを実現する言い換えファミリーに対する重み付けが存在します (フリーベンチマークの帰結)。すべてのファミリーをパレート支配するモデルはないためです。そのため、ファミリーを選択するベンチマーク設計者は、どのモデルが勝つかを暗黙的に選択していることになります。 FormInv は、監査プロトコル (再現率 100% で外部ベンチマークに複製)、366 ～ 811 項目にわたる 9 つのモデル (Lean4 で検証された定理) で評価された一次不変性尺度としての SCR および定理ごとのコクランの Q、およびレジームを意識したモデル選択のための FormInvSelector を提供します。

原文 (English)

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset the same protocol found that 47% of auto-generated connective-variation paraphrases were semantically incorrect. That flaw compounds a deeper measurement gap: Claude Haiku 4.5 achieves 86% accuracy yet SCR=50%, meaning half its theorems are answered differently under semantically equivalent restatements, while aggregate accuracy across 9 models spans only 86-96% yet Semantic Consistency Rates (SCR) span 50-82% -- a 32-point gap invisible to standard benchmarks. Formally, for any target ranking over 9 frontier models there exists a weighting over paraphrase families that realizes it (No-Free-Benchmark corollary), because no model Pareto-dominates all families -- so benchmark designers who select families are implicitly choosing which model wins. FormInv supplies the audit protocol (replicated on external benchmarks at 100% recall), SCR and per-theorem Cochran's Q as primary invariance measures evaluated on 9 models across 366-811 items (on Lean4-verified theorems), and FormInvSelector for regime-aware model selection.

2026-05-29 13:00 JSTarXiv cs.AIエージェント

SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow

The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collab…

2026-05-29 13:00 JSTarXiv cs.AI研究/論文Qwen

ESPO: Early-Stopping Proximal Policy Optimization

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AI

Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations

We present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations…

2026-05-29 13:00 JSTarXiv cs.AILLM/生成AIQwen

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement l…

2026-05-29 13:00 JSTarXiv cs.AI研究/論文GPT / ChatGPT

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answe…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AIエージェント

TCP-MCP: マルチエージェントシステム向けのプロンプトと通信トポロジのランドスケープガイドに基づく共進化

効果的なマルチエージェントシステムは、プロンプトやコミュニケーショングラフを単独で選択するだけでは設計できません。エージェントの動作はエージェントが受信する情報に依存しますが、通信エッジの有用性は受信エージェントがその情報をどのように解釈して使用するかに依存します。私たちは、エージェントのプロンプトと通信トポロジーを統一ゲノムとして検索する共進化フレームワークである \textbf{TCP-MCP} (Topology-Coupled Prompting for Multi-Agent Collaborative 問題解決) を提案します。 TCP-MCP は、初期化時のランドスケーププローブを使用して初期の検索動作を調整し、次にパレートフロント診断に依存して、タスクのパフォーマンス、トークンコスト、構造の複雑さという 3 つの目標に基づいて探索を調整します。すべての方式で同じ DeepSeek-V3.2 バックボーンを使用することで、TCP-MCP は、MMLU-Pro、MMLU、GSM8K でそれぞれ 82.66\%、89.96\%、96.61\% の精度を達成します。 3 つのベンチマーク全体で、自動化されたグラフ生成のベースラインを常に上回り、ディベートスタイルのシステムと比較して競争力のある精度を達成しながら、報告された動作ポイントでこれらのシステムよりもトークンの使用量が最大 5.69$\time$ 少なくなります。これらの結果は、共同して進化するプロンプトとコミュニケーション構造が、管理された評価におけるコストを意識したタスク適応型のマルチエージェントシステム設計への実用的なルートを提供することを示しています。

原文 (English)

TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems

Effective multi-agent systems cannot be designed by selecting prompts or communication graphs in isolation. Agent behavior depends on the information an agent receives, while the usefulness of a communication edge depends on how the receiving agent interprets and uses that information. We propose \textbf{TCP-MCP} (Topology-Coupled Prompting for Multi-Agent Collaborative Problem-Solving), a co-evolution framework that searches agent prompts and communication topologies as a unified genome. TCP-MCP uses an initialization-time landscape probe to calibrate early search behavior, and then relies on Pareto-front diagnostics to adapt exploration under three objectives: task performance, token cost, and structural complexity. Using the same DeepSeek-V3.2 backbone across all methods, TCP-MCP achieves 82.66\%, 89.96\%, and 96.61\% accuracy on MMLU-Pro, MMLU, and GSM8K, respectively. Across the three benchmarks, it consistently outperforms automated graph-generation baselines and achieves competitive accuracy relative to debate-style systems, while using up to 5.69$\times$ fewer tokens than those systems at the reported operating points. These results show that jointly evolving prompts and communication structure provides a practical route to cost-aware and task-adaptive multi-agent system design in controlled evaluations.

2026-05-28 13:00 JSTarXiv cs.AI研究/論文

Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

Chain-of-thought (CoT) distillation trains a smaller model to imitate a teacher's reasoning trace, but it is typically evaluated by final-a…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AI

FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation

Despite rapid progress in LLM-based code generation, existing models are predominantly trained on imperative languages, leaving functional…

2026-05-28 13:00 JSTarXiv cs.AI研究/論文GPT / ChatGPT

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AI

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT s…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AI

VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization

Large language models (LLMs) have recently shown strong potential in vulnerability detection (VD). However, accurately detecting vulnerabil…

2026-05-28 13:00 JSTarXiv cs.AILLM/生成AI

Differential syntactic and semantic encoding in LLMs

We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the…

2026-05-28 13:00 JSTarXiv cs.AI研究/論文Gemma Qwen

When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden

Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic chal…

2026-05-28 13:00 JSTarXiv cs.AI研究/論文Llama

GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matc…

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AIロボティクスClaude GPT / ChatGPT

音声言語処理タスクのためのロボットと患者および医師と患者の医療対話のデータセット

大規模言語モデル (LLM) は、人工知能 (AI) に大幅な改善をもたらし、汎用タスクに適用できます。ただし、テキストまたは音声による医療相談への応用は、まだ未解決の研究問題です。本稿では、患者との相談を行うことができるMed-AIを訓練・評価するための新しい音声データセットであるMeDial-Speechを提案する。これは、ロボットと患者および医師と患者の対話から現実的な環境で収集されたもので、111 時間以上の音声データ (データ拡張なし) が含まれており、レビー小体型認知症、心不全、肩の痛み、狭心症という 4 つの健康状態をカバーしています。さらに、GPT-5 mini、DeepSeek-V3、Claude Sonnet 4 という 3 つの最先端の LLM を評価するために、文選択 (20 のオプション) による対話ベンチマークを提案します。実験の結果、Claude Sonnet 4 が文選択において最高であり、手動転写を使用した場合は 71.1%、自動転写を使用した場合は 74.7% であり、すべての LLM は確率的予測に非常に自信を持っていることが明らかになりました。医療対話における正しい文または誤った文の選択。このデータセットは、非営利目的の場合、https://huggingface.co/datasets/hcuayahu/MeDial-Speech で無料で利用できます。

原文 (English)

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

Large Language Models (LLMs) have brought huge improvements to Artificial Intelligence (AI), which can be applied to general-purpose tasks. However, their application to textual or spoken medical consultations is still an open research problem. This paper proposes MeDial-Speech, a novel speech dataset for training and evaluating Med-AIs that can carry out consultations with patients. It was collected in realistic environments from robot-patient and doctor-patient dialogues, contains 111+ hours of speech data (without data augmentation), and covers four health conditions: Lewy body dementia, heart failure, shoulder pain, and angina. In addition, we propose a dialogue benchmark via sentence selection (with 20 options) to evaluate three state-of-the-art LLMs: GPT-5 mini, DeepSeek-V3, and Claude Sonnet 4. Experimental results reveal that Claude Sonnet 4 is the best in sentence selection, with 71.1% accuracy using manual transcriptions and 74.7% using automatic transcriptions, and that all LLMs are highly overconfident in their probabilistic predictions, regardless of selecting correct or incorrect sentences in medical dialogues. This dataset is free of charge for non-commercial purposes at: https://huggingface.co/datasets/hcuayahu/MeDial-Speech

2026-05-27 13:00 JSTarXiv cs.AIハードウェア/半導体Llama

単一方向を超えて: 思考の連鎖が単純な拒否の方向性を混乱させる

大規模推論モデル (LRM) は、最終出力を生成する前に思考連鎖 (CoT) トレースを生成し、拒否などの制御メカニズムを複雑にする可能性のある動的な内部状態を導入します。単一方向部分空間によって拒否が媒介される命令調整型 LLM とは異なり、大規模推論モデル (LRM) での拒否はさらに CoT に依存します。 DeepSeek-R1-Distill-LLaMA-8B では、CoT が固定されている場合、アクティブ化ステアリングによって拒否が逆転するのはわずか 39% ですが、CoT を完全に削除するとこれが 70% に増加し、CoT が積極的に拒否を強化していることがわかります。モデルが活性化ステアリングの下で CoT を再生成する 2 段階の介入では、94% のケースで拒否が逆転しますが、結果として得られる CoT だけでは、ステアリングが取り除かれた後でもこの効果の 48% が保持されます。これは、CoT がコンプライアンス信号を独立して伝送および再構築できることを示唆しています。これらの発見は、LRM での拒否が残留ストリームのアクティベーションと CoT で共同してエンコードされることを示しています。この共同アクティベーションにより、LRM はアクティベーションレベルの介入のみに対してより堅牢になりますが、CoT は代替の表面攻撃にさらされる可能性があります。

原文 (English)

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal. Unlike instruction-tuned LLMs, where refusal is mediated by a single directional subspace, refusal in large reasoning models (LRMs) additionally depends on the CoT. In DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in only 39% of cases when the CoT is kept fixed, but removing the CoT entirely increases this to 70%, indicating that the CoT actively reinforces refusal. In a two-stage intervention where the model regenerates its CoT under activation steering, refusal is reversed in 94% of cases, while the resulting CoT alone retains 48% of this effect even after steering is removed. This suggests that the CoT can carry and reconstruct the compliance signal independently. These findings indicate that refusal in LRMs is jointly encoded in residual stream activations and CoT. This joint activation makes LRM more robust against activation-level interventions alone, but exposes CoT to a possible alternative surface attack.

2026-05-27 13:00 JSTarXiv cs.AILLM/生成AILlama Qwen

ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while mai…

2026-05-27 13:00 JSTarXiv cs.AI研究/論文

From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated P…

2026-05-27 13:00 JSTarXiv cs.AIエージェントGemini

Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation

Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game-theoretic model…

2026-05-27 13:00 JSTarXiv cs.AI研究/論文Llama Alibaba Qwen

The ATOM Report: Measuring the Open Language Model Ecosystem

We present a comprehensive adoption snapshot of the leading open language models and who is building them, focusing on the ~1.5K mainline o…

2026-05-26 13:00 JSTarXiv cs.AI研究/論文Anthropic Claude Mistral AI Alibaba

BODHI: 正確な OS カーネル仕様の推論

オペレーティングシステムカーネルの正式な検証には、システムコールの意図された動作を捕捉する正確な仕様が必要です。これらの仕様を手動で記述するには、ドメインに関する深い専門知識が必要となるため、プロセスを自動化するために大規模言語モデル (LLM) の使用が促進されます。ただし、Hyperkernel OS カーネルから派生した 245 の仕様生成タスクのベンチマークである OSV-Bench では、最も良いと報告された Pass@1 は 55.10% です。我々は、ドメイン知識プロンプト手法 (BODHI) を提案します。これは、ドメイン固有の翻訳パターンの 15 カテゴリーをカバーする構造化された C から Python への翻訳ガイドで、標準的な数ショットプロンプトを強化します。このガイドは構造化思考連鎖 (SCoT) プロンプトに触発されており、懸念事項の分離によって翻訳を整理し、事前条件の抽出と事後条件の生成を別個のカテゴリとして扱います。 6 つのプロバイダー (Anthropic、Mistral、Amazon、DeepSeek、Meta、Alibaba) の 9 つのモデルで評価され、高密度で専門家が混在した推論アーキテクチャをカバーする BODHI は、テストされたすべてのモデルを +11% ～ +32% の範囲で改善します。最良の構成 (Claude Opus 4.6 + BODHI) では、96.73% Pass@1 に達します。 BODHI は構文エラーと意味論的エラーの両方を削減し、構造化参照資料を利用するための十分な命令追従機能を持つモデルに最も大きな効果をもたらします。これらの結果は、ドメイン知識注入が、汎用コード生成と形式仕様合成の間のギャップを実質的に埋めるモデルに依存しない手法であることを示しています。

原文 (English)

BODHI: Precise OS Kernel Specification Inference

The formal verification of operating system kernels requires precise specifications that capture the intended behavior of system calls. Writing these specifications manually demands deep domain expertise, motivating the use of large language models (LLMs) to automate the process. However, in OSV-Bench, a benchmark of 245 specification generation tasks derived from the Hyperkernel OS kernel, the best reported Pass@1 is 55.10%. We propose a domain knowledge prompting method (BODHI), which augments the standard few-shot prompt with a structured C-to-Python translation guide covering 15 categories of domain-specific translation patterns. Inspired by Structured Chain-of-Thought (SCoT) prompting, the guide organizes translation by separation of concerns, addressing pre-condition extraction and post-condition generation as distinct categories. Evaluated on nine models from six providers (Anthropic, Mistral, Amazon, DeepSeek, Meta, Alibaba), covering dense, mixture-of-experts and reasoning architectures, BODHI improves every model tested, with gains ranging from +11% to +32%. The best configuration (Claude Opus 4.6 + BODHI) reaches 96.73% Pass@1. BODHI reduces both syntax and semantic errors, with the strongest effect on models that have sufficient instruction-following capability to utilize structured reference material. These results demonstrate that domain knowledge injection is a model-agnostic technique that substantially bridges the gap between general-purpose code generation and formal specification synthesis.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AILlama

A* ポストトレーニングで効率的に推論する方法を学ぶ

大規模言語モデル (LLM) のアプリケーションの多くは演繹的推論を必要としますが、モデルでは誤った推論ステップや冗長な推論ステップが頻繁に生成されます。私たちは自然言語推論を、最終的な答えが有効な証明そのものである検索問題として組み立て、中間の推論が正しい推論手順を必要とします。具体的には、LLM が A* 検索 (目標への最適かつ効率的なパスを保証するアルゴリズム) からのガイダンスに従って、正しく効率的な証明を生成する方法を学習できるかどうかを調査します。私たちは 2 つのトレーニング手法を検討します。A* からの実行トレースの教師あり微調整と、A* 情報に基づいたプロセス報酬モデルを使用した強化学習です。経験的に、1B ～ 3B の範囲の Llama-3.2 モデルはトレーニング後の A* から大幅な恩恵を受け、ほぼゼロの精度から、はるかに大規模なモデルである DeepSeek-V3.2 を上回るパフォーマンスを発揮することがわかりました。私たちの分析ではトレードオフが明らかになりました。単純な正確性により精度が最大化される一方で、A* 情報による信号は精度と効率のバランスをとるというものです。さらに、より大きな検索空間では、不完全なヒューリスティックでトレーニングされたモデルが優れた精度を示すことがわかりました。私たちの結果は、古典的な検索アルゴリズムから導き出された原理に基づいた推論への有望な方向性を示しています。

原文 (English)

Learning to Reason Efficiently with A* Post-Training

Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is the valid proof itself, requiring a reasoning procedure in which intermediate inferences are correct. Specifically, we investigate whether LLMs can learn to generate correct and efficient proofs with guidance from A* search -- an algorithm that guarantees an optimally efficient path to a goal. We explore two training techniques: supervised fine-tuning on execution traces from A* and reinforcement learning with A*-informed process reward models. Empirically, we find that Llama-3.2 models in the 1B--3B range benefit substantially from A* post training, going from near-zero accuracy to outperforming DeepSeek-V3.2 -- a much larger model. Our analysis uncovers a trade-off: while simple correctness rewards maximize accuracy, A*-informed signals strike a balance between accuracy and efficiency. Furthermore, we find that on larger search spaces, models trained with imperfect heuristics exhibit superior accuracy. Our results demonstrate a promising direction towards reasoning guided by principles derived from classical search algorithms.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIビジネス/資金調達Claude

LLM における推論の質の測定: 多次元の行動フレームワーク

LLM は複雑な推論タスクで目覚ましい成功を収めていますが、現在の評価アプローチは主に最終的な答えの正しさに依存しており、それらの答えを生み出す根本的な推論プロセスについての洞察は限られています。このギャップに対処するために、この研究では、動作の観点から LLM の推論品質を測定するための統一された多次元フレームワークを提案し、理論的に根拠のある 6 つの次元、正確性 (CQ)、一貫性 (CS)、堅牢性 (RS)、論理的一貫性 (LS)、効率 (ES)、安定性 (SS) を運用します。 4 つのベンチマークの 975 項目にわたる 7 つの LLM に関する広範な実験により、このフレームワークが精度のみの指標では見えない動作を明らかにすることが実証されました。特に、論理的一貫性は正しさ (r = -0.172、ns) と直交しており、一貫性のない推論から正しい答えが得られることが確認され、一方、Claude-Haiku-4.5 は最高の多次元スコア (Q_bal = 0.778) を達成しています。さらに、このフレームワークは重大なランキングの逆転を明らかにしています。DeepSeek-V3 は精度優先では 2 位ですが、法的/コンプライアンスの重み付けでは 5 位にランクされており、単一指標の評価では検出できない逆転です。判別式の妥当性により、11/15 次元のペアが独立している (|r| < 0.50) ことが確認され、各次元を別個の信号として扱うための心理測定的サポートが提供されます。フレームワークによって生成される次元プロファイルは、次の 3 つのクラスの展開決定を直接サポートします。最終的な答えが正しいにもかかわらず、その推論トレースが説明責任監査に失敗するモデルを特定します (LS--CQ 直交性)。精度のみのベンチマークによって引き起こされるランキングエラーを防止します。そして、フレームワークがキャプチャする 6 つの独立したシグナルを単一のメトリックが暗黙的に置き換えることがないようにします。

原文 (English)

Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

LLMs have achieved remarkable success in complex reasoning tasks, yet current evaluation approaches predominantly rely on final-answer correctness, offering limited insight into the underlying reasoning processes that produce those answers. To address this gap, this study proposes a unified multi-dimensional framework for measuring reasoning quality in LLMs from a behavioral perspective, operationalizing six theoretically grounded dimensions: Correctness (CQ), Consistency (CS), Robustness (RS), Logical Coherence (LS), Efficiency (ES), and Stability (SS). Extensive experiments on seven LLMs across 975 items from four benchmarks demonstrate that the framework reveals behaviors invisible to accuracy-only metrics. Notably, logical coherence is orthogonal to correctness (r = -0.172, ns), confirming that correct answers can arise from incoherent reasoning, while Claude-Haiku-4.5 achieves the highest multi-dimensional score (Q_bal = 0.778). Furthermore, the framework exposes critical ranking inversions: DeepSeek-V3 ranks second under accuracy-priority but fifth under legal/compliance weighting, a reversal that single-metric evaluation cannot detect. Discriminant validity confirms 11/15 dimension pairs are independent (|r| < 0.50), providing psychometric support for treating each dimension as a distinct signal. The dimensional profiles produced by the framework directly support three classes of deployment decision: identifying models whose reasoning traces would fail accountability audits despite correct final answers (LS--CQ orthogonality); preventing ranking errors caused by accuracy-only benchmarking; and ensuring that no single metric silently substitutes for the six independent signals the framework captures.

2026-05-26 13:00 JSTarXiv cs.AIエージェント研究/論文

VeriTrace: 深層研究エージェントの進化するメンタルモデル

深部調査エージェントは、膨大で相互依存性があり、不確実性が蔓延する情報に直面しています。既存のシステムは、進化する中間表現がどうあるべきかを探索しますが、その進化は LLM の暗黙の推論に任せています。明示的な規制がないと、中間層は混合品質の情報によって簡単に汚染され、その依存関係に沿ってエラーが伝播するため、多くの場合、モデルのスケールが規制のない代わりに使用されることになります。私たちは、エージェントのメンタルモデルは代わりに、タスクの理解を現実と継続的に一致させる明示的なフィードバックを通じて進化する必要があると主張し、解釈の更新、逸脱フィードバック、スキーマの修正という 3 つの制御ループを特定します。これは、3 つのループを明示的に実装するコグニティブグラフフレームワークである VeriTrace で実現されます。一致した Qwen3.5-27B バックボーンを使用することにより、VeriTrace は、最も一致したベースラインよりも、DeepResearch Bench (DRB) Insight で 4.22 pp (全体で 1.49 pp)、DeepConsult での全体の勝率が 5.9 pp 向上しました。 Config-DeepSeek を使用すると、DRB 上で最も強力な再現可能なオープンソース結果が得られます。

原文 (English)

VeriTrace: Evolving Mental Models for Deep Research Agents

Deep research agents face vast, interdependent, and pervasively uncertain information. Existing systems explore what evolving intermediate representations should look like, but leave their evolution to the LLM's implicit reasoning. Without explicit regulation, the intermediate layer is easily contaminated by mixed-quality information and propagates errors along its dependencies, so model scale often ends up substituting for absent regulation. We argue that an agent's mental model should instead evolve through explicit feedback that continuously aligns task understanding with reality, and identify three regulatory loops: interpretive update, deviation feedback, and schema revision. We realise this in VeriTrace, a cognitive-graph framework that explicitly implements the three loops. Using matched Qwen3.5-27B backbones, VeriTrace improves over the strongest matched baseline by 4.22 pp on DeepResearch Bench (DRB) Insight (1.49 pp Overall) and by 5.9 pp Overall win rate on DeepConsult. With Config-DeepSeek, it achieves the strongest reproducible open-source result on DRB.

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIエージェントGemma Llama

Understanding Conversational Patterns in Multi-agent Programming: A Case Study on Fibonacci Game Development

Large Language Models (LLMs) are increasingly applied to software engineering (SE), yet their potential for autonomous, role-oriented colla…

2026-05-26 13:00 JSTarXiv cs.AI研究/論文

Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deployi…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIビジネス/資金調達研究/論文GPT / ChatGPT Gemma

When Reasoning Hurts: Source-Aware Evaluation of Frontier LLMs for Clinical SOAP Note Generation

Reasoning-enabled LLMs perform strongly on medical reasoning benchmarks, but it remains unclear whether these gains transfer to structured…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT Gemini

READER: Reasoning-Enhanced AI-Generated Text Detection

Recent advances in large language models (LLMs) have made it increasingly difficult to distinguish human-written text from AI-generated con…

2026-05-26 13:00 JSTarXiv cs.AIビジネス/資金調達研究/論文

Context-Instrumental Data Distillation for Kubernetes Manifest Generation: Method and Experimental Evaluation

This paper examines the specialization of Small Language Models (SLMs) with up to 4 billion parameters for generating artifacts in domain-s…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIビジネス/資金調達

From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports -- with Preliminary Extension to Lung Cancer

Large language models (LLMs) have demonstrated promising performance in generating diagnostic conclusions from imaging findings, thereby su…

2026-05-26 13:00 JSTarXiv cs.AIエージェントGPT / ChatGPT

MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents

Long-running autonomous AI agents suffer from a well-documented memory coherence problem: tool-execution success rates degrade 14 percentag…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT

SoK: A Comprehensive Security Analysis of Jailbreak Resilience in GPT and DeepSeek Models

The rapid proliferation of Large Language Models (LLMs) has heightened concerns regarding their exposure to jailbreak attacks, which craft…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIハードウェア/半導体Qwen

HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output Space

Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared…

2026-05-26 13:00 JSTarXiv cs.AI研究/論文Llama

vAttention: Verified Sparse Attention

State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extensi…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIClaude GPT / ChatGPT

Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We pro…

2026-05-26 13:00 JSTarXiv cs.AILLM/生成AIGPT / ChatGPT

Persona-Model Collapse in Emergent Misalignment

Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomen…

2026-05-26 13:00 JSTarXiv cs.AI研究/論文Anthropic Google

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and…

2026-05-25 13:00 JSTarXiv cs.AI研究/論文

同盟内の戦略的強制: AI ストレステストとしてのグリーンランド主権ゲーム

最強の同盟メンバーが領土や戦略的支配をめぐって弱いメンバーに圧力をかけると何が起こるでしょうか?私たちは、2019年から2026年にデンマーク王国からグリーンランドを取得しようとする米国の推進を中心に、LLM地政学のストレステストとしてグリーンランド主権危機を検証します。この危機には、北極の戦略的支配と、NATOが支配的な加盟国に対して同盟規範を強制できるかどうかという、2つの集団行動の問題が巣食っている。私たちは 3 つのゲーム (非対称強制、クリティカルマス転換点を備えた NATO 保証ゲーム、社会的嗜好を備えた 3 つの拡張形式ゲーム) を開発し、8 つのフロンティア LLM が 6 つの地政学的役割 (米国、デンマーク、グリーンランド、NATO、ロシア、カナダ) を演じるマルチエージェントシミュレーションで、3,604 の完成したゲームと 108,120 の行動観察にわたってテストしました。逆ゲーム理論を使用して、物質的な利己性、互恵性、不平等回避、規範の尊重、コミットメントの一貫性に関する各モデルの構造的効用パラメーター (アルファ、ベータ、ガンマ、デルタ、イータ) を回復します。 3 つの発見が際立っています。まず、8 つのモデルすべてが強制フレーミングの下でよりエスカレートします (4 つのアクションのエスカレーションは 10.7% から 28.6% に増加します)。第二に、中国起源のモデルは、米国の役割を果たした場合、西洋起源のモデルとは体系的に異なるパワーウェイトプロファイルを示します。第三に、米国による平和的買収はクリーンゲームのわずか 1.9% で発生し、8 つのフロンティアモデルのうち 3 つだけがそれを達成します。最も顕著なのは、大都市を通じて安定した 5 ラウンドのプレイブックを実行する DeepSeek V3.2 です。英語のみの確認サンプルでは、正義と自己決定を強調するプロンプトにより、ベースライン付近までのエスカレーションが減少します。多言語の対比は探索的感度チェックとして報告されます。私たちはこれを LLM の地政学的な行動の構造的なベンチマークとして位置づけ、行動頻度のベンチマークを補完します。

原文 (English)

Strategic Coercion Within Alliances: The Greenland Sovereignty Game as an AI Stress Test

What happens when the strongest alliance member pressures a weaker member over territory and strategic control? We examine the Greenland sovereignty crisis as a stress test for LLM geopolitics, centered on the 2019-2026 U.S. push to acquire Greenland from the Kingdom of Denmark. The crisis nests two collective-action problems: Arctic strategic control and whether NATO can enforce alliance norms against the dominant member. We develop three games (asymmetric coercion; a NATO assurance game with a critical-mass tipping point; a triadic extensive-form game with social preferences) and test them with a multi-agent simulation in which eight frontier LLMs play six geopolitical roles (United States, Denmark, Greenland, NATO, Russia, Canada) across 3,604 completed games and 108,120 action observations. Using inverse game theory, we recover each model's structural utility parameters (alpha, beta, gamma, delta, eta) for material self-interest, reciprocity, inequality aversion, norm respect, and commitment consistency. Three findings stand out. First, all eight models become more escalatory under coercion framing (four-action escalation rises from 10.7% to 28.6%). Second, Chinese-origin models show systematically different power-weight profiles from Western-origin models when playing the U.S. role. Third, peaceful US acquisition emerges in only 1.9% of clean games and only 3 of 8 frontier models ever achieve it, most prominently DeepSeek V3.2, which executes a stable five-round playbook through the metropole. Prompts emphasizing jus cogens and self-determination reduce escalation back near baseline in the English-only confirmatory sample; multilingual contrasts are reported as exploratory sensitivity checks. We position this as a structural benchmark for LLM geopolitical behavior, complementing action-frequency benchmarks.